Document Management

Document Operations

This guide covers uploading documents, listing your document collection, retrieving corpus information, and understanding document structure. For the basics, see Getting Started.

Uploading Documents

Document uploads use multipart form data to stream file content efficiently. The upload endpoint accepts the document file along with optional metadata including original filename and MIME type. When metadata is omitted, the system infers values from the uploaded file.

The upload process creates a CORPUS node representing the original file. This node serves as the root of the document hierarchy and triggers parsing workflows that extract content and create structured representations. The upload response includes the corpus identifier, which you use to reference the document in subsequent operations.

Large files are streamed in chunks, enabling efficient uploads without memory constraints. The system handles streaming transparently, so applications can upload files of various sizes using standard HTTP multipart form data. Upload progress and completion are indicated through HTTP status codes and response bodies.

cURL
Python

curl -X POST "https://<your_domain>/v1/corpus" \
  -H "Authorization: Bearer <your-token>" \
  -F "file=@document.pdf" \
  -F "original_name=research-paper.pdf" \
  -F "mime_type=application/pdf"

import requests

headers = {"Authorization": "Bearer <your-token>"}

with open("document.pdf", "rb") as f:
    files = {"file": f}
    data = {
        "original_name": "research-paper.pdf",
        "mime_type": "application/pdf"
    }
    response = requests.post(
        "https://<your_domain>/v1/corpus",
        headers=headers,
        files=files,
        data=data
    )
    corpus = response.json()
    print(f"Corpus ID: {corpus['corpus_id']}")

Supported Formats

MatsuDB supports various document formats:

PDF: Research papers, reports, forms
Spreadsheets: Excel, CSV files
Text files: Plain text, markdown
Images: When embedded in documents

The system automatically detects file types and applies appropriate parsing workflows.

See the API Reference for complete upload endpoint documentation.

Listing Corpus Documents

The list endpoint returns all corpus documents in your namespace with pagination support. Each corpus entry includes the corpus identifier, original filename, creation timestamp, and blob metadata. This listing enables you to discover uploaded documents and understand your document collection.

Pagination parameters control result size and enable efficient processing of large document collections. Use page_size to control how many results appear per page, and page_token to retrieve subsequent pages. The response includes next_page_token when additional pages are available, and total_count indicating the complete collection size.

List results are ordered by creation time, with most recent documents appearing first. This ordering enables you to find recently uploaded documents quickly while supporting pagination through historical uploads.

Request
Response

GET /v1/corpus?page_size=20&page_token=eyJwYWdlIjoxfQ==
Authorization: Bearer <your-token>

{
  "corpora": [
    {
      "corpus_id": "corpus-123",
      "original_name": "research-paper.pdf",
      "created_at": "2024-01-15T10:30:00Z",
      "mime_type": "application/pdf",
      "blob_checksum": "sha256:abc123..."
    }
  ],
  "next_page_token": "eyJwYWdlIjoyfQ==",
  "total_count": 45
}

Pagination

List operations support pagination through page_size and page_token parameters. When next_page_token appears in a response, additional pages are available. Pass this token as page_token to retrieve subsequent pages.

Retrieving Corpus Information

The get endpoint returns detailed information about a specific corpus document. The response includes the corpus identifier, original filename, blob location, creation timestamp, and MIME type. This information enables you to understand document metadata and track document lifecycle.

The corpus response also indicates processing state through node counts and status information. You can use this information to understand whether parsing has completed and how many nodes were created from the document. This visibility enables you to monitor processing progress and understand document structure.

Request
Response

GET /v1/corpus/corpus-123
Authorization: Bearer <your-token>

{
  "corpus_id": "corpus-123",
  "original_name": "research-paper.pdf",
  "created_at": "2024-01-15T10:30:00Z",
  "mime_type": "application/pdf",
  "blob_checksum": "sha256:abc123...",
  "node_count": 127,
  "processing_status": "COMPLETED"
}

Understanding Document Structure

After upload, documents undergo parsing workflows that create structured node hierarchies. The CORPUS node represents the original file, while the ARTIFACT node represents the parsed document structure. Child nodes under the ARTIFACT represent document components: text nodes for paragraphs, image nodes for figures, table nodes for tabular data.

This hierarchical structure enables navigation from the corpus level down to individual content elements. You can list child nodes to understand document structure, retrieve specific nodes by identifier, or search across all nodes within a document. The hierarchy preserves document organization while enabling granular access to individual components.

Processing workflows enrich nodes with embeddings, positions, and metadata as they progress. Once processing completes, nodes become fully searchable and navigable, enabling sophisticated query and analysis operations that leverage the complete document structure.

Learn More

For detailed information about document structure:

Node Types: Understanding different node types
Node Navigation: Exploring document hierarchies
Workflows: How documents are processed

Next Steps

Now that you understand document management, explore:

Node Navigation: Explore document structures in detail
Search Operations: Find content within your documents
Common Patterns: Real-world document management workflows

Related Concepts

Understanding these concepts will help you manage documents effectively:

Node: The building blocks of documents
Workflows: How documents are processed
Status Tracking: Monitor processing progress

Uploading Documents​

Listing Corpus Documents​

Retrieving Corpus Information​

Understanding Document Structure​

Next Steps​

Uploading Documents

Listing Corpus Documents

Retrieving Corpus Information

Understanding Document Structure

Next Steps