Skip to main content

Document Management

Document Operations

This guide covers uploading documents, listing your document collection, retrieving corpus information, and understanding document structure. For the basics, see Getting Started.

Uploading Documents

Document uploads use multipart form data to stream file content efficiently. The upload endpoint accepts the document file along with optional metadata including original filename and MIME type. When metadata is omitted, the system infers values from the uploaded file.

The upload process creates a CORPUS node representing the original file. This node serves as the root of the document hierarchy and triggers parsing workflows that extract content and create structured representations. The upload response includes the corpus identifier, which you use to reference the document in subsequent operations.

Large files are streamed in chunks, enabling efficient uploads without memory constraints. The system handles streaming transparently, so applications can upload files of various sizes using standard HTTP multipart form data. Upload progress and completion are indicated through HTTP status codes and response bodies.

curl -X POST "https://<your_domain>/v1/corpus" \
-H "Authorization: Bearer <your-token>" \
-F "file=@document.pdf" \
-F "original_name=research-paper.pdf" \
-F "mime_type=application/pdf"
Supported Formats

MatsuDB supports various document formats:

  • PDF: Research papers, reports, forms
  • Spreadsheets: Excel, CSV files
  • Text files: Plain text, markdown
  • Images: When embedded in documents

The system automatically detects file types and applies appropriate parsing workflows.

See the API Reference for complete upload endpoint documentation.

Listing Corpus Documents

The list endpoint returns all corpus documents in your namespace with pagination support. Each corpus entry includes the corpus identifier, original filename, creation timestamp, and blob metadata. This listing enables you to discover uploaded documents and understand your document collection.

Pagination parameters control result size and enable efficient processing of large document collections. Use page_size to control how many results appear per page, and page_token to retrieve subsequent pages. The response includes next_page_token when additional pages are available, and total_count indicating the complete collection size.

List results are ordered by creation time, with most recent documents appearing first. This ordering enables you to find recently uploaded documents quickly while supporting pagination through historical uploads.

GET /v1/corpus?page_size=20&page_token=eyJwYWdlIjoxfQ==
Authorization: Bearer <your-token>
Pagination

List operations support pagination through page_size and page_token parameters. When next_page_token appears in a response, additional pages are available. Pass this token as page_token to retrieve subsequent pages.

Retrieving Corpus Information

The get endpoint returns detailed information about a specific corpus document. The response includes the corpus identifier, original filename, blob location, creation timestamp, and MIME type. This information enables you to understand document metadata and track document lifecycle.

The corpus response also indicates processing state through node counts and status information. You can use this information to understand whether parsing has completed and how many nodes were created from the document. This visibility enables you to monitor processing progress and understand document structure.

GET /v1/corpus/corpus-123
Authorization: Bearer <your-token>

Understanding Document Structure

After upload, documents undergo parsing workflows that create structured node hierarchies. The CORPUS node represents the original file, while the ARTIFACT node represents the parsed document structure. Child nodes under the ARTIFACT represent document components: text nodes for paragraphs, image nodes for figures, table nodes for tabular data.

This hierarchical structure enables navigation from the corpus level down to individual content elements. You can list child nodes to understand document structure, retrieve specific nodes by identifier, or search across all nodes within a document. The hierarchy preserves document organization while enabling granular access to individual components.

Processing workflows enrich nodes with embeddings, positions, and metadata as they progress. Once processing completes, nodes become fully searchable and navigable, enabling sophisticated query and analysis operations that leverage the complete document structure.

Learn More

For detailed information about document structure:

Next Steps

Now that you understand document management, explore:

  1. Node Navigation: Explore document structures in detail
  2. Search Operations: Find content within your documents
  3. Common Patterns: Real-world document management workflows
Related Concepts

Understanding these concepts will help you manage documents effectively: