Document Management
This guide covers uploading documents, listing your document collection, retrieving corpus information, and understanding document structure. For the basics, see Getting Started.
Uploading Documents
Document uploads use multipart form data to stream file content efficiently. The upload endpoint accepts the document file along with optional metadata including original filename and MIME type. When metadata is omitted, the system infers values from the uploaded file.
The upload process creates a CORPUS node representing the original file. This node serves as the root of the document hierarchy and triggers parsing workflows that extract content and create structured representations. The upload response includes the corpus identifier, which you use to reference the document in subsequent operations.
Large files are streamed in chunks, enabling efficient uploads without memory constraints. The system handles streaming transparently, so applications can upload files of various sizes using standard HTTP multipart form data. Upload progress and completion are indicated through HTTP status codes and response bodies.
- cURL
- Python
curl -X POST "https://<your_domain>/v1/corpus" \
-H "Authorization: Bearer <your-token>" \
-F "file=@document.pdf" \
-F "original_name=research-paper.pdf" \
-F "mime_type=application/pdf"
import requests
headers = {"Authorization": "Bearer <your-token>"}
with open("document.pdf", "rb") as f:
files = {"file": f}
data = {
"original_name": "research-paper.pdf",
"mime_type": "application/pdf"
}
response = requests.post(
"https://<your_domain>/v1/corpus",
headers=headers,
files=files,
data=data
)
corpus = response.json()
print(f"Corpus ID: {corpus['corpus_id']}")
MatsuDB supports various document formats:
- PDF: Research papers, reports, forms
- Spreadsheets: Excel, CSV files
- Text files: Plain text, markdown
- Images: When embedded in documents
The system automatically detects file types and applies appropriate parsing workflows.
See the API Reference for complete upload endpoint documentation.
Listing Corpus Documents
The list endpoint returns all corpus documents in your namespace with pagination support. Each corpus entry includes the corpus identifier, original filename, creation timestamp, and blob metadata. This listing enables you to discover uploaded documents and understand your document collection.
Pagination parameters control result size and enable efficient processing of large document collections. Use page_size to control how many results appear per page, and page_token to retrieve subsequent pages. The response includes next_page_token when additional pages are available, and total_count indicating the complete collection size.
List results are ordered by creation time, with most recent documents appearing first. This ordering enables you to find recently uploaded documents quickly while supporting pagination through historical uploads.
- Request
- Response
GET /v1/corpus?page_size=20&page_token=eyJwYWdlIjoxfQ==
Authorization: Bearer <your-token>
{
"corpora": [
{
"corpus_id": "corpus-123",
"original_name": "research-paper.pdf",
"created_at": "2024-01-15T10:30:00Z",
"mime_type": "application/pdf",
"blob_checksum": "sha256:abc123..."
}
],
"next_page_token": "eyJwYWdlIjoyfQ==",
"total_count": 45
}
List operations support pagination through page_size and page_token parameters. When next_page_token appears in a response, additional pages are available. Pass this token as page_token to retrieve subsequent pages.
Retrieving Corpus Information
The get endpoint returns detailed information about a specific corpus document. The response includes the corpus identifier, original filename, blob location, creation timestamp, and MIME type. This information enables you to understand document metadata and track document lifecycle.
The corpus response also indicates processing state through node counts and status information. You can use this information to understand whether parsing has completed and how many nodes were created from the document. This visibility enables you to monitor processing progress and understand document structure.
- Request
- Response
GET /v1/corpus/corpus-123
Authorization: Bearer <your-token>
{
"corpus_id": "corpus-123",
"original_name": "research-paper.pdf",
"created_at": "2024-01-15T10:30:00Z",
"mime_type": "application/pdf",
"blob_checksum": "sha256:abc123...",
"node_count": 127,
"processing_status": "COMPLETED"
}
Understanding Document Structure
After upload, documents undergo parsing workflows that create structured node hierarchies. The CORPUS node represents the original file, while the ARTIFACT node represents the parsed document structure. Child nodes under the ARTIFACT represent document components: text nodes for paragraphs, image nodes for figures, table nodes for tabular data.
This hierarchical structure enables navigation from the corpus level down to individual content elements. You can list child nodes to understand document structure, retrieve specific nodes by identifier, or search across all nodes within a document. The hierarchy preserves document organization while enabling granular access to individual components.
Processing workflows enrich nodes with embeddings, positions, and metadata as they progress. Once processing completes, nodes become fully searchable and navigable, enabling sophisticated query and analysis operations that leverage the complete document structure.
For detailed information about document structure:
- Node Types: Understanding different node types
- Node Navigation: Exploring document hierarchies
- Workflows: How documents are processed
Next Steps
Now that you understand document management, explore:
- Node Navigation: Explore document structures in detail
- Search Operations: Find content within your documents
- Common Patterns: Real-world document management workflows
Understanding these concepts will help you manage documents effectively:
- Node: The building blocks of documents
- Workflows: How documents are processed
- Status Tracking: Monitor processing progress