Getting Started
This guide walks you through uploading a document, understanding the processing lifecycle, and performing your first search. For API fundamentals, see the Introduction guide.
Uploading Your First Document
Uploading a document creates a CORPUS node that represents the original file. The upload endpoint accepts multipart form data with the document file and optional metadata like original filename and MIME type. The system streams large files efficiently, enabling uploads of documents of various sizes.
After upload, the API returns a corpus identifier and metadata. The system immediately begins processing the document through parsing workflows that extract content and create structured node hierarchies. This processing occurs asynchronously, so the document becomes searchable once processing completes.
The upload response includes the corpus identifier, which you'll use to reference the document in subsequent operations. The response also includes the blob checksum, which enables the system to recognize duplicate uploads and reuse existing parsing results.
- cURL
- Python
curl -X POST "https://<your_domain>/v1/corpus" \
-H "Authorization: Bearer <your-token>" \
-F "file=@document.pdf" \
-F "original_name=document.pdf"
import requests
headers = {"Authorization": "Bearer <your-token>"}
with open("document.pdf", "rb") as f:
files = {"file": f}
data = {"original_name": "document.pdf"}
response = requests.post(
"https://<your_domain>/v1/corpus",
headers=headers,
files=files,
data=data
)
Document processing happens asynchronously. The upload completes immediately, but parsing and enrichment occur in the background. Use status tracking to monitor when processing completes.
See the Document Management guide for more details on uploading and managing documents.
Understanding the Processing Lifecycle
When you upload a document, the system creates a CORPUS node representing the original file. Processing workflows then create an ARTIFACT node representing the parsed document structure, followed by child nodes representing paragraphs, images, tables, and other document components.
Processing occurs asynchronously through workflows that parse content, generate embeddings, and enrich nodes with additional information. The system tracks processing status for each node, enabling you to monitor progress and understand when content becomes searchable.
You can query node status to understand processing state. Status progresses through PENDING, RUNNING, and COMPLETED states, with FAILED or CANCELLED states indicating problems. Status information includes error messages when failures occur, enabling you to diagnose and resolve processing issues.
For detailed information about the processing lifecycle:
- Workflows: How automated processing pipelines work
- Status Tracking: Monitoring processing progress
- Node Types: Understanding document structure
Your First Search
Once processing completes, nodes become searchable through text search operations. The simplest search uses exact text matching to find nodes containing specific terms. More sophisticated searches use semantic embeddings to find conceptually similar content regardless of exact terminology.
To perform a semantic text search, send a POST request to /v1/search/dense with your query text. The system automatically generates embeddings for your query and finds nodes with similar semantic meaning. Results include similarity scores that indicate how closely each result matches your query.
- Request
- Response
POST /v1/search/dense
Content-Type: application/json
Authorization: Bearer <your-token>
{
"query_text": "climate change impacts",
"min_similarity": 0.7,
"top_k": 10
}
{
"results": [
{
"node_id": "node-123",
"root_node_id": "corpus-456",
"similarity_score": 0.89,
"content": "The effects of global warming on ecosystems...",
"node_type": "TEXT"
}
],
"total_count": 15
}
Search results return nodes with their content, positions, and metadata. You can use these results to navigate document structures, retrieve full content, or perform additional searches based on discovered information.
MatsuDB supports multiple search modes:
- Dense search: Semantic similarity using dense embeddings
- Sparse search: Lexical precision using sparse embeddings
- Exact search: Literal text matching
See the Search guide for detailed information about all search modes.
Next Steps
Now that you've completed your first workflow, explore these guides:
- Document Management: Upload, list, and manage documents
- Search Operations: Advanced search techniques and filtering
- Node Navigation: Explore document structures
- Automation: Configure automated processing
- Common Patterns: Real-world usage patterns
Deepen your understanding with these concepts:
- Node: The fundamental unit of information
- Embeddings: How semantic search works
- Workflows: Automated processing pipelines