Node
The Node is the fundamental atomic unit in MatsuDB. Every piece of information—whether a paragraph, an image, a table cell, or a document section—exists as a node. Understanding nodes is essential to understanding how MatsuDB works.
Definition
In MatsuDB, the node represents the fundamental atomic unit through which information is understood, stored, and connected. Rather than treating documents as indivisible files, MatsuDB decomposes them into nodes: discrete pieces of information that exist independently while maintaining rich relationships with their context. Each node carries its own identity, content, and position, transforming static files into living, queryable knowledge entities.
This atomic approach enables operations previously impossible: a paragraph searched and versioned independently, an image linked across thousands of files, a table cell triggering targeted workflows. The node liberates information from its original container.
Beyond documents, nodes represent any structured information: genomic sequences as gene nodes positioned on chromosomes, network topologies as connection nodes, knowledge graphs as concept nodes. MatsuDB is a knowledge platform, not merely a document platform.
Core Philosophy: The Bonsai Approach
MatsuDB's "bonsai philosophy" treats each node with selective attention and care, rejecting monolithic processing where entire documents wait for the slowest component. When a complex PDF arrives, for example, simple text nodes become available immediately while expensive table reconstructions proceed in parallel. High-priority content processes first, regardless of position. Critical nodes receive enrichment, routine content follows expedited paths.
Nodes form semantic and structural connections across document boundaries. A concept in one document links to its elaboration in another. Similar passages become discoverable through shared characteristics. The rigid tree transforms into a flexible lattice where meaning flows freely.
The bonsai philosophy means that not all nodes receive the same treatment. Critical content gets priority processing, while routine content follows faster paths. This selective approach enables efficient resource utilization while maintaining quality where it matters most.
Structure
Identity and Uniqueness
A node's identity emerges from three coordinates. See the Position concept for detailed information about how these coordinates work together.
| Coordinate | Purpose |
|---|---|
| Node ID | Unique identifier within the document tree |
| Root ID | Membership in a particular document structure, enables ancestry tracking |
| Namespace ID | Multi-tenant isolation boundary (see Namespace) |
This composite identity enables distributed scaling and secure multi-tenancy while preserving hierarchical relationships.
Type and Classification
Every node declares what kind of information it represents through its type designation, governing interpretation and processing pathways. Common types include CORPUS, ARTIFACT, TEXT, IMAGE, TABLE, FORMULA, and FORM. The system is extensible, and new types like GeneNode or ConceptNode unlock specialized workflows and use cases.
Classifications add nuance beyond the primary type. Text nodes might be classified as structural (headings, captions) or content (substantive information). Image nodes distinguish between diagrams, technical drawings, illustrations, and photographs. Each classification triggers specific workflows, and both type and classification remain extensible, evolving with use rather than constraining information into predetermined categories.
Content and Representation
Nodes store content in multiple forms:
- Direct text: Inline storage for names, paragraphs, and descriptions, serving as the primary interface for meaning
- URIs and fingerprints: For large or binary content like images and tables, nodes maintain URIs pointing to external storage plus cryptographic fingerprints enabling deduplication and integrity verification
- Metadata: Flexible key-value storage including tags, subjects, quality ratings, token counts for language model operations, and any other dimension relevant to use (see Metadata)
This multi-faceted representation acknowledges that information has multiple relevant contexts.
Hierarchical Relationships
Nodes maintain three relationship dimensions:
- Parent-child relationships: Create vertical tree structures mirroring natural organization (sections contain paragraphs, tables contain cells)
- Sequential links: Preserve horizontal reading order and document flow (title → first paragraph → second paragraph)
- Ancestral paths: Provide encoded representation of complete lineage from root to node, enabling efficient queries without tree traversal
The hierarchy becomes navigable in multiple directions simultaneously: up toward roots, down toward leaves, laterally among siblings.
Semantic Representation
Vector encodings transform meaning into geometric coordinates where conceptual similarity becomes spatial proximity. The system maintains both dense vectors for semantic similarity (finding "ocean" ↔ "sea" connections) and sparse vectors for lexical precision (maintaining subtle distinctions). This dual strategy enables both semantic discovery and precise matching, adapting to the query.
For detailed information about how embeddings work, see the Embeddings concept documentation.
Extended Attributes
Positions
MatsuDB's universal positioning framework accommodates spatial, logical, and structural coordinates in a unified abstraction. See Position for details. Nodes declare multiple positions simultaneously, combining pixel coordinates on a page with logical position in a tree. A table cell has both row-column address and spatial extent when rendered.
Metadata
Extensible key-value storage grows organically with use: extraction confidence scores, project relevance markers, human annotations, and processing parameters. Searchable and queryable without schema changes. See Metadata for more information.
Processing Status
Detailed operation tracking enables targeted fixes rather than blanket reprocessing. Status transitions follow: PENDING → RUNNING → COMPLETED/FAILED/CANCELLED. See Status Tracking for comprehensive information about monitoring and debugging processing operations.
Node Types
| Type | Purpose | Characteristics |
|---|---|---|
| CORPUS | Original file | Represents the uploaded file (PDF, spreadsheet, etc.), the document source before parsing |
| ARTIFACT | Parsed document | Canonical result of parsing a corpus, the structured representation of the document, reusable across multiple corpus instances, classification: "corpus_extraction" |
| SECTION | Document section | Intermediate structural node that groups related content (e.g., introduction section, methodology section), enables advanced navigation |
| TEXT | Textual content | Paragraphs, titles, captions, primary semantic encoding candidates |
| IMAGE | Visual content | Spatial positioning, multimodal analysis, cross-modal connections |
| TABLE | Tabular data | 2D structure, cell relationships, container for group table nodes |
| GROUP_TABLE | Table row/group | Intermediate structural node representing group of tables |
| FORMULA | Mathematical expressions | Multiple representations (visual, LaTeX, semantic) |
| FORM | Structured input fields | Field labels + values, automatic data extraction |
The distinction between CORPUS and ARTIFACT enables parsing-level deduplication. When a file is uploaded, a CORPUS node is created representing the original file. The parsing workflow then creates an ARTIFACT node as the canonical parsing result. If the same file (same blob checksum) is uploaded again, a new CORPUS node is created, but the system recognizes the existing ARTIFACT and copies all its child nodes without re-parsing.
Artifact Reusability
The ARTIFACT node serves as the canonical parsing result, enabling parsing-level deduplication. When the same document (identified by blob checksum) is uploaded multiple times, the system can reuse the existing ARTIFACT instead of re-parsing:
Structural Dimensions
Parent-child tree structures create vertical containment that mirrors natural organization: Corpus → Artifact → Section → Paragraph. Understanding requires context. Sequential flow preserves horizontal reading order through successor links: Title → Paragraph 1 → Paragraph 2 → Image, maintaining narrative flow. Ancestral paths provide encoded complete lineage enabling efficient queries without tree traversal, trading storage for speed in a worthwhile exchange for frequent hierarchical operations.
Namespace Isolation
Each node belongs to exactly one namespace, a boundary ensuring complete data isolation between tenants. Enforced at foundational security levels, not just application logic. Each namespace establishes its own processing rules, workflows, and practices. As the system scales, namespaces become the natural distribution unit.
Related Concepts
The node concept intertwines with several architectural elements:
- Namespaces: Provide isolation boundaries for multi-tenant separation
- Positions: Offer multimodal coordinates combining spatial and logical dimensions
- Rules & Triggers: Establish trigger conditions for dynamic workflows
- Workflows: Orchestrate transformation from raw to enriched content
- Embeddings: Create geometric representations enabling semantic search
- Status Tracking: Provides operation visibility and debugging capability
- Metadata: Extensible key-value storage for custom attributes
Together, these elements form an integrated architecture where nodes serve as the fundamental unit of information atomicity.