Node

Fundamental Concept

The Node is the fundamental atomic unit in MatsuDB. Every piece of information—whether a paragraph, an image, a table cell, or a document section—exists as a node. Understanding nodes is essential to understanding how MatsuDB works.

Definition

In MatsuDB, the node represents the fundamental atomic unit through which information is understood, stored, and connected. Rather than treating documents as indivisible files, MatsuDB decomposes them into nodes: discrete pieces of information that exist independently while maintaining rich relationships with their context. Each node carries its own identity, content, and position, transforming static files into living, queryable knowledge entities.

This atomic approach enables operations previously impossible: a paragraph searched and versioned independently, an image linked across thousands of files, a table cell triggering targeted workflows. The node liberates information from its original container.

Beyond Documents

Beyond documents, nodes represent any structured information: genomic sequences as gene nodes positioned on chromosomes, network topologies as connection nodes, knowledge graphs as concept nodes. MatsuDB is a knowledge platform, not merely a document platform.

Core Philosophy: The Bonsai Approach

MatsuDB's "bonsai philosophy" treats each node with selective attention and care, rejecting monolithic processing where entire documents wait for the slowest component. When a complex PDF arrives, for example, simple text nodes become available immediately while expensive table reconstructions proceed in parallel. High-priority content processes first, regardless of position. Critical nodes receive enrichment, routine content follows expedited paths.

Nodes form semantic and structural connections across document boundaries. A concept in one document links to its elaboration in another. Similar passages become discoverable through shared characteristics. The rigid tree transforms into a flexible lattice where meaning flows freely.

Selective Processing

The bonsai philosophy means that not all nodes receive the same treatment. Critical content gets priority processing, while routine content follows faster paths. This selective approach enables efficient resource utilization while maintaining quality where it matters most.

Structure

Identity and Uniqueness

A node's identity emerges from three coordinates. See the Position concept for detailed information about how these coordinates work together.

Coordinate	Purpose
Node ID	Unique identifier within the document tree
Root ID	Membership in a particular document structure, enables ancestry tracking
Namespace ID	Multi-tenant isolation boundary (see Namespace)

This composite identity enables distributed scaling and secure multi-tenancy while preserving hierarchical relationships.

Type and Classification

Every node declares what kind of information it represents through its type designation, governing interpretation and processing pathways. Common types include CORPUS, ARTIFACT, TEXT, IMAGE, TABLE, FORMULA, and FORM. The system is extensible, and new types like GeneNode or ConceptNode unlock specialized workflows and use cases.

Classifications add nuance beyond the primary type. Text nodes might be classified as structural (headings, captions) or content (substantive information). Image nodes distinguish between diagrams, technical drawings, illustrations, and photographs. Each classification triggers specific workflows, and both type and classification remain extensible, evolving with use rather than constraining information into predetermined categories.

Content and Representation

Nodes store content in multiple forms:

Direct text: Inline storage for names, paragraphs, and descriptions, serving as the primary interface for meaning
URIs and fingerprints: For large or binary content like images and tables, nodes maintain URIs pointing to external storage plus cryptographic fingerprints enabling deduplication and integrity verification
Metadata: Flexible key-value storage including tags, subjects, quality ratings, token counts for language model operations, and any other dimension relevant to use (see Metadata)

This multi-faceted representation acknowledges that information has multiple relevant contexts.

Hierarchical Relationships

Nodes maintain three relationship dimensions:

Parent-child relationships: Create vertical tree structures mirroring natural organization (sections contain paragraphs, tables contain cells)
Sequential links: Preserve horizontal reading order and document flow (title → first paragraph → second paragraph)
Ancestral paths: Provide encoded representation of complete lineage from root to node, enabling efficient queries without tree traversal

The hierarchy becomes navigable in multiple directions simultaneously: up toward roots, down toward leaves, laterally among siblings.

Semantic Representation

Vector encodings transform meaning into geometric coordinates where conceptual similarity becomes spatial proximity. The system maintains both dense vectors for semantic similarity (finding "ocean" ↔ "sea" connections) and sparse vectors for lexical precision (maintaining subtle distinctions). This dual strategy enables both semantic discovery and precise matching, adapting to the query.

Learn More

For detailed information about how embeddings work, see the Embeddings concept documentation.

Extended Attributes

Positions

MatsuDB's universal positioning framework accommodates spatial, logical, and structural coordinates in a unified abstraction. See Position for details. Nodes declare multiple positions simultaneously, combining pixel coordinates on a page with logical position in a tree. A table cell has both row-column address and spatial extent when rendered.

Metadata

Extensible key-value storage grows organically with use: extraction confidence scores, project relevance markers, human annotations, and processing parameters. Searchable and queryable without schema changes. See Metadata for more information.

Processing Status

Detailed operation tracking enables targeted fixes rather than blanket reprocessing. Status transitions follow: PENDING → RUNNING → COMPLETED/FAILED/CANCELLED. See Status Tracking for comprehensive information about monitoring and debugging processing operations.

Node Types

Type	Purpose	Characteristics
CORPUS	Original file	Represents the uploaded file (PDF, spreadsheet, etc.), the document source before parsing
ARTIFACT	Parsed document	Canonical result of parsing a corpus, the structured representation of the document, reusable across multiple corpus instances, classification: "corpus_extraction"
SECTION	Document section	Intermediate structural node that groups related content (e.g., introduction section, methodology section), enables advanced navigation
TEXT	Textual content	Paragraphs, titles, captions, primary semantic encoding candidates
IMAGE	Visual content	Spatial positioning, multimodal analysis, cross-modal connections
TABLE	Tabular data	2D structure, cell relationships, container for group table nodes
GROUP_TABLE	Table row/group	Intermediate structural node representing group of tables
FORMULA	Mathematical expressions	Multiple representations (visual, LaTeX, semantic)
FORM	Structured input fields	Field labels + values, automatic data extraction

CORPUS vs ARTIFACT

The distinction between CORPUS and ARTIFACT enables parsing-level deduplication. When a file is uploaded, a CORPUS node is created representing the original file. The parsing workflow then creates an ARTIFACT node as the canonical parsing result. If the same file (same blob checksum) is uploaded again, a new CORPUS node is created, but the system recognizes the existing ARTIFACT and copies all its child nodes without re-parsing.

Artifact Reusability

The ARTIFACT node serves as the canonical parsing result, enabling parsing-level deduplication. When the same document (identified by blob checksum) is uploaded multiple times, the system can reuse the existing ARTIFACT instead of re-parsing:

Structural Dimensions

Parent-child tree structures create vertical containment that mirrors natural organization: Corpus → Artifact → Section → Paragraph. Understanding requires context. Sequential flow preserves horizontal reading order through successor links: Title → Paragraph 1 → Paragraph 2 → Image, maintaining narrative flow. Ancestral paths provide encoded complete lineage enabling efficient queries without tree traversal, trading storage for speed in a worthwhile exchange for frequent hierarchical operations.

Namespace Isolation

Each node belongs to exactly one namespace, a boundary ensuring complete data isolation between tenants. Enforced at foundational security levels, not just application logic. Each namespace establishes its own processing rules, workflows, and practices. As the system scales, namespaces become the natural distribution unit.

The node concept intertwines with several architectural elements:

Namespaces: Provide isolation boundaries for multi-tenant separation
Positions: Offer multimodal coordinates combining spatial and logical dimensions
Rules & Triggers: Establish trigger conditions for dynamic workflows
Workflows: Orchestrate transformation from raw to enriched content
Embeddings: Create geometric representations enabling semantic search
Status Tracking: Provides operation visibility and debugging capability
Metadata: Extensible key-value storage for custom attributes

Together, these elements form an integrated architecture where nodes serve as the fundamental unit of information atomicity.

Definition​

Core Philosophy: The Bonsai Approach​

Structure​

Identity and Uniqueness​

Type and Classification​

Content and Representation​

Hierarchical Relationships​

Semantic Representation​

Extended Attributes​

Positions​

Metadata​

Processing Status​

Node Types​

Artifact Reusability​

Structural Dimensions​

Namespace Isolation​

Related Concepts​