Workflows
Workflows in MatsuDB represent automated processing pipelines that transform and enrich document content. When a document is uploaded or nodes are created, workflows execute sequences of operations that parse content, generate embeddings, and augment nodes with additional information.
Definition
Workflows in MatsuDB represent automated processing pipelines that transform and enrich document content. When a document is uploaded or nodes are created, workflows execute sequences of operations that parse content, generate embeddings, and augment nodes with additional information. These workflows operate asynchronously, enabling responsive system behavior while complex processing occurs in the background.
Workflows are triggered automatically by the rules and triggers system, responding to events in the node lifecycle. They orchestrate multiple activities—discrete processing steps that perform specific operations—coordinating their execution, managing retries, and tracking progress. This orchestration ensures reliable processing even when individual steps encounter temporary failures.
Core Philosophy: Orchestrated Processing
Workflows embody the bonsai philosophy by enabling selective, configurable processing. Different workflows handle different aspects of content transformation: parsing workflows decompose documents into nodes, embedding workflows generate semantic representations, and augmentation workflows enrich nodes with additional context. Each workflow can be configured independently, allowing organizations to customize their processing pipelines according to their needs.
The workflow system separates orchestration from execution. Workflows define the sequence and coordination of operations, while activities perform the actual work. This separation enables workflows to manage retries, handle failures gracefully, and track progress without coupling to specific implementation details. Workflows can be monitored, cancelled, and resumed, providing visibility and control over long-running processing operations.
Workflows operate asynchronously, enabling responsive system behavior while complex processing occurs in the background. This allows users to continue working while documents are being processed.
Workflow Types
MatsuDB provides several workflow types, each addressing a specific aspect of document processing:
- Parsing Workflows
- Embedding Workflows
- Augmentation Workflows
Transform uploaded documents into structured node hierarchies. When a corpus node is created, parsing workflows extract content, identify structural elements, and create artifact nodes with child nodes representing paragraphs, images, tables, and other document components.
Supports multiple formats: PDF, CSV, spreadsheets, and text files.
Generate semantic representations for nodes containing text content. These workflows process nodes in batches, creating dense and sparse vector embeddings that enable semantic search and similarity matching. Can be configured to use different models, dimensions, and processing strategies.
Enrich nodes with additional information beyond their original content:
- Formula augmentation: Extract mathematical descriptions from formula images
- Image augmentation: Generate textual descriptions of visual content
- Table augmentation: Correct cell extraction errors and identify table structures
- Post-parsing augmentation: Perform structural analysis, grouping related nodes and identifying document sections
Workflow Lifecycle
Workflows progress through distinct phases from initiation to completion:
- Initiation: When triggered, a workflow receives input parameters specifying which nodes to process and what operations to perform
- Configuration: The workflow loads configuration settings that determine timeouts, retry policies, and processing parameters
- Execution: Workflows coordinate multiple activities, each performing a specific processing step. Activities execute sequentially or in parallel as required by the workflow logic
- Status Tracking: The workflow tracks the status of each node being processed, updating status information as activities complete or fail
- Completion: Upon completion, workflows update node statuses to reflect success or failure, enabling downstream systems and users to understand processing outcomes
Workflows handle failures through retry mechanisms, attempting failed activities multiple times before marking nodes as failed. This resilience ensures that temporary issues—network interruptions, service unavailability, or transient errors—do not cause permanent processing failures. Workflows can be cancelled or terminated, gracefully stopping processing and updating node statuses accordingly.
Workflow Chaining
Workflows naturally chain together as processing progresses. A parsing workflow creates nodes, which triggers embedding workflows for text nodes and augmentation workflows for specialized node types. These workflows execute independently, enabling parallel processing of different node types while maintaining coordination through status tracking.
The chaining occurs automatically through the rules and triggers system. When a parsing workflow creates text nodes, embedding triggers detect the new nodes and initiate embedding workflows. When formula nodes are created, formula augmentation triggers initiate augmentation workflows. This automatic chaining enables end-to-end processing pipelines without manual intervention.
Workflow chaining respects namespace boundaries and configuration settings. Each workflow in the chain operates within its namespace context, applying namespace-specific configurations. This ensures that processing pipelines remain isolated and customizable per organization.
Effective Usage Principles
Workflow configuration should align with organizational processing requirements and resource constraints. Timeout settings should account for expected processing times while preventing workflows from running indefinitely. Retry policies should balance reliability with resource consumption, retrying transient failures without exhausting resources on permanent errors.
Monitoring workflow execution and node statuses enables proactive identification of processing issues. Failed workflows should be investigated to understand root causes, whether configuration issues, resource constraints, or external service problems. Status information provides the visibility needed to diagnose and resolve processing problems.
Workflow design should consider idempotency and retryability. Activities should be designed to handle repeated execution safely, enabling workflows to retry failed operations without causing duplicate processing or data corruption. This design enables reliable processing even when failures occur.
Relationship to Other Concepts
Workflows are triggered by the rules and triggers system, responding to events in the node lifecycle. They operate within namespace boundaries, ensuring that processing configurations and statuses are isolated per organization. Workflows process nodes, updating their content, embeddings, and metadata as processing progresses.
Workflows integrate with the status tracking system, updating node statuses throughout execution. They use configuration scoped by namespace, enabling organizational customization of processing behavior. Workflows embody the bonsai philosophy by enabling selective, configurable processing that adapts to organizational needs.
The workflow system enables MatsuDB to transform static documents into living, queryable knowledge through automated processing pipelines. Without workflows, documents would remain unprocessed, and nodes would lack embeddings and enrichment. With workflows, the system automatically extracts structure, generates semantic representations, and enriches content, enabling the sophisticated querying and analysis capabilities that make MatsuDB a knowledge platform.