5.8 KiB
Knowledge Workflow Architecture
This document records the intended v2 Knowledge workflow architecture. It is the canonical reference for the runtime workflow; temporary RFC copies under v2-refactor-temp/ may be removed before release.
Goals
Knowledge operations are modelled as a lightweight workflow rather than a single indexing pipeline:
API / user action
-> KnowledgeWorkflowService
-> JobManager
-> Knowledge job handlers
-> KnowledgeLockManager
-> SQLite / vector store / FileManager
The design keeps three owners:
KnowledgeWorkflowServicedecides the next workflow step.KnowledgeLockManagerserializes same-base mutations and cleanup.- Knowledge job handlers execute one durable stage and call the workflow service for the next step.
Helpers may own source planning, lifecycle writes, artifact refs, and FileProcessing adaptation. They should stay as modules until they need lifecycle-managed resources, IPC, timers, or long-lived state.
Workflow Entry Points
addItems, deleteItems, and reindexItems are async workflow entry points. API resolution means the durable workflow has been accepted, not that every physical side effect has finished.
addItemsresolves after root rows are created and first Knowledge jobs are queued.deleteItemsresolves after top-level target subtrees are markeddeletingandknowledge.delete-subtreeis queued.reindexItemsresolves after each top-level target subtree is confirmed terminal (completedorfailed) andknowledge.reindex-subtreeis queued.
Default item list, search, and RAG hydration exclude deleting items. deleting is a durable cleanup marker, not a tombstone or terminal success state.
Scheduling Model
The workflow service owns all branching:
scheduleItem(baseId, itemId)
directory -> enqueue knowledge.prepare-root
file / note / url -> source planning
direct -> enqueue knowledge.index-documents
invalid -> mark item failed
needs processing -> Round 2 FileProcessing path
Job handlers do not decide whether an item is a root, nested container, direct leaf, or FileProcessing candidate. They perform their current stage and re-enter the workflow service.
Recursive Container Expansion
knowledge.prepare-root expands a directory item and creates or replaces its child rows. Expansion results must not assume every child is a leaf:
prepare-root(container)
-> create/replace child rows
-> for each child:
workflowService.scheduleItem(baseId, childId)
If a child is another directory, scheduleItem queues another knowledge.prepare-root. If a child is file, note, or url, scheduleItem routes it to source planning and indexing. Recursive processing therefore lives in the workflow service loop, not inside a reader-specific branch.
Job Types
Round 1 job types:
-
knowledge.prepare-root: expand a container and schedule each child. -
knowledge.index-documents: read/chunk/embed/write vectors for a concrete document source. Empty reader results or zero chunks still write an empty vector set and complete the item. -
knowledge.delete-subtree: cancel active subtree jobs, delete vectors, delete base-directory files, then delete resolved item ids withdeleteItemsByIds. The create/index path does not register FileManager refs, so there is no separate file-ref detach step; any historicalFileEntryrows are left to the file module's no-reference policy. -
knowledge.reindex-subtree: for terminal subtrees only, delete vectors, remove stale container descendants, reset selected root state, then callscheduleItem. Selected leaf roots keep their source files on disk and are repaired byindex-documentsfromknowledge_item.data. -
knowledge.check-file-processing-result: poll or inspect the FileProcessing job, record the converted markdown's location on the item (viaupdateIndexedRelativePath) on success, then schedule indexing.
knowledge_base.fileProcessorId controls source planning for supported file items. When a source needs conversion, the workflow starts FileProcessing, schedules knowledge.check-file-processing-result, records the converted markdown's location on the item via updateIndexedRelativePath (the indexedRelativePath leaf field, not a separate file-ref artifact row), then indexes that markdown.
Status (2026-06-08): baseline behavior.
check-file-processing-resultis a polling job:scheduleFileProcessingCheckreschedules it atFILE_PROCESSING_CHECK_DELAY_MS = 5_000(5s) intervals until the FileProcessing job reaches a terminal state; if it is still unfinished afterFILE_PROCESSING_MAX_WAIT_MS = 30 * 60 * 1000(30 minutes), the remote job is cancelled and the item is markedfailed. On success the output's relative path is recorded viaupdateIndexedRelativePath, thenindex-documentsis scheduled.
Mutation And Crash Semantics
Same-base Knowledge mutations must go through KnowledgeLockManager. Main SQLite writes still use a synchronous write transaction (DbService.withWriteTx, or an equivalent db.transaction()); the lock manager is not a replacement for that write transaction.
Crash safety comes from durable jobs, durable item states, JobManager recovery, and idempotent cleanup. The in-memory mutation lock only serializes concurrent work in the current process.
Delete and reindex span two stores: the main SQLite database and the per-base vector store. They cannot be one cross-store transaction. Consistency relies on durable re-entry and idempotent vector/artifact/row cleanup.
User-triggered reindex is not a cancellation primitive. The service admits reindex only when the entire selected subtree is already completed or failed. Active states (idle, preparing, processing, reading, embedding) and deleting are rejected; delete remains the operation that can be requested at any time.