mirror of https://github.com/CherryHQ/cherry-studio.git synced 2026-07-03 12:27:41 +08:00

Files

槑囿脑袋 2fbc7bda1c feat(knowledge): optional embedding model with BM25-only fallback (#16553 )

Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-authored-by: fullex <106392080+0xfullex@users.noreply.github.com>
Signed-off-by: eeee0717 <chentao020717Work@outlook.com>

2026-07-02 20:19:21 +08:00

15 KiB

Raw Permalink Blame History

Knowledge Service

This document records the current v2 knowledge backend shape in the main process.

It covers the src/main/services/knowledge workflow path and the SQLite-backed data services. It does not describe the legacy src/main/knowledge service or the old knowledge-base:* IPC channels.

For workflow guard details, see Knowledge Operation Guards. For the workflow architecture overview, see Knowledge Workflow Architecture.

Overview

The current implementation is split into four responsibility areas:

KnowledgeBaseService / KnowledgeItemService
- Persist SQLite-backed knowledge base and knowledge item data.
- Persist knowledge_base.status and error; migrated bases with missing embedding models remain as recoverable failed bases.
- Persist knowledge_base.groupId and dimensions; dimensions is null for BM25-only completed bases (no embedding model) and for failed bases whose embedding contract is unknown.
- Validate item type / data consistency.
- Persist knowledge_item.status and error.
- Reconcile container item status from child item state.
Data API knowledge handlers
- Expose database-backed list/get operations and base metadata/config patch.
- Do not perform vector-store mutations.
KnowledgeService
- Owns caller-facing runtime IPC workflow.
- Creates/deletes/restores bases through data services and vector store services.
- Registers Knowledge JobManager handlers.
- Holds the KnowledgeWorkflowService and KnowledgeLockManager.
- Collapses delete/reindex item inputs to top-level roots and enforces runtime guards.
Knowledge job handlers
- Execute durable workflow stages through JobManager.
- Use KnowledgeWorkflowService for next-step scheduling.
- Use KnowledgeLockManager for same-base mutations and vector cleanup.

caller
  -> Data API reads / base patch
     -> KnowledgeBaseService / KnowledgeItemService

caller
  -> preload knowledge IPC
     -> KnowledgeService
        -> KnowledgeWorkflowService
        -> JobManager
           -> knowledge.prepare-root / knowledge.index-documents
           -> knowledge.delete-subtree / knowledge.reindex-subtree
              -> KnowledgeLockManager
                 -> KnowledgeBaseService / KnowledgeItemService
                 -> KnowledgeVectorStoreService / FileManager

There is no current KnowledgeRuntimeService and no in-memory Knowledge queue. Durable work is owned by JobManager.

Caller Contract

Current Data API knowledge endpoints are read/update-only for database state that has no vector-store side effect:

GET /knowledge-bases
GET /knowledge-bases/:id
PATCH /knowledge-bases/:id
GET /knowledge-bases/:id/items
GET /knowledge-items/:id

Caller-facing create/delete/index/search operations go through KnowledgeService IPC.

The caller-facing add model is payload-based:

Call runtime IPC once with item payloads.
The workflow creates the knowledge_item rows.
The workflow queues either preparation or indexing work.

For leaf items (file, url, note):

caller
 -> preload IPC add-items(leaf item payloads)
    -> create leaf items
    -> mark roots processing
    -> enqueue knowledge.index-documents

For container items (directory):

caller
 -> preload IPC add-items(owner item payloads)
    -> create root items
    -> mark roots preparing
    -> enqueue knowledge.prepare-root
    -> prepare-root expands owner
    -> prepare-root creates child items
    -> workflow service schedules each child

Callers should not create item records through Data API and then call runtime IPC with item ids. add-items accepts KnowledgeAddItemInput[] and returns after root items are accepted and first jobs are queued, not after indexing completes.

Delete and reindex remain id-based because they operate on existing persisted items:

delete-items(baseId, itemIds)
reindex-items(baseId, itemIds)

KnowledgeService collapses nested selected ids to top-level roots before calling the workflow service.

IPC Surface

KnowledgeService currently owns these public IPC entrypoints:

knowledge:create-base
knowledge:restore-base
knowledge:delete-base
knowledge:add-items
knowledge:delete-items
knowledge:reindex-items
knowledge:search
knowledge:list-item-chunks

These IPC handlers are workflow-oriented. They validate payloads, call data services, and enqueue or execute runtime work internally. (The former knowledge:delete-item-chunk entrypoint was removed with the per-base index store cutover — chunks are derived index rows, replaced wholesale by reindexing.)

KnowledgeService also owns one orphaned v1 bridge entrypoint, knowledge-base:delete. Its only caller was the legacy Redux store/knowledge slice, which has now been removed, so this entrypoint is dead and pending cleanup. It routes to the same delete-base path.

The chunk IPC entrypoint is a runtime inspection helper:

list-item-chunks rejects failed bases.
It requires the requested item to be completed.
Listing chunks for a completed directory also rejects when the subtree still contains deleting descendants, because container status reconciliation ignores deleting children.

Runtime Behavior

Knowledge runtime work is persisted in JobManager. KnowledgeService.onInit registers:

knowledge.prepare-root
knowledge.index-documents
knowledge.delete-subtree
knowledge.reindex-subtree

Each base uses queue base.${baseId}. JobManager owns queue persistence, dispatch, retry, cancellation, timeout, and startup recovery. Knowledge code uses KnowledgeLockManager to serialize same-base vector and item mutations inside the current process.

Current item statuses are:

idle
preparing
processing
reading
embedding
completed
failed
deleting

There is no separate persisted phase field. preparing, reading, and embedding are first-class item statuses.

Current status writes are:

preparing for active directory preparation.
processing for accepted leaf roots before indexing starts, and for containers that still have active children.
reading while a leaf item reads source documents.
embedding while a leaf item embeds chunks.
completed after successful leaf indexing, including leaf indexing that writes zero chunks, or when a container has no active children.
failed on indexing/preparation failure or scheduling compensation.
deleting after user-visible delete intent is written and before physical cleanup completes.

status is the durable business state. JobManager progress is diagnostic execution state and is not the source of truth for item lifecycle. Container status is reconciled from immediate child statuses.

Current persisted knowledge_base columns include:

groupId: nullable group assignment; null means ungrouped.
embeddingModelId: the embedding model; null for BM25-only bases.
dimensions: positive embedding vector width for vector-capable bases; null for BM25-only completed bases (no embedding model) and for failed migrated bases with unknown dimensions. On a completed base it is paired with embeddingModelId — both set, or both null with searchMode forced to bm25 (enforced by the DB CHECK and the entity schema).
status: completed for runnable bases, failed for recoverable base-level migration failures.
error: nullable KnowledgeBaseErrorCode; currently missing_embedding_model for recoverable failed bases.

Delete And Reindex

delete-items currently runs:

Orchestration loads requested items and collapses descendants to top-level roots.
Workflow service marks selected root subtrees deleting under the base mutation lock.
Workflow service enqueues knowledge.delete-subtree.
The delete job cancels active jobs touching the subtree.
Under the base mutation lock, the delete job deletes leaf vectors, deletes Knowledge-owned raw files, and hard-deletes item rows.

Knowledge files are managed by the Knowledge workflow under the base raw/ directory. The create/index path does not register FileManager refs, so delete has no separate FileManager ref cleanup step.

If enqueueing knowledge.delete-subtree fails after rows are marked deleting, rows remain deleting. Startup recovery scans deleting roots and re-enqueues cleanup jobs best-effort.

reindex-items currently runs:

Orchestration loads requested items and collapses descendants to top-level roots.
Orchestration rejects the request unless every selected subtree item is terminal: completed or failed.
Workflow service enqueues knowledge.reindex-subtree.
The reindex job skips if delete won the race and any subtree item is now deleting.
Under the base mutation lock, the reindex job deletes old vectors, removes expanded descendants for selected container roots, resets selected roots to preparing or processing, and schedules each selected root through the workflow service.

Reindex is not a cancellation primitive. Delete is the operation that can preempt active work.

Base deletion currently runs:

delete-base(baseId)
 -> cancel active Knowledge jobs in base queue
 -> under base mutation lock:
      delete vector store artifacts
      delete SQLite base row

If vector artifact deletion fails, the SQLite base row is preserved so the user can retry deletion. If SQLite deletion fails after vector artifacts were deleted, orchestration throws an invalidOperation because the cross-store cleanup cannot be rolled back.

Knowledge files are owned by the Knowledge workflow under its raw/vector storage and are not registered as FileManager FileRef rows. Delete/reindex cleanup stays within Knowledge-owned storage and metadata.

Base Restore

Base restore creates a new knowledge base from an existing base:

restore-base(sourceBaseId, embeddingModelId, dimensions)
 -> data service loads the source base
 -> data service loads source root items
 -> orchestration creates a new base with source config plus the requested embedding model/dimensions
 -> orchestration adds each root item to the new base

dimensions must already be resolved for the selected embeddingModelId before calling restore-base. Automatic flows should fill it from AI Core dimension detection; manual flows accept the user-provided value and rely on the caller to confirm it matches the model. The restore backend only validates that dimensions is a positive integer and uses it to create the new vector store; it does not perform a second model probe. If the value does not match the model's actual embedding output size, the mismatch is expected to surface during the subsequent indexing/write-vector phase.

The source base is preserved. Restore is allowed for failed bases and completed bases, including completed bases whose embeddingModelId and dimensions are unchanged. Same-config restore is a valid clone/rebuild workflow, not rejected as a no-op.

If one or more root items cannot be accepted into the restored base, orchestration best-effort deletes the new base and rethrows an invalidOperation. Later background indexing failures are recorded on item status instead of this synchronous restore error.

Migrated Bases With Missing Embedding Models

During v1-to-v2 migration, a legacy knowledge base may reference an embedding model that does not exist in the migrated user_model table. For example, a legacy model id such as ollama::dengcao/Qwen3-Embedding-0.6B:Q8_0 can be present in Redux knowledge data while no matching V2 user model row exists.

In that case, migration must preserve the user-created knowledge data instead of dropping the base:

knowledge_base.embeddingModelId = null
knowledge_base.dimensions = valid legacy dimensions, or null when unknown
knowledge_base.status = failed
knowledge_base.error = missing_embedding_model
knowledge_item rows under that base continue to migrate
legacy vectors for that base are skipped because there is no confirmed embedding model contract

knowledge_base.error is a shared KnowledgeBaseErrorCode value, not a free-form string. The current recoverable base-level error code is missing_embedding_model.

This means the migrated base is visible as recoverable data, but it is not usable for search/index operations until the user chooses a valid embedding model.

The failed-base recovery path is knowledge:restore-base, not an in-place rebuild:

user selects a valid embedding model for the failed base
 -> restore-base(sourceBaseId, embeddingModelId, dimensions)
 -> orchestration creates a new completed base using the source base config
 -> orchestration copies only source root items into the new base
 -> add-items triggers the normal workflow indexing flow for the new base

Only root items (groupId = null) are copied. Expanded directory children are intentionally not copied because they belong to the old base hierarchy and can be regenerated by the normal container preparation flow. The old failed base is left intact; product/UI code can decide whether to keep it for confirmation or delete it after a successful restore.

Search

Search is executed by KnowledgeService.search(baseId, query):

Reject failed bases.
Reject queries without searchable tokens.
Resolve the base's searchMode (vector / bm25 / hybrid) and embed the query — skipped for bm25, which is lexical only.
Call KnowledgeIndexStore.search on the base's per-base index store with an over-fetched candidate limit (topK × overfetch, capped). The store runs the BM25 lane (search_text_fts, with a LIKE fallback for short CJK tokens), the brute-force vector lane, or fuses both with RRF (hybridAlpha).
Filter results whose source items are missing, outside the base, or deleting, then trim to documentCount ?? 10.
Rerank when base.rerankModelId is configured.
Apply relevance threshold (a no-op for ranking-kind scores) and assign ranks.

Current KnowledgeSearchResult includes:

pageContent
score
scoreKind
rank
metadata
optional itemId
required chunkId

chunkId is the search unit identity (search_unit.unit_id) used for result-level attribution. itemId equals the unit's material_id (= knowledge_item.id).

Current Retrieval Cost Assumption

The current v2 implementation intentionally does not create a vector index and does not use an indexed approximate-nearest-neighbor lookup. Similarity search scans the embedding rows directly and sorts by the engine's scalar cosine distance (vec_distance_cosine on sqlite-vec, with the query vector bound as a raw little-endian float32 BLOB).

This means retrieval cost scales roughly linearly with the number of vector rows in a single knowledge base. That tradeoff is currently accepted because it keeps the runtime path simpler for expected near-term corpus sizes.

Current guidance:

Treat the no-index design as the default for now, not as an unlimited scaling guarantee.
Re-evaluate indexed search if real single-base corpora grow toward 100k+ rows or retrieval latency budgets can no longer tolerate a few hundred milliseconds per query.
If future product requirements change, adding a vector index remains a valid follow-up optimization rather than a blocked prerequisite for the current design.

15 KiB Raw Permalink Blame History Unescape Escape