mirror of
https://github.com/CherryHQ/cherry-studio.git
synced 2026-07-04 05:00:00 +08:00
### What this PR does Before this PR: - Knowledge embeddings and reranking ran through the legacy embedjs-based knowledgeV1 stack with their own provider clients, independent of the app's AI service. - File-processing intake accepted several heterogeneous input shapes, and knowledge file items were tracked by FileEntry ids, coupling file content to the file-manager entry/cache. After this PR: - Embeddings and reranking are routed through the unified `AiService` (with cherryin rerank support) and guarded by strict embedding-dimension validation that rejects stale/mismatched vectors. - File-processing intake is collapsed to a single path-based model; knowledge file items are stored by base-relative path under the knowledge-base directory, and v1 uploads are copied into the v2 base dir during migration so migrated items stay reindexable/restorable. - Legacy `knowledgeV1` is removed; the orchestration services were renamed to `KnowledgeService` / `FileProcessingService`. - Chat -> knowledge attach is temporarily disconnected (tracked TODO) while the v2 file-manager bridge is rebuilt. Fixes #N/A (no linked issue) ### Why we need it and why it was done in this way Routing embeddings/rerank through `AiService` unifies provider handling and credentials and removes the parallel embedjs client stack and its v1 coupling. Storing knowledge files by base-relative path (instead of FileEntry ids) makes each knowledge base self-contained and portable. The following tradeoffs were made: - A large, coordinated refactor plus a migration step that physically copies v1 uploads into the v2 base dir, in exchange for removing the parallel client stack and making bases self-contained. - Base-relative path storage required a fail-fast/dedup strategy for same-named files and a guard for blank legacy filenames. The following alternatives were considered: - Keeping the embedjs stack behind an adapter — rejected; perpetuates the parallel client and v1 coupling. - Keeping FileEntry-id storage — rejected; couples knowledge files to the file-manager cache and blocks portability. ### Breaking changes - `knowledgeV1` is removed. Legacy v1 knowledge data reaches v2 only through the v2 migrators; there is no v1 fallback. - The v2 knowledge HTTP API (API gateway) now returns v2-native per-entry fields (`embeddingModelId`, `createdAt` on base entries; `chunkId`, `scoreKind`, `rank` on search results). The response envelope (`knowledge_bases`, `searched_bases`, `total`) is unchanged. See `v2-refactor-temp/docs/breaking-changes/2026-06-05-knowledge-api-v2.md`. ### Special notes for your reviewer - This branch went through several rounds of multi-agent code review. The most recent 6 commits address review findings: directory-import path collisions, migrated-file source copying + blank `relativePath` guard, addItems rollback error preservation, eager `document_to_markdown` output-target validation, a `CompletedKnowledgeBase` type guard, and breaking-changes doc corrections. - Chat -> knowledge attach is intentionally disconnected for now (tracked in `v2-refactor-temp/docs/knowledge/knowledge-todo.md`). - Local full `pnpm lint`/`pnpm test` was not run per the project's review conventions; please rely on CI / `pnpm build:check`. ### Checklist - [x] Branch: This PR targets the correct branch — `main` for active development, `v1` for v1 maintenance fixes - [x] PR: The PR description is expressive enough and will help future contributors - [x] Code: Write code that humans can understand and Keep it simple - [x] Refactor: You have left the code cleaner than you found it (Boy Scout Rule) - [x] Upgrade: Impact of this change on upgrade flows was considered and addressed if required - [ ] Documentation: A user-guide update was considered and is present (link) or not required. - [x] Self-review: I have reviewed my own code before requesting review from others ### Release note ```release-note NONE ``` --------- Signed-off-by: eeee0717 <chentao020717Work@outlook.com> Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>
9.4 KiB
9.4 KiB
Knowledge Schema Notes (V2)
This document records the current V2 knowledge target schema, migration constraints, and temporary scope boundaries.
Scope Clarification
videoitems are out of scope for V2 knowledge data migration and should be skipped.memoryitems belong to the memory module, not the knowledge module, and should be skipped in knowledge migration.
Current Target Schema
knowledge_base
- Persisted columns:
idnamegroupIddimensionsembeddingModelIdstatuserrorrerankModelIdfileProcessorIdchunkSizechunkOverlapthresholddocumentCountsearchModehybridAlphacreatedAtupdatedAt
knowledge_item
- Persisted columns:
idbaseIdgroupIdtypedatastatuserrorcreatedAtupdatedAt
- New app-created knowledge items use ordered UUID generation for
id.
Fields Removed From The V2 SQLite Schema
videois not a targetknowledge_item.type.memoryis not a targetknowledge_item.type.sitemapis not a targetknowledge_item.type; legacy sitemap entries are migrated asurlitems.- Legacy runtime-only item fields are not stored as standalone SQLite columns:
uniqueIduniqueIdsprocessingProgressretryCountisPreprocessed
remarkis not part of the V2 SQLite schema.sourceUrlis not a standaloneknowledge_itemcolumn:- for notes, it may exist inside
data.sourceUrl - for url items, the URL is stored inside the typed
datapayload
- for notes, it may exist inside
- Official v1 legacy exports do not contain
groupId.
groupId Semantics
knowledge_itemis modeled as a flat same-base item collection.groupIdis an optional stable grouping key inside one knowledge base.- Typical usage: items from the same imported source/container
- Examples: one directory import, one URL collection
- When one item is the logical container/owner of a group, downstream items use
groupId = containerItem.id - The schema enforces same-base ownership:
(baseId, groupId)must reference(baseId, id)inknowledge_item- deleting the owner cascades to grouped members
- Current runtime read flows use:
GET /knowledge-bases/:id/itemsfor flat item listing- optional query filters:
type,groupId
- Current runtime write workflows use
KnowledgeServiceIPC, not DataApi endpoints:- add items: normalize caller-friendly inputs, create SQLite rows, and enqueue prepare/index tasks
- delete items: interrupt runtime work, delete vectors, then delete SQLite roots
- reindex items: interrupt runtime work, delete old vectors, rebuild expanded children when needed, then enqueue indexing
- search and chunk mutation: execute against the per-base vector store through runtime IPC
- DataApi remains limited to SQLite-backed reads and knowledge base metadata PATCH.
- Migration from official v1 data does not preserve or infer grouping metadata:
- official v1 exports are flat
- migrated items are inserted with
groupId = null
Current type / data Integrity Boundary
knowledge_item.typeandknowledge_item.dataare intended to stay aligned by controlled UI flows.- In the current V2 scope, knowledge item create/edit operations are expected to come from strongly associated UI forms or controlled write paths for each item type.
- The current implementation does not add an extra DB-level cross-structure constraint that re-validates
dataagainst the storedtypeon every write. - At the DataApi/service layer:
- create flows still rely on controlled write paths to keep
typeanddataaligned - update flows re-validate
dataagainst the storedtypebefore persisting changes
- create flows still rely on controlled write paths to keep
- Downstream knowledge code may therefore treat the stored
type+datapair as a trusted contract produced by the app's controlled write path. - If future write paths are added outside the current controlled UI flow, such as import tools, scripts, sync jobs, or public/external APIs, this assumption must be revisited and explicit boundary validation should be added at that time.
Current Non-Goals
- This phase does not reconstruct hierarchy from legacy v1 exports.
- This phase does not infer directory child relationships during migration.
- This phase does not introduce a first-class
knowledge_grouptable. - This phase does not preserve temporary processing lifecycle states beyond the
uniqueId-based status rule below. - This phase does not migrate
videoormemoryinto V2 knowledge tables.
dimensions Resolution Rule
dimensionsis treated as a required field for target V2knowledge_base.- Migration does not trust legacy Redux
dimensionsas the source of truth. - Migration must resolve
dimensionsfrom the legacy vector database by inspecting:- the per-base legacy vector DB file
- the
vectorstable - a non-null vector blob whose byte length can be converted to a positive dimension count
- Resolution is considered failed when the legacy vector DB is missing, empty, invalid, or its vector blob length cannot be parsed into a valid positive dimension count.
- When resolution fails, the knowledge base is considered unusable in V2 migration:
- skip the entire base
- skip all items under that base
- record a warning for diagnostics
- Migration does not apply fallback or auto-fix for unresolved
dimensions.
Item Status Migration Rule
- Legacy
processingStatusis treated as runtime state and is not used as the migration source of truth. - Migration infers target V2
knowledge_item.statusfrom legacyuniqueId:- non-empty
uniqueId->completed - otherwise ->
idle
- non-empty
- Temporary legacy states such as in-progress or failed processing are not preserved as V2 status during migration.
Runtime Status Boundary
knowledge_item.statusandknowledge_item.errorremain part of the official V2 business schema.- The runtime queue implementation is not part of the schema contract:
- no separate task table
- no persisted queue record
- no persisted task run id
- Runtime currently uses an in-memory
p-queuebased pipeline inKnowledgeRuntimeService. - The schema-level
statusset is:idlepreparingprocessingreadingembeddingcompletedfaileddeleting
- Current runtime writes:
preparingwhile adirectoryroot or nested directory is being expandedreadingwhile a leaf item is reading source documentsembeddingwhile a leaf item is embedding / writing vectorsprocessingwhile a container has active descendants but is not itself expandingcompletedafter successful leaf indexing, or when a container has no active childrenfailedon runtime failure, interrupt cleanup failure, or shutdown interruption
fileProcessorIdis persisted in base config, but it does not participate in runtime execution yet.- In other words:
- queue structure is implementation detail
statusis business lifecycle and coarse runtime progress- container status is reconciled from its own status and child item statuses
- these concerns must not be conflated
Current Runtime Consumption Notes
- Runtime entrypoint:
src/main/services/knowledge/runtime/KnowledgeRuntimeService.ts
- Reader dispatch code still exists for stored
knowledge_item.typevalues:file-> file reader by extensionurl-> fetch markdown through Jina Readernote-> inline note contentdirectory-> currently treated as a container placeholder and returns no documents
sitemapis no longer a valid persisted V2knowledge_item.type. Legacy v1 sitemap items are mapped tourlduring migration and indexed through the URL path.- Runtime add flow accepts new item payloads:
- leaf payloads create
knowledge_itemrows and enqueueindex-leaf directorypayloads create root rows and enqueueprepare-root
- leaf payloads create
prepare-rootexpands a directory owner inside the runtime queue, creates child rows, and enqueues concrete leaf children asindex-leaf.- Callers must not create user-supplied nested
directoryitems under another item. Nested directory rows may still be created internally by directory expansion to preserve filesystem hierarchy. - Runtime embedding model resolution currently expects
knowledge_base.embeddingModelIdinproviderId::modelIdformat and only supportsollamaas the active provider.
Implementation Status
videoandmemoryitems are skipped during migration.- The target schema uses optional
groupId, but migration from official v1 data still writes it asnull. - The current DataApi contract exposes flat item read/listing only; write operations go through runtime orchestration.
- Group ownership is represented implicitly by
groupId = ownerItem.id; there is no standalone group table in the current phase. dimensionsresolution failure skips the entire base and all nested items, with warnings recorded in migration output.- Knowledge item status migration uses
uniqueIdinstead ofprocessingStatus. - The current runtime service is
KnowledgeRuntimeService, not the oldKnowledgeServicename used in earlier notes. - Current runtime queue behavior is a single in-memory
PQueue({ concurrency: 5 })shared across knowledge bases; there is no per-base serial queue yet. - Current runtime queue entries are
prepare-rootandindex-leaf; preparation and leaf indexing share interrupt / wait / shutdown cleanup semantics.