mirror of https://github.com/CherryHQ/cherry-studio.git synced 2026-07-04 05:00:00 +08:00

Files

槑囿脑袋 1382a8dd7c feat(knowledge): route embeddings and reranking through the AI service (#15796 )

### What this PR does

Before this PR:

- Knowledge embeddings and reranking ran through the legacy
embedjs-based
knowledgeV1 stack with their own provider clients, independent of the
app's
  AI service.
- File-processing intake accepted several heterogeneous input shapes,
and
knowledge file items were tracked by FileEntry ids, coupling file
content to
  the file-manager entry/cache.

After this PR:

- Embeddings and reranking are routed through the unified `AiService`
(with
cherryin rerank support) and guarded by strict embedding-dimension
validation
  that rejects stale/mismatched vectors.
- File-processing intake is collapsed to a single path-based model;
knowledge
  file items are stored by base-relative path under the knowledge-base
directory, and v1 uploads are copied into the v2 base dir during
migration so
  migrated items stay reindexable/restorable.
- Legacy `knowledgeV1` is removed; the orchestration services were
renamed to
  `KnowledgeService` / `FileProcessingService`.
- Chat -> knowledge attach is temporarily disconnected (tracked TODO)
while the
  v2 file-manager bridge is rebuilt.

Fixes #N/A (no linked issue)

### Why we need it and why it was done in this way

Routing embeddings/rerank through `AiService` unifies provider handling
and
credentials and removes the parallel embedjs client stack and its v1
coupling.
Storing knowledge files by base-relative path (instead of FileEntry ids)
makes
each knowledge base self-contained and portable.

The following tradeoffs were made:

- A large, coordinated refactor plus a migration step that physically
copies v1
uploads into the v2 base dir, in exchange for removing the parallel
client
  stack and making bases self-contained.
- Base-relative path storage required a fail-fast/dedup strategy for
same-named
  files and a guard for blank legacy filenames.

The following alternatives were considered:

- Keeping the embedjs stack behind an adapter — rejected; perpetuates
the
  parallel client and v1 coupling.
- Keeping FileEntry-id storage — rejected; couples knowledge files to
the
  file-manager cache and blocks portability.

### Breaking changes

- `knowledgeV1` is removed. Legacy v1 knowledge data reaches v2 only
through the
  v2 migrators; there is no v1 fallback.
- The v2 knowledge HTTP API (API gateway) now returns v2-native
per-entry fields
(`embeddingModelId`, `createdAt` on base entries; `chunkId`,
`scoreKind`,
  `rank` on search results). The response envelope (`knowledge_bases`,
  `searched_bases`, `total`) is unchanged. See

`v2-refactor-temp/docs/breaking-changes/2026-06-05-knowledge-api-v2.md`.

### Special notes for your reviewer

- This branch went through several rounds of multi-agent code review.
The most
recent 6 commits address review findings: directory-import path
collisions,
migrated-file source copying + blank `relativePath` guard, addItems
rollback
error preservation, eager `document_to_markdown` output-target
validation, a
`CompletedKnowledgeBase` type guard, and breaking-changes doc
corrections.
- Chat -> knowledge attach is intentionally disconnected for now
(tracked in
  `v2-refactor-temp/docs/knowledge/knowledge-todo.md`).
- Local full `pnpm lint`/`pnpm test` was not run per the project's
review
  conventions; please rely on CI / `pnpm build:check`.

### Checklist

- [x] Branch: This PR targets the correct branch — `main` for active
development, `v1` for v1 maintenance fixes
- [x] PR: The PR description is expressive enough and will help future
contributors
- [x] Code: Write code that humans can understand and Keep it simple
- [x] Refactor: You have left the code cleaner than you found it (Boy
Scout Rule)
- [x] Upgrade: Impact of this change on upgrade flows was considered and
addressed if required
- [ ] Documentation: A user-guide update was considered and is present
(link) or not required.
- [x] Self-review: I have reviewed my own code before requesting review
from others

### Release note

```release-note
NONE
```

---------

Signed-off-by: eeee0717 <chentao020717Work@outlook.com>
Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>

2026-06-09 14:04:29 +08:00

9.4 KiB

Raw Permalink Blame History

Knowledge Schema Notes (V2)

This document records the current V2 knowledge target schema, migration constraints, and temporary scope boundaries.

Scope Clarification

video items are out of scope for V2 knowledge data migration and should be skipped.
memory items belong to the memory module, not the knowledge module, and should be skipped in knowledge migration.

Current Target Schema

`knowledge_base`

Persisted columns:
- id
- name
- groupId
- dimensions
- embeddingModelId
- status
- error
- rerankModelId
- fileProcessorId
- chunkSize
- chunkOverlap
- threshold
- documentCount
- searchMode
- hybridAlpha
- createdAt
- updatedAt

`knowledge_item`

Persisted columns:
- id
- baseId
- groupId
- type
- data
- status
- error
- createdAt
- updatedAt
New app-created knowledge items use ordered UUID generation for id.

Fields Removed From The V2 SQLite Schema

video is not a target knowledge_item.type.
memory is not a target knowledge_item.type.
sitemap is not a target knowledge_item.type; legacy sitemap entries are migrated as url items.
Legacy runtime-only item fields are not stored as standalone SQLite columns:
- uniqueId
- uniqueIds
- processingProgress
- retryCount
- isPreprocessed
remark is not part of the V2 SQLite schema.
sourceUrl is not a standalone knowledge_item column:
- for notes, it may exist inside data.sourceUrl
- for url items, the URL is stored inside the typed data payload
Official v1 legacy exports do not contain groupId.

`groupId` Semantics

knowledge_item is modeled as a flat same-base item collection.
groupId is an optional stable grouping key inside one knowledge base.
- Typical usage: items from the same imported source/container
- Examples: one directory import, one URL collection
- When one item is the logical container/owner of a group, downstream items use groupId = containerItem.id
- The schema enforces same-base ownership:
  - (baseId, groupId) must reference (baseId, id) in knowledge_item
  - deleting the owner cascades to grouped members
Current runtime read flows use:
- GET /knowledge-bases/:id/items for flat item listing
- optional query filters: type, groupId
Current runtime write workflows use KnowledgeService IPC, not DataApi endpoints:
- add items: normalize caller-friendly inputs, create SQLite rows, and enqueue prepare/index tasks
- delete items: interrupt runtime work, delete vectors, then delete SQLite roots
- reindex items: interrupt runtime work, delete old vectors, rebuild expanded children when needed, then enqueue indexing
- search and chunk mutation: execute against the per-base vector store through runtime IPC
DataApi remains limited to SQLite-backed reads and knowledge base metadata PATCH.
Migration from official v1 data does not preserve or infer grouping metadata:
- official v1 exports are flat
- migrated items are inserted with groupId = null

Current `type` / `data` Integrity Boundary

knowledge_item.type and knowledge_item.data are intended to stay aligned by controlled UI flows.
In the current V2 scope, knowledge item create/edit operations are expected to come from strongly associated UI forms or controlled write paths for each item type.
The current implementation does not add an extra DB-level cross-structure constraint that re-validates data against the stored type on every write.
At the DataApi/service layer:
- create flows still rely on controlled write paths to keep type and data aligned
- update flows re-validate data against the stored type before persisting changes
Downstream knowledge code may therefore treat the stored type + data pair as a trusted contract produced by the app's controlled write path.
If future write paths are added outside the current controlled UI flow, such as import tools, scripts, sync jobs, or public/external APIs, this assumption must be revisited and explicit boundary validation should be added at that time.

Current Non-Goals

This phase does not reconstruct hierarchy from legacy v1 exports.
This phase does not infer directory child relationships during migration.
This phase does not introduce a first-class knowledge_group table.
This phase does not preserve temporary processing lifecycle states beyond the uniqueId-based status rule below.
This phase does not migrate video or memory into V2 knowledge tables.

`dimensions` Resolution Rule

dimensions is treated as a required field for target V2 knowledge_base.
Migration does not trust legacy Redux dimensions as the source of truth.
Migration must resolve dimensions from the legacy vector database by inspecting:
- the per-base legacy vector DB file
- the vectors table
- a non-null vector blob whose byte length can be converted to a positive dimension count
Resolution is considered failed when the legacy vector DB is missing, empty, invalid, or its vector blob length cannot be parsed into a valid positive dimension count.
When resolution fails, the knowledge base is considered unusable in V2 migration:
- skip the entire base
- skip all items under that base
- record a warning for diagnostics
Migration does not apply fallback or auto-fix for unresolved dimensions.

Item Status Migration Rule

Legacy processingStatus is treated as runtime state and is not used as the migration source of truth.
Migration infers target V2 knowledge_item.status from legacy uniqueId:
- non-empty uniqueId -> completed
- otherwise -> idle
Temporary legacy states such as in-progress or failed processing are not preserved as V2 status during migration.

Runtime Status Boundary

knowledge_item.status and knowledge_item.error remain part of the official V2 business schema.
The runtime queue implementation is not part of the schema contract:
- no separate task table
- no persisted queue record
- no persisted task run id
Runtime currently uses an in-memory p-queue based pipeline in KnowledgeRuntimeService.
The schema-level status set is:
- idle
- preparing
- processing
- reading
- embedding
- completed
- failed
- deleting
Current runtime writes:
- preparing while a directory root or nested directory is being expanded
- reading while a leaf item is reading source documents
- embedding while a leaf item is embedding / writing vectors
- processing while a container has active descendants but is not itself expanding
- completed after successful leaf indexing, or when a container has no active children
- failed on runtime failure, interrupt cleanup failure, or shutdown interruption
fileProcessorId is persisted in base config, but it does not participate in runtime execution yet.
In other words:
- queue structure is implementation detail
- status is business lifecycle and coarse runtime progress
- container status is reconciled from its own status and child item statuses
- these concerns must not be conflated

Current Runtime Consumption Notes

Runtime entrypoint:
- src/main/services/knowledge/runtime/KnowledgeRuntimeService.ts
Reader dispatch code still exists for stored knowledge_item.type values:
- file -> file reader by extension
- url -> fetch markdown through Jina Reader
- note -> inline note content
- directory -> currently treated as a container placeholder and returns no documents
sitemap is no longer a valid persisted V2 knowledge_item.type. Legacy v1 sitemap items are mapped to url during migration and indexed through the URL path.
Runtime add flow accepts new item payloads:
- leaf payloads create knowledge_item rows and enqueue index-leaf
- directory payloads create root rows and enqueue prepare-root
prepare-root expands a directory owner inside the runtime queue, creates child rows, and enqueues concrete leaf children as index-leaf.
Callers must not create user-supplied nested directory items under another item. Nested directory rows may still be created internally by directory expansion to preserve filesystem hierarchy.
Runtime embedding model resolution currently expects knowledge_base.embeddingModelId in providerId::modelId format and only supports ollama as the active provider.

Implementation Status

video and memory items are skipped during migration.
The target schema uses optional groupId, but migration from official v1 data still writes it as null.
The current DataApi contract exposes flat item read/listing only; write operations go through runtime orchestration.
Group ownership is represented implicitly by groupId = ownerItem.id; there is no standalone group table in the current phase.
dimensions resolution failure skips the entire base and all nested items, with warnings recorded in migration output.
Knowledge item status migration uses uniqueId instead of processingStatus.
The current runtime service is KnowledgeRuntimeService, not the old KnowledgeService name used in earlier notes.
Current runtime queue behavior is a single in-memory PQueue({ concurrency: 5 }) shared across knowledge bases; there is no per-base serial queue yet.
Current runtime queue entries are prepare-root and index-leaf; preparation and leaf indexing share interrupt / wait / shutdown cleanup semantics.

9.4 KiB Raw Permalink Blame History