Files
CherryHQ-cherry-studio/v2-refactor-temp/docs/knowledge/knowledge-schema.md
槑囿脑袋 1382a8dd7c feat(knowledge): route embeddings and reranking through the AI service (#15796)
### What this PR does

Before this PR:

- Knowledge embeddings and reranking ran through the legacy
embedjs-based
knowledgeV1 stack with their own provider clients, independent of the
app's
  AI service.
- File-processing intake accepted several heterogeneous input shapes,
and
knowledge file items were tracked by FileEntry ids, coupling file
content to
  the file-manager entry/cache.

After this PR:

- Embeddings and reranking are routed through the unified `AiService`
(with
cherryin rerank support) and guarded by strict embedding-dimension
validation
  that rejects stale/mismatched vectors.
- File-processing intake is collapsed to a single path-based model;
knowledge
  file items are stored by base-relative path under the knowledge-base
directory, and v1 uploads are copied into the v2 base dir during
migration so
  migrated items stay reindexable/restorable.
- Legacy `knowledgeV1` is removed; the orchestration services were
renamed to
  `KnowledgeService` / `FileProcessingService`.
- Chat -> knowledge attach is temporarily disconnected (tracked TODO)
while the
  v2 file-manager bridge is rebuilt.

Fixes #N/A (no linked issue)

### Why we need it and why it was done in this way

Routing embeddings/rerank through `AiService` unifies provider handling
and
credentials and removes the parallel embedjs client stack and its v1
coupling.
Storing knowledge files by base-relative path (instead of FileEntry ids)
makes
each knowledge base self-contained and portable.

The following tradeoffs were made:

- A large, coordinated refactor plus a migration step that physically
copies v1
uploads into the v2 base dir, in exchange for removing the parallel
client
  stack and making bases self-contained.
- Base-relative path storage required a fail-fast/dedup strategy for
same-named
  files and a guard for blank legacy filenames.

The following alternatives were considered:

- Keeping the embedjs stack behind an adapter — rejected; perpetuates
the
  parallel client and v1 coupling.
- Keeping FileEntry-id storage — rejected; couples knowledge files to
the
  file-manager cache and blocks portability.

### Breaking changes

- `knowledgeV1` is removed. Legacy v1 knowledge data reaches v2 only
through the
  v2 migrators; there is no v1 fallback.
- The v2 knowledge HTTP API (API gateway) now returns v2-native
per-entry fields
(`embeddingModelId`, `createdAt` on base entries; `chunkId`,
`scoreKind`,
  `rank` on search results). The response envelope (`knowledge_bases`,
  `searched_bases`, `total`) is unchanged. See

`v2-refactor-temp/docs/breaking-changes/2026-06-05-knowledge-api-v2.md`.

### Special notes for your reviewer

- This branch went through several rounds of multi-agent code review.
The most
recent 6 commits address review findings: directory-import path
collisions,
migrated-file source copying + blank `relativePath` guard, addItems
rollback
error preservation, eager `document_to_markdown` output-target
validation, a
`CompletedKnowledgeBase` type guard, and breaking-changes doc
corrections.
- Chat -> knowledge attach is intentionally disconnected for now
(tracked in
  `v2-refactor-temp/docs/knowledge/knowledge-todo.md`).
- Local full `pnpm lint`/`pnpm test` was not run per the project's
review
  conventions; please rely on CI / `pnpm build:check`.

### Checklist

- [x] Branch: This PR targets the correct branch — `main` for active
development, `v1` for v1 maintenance fixes
- [x] PR: The PR description is expressive enough and will help future
contributors
- [x] Code: Write code that humans can understand and Keep it simple
- [x] Refactor: You have left the code cleaner than you found it (Boy
Scout Rule)
- [x] Upgrade: Impact of this change on upgrade flows was considered and
addressed if required
- [ ] Documentation: A user-guide update was considered and is present
(link) or not required.
- [x] Self-review: I have reviewed my own code before requesting review
from others

### Release note

```release-note
NONE
```

---------

Signed-off-by: eeee0717 <chentao020717Work@outlook.com>
Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-09 14:04:29 +08:00

9.4 KiB

Knowledge Schema Notes (V2)

This document records the current V2 knowledge target schema, migration constraints, and temporary scope boundaries.

Scope Clarification

  • video items are out of scope for V2 knowledge data migration and should be skipped.
  • memory items belong to the memory module, not the knowledge module, and should be skipped in knowledge migration.

Current Target Schema

knowledge_base

  • Persisted columns:
    • id
    • name
    • groupId
    • dimensions
    • embeddingModelId
    • status
    • error
    • rerankModelId
    • fileProcessorId
    • chunkSize
    • chunkOverlap
    • threshold
    • documentCount
    • searchMode
    • hybridAlpha
    • createdAt
    • updatedAt

knowledge_item

  • Persisted columns:
    • id
    • baseId
    • groupId
    • type
    • data
    • status
    • error
    • createdAt
    • updatedAt
  • New app-created knowledge items use ordered UUID generation for id.

Fields Removed From The V2 SQLite Schema

  • video is not a target knowledge_item.type.
  • memory is not a target knowledge_item.type.
  • sitemap is not a target knowledge_item.type; legacy sitemap entries are migrated as url items.
  • Legacy runtime-only item fields are not stored as standalone SQLite columns:
    • uniqueId
    • uniqueIds
    • processingProgress
    • retryCount
    • isPreprocessed
  • remark is not part of the V2 SQLite schema.
  • sourceUrl is not a standalone knowledge_item column:
    • for notes, it may exist inside data.sourceUrl
    • for url items, the URL is stored inside the typed data payload
  • Official v1 legacy exports do not contain groupId.

groupId Semantics

  • knowledge_item is modeled as a flat same-base item collection.
  • groupId is an optional stable grouping key inside one knowledge base.
    • Typical usage: items from the same imported source/container
    • Examples: one directory import, one URL collection
    • When one item is the logical container/owner of a group, downstream items use groupId = containerItem.id
    • The schema enforces same-base ownership:
      • (baseId, groupId) must reference (baseId, id) in knowledge_item
      • deleting the owner cascades to grouped members
  • Current runtime read flows use:
    • GET /knowledge-bases/:id/items for flat item listing
    • optional query filters: type, groupId
  • Current runtime write workflows use KnowledgeService IPC, not DataApi endpoints:
    • add items: normalize caller-friendly inputs, create SQLite rows, and enqueue prepare/index tasks
    • delete items: interrupt runtime work, delete vectors, then delete SQLite roots
    • reindex items: interrupt runtime work, delete old vectors, rebuild expanded children when needed, then enqueue indexing
    • search and chunk mutation: execute against the per-base vector store through runtime IPC
  • DataApi remains limited to SQLite-backed reads and knowledge base metadata PATCH.
  • Migration from official v1 data does not preserve or infer grouping metadata:
    • official v1 exports are flat
    • migrated items are inserted with groupId = null

Current type / data Integrity Boundary

  • knowledge_item.type and knowledge_item.data are intended to stay aligned by controlled UI flows.
  • In the current V2 scope, knowledge item create/edit operations are expected to come from strongly associated UI forms or controlled write paths for each item type.
  • The current implementation does not add an extra DB-level cross-structure constraint that re-validates data against the stored type on every write.
  • At the DataApi/service layer:
    • create flows still rely on controlled write paths to keep type and data aligned
    • update flows re-validate data against the stored type before persisting changes
  • Downstream knowledge code may therefore treat the stored type + data pair as a trusted contract produced by the app's controlled write path.
  • If future write paths are added outside the current controlled UI flow, such as import tools, scripts, sync jobs, or public/external APIs, this assumption must be revisited and explicit boundary validation should be added at that time.

Current Non-Goals

  • This phase does not reconstruct hierarchy from legacy v1 exports.
  • This phase does not infer directory child relationships during migration.
  • This phase does not introduce a first-class knowledge_group table.
  • This phase does not preserve temporary processing lifecycle states beyond the uniqueId-based status rule below.
  • This phase does not migrate video or memory into V2 knowledge tables.

dimensions Resolution Rule

  • dimensions is treated as a required field for target V2 knowledge_base.
  • Migration does not trust legacy Redux dimensions as the source of truth.
  • Migration must resolve dimensions from the legacy vector database by inspecting:
    • the per-base legacy vector DB file
    • the vectors table
    • a non-null vector blob whose byte length can be converted to a positive dimension count
  • Resolution is considered failed when the legacy vector DB is missing, empty, invalid, or its vector blob length cannot be parsed into a valid positive dimension count.
  • When resolution fails, the knowledge base is considered unusable in V2 migration:
    • skip the entire base
    • skip all items under that base
    • record a warning for diagnostics
  • Migration does not apply fallback or auto-fix for unresolved dimensions.

Item Status Migration Rule

  • Legacy processingStatus is treated as runtime state and is not used as the migration source of truth.
  • Migration infers target V2 knowledge_item.status from legacy uniqueId:
    • non-empty uniqueId -> completed
    • otherwise -> idle
  • Temporary legacy states such as in-progress or failed processing are not preserved as V2 status during migration.

Runtime Status Boundary

  • knowledge_item.status and knowledge_item.error remain part of the official V2 business schema.
  • The runtime queue implementation is not part of the schema contract:
    • no separate task table
    • no persisted queue record
    • no persisted task run id
  • Runtime currently uses an in-memory p-queue based pipeline in KnowledgeRuntimeService.
  • The schema-level status set is:
    • idle
    • preparing
    • processing
    • reading
    • embedding
    • completed
    • failed
    • deleting
  • Current runtime writes:
    • preparing while a directory root or nested directory is being expanded
    • reading while a leaf item is reading source documents
    • embedding while a leaf item is embedding / writing vectors
    • processing while a container has active descendants but is not itself expanding
    • completed after successful leaf indexing, or when a container has no active children
    • failed on runtime failure, interrupt cleanup failure, or shutdown interruption
  • fileProcessorId is persisted in base config, but it does not participate in runtime execution yet.
  • In other words:
    • queue structure is implementation detail
    • status is business lifecycle and coarse runtime progress
    • container status is reconciled from its own status and child item statuses
    • these concerns must not be conflated

Current Runtime Consumption Notes

  • Runtime entrypoint:
    • src/main/services/knowledge/runtime/KnowledgeRuntimeService.ts
  • Reader dispatch code still exists for stored knowledge_item.type values:
    • file -> file reader by extension
    • url -> fetch markdown through Jina Reader
    • note -> inline note content
    • directory -> currently treated as a container placeholder and returns no documents
  • sitemap is no longer a valid persisted V2 knowledge_item.type. Legacy v1 sitemap items are mapped to url during migration and indexed through the URL path.
  • Runtime add flow accepts new item payloads:
    • leaf payloads create knowledge_item rows and enqueue index-leaf
    • directory payloads create root rows and enqueue prepare-root
  • prepare-root expands a directory owner inside the runtime queue, creates child rows, and enqueues concrete leaf children as index-leaf.
  • Callers must not create user-supplied nested directory items under another item. Nested directory rows may still be created internally by directory expansion to preserve filesystem hierarchy.
  • Runtime embedding model resolution currently expects knowledge_base.embeddingModelId in providerId::modelId format and only supports ollama as the active provider.

Implementation Status

  • video and memory items are skipped during migration.
  • The target schema uses optional groupId, but migration from official v1 data still writes it as null.
  • The current DataApi contract exposes flat item read/listing only; write operations go through runtime orchestration.
  • Group ownership is represented implicitly by groupId = ownerItem.id; there is no standalone group table in the current phase.
  • dimensions resolution failure skips the entire base and all nested items, with warnings recorded in migration output.
  • Knowledge item status migration uses uniqueId instead of processingStatus.
  • The current runtime service is KnowledgeRuntimeService, not the old KnowledgeService name used in earlier notes.
  • Current runtime queue behavior is a single in-memory PQueue({ concurrency: 5 }) shared across knowledge bases; there is no per-base serial queue yet.
  • Current runtime queue entries are prepare-root and index-leaf; preparation and leaf indexing share interrupt / wait / shutdown cleanup semantics.