CherryHQ-cherry-studio

mirror of https://github.com/CherryHQ/cherry-studio.git synced 2026-07-05 21:50:46 +08:00

Author	SHA1	Message	Date
槑囿脑袋	1382a8dd7c	feat(knowledge): route embeddings and reranking through the AI service (#15796 ) ### What this PR does Before this PR: - Knowledge embeddings and reranking ran through the legacy embedjs-based knowledgeV1 stack with their own provider clients, independent of the app's AI service. - File-processing intake accepted several heterogeneous input shapes, and knowledge file items were tracked by FileEntry ids, coupling file content to the file-manager entry/cache. After this PR: - Embeddings and reranking are routed through the unified `AiService` (with cherryin rerank support) and guarded by strict embedding-dimension validation that rejects stale/mismatched vectors. - File-processing intake is collapsed to a single path-based model; knowledge file items are stored by base-relative path under the knowledge-base directory, and v1 uploads are copied into the v2 base dir during migration so migrated items stay reindexable/restorable. - Legacy `knowledgeV1` is removed; the orchestration services were renamed to `KnowledgeService` / `FileProcessingService`. - Chat -> knowledge attach is temporarily disconnected (tracked TODO) while the v2 file-manager bridge is rebuilt. Fixes #N/A (no linked issue) ### Why we need it and why it was done in this way Routing embeddings/rerank through `AiService` unifies provider handling and credentials and removes the parallel embedjs client stack and its v1 coupling. Storing knowledge files by base-relative path (instead of FileEntry ids) makes each knowledge base self-contained and portable. The following tradeoffs were made: - A large, coordinated refactor plus a migration step that physically copies v1 uploads into the v2 base dir, in exchange for removing the parallel client stack and making bases self-contained. - Base-relative path storage required a fail-fast/dedup strategy for same-named files and a guard for blank legacy filenames. The following alternatives were considered: - Keeping the embedjs stack behind an adapter — rejected; perpetuates the parallel client and v1 coupling. - Keeping FileEntry-id storage — rejected; couples knowledge files to the file-manager cache and blocks portability. ### Breaking changes - `knowledgeV1` is removed. Legacy v1 knowledge data reaches v2 only through the v2 migrators; there is no v1 fallback. - The v2 knowledge HTTP API (API gateway) now returns v2-native per-entry fields (`embeddingModelId`, `createdAt` on base entries; `chunkId`, `scoreKind`, `rank` on search results). The response envelope (`knowledge_bases`, `searched_bases`, `total`) is unchanged. See `v2-refactor-temp/docs/breaking-changes/2026-06-05-knowledge-api-v2.md`. ### Special notes for your reviewer - This branch went through several rounds of multi-agent code review. The most recent 6 commits address review findings: directory-import path collisions, migrated-file source copying + blank `relativePath` guard, addItems rollback error preservation, eager `document_to_markdown` output-target validation, a `CompletedKnowledgeBase` type guard, and breaking-changes doc corrections. - Chat -> knowledge attach is intentionally disconnected for now (tracked in `v2-refactor-temp/docs/knowledge/knowledge-todo.md`). - Local full `pnpm lint`/`pnpm test` was not run per the project's review conventions; please rely on CI / `pnpm build:check`. ### Checklist - [x] Branch: This PR targets the correct branch — `main` for active development, `v1` for v1 maintenance fixes - [x] PR: The PR description is expressive enough and will help future contributors - [x] Code: Write code that humans can understand and Keep it simple - [x] Refactor: You have left the code cleaner than you found it (Boy Scout Rule) - [x] Upgrade: Impact of this change on upgrade flows was considered and addressed if required - [ ] Documentation: A user-guide update was considered and is present (link) or not required. - [x] Self-review: I have reviewed my own code before requesting review from others ### Release note ```release-note NONE ``` --------- Signed-off-by: eeee0717 <chentao020717Work@outlook.com> Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-09 14:04:29 +08:00
槑囿脑袋	2250ccb52f	feat(knowledge): integrate file processing for document ingestion (#15470 ) Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com> Co-authored-by: fullex <106392080+0xfullex@users.noreply.github.com> Signed-off-by: eeee0717 <chentao020717Work@outlook.com> Signed-off-by: icarus <eurfelux@gmail.com>	2026-05-31 23:10:36 +08:00
fullex	c514dcc049	refactor(shared): move packages/shared to src/shared packages/shared was never a real pnpm workspace package (no package.json); it was referenced only through the @shared TypeScript path alias. Relocate it under src/ via git mv (143 files, detected as pure renames). Repoint the @shared alias and include globs to src/shared across electron.vite.config.ts, tsconfig.{json,node,web}.json and vitest.config.ts; update scripts/check-custom-exts.ts, scripts/update-languages.ts, the eslint.config.mjs generated-file globs, the data-classify generator output targets, .github/CODEOWNERS path rules, and CLAUDE.md/docs/source-comment references. The @shared alias name is unchanged, so all 1403 @shared/* import sites resolve without modification. Verified with typecheck:node, typecheck:web and the full test suite (700 files, 9739 tests passing).	2026-05-28 21:02:49 -07:00
槑囿脑袋	01e7e31a8e	feat(v2): add main-side file processing backend (#13968 ) ### What this PR does Before this PR: File processing on the `v2` branch was still described and wired around split OCR / markdown APIs, legacy feature names, and feature-first provider structure. OCR-like image text extraction and document-to-markdown conversion did not share one task contract, and provider task ids / polling details were harder to keep behind the Main-process boundary. After this PR: `v2` file processing follows `v2-refactor-temp/docs/fileProcessing/file-processing-service.md` as the design baseline: - exposes one Main-side task API: `startTask`, `getTask`, and `cancelTask` - replaces split file-processing IPC with `file-processing:start-task`, `file-processing:get-task`, and `file-processing:cancel-task` - renames features and preference keys to `image_to_text` and `document_to_markdown` - adds `FileProcessingTaskService` as the in-memory source of truth for task ids, task state, progress, cancellation, TTL pruning, remote-poll dedupe, and task change events - keeps provider task ids, remote context, query context, abort controllers, and in-flight polling inside Main-process task records - maps completed results to artifacts: inline `text/plain` for `image_to_text`, and persisted markdown file artifacts for `document_to_markdown` - reorganizes providers into processor-first handlers under `src/main/services/fileProcessing/processors` - moves Tesseract worker ownership under `processors/tesseract/runtime` - removes the new file-processing module's old `ocr/` and `markdown/` split directories after migrating their logic - updates shared schemas, presets, preference generation, migration mappings, and tests for the renamed feature model The public file-processing contract is now: ```ts await window.api.fileProcessing.startTask({ feature: 'image_to_text', file, processorId: 'tesseract' }) await window.api.fileProcessing.getTask({ taskId }) await window.api.fileProcessing.cancelTask({ taskId }) ``` Architecture overview: ```text Renderer / upper-layer caller \| \| startTask / getTask / cancelTask v FileProcessingOrchestrationService \| \| Zod validation + delegation v FileProcessingTaskService \| \| taskId, task store, TTL, cancellation, \| background execution, remote polling, artifacts v processorRegistry[processorId].capabilities[feature] \| +--> image-to-text handlers \| -> text/plain artifact \| +--> document-to-markdown handlers -> feature.files.data/fileId/file-processing/taskId/output.md ``` Notes: - The new file-processing API does not keep facades for `file-processing:extract-text`, `file-processing:start-markdown-conversion-task`, or `file-processing:get-markdown-conversion-task-result`. - `FileProcessingOrchestrationService` is intentionally only the IPC validation and delegation layer. - Task state is Main-process runtime coordination state, not DataApi or Cache state. - Renderer task subscriptions, a global UI task center, and full renderer business-flow migration are intentionally out of scope for this PR. - The legacy standalone OCR path outside the new file-processing module can coexist during the v2 transition, but the new file-processing interface is not polluted by those split-API types. Fixes #N/A ### Why we need it and why it was done in this way This PR makes OCR-style image text extraction and document-to-markdown conversion use the same Main-process task model before renderer-side adoption. The unified contract gives upper layers one way to start work, query progress, handle failure, cancel work, and consume completed artifacts without learning provider-specific polling details. The following tradeoffs were made: - Fast OCR now also goes through a task API, so callers need start/query behavior instead of a direct `extractText -> text` call. - Task state remains session-scoped in memory; completed artifacts are persisted, but task snapshots are not restored after app restart. - Remote-provider cancellation is best-effort: local polling and state transition stop immediately, but third-party provider-side cancellation is not guaranteed. - Renderer integration is intentionally compile-safe and minimal in this PR; full UX migration should happen in follow-up changes. - Tesseract keeps a processor-owned runtime service, while other processors stay as handlers/utilities until they need lifecycle-managed resources. The following alternatives were considered: - Keeping separate OCR and markdown conversion APIs, which would preserve the current split but continue duplicating task, progress, cancellation, and result semantics. - Adding a DataApi task table or Cache mirror for file-processing task state, which would create a second source of truth for runtime coordination state. - Adding renderer push subscriptions in this PR, which would expand the scope beyond the Main-side task contract. - Introducing a generic process manager for all processors, which is premature while only Tesseract currently owns reusable lifecycle resources. Links to places where the discussion took place: `v2-refactor-temp/docs/fileProcessing/file-processing-service.md` ### Breaking changes None for released user-facing behavior. If this PR introduces breaking changes, please describe the changes and the impact on users. For the in-progress `v2` file-processing integration, this replaces the split file-processing IPC/preload shape with the unified `startTask/getTask/cancelTask` contract. It also renames file-processing feature and preference keys from the old `text_extraction` / `markdown_conversion` model to `image_to_text` / `document_to_markdown`. ### Special notes for your reviewer - This PR targets `v2`, not `main`. - Review this against `v2-refactor-temp/docs/fileProcessing/file-processing-service.md`; that document is the source of truth for the module boundary. - Main review points: unified task API, artifact model, cancellation semantics, processor-first registry/handlers, hidden provider runtime state, and no DataApi/Cache task storage. - `document_to_markdown` artifacts are persisted under `application.getPath('feature.files.data')/fileId/file-processing/taskId/output.md`. - `image_to_text` artifacts are returned inline as plain text artifacts and are not persisted. - Current local verification status: - `pnpm format`: passed - `pnpm build:check`: passed - Vitest inside `build:check`: `432` test files passed, `7171` tests passed, `72` skipped ### Checklist This checklist is not enforcing, but it's a reminder of items that could be relevant to every PR. Approvers are expected to review this list. - [x] PR: The PR description is expressive enough and will help future contributors - [x] Code: [Write code that humans can understand](https://en.wikiquote.org/wiki/Martin_Fowler#code-for-humans) and [Keep it simple](https://en.wikipedia.org/wiki/KISS_principle) - [x] Refactor: You have [left the code cleaner than you found it (Boy Scout Rule)](https://learning.oreilly.com/library/view/97-things-every/9780596809515/ch08.html) - [x] Upgrade: Impact of this change on upgrade flows was considered and addressed if required - [ ] Documentation: A [user-guide update](https://docs.cherry-ai.com) was considered and is present (link) or not required. Check this only when the PR introduces or changes a user-facing feature or behavior. - [x] Self-review: I have reviewed my own code (e.g., via [`/gh-pr-review`](/.claude/skills/gh-pr-review/SKILL.md), `gh pr diff`, or GitHub UI) before requesting review from others ### Release note ```release-note NONE ``` --------- Signed-off-by: eeee0717 <chentao020717Work@outlook.com> Co-authored-by: fullex <106392080+0xfullex@users.noreply.github.com>	2026-05-08 21:25:03 +08:00

4 Commits