Commit Graph

4 Commits

Author SHA1 Message Date
槑囿脑袋
1382a8dd7c feat(knowledge): route embeddings and reranking through the AI service (#15796)
### What this PR does

Before this PR:

- Knowledge embeddings and reranking ran through the legacy
embedjs-based
knowledgeV1 stack with their own provider clients, independent of the
app's
  AI service.
- File-processing intake accepted several heterogeneous input shapes,
and
knowledge file items were tracked by FileEntry ids, coupling file
content to
  the file-manager entry/cache.

After this PR:

- Embeddings and reranking are routed through the unified `AiService`
(with
cherryin rerank support) and guarded by strict embedding-dimension
validation
  that rejects stale/mismatched vectors.
- File-processing intake is collapsed to a single path-based model;
knowledge
  file items are stored by base-relative path under the knowledge-base
directory, and v1 uploads are copied into the v2 base dir during
migration so
  migrated items stay reindexable/restorable.
- Legacy `knowledgeV1` is removed; the orchestration services were
renamed to
  `KnowledgeService` / `FileProcessingService`.
- Chat -> knowledge attach is temporarily disconnected (tracked TODO)
while the
  v2 file-manager bridge is rebuilt.

Fixes #N/A (no linked issue)

### Why we need it and why it was done in this way

Routing embeddings/rerank through `AiService` unifies provider handling
and
credentials and removes the parallel embedjs client stack and its v1
coupling.
Storing knowledge files by base-relative path (instead of FileEntry ids)
makes
each knowledge base self-contained and portable.

The following tradeoffs were made:

- A large, coordinated refactor plus a migration step that physically
copies v1
uploads into the v2 base dir, in exchange for removing the parallel
client
  stack and making bases self-contained.
- Base-relative path storage required a fail-fast/dedup strategy for
same-named
  files and a guard for blank legacy filenames.

The following alternatives were considered:

- Keeping the embedjs stack behind an adapter — rejected; perpetuates
the
  parallel client and v1 coupling.
- Keeping FileEntry-id storage — rejected; couples knowledge files to
the
  file-manager cache and blocks portability.

### Breaking changes

- `knowledgeV1` is removed. Legacy v1 knowledge data reaches v2 only
through the
  v2 migrators; there is no v1 fallback.
- The v2 knowledge HTTP API (API gateway) now returns v2-native
per-entry fields
(`embeddingModelId`, `createdAt` on base entries; `chunkId`,
`scoreKind`,
  `rank` on search results). The response envelope (`knowledge_bases`,
  `searched_bases`, `total`) is unchanged. See

`v2-refactor-temp/docs/breaking-changes/2026-06-05-knowledge-api-v2.md`.

### Special notes for your reviewer

- This branch went through several rounds of multi-agent code review.
The most
recent 6 commits address review findings: directory-import path
collisions,
migrated-file source copying + blank `relativePath` guard, addItems
rollback
error preservation, eager `document_to_markdown` output-target
validation, a
`CompletedKnowledgeBase` type guard, and breaking-changes doc
corrections.
- Chat -> knowledge attach is intentionally disconnected for now
(tracked in
  `v2-refactor-temp/docs/knowledge/knowledge-todo.md`).
- Local full `pnpm lint`/`pnpm test` was not run per the project's
review
  conventions; please rely on CI / `pnpm build:check`.

### Checklist

- [x] Branch: This PR targets the correct branch — `main` for active
development, `v1` for v1 maintenance fixes
- [x] PR: The PR description is expressive enough and will help future
contributors
- [x] Code: Write code that humans can understand and Keep it simple
- [x] Refactor: You have left the code cleaner than you found it (Boy
Scout Rule)
- [x] Upgrade: Impact of this change on upgrade flows was considered and
addressed if required
- [ ] Documentation: A user-guide update was considered and is present
(link) or not required.
- [x] Self-review: I have reviewed my own code before requesting review
from others

### Release note

```release-note
NONE
```

---------

Signed-off-by: eeee0717 <chentao020717Work@outlook.com>
Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-09 14:04:29 +08:00
槑囿脑袋
2250ccb52f feat(knowledge): integrate file processing for document ingestion (#15470)
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-authored-by: fullex <106392080+0xfullex@users.noreply.github.com>
Signed-off-by: eeee0717 <chentao020717Work@outlook.com>
Signed-off-by: icarus <eurfelux@gmail.com>
2026-05-31 23:10:36 +08:00
fullex
c514dcc049 refactor(shared): move packages/shared to src/shared
packages/shared was never a real pnpm workspace package (no package.json); it was referenced only through the @shared TypeScript path alias. Relocate it under src/ via git mv (143 files, detected as pure renames).

Repoint the @shared alias and include globs to src/shared across electron.vite.config.ts, tsconfig.{json,node,web}.json and vitest.config.ts; update scripts/check-custom-exts.ts, scripts/update-languages.ts, the eslint.config.mjs generated-file globs, the data-classify generator output targets, .github/CODEOWNERS path rules, and CLAUDE.md/docs/source-comment references.

The @shared alias name is unchanged, so all 1403 @shared/* import sites resolve without modification. Verified with typecheck:node, typecheck:web and the full test suite (700 files, 9739 tests passing).
2026-05-28 21:02:49 -07:00
槑囿脑袋
01e7e31a8e feat(v2): add main-side file processing backend (#13968)
### What this PR does

Before this PR:

File processing on the `v2` branch was still described and wired around
split OCR / markdown APIs, legacy feature names, and feature-first
provider structure. OCR-like image text extraction and
document-to-markdown conversion did not share one task contract, and
provider task ids / polling details were harder to keep behind the
Main-process boundary.

After this PR:

`v2` file processing follows
`v2-refactor-temp/docs/fileProcessing/file-processing-service.md` as the
design baseline:

- exposes one Main-side task API: `startTask`, `getTask`, and
`cancelTask`
- replaces split file-processing IPC with `file-processing:start-task`,
`file-processing:get-task`, and `file-processing:cancel-task`
- renames features and preference keys to `image_to_text` and
`document_to_markdown`
- adds `FileProcessingTaskService` as the in-memory source of truth for
task ids, task state, progress, cancellation, TTL pruning, remote-poll
dedupe, and task change events
- keeps provider task ids, remote context, query context, abort
controllers, and in-flight polling inside Main-process task records
- maps completed results to artifacts: inline `text/plain` for
`image_to_text`, and persisted markdown file artifacts for
`document_to_markdown`
- reorganizes providers into processor-first handlers under
`src/main/services/fileProcessing/processors`
- moves Tesseract worker ownership under `processors/tesseract/runtime`
- removes the new file-processing module's old `ocr/` and `markdown/`
split directories after migrating their logic
- updates shared schemas, presets, preference generation, migration
mappings, and tests for the renamed feature model

The public file-processing contract is now:

```ts
await window.api.fileProcessing.startTask({
  feature: 'image_to_text',
  file,
  processorId: 'tesseract'
})

await window.api.fileProcessing.getTask({ taskId })
await window.api.fileProcessing.cancelTask({ taskId })
```

Architecture overview:

```text
Renderer / upper-layer caller
        |
        | startTask / getTask / cancelTask
        v
FileProcessingOrchestrationService
        |
        | Zod validation + delegation
        v
FileProcessingTaskService
        |
        | taskId, task store, TTL, cancellation,
        | background execution, remote polling, artifacts
        v
processorRegistry[processorId].capabilities[feature]
        |
        +--> image-to-text handlers
        |       -> text/plain artifact
        |
        +--> document-to-markdown handlers
                -> feature.files.data/fileId/file-processing/taskId/output.md
```

Notes:

- The new file-processing API does not keep facades for
`file-processing:extract-text`,
`file-processing:start-markdown-conversion-task`, or
`file-processing:get-markdown-conversion-task-result`.
- `FileProcessingOrchestrationService` is intentionally only the IPC
validation and delegation layer.
- Task state is Main-process runtime coordination state, not DataApi or
Cache state.
- Renderer task subscriptions, a global UI task center, and full
renderer business-flow migration are intentionally out of scope for this
PR.
- The legacy standalone OCR path outside the new file-processing module
can coexist during the v2 transition, but the new file-processing
interface is not polluted by those split-API types.

Fixes #N/A

### Why we need it and why it was done in this way

This PR makes OCR-style image text extraction and document-to-markdown
conversion use the same Main-process task model before renderer-side
adoption. The unified contract gives upper layers one way to start work,
query progress, handle failure, cancel work, and consume completed
artifacts without learning provider-specific polling details.

The following tradeoffs were made:

- Fast OCR now also goes through a task API, so callers need start/query
behavior instead of a direct `extractText -> text` call.
- Task state remains session-scoped in memory; completed artifacts are
persisted, but task snapshots are not restored after app restart.
- Remote-provider cancellation is best-effort: local polling and state
transition stop immediately, but third-party provider-side cancellation
is not guaranteed.
- Renderer integration is intentionally compile-safe and minimal in this
PR; full UX migration should happen in follow-up changes.
- Tesseract keeps a processor-owned runtime service, while other
processors stay as handlers/utilities until they need lifecycle-managed
resources.

The following alternatives were considered:

- Keeping separate OCR and markdown conversion APIs, which would
preserve the current split but continue duplicating task, progress,
cancellation, and result semantics.
- Adding a DataApi task table or Cache mirror for file-processing task
state, which would create a second source of truth for runtime
coordination state.
- Adding renderer push subscriptions in this PR, which would expand the
scope beyond the Main-side task contract.
- Introducing a generic process manager for all processors, which is
premature while only Tesseract currently owns reusable lifecycle
resources.

Links to places where the discussion took place:
`v2-refactor-temp/docs/fileProcessing/file-processing-service.md`

### Breaking changes

None for released user-facing behavior.

If this PR introduces breaking changes, please describe the changes and
the impact on users.

For the in-progress `v2` file-processing integration, this replaces the
split file-processing IPC/preload shape with the unified
`startTask/getTask/cancelTask` contract. It also renames file-processing
feature and preference keys from the old `text_extraction` /
`markdown_conversion` model to `image_to_text` / `document_to_markdown`.

### Special notes for your reviewer

- This PR targets `v2`, not `main`.
- Review this against
`v2-refactor-temp/docs/fileProcessing/file-processing-service.md`; that
document is the source of truth for the module boundary.
- Main review points: unified task API, artifact model, cancellation
semantics, processor-first registry/handlers, hidden provider runtime
state, and no DataApi/Cache task storage.
- `document_to_markdown` artifacts are persisted under
`application.getPath('feature.files.data')/fileId/file-processing/taskId/output.md`.
- `image_to_text` artifacts are returned inline as plain text artifacts
and are not persisted.
- Current local verification status:
  - `pnpm format`: passed
  - `pnpm build:check`: passed
- Vitest inside `build:check`: `432` test files passed, `7171` tests
passed, `72` skipped

### Checklist

This checklist is not enforcing, but it's a reminder of items that could
be relevant to every PR.
Approvers are expected to review this list.

- [x] PR: The PR description is expressive enough and will help future
contributors
- [x] Code: [Write code that humans can
understand](https://en.wikiquote.org/wiki/Martin_Fowler#code-for-humans)
and [Keep it simple](https://en.wikipedia.org/wiki/KISS_principle)
- [x] Refactor: You have [left the code cleaner than you found it (Boy
Scout
Rule)](https://learning.oreilly.com/library/view/97-things-every/9780596809515/ch08.html)
- [x] Upgrade: Impact of this change on upgrade flows was considered and
addressed if required
- [ ] Documentation: A [user-guide update](https://docs.cherry-ai.com)
was considered and is present (link) or not required. Check this only
when the PR introduces or changes a user-facing feature or behavior.
- [x] Self-review: I have reviewed my own code (e.g., via
[`/gh-pr-review`](/.claude/skills/gh-pr-review/SKILL.md), `gh pr diff`,
or GitHub UI) before requesting review from others

### Release note

```release-note
NONE
```

---------

Signed-off-by: eeee0717 <chentao020717Work@outlook.com>
Co-authored-by: fullex <106392080+0xfullex@users.noreply.github.com>
2026-05-08 21:25:03 +08:00