refactor(knowledge-data): adjust knowledge v2 data and service (#14719)

Co-authored-by: fullex <0xfullex@gmail.com>
Co-authored-by: fullex <106392080+0xfullex@users.noreply.github.com>
This commit is contained in:
槑囿脑袋
2026-05-01 04:24:48 -07:00
committed by GitHub
parent 8d3ce3bfb1
commit 434d4a938f
86 changed files with 14234 additions and 4933 deletions

View File

@@ -0,0 +1,5 @@
---
'@vectorstores/libsql': patch
---
Align libSQL vector store schema and metadata handling with the V2 knowledge migration/runtime flow.

View File

@@ -2,25 +2,38 @@
This document records the current v2 knowledge backend shape in the main process.
It covers the `src/main/services/knowledge` runtime path and the SQLite-backed data services. It does not describe the legacy `src/main/knowledge` service or the old `knowledge-base:*` IPC channels.
## Overview
The current implementation is split into three layers:
The current implementation is split into four responsibility areas:
1. `KnowledgeBaseService` / `KnowledgeItemService`
- Persist SQLite-backed knowledge base and knowledge item data.
- Persist `knowledge_base.status` and `error`; migrated bases with missing embedding models remain as recoverable `failed` bases.
- Persist `knowledge_base.groupId`, `emoji`, and `dimensions`; `dimensions` is nullable only for failed bases whose embedding contract is unknown.
- Validate `type` / `data` consistency.
- Persist `knowledge_item.status` and `error`.
2. `KnowledgeOrchestrationService`
- Exposes the caller-facing IPC workflow.
- Coordinates expand, create, filter, add, delete, and search flows.
3. `KnowledgeRuntimeService`
- Persist `knowledge_item.status`, `phase`, and `error`.
- Reconcile container item status from child item state.
2. Data API knowledge handlers
- Expose database-backed list/get operations and base metadata/config patch.
- Do not perform vector-store mutations.
3. `KnowledgeOrchestrationService`
- Owns caller-facing runtime IPC workflow.
- Creates/deletes bases through data services.
- Collapses delete/reindex item inputs to top-level roots and coordinates runtime cleanup with SQLite deletion.
4. `KnowledgeRuntimeService`
- Executes indexing and retrieval work.
- Owns the in-memory add queue, interruption handling, and vector-store coordination.
- Creates runtime-added items.
- Owns the in-memory runtime queue, interruption handling, preparation, indexing, and vector-store coordination.
```text
caller
-> Data API
-> preload IPC
-> Data API reads / base patch
-> KnowledgeBaseService / KnowledgeItemService
caller
-> preload knowledgeRuntime IPC
-> KnowledgeOrchestrationService
-> KnowledgeBaseService / KnowledgeItemService
-> KnowledgeRuntimeService
@@ -29,70 +42,185 @@ caller
## Caller Contract
The caller-facing model is now unified:
Current Data API knowledge endpoints are read/update-only for database state that has no vector-store side effect:
1. Create item records through Data API.
2. Call runtime IPC once with item ids.
- `GET /knowledge-bases`
- `GET /knowledge-bases/:id`
- `PATCH /knowledge-bases/:id`
- `GET /knowledge-bases/:id/items`
- `GET /knowledge-items/:id`
Caller-facing create/delete/index/search operations go through `KnowledgeOrchestrationService` IPC.
The caller-facing add model is payload-based:
1. Call runtime IPC once with item payloads.
2. Runtime creates the `knowledge_item` rows.
3. Runtime queues either preparation or indexing work.
For leaf items (`file`, `url`, `note`):
```text
caller
-> Data API create item(s)
-> preload IPC add-items(item ids)
-> preload IPC add-items(leaf item payloads)
-> runtime creates leaf items
-> runtime enqueues index-leaf tasks
```
For container items (`directory`, `sitemap`):
```text
caller
-> Data API create owner item
-> preload IPC add-items(owner item ids)
-> orchestration expands owner
-> orchestration persists child items
-> orchestration filters indexable leaf items
-> runtime enqueues leaf items
-> preload IPC add-items(owner item payloads)
-> runtime creates root items
-> runtime enqueues prepare-root tasks
-> prepare-root expands owner inside the queue
-> prepare-root creates child items
-> prepare-root enqueues index-leaf tasks for concrete leaf children
```
The caller no longer needs to invoke separate `expand*` IPC APIs.
Callers should not create item records through Data API and then call runtime IPC with item ids. `add-items` accepts `KnowledgeRuntimeAddItemInput[]` and returns after root items are accepted, not after indexing completes.
Delete and reindex remain id-based because they operate on existing persisted items:
```text
delete-items(baseId, itemIds)
reindex-items(baseId, itemIds)
```
`KnowledgeOrchestrationService` collapses nested selected ids to top-level roots before calling runtime.
Current product scope does not allow users to add nested `directory` / `sitemap` items under another item. Nested directory rows may be created internally by directory expansion to preserve hierarchy.
## IPC Surface
`KnowledgeOrchestrationService` currently owns the public IPC entrypoints:
`KnowledgeOrchestrationService` currently owns these public IPC entrypoints:
- `knowledge-runtime:create-base`
- `knowledge-runtime:restore-base`
- `knowledge-runtime:delete-base`
- `knowledge-runtime:add-items`
- `knowledge-runtime:delete-items`
- `knowledge-runtime:reindex-items`
- `knowledge-runtime:search`
- `knowledge-runtime:list-item-chunks`
- `knowledge-runtime:delete-item-chunk`
These IPC handlers are workflow-oriented. They may call data services and runtime services internally before returning.
These IPC handlers are workflow-oriented. They validate payloads, call data services, and call runtime services internally.
## Runtime Behavior
`KnowledgeRuntimeService` keeps a single in-memory add queue with:
`KnowledgeRuntimeService` keeps a single in-memory runtime queue with:
- one shared queue across all knowledge bases
- fixed concurrency of `5`
- item-level deduplication for pending/running add work
- interruption support for delete and shutdown
- task kinds: `prepare-root` and `index-leaf`
- item-level deduplication for pending/running runtime work
- interruption support for delete, reindex, and shutdown
- a per-base vector write lock so concurrent tasks do not write the same base store at the same time
Current status writes are:
- `pending` before enqueue
- `completed` after successful vector write
- `failed` on error or shutdown interruption
- `processing, phase = preparing` for active `directory` / `sitemap` preparation
- `processing, phase = reading` while a leaf item reads source documents
- `processing, phase = embedding` while a leaf item embeds and writes vectors
- `processing, phase = null` after a container's own preparation finishes while descendant leaf items are still processing
- `completed, phase = null` after successful leaf indexing or when a container has no active children
- `failed, phase = null` on error, cleanup failure, or shutdown interruption
Intermediate states such as `file_processing`, `read`, and `embed` remain reserved in schema/types, but are not written by the current runtime.
`status` is the aggregate business state. `phase` is runtime progress. Container status is reconciled from its own phase and child statuses.
Current persisted `knowledge_base` columns include:
- `groupId`: nullable group assignment; `null` means ungrouped.
- `emoji`: user-visible base icon, filled by service/migration defaults.
- `dimensions`: positive embedding vector width for completed bases; nullable for failed migrated bases with unknown dimensions.
- `status`: `completed` for runnable bases, `failed` for recoverable base-level migration failures.
- `error`: nullable `KnowledgeBaseErrorCode`; currently `missing_embedding_model` for recoverable failed bases.
## Delete And Reindex
`delete-items` currently runs:
1. Orchestration loads requested items and collapses descendants to top-level roots.
2. Runtime interrupts root tasks and waits for running root work to settle.
3. Runtime fresh-queries descendants.
4. Runtime interrupts root + descendant tasks and waits again.
5. Runtime deletes leaf vectors.
6. Orchestration deletes top-level root SQLite rows; database cascade removes descendants.
`reindex-items` currently runs:
1. Orchestration loads requested items and collapses descendants to top-level roots.
2. Runtime interrupts root + descendants using the same two-stage interrupt flow.
3. Runtime deletes existing leaf vectors.
4. Container roots delete old leaf descendants and enqueue fresh `prepare-root`.
5. Leaf roots write `processing` and enqueue fresh `index-leaf`.
If destructive cleanup fails after interrupt, runtime writes the cleanup error to the affected item state before rethrowing so callers can surface the failure.
Base deletion follows the same ordering:
```text
delete-base(baseId)
-> runtime interrupts base work and returns interrupted item ids
-> data service deletes the SQLite base and cascaded items
-> runtime deletes base vector artifacts
```
If SQLite deletion fails after runtime work was interrupted, orchestration marks the interrupted items failed and rethrows the SQLite error.
If post-SQLite artifact cleanup fails, orchestration logs the cleanup error and rejects the delete call with a partial-deletion error. At that point the durable SQLite rows are already gone, but callers should not report the operation as fully successful because vector artifacts may remain on disk.
Base restore creates a new knowledge base from an existing base when the caller needs a fresh embedding/index setup, such as a migrated base whose legacy embedding model is unavailable or a completed base whose embedding model was changed by the user:
```text
restore-base(sourceBaseId, embeddingModelId, dimensions)
-> data service loads the source base
-> data service loads source root items
-> orchestration creates a new base with source config plus the new embedding model/dimensions
-> orchestration adds each root item to the new base
```
`dimensions` must already be resolved for the selected `embeddingModelId` before calling `restore-base`. Automatic flows should fill it from AI Core dimension detection; manual flows accept the user-provided value and rely on the caller to confirm it matches the model. The restore backend only validates that `dimensions` is a positive integer and uses it to create the new vector store; it does not perform a second model probe. If the value does not match the model's actual embedding output size, the mismatch is expected to surface during the subsequent reindex/write-vector phase.
The source base is preserved. For completed source bases, `restore-base` is only valid when `embeddingModelId` or `dimensions` changes; a completed base with unchanged embedding config would be a no-op clone and is rejected. If one or more root items cannot be accepted into the restored base, orchestration aggregates those synchronous acceptance failures, best-effort deletes the new base, and rethrows the aggregate error. Later background indexing failures are recorded on item status instead of this synchronous restore error.
### Migrated Bases With Missing Embedding Models
During v1-to-v2 migration, a legacy knowledge base may reference an embedding model that does not exist in the migrated `user_model` table. For example, a legacy model id such as `ollama::dengcao/Qwen3-Embedding-0.6B:Q8_0` can be present in Redux knowledge data while no matching V2 user model row exists.
In that case, migration must preserve the user-created knowledge data instead of dropping the base:
- `knowledge_base.embeddingModelId = null`
- `knowledge_base.dimensions = valid legacy dimensions, or null when unknown`
- `knowledge_base.status = failed`
- `knowledge_base.error = missing_embedding_model`
- `knowledge_item` rows under that base continue to migrate
- legacy vectors for that base are skipped because there is no confirmed embedding model contract
`knowledge_base.error` is a shared `KnowledgeBaseErrorCode` value, not a free-form string. The current recoverable base-level error code is `missing_embedding_model`.
This means the migrated base is visible as recoverable data, but it is not usable for search/index operations until the user chooses a valid embedding model.
The failed-base recovery path is `knowledge-runtime:restore-base`, not an in-place rebuild:
```text
user selects a valid embedding model for the failed base
-> restore-base(sourceBaseId, embeddingModelId, dimensions)
-> orchestration creates a new completed base using the source base config
-> orchestration copies only source root items into the new base
-> add-items triggers the normal runtime indexing flow for the new base
```
Only root items (`groupId = null`) are copied. Expanded directory/sitemap children are intentionally not copied because they belong to the old base hierarchy and can be regenerated by the normal container preparation flow. The old failed base is left intact; product/UI code can decide whether to keep it for confirmation or delete it after a successful restore.
## Search
Search is executed by `KnowledgeRuntimeService.search(base, query)`:
1. embed query
1. resolve and run the embedding model for the query
2. query the libsql vector store
3. map nodes into `KnowledgeSearchResult`
4. rerank only when `base.rerankModelId` is configured
4. call rerank only when `base.rerankModelId` is configured
Current `KnowledgeSearchResult` includes:
@@ -110,27 +238,10 @@ The current v2 implementation intentionally does **not** create a libSQL vector
Similarity search currently queries the base table directly and sorts by `vector_distance_cos(...)`.
This means retrieval cost scales roughly linearly with the number of vector rows in a single knowledge base.
That tradeoff is currently accepted because it keeps the runtime path simpler and performs well enough for the expected near-term corpus sizes.
A local benchmark run on April 15, 2026 with 1536-dimension embeddings and `topK=10` measured approximately:
- `20k` rows: `~78ms` warm vector search
- `50k` rows: `~195ms` warm vector search
That tradeoff is currently accepted because it keeps the runtime path simpler for expected near-term corpus sizes.
Current guidance:
1. Treat the no-index design as the default for now, not as an unlimited scaling guarantee.
2. Re-evaluate indexed search if real single-base corpora grow toward `100k+` rows or retrieval latency budgets can no longer tolerate a few hundred milliseconds per query.
3. If future product requirements change, adding a vector index remains a valid follow-up optimization rather than a blocked prerequisite for the current design.
## Deletion
Deletion still requires two concerns to be handled:
1. Runtime deletion
- interrupt queue work
- delete vectors
2. Data deletion
- remove SQLite rows through Data API
The runtime layer does not delete SQLite business data by itself.

View File

@@ -0,0 +1,99 @@
-- HAND-EDITED MIGRATION: merged generated steps for the v2 knowledge final schema.
-- Keep the additive columns before the table rebuild so INSERT...SELECT only reads existing columns.
ALTER TABLE `knowledge_base` ADD `group_id` text;--> statement-breakpoint
ALTER TABLE `knowledge_base` ADD `emoji` text;--> statement-breakpoint
ALTER TABLE `knowledge_base` ADD `status` text;--> statement-breakpoint
ALTER TABLE `knowledge_base` ADD `error` text;--> statement-breakpoint
ALTER TABLE `knowledge_item` ADD `phase` text;--> statement-breakpoint
PRAGMA foreign_keys=OFF;--> statement-breakpoint
CREATE TABLE `__new_knowledge_base` (
`id` text PRIMARY KEY NOT NULL,
`name` text NOT NULL,
`group_id` text,
`emoji` text NOT NULL,
`dimensions` integer,
`embedding_model_id` text,
`status` text NOT NULL,
`error` text,
`rerank_model_id` text,
`file_processor_id` text,
`chunk_size` integer NOT NULL,
`chunk_overlap` integer NOT NULL,
`threshold` real,
`document_count` integer,
`search_mode` text NOT NULL,
`hybrid_alpha` real,
`created_at` integer NOT NULL,
`updated_at` integer NOT NULL,
FOREIGN KEY (`group_id`) REFERENCES `group`(`id`) ON UPDATE no action ON DELETE set null,
FOREIGN KEY (`embedding_model_id`) REFERENCES `user_model`(`id`) ON UPDATE no action ON DELETE no action,
FOREIGN KEY (`rerank_model_id`) REFERENCES `user_model`(`id`) ON UPDATE no action ON DELETE set null,
CONSTRAINT "knowledge_base_search_mode_check" CHECK("__new_knowledge_base"."search_mode" IN ('default', 'bm25', 'hybrid')),
CONSTRAINT "knowledge_base_status_check" CHECK("__new_knowledge_base"."status" IN ('completed', 'failed')),
CONSTRAINT "knowledge_base_status_error_check" CHECK(
(
"__new_knowledge_base"."status" = 'completed'
AND "__new_knowledge_base"."embedding_model_id" IS NOT NULL
AND "__new_knowledge_base"."dimensions" IS NOT NULL
AND "__new_knowledge_base"."dimensions" > 0
AND "__new_knowledge_base"."error" IS NULL
)
OR (
"__new_knowledge_base"."status" = 'failed'
AND "__new_knowledge_base"."error" IS NOT NULL
AND length(trim("__new_knowledge_base"."error")) > 0
)
)
);
--> statement-breakpoint
INSERT INTO `__new_knowledge_base`("id", "name", "group_id", "emoji", "dimensions", "embedding_model_id", "status", "error", "rerank_model_id", "file_processor_id", "chunk_size", "chunk_overlap", "threshold", "document_count", "search_mode", "hybrid_alpha", "created_at", "updated_at") SELECT "id", "name", "group_id", "emoji", "dimensions", "embedding_model_id", "status", "error", "rerank_model_id", "file_processor_id", "chunk_size", "chunk_overlap", "threshold", "document_count", "search_mode", "hybrid_alpha", "created_at", "updated_at" FROM `knowledge_base`;--> statement-breakpoint
DROP TABLE `knowledge_base`;--> statement-breakpoint
ALTER TABLE `__new_knowledge_base` RENAME TO `knowledge_base`;--> statement-breakpoint
PRAGMA foreign_keys=ON;--> statement-breakpoint
CREATE TABLE `__new_knowledge_item` (
`id` text PRIMARY KEY NOT NULL,
`base_id` text NOT NULL,
`group_id` text,
`type` text NOT NULL,
`data` text NOT NULL,
`status` text NOT NULL,
`phase` text,
`error` text,
`created_at` integer NOT NULL,
`updated_at` integer NOT NULL,
FOREIGN KEY (`base_id`) REFERENCES `knowledge_base`(`id`) ON UPDATE no action ON DELETE cascade,
FOREIGN KEY (`base_id`,`group_id`) REFERENCES `knowledge_item`(`base_id`,`id`) ON UPDATE no action ON DELETE cascade,
CONSTRAINT "knowledge_item_type_check" CHECK("__new_knowledge_item"."type" IN ('file', 'url', 'note', 'sitemap', 'directory')),
CONSTRAINT "knowledge_item_status_check" CHECK("__new_knowledge_item"."status" IN ('idle', 'processing', 'completed', 'failed')),
CONSTRAINT "knowledge_item_phase_check" CHECK(
"__new_knowledge_item"."phase" IS NULL
OR ("__new_knowledge_item"."type" IN ('file', 'url', 'note') AND "__new_knowledge_item"."phase" IN ('reading', 'embedding'))
OR ("__new_knowledge_item"."type" IN ('directory', 'sitemap') AND "__new_knowledge_item"."phase" = 'preparing')
),
CONSTRAINT "knowledge_item_status_phase_error_check" CHECK(
(
"__new_knowledge_item"."status" IN ('idle', 'completed')
AND "__new_knowledge_item"."phase" IS NULL
AND "__new_knowledge_item"."error" IS NULL
)
OR (
-- Containers may stay processing after their own prepare phase ends
-- while descendant leaf items continue reading/embedding.
"__new_knowledge_item"."status" = 'processing'
AND "__new_knowledge_item"."error" IS NULL
)
OR (
"__new_knowledge_item"."status" = 'failed'
AND "__new_knowledge_item"."phase" IS NULL
AND "__new_knowledge_item"."error" IS NOT NULL
AND length(trim("__new_knowledge_item"."error")) > 0
)
)
);
--> statement-breakpoint
INSERT INTO `__new_knowledge_item`("id", "base_id", "group_id", "type", "data", "status", "phase", "error", "created_at", "updated_at") SELECT "id", "base_id", "group_id", "type", "data", "status", "phase", "error", "created_at", "updated_at" FROM `knowledge_item`;--> statement-breakpoint
DROP TABLE `knowledge_item`;--> statement-breakpoint
ALTER TABLE `__new_knowledge_item` RENAME TO `knowledge_item`;--> statement-breakpoint
CREATE INDEX `knowledge_item_base_type_created_idx` ON `knowledge_item` (`base_id`,`type`,`created_at`);--> statement-breakpoint
CREATE INDEX `knowledge_item_base_group_created_idx` ON `knowledge_item` (`base_id`,`group_id`,`created_at`);--> statement-breakpoint
CREATE UNIQUE INDEX `knowledge_item_baseId_id_unique` ON `knowledge_item` (`base_id`,`id`);

File diff suppressed because it is too large Load Diff

View File

@@ -126,6 +126,13 @@
"when": 1777474700000,
"tag": "0017_giant_vermin",
"breakpoints": true
},
{
"idx": 18,
"version": "6",
"when": 1777555974425,
"tag": "0018_tearful_jamie_braddock",
"breakpoints": true
}
],
"version": "7"

View File

@@ -193,10 +193,14 @@ export enum IpcChannel {
KnowledgeBase_Search = 'knowledge-base:search',
KnowledgeBase_Rerank = 'knowledge-base:rerank',
KnowledgeRuntime_CreateBase = 'knowledge-runtime:create-base',
KnowledgeRuntime_RestoreBase = 'knowledge-runtime:restore-base',
KnowledgeRuntime_DeleteBase = 'knowledge-runtime:delete-base',
KnowledgeRuntime_AddItems = 'knowledge-runtime:add-items',
KnowledgeRuntime_DeleteItems = 'knowledge-runtime:delete-items',
KnowledgeRuntime_ReindexItems = 'knowledge-runtime:reindex-items',
KnowledgeRuntime_Search = 'knowledge-runtime:search',
KnowledgeRuntime_ListItemChunks = 'knowledge-runtime:list-item-chunks',
KnowledgeRuntime_DeleteItemChunk = 'knowledge-runtime:delete-item-chunk',
//file
File_Open = 'file:open',

View File

@@ -1,23 +1,67 @@
import { describe, expect, it } from 'vitest'
import { CreateKnowledgeBaseSchema, UpdateKnowledgeBaseSchema } from '../data/api/schemas/knowledges'
import { KnowledgeBaseSchema, KnowledgeItemSchema } from '../data/types/knowledge'
import {
ListKnowledgeBasesQuerySchema,
ListKnowledgeItemsQuerySchema,
UpdateKnowledgeBaseSchema
} from '../data/api/schemas/knowledges'
import {
CreateKnowledgeBaseSchema,
CreateKnowledgeItemSchema,
DEFAULT_KNOWLEDGE_BASE_CHUNK_OVERLAP,
DEFAULT_KNOWLEDGE_BASE_CHUNK_SIZE,
KNOWLEDGE_BASE_ERROR_MISSING_EMBEDDING_MODEL,
KnowledgeBaseSchema,
KnowledgeItemSchema,
KnowledgeRuntimeAddItemInputSchema,
RestoreKnowledgeBaseSchema
} from '../data/types/knowledge'
describe('Knowledge base schemas', () => {
it('accepts valid numeric tuning fields', () => {
const result = CreateKnowledgeBaseSchema.safeParse({
name: 'KB',
dimensions: 1024,
embeddingModelId: 'embed-model',
groupId: ' group-1 ',
emoji: '📚',
chunkSize: 800,
chunkOverlap: 120,
threshold: 0.5,
documentCount: 5,
searchMode: 'hybrid',
hybridAlpha: 0.7
})
expect(result.success).toBe(true)
if (result.success) {
expect(result.data.groupId).toBe('group-1')
}
})
it('rejects blank create group ids', () => {
expect(
CreateKnowledgeBaseSchema.safeParse({
name: 'KB',
dimensions: 1024,
embeddingModelId: 'embed-model',
chunkSize: 800,
chunkOverlap: 120,
threshold: 0.5,
documentCount: 5,
searchMode: 'hybrid',
hybridAlpha: 0.7
groupId: ' '
}).success
).toBe(true)
).toBe(false)
})
it('does not apply product defaults in create schema', () => {
const result = CreateKnowledgeBaseSchema.safeParse({
name: 'KB',
dimensions: 1024,
embeddingModelId: 'embed-model'
})
expect(result.success).toBe(true)
if (result.success) {
expect(result.data).not.toHaveProperty('emoji')
expect(result.data).not.toHaveProperty('searchMode')
}
})
it('rejects invalid numeric tuning fields in create schema', () => {
@@ -35,9 +79,138 @@ describe('Knowledge base schemas', () => {
expect(result.success).toBe(false)
})
it('rejects invalid create chunk relationships', () => {
expect(
CreateKnowledgeBaseSchema.safeParse({
name: 'KB',
dimensions: 1024,
embeddingModelId: 'embed-model',
chunkOverlap: 120
}).success
).toBe(false)
expect(
CreateKnowledgeBaseSchema.safeParse({
name: 'KB',
dimensions: 1024,
embeddingModelId: 'embed-model',
chunkSize: 120,
chunkOverlap: 120
}).success
).toBe(false)
})
it('rejects extra fields in create schema', () => {
const result = CreateKnowledgeBaseSchema.safeParse({
name: 'KB',
dimensions: 1024,
embeddingModelId: 'embed-model',
createdAt: '2026-04-10T00:00:00.000Z'
})
expect(result.success).toBe(false)
})
it('validates restore-base DTOs', () => {
const result = RestoreKnowledgeBaseSchema.safeParse({
sourceBaseId: 'base-1',
dimensions: 3072,
embeddingModelId: 'openai::text-embedding-3-large'
})
expect(result.success).toBe(true)
})
it('rejects extra fields in restore-base DTOs', () => {
expect(
RestoreKnowledgeBaseSchema.safeParse({
sourceBaseId: 'base-1',
dimensions: 3072,
embeddingModelId: 'openai::text-embedding-3-large',
chunkSize: 800
}).success
).toBe(false)
})
it('validates create-item DTO item shapes', () => {
expect(
CreateKnowledgeItemSchema.safeParse({
type: 'note',
data: { source: 'hello', content: 'hello' }
}).success
).toBe(true)
})
it('uses create-item DTO shapes for runtime add-item inputs', () => {
expect(
KnowledgeRuntimeAddItemInputSchema.safeParse({
type: 'url',
data: { source: 'https://example.com/docs', url: 'https://example.com/docs' },
groupId: null
}).success
).toBe(true)
expect(
KnowledgeRuntimeAddItemInputSchema.safeParse({
type: 'file',
data: {
source: '/docs/guide.md',
file: {
id: 'file-1',
name: 'guide.md',
origin_name: 'guide.md',
path: '/docs/guide.md',
size: 12,
ext: '.md',
type: 'text',
created_at: '2026-04-10T00:00:00.000Z',
count: 1
}
}
}).success
).toBe(true)
expect(
KnowledgeRuntimeAddItemInputSchema.safeParse({
type: 'url',
url: 'https://example.com/docs',
name: 'Docs'
}).success
).toBe(false)
expect(
KnowledgeRuntimeAddItemInputSchema.safeParse({
type: 'note',
data: { source: 'hello', content: 'hello' }
}).success
).toBe(true)
expect(
KnowledgeRuntimeAddItemInputSchema.safeParse({
type: 'note',
content: 'hello',
source: 'note-1'
}).success
).toBe(false)
})
it('rejects extra fields in create-item and list query schemas', () => {
expect(
CreateKnowledgeItemSchema.safeParse({
type: 'note',
data: { source: 'hello', content: 'hello' },
extra: true
}).success
).toBe(false)
expect(ListKnowledgeBasesQuerySchema.safeParse({ page: 1, limit: 20, extra: true }).success).toBe(false)
expect(ListKnowledgeItemsQuerySchema.safeParse({ page: 1, limit: 20, type: 'note', extra: true }).success).toBe(
false
)
})
it('rejects invalid numeric tuning fields in update schema', () => {
const result = UpdateKnowledgeBaseSchema.safeParse({
embeddingModelId: 'openai::text-embedding-3-small',
chunkSize: -10,
chunkOverlap: -1,
threshold: 1.1,
@@ -54,6 +227,10 @@ describe('Knowledge base schemas', () => {
name: 'KB',
dimensions: 1024,
embeddingModelId: 'embed-model',
groupId: null,
emoji: '📁',
status: 'completed',
error: null,
chunkSize: 0,
chunkOverlap: -1,
threshold: 2,
@@ -66,14 +243,149 @@ describe('Knowledge base schemas', () => {
expect(result.success).toBe(false)
})
it('accepts nullable groupId and requires persisted defaults in entity schema', () => {
const result = KnowledgeBaseSchema.safeParse({
id: 'kb-1',
name: 'KB',
dimensions: 1024,
embeddingModelId: 'embed-model',
groupId: null,
emoji: '📁',
status: 'completed',
error: null,
chunkSize: DEFAULT_KNOWLEDGE_BASE_CHUNK_SIZE,
chunkOverlap: DEFAULT_KNOWLEDGE_BASE_CHUNK_OVERLAP,
searchMode: 'hybrid',
createdAt: '2026-04-10T00:00:00.000Z',
updatedAt: '2026-04-10T00:00:00.000Z'
})
expect(result.success).toBe(true)
if (result.success) {
expect(result.data.emoji).toBe('📁')
expect(result.data.searchMode).toBe('hybrid')
}
})
it('requires completed bases to have positive dimensions and allows failed bases with unknown dimensions', () => {
const failedBase = {
id: 'kb-1',
name: 'KB',
embeddingModelId: null,
groupId: null,
emoji: '📁',
status: 'failed',
error: KNOWLEDGE_BASE_ERROR_MISSING_EMBEDDING_MODEL,
chunkSize: DEFAULT_KNOWLEDGE_BASE_CHUNK_SIZE,
chunkOverlap: DEFAULT_KNOWLEDGE_BASE_CHUNK_OVERLAP,
searchMode: 'hybrid',
createdAt: '2026-04-10T00:00:00.000Z',
updatedAt: '2026-04-10T00:00:00.000Z'
}
const completedBase = {
...failedBase,
embeddingModelId: 'embed-model',
status: 'completed',
error: null
}
expect(KnowledgeBaseSchema.safeParse({ ...completedBase, dimensions: null }).success).toBe(false)
expect(KnowledgeBaseSchema.safeParse({ ...completedBase, dimensions: 0 }).success).toBe(false)
expect(KnowledgeBaseSchema.safeParse({ ...failedBase, dimensions: null }).success).toBe(true)
expect(KnowledgeBaseSchema.safeParse({ ...failedBase, dimensions: 0 }).success).toBe(false)
expect(KnowledgeBaseSchema.safeParse({ ...failedBase, dimensions: 768 }).success).toBe(true)
})
it('requires persisted config to be present in entity schema', () => {
expect(
KnowledgeBaseSchema.safeParse({
id: 'kb-1',
name: 'KB',
dimensions: 1024,
embeddingModelId: 'embed-model',
status: 'completed',
error: null,
createdAt: '2026-04-10T00:00:00.000Z',
updatedAt: '2026-04-10T00:00:00.000Z'
}).success
).toBe(false)
expect(
KnowledgeBaseSchema.safeParse({
id: 'kb-1',
name: 'KB',
dimensions: 1024,
embeddingModelId: 'embed-model',
emoji: '📁',
status: 'completed',
error: null,
chunkSize: DEFAULT_KNOWLEDGE_BASE_CHUNK_SIZE,
chunkOverlap: DEFAULT_KNOWLEDGE_BASE_CHUNK_OVERLAP,
searchMode: 'hybrid',
createdAt: '2026-04-10T00:00:00.000Z',
updatedAt: '2026-04-10T00:00:00.000Z'
}).success
).toBe(false)
})
it('rejects invalid knowledge base emoji values', () => {
expect(
CreateKnowledgeBaseSchema.safeParse({
name: 'KB',
dimensions: 1024,
embeddingModelId: 'embed-model',
emoji: 'books'
}).success
).toBe(false)
expect(
UpdateKnowledgeBaseSchema.safeParse({
emoji: 'books'
}).success
).toBe(false)
expect(
CreateKnowledgeBaseSchema.safeParse({
name: 'KB',
dimensions: 1024,
embeddingModelId: 'embed-model',
emoji: ' 📚 '
}).success
).toBe(false)
expect(
UpdateKnowledgeBaseSchema.safeParse({
emoji: ' '
}).success
).toBe(false)
expect(
KnowledgeBaseSchema.safeParse({
id: 'kb-1',
name: 'KB',
dimensions: 1024,
embeddingModelId: 'embed-model',
emoji: 'books',
status: 'completed',
error: null,
chunkSize: DEFAULT_KNOWLEDGE_BASE_CHUNK_SIZE,
chunkOverlap: DEFAULT_KNOWLEDGE_BASE_CHUNK_OVERLAP,
createdAt: '2026-04-10T00:00:00.000Z',
updatedAt: '2026-04-10T00:00:00.000Z'
}).success
).toBe(false)
})
it('requires knowledge items to carry an explicit nullable error field', () => {
expect(
KnowledgeItemSchema.safeParse({
id: 'item-1',
baseId: 'kb-1',
groupId: null,
type: 'note',
data: { content: 'hello' },
data: { source: 'hello', content: 'hello' },
status: 'idle',
phase: null,
error: null,
createdAt: '2026-04-10T00:00:00.000Z',
updatedAt: '2026-04-10T00:00:00.000Z'
@@ -84,22 +396,140 @@ describe('Knowledge base schemas', () => {
KnowledgeItemSchema.safeParse({
id: 'item-1',
baseId: 'kb-1',
groupId: null,
type: 'note',
data: { content: 'hello' },
data: { source: 'hello', content: 'hello' },
status: 'idle',
phase: null,
createdAt: '2026-04-10T00:00:00.000Z',
updatedAt: '2026-04-10T00:00:00.000Z'
}).success
).toBe(false)
})
it('separates knowledge item status from runtime phase', () => {
expect(
KnowledgeItemSchema.safeParse({
id: 'item-1',
baseId: 'kb-1',
groupId: null,
type: 'note',
data: { source: 'hello', content: 'hello' },
status: 'processing',
phase: 'reading',
error: null,
createdAt: '2026-04-10T00:00:00.000Z',
updatedAt: '2026-04-10T00:00:00.000Z'
}).success
).toBe(true)
expect(
KnowledgeItemSchema.safeParse({
id: 'item-1',
baseId: 'kb-1',
groupId: null,
type: 'note',
data: { source: 'hello', content: 'hello' },
status: 'read',
phase: null,
error: null,
createdAt: '2026-04-10T00:00:00.000Z',
updatedAt: '2026-04-10T00:00:00.000Z'
}).success
).toBe(false)
})
it('rejects invalid knowledge item status phase error combinations', () => {
const validItem = {
id: 'item-1',
baseId: 'kb-1',
groupId: null,
type: 'note' as const,
data: { source: 'hello', content: 'hello' },
createdAt: '2026-04-10T00:00:00.000Z',
updatedAt: '2026-04-10T00:00:00.000Z'
}
expect(KnowledgeItemSchema.safeParse({ ...validItem, status: 'idle', phase: null, error: null }).success).toBe(true)
expect(KnowledgeItemSchema.safeParse({ ...validItem, status: 'completed', phase: null, error: null }).success).toBe(
true
)
expect(
KnowledgeItemSchema.safeParse({ ...validItem, status: 'processing', phase: null, error: null }).success
).toBe(true)
expect(
KnowledgeItemSchema.safeParse({ ...validItem, status: 'processing', phase: 'reading', error: null }).success
).toBe(true)
expect(
KnowledgeItemSchema.safeParse({ ...validItem, status: 'failed', phase: null, error: 'read failed' }).success
).toBe(true)
expect(KnowledgeItemSchema.safeParse({ ...validItem, status: 'idle', phase: 'reading', error: null }).success).toBe(
false
)
expect(
KnowledgeItemSchema.safeParse({ ...validItem, status: 'completed', phase: null, error: 'stale' }).success
).toBe(false)
expect(
KnowledgeItemSchema.safeParse({ ...validItem, status: 'processing', phase: null, error: 'stale' }).success
).toBe(false)
expect(
KnowledgeItemSchema.safeParse({ ...validItem, status: 'failed', phase: 'reading', error: 'read failed' }).success
).toBe(false)
expect(KnowledgeItemSchema.safeParse({ ...validItem, status: 'failed', phase: null, error: '' }).success).toBe(
false
)
})
it('restricts processing phase by knowledge item type', () => {
const leafItem = {
id: 'leaf-1',
baseId: 'kb-1',
groupId: null,
type: 'note' as const,
data: { source: 'leaf', content: 'leaf content' },
status: 'processing' as const,
error: null,
createdAt: '2026-04-10T00:00:00.000Z',
updatedAt: '2026-04-10T00:00:00.000Z'
}
const containerItem = {
id: 'container-1',
baseId: 'kb-1',
groupId: null,
type: 'directory' as const,
data: { source: '/docs', path: '/docs' },
status: 'processing' as const,
error: null,
createdAt: '2026-04-10T00:00:00.000Z',
updatedAt: '2026-04-10T00:00:00.000Z'
}
expect(KnowledgeItemSchema.safeParse({ ...leafItem, phase: null }).success).toBe(true)
expect(KnowledgeItemSchema.safeParse({ ...leafItem, phase: 'reading' }).success).toBe(true)
expect(KnowledgeItemSchema.safeParse({ ...leafItem, phase: 'embedding' }).success).toBe(true)
expect(KnowledgeItemSchema.safeParse({ ...leafItem, phase: 'preparing' }).success).toBe(false)
expect(KnowledgeItemSchema.safeParse({ ...containerItem, phase: null }).success).toBe(true)
expect(KnowledgeItemSchema.safeParse({ ...containerItem, phase: 'preparing' }).success).toBe(true)
expect(KnowledgeItemSchema.safeParse({ ...containerItem, phase: 'reading' }).success).toBe(false)
expect(KnowledgeItemSchema.safeParse({ ...containerItem, phase: 'embedding' }).success).toBe(false)
})
})
it('allows migrated knowledge bases to have a null embedding model id', () => {
it('accepts failed knowledge bases with a null embedding model id', () => {
const result = KnowledgeBaseSchema.safeParse({
id: 'kb-null-model',
name: 'KB nullable model',
dimensions: 1024,
embeddingModelId: null,
groupId: null,
emoji: '📁',
status: 'failed',
error: KNOWLEDGE_BASE_ERROR_MISSING_EMBEDDING_MODEL,
chunkSize: DEFAULT_KNOWLEDGE_BASE_CHUNK_SIZE,
chunkOverlap: DEFAULT_KNOWLEDGE_BASE_CHUNK_OVERLAP,
searchMode: 'hybrid',
createdAt: '2026-04-10T00:00:00.000Z',
updatedAt: '2026-04-10T00:00:00.000Z'
})
@@ -107,8 +537,102 @@ it('allows migrated knowledge bases to have a null embedding model id', () => {
expect(result.success).toBe(true)
})
it('keeps embeddingModelId optional in patch schema but rejects null clears', () => {
expect(UpdateKnowledgeBaseSchema.safeParse({ embeddingModelId: 'openai::text-embedding-3-small' }).success).toBe(true)
it('rejects invalid knowledge base status error combinations', () => {
const validBase = {
id: 'kb-1',
name: 'KB',
dimensions: 1024,
groupId: null,
emoji: '📁',
chunkSize: DEFAULT_KNOWLEDGE_BASE_CHUNK_SIZE,
chunkOverlap: DEFAULT_KNOWLEDGE_BASE_CHUNK_OVERLAP,
searchMode: 'hybrid' as const,
createdAt: '2026-04-10T00:00:00.000Z',
updatedAt: '2026-04-10T00:00:00.000Z'
}
expect(
KnowledgeBaseSchema.safeParse({
...validBase,
embeddingModelId: 'embed-model',
status: 'completed',
error: null
}).success
).toBe(true)
expect(
KnowledgeBaseSchema.safeParse({
...validBase,
embeddingModelId: null,
status: 'completed',
error: null
}).success
).toBe(false)
expect(
KnowledgeBaseSchema.safeParse({
...validBase,
embeddingModelId: 'embed-model',
status: 'completed',
error: 'stale'
}).success
).toBe(false)
expect(
KnowledgeBaseSchema.safeParse({
...validBase,
embeddingModelId: null,
status: 'failed',
error: null
}).success
).toBe(false)
expect(
KnowledgeBaseSchema.safeParse({
...validBase,
embeddingModelId: null,
status: 'failed',
error: ''
}).success
).toBe(false)
expect(
KnowledgeBaseSchema.safeParse({
...validBase,
embeddingModelId: null,
status: 'failed',
error: 'unknown_error'
}).success
).toBe(false)
})
it('rejects embedding model changes in patch schema', () => {
expect(UpdateKnowledgeBaseSchema.safeParse({ embeddingModelId: 'openai::text-embedding-3-small' }).success).toBe(
false
)
expect(UpdateKnowledgeBaseSchema.safeParse({ embeddingModelId: null }).success).toBe(false)
expect(UpdateKnowledgeBaseSchema.safeParse({}).success).toBe(true)
})
it('rejects optional config null clears in patch schema', () => {
expect(UpdateKnowledgeBaseSchema.safeParse({ chunkSize: null }).success).toBe(false)
expect(UpdateKnowledgeBaseSchema.safeParse({ chunkOverlap: null }).success).toBe(false)
expect(UpdateKnowledgeBaseSchema.safeParse({ searchMode: null }).success).toBe(false)
expect(UpdateKnowledgeBaseSchema.safeParse({ rerankModelId: null }).success).toBe(false)
expect(UpdateKnowledgeBaseSchema.safeParse({ fileProcessorId: null }).success).toBe(false)
expect(UpdateKnowledgeBaseSchema.safeParse({ threshold: null }).success).toBe(false)
expect(UpdateKnowledgeBaseSchema.safeParse({ documentCount: null }).success).toBe(false)
expect(UpdateKnowledgeBaseSchema.safeParse({ hybridAlpha: null }).success).toBe(false)
expect(UpdateKnowledgeBaseSchema.safeParse({ chunkSize: 1024, chunkOverlap: 200 }).success).toBe(true)
expect(
UpdateKnowledgeBaseSchema.safeParse({
rerankModelId: 'rerank-1',
fileProcessorId: 'processor-1',
threshold: 0.5,
documentCount: 5,
hybridAlpha: 0.7
}).success
).toBe(true)
})
it('keeps patch groupId aligned with topic semantics', () => {
expect(UpdateKnowledgeBaseSchema.safeParse({ groupId: null }).success).toBe(true)
expect(UpdateKnowledgeBaseSchema.safeParse({ groupId: ' group-1 ' }).success).toBe(true)
expect(UpdateKnowledgeBaseSchema.safeParse({ groupId: ' ' }).success).toBe(false)
expect(UpdateKnowledgeBaseSchema.safeParse({ emoji: null }).success).toBe(false)
})

View File

@@ -1,152 +1,45 @@
/**
* Knowledge API DTOs and schema contracts.
* Knowledge DataApi schemas.
*
* Runtime/index operations are exposed through KnowledgeOrchestrationService
* IPC contracts in `src/main/services/knowledge/types/ipc`, not through DataApi.
*/
import type { OffsetPaginationResponse } from '@shared/data/api'
import {
DirectoryItemDataSchema,
FileItemDataSchema,
FileMetadataSchema,
type KnowledgeBase,
KnowledgeChunkOverlapSchema,
KnowledgeChunkSizeSchema,
KnowledgeDocumentCountSchema,
KnowledgeHybridAlphaSchema,
KnowledgeBaseEntitySchema,
type KnowledgeItem,
KnowledgeItemStatusSchema,
KnowledgeItemTypeSchema,
KnowledgeSearchModeSchema,
KnowledgeThresholdSchema,
NoteItemDataSchema,
SitemapItemDataSchema,
UrlItemDataSchema
KnowledgeItemTypeSchema
} from '@shared/data/types/knowledge'
import * as z from 'zod'
export const CreateKnowledgeBaseSchema = z.object({
name: z.string().trim().min(1),
description: z.string().optional(),
dimensions: z.number().int().positive(),
embeddingModelId: z.string().trim().min(1),
rerankModelId: z.string().optional(),
fileProcessorId: z.string().optional(),
chunkSize: KnowledgeChunkSizeSchema.optional(),
chunkOverlap: KnowledgeChunkOverlapSchema.optional(),
threshold: KnowledgeThresholdSchema.optional(),
documentCount: KnowledgeDocumentCountSchema.optional(),
searchMode: KnowledgeSearchModeSchema.optional(),
hybridAlpha: KnowledgeHybridAlphaSchema.optional()
})
export type CreateKnowledgeBaseDto = z.infer<typeof CreateKnowledgeBaseSchema>
const KNOWLEDGE_BASE_MUTABLE_FIELDS = {
name: true,
groupId: true,
emoji: true,
rerankModelId: true,
fileProcessorId: true,
chunkSize: true,
chunkOverlap: true,
threshold: true,
documentCount: true,
searchMode: true,
hybridAlpha: true
} as const
export const UpdateKnowledgeBaseSchema = z
.object({
name: z.string().trim().min(1).optional(),
description: z.string().nullable().optional(),
embeddingModelId: z.string().trim().min(1).optional(),
rerankModelId: z.string().nullable().optional(),
fileProcessorId: z.string().nullable().optional(),
chunkSize: KnowledgeChunkSizeSchema.nullable().optional(),
chunkOverlap: KnowledgeChunkOverlapSchema.nullable().optional(),
threshold: KnowledgeThresholdSchema.nullable().optional(),
documentCount: KnowledgeDocumentCountSchema.nullable().optional(),
searchMode: KnowledgeSearchModeSchema.nullable().optional(),
hybridAlpha: KnowledgeHybridAlphaSchema.nullable().optional()
// `embeddingModelId` and `dimensions` are intentionally excluded: changing
// either invalidates existing vectors and must go through a runtime reindex flow.
export const UpdateKnowledgeBaseSchema = KnowledgeBaseEntitySchema.pick(KNOWLEDGE_BASE_MUTABLE_FIELDS)
.partial()
.extend({
rerankModelId: KnowledgeBaseEntitySchema.shape.rerankModelId,
fileProcessorId: KnowledgeBaseEntitySchema.shape.fileProcessorId,
threshold: KnowledgeBaseEntitySchema.shape.threshold,
documentCount: KnowledgeBaseEntitySchema.shape.documentCount,
hybridAlpha: KnowledgeBaseEntitySchema.shape.hybridAlpha
})
.strict()
export type UpdateKnowledgeBaseDto = z.infer<typeof UpdateKnowledgeBaseSchema>
export {
DirectoryItemDataSchema,
FileItemDataSchema,
FileMetadataSchema,
KnowledgeItemStatusSchema,
KnowledgeItemTypeSchema,
KnowledgeSearchModeSchema,
NoteItemDataSchema,
SitemapItemDataSchema,
UrlItemDataSchema
}
const CreateKnowledgeItemBaseSchema = z
.object({
ref: z.string().trim().min(1).optional(),
groupId: z.string().nullable().optional(),
groupRef: z.string().trim().min(1).optional()
})
.strict()
type CreateKnowledgeItemReferenceInput = z.input<typeof CreateKnowledgeItemBaseSchema>
function validateCreateKnowledgeItemReferences(item: CreateKnowledgeItemReferenceInput, ctx: z.RefinementCtx): void {
if (item.groupId != null && item.groupRef != null) {
ctx.addIssue({
code: 'custom',
path: ['groupRef'],
message: 'Knowledge items cannot specify both groupId and groupRef'
})
}
}
export function getCreateKnowledgeItemsReferenceErrors(
items: CreateKnowledgeItemReferenceInput[]
): Record<string, string[]> {
const refs = new Set<string>()
const duplicateRefs = new Set<string>()
const missingGroupRefs = new Set<string>()
for (const item of items) {
if (item.ref) {
if (refs.has(item.ref)) {
duplicateRefs.add(item.ref)
} else {
refs.add(item.ref)
}
}
}
for (const item of items) {
if (item.groupId == null && item.groupRef && !refs.has(item.groupRef)) {
missingGroupRefs.add(item.groupRef)
}
}
const fieldErrors: Record<string, string[]> = {}
if (duplicateRefs.size > 0) {
fieldErrors.ref = [`Duplicate knowledge item refs in request batch: ${[...duplicateRefs].join(', ')}`]
}
if (missingGroupRefs.size > 0) {
fieldErrors.groupRef = [`Knowledge item group ref not found in request batch: ${[...missingGroupRefs].join(', ')}`]
}
return fieldErrors
}
export const CreateKnowledgeItemSchema = z.discriminatedUnion('type', [
CreateKnowledgeItemBaseSchema.extend({
type: z.literal('file'),
data: FileItemDataSchema
}).superRefine(validateCreateKnowledgeItemReferences),
CreateKnowledgeItemBaseSchema.extend({
type: z.literal('url'),
data: UrlItemDataSchema
}).superRefine(validateCreateKnowledgeItemReferences),
CreateKnowledgeItemBaseSchema.extend({
type: z.literal('note'),
data: NoteItemDataSchema
}).superRefine(validateCreateKnowledgeItemReferences),
CreateKnowledgeItemBaseSchema.extend({
type: z.literal('sitemap'),
data: SitemapItemDataSchema
}).superRefine(validateCreateKnowledgeItemReferences),
CreateKnowledgeItemBaseSchema.extend({
type: z.literal('directory'),
data: DirectoryItemDataSchema
}).superRefine(validateCreateKnowledgeItemReferences)
])
export type CreateKnowledgeItemDto = z.infer<typeof CreateKnowledgeItemSchema>
export type UpdateKnowledgeBaseDto = z.input<typeof UpdateKnowledgeBaseSchema>
export const KNOWLEDGE_ITEMS_DEFAULT_PAGE = 1
export const KNOWLEDGE_ITEMS_DEFAULT_LIMIT = 20
@@ -155,75 +48,35 @@ export const KNOWLEDGE_BASES_DEFAULT_PAGE = 1
export const KNOWLEDGE_BASES_DEFAULT_LIMIT = 20
export const KNOWLEDGE_BASES_MAX_LIMIT = 100
export const CreateKnowledgeItemsSchema = z
.object({
items: z.array(CreateKnowledgeItemSchema).min(1).max(KNOWLEDGE_ITEMS_MAX_LIMIT)
})
.superRefine((value, ctx) => {
const fieldErrors = getCreateKnowledgeItemsReferenceErrors(value.items)
for (const [field, messages] of Object.entries(fieldErrors)) {
for (const message of messages) {
ctx.addIssue({
code: 'custom',
path: ['items', field],
message
})
}
}
})
export type CreateKnowledgeItemsDto = z.infer<typeof CreateKnowledgeItemsSchema>
export const UpdateKnowledgeItemDataSchema = z.union([
FileItemDataSchema,
UrlItemDataSchema,
NoteItemDataSchema,
SitemapItemDataSchema,
DirectoryItemDataSchema
])
export const UpdateKnowledgeItemSchema = z
.object({
data: UpdateKnowledgeItemDataSchema.optional(),
status: KnowledgeItemStatusSchema.optional(),
error: z.string().nullable().optional()
})
.strict()
export type UpdateKnowledgeItemDto = z.infer<typeof UpdateKnowledgeItemSchema>
export const KnowledgeBaseListQuerySchema = z.object({
export const ListKnowledgeBasesQuerySchema = z.strictObject({
page: z.int().positive().default(KNOWLEDGE_BASES_DEFAULT_PAGE),
limit: z.int().positive().max(KNOWLEDGE_BASES_MAX_LIMIT).default(KNOWLEDGE_BASES_DEFAULT_LIMIT)
})
export type KnowledgeBaseListQueryParams = z.input<typeof KnowledgeBaseListQuerySchema>
export type KnowledgeBaseListQuery = z.output<typeof KnowledgeBaseListQuerySchema>
export type ListKnowledgeBasesQueryParams = z.input<typeof ListKnowledgeBasesQuerySchema>
export type ListKnowledgeBasesQuery = z.output<typeof ListKnowledgeBasesQuerySchema>
/**
* Query parameters for GET /knowledge-bases/:id/items
*
* Returns flat knowledge items for one knowledge base with optional filters.
*/
export const KnowledgeItemsQuerySchema = z.object({
export const ListKnowledgeItemsQuerySchema = z.strictObject({
page: z.int().positive().default(KNOWLEDGE_ITEMS_DEFAULT_PAGE),
limit: z.int().positive().max(KNOWLEDGE_ITEMS_MAX_LIMIT).default(KNOWLEDGE_ITEMS_DEFAULT_LIMIT),
type: KnowledgeItemTypeSchema.optional(),
groupId: z.string().optional()
groupId: z.string().nullable().optional()
})
export type KnowledgeItemsQueryParams = z.input<typeof KnowledgeItemsQuerySchema>
export type KnowledgeItemsQuery = z.output<typeof KnowledgeItemsQuerySchema>
export type ListKnowledgeItemsQueryParams = z.input<typeof ListKnowledgeItemsQuerySchema>
export type ListKnowledgeItemsQuery = z.output<typeof ListKnowledgeItemsQuerySchema>
export type KnowledgeSchemas = {
'/knowledge-bases': {
GET: {
query?: KnowledgeBaseListQueryParams
query?: ListKnowledgeBasesQueryParams
response: OffsetPaginationResponse<KnowledgeBase>
}
POST: {
body: CreateKnowledgeBaseDto
response: KnowledgeBase
}
}
'/knowledge-bases/:id': {
@@ -236,10 +89,6 @@ export type KnowledgeSchemas = {
body: UpdateKnowledgeBaseDto
response: KnowledgeBase
}
DELETE: {
params: { id: string }
response: void
}
}
'/knowledge-bases/:id/items': {
@@ -248,17 +97,9 @@ export type KnowledgeSchemas = {
*/
GET: {
params: { id: string }
query?: KnowledgeItemsQueryParams
query?: ListKnowledgeItemsQueryParams
response: OffsetPaginationResponse<KnowledgeItem>
}
/**
* Create flat knowledge items with optional grouping metadata.
*/
POST: {
params: { id: string }
body: CreateKnowledgeItemsDto
response: { items: KnowledgeItem[] }
}
}
'/knowledge-items/:id': {
@@ -266,21 +107,5 @@ export type KnowledgeSchemas = {
params: { id: string }
response: KnowledgeItem
}
PATCH: {
params: { id: string }
body: UpdateKnowledgeItemDto
response: KnowledgeItem
}
/**
* Delete one knowledge item by id.
*
* If the deleted item acts as a group owner, all items with
* `groupId = :id` are deleted in the same operation through the
* database-level same-base cascade constraint.
*/
DELETE: {
params: { id: string }
response: void
}
}
}

View File

@@ -121,6 +121,9 @@ export type UseCacheSchema = {
'chat.generating': boolean
'chat.web_search.searching': boolean
// Knowledge recall test query history (session-only)
'knowledge.recall.search_queries': Record<string, string[]>
// Minapp management
'minapp.opened_keep_alive': CacheValueTypes.CacheMinAppType[]
'minapp.current_id': string
@@ -190,6 +193,7 @@ export const DefaultUseCache: UseCacheSchema = {
'chat.selected_message_ids': [],
'chat.generating': false,
'chat.web_search.searching': false,
'knowledge.recall.search_queries': {},
// Minapp management
'minapp.opened_keep_alive': [],

View File

@@ -0,0 +1,37 @@
import { describe, expect, it } from 'vitest'
import { KnowledgeSearchResultSchema } from '../knowledge'
describe('KnowledgeSearchResultSchema', () => {
const result = {
pageContent: 'hello',
score: 0.9,
metadata: {
itemId: 'item-1',
itemType: 'note',
source: 'note-1',
chunkIndex: 0,
tokenCount: 1
},
itemId: 'item-1',
chunkId: 'chunk-1'
}
it('accepts explicit chunk metadata', () => {
expect(KnowledgeSearchResultSchema.parse(result)).toEqual(result)
})
it('rejects search results without required metadata fields', () => {
const invalidResult = {
...result,
metadata: {
itemId: 'item-1',
itemType: 'note',
source: 'note-1',
chunkIndex: 0
}
}
expect(() => KnowledgeSearchResultSchema.parse(invalidResult)).toThrow()
})
})

View File

@@ -22,7 +22,7 @@ import { UniqueModelIdSchema } from './model'
* so we trust the renderer-side contract instead of paying for a second layer
* of schema-level cross-field validation.
*/
export const EntityTypeSchema = z.enum(['assistant', 'topic', 'model', 'agent'])
export const EntityTypeSchema = z.enum(['assistant', 'topic', 'model', 'agent', 'knowledge'])
export type EntityType = z.infer<typeof EntityTypeSchema>
/**

View File

@@ -3,37 +3,145 @@ import * as z from 'zod'
import { type FileMetadata, FileTypeSchema } from './file'
/**
* Shared knowledge domain types.
* Knowledge domain types.
*
* Entity schemas live here so DataApi schemas and DB schemas can reuse the
* same source of truth.
* Keep this file as the single shared entry point for knowledge data contracts.
* Sections below separate persisted entities, runtime search types, and
* runtime operation DTOs.
*/
// ============================================================================
// Constants and Field Schemas
// ============================================================================
export const KNOWLEDGE_ITEM_TYPES = ['file', 'url', 'note', 'sitemap', 'directory'] as const
export const KnowledgeItemTypeSchema = z.enum(KNOWLEDGE_ITEM_TYPES)
export type KnowledgeItemType = z.infer<typeof KnowledgeItemTypeSchema>
export const KNOWLEDGE_ITEM_STATUSES = [
'idle',
'pending',
'file_processing',
'read',
'embed',
'completed',
'failed'
] as const
export const KNOWLEDGE_ITEM_STATUSES = ['idle', 'processing', 'completed', 'failed'] as const
export const KnowledgeItemStatusSchema = z.enum(KNOWLEDGE_ITEM_STATUSES)
export type KnowledgeItemStatus = z.infer<typeof KnowledgeItemStatusSchema>
export const KNOWLEDGE_ITEM_PHASES = ['preparing', 'reading', 'embedding'] as const
export const KnowledgeItemPhaseSchema = z.enum(KNOWLEDGE_ITEM_PHASES)
export type KnowledgeItemPhase = z.infer<typeof KnowledgeItemPhaseSchema>
export const KnowledgeLeafItemPhaseSchema = z.enum(['reading', 'embedding'])
export const KnowledgeContainerItemPhaseSchema = z.literal('preparing')
export const KNOWLEDGE_SEARCH_MODES = ['default', 'bm25', 'hybrid'] as const
export const KnowledgeSearchModeSchema = z.enum(KNOWLEDGE_SEARCH_MODES)
export type KnowledgeSearchMode = z.infer<typeof KnowledgeSearchModeSchema>
export const DEFAULT_KNOWLEDGE_SEARCH_MODE: KnowledgeSearchMode = 'hybrid'
export const KNOWLEDGE_BASE_STATUSES = ['completed', 'failed'] as const
export const KnowledgeBaseStatusSchema = z.enum(KNOWLEDGE_BASE_STATUSES)
export type KnowledgeBaseStatus = z.infer<typeof KnowledgeBaseStatusSchema>
export const DEFAULT_KNOWLEDGE_BASE_STATUS: KnowledgeBaseStatus = 'completed'
export const KNOWLEDGE_BASE_ERROR_CODES = ['missing_embedding_model'] as const
export const KnowledgeBaseErrorCodeSchema = z.enum(KNOWLEDGE_BASE_ERROR_CODES)
export type KnowledgeBaseErrorCode = z.infer<typeof KnowledgeBaseErrorCodeSchema>
export const KNOWLEDGE_BASE_ERROR_MISSING_EMBEDDING_MODEL: KnowledgeBaseErrorCode = 'missing_embedding_model'
export const KnowledgeChunkSizeSchema = z.number().int().positive()
export const KnowledgeChunkOverlapSchema = z.number().int().min(0)
export const KnowledgeThresholdSchema = z.number().min(0).max(1)
export const KnowledgeDocumentCountSchema = z.number().int().positive()
export const KnowledgeHybridAlphaSchema = z.number().min(0).max(1)
export const KnowledgeBaseEmojiSchema = z.emoji()
export const DEFAULT_KNOWLEDGE_BASE_CHUNK_SIZE = 1024
export const DEFAULT_KNOWLEDGE_BASE_CHUNK_OVERLAP = 200
export const DEFAULT_KNOWLEDGE_BASE_EMOJI = '📁'
export const KNOWLEDGE_RUNTIME_ITEMS_MAX = 100
export const KNOWLEDGE_NOTE_CONTENT_MAX = 1_000_000
// ============================================================================
// Knowledge Base Entity
// ============================================================================
/**
* Knowledge base metadata stored in SQLite.
*/
export const KnowledgeBaseEntitySchema = z.strictObject({
id: z.string(),
name: z.string().trim().min(1),
groupId: z.string().trim().min(1).nullable(),
emoji: KnowledgeBaseEmojiSchema,
dimensions: z.number().int().positive().nullable(),
embeddingModelId: z.string().trim().min(1).nullable(),
status: KnowledgeBaseStatusSchema,
error: KnowledgeBaseErrorCodeSchema.nullable(),
rerankModelId: z.string().optional(),
fileProcessorId: z.string().optional(),
chunkSize: KnowledgeChunkSizeSchema,
chunkOverlap: KnowledgeChunkOverlapSchema,
threshold: KnowledgeThresholdSchema.optional(),
documentCount: KnowledgeDocumentCountSchema.optional(),
searchMode: KnowledgeSearchModeSchema,
hybridAlpha: KnowledgeHybridAlphaSchema.optional(),
createdAt: z.iso.datetime(),
updatedAt: z.iso.datetime()
})
export const KnowledgeBaseSchema = KnowledgeBaseEntitySchema.superRefine((value, ctx) => {
if (value.status === 'completed') {
if (value.embeddingModelId === null) {
ctx.addIssue({
code: 'custom',
path: ['embeddingModelId'],
message: 'Completed knowledge base requires an embedding model'
})
}
if (value.error !== null) {
ctx.addIssue({
code: 'custom',
path: ['error'],
message: 'Completed knowledge base cannot have an error'
})
}
if (value.dimensions === null) {
ctx.addIssue({
code: 'custom',
path: ['dimensions'],
message: 'Completed knowledge base requires positive dimensions'
})
}
}
if (value.status === 'failed' && value.error === null) {
ctx.addIssue({
code: 'custom',
path: ['error'],
message: 'Failed knowledge base requires an error'
})
}
if (value.chunkOverlap >= value.chunkSize) {
ctx.addIssue({
code: 'custom',
path: ['chunkOverlap'],
message: 'Chunk overlap must be smaller than chunk size'
})
}
if (value.hybridAlpha != null && value.searchMode !== 'hybrid') {
ctx.addIssue({
code: 'custom',
path: ['hybridAlpha'],
message: 'Hybrid alpha requires hybrid search mode'
})
}
})
export type KnowledgeBase = z.infer<typeof KnowledgeBaseSchema>
// ============================================================================
// Knowledge Item Data
// ============================================================================
const KnowledgeItemSharedSchema = z.strictObject({
source: z.string().trim().min(1)
})
/**
* Temporary schema mirroring the current FileMetadata shape.
@@ -56,54 +164,38 @@ export const FileMetadataSchema: z.ZodType<FileMetadata> = z.object({
/**
* File item data.
*/
export const FileItemDataSchema = z.object({
export const FileItemDataSchema = KnowledgeItemSharedSchema.extend({
file: FileMetadataSchema
})
export type FileItemData = z.infer<typeof FileItemDataSchema>
/**
* URL item data.
*/
export const UrlItemDataSchema = z.object({
url: z.string().trim().min(1),
name: z.string().trim().min(1)
export const UrlItemDataSchema = KnowledgeItemSharedSchema.extend({
url: z.string().trim().min(1)
})
export type UrlItemData = z.infer<typeof UrlItemDataSchema>
/**
* Note item data.
*/
export const NoteItemDataSchema = z.object({
content: z.string(),
export const NoteItemDataSchema = KnowledgeItemSharedSchema.extend({
content: z.string().max(KNOWLEDGE_NOTE_CONTENT_MAX),
sourceUrl: z.string().optional()
})
export type NoteItemData = z.infer<typeof NoteItemDataSchema>
/**
* Sitemap item data.
*/
export const SitemapItemDataSchema = z.object({
url: z.string().trim().min(1),
name: z.string().trim().min(1)
export const SitemapItemDataSchema = KnowledgeItemSharedSchema.extend({
url: z.string().trim().min(1)
})
export type SitemapItemData = z.infer<typeof SitemapItemDataSchema>
/**
* Directory item data.
*/
export const DirectoryItemDataSchema = z.object({
name: z.string().trim().min(1),
export const DirectoryItemDataSchema = KnowledgeItemSharedSchema.extend({
path: z.string().trim().min(1)
})
export type DirectoryItemData = z.infer<typeof DirectoryItemDataSchema>
export type KnowledgeItemDataMap = {
file: FileItemData
url: UrlItemData
note: NoteItemData
sitemap: SitemapItemData
directory: DirectoryItemData
}
/**
* JSON payload stored in `knowledge_item.data`.
@@ -117,15 +209,177 @@ export const KnowledgeItemDataSchema = z.union([
])
export type KnowledgeItemData = z.infer<typeof KnowledgeItemDataSchema>
/**
* Knowledge base metadata stored in SQLite.
*/
export const KnowledgeBaseSchema = z.object({
// ============================================================================
// Knowledge Item Entity
// ============================================================================
const KnowledgeItemEntityBaseSchema = z.strictObject({
id: z.string(),
name: z.string().min(1),
description: z.string().optional(),
baseId: z.string(),
groupId: z.string().trim().min(1).nullable().optional(),
createdAt: z.iso.datetime(),
updatedAt: z.iso.datetime()
})
const IdleKnowledgeItemLifecycleSchema = {
status: z.literal('idle'),
phase: z.null(),
error: z.null()
} as const
const ProcessingKnowledgeItemLifecycleSchema = {
status: z.literal('processing'),
phase: KnowledgeItemPhaseSchema.nullable(),
error: z.null()
} as const
const LeafProcessingKnowledgeItemLifecycleSchema = {
status: z.literal('processing'),
phase: KnowledgeLeafItemPhaseSchema.nullable(),
error: z.null()
} as const
const ContainerProcessingKnowledgeItemLifecycleSchema = {
status: z.literal('processing'),
phase: KnowledgeContainerItemPhaseSchema.nullable(),
error: z.null()
} as const
const CompletedKnowledgeItemLifecycleSchema = {
status: z.literal('completed'),
phase: z.null(),
error: z.null()
} as const
const FailedKnowledgeItemLifecycleSchema = {
status: z.literal('failed'),
phase: z.null(),
error: z.string().trim().min(1)
} as const
const KnowledgeItemLifecycleSchemas = [
z.strictObject(IdleKnowledgeItemLifecycleSchema),
z.strictObject(ProcessingKnowledgeItemLifecycleSchema),
z.strictObject(CompletedKnowledgeItemLifecycleSchema),
z.strictObject(FailedKnowledgeItemLifecycleSchema)
] as const
export const KnowledgeItemLifecycleSchema = z.discriminatedUnion('status', KnowledgeItemLifecycleSchemas)
export type KnowledgeItemLifecycle = z.infer<typeof KnowledgeItemLifecycleSchema>
const createKnowledgeItemEntitySchemas = <
TType extends KnowledgeItemType,
TData extends z.ZodType,
TProcessingLifecycle extends
| typeof LeafProcessingKnowledgeItemLifecycleSchema
| typeof ContainerProcessingKnowledgeItemLifecycleSchema
>(
type: TType,
data: TData,
processingLifecycle: TProcessingLifecycle
) =>
[
KnowledgeItemEntityBaseSchema.extend({
type: z.literal(type),
data,
...IdleKnowledgeItemLifecycleSchema
}),
KnowledgeItemEntityBaseSchema.extend({
type: z.literal(type),
data,
...processingLifecycle
}),
KnowledgeItemEntityBaseSchema.extend({
type: z.literal(type),
data,
...CompletedKnowledgeItemLifecycleSchema
}),
KnowledgeItemEntityBaseSchema.extend({
type: z.literal(type),
data,
...FailedKnowledgeItemLifecycleSchema
})
] as const
const FileKnowledgeItemSchema = z.discriminatedUnion(
'status',
createKnowledgeItemEntitySchemas('file', FileItemDataSchema, LeafProcessingKnowledgeItemLifecycleSchema)
)
const UrlKnowledgeItemSchema = z.discriminatedUnion(
'status',
createKnowledgeItemEntitySchemas('url', UrlItemDataSchema, LeafProcessingKnowledgeItemLifecycleSchema)
)
const NoteKnowledgeItemSchema = z.discriminatedUnion(
'status',
createKnowledgeItemEntitySchemas('note', NoteItemDataSchema, LeafProcessingKnowledgeItemLifecycleSchema)
)
const SitemapKnowledgeItemSchema = z.discriminatedUnion(
'status',
createKnowledgeItemEntitySchemas('sitemap', SitemapItemDataSchema, ContainerProcessingKnowledgeItemLifecycleSchema)
)
const DirectoryKnowledgeItemSchema = z.discriminatedUnion(
'status',
createKnowledgeItemEntitySchemas(
'directory',
DirectoryItemDataSchema,
ContainerProcessingKnowledgeItemLifecycleSchema
)
)
/**
* Knowledge item record stored in SQLite.
*/
export const KnowledgeItemSchema = z.union([
FileKnowledgeItemSchema,
UrlKnowledgeItemSchema,
NoteKnowledgeItemSchema,
SitemapKnowledgeItemSchema,
DirectoryKnowledgeItemSchema
])
export type KnowledgeItem = z.infer<typeof KnowledgeItemSchema>
export type KnowledgeItemOf<T extends KnowledgeItemType> = Extract<KnowledgeItem, { type: T }>
// ============================================================================
// Runtime Search and Chunk Types
// ============================================================================
export const KnowledgeChunkMetadataSchema = z.strictObject({
itemId: z.string(),
itemType: KnowledgeItemTypeSchema,
source: z.string().trim().min(1),
chunkIndex: z.number().int().min(0),
tokenCount: z.number().int().min(0)
})
export type KnowledgeChunkMetadata = z.infer<typeof KnowledgeChunkMetadataSchema>
export type KnowledgeSourceMetadata = Pick<KnowledgeChunkMetadata, 'source'>
/**
* Search result returned by retrieval.
*/
export const KnowledgeSearchResultSchema = z.strictObject({
pageContent: z.string(),
score: z.number(),
metadata: KnowledgeChunkMetadataSchema,
itemId: z.string().optional(),
chunkId: z.string()
})
export type KnowledgeSearchResult = z.infer<typeof KnowledgeSearchResultSchema>
export const KnowledgeItemChunkSchema = z.strictObject({
id: z.string(),
itemId: z.string(),
content: z.string(),
metadata: KnowledgeChunkMetadataSchema
})
export type KnowledgeItemChunk = z.infer<typeof KnowledgeItemChunkSchema>
// ============================================================================
// Runtime Operation Schemas
// ============================================================================
const KnowledgeBaseRuntimeConfigSchema = z.strictObject({
dimensions: z.number().int().positive(),
embeddingModelId: z.string().min(1).nullable(),
embeddingModelId: z.string().trim().min(1),
rerankModelId: z.string().optional(),
fileProcessorId: z.string().optional(),
chunkSize: KnowledgeChunkSizeSchema.optional(),
@@ -133,58 +387,76 @@ export const KnowledgeBaseSchema = z.object({
threshold: KnowledgeThresholdSchema.optional(),
documentCount: KnowledgeDocumentCountSchema.optional(),
searchMode: KnowledgeSearchModeSchema.optional(),
hybridAlpha: KnowledgeHybridAlphaSchema.optional(),
createdAt: z.iso.datetime(),
updatedAt: z.iso.datetime()
hybridAlpha: KnowledgeHybridAlphaSchema.optional()
})
export type KnowledgeBase = z.infer<typeof KnowledgeBaseSchema>
const KnowledgeItemBaseSchema = z.object({
id: z.string(),
baseId: z.string(),
groupId: z.string().nullable().optional(),
status: KnowledgeItemStatusSchema,
error: z.string().nullable(),
createdAt: z.iso.datetime(),
updatedAt: z.iso.datetime()
})
const refineRuntimeConfig = (value: z.infer<typeof KnowledgeBaseRuntimeConfigSchema>, ctx: z.RefinementCtx): void => {
if (value.chunkOverlap != null && value.chunkSize == null) {
ctx.addIssue({
code: 'custom',
path: ['chunkSize'],
message: 'Chunk size is required when chunk overlap is provided'
})
}
if (value.chunkOverlap != null && value.chunkSize != null && value.chunkOverlap >= value.chunkSize) {
ctx.addIssue({
code: 'custom',
path: ['chunkOverlap'],
message: 'Chunk overlap must be smaller than chunk size'
})
}
}
/**
* Knowledge item record stored in SQLite.
* Runtime create-base request. This is intentionally not a DataApi endpoint:
* orchestration creates the SQLite row and initializes the vector store.
*/
export const KnowledgeItemSchema = z.discriminatedUnion('type', [
KnowledgeItemBaseSchema.extend({
export const CreateKnowledgeBaseSchema = KnowledgeBaseRuntimeConfigSchema.extend({
name: z.string().trim().min(1),
groupId: z.string().trim().min(1).optional(),
emoji: KnowledgeBaseEmojiSchema.optional()
}).superRefine(refineRuntimeConfig)
export type CreateKnowledgeBaseDto = z.input<typeof CreateKnowledgeBaseSchema>
export const RestoreKnowledgeBaseSchema = z.strictObject({
sourceBaseId: z.string().trim().min(1),
// Dimensions must be the resolved embedding vector size for embeddingModelId.
// Automatic callers should fill this from AI Core dimension detection; manual
// callers are responsible for confirming the value matches the selected model.
// Restore validates shape only and does not probe the model again server-side.
dimensions: z.number().int().positive(),
embeddingModelId: z.string().trim().min(1)
})
export type RestoreKnowledgeBaseDto = z.input<typeof RestoreKnowledgeBaseSchema>
const CreateKnowledgeItemBaseSchema = z.strictObject({
groupId: z.string().trim().min(1).nullable().optional()
})
export const CreateKnowledgeItemSchema = z.discriminatedUnion('type', [
CreateKnowledgeItemBaseSchema.extend({
type: z.literal('file'),
data: FileItemDataSchema
}),
KnowledgeItemBaseSchema.extend({
CreateKnowledgeItemBaseSchema.extend({
type: z.literal('url'),
data: UrlItemDataSchema
}),
KnowledgeItemBaseSchema.extend({
CreateKnowledgeItemBaseSchema.extend({
type: z.literal('note'),
data: NoteItemDataSchema
}),
KnowledgeItemBaseSchema.extend({
CreateKnowledgeItemBaseSchema.extend({
type: z.literal('sitemap'),
data: SitemapItemDataSchema
}),
KnowledgeItemBaseSchema.extend({
CreateKnowledgeItemBaseSchema.extend({
type: z.literal('directory'),
data: DirectoryItemDataSchema
})
])
export type KnowledgeItem = z.infer<typeof KnowledgeItemSchema>
export type KnowledgeItemOf<T extends KnowledgeItemType> = Extract<KnowledgeItem, { type: T }>
export type CreateKnowledgeItemDto = z.infer<typeof CreateKnowledgeItemSchema>
/**
* Search result returned by retrieval.
*/
export const KnowledgeSearchResultSchema = z.object({
pageContent: z.string(),
score: z.number(),
metadata: z.record(z.string(), z.unknown()),
itemId: z.string().optional(),
chunkId: z.string()
})
export type KnowledgeSearchResult = z.infer<typeof KnowledgeSearchResultSchema>
export const KnowledgeRuntimeAddItemInputSchema = CreateKnowledgeItemSchema
export type KnowledgeRuntimeAddItemInput = CreateKnowledgeItemDto

View File

@@ -31,6 +31,17 @@ function toError(error: unknown): Error {
return error instanceof Error ? error : new Error(String(error))
}
function toFts5TokenQuery(query: string): string | null {
const tokens = query.match(/[\p{L}\p{N}_]+/gu) ?? []
const nonEmptyTokens = tokens.map((token) => token.trim()).filter((token) => token.length > 0)
if (nonEmptyTokens.length === 0) {
return null
}
return nonEmptyTokens.map((token) => `"${token.replaceAll('"', '""')}"`).join(' AND ')
}
function validateMetadataKey(key: string): string {
if (!SAFE_METADATA_KEY_PATTERN.test(key)) {
throw new Error(`Invalid metadata filter key: ${key}`)
@@ -267,9 +278,6 @@ export class LibSQLVectorStore extends BaseVectorStore {
const id = node.id_.length ? node.id_ : null
const externalId = node.sourceNode?.nodeId || node.id_
const meta = node.metadata || {}
if (!meta.create_date) {
meta.create_date = new Date()
}
const nodeId = id ?? '<auto-id>'
const embedding = this.normalizeEmbeddingOrThrow(this.getNodeEmbedding(node, nodeId), nodeId)
@@ -337,6 +345,16 @@ export class LibSQLVectorStore extends BaseVectorStore {
await this.clientInstance.execute(statement)
}
async deleteByIdAndExternalId(chunkId: string, refDocId: string): Promise<void> {
await this.ensureInitialized()
const collectionCriteria = this.collection.length ? 'AND collection = ?' : ''
const sql = `DELETE FROM ${this.tableName} WHERE id = ? AND external_id = ? ${collectionCriteria}`
const args = this.collection.length ? [chunkId, refDocId, this.collection] : [chunkId, refDocId]
const statement: InStatement = { sql, args: toInArgs(args) }
await this.clientInstance.execute(statement)
}
private normalizeEmbeddingOrThrow(embedding: number[] | undefined, nodeId: string): Float32Array {
if (!embedding || embedding.length === 0) {
throw new Error(`Missing embedding for node ${nodeId}`)
@@ -583,6 +601,16 @@ export class LibSQLVectorStore extends BaseVectorStore {
throw new Error('queryStr is required for BM25 mode')
}
const matchQuery = toFts5TokenQuery(query.queryStr)
if (!matchQuery) {
return {
nodes: [],
similarities: [],
ids: []
}
}
const { where, params } = this.buildWhereClause(query, 'v')
// Use FTS5 for BM25 search
@@ -596,7 +624,7 @@ export class LibSQLVectorStore extends BaseVectorStore {
ORDER BY score
LIMIT ${max}
`,
args: toInArgs([...params, query.queryStr])
args: toInArgs([...params, matchQuery])
}
try {
@@ -734,4 +762,39 @@ export class LibSQLVectorStore extends BaseVectorStore {
})
return results.rows.length > 0
}
async listByExternalId(refDocId: string): Promise<Document<Metadata>[]> {
await this.ensureInitialized()
const collectionCriteria = this.collection.length ? 'AND collection = ?' : ''
const sql = `SELECT id, external_id, document, metadata FROM ${this.tableName}
WHERE external_id = ? ${collectionCriteria}
ORDER BY CASE WHEN json_valid(metadata) THEN CAST(json_extract(metadata, '$.chunkIndex') AS INTEGER) ELSE NULL END, id`
const params = this.collection.length ? [refDocId, this.collection] : [refDocId]
const results = await this.clientInstance.execute({
sql,
args: toInArgs(params)
})
return results.rows.map((row) => {
const metadata = this.parseJson<Metadata>(
row.metadata as Metadata | string | null | undefined,
{},
{
field: 'metadata',
rowId: String(row.id ?? '')
}
)
const externalId = typeof row.external_id === 'string' && row.external_id.length > 0 ? row.external_id : undefined
if (externalId && metadata.itemId === undefined) {
metadata.itemId = externalId
}
return new Document({
id_: String(row.id),
text: String(row.document || ''),
metadata
})
})
}
}

View File

@@ -93,6 +93,27 @@ describe('LibSQLVectorStore', () => {
expect(ids[1]).toBeDefined()
})
it('should preserve caller metadata without injecting create_date', async () => {
const metadata: Metadata = { category: 'test', score: 1.0 }
const node = new TextNode({
id_: 'chunk-metadata-preserved',
text: 'Document chunk',
embedding: [0.1, 0.2],
metadata
})
await store.add([node])
expect(metadata).toEqual({ category: 'test', score: 1.0 })
const rows = await client.execute("SELECT metadata FROM test_embeddings WHERE id = 'chunk-metadata-preserved'")
expect(rows.rows).toHaveLength(1)
expect(JSON.parse(String(rows.rows[0]?.metadata))).toEqual({
category: 'test',
score: 1.0
})
})
it('should reject nodes with missing embeddings instead of writing zero vectors', async () => {
const node = new TextNode({
id_: 'chunk-missing-embedding',
@@ -923,6 +944,54 @@ describe('LibSQLVectorStore', () => {
})
})
it('should query bm25 mode with non-consecutive multi-word user text', async () => {
const result = await store.query({
queryStr: 'artificial technology',
similarityTopK: 2,
mode: 'bm25' as VectorStoreQueryMode
})
const nodes = result.nodes ?? []
expect(nodes.length).toBeGreaterThan(0)
expect(nodes.some((node) => node.getContent(MetadataMode.NONE).includes('artificial intelligence'))).toBe(true)
})
it('should query bm25 mode with punctuation as ordinary user text', async () => {
await store.add([
new TextNode({
text: 'DeepSeek-V3.2 release notes mention node.js, README.md, and C++ usage examples',
embedding: [0.7, 0.3],
metadata: { category: 'release' }
})
])
const queries = ['DeepSeek-V3.2', 'README.md', 'node.js', 'C++', 'DeepSeek "V3.2"']
for (const queryStr of queries) {
const result = await store.query({
queryStr,
similarityTopK: 3,
mode: 'bm25' as VectorStoreQueryMode
})
expect(result.nodes?.length ?? 0).toBeGreaterThan(0)
}
})
it('should return empty bm25 results for punctuation-only user text', async () => {
const result = await store.query({
queryStr: '...',
similarityTopK: 3,
mode: 'bm25' as VectorStoreQueryMode
})
expect(result).toEqual({
nodes: [],
similarities: [],
ids: []
})
})
it('should throw error for bm25 mode without queryStr', async () => {
const query: VectorStoreQuery = {
queryEmbedding: [0.5, 0.5],
@@ -954,6 +1023,38 @@ describe('LibSQLVectorStore', () => {
})
})
it('should query hybrid mode with non-consecutive multi-word user text', async () => {
const result = await store.query({
queryEmbedding: [0.9, 0.1],
queryStr: 'artificial technology',
similarityTopK: 2,
mode: 'hybrid' as VectorStoreQueryMode
})
const nodes = result.nodes ?? []
expect(nodes.length).toBeGreaterThan(0)
expect(nodes.some((node) => node.getContent(MetadataMode.NONE).includes('artificial intelligence'))).toBe(true)
})
it('should query hybrid mode with punctuation as ordinary user text', async () => {
await store.add([
new TextNode({
text: 'DeepSeek-V3.2 release notes for hybrid retrieval',
embedding: [0.9, 0.1],
metadata: { category: 'release' }
})
])
const result = await store.query({
queryEmbedding: [0.9, 0.1],
queryStr: 'DeepSeek-V3.2',
similarityTopK: 2,
mode: 'hybrid' as VectorStoreQueryMode
})
expect(result.nodes?.length ?? 0).toBeGreaterThan(0)
})
it('should throw error for hybrid mode without queryEmbedding', async () => {
const query: VectorStoreQuery = {
queryStr: 'artificial intelligence',
@@ -1117,4 +1218,219 @@ describe('LibSQLVectorStore', () => {
expect(await store.exists('item-collection')).toBe(false)
})
})
describe('chunk deletion', () => {
it('should delete one chunk by id, external_id, and collection only', async () => {
store.setCollection('collection-a')
await store.add([
new TextNode({
id_: 'chunk-1',
text: 'first chunk',
embedding: [0.1, 0.2],
metadata: { itemId: 'item-1', chunkIndex: 0, tokenCount: 2 },
relationships: { [NodeRelationship.SOURCE]: { nodeId: 'item-1', metadata: {} } }
}),
new TextNode({
id_: 'chunk-2',
text: 'second chunk',
embedding: [0.2, 0.3],
metadata: { itemId: 'item-1', chunkIndex: 1, tokenCount: 2 },
relationships: { [NodeRelationship.SOURCE]: { nodeId: 'item-1', metadata: {} } }
})
])
const otherCollectionStore = new LibSQLVectorStore({
client,
tableName: 'test_embeddings',
dimensions: 2,
collection: 'collection-b'
})
await otherCollectionStore.add([
new TextNode({
id_: 'chunk-1',
text: 'other collection chunk',
embedding: [0.3, 0.4],
metadata: { itemId: 'item-1', chunkIndex: 0, tokenCount: 3 },
relationships: { [NodeRelationship.SOURCE]: { nodeId: 'item-1', metadata: {} } }
})
])
await store.deleteByIdAndExternalId('chunk-1', 'item-1')
const rows = await client.execute(
'SELECT id, external_id, collection FROM test_embeddings ORDER BY collection, id'
)
expect(rows.rows).toHaveLength(2)
expect(rows.rows[0]).toMatchObject({ id: 'chunk-2', external_id: 'item-1', collection: 'collection-a' })
expect(rows.rows[1]).toMatchObject({ id: 'chunk-1', external_id: 'item-1', collection: 'collection-b' })
})
it('should remove a deleted chunk from bm25 search', async () => {
const node = new TextNode({
id_: 'chunk-bm25-delete',
text: 'delete this exact chunk',
embedding: [0.5, 0.6],
metadata: { itemId: 'item-1', chunkIndex: 0, tokenCount: 4 },
relationships: { [NodeRelationship.SOURCE]: { nodeId: 'item-1', metadata: {} } }
})
await store.add([node])
await store.deleteByIdAndExternalId('chunk-bm25-delete', 'item-1')
const result = await store.query({
queryStr: 'delete exact',
similarityTopK: 5,
mode: 'bm25' as VectorStoreQueryMode
})
expect(result.ids).not.toContain('chunk-bm25-delete')
})
})
describe('listByExternalId', () => {
it('should list documents by external_id in chunk order without embeddings', async () => {
await store.add([
new TextNode({
id_: 'chunk-2',
text: 'second chunk',
embedding: [0.1, 0.2],
metadata: { itemId: 'item-1', chunkIndex: 1, tokenCount: 2 },
relationships: {
[NodeRelationship.SOURCE]: {
nodeId: 'item-1',
metadata: {}
}
}
}),
new TextNode({
id_: 'chunk-1',
text: 'first chunk',
embedding: [0.3, 0.4],
metadata: { itemId: 'item-1', chunkIndex: 0, tokenCount: 2 },
relationships: {
[NodeRelationship.SOURCE]: {
nodeId: 'item-1',
metadata: {}
}
}
}),
new TextNode({
id_: 'other-chunk',
text: 'other item chunk',
embedding: [0.5, 0.6],
metadata: { itemId: 'item-2', chunkIndex: 0, tokenCount: 3 },
relationships: {
[NodeRelationship.SOURCE]: {
nodeId: 'item-2',
metadata: {}
}
}
})
])
const chunks = await store.listByExternalId('item-1')
expect(chunks.map((chunk) => chunk.id_)).toEqual(['chunk-1', 'chunk-2'])
expect(chunks.map((chunk) => chunk.getContent(MetadataMode.NONE))).toEqual(['first chunk', 'second chunk'])
expect(chunks.map((chunk) => chunk.metadata.chunkIndex)).toEqual([0, 1])
expect(() => chunks[0]?.getEmbedding()).toThrow('Embedding not set')
})
it('should fall back to external_id when listed metadata has no itemId', async () => {
await store.add([
new TextNode({
id_: 'chunk-without-item-id',
text: 'chunk without item id',
embedding: [0.1, 0.2],
metadata: { chunkIndex: 0, tokenCount: 4 },
relationships: {
[NodeRelationship.SOURCE]: {
nodeId: 'item-1',
metadata: {}
}
}
})
])
const chunks = await store.listByExternalId('item-1')
expect(chunks).toHaveLength(1)
expect(chunks[0]?.metadata).toMatchObject({
itemId: 'item-1',
chunkIndex: 0,
tokenCount: 4
})
})
it('should tolerate invalid metadata JSON when listing documents', async () => {
const warnSpy = vi.spyOn(console, 'warn').mockImplementation(() => {})
await store.add([
new TextNode({
id_: 'chunk-invalid-list-metadata',
text: 'chunk with invalid list metadata',
embedding: [0.1, 0.2],
metadata: { itemId: 'item-1', chunkIndex: 0, tokenCount: 5 },
relationships: {
[NodeRelationship.SOURCE]: {
nodeId: 'item-1',
metadata: {}
}
}
})
])
await client.execute({
sql: 'UPDATE test_embeddings SET metadata = ? WHERE id = ?',
args: ['{"itemId":', 'chunk-invalid-list-metadata']
})
const chunks = await store.listByExternalId('item-1')
expect(chunks).toHaveLength(1)
expect(chunks[0]?.id_).toBe('chunk-invalid-list-metadata')
expect(chunks[0]?.metadata).toEqual({ itemId: 'item-1' })
expect(warnSpy).toHaveBeenCalledWith(
'Failed to parse metadata JSON for row chunk-invalid-list-metadata',
expect.any(Error)
)
warnSpy.mockRestore()
})
it('should respect collection when listing documents', async () => {
store.setCollection('collection-a')
await store.add([
new TextNode({
id_: 'collection-a-chunk',
text: 'collection a chunk',
embedding: [0.1, 0.2],
metadata: { itemId: 'item-1', chunkIndex: 0, tokenCount: 3 },
relationships: {
[NodeRelationship.SOURCE]: {
nodeId: 'item-1',
metadata: {}
}
}
})
])
store.setCollection('collection-b')
await store.add([
new TextNode({
id_: 'collection-b-chunk',
text: 'collection b chunk',
embedding: [0.3, 0.4],
metadata: { itemId: 'item-1', chunkIndex: 0, tokenCount: 3 },
relationships: {
[NodeRelationship.SOURCE]: {
nodeId: 'item-1',
metadata: {}
}
}
})
])
const chunks = await store.listByExternalId('item-1')
expect(chunks.map((chunk) => chunk.id_)).toEqual(['collection-b-chunk'])
})
})
})

View File

@@ -59,6 +59,17 @@ describe('groupHandlers', () => {
expect(listByEntityTypeMock).not.toHaveBeenCalled()
})
it('should accept knowledge entityType in GET query params', async () => {
listByEntityTypeMock.mockResolvedValueOnce([{ id: 'g-knowledge', entityType: 'knowledge', name: 'Knowledge' }])
const result = await groupHandlers['/groups'].GET({
query: { entityType: 'knowledge' }
} as never)
expect(listByEntityTypeMock).toHaveBeenCalledWith('knowledge')
expect(result).toEqual([{ id: 'g-knowledge', entityType: 'knowledge', name: 'Knowledge' }])
})
})
describe('/groups POST', () => {
@@ -93,6 +104,18 @@ describe('groupHandlers', () => {
expect(createMock).not.toHaveBeenCalled()
})
it('should accept knowledge entityType in POST body', async () => {
createMock.mockResolvedValueOnce({ id: 'g-knowledge', entityType: 'knowledge', name: 'Knowledge' })
await expect(
groupHandlers['/groups'].POST({
body: { entityType: 'knowledge', name: 'Knowledge' }
} as never)
).resolves.toMatchObject({ id: 'g-knowledge', entityType: 'knowledge' })
expect(createMock).toHaveBeenCalledWith({ entityType: 'knowledge', name: 'Knowledge' })
})
})
describe('/groups/:id', () => {

View File

@@ -1,34 +1,24 @@
import type { CreateKnowledgeItemsDto } from '@shared/data/api/schemas/knowledges'
import { beforeEach, describe, expect, it, vi } from 'vitest'
const {
listKnowledgeBasesMock,
createKnowledgeBaseMock,
getKnowledgeBaseByIdMock,
updateKnowledgeBaseMock,
deleteKnowledgeBaseMock,
listKnowledgeItemsMock,
createKnowledgeItemsMock,
getKnowledgeItemByIdMock,
updateKnowledgeItemMock,
deleteKnowledgeItemMock
getKnowledgeItemByIdMock
} = vi.hoisted(() => ({
listKnowledgeBasesMock: vi.fn(),
createKnowledgeBaseMock: vi.fn(),
getKnowledgeBaseByIdMock: vi.fn(),
updateKnowledgeBaseMock: vi.fn(),
deleteKnowledgeBaseMock: vi.fn(),
listKnowledgeItemsMock: vi.fn(),
createKnowledgeItemsMock: vi.fn(),
getKnowledgeItemByIdMock: vi.fn(),
updateKnowledgeItemMock: vi.fn(),
deleteKnowledgeItemMock: vi.fn()
getKnowledgeItemByIdMock: vi.fn()
}))
vi.mock('@data/services/KnowledgeBaseService', () => ({
knowledgeBaseService: {
list: listKnowledgeBasesMock,
create: createKnowledgeBaseMock,
getById: getKnowledgeBaseByIdMock,
update: updateKnowledgeBaseMock,
delete: deleteKnowledgeBaseMock
@@ -38,10 +28,7 @@ vi.mock('@data/services/KnowledgeBaseService', () => ({
vi.mock('@data/services/KnowledgeItemService', () => ({
knowledgeItemService: {
list: listKnowledgeItemsMock,
createMany: createKnowledgeItemsMock,
getById: getKnowledgeItemByIdMock,
update: updateKnowledgeItemMock,
delete: deleteKnowledgeItemMock
getById: getKnowledgeItemByIdMock
}
}))
@@ -112,64 +99,12 @@ describe('knowledgeHandlers', () => {
expect(listKnowledgeBasesMock).not.toHaveBeenCalled()
})
it('should parse and delegate POST to knowledgeBaseService.create', async () => {
const body = {
name: ' Knowledge Base ',
dimensions: 1536,
embeddingModelId: ' text-embedding-3-large '
}
createKnowledgeBaseMock.mockResolvedValueOnce({
id: 'kb-1',
name: 'Knowledge Base',
dimensions: 1536,
embeddingModelId: 'text-embedding-3-large'
})
const result = await knowledgeHandlers['/knowledge-bases'].POST({ body })
expect(createKnowledgeBaseMock).toHaveBeenCalledWith({
name: 'Knowledge Base',
dimensions: 1536,
embeddingModelId: 'text-embedding-3-large'
})
expect(result).toMatchObject({ id: 'kb-1' })
})
it('should reject invalid POST bodies before calling the service', async () => {
await expect(
knowledgeHandlers['/knowledge-bases'].POST({
body: {
name: ' ',
dimensions: 1536,
embeddingModelId: 'model-1'
}
} as never)
).rejects.toHaveProperty('name', 'ZodError')
expect(createKnowledgeBaseMock).not.toHaveBeenCalled()
})
it('should reject blank embedding model ids before calling the service', async () => {
await expect(
knowledgeHandlers['/knowledge-bases'].POST({
body: {
name: 'Knowledge Base',
dimensions: 1536,
embeddingModelId: ' '
}
} as never)
).rejects.toHaveProperty('name', 'ZodError')
expect(createKnowledgeBaseMock).not.toHaveBeenCalled()
})
})
describe('/knowledge-bases/:id', () => {
it('should delegate GET/PATCH/DELETE with the path id', async () => {
it('should delegate GET/PATCH with the path id', async () => {
getKnowledgeBaseByIdMock.mockResolvedValueOnce({ id: 'kb-1' })
updateKnowledgeBaseMock.mockResolvedValueOnce({ id: 'kb-1', name: 'Updated Base' })
deleteKnowledgeBaseMock.mockResolvedValueOnce(undefined)
await expect(knowledgeHandlers['/knowledge-bases/:id'].GET({ params: { id: 'kb-1' } })).resolves.toEqual({
id: 'kb-1'
@@ -185,15 +120,9 @@ describe('knowledgeHandlers', () => {
name: 'Updated Base'
})
await expect(
knowledgeHandlers['/knowledge-bases/:id'].DELETE({
params: { id: 'kb-1' }
})
).resolves.toBeUndefined()
expect(getKnowledgeBaseByIdMock).toHaveBeenCalledWith('kb-1')
expect(updateKnowledgeBaseMock).toHaveBeenCalledWith('kb-1', { name: 'Updated Base' })
expect(deleteKnowledgeBaseMock).toHaveBeenCalledWith('kb-1')
expect(deleteKnowledgeBaseMock).not.toHaveBeenCalled()
})
it('should reject invalid PATCH bodies before calling the service', async () => {
@@ -222,19 +151,36 @@ describe('knowledgeHandlers', () => {
expect(updateKnowledgeBaseMock).not.toHaveBeenCalled()
})
it('should allow embeddingModelId updates and normalize them before calling the service', async () => {
updateKnowledgeBaseMock.mockResolvedValueOnce({ id: 'kb-1', embeddingModelId: 'new-model' })
it('should reject embeddingModelId updates before calling the service', async () => {
await expect(
knowledgeHandlers['/knowledge-bases/:id'].PATCH({
params: { id: 'kb-1' },
body: {
embeddingModelId: ' new-model '
}
} as never)
).rejects.toHaveProperty('name', 'ZodError')
expect(updateKnowledgeBaseMock).not.toHaveBeenCalled()
})
it('should trim groupId and keep emoji unchanged in PATCH bodies before calling the service', async () => {
updateKnowledgeBaseMock.mockResolvedValueOnce({ id: 'kb-1', groupId: 'group-1', emoji: '📚' })
await expect(
knowledgeHandlers['/knowledge-bases/:id'].PATCH({
params: { id: 'kb-1' },
body: {
groupId: ' group-1 ',
emoji: '📚'
}
})
).resolves.toMatchObject({ id: 'kb-1' })
expect(updateKnowledgeBaseMock).toHaveBeenCalledWith('kb-1', { embeddingModelId: 'new-model' })
expect(updateKnowledgeBaseMock).toHaveBeenCalledWith('kb-1', {
groupId: 'group-1',
emoji: '📚'
})
})
it('should reject null embeddingModelId clears before calling the service', async () => {
@@ -249,6 +195,66 @@ describe('knowledgeHandlers', () => {
expect(updateKnowledgeBaseMock).not.toHaveBeenCalled()
})
it('should reject optional config null clears before calling the service', async () => {
await expect(
knowledgeHandlers['/knowledge-bases/:id'].PATCH({
params: { id: 'kb-1' },
body: {
rerankModelId: null,
fileProcessorId: null,
threshold: null,
documentCount: null,
hybridAlpha: null
}
} as never)
).rejects.toHaveProperty('name', 'ZodError')
expect(updateKnowledgeBaseMock).not.toHaveBeenCalled()
})
it('should reject invalid emoji in PATCH bodies before calling the service', async () => {
await expect(
knowledgeHandlers['/knowledge-bases/:id'].PATCH({
params: { id: 'kb-1' },
body: {
emoji: 'books'
}
} as never)
).rejects.toHaveProperty('name', 'ZodError')
expect(updateKnowledgeBaseMock).not.toHaveBeenCalled()
})
it('should reject whitespace-padded emoji in PATCH bodies before calling the service', async () => {
await expect(
knowledgeHandlers['/knowledge-bases/:id'].PATCH({
params: { id: 'kb-1' },
body: {
emoji: ' 📚 '
}
} as never)
).rejects.toHaveProperty('name', 'ZodError')
expect(updateKnowledgeBaseMock).not.toHaveBeenCalled()
})
it('should pass null groupId clears before calling the service', async () => {
updateKnowledgeBaseMock.mockResolvedValueOnce({ id: 'kb-1', groupId: null })
await expect(
knowledgeHandlers['/knowledge-bases/:id'].PATCH({
params: { id: 'kb-1' },
body: {
groupId: null
}
})
).resolves.toMatchObject({ id: 'kb-1', groupId: null })
expect(updateKnowledgeBaseMock).toHaveBeenCalledWith('kb-1', {
groupId: null
})
})
})
describe('/knowledge-bases/:id/items', () => {
@@ -294,6 +300,27 @@ describe('knowledgeHandlers', () => {
})
})
it('should pass null groupId root filters to knowledge item listing', async () => {
listKnowledgeItemsMock.mockResolvedValueOnce({
items: [],
total: 0,
page: 1
})
await knowledgeHandlers['/knowledge-bases/:id/items'].GET({
params: { id: 'kb-1' },
query: {
groupId: null
}
} as never)
expect(listKnowledgeItemsMock).toHaveBeenCalledWith('kb-1', {
page: KNOWLEDGE_ITEMS_DEFAULT_PAGE,
limit: KNOWLEDGE_ITEMS_DEFAULT_LIMIT,
groupId: null
})
})
it('should reject non-positive page values', async () => {
await expect(
knowledgeHandlers['/knowledge-bases/:id/items'].GET({
@@ -332,267 +359,17 @@ describe('knowledgeHandlers', () => {
expect(listKnowledgeItemsMock).not.toHaveBeenCalled()
})
it('should delegate POST to knowledgeItemService.createMany', async () => {
const body: CreateKnowledgeItemsDto = {
items: [
{
groupId: 'group-1',
type: 'note',
data: { content: 'hello world' }
}
]
}
createKnowledgeItemsMock.mockResolvedValueOnce({
items: [
{
id: 'item-1',
baseId: 'kb-1',
groupId: 'group-1',
type: 'note',
data: { content: 'hello world' }
}
]
})
const result = await knowledgeHandlers['/knowledge-bases/:id/items'].POST({
params: { id: 'kb-1' },
body
})
expect(createKnowledgeItemsMock).toHaveBeenCalledWith('kb-1', {
items: [
{
groupId: 'group-1',
type: 'note',
data: { content: 'hello world' }
}
]
})
expect(result).toMatchObject({
items: [
{
id: 'item-1'
}
]
})
})
it('should accept sitemap owner items with grouped url children', async () => {
const body: CreateKnowledgeItemsDto = {
items: [
{
ref: 'sitemap-root',
type: 'sitemap',
data: {
url: 'https://example.com/sitemap.xml',
name: 'Example Sitemap'
}
},
{
groupRef: 'sitemap-root',
type: 'url',
data: {
url: 'https://example.com/page-a',
name: 'Page A'
}
}
]
}
createKnowledgeItemsMock.mockResolvedValueOnce({
items: [
{
id: 'sitemap-1',
baseId: 'kb-1',
groupId: null,
type: 'sitemap',
data: {
url: 'https://example.com/sitemap.xml',
name: 'Example Sitemap'
}
},
{
id: 'url-1',
baseId: 'kb-1',
groupId: 'sitemap-1',
type: 'url',
data: {
url: 'https://example.com/page-a',
name: 'Page A'
}
}
]
})
const result = await knowledgeHandlers['/knowledge-bases/:id/items'].POST({
params: { id: 'kb-1' },
body
})
expect(createKnowledgeItemsMock).toHaveBeenCalledWith('kb-1', body)
expect(result).toMatchObject({
items: [
{ id: 'sitemap-1', type: 'sitemap' },
{ id: 'url-1', type: 'url' }
]
})
})
it('should reject invalid POST bodies before calling the service', async () => {
await expect(
knowledgeHandlers['/knowledge-bases/:id/items'].POST({
params: { id: 'kb-1' },
body: {
items: []
}
} as never)
).rejects.toHaveProperty('name', 'ZodError')
expect(createKnowledgeItemsMock).not.toHaveBeenCalled()
})
it('should reject parentId in flat item create requests', async () => {
await expect(
knowledgeHandlers['/knowledge-bases/:id/items'].POST({
params: { id: 'kb-1' },
body: {
items: [
{
parentId: '550e8400-e29b-41d4-a716-446655440001',
type: 'note',
data: { content: 'hello world' }
}
]
}
} as never)
).rejects.toHaveProperty('name', 'ZodError')
expect(createKnowledgeItemsMock).not.toHaveBeenCalled()
})
it('should reject POST bodies that specify both groupId and groupRef', async () => {
await expect(
knowledgeHandlers['/knowledge-bases/:id/items'].POST({
params: { id: 'kb-1' },
body: {
items: [
{
groupId: 'group-1',
groupRef: 'root',
type: 'note',
data: { content: 'hello world' }
}
]
}
} as never)
).rejects.toHaveProperty('name', 'ZodError')
expect(createKnowledgeItemsMock).not.toHaveBeenCalled()
})
it('should reject POST bodies with duplicate refs', async () => {
await expect(
knowledgeHandlers['/knowledge-bases/:id/items'].POST({
params: { id: 'kb-1' },
body: {
items: [
{
ref: 'duplicate',
type: 'directory',
data: { name: 'files', path: '/tmp/files' }
},
{
ref: 'duplicate',
type: 'note',
data: { content: 'hello world' }
}
]
}
} as never)
).rejects.toHaveProperty('name', 'ZodError')
expect(createKnowledgeItemsMock).not.toHaveBeenCalled()
})
it('should reject POST bodies with missing groupRef targets', async () => {
await expect(
knowledgeHandlers['/knowledge-bases/:id/items'].POST({
params: { id: 'kb-1' },
body: {
items: [
{
groupRef: 'missing-root',
type: 'url',
data: {
url: 'https://example.com/page-a',
name: 'Page A'
}
}
]
}
} as never)
).rejects.toHaveProperty('name', 'ZodError')
expect(createKnowledgeItemsMock).not.toHaveBeenCalled()
})
})
describe('/knowledge-items/:id', () => {
it('should delegate GET/PATCH/DELETE with the item id', async () => {
it('should delegate GET with the item id', async () => {
getKnowledgeItemByIdMock.mockResolvedValueOnce({ id: 'item-1' })
updateKnowledgeItemMock.mockResolvedValueOnce({ id: 'item-1', status: 'completed' })
deleteKnowledgeItemMock.mockResolvedValueOnce(undefined)
await expect(knowledgeHandlers['/knowledge-items/:id'].GET({ params: { id: 'item-1' } })).resolves.toEqual({
id: 'item-1'
})
await expect(
knowledgeHandlers['/knowledge-items/:id'].PATCH({
params: { id: 'item-1' },
body: { status: 'completed' }
})
).resolves.toEqual({
id: 'item-1',
status: 'completed'
})
await expect(
knowledgeHandlers['/knowledge-items/:id'].DELETE({
params: { id: 'item-1' }
})
).resolves.toBeUndefined()
expect(getKnowledgeItemByIdMock).toHaveBeenCalledWith('item-1')
expect(updateKnowledgeItemMock).toHaveBeenCalledWith('item-1', { status: 'completed' })
expect(deleteKnowledgeItemMock).toHaveBeenCalledWith('item-1')
})
it('should reject invalid PATCH bodies before calling the service', async () => {
await expect(
knowledgeHandlers['/knowledge-items/:id'].PATCH({
params: { id: 'item-1' },
body: {
status: 'unknown'
}
} as never)
).rejects.toHaveProperty('name', 'ZodError')
expect(updateKnowledgeItemMock).not.toHaveBeenCalled()
})
it('should reject groupId in PATCH bodies before calling the service', async () => {
await expect(
knowledgeHandlers['/knowledge-items/:id'].PATCH({
params: { id: 'item-1' },
body: {
groupId: 'group-1'
}
} as never)
).rejects.toHaveProperty('name', 'ZodError')
expect(updateKnowledgeItemMock).not.toHaveBeenCalled()
})
})
})

View File

@@ -1,5 +1,14 @@
/**
* Knowledge API Handlers.
* Knowledge API Handlers
*
* Implements the SQLite-backed knowledge endpoints:
* - Knowledge base list/detail reads
* - Knowledge base metadata/config updates
* - Knowledge item reads within a base or by item id
*
* DataApi only exposes operations that are satisfied by the database layer.
* Runtime/index mutations that create, delete, restore, or reindex vector-store
* artifacts are coordinated by `KnowledgeOrchestrationService` instead.
*/
import { knowledgeBaseService } from '@data/services/KnowledgeBaseService'
@@ -7,23 +16,16 @@ import { knowledgeItemService } from '@data/services/KnowledgeItemService'
import type { HandlersFor } from '@shared/data/api/apiTypes'
import type { KnowledgeSchemas } from '@shared/data/api/schemas/knowledges'
import {
CreateKnowledgeBaseSchema,
CreateKnowledgeItemsSchema,
KnowledgeBaseListQuerySchema,
KnowledgeItemsQuerySchema,
UpdateKnowledgeBaseSchema,
UpdateKnowledgeItemSchema
ListKnowledgeBasesQuerySchema,
ListKnowledgeItemsQuerySchema,
UpdateKnowledgeBaseSchema
} from '@shared/data/api/schemas/knowledges'
export const knowledgeHandlers: HandlersFor<KnowledgeSchemas> = {
'/knowledge-bases': {
GET: async ({ query }) => {
const parsed = KnowledgeBaseListQuerySchema.parse(query ?? {})
const parsed = ListKnowledgeBasesQuerySchema.parse(query ?? {})
return await knowledgeBaseService.list(parsed)
},
POST: async ({ body }) => {
const parsed = CreateKnowledgeBaseSchema.parse(body)
return await knowledgeBaseService.create(parsed)
}
},
@@ -34,35 +36,19 @@ export const knowledgeHandlers: HandlersFor<KnowledgeSchemas> = {
PATCH: async ({ params, body }) => {
const parsed = UpdateKnowledgeBaseSchema.parse(body)
return await knowledgeBaseService.update(params.id, parsed)
},
DELETE: async ({ params }) => {
await knowledgeBaseService.delete(params.id)
return undefined
}
},
'/knowledge-bases/:id/items': {
GET: async ({ params, query }) => {
const parsed = KnowledgeItemsQuerySchema.parse(query ?? {})
const parsed = ListKnowledgeItemsQuerySchema.parse(query ?? {})
return await knowledgeItemService.list(params.id, parsed)
},
POST: async ({ params, body }) => {
const parsed = CreateKnowledgeItemsSchema.parse(body)
return await knowledgeItemService.createMany(params.id, parsed)
}
},
'/knowledge-items/:id': {
GET: async ({ params }) => {
return await knowledgeItemService.getById(params.id)
},
PATCH: async ({ params, body }) => {
const parsed = UpdateKnowledgeItemSchema.parse(body)
return await knowledgeItemService.update(params.id, parsed)
},
DELETE: async ({ params }) => {
await knowledgeItemService.delete(params.id)
return undefined
}
}
}

View File

@@ -1,59 +1,72 @@
import type {
KnowledgeItemData,
KnowledgeItemStatus,
KnowledgeItemType,
KnowledgeSearchMode
import {
type KnowledgeBaseErrorCode,
type KnowledgeBaseStatus,
type KnowledgeItemData,
type KnowledgeItemPhase,
type KnowledgeItemStatus,
type KnowledgeItemType,
type KnowledgeSearchMode
} from '@shared/data/types/knowledge'
import { sql } from 'drizzle-orm'
import { check, foreignKey, index, integer, real, sqliteTable, text, unique } from 'drizzle-orm/sqlite-core'
import { createUpdateTimestamps, uuidPrimaryKey, uuidPrimaryKeyOrdered } from './_columnHelpers'
import { groupTable } from './group'
import { userModelTable } from './userModel'
/**
* knowledge_base table - Knowledge base metadata
*/
// Durable base metadata; per-base vector stores remain runtime artifacts.
export const knowledgeBaseTable = sqliteTable(
'knowledge_base',
{
id: uuidPrimaryKey(),
name: text().notNull(),
description: text(),
dimensions: integer().notNull(),
groupId: text().references(() => groupTable.id, { onDelete: 'set null' }),
emoji: text().notNull(),
dimensions: integer(),
// Embedding model: FK to user_model(id) — UniqueModelId "providerId::modelId"
embeddingModelId: text().references(() => userModelTable.id, { onDelete: 'set null' }),
embeddingModelId: text().references(() => userModelTable.id),
// Rerank model: FK to user_model(id) — UniqueModelId "providerId::modelId"
status: text().$type<KnowledgeBaseStatus>().notNull(),
error: text().$type<KnowledgeBaseErrorCode>(),
// Preserve the base when an optional rerank model is removed.
rerankModelId: text().references(() => userModelTable.id, { onDelete: 'set null' }),
// File processing processor ID
fileProcessorId: text(),
// Configuration
chunkSize: integer(),
chunkOverlap: integer(),
chunkSize: integer().notNull(),
chunkOverlap: integer().notNull(),
threshold: real(),
documentCount: integer(),
searchMode: text().$type<KnowledgeSearchMode>(),
searchMode: text().$type<KnowledgeSearchMode>().notNull(),
hybridAlpha: real(),
...createUpdateTimestamps
},
(t) => [
check('knowledge_base_search_mode_check', sql`${t.searchMode} IN ('default', 'bm25', 'hybrid')`),
check('knowledge_base_status_check', sql`${t.status} IN ('completed', 'failed')`),
check(
'knowledge_base_search_mode_check',
sql`${t.searchMode} IN ('default', 'bm25', 'hybrid') OR ${t.searchMode} IS NULL`
'knowledge_base_status_error_check',
sql`
(
${t.status} = 'completed'
AND ${t.embeddingModelId} IS NOT NULL
AND ${t.dimensions} IS NOT NULL
AND ${t.dimensions} > 0
AND ${t.error} IS NULL
)
OR (
${t.status} = 'failed'
AND ${t.error} IS NOT NULL
AND length(trim(${t.error})) > 0
)
`
)
]
)
/**
* knowledge_item table - Knowledge items (files, URLs, notes, etc.)
*
* Uses uuidPrimaryKeyOrdered (UUID v7) because knowledge items are a growing,
* time-ordered dataset with paginated list queries.
*/
// User-added sources and expanded import children; chunks/embeddings live in the vector store.
export const knowledgeItemTable = sqliteTable(
'knowledge_item',
{
@@ -62,35 +75,59 @@ export const knowledgeItemTable = sqliteTable(
.notNull()
.references(() => knowledgeBaseTable.id, { onDelete: 'cascade' }),
// Stable business grouping for items from the same source/container.
// Examples: one directory import, one sitemap expansion, one URL collection.
// The composite self-FK below keeps expanded children in the owner's base.
groupId: text(),
// Type: 'file' | 'url' | 'note' | 'sitemap' | 'directory'
type: text().$type<KnowledgeItemType>().notNull(),
// Unified data field (Discriminated Union)
data: text({ mode: 'json' }).$type<KnowledgeItemData>().notNull(),
// Processing status
status: text().$type<KnowledgeItemStatus>().notNull().default('idle'),
status: text().$type<KnowledgeItemStatus>().notNull(),
phase: text().$type<KnowledgeItemPhase>(),
error: text(),
...createUpdateTimestamps
},
(t) => [
check('knowledge_item_type_check', sql`${t.type} IN ('file', 'url', 'note', 'sitemap', 'directory')`),
check('knowledge_item_status_check', sql`${t.status} IN ('idle', 'processing', 'completed', 'failed')`),
check(
'knowledge_item_status_check',
sql`${t.status} IN ('idle', 'pending', 'file_processing', 'read', 'embed', 'completed', 'failed')`
'knowledge_item_phase_check',
sql`
${t.phase} IS NULL
OR (${t.type} IN ('file', 'url', 'note') AND ${t.phase} IN ('reading', 'embedding'))
OR (${t.type} IN ('directory', 'sitemap') AND ${t.phase} = 'preparing')
`
),
// Enforce that group owners live inside the same knowledge base.
check(
'knowledge_item_status_phase_error_check',
sql`
(
${t.status} IN ('idle', 'completed')
AND ${t.phase} IS NULL
AND ${t.error} IS NULL
)
OR (
-- Containers may stay processing after their own prepare phase ends
-- while descendant leaf items continue reading/embedding.
${t.status} = 'processing'
AND ${t.error} IS NULL
)
OR (
${t.status} = 'failed'
AND ${t.phase} IS NULL
AND ${t.error} IS NOT NULL
AND length(trim(${t.error})) > 0
)
`
),
// Deletes expanded children when their group-owner item is deleted.
foreignKey({ columns: [t.baseId, t.groupId], foreignColumns: [t.baseId, t.id] }).onDelete('cascade'),
// Main tab/list query path: same-base items filtered by type and ordered by createdAt.
// Supports list queries by base/type with stable creation ordering.
index('knowledge_item_base_type_created_idx').on(t.baseId, t.type, t.createdAt),
// Group result lookups, e.g. show all items from one imported source/container.
// Supports fetches of all children for a group owner inside a base.
index('knowledge_item_base_group_created_idx').on(t.baseId, t.groupId, t.createdAt),
// Required by the same-base self-reference on (baseId, groupId) -> (baseId, id).
// Required target for the composite self-reference above.
unique('knowledge_item_baseId_id_unique').on(t.baseId, t.id)
]
)

View File

@@ -100,7 +100,7 @@ export async function createMigrationContext(
dexieExport: dexieFileReader,
dexieSettings: new DexieSettingsReader(dexieSettingsRecords),
localStorage: new LocalStorageReader(localStorageRecords),
knowledgeVectorSource: new KnowledgeVectorSourceReader(),
knowledgeVectorSource: new KnowledgeVectorSourceReader(paths.knowledgeBaseDir),
legacyHomeConfig: new LegacyHomeConfigReader(paths.legacyConfigFile)
},
db,

View File

@@ -1,15 +1,4 @@
/**
* Knowledge migrator - migrates knowledge bases and items from Redux/Dexie to SQLite
*
* Data sources:
* - Redux knowledge slice (`knowledge.bases`)
* - Dexie `knowledge_notes` table (full note content)
* - Dexie `files` table (file metadata fallback)
*
* Target tables:
* - `knowledge_base`
* - `knowledge_item`
*/
/** Migrates legacy knowledge bases/items from Redux and Dexie exports into SQLite. */
import fs from 'node:fs'
import path from 'node:path'
@@ -22,6 +11,7 @@ import { loggerService } from '@logger'
import { sanitizeFilename } from '@main/utils/file'
import type { ExecuteResult, PrepareResult, ValidateResult, ValidationError } from '@shared/data/migration/v2/types'
import type { FileMetadata } from '@shared/data/types/file'
import { KNOWLEDGE_BASE_ERROR_MISSING_EMBEDDING_MODEL } from '@shared/data/types/knowledge'
import { sql } from 'drizzle-orm'
import type { MigrationContext } from '../core/MigrationContext'
@@ -37,13 +27,14 @@ import {
transformKnowledgeBase,
transformKnowledgeItem
} from './mappings/KnowledgeMappings'
import { resolveModelReference } from './transformers/ModelTransformers'
import { legacyModelToUniqueId, resolveModelReference } from './transformers/ModelTransformers'
const logger = loggerService.withContext('KnowledgeMigrator')
const ITEM_INSERT_BATCH_SIZE = 200
const LOOKUP_STREAM_BATCH_SIZE = 200
const LEGACY_VECTOR_TABLE_NAME = 'vectors'
const SKIP_WARNING_SAMPLE_LIMIT = 3
type DimensionResolutionReason =
| 'ok'
@@ -108,6 +99,12 @@ const getInvalidKnowledgeBaseConfigWarning = (
return `Knowledge base ${base.id}: cleared invalid config fields: ${clearedFields.join(', ')}`
}
const resolveLegacyKnowledgeBaseDimensions = (base: LegacyKnowledgeBaseWithIdentity): number | null => {
return typeof base.dimensions === 'number' && Number.isInteger(base.dimensions) && base.dimensions > 0
? base.dimensions
: null
}
export class KnowledgeMigrator extends BaseMigrator {
readonly id = 'knowledge'
readonly name = 'KnowledgeBase'
@@ -119,6 +116,7 @@ export class KnowledgeMigrator extends BaseMigrator {
private preparedBases: NewKnowledgeBase[] = []
private preparedItems: NewKnowledgeItem[] = []
private warnings: string[] = []
private skippedWarnings = new Map<string, { count: number; samples: string[] }>()
private seenBaseIds = new Set<string>()
private seenItemIds = new Set<string>()
@@ -128,16 +126,36 @@ export class KnowledgeMigrator extends BaseMigrator {
this.preparedBases = []
this.preparedItems = []
this.warnings = []
this.skippedWarnings = new Map<string, { count: number; samples: string[] }>()
this.seenBaseIds = new Set<string>()
this.seenItemIds = new Set<string>()
}
private recordWarning(message: string): void {
logger.warn(message)
this.warnings.push(message)
}
private recordSkippedWarning(reason: string, message: string): void {
const bucket = this.skippedWarnings.get(reason) ?? { count: 0, samples: [] }
bucket.count += 1
if (bucket.samples.length < SKIP_WARNING_SAMPLE_LIMIT) {
bucket.samples.push(message)
}
this.skippedWarnings.set(reason, bucket)
}
private flushSkippedWarnings(): void {
for (const [reason, bucket] of this.skippedWarnings) {
const summary = `Skipped knowledge records (${reason}): count=${bucket.count}; examples: ${bucket.samples.join(' | ')}`
this.recordWarning(summary)
}
this.skippedWarnings.clear()
}
private getLegacyKnowledgeDbPath(baseId: string, knowledgeBaseDir: string): string | null {
// The knowledge base directory comes from MigrationPaths, which is resolved
// once at the migration gate entry by resolveMigrationPaths(). This avoids
// calling app.getPath('userData') directly (which would miss custom userData
// overrides from legacy config.json) and avoids the v2 path registry (which
// is not available during migration).
// MigrationPaths already accounts for legacy custom userData before the v2 path registry is available.
const rootPath = knowledgeBaseDir
const sanitizedBaseId = sanitizeFilename(baseId, '_')
const resolvedDbPath = path.resolve(rootPath, sanitizedBaseId)
@@ -145,8 +163,7 @@ export class KnowledgeMigrator extends BaseMigrator {
if (relativePath === '' || relativePath.startsWith('..') || path.isAbsolute(relativePath)) {
const warningMessage = `Skipped knowledge base ${baseId}: invalid legacy vector DB path`
logger.warn(warningMessage)
this.warnings.push(warningMessage)
this.recordWarning(warningMessage)
return null
}
@@ -175,8 +192,7 @@ export class KnowledgeMigrator extends BaseMigrator {
if (blobLength % Float32Array.BYTES_PER_ELEMENT !== 0) {
const warningMessage = `Invalid vector blob length for knowledge base ${baseId}: ${blobLength} is not divisible by ${Float32Array.BYTES_PER_ELEMENT}`
logger.warn(warningMessage)
this.warnings.push(warningMessage)
this.recordWarning(warningMessage)
return null
}
@@ -230,8 +246,7 @@ export class KnowledgeMigrator extends BaseMigrator {
const warningMessage = `Failed to inspect legacy vector DB for knowledge base ${base.id}: ${
error instanceof Error ? error.message : String(error)
}`
logger.warn(warningMessage)
this.warnings.push(warningMessage)
this.recordWarning(warningMessage)
return { dimensions: null, reason: 'vector_db_error' }
} finally {
if (client) {
@@ -241,8 +256,7 @@ export class KnowledgeMigrator extends BaseMigrator {
const warningMessage = `Failed to close legacy vector DB client for knowledge base ${base.id}: ${
error instanceof Error ? error.message : String(error)
}`
logger.warn(warningMessage)
this.warnings.push(warningMessage)
this.recordWarning(warningMessage)
}
}
}
@@ -312,8 +326,7 @@ export class KnowledgeMigrator extends BaseMigrator {
if (!(await ctx.sources.dexieExport.tableExists('knowledge_notes'))) {
const warningMessage = 'knowledge_notes export file not found - note content fallback to Redux item content'
logger.warn(warningMessage)
this.warnings.push(warningMessage)
this.recordWarning(warningMessage)
return noteById
}
@@ -343,8 +356,7 @@ export class KnowledgeMigrator extends BaseMigrator {
if (!(await ctx.sources.dexieExport.tableExists('files'))) {
const warningMessage = 'files export file not found - file item fallback by id disabled'
logger.warn(warningMessage)
this.warnings.push(warningMessage)
this.recordWarning(warningMessage)
return filesById
}
@@ -412,8 +424,7 @@ export class KnowledgeMigrator extends BaseMigrator {
if (!hasKnowledgeBaseIdentity(base)) {
this.skippedCount += 1
const warningMessage = 'Skipped invalid knowledge base: missing id or name'
logger.warn(warningMessage)
this.warnings.push(warningMessage)
this.recordSkippedWarning('invalid_knowledge_base_identity', warningMessage)
continue
}
@@ -425,44 +436,46 @@ export class KnowledgeMigrator extends BaseMigrator {
this.skippedCount += 1 + items.length
this.sourceCount += items.length
const warningMessage = `Skipped duplicate knowledge base ${validBase.id}`
logger.warn(warningMessage)
this.warnings.push(warningMessage)
this.recordSkippedWarning('duplicate_knowledge_base', warningMessage)
continue
}
const resolvedDimensions = await this.resolveDimensionsForBase(validBase, ctx.paths.knowledgeBaseDir)
const embeddingModelId = legacyModelToUniqueId(validBase.model ?? null)
const embeddingResolution = resolveModelReference(embeddingModelId, validModelIds)
const resolvedDimensions =
embeddingResolution.kind === 'resolved'
? await this.resolveDimensionsForBase(validBase, ctx.paths.knowledgeBaseDir)
: { dimensions: resolveLegacyKnowledgeBaseDimensions(validBase), reason: 'legacy_dimensions' as const }
if (resolvedDimensions.dimensions === null) {
if (embeddingResolution.kind === 'resolved' && resolvedDimensions.dimensions === null) {
this.skippedCount += 1 + items.length
this.sourceCount += items.length
const warningMessage = `Skipped knowledge base ${validBase.id}: ${resolvedDimensions.reason}`
logger.warn(warningMessage)
this.warnings.push(warningMessage)
this.recordSkippedWarning(`knowledge_base_${resolvedDimensions.reason}`, warningMessage)
continue
}
const baseResult = transformKnowledgeBase(validBase, resolvedDimensions.dimensions)
const preparedBase = { ...baseResult.value }
const embeddingResolution = resolveModelReference(preparedBase.embeddingModelId ?? null, validModelIds)
if (embeddingResolution.kind === 'resolved') {
preparedBase.embeddingModelId = embeddingResolution.modelId
} else {
preparedBase.embeddingModelId = null
const warningMessage =
embeddingResolution.kind === 'dangling'
? `Knowledge base ${validBase.id}: dangling embedding model reference ${embeddingResolution.modelId} was cleared`
: `Knowledge base ${validBase.id}: missing embedding model reference was cleared`
logger.warn(warningMessage)
this.warnings.push(warningMessage)
? `Knowledge base ${validBase.id}: dangling embedding model reference ${embeddingResolution.modelId} requires restore with a new embedding model`
: `Knowledge base ${validBase.id}: missing embedding model reference requires restore with a new embedding model`
this.recordWarning(warningMessage)
preparedBase.embeddingModelId = null
preparedBase.status = 'failed'
preparedBase.error = KNOWLEDGE_BASE_ERROR_MISSING_EMBEDDING_MODEL
}
const rerankResolution = resolveModelReference(preparedBase.rerankModelId ?? null, validModelIds)
preparedBase.rerankModelId = rerankResolution.kind === 'resolved' ? rerankResolution.modelId : null
if (rerankResolution.kind === 'dangling') {
const warningMessage = `Knowledge base ${validBase.id}: dangling rerank model reference ${rerankResolution.modelId} was cleared`
logger.warn(warningMessage)
this.warnings.push(warningMessage)
this.recordWarning(warningMessage)
}
this.seenBaseIds.add(preparedBase.id!)
@@ -470,8 +483,7 @@ export class KnowledgeMigrator extends BaseMigrator {
const invalidConfigWarning = getInvalidKnowledgeBaseConfigWarning(validBase, preparedBase)
if (invalidConfigWarning) {
logger.warn(invalidConfigWarning)
this.warnings.push(invalidConfigWarning)
this.recordWarning(invalidConfigWarning)
}
for (const item of items) {
@@ -485,16 +497,14 @@ export class KnowledgeMigrator extends BaseMigrator {
if (!itemResult.ok) {
this.skippedCount += 1
const warningMessage = this.formatItemWarning(validBase.id, item, itemResult.reason)
logger.warn(warningMessage)
this.warnings.push(warningMessage)
this.recordSkippedWarning(`knowledge_item_${itemResult.reason}`, warningMessage)
continue
}
if (this.seenItemIds.has(itemResult.value.id!)) {
this.skippedCount += 1
const warningMessage = `Skipped duplicate knowledge item ${itemResult.value.id!} in base ${validBase.id}`
logger.warn(warningMessage)
this.warnings.push(warningMessage)
this.recordSkippedWarning('duplicate_knowledge_item', warningMessage)
continue
}
@@ -503,6 +513,8 @@ export class KnowledgeMigrator extends BaseMigrator {
}
}
this.flushSkippedWarnings()
logger.info('KnowledgeMigrator.prepare completed', {
sourceCount: this.sourceCount,
preparedBases: this.preparedBases.length,

View File

@@ -5,6 +5,13 @@ import { knowledgeBaseTable, knowledgeItemTable } from '@data/db/schemas/knowled
import { type Client, createClient } from '@libsql/client'
import { loggerService } from '@logger'
import type { ExecuteResult, PrepareResult, ValidateResult, ValidationError } from '@shared/data/migration/v2/types'
import {
KNOWLEDGE_BASE_ERROR_MISSING_EMBEDDING_MODEL,
KnowledgeChunkMetadataSchema,
type KnowledgeItemData,
type KnowledgeItemType
} from '@shared/data/types/knowledge'
import { estimateTokenCount } from 'tokenx'
import { v4 as uuidv4 } from 'uuid'
import type { MigrationContext } from '../core/MigrationContext'
@@ -14,6 +21,9 @@ const logger = loggerService.withContext('KnowledgeVectorMigrator')
const VECTORSTORE_TABLE_NAME = 'libsql_vectorstores_embedding'
const INSERT_BATCH_SIZE = 100
const LEGACY_VECTOR_BACKUP_SUFFIX = '.embedjs.bak'
const INDEXABLE_KNOWLEDGE_ITEM_TYPES = new Set<KnowledgeItemType>(['file', 'url', 'note'])
const SKIP_WARNING_SAMPLE_LIMIT = 3
function yieldToEventLoop(): Promise<void> {
return new Promise((resolve) => {
@@ -39,10 +49,26 @@ interface LegacyKnowledgeStateWithLoaders {
interface PreparedVectorRow {
document: string
externalId: string
itemType: KnowledgeItemType
source: string
chunkIndex: number
tokenCount: number
embedding: number[]
}
interface MigratedKnowledgeItemForVector {
id: string
baseId: string
type: KnowledgeItemType
data: KnowledgeItemData
}
interface LoaderTarget {
id: string
itemType: KnowledgeItemType
source: string
}
interface PreparedBasePlan {
baseId: string
dbPath: string
@@ -60,6 +86,7 @@ export class KnowledgeVectorMigrator extends BaseMigrator {
private sourceCount = 0
private skippedCount = 0
private warnings: string[] = []
private skippedWarnings = new Map<string, { count: number; samples: string[] }>()
private preparedBasePlans: PreparedBasePlan[] = []
private successfulBaseIds = new Set<string>()
private targetCountByBaseId = new Map<string, number>()
@@ -69,6 +96,7 @@ export class KnowledgeVectorMigrator extends BaseMigrator {
this.sourceCount = 0
this.skippedCount = 0
this.warnings = []
this.skippedWarnings = new Map<string, { count: number; samples: string[] }>()
this.preparedBasePlans = []
this.successfulBaseIds = new Set<string>()
this.targetCountByBaseId = new Map<string, number>()
@@ -79,6 +107,33 @@ export class KnowledgeVectorMigrator extends BaseMigrator {
return `${dbPath}.vectorstore.tmp`
}
private getLegacyBackupPath(dbPath: string): string {
return `${dbPath}${LEGACY_VECTOR_BACKUP_SUFFIX}`
}
private recordWarning(message: string): void {
logger.warn(message)
this.warnings.push(message)
}
private recordSkippedWarning(reason: string, message: string): void {
const bucket = this.skippedWarnings.get(reason) ?? { count: 0, samples: [] }
bucket.count += 1
if (bucket.samples.length < SKIP_WARNING_SAMPLE_LIMIT) {
bucket.samples.push(message)
}
this.skippedWarnings.set(reason, bucket)
}
private flushSkippedWarnings(): void {
for (const [reason, bucket] of this.skippedWarnings) {
const summary = `Skipped knowledge vector records (${reason}): count=${bucket.count}; examples: ${bucket.samples.join(' | ')}`
this.recordWarning(summary)
}
this.skippedWarnings.clear()
}
private async ensureVectorStoreSchema(client: Client, dimensions: number): Promise<void> {
await client.execute({
sql: `
@@ -180,7 +235,10 @@ export class KnowledgeVectorMigrator extends BaseMigrator {
row.document,
JSON.stringify({
itemId: row.externalId,
...(row.source.trim() !== '' ? { source: row.source } : {})
itemType: row.itemType,
source: row.source,
chunkIndex: row.chunkIndex,
tokenCount: row.tokenCount
}),
`[${row.embedding.join(',')}]`
])
@@ -195,31 +253,50 @@ export class KnowledgeVectorMigrator extends BaseMigrator {
})
}
private buildLoaderKeyMap(
private getMigratedItemSource(data: KnowledgeItemData): string {
if (!data || typeof data !== 'object' || !('source' in data) || typeof data.source !== 'string') {
return ''
}
return data.source.trim()
}
private buildLoaderTargetMap(
legacyBase: LegacyKnowledgeBaseWithLoaders | undefined,
migratedItemIds: Set<string>
): Map<string, string> {
const map = new Map<string, string>()
migratedItemsById: Map<string, MigratedKnowledgeItemForVector>
): Map<string, LoaderTarget> {
const map = new Map<string, LoaderTarget>()
if (!legacyBase || !Array.isArray(legacyBase.items)) {
return map
}
for (const item of legacyBase.items) {
if (!item.id || !migratedItemIds.has(item.id)) {
if (!item.id) {
continue
}
const migratedItem = migratedItemsById.get(item.id)
if (!migratedItem) {
continue
}
const target: LoaderTarget = {
id: migratedItem.id,
itemType: migratedItem.type,
source: this.getMigratedItemSource(migratedItem.data)
}
if (Array.isArray(item.uniqueIds) && item.uniqueIds.length > 0) {
for (const uniqueId of item.uniqueIds) {
if (typeof uniqueId === 'string' && uniqueId.trim() !== '') {
map.set(uniqueId, item.id)
map.set(uniqueId, target)
}
}
continue
}
if (typeof item.uniqueId === 'string' && item.uniqueId.trim() !== '') {
map.set(item.uniqueId, item.id)
map.set(item.uniqueId, target)
}
}
@@ -239,14 +316,19 @@ export class KnowledgeVectorMigrator extends BaseMigrator {
}
const migratedItems = await ctx.db
.select({ id: knowledgeItemTable.id, baseId: knowledgeItemTable.baseId })
.select({
id: knowledgeItemTable.id,
baseId: knowledgeItemTable.baseId,
type: knowledgeItemTable.type,
data: knowledgeItemTable.data
})
.from(knowledgeItemTable)
const migratedItemIdsByBaseId = new Map<string, Set<string>>()
const migratedItemsByBaseId = new Map<string, Map<string, MigratedKnowledgeItemForVector>>()
for (const item of migratedItems) {
const bucket = migratedItemIdsByBaseId.get(item.baseId) ?? new Set<string>()
bucket.add(item.id)
migratedItemIdsByBaseId.set(item.baseId, bucket)
const bucket = migratedItemsByBaseId.get(item.baseId) ?? new Map<string, MigratedKnowledgeItemForVector>()
bucket.set(item.id, item)
migratedItemsByBaseId.set(item.baseId, bucket)
}
const legacyBasesById = new Map(
@@ -256,11 +338,23 @@ export class KnowledgeVectorMigrator extends BaseMigrator {
)
for (const base of migratedBases) {
if (base.status === 'failed' || base.embeddingModelId === null) {
const warningMessage = `Skipped knowledge vector base ${base.id}: missing embedding model`
this.recordSkippedWarning(KNOWLEDGE_BASE_ERROR_MISSING_EMBEDDING_MODEL, warningMessage)
continue
}
const dimensions = base.dimensions
if (typeof dimensions !== 'number' || !Number.isInteger(dimensions) || dimensions <= 0) {
const warningMessage = `Skipped knowledge vector base ${base.id}: invalid dimensions`
this.recordSkippedWarning('invalid_dimensions', warningMessage)
continue
}
const legacyBase = legacyBasesById.get(base.id)
if (!legacyBase) {
const warningMessage = `Skipped knowledge vector base ${base.id}: legacy knowledge base not found`
logger.warn(warningMessage)
this.warnings.push(warningMessage)
this.recordSkippedWarning('legacy_base_missing', warningMessage)
continue
}
@@ -268,26 +362,22 @@ export class KnowledgeVectorMigrator extends BaseMigrator {
switch (source.status) {
case 'invalid_path': {
const warningMessage = `Skipped knowledge vector base ${base.id}: invalid legacy vector DB path`
logger.warn(warningMessage)
this.warnings.push(warningMessage)
this.recordSkippedWarning('invalid_path', warningMessage)
continue
}
case 'missing': {
const warningMessage = `Skipped knowledge vector base ${base.id}: legacy vector DB missing`
logger.warn(warningMessage)
this.warnings.push(warningMessage)
this.recordSkippedWarning('missing', warningMessage)
continue
}
case 'directory': {
const warningMessage = `Skipped knowledge vector base ${base.id}: legacy vector DB path is a directory`
logger.warn(warningMessage)
this.warnings.push(warningMessage)
this.recordSkippedWarning('directory', warningMessage)
continue
}
case 'not_embedjs': {
const warningMessage = `Skipped knowledge vector base ${base.id}: legacy DB is not embedjs format`
logger.warn(warningMessage)
this.warnings.push(warningMessage)
this.recordSkippedWarning('not_embedjs', warningMessage)
continue
}
}
@@ -295,38 +385,65 @@ export class KnowledgeVectorMigrator extends BaseMigrator {
const vectorRows = source.rows
this.sourceCount += vectorRows.length
const loaderKeyMap = this.buildLoaderKeyMap(
const loaderTargetMap = this.buildLoaderTargetMap(
legacyBase,
migratedItemIdsByBaseId.get(base.id) ?? new Set<string>()
migratedItemsByBaseId.get(base.id) ?? new Map<string, MigratedKnowledgeItemForVector>()
)
const rows: PreparedVectorRow[] = []
const chunkIndexByItemId = new Map<string, number>()
for (const row of vectorRows) {
// V2 only keeps vectors that can be proven to belong to an existing
// migrated knowledge_item row. Unmapped legacy vectors are treated
// as invalid index residue and are intentionally dropped.
const externalId = loaderKeyMap.get(row.uniqueLoaderId)
if (!externalId) {
const target = loaderTargetMap.get(row.uniqueLoaderId)
if (!target) {
this.skippedCount += 1
const warningMessage = `Skipped knowledge vector row in base ${base.id}: uniqueLoaderId '${row.uniqueLoaderId}' cannot be mapped to item.id`
logger.warn(warningMessage)
this.warnings.push(warningMessage)
this.recordSkippedWarning('unmapped_loader', warningMessage)
continue
}
if (!row.vector || row.vector.length === 0) {
if (!INDEXABLE_KNOWLEDGE_ITEM_TYPES.has(target.itemType)) {
this.skippedCount += 1
const warningMessage = `Skipped knowledge vector row in base ${base.id}: vector payload missing for uniqueLoaderId '${row.uniqueLoaderId}'`
logger.warn(warningMessage)
this.warnings.push(warningMessage)
const warningMessage = `Skipped knowledge vector row in base ${base.id}: container item '${target.id}' of type '${target.itemType}' is not indexable`
this.recordSkippedWarning('non_indexable_container', warningMessage)
continue
}
if (row.vector.status === 'unsupported_encoding') {
this.skippedCount += 1
const warningMessage = `Skipped knowledge vector row in base ${base.id}: unsupported vector encoding '${row.vector.encoding}' for uniqueLoaderId '${row.uniqueLoaderId}'`
this.recordSkippedWarning('unsupported_vector_encoding', warningMessage)
continue
}
if (row.vector.status === 'missing' || row.vector.value.length === 0) {
this.skippedCount += 1
const warningMessage = `Skipped knowledge vector row in base ${base.id}: vector payload missing for uniqueLoaderId '${row.uniqueLoaderId}'`
this.recordSkippedWarning('missing_vector_payload', warningMessage)
continue
}
const sourceText = row.source.trim() || target.source
if (sourceText === '') {
this.skippedCount += 1
const warningMessage = `Skipped knowledge vector row in base ${base.id}: source missing for item '${target.id}'`
this.recordSkippedWarning('missing_source', warningMessage)
continue
}
const chunkIndex = chunkIndexByItemId.get(target.id) ?? 0
chunkIndexByItemId.set(target.id, chunkIndex + 1)
rows.push({
document: row.pageContent,
externalId,
source: row.source,
embedding: row.vector
externalId: target.id,
itemType: target.itemType,
source: sourceText,
chunkIndex,
tokenCount: estimateTokenCount(row.pageContent),
embedding: row.vector.value
})
}
@@ -336,23 +453,27 @@ export class KnowledgeVectorMigrator extends BaseMigrator {
this.preparedBasePlans.push({
baseId: base.id,
dbPath: source.dbPath,
dimensions: base.dimensions,
dimensions,
rows,
sourceRowCount: vectorRows.length
})
}
this.flushSkippedWarnings()
return {
success: true,
itemCount: this.sourceCount,
warnings: this.warnings.length > 0 ? this.warnings : undefined
}
} catch (error) {
this.flushSkippedWarnings()
const errorMessage = error instanceof Error ? error.message : String(error)
logger.error('KnowledgeVectorMigrator.prepare failed', error as Error)
return {
success: false,
itemCount: this.sourceCount,
warnings: [error instanceof Error ? error.message : String(error)]
warnings: [...this.warnings, errorMessage]
}
}
}
@@ -371,6 +492,7 @@ export class KnowledgeVectorMigrator extends BaseMigrator {
for (const plan of this.preparedBasePlans) {
const tempPath = this.getTempVectorStorePath(plan.dbPath)
const backupPath = this.getLegacyBackupPath(plan.dbPath)
try {
const rebuiltRows: Array<PreparedVectorRow & { id: string }> = plan.rows.map((row) => ({
@@ -415,7 +537,12 @@ export class KnowledgeVectorMigrator extends BaseMigrator {
await yieldToEventLoop()
}
await fs.promises.rm(plan.dbPath, { force: true })
// First migration preserves the legacy embedjs DB; retries remove the stale failed target before swapping.
if (!fs.existsSync(backupPath) && fs.existsSync(plan.dbPath)) {
await fs.promises.rename(plan.dbPath, backupPath)
} else {
await fs.promises.rm(plan.dbPath, { force: true })
}
await fs.promises.rename(tempPath, plan.dbPath)
this.successfulBaseIds.add(plan.baseId)
@@ -492,17 +619,51 @@ export class KnowledgeVectorMigrator extends BaseMigrator {
})
}
const missingOrMismatchedItemIdResult = await client.execute({
sql: `SELECT count(*) AS count FROM ${VECTORSTORE_TABLE_NAME} WHERE json_extract(metadata, '$.itemId') IS NULL OR json_extract(metadata, '$.itemId') = '' OR json_extract(metadata, '$.itemId') != external_id`,
const metadataResult = await client.execute({
sql: `SELECT id, external_id, metadata FROM ${VECTORSTORE_TABLE_NAME}`,
args: []
})
const missingOrMismatchedItemIdCount = Number(missingOrMismatchedItemIdResult.rows[0]?.count ?? 0)
if (missingOrMismatchedItemIdCount > 0) {
let invalidMetadataCount = 0
let mismatchedItemIdCount = 0
for (const row of metadataResult.rows) {
let metadata: unknown
try {
metadata = JSON.parse(String(row.metadata ?? '{}'))
} catch {
invalidMetadataCount += 1
continue
}
const parsedMetadata = KnowledgeChunkMetadataSchema.safeParse(metadata)
if (!parsedMetadata.success) {
invalidMetadataCount += 1
continue
}
const externalId = typeof row.external_id === 'string' ? row.external_id : String(row.external_id ?? '')
if (parsedMetadata.data.itemId !== externalId) {
mismatchedItemIdCount += 1
}
}
if (invalidMetadataCount > 0) {
errors.push({
key: `knowledge_vector_missing_item_id_${plan.baseId}`,
key: `knowledge_vector_invalid_metadata_${plan.baseId}`,
expected: 0,
actual: missingOrMismatchedItemIdCount,
message: `Found ${missingOrMismatchedItemIdCount} knowledge vector rows without matching metadata.itemId in base ${plan.baseId}`
actual: invalidMetadataCount,
message: `Found ${invalidMetadataCount} knowledge vector rows with invalid runtime metadata in base ${plan.baseId}`
})
}
if (mismatchedItemIdCount > 0) {
errors.push({
key: `knowledge_vector_mismatched_item_id_${plan.baseId}`,
expected: 0,
actual: mismatchedItemIdCount,
message: `Found ${mismatchedItemIdCount} knowledge vector rows whose metadata.itemId does not match external_id in base ${plan.baseId}`
})
}
} finally {

View File

@@ -9,9 +9,9 @@
| Knowledge bases + lightweight items | Redux `knowledge.bases` | `ReduxStateReader.getCategory('knowledge')` |
| Full note content | Dexie `knowledge_notes` | `knowledge_notes.json` |
| File metadata fallback | Dexie `files` | `files.json` |
| Legacy vector databases | Filesystem | `ctx.paths.knowledgeBaseDir/{baseId}` (via `MigrationPaths`) |
| Legacy vector databases | Filesystem | `ctx.paths.knowledgeBaseDir/<sanitizedBaseId>` (via `MigrationPaths`) |
> **Note**: The legacy vector DB path comes from `ctx.paths.knowledgeBaseDir`, which is pre-computed by `MigrationPaths` from the resolved v1 userData directory. Do NOT call `app.getPath('userData')` directly — see `migration/v2/README.md` Path Safety section.
> **Note**: The legacy vector DB path comes from `ctx.paths.knowledgeBaseDir`, which is pre-computed by `MigrationPaths` from the resolved v1 userData directory. The base id is sanitized with `sanitizeFilename(baseId, '_')`. Do NOT call `app.getPath('userData')` directly — see `migration/v2/README.md` Path Safety section.
## Target Tables
@@ -22,15 +22,21 @@
1. Base metadata migration
- Legacy base model/rerank model are transformed to `embeddingModelId` and `rerankModelId`.
- Migrated base `searchMode` is set to `default`.
- Model references are resolved against migrated `user_model` rows.
- Missing or dangling embedding model references are preserved as recoverable failed bases with `embeddingModelId = null`, `status = failed`, and `error = missing_embedding_model`.
- `error = missing_embedding_model` is the current shared `KnowledgeBaseErrorCode` member for recoverable base-level embedding model loss.
- Missing or dangling rerank references are cleared with warnings.
- Migrated base `searchMode` is set to `hybrid`.
- Legacy preprocess provider id is mapped to `fileProcessorId`.
- Invalid runtime tuning fields are normalized away instead of causing the whole base to be skipped.
- Invalid runtime tuning fields are normalized to schema-safe defaults/nulls instead of causing the whole base to be skipped.
2. Unified item payload migration
- Legacy item `content` is transformed into the new `knowledge_item.data` union payload by item type.
- Supported migrated item types are `file`, `url`, `note`, `sitemap`, and `directory`.
- V2 models `knowledge_item` as a flat item list with optional `groupId`.
- Official v1 exports do not provide grouping metadata.
- Migrated items are therefore inserted with `groupId = null` by design.
- `directory` and `sitemap` are container/source declarations in `knowledge_item`; their own container-level vectors are handled by `KnowledgeVectorMigrator` as non-indexable and are not written to the V2 vector store.
3. Note content source priority
- Prefer Dexie `knowledge_notes` content.
@@ -47,6 +53,13 @@
- `uniqueId` present and non-empty -> `completed`
- otherwise -> `idle`
6. Vector dimension dependency
- Completed bases require a resolved positive `knowledge_base.dimensions` value.
- The migrator resolves dimensions from the legacy per-base vector DB, using the first non-null `vectors.vector` blob length.
- This migrator does not copy vector rows. It only prepares the base and item records needed by `KnowledgeVectorMigrator`.
- If dimension resolution fails for a base with a resolved embedding model, the base and its items are skipped because the target schema cannot safely materialize a completed base.
- If the embedding model is missing or dangling, the base is preserved as `failed`; valid legacy `dimensions` are kept, otherwise `dimensions` is `null`.
## Field Mappings
### knowledge_base mapping
@@ -55,16 +68,17 @@
|----------------------|---------------------------|-------|
| `id` | `id` | Direct copy |
| `name` | `name` | Direct copy |
| `description` | `description` | Direct copy |
| `dimensions` | `dimensions` | Read from legacy vector DB `vectors.vector` blob length (`length(vector)/4`) |
| `model` | `embeddingModelId` | Converted to `provider::modelId` |
| `rerankModel` | `rerankModelId` | Optional, converted to `provider::modelId` |
| _no legacy grouping field_ | `groupId` | V1 knowledge bases do not carry group metadata; migrate as `null` |
| _constant_ | `emoji` | Always `📁` during v1 migration |
| `dimensions` | `dimensions` | Completed bases use legacy vector DB blob length (`length(vector)/4`); failed bases keep valid legacy dimensions or `null` |
| `model` | `embeddingModelId` / `status` / `error` | Converted to `provider::modelId`, then resolved against `user_model`; missing/dangling references produce a failed recoverable base |
| `rerankModel` | `rerankModelId` | Optional, converted to `provider::modelId`, then resolved against `user_model`; dangling references are cleared |
| `preprocessProvider.provider.id` | `fileProcessorId` | Optional |
| `chunkSize` | `chunkSize` | Copied when positive; otherwise cleared |
| `chunkOverlap` | `chunkOverlap` | Copied when non-negative and smaller than `chunkSize`; otherwise cleared |
| `chunkSize` | `chunkSize` | Copied when positive integer; otherwise normalized to the default chunk size |
| `chunkOverlap` | `chunkOverlap` | Copied when non-negative integer and smaller than `chunkSize`; otherwise normalized to the default overlap for the resolved chunk size |
| `threshold` | `threshold` | Copied when within `[0, 1]`; otherwise cleared |
| `documentCount` | `documentCount` | Copied when positive; otherwise cleared |
| _constant_ | `searchMode` | Always `default` during v1 migration |
| _constant_ | `searchMode` | Always `hybrid` during v1 migration |
| `created_at` | `createdAt` | Timestamp conversion |
| `updated_at` | `updatedAt` | Timestamp conversion |
@@ -88,31 +102,66 @@
- `memory` items are skipped.
- Legacy per-base knowledge store paths that resolve to directories are skipped as unsupported pre-v2 layouts.
- Invalid/malformed items are skipped and recorded as warnings in `prepare`.
- Invalid knowledge-base tuning fields are cleared during migration; they do not cause the base or its items to be skipped.
- Invalid knowledge-base tuning fields are normalized during migration; they do not cause the base or its items to be skipped.
## Directory and Sitemap Semantics
- `directory` and `sitemap` items are migrated into `knowledge_item` when their legacy payload is valid.
- They preserve the source/root declaration needed to show the original knowledge entry in V2.
- V1 does not provide separate child `knowledge_item` ids for every expanded directory or sitemap child document.
- Therefore this migrator does not synthesize child item rows during v1 migration.
- Any legacy vector rows that map back to the root `directory` or `sitemap` item are considered container-level vectors and are skipped by `KnowledgeVectorMigrator` with warnings.
- Child content vectors are only migrated when they can be mapped to an existing migrated `file`, `url`, or `note` item id.
## Current Constraint Decisions
- `dimensions` is required in target schema.
- `dimensions` is required only for completed bases; failed migrated bases may have `dimensions = null`.
- The legacy Redux `dimensions` field is not treated as the migration source of truth.
- `dimensions` is resolved from legacy vector DB content by inspecting:
- the per-base legacy vector DB file
- the `vectors` table
- a non-null vector blob whose byte length can be converted to a positive dimension count (`length(vector)/4`)
- If the per-base legacy knowledge store path resolves to a directory instead of a SQLite file, that base is treated as an unsupported legacy layout and is skipped.
- If the legacy vector DB is missing, empty, invalid, or the vector blob length cannot be parsed into a valid positive dimension count, that base is treated as unusable in V2 migration:
- If the legacy vector DB is missing, empty, invalid, or the vector blob length cannot be parsed into a valid positive dimension count, a base with a resolvable embedding model is treated as unusable in V2 migration:
- the base is skipped
- all items under that base are skipped
- a warning is recorded during `prepare`
- Missing embedding model identity is treated as a structural migration failure for that base.
- Missing or dangling embedding model identity is cleared to `null`, `status` is set to `failed`, and `error` is set to `missing_embedding_model` with a warning. That error value is a shared `KnowledgeBaseErrorCode`, not a free-form string. It does not require legacy vector DB inspection; valid legacy `dimensions` are preserved and invalid or missing legacy `dimensions` are stored as `null`.
- Non-structural tuning config (`chunkSize`, `chunkOverlap`, `threshold`, `documentCount`) is migrated on a best-effort basis:
- valid values are preserved
- invalid values are cleared
- invalid `chunkSize` / `chunkOverlap` values are replaced with defaults
- invalid nullable tuning values such as `threshold` / `documentCount` are cleared
- the base still migrates
- V2 keeps `knowledge_item` flat and uses optional `groupId` for grouping queries.
- Legacy v1 knowledge data does not include that field, so migrated items keep it as `null`.
- This document describes migration behavior only; runtime APIs may set `groupId` after migration.
- Runtime schema enforces same-base group ownership through `(baseId, groupId) -> (baseId, id)`.
## Missing Embedding Model Recovery
A common recoverable case is a legacy knowledge base whose embedding model id exists in Redux but not in the V2 `user_model` table. For example, Redux may contain `ollama::dengcao/Qwen3-Embedding-0.6B:Q8_0` while no matching migrated user model row exists.
The migrator handles this as a recoverable failed base:
```text
embeddingModelId = null
status = failed
error = missing_embedding_model
```
The base and its `knowledge_item` rows are preserved. `KnowledgeVectorMigrator` skips vectors for this base because the embedding model contract cannot be verified.
User recovery is handled by runtime restore, not by mutating the failed base in place:
```text
knowledge-runtime:restore-base
-> create a new knowledge base with the source base config and selected embedding model
-> copy source root items only
-> run the normal createBase + addItems indexing flow
```
The original failed base remains available after restore so the UI can let the user confirm success before deleting it.
## Validation
- Count validation uses migrator stats:

View File

@@ -9,7 +9,9 @@
| Migrated knowledge base identities and dimensions | SQLite `knowledge_base` | `knowledge_base` table |
| Migrated knowledge item identities | SQLite `knowledge_item` | `knowledge_item` table |
| Legacy loader metadata | Redux `knowledge.bases[].items[]` | `ReduxStateReader.getCategory('knowledge')` |
| Legacy chunk vectors | Per-base legacy vector DB | `application.getPath('feature.knowledgebase.data', <sanitizedBaseId>)` |
| Legacy chunk vectors | Per-base legacy vector DB | `ctx.sources.knowledgeVectorSource.loadBase(base.id)` |
The source reader is initialized by `MigrationContext` with `ctx.paths.knowledgeBaseDir`. It must read from the migration-resolved v1 userData path, not from the v2 path registry or `app.getPath()`. `KnowledgeVectorMigrator` itself should continue to use the reader abstraction instead of constructing vector DB paths inline.
## Target Storage
@@ -19,27 +21,38 @@
## Key Transformations
1. Loader identity remapping
- Failed knowledge bases without a resolved embedding model are skipped at the base level; they keep their SQLite base/items and must be rebuilt after the user selects a new model.
- `uniqueLoaderId` is not kept as a persisted field.
- It is resolved back to `knowledge_item.id` and written into `external_id`.
- `uniqueIds[]` takes precedence over legacy `uniqueId`.
- A legacy vector row is considered valid only if it can be mapped to an existing V2 `knowledge_item.id`.
- Unmapped legacy rows are treated as invalid index residue, not as business data that must be preserved.
2. Chunk payload migration
2. Indexable item filtering
- Only vectors mapped to indexable V2 item types are migrated.
- Indexable types are `file`, `url`, and `note`.
- Vectors mapped to container items, currently `directory` and `sitemap`, are skipped with warnings.
- This does not remove the `directory` or `sitemap` rows from `knowledge_item`; it only prevents container-level vectors from being written into the V2 vector store.
3. Chunk payload migration
- `pageContent` -> `document`
- `knowledge_item.id` -> `metadata.itemId`
- `source` -> optional `metadata.source`
- `knowledge_item.type` -> `metadata.itemType`
- Legacy row `source`, falling back to `knowledge_item.data.source` -> `metadata.source`
- Per-item migrated row order -> `metadata.chunkIndex`
- Estimated document token count -> `metadata.tokenCount`
- Other legacy metadata fields are dropped.
3. Embedding reuse
4. Embedding reuse
- Legacy `vector` payloads are decoded from `F32_BLOB` and written directly to `embeddings`.
- Unsupported vector encodings are skipped under `unsupported_vector_encoding`, separate from truly missing payloads.
- Existing chunk embeddings are reused; this migrator does not re-embed content.
4. Chunk identity regeneration
5. Chunk identity regeneration
- Legacy chunk IDs are not reused.
- Every migrated vector row gets a new UUID v4 `id`.
5. Schema bootstrap
6. Schema bootstrap
- Creates `external_id`, `collection`, and FTS schema needed by `@vectorstores/libsql`.
- Migrated rows use `collection = base.id` so runtime reads and deletes match the same per-base store contract.
@@ -47,29 +60,33 @@
- The migrator writes each rebuilt vector store to a temporary sibling file first.
- The original embedjs DB stays untouched until the temporary file has been written successfully.
- Once the temp file is ready, the migrator replaces the original DB in place.
- The migration flow relies on the user-completed pre-migration v1 backup; it does not keep an additional in-place rollback copy.
- Once the temp file is ready, the migrator moves the original embedjs DB to a
`.embedjs.bak` sibling and places the rebuilt V2 store at the original path.
- Retry reads from the `.embedjs.bak` sibling when the original path already
contains a V2 vector store from an earlier attempt.
## IMPORTANT: Current Limitations
- Base-level execution failures are treated as migration failures, not as skippable data warnings. If rebuilding or replacing one base fails, `execute()` returns `success: false`.
- The current implementation does **not** preserve a retryable in-place copy of the original embedjs DB. It does not keep `.bak` files or other retry artifacts beside the knowledge DB path.
- Because the replacement is in-place, a failure that happens after the original DB has been removed but before the new file is fully placed may leave the base without a usable legacy source file on disk.
- Therefore, retry semantics currently depend on the user restoring the pre-migration v1 backup before running migration again. The migrator itself does not guarantee that a failed run leaves the knowledge vector source in a reusable retry state.
- This limitation is intentional for the current implementation, but it is **important** and may need follow-up design discussion or future changes if the project later wants first-class retry support without requiring manual restore.
- Retry depends on the `.embedjs.bak` sibling staying beside the rewritten V2
store until the migration flow has completed.
## Validation
- Per-base row count must equal the prepared row count.
- `external_id` must be non-empty for every migrated row.
- `metadata.itemId` must be present and match `external_id` for every migrated row.
- `metadata.source` is optional and is only preserved when the legacy row has a non-empty `source`.
- `metadata` must satisfy the runtime `KnowledgeChunkMetadataSchema`.
## Skipped Data
- Bases missing from migrated `knowledge_base`
- Bases marked `failed` or with `embeddingModelId = null`
- Bases whose legacy DB file is missing, resolves to a directory, or does not contain a `vectors` table
- Vector rows whose `uniqueLoaderId` cannot be mapped to a migrated `knowledge_item.id`
- Vector rows mapped to non-indexable container item types such as `directory` or `sitemap`
- Vector rows with missing or empty `vector` payloads
- Vector rows whose `vector` payload exists but is exposed through an unsupported runtime encoding
- Vector rows whose source cannot be resolved from either the legacy row or migrated `knowledge_item.data.source`
If every legacy vector row under one base is skipped, the rebuilt V2 vector store for that base is expected to be empty. This is intentional: only vectors that can be proven to belong to migrated `knowledge_item` rows remain valid in V2.

View File

@@ -1,6 +1,7 @@
import fs from 'node:fs'
import { createClient } from '@libsql/client'
import { KNOWLEDGE_BASE_ERROR_MISSING_EMBEDDING_MODEL } from '@shared/data/types/knowledge'
import { beforeEach, describe, expect, it, vi } from 'vitest'
vi.mock('node:fs', async () => {
@@ -297,12 +298,11 @@ describe('KnowledgeMigrator dimensions resolution', () => {
expect(result.warnings?.some((warning: string) => warning.includes('Skipped knowledge base kb-empty'))).toBe(true)
})
it('prepare preserves knowledge base and clears dangling model references', async () => {
it('prepare preserves knowledge base and items with dangling embedding model reference', async () => {
const migrator = new KnowledgeMigrator() as any
vi.spyOn(migrator, 'resolveDimensionsForBase').mockResolvedValue({
dimensions: 1024,
reason: 'ok'
})
const resolveDimensionsForBase = vi
.spyOn(migrator, 'resolveDimensionsForBase')
.mockRejectedValue(new Error('should not inspect vector DB for missing models'))
const ctx = {
paths: { knowledgeBaseDir: '/mock/userData/Data/KnowledgeBase' },
@@ -313,9 +313,10 @@ describe('KnowledgeMigrator dimensions resolution', () => {
{
id: 'kb-dangling-model',
name: 'Dangling KB',
dimensions: 768,
model: { id: 'qwen', name: 'qwen', provider: 'cherryai' },
rerankModel: { id: 'rerank', name: 'rerank', provider: 'cherryai' },
items: []
items: [{ id: 'item-1', type: 'note', content: 'test' }]
}
]
})
@@ -336,13 +337,84 @@ describe('KnowledgeMigrator dimensions resolution', () => {
expect(result.success).toBe(true)
expect(migrator.preparedBases).toHaveLength(1)
expect(migrator.preparedBases[0].embeddingModelId).toBeNull()
expect(migrator.preparedBases[0].rerankModelId).toBeNull()
expect(migrator.preparedBases[0]).toMatchObject({
id: 'kb-dangling-model',
dimensions: 768,
embeddingModelId: null,
status: 'failed',
error: KNOWLEDGE_BASE_ERROR_MISSING_EMBEDDING_MODEL,
rerankModelId: null
})
expect(migrator.preparedItems).toHaveLength(1)
expect(migrator.skippedCount).toBe(0)
expect(migrator.sourceCount).toBe(2)
expect(resolveDimensionsForBase).not.toHaveBeenCalled()
expect(result.warnings?.some((warning: string) => warning.includes('dangling embedding model reference'))).toBe(
true
)
})
it('prepare materializes valid chunk defaults for migrated knowledge bases', async () => {
const migrator = new KnowledgeMigrator() as any
vi.spyOn(migrator, 'resolveDimensionsForBase').mockResolvedValue({
dimensions: 1024,
reason: 'ok'
})
const ctx = {
paths: { knowledgeBaseDir: '/mock/userData/Data/KnowledgeBase' },
sources: {
reduxState: {
getCategory: vi.fn().mockReturnValue({
bases: [
{
id: 'kb-missing-chunk',
name: 'Missing chunk config',
model: { id: 'm1', name: 'model-1', provider: 'openai' },
items: []
},
{
id: 'kb-small-chunk',
name: 'Small chunk config',
model: { id: 'm2', name: 'model-2', provider: 'openai' },
chunkSize: 128,
items: []
}
]
})
},
dexieExport: {
tableExists: vi.fn().mockResolvedValue(false),
readTable: vi.fn()
}
},
db: {
select: vi.fn().mockReturnValue({
from: vi.fn().mockResolvedValue([{ id: 'openai::m1' }, { id: 'openai::m2' }])
})
}
} as any
const result = await migrator.prepare(ctx)
expect(result.success).toBe(true)
expect(migrator.preparedBases).toHaveLength(2)
expect(migrator.preparedBases).toEqual(
expect.arrayContaining([
expect.objectContaining({
id: 'kb-missing-chunk',
chunkSize: 1024,
chunkOverlap: 200
}),
expect.objectContaining({
id: 'kb-small-chunk',
chunkSize: 128,
chunkOverlap: 127
})
])
)
})
it('prepare skips base and items when legacy knowledge store path is a directory', async () => {
const migrator = new KnowledgeMigrator() as any
vi.spyOn(migrator, 'resolveDimensionsForBase').mockResolvedValue({
@@ -521,10 +593,12 @@ describe('KnowledgeMigrator dimensions resolution', () => {
const fileItem = migrator.preparedItems.find((item: any) => item.id === 'file-item-1')
expect(noteItem?.data).toEqual({
source: 'https://streamed.example.com',
content: 'streamed note content',
sourceUrl: 'https://streamed.example.com'
})
expect(fileItem?.data).toEqual({
source: '/tmp/report.pdf',
file: expect.objectContaining({
id: 'file-1',
name: 'report.pdf'
@@ -570,10 +644,61 @@ describe('KnowledgeMigrator dimensions resolution', () => {
expect(migrator.preparedBases).toHaveLength(1)
expect(migrator.preparedBases[0].embeddingModelId).toBe('silicon::BAAI/bge-m3')
expect(migrator.preparedBases[0].rerankModelId).toBe('silicon::Qwen/Qwen3-Reranker-8B')
expect(migrator.preparedBases[0].searchMode).toBe('default')
expect(migrator.preparedBases[0].searchMode).toBe('hybrid')
expect(migrator.skippedCount).toBe(0)
})
it('prepare clears dangling rerank model reference while keeping resolved embedding model', async () => {
const migrator = new KnowledgeMigrator() as any
vi.spyOn(migrator, 'resolveDimensionsForBase').mockResolvedValue({
dimensions: 1024,
reason: 'ok'
})
const ctx = {
paths: { knowledgeBaseDir: '/mock/userData/Data/KnowledgeBase' },
sources: {
reduxState: {
getCategory: vi.fn().mockReturnValue({
bases: [
{
id: 'kb-dangling-rerank',
name: 'KB dangling rerank',
model: { id: 'BAAI/bge-m3', name: 'BAAI/bge-m3', provider: 'silicon' },
rerankModel: { id: 'missing-rerank', name: 'missing-rerank', provider: 'silicon' },
items: []
}
]
})
},
dexieExport: {
tableExists: vi.fn().mockResolvedValue(false),
readTable: vi.fn()
}
},
db: {
select: vi.fn().mockReturnValue({
from: vi.fn().mockResolvedValue([{ id: 'silicon::BAAI/bge-m3' }])
})
}
} as any
const result = await migrator.prepare(ctx)
expect(result.success).toBe(true)
expect(migrator.preparedBases).toHaveLength(1)
expect(migrator.preparedBases[0]).toMatchObject({
id: 'kb-dangling-rerank',
embeddingModelId: 'silicon::BAAI/bge-m3',
status: 'completed',
error: null,
rerankModelId: null
})
expect(result.warnings).toContain(
'Knowledge base kb-dangling-rerank: dangling rerank model reference silicon::missing-rerank was cleared'
)
})
it('prepare infers item status from legacy uniqueId', async () => {
const migrator = new KnowledgeMigrator() as any
vi.spyOn(migrator, 'resolveDimensionsForBase').mockResolvedValue({
@@ -626,12 +751,11 @@ describe('KnowledgeMigrator dimensions resolution', () => {
expect(statusById.get('i-failed-with-unique-id')).toBe('completed')
})
it('prepare preserves base and items when embedding model is missing', async () => {
it('prepare preserves failed missing-model bases with null dimensions when legacy dimensions are missing', async () => {
const migrator = new KnowledgeMigrator() as any
vi.spyOn(migrator, 'resolveDimensionsForBase').mockResolvedValue({
dimensions: 1024,
reason: 'ok'
})
const resolveDimensionsForBase = vi
.spyOn(migrator, 'resolveDimensionsForBase')
.mockRejectedValue(new Error('should not inspect vector DB for missing models'))
const ctx = {
paths: { knowledgeBaseDir: '/mock/userData/Data/KnowledgeBase' },
@@ -661,13 +785,58 @@ describe('KnowledgeMigrator dimensions resolution', () => {
expect(result.success).toBe(true)
expect(migrator.preparedBases).toHaveLength(1)
expect(migrator.preparedBases[0]).toMatchObject({
id: 'kb-no-model',
dimensions: null,
embeddingModelId: null,
status: 'failed',
error: KNOWLEDGE_BASE_ERROR_MISSING_EMBEDDING_MODEL
})
expect(migrator.preparedItems).toHaveLength(2)
expect(migrator.skippedCount).toBe(0)
expect(migrator.sourceCount).toBe(3)
expect(migrator.preparedBases[0].embeddingModelId).toBeNull()
expect(
result.warnings?.some((warning: string) => warning.includes('missing embedding model reference was cleared'))
).toBe(true)
expect(resolveDimensionsForBase).not.toHaveBeenCalled()
})
it('prepare preserves legacy dimensions for failed bases when embedding model is missing', async () => {
const migrator = new KnowledgeMigrator() as any
const resolveDimensionsForBase = vi
.spyOn(migrator, 'resolveDimensionsForBase')
.mockRejectedValue(new Error('should not inspect vector DB for missing models'))
const ctx = {
paths: { knowledgeBaseDir: '/mock/userData/Data/KnowledgeBase' },
sources: {
reduxState: {
getCategory: vi.fn().mockReturnValue({
bases: [
{
id: 'kb-no-model',
name: 'KB without model',
dimensions: 768,
items: [{ id: 'i1', type: 'note', content: 'test' }]
}
]
})
},
dexieExport: {
tableExists: vi.fn().mockResolvedValue(false),
readTable: vi.fn()
}
}
} as any
const result = await migrator.prepare(ctx)
expect(result.success).toBe(true)
expect(migrator.preparedBases[0]).toMatchObject({
id: 'kb-no-model',
dimensions: 768,
embeddingModelId: null,
status: 'failed',
error: KNOWLEDGE_BASE_ERROR_MISSING_EMBEDDING_MODEL
})
expect(resolveDimensionsForBase).not.toHaveBeenCalled()
})
it('prepare skips duplicate base ids and duplicate item ids with warnings', async () => {
@@ -725,8 +894,20 @@ describe('KnowledgeMigrator dimensions resolution', () => {
expect(migrator.skippedCount).toBe(3)
expect(migrator.preparedBases.map((base: any) => base.id)).toEqual(['kb-1', 'kb-2'])
expect(migrator.preparedItems.map((item: any) => item.id)).toEqual(['item-1', 'item-dup', 'item-2'])
expect(result.warnings).toContain('Skipped duplicate knowledge base kb-1')
expect(result.warnings).toContain('Skipped duplicate knowledge item item-dup in base kb-2')
expect(
result.warnings?.some(
(warning: string) =>
warning.includes('Skipped knowledge records (duplicate_knowledge_base): count=1') &&
warning.includes('Skipped duplicate knowledge base kb-1')
)
).toBe(true)
expect(
result.warnings?.some(
(warning: string) =>
warning.includes('Skipped knowledge records (duplicate_knowledge_item): count=1') &&
warning.includes('Skipped duplicate knowledge item item-dup in base kb-2')
)
).toBe(true)
})
it('prepare migrates legacy flat items without grouping metadata', async () => {
@@ -920,6 +1101,77 @@ describe('KnowledgeMigrator execute/validate paths', () => {
expect(transaction).toHaveBeenCalledTimes(2)
})
it('execute writes recoverable failed bases and their items', async () => {
const migrator = new KnowledgeMigrator() as any
migrator.preparedBases = [
{
id: 'kb-missing-model',
name: 'Missing Model KB',
groupId: null,
emoji: '📁',
dimensions: 768,
embeddingModelId: null,
status: 'failed',
error: KNOWLEDGE_BASE_ERROR_MISSING_EMBEDDING_MODEL,
rerankModelId: null,
fileProcessorId: null,
chunkSize: 1024,
chunkOverlap: 200,
threshold: null,
documentCount: null,
searchMode: 'hybrid',
hybridAlpha: null,
createdAt: 1775114958369,
updatedAt: 1775114958369
}
]
migrator.preparedItems = [
{
id: 'item-1',
baseId: 'kb-missing-model',
groupId: null,
type: 'note',
data: { source: 'note', content: 'note' },
status: 'idle',
phase: null,
error: null,
createdAt: 1775114958369,
updatedAt: 1775114958369
}
]
const insertedValues: unknown[] = []
const values = vi.fn(async (value: unknown) => {
insertedValues.push(value)
})
const insert = vi.fn().mockReturnValue({ values })
const transaction = vi.fn(async (callback: (tx: any) => Promise<void>) => {
await callback({ insert })
})
const result = await migrator.execute({
db: { transaction }
} as any)
expect(result.success).toBe(true)
expect(result.processedCount).toBe(2)
expect(insertedValues).toEqual([
expect.objectContaining({
id: 'kb-missing-model',
embeddingModelId: null,
status: 'failed',
error: KNOWLEDGE_BASE_ERROR_MISSING_EMBEDDING_MODEL
}),
[
expect.objectContaining({
id: 'item-1',
baseId: 'kb-missing-model',
status: 'idle'
})
]
])
})
it('execute failure keeps processedCount to already committed base groups only', async () => {
const migrator = new KnowledgeMigrator() as any
migrator.preparedBases = [

View File

@@ -1,8 +1,15 @@
import path from 'node:path'
import type { knowledgeBaseTable, knowledgeItemTable } from '@data/db/schemas/knowledge'
import type { FileMetadata } from '@shared/data/types/file'
import type { KnowledgeItemData, KnowledgeItemStatus } from '@shared/data/types/knowledge'
import {
DEFAULT_KNOWLEDGE_BASE_CHUNK_OVERLAP,
DEFAULT_KNOWLEDGE_BASE_CHUNK_SIZE,
DEFAULT_KNOWLEDGE_BASE_EMOJI,
DEFAULT_KNOWLEDGE_BASE_STATUS,
DEFAULT_KNOWLEDGE_SEARCH_MODE,
KNOWLEDGE_BASE_ERROR_MISSING_EMBEDDING_MODEL,
type KnowledgeItemData,
type KnowledgeItemStatus
} from '@shared/data/types/knowledge'
import { legacyModelToUniqueId } from '../transformers/ModelTransformers'
@@ -44,7 +51,6 @@ export interface LegacyKnowledgeItem {
export interface LegacyKnowledgeBase {
id?: string
name?: string
description?: string
dimensions?: number
model?: LegacyModel | null
rerankModel?: LegacyModel | null
@@ -113,19 +119,44 @@ export const toTimestamp = (value: number | undefined): number => {
export const inferKnowledgeItemStatus = (item: Pick<LegacyKnowledgeItem, 'uniqueId'>): KnowledgeItemStatus =>
typeof item.uniqueId === 'string' && item.uniqueId.trim() !== '' ? 'completed' : 'idle'
const normalizeKnowledgeItemError = (
status: KnowledgeItemStatus,
processingError: string | undefined
): string | null => {
if (status !== 'failed') {
return null
}
const normalizedError = processingError?.trim()
return normalizedError ? normalizedError : 'Legacy knowledge item failed without an error message.'
}
const getDefaultChunkOverlap = (chunkSize: number): number => {
if (chunkSize <= 1) {
return 0
}
return Math.min(DEFAULT_KNOWLEDGE_BASE_CHUNK_OVERLAP, chunkSize - 1)
}
function normalizeMigratedKnowledgeBaseConfig<T extends Partial<NewKnowledgeBase>>(config: T): T {
const normalized = { ...config }
if (normalized.chunkSize != null && normalized.chunkSize <= 0) {
normalized.chunkSize = undefined as T['chunkSize']
}
const chunkSizeCandidate = normalized.chunkSize
const chunkSize =
typeof chunkSizeCandidate === 'number' && Number.isInteger(chunkSizeCandidate) && chunkSizeCandidate > 0
? chunkSizeCandidate
: DEFAULT_KNOWLEDGE_BASE_CHUNK_SIZE
normalized.chunkSize = chunkSize as T['chunkSize']
if (normalized.chunkOverlap != null) {
if (normalized.chunkOverlap < 0) {
normalized.chunkOverlap = undefined as T['chunkOverlap']
} else if (normalized.chunkSize == null || normalized.chunkOverlap >= normalized.chunkSize) {
normalized.chunkOverlap = undefined as T['chunkOverlap']
}
const chunkOverlapCandidate = normalized.chunkOverlap
if (
typeof chunkOverlapCandidate !== 'number' ||
!Number.isInteger(chunkOverlapCandidate) ||
chunkOverlapCandidate < 0 ||
chunkOverlapCandidate >= chunkSize
) {
normalized.chunkOverlap = getDefaultChunkOverlap(chunkSize) as T['chunkOverlap']
}
if (normalized.threshold != null && (normalized.threshold < 0 || normalized.threshold > 1)) {
@@ -172,7 +203,7 @@ export const resolveLegacyFileMetadata = (
export const transformKnowledgeBase = (
base: LegacyKnowledgeBaseWithIdentity,
dimensions: number
dimensions: number | null
): KnowledgeBaseTransformResult => {
const embeddingModelId = legacyModelToUniqueId(base.model ?? null)
const rerankModelId = legacyModelToUniqueId(base.rerankModel ?? null)
@@ -180,16 +211,19 @@ export const transformKnowledgeBase = (
const transformedBase: NewKnowledgeBase = {
id: base.id,
name: base.name,
description: base.description,
groupId: null,
emoji: DEFAULT_KNOWLEDGE_BASE_EMOJI,
dimensions,
embeddingModelId: embeddingModelId ?? null,
embeddingModelId,
status: embeddingModelId ? DEFAULT_KNOWLEDGE_BASE_STATUS : 'failed',
error: embeddingModelId ? null : KNOWLEDGE_BASE_ERROR_MISSING_EMBEDDING_MODEL,
rerankModelId: rerankModelId ?? null,
fileProcessorId: base.preprocessProvider?.provider?.id,
chunkSize: base.chunkSize,
chunkOverlap: base.chunkOverlap,
chunkSize: base.chunkSize ?? DEFAULT_KNOWLEDGE_BASE_CHUNK_SIZE,
chunkOverlap: base.chunkOverlap ?? DEFAULT_KNOWLEDGE_BASE_CHUNK_OVERLAP,
threshold: base.threshold,
documentCount: base.documentCount,
searchMode: 'default',
searchMode: DEFAULT_KNOWLEDGE_SEARCH_MODE,
createdAt: toTimestamp(base.created_at),
updatedAt: toTimestamp(base.updated_at)
}
@@ -228,7 +262,7 @@ export const transformKnowledgeItem = (
}
type = 'file'
data = { file }
data = { source: file.path, file }
} else if (item.type === 'url') {
if (typeof item.content !== 'string' || item.content.trim() === '') {
return {
@@ -239,8 +273,8 @@ export const transformKnowledgeItem = (
type = 'url'
data = {
url: item.content,
name: item.content
source: item.content,
url: item.content
}
} else if (item.type === 'sitemap') {
if (typeof item.content !== 'string' || item.content.trim() === '') {
@@ -252,8 +286,8 @@ export const transformKnowledgeItem = (
type = 'sitemap'
data = {
url: item.content,
name: item.content
source: item.content,
url: item.content
}
} else if (item.type === 'directory') {
if (typeof item.content !== 'string' || item.content.trim() === '') {
@@ -265,7 +299,7 @@ export const transformKnowledgeItem = (
type = 'directory'
data = {
name: path.basename(item.content),
source: item.content,
path: item.content
}
} else if (item.type === 'note') {
@@ -274,6 +308,7 @@ export const transformKnowledgeItem = (
type = 'note'
data = {
source: note?.sourceUrl ?? item.sourceUrl ?? content,
content,
sourceUrl: note?.sourceUrl ?? item.sourceUrl
}
@@ -284,6 +319,8 @@ export const transformKnowledgeItem = (
}
}
const status = inferKnowledgeItemStatus(item)
return {
ok: true,
value: {
@@ -296,8 +333,9 @@ export const transformKnowledgeItem = (
groupId: null,
type,
data,
status: inferKnowledgeItemStatus(item),
error: item.processingError ?? null,
status,
phase: null,
error: normalizeKnowledgeItemError(status, item.processingError),
createdAt: toTimestamp(item.created_at),
updatedAt: toTimestamp(item.updated_at)
}

View File

@@ -1,4 +1,5 @@
import { FILE_TYPE } from '@shared/data/types/file'
import { KNOWLEDGE_BASE_ERROR_MISSING_EMBEDDING_MODEL } from '@shared/data/types/knowledge'
import { describe, expect, it } from 'vitest'
import { legacyModelToUniqueId } from '../../transformers/ModelTransformers'
@@ -28,7 +29,7 @@ describe('KnowledgeMappings', () => {
expect(inferKnowledgeItemStatus({} as any)).toBe('idle')
})
it('transformKnowledgeBase preserves the knowledge base when model is unavailable', () => {
it('transformKnowledgeBase marks knowledge bases without an embedding model as failed', () => {
expect(
transformKnowledgeBase(
{
@@ -41,9 +42,48 @@ describe('KnowledgeMappings', () => {
ok: true,
value: expect.objectContaining({
id: 'kb-1',
name: 'KB 1',
embeddingModelId: null,
rerankModelId: null
status: 'failed',
error: KNOWLEDGE_BASE_ERROR_MISSING_EMBEDDING_MODEL
})
})
})
it('transformKnowledgeBase fills default chunk config when legacy values are missing', () => {
expect(
transformKnowledgeBase(
{
id: 'kb-default-config',
name: 'KB default config',
model: { id: 'BAAI/bge-m3', name: 'bge', provider: 'silicon' }
},
1024
)
).toStrictEqual({
ok: true,
value: expect.objectContaining({
chunkSize: 1024,
chunkOverlap: 200
})
})
})
it('transformKnowledgeBase keeps default overlap below a preserved small chunk size', () => {
expect(
transformKnowledgeBase(
{
id: 'kb-small-chunk',
name: 'KB small chunk',
model: { id: 'BAAI/bge-m3', name: 'bge', provider: 'silicon' },
chunkSize: 128
},
1024
)
).toStrictEqual({
ok: true,
value: expect.objectContaining({
chunkSize: 128,
chunkOverlap: 127
})
})
})
@@ -74,7 +114,7 @@ describe('KnowledgeMappings', () => {
})
})
it('transformKnowledgeBase clears invalid tuning config instead of skipping the base', () => {
it('transformKnowledgeBase normalizes invalid tuning config instead of skipping the base', () => {
expect(
transformKnowledgeBase(
{
@@ -95,10 +135,10 @@ describe('KnowledgeMappings', () => {
name: 'KB invalid config',
embeddingModelId: 'silicon::BAAI/bge-m3',
chunkSize: 200,
chunkOverlap: undefined,
chunkOverlap: 199,
threshold: undefined,
documentCount: undefined,
searchMode: 'default'
searchMode: 'hybrid'
})
})
})
@@ -173,10 +213,12 @@ describe('KnowledgeMappings', () => {
groupId: null,
type: 'note',
data: {
source: 'https://dexie.example.com',
content: 'dexie-content',
sourceUrl: 'https://dexie.example.com'
},
status: 'idle',
phase: null,
error: null,
createdAt: expect.any(Number),
updatedAt: expect.any(Number)
@@ -207,9 +249,11 @@ describe('KnowledgeMappings', () => {
groupId: null,
type: 'file',
data: {
source: '/tmp/report.pdf',
file: fileMetadata
},
status: 'completed',
phase: null,
error: null,
createdAt: expect.any(Number),
updatedAt: expect.any(Number)
@@ -217,6 +261,53 @@ describe('KnowledgeMappings', () => {
})
})
it('transformKnowledgeItem clears blank legacy processing errors for idle and completed items', () => {
const idleResult = transformKnowledgeItem(
'kb-1',
{
id: 'idle-note',
type: 'note',
content: 'idle note',
processingError: ''
},
{
noteById: new Map(),
filesById: new Map()
}
)
const completedResult = transformKnowledgeItem(
'kb-1',
{
id: 'completed-file',
type: 'file',
content: 'file-1',
uniqueId: 'loader-1',
processingError: ' '
},
{
noteById: new Map(),
filesById: new Map([['file-1', fileMetadata]])
}
)
expect(idleResult).toStrictEqual({
ok: true,
value: expect.objectContaining({
status: 'idle',
phase: null,
error: null
})
})
expect(completedResult).toStrictEqual({
ok: true,
value: expect.objectContaining({
status: 'completed',
phase: null,
error: null
})
})
})
it('transformKnowledgeItem rejects unsupported legacy item types', () => {
expect(
transformKnowledgeItem(
@@ -259,10 +350,11 @@ describe('KnowledgeMappings', () => {
groupId: null,
type: 'directory',
data: {
name: 'docs',
source: '/tmp/docs',
path: '/tmp/docs'
},
status: 'idle',
phase: null,
error: null,
createdAt: expect.any(Number),
updatedAt: expect.any(Number)

View File

@@ -1,26 +1,38 @@
import fs from 'node:fs'
import path from 'node:path'
import { pathToFileURL } from 'node:url'
import { application } from '@application'
import { type Client, createClient, type Value as LibsqlValue } from '@libsql/client'
import { sanitizeFilename } from '@main/utils/file'
const LEGACY_VECTOR_TABLE_NAME = 'vectors'
const LEGACY_VECTOR_BACKUP_SUFFIX = '.embedjs.bak'
export interface LegacyKnowledgeVectorRow {
pageContent: string
uniqueLoaderId: string
source: string
vector: number[] | null
vector: LegacyKnowledgeVectorDecodeResult
}
export type LegacyKnowledgeVectorDecodeResult =
| { status: 'decoded'; value: number[] }
| { status: 'missing' }
| { status: 'unsupported_encoding'; encoding: string }
export type LegacyKnowledgeVectorLoadResult =
| { status: 'ok'; dbPath: string; rows: LegacyKnowledgeVectorRow[] }
| { status: 'invalid_path' | 'missing' | 'directory' | 'not_embedjs'; dbPath?: string }
export class KnowledgeVectorSourceReader {
constructor(private readonly knowledgeBaseDir: string) {}
getLegacyDbPath(baseId: string): string | null {
return application.getPath('feature.knowledgebase.data', sanitizeFilename(baseId, '_'))
return path.join(this.knowledgeBaseDir, sanitizeFilename(baseId, '_'))
}
private getLegacyBackupPath(dbPath: string): string {
return `${dbPath}${LEGACY_VECTOR_BACKUP_SUFFIX}`
}
async loadBase(baseId: string): Promise<LegacyKnowledgeVectorLoadResult> {
@@ -29,7 +41,11 @@ export class KnowledgeVectorSourceReader {
return { status: 'invalid_path' }
}
const backupPath = this.getLegacyBackupPath(dbPath)
if (!fs.existsSync(dbPath)) {
if (fs.existsSync(backupPath)) {
return this.loadLegacyDb(dbPath, backupPath)
}
return { status: 'missing', dbPath }
}
@@ -38,7 +54,16 @@ export class KnowledgeVectorSourceReader {
return { status: 'directory', dbPath }
}
const client = createClient({ url: pathToFileURL(dbPath).toString() })
const result = await this.loadLegacyDb(dbPath, dbPath)
if (result.status === 'not_embedjs' && fs.existsSync(backupPath)) {
return this.loadLegacyDb(dbPath, backupPath)
}
return result
}
private async loadLegacyDb(dbPath: string, sourcePath: string): Promise<LegacyKnowledgeVectorLoadResult> {
const client = createClient({ url: pathToFileURL(sourcePath).toString() })
try {
const isEmbedjs = await this.isEmbedjsDatabase(client)
if (!isEmbedjs) {
@@ -82,30 +107,49 @@ export class KnowledgeVectorSourceReader {
// client/runtime combinations. In local verification on macOS this returns
// ArrayBuffer, but other environments may expose Float32Array or another
// ArrayBufferView, so keep the decoder intentionally permissive.
private deserializeLegacyVector(raw: LibsqlValue): number[] | null {
private describeLegacyVectorEncoding(raw: LibsqlValue): string {
if (raw === null) {
return 'null'
}
if (raw === undefined) {
return 'undefined'
}
if (typeof raw !== 'object') {
return typeof raw
}
return raw.constructor?.name ?? 'Object'
}
private deserializeLegacyVector(raw: LibsqlValue): LegacyKnowledgeVectorDecodeResult {
if (raw === null || raw === undefined) {
return null
return { status: 'missing' }
}
if (raw instanceof Float32Array) {
return Array.from(raw)
return { status: 'decoded', value: Array.from(raw) }
}
if (raw instanceof ArrayBuffer) {
return Array.from(new Float32Array(raw))
return { status: 'decoded', value: Array.from(new Float32Array(raw)) }
}
if (ArrayBuffer.isView(raw)) {
const view = raw as ArrayBufferView
return Array.from(
new Float32Array(view.buffer, view.byteOffset, view.byteLength / Float32Array.BYTES_PER_ELEMENT)
)
return {
status: 'decoded',
value: Array.from(
new Float32Array(view.buffer, view.byteOffset, view.byteLength / Float32Array.BYTES_PER_ELEMENT)
)
}
}
if (Array.isArray(raw)) {
return raw.map((value) => Number(value))
return { status: 'decoded', value: raw.map((value) => Number(value)) }
}
return null
return { status: 'unsupported_encoding', encoding: this.describeLegacyVectorEncoding(raw) }
}
}

View File

@@ -6,29 +6,6 @@ import { pathToFileURL } from 'node:url'
import { createClient } from '@libsql/client'
import { afterEach, beforeEach, describe, expect, it, vi } from 'vitest'
const { setKnowledgeBaseRoot, getPathMock } = vi.hoisted(() => {
let currentKnowledgeBaseRoot = ''
return {
setKnowledgeBaseRoot: (nextPath: string) => {
currentKnowledgeBaseRoot = nextPath
},
getPathMock: vi.fn((key: string, filename?: string) => {
if (key !== 'feature.knowledgebase.data') {
throw new Error(`Unexpected path key: ${key}`)
}
return filename ? path.join(currentKnowledgeBaseRoot, filename) : currentKnowledgeBaseRoot
})
}
})
vi.mock('@application', () => ({
application: {
getPath: getPathMock
}
}))
vi.mock('@main/utils/file', () => ({
sanitizeFilename: (value: string) => value
}))
@@ -79,13 +56,34 @@ async function createLegacyVectorDb(
client.close()
}
async function createLegacyVectorDbWithRawVector(dbPath: string, vectorColumnType: string, vectorValue: unknown) {
const client = createClient({ url: pathToFileURL(dbPath).toString() })
await client.execute(`
CREATE TABLE vectors (
id TEXT PRIMARY KEY,
pageContent TEXT UNIQUE,
uniqueLoaderId TEXT NOT NULL,
source TEXT NOT NULL,
vector ${vectorColumnType},
metadata TEXT
)
`)
const encodedValue = vectorValue == null ? 'NULL' : `'${String(vectorValue).replaceAll("'", "''")}'`
await client.execute(`
INSERT INTO vectors (id, pageContent, uniqueLoaderId, source, vector, metadata)
VALUES ('legacy-row-1', 'hello vector', 'loader-1', '/tmp/file.md', ${encodedValue}, '{}')
`)
client.close()
}
describe('KnowledgeVectorSourceReader', () => {
let tempRoot: string
beforeEach(() => {
tempRoot = fs.mkdtempSync(path.join(os.tmpdir(), 'knowledge-vector-source-reader-'))
fs.mkdirSync(path.join(tempRoot, 'KnowledgeBase'), { recursive: true })
setKnowledgeBaseRoot(path.join(tempRoot, 'KnowledgeBase'))
})
afterEach(() => {
@@ -93,7 +91,7 @@ describe('KnowledgeVectorSourceReader', () => {
})
it('loads legacy embedjs rows from the knowledge base path', async () => {
const reader = new KnowledgeVectorSourceReader()
const reader = new KnowledgeVectorSourceReader(path.join(tempRoot, 'KnowledgeBase'))
const dbPath = path.join(tempRoot, 'KnowledgeBase', 'kb-1')
await createLegacyVectorDb(dbPath, [
@@ -114,14 +112,54 @@ describe('KnowledgeVectorSourceReader', () => {
pageContent: 'hello vector',
uniqueLoaderId: 'loader-1',
source: '/tmp/file.md',
vector: [1, 2]
vector: { status: 'decoded', value: [1, 2] }
}
]
})
})
it('marks null legacy vector payloads as missing', async () => {
const reader = new KnowledgeVectorSourceReader(path.join(tempRoot, 'KnowledgeBase'))
const dbPath = path.join(tempRoot, 'KnowledgeBase', 'kb-1')
await createLegacyVectorDbWithRawVector(dbPath, 'BLOB', null)
await expect(reader.loadBase('kb-1')).resolves.toEqual({
status: 'ok',
dbPath,
rows: [
{
pageContent: 'hello vector',
uniqueLoaderId: 'loader-1',
source: '/tmp/file.md',
vector: { status: 'missing' }
}
]
})
})
it('marks unknown legacy vector encodings as unsupported', async () => {
const reader = new KnowledgeVectorSourceReader(path.join(tempRoot, 'KnowledgeBase'))
const dbPath = path.join(tempRoot, 'KnowledgeBase', 'kb-1')
await createLegacyVectorDbWithRawVector(dbPath, 'TEXT', 'not-a-vector')
await expect(reader.loadBase('kb-1')).resolves.toEqual({
status: 'ok',
dbPath,
rows: [
{
pageContent: 'hello vector',
uniqueLoaderId: 'loader-1',
source: '/tmp/file.md',
vector: { status: 'unsupported_encoding', encoding: 'string' }
}
]
})
})
it('returns not_embedjs for non-embedjs sqlite files', async () => {
const reader = new KnowledgeVectorSourceReader()
const reader = new KnowledgeVectorSourceReader(path.join(tempRoot, 'KnowledgeBase'))
const dbPath = path.join(tempRoot, 'KnowledgeBase', 'kb-1')
const client = createClient({ url: pathToFileURL(dbPath).toString() })

View File

@@ -9,120 +9,74 @@ import { knowledgeBaseTable } from '@data/db/schemas/knowledge'
import { loggerService } from '@logger'
import { DataApiErrorFactory } from '@shared/data/api'
import type { OffsetPaginationResponse } from '@shared/data/api/apiTypes'
import type { ListKnowledgeBasesQuery, UpdateKnowledgeBaseDto } from '@shared/data/api/schemas/knowledges'
import {
type CreateKnowledgeBaseDto,
type KnowledgeBaseListQuery,
type UpdateKnowledgeBaseDto
} from '@shared/data/api/schemas/knowledges'
import type { KnowledgeBase, KnowledgeSearchMode } from '@shared/data/types/knowledge'
DEFAULT_KNOWLEDGE_BASE_CHUNK_OVERLAP,
DEFAULT_KNOWLEDGE_BASE_CHUNK_SIZE,
DEFAULT_KNOWLEDGE_BASE_EMOJI,
DEFAULT_KNOWLEDGE_BASE_STATUS,
DEFAULT_KNOWLEDGE_SEARCH_MODE,
type KnowledgeBase,
KnowledgeBaseSchema
} from '@shared/data/types/knowledge'
import { desc, eq, sql } from 'drizzle-orm'
import { nullsToUndefined, timestampToISO } from './utils/rowMappers'
const logger = loggerService.withContext('DataApi:KnowledgeBaseService')
export interface KnowledgeBaseConfigInput {
chunkSize?: number | null
chunkOverlap?: number | null
threshold?: number | null
documentCount?: number | null
searchMode?: KnowledgeSearchMode | null
type KnowledgeBaseRow = typeof knowledgeBaseTable.$inferSelect
function validateKnowledgeBaseConfig(config: {
chunkSize: number
chunkOverlap: number
searchMode?: string | null
hybridAlpha?: number | null
}
function addFieldError(
fieldErrors: Record<string, string[]>,
field: keyof KnowledgeBaseConfigInput,
message: string
): void {
if (!fieldErrors[field]) {
fieldErrors[field] = []
}
fieldErrors[field].push(message)
}
export function normalizeKnowledgeBaseConfigDependencies<T extends KnowledgeBaseConfigInput>(config: T): T {
const normalized = { ...config }
if (normalized.chunkOverlap != null) {
if (normalized.chunkSize == null || normalized.chunkOverlap >= normalized.chunkSize) {
normalized.chunkOverlap = undefined as T['chunkOverlap']
}
}
if (normalized.hybridAlpha != null && normalized.searchMode !== 'hybrid') {
normalized.hybridAlpha = undefined as T['hybridAlpha']
}
return normalized
}
export function validateKnowledgeBaseConfig(config: KnowledgeBaseConfigInput): Record<string, string[]> {
}): Record<string, string[]> {
const fieldErrors: Record<string, string[]> = {}
if (config.chunkSize != null && config.chunkSize <= 0) {
addFieldError(fieldErrors, 'chunkSize', 'Chunk size must be greater than 0')
if (config.chunkOverlap >= config.chunkSize) {
fieldErrors.chunkOverlap = ['Chunk overlap must be smaller than chunk size']
}
if (config.chunkOverlap != null && config.chunkOverlap < 0) {
addFieldError(fieldErrors, 'chunkOverlap', 'Chunk overlap must be greater than or equal to 0')
}
if (config.threshold != null && (config.threshold < 0 || config.threshold > 1)) {
addFieldError(fieldErrors, 'threshold', 'Threshold must be between 0 and 1')
}
if (config.documentCount != null && config.documentCount <= 0) {
addFieldError(fieldErrors, 'documentCount', 'Document count must be greater than 0')
}
const hybridAlphaIsInRange = config.hybridAlpha == null || (config.hybridAlpha >= 0 && config.hybridAlpha <= 1)
if (!hybridAlphaIsInRange) {
addFieldError(fieldErrors, 'hybridAlpha', 'Hybrid alpha must be between 0 and 1')
}
const chunkOverlap = config.chunkOverlap
if (chunkOverlap != null && chunkOverlap >= 0) {
if (config.chunkSize == null) {
addFieldError(fieldErrors, 'chunkOverlap', 'Chunk overlap requires chunk size')
} else if (chunkOverlap >= config.chunkSize) {
addFieldError(fieldErrors, 'chunkOverlap', 'Chunk overlap must be smaller than chunk size')
}
}
if (config.hybridAlpha != null && hybridAlphaIsInRange && config.searchMode !== 'hybrid') {
addFieldError(fieldErrors, 'hybridAlpha', 'Hybrid alpha requires hybrid search mode')
if (config.hybridAlpha != null && config.searchMode !== 'hybrid') {
fieldErrors.hybridAlpha = ['Hybrid alpha requires hybrid search mode']
}
return fieldErrors
}
function rowToKnowledgeBase(row: typeof knowledgeBaseTable.$inferSelect): KnowledgeBase {
function rowToKnowledgeBase(row: KnowledgeBaseRow): KnowledgeBase {
const clean = nullsToUndefined(row)
return {
return KnowledgeBaseSchema.parse({
...clean,
// Preserve `string | null` contract — bypass clean (which would narrow null → undefined)
groupId: row.groupId,
dimensions: row.dimensions,
embeddingModelId: row.embeddingModelId,
error: row.error,
createdAt: timestampToISO(row.createdAt),
updatedAt: timestampToISO(row.updatedAt)
}
})
}
export class KnowledgeBaseService {
async list(query: KnowledgeBaseListQuery): Promise<OffsetPaginationResponse<KnowledgeBase>> {
const db = application.get('DbService').getDb()
private get db() {
return application.get('DbService').getDb()
}
async list(query: ListKnowledgeBasesQuery): Promise<OffsetPaginationResponse<KnowledgeBase>> {
const { page, limit } = query
const offset = (page - 1) * limit
const [rows, [{ count }]] = await Promise.all([
db
this.db
.select()
.from(knowledgeBaseTable)
.orderBy(desc(knowledgeBaseTable.createdAt), desc(knowledgeBaseTable.id))
.limit(limit)
.offset(offset),
db.select({ count: sql<number>`count(*)` }).from(knowledgeBaseTable)
this.db.select({ count: sql<number>`count(*)` }).from(knowledgeBaseTable)
])
return {
@@ -133,8 +87,7 @@ export class KnowledgeBaseService {
}
async getById(id: string): Promise<KnowledgeBase> {
const db = application.get('DbService').getDb()
const [row] = await db.select().from(knowledgeBaseTable).where(eq(knowledgeBaseTable.id, id)).limit(1)
const [row] = await this.db.select().from(knowledgeBaseTable).where(eq(knowledgeBaseTable.id, id)).limit(1)
if (!row) {
throw DataApiErrorFactory.notFound('KnowledgeBase', id)
@@ -144,131 +97,129 @@ export class KnowledgeBaseService {
}
async create(dto: CreateKnowledgeBaseDto): Promise<KnowledgeBase> {
const db = application.get('DbService').getDb()
const createValues: Omit<typeof knowledgeBaseTable.$inferInsert, 'id' | 'createdAt' | 'updatedAt'> = {
name: dto.name.trim(),
description: dto.description,
dimensions: dto.dimensions,
embeddingModelId: dto.embeddingModelId.trim(),
rerankModelId: dto.rerankModelId ?? null,
fileProcessorId: dto.fileProcessorId,
chunkSize: dto.chunkSize,
chunkOverlap: dto.chunkOverlap,
threshold: dto.threshold,
documentCount: dto.documentCount,
searchMode: dto.searchMode,
const createConfig = {
chunkSize: dto.chunkSize ?? DEFAULT_KNOWLEDGE_BASE_CHUNK_SIZE,
chunkOverlap: dto.chunkOverlap ?? DEFAULT_KNOWLEDGE_BASE_CHUNK_OVERLAP,
searchMode: dto.searchMode ?? DEFAULT_KNOWLEDGE_SEARCH_MODE,
hybridAlpha: dto.hybridAlpha
}
const createFieldErrors = validateKnowledgeBaseConfig(createValues)
const createFieldErrors = validateKnowledgeBaseConfig(createConfig)
if (Object.keys(createFieldErrors).length > 0) {
throw DataApiErrorFactory.validation(createFieldErrors)
}
const [row] = await db.insert(knowledgeBaseTable).values(createValues).returning()
const createValues: Omit<typeof knowledgeBaseTable.$inferInsert, 'id' | 'createdAt' | 'updatedAt'> = {
name: dto.name.trim(),
groupId: dto.groupId ?? null,
emoji: dto.emoji ?? DEFAULT_KNOWLEDGE_BASE_EMOJI,
dimensions: dto.dimensions,
embeddingModelId: dto.embeddingModelId.trim(),
status: DEFAULT_KNOWLEDGE_BASE_STATUS,
error: null,
rerankModelId: dto.rerankModelId ?? null,
fileProcessorId: dto.fileProcessorId ?? null,
chunkSize: createConfig.chunkSize,
chunkOverlap: createConfig.chunkOverlap,
threshold: dto.threshold ?? null,
documentCount: dto.documentCount ?? null,
searchMode: createConfig.searchMode,
hybridAlpha: createConfig.hybridAlpha ?? null
}
const row = await this.db.transaction(async (tx) => {
const [inserted] = await tx.insert(knowledgeBaseTable).values(createValues).returning()
return inserted
})
logger.info('Created knowledge base', { id: row.id, name: row.name })
return rowToKnowledgeBase(row)
}
async update(id: string, dto: UpdateKnowledgeBaseDto): Promise<KnowledgeBase> {
const db = application.get('DbService').getDb()
const existing = await this.getById(id)
const updates: Partial<typeof knowledgeBaseTable.$inferInsert> = {}
if (dto.name !== undefined) updates.name = dto.name.trim()
if (dto.description !== undefined) updates.description = dto.description
const nextConfig: {
chunkSize: number
chunkOverlap: number
searchMode: KnowledgeBase['searchMode']
hybridAlpha: number | null | undefined
} = {
chunkSize: dto.chunkSize !== undefined ? dto.chunkSize : existing.chunkSize,
chunkOverlap: dto.chunkOverlap !== undefined ? dto.chunkOverlap : existing.chunkOverlap,
searchMode: dto.searchMode !== undefined ? dto.searchMode : existing.searchMode,
hybridAlpha: dto.hybridAlpha !== undefined ? dto.hybridAlpha : existing.hybridAlpha
}
if (dto.embeddingModelId !== undefined) {
const nextEmbeddingModelId = dto.embeddingModelId.trim()
if (nextEmbeddingModelId !== (existing.embeddingModelId ?? null)) {
updates.embeddingModelId = nextEmbeddingModelId
}
if (dto.searchMode !== undefined && dto.searchMode !== 'hybrid' && dto.hybridAlpha === undefined) {
nextConfig.hybridAlpha = null
}
if (dto.rerankModelId !== undefined) {
updates.rerankModelId = dto.rerankModelId ?? null
const updateFieldErrors = validateKnowledgeBaseConfig(nextConfig)
if (Object.keys(updateFieldErrors).length > 0) {
throw DataApiErrorFactory.validation(updateFieldErrors)
}
const updates: Partial<typeof knowledgeBaseTable.$inferInsert> = {}
if (dto.name !== undefined) {
const nextName = dto.name.trim()
if (nextName !== existing.name) updates.name = nextName
}
if (dto.groupId !== undefined && dto.groupId !== existing.groupId) {
updates.groupId = dto.groupId
}
if (dto.emoji !== undefined && dto.emoji !== existing.emoji) {
updates.emoji = dto.emoji
}
if (dto.rerankModelId !== undefined && dto.rerankModelId !== existing.rerankModelId) {
updates.rerankModelId = dto.rerankModelId
}
if (dto.fileProcessorId !== undefined && dto.fileProcessorId !== existing.fileProcessorId) {
updates.fileProcessorId = dto.fileProcessorId
}
if (nextConfig.chunkSize !== existing.chunkSize) {
updates.chunkSize = nextConfig.chunkSize
}
if (nextConfig.chunkOverlap !== existing.chunkOverlap) {
updates.chunkOverlap = nextConfig.chunkOverlap
}
if (dto.threshold !== undefined && dto.threshold !== existing.threshold) {
updates.threshold = dto.threshold
}
if (dto.documentCount !== undefined && dto.documentCount !== existing.documentCount) {
updates.documentCount = dto.documentCount
}
if (nextConfig.searchMode !== existing.searchMode) {
updates.searchMode = nextConfig.searchMode
}
if ((nextConfig.hybridAlpha ?? undefined) !== existing.hybridAlpha) {
updates.hybridAlpha = nextConfig.hybridAlpha
}
if (dto.fileProcessorId !== undefined) updates.fileProcessorId = dto.fileProcessorId
if (dto.chunkSize !== undefined) updates.chunkSize = dto.chunkSize
if (dto.chunkOverlap !== undefined) updates.chunkOverlap = dto.chunkOverlap
if (dto.threshold !== undefined) updates.threshold = dto.threshold
if (dto.documentCount !== undefined) updates.documentCount = dto.documentCount
if (dto.searchMode !== undefined) updates.searchMode = dto.searchMode
if (dto.hybridAlpha !== undefined) updates.hybridAlpha = dto.hybridAlpha
if (Object.keys(updates).length === 0) {
return existing
}
const mergedConfig = {
chunkSize: dto.chunkSize !== undefined ? dto.chunkSize : existing.chunkSize,
chunkOverlap: dto.chunkOverlap !== undefined ? dto.chunkOverlap : existing.chunkOverlap,
threshold: dto.threshold !== undefined ? dto.threshold : existing.threshold,
documentCount: dto.documentCount !== undefined ? dto.documentCount : existing.documentCount,
searchMode: dto.searchMode !== undefined ? dto.searchMode : existing.searchMode,
hybridAlpha: dto.hybridAlpha !== undefined ? dto.hybridAlpha : existing.hybridAlpha
}
const normalizedConfig = { ...mergedConfig }
if (dto.chunkSize !== undefined && dto.chunkOverlap === undefined) {
normalizedConfig.chunkOverlap = normalizeKnowledgeBaseConfigDependencies({
chunkSize: mergedConfig.chunkSize,
chunkOverlap: mergedConfig.chunkOverlap
}).chunkOverlap
}
if (dto.searchMode !== undefined && dto.hybridAlpha === undefined) {
normalizedConfig.hybridAlpha = normalizeKnowledgeBaseConfigDependencies({
searchMode: mergedConfig.searchMode,
hybridAlpha: mergedConfig.hybridAlpha
}).hybridAlpha
}
const updateFieldErrors = validateKnowledgeBaseConfig(normalizedConfig)
if (Object.keys(updateFieldErrors).length > 0) {
throw DataApiErrorFactory.validation(updateFieldErrors)
}
const nextChunkSize = normalizedConfig.chunkSize ?? null
if (nextChunkSize !== (existing.chunkSize ?? null)) {
updates.chunkSize = nextChunkSize
}
const nextChunkOverlap = normalizedConfig.chunkOverlap ?? null
if (nextChunkOverlap !== (existing.chunkOverlap ?? null)) {
updates.chunkOverlap = nextChunkOverlap
}
const nextThreshold = normalizedConfig.threshold ?? null
if (nextThreshold !== (existing.threshold ?? null)) {
updates.threshold = nextThreshold
}
const nextDocumentCount = normalizedConfig.documentCount ?? null
if (nextDocumentCount !== (existing.documentCount ?? null)) {
updates.documentCount = nextDocumentCount
}
const nextSearchMode = normalizedConfig.searchMode ?? null
if (nextSearchMode !== (existing.searchMode ?? null)) {
updates.searchMode = nextSearchMode
}
const nextHybridAlpha = normalizedConfig.hybridAlpha ?? null
if (nextHybridAlpha !== (existing.hybridAlpha ?? null)) {
updates.hybridAlpha = nextHybridAlpha
}
const [row] = await db.update(knowledgeBaseTable).set(updates).where(eq(knowledgeBaseTable.id, id)).returning()
const row = await this.db.transaction(async (tx) => {
const [updated] = await tx
.update(knowledgeBaseTable)
.set(updates)
.where(eq(knowledgeBaseTable.id, id))
.returning()
return updated
})
logger.info('Updated knowledge base', { id, changes: Object.keys(dto) })
return rowToKnowledgeBase(row)
}
async delete(id: string): Promise<void> {
const db = application.get('DbService').getDb()
// Verify knowledge base exists
await this.getById(id)
await db.delete(knowledgeBaseTable).where(eq(knowledgeBaseTable.id, id))
await this.db.transaction(async (tx) => {
await tx.delete(knowledgeBaseTable).where(eq(knowledgeBaseTable.id, id))
})
logger.info('Deleted knowledge base', { id })
}
}

View File

@@ -6,154 +6,53 @@
import { application } from '@application'
import { knowledgeItemTable } from '@data/db/schemas/knowledge'
import { type SqliteErrorHandlers, withSqliteErrors } from '@data/db/sqliteErrors'
import { loggerService } from '@logger'
import type { OffsetPaginationResponse } from '@shared/data/api'
import { DataApiErrorFactory } from '@shared/data/api'
import type {
CreateKnowledgeItemsDto,
KnowledgeItemsQuery,
UpdateKnowledgeItemDto
} from '@shared/data/api/schemas/knowledges'
import { getCreateKnowledgeItemsReferenceErrors } from '@shared/data/api/schemas/knowledges'
import type { ListKnowledgeItemsQuery } from '@shared/data/api/schemas/knowledges'
import {
DirectoryItemDataSchema,
FileItemDataSchema,
type CreateKnowledgeItemDto,
type KnowledgeItem,
NoteItemDataSchema,
SitemapItemDataSchema,
UrlItemDataSchema
type KnowledgeItemPhase,
KnowledgeItemSchema,
type KnowledgeItemStatus
} from '@shared/data/types/knowledge'
import { and, desc, eq, inArray, sql } from 'drizzle-orm'
import { and, desc, eq, inArray, isNull, sql } from 'drizzle-orm'
import { knowledgeBaseService } from './KnowledgeBaseService'
import { timestampToISO } from './utils/rowMappers'
const logger = loggerService.withContext('DataApi:KnowledgeItemService')
const CONTAINER_CHILD_FAILURE_ERROR = 'One or more child items failed'
const KNOWLEDGE_ITEM_DATA_SCHEMAS = {
file: FileItemDataSchema,
url: UrlItemDataSchema,
note: NoteItemDataSchema,
sitemap: SitemapItemDataSchema,
directory: DirectoryItemDataSchema
} as const
type KnowledgeItemRow = typeof knowledgeItemTable.$inferSelect
type PlannedKnowledgeItemInsert = CreateKnowledgeItemsDto['items'][number] & {
parsedData: CreateKnowledgeItemsDto['items'][number]['data']
index: number
type KnowledgeItemStatusUpdate = {
phase?: KnowledgeItemPhase | null
}
function getCreateKnowledgeItemGroupingErrors(
itemsToCreate: CreateKnowledgeItemsDto['items']
): Record<string, string[]> {
const itemsByRef = new Map(
itemsToCreate
.filter((item): item is (typeof itemsToCreate)[number] & { ref: string } => typeof item.ref === 'string')
.map((item) => [item.ref, item] as const)
)
for (const item of itemsToCreate) {
if (item.ref && item.groupRef === item.ref) {
return {
groupRef: ['Knowledge item cannot reference itself as group owner']
}
}
}
const visitState = new Map<string, 'visiting' | 'visited'>()
const hasCycle = (ref: string): boolean => {
const state = visitState.get(ref)
if (state === 'visiting') {
return true
}
if (state === 'visited') {
return false
}
visitState.set(ref, 'visiting')
const targetRef = itemsByRef.get(ref)?.groupRef
if (targetRef && itemsByRef.has(targetRef) && hasCycle(targetRef)) {
return true
}
visitState.set(ref, 'visited')
return false
}
for (const ref of itemsByRef.keys()) {
if (hasCycle(ref)) {
return {
groupRef: ['Knowledge item grouping cannot contain cycles within one request batch']
}
}
}
return {}
type FailedKnowledgeItemStatusUpdate = {
error: string
}
function rowToKnowledgeItem(row: typeof knowledgeItemTable.$inferSelect): KnowledgeItem {
// Drizzle's `text({ mode: 'json' })` decoder already ran by the time we
// get here, so `row.data` is either the decoded object, null (missing
// blob), or in the legacy/bad-typing case a raw string. The JSON-parse
// branch exists for defence-in-depth; the awaitKnowledgeItemRead wrapper
// on the query side is what actually catches corrupt-blob SyntaxError
// before it ever reaches this converter.
const parseJson = <T>(value: T | string | null | undefined, context?: string): T | null => {
if (value == null) return null
if (typeof value === 'string') {
try {
return JSON.parse(value)
} catch (error) {
logger.error(`Failed to parse JSON data${context ? ` for ${context}` : ''}`, error as Error)
throw DataApiErrorFactory.dataInconsistent(
'KnowledgeItem',
`Corrupted data in knowledge item${context ? ` '${context}'` : ''}`
)
}
}
return value as T
}
type KnowledgeItemsByBaseOptions = {
groupId?: string | null
}
const parsedData = parseJson(row.data, row.id)
if (!parsedData) {
throw DataApiErrorFactory.dataInconsistent('KnowledgeItem', `Knowledge item '${row.id}' has missing or null data`)
}
return {
function rowToKnowledgeItem(row: KnowledgeItemRow): KnowledgeItem {
return KnowledgeItemSchema.parse({
id: row.id,
baseId: row.baseId,
groupId: row.groupId,
type: row.type,
data: parsedData,
data: row.data,
status: row.status,
phase: row.phase,
error: row.error,
createdAt: timestampToISO(row.createdAt),
updatedAt: timestampToISO(row.updatedAt)
} as KnowledgeItem
}
/**
* Run a knowledge_item read query and translate any Drizzle JSON-decode
* SyntaxError into a domain-typed DATA_INCONSISTENT response.
*
* Rationale: Drizzle's `text({ mode: 'json' })` calls JSON.parse as part of
* row materialisation. If a `data` blob in the DB is corrupt (bit rot, manual
* SQL edit, bad migration), the `await db.select()` call throws a bare
* SyntaxError from inside the driver, *before* rowToKnowledgeItem runs. The
* service would then leak `SyntaxError: Expected property name or '}' ...`
* to callers instead of a DataApiError. Wrapping the read here converts it.
*/
async function awaitKnowledgeItemRead<T>(fn: () => PromiseLike<T>, context: string): Promise<T> {
try {
return await fn()
} catch (e) {
if (e instanceof SyntaxError) {
throw DataApiErrorFactory.dataInconsistent('KnowledgeItem', `Corrupted data in knowledge item ${context}`)
}
throw e
}
})
}
export class KnowledgeItemService {
@@ -162,7 +61,7 @@ export class KnowledgeItemService {
return dbService.getDb()
}
async list(baseId: string, query: KnowledgeItemsQuery): Promise<OffsetPaginationResponse<KnowledgeItem>> {
async list(baseId: string, query: ListKnowledgeItemsQuery): Promise<OffsetPaginationResponse<KnowledgeItem>> {
await knowledgeBaseService.getById(baseId)
const { page, limit, type, groupId } = query
const offset = (page - 1) * limit
@@ -172,22 +71,18 @@ export class KnowledgeItemService {
conditions.push(eq(knowledgeItemTable.type, type))
}
if (groupId !== undefined) {
conditions.push(eq(knowledgeItemTable.groupId, groupId))
conditions.push(groupId === null ? isNull(knowledgeItemTable.groupId) : eq(knowledgeItemTable.groupId, groupId))
}
const where = conditions.length === 1 ? conditions[0] : and(...conditions)
const where = and(...conditions)
const [rows, [{ count }]] = await Promise.all([
awaitKnowledgeItemRead(
() =>
this.db
.select()
.from(knowledgeItemTable)
.where(where)
.orderBy(desc(knowledgeItemTable.createdAt), desc(knowledgeItemTable.id))
.limit(limit)
.offset(offset),
`in base '${baseId}'`
),
this.db
.select()
.from(knowledgeItemTable)
.where(where)
.orderBy(desc(knowledgeItemTable.createdAt), desc(knowledgeItemTable.id))
.limit(limit)
.offset(offset),
this.db.select({ count: sql<number>`count(*)` }).from(knowledgeItemTable).where(where)
])
@@ -198,69 +93,106 @@ export class KnowledgeItemService {
}
}
async createMany(baseId: string, dto: CreateKnowledgeItemsDto): Promise<{ items: KnowledgeItem[] }> {
async getItemsByBaseId(baseId: string, options: KnowledgeItemsByBaseOptions = {}): Promise<KnowledgeItem[]> {
await knowledgeBaseService.getById(baseId)
const referenceErrors = getCreateKnowledgeItemsReferenceErrors(dto.items)
if (Object.keys(referenceErrors).length > 0) {
throw DataApiErrorFactory.validation(referenceErrors)
const conditions = [eq(knowledgeItemTable.baseId, baseId)]
if (options.groupId !== undefined) {
conditions.push(
options.groupId === null ? isNull(knowledgeItemTable.groupId) : eq(knowledgeItemTable.groupId, options.groupId)
)
}
const groupingErrors = getCreateKnowledgeItemGroupingErrors(dto.items)
if (Object.keys(groupingErrors).length > 0) {
throw DataApiErrorFactory.validation(groupingErrors)
const where = and(...conditions)
const rows = await this.db
.select()
.from(knowledgeItemTable)
.where(where)
.orderBy(knowledgeItemTable.createdAt, knowledgeItemTable.id)
return rows.map((row) => rowToKnowledgeItem(row))
}
async create(baseId: string, item: CreateKnowledgeItemDto): Promise<KnowledgeItem> {
await this.validateGroupOwner(baseId, item.groupId)
const [row] = await this.db.transaction(async (tx) =>
withSqliteErrors(
() =>
tx
.insert(knowledgeItemTable)
.values({
baseId,
groupId: item.groupId ?? null,
type: item.type,
data: item.data,
status: 'idle',
phase: null,
error: null
})
.returning(),
{
foreignKey: () =>
item.groupId
? DataApiErrorFactory.validation({
groupId: [`Knowledge item group owner not found in base '${baseId}': ${item.groupId}`]
})
: DataApiErrorFactory.notFound('KnowledgeBase', baseId),
check: (constraintName) =>
DataApiErrorFactory.validation({
_root: [
constraintName
? `Knowledge item failed CHECK constraint '${constraintName}'`
: 'Knowledge item failed a CHECK constraint'
]
})
} satisfies SqliteErrorHandlers
)
)
if (!row) {
throw DataApiErrorFactory.dataInconsistent('KnowledgeItem', 'Knowledge item create result missing')
}
const itemsToCreate = dto.items.map((item, index) => {
const parsed = KNOWLEDGE_ITEM_DATA_SCHEMAS[item.type].safeParse(item.data)
if (!parsed.success) {
throw DataApiErrorFactory.validation({
[`items.${index}.data`]: [`Data payload does not match knowledge item type '${item.type}'`]
})
}
logger.info('Created knowledge item', { baseId, id: row.id, type: row.type })
return rowToKnowledgeItem(row)
}
return {
...item,
parsedData: parsed.data,
index
}
})
private async validateGroupOwner(baseId: string, groupId: string | null | undefined): Promise<void> {
if (groupId == null) {
return
}
const requestedGroupIds = [
...new Set(itemsToCreate.flatMap((item) => (item.groupId != null ? [item.groupId] : [])))
]
const existingGroupIds = await this.getExistingGroupIdsInBase(baseId, requestedGroupIds)
const missingGroupIds = requestedGroupIds.filter((groupId) => !existingGroupIds.has(groupId))
if (missingGroupIds.length > 0) {
if (groupId.trim().length === 0) {
throw DataApiErrorFactory.validation({
groupId: [`Knowledge item group owner not found in base '${baseId}': ${missingGroupIds.join(', ')}`]
groupId: ['Knowledge item group owner id is required when groupId is provided']
})
}
const createdRows = await this.createBatch(baseId, itemsToCreate)
const [owner] = await this.db
.select({
type: knowledgeItemTable.type
})
.from(knowledgeItemTable)
.where(and(eq(knowledgeItemTable.baseId, baseId), eq(knowledgeItemTable.id, groupId)))
.limit(1)
const items = itemsToCreate.map((item) => {
const createdRow = createdRows[item.index]
if (!createdRow) {
throw DataApiErrorFactory.dataInconsistent(
'KnowledgeItem',
`Knowledge item create result missing for index '${item.index}'`
)
}
if (!owner) {
throw DataApiErrorFactory.validation({
groupId: [`Knowledge item group owner not found in base '${baseId}': ${groupId}`]
})
}
return rowToKnowledgeItem(createdRow)
})
logger.info('Created knowledge items', { baseId, count: items.length })
return { items }
if (owner.type !== 'directory' && owner.type !== 'sitemap') {
throw DataApiErrorFactory.validation({
groupId: [`Knowledge item group owner must be a directory or sitemap: ${groupId}`]
})
}
}
async getById(id: string): Promise<KnowledgeItem> {
const [row] = await awaitKnowledgeItemRead(
() => this.db.select().from(knowledgeItemTable).where(eq(knowledgeItemTable.id, id)).limit(1),
`'${id}'`
)
const [row] = await this.db.select().from(knowledgeItemTable).where(eq(knowledgeItemTable.id, id)).limit(1)
if (!row) {
throw DataApiErrorFactory.notFound('KnowledgeItem', id)
@@ -269,47 +201,67 @@ export class KnowledgeItemService {
return rowToKnowledgeItem(row)
}
async getByIdsInBase(baseId: string, itemIds: string[]): Promise<KnowledgeItem[]> {
const uniqueItemIds = [...new Set(itemIds)]
async getLeafDescendantItems(baseId: string, rootIds: string[]): Promise<KnowledgeItem[]> {
const leafIds = await this.getLeafDescendantIds(baseId, rootIds)
if (uniqueItemIds.length === 0) {
if (leafIds.length === 0) {
return []
}
const rows = await awaitKnowledgeItemRead(
() =>
this.db
.select()
.from(knowledgeItemTable)
.where(and(eq(knowledgeItemTable.baseId, baseId), inArray(knowledgeItemTable.id, uniqueItemIds))),
`in base '${baseId}'`
)
const rows = await this.db
.select()
.from(knowledgeItemTable)
.where(and(eq(knowledgeItemTable.baseId, baseId), inArray(knowledgeItemTable.id, leafIds)))
const rowsById = new Map(rows.map((row) => [row.id, row]))
const itemsById = new Map(rows.map((row) => [row.id, rowToKnowledgeItem(row)]))
return leafIds.map((id) => {
const row = rowsById.get(id)
for (const itemId of uniqueItemIds) {
if (!itemsById.has(itemId)) {
throw DataApiErrorFactory.notFound('KnowledgeItem', itemId)
if (!row) {
throw DataApiErrorFactory.dataInconsistent('KnowledgeItem', `Leaf descendant row missing for id '${id}'`)
}
}
return uniqueItemIds.map((itemId) => itemsById.get(itemId)!)
return rowToKnowledgeItem(row)
})
}
async getCascadeIdsInBase(baseId: string, rootIds: string[]): Promise<string[]> {
async getDescendantItems(baseId: string, rootIds: string[]): Promise<KnowledgeItem[]> {
const descendantIds = await this.getDescendantIds(baseId, rootIds)
if (descendantIds.length === 0) {
return []
}
const rows = await this.db
.select()
.from(knowledgeItemTable)
.where(and(eq(knowledgeItemTable.baseId, baseId), inArray(knowledgeItemTable.id, descendantIds)))
const rowsById = new Map(rows.map((row) => [row.id, row]))
return descendantIds.map((id) => {
const row = rowsById.get(id)
if (!row) {
throw DataApiErrorFactory.dataInconsistent('KnowledgeItem', `Descendant row missing for id '${id}'`)
}
return rowToKnowledgeItem(row)
})
}
private async getDescendantIds(baseId: string, rootIds: string[]): Promise<string[]> {
const uniqueRootIds = [...new Set(rootIds)]
if (uniqueRootIds.length === 0) {
return []
}
await this.getByIdsInBase(baseId, uniqueRootIds)
const descendantRows = await this.db.all<{ id: string }>(sql`
WITH RECURSIVE descendants AS (
const rows = await this.db.all<{ id: string }>(sql`
WITH RECURSIVE subtree AS (
SELECT id
FROM knowledge_item
WHERE base_id = ${baseId}
AND group_id IN (${sql.join(
AND id IN (${sql.join(
uniqueRootIds.map((id) => sql`${id}`),
sql`, `
)})
@@ -318,117 +270,222 @@ export class KnowledgeItemService {
SELECT child.id
FROM knowledge_item child
INNER JOIN descendants parent ON child.group_id = parent.id
INNER JOIN subtree parent ON child.group_id = parent.id
WHERE child.base_id = ${baseId}
)
SELECT DISTINCT id FROM descendants
SELECT DISTINCT id
FROM subtree
WHERE id NOT IN (${sql.join(
uniqueRootIds.map((id) => sql`${id}`),
sql`, `
)})
`)
const descendantIds = descendantRows.map((row) => row.id)
const rootIdSet = new Set(uniqueRootIds)
return [...uniqueRootIds, ...descendantIds.filter((id) => !rootIdSet.has(id))]
return rows.map((row) => row.id)
}
async update(id: string, dto: UpdateKnowledgeItemDto): Promise<KnowledgeItem> {
const existing = await this.getById(id)
async deleteLeafDescendantItems(baseId: string, rootIds: string[]): Promise<void> {
const uniqueRootIds = [...new Set(rootIds)]
const updates: Partial<typeof knowledgeItemTable.$inferInsert> = {}
if (dto.data !== undefined) {
const parsed = KNOWLEDGE_ITEM_DATA_SCHEMAS[existing.type].safeParse(dto.data)
if (!parsed.success) {
throw DataApiErrorFactory.validation({
data: [`Data payload does not match the existing knowledge item type '${existing.type}'`]
})
if (uniqueRootIds.length === 0) {
return
}
await this.db.run(sql`
WITH RECURSIVE subtree AS (
SELECT id
FROM knowledge_item
WHERE base_id = ${baseId}
AND id IN (${sql.join(
uniqueRootIds.map((id) => sql`${id}`),
sql`, `
)})
UNION ALL
SELECT child.id
FROM knowledge_item child
INNER JOIN subtree parent ON child.group_id = parent.id
WHERE child.base_id = ${baseId}
)
DELETE FROM knowledge_item
WHERE base_id = ${baseId}
AND id IN (SELECT id FROM subtree)
AND id NOT IN (${sql.join(
uniqueRootIds.map((id) => sql`${id}`),
sql`, `
)})
`)
}
private async getLeafDescendantIds(baseId: string, rootIds: string[]): Promise<string[]> {
const uniqueRootIds = [...new Set(rootIds)]
if (uniqueRootIds.length === 0) {
return []
}
const rows = await this.db.all<{ id: string }>(sql`
WITH RECURSIVE subtree AS (
SELECT id, type
FROM knowledge_item
WHERE base_id = ${baseId}
AND id IN (${sql.join(
uniqueRootIds.map((id) => sql`${id}`),
sql`, `
)})
UNION ALL
SELECT child.id, child.type
FROM knowledge_item child
INNER JOIN subtree parent ON child.group_id = parent.id
WHERE child.base_id = ${baseId}
)
SELECT DISTINCT id
FROM subtree
WHERE type IN ('file', 'url', 'note')
`)
return rows.map((row) => row.id)
}
async updateStatus(id: string, status: 'idle' | 'completed'): Promise<KnowledgeItem>
async updateStatus(id: string, status: 'processing', update?: KnowledgeItemStatusUpdate): Promise<KnowledgeItem>
async updateStatus(id: string, status: 'failed', update: FailedKnowledgeItemStatusUpdate): Promise<KnowledgeItem>
async updateStatus(
id: string,
status: KnowledgeItemStatus,
update: KnowledgeItemStatusUpdate | FailedKnowledgeItemStatusUpdate = {}
): Promise<KnowledgeItem> {
const phase = status === 'processing' && 'phase' in update ? (update.phase ?? null) : null
const error = status === 'failed' && 'error' in update ? update.error.trim() : null
if (status === 'failed' && !error) {
throw DataApiErrorFactory.validation({
error: ['Failed knowledge items must include a non-empty error']
})
}
const { item, startContainerIds } = await this.db.transaction(async (tx) => {
const [existingRow] = await tx.select().from(knowledgeItemTable).where(eq(knowledgeItemTable.id, id)).limit(1)
if (!existingRow) {
throw DataApiErrorFactory.notFound('KnowledgeItem', id)
}
updates.data = parsed.data
}
if (dto.status !== undefined) updates.status = dto.status
if (dto.error !== undefined) updates.error = dto.error
if (Object.keys(updates).length === 0) {
return existing
}
const [updatedRow] = await tx
.update(knowledgeItemTable)
.set({ status, phase, error })
.where(eq(knowledgeItemTable.id, id))
.returning()
const [row] = await this.db.update(knowledgeItemTable).set(updates).where(eq(knowledgeItemTable.id, id)).returning()
if (!row) {
throw DataApiErrorFactory.dataInconsistent('KnowledgeItem', `Knowledge item update result missing for id '${id}'`)
}
logger.info('Updated knowledge item', { id, changes: Object.keys(dto) })
return rowToKnowledgeItem(row)
}
if (!updatedRow) {
throw DataApiErrorFactory.dataInconsistent(
'KnowledgeItem',
`Knowledge item status update result missing for id '${id}'`
)
}
async delete(id: string): Promise<void> {
await this.getById(id)
await this.db.delete(knowledgeItemTable).where(eq(knowledgeItemTable.id, id))
logger.info('Deleted knowledge item', { id })
}
private async createBatch(
baseId: string,
itemsToCreate: PlannedKnowledgeItemInsert[]
): Promise<Array<typeof knowledgeItemTable.$inferSelect | undefined>> {
const rowsByIndex = new Map<number, typeof knowledgeItemTable.$inferSelect>()
const itemsByRef = new Map<string, typeof knowledgeItemTable.$inferSelect>()
await this.db.transaction(async (tx) => {
const pendingItems = [...itemsToCreate]
while (pendingItems.length > 0) {
const readyItems = pendingItems.filter((item) => item.groupRef == null || itemsByRef.has(item.groupRef))
if (readyItems.length === 0) {
throw DataApiErrorFactory.dataInconsistent(
'KnowledgeItem',
`Unable to resolve knowledge item grouping in base '${baseId}'`
)
}
for (const item of readyItems) {
const groupId = item.groupRef ? (itemsByRef.get(item.groupRef)?.id ?? null) : (item.groupId ?? null)
const [row] = await tx
.insert(knowledgeItemTable)
.values({
baseId,
groupId,
type: item.type,
data: item.parsedData,
status: 'idle',
error: null
})
.returning()
rowsByIndex.set(item.index, row)
if (item.ref) {
itemsByRef.set(item.ref, row)
}
}
const readyIndices = new Set(readyItems.map((item) => item.index))
for (let index = pendingItems.length - 1; index >= 0; index -= 1) {
if (readyIndices.has(pendingItems[index].index)) {
pendingItems.splice(index, 1)
}
}
return {
item: rowToKnowledgeItem(updatedRow),
startContainerIds: [updatedRow.id, existingRow.groupId]
}
})
return itemsToCreate.map((item) => rowsByIndex.get(item.index))
await this.reconcileContainers(item.baseId, startContainerIds)
logger.info('Updated knowledge item status', { id, status, phase })
return item
}
private async getExistingGroupIdsInBase(baseId: string, groupIds: string[]): Promise<Set<string>> {
const uniqueGroupIds = [...new Set(groupIds)]
async reconcileContainers(baseId: string, startContainerIds: Array<string | null | undefined>): Promise<void> {
await this.db.transaction(async (tx) => {
const queue = [...new Set(startContainerIds.filter((id): id is string => Boolean(id)))]
const visited = new Set<string>()
if (uniqueGroupIds.length === 0) {
return new Set()
}
while (queue.length > 0) {
const containerId = queue.shift()
if (!containerId || visited.has(containerId)) {
continue
}
visited.add(containerId)
const rows = await this.db
.select({ id: knowledgeItemTable.id })
.from(knowledgeItemTable)
.where(and(eq(knowledgeItemTable.baseId, baseId), inArray(knowledgeItemTable.id, uniqueGroupIds)))
const [containerRow] = await tx
.select()
.from(knowledgeItemTable)
.where(and(eq(knowledgeItemTable.baseId, baseId), eq(knowledgeItemTable.id, containerId)))
.limit(1)
return new Set(rows.map((row) => row.id))
if (!containerRow || (containerRow.type !== 'directory' && containerRow.type !== 'sitemap')) {
continue
}
if (containerRow.phase !== null) {
await tx
.update(knowledgeItemTable)
.set({ status: 'processing', error: null })
.where(and(eq(knowledgeItemTable.baseId, baseId), eq(knowledgeItemTable.id, containerId)))
if (containerRow.groupId) {
queue.push(containerRow.groupId)
}
continue
}
const [stats] = await tx
.select({
activeCount: sql<number>`sum(case when ${knowledgeItemTable.status} not in ('completed', 'failed') then 1 else 0 end)`,
failedCount: sql<number>`sum(case when ${knowledgeItemTable.status} = 'failed' then 1 else 0 end)`
})
.from(knowledgeItemTable)
.where(and(eq(knowledgeItemTable.baseId, baseId), eq(knowledgeItemTable.groupId, containerId)))
if (Number(stats?.activeCount ?? 0) > 0) {
await tx
.update(knowledgeItemTable)
.set({ status: 'processing', error: null })
.where(and(eq(knowledgeItemTable.baseId, baseId), eq(knowledgeItemTable.id, containerId)))
if (containerRow.groupId) {
queue.push(containerRow.groupId)
}
continue
}
const nextStatus: KnowledgeItemStatus = Number(stats?.failedCount ?? 0) > 0 ? 'failed' : 'completed'
await tx
.update(knowledgeItemTable)
.set({ status: nextStatus, error: nextStatus === 'failed' ? CONTAINER_CHILD_FAILURE_ERROR : null })
.where(and(eq(knowledgeItemTable.baseId, baseId), eq(knowledgeItemTable.id, containerId)))
if (containerRow.groupId) {
queue.push(containerRow.groupId)
}
}
})
}
async delete(id: string): Promise<void> {
const deleted = await this.db.transaction(async (tx) => {
const [existingRow] = await tx.select().from(knowledgeItemTable).where(eq(knowledgeItemTable.id, id)).limit(1)
if (!existingRow) {
throw DataApiErrorFactory.notFound('KnowledgeItem', id)
}
const [row] = await tx.delete(knowledgeItemTable).where(eq(knowledgeItemTable.id, id)).returning({
id: knowledgeItemTable.id
})
if (!row) {
throw DataApiErrorFactory.notFound('KnowledgeItem', id)
}
return { baseId: existingRow.baseId, groupId: existingRow.groupId }
})
await this.reconcileContainers(deleted.baseId, [deleted.groupId])
logger.info('Deleted knowledge item', { id })
}
}

View File

@@ -68,8 +68,14 @@ describe('AssistantDataService', () => {
await dbh.db.insert(knowledgeBaseTable).values({
id,
name: 'KB',
emoji: '📁',
dimensions: 1024,
embeddingModelId: createUniqueModelId('openai', 'text-embedding-3-large')
embeddingModelId: createUniqueModelId('openai', 'text-embedding-3-large'),
status: 'completed',
error: null,
chunkSize: 1024,
chunkOverlap: 200,
searchMode: 'hybrid'
})
}

View File

@@ -43,6 +43,12 @@ describe('GroupService', () => {
// because neither bucket has a predecessor.
expect(topicFirst.orderKey).toBe(assistantFirst.orderKey)
})
it('should create knowledge groups', async () => {
const result = await groupService.create({ entityType: 'knowledge', name: 'Knowledge Group' })
expect(result).toMatchObject({ entityType: 'knowledge', name: 'Knowledge Group' })
})
})
describe('listByEntityType', () => {
@@ -58,6 +64,13 @@ describe('GroupService', () => {
it('should return an empty array when no groups exist for the entityType', async () => {
await expect(groupService.listByEntityType('assistant')).resolves.toEqual([])
})
it('should list groups for the knowledge entityType', async () => {
const knowledgeGroup = await groupService.create({ entityType: 'knowledge', name: 'Knowledge Group' })
await groupService.create({ entityType: 'topic', name: 'Topic Group' })
await expect(groupService.listByEntityType('knowledge')).resolves.toEqual([knowledgeGroup])
})
})
describe('getById', () => {

View File

@@ -1,13 +1,9 @@
import { knowledgeBaseTable } from '@data/db/schemas/knowledge'
import { userModelTable } from '@data/db/schemas/userModel'
import { userProviderTable } from '@data/db/schemas/userProvider'
import {
KnowledgeBaseService,
normalizeKnowledgeBaseConfigDependencies,
validateKnowledgeBaseConfig
} from '@data/services/KnowledgeBaseService'
import { KnowledgeBaseService } from '@data/services/KnowledgeBaseService'
import { ErrorCode } from '@shared/data/api'
import type { CreateKnowledgeBaseDto } from '@shared/data/api/schemas/knowledges'
import { type CreateKnowledgeBaseDto, KNOWLEDGE_BASE_ERROR_MISSING_EMBEDDING_MODEL } from '@shared/data/types/knowledge'
import { createUniqueModelId } from '@shared/data/types/model'
import { setupTestDatabase } from '@test-helpers/db'
import { eq } from 'drizzle-orm'
@@ -22,33 +18,10 @@ describe('KnowledgeBaseService', () => {
await seedUserProvidersAndModelsForKb()
})
/** FK targets for embedding_model_id / rerank_model_id → user_model.id */
/** FK target for embedding_model_id → user_model.id */
async function seedUserProvidersAndModelsForKb() {
await dbh.db.insert(userProviderTable).values([
{ providerId: 'openai', name: 'OpenAI' },
{ providerId: 'cohere', name: 'Cohere' }
])
await dbh.db.insert(userProviderTable).values([{ providerId: 'openai', name: 'OpenAI' }])
await dbh.db.insert(userModelTable).values([
{
id: createUniqueModelId('openai', 'text-embedding-3-large'),
providerId: 'openai',
modelId: 'text-embedding-3-large',
presetModelId: 'text-embedding-3-large',
name: 'text-embedding-3-large',
isEnabled: true,
isHidden: false,
sortOrder: 0
},
{
id: createUniqueModelId('cohere', 'rerank-v1'),
providerId: 'cohere',
modelId: 'rerank-v1',
presetModelId: 'rerank-v1',
name: 'rerank-v1',
isEnabled: true,
isHidden: false,
sortOrder: 0
},
{
id: createUniqueModelId('openai', 'embed-model'),
providerId: 'openai',
@@ -58,16 +31,6 @@ describe('KnowledgeBaseService', () => {
isEnabled: true,
isHidden: false,
sortOrder: 0
},
{
id: createUniqueModelId('cohere', 'rerank-model'),
providerId: 'cohere',
modelId: 'rerank-model',
presetModelId: 'rerank-model',
name: 'rerank-model',
isEnabled: true,
isHidden: false,
sortOrder: 0
}
])
}
@@ -76,10 +39,12 @@ describe('KnowledgeBaseService', () => {
const values: typeof knowledgeBaseTable.$inferInsert = {
id: 'kb-1',
name: 'Knowledge Base',
description: 'Knowledge base description',
emoji: '📁',
dimensions: 1536,
embeddingModelId: createUniqueModelId('openai', 'text-embedding-3-large'),
rerankModelId: createUniqueModelId('cohere', 'rerank-v1'),
embeddingModelId: createUniqueModelId('openai', 'embed-model'),
status: 'completed',
error: null,
rerankModelId: null,
fileProcessorId: 'processor-1',
chunkSize: 800,
chunkOverlap: 120,
@@ -96,7 +61,7 @@ describe('KnowledgeBaseService', () => {
describe('list', () => {
it('should return paginated knowledge bases', async () => {
await seedKnowledgeBase()
await seedKnowledgeBase({ id: 'kb-2', name: 'Another Base', description: null })
await seedKnowledgeBase({ id: 'kb-2', name: 'Another Base' })
const result = await service.list({ page: 2, limit: 1 })
@@ -125,45 +90,78 @@ describe('KnowledgeBaseService', () => {
status: 404
})
})
it('should reject invalid persisted chunk configuration at the read boundary', async () => {
await seedKnowledgeBase({ chunkSize: 100, chunkOverlap: 100 })
await expect(service.getById('kb-1')).rejects.toThrow('Chunk overlap must be smaller than chunk size')
})
})
describe('create', () => {
it('should create a knowledge base with trimmed identifiers', async () => {
it('should create a knowledge base with trimmed identifiers and defaults', async () => {
const dto: CreateKnowledgeBaseDto = {
name: ' New Base ',
description: 'desc',
dimensions: 1024,
embeddingModelId: ` ${createUniqueModelId('openai', 'embed-model')} `,
rerankModelId: createUniqueModelId('cohere', 'rerank-model'),
fileProcessorId: 'processor-1',
chunkSize: 512,
chunkOverlap: 64,
threshold: 0.5,
documentCount: 3,
searchMode: 'hybrid',
hybridAlpha: 0.6
embeddingModelId: ` ${createUniqueModelId('openai', 'embed-model')} `
}
const result = await service.create(dto)
expect(result.name).toBe('New Base')
expect(result.embeddingModelId).toBe(createUniqueModelId('openai', 'embed-model'))
expect(result.chunkSize).toBe(1024)
expect(result.chunkOverlap).toBe(200)
expect(result.emoji).toBe('📁')
expect(result.searchMode).toBe('hybrid')
expect(result.status).toBe('completed')
expect(result.error).toBeNull()
const [row] = await dbh.db.select().from(knowledgeBaseTable).where(eq(knowledgeBaseTable.id, result.id))
expect(row.name).toBe('New Base')
expect(row.groupId).toBeNull()
expect(row.embeddingModelId).toBe(createUniqueModelId('openai', 'embed-model'))
expect(row.rerankModelId).toBeNull()
expect(row.fileProcessorId).toBeNull()
expect(row.chunkSize).toBe(1024)
expect(row.chunkOverlap).toBe(200)
expect(row.threshold).toBeNull()
expect(row.documentCount).toBeNull()
expect(row.emoji).toBe('📁')
expect(row.searchMode).toBe('hybrid')
expect(row.hybridAlpha).toBeNull()
expect(row.status).toBe('completed')
expect(row.error).toBeNull()
})
it('should reject invalid runtime config before insert', async () => {
it('should create a knowledge base with explicit valid chunk config', async () => {
const dto: CreateKnowledgeBaseDto = {
name: 'Invalid Base',
name: 'Small Chunks',
dimensions: 1024,
embeddingModelId: createUniqueModelId('openai', 'embed-model'),
chunkSize: 256,
chunkOverlap: 256
chunkSize: 100,
chunkOverlap: 20
}
await expect(service.create(dto)).rejects.toMatchObject({
const result = await service.create(dto)
expect(result.chunkSize).toBe(100)
expect(result.chunkOverlap).toBe(20)
const [row] = await dbh.db.select().from(knowledgeBaseTable).where(eq(knowledgeBaseTable.id, result.id))
expect(row.chunkSize).toBe(100)
expect(row.chunkOverlap).toBe(20)
})
it('should reject create when default chunkOverlap does not fit explicit chunkSize', async () => {
await expect(
service.create({
name: 'Invalid Small Chunks',
dimensions: 1024,
embeddingModelId: createUniqueModelId('openai', 'embed-model'),
chunkSize: 100
})
).rejects.toMatchObject({
code: ErrorCode.VALIDATION_ERROR,
details: {
fieldErrors: {
@@ -171,9 +169,71 @@ describe('KnowledgeBaseService', () => {
}
}
})
})
})
const rows = await dbh.db.select().from(knowledgeBaseTable)
expect(rows).toHaveLength(0)
describe('status constraints', () => {
it('does not define a database default for status', async () => {
const result = await dbh.client.execute('PRAGMA table_info(`knowledge_base`)')
const statusColumn = result.rows.find((row) => row.name === 'status')
expect(statusColumn).toBeDefined()
expect(statusColumn?.dflt_value).toBeNull()
})
it('allows persisted failed bases with null embedding model ids, null dimensions, and non-empty errors', async () => {
await expect(
seedKnowledgeBase({
dimensions: null,
embeddingModelId: null,
status: 'failed',
error: KNOWLEDGE_BASE_ERROR_MISSING_EMBEDDING_MODEL
})
).resolves.toBeDefined()
const [row] = await dbh.db.select().from(knowledgeBaseTable).where(eq(knowledgeBaseTable.id, 'kb-1'))
expect(row).toMatchObject({
dimensions: null,
embeddingModelId: null,
status: 'failed',
error: KNOWLEDGE_BASE_ERROR_MISSING_EMBEDDING_MODEL
})
await expect(service.getById('kb-1')).resolves.toMatchObject({
dimensions: null,
embeddingModelId: null,
status: 'failed',
error: KNOWLEDGE_BASE_ERROR_MISSING_EMBEDDING_MODEL
})
})
it('rejects invalid persisted knowledge base status combinations', async () => {
await expect(
seedKnowledgeBase({
embeddingModelId: null,
dimensions: null,
status: 'completed',
error: null
})
).rejects.toThrow()
await expect(
seedKnowledgeBase({
id: 'kb-failed-null-error',
embeddingModelId: null,
status: 'failed',
error: null
})
).rejects.toThrow()
await expect(
seedKnowledgeBase({
id: 'kb-failed-empty-error',
embeddingModelId: null,
status: 'failed',
error: '' as typeof knowledgeBaseTable.$inferInsert.error
})
).rejects.toThrow()
})
})
@@ -192,22 +252,26 @@ describe('KnowledgeBaseService', () => {
const result = await service.update('kb-1', {
name: ' Updated Base ',
description: null,
chunkSize: null,
chunkOverlap: null,
emoji: '📚',
chunkSize: 1024,
chunkOverlap: 128,
hybridAlpha: 0.9
})
expect(result.name).toBe('Updated Base')
expect(result.chunkSize).toBe(1024)
expect(result.chunkOverlap).toBe(128)
expect(result.hybridAlpha).toBe(0.9)
expect(result.emoji).toBe('📚')
const [row] = await dbh.db.select().from(knowledgeBaseTable).where(eq(knowledgeBaseTable.id, 'kb-1'))
expect(row.name).toBe('Updated Base')
expect(row.description).toBeNull()
expect(row.chunkSize).toBeNull()
expect(row.chunkSize).toBe(1024)
expect(row.chunkOverlap).toBe(128)
expect(row.emoji).toBe('📚')
})
it('should clear stale dependent config fields during update', async () => {
it('should clear stale hybrid config when search mode changes during update', async () => {
await seedKnowledgeBase({
chunkSize: 256,
chunkOverlap: 120,
@@ -216,21 +280,65 @@ describe('KnowledgeBaseService', () => {
})
const result = await service.update('kb-1', {
chunkSize: 100,
searchMode: 'default'
})
expect(result.chunkSize).toBe(100)
expect(result.searchMode).toBe('default')
expect(result.chunkSize).toBe(256)
expect(result.chunkOverlap).toBe(120)
expect(result.hybridAlpha).toBeUndefined()
const [row] = await dbh.db.select().from(knowledgeBaseTable).where(eq(knowledgeBaseTable.id, 'kb-1'))
expect(row.chunkSize).toBe(100)
expect(row.searchMode).toBe('default')
// Dependent fields cleared
expect(row.chunkOverlap).toBeNull()
expect(row.chunkSize).toBe(256)
expect(row.chunkOverlap).toBe(120)
expect(row.hybridAlpha).toBeNull()
})
it('should reject shrinking chunkSize when the existing chunkOverlap no longer fits', async () => {
await seedKnowledgeBase({ chunkSize: 256, chunkOverlap: 120 })
await expect(
service.update('kb-1', {
chunkSize: 100
})
).rejects.toMatchObject({
code: ErrorCode.VALIDATION_ERROR,
details: {
fieldErrors: {
chunkOverlap: ['Chunk overlap must be smaller than chunk size']
}
}
})
})
it('should reject explicitly provided chunkOverlap when it no longer fits the current chunkSize', async () => {
await seedKnowledgeBase({ chunkSize: 256, chunkOverlap: 120 })
await expect(
service.update('kb-1', {
chunkOverlap: 256
})
).rejects.toMatchObject({
code: ErrorCode.VALIDATION_ERROR,
details: {
fieldErrors: {
chunkOverlap: ['Chunk overlap must be smaller than chunk size']
}
}
})
})
it('should not silently clean stale dependent fields during unrelated updates', async () => {
await seedKnowledgeBase({ searchMode: 'default', hybridAlpha: 0.7 })
await expect(
service.update('kb-1', {
name: 'Renamed Base'
})
).rejects.toThrow('Hybrid alpha requires hybrid search mode')
})
it('should reject explicitly provided hybridAlpha when search mode is not hybrid', async () => {
await seedKnowledgeBase({ searchMode: 'hybrid', hybridAlpha: 0.7 })
@@ -248,40 +356,6 @@ describe('KnowledgeBaseService', () => {
}
})
})
it('should not silently clean stale dependent fields during unrelated updates', async () => {
// Seed a KB whose existing config is already inconsistent (searchMode=default
// but hybridAlpha is populated). An unrelated field update must surface the
// validation error rather than silently scrub the bad field.
await seedKnowledgeBase({ searchMode: 'default', hybridAlpha: 0.7 })
await expect(service.update('kb-1', { name: 'Renamed Base' })).rejects.toMatchObject({
code: ErrorCode.VALIDATION_ERROR,
details: {
fieldErrors: {
hybridAlpha: ['Hybrid alpha requires hybrid search mode']
}
}
})
})
it('should reject explicitly provided chunkOverlap when it no longer fits chunkSize', async () => {
await seedKnowledgeBase({ chunkSize: 256, chunkOverlap: 64 })
await expect(
service.update('kb-1', {
chunkSize: 100,
chunkOverlap: 120
})
).rejects.toMatchObject({
code: ErrorCode.VALIDATION_ERROR,
details: {
fieldErrors: {
chunkOverlap: ['Chunk overlap must be smaller than chunk size']
}
}
})
})
})
describe('delete', () => {
@@ -301,68 +375,4 @@ describe('KnowledgeBaseService', () => {
})
})
})
describe('config helpers (pure)', () => {
describe('normalizeKnowledgeBaseConfigDependencies', () => {
it('should clear stale dependent fields after primary config changes', () => {
expect(
normalizeKnowledgeBaseConfigDependencies({
chunkSize: 100,
chunkOverlap: 120,
searchMode: 'default' as const,
hybridAlpha: 0.6
})
).toEqual({
chunkSize: 100,
chunkOverlap: undefined,
searchMode: 'default',
hybridAlpha: undefined
})
})
})
describe('validateKnowledgeBaseConfig', () => {
it('should return field errors for invalid runtime config combinations', () => {
expect(
validateKnowledgeBaseConfig({
chunkSize: null,
chunkOverlap: 64,
threshold: 1.5,
documentCount: 0,
searchMode: 'default',
hybridAlpha: 2
})
).toEqual({
chunkOverlap: ['Chunk overlap requires chunk size'],
threshold: ['Threshold must be between 0 and 1'],
documentCount: ['Document count must be greater than 0'],
hybridAlpha: ['Hybrid alpha must be between 0 and 1']
})
})
it('should reject hybridAlpha when searchMode is not hybrid', () => {
expect(
validateKnowledgeBaseConfig({
searchMode: 'bm25',
hybridAlpha: 0.7
})
).toEqual({
hybridAlpha: ['Hybrid alpha requires hybrid search mode']
})
})
it('should accept valid config', () => {
expect(
validateKnowledgeBaseConfig({
chunkSize: 512,
chunkOverlap: 64,
threshold: 0.5,
documentCount: 5,
searchMode: 'hybrid',
hybridAlpha: 0.7
})
).toEqual({})
})
})
})
})

File diff suppressed because it is too large Load Diff

View File

@@ -1,34 +1,91 @@
import { application } from '@application'
import { knowledgeBaseService } from '@data/services/KnowledgeBaseService'
import { knowledgeItemService } from '@data/services/KnowledgeItemService'
import { loggerService } from '@logger'
import { BaseService, DependsOn, Injectable, Phase, ServicePhase } from '@main/core/lifecycle'
import type { CreateKnowledgeItemsDto } from '@shared/data/api/schemas/knowledges'
import type { KnowledgeItem, KnowledgeSearchResult } from '@shared/data/types/knowledge'
import { DataApiErrorFactory } from '@shared/data/api'
import {
type CreateKnowledgeBaseDto,
type KnowledgeBase,
type KnowledgeItem,
type KnowledgeItemChunk,
type KnowledgeRuntimeAddItemInput,
KnowledgeRuntimeAddItemInputSchema,
type KnowledgeSearchResult,
type RestoreKnowledgeBaseDto
} from '@shared/data/types/knowledge'
import { IpcChannel } from '@shared/IpcChannel'
import * as z from 'zod'
import { expandDirectoryOwnerToCreateItems } from './utils/directory'
import { expandSitemapOwnerToCreateItems } from './utils/sitemap'
import { failItems } from './runtime/utils/cleanup'
import {
KnowledgeRuntimeAddItemsPayloadSchema,
KnowledgeRuntimeBasePayloadSchema,
KnowledgeRuntimeCreateBasePayloadSchema,
KnowledgeRuntimeDeleteItemChunkPayloadSchema,
KnowledgeRuntimeItemChunksPayloadSchema,
KnowledgeRuntimeItemsPayloadSchema,
KnowledgeRuntimeRestoreBasePayloadSchema,
KnowledgeRuntimeSearchPayloadSchema
} from './types/ipc'
const KnowledgeRuntimeBasePayloadSchema = z
.object({
baseId: z.string().trim().min(1)
})
.strict()
const logger = loggerService.withContext('KnowledgeOrchestrationService')
const KnowledgeRuntimeItemsPayloadSchema = z
.object({
baseId: z.string().trim().min(1),
itemIds: z.array(z.string().trim().min(1)).min(1)
})
.strict()
export interface KnowledgeRuntimeAddItemsPartialFailure {
sourceItemId: string
sourceItemType: string
message: string
}
const KnowledgeRuntimeSearchPayloadSchema = z
.object({
baseId: z.string().trim().min(1),
query: z.string().trim().min(1).max(1000)
})
.strict()
export class KnowledgeRuntimeAddItemsPartialError extends Error {
readonly failures: KnowledgeRuntimeAddItemsPartialFailure[]
constructor(failures: KnowledgeRuntimeAddItemsPartialFailure[]) {
super(`Failed to restore ${failures.length} knowledge root item(s)`)
this.name = 'KnowledgeRuntimeAddItemsPartialError'
this.failures = failures
}
}
function createRestoreBaseDto(sourceBase: KnowledgeBase, dto: RestoreKnowledgeBaseDto): CreateKnowledgeBaseDto {
// The new vector store is shaped from dto.dimensions. Callers must resolve it
// against dto.embeddingModelId before restore; mismatches surface during reindex.
return {
name: sourceBase.name,
emoji: sourceBase.emoji,
dimensions: dto.dimensions,
embeddingModelId: dto.embeddingModelId,
rerankModelId: sourceBase.rerankModelId,
fileProcessorId: sourceBase.fileProcessorId,
chunkSize: sourceBase.chunkSize,
chunkOverlap: sourceBase.chunkOverlap,
threshold: sourceBase.threshold,
documentCount: sourceBase.documentCount,
searchMode: sourceBase.searchMode,
hybridAlpha: sourceBase.hybridAlpha
}
}
function assertRestoreBaseCanRebuild(sourceBase: KnowledgeBase, dto: RestoreKnowledgeBaseDto): void {
if (sourceBase.status === 'failed') {
return
}
const embeddingModelChanged = dto.embeddingModelId.trim() !== sourceBase.embeddingModelId
const dimensionsChanged = dto.dimensions !== sourceBase.dimensions
if (embeddingModelChanged || dimensionsChanged) {
return
}
throw DataApiErrorFactory.invalidOperation(
'restoreBase',
'Embedding model or dimensions must change when rebuilding a completed knowledge base'
)
}
function normalizeFailureMessage(error: unknown): string {
return error instanceof Error ? error.message : String(error)
}
@Injectable('KnowledgeOrchestrationService')
@ServicePhase(Phase.WhenReady)
@@ -38,120 +95,230 @@ export class KnowledgeOrchestrationService extends BaseService {
this.registerIpcHandlers()
}
async createBase(baseId: string): Promise<void> {
const base = await knowledgeBaseService.getById(baseId)
async createBase(dto: CreateKnowledgeBaseDto): Promise<KnowledgeBase> {
const base = await knowledgeBaseService.create(dto)
const runtime = application.get('KnowledgeRuntimeService')
await runtime.createBase(base)
try {
await runtime.createBase(base.id)
} catch (error) {
await knowledgeBaseService.delete(base.id)
throw error
}
return base
}
async deleteBase(baseId: string): Promise<void> {
const runtime = application.get('KnowledgeRuntimeService')
await runtime.deleteBase(baseId)
}
const interruptedItemIds = await runtime.deleteBase(baseId)
async addItems(baseId: string, itemIds: string[]): Promise<void[]> {
const [base, items] = await Promise.all([
knowledgeBaseService.getById(baseId),
knowledgeItemService.getByIdsInBase(baseId, itemIds)
])
const expandedItems = await this.expandItemsToCreateInputs(items)
const expandedLeafItems =
expandedItems.length === 0
? []
: this.collectIndexableItems(
(
await knowledgeItemService.createMany(baseId, {
items: expandedItems
})
).items
)
const allLeafItems = this.collectIndexableItems([...items, ...expandedLeafItems])
if (allLeafItems.length === 0) {
return []
try {
await knowledgeBaseService.delete(baseId)
} catch (error) {
const normalizedError = error instanceof Error ? error : new Error(String(error))
try {
await failItems(interruptedItemIds, normalizedError.message)
} catch (failureStateError) {
logger.error(
'Failed to persist runtime item failure state after knowledge base deletion failed',
failureStateError instanceof Error ? failureStateError : new Error(String(failureStateError)),
{
baseId,
interruptedItemIds,
deleteError: normalizedError.message
}
)
}
throw error
}
try {
await runtime.deleteBaseArtifacts(baseId)
} catch (error) {
const normalizedError = error instanceof Error ? error : new Error(String(error))
logger.error('Failed to delete knowledge base vector artifacts after SQLite deletion', normalizedError, {
baseId,
interruptedItemIds
})
throw DataApiErrorFactory.invalidOperation(
'deleteBase',
`SQLite knowledge base was deleted, but vector artifact cleanup failed: ${normalizedError.message}`
)
}
}
async restoreBase(dto: RestoreKnowledgeBaseDto): Promise<KnowledgeBase> {
const sourceBase = await knowledgeBaseService.getById(dto.sourceBaseId)
assertRestoreBaseCanRebuild(sourceBase, dto)
const createDto = createRestoreBaseDto(sourceBase, dto)
const rootItems = await knowledgeItemService.getItemsByBaseId(sourceBase.id, { groupId: null })
const restoredBase = await this.createBase(createDto)
try {
const failures: KnowledgeRuntimeAddItemsPartialFailure[] = []
for (const item of rootItems) {
try {
const input = KnowledgeRuntimeAddItemInputSchema.parse({
type: item.type,
data: item.data
})
await this.addItems(restoredBase.id, [input])
} catch (error) {
failures.push({
sourceItemId: item.id,
sourceItemType: item.type,
message: normalizeFailureMessage(error)
})
}
}
if (failures.length > 0) {
throw new KnowledgeRuntimeAddItemsPartialError(failures)
}
} catch (error) {
try {
await this.deleteBase(restoredBase.id)
} catch (cleanupError) {
logger.error(
'Failed to delete restored knowledge base after item restoration failed',
cleanupError instanceof Error ? cleanupError : new Error(String(cleanupError)),
{
sourceBaseId: sourceBase.id,
restoredBaseId: restoredBase.id
}
)
}
throw error
}
return restoredBase
}
async addItems(baseId: string, items: KnowledgeRuntimeAddItemInput[]): Promise<void> {
await this.assertBaseCanRunRuntimeOperation(baseId, 'addItems')
const runtime = application.get('KnowledgeRuntimeService')
return await runtime.addItems(base, allLeafItems)
await runtime.addItems(baseId, items)
}
async deleteItems(baseId: string, itemIds: string[]): Promise<void> {
const [base, items] = await Promise.all([
knowledgeBaseService.getById(baseId),
knowledgeItemService.getByIdsInBase(baseId, itemIds)
])
const items = await this.getTopLevelItemsInBase(baseId, itemIds)
const runtime = application.get('KnowledgeRuntimeService')
await runtime.deleteItems(base, items)
await runtime.deleteItems(baseId, items)
for (const item of items) {
await knowledgeItemService.delete(item.id)
}
}
async reindexItems(baseId: string, itemIds: string[]): Promise<void> {
await this.assertBaseCanRunRuntimeOperation(baseId, 'reindexItems')
const items = await this.getTopLevelItemsInBase(baseId, itemIds)
const runtime = application.get('KnowledgeRuntimeService')
await runtime.reindexItems(baseId, items)
}
async search(baseId: string, query: string): Promise<KnowledgeSearchResult[]> {
const base = await knowledgeBaseService.getById(baseId)
await this.assertBaseCanRunRuntimeOperation(baseId, 'search')
const runtime = application.get('KnowledgeRuntimeService')
return await runtime.search(base, query)
return await runtime.search(baseId, query)
}
async listItemChunks(baseId: string, itemId: string): Promise<KnowledgeItemChunk[]> {
await this.assertBaseCanRunRuntimeOperation(baseId, 'listItemChunks')
await this.getRootItemsInBase(baseId, [itemId])
const runtime = application.get('KnowledgeRuntimeService')
return await runtime.listItemChunks(baseId, itemId)
}
async deleteItemChunk(baseId: string, itemId: string, chunkId: string): Promise<void> {
await this.assertBaseCanRunRuntimeOperation(baseId, 'deleteItemChunk')
await this.getRootItemsInBase(baseId, [itemId])
const runtime = application.get('KnowledgeRuntimeService')
return await runtime.deleteItemChunk(baseId, itemId, chunkId)
}
private async assertBaseCanRunRuntimeOperation(baseId: string, operation: string): Promise<void> {
const base = await knowledgeBaseService.getById(baseId)
if (base.status !== 'failed') {
return
}
throw DataApiErrorFactory.validation(
{
base: [`Knowledge base '${baseId}' is in failed state; restore it before ${operation}.`]
},
`Cannot ${operation} failed knowledge base`
)
}
private async getRootItemsInBase(baseId: string, itemIds: string[]): Promise<KnowledgeItem[]> {
const rootIds = [...new Set(itemIds)]
const items = await Promise.all(rootIds.map((itemId) => knowledgeItemService.getById(itemId)))
const invalidItem = items.find((item) => item.baseId !== baseId)
if (invalidItem) {
throw new Error(`Knowledge item '${invalidItem.id}' does not belong to base '${baseId}'`)
}
return items
}
private async getTopLevelItemsInBase(baseId: string, itemIds: string[]): Promise<KnowledgeItem[]> {
const items = await this.getRootItemsInBase(baseId, itemIds)
const selectedIds = new Set(items.map((item) => item.id))
const descendantSelectedIds = new Set<string>()
for (const item of items) {
const descendants = await knowledgeItemService.getDescendantItems(baseId, [item.id])
for (const descendant of descendants) {
if (selectedIds.has(descendant.id)) {
descendantSelectedIds.add(descendant.id)
}
}
}
return items.filter((item) => !descendantSelectedIds.has(item.id))
}
private registerIpcHandlers(): void {
this.ipcHandle(IpcChannel.KnowledgeRuntime_CreateBase, async (_, payload: unknown) => {
const { baseId } = KnowledgeRuntimeBasePayloadSchema.parse(payload)
return await this.createBase(baseId)
const { base } = KnowledgeRuntimeCreateBasePayloadSchema.parse(payload)
return await this.createBase(base)
})
this.ipcHandle(IpcChannel.KnowledgeRuntime_RestoreBase, async (_, payload: unknown) => {
const dto = KnowledgeRuntimeRestoreBasePayloadSchema.parse(payload)
return await this.restoreBase(dto)
})
this.ipcHandle(IpcChannel.KnowledgeRuntime_DeleteBase, async (_, payload: unknown) => {
const { baseId } = KnowledgeRuntimeBasePayloadSchema.parse(payload)
return await this.deleteBase(baseId)
})
this.ipcHandle(IpcChannel.KnowledgeRuntime_AddItems, async (_, payload: unknown) => {
const { baseId, itemIds } = KnowledgeRuntimeItemsPayloadSchema.parse(payload)
return await this.addItems(baseId, itemIds)
const { baseId, items } = KnowledgeRuntimeAddItemsPayloadSchema.parse(payload)
return await this.addItems(baseId, items)
})
this.ipcHandle(IpcChannel.KnowledgeRuntime_DeleteItems, async (_, payload: unknown) => {
const { baseId, itemIds } = KnowledgeRuntimeItemsPayloadSchema.parse(payload)
return await this.deleteItems(baseId, itemIds)
})
this.ipcHandle(IpcChannel.KnowledgeRuntime_ReindexItems, async (_, payload: unknown) => {
const { baseId, itemIds } = KnowledgeRuntimeItemsPayloadSchema.parse(payload)
return await this.reindexItems(baseId, itemIds)
})
this.ipcHandle(IpcChannel.KnowledgeRuntime_Search, async (_, payload: unknown) => {
const { baseId, query } = KnowledgeRuntimeSearchPayloadSchema.parse(payload)
return await this.search(baseId, query)
})
}
private async expandItemsToCreateInputs(items: KnowledgeItem[]): Promise<CreateKnowledgeItemsDto['items']> {
const expandedItems: CreateKnowledgeItemsDto['items'] = []
for (const item of items) {
const itemCreateInputs = await this.expandItemToCreateInputs(item)
if (itemCreateInputs.length === 0) {
continue
}
expandedItems.push(...itemCreateInputs)
}
return expandedItems
}
private async expandItemToCreateInputs(item: KnowledgeItem): Promise<CreateKnowledgeItemsDto['items']> {
if (item.type === 'directory') {
return await expandDirectoryOwnerToCreateItems(item)
}
if (item.type === 'sitemap') {
return await expandSitemapOwnerToCreateItems(item)
}
return []
}
private collectIndexableItems(items: KnowledgeItem[]): KnowledgeItem[] {
const leafItems = new Map<string, KnowledgeItem>()
for (const item of items) {
if (item.type === 'file' || item.type === 'url' || item.type === 'note') {
leafItems.set(item.id, item)
}
}
return [...leafItems.values()]
this.ipcHandle(IpcChannel.KnowledgeRuntime_ListItemChunks, async (_, payload: unknown) => {
const { baseId, itemId } = KnowledgeRuntimeItemChunksPayloadSchema.parse(payload)
return await this.listItemChunks(baseId, itemId)
})
this.ipcHandle(IpcChannel.KnowledgeRuntime_DeleteItemChunk, async (_, payload: unknown) => {
const { baseId, itemId, chunkId } = KnowledgeRuntimeDeleteItemChunkPayloadSchema.parse(payload)
return await this.deleteItemChunk(baseId, itemId, chunkId)
})
}
}

View File

@@ -0,0 +1,202 @@
import { groupTable } from '@data/db/schemas/group'
import { knowledgeBaseTable, knowledgeItemTable } from '@data/db/schemas/knowledge'
import { userModelTable } from '@data/db/schemas/userModel'
import { userProviderTable } from '@data/db/schemas/userProvider'
import { knowledgeItemService } from '@data/services/KnowledgeItemService'
import {
DEFAULT_KNOWLEDGE_BASE_CHUNK_OVERLAP,
DEFAULT_KNOWLEDGE_BASE_CHUNK_SIZE,
DEFAULT_KNOWLEDGE_BASE_EMOJI,
DEFAULT_KNOWLEDGE_SEARCH_MODE,
KNOWLEDGE_BASE_ERROR_MISSING_EMBEDDING_MODEL
} from '@shared/data/types/knowledge'
import { createUniqueModelId } from '@shared/data/types/model'
import { setupTestDatabase } from '@test-helpers/db'
import { eq, isNull } from 'drizzle-orm'
import { beforeEach, describe, expect, it, vi } from 'vitest'
const { runtimeAddItemsMock, runtimeCreateBaseMock, runtimeReindexItemsMock } = vi.hoisted(() => ({
runtimeAddItemsMock: vi.fn(),
runtimeCreateBaseMock: vi.fn(),
runtimeReindexItemsMock: vi.fn()
}))
vi.mock('@application', async () => {
const { mockApplicationFactory } = await import('@test-mocks/main/application')
return mockApplicationFactory({
KnowledgeRuntimeService: {
addItems: runtimeAddItemsMock,
createBase: runtimeCreateBaseMock,
reindexItems: runtimeReindexItemsMock
}
} as Parameters<typeof mockApplicationFactory>[0])
})
vi.mock('@logger', () => ({
loggerService: {
withContext: () => ({
error: vi.fn(),
info: vi.fn(),
warn: vi.fn()
})
}
}))
const { KnowledgeOrchestrationService } = await import('../KnowledgeOrchestrationService')
describe('KnowledgeOrchestrationService integration', () => {
const dbh = setupTestDatabase()
const embeddingModelId = createUniqueModelId('openai', 'text-embedding-3-small')
beforeEach(async () => {
vi.clearAllMocks()
runtimeCreateBaseMock.mockResolvedValue(undefined)
runtimeReindexItemsMock.mockResolvedValue(undefined)
runtimeAddItemsMock.mockImplementation(async (baseId, inputs) => {
for (const input of inputs) {
await knowledgeItemService.create(baseId, input)
}
})
await dbh.db.insert(userProviderTable).values({
providerId: 'openai',
name: 'OpenAI'
})
await dbh.db.insert(userModelTable).values({
id: embeddingModelId,
providerId: 'openai',
modelId: 'text-embedding-3-small',
presetModelId: 'text-embedding-3-small',
name: 'text-embedding-3-small',
isEnabled: true,
isHidden: false,
sortOrder: 0
})
await dbh.db.insert(groupTable).values({
id: 'group-1',
entityType: 'knowledge',
name: 'Legacy group',
orderKey: 'a0'
})
await dbh.db.insert(knowledgeBaseTable).values({
id: 'source-kb',
name: 'Legacy KB',
groupId: 'group-1',
emoji: DEFAULT_KNOWLEDGE_BASE_EMOJI,
dimensions: null,
embeddingModelId: null,
status: 'failed',
error: KNOWLEDGE_BASE_ERROR_MISSING_EMBEDDING_MODEL,
rerankModelId: null,
fileProcessorId: null,
chunkSize: DEFAULT_KNOWLEDGE_BASE_CHUNK_SIZE,
chunkOverlap: DEFAULT_KNOWLEDGE_BASE_CHUNK_OVERLAP,
threshold: null,
documentCount: null,
searchMode: DEFAULT_KNOWLEDGE_SEARCH_MODE,
hybridAlpha: null
})
await dbh.db.insert(knowledgeItemTable).values([
{
id: 'source-root',
baseId: 'source-kb',
groupId: null,
type: 'note',
data: { source: 'source-root', content: 'root content' },
status: 'idle',
phase: null,
error: null
},
{
id: 'source-child',
baseId: 'source-kb',
groupId: 'source-root',
type: 'note',
data: { source: 'source-child', content: 'child content' },
status: 'idle',
phase: null,
error: null
}
])
})
it('restores a failed base into a new completed base and reindexes the restored root', async () => {
const service = new KnowledgeOrchestrationService()
const restoredBase = await service.restoreBase({
sourceBaseId: 'source-kb',
embeddingModelId,
dimensions: 1536
})
expect(restoredBase).toMatchObject({
name: 'Legacy KB',
groupId: null,
dimensions: 1536,
embeddingModelId,
status: 'completed',
error: null
})
expect(restoredBase.id).not.toBe('source-kb')
expect(runtimeCreateBaseMock).toHaveBeenCalledWith(restoredBase.id)
expect(runtimeAddItemsMock).toHaveBeenCalledWith(restoredBase.id, [
{ type: 'note', data: { source: 'source-root', content: 'root content' } }
])
const [sourceBase] = await dbh.db.select().from(knowledgeBaseTable).where(eq(knowledgeBaseTable.id, 'source-kb'))
expect(sourceBase).toMatchObject({
id: 'source-kb',
groupId: 'group-1',
dimensions: null,
embeddingModelId: null,
status: 'failed',
error: KNOWLEDGE_BASE_ERROR_MISSING_EMBEDDING_MODEL
})
const restoredItems = await dbh.db
.select()
.from(knowledgeItemTable)
.where(eq(knowledgeItemTable.baseId, restoredBase.id))
expect(restoredItems).toHaveLength(1)
expect(restoredItems[0]).toMatchObject({
baseId: restoredBase.id,
groupId: null,
type: 'note',
data: { source: 'source-root', content: 'root content' }
})
const sourceChildRows = await dbh.db
.select()
.from(knowledgeItemTable)
.where(eq(knowledgeItemTable.id, 'source-child'))
expect(sourceChildRows).toHaveLength(1)
const restoredRootItems = await dbh.db
.select()
.from(knowledgeItemTable)
.where(eq(knowledgeItemTable.baseId, restoredBase.id))
const restoredRoot = restoredRootItems.find((item) => item.groupId === null)
expect(restoredRoot).toBeDefined()
await service.reindexItems(restoredBase.id, [restoredRoot!.id])
expect(runtimeReindexItemsMock).toHaveBeenCalledWith(
restoredBase.id,
expect.arrayContaining([
expect.objectContaining({
id: restoredRoot!.id,
baseId: restoredBase.id,
groupId: null,
data: { source: 'source-root', content: 'root content' }
})
])
)
expect(runtimeReindexItemsMock).not.toHaveBeenCalledWith('source-kb', expect.anything())
const ungroupedRestoredItems = await dbh.db
.select()
.from(knowledgeItemTable)
.where(isNull(knowledgeItemTable.groupId))
expect(ungroupedRestoredItems.some((item) => item.baseId === restoredBase.id)).toBe(true)
})
})

View File

@@ -0,0 +1,330 @@
import { loggerService } from '@logger'
import PQueue from 'p-queue'
import type {
EnqueueKnowledgeTaskOptions,
IndexLeafTaskEntry,
KnowledgeQueueSnapshot,
KnowledgeQueueTaskContext,
KnowledgeQueueTaskDescriptor,
PrepareRootTaskEntry
} from './types'
const logger = loggerService.withContext('KnowledgeQueueManager')
const DEFAULT_CONCURRENCY = 5
class KnowledgeQueueInterruptedError extends Error {
constructor(message: string) {
super(message)
this.name = 'KnowledgeQueueInterruptedError'
}
}
type KnowledgeQueueTaskStatus = 'pending' | 'running'
type QueueEntry = EnqueueKnowledgeTaskOptions & {
controller: AbortController
interruptError?: KnowledgeQueueInterruptedError
reject: (error: Error) => void
resolve: () => void
runPromise?: Promise<void>
promise: Promise<void>
status: KnowledgeQueueTaskStatus
settled: boolean
}
export class KnowledgeQueueManager {
private queue: PQueue
private isResetting = false
private resetReason: string | null = null
private readonly entries = new Map<string, QueueEntry>()
// Per-base serialization protects vector-store writes and status completion ordering.
private readonly baseWriteLocks = new Map<string, Promise<void>>()
constructor() {
this.queue = this.createQueue()
}
async reset(reason: string): Promise<KnowledgeQueueTaskDescriptor[]> {
if (this.isResetting) {
throw this.createResetError()
}
this.resetReason = reason
this.isResetting = true
try {
const interruptedEntries = this.interruptAll(reason)
this.queue.clear()
await this.waitForRunning(interruptedEntries.map((entry) => entry.itemId))
this.queue = this.createQueue()
this.baseWriteLocks.clear()
return interruptedEntries
} finally {
this.isResetting = false
this.resetReason = null
}
}
enqueue(options: EnqueueKnowledgeTaskOptions): Promise<void> {
if (this.isResetting) {
return Promise.reject(this.createResetError())
}
const existingEntry = this.entries.get(options.item.id)
if (existingEntry) {
return existingEntry.promise
}
const entry = this.createEntry(options)
this.entries.set(entry.item.id, entry)
this.schedule(entry)
return entry.promise
}
interruptItems(itemIds: string[], reason: string): KnowledgeQueueTaskDescriptor[] {
const interruptedEntries = this.getEntriesByIds(itemIds)
for (const entry of interruptedEntries) {
entry.interruptError ??= new KnowledgeQueueInterruptedError(reason)
if (!entry.controller.signal.aborted) {
entry.controller.abort(entry.interruptError)
}
if (entry.status === 'pending') {
this.rejectEntry(entry, this.createInterruptError(entry))
}
}
return interruptedEntries.map((entry) => this.createDescriptor(entry))
}
interruptBase(baseId: string, reason: string): KnowledgeQueueTaskDescriptor[] {
const itemIds = [...this.entries.values()].filter((entry) => entry.base.id === baseId).map((entry) => entry.item.id)
return this.interruptItems(itemIds, reason)
}
interruptAll(reason: string): KnowledgeQueueTaskDescriptor[] {
return this.interruptItems([...this.entries.keys()], reason)
}
async waitForRunning(itemIds: string[]): Promise<void> {
const runningPromises = this.getEntriesByIds(itemIds)
.filter((entry) => !entry.settled && entry.status === 'running')
.map((entry) => entry.runPromise ?? entry.promise)
if (runningPromises.length === 0) {
return
}
await Promise.allSettled(runningPromises)
}
getSnapshot(): KnowledgeQueueSnapshot {
const snapshot: KnowledgeQueueSnapshot = {
pending: [],
running: []
}
for (const entry of this.entries.values()) {
if (entry.settled) {
continue
}
snapshot[entry.status].push({
...this.createDescriptor(entry)
})
}
return snapshot
}
private createQueue(): PQueue {
return new PQueue({ concurrency: DEFAULT_CONCURRENCY })
}
private createEntry(options: EnqueueKnowledgeTaskOptions): QueueEntry {
const controller = new AbortController()
let resolve!: () => void
let reject!: (error: Error) => void
const promise = new Promise<void>((res, rej) => {
resolve = res
reject = rej
})
return {
...options,
controller,
promise,
reject,
resolve,
settled: false,
status: 'pending'
}
}
private schedule(entry: QueueEntry): void {
void this.queue.add(async () => {
if (this.entries.get(entry.item.id) !== entry || entry.settled || entry.status !== 'pending') {
return
}
entry.status = 'running'
entry.runPromise = this.executeEntry(entry)
await entry.runPromise
})
}
private async executeEntry(entry: QueueEntry): Promise<void> {
try {
this.throwIfInterrupted(entry)
await this.executeQueueEntry(entry)
this.throwIfInterrupted(entry)
this.resolveEntry(entry)
} catch (error) {
const taskError = error instanceof Error ? error : new Error(String(error))
if (taskError !== entry.interruptError) {
logger.error('Knowledge queue task failed unexpectedly', taskError, {
baseId: entry.base.id,
itemId: entry.item.id,
kind: entry.kind
})
}
this.rejectEntry(entry, taskError)
}
}
private async executeQueueEntry(entry: QueueEntry): Promise<void> {
if (entry.kind === 'index-leaf') {
const context: KnowledgeQueueTaskContext<IndexLeafTaskEntry> = {
base: entry.base,
baseId: entry.base.id,
item: entry.item,
itemId: entry.item.id,
itemType: entry.item.type,
kind: entry.kind,
signal: entry.controller.signal,
runWithBaseWriteLock: (task) => this.runWithBaseWriteLock(entry, task)
}
await entry.execute(context)
return
}
const context: KnowledgeQueueTaskContext<PrepareRootTaskEntry> = {
base: entry.base,
baseId: entry.base.id,
item: entry.item,
itemId: entry.item.id,
itemType: entry.item.type,
kind: entry.kind,
signal: entry.controller.signal,
runWithBaseWriteLock: (task) => this.runWithBaseWriteLock(entry, task)
}
await entry.execute(context)
}
private async runWithBaseWriteLock<T>(entry: QueueEntry, task: () => Promise<T>): Promise<T> {
this.throwIfInterrupted(entry)
const baseId = entry.base.id
const previousLock = this.baseWriteLocks.get(baseId) ?? Promise.resolve()
let releaseCurrentLock!: () => void
const currentLock = new Promise<void>((resolve) => {
releaseCurrentLock = resolve
})
const nextLock = previousLock.catch(() => undefined).then(() => currentLock)
this.baseWriteLocks.set(baseId, nextLock)
try {
await previousLock.catch(() => undefined)
this.throwIfInterrupted(entry)
const result = await task()
this.throwIfInterrupted(entry)
return result
} finally {
releaseCurrentLock()
if (this.baseWriteLocks.get(baseId) === nextLock) {
this.baseWriteLocks.delete(baseId)
}
}
}
private getEntriesByIds(itemIds: string[]): QueueEntry[] {
const entries: QueueEntry[] = []
for (const itemId of new Set(itemIds)) {
const entry = this.entries.get(itemId)
if (entry) {
entries.push(entry)
}
}
return entries
}
private deleteEntry(entry: QueueEntry): void {
if (this.entries.get(entry.item.id) === entry) {
this.entries.delete(entry.item.id)
}
}
private createDescriptor(entry: QueueEntry): KnowledgeQueueTaskDescriptor {
return {
base: entry.base,
baseId: entry.base.id,
itemId: entry.item.id,
itemType: entry.item.type,
kind: entry.kind
}
}
private resolveEntry(entry: QueueEntry): void {
if (entry.settled) {
return
}
entry.settled = true
entry.resolve()
this.deleteEntry(entry)
}
private rejectEntry(entry: QueueEntry, error: Error): void {
if (entry.settled) {
return
}
entry.settled = true
entry.reject(error)
this.deleteEntry(entry)
}
private throwIfInterrupted(entry: QueueEntry): void {
if (entry.controller.signal.aborted) {
throw this.createInterruptError(entry)
}
}
private createInterruptError(entry: QueueEntry): Error {
if (!entry.interruptError) {
throw new Error('Knowledge queue entry was aborted without an interrupt error')
}
return entry.interruptError
}
private createResetError(): Error {
return new KnowledgeQueueInterruptedError(this.resetReason!)
}
}

View File

@@ -0,0 +1,527 @@
import type { KnowledgeBase, KnowledgeItem, KnowledgeItemOf } from '@shared/data/types/knowledge'
import { beforeEach, describe, expect, it, vi } from 'vitest'
import { KnowledgeQueueManager } from '../KnowledgeQueueManager'
import type {
EnqueueKnowledgeTaskOptions,
IndexLeafTaskEntry,
KnowledgeQueueTaskDescriptor,
PrepareRootTaskEntry
} from '../types'
const { loggerErrorMock, loggerWarnMock } = vi.hoisted(() => ({
loggerErrorMock: vi.fn(),
loggerWarnMock: vi.fn()
}))
vi.mock('@logger', () => ({
loggerService: {
withContext: () => ({
debug: vi.fn(),
error: loggerErrorMock,
info: vi.fn(),
warn: loggerWarnMock
})
}
}))
const BASE_ID = 'base-1'
const BASE: KnowledgeBase = {
id: BASE_ID,
name: 'Base',
groupId: null,
emoji: '📁',
dimensions: 1024,
embeddingModelId: 'ollama::nomic-embed-text',
status: 'completed',
error: null,
chunkSize: 1024,
chunkOverlap: 200,
searchMode: 'hybrid',
createdAt: '2026-04-08T00:00:00.000Z',
updatedAt: '2026-04-08T00:00:00.000Z'
}
function createDeferred<T = void>() {
let resolve!: (value: T | PromiseLike<T>) => void
let reject!: (reason?: unknown) => void
const promise = new Promise<T>((res, rej) => {
resolve = res
reject = rej
})
return { promise, reject, resolve }
}
function createNoteItem(
id = 'note-1',
status: KnowledgeItem['status'] = 'processing',
baseId = BASE_ID
): KnowledgeItemOf<'note'> {
const lifecycle =
status === 'failed'
? ({ status, phase: null, error: `failed ${id}` } as const)
: ({ status, phase: null, error: null } as const)
return {
id,
baseId,
groupId: null,
type: 'note',
data: { source: id, content: `hello ${id}` },
...lifecycle,
createdAt: '2026-04-08T00:00:00.000Z',
updatedAt: '2026-04-08T00:00:00.000Z'
}
}
function createDirectoryItem(
id = 'dir-1',
status: KnowledgeItem['status'] = 'processing',
baseId = BASE_ID
): KnowledgeItemOf<'directory'> {
const lifecycle =
status === 'failed'
? ({ status, phase: null, error: `failed ${id}` } as const)
: ({ status, phase: null, error: null } as const)
return {
id,
baseId,
groupId: null,
type: 'directory',
data: { source: `/docs/${id}`, path: `/docs/${id}` },
...lifecycle,
createdAt: '2026-04-08T00:00:00.000Z',
updatedAt: '2026-04-08T00:00:00.000Z'
}
}
function createIndexTask(
itemId: string,
execute: EnqueueKnowledgeTaskOptions<IndexLeafTaskEntry>['execute'],
baseId = BASE_ID
): EnqueueKnowledgeTaskOptions<IndexLeafTaskEntry> {
return {
base: { ...BASE, id: baseId },
kind: 'index-leaf',
item: createNoteItem(itemId, 'processing', baseId),
execute
}
}
function createPrepareTask(
itemId: string,
execute: EnqueueKnowledgeTaskOptions<PrepareRootTaskEntry>['execute'],
baseId = BASE_ID
): EnqueueKnowledgeTaskOptions<PrepareRootTaskEntry> {
return {
base: { ...BASE, id: baseId },
kind: 'prepare-root',
item: createDirectoryItem(itemId, 'processing', baseId),
execute
}
}
function createTaskDescriptor(
itemId: string,
kind: KnowledgeQueueTaskDescriptor['kind'] = 'index-leaf',
baseId = BASE_ID
): KnowledgeQueueTaskDescriptor {
return {
base: { ...BASE, id: baseId },
baseId,
itemId,
itemType: kind === 'index-leaf' ? 'note' : 'directory',
kind
}
}
function captureError<T>(promise: Promise<T>): Promise<Error> {
return promise.then(
() => new Error('Expected promise to reject'),
(error) => (error instanceof Error ? error : new Error(String(error)))
)
}
async function flushPromises(): Promise<void> {
await Promise.resolve()
await Promise.resolve()
}
describe('KnowledgeQueueManager', () => {
beforeEach(() => {
vi.clearAllMocks()
})
it('deduplicates queued work for the same item', async () => {
const manager = new KnowledgeQueueManager()
const execute = vi.fn(async () => undefined)
const firstPromise = manager.enqueue(createIndexTask('item-1', execute))
const secondPromise = manager.enqueue(createIndexTask('item-1', execute))
expect(secondPromise).toBe(firstPromise)
await expect(firstPromise).resolves.toBeUndefined()
expect(execute).toHaveBeenCalledTimes(1)
expect(manager.getSnapshot()).toEqual({ pending: [], running: [] })
})
it('preserves task kind in snapshots and interrupted entries', async () => {
const manager = new KnowledgeQueueManager()
const blocker = createDeferred()
const started = createDeferred()
const taskPromise = manager.enqueue(
createPrepareTask('dir-1', async () => {
started.resolve()
await blocker.promise
})
)
const taskError = captureError(taskPromise)
await started.promise
expect(manager.getSnapshot().running).toEqual([createTaskDescriptor('dir-1', 'prepare-root')])
expect(manager.interruptItems(['dir-1'], 'deleted')).toEqual([createTaskDescriptor('dir-1', 'prepare-root')])
blocker.resolve()
await expect(taskError).resolves.toMatchObject({ message: 'deleted' })
})
it('rejects pending tasks on interrupt and does not execute them later', async () => {
const manager = new KnowledgeQueueManager()
const blockers = Array.from({ length: 5 }, () => createDeferred())
const executedItemIds: string[] = []
const runningPromises = blockers.map((deferred, index) =>
manager.enqueue(
createIndexTask(`running-${index}`, async (context) => {
executedItemIds.push(context.itemId)
await deferred.promise
})
)
)
await vi.waitFor(() => {
expect(executedItemIds).toHaveLength(5)
})
const pendingPromise = manager.enqueue(
createIndexTask('pending', async (context) => {
executedItemIds.push(context.itemId)
})
)
const pendingError = captureError(pendingPromise)
expect(manager.getSnapshot().pending).toEqual([createTaskDescriptor('pending')])
const interruptedEntries = manager.interruptItems(['pending'], 'deleted')
expect(interruptedEntries).toEqual([createTaskDescriptor('pending')])
await expect(pendingError).resolves.toMatchObject({ message: 'deleted' })
expect(manager.getSnapshot().pending).toEqual([])
for (const blocker of blockers) {
blocker.resolve()
}
await expect(Promise.all(runningPromises)).resolves.toEqual([undefined, undefined, undefined, undefined, undefined])
await flushPromises()
expect(executedItemIds).not.toContain('pending')
})
it('waits for interrupted running tasks to really finish before waitForRunning resolves', async () => {
const manager = new KnowledgeQueueManager()
const started = createDeferred()
const finish = createDeferred()
let waitResolved = false
let signalAbortedAfterFinish = false
const taskPromise = manager.enqueue(
createIndexTask('running', async (context) => {
started.resolve()
await finish.promise
signalAbortedAfterFinish = context.signal.aborted
})
)
const taskError = captureError(taskPromise)
await started.promise
manager.interruptItems(['running'], 'deleted')
const waitPromise = manager.waitForRunning(['running']).then(() => {
waitResolved = true
})
await flushPromises()
expect(waitResolved).toBe(false)
finish.resolve()
await waitPromise
expect(signalAbortedAfterFinish).toBe(true)
await expect(taskError).resolves.toMatchObject({ message: 'deleted' })
expect(loggerErrorMock).not.toHaveBeenCalled()
})
it('treats signal throwIfAborted as a normal running task interruption', async () => {
const manager = new KnowledgeQueueManager()
const started = createDeferred()
const finish = createDeferred()
const taskPromise = manager.enqueue(
createIndexTask('running', async (context) => {
started.resolve()
await finish.promise
context.signal.throwIfAborted()
})
)
const taskError = captureError(taskPromise)
await started.promise
manager.interruptItems(['running'], 'deleted')
finish.resolve()
await expect(taskError).resolves.toMatchObject({ message: 'deleted' })
expect(loggerErrorMock).not.toHaveBeenCalled()
})
it('resets pending work and waits for running work to settle', async () => {
const manager = new KnowledgeQueueManager()
const blockers = Array.from({ length: 5 }, () => createDeferred())
const executedItemIds: string[] = []
const runningPromises = blockers.map((deferred, index) =>
manager.enqueue(
createIndexTask(`running-${index}`, async (context) => {
executedItemIds.push(context.itemId)
await deferred.promise
})
)
)
const runningErrors = runningPromises.map(captureError)
await vi.waitFor(() => {
expect(manager.getSnapshot().running).toHaveLength(5)
})
const pendingPromise = manager.enqueue(
createIndexTask('pending', async (context) => {
executedItemIds.push(context.itemId)
})
)
const pendingError = captureError(pendingPromise)
let resetResolved = false
const resetPromise = manager.reset('reset').then((entries) => {
resetResolved = true
return entries
})
await expect(pendingError).resolves.toMatchObject({ message: 'reset' })
await flushPromises()
expect(resetResolved).toBe(false)
for (const blocker of blockers) {
blocker.resolve()
}
await expect(resetPromise).resolves.toEqual([
...Array.from({ length: 5 }, (_, index) => createTaskDescriptor(`running-${index}`)),
createTaskDescriptor('pending')
])
await expect(Promise.all(runningErrors)).resolves.toEqual(
Array.from({ length: 5 }, () => expect.objectContaining({ message: 'reset' }))
)
expect(manager.getSnapshot()).toEqual({ pending: [], running: [] })
expect(executedItemIds).toEqual(['running-0', 'running-1', 'running-2', 'running-3', 'running-4'])
})
it('rejects new work while reset is waiting for running work', async () => {
const manager = new KnowledgeQueueManager()
const started = createDeferred()
const finish = createDeferred()
const executeAfterReset = vi.fn(async () => undefined)
const runningPromise = manager.enqueue(
createIndexTask('running', async () => {
started.resolve()
await finish.promise
})
)
const runningError = captureError(runningPromise)
await started.promise
const resetPromise = manager.reset('reset')
const rejectedDuringReset = captureError(manager.enqueue(createIndexTask('during-reset', executeAfterReset)))
await expect(rejectedDuringReset).resolves.toMatchObject({ message: 'reset' })
expect(executeAfterReset).not.toHaveBeenCalled()
finish.resolve()
await expect(resetPromise).resolves.toEqual([createTaskDescriptor('running')])
await expect(runningError).resolves.toMatchObject({ message: 'reset' })
await expect(manager.enqueue(createIndexTask('after-reset', executeAfterReset))).resolves.toBeUndefined()
expect(executeAfterReset).toHaveBeenCalledOnce()
})
it('rejects a second reset with the current reset reason while reset is running', async () => {
const manager = new KnowledgeQueueManager()
const started = createDeferred()
const finish = createDeferred()
const runningPromise = manager.enqueue(
createIndexTask('running', async () => {
started.resolve()
await finish.promise
})
)
const runningError = captureError(runningPromise)
await started.promise
const resetPromise = manager.reset('first-reset')
const secondResetError = captureError(manager.reset('second-reset'))
await expect(secondResetError).resolves.toMatchObject({ message: 'first-reset' })
finish.resolve()
await expect(resetPromise).resolves.toEqual([createTaskDescriptor('running')])
await expect(runningError).resolves.toMatchObject({ message: 'first-reset' })
expect(loggerErrorMock).not.toHaveBeenCalled()
})
it('serializes writes for the same base', async () => {
const manager = new KnowledgeQueueManager()
const releaseFirstWrite = createDeferred()
const firstInWriteLock = createDeferred()
const secondStarted = createDeferred()
const events: string[] = []
const firstPromise = manager.enqueue(
createIndexTask('first', async (context) => {
await context.runWithBaseWriteLock(async () => {
events.push('lock:first')
firstInWriteLock.resolve()
await releaseFirstWrite.promise
events.push('unlock:first')
})
})
)
const secondPromise = manager.enqueue(
createIndexTask('second', async (context) => {
secondStarted.resolve()
await context.runWithBaseWriteLock(async () => {
events.push('lock:second')
})
})
)
await firstInWriteLock.promise
await secondStarted.promise
await flushPromises()
expect(events).toEqual(['lock:first'])
releaseFirstWrite.resolve()
await expect(Promise.all([firstPromise, secondPromise])).resolves.toEqual([undefined, undefined])
expect(events).toEqual(['lock:first', 'unlock:first', 'lock:second'])
})
it('does not enter the base write lock body after being interrupted while waiting', async () => {
const manager = new KnowledgeQueueManager()
const releaseFirstWrite = createDeferred()
const firstInWriteLock = createDeferred()
const secondStarted = createDeferred()
const events: string[] = []
const firstPromise = manager.enqueue(
createIndexTask('first', async (context) => {
await context.runWithBaseWriteLock(async () => {
events.push('lock:first')
firstInWriteLock.resolve()
await releaseFirstWrite.promise
events.push('unlock:first')
})
})
)
const secondPromise = manager.enqueue(
createIndexTask('second', async (context) => {
secondStarted.resolve()
await context.runWithBaseWriteLock(async () => {
events.push('lock:second')
})
})
)
const secondError = captureError(secondPromise)
await firstInWriteLock.promise
await secondStarted.promise
manager.interruptItems(['second'], 'deleted')
releaseFirstWrite.resolve()
await expect(firstPromise).resolves.toBeUndefined()
await expect(secondError).resolves.toMatchObject({ message: 'deleted' })
expect(events).toEqual(['lock:first', 'unlock:first'])
})
it('rejects failed tasks, logs unexpected errors, and continues later work', async () => {
const manager = new KnowledgeQueueManager()
const executeNext = vi.fn(async () => undefined)
const failure = new Error('execute failed')
const failedPromise = manager.enqueue(
createIndexTask('failed', async () => {
throw failure
})
)
const failedError = captureError(failedPromise)
const nextPromise = manager.enqueue(createIndexTask('next', executeNext))
await expect(failedError).resolves.toBe(failure)
await expect(nextPromise).resolves.toBeUndefined()
expect(executeNext).toHaveBeenCalledOnce()
expect(manager.getSnapshot()).toEqual({ pending: [], running: [] })
expect(loggerErrorMock).toHaveBeenCalledWith('Knowledge queue task failed unexpectedly', failure, {
baseId: BASE_ID,
itemId: 'failed',
kind: 'index-leaf'
})
})
it('logs non-interruption errors even after a task has been aborted', async () => {
const manager = new KnowledgeQueueManager()
const started = createDeferred()
const finish = createDeferred()
const failure = new Error('failed after abort')
const taskPromise = manager.enqueue(
createIndexTask('running', async (context) => {
started.resolve()
await finish.promise
if (context.signal.aborted) {
throw failure
}
})
)
const taskError = captureError(taskPromise)
await started.promise
manager.interruptItems(['running'], 'deleted')
finish.resolve()
await expect(taskError).resolves.toBe(failure)
expect(loggerErrorMock).toHaveBeenCalledWith('Knowledge queue task failed unexpectedly', failure, {
baseId: BASE_ID,
itemId: 'running',
kind: 'index-leaf'
})
})
})

View File

@@ -0,0 +1,163 @@
/**
* Type-safety regression tests for knowledge queue task entries.
*
* This file is typechecked by `pnpm typecheck:node`; every `@ts-expect-error`
* directive asserts an invalid queue task shape that must stay rejected.
*/
import type { KnowledgeBase, KnowledgeItem, KnowledgeItemOf } from '@shared/data/types/knowledge'
import type { EnqueueKnowledgeTaskOptions } from '../types'
const base: KnowledgeBase = {
id: 'base-1',
name: 'Base',
groupId: null,
emoji: '📁',
dimensions: 1024,
embeddingModelId: 'ollama::nomic-embed-text',
status: 'completed',
error: null,
chunkSize: 1024,
chunkOverlap: 200,
searchMode: 'hybrid',
createdAt: '2026-04-08T00:00:00.000Z',
updatedAt: '2026-04-08T00:00:00.000Z'
}
const lifecycle = {
status: 'processing',
phase: null,
error: null
} as const satisfies Pick<KnowledgeItem, 'status' | 'phase' | 'error'>
const noteItem: KnowledgeItemOf<'note'> = {
id: 'note-1',
baseId: base.id,
groupId: null,
type: 'note',
data: { source: 'note-1', content: 'hello note-1' },
...lifecycle,
createdAt: '2026-04-08T00:00:00.000Z',
updatedAt: '2026-04-08T00:00:00.000Z'
}
const fileItem: KnowledgeItemOf<'file'> = {
id: 'file-1',
baseId: base.id,
groupId: null,
type: 'file',
data: {
source: 'file-1',
file: {
id: 'file-1',
name: 'file.md',
origin_name: 'file.md',
path: '/tmp/file.md',
size: 1,
ext: '.md',
type: 'text',
created_at: '2026-04-08T00:00:00.000Z',
count: 1
}
},
...lifecycle,
createdAt: '2026-04-08T00:00:00.000Z',
updatedAt: '2026-04-08T00:00:00.000Z'
}
const urlItem: KnowledgeItemOf<'url'> = {
id: 'url-1',
baseId: base.id,
groupId: null,
type: 'url',
data: { source: 'url-1', url: 'https://example.com' },
...lifecycle,
createdAt: '2026-04-08T00:00:00.000Z',
updatedAt: '2026-04-08T00:00:00.000Z'
}
const directoryItem: KnowledgeItemOf<'directory'> = {
id: 'dir-1',
baseId: base.id,
groupId: null,
type: 'directory',
data: { source: '/tmp/docs', path: '/tmp/docs' },
...lifecycle,
createdAt: '2026-04-08T00:00:00.000Z',
updatedAt: '2026-04-08T00:00:00.000Z'
}
const sitemapItem: KnowledgeItemOf<'sitemap'> = {
id: 'sitemap-1',
baseId: base.id,
groupId: null,
type: 'sitemap',
data: { source: 'https://example.com/sitemap.xml', url: 'https://example.com/sitemap.xml' },
...lifecycle,
createdAt: '2026-04-08T00:00:00.000Z',
updatedAt: '2026-04-08T00:00:00.000Z'
}
const ok = async (): Promise<void> => undefined
const validTasks = [
{
base,
item: noteItem,
kind: 'index-leaf',
execute: ok
},
{
base,
item: fileItem,
kind: 'index-leaf',
execute: ok
},
{
base,
item: urlItem,
kind: 'index-leaf',
execute: ok
},
{
base,
item: directoryItem,
kind: 'prepare-root',
execute: ok
},
{
base,
item: sitemapItem,
kind: 'prepare-root',
execute: ok
}
] satisfies EnqueueKnowledgeTaskOptions[]
void validTasks
// @ts-expect-error - sitemap roots must be prepared before leaf indexing.
const _indexSitemap: EnqueueKnowledgeTaskOptions = {
base,
item: sitemapItem,
kind: 'index-leaf',
execute: ok
}
void _indexSitemap
// @ts-expect-error - note leaf items cannot be prepared as roots.
const _prepareNote: EnqueueKnowledgeTaskOptions = {
base,
item: noteItem,
kind: 'prepare-root',
execute: ok
}
void _prepareNote
const _rawItemId: EnqueueKnowledgeTaskOptions = {
base,
// @ts-expect-error - public enqueue entries must carry the typed item, not a raw id.
itemId: sitemapItem.id,
kind: 'index-leaf',
execute: ok
}
void _rawItemId

View File

@@ -0,0 +1,51 @@
import type { KnowledgeBase, KnowledgeItemOf, KnowledgeItemType } from '@shared/data/types/knowledge'
import type { IndexableKnowledgeItem } from '../types/items'
interface KnowledgeQueueBaseTaskEntry<TItem> {
base: KnowledgeBase
item: TItem
}
export interface IndexLeafTaskEntry extends KnowledgeQueueBaseTaskEntry<IndexableKnowledgeItem> {
kind: 'index-leaf'
}
export interface PrepareRootTaskEntry
extends KnowledgeQueueBaseTaskEntry<KnowledgeItemOf<'directory'> | KnowledgeItemOf<'sitemap'>> {
kind: 'prepare-root'
}
export type KnowledgeQueueTaskEntry = IndexLeafTaskEntry | PrepareRootTaskEntry
export type KnowledgeQueueTaskContext<TEntry extends KnowledgeQueueTaskEntry = KnowledgeQueueTaskEntry> =
TEntry extends KnowledgeQueueTaskEntry
? TEntry & {
baseId: string
itemId: string
itemType: TEntry['item']['type']
/** Interruption waits for running work to observe this signal and settle. */
signal: AbortSignal
runWithBaseWriteLock<T>(task: () => Promise<T>): Promise<T>
}
: never
export type EnqueueKnowledgeTaskOptions<TEntry extends KnowledgeQueueTaskEntry = KnowledgeQueueTaskEntry> =
TEntry extends KnowledgeQueueTaskEntry
? TEntry & {
execute: (context: KnowledgeQueueTaskContext<TEntry>) => Promise<void>
}
: never
export interface KnowledgeQueueTaskDescriptor {
base: KnowledgeBase
baseId: string
itemId: string
itemType: KnowledgeItemType
kind: KnowledgeQueueTaskEntry['kind']
}
export interface KnowledgeQueueSnapshot {
pending: KnowledgeQueueTaskDescriptor[]
running: KnowledgeQueueTaskDescriptor[]
}

View File

@@ -1,7 +1,7 @@
import { getFileExt } from '@main/utils/file'
import type { FileMetadata } from '@shared/data/types/file'
import type { KnowledgeItemOf } from '@shared/data/types/knowledge'
import { type Document, type FileReader as VectorStoreFileReader } from '@vectorstores/core'
import type { KnowledgeItemOf, KnowledgeSourceMetadata } from '@shared/data/types/knowledge'
import { Document, type FileReader as VectorStoreFileReader } from '@vectorstores/core'
import { CSVReader } from '@vectorstores/readers/csv'
import { DocxReader } from '@vectorstores/readers/docx'
import { JSONReader } from '@vectorstores/readers/json'
@@ -42,5 +42,16 @@ export async function loadFileDocuments(item: KnowledgeItemOf<'file'>): Promise<
}
const reader = createSupportedFileReader(file)
return await reader.loadData(file.path)
const documents = await reader.loadData(file.path)
const sourceMetadata: KnowledgeSourceMetadata = {
source: item.data.source
}
return documents.map(
(document) =>
new Document({
text: document.text,
metadata: { ...sourceMetadata }
})
)
}

View File

@@ -6,9 +6,7 @@ export async function loadNoteDocuments(item: KnowledgeItemOf<'note'>): Promise<
new Document({
text: item.data.content,
metadata: {
itemId: item.id,
itemType: item.type,
sourceUrl: item.data.sourceUrl
source: item.data.source
}
})
]

View File

@@ -14,8 +14,7 @@ export async function loadUrlDocuments(
if (!markdown) {
logger.warn('Knowledge URL reader received empty markdown', {
itemId: item.id,
sourceUrl: item.data.url,
name: item.data.name
sourceUrl: item.data.source
})
throw new Error(`Knowledge URL returned empty markdown: ${item.data.url}`)
}
@@ -24,10 +23,7 @@ export async function loadUrlDocuments(
new Document({
text: markdown,
metadata: {
itemId: item.id,
itemType: item.type,
sourceUrl: item.data.url,
name: item.data.name
source: item.data.source
}
})
]

View File

@@ -0,0 +1,134 @@
import { describe, expect, it, vi } from 'vitest'
const loadDataMock = vi.hoisted(() => vi.fn())
vi.mock('@main/utils/file', () => ({
getFileExt: (path: string) => path.slice(path.lastIndexOf('.'))
}))
vi.mock('@vectorstores/readers/text', async () => {
const { Document } = await import('@vectorstores/core')
return {
TextFileReader: class MockTextFileReader {
loadData = loadDataMock.mockResolvedValue([
new Document({
text: 'file content',
metadata: { page: 1 }
})
])
}
}
})
vi.mock('@vectorstores/readers/csv', () => ({ CSVReader: class MockCSVReader {} }))
vi.mock('@vectorstores/readers/docx', () => ({ DocxReader: class MockDocxReader {} }))
vi.mock('@vectorstores/readers/json', () => ({ JSONReader: class MockJSONReader {} }))
vi.mock('@vectorstores/readers/markdown', () => ({ MarkdownReader: class MockMarkdownReader {} }))
vi.mock('@vectorstores/readers/pdf', () => ({ PDFReader: class MockPDFReader {} }))
vi.mock('../files/DraftsExportReader', () => ({ DraftsExportReader: class MockDraftsExportReader {} }))
vi.mock('../files/EpubReader', () => ({ EpubReader: class MockEpubReader {} }))
vi.mock('../../utils/url', () => ({
fetchKnowledgeWebPage: vi.fn().mockResolvedValue('url content')
}))
const { loadFileDocuments } = await import('../KnowledgeFileReader')
const { loadNoteDocuments } = await import('../KnowledgeNoteReader')
const { loadUrlDocuments } = await import('../KnowledgeUrlReader')
describe('knowledge reader metadata', () => {
it('normalizes file source metadata', async () => {
const documents = await loadFileDocuments({
id: 'file-item-1',
baseId: 'kb-1',
groupId: null,
type: 'file',
data: {
source: '/tmp/original.txt',
file: {
id: 'file-1',
name: 'stored.txt',
origin_name: 'Original.txt',
path: '/tmp/original.txt',
size: 12,
ext: 'txt',
type: 'text',
created_at: '2026-04-08T00:00:00.000Z',
count: 1
}
},
status: 'idle',
phase: null,
error: null,
createdAt: '2026-04-08T00:00:00.000Z',
updatedAt: '2026-04-08T00:00:00.000Z'
})
expect(documents[0]?.metadata).toEqual({
source: '/tmp/original.txt'
})
})
it('normalizes url source metadata', async () => {
const documents = await loadUrlDocuments({
id: 'url-item-1',
baseId: 'kb-1',
groupId: null,
type: 'url',
data: { source: 'https://example.com', url: 'https://example.com' },
status: 'idle',
phase: null,
error: null,
createdAt: '2026-04-08T00:00:00.000Z',
updatedAt: '2026-04-08T00:00:00.000Z'
})
expect(documents[0]?.metadata).toEqual({
source: 'https://example.com'
})
})
it('uses note sourceUrl as source metadata', async () => {
const documents = await loadNoteDocuments({
id: 'note-item-1',
baseId: 'kb-1',
groupId: null,
type: 'note',
data: {
source: 'https://example.com/note',
content: '\n Note title\nbody',
sourceUrl: 'https://example.com/note'
},
status: 'idle',
phase: null,
error: null,
createdAt: '2026-04-08T00:00:00.000Z',
updatedAt: '2026-04-08T00:00:00.000Z'
})
expect(documents[0]?.metadata).toEqual({
source: 'https://example.com/note'
})
})
it('uses note source as source metadata when content is blank', async () => {
const documents = await loadNoteDocuments({
id: 'note-item-1',
baseId: 'kb-1',
groupId: null,
type: 'note',
data: { source: 'note-item-1', content: ' ' },
status: 'idle',
phase: null,
error: null,
createdAt: '2026-04-08T00:00:00.000Z',
updatedAt: '2026-04-08T00:00:00.000Z'
})
expect(documents[0]?.metadata).toEqual({
source: 'note-item-1'
})
})
})

View File

@@ -82,6 +82,7 @@ vi.mock('../files/DraftsExportReader', () => ({
createdAt: '2026-04-03T00:00:00.000Z',
updatedAt: '2026-04-03T00:00:00.000Z',
data: {
source: filePath,
file: {
id: 'file-1',
name: filePath.split('/').pop() || filePath,
@@ -111,6 +112,7 @@ vi.mock('../files/EpubReader', () => ({
createdAt: '2026-04-03T00:00:00.000Z',
updatedAt: '2026-04-03T00:00:00.000Z',
data: {
source: filePath,
file: {
id: 'file-1',
name: filePath.split('/').pop() || filePath,
@@ -136,10 +138,12 @@ function createFileItem(ext: string, filePath?: string): KnowledgeItemOf<'file'>
groupId: null,
type: 'file',
status: 'idle',
phase: null,
error: null,
createdAt: '2026-04-03T00:00:00.000Z',
updatedAt: '2026-04-03T00:00:00.000Z',
data: {
source: filePath ?? `/tmp/sample${ext}`,
file: {
id: 'file-1',
name: `sample${ext}`,
@@ -162,10 +166,12 @@ function createNoteItem(content: string, sourceUrl?: string): KnowledgeItemOf<'n
groupId: null,
type: 'note',
status: 'idle',
phase: null,
error: null,
createdAt: '2026-04-03T00:00:00.000Z',
updatedAt: '2026-04-03T00:00:00.000Z',
data: {
source: sourceUrl ?? 'note-1',
content,
sourceUrl
}
@@ -179,12 +185,13 @@ function createUrlItem(): KnowledgeItemOf<'url'> {
groupId: null,
type: 'url',
status: 'idle',
phase: null,
error: null,
createdAt: '2026-04-03T00:00:00.000Z',
updatedAt: '2026-04-03T00:00:00.000Z',
data: {
url: 'https://example.com',
name: 'Example'
source: 'https://example.com',
url: 'https://example.com'
}
}
}
@@ -196,12 +203,13 @@ function createSitemapItem(): KnowledgeItemOf<'sitemap'> {
groupId: null,
type: 'sitemap',
status: 'idle',
phase: null,
error: null,
createdAt: '2026-04-03T00:00:00.000Z',
updatedAt: '2026-04-03T00:00:00.000Z',
data: {
url: 'https://example.com/sitemap.xml',
name: 'Example Sitemap'
source: 'https://example.com/sitemap.xml',
url: 'https://example.com/sitemap.xml'
}
}
}
@@ -213,11 +221,12 @@ function createDirectoryItem(): KnowledgeItemOf<'directory'> {
groupId: null,
type: 'directory',
status: 'idle',
phase: null,
error: null,
createdAt: '2026-04-03T00:00:00.000Z',
updatedAt: '2026-04-03T00:00:00.000Z',
data: {
name: 'example-directory',
source: '/tmp/example-directory',
path: '/tmp/example-directory'
}
}
@@ -239,10 +248,10 @@ describe('loadKnowledgeItemDocuments', () => {
const item = createFileItem(ext)
const docs = await loadKnowledgeItemDocuments(item)
expect(readerSpies[expectedReader as keyof typeof readerSpies]).toHaveBeenCalledWith(`/tmp/sample${ext}`)
expect(docs[0]).toMatchObject({
metadata: {
reader: expectedReader,
filePath: `/tmp/sample${ext}`
source: `/tmp/sample${ext}`
}
})
})
@@ -253,8 +262,7 @@ describe('loadKnowledgeItemDocuments', () => {
expect(docs[0]).toMatchObject({
metadata: {
reader: 'text',
filePath: '/tmp/sample.log'
source: '/tmp/sample.log'
}
})
})
@@ -267,8 +275,7 @@ describe('loadKnowledgeItemDocuments', () => {
expect(customReaderSpies.drafts).toHaveBeenCalled()
expect(docs[0]).toMatchObject({
metadata: {
reader: 'drafts',
itemId: 'item-1'
source: '/tmp/sample.draftsexport'
}
})
})
@@ -281,8 +288,7 @@ describe('loadKnowledgeItemDocuments', () => {
expect(customReaderSpies.epub).toHaveBeenCalled()
expect(docs[0]).toMatchObject({
metadata: {
reader: 'epub',
itemId: 'item-1'
source: '/tmp/sample.epub'
}
})
})
@@ -301,9 +307,7 @@ describe('loadKnowledgeItemDocuments', () => {
expect(docs[0]).toMatchObject({
text: 'hello world',
metadata: {
itemId: 'note-1',
itemType: 'note',
sourceUrl: 'https://example.com/note'
source: 'https://example.com/note'
}
})
})
@@ -328,10 +332,7 @@ describe('loadKnowledgeItemDocuments', () => {
expect(docs[0]).toMatchObject({
text: '# Example Page\n\nHello knowledge',
metadata: {
itemId: 'url-1',
itemType: 'url',
sourceUrl: 'https://example.com',
name: 'Example'
source: 'https://example.com'
}
})
})
@@ -346,8 +347,7 @@ describe('loadKnowledgeItemDocuments', () => {
)
expect(loggerWarnMock).toHaveBeenCalledWith('Knowledge URL reader received empty markdown', {
itemId: 'url-1',
sourceUrl: 'https://example.com',
name: 'Example'
sourceUrl: 'https://example.com'
})
})

View File

@@ -20,7 +20,7 @@ export class EpubReader extends FileReader<Document<Metadata>> {
const documents: Document<Metadata>[] = []
const failedChapterIds: string[] = []
for (const [index, chapter] of chapters.entries()) {
for (const chapter of chapters) {
try {
const content = await epub.getChapter(chapter.id)
const text = stripHtml(content)
@@ -31,16 +31,7 @@ export class EpubReader extends FileReader<Document<Metadata>> {
documents.push(
new Document({
text,
metadata: {
source: filename,
title: epub.metadata.title || filename || '',
creator: epub.metadata.creator || '',
language: epub.metadata.language || '',
chapterId: chapter.id,
chapterTitle: chapter.title || `Chapter ${index + 1}`,
chapterOrder: index + 1
}
text
})
)
} catch (error) {

View File

@@ -57,15 +57,7 @@ describe('EpubReader', () => {
expect(docs).toHaveLength(2)
expect(docs[0]?.text).toBe('chapter-1 content')
expect(docs[0]?.metadata).toMatchObject({
source: 'book.epub',
title: 'Test EPUB',
creator: 'Author',
language: 'en',
chapterId: 'chapter-1',
chapterTitle: 'Chapter 1',
chapterOrder: 1
})
expect(docs[0]?.metadata).toEqual({})
expect(loggerErrorMock).not.toHaveBeenCalled()
})

View File

@@ -1,4 +1,9 @@
import type { KnowledgeBase, KnowledgeSearchResult } from '@shared/data/types/knowledge'
import {
DEFAULT_KNOWLEDGE_BASE_CHUNK_OVERLAP,
DEFAULT_KNOWLEDGE_BASE_CHUNK_SIZE,
type KnowledgeBase,
type KnowledgeSearchResult
} from '@shared/data/types/knowledge'
import { beforeEach, describe, expect, it, vi } from 'vitest'
const fetchMock = vi.hoisted(() => vi.fn())
@@ -24,14 +29,23 @@ const { getRerankAdapter } = await import('../adapters')
const { executeRerankRequest, rerankKnowledgeSearchResults, resolveRerankRuntime } = await import('../rerank')
function createKnowledgeBase(overrides: Partial<KnowledgeBase> = {}): KnowledgeBase {
const now = new Date().toISOString()
return {
id: 'kb-1',
name: 'Knowledge Base',
dimensions: 1024,
embeddingModelId: 'ollama::nomic-embed-text',
createdAt: new Date().toISOString(),
updatedAt: new Date().toISOString(),
...overrides
...overrides,
id: overrides.id ?? 'kb-1',
name: overrides.name ?? 'Knowledge Base',
groupId: overrides.groupId ?? null,
emoji: overrides.emoji ?? '📁',
dimensions: overrides.dimensions ?? 1024,
embeddingModelId: overrides.embeddingModelId ?? 'ollama::nomic-embed-text',
status: overrides.status ?? 'completed',
error: overrides.error ?? null,
chunkSize: overrides.chunkSize ?? DEFAULT_KNOWLEDGE_BASE_CHUNK_SIZE,
chunkOverlap: overrides.chunkOverlap ?? DEFAULT_KNOWLEDGE_BASE_CHUNK_OVERLAP,
searchMode: overrides.searchMode ?? 'hybrid',
createdAt: overrides.createdAt ?? now,
updatedAt: overrides.updatedAt ?? now
}
}
@@ -40,13 +54,25 @@ function createSearchResults(): KnowledgeSearchResult[] {
{
pageContent: 'alpha',
score: 0.1,
metadata: { type: 'text' },
metadata: {
itemId: 'item-1',
itemType: 'note',
source: 'note-1',
chunkIndex: 0,
tokenCount: 1
},
chunkId: 'chunk-1'
},
{
pageContent: 'beta',
score: 0.2,
metadata: { type: 'text' },
metadata: {
itemId: 'item-2',
itemType: 'note',
source: 'note-2',
chunkIndex: 1,
tokenCount: 1
},
chunkId: 'chunk-2'
}
]

View File

@@ -1,146 +0,0 @@
import type { KnowledgeBase, KnowledgeItem } from '@shared/data/types/knowledge'
import PQueue from 'p-queue'
export interface AddTaskEntry {
base: KnowledgeBase
item: KnowledgeItem
}
export interface AddTaskContext extends AddTaskEntry {
controller: AbortController
interruptedBy?: 'delete' | 'stop'
}
type QueueEntry = AddTaskContext & {
status: 'pending' | 'running'
promise: Promise<void>
}
export class KnowledgeAddQueue {
private readonly concurrency: number
private readonly executeAdd: (entry: AddTaskContext) => Promise<void>
private queue: PQueue
private entries = new Map<string, QueueEntry>()
constructor(concurrency: number, executeAdd: (entry: AddTaskContext) => Promise<void>) {
this.concurrency = concurrency
this.executeAdd = executeAdd
this.queue = this.createQueue()
}
reset(): void {
this.queue.clear()
this.queue = this.createQueue()
this.entries.clear()
}
enqueue(base: KnowledgeBase, item: KnowledgeItem): Promise<void> {
const existingEntry = this.entries.get(item.id)
if (existingEntry) {
return existingEntry.promise
}
const entry = this.createEntry(base, item)
this.entries.set(item.id, entry)
this.schedule(entry)
return entry.promise
}
interrupt(itemIds: string[], interruptedBy: 'delete' | 'stop', reason: string): AddTaskEntry[] {
const interruptedEntries = this.getEntriesByIds(itemIds)
for (const entry of interruptedEntries) {
if (entry.status === 'pending') {
entry.controller.abort(reason)
this.deleteEntry(entry)
continue
}
entry.interruptedBy = interruptedBy
entry.controller.abort(reason)
}
return interruptedEntries
}
interruptBase(baseId: string, interruptedBy: 'delete' | 'stop', reason: string): AddTaskEntry[] {
const itemIds = this.getEntriesForBase(baseId).map((entry) => entry.item.id)
return this.interrupt(itemIds, interruptedBy, reason)
}
interruptAll(interruptedBy: 'delete' | 'stop', reason: string): AddTaskEntry[] {
return this.interrupt([...this.entries.keys()], interruptedBy, reason)
}
async waitForRunning(itemIds: string[]): Promise<void> {
const executions = this.getEntriesByIds(itemIds)
.filter((entry): entry is QueueEntry & { status: 'running' } => entry.status === 'running')
.map((entry) => entry.promise)
if (executions.length === 0) {
return
}
await Promise.allSettled(executions)
}
private createQueue(): PQueue {
return new PQueue({
concurrency: this.concurrency
})
}
private createEntry(base: KnowledgeBase, item: KnowledgeItem): QueueEntry {
const controller = new AbortController()
return {
base,
item,
promise: Promise.resolve(),
controller,
status: 'pending' as const,
interruptedBy: undefined
}
}
private schedule(entry: QueueEntry): void {
entry.promise = this.queue
.add(
async () => {
if (this.entries.get(entry.item.id) !== entry) {
return
}
entry.status = 'running'
await this.executeAdd(entry)
},
{ signal: entry.controller.signal }
)
.finally(() => {
this.deleteEntry(entry)
})
}
private getEntriesByIds(itemIds: string[]): QueueEntry[] {
const entries = new Map<string, QueueEntry>()
for (const itemId of new Set(itemIds)) {
const entry = this.entries.get(itemId)
if (entry) {
entries.set(itemId, entry)
}
}
return [...entries.values()]
}
private getEntriesForBase(baseId: string): QueueEntry[] {
return [...this.entries.values()].filter((entry) => entry.base.id === baseId)
}
private deleteEntry(entry: QueueEntry): void {
if (this.entries.get(entry.item.id) === entry) {
this.entries.delete(entry.item.id)
}
}
}

View File

@@ -1,123 +0,0 @@
import { application } from '@application'
import { knowledgeItemService } from '@data/services/KnowledgeItemService'
import { loggerService } from '@logger'
import type { KnowledgeBase, KnowledgeItem } from '@shared/data/types/knowledge'
import type { BaseVectorStore } from '@vectorstores/core'
import { loadKnowledgeItemDocuments } from '../readers/KnowledgeReader'
import { chunkDocuments } from '../utils/chunk'
import { embedDocuments } from '../utils/embed'
import { getEmbedModel } from '../utils/model'
import type { AddTaskContext } from './KnowledgeAddQueue'
import {
DELETE_INTERRUPTED_REASON,
runAbortable,
type RuntimeTaskContext,
SHUTDOWN_INTERRUPTED_REASON
} from './utils/taskRuntime'
const logger = loggerService.withContext('KnowledgeAddRuntime')
const CONTAINER_ITEM_INDEXING_UNSUPPORTED_REASON =
'Container knowledge items must be expanded into child items before indexing'
export class KnowledgeAddRuntime {
constructor(private readonly isStopping: () => boolean) {}
async executeAdd(entry: AddTaskContext): Promise<void> {
const { base, item, controller } = entry
const ctx: RuntimeTaskContext = {
itemId: item.id,
signal: controller.signal
}
let vectorStore: BaseVectorStore | null = null
try {
await runAbortable(this.isStopping, ctx, () =>
knowledgeItemService.update(item.id, {
status: 'pending',
error: null
})
)
const nodes = await this.indexItem(ctx, base, item)
const vectorStoreService = application.get('KnowledgeVectorStoreService')
vectorStore = await runAbortable(this.isStopping, ctx, () => vectorStoreService.createStore(base))
const activeVectorStore = vectorStore
await runAbortable(this.isStopping, ctx, () => activeVectorStore.add(nodes))
await runAbortable(this.isStopping, ctx, () =>
knowledgeItemService.update(item.id, {
status: 'completed',
error: null
})
)
} catch (error) {
const normalizedError = error instanceof Error ? error : new Error(String(error))
if (
entry.interruptedBy ||
normalizedError.message === DELETE_INTERRUPTED_REASON ||
normalizedError.message === SHUTDOWN_INTERRUPTED_REASON
) {
throw normalizedError
}
throw await this.handleAddItemFailure(base, item, vectorStore, normalizedError)
}
}
private async indexItem(ctx: RuntimeTaskContext, base: KnowledgeBase, item: KnowledgeItem) {
if (item.type === 'directory' || item.type === 'sitemap') {
throw new Error(CONTAINER_ITEM_INDEXING_UNSUPPORTED_REASON)
}
const embeddingModel = getEmbedModel(base)
const documents = await runAbortable(this.isStopping, ctx, () => loadKnowledgeItemDocuments(item, ctx.signal))
const chunks = await runAbortable(this.isStopping, ctx, () => chunkDocuments(base, item, documents))
return await runAbortable(this.isStopping, ctx, () => embedDocuments(embeddingModel, chunks, ctx.signal))
}
private async handleAddItemFailure(
base: KnowledgeBase,
item: KnowledgeItem,
vectorStore: BaseVectorStore | null,
error: Error
): Promise<Error> {
logger.error('Failed to add knowledge item', error, {
baseId: base.id,
itemId: item.id,
itemType: item.type
})
try {
await knowledgeItemService.update(item.id, {
status: 'failed',
error: error.message
})
} catch (persistError) {
logger.error(
'Failed to persist knowledge item failure state',
persistError instanceof Error ? persistError : new Error(String(persistError)),
{
baseId: base.id,
itemId: item.id,
itemType: item.type,
originalError: error.message
}
)
}
if (vectorStore) {
try {
await vectorStore.delete(item.id)
} catch (cleanupError) {
logger.warn('Failed to cleanup knowledge item vectors after add failure', {
baseId: base.id,
itemId: item.id,
cleanupError: cleanupError instanceof Error ? cleanupError.message : String(cleanupError)
})
}
}
return error
}
}

View File

@@ -1,76 +1,238 @@
import { application } from '@application'
import { knowledgeBaseService } from '@data/services/KnowledgeBaseService'
import { knowledgeItemService } from '@data/services/KnowledgeItemService'
import { loggerService } from '@logger'
import { BaseService, DependsOn, Injectable, Phase, ServicePhase } from '@main/core/lifecycle'
import type { KnowledgeBase, KnowledgeItem, KnowledgeSearchResult } from '@shared/data/types/knowledge'
import { ErrorCode, isDataApiError } from '@shared/data/api'
import {
type KnowledgeBase,
KnowledgeChunkMetadataSchema,
type KnowledgeItem,
type KnowledgeItemChunk,
type KnowledgeItemOf,
type KnowledgeRuntimeAddItemInput,
type KnowledgeSearchResult
} from '@shared/data/types/knowledge'
import { MetadataMode } from '@vectorstores/core'
import { embedMany } from 'ai'
import { KnowledgeQueueManager } from '../queue/KnowledgeQueueManager'
import type {
IndexLeafTaskEntry,
KnowledgeQueueTaskContext,
KnowledgeQueueTaskDescriptor,
KnowledgeQueueTaskEntry,
PrepareRootTaskEntry
} from '../queue/types'
import { loadKnowledgeItemDocuments } from '../readers/KnowledgeReader'
import { rerankKnowledgeSearchResults } from '../rerank/rerank'
import type { IndexableKnowledgeItem } from '../types/items'
import { chunkDocuments } from '../utils/chunk'
import { embedDocuments } from '../utils/embed'
import { filterIndexableKnowledgeItems, isIndexableKnowledgeItem } from '../utils/items'
import { getEmbedModel } from '../utils/model'
import { KnowledgeAddQueue } from './KnowledgeAddQueue'
import { KnowledgeAddRuntime } from './KnowledgeAddRuntime'
import { deleteItemVectors, deleteVectorsForEntries, failItems } from './utils/cleanup'
import { DELETE_INTERRUPTED_REASON, SHUTDOWN_INTERRUPTED_REASON } from './utils/taskRuntime'
import { prepareKnowledgeItem } from './utils/prepare'
const logger = loggerService.withContext('KnowledgeRuntimeService')
const SHUTDOWN_INTERRUPTED_REASON = 'Knowledge task interrupted by service shutdown'
const DELETE_INTERRUPTED_REASON = 'Knowledge task interrupted by item deletion'
const REINDEX_INTERRUPTED_REASON = 'Knowledge task interrupted by reindex'
const EXPECTED_QUEUE_INTERRUPT_REASONS = new Set([
SHUTDOWN_INTERRUPTED_REASON,
DELETE_INTERRUPTED_REASON,
REINDEX_INTERRUPTED_REASON
])
const SEARCH_TOKEN_PATTERN = /[\p{L}\p{N}_]+/u
type QueueTaskLogContext = {
baseId: string
itemId: string
kind: KnowledgeQueueTaskEntry['kind']
}
const mapChunkDocument = (chunk: {
id_: string
metadata: unknown
getContent: (mode?: MetadataMode) => string
}): KnowledgeItemChunk => {
const metadata = KnowledgeChunkMetadataSchema.parse(chunk.metadata ?? {})
return {
id: chunk.id_,
itemId: metadata.itemId,
content: chunk.getContent(MetadataMode.NONE),
metadata
}
}
const assertNeverKnowledgeItem = (item: never): never => {
throw new Error(`Unsupported knowledge item type: ${String((item as { type?: unknown }).type)}`)
}
@Injectable('KnowledgeRuntimeService')
@ServicePhase(Phase.WhenReady)
@DependsOn(['KnowledgeVectorStoreService'])
export class KnowledgeRuntimeService extends BaseService {
private isStopping = false
private addRuntime = new KnowledgeAddRuntime(() => this.isStopping)
private addQueue = new KnowledgeAddQueue(5, (entry) => {
if (this.isStopping) {
throw new Error(SHUTDOWN_INTERRUPTED_REASON)
}
return this.addRuntime.executeAdd(entry)
})
private queue = new KnowledgeQueueManager()
protected onInit(): void {
this.isStopping = false
this.addQueue.reset()
this.queue = new KnowledgeQueueManager()
}
protected async onStop(): Promise<void> {
this.isStopping = true
const interruptedEntries = this.addQueue.interruptAll('stop', SHUTDOWN_INTERRUPTED_REASON)
const interruptedItemIds = interruptedEntries.map((entry) => entry.item.id)
await this.addQueue.waitForRunning(interruptedItemIds)
await deleteVectorsForEntries(interruptedEntries, { continueOnError: true })
await failItems(interruptedItemIds, SHUTDOWN_INTERRUPTED_REASON)
const interruptedEntries = this.queue.interruptAll(SHUTDOWN_INTERRUPTED_REASON)
await this.queue.waitForRunning(interruptedEntries.map((entry) => entry.itemId))
await this.cleanupInterruptedEntries(interruptedEntries, SHUTDOWN_INTERRUPTED_REASON)
}
async createBase(base: KnowledgeBase) {
async createBase(baseId: string): Promise<void> {
const base = await knowledgeBaseService.getById(baseId)
const vectorStoreService = application.get('KnowledgeVectorStoreService')
await vectorStoreService.createStore(base)
}
async deleteBase(baseId: string) {
const interruptedEntries = this.addQueue.interruptBase(baseId, 'delete', DELETE_INTERRUPTED_REASON)
const interruptedItemIds = interruptedEntries.map((entry) => entry.item.id)
async deleteBase(baseId: string): Promise<string[]> {
const interruptedEntries = this.queue.interruptBase(baseId, DELETE_INTERRUPTED_REASON)
await this.queue.waitForRunning(interruptedEntries.map((entry) => entry.itemId))
await this.addQueue.waitForRunning(interruptedItemIds)
let cleanupEntries: Array<{ base: KnowledgeBase; baseId: string; itemIds: string[] }>
try {
cleanupEntries = await this.expandInterruptedEntries(interruptedEntries)
} catch (error) {
const normalizedError = error instanceof Error ? error : new Error(String(error))
await this.persistFailureStateBestEffort(
interruptedEntries.map((entry) => entry.itemId),
normalizedError.message,
{
baseId,
operation: 'deleteBase'
}
)
throw error
}
return cleanupEntries.flatMap((entry) => entry.itemIds)
}
async deleteBaseArtifacts(baseId: string): Promise<void> {
const vectorStoreService = application.get('KnowledgeVectorStoreService')
await vectorStoreService.deleteStore(baseId)
}
async addItems(base: KnowledgeBase, items: KnowledgeItem[]) {
return await Promise.all(items.map((item) => this.addQueue.enqueue(base, item)))
async addItems(baseId: string, inputs: KnowledgeRuntimeAddItemInput[]): Promise<void> {
if (inputs.length === 0) {
return
}
const base = await knowledgeBaseService.getById(baseId)
const acceptedItems: KnowledgeItem[] = []
try {
for (const input of inputs) {
const createdItem = await knowledgeItemService.create(base.id, input)
acceptedItems.push(createdItem)
acceptedItems[acceptedItems.length - 1] =
createdItem.type === 'directory' || createdItem.type === 'sitemap'
? await knowledgeItemService.updateStatus(createdItem.id, 'processing', { phase: 'preparing' })
: await knowledgeItemService.updateStatus(createdItem.id, 'processing')
}
} catch (error) {
const normalizedError = error instanceof Error ? error : new Error(String(error))
logger.error('Failed to add knowledge items', normalizedError, {
baseId: base.id,
accepted: acceptedItems.length,
total: inputs.length
})
await this.deleteAcceptedItemsBestEffort(acceptedItems, normalizedError, base.id)
throw error
}
for (const item of acceptedItems) {
await this.submitRuntimeItem(base, item)
}
}
async deleteItems(base: KnowledgeBase, items: KnowledgeItem[]) {
const rootIds = [...new Set(items.map((item) => item.id))]
const itemIds = await knowledgeItemService.getCascadeIdsInBase(base.id, rootIds)
async reindexItems(baseId: string, rootItems: KnowledgeItem[]): Promise<void> {
const base = await knowledgeBaseService.getById(baseId)
const rootIds = [...new Set(rootItems.map((item) => item.id))]
let interruptIds = rootIds
this.addQueue.interrupt(itemIds, 'delete', DELETE_INTERRUPTED_REASON)
await this.addQueue.waitForRunning(itemIds)
await deleteItemVectors(base, itemIds)
try {
const interrupted = await this.interruptRootsAndDescendants(base.id, rootIds, REINDEX_INTERRUPTED_REASON)
interruptIds = interrupted.interruptIds
const leafItems = filterIndexableKnowledgeItems(
await knowledgeItemService.getLeafDescendantItems(base.id, rootIds)
)
await this.deleteItemVectorsOrFailItems(
base,
leafItems.map((item) => item.id),
interruptIds,
{ baseId: base.id, operation: 'reindexItems', rootIds }
)
const containerItems = rootItems.filter(
(item): item is KnowledgeItemOf<'directory'> | KnowledgeItemOf<'sitemap'> =>
item.type === 'directory' || item.type === 'sitemap'
)
if (containerItems.length > 0) {
// Reindexing directory/sitemap roots rebuilds their leaf children from the source:
// old leaf items are deleted here, then preparation creates fresh leaf items to index.
await knowledgeItemService.deleteLeafDescendantItems(
base.id,
containerItems.map((item) => item.id)
)
}
for (const containerItem of containerItems) {
const preparedRoot = await knowledgeItemService.updateStatus(containerItem.id, 'processing', {
phase: 'preparing'
})
await this.submitRuntimeItem(base, preparedRoot)
}
for (const leafItem of rootItems.filter(isIndexableKnowledgeItem)) {
const processingItem = await knowledgeItemService.updateStatus(leafItem.id, 'processing')
if (isIndexableKnowledgeItem(processingItem)) {
this.enqueueIndexItem(base, processingItem)
}
}
} catch (error) {
await this.failItemsAndRethrow(interruptIds, error, { baseId: base.id, operation: 'reindexItems', rootIds })
}
}
async search(base: KnowledgeBase, query: string): Promise<KnowledgeSearchResult[]> {
async deleteItems(baseId: string, rootItems: KnowledgeItem[]): Promise<void> {
const base = await knowledgeBaseService.getById(baseId)
const rootIds = [...new Set(rootItems.map((item) => item.id))]
let interruptIds = rootIds
try {
const interrupted = await this.interruptRootsAndDescendants(base.id, rootIds, DELETE_INTERRUPTED_REASON)
interruptIds = interrupted.interruptIds
const leafItems = filterIndexableKnowledgeItems(
await knowledgeItemService.getLeafDescendantItems(base.id, rootIds)
)
await this.deleteItemVectorsOrFailItems(
base,
leafItems.map((item) => item.id),
interruptIds,
{ baseId: base.id, operation: 'deleteItems', rootIds }
)
} catch (error) {
await this.failItemsAndRethrow(interruptIds, error, { baseId: base.id, operation: 'deleteItems', rootIds })
}
}
async search(baseId: string, query: string): Promise<KnowledgeSearchResult[]> {
if (!SEARCH_TOKEN_PATTERN.test(query)) {
return []
}
const base = await knowledgeBaseService.getById(baseId)
const model = getEmbedModel(base)
const embedResult = await embedMany({ model, values: [query] })
const queryEmbedding = embedResult.embeddings[0]
@@ -90,19 +252,419 @@ export class KnowledgeRuntimeService extends BaseService {
})
const nodes = results.nodes ?? []
const searchResults = nodes.map((node, index) => {
const metadata = node.metadata ?? {}
const metadata = KnowledgeChunkMetadataSchema.parse(node.metadata ?? {})
return {
pageContent: node.getContent(MetadataMode.NONE),
score: results.similarities[index] ?? 0,
metadata,
itemId: typeof metadata.itemId === 'string' && metadata.itemId.length > 0 ? metadata.itemId : undefined,
itemId: metadata.itemId,
chunkId: node.id_
}
})
if (base.rerankModelId) {
return await rerankKnowledgeSearchResults(base, query, searchResults)
}
return searchResults
}
async listItemChunks(baseId: string, itemId: string): Promise<KnowledgeItemChunk[]> {
const base = await knowledgeBaseService.getById(baseId)
const vectorStoreService = application.get('KnowledgeVectorStoreService')
const vectorStore = await vectorStoreService.createStore(base)
const chunks = await vectorStore.listByExternalId(itemId)
return chunks.map(mapChunkDocument)
}
async deleteItemChunk(baseId: string, itemId: string, chunkId: string): Promise<void> {
const base = await knowledgeBaseService.getById(baseId)
const vectorStoreService = application.get('KnowledgeVectorStoreService')
const vectorStore = await vectorStoreService.createStore(base)
await vectorStore.deleteByIdAndExternalId(chunkId, itemId)
}
private async submitRuntimeItem(base: KnowledgeBase, item: KnowledgeItem): Promise<void> {
switch (item.type) {
case 'file':
case 'url':
case 'note':
this.enqueueIndexItem(base, item)
return
case 'directory':
case 'sitemap':
this.enqueuePrepareRoot(base, item)
return
default:
assertNeverKnowledgeItem(item)
}
}
private enqueueIndexItem(base: KnowledgeBase, item: IndexableKnowledgeItem): void {
const wasAlreadyQueued = this.hasQueuedItem(item.id)
let didStart = false
const promise = this.queue.enqueue({
base,
item,
kind: 'index-leaf',
execute: (context) => {
didStart = true
return this.executeIndexTask(context)
}
})
void promise.catch((error) => {
if (wasAlreadyQueued || didStart || this.isExpectedQueueInterrupt(error)) {
return
}
void this.failItemsAfterEnqueueRejection([item.id], error, {
baseId: base.id,
itemId: item.id,
kind: 'index-leaf'
})
})
}
private enqueuePrepareRoot(
base: KnowledgeBase,
item: KnowledgeItemOf<'directory'> | KnowledgeItemOf<'sitemap'>
): void {
const wasAlreadyQueued = this.hasQueuedItem(item.id)
let didStart = false
const promise = this.queue.enqueue({
base,
item,
kind: 'prepare-root',
execute: (context) => {
didStart = true
return this.executePrepareTask(context)
}
})
void promise.catch((error) => {
if (wasAlreadyQueued || didStart || this.isExpectedQueueInterrupt(error)) {
return
}
void this.failItemsAfterEnqueueRejection([item.id], error, {
baseId: base.id,
itemId: item.id,
kind: 'prepare-root'
})
})
}
private async executePrepareTask(context: KnowledgeQueueTaskContext<PrepareRootTaskEntry>): Promise<void> {
const { base, item } = context
const createdItemIds = new Set<string>([item.id])
try {
const leafItems = await prepareKnowledgeItem({
baseId: base.id,
item,
onCreatedItem: (createdItem) => createdItemIds.add(createdItem.id),
runMutation: (task) => context.runWithBaseWriteLock(task),
signal: context.signal
})
for (const leafItem of leafItems) {
if (await this.shouldEnqueueLeaf(leafItem.id)) {
context.signal.throwIfAborted()
this.enqueueIndexItem(base, leafItem)
}
}
await context.runWithBaseWriteLock(async () => {
await knowledgeItemService.updateStatus(item.id, 'processing')
context.signal.throwIfAborted()
})
} catch (error) {
if (context.signal.aborted) {
context.signal.throwIfAborted()
throw error
}
const normalizedError = error instanceof Error ? error : new Error(String(error))
await this.cleanupFailedItems(base, [...createdItemIds], item, normalizedError)
throw normalizedError
}
}
private async executeIndexTask(context: KnowledgeQueueTaskContext<IndexLeafTaskEntry>): Promise<void> {
const { base, item } = context
try {
await this.indexLeafItem(base, item, context)
} catch (error) {
if (context.signal.aborted) {
context.signal.throwIfAborted()
throw error
}
const normalizedError = error instanceof Error ? error : new Error(String(error))
await this.cleanupFailedItems(base, [item.id], item, normalizedError)
throw normalizedError
}
}
private async indexLeafItem(
base: KnowledgeBase,
item: IndexableKnowledgeItem,
context: KnowledgeQueueTaskContext<IndexLeafTaskEntry>
): Promise<void> {
context.signal.throwIfAborted()
await context.runWithBaseWriteLock(() =>
knowledgeItemService.updateStatus(item.id, 'processing', { phase: 'reading' })
)
const documents = await this.runTaskStep(context, () => loadKnowledgeItemDocuments(item, context.signal))
const chunks = await this.runTaskStep(context, () => chunkDocuments(base, item, documents))
await context.runWithBaseWriteLock(() =>
knowledgeItemService.updateStatus(item.id, 'processing', { phase: 'embedding' })
)
const nodes = await this.runTaskStep(context, () => embedDocuments(getEmbedModel(base), chunks, context.signal))
await context.runWithBaseWriteLock(async () => {
const vectorStoreService = application.get('KnowledgeVectorStoreService')
const activeVectorStore = await this.runTaskStep(context, () => vectorStoreService.createStore(base))
await this.runTaskStep(context, () => activeVectorStore.add(nodes))
await knowledgeItemService.updateStatus(item.id, 'completed')
})
}
private async cleanupFailedItems(
base: KnowledgeBase,
itemIds: string[],
logItem: KnowledgeItem,
error: Error
): Promise<void> {
logger.error('Failed to process knowledge item runtime task', error, {
baseId: base.id,
itemId: logItem.id,
itemType: logItem.type
})
try {
await deleteItemVectors(base, itemIds)
} catch (cleanupError) {
logger.error(
'Failed to cleanup knowledge item vectors after runtime failure',
cleanupError instanceof Error ? cleanupError : new Error(String(cleanupError)),
{
baseId: base.id,
itemIds
}
)
}
await this.persistFailureStateBestEffort(itemIds, error.message, {
baseId: base.id,
itemId: logItem.id,
itemType: logItem.type,
operation: 'runtimeTaskFailure'
})
}
private async persistFailureStateBestEffort(
itemIds: string[],
reason: string,
context: Record<string, unknown>
): Promise<void> {
try {
await failItems(itemIds, reason)
} catch (error) {
logger.error(
'Failed to persist knowledge item failure state during runtime cleanup',
error instanceof Error ? error : new Error(String(error)),
{
...context,
itemIds,
reason
}
)
}
}
private async deleteItemVectorsOrFailItems(
base: KnowledgeBase,
vectorItemIds: string[],
failureItemIds: string[],
context: Record<string, unknown>
): Promise<void> {
try {
await deleteItemVectors(base, vectorItemIds)
} catch (error) {
await this.failItemsAndRethrow(failureItemIds, error, context)
}
}
private async deleteAcceptedItemsBestEffort(
items: KnowledgeItem[],
originalError: Error,
baseId: string
): Promise<void> {
const uniqueItems = [...new Map(items.map((item) => [item.id, item])).values()]
await Promise.all(
uniqueItems.map(async (item) => {
try {
await knowledgeItemService.delete(item.id)
} catch (cleanupError) {
logger.error(
'Failed to rollback accepted knowledge item after addItems failure',
cleanupError instanceof Error ? cleanupError : new Error(String(cleanupError)),
{
baseId,
itemId: item.id,
addError: originalError.message
}
)
}
})
)
}
private async failItemsAndRethrow(
itemIds: string[],
error: unknown,
context: Record<string, unknown>
): Promise<never> {
const normalizedError = error instanceof Error ? error : new Error(String(error))
await this.persistFailureStateBestEffort(itemIds, normalizedError.message, {
...context,
operation: context.operation ?? 'strictRuntimeCleanup'
})
throw error
}
private isExpectedQueueInterrupt(error: unknown): boolean {
return (
error instanceof Error &&
error.name === 'KnowledgeQueueInterruptedError' &&
EXPECTED_QUEUE_INTERRUPT_REASONS.has(error.message)
)
}
private hasQueuedItem(itemId: string): boolean {
const snapshot = this.queue.getSnapshot()
return [...snapshot.pending, ...snapshot.running].some((entry) => entry.itemId === itemId)
}
private async failItemsAfterEnqueueRejection(
itemIds: string[],
error: unknown,
context: QueueTaskLogContext
): Promise<void> {
const normalizedError = error instanceof Error ? error : new Error(String(error))
logger.error('Knowledge queue rejected runtime task before execution', normalizedError, context)
try {
await failItems(itemIds, normalizedError.message)
} catch (failureStateError) {
logger.error(
'Failed to persist knowledge item failure state after queue enqueue rejection',
failureStateError instanceof Error ? failureStateError : new Error(String(failureStateError)),
{
...context,
itemIds,
enqueueError: normalizedError.message
}
)
}
}
private async interruptRootsAndDescendants(
baseId: string,
rootIds: string[],
reason: string
): Promise<{ descendantItems: KnowledgeItem[]; interruptIds: string[] }> {
// Stop roots before descendant lookup so an active expansion cannot enqueue fresh children during cleanup.
this.queue.interruptItems(rootIds, reason)
await this.queue.waitForRunning(rootIds)
const descendantItems = await knowledgeItemService.getDescendantItems(baseId, rootIds)
const interruptIds = [...rootIds, ...descendantItems.map((item) => item.id)]
this.queue.interruptItems(interruptIds, reason)
await this.queue.waitForRunning(interruptIds)
return { descendantItems, interruptIds }
}
private async runTaskStep<T>(context: KnowledgeQueueTaskContext, step: () => Promise<T> | T): Promise<T> {
context.signal.throwIfAborted()
const result = await step()
context.signal.throwIfAborted()
return result
}
private async shouldEnqueueLeaf(itemId: string): Promise<boolean> {
try {
const item = await knowledgeItemService.getById(itemId)
return isIndexableKnowledgeItem(item) && item.status === 'processing'
} catch (error) {
if (isDataApiError(error) && error.code === ErrorCode.NOT_FOUND) {
return false
}
throw error
}
}
private async cleanupInterruptedEntries(entries: KnowledgeQueueTaskDescriptor[], reason: string): Promise<void> {
const cleanupEntries = await this.expandInterruptedEntries(entries)
await this.deleteVectorsForQueueEntries(cleanupEntries)
await this.persistFailureStateBestEffort(
cleanupEntries.flatMap((entry) => entry.itemIds),
reason,
{
operation: 'interruptedRuntimeCleanup'
}
)
}
private async expandInterruptedEntries(
entries: KnowledgeQueueTaskDescriptor[]
): Promise<Array<{ base: KnowledgeBase; baseId: string; itemIds: string[] }>> {
const expandedEntries: Array<{ base: KnowledgeBase; baseId: string; itemIds: string[] }> = []
for (const entry of entries) {
if (entry.kind === 'index-leaf') {
expandedEntries.push({ base: entry.base, baseId: entry.baseId, itemIds: [entry.itemId] })
continue
}
const descendantItems = await knowledgeItemService.getDescendantItems(entry.baseId, [entry.itemId])
expandedEntries.push({
base: entry.base,
baseId: entry.baseId,
itemIds: [entry.itemId, ...descendantItems.map((item) => item.id)]
})
}
return expandedEntries
}
private async deleteVectorsForQueueEntries(
entries: Array<{ base: KnowledgeBase; baseId: string; itemIds: string[] }>
): Promise<void> {
const entriesByBase = new Map<string, { base: KnowledgeBase; itemIds: string[] }>()
for (const entry of entries) {
const existing = entriesByBase.get(entry.baseId)
if (existing) {
existing.itemIds.push(...entry.itemIds)
continue
}
entriesByBase.set(entry.baseId, {
base: entry.base,
itemIds: entry.itemIds
})
}
await deleteVectorsForEntries([...entriesByBase.values()])
}
}

View File

@@ -1,120 +0,0 @@
import type { KnowledgeBase, KnowledgeItem } from '@shared/data/types/knowledge'
import { describe, expect, it, vi } from 'vitest'
import { KnowledgeAddQueue } from '../KnowledgeAddQueue'
function createBase(): KnowledgeBase {
return {
id: 'kb-1',
name: 'KB',
dimensions: 1024,
embeddingModelId: 'ollama::nomic-embed-text',
createdAt: '2026-04-08T00:00:00.000Z',
updatedAt: '2026-04-08T00:00:00.000Z'
}
}
function createItem(id: string): KnowledgeItem {
return {
id,
baseId: 'kb-1',
groupId: null,
type: 'note',
data: { content: id },
status: 'idle',
error: null,
createdAt: '2026-04-08T00:00:00.000Z',
updatedAt: '2026-04-08T00:00:00.000Z'
}
}
function createDeferred<T>() {
let resolve!: (value: T | PromiseLike<T>) => void
let reject!: (reason?: unknown) => void
const promise = new Promise<T>((res, rej) => {
resolve = res
reject = rej
})
return { promise, resolve, reject }
}
describe('KnowledgeAddQueue', () => {
it('deduplicates queued work for the same item', async () => {
const deferred = createDeferred<void>()
const executeAdd = vi.fn(async () => {
await deferred.promise
})
const queue = new KnowledgeAddQueue(1, executeAdd)
const base = createBase()
const item = createItem('item-1')
const firstPromise = queue.enqueue(base, item)
const secondPromise = queue.enqueue(base, item)
await vi.waitFor(() => {
expect(executeAdd).toHaveBeenCalledTimes(1)
})
deferred.resolve()
await expect(Promise.all([firstPromise, secondPromise])).resolves.toEqual([undefined, undefined])
})
it('interrupts pending and running items and returns their entries', async () => {
const deferred = createDeferred<void>()
const executeAdd = vi.fn(async (entry) => {
if (entry.item.id === runningItem.id) {
await deferred.promise
}
if (entry.interruptedBy) {
throw new Error('Knowledge task interrupted by item deletion')
}
})
const queue = new KnowledgeAddQueue(1, executeAdd)
const base = createBase()
const runningItem = createItem('item-running')
const pendingItem = createItem('item-pending')
const runningPromise = queue.enqueue(base, runningItem)
const pendingPromise = queue.enqueue(base, pendingItem)
await vi.waitFor(() => {
expect(executeAdd).toHaveBeenCalledTimes(1)
})
const interruptedEntries = queue.interrupt(
[runningItem.id, pendingItem.id],
'delete',
'Knowledge task interrupted by item deletion'
)
expect(interruptedEntries.map((entry) => entry.item.id)).toEqual([runningItem.id, pendingItem.id])
expect(executeAdd.mock.calls[0][0].interruptedBy).toBe('delete')
deferred.resolve()
await queue.waitForRunning([runningItem.id, pendingItem.id])
await expect(runningPromise).rejects.toThrow('Knowledge task interrupted by item deletion')
await expect(pendingPromise).rejects.toThrow('Knowledge task interrupted by item deletion')
})
it('rejects the public promise when executeAdd throws and continues with later work', async () => {
const queue = new KnowledgeAddQueue(1, async (entry) => {
if (entry.item.id === firstItem.id) {
throw new Error('execute failed')
}
})
const base = createBase()
const firstItem = createItem('item-failed')
const secondItem = createItem('item-next')
const firstPromise = queue.enqueue(base, firstItem)
const secondPromise = queue.enqueue(base, secondItem)
await expect(firstPromise).rejects.toThrow('execute failed')
await expect(secondPromise).resolves.toBeUndefined()
})
})

View File

@@ -1,5 +1 @@
export * from './KnowledgeAddQueue'
export * from './KnowledgeAddRuntime'
export * from './KnowledgeRuntimeService'
export * from './utils/cleanup'
export * from './utils/taskRuntime'

View File

@@ -1,248 +1,213 @@
import type { KnowledgeBase } from '@shared/data/types/knowledge'
import { beforeEach, describe, expect, it, vi } from 'vitest'
const { appGetMock, getStoreIfExistsMock, knowledgeItemUpdateMock, loggerErrorMock, loggerWarnMock, vectorDeleteMock } =
vi.hoisted(() => ({
appGetMock: vi.fn(),
getStoreIfExistsMock: vi.fn(),
knowledgeItemUpdateMock: vi.fn(),
loggerErrorMock: vi.fn(),
loggerWarnMock: vi.fn(),
vectorDeleteMock: vi.fn()
}))
vi.mock('@application', () => ({
application: {
get: appGetMock
const {
knowledgeItemUpdateStatusMock,
loggerErrorMock,
loggerWarnMock,
vectorStoreDeleteMock,
vectorStoreServiceMock
} = vi.hoisted(() => ({
knowledgeItemUpdateStatusMock: vi.fn(),
loggerErrorMock: vi.fn(),
loggerWarnMock: vi.fn(),
vectorStoreDeleteMock: vi.fn(),
vectorStoreServiceMock: {
getStoreIfExists: vi.fn()
}
}))
vi.mock('@application', async () => {
const { mockApplicationFactory } = await import('@test-mocks/main/application')
return mockApplicationFactory({
KnowledgeVectorStoreService: vectorStoreServiceMock
} as Parameters<typeof mockApplicationFactory>[0])
})
vi.mock('@data/services/KnowledgeItemService', () => ({
knowledgeItemService: {
update: knowledgeItemUpdateMock
updateStatus: knowledgeItemUpdateStatusMock
}
}))
vi.mock('@logger', () => ({
loggerService: {
withContext: () => ({
warn: loggerWarnMock,
error: loggerErrorMock
error: loggerErrorMock,
warn: loggerWarnMock
})
}
}))
const { deleteItemVectors, deleteVectorsForEntries, failItems } = await import('../cleanup')
function createBase() {
function createBase(): KnowledgeBase {
return {
id: 'kb-1',
name: 'KB',
groupId: null,
emoji: '📁',
dimensions: 1024,
embeddingModelId: 'ollama::nomic-embed-text',
status: 'completed',
error: null,
chunkSize: 1024,
chunkOverlap: 200,
searchMode: 'hybrid',
createdAt: '2026-04-08T00:00:00.000Z',
updatedAt: '2026-04-08T00:00:00.000Z'
}
}
describe('cleanup', () => {
function createDeferred<T = void>() {
let resolve!: (value: T | PromiseLike<T>) => void
const promise = new Promise<T>((res) => {
resolve = res
})
return { promise, resolve }
}
async function flushPromises(): Promise<void> {
await Promise.resolve()
await Promise.resolve()
}
describe('deleteItemVectors', () => {
beforeEach(() => {
vi.clearAllMocks()
knowledgeItemUpdateStatusMock.mockResolvedValue(undefined)
vectorStoreServiceMock.getStoreIfExists.mockResolvedValue({
delete: vectorStoreDeleteMock
})
})
appGetMock.mockImplementation((serviceName: string) => {
if (serviceName === 'KnowledgeVectorStoreService') {
return {
getStoreIfExists: getStoreIfExistsMock
}
it('does not delete vectors when the vector store does not exist', async () => {
vectorStoreServiceMock.getStoreIfExists.mockResolvedValue(null)
await expect(deleteItemVectors(createBase(), ['item-1'])).resolves.toBeUndefined()
expect(vectorStoreDeleteMock).not.toHaveBeenCalled()
})
it('deduplicates item ids before deleting vectors', async () => {
await deleteItemVectors(createBase(), ['item-1', 'item-1', 'item-2'])
expect(vectorStoreDeleteMock).toHaveBeenCalledTimes(2)
expect(vectorStoreDeleteMock).toHaveBeenCalledWith('item-1')
expect(vectorStoreDeleteMock).toHaveBeenCalledWith('item-2')
})
it('waits for every vector delete attempt before reporting failures', async () => {
const pendingDelete = createDeferred()
let rejected = false
vectorStoreDeleteMock.mockImplementation(async (itemId: string) => {
if (itemId === 'item-fail') {
throw new Error('delete failed')
}
throw new Error(`Unexpected application.get(${serviceName}) in test`)
await pendingDelete.promise
})
const deletePromise = deleteItemVectors(createBase(), ['item-fail', 'item-slow']).catch((error: unknown) => {
rejected = true
throw error
})
await flushPromises()
expect(rejected).toBe(false)
expect(vectorStoreDeleteMock).toHaveBeenCalledWith('item-fail')
expect(vectorStoreDeleteMock).toHaveBeenCalledWith('item-slow')
pendingDelete.resolve()
await expect(deletePromise).rejects.toThrow('Failed to delete vectors for knowledge items in base kb-1: item-fail')
})
})
describe('deleteVectorsForEntries', () => {
beforeEach(() => {
vi.clearAllMocks()
vectorStoreServiceMock.getStoreIfExists.mockResolvedValue({
delete: vectorStoreDeleteMock
})
})
it('does nothing when no vector store exists for the base', async () => {
const base = createBase()
getStoreIfExistsMock.mockResolvedValueOnce(undefined)
await expect(deleteItemVectors(base, ['item-1'])).resolves.toBeUndefined()
expect(getStoreIfExistsMock).toHaveBeenCalledWith(base)
expect(vectorDeleteMock).not.toHaveBeenCalled()
})
it('deduplicates item ids before deleting from an existing vector store', async () => {
const base = createBase()
getStoreIfExistsMock.mockResolvedValueOnce({
delete: vectorDeleteMock
})
vectorDeleteMock.mockResolvedValue(undefined)
await expect(deleteItemVectors(base, ['item-1', 'item-1', 'item-2'])).resolves.toBeUndefined()
expect(getStoreIfExistsMock).toHaveBeenCalledWith(base)
expect(vectorDeleteMock).toHaveBeenCalledTimes(2)
expect(vectorDeleteMock).toHaveBeenCalledWith('item-1')
expect(vectorDeleteMock).toHaveBeenCalledWith('item-2')
})
it('keeps deleting remaining items and reports partial failures', async () => {
const base = createBase()
const deleteError = new Error('delete failed for item-2')
getStoreIfExistsMock.mockResolvedValueOnce({
delete: vectorDeleteMock
})
vectorDeleteMock.mockImplementation(async (itemId: string) => {
if (itemId === 'item-2') {
throw deleteError
}
})
await expect(deleteItemVectors(base, ['item-1', 'item-2'])).rejects.toMatchObject({
name: 'DeleteItemVectorsError',
message: 'Failed to delete vectors for knowledge items in base kb-1: item-2',
baseId: 'kb-1',
succeededItemIds: ['item-1'],
failed: [
{
itemId: 'item-2',
error: deleteError
}
]
})
expect(vectorDeleteMock).toHaveBeenCalledTimes(2)
expect(vectorDeleteMock).toHaveBeenCalledWith('item-1')
expect(vectorDeleteMock).toHaveBeenCalledWith('item-2')
})
it('logs partial vector cleanup failures and continues when continueOnError is enabled', async () => {
const base = createBase()
getStoreIfExistsMock.mockResolvedValueOnce({
delete: vectorDeleteMock
})
vectorDeleteMock.mockImplementation(async (itemId: string) => {
if (itemId === 'item-2') {
throw new Error('delete failed for item-2')
it('logs and continues when deleting vectors fails for one base', async () => {
const firstBase = createBase()
const secondBase = { ...createBase(), id: 'kb-2' }
vectorStoreDeleteMock.mockImplementation(async (itemId: string) => {
if (itemId === 'item-fail') {
throw new Error('delete failed')
}
})
await expect(
deleteVectorsForEntries(
[
{
base,
item: {
id: 'item-1'
}
},
{
base,
item: {
id: 'item-2'
}
}
] as any,
{ continueOnError: true }
)
deleteVectorsForEntries([
{ base: firstBase, itemIds: ['item-fail'] },
{ base: secondBase, itemIds: ['item-ok'] }
])
).resolves.toBeUndefined()
expect(loggerWarnMock).toHaveBeenCalledWith('Failed to delete knowledge item vectors during interruption cleanup', {
baseId: 'kb-1',
itemIds: ['item-1', 'item-2'],
succeededItemIds: ['item-1'],
failedItemIds: ['item-2'],
cleanupError: 'Failed to delete vectors for knowledge items in base kb-1: item-2'
})
})
it('throws vector cleanup errors when continueOnError is disabled', async () => {
const base = createBase()
getStoreIfExistsMock.mockResolvedValueOnce({
delete: vectorDeleteMock
})
vectorDeleteMock.mockRejectedValueOnce(new Error('delete failed for item-1'))
await expect(
deleteVectorsForEntries(
[
{
base,
item: { id: 'item-1' }
}
] as any,
{ continueOnError: false }
)
).rejects.toMatchObject({
name: 'DeleteItemVectorsError',
baseId: 'kb-1',
failed: [
{
itemId: 'item-1'
}
]
})
expect(vectorStoreDeleteMock).toHaveBeenCalledWith('item-fail')
expect(vectorStoreDeleteMock).toHaveBeenCalledWith('item-ok')
expect(loggerErrorMock).toHaveBeenCalledWith(
'Failed to delete knowledge item vectors during runtime cleanup',
expect.objectContaining({
message: 'Failed to delete vectors for knowledge items in base kb-1: item-fail'
}),
{
baseId: firstBase.id,
itemIds: ['item-fail'],
failedItemIds: ['item-fail']
}
)
expect(loggerWarnMock).not.toHaveBeenCalled()
})
})
it('groups entries by base before deleting vectors', async () => {
const firstBase = createBase()
const secondBase = {
...createBase(),
id: 'kb-2'
}
getStoreIfExistsMock
.mockResolvedValueOnce({ delete: vectorDeleteMock })
.mockResolvedValueOnce({ delete: vectorDeleteMock })
vectorDeleteMock.mockResolvedValue(undefined)
await expect(
deleteVectorsForEntries(
[
{ base: firstBase, item: { id: 'item-1' } },
{ base: firstBase, item: { id: 'item-2' } },
{ base: secondBase, item: { id: 'item-3' } }
] as any,
{ continueOnError: true }
)
).resolves.toBeUndefined()
expect(getStoreIfExistsMock).toHaveBeenNthCalledWith(1, firstBase)
expect(getStoreIfExistsMock).toHaveBeenNthCalledWith(2, secondBase)
expect(vectorDeleteMock).toHaveBeenCalledWith('item-1')
expect(vectorDeleteMock).toHaveBeenCalledWith('item-2')
expect(vectorDeleteMock).toHaveBeenCalledWith('item-3')
describe('failItems', () => {
beforeEach(() => {
vi.clearAllMocks()
knowledgeItemUpdateStatusMock.mockResolvedValue(undefined)
})
it('marks interrupted items as failed and logs persistence errors per item', async () => {
knowledgeItemUpdateMock.mockImplementation(async (itemId: string) => {
if (itemId === 'item-2') {
throw new Error('persist failed')
it('marks unique item ids failed with the failure reason', async () => {
await failItems(['item-1', 'item-1', 'item-2'], 'read failed')
expect(knowledgeItemUpdateStatusMock).toHaveBeenCalledTimes(2)
expect(knowledgeItemUpdateStatusMock).toHaveBeenCalledWith('item-1', 'failed', { error: 'read failed' })
expect(knowledgeItemUpdateStatusMock).toHaveBeenCalledWith('item-2', 'failed', { error: 'read failed' })
expect(loggerErrorMock).not.toHaveBeenCalled()
})
it('throws an aggregate error after logging persistence failures', async () => {
const persistError = new Error('database locked')
knowledgeItemUpdateStatusMock.mockImplementation(async (itemId: string) => {
if (itemId === 'item-fail') {
throw persistError
}
})
await expect(failItems(['item-1', 'item-2', 'item-1'], 'stop')).resolves.toBeUndefined()
expect(knowledgeItemUpdateMock).toHaveBeenCalledTimes(2)
expect(knowledgeItemUpdateMock).toHaveBeenCalledWith('item-1', {
status: 'failed',
error: 'stop'
await expect(failItems(['item-ok', 'item-fail'], 'read failed')).rejects.toMatchObject({
name: 'FailedToPersistFailureStateError',
itemIds: ['item-fail'],
reason: 'read failed'
})
expect(knowledgeItemUpdateMock).toHaveBeenCalledWith('item-2', {
status: 'failed',
error: 'stop'
expect(knowledgeItemUpdateStatusMock).toHaveBeenCalledWith('item-ok', 'failed', { error: 'read failed' })
expect(knowledgeItemUpdateStatusMock).toHaveBeenCalledWith('item-fail', 'failed', { error: 'read failed' })
expect(loggerErrorMock).toHaveBeenCalledWith('Failed to persist knowledge item failure state', persistError, {
itemId: 'item-fail',
reason: 'read failed'
})
expect(loggerErrorMock).toHaveBeenCalledWith(
'Failed to persist interrupted knowledge item state',
expect.objectContaining({ message: 'persist failed' }),
'Failed to persist failure state for knowledge items',
expect.objectContaining({ name: 'FailedToPersistFailureStateError' }),
{
itemId: 'item-2',
reason: 'stop'
count: 1,
itemIds: ['item-fail'],
reason: 'read failed'
}
)
})

View File

@@ -0,0 +1,343 @@
import type { KnowledgeItem } from '@shared/data/types/knowledge'
import { beforeEach, describe, expect, it, vi } from 'vitest'
const {
expandDirectoryOwnerToTreeMock,
expandSitemapOwnerToCreateItemsMock,
knowledgeItemCreateMock,
loggerWarnMock,
knowledgeItemUpdateStatusMock
} = vi.hoisted(() => ({
expandDirectoryOwnerToTreeMock: vi.fn(),
expandSitemapOwnerToCreateItemsMock: vi.fn(),
knowledgeItemCreateMock: vi.fn(),
loggerWarnMock: vi.fn(),
knowledgeItemUpdateStatusMock: vi.fn()
}))
vi.mock('@data/services/KnowledgeItemService', () => ({
knowledgeItemService: {
create: knowledgeItemCreateMock,
updateStatus: knowledgeItemUpdateStatusMock
}
}))
vi.mock('@logger', () => ({
loggerService: {
withContext: () => ({
warn: loggerWarnMock
})
}
}))
vi.mock('../../../utils/directory', () => ({
expandDirectoryOwnerToTree: expandDirectoryOwnerToTreeMock
}))
vi.mock('../../../utils/sitemap', () => ({
expandSitemapOwnerToCreateItems: expandSitemapOwnerToCreateItemsMock
}))
import type { PrepareKnowledgeItemOptions } from '../prepare'
const { prepareKnowledgeItem } = await import('../prepare')
const baseId = 'kb-1'
function createPrepareOptions(item: KnowledgeItem, onCreatedItem = vi.fn()): PrepareKnowledgeItemOptions {
const signal = new AbortController().signal
const runMutation: PrepareKnowledgeItemOptions['runMutation'] = async (task) => await task()
return {
baseId,
item,
onCreatedItem,
runMutation,
signal
}
}
function createDirectoryItem(id = 'dir-1', groupId: string | null = null): KnowledgeItem {
return {
id,
baseId,
groupId,
type: 'directory',
data: { source: id, path: `/docs/${id}` },
status: 'processing',
phase: null,
error: null,
createdAt: '2026-04-08T00:00:00.000Z',
updatedAt: '2026-04-08T00:00:00.000Z'
}
}
function createSitemapItem(): KnowledgeItem {
return {
id: 'sitemap-1',
baseId,
groupId: null,
type: 'sitemap',
data: { source: 'sitemap', url: 'https://example.com/sitemap.xml' },
status: 'processing',
phase: null,
error: null,
createdAt: '2026-04-08T00:00:00.000Z',
updatedAt: '2026-04-08T00:00:00.000Z'
}
}
function createNoteItem(id = 'note-1'): KnowledgeItem {
return {
id,
baseId,
groupId: null,
type: 'note',
data: { source: id, content: `hello ${id}` },
status: 'processing',
phase: null,
error: null,
createdAt: '2026-04-08T00:00:00.000Z',
updatedAt: '2026-04-08T00:00:00.000Z'
}
}
function createFileItem(id = 'file-1', groupId: string | null = null): KnowledgeItem {
return {
id,
baseId,
groupId,
type: 'file',
data: {
source: `/docs/${id}.md`,
file: {
id: `${id}-meta`,
name: `${id}.md`,
origin_name: `${id}.md`,
path: `/docs/${id}.md`,
created_at: '2026-04-08T00:00:00.000Z',
size: 10,
ext: '.md',
type: 'text',
count: 1
}
},
status: 'processing',
phase: null,
error: null,
createdAt: '2026-04-08T00:00:00.000Z',
updatedAt: '2026-04-08T00:00:00.000Z'
}
}
describe('prepareKnowledgeItem', () => {
beforeEach(() => {
vi.clearAllMocks()
expandDirectoryOwnerToTreeMock.mockResolvedValue([])
expandSitemapOwnerToCreateItemsMock.mockResolvedValue([])
knowledgeItemCreateMock.mockImplementation(async (_baseId: string, item: Partial<KnowledgeItem>) => ({
id: `${item.type}-created`,
baseId,
groupId: item.groupId ?? null,
type: item.type,
data: item.data,
status: 'idle',
phase: null,
error: null,
createdAt: '2026-04-08T00:00:00.000Z',
updatedAt: '2026-04-08T00:00:00.000Z'
}))
knowledgeItemUpdateStatusMock.mockImplementation(
async (
id: string,
status: KnowledgeItem['status'],
update: { phase?: KnowledgeItem['phase']; error?: string | null } = {}
) => ({
id,
baseId,
groupId: null,
type: id.startsWith('file') ? 'file' : 'note',
data: { source: id, content: id },
status,
phase: update.phase ?? null,
error: update.error ?? null,
createdAt: '2026-04-08T00:00:00.000Z',
updatedAt: '2026-04-08T00:00:00.000Z'
})
)
})
it('returns leaf items directly', async () => {
const note = createNoteItem()
await expect(prepareKnowledgeItem(createPrepareOptions(note))).resolves.toEqual([note])
expect(knowledgeItemCreateMock).not.toHaveBeenCalled()
expect(knowledgeItemUpdateStatusMock).not.toHaveBeenCalled()
})
it('expands directory trees and returns only file leaves', async () => {
const root = createDirectoryItem('dir-root')
const childDir = createDirectoryItem('dir-child', root.id)
const childFile = createFileItem('file-child', childDir.id)
knowledgeItemCreateMock.mockResolvedValueOnce(childDir).mockResolvedValueOnce(childFile)
knowledgeItemUpdateStatusMock.mockResolvedValueOnce(childDir).mockResolvedValueOnce(childFile)
expandDirectoryOwnerToTreeMock.mockResolvedValueOnce([
{
type: 'directory',
data: childDir.data,
children: [
{
type: 'file',
data: childFile.data
}
]
}
])
const options = createPrepareOptions(root)
await expect(prepareKnowledgeItem(options)).resolves.toEqual([childFile])
expect(expandDirectoryOwnerToTreeMock).toHaveBeenCalledWith(root, options.signal)
expect(knowledgeItemCreateMock).toHaveBeenNthCalledWith(1, baseId, {
groupId: root.id,
type: 'directory',
data: childDir.data
})
expect(knowledgeItemCreateMock).toHaveBeenNthCalledWith(2, baseId, {
groupId: childDir.id,
type: 'file',
data: childFile.data
})
expect(knowledgeItemUpdateStatusMock).toHaveBeenCalledWith(childDir.id, 'processing', { phase: 'preparing' })
expect(knowledgeItemUpdateStatusMock).toHaveBeenCalledWith(childDir.id, 'processing')
expect(knowledgeItemUpdateStatusMock).toHaveBeenCalledWith(childFile.id, 'processing')
})
it('marks empty directory roots failed and returns no leaves', async () => {
const root = createDirectoryItem('dir-root')
expandDirectoryOwnerToTreeMock.mockResolvedValueOnce([])
await expect(prepareKnowledgeItem(createPrepareOptions(root))).resolves.toEqual([])
expect(loggerWarnMock).toHaveBeenCalledWith('Directory expansion produced no indexable files', {
baseId,
itemId: root.id,
source: root.data.source
})
expect(knowledgeItemUpdateStatusMock).toHaveBeenCalledWith(root.id, 'failed', {
error: 'Directory contains no indexable files'
})
})
it('marks empty sitemap roots failed and returns no leaves', async () => {
const sitemap = createSitemapItem()
expandSitemapOwnerToCreateItemsMock.mockResolvedValueOnce([])
const options = createPrepareOptions(sitemap)
await expect(prepareKnowledgeItem(options)).resolves.toEqual([])
expect(expandSitemapOwnerToCreateItemsMock).toHaveBeenCalledWith(sitemap, options.signal)
expect(loggerWarnMock).toHaveBeenCalledWith('Sitemap expansion produced no indexable URLs', {
baseId,
itemId: sitemap.id,
source: sitemap.data.source
})
expect(knowledgeItemUpdateStatusMock).toHaveBeenCalledWith(sitemap.id, 'failed', {
error: 'Sitemap contains no indexable URLs'
})
})
it('expands sitemap items into url children and returns url leaves', async () => {
const sitemap = createSitemapItem()
const urlChild: KnowledgeItem = {
id: 'url-child',
baseId,
groupId: sitemap.id,
type: 'url',
data: { source: 'https://example.com/page-1', url: 'https://example.com/page-1' },
status: 'processing',
phase: null,
error: null,
createdAt: '2026-04-08T00:00:00.000Z',
updatedAt: '2026-04-08T00:00:00.000Z'
}
expandSitemapOwnerToCreateItemsMock.mockResolvedValueOnce([
{ groupId: sitemap.id, type: 'url', data: urlChild.data }
])
knowledgeItemCreateMock.mockResolvedValueOnce(urlChild)
knowledgeItemUpdateStatusMock.mockResolvedValueOnce(urlChild)
await expect(prepareKnowledgeItem(createPrepareOptions(sitemap))).resolves.toEqual([urlChild])
expect(knowledgeItemCreateMock).toHaveBeenCalledWith(baseId, {
groupId: sitemap.id,
type: 'url',
data: urlChild.data
})
expect(knowledgeItemUpdateStatusMock).toHaveBeenCalledWith(urlChild.id, 'processing')
})
it('reports created children before marking them processing', async () => {
const sitemap = createSitemapItem()
const urlChild: KnowledgeItem = {
id: 'url-child',
baseId,
groupId: sitemap.id,
type: 'url',
data: { source: 'https://example.com/page-1', url: 'https://example.com/page-1' },
status: 'idle',
phase: null,
error: null,
createdAt: '2026-04-08T00:00:00.000Z',
updatedAt: '2026-04-08T00:00:00.000Z'
}
const onCreatedItem = vi.fn()
expandSitemapOwnerToCreateItemsMock.mockResolvedValueOnce([
{ groupId: sitemap.id, type: 'url', data: urlChild.data }
])
knowledgeItemCreateMock.mockResolvedValueOnce(urlChild)
knowledgeItemUpdateStatusMock.mockRejectedValueOnce(new Error('status failed'))
await expect(prepareKnowledgeItem(createPrepareOptions(sitemap, onCreatedItem))).rejects.toThrow('status failed')
expect(onCreatedItem).toHaveBeenCalledWith(urlChild)
})
it('stops creating children when the runtime signal is aborted after expansion', async () => {
const sitemap = createSitemapItem()
const controller = new AbortController()
const abortError = new Error('interrupted')
expandSitemapOwnerToCreateItemsMock.mockImplementationOnce(async () => {
controller.abort(abortError)
return [
{
groupId: sitemap.id,
type: 'url',
data: { source: 'https://example.com/page-1', url: 'https://example.com/page-1' }
}
]
})
await expect(
prepareKnowledgeItem({
...createPrepareOptions(sitemap),
signal: controller.signal
})
).rejects.toBe(abortError)
expect(knowledgeItemCreateMock).not.toHaveBeenCalled()
expect(knowledgeItemUpdateStatusMock).not.toHaveBeenCalled()
})
it('propagates expansion failures without marking the source failed', async () => {
const sitemap = createSitemapItem()
expandSitemapOwnerToCreateItemsMock.mockRejectedValueOnce(new Error('sitemap expansion failed'))
await expect(prepareKnowledgeItem(createPrepareOptions(sitemap))).rejects.toThrow('sitemap expansion failed')
expect(knowledgeItemUpdateStatusMock).not.toHaveBeenCalledWith(sitemap.id, 'failed', {
error: 'sitemap expansion failed'
})
})
})

View File

@@ -1,47 +0,0 @@
import { describe, expect, it, vi } from 'vitest'
import { assertTaskActive, runAbortable, SHUTDOWN_INTERRUPTED_REASON } from '../taskRuntime'
describe('taskRuntime', () => {
it('runs active work and returns the step result', async () => {
const step = vi.fn(async () => 'done')
await expect(
runAbortable(() => false, { itemId: 'item-1', signal: new AbortController().signal }, step)
).resolves.toBe('done')
expect(step).toHaveBeenCalledTimes(1)
})
it('throws the abort reason when the signal was aborted', () => {
const controller = new AbortController()
controller.abort('Knowledge task interrupted by item deletion')
expect(() => assertTaskActive(() => false, { itemId: 'item-1', signal: controller.signal })).toThrow(
'Knowledge task interrupted by item deletion'
)
})
it('throws the shutdown reason before invoking the step when stopping', async () => {
const step = vi.fn(async () => 'done')
await expect(
runAbortable(() => true, { itemId: 'item-1', signal: new AbortController().signal }, step)
).rejects.toThrow(SHUTDOWN_INTERRUPTED_REASON)
expect(step).not.toHaveBeenCalled()
})
it('rechecks the stopping state after the step instead of using a stale snapshot', async () => {
let stopping = false
await expect(
runAbortable(
() => stopping,
{ itemId: 'item-1', signal: new AbortController().signal },
async () => {
stopping = true
return 'done'
}
)
).rejects.toThrow(SHUTDOWN_INTERRUPTED_REASON)
})
})

View File

@@ -1,31 +1,30 @@
import { application } from '@application'
import { knowledgeItemService } from '@data/services/KnowledgeItemService'
import { loggerService } from '@logger'
import type { KnowledgeBase, KnowledgeItem } from '@shared/data/types/knowledge'
import type { KnowledgeBase } from '@shared/data/types/knowledge'
const logger = loggerService.withContext('KnowledgeRuntimeCleanup')
interface DeleteItemVectorFailure {
itemId: string
error: Error
}
class DeleteItemVectorsError extends Error {
constructor(
readonly baseId: string,
readonly succeededItemIds: string[],
readonly failed: DeleteItemVectorFailure[]
readonly failedItemIds: string[]
) {
super(
`Failed to delete vectors for knowledge items in base ${baseId}: ${failed.map((entry) => entry.itemId).join(', ')}`
)
super(`Failed to delete vectors for knowledge items in base ${baseId}: ${failedItemIds.join(', ')}`)
this.name = 'DeleteItemVectorsError'
}
}
/**
* Deletes vectors for the given item ids within one knowledge base.
*/
class FailedToPersistFailureStateError extends Error {
constructor(
readonly itemIds: string[],
readonly reason: string
) {
super(`Failed to persist failure state for knowledge items: ${itemIds.join(', ')}`)
this.name = 'FailedToPersistFailureStateError'
}
}
export async function deleteItemVectors(base: KnowledgeBase, itemIds: string[]): Promise<void> {
const uniqueItemIds = [...new Set(itemIds)]
if (uniqueItemIds.length === 0) {
@@ -39,85 +38,38 @@ export async function deleteItemVectors(base: KnowledgeBase, itemIds: string[]):
}
const results = await Promise.allSettled(uniqueItemIds.map((itemId) => vectorStore.delete(itemId)))
const succeededItemIds: string[] = []
const failed: DeleteItemVectorFailure[] = []
const failedItemIds = results.flatMap((result, index) => (result.status === 'rejected' ? [uniqueItemIds[index]] : []))
for (const [index, result] of results.entries()) {
const itemId = uniqueItemIds[index]
if (result.status === 'fulfilled') {
succeededItemIds.push(itemId)
continue
}
failed.push({
itemId,
error: result.reason instanceof Error ? result.reason : new Error(String(result.reason))
})
}
if (failed.length > 0) {
throw new DeleteItemVectorsError(base.id, succeededItemIds, failed)
if (failedItemIds.length > 0) {
throw new DeleteItemVectorsError(base.id, failedItemIds)
}
}
/**
* Groups interrupted entries by base and deletes their vectors in batches.
*/
export async function deleteVectorsForEntries(
entries: Array<{ base: KnowledgeBase; item: KnowledgeItem }>,
options: { continueOnError: boolean }
entries: Array<{ base: KnowledgeBase; itemIds: string[] }>
): Promise<void> {
const entriesByBase = new Map<string, { base: KnowledgeBase; itemIds: Set<string> }>()
for (const entry of entries) {
const existing = entriesByBase.get(entry.base.id)
if (existing) {
existing.itemIds.add(entry.item.id)
continue
}
entriesByBase.set(entry.base.id, {
base: entry.base,
itemIds: new Set([entry.item.id])
})
}
for (const { base, itemIds } of entriesByBase.values()) {
for (const { base, itemIds } of entries) {
try {
await deleteItemVectors(base, [...itemIds])
await deleteItemVectors(base, itemIds)
} catch (error) {
if (!options.continueOnError) {
throw error
}
const deleteError = error instanceof DeleteItemVectorsError ? error : null
logger.warn('Failed to delete knowledge item vectors during interruption cleanup', {
const normalizedError = error instanceof Error ? error : new Error(String(error))
logger.error('Failed to delete knowledge item vectors during runtime cleanup', normalizedError, {
baseId: base.id,
itemIds: [...itemIds],
succeededItemIds: deleteError?.succeededItemIds ?? [],
failedItemIds: deleteError?.failed.map((entry) => entry.itemId) ?? [],
cleanupError: error instanceof Error ? error.message : String(error)
itemIds,
failedItemIds: error instanceof DeleteItemVectorsError ? error.failedItemIds : itemIds
})
}
}
}
/**
* Marks interrupted items as failed and logs any persistence errors.
*/
export async function failItems(itemIds: string[], reason: string): Promise<void> {
if (itemIds.length === 0) {
const uniqueItemIds = [...new Set(itemIds)]
if (uniqueItemIds.length === 0) {
return
}
const uniqueItemIds = [...new Set(itemIds)]
const results = await Promise.allSettled(
uniqueItemIds.map((itemId) =>
knowledgeItemService.update(itemId, {
status: 'failed',
error: reason
})
)
uniqueItemIds.map((itemId) => knowledgeItemService.updateStatus(itemId, 'failed', { error: reason }))
)
for (const [index, result] of results.entries()) {
@@ -126,7 +78,7 @@ export async function failItems(itemIds: string[], reason: string): Promise<void
}
logger.error(
'Failed to persist interrupted knowledge item state',
'Failed to persist knowledge item failure state',
result.reason instanceof Error ? result.reason : new Error(String(result.reason)),
{
itemId: uniqueItemIds[index],
@@ -134,4 +86,17 @@ export async function failItems(itemIds: string[], reason: string): Promise<void
}
)
}
const failedItemIds = results.flatMap((result, index) => (result.status === 'rejected' ? [uniqueItemIds[index]] : []))
if (failedItemIds.length === 0) {
return
}
const aggregateError = new FailedToPersistFailureStateError(failedItemIds, reason)
logger.error('Failed to persist failure state for knowledge items', aggregateError, {
count: failedItemIds.length,
itemIds: failedItemIds,
reason
})
throw aggregateError
}

View File

@@ -0,0 +1,175 @@
import { knowledgeItemService } from '@data/services/KnowledgeItemService'
import { loggerService } from '@logger'
import {
type CreateKnowledgeItemDto,
type KnowledgeItem,
type KnowledgeItemOf,
type KnowledgeItemType
} from '@shared/data/types/knowledge'
import type { IndexableKnowledgeItem } from '../../types/items'
import { expandDirectoryOwnerToTree, type ExpandedDirectoryNode } from '../../utils/directory'
import { isIndexableKnowledgeItem } from '../../utils/items'
import { expandSitemapOwnerToCreateItems } from '../../utils/sitemap'
const logger = loggerService.withContext('KnowledgeRuntimePrepare')
const EMPTY_DIRECTORY_ERROR = 'Directory contains no indexable files'
const EMPTY_SITEMAP_ERROR = 'Sitemap contains no indexable URLs'
export interface PrepareKnowledgeItemOptions {
baseId: string
item: KnowledgeItem
onCreatedItem: (item: KnowledgeItem) => void
runMutation: <T>(task: () => Promise<T>) => Promise<T>
signal: AbortSignal
}
export async function prepareKnowledgeItem({
baseId,
item,
onCreatedItem,
runMutation,
signal
}: PrepareKnowledgeItemOptions): Promise<IndexableKnowledgeItem[]> {
signal.throwIfAborted()
if (isIndexableKnowledgeItem(item)) {
return [item]
}
if (item.type === 'directory') {
return await prepareDirectoryForRuntime(baseId, item, onCreatedItem, runMutation, signal)
}
return await prepareSitemapForRuntime(baseId, item, onCreatedItem, runMutation, signal)
}
async function prepareDirectoryForRuntime(
baseId: string,
item: KnowledgeItemOf<'directory'>,
onCreatedItem: (item: KnowledgeItem) => void,
runMutation: <T>(task: () => Promise<T>) => Promise<T>,
signal: AbortSignal
): Promise<IndexableKnowledgeItem[]> {
const expandedChildren = await expandDirectoryOwnerToTree(item, signal)
signal.throwIfAborted()
if (expandedChildren.length === 0) {
logger.warn('Directory expansion produced no indexable files', {
baseId,
itemId: item.id,
source: item.data.source
})
await runMutation(() => knowledgeItemService.updateStatus(item.id, 'failed', { error: EMPTY_DIRECTORY_ERROR }))
return []
}
return await createDirectoryChildren(baseId, item.id, expandedChildren, onCreatedItem, runMutation, signal)
}
async function createDirectoryChildren(
baseId: string,
parentId: string,
children: ExpandedDirectoryNode[],
onCreatedItem: (item: KnowledgeItem) => void,
runMutation: <T>(task: () => Promise<T>) => Promise<T>,
signal: AbortSignal
): Promise<IndexableKnowledgeItem[]> {
const leafItems: IndexableKnowledgeItem[] = []
for (const child of children) {
signal.throwIfAborted()
if (child.type === 'file') {
const createdFile = await createRuntimeItem(
baseId,
{
groupId: parentId,
type: 'file',
data: child.data
},
onCreatedItem,
runMutation,
signal
)
leafItems.push(createdFile)
continue
}
const createdDirectory = await createRuntimeItem(
baseId,
{
groupId: parentId,
type: 'directory',
data: child.data
},
onCreatedItem,
runMutation,
signal
)
const childLeafItems = await createDirectoryChildren(
baseId,
createdDirectory.id,
child.children,
onCreatedItem,
runMutation,
signal
)
await runMutation(() => knowledgeItemService.updateStatus(createdDirectory.id, 'processing'))
leafItems.push(...childLeafItems)
}
return leafItems
}
async function prepareSitemapForRuntime(
baseId: string,
item: KnowledgeItemOf<'sitemap'>,
onCreatedItem: (item: KnowledgeItem) => void,
runMutation: <T>(task: () => Promise<T>) => Promise<T>,
signal: AbortSignal
): Promise<IndexableKnowledgeItem[]> {
const expandedItems = await expandSitemapOwnerToCreateItems(item, signal)
signal.throwIfAborted()
if (expandedItems.length === 0) {
logger.warn('Sitemap expansion produced no indexable URLs', {
baseId,
itemId: item.id,
source: item.data.source
})
await runMutation(() => knowledgeItemService.updateStatus(item.id, 'failed', { error: EMPTY_SITEMAP_ERROR }))
return []
}
const leafItems: IndexableKnowledgeItem[] = []
for (const expandedItem of expandedItems) {
signal.throwIfAborted()
const createdItem = await createRuntimeItem(baseId, expandedItem, onCreatedItem, runMutation, signal)
leafItems.push(createdItem)
}
return leafItems
}
async function createRuntimeItem<T extends KnowledgeItemType>(
baseId: string,
item: Extract<CreateKnowledgeItemDto, { type: T }>,
onCreatedItem: (item: KnowledgeItem) => void,
runMutation: <TResult>(task: () => Promise<TResult>) => Promise<TResult>,
signal: AbortSignal
): Promise<KnowledgeItemOf<T>> {
signal.throwIfAborted()
const createdItem = await runMutation(() => knowledgeItemService.create(baseId, item))
onCreatedItem(createdItem)
const processingItem = await runMutation(() =>
createdItem.type === 'directory' || createdItem.type === 'sitemap'
? knowledgeItemService.updateStatus(createdItem.id, 'processing', { phase: 'preparing' })
: knowledgeItemService.updateStatus(createdItem.id, 'processing')
)
signal.throwIfAborted()
return processingItem as KnowledgeItemOf<T>
}

View File

@@ -1,39 +0,0 @@
export const SHUTDOWN_INTERRUPTED_REASON = 'Knowledge task interrupted by service shutdown'
export const DELETE_INTERRUPTED_REASON = 'Knowledge task interrupted by item deletion'
export interface RuntimeTaskContext {
itemId: string
signal: AbortSignal
}
/**
* Runs one async runtime step with interruption checks before and after the
* step body.
*/
export async function runAbortable<T>(
isStopping: () => boolean,
ctx: RuntimeTaskContext,
step: () => Promise<T> | T
): Promise<T> {
assertTaskActive(isStopping, ctx)
const result = await step()
assertTaskActive(isStopping, ctx)
return result
}
/**
* Throws when the runtime has been interrupted by shutdown or abort signal.
*/
export function assertTaskActive(isStopping: () => boolean, ctx: RuntimeTaskContext): void {
if (ctx.signal.aborted) {
const reason =
typeof ctx.signal.reason === 'string' && ctx.signal.reason.length > 0
? ctx.signal.reason
: SHUTDOWN_INTERRUPTED_REASON
throw new Error(reason)
}
if (isStopping()) {
throw new Error(SHUTDOWN_INTERRUPTED_REASON)
}
}

View File

@@ -0,0 +1,190 @@
import { KNOWLEDGE_NOTE_CONTENT_MAX, KNOWLEDGE_RUNTIME_ITEMS_MAX } from '@shared/data/types/knowledge'
import { describe, expect, it } from 'vitest'
import {
KnowledgeRuntimeAddItemsPayloadSchema,
KnowledgeRuntimeBasePayloadSchema,
KnowledgeRuntimeCreateBasePayloadSchema,
KnowledgeRuntimeDeleteItemChunkPayloadSchema,
KnowledgeRuntimeItemChunksPayloadSchema,
KnowledgeRuntimeItemsPayloadSchema,
KnowledgeRuntimeRestoreBasePayloadSchema,
KnowledgeRuntimeSearchPayloadSchema
} from '../ipc'
const createBaseInput = () => ({
name: 'Knowledge Base',
dimensions: 1024,
embeddingModelId: 'openai::text-embedding-3-large'
})
const createRuntimeItem = (index: number) => ({
type: 'note' as const,
data: {
source: `note-${index}`,
content: `note ${index}`
}
})
const createPayload = (count: number) => ({
baseId: 'base-1',
itemIds: Array.from({ length: count }, (_, index) => `item-${index}`)
})
const createAddItemsPayload = (count: number) => ({
baseId: 'base-1',
items: Array.from({ length: count }, (_, index) => createRuntimeItem(index))
})
describe('knowledge runtime payload schemas', () => {
it('accepts valid payloads for every runtime operation', () => {
const cases = [
{ name: 'create base', schema: KnowledgeRuntimeCreateBasePayloadSchema, payload: { base: createBaseInput() } },
{
name: 'restore base',
schema: KnowledgeRuntimeRestoreBasePayloadSchema,
payload: {
sourceBaseId: 'base-1',
dimensions: 3072,
embeddingModelId: 'openai::text-embedding-3-large'
}
},
{ name: 'base', schema: KnowledgeRuntimeBasePayloadSchema, payload: { baseId: 'base-1' } },
{ name: 'add items', schema: KnowledgeRuntimeAddItemsPayloadSchema, payload: createAddItemsPayload(1) },
{ name: 'items', schema: KnowledgeRuntimeItemsPayloadSchema, payload: createPayload(1) },
{ name: 'search', schema: KnowledgeRuntimeSearchPayloadSchema, payload: { baseId: 'base-1', query: 'hello' } },
{
name: 'item chunks',
schema: KnowledgeRuntimeItemChunksPayloadSchema,
payload: { baseId: 'base-1', itemId: 'item-1' }
},
{
name: 'delete item chunk',
schema: KnowledgeRuntimeDeleteItemChunkPayloadSchema,
payload: { baseId: 'base-1', itemId: 'item-1', chunkId: 'chunk-1' }
}
]
for (const { name, payload, schema } of cases) {
expect(schema.safeParse(payload).success, name).toBe(true)
}
})
it('rejects invalid payloads for every runtime operation', () => {
const cases = [
{
name: 'create base',
schema: KnowledgeRuntimeCreateBasePayloadSchema,
payload: { base: { ...createBaseInput(), name: '' } }
},
{
name: 'restore base',
schema: KnowledgeRuntimeRestoreBasePayloadSchema,
payload: { sourceBaseId: 'base-1', dimensions: 3072, embeddingModelId: '', chunkOverlap: 120 }
},
{ name: 'base', schema: KnowledgeRuntimeBasePayloadSchema, payload: { baseId: '' } },
{ name: 'add items', schema: KnowledgeRuntimeAddItemsPayloadSchema, payload: createAddItemsPayload(0) },
{ name: 'items', schema: KnowledgeRuntimeItemsPayloadSchema, payload: createPayload(0) },
{ name: 'search', schema: KnowledgeRuntimeSearchPayloadSchema, payload: { baseId: 'base-1', query: '' } },
{
name: 'item chunks',
schema: KnowledgeRuntimeItemChunksPayloadSchema,
payload: { baseId: 'base-1', itemId: '' }
},
{
name: 'delete item chunk',
schema: KnowledgeRuntimeDeleteItemChunkPayloadSchema,
payload: { baseId: 'base-1', itemId: 'item-1', chunkId: '' }
}
]
for (const { name, payload, schema } of cases) {
expect(schema.safeParse(payload).success, name).toBe(false)
}
})
})
describe('KnowledgeRuntimeAddItemsPayloadSchema', () => {
it('accepts one runtime item', () => {
expect(KnowledgeRuntimeAddItemsPayloadSchema.safeParse(createAddItemsPayload(1)).success).toBe(true)
})
it('accepts runtime items at the runtime batch limit', () => {
expect(
KnowledgeRuntimeAddItemsPayloadSchema.safeParse(createAddItemsPayload(KNOWLEDGE_RUNTIME_ITEMS_MAX)).success
).toBe(true)
})
it('rejects empty runtime item lists', () => {
expect(KnowledgeRuntimeAddItemsPayloadSchema.safeParse(createAddItemsPayload(0)).success).toBe(false)
})
it('rejects runtime items above the runtime batch limit', () => {
expect(
KnowledgeRuntimeAddItemsPayloadSchema.safeParse(createAddItemsPayload(KNOWLEDGE_RUNTIME_ITEMS_MAX + 1)).success
).toBe(false)
})
it('rejects note content above the runtime note content limit', () => {
expect(
KnowledgeRuntimeAddItemsPayloadSchema.safeParse({
baseId: 'base-1',
items: [
{
type: 'note',
data: { source: 'note-1', content: 'a'.repeat(KNOWLEDGE_NOTE_CONTENT_MAX + 1) }
}
]
}).success
).toBe(false)
})
it('rejects blank group owner ids', () => {
expect(
KnowledgeRuntimeAddItemsPayloadSchema.safeParse({
baseId: 'base-1',
items: [{ type: 'note', groupId: ' ', data: { source: 'note-1', content: 'note' } }]
}).success
).toBe(false)
})
})
describe('KnowledgeRuntimeItemsPayloadSchema', () => {
it('accepts one item id', () => {
expect(KnowledgeRuntimeItemsPayloadSchema.safeParse(createPayload(1)).success).toBe(true)
})
it('accepts item ids at the runtime batch limit', () => {
expect(KnowledgeRuntimeItemsPayloadSchema.safeParse(createPayload(KNOWLEDGE_RUNTIME_ITEMS_MAX)).success).toBe(true)
})
it('rejects empty item id lists', () => {
expect(KnowledgeRuntimeItemsPayloadSchema.safeParse(createPayload(0)).success).toBe(false)
})
it('rejects item ids above the runtime batch limit', () => {
expect(KnowledgeRuntimeItemsPayloadSchema.safeParse(createPayload(KNOWLEDGE_RUNTIME_ITEMS_MAX + 1)).success).toBe(
false
)
})
})
describe('KnowledgeRuntimeSearchPayloadSchema', () => {
it('accepts queries at the runtime query length limit', () => {
expect(
KnowledgeRuntimeSearchPayloadSchema.safeParse({
baseId: 'base-1',
query: 'a'.repeat(1000)
}).success
).toBe(true)
})
it('rejects queries above the runtime query length limit', () => {
expect(
KnowledgeRuntimeSearchPayloadSchema.safeParse({
baseId: 'base-1',
query: 'a'.repeat(1001)
}).success
).toBe(false)
})
})

View File

@@ -0,0 +1,51 @@
import {
CreateKnowledgeBaseSchema,
KNOWLEDGE_RUNTIME_ITEMS_MAX,
KnowledgeRuntimeAddItemInputSchema,
RestoreKnowledgeBaseSchema
} from '@shared/data/types/knowledge'
import * as z from 'zod'
export const KnowledgeRuntimeCreateBasePayloadSchema = z.strictObject({
base: CreateKnowledgeBaseSchema
})
export type KnowledgeRuntimeCreateBasePayload = z.infer<typeof KnowledgeRuntimeCreateBasePayloadSchema>
export const KnowledgeRuntimeRestoreBasePayloadSchema = RestoreKnowledgeBaseSchema
export type KnowledgeRuntimeRestoreBasePayload = z.infer<typeof KnowledgeRuntimeRestoreBasePayloadSchema>
export const KnowledgeRuntimeBasePayloadSchema = z.strictObject({
baseId: z.string().trim().min(1)
})
export type KnowledgeRuntimeBasePayload = z.infer<typeof KnowledgeRuntimeBasePayloadSchema>
export const KnowledgeRuntimeAddItemsPayloadSchema = z.strictObject({
baseId: z.string().trim().min(1),
items: z.array(KnowledgeRuntimeAddItemInputSchema).min(1).max(KNOWLEDGE_RUNTIME_ITEMS_MAX)
})
export type KnowledgeRuntimeAddItemsPayload = z.infer<typeof KnowledgeRuntimeAddItemsPayloadSchema>
export const KnowledgeRuntimeItemsPayloadSchema = z.strictObject({
baseId: z.string().trim().min(1),
itemIds: z.array(z.string().trim().min(1)).min(1).max(KNOWLEDGE_RUNTIME_ITEMS_MAX)
})
export type KnowledgeRuntimeItemsPayload = z.infer<typeof KnowledgeRuntimeItemsPayloadSchema>
export const KnowledgeRuntimeSearchPayloadSchema = z.strictObject({
baseId: z.string().trim().min(1),
query: z.string().trim().min(1).max(1000)
})
export type KnowledgeRuntimeSearchPayload = z.infer<typeof KnowledgeRuntimeSearchPayloadSchema>
export const KnowledgeRuntimeItemChunksPayloadSchema = z.strictObject({
baseId: z.string().trim().min(1),
itemId: z.string().trim().min(1)
})
export type KnowledgeRuntimeItemChunksPayload = z.infer<typeof KnowledgeRuntimeItemChunksPayloadSchema>
export const KnowledgeRuntimeDeleteItemChunkPayloadSchema = z.strictObject({
baseId: z.string().trim().min(1),
itemId: z.string().trim().min(1),
chunkId: z.string().trim().min(1)
})
export type KnowledgeRuntimeDeleteItemChunkPayload = z.infer<typeof KnowledgeRuntimeDeleteItemChunkPayloadSchema>

View File

@@ -0,0 +1,3 @@
import type { KnowledgeItemOf } from '@shared/data/types/knowledge'
export type IndexableKnowledgeItem = KnowledgeItemOf<'file' | 'url' | 'note'>

View File

@@ -1,16 +1,22 @@
import { type KnowledgeBase, KnowledgeChunkMetadataSchema } from '@shared/data/types/knowledge'
import { Document } from '@vectorstores/core'
import { describe, expect, it } from 'vitest'
import { chunkDocuments } from '../chunk'
function createBase() {
function createBase(): KnowledgeBase {
return {
id: 'kb-1',
name: 'KB',
groupId: null,
emoji: '📁',
dimensions: 1024,
embeddingModelId: 'ollama::nomic-embed-text',
status: 'completed',
error: null,
chunkSize: 1000,
chunkOverlap: 0,
searchMode: 'hybrid',
createdAt: '2026-04-08T00:00:00.000Z',
updatedAt: '2026-04-08T00:00:00.000Z'
}
@@ -22,8 +28,9 @@ function createItem() {
baseId: 'kb-1',
groupId: null,
type: 'note' as const,
data: { content: 'hello' },
data: { source: 'item-1', content: 'hello' },
status: 'idle' as const,
phase: null,
error: null,
createdAt: '2026-04-08T00:00:00.000Z',
updatedAt: '2026-04-08T00:00:00.000Z'
@@ -39,32 +46,55 @@ describe('chunkDocuments', () => {
const documents = [
new Document({
text: 'hello world',
metadata: { sourceUrl: 'https://example.com/1' }
metadata: { source: 'https://example.com/1', page: 1 }
}),
new Document({
text: 'goodbye world',
metadata: { sourceUrl: 'https://example.com/2' }
metadata: { source: 'https://example.com/2' }
})
]
const chunks = chunkDocuments(createBase(), createItem(), documents)
const metadata = chunks.map((chunk) => KnowledgeChunkMetadataSchema.parse(chunk.metadata))
expect(chunks).toHaveLength(2)
expect(chunks[0]?.metadata).toMatchObject({
sourceUrl: 'https://example.com/1',
expect(metadata[0]).toMatchObject({
source: 'https://example.com/1',
itemId: 'item-1',
itemType: 'note',
sourceDocumentIndex: 0,
chunkIndex: 0,
chunkCount: 1
tokenCount: expect.any(Number)
})
expect(chunks[1]?.metadata).toMatchObject({
sourceUrl: 'https://example.com/2',
expect(metadata[0]).not.toHaveProperty('page')
expect(metadata[1]).toMatchObject({
source: 'https://example.com/2',
itemId: 'item-1',
itemType: 'note',
sourceDocumentIndex: 1,
chunkIndex: 0,
chunkCount: 1
chunkIndex: 1,
tokenCount: expect.any(Number)
})
expect(metadata[0]?.tokenCount).toBeGreaterThan(0)
})
it('throws before returning chunks when source metadata is missing', () => {
expect(() =>
chunkDocuments(createBase(), createItem(), [
new Document({
text: 'hello world',
metadata: {}
})
])
).toThrow()
})
it('throws before returning chunks when source metadata is blank', () => {
expect(() =>
chunkDocuments(createBase(), createItem(), [
new Document({
text: 'hello world',
metadata: { source: ' ' }
})
])
).toThrow()
})
})

View File

@@ -4,7 +4,7 @@ import path from 'node:path'
import { afterEach, describe, expect, it, vi } from 'vitest'
const { expandDirectoryOwnerToCreateItems } = await import('../directory')
const { expandDirectoryOwnerToTree } = await import('../directory')
const realFs = await vi.importActual<typeof NodeFs>('node:fs')
const realOs = await vi.importActual<typeof NodeOs>('node:os')
@@ -12,7 +12,11 @@ function createTempRoot() {
return realFs.mkdtempSync(path.join(realOs.tmpdir(), 'knowledge-directory-expand-'))
}
describe('expandDirectoryOwnerToCreateItems', () => {
function createSignal() {
return new AbortController().signal
}
describe('expandDirectoryOwnerToTree', () => {
let tempRoot: string | undefined
afterEach(() => {
@@ -22,7 +26,7 @@ describe('expandDirectoryOwnerToCreateItems', () => {
}
})
it('expands a directory owner into child createMany dto items with preserved hierarchy', async () => {
it('expands a directory owner into a tree while preserving hierarchy', async () => {
tempRoot = createTempRoot()
const rootDir = path.join(tempRoot, 'anna')
const nestedDir = path.join(rootDir, 'agents', 'skills')
@@ -30,51 +34,139 @@ describe('expandDirectoryOwnerToCreateItems', () => {
realFs.writeFileSync(path.join(rootDir, '.dockerignore'), 'node_modules')
realFs.writeFileSync(path.join(nestedDir, 'skill.md'), '# skill')
const items = await expandDirectoryOwnerToCreateItems({
id: 'dir-owner-1',
baseId: 'kb-1',
groupId: null,
type: 'directory',
data: {
name: 'anna',
path: rootDir
const nodes = await expandDirectoryOwnerToTree(
{
id: 'dir-owner-1',
baseId: 'kb-1',
groupId: null,
type: 'directory',
data: {
source: rootDir,
path: rootDir
},
status: 'idle',
phase: null,
error: null,
createdAt: '2026-04-08T00:00:00.000Z',
updatedAt: '2026-04-08T00:00:00.000Z'
},
status: 'idle',
error: null,
createdAt: '2026-04-08T00:00:00.000Z',
updatedAt: '2026-04-08T00:00:00.000Z'
})
const agentsDir = items.find((item) => item.type === 'directory' && item.data.path === path.join(rootDir, 'agents'))
const skillsDir = items.find(
(item) => item.type === 'directory' && item.data.path === path.join(rootDir, 'agents', 'skills')
)
const rootFile = items.find(
(item) => item.type === 'file' && item.data.file.path === path.join(rootDir, '.dockerignore')
)
const nestedFile = items.find(
(item) => item.type === 'file' && item.data.file.path === path.join(nestedDir, 'skill.md')
createSignal()
)
expect(agentsDir).toMatchObject({
ref: 'dir:/agents',
groupId: 'dir-owner-1'
})
expect(skillsDir).toMatchObject({
ref: 'dir:/agents/skills',
groupRef: 'dir:/agents'
})
expect(rootFile).toBeUndefined()
expect(nestedFile).toMatchObject({
groupRef: 'dir:/agents/skills',
type: 'file'
})
expect(nestedFile && nestedFile.type === 'file' ? nestedFile.data.file : undefined).toMatchObject({
name: 'skill.md',
origin_name: 'skill.md',
path: path.join(nestedDir, 'skill.md'),
ext: '.md',
count: 1
})
expect(nodes).toEqual([
{
type: 'directory',
data: { source: path.join(rootDir, 'agents'), path: path.join(rootDir, 'agents') },
children: [
{
type: 'directory',
data: { source: nestedDir, path: nestedDir },
children: [
{
type: 'file',
data: {
source: path.join(nestedDir, 'skill.md'),
file: expect.objectContaining({
name: 'skill.md',
origin_name: 'skill.md',
path: path.join(nestedDir, 'skill.md'),
ext: '.md',
count: 1
})
}
}
]
}
]
}
])
})
it('skips empty nested directories while preserving non-empty directory hierarchy', async () => {
tempRoot = createTempRoot()
const rootDir = path.join(tempRoot, 'workspace')
const emptyDir = path.join(rootDir, 'empty')
const nestedDir = path.join(rootDir, 'guides', 'api')
realFs.mkdirSync(emptyDir, { recursive: true })
realFs.mkdirSync(nestedDir, { recursive: true })
realFs.writeFileSync(path.join(rootDir, 'readme.md'), '# readme')
realFs.writeFileSync(path.join(nestedDir, 'reference.md'), '# reference')
const nodes = await expandDirectoryOwnerToTree(
{
id: 'dir-owner-1',
baseId: 'kb-1',
groupId: null,
type: 'directory',
data: {
source: rootDir,
path: rootDir
},
status: 'idle',
phase: null,
error: null,
createdAt: '2026-04-08T00:00:00.000Z',
updatedAt: '2026-04-08T00:00:00.000Z'
},
createSignal()
)
expect(JSON.stringify(nodes)).not.toContain(emptyDir)
expect(nodes).toContainEqual(
expect.objectContaining({
type: 'file',
data: expect.objectContaining({
file: expect.objectContaining({ path: path.join(rootDir, 'readme.md') })
})
})
)
expect(nodes).toContainEqual(
expect.objectContaining({
type: 'directory',
data: expect.objectContaining({ path: path.join(rootDir, 'guides') }),
children: [
expect.objectContaining({
type: 'directory',
data: expect.objectContaining({ path: nestedDir }),
children: [
expect.objectContaining({
type: 'file',
data: expect.objectContaining({
file: expect.objectContaining({ path: path.join(nestedDir, 'reference.md') })
})
})
]
})
]
})
)
})
it('stops before reading when the runtime signal is already aborted', async () => {
tempRoot = createTempRoot()
const controller = new AbortController()
const abortError = new Error('interrupted')
controller.abort(abortError)
await expect(
expandDirectoryOwnerToTree(
{
id: 'dir-owner-1',
baseId: 'kb-1',
groupId: null,
type: 'directory',
data: {
source: tempRoot,
path: tempRoot
},
status: 'idle',
phase: null,
error: null,
createdAt: '2026-04-08T00:00:00.000Z',
updatedAt: '2026-04-08T00:00:00.000Z'
},
controller.signal
)
).rejects.toBe(abortError)
})
})

View File

@@ -0,0 +1,60 @@
import type { FileMetadata } from '@shared/data/types/file'
import type { KnowledgeItem } from '@shared/data/types/knowledge'
import { describe, expect, it } from 'vitest'
import { filterIndexableKnowledgeItems, isIndexableKnowledgeItem } from '../items'
function createFileMetadata(): FileMetadata {
return {
id: 'file-1',
name: 'guide.md',
origin_name: 'guide.md',
path: '/docs/guide.md',
created_at: '2026-04-08T00:00:00.000Z',
size: 12,
ext: '.md',
type: 'text',
count: 1
}
}
function createItem(type: KnowledgeItem['type']): KnowledgeItem {
const base = {
id: `${type}-1`,
baseId: 'kb-1',
groupId: null,
status: 'idle',
phase: null,
error: null,
createdAt: '2026-04-08T00:00:00.000Z',
updatedAt: '2026-04-08T00:00:00.000Z'
} as const
switch (type) {
case 'file':
return { ...base, type, data: { source: '/docs/file.md', file: createFileMetadata() } }
case 'url':
return { ...base, type, data: { source: 'https://example.com', url: 'https://example.com' } }
case 'note':
return { ...base, type, data: { source: 'note', content: 'note' } }
case 'sitemap':
return {
...base,
type,
data: { source: 'https://example.com/sitemap.xml', url: 'https://example.com/sitemap.xml' }
}
case 'directory':
return { ...base, type, data: { source: '/docs', path: '/docs' } }
}
}
describe('indexable knowledge item helpers', () => {
it('recognizes file, url, and note as indexable leaves', () => {
const items = ['file', 'url', 'note', 'sitemap', 'directory'].map((type) =>
createItem(type as KnowledgeItem['type'])
)
expect(items.map((item) => isIndexableKnowledgeItem(item))).toEqual([true, true, true, false, false])
expect(filterIndexableKnowledgeItems(items).map((item) => item.type)).toEqual(['file', 'url', 'note'])
})
})

View File

@@ -1,7 +1,8 @@
import { beforeEach, describe, expect, it, vi } from 'vitest'
const { fetchMock, loggerWarnMock } = vi.hoisted(() => ({
const { fetchMock, loggerErrorMock, loggerWarnMock } = vi.hoisted(() => ({
fetchMock: vi.fn(),
loggerErrorMock: vi.fn(),
loggerWarnMock: vi.fn()
}))
@@ -11,7 +12,7 @@ vi.mock('@logger', () => ({
debug: vi.fn(),
info: vi.fn(),
warn: loggerWarnMock,
error: vi.fn()
error: loggerErrorMock
})
}
}))
@@ -24,9 +25,32 @@ vi.mock('electron', () => ({
const { expandSitemapOwnerToCreateItems } = await import('../sitemap')
function createSitemapOwner(id = 'sitemap-owner-1', url = 'https://example.com/sitemap.xml') {
return {
id,
baseId: 'kb-1',
groupId: null,
type: 'sitemap' as const,
data: {
source: url,
url
},
status: 'idle' as const,
phase: null,
error: null,
createdAt: '2026-04-08T00:00:00.000Z',
updatedAt: '2026-04-08T00:00:00.000Z'
}
}
function createSignal() {
return new AbortController().signal
}
describe('expandSitemapOwnerToCreateItems', () => {
beforeEach(() => {
fetchMock.mockReset()
loggerErrorMock.mockReset()
loggerWarnMock.mockReset()
})
@@ -44,36 +68,23 @@ describe('expandSitemapOwnerToCreateItems', () => {
)
)
const items = await expandSitemapOwnerToCreateItems({
id: 'sitemap-owner-1',
baseId: 'kb-1',
groupId: null,
type: 'sitemap',
data: {
url: 'https://example.com/sitemap.xml',
name: 'https://example.com/sitemap.xml'
},
status: 'idle',
error: null,
createdAt: '2026-04-08T00:00:00.000Z',
updatedAt: '2026-04-08T00:00:00.000Z'
})
const items = await expandSitemapOwnerToCreateItems(createSitemapOwner(), createSignal())
expect(items).toEqual([
{
groupId: 'sitemap-owner-1',
type: 'url',
data: {
url: 'https://example.com/page-1',
name: 'https://example.com/page-1'
source: 'https://example.com/page-1',
url: 'https://example.com/page-1'
}
},
{
groupId: 'sitemap-owner-1',
type: 'url',
data: {
url: 'https://example.com/page-2',
name: 'https://example.com/page-2'
source: 'https://example.com/page-2',
url: 'https://example.com/page-2'
}
}
])
@@ -81,20 +92,7 @@ describe('expandSitemapOwnerToCreateItems', () => {
it('rejects unsupported sitemap protocols before fetching', async () => {
await expect(
expandSitemapOwnerToCreateItems({
id: 'sitemap-owner-2',
baseId: 'kb-1',
groupId: null,
type: 'sitemap',
data: {
url: 'file:///etc/passwd',
name: 'file:///etc/passwd'
},
status: 'idle',
error: null,
createdAt: '2026-04-08T00:00:00.000Z',
updatedAt: '2026-04-08T00:00:00.000Z'
})
expandSitemapOwnerToCreateItems(createSitemapOwner('sitemap-owner-2', 'file:///etc/passwd'), createSignal())
).rejects.toThrow('Invalid knowledge url: file:///etc/passwd')
expect(fetchMock).not.toHaveBeenCalled()
@@ -104,20 +102,10 @@ describe('expandSitemapOwnerToCreateItems', () => {
fetchMock.mockResolvedValue(new Response('<urlset></urlset>', { status: 200 }))
await expect(
expandSitemapOwnerToCreateItems({
id: 'sitemap-owner-3',
baseId: 'kb-1',
groupId: null,
type: 'sitemap',
data: {
url: 'https://example.com/empty-sitemap.xml',
name: 'https://example.com/empty-sitemap.xml'
},
status: 'idle',
error: null,
createdAt: '2026-04-08T00:00:00.000Z',
updatedAt: '2026-04-08T00:00:00.000Z'
})
expandSitemapOwnerToCreateItems(
createSitemapOwner('sitemap-owner-3', 'https://example.com/empty-sitemap.xml'),
createSignal()
)
).resolves.toEqual([])
expect(loggerWarnMock).toHaveBeenCalledWith('Sitemap expansion produced no URLs', {
@@ -125,4 +113,46 @@ describe('expandSitemapOwnerToCreateItems', () => {
sitemapUrl: 'https://example.com/empty-sitemap.xml'
})
})
it('uses an internal fetch timeout signal', async () => {
fetchMock.mockResolvedValue(new Response('<urlset></urlset>', { status: 200 }))
await expandSitemapOwnerToCreateItems(createSitemapOwner('sitemap-owner-4'), createSignal())
expect(fetchMock).toHaveBeenCalledWith('https://example.com/sitemap.xml', {
signal: expect.any(AbortSignal)
})
})
it('rejects before fetching when the runtime signal is already aborted', async () => {
const controller = new AbortController()
const abortError = new Error('interrupted')
controller.abort(abortError)
await expect(expandSitemapOwnerToCreateItems(createSitemapOwner(), controller.signal)).rejects.toBe(abortError)
expect(fetchMock).not.toHaveBeenCalled()
})
it('aborts the fetch signal when the runtime signal aborts', async () => {
const controller = new AbortController()
fetchMock.mockImplementation(
async () =>
new Promise(() => {
// Keep fetch pending so the test can inspect the supplied signal.
})
)
void expandSitemapOwnerToCreateItems(createSitemapOwner(), controller.signal).catch(() => undefined)
await vi.waitFor(() => {
expect(fetchMock).toHaveBeenCalled()
})
const fetchSignal = fetchMock.mock.calls[0]?.[1]?.signal as AbortSignal
expect(fetchSignal.aborted).toBe(false)
controller.abort(new Error('interrupted'))
expect(fetchSignal.aborted).toBe(true)
})
})

View File

@@ -34,6 +34,7 @@ function createDeferred<T>() {
describe('fetchKnowledgeWebPage', () => {
beforeEach(() => {
vi.useRealTimers()
fetchMock.mockReset()
})
@@ -133,4 +134,45 @@ describe('fetchKnowledgeWebPage', () => {
await expect(Promise.all(requests)).resolves.toEqual(['page 1', 'page 2', 'page 3', 'page 4', 'page 5'])
expect(maxActiveFetches).toBeLessThanOrEqual(3)
})
it('does not create the fetch timeout while a request is waiting in the queue', async () => {
const timeoutSpy = vi.spyOn(AbortSignal, 'timeout')
const deferredResponses = Array.from({ length: 4 }, () => createDeferred<Response>())
let fetchCallIndex = 0
fetchMock.mockImplementation(async () => {
const deferred = deferredResponses[fetchCallIndex]
fetchCallIndex += 1
if (!deferred) {
throw new Error('Unexpected fetch call')
}
return await deferred.promise
})
const queuedController = new AbortController()
const activeRequests = [
fetchKnowledgeWebPage('https://example.com/1'),
fetchKnowledgeWebPage('https://example.com/2'),
fetchKnowledgeWebPage('https://example.com/3')
]
const queuedRequest = fetchKnowledgeWebPage('https://example.com/4', queuedController.signal)
void queuedRequest.catch(() => undefined)
await vi.waitFor(() => {
expect(fetchMock).toHaveBeenCalledTimes(3)
})
expect(timeoutSpy).toHaveBeenCalledTimes(3)
expect(fetchMock).toHaveBeenCalledTimes(3)
queuedController.abort(new Error('queued abort'))
deferredResponses[0].resolve(new Response('page 1', { status: 200 }))
deferredResponses[1].resolve(new Response('page 2', { status: 200 }))
deferredResponses[2].resolve(new Response('page 3', { status: 200 }))
await expect(Promise.all(activeRequests)).resolves.toEqual(['page 1', 'page 2', 'page 3'])
expect(fetchMock).toHaveBeenCalledTimes(3)
timeoutSpy.mockRestore()
})
})

View File

@@ -1,32 +1,32 @@
import type { KnowledgeBase, KnowledgeItem } from '@shared/data/types/knowledge'
import { type KnowledgeBase, KnowledgeChunkMetadataSchema, type KnowledgeItem } from '@shared/data/types/knowledge'
import { Document, type Document as VectorStoreDocument, SentenceSplitter } from '@vectorstores/core'
import { estimateTokenCount } from 'tokenx'
/**
* Splits source documents into chunked vector-store documents and attaches
* knowledge-item metadata needed by downstream indexing steps.
*/
export function chunkDocuments(base: KnowledgeBase, item: KnowledgeItem, documents: VectorStoreDocument[]) {
const splitter = new SentenceSplitter({
chunkSize: base.chunkSize,
chunkOverlap: base.chunkOverlap
})
let chunkIndex = 0
return documents.flatMap((document, documentIndex) => {
return documents.flatMap((document) => {
const chunks = splitter.splitText(document.text).filter(Boolean)
return chunks.map(
(chunk, chunkIndex) =>
new Document({
text: chunk,
metadata: {
...document.metadata,
itemId: item.id,
itemType: item.type,
sourceDocumentIndex: documentIndex,
chunkIndex,
chunkCount: chunks.length
}
})
)
return chunks.map((chunk) => {
const currentChunkIndex = chunkIndex
chunkIndex += 1
const metadata = KnowledgeChunkMetadataSchema.parse({
source: document.metadata.source,
itemId: item.id,
itemType: item.type,
chunkIndex: currentChunkIndex,
tokenCount: estimateTokenCount(chunk)
})
return new Document({
text: chunk,
metadata
})
})
})
}

View File

@@ -2,28 +2,48 @@ import fs from 'node:fs/promises'
import path from 'node:path'
import { getFileType } from '@main/utils/file'
import type { CreateKnowledgeItemsDto } from '@shared/data/api/schemas/knowledges'
import type { FileMetadata } from '@shared/data/types/file'
import type { KnowledgeItem } from '@shared/data/types/knowledge'
import type { NotesTreeNode } from '@types'
import { v4 as uuidv4 } from 'uuid'
type CreateKnowledgeItemInput = CreateKnowledgeItemsDto['items'][number]
export type ExpandedDirectoryNode =
| {
type: 'directory'
data: {
source: string
path: string
}
children: ExpandedDirectoryNode[]
}
| {
type: 'file'
data: {
source: string
file: FileMetadata
}
}
/**
* Recursively reads a directory tree and converts it into note-tree nodes.
*/
async function readDirectoryTree(dirPath: string, rootPath: string = dirPath): Promise<NotesTreeNode[]> {
async function readDirectoryTree(
dirPath: string,
signal: AbortSignal,
rootPath: string = dirPath
): Promise<NotesTreeNode[]> {
signal.throwIfAborted()
const entries = await fs.readdir(dirPath, { withFileTypes: true })
signal.throwIfAborted()
const nodes: NotesTreeNode[] = []
for (const entry of entries) {
signal.throwIfAborted()
if (entry.name.startsWith('.')) {
continue
}
const entryPath = path.join(dirPath, entry.name)
const stats = await fs.stat(entryPath)
signal.throwIfAborted()
const relativePath = path.relative(rootPath, entryPath)
const treePath = `/${relativePath.replace(/\\/g, '/')}`
@@ -36,7 +56,7 @@ async function readDirectoryTree(dirPath: string, rootPath: string = dirPath): P
externalPath: entryPath,
createdAt: stats.birthtime.toISOString(),
updatedAt: stats.mtime.toISOString(),
children: await readDirectoryTree(entryPath, rootPath)
children: await readDirectoryTree(entryPath, signal, rootPath)
})
continue
}
@@ -57,12 +77,9 @@ async function readDirectoryTree(dirPath: string, rootPath: string = dirPath): P
return nodes
}
/**
* Builds file metadata for an external file path so it can be stored as a
* knowledge file item.
*/
async function createExternalFileMetadata(filePath: string): Promise<FileMetadata> {
async function createExternalFileMetadata(filePath: string, signal: AbortSignal): Promise<FileMetadata> {
const stats = await fs.stat(filePath)
signal.throwIfAborted()
const originName = path.basename(filePath)
const ext = path.extname(originName)
@@ -79,67 +96,62 @@ async function createExternalFileMetadata(filePath: string): Promise<FileMetadat
}
}
type GroupingTarget = { groupId: string } | { groupRef: string }
/**
* Flattens a directory node into create-item inputs while preserving the
* parent-child grouping relationship.
*/
async function flattenDirectoryNode(node: NotesTreeNode, parent: GroupingTarget): Promise<CreateKnowledgeItemInput[]> {
async function expandDirectoryNode(node: NotesTreeNode, signal: AbortSignal): Promise<ExpandedDirectoryNode | null> {
if (node.type === 'file') {
return [
{
...parent,
type: 'file',
data: {
file: await createExternalFileMetadata(node.externalPath)
}
return {
type: 'file',
data: {
source: node.externalPath,
file: await createExternalFileMetadata(node.externalPath, signal)
}
]
}
}
if (node.type !== 'folder') {
return []
return null
}
const ref = node.treePath === '/' ? 'root' : `dir:${node.treePath}`
const items: CreateKnowledgeItemInput[] = [
{
ref,
...parent,
type: 'directory',
data: {
name: node.name,
path: node.externalPath
}
}
]
const children: ExpandedDirectoryNode[] = []
for (const child of node.children ?? []) {
items.push(...(await flattenDirectoryNode(child, { groupRef: ref })))
const expandedChild = await expandDirectoryNode(child, signal)
if (expandedChild) {
children.push(expandedChild)
}
}
return items
if (children.length === 0) {
return null
}
return {
type: 'directory',
data: {
source: node.externalPath,
path: node.externalPath
},
children
}
}
/**
* Expands a directory owner item into a batch of child knowledge items that
* mirror the directory structure on disk.
*/
export async function expandDirectoryOwnerToCreateItems(
owner: KnowledgeItem
): Promise<CreateKnowledgeItemsDto['items']> {
export async function expandDirectoryOwnerToTree(
owner: KnowledgeItem,
signal: AbortSignal
): Promise<ExpandedDirectoryNode[]> {
if (owner.type !== 'directory') {
throw new Error(`Knowledge item '${owner.id}' must be type 'directory', received '${owner.type}'`)
}
const resolvedPath = path.resolve(owner.data.path)
const children = await readDirectoryTree(resolvedPath)
const items: CreateKnowledgeItemsDto['items'] = []
const children = await readDirectoryTree(resolvedPath, signal)
const expandedChildren: ExpandedDirectoryNode[] = []
for (const child of children) {
items.push(...(await flattenDirectoryNode(child, { groupId: owner.id })))
const expandedChild = await expandDirectoryNode(child, signal)
if (expandedChild) {
expandedChildren.push(expandedChild)
}
}
return items
return expandedChildren
}

View File

@@ -2,9 +2,6 @@ import type { EmbeddingModelV3 } from '@ai-sdk/provider'
import { type Document as VectorStoreDocument, NodeRelationship, TextNode } from '@vectorstores/core'
import { embedMany } from 'ai'
/**
* Embeds chunked documents and converts them into vector-store text nodes.
*/
export async function embedDocuments(
model: EmbeddingModelV3,
documents: VectorStoreDocument[],

View File

@@ -0,0 +1,11 @@
import type { KnowledgeItem } from '@shared/data/types/knowledge'
import type { IndexableKnowledgeItem } from '../types/items'
export function isIndexableKnowledgeItem(item: KnowledgeItem): item is IndexableKnowledgeItem {
return item.type === 'file' || item.type === 'url' || item.type === 'note'
}
export function filterIndexableKnowledgeItems(items: KnowledgeItem[]): IndexableKnowledgeItem[] {
return items.filter(isIndexableKnowledgeItem)
}

View File

@@ -8,21 +8,12 @@ export function getKnowledgeBaseEmbeddingModelMissingMessage(baseId: string): st
return `Knowledge base ${baseId} has no embedding model configured. Select a new embedding model before indexing or searching.`
}
/**
* Temporary knowledge-domain model resolver.
* TODO: unify model acquisition after ai-core moves into main.
*/
/**
* Resolves the embedding model configured on a knowledge base.
*/
export function getEmbedModel(base: KnowledgeBase): EmbeddingModelV3 {
if (!base.embeddingModelId) {
throw new Error(getKnowledgeBaseEmbeddingModelMissingMessage(base.id))
}
const { providerId, modelId } = parseCompositeModelId(base.embeddingModelId)
// todo: wait model/provider pr merged
// const {baseUrl, apiKey} = model/provider.getxxx
if (providerId !== 'ollama') {
throw new Error(`Unsupported embedding provider: ${providerId}`)

View File

@@ -1,22 +1,19 @@
import { loggerService } from '@logger'
import type { CreateKnowledgeItemsDto } from '@shared/data/api/schemas/knowledges'
import type { KnowledgeItem } from '@shared/data/types/knowledge'
import type { CreateKnowledgeItemDto, KnowledgeItemOf } from '@shared/data/types/knowledge'
import { net } from 'electron'
import { XMLParser } from 'fast-xml-parser'
import { sanitizeKnowledgeUrl } from './url'
const logger = loggerService.withContext('KnowledgeSitemapExpansion')
const DEFAULT_FETCH_TIMEOUT_MS = 30000
const DEFAULT_SITEMAP_FETCH_TIMEOUT_MS = 30000
const sitemapParser = new XMLParser()
type ParsedSitemapDocument = {
urlset?: { url?: Array<{ loc?: string }> | { loc?: string } }
}
type SitemapUrlChildInput = Extract<CreateKnowledgeItemDto, { type: 'url' }>
/**
* Normalizes sitemap url entries into a flat string list.
*/
function normalizeLocs(value: Array<{ loc?: string }> | { loc?: string } | undefined): string[] {
if (!value) {
return []
@@ -26,29 +23,27 @@ function normalizeLocs(value: Array<{ loc?: string }> | { loc?: string } | undef
return entries.map((entry) => entry.loc?.trim()).filter((loc): loc is string => Boolean(loc))
}
/**
* Expands a sitemap owner item into child url items fetched from the remote
* sitemap document.
*/
export async function expandSitemapOwnerToCreateItems(owner: KnowledgeItem): Promise<CreateKnowledgeItemsDto['items']> {
if (owner.type !== 'sitemap') {
throw new Error(`Knowledge item '${owner.id}' must be type 'sitemap', received '${owner.type}'`)
}
export async function expandSitemapOwnerToCreateItems(
owner: KnowledgeItemOf<'sitemap'>,
signal: AbortSignal
): Promise<SitemapUrlChildInput[]> {
const sitemapUrl = owner.data.url
try {
const safeSitemapUrl = sanitizeKnowledgeUrl(sitemapUrl)
signal.throwIfAborted()
const response = await net.fetch(safeSitemapUrl, {
signal: AbortSignal.timeout(DEFAULT_FETCH_TIMEOUT_MS)
signal: AbortSignal.any([signal, AbortSignal.timeout(DEFAULT_SITEMAP_FETCH_TIMEOUT_MS)])
})
signal.throwIfAborted()
if (!response.ok) {
throw new Error(`Failed to read sitemap ${safeSitemapUrl}: HTTP ${response.status}`)
}
const xml = await response.text()
signal.throwIfAborted()
const parsed = sitemapParser.parse(xml) as ParsedSitemapDocument
const pageUrls = [...new Set(normalizeLocs(parsed.urlset?.url).map((url) => sanitizeKnowledgeUrl(url)))]
@@ -63,11 +58,15 @@ export async function expandSitemapOwnerToCreateItems(owner: KnowledgeItem): Pro
groupId: owner.id,
type: 'url' as const,
data: {
url,
name: url
source: url,
url
}
}))
} catch (error) {
if (signal.aborted) {
throw error
}
const normalizedError = error instanceof Error ? error : new Error(String(error))
logger.error(`Failed to expand sitemap: ${sitemapUrl}`, normalizedError)
throw error

View File

@@ -31,23 +31,23 @@ export function sanitizeKnowledgeUrl(rawUrl: string): string {
}
}
/**
* Fetches a knowledge web page through the Jina reader endpoint and returns
* the normalized markdown payload.
*/
export async function fetchKnowledgeWebPage(url: string, signal?: AbortSignal): Promise<string> {
try {
const safeUrl = sanitizeKnowledgeUrl(url)
const response = await knowledgeWebFetchQueue.add(
async () =>
await net.fetch(`${JINA_READER_BASE_URL}${safeUrl}`, {
signal: signal ?? AbortSignal.timeout(DEFAULT_FETCH_TIMEOUT_MS),
async () => {
const timeoutSignal = AbortSignal.timeout(DEFAULT_FETCH_TIMEOUT_MS)
const fetchSignal = signal ? AbortSignal.any([signal, timeoutSignal]) : timeoutSignal
return await net.fetch(`${JINA_READER_BASE_URL}${safeUrl}`, {
signal: fetchSignal,
headers: {
'X-Retain-Images': 'none',
'X-Return-Format': 'markdown'
}
}),
})
},
signal ? { signal } : undefined
)
if (!response) {

View File

@@ -1,19 +1,33 @@
import { loggerService } from '@logger'
import { BaseService, Injectable, Phase, ServicePhase } from '@main/core/lifecycle'
import { DataApiErrorFactory } from '@shared/data/api'
import type { KnowledgeBase } from '@shared/data/types/knowledge'
import type { BaseVectorStore } from '@vectorstores/core'
import { LibSQLVectorStore } from '@vectorstores/libsql'
import { libSqlVectorStoreProvider } from './providers/LibSqlVectorStoreProvider'
import type { KnowledgeVectorStore } from './types'
const logger = loggerService.withContext('KnowledgeVectorStoreService')
function assertVectorStoreReadyBase(base: KnowledgeBase): asserts base is KnowledgeBase & { dimensions: number } {
if (base.status === 'completed' && typeof base.dimensions === 'number' && base.dimensions > 0) {
return
}
throw DataApiErrorFactory.invalidOperation(
'createKnowledgeVectorStore',
`Knowledge base '${base.id}' is not ready for vector store operations`
)
}
@Injectable('KnowledgeVectorStoreService')
@ServicePhase(Phase.WhenReady)
export class KnowledgeVectorStoreService extends BaseService {
private instanceCache = new Map<string, BaseVectorStore>()
private instanceCache = new Map<string, KnowledgeVectorStore>()
async createStore(base: KnowledgeBase): Promise<KnowledgeVectorStore> {
assertVectorStoreReadyBase(base)
async createStore(base: KnowledgeBase): Promise<BaseVectorStore> {
if (this.instanceCache.has(base.id)) {
logger.debug('Reusing cached vector store', { baseId: base.id })
return this.instanceCache.get(base.id)!
@@ -22,7 +36,7 @@ export class KnowledgeVectorStoreService extends BaseService {
// Cache is keyed only by base.id because store-shaping config is treated as immutable
// for an existing knowledge base. If embedding model / dimensions change, callers must
// migrate into a new knowledge base instead of mutating the existing one in place.
const store = await libSqlVectorStoreProvider.create(base)
const store = (await libSqlVectorStoreProvider.create(base)) as KnowledgeVectorStore
this.instanceCache.set(base.id, store)
logger.info('Created vector store', {
baseId: base.id,
@@ -32,7 +46,9 @@ export class KnowledgeVectorStoreService extends BaseService {
return store
}
async getStoreIfExists(base: KnowledgeBase): Promise<BaseVectorStore | undefined> {
async getStoreIfExists(base: KnowledgeBase): Promise<KnowledgeVectorStore | undefined> {
assertVectorStoreReadyBase(base)
const cachedStore = this.instanceCache.get(base.id)
if (cachedStore) {
logger.debug('Using cached vector store from getStoreIfExists', { baseId: base.id })
@@ -82,7 +98,7 @@ export class KnowledgeVectorStoreService extends BaseService {
}
}
private closeStoreInstance(store: BaseVectorStore | undefined): void {
private closeStoreInstance(store: KnowledgeVectorStore | undefined): void {
if (!store) {
return
}

View File

@@ -1,4 +1,9 @@
import type * as LifecycleModule from '@main/core/lifecycle'
import {
DEFAULT_KNOWLEDGE_BASE_CHUNK_OVERLAP,
DEFAULT_KNOWLEDGE_BASE_CHUNK_SIZE,
type KnowledgeBase
} from '@shared/data/types/knowledge'
import { beforeEach, describe, expect, it, vi } from 'vitest'
const { loggerDebugMock, loggerErrorMock, loggerInfoMock, providerCreateMock, providerDeleteMock, providerExistsMock } =
@@ -59,12 +64,19 @@ vi.mock('../providers/LibSqlVectorStoreProvider', () => ({
const { KnowledgeVectorStoreService } = await import('../KnowledgeVectorStoreService')
const { LibSQLVectorStore } = await import('@vectorstores/libsql')
function createBase(id = 'kb-1') {
function createBase(id = 'kb-1'): KnowledgeBase {
return {
id,
name: 'KB',
groupId: null,
emoji: '📁',
dimensions: 1024,
embeddingModelId: 'ollama::nomic-embed-text',
status: 'completed',
error: null,
chunkSize: DEFAULT_KNOWLEDGE_BASE_CHUNK_SIZE,
chunkOverlap: DEFAULT_KNOWLEDGE_BASE_CHUNK_OVERLAP,
searchMode: 'hybrid',
createdAt: '2026-04-08T00:00:00.000Z',
updatedAt: '2026-04-08T00:00:00.000Z'
}
@@ -184,6 +196,23 @@ describe('KnowledgeVectorStoreService', () => {
expect(loggerInfoMock).toHaveBeenCalledWith('Opening existing vector store from disk', { baseId: base.id })
})
it('rejects failed bases with null dimensions before touching the provider', async () => {
const service = new KnowledgeVectorStoreService()
const base = {
...createBase(),
dimensions: null,
embeddingModelId: null,
status: 'failed',
error: 'missing_embedding_model'
} satisfies KnowledgeBase
await expect(service.createStore(base)).rejects.toThrow('not ready for vector store operations')
await expect(service.getStoreIfExists(base)).rejects.toThrow('not ready for vector store operations')
expect(providerCreateMock).not.toHaveBeenCalled()
expect(providerExistsMock).not.toHaveBeenCalled()
})
it('returns the cached store from getStoreIfExists without probing the provider', async () => {
const service = new KnowledgeVectorStoreService()
const base = createBase()

View File

@@ -4,6 +4,7 @@ import { pathToFileURL } from 'node:url'
import { application } from '@application'
import { loggerService } from '@logger'
import { sanitizeFilename } from '@main/utils/file'
import { DataApiErrorFactory } from '@shared/data/api'
import type { KnowledgeBase } from '@shared/data/types/knowledge'
import type { BaseVectorStore } from '@vectorstores/core'
import { LibSQLVectorStore } from '@vectorstores/libsql'
@@ -14,11 +15,24 @@ const logger = loggerService.withContext('LibSqlVectorStoreProvider')
export class LibSqlVectorStoreProvider implements BaseVectorStoreProvider {
async create(base: KnowledgeBase): Promise<BaseVectorStore> {
if (
base.status !== 'completed' ||
typeof base.dimensions !== 'number' ||
!Number.isInteger(base.dimensions) ||
base.dimensions <= 0
) {
throw DataApiErrorFactory.invalidOperation(
'createLibSqlVectorStore',
`Knowledge base '${base.id}' is not ready for vector store operations`
)
}
const dimensions = base.dimensions
const dbPath = await this.getKnowledgeBaseFilePath(base.id)
return new LibSQLVectorStore({
collection: base.id,
dimensions: base.dimensions,
dimensions,
clientConfig: {
url: pathToFileURL(dbPath).toString()
}

View File

@@ -0,0 +1,6 @@
import type { BaseVectorStore, Document, Metadata } from '@vectorstores/core'
export interface KnowledgeVectorStore extends BaseVectorStore {
listByExternalId(itemId: string): Promise<Document<Metadata>[]>
deleteByIdAndExternalId(chunkId: string, itemId: string): Promise<void>
}

View File

@@ -25,7 +25,14 @@ import type {
UnifiedPreferenceType,
UpgradeChannel
} from '@shared/data/preference/preferenceTypes'
import type { KnowledgeSearchResult as KnowledgeVectorSearchResult } from '@shared/data/types/knowledge'
import type {
CreateKnowledgeBaseDto,
KnowledgeBase,
KnowledgeItemChunk,
KnowledgeRuntimeAddItemInput,
KnowledgeSearchResult as KnowledgeVectorSearchResult,
RestoreKnowledgeBaseDto
} from '@shared/data/types/knowledge'
import type { ExternalAppInfo } from '@shared/externalApp/types'
import { IpcChannel } from '@shared/IpcChannel'
import type { ShortcutPreferenceKey } from '@shared/shortcuts/types'
@@ -345,16 +352,24 @@ const api = {
})
},
knowledgeRuntime: {
createBase: (baseId: string): Promise<void> =>
ipcRenderer.invoke(IpcChannel.KnowledgeRuntime_CreateBase, { baseId }),
createBase: (base: CreateKnowledgeBaseDto): Promise<KnowledgeBase> =>
ipcRenderer.invoke(IpcChannel.KnowledgeRuntime_CreateBase, { base }),
restoreBase: (dto: RestoreKnowledgeBaseDto): Promise<KnowledgeBase> =>
ipcRenderer.invoke(IpcChannel.KnowledgeRuntime_RestoreBase, dto),
deleteBase: (baseId: string): Promise<void> =>
ipcRenderer.invoke(IpcChannel.KnowledgeRuntime_DeleteBase, { baseId }),
addItems: (baseId: string, itemIds: string[]): Promise<void> =>
ipcRenderer.invoke(IpcChannel.KnowledgeRuntime_AddItems, { baseId, itemIds }),
addItems: (baseId: string, items: KnowledgeRuntimeAddItemInput[]): Promise<void> =>
ipcRenderer.invoke(IpcChannel.KnowledgeRuntime_AddItems, { baseId, items }),
deleteItems: (baseId: string, itemIds: string[]): Promise<void> =>
ipcRenderer.invoke(IpcChannel.KnowledgeRuntime_DeleteItems, { baseId, itemIds }),
reindexItems: (baseId: string, itemIds: string[]): Promise<void> =>
ipcRenderer.invoke(IpcChannel.KnowledgeRuntime_ReindexItems, { baseId, itemIds }),
search: (baseId: string, query: string): Promise<KnowledgeVectorSearchResult[]> =>
ipcRenderer.invoke(IpcChannel.KnowledgeRuntime_Search, { baseId, query })
ipcRenderer.invoke(IpcChannel.KnowledgeRuntime_Search, { baseId, query }),
listItemChunks: (baseId: string, itemId: string): Promise<KnowledgeItemChunk[]> =>
ipcRenderer.invoke(IpcChannel.KnowledgeRuntime_ListItemChunks, { baseId, itemId }),
deleteItemChunk: (baseId: string, itemId: string, chunkId: string): Promise<void> =>
ipcRenderer.invoke(IpcChannel.KnowledgeRuntime_DeleteItemChunk, { baseId, itemId, chunkId })
},
window: {
setMinimumSize: (width: number, height: number) =>

View File

@@ -1,62 +1,79 @@
# 知识库后端当前实现说明
本文档只记录 `src/main/services/knowledge` 当前已经落地的后端分层、调用边界和 runtime 编排行为。
本文档只记录当前分支中 `src/main/services/knowledge` 已经落地的后端分层、调用边界和 runtime 编排行为。
它的目标不是描述理想方案,而是把当前代码中的稳定事实说明清楚,方便后续 v2 重构继续收敛。
它的目标不是描述理想方案,而是把当前代码中的稳定事实说明清楚,方便后续 v2 重构继续收敛。本文不覆盖旧的 `src/main/knowledge` / `knowledge-base:*` 通道。
## 1. 当前架构图
```text
+----------------------------------------------------------------------------------+
| Callers |
| Callers |
| |
| UI (Data API) UI / preload IPC / main-side calls |
+------------------------------------------+---------------------------------------+
|
+--------------------------+ +-----------------------------------+
| Data API | | KnowledgeOrchestrationService |
| knowledge handlers | | caller-facing workflow facade |
+-------------+------------+ +-----------------+-----------------+
| |
v v
+--------------------------+ +---------------------------+
| KnowledgeBaseService |<---------| KnowledgeItemService |
| base data logic | | item data + status |
+-------------+------------+ +-------------+-------------+
| |
v v
+----------------------+ +---------------------------+
| SQLite / Drizzle | | KnowledgeRuntimeService |
+----------------------+ | runtime execution / queue |
+-------------+-------------+
|
v
+---------------------------+
| reader / chunk / embed / |
| rerank / vectorstore |
+-------------+-------------+
|
v
+------------------------+
| LibSQL vector store |
+------------------------+
| UI (Data API reads/patch) UI / preload IPC / main-side workflow |
+-----------------------------------+----------------------------------------------+
|
v
+-------------------------------+
| Data API |
| knowledge read/update |
+---------------+---------------+
|
v
+-------------------------------+
| KnowledgeBaseService / |<-----------------------------+
| KnowledgeItemService | |
| SQLite business data | |
+---------------+---------------+ |
| |
v |
+-------------------+ |
| SQLite / Drizzle | |
+-------------------+ |
|
+----------------------------------------+ |
| KnowledgeOrchestrationService |----------------------------------+
| caller-facing runtime workflow facade |
+-------------------+--------------------+
|
v
+----------------------------------------+
| KnowledgeRuntimeService |
| prepare/index/search/chunk runtime |
+-------------------+--------------------+
|
v
+----------------------------------------+
| reader / chunk / embed / rerank / |
| KnowledgeVectorStoreService |
+-------------------+--------------------+
|
v
+------------------+
| LibSQL vector DB |
+------------------+
```
当前知识库后端已经分成三层:
当前知识库后端分为四个主要职责层:
1. `KnowledgeBaseService` / `KnowledgeItemService`
- 负责 SQLite 中的知识库业务主数据 CRUD
- 负责 `knowledge_item.status` / `error` 的持久化更新
2. `KnowledgeOrchestrationService`
- 负责对外 workflow 编排
- 负责统一 caller-facing IPC
- 负责把 expand / create / add / delete / search 串成单次调用入口
3. `KnowledgeRuntimeService`
- 负责 runtime 执行
- 负责 SQLite 中的知识库业务主数据读写
- 负责 `knowledge_item.status` / `phase` / `error` 的持久化更新
- 负责 `knowledge_item.data``type` 的一致性校验
- 负责 container item 的子项状态向上聚合
2. Data API knowledge handlers
- 只暴露数据库可直接满足的读和 base metadata/config 更新
- 不负责 runtime mutation不创建或删除 vector store artifacts
3. `KnowledgeOrchestrationService`
- 负责 caller-facing runtime IPC 和 main-side workflow facade
- 负责 create/delete base、delete/reindex item ids 归一化、chunk/search workflow 转发
- 不直接执行 reader/chunk/embed/vector write也不持有 queue
4. `KnowledgeRuntimeService`
- 负责 runtime add item 创建、`prepare-root` / `index-leaf` 入队与执行
- 负责 reader / chunk / embedding / vector store 调用串联
- 负责队列、中断、stop 清理和检索执行
- 负责 queue、中断、stop 清理、reindex 和检索执行
## 2. Data Service 的定位
## 2. Data Service 与 Data API 的定位
`src/main/data/services/KnowledgeBaseService.ts``src/main/data/services/KnowledgeItemService.ts` 属于 data services。
@@ -65,14 +82,26 @@
1. SQLite 业务表读写
2. DTO 校验后的数据落库
3. `knowledge_item.data``type` 的一致性校验
4. item 状态错误信息的持久化
4. item 状态、阶段和错误信息的持久化
5. leaf item 完成或失败后,向上更新 `directory` / `sitemap` container 的状态
它们不负责:
1. reader 调度
2. embedding
3. 向量库写入与检索
4. runtime queue 管理
1. caller-facing runtime IPC
2. reader
3. embedding 调用
4. 向量库写入与检索
5. runtime queue 管理
当前 Data API knowledge handlers 只暴露:
1. `GET /knowledge-bases`
2. `GET /knowledge-bases/:id`
3. `PATCH /knowledge-bases/:id`
4. `GET /knowledge-bases/:id/items`
5. `GET /knowledge-items/:id`
也就是说,当前 Data API 不暴露 knowledge base 创建、knowledge base 删除、knowledge item 创建、knowledge item 删除或 item 状态 mutation。这些带 runtime side effects 的操作由 `KnowledgeOrchestrationService` 统一处理。
## 3. `KnowledgeRuntimeService` 的定位
@@ -87,28 +116,33 @@
1. `@Injectable('KnowledgeRuntimeService')`
2. `@ServicePhase(Phase.WhenReady)`
3. 已注册到应用 service registry
3. `@DependsOn(['KnowledgeVectorStoreService'])`
4. 已注册到应用 service registry
它当前对内部调用方暴露的核心能力是:
1. `createBase(base)`
1. `createBase(baseId)`
2. `deleteBase(baseId)`
3. `addItems(base, items)`
4. `deleteItems(base, items)`
5. `search(base, query)`
3. `addItems(baseId, inputs)`
4. `deleteItems(baseId, rootItems)`
5. `reindexItems(baseId, rootItems)`
6. `search(baseId, query)`
7. `listItemChunks(baseId, itemId)`
8. `deleteItemChunk(baseId, itemId, chunkId)`
它负责:
1. item 级索引任务入队与执行
2. `knowledge_item.status` 的有限状态推进
3. 失败与中断原因写回数据库
4. 向量库实例的获取、删除和清理
5. 检索后的 rerank 串联
6. stop / delete 时的 queue 中断与向量清理补偿
1. 创建 runtime add 传入的 `knowledge_item`
2. `prepare-root` / `index-leaf` 任务入队与执行
3. `knowledge_item.status` / `phase` 的有限状态推进
4. 失败与中断原因写回数据库
5. 向量库实例的获取、删除和清理
6. 检索后的 rerank 串联
7. stop / delete / reindex 时的 queue 中断与向量清理补偿
它不负责:
1. `knowledge_base` / `knowledge_item` 的主数据 CRUD
1. `knowledge_base` 的主数据 CRUD
2. caller-facing IPC workflow 编排
3. `directory` / `sitemap` owner item 的对外展开入口
4. 持久化任务队列
@@ -129,172 +163,138 @@
1. `@Injectable('KnowledgeOrchestrationService')`
2. `@ServicePhase(Phase.WhenReady)`
3. 已注册到应用 service registry
3. `@DependsOn(['KnowledgeRuntimeService'])`
4. 已注册到应用 service registry
它当前对外暴露的核心 IPC 能力是:
1. `createBase(baseId)`
1. `createBase(base dto)`
2. `deleteBase(baseId)`
3. `addItems(baseId, itemIds)`
3. `addItems(baseId, item payloads)`
4. `deleteItems(baseId, itemIds)`
5. `search(baseId, query)`
5. `reindexItems(baseId, itemIds)`
6. `search(baseId, query)`
7. `listItemChunks(baseId, itemId)`
8. `deleteItemChunk(baseId, itemId, chunkId)`
它负责:
1. 统一 caller-facing knowledge runtime IPC
2. 对传入 item ids 做主数据读取
3. `directory` / `sitemap` owner item 做内部 expand
4. 通过 `KnowledgeItemService.createMany()` 持久化 expanded child items
5. 过滤真正可索引的 leaf items再交给 `KnowledgeRuntimeService.addItems()`
6. 协调 runtime 与 data service 的调用顺序
2. create base 时协调 SQLite base 创建和 vector store 创建
3. delete base 时先 runtime cleanup再删除 SQLite base
4. 对 delete / reindex / chunk 操作传入的 item ids 做主数据读取
5. 删除 / 重建时把传入 ids 归一化为 top-level roots
6. runtime 清理完成后删除 SQLite root rows
它不负责:
1. 直接执行 reader / chunk / embed / vector write
2. 直接持有 queue
3. 直接持有 vector store 实例
4. 展开 `directory` / `sitemap`
5. 创建 expanded child items
## 4. 当前调用边界与调用方契约
### 4.1 UI
### 4.1 UI / preload
当前 v2 runtime 调用模型是:
```text
UI
|
+--> Data API -> knowledge handlers -> KnowledgeBaseService / KnowledgeItemService
+--> Data API
| -> list/get knowledge bases
| -> patch base metadata/config
| -> list/get knowledge items
|
\--> preload IPC -> KnowledgeOrchestrationService
-> KnowledgeRuntimeService
\--> preload knowledgeRuntime IPC
-> create/delete base
-> add/delete/reindex items
-> search
-> list/delete chunks
```
当前实现要求调用方明确区分两条调用路径
添加 file / url / note / directory / sitemap 时,调用方应直接走
1. Data API
- 负责 `knowledge_base` / `knowledge_item` 的持久化 CRUD
- 负责调用方显式创建的 owner item / leaf item 主数据创建
- 负责 `knowledge_item.status` / `error` 的持久化读写
2. runtime IPC
- 负责统一的 knowledge workflow 入口
- 负责必要时在 main process 内部展开 `directory` / `sitemap`
- 负责索引入队、向量写入和删除
- 负责检索
```text
caller
-> preload IPC add-items(item payloads)
```
当前 Data API 侧稳定接口是:
1. `GET /knowledge-bases`
2. `POST /knowledge-bases`
3. `GET /knowledge-bases/:id`
4. `PATCH /knowledge-bases/:id`
5. `DELETE /knowledge-bases/:id`
6. `GET /knowledge-bases/:id/items`
7. `POST /knowledge-bases/:id/items`
8. `GET /knowledge-items/:id`
9. `PATCH /knowledge-items/:id`
10. `DELETE /knowledge-items/:id`
preload 已暴露的 runtime IPC 通道是:
1. `knowledge-runtime:create-base`
2. `knowledge-runtime:delete-base`
3. `knowledge-runtime:add-items`
4. `knowledge-runtime:delete-items`
5. `knowledge-runtime:search`
调用方不再需要先通过 Data API 创建 item也不需要把 created item ids 再传给 runtime `addItems`
### 4.1.1 Leaf item 的调用链
`file` / `url` / `note` 这类可直接索引的 leaf item调用方应走
```text
caller
-> Data API create item(s)
-> get created item ids
-> preload IPC add-items(item ids)
```
也就是说:
1. 先通过 Data API 创建持久化 `knowledge_item`
2. 再把 Data API 返回的 item ids 传给 runtime `addItems`
3. runtime 不负责替调用方补建 leaf item 主数据
4. runtime `addItems` 的输入语义是“已经存在于 SQLite 中的 item ids”
批量添加 files 时,当前契约就是:
```text
caller
-> Data API create file items
-> get created file item ids
-> preload IPC add-items(file item ids)
-> preload IPC add-items(leaf item payloads)
-> runtime creates leaf items
-> leaf status = processing, phase = null
-> enqueue index-leaf
```
### 4.1.2 Container item 的调用链
`directory` / `sitemap` 当前已经收口为与 leaf item 相同的“两步调用模型
当前调用方应使用:
`directory` / `sitemap` 当前已经收口为与 leaf item 相同的 runtime 调用模型。
```text
caller
-> Data API create owner item
-> preload IPC add-items(owner item ids)
-> preload IPC add-items(owner item payloads)
-> runtime creates root item
-> root status = processing, phase = preparing
-> enqueue prepare-root(root id)
-> prepare-root expands owner
-> prepare-root creates child items
-> prepare-root enqueues index-leaf(child leaf ids)
-> clear root phase
-> reconcile container status from children
```
也就是说:
1. owner item 的主数据创建仍然走 Data API
2. 对外 IPC 不再暴露 `expand*`,而是由 `KnowledgeOrchestrationService.addItems()` 在内部判断 owner item 类型
3. 如果传入的是 `directory` / `sitemap` owner itemorchestration 会:
- expand owner
- 通过 `KnowledgeItemService.createMany()` 持久化 child items
- 过滤出 indexable leaf items
- 调用 `KnowledgeRuntimeService.addItems()` 入队索引
4. `groupId` / `groupRef` 的职责仍然是把 owner / child / nested child 的持久化关系写进 `knowledge_item`
5. 当前调用方不再需要自己显式执行 “expand -> create children -> filter -> add” 这四步
这个边界是当前实现的硬约束:
1. expand 仍然负责生成要创建的持久化 items
2. child item 的持久化仍然通过 `KnowledgeItemService.createMany()` 写入 SQLite
3. `KnowledgeRuntimeService` 仍然只负责编排可索引 items 的读取 / 切块 / embedding / vector write
4. orchestration 只是把上述步骤收口到一次 caller-facing IPC不改变 data/runtime 的最终边界
5. mixed batch 可用于持久化树结构,但不等于 mixed batch 可直接进入 runtime 索引队列
1. expand 只发生在 runtime `prepare-root` task 内
2. child item 的持久化由 prepare helper 通过 `KnowledgeItemService.create()` 写入 SQLite
3. `KnowledgeRuntimeService` 同时负责 root preparation 和 leaf indexing 的 queue 生命周期
4. orchestration 只是 caller-facing workflow facade不参与 preparation 细节
5. mixed batch 可包含 leaf 和 root container payload但最终会拆成 `prepare-root` / `index-leaf` 两类 queue task
这个调用链仍然符合“Data Service 负责主数据Runtime 负责索引执行Orchestration 负责 workflow 收口”的分层,不属于边界漂移。
### 4.1.3 Nested container 约束
`directory` / `sitemap` 的当前内部流程可以进一步写成:
当前产品约束是:调用方不允许把 `directory` / `sitemap` 作为另一个 item 的用户输入子节点添加。
```text
directory/sitemap
-> Data API create owner
-> IPC add-items(owner item ids)
-> orchestration expand owner
-> orchestration create expanded items
-> orchestration filter indexable leaf items
-> runtime add-items(indexable child items)
```
允许的 container 来源只有:
### 4.1.3 删除链路的当前约束
1. 用户通过 `addItems()` 添加的 top-level `directory` / `sitemap` root
2. directory expansion 内部为了保留目录层级而创建的 nested `directory` rows
删除场景同样需要区分持久化删除与 runtime 删除。
不允许的来源是:
1. 用户显式创建 parent 为其他 item 的 `directory` / `sitemap`
2. 用户把 `sitemap` 放进另一个 `directory` / `sitemap` 下面作为可独立 preparation 的 descendant root
这个约束影响 delete / reindex 的 review 边界:
1. 当前 runtime 只需要中断传入 roots 以及 fresh 查询到的 descendants
2. 不需要为“descendant `prepare-root` 在 snapshot 之后继续发布新 leaf”的未来嵌套 container 场景加入 stable-loop interrupt
3. 如果未来开放用户添加 nested `directory` / `sitemap`,必须先重新设计 interrupt/reconcile 语义,再放开这个输入能力
### 4.1.4 删除链路的当前约束
item 删除时,调用方应理解为两件独立的事:
1. runtime IPC `delete-items`
- 通过 orchestration 进入删除 workflow
- 中断 pending / running add task
- 将传入 ids 归一化为 top-level roots
- 中断 root `prepare-root` / `index-leaf`
- fresh 查询 descendants
- 中断 descendants 的 `prepare-root` / `index-leaf`
- 删除 item 及其级联子项的向量
2. Data API `DELETE /knowledge-items/:id`
- 删除 SQLite 中的 `knowledge_item`
- 依赖数据库 cascade 删除 grouped descendants
2. orchestration 在 runtime cleanup 后删除 SQLite root rows
- 数据库 cascade 删除 grouped descendants
base 删除时,调用方同样需要区分两步:
1. runtime IPC `delete-base`
- 通过 orchestration 进入删除 workflow
- 中断该 base 下相关 add task
- 删除对应 vector store
2. Data API `DELETE /knowledge-bases/:id`
- 删除 SQLite 中的 base 和关联 items
base 删除时会先中断并等待该 base 的 runtime work然后删除 SQLite base 和关联 items。
SQLite 删除成功后,再 best-effort 删除该 base 的 vector artifactsartifact 清理失败只记录日志,不回滚已完成的 SQLite 删除。
当前实现下Data API 删除并不会替调用方清理向量库,也不会替调用方中断 runtime 任务。
@@ -302,21 +302,46 @@ base 删除时,调用方同样需要区分两步:
主进程内部其他模块如果需要 caller-facing workflow 能力,应优先调用 `KnowledgeOrchestrationService`
主进程内部如果已经明确持有 leaf items 且只需要底层索引执行能力,可以直接调用 `KnowledgeRuntimeService`
主进程内部如果只需要 SQLite 主数据读写能力,直接调用 `KnowledgeBaseService` / `KnowledgeItemService`
主进程内部如果需要业务主数据能力,应直接调用 `KnowledgeBaseService` / `KnowledgeItemService`
## 5. Base workflow
## 5. 当前 Queue 模型
`createBase(dto)` 当前流程:
### 5.1 已落地行为
```text
IPC create-base(CreateKnowledgeBaseDto)
-> KnowledgeBaseService.create(dto)
-> KnowledgeRuntimeService.createBase(base.id)
-> KnowledgeVectorStoreService.createStore(base)
-> return created base
```
当前实现使用一个进程内自定义 add queue
如果 vector store 初始化失败orchestration 会调用 `KnowledgeBaseService.delete(base.id)` 回滚刚创建的 SQLite base然后把原始错误抛给调用方。
`deleteBase(baseId)` 当前流程:
```text
IPC delete-base(baseId)
-> KnowledgeRuntimeService.deleteBase(baseId)
-> KnowledgeBaseService.delete(baseId)
-> KnowledgeRuntimeService.deleteBaseArtifacts(baseId)
```
runtime 删除阶段会先中断该 base 下 pending / running runtime task等待 running task settle并返回被中断 item ids。
随后 data service 删除 SQLite base 和关联 items。
SQLite 删除成功后orchestration 再调用 artifact cleanup 删除该 base 对应的 vector store该 cleanup 失败只记录日志。
如果 SQLite 删除失败orchestration 会把被中断 items 标记为 failed然后把 SQLite 删除错误抛给调用方。
## 6. 当前 Queue 模型
当前实现使用一个进程内 runtime queue
1. queue 持有者是 `KnowledgeRuntimeService`
2. queue 为单实例 in-memory queue
3. 默认 `concurrency = 5`
4. 所有 base 的 add item 任务共用这一条 queue
5. delete 行为不会进入 queue而是先中断相关 add 任务,再直接删除向量
4. 所有 base 的 runtime task 共用这一条 queue
5. queue task 分为 `prepare-root``index-leaf`
6. delete / reindex 不进入 queue而是先中断相关 runtime task再直接删除向量
当前实现没有落地以下旧设计假设:
@@ -324,132 +349,137 @@ base 删除时,调用方同样需要区分两步:
2. 不是 round-robin scheduler
3. 没有全局持久化任务表
### 5.2 当前可观测状态
queue 内部维护 `entries` mapentry 上记录:
当前 queue 内部维护的是一份 `entries` mapentry 上记录:
1. `base`
2. `baseId`
3. `itemId`
4. `kind = prepare-root | index-leaf`
5. `status = pending | running | settled`
6. `controller`
7. `promise`
8. `runPromise`
9. `interruptError`
1. `item.id`
2. `status = pending | running`
3. `controller`
4. `promise`
5. `interruptedBy`
这些状态只用于:
它们的作用仅是:
1. 跟踪哪些 add 任务仍在等待执行
2. 跟踪哪些 add 任务正在运行
3. 在 delete / shutdown 时中断对应任务
1. 跟踪哪些 runtime task 仍在等待执行
2. 跟踪哪些 runtime task 正在运行
3. 在 delete / reindex / shutdown 时中断对应任务
4. 在 shutdown 时识别哪些 item 被中断并做失败补偿
这些状态都只是 runtime 内部实现细节,不是对外数据模型的一部分。
它们不是对外数据模型的一部分。
### 5.3 入队行为
## 7. 当前索引执行链路
`addItems(base, items)` 当前行为
1. 对传入的每个 item 分别先写 `status = pending`
2. 同时清空该 item 的旧 `error`
3. 每个 item 在自己的状态写入成功后,立即作为一个 add task 入队
4. 如果同一个 item 已经在 pending 或 running 中,再次 enqueue 会直接复用已有 promise不会重复入队
5. 当前实现不是“整批状态先全部落库,再统一开始 enqueue”的原子批次启动模型
6. 因此如果某个 item 在写 `pending` 或 enqueue 之前失败,其他已经成功启动的 item 仍可能继续执行
`deleteItems(base, items)` 当前行为:
1. 不更新 item 状态
2. 先对同 id 的 pending / running add task 做 interrupt
3. 等待相关 running add task settle
4. 直接删除这些 item 对应的向量
当前有:
1. item 级 add 去重保护
2. delete / stop 中断 add task 的机制
当前没有:
1. 优先级队列
2. 暂停 / 恢复 API
3. 自动重试
## 6. 当前索引执行链路
一个 `knowledge_item` 的一次索引流程,当前是:
一个 leaf `knowledge_item` 的一次索引流程,当前是
```text
addItems
-> status = pending
-> queue task
-> create leaf item
-> status = processing, phase = null
-> queue task index-leaf
-> phase = reading
-> loadKnowledgeItemDocuments(item)
-> chunkDocuments(base, item, documents)
-> phase = embedding
-> getEmbedModel(base)
-> embedDocuments(model, chunks)
-> vectorStore.add(nodes)
-> status = completed
-> runWithBaseWriteLock
-> KnowledgeVectorStoreService.createStore(base)
-> vectorStore.add(nodes)
-> status = completed, phase = null
```
任意步骤抛错时:
任意非中断错误抛出时:
```text
catch error
-> status = failed
-> logger.error(...)
-> best-effort cleanup vectors
-> status = failed, phase = null
-> error = normalizedError.message
-> 向上抛出异常
```
当前还没有落地 `fileProcessorId` 的执行链路。代码中这一段仍然是 `// todo file processing`
`directory` / `sitemap` 的一次 preparation 流程,当前是:
## 7. `knowledge_item.status` 的当前实现边界
```text
addItems
-> create root item
-> root status = processing, phase = preparing
-> queue task prepare-root
-> expand directory/sitemap with queue AbortSignal
-> create child items
-> child leaf status = processing
-> child directory status = processing, phase = preparing
-> enqueue child leaf index-leaf
-> root phase = null
-> reconcile root/container statuses from children
```
### 7.1 枚举定义
preparation 被 interrupt 时:
schema 和共享类型仍然保留完整状态集合:
1. queue signal 会在 expand I/O 前后、循环边界和 child create 边界被检查
2. prepare task 不再发布新的 stale leaf task
3. cleanup 由 runtime interrupt flow 统一处理
4. 已创建的 root / descendants 会被标记为 `failed` 或在 delete flow 中由 SQLite cascade 删除
`fileProcessorId` 已保留在 schema/config 中,但 runtime 处理链路尚未接入该配置。
## 8. `knowledge_item.status` / `phase` 的当前实现边界
当前 `status` 表达总体状态:
1. `idle`
2. `pending`
3. `file_processing`
4. `read`
5. `embed`
6. `completed`
7. `failed`
2. `processing`
3. `completed`
4. `failed`
### 7.2 当前 runtime 实际写入
当前 `phase` 字段允许以下值:
`KnowledgeRuntimeService` 当前真正写入的状态只有:
1. `null`
2. `preparing`
3. `reading`
4. `embedding`
1. 入队前写 `pending`
2. 成功完成写 `completed`
3. 任意失败或 shutdown 中断写 `failed`
`KnowledgeRuntimeService` 当前写入的 active 状态是:
1. `processing, phase = preparing``directory` / `sitemap` root 或 nested directory 正在 expand / create children
2. `processing, phase = reading`leaf 正在读取 source documents
3. `processing, phase = embedding`leaf 正在 embedding / 写入 vector store
4. `completed, phase = null`leaf indexing 完成,或 container 没有 active children
5. `failed, phase = null`runtime task 失败、interrupt cleanup 失败,或 shutdown 中断补偿
也就是说:
1. `file_processing` / `read` / `embed` 目前仍是预留状态
2. 它们已进入 schema但当前 runtime 尚未推进到这些中间态
1. `status` 不再承载 `read` / `embed` 这类阶段语义
2. `phase` 是 runtime 内部进度,不应由通用 Data API update DTO 对外暴露
3. container 的最终状态由自身 phase 和 children 状态自下而上 reconcile
部分必须在文档中明确,因为旧文档把这些状态当成“当前已经落地的推进链路”,但实现并非如此
个拆分解决的核心问题是:`processing/read/embed` 不再同时表达总体状态和运行阶段directory/sitemap 的 preparation 与 children indexing 也不会混在同一个字段里
## 8. Lifecycle 行为
## 9. Lifecycle 行为
`KnowledgeRuntimeService` 已经接入 lifecycle system,当前行为如下
`KnowledgeRuntimeService``KnowledgeVectorStoreService` 已经接入 lifecycle system。
### 8.1 `onInit`
### 9.1 `KnowledgeRuntimeService.onInit`
当前做件事:
当前做件事:
1. `isStopping = false`
2. `addQueue.reset()`
1. 重新创建进程内 `KnowledgeQueueManager`
当前没有启动时“扫描中间状态并补偿失败”的逻辑。
当前没有启动时“扫描中间状态并补偿失败”或“自动恢复索引任务”的逻辑。
### 8.2 `onStop`
### 9.2 `KnowledgeRuntimeService.onStop`
当前 stop 流程是:
1. `isStopping = true`
2. 调用 `addQueue.interruptAll('stop', SHUTDOWN_INTERRUPTED_REASON)`
3. 收集中断的 entries 和 itemIds
4. 等待相关 running add task settle
5. best-effort 删除这些被中断 item 已写入的向量
1. 调用 `queue.interruptAll(SHUTDOWN_INTERRUPTED_REASON)`
2. 收集中断的 `prepare-root` / `index-leaf` entries
3. 等待相关 running task settle
4. `index-leaf` 清理对应 leaf vectors
5. `prepare-root` fresh 查询 descendants并清理 root / descendants vectors
6. 将这些 item 批量写为 `failed`
这意味着:
@@ -458,75 +488,96 @@ schema 和共享类型仍然保留完整状态集合:
2. 当前会在 stop 时清理被中断 item 的向量残留
3. 但没有做重启后的自动恢复
## 9. Reader / Chunk / Embed / Search 的当前边界
### 9.3 `KnowledgeVectorStoreService.onStop`
### 9.1 Reader
当前 stop 流程是:
reader 由 `loadKnowledgeItemDocuments(item)``item.type` 分派:
1. 遍历 cached vector stores
2.`LibSQLVectorStore` 调用 `client().close()`
3. 清空 `instanceCache`
## 10. Reader / Chunk / Embed / Search 的当前边界
### 10.1 Reader
reader 由 `loadKnowledgeItemDocuments(item)` 按 leaf `item.type` 分派:
1. `file` -> `KnowledgeFileReader`
2. `url` -> `KnowledgeUrlReader`
3. `note` -> `KnowledgeNoteReader`
4. `sitemap` -> `KnowledgeSitemapReader`
5. `directory` -> `KnowledgeDirectoryReader`
当前 runtime reader 不直接索引 `directory` / `sitemap`。这两类 item 必须先通过 `prepare-root` 展开成 `file` / `url` leaf items 后再进入 indexing。
当前各 reader 的实际行为:
1. `file`
- 按扩展名选择 reader
- 已支持 `pdf` / `csv` / `docx` / `epub` / `json` / `md` / `draftsexport`
- 已支持 `.pdf` / `.csv` / `.docx` / `.epub` / `.json` / `.md` / `.draftsexport`
- 其他扩展名回退到 `TextFileReader`
- metadata 保留 `source`
2. `url`
- 通过 `https://r.jina.ai/<url>` 抓取 markdown
- 元数据中保留 `itemId` / `itemType` / `sourceUrl` / `name`
- 支持 `AbortSignal`
- metadata 保留 `source`
3. `note`
- 直接把 `content` 包成一个 `Document`
- metadata 保留 `source`
4. `sitemap`
- 当前已保留 `KnowledgeSitemapReader` 代码路径
- 但 runtime 侧暂时不直接索引 `sitemap` item
- 当前调用方会先创建 sitemap owner再通过 runtime IPC 将其展开为具体 `url` item再进入索引流程
5. `directory`
- 当前只作为 container placeholder
- reader 会记录 warning 并返回空数组
- 也就是说它不会直接产出可索引文档,调用方需要先创建 directory owner再通过 runtime IPC 将其展开为具体子 item
### 9.2 Chunk
### 10.2 Chunk
`chunkDocuments(base, item, documents)` 当前做的事情:
1. 使用 `SentenceSplitter`
2. 读取 `base.chunkSize``base.chunkOverlap`
3. 为每个 chunk 写入元数据:
- 原 document metadata
- `itemId`
- `itemType`
- `sourceDocumentIndex`
- `chunkIndex`
- `chunkCount`
- `tokenCount`
### 9.3 Embed
当前 `KnowledgeChunkMetadataSchema` 要求 metadata 包含:
1. `itemId`
2. `itemType`
3. `source`
4. `chunkIndex`
5. `tokenCount`
### 10.3 Embed
`getEmbedModel(base)` 当前只支持:
1.`embeddingModelId` 解析 `providerId::modelId`
2. 仅接受 `providerId === 'ollama'`
3. 通过 `createOllama().textEmbeddingModel(modelId)` 获取 embedding model
其他 provider 当前会直接抛错。
其他 provider 当前会直接抛错。`embeddingModelId` 为空时也会抛错。
`embedDocuments(model, documents)` 当前会:
`embedDocuments(model, documents, signal)` 当前会:
1.`embedMany` 批量生成 embeddings
2. 构造 `TextNode`
3. `NodeRelationship.SOURCE` 上写回 `itemId`
2. 支持把 `AbortSignal` 传给 AI SDK
3. 构造 `TextNode`
4.`NodeRelationship.SOURCE` 上写回 `itemId` 和 metadata
### 9.4 Search
### 10.4 Search
`search(base, query)` 当前链路是:
`search(baseId, query)` 当前链路是:
```text
embed query
getEmbedModel(base)
-> embed query with embedMany
-> KnowledgeVectorStoreService.createStore(base)
-> vectorStore.query(...)
-> map nodes into KnowledgeSearchResult[]
-> rerankKnowledgeSearchResults(base, query, results)
-> optional rerankKnowledgeSearchResults(base, query, results)
```
查询参数来自 base
@@ -535,7 +586,9 @@ embed query
2. `similarityTopK = base.documentCount ?? 10`
3. `alpha = base.hybridAlpha`
### 9.5 Rerank 的当前真实状态
如果 query embedding 为空,会抛出 `Failed to embed search query: model returned empty result`
### 10.5 Rerank 的当前真实状态
当前 rerank 代码路径已经存在,但 runtime 配置解析尚未接通:
@@ -545,15 +598,16 @@ embed query
换句话说rerank 是“代码壳已存在,但还未真正启用”。
## 10. `KnowledgeVectorStoreService` 的边界
## 11. `KnowledgeVectorStoreService` 的边界
`KnowledgeVectorStoreService` 当前负责 runtime vector store 的最小缓存和生命周期管理。
它负责:
1.`base.id` 创建或复用 store
2. 删除单个 base 的 store 文件
3. shutdown 时关闭所有已缓存 store
2. 按需打开磁盘上已存在的 store
3. 删除单个 base 的 store 文件
4. shutdown 时关闭所有已缓存 store
它当前的重要约束是:
@@ -563,10 +617,12 @@ embed query
当前实际 provider 是 `LibSqlVectorStoreProvider`
1. 向量文件路径位于 `application.getPath('feature.knowledgebase.data', <sanitizedBaseId>)`
2. 删除 base 时会删除对应文件
1. 向量文件路径位于 `application.getPath('feature.knowledgebase.data', sanitizeFilename(baseId, '_'))`
2. collection 使用 `base.id`
3. dimensions 使用 `base.dimensions`
4. 删除 base 时会删除对应文件
## 11. 当前明确不做的内容
## 12. 当前明确不做的内容
当前实现没有做:
@@ -578,19 +634,19 @@ embed query
6. 自动恢复索引继续执行
7. 自动重试
8. chunk 级 queue
9. runtime 在 `addItems` 内对 `directory` / `sitemap` item 做隐式自动展开
9. 用户添加 nested `directory` / `sitemap`
10. 真正可用的 rerank runtime 配置接入
11.`ollama` embedding provider 支持
12. `fileProcessorId` 驱动的文件处理链路
## 12. 后续更新本文档时的原则
## 13. 后续更新本文档时的原则
后续只有在以下行为真正落地之后,才应更新本文档:
1. runtime queue 从单队列改成 per-base queue
2. 中间状态 `file_processing` / `read` / `embed` 真的开始持久化写入
3. rerank runtime 配置真正接通
4. `fileProcessorId` 开始参与 runtime 执行链路
5. runtime 在 `addItems` 中原生接管 `directory` / `sitemap` item 的隐式展开与索引编排
2. rerank runtime 配置真正接通
3. `fileProcessorId` 开始参与 runtime 执行链路
4. 用户添加 nested `directory` / `sitemap`
5. queue interrupt 从当前 root + fresh descendants 模型改成 stable-loop 或 generation/runId 模型
在这些行为落地之前,文档应继续以“当前已实现”为准,不提前写成目标设计。

View File

@@ -14,9 +14,12 @@ This document records the current V2 knowledge target schema, migration constrai
- Persisted columns:
- `id`
- `name`
- `description`
- `groupId`
- `emoji`
- `dimensions`
- `embeddingModelId`
- `status`
- `error`
- `rerankModelId`
- `fileProcessorId`
- `chunkSize`
@@ -37,6 +40,7 @@ This document records the current V2 knowledge target schema, migration constrai
- `type`
- `data`
- `status`
- `phase`
- `error`
- `createdAt`
- `updatedAt`
@@ -71,34 +75,12 @@ This document records the current V2 knowledge target schema, migration constrai
- Current runtime read flows use:
- `GET /knowledge-bases/:id/items` for flat item listing
- optional query filters: `type`, `groupId`
- Current runtime create flow uses:
- `POST /knowledge-bases/:id/items`
- request bodies may carry `groupId`, `ref`, and `groupRef`
- `groupId` may point to an already existing owner item in the same knowledge base
- `ref` is an optional request-local reference key for one newly created item in the current batch
- `groupRef` is an optional request-local owner reference that points to another item's `ref` in the same batch
- `ref` and `groupRef` are request-level helper fields only:
- they are not persisted to SQLite
- they are resolved by the DataApi/service layer before insert
- the persisted relationship is still `groupId = ownerItem.id`
- `groupId` and `groupRef` are mutually exclusive on one item
- `groupRef` must resolve to a `ref` present in the same request batch
- one request batch may therefore create:
- a new owner item and its grouped members together
- a multi-level same-base grouping tree
- the current create contract rejects invalid batch-local grouping:
- duplicate `ref` values in one request batch
- missing `groupRef` targets
- self-references
- cycles within one request batch
- Current runtime update flow uses:
- `PATCH /knowledge-items/:id`
- mutable fields may include `data`, `status`, `error`
- `groupId` is create-only in the current DataApi contract and is not updated through `PATCH /knowledge-items/:id`
- Current delete flow is item-level only:
- `DELETE /knowledge-items/:id`
- when the deleted item is a logical group owner, the database cascade also removes items with `groupId = :id`
- there is still no separate first-class group resource or `DELETE /knowledge-groups/:id` endpoint
- Current runtime write workflows use `KnowledgeOrchestrationService` IPC, not DataApi endpoints:
- add items: normalize caller-friendly inputs, create SQLite rows, and enqueue prepare/index tasks
- delete items: interrupt runtime work, delete vectors, then delete SQLite roots
- reindex items: interrupt runtime work, delete old vectors, rebuild expanded children when needed, then enqueue indexing
- search and chunk mutation: execute against the per-base vector store through runtime IPC
- DataApi remains limited to SQLite-backed reads and knowledge base metadata PATCH.
- Migration from official v1 data does not preserve or infer grouping metadata:
- official v1 exports are flat
- migrated items are inserted with `groupId = null`
@@ -147,59 +129,62 @@ This document records the current V2 knowledge target schema, migration constrai
## Runtime Status Boundary
- `knowledge_item.status` and `knowledge_item.error` remain part of the official V2 business schema.
- `knowledge_item.status`, `knowledge_item.phase`, and `knowledge_item.error` remain part of the official V2 business schema.
- The runtime queue implementation is not part of the schema contract:
- no separate task table
- no persisted queue record
- no scheduler-specific stage column
- no persisted task run id
- Runtime currently uses an in-memory `p-queue` based pipeline in `KnowledgeRuntimeService`.
- The schema-level status set is still:
- The schema-level `status` set is:
- `idle`
- `pending`
- `file_processing`
- `read`
- `embed`
- `processing`
- `completed`
- `failed`
- But the current runtime implementation only persists:
- `pending` before enqueue
- `completed` after successful vector write
- `failed` on any exception or shutdown interruption
- `file_processing`, `read`, and `embed` remain reserved intermediate statuses in the schema and shared types, but are not written by the current runtime yet.
- The schema-level `phase` set is:
- `null`
- `preparing`
- `reading`
- `embedding`
- Current runtime writes:
- `processing, phase = preparing` while a `directory` / `sitemap` root or nested directory is being expanded
- `processing, phase = reading` while a leaf item is reading source documents
- `processing, phase = embedding` while a leaf item is embedding / writing vectors
- `completed, phase = null` after successful leaf indexing, or when a container has no active children
- `failed, phase = null` on runtime failure, interrupt cleanup failure, or shutdown interruption
- `fileProcessorId` is persisted in base config, but it does not participate in runtime execution yet.
- In other words:
- queue structure is implementation detail
- item status is business state
- some business states are currently reserved for future runtime expansion
- `status` is aggregate business state
- `phase` is runtime progress
- container status is reconciled from its own phase and child item statuses
- these concerns must not be conflated
## Current Runtime Consumption Notes
- Runtime entrypoint:
- `src/main/services/knowledge/KnowledgeRuntimeService.ts`
- `src/main/services/knowledge/runtime/KnowledgeRuntimeService.ts`
- Reader dispatch code still exists for stored `knowledge_item.type` values:
- `file` -> file reader by extension
- `url` -> fetch markdown through Jina Reader
- `note` -> inline note content
- `sitemap` -> sitemap reader code path is present, but current runtime does not index `sitemap` items directly
- `directory` -> currently treated as a container placeholder and returns no documents
- This means `directory` and `sitemap` remain valid persisted `knowledge_item.type` values, but the current runtime does not index them directly.
- For container expansion flows, upstream callers may still create mixed persisted child batches under one owner/group, for example `directory` + `file`.
- That mixed batch is a persistence concern, not an indexing contract:
- container items may be stored in `knowledge_item`
- but only concrete indexable leaf items should be submitted to runtime `addItems`
- In other words, callers must distinguish:
- create set: all items that should be persisted into `knowledge_item`
- index set: only items that runtime is allowed to index
- Upstream callers must therefore flatten containers into concrete child items and filter out non-indexable container types before indexing.
- This means `directory` and `sitemap` remain valid persisted `knowledge_item.type` values, but they are prepared before leaf indexing rather than indexed directly.
- Runtime add flow accepts new item payloads:
- leaf payloads create `knowledge_item` rows and enqueue `index-leaf`
- `directory` / `sitemap` payloads create root rows and enqueue `prepare-root`
- `prepare-root` expands the owner inside the runtime queue, creates child rows, and enqueues concrete leaf children as `index-leaf`.
- Callers must not create user-supplied nested `directory` / `sitemap` items under another item. Nested directory rows may still be created internally by directory expansion to preserve filesystem hierarchy.
- Runtime embedding model resolution currently expects `knowledge_base.embeddingModelId` in `providerId::modelId` format and only supports `ollama` as the active provider.
## Implementation Status
- `video` and `memory` items are skipped during migration.
- The target schema uses optional `groupId`, but migration from official v1 data still writes it as `null`.
- The current DataApi contract is flat item CRUD plus filtered listing; it does not expose tree navigation.
- The current DataApi contract exposes flat item read/listing only; write operations go through runtime orchestration.
- Group ownership is represented implicitly by `groupId = ownerItem.id`; there is no standalone group table in the current phase.
- `dimensions` resolution failure skips the entire base and all nested items, with warnings recorded in migration output.
- Knowledge item status migration uses `uniqueId` instead of `processingStatus`.
- The current runtime service is `KnowledgeRuntimeService`, not the old `KnowledgeService` name used in earlier notes.
- Current runtime queue behavior is a single in-memory `PQueue({ concurrency: 5 })` shared across knowledge bases; there is no per-base serial queue yet.
- Current runtime queue entries are `prepare-root` and `index-leaf`; preparation and leaf indexing share interrupt / wait / shutdown cleanup semantics.

View File

@@ -0,0 +1,129 @@
# Knowledge 待修改数据结构清单
本文档用于记录 Knowledge V2 还需要修改的数据结构项。
只记录已经确认的结构性调整,不记录未确认的 UI 推断或实现细节。
## 已确认
### 1. 复用 `groupTable`,为 `knowledge_base` 增加 `groupId`
#### 结论
- 不新增独立的 `knowledge_group` 表。
- 复用现有 `src/main/data/db/schemas/group.ts` 中的 `groupTable`
-`knowledge_base` 上增加 `groupId` 字段,关联到 `groupTable.id`
- 如果上层为 Knowledge 创建专用分组,`group.entityType` 约定使用:
- 建议值:`knowledge_base`
- 当前 `KnowledgeBaseService` 不额外强校验 `entityType`,行为与现有 `topic.groupId` 一致。
#### 目的
- 支撑 Knowledge V2 左侧知识库列表的分组组织能力。
- 让知识库分组成为业务数据,而不是 renderer 本地 mock 字段。
- 避免误用 `knowledge_item.groupId`
- `knowledge_item.groupId` 的语义仍然是 item 级来源/容器分组
- 不能复用于知识库导航分组
#### 需要修改的结构
1. SQLite Schema
- `src/main/data/db/schemas/knowledge.ts`
-`knowledgeBaseTable` 上新增:
- `groupId: text().references(() => groupTable.id, { onDelete: 'set null' })`
2. Shared Data Types / API Schema
- `packages/shared/data/types/knowledge.*`
- `packages/shared/data/api/schemas/knowledges.ts`
- 需要让 `KnowledgeBase``CreateKnowledgeBaseDto``UpdateKnowledgeBaseDto` 支持 `groupId`
3. Data Service / Handler 约束
- `KnowledgeBaseService`
- knowledge 相关 handler
- 保持与 `topic.groupId` 一致:
- service 层不额外增加 `groupId` / `entityType` 业务校验
- create / update 对 `groupId` 直接透传,不做 trim 或空值归一化
- 由 SQLite 外键约束负责引用完整性
4. Migration / 兼容策略
- 旧知识库数据当前没有 `groupId`
- 迁移阶段允许先写入 `null`
- 是否补默认分组,后续单独确认
#### 当前不做
- 不新增 `knowledge_group`
- 不把 `knowledge_item.groupId` 改造成知识库分组字段
- 不做多级分组
- 不做 group 专属额外字段:
-`icon`
- `color`
- `isDefault`
- `parentId`
#### 影响范围
- `knowledge_base` 主数据结构
- knowledge base DataApi 输入输出契约
- 后续 renderer 左侧分组列表的数据来源
### 2. 为 `knowledge_base` 增加 `emoji`
#### 结论
-`knowledge_base` 上增加 `emoji` 字段,用于知识库的 icon 展示。
- 存储方式与现有 `assistantTable` 保持一致:
- SQLite 使用 `emoji: text()`
- 不单独引入图片、SVG、icon type 等扩展字段
- `emoji` 是知识库主数据的一部分,不放到 renderer 本地状态或临时 UI 配置里。
- API / service 层行为与 assistant 对齐:
- `KnowledgeBase.emoji` 始终返回非空值
- 默认值为 `📁`
#### 目的
- 支撑 Knowledge V2 左侧列表、详情头部等位置的知识库图标展示。
- 让 icon 成为可持久化业务数据,而不是 UI 层推导值。
- 与现有 assistant 的 emoji 存储模式保持一致,降低理解和实现成本。
#### 需要修改的结构
1. SQLite Schema
- `src/main/data/db/schemas/knowledge.ts`
-`knowledgeBaseTable` 上新增:
- `emoji: text()`
2. Shared Data Types / API Schema
- `packages/shared/data/types/knowledge.ts`
- `packages/shared/data/api/schemas/knowledges.ts`
- 需要让 `KnowledgeBase``CreateKnowledgeBaseDto``UpdateKnowledgeBaseDto` 支持 `emoji`
3. Data Service / Handler 约束
- `KnowledgeBaseService`
- knowledge 相关 handler
- 建议约束:
- 仅接受单个 emoji 字符,行为与 assistant 一致
- 与 assistant 一样API 层保证返回值始终带 emoji
4. Migration / 兼容策略
- 旧知识库数据当前没有 `emoji`
- 迁移阶段允许数据库中仍为 `null`
- 读取时由 service 层补默认值 `📁`
#### 当前不做
- 不新增 `icon`
- 不新增 `iconType`
- 不新增 `iconUrl`
- 不新增 `cover`
- 不新增 `separatorRule`
#### 影响范围
- `knowledge_base` 主数据结构
- knowledge base DataApi 输入输出契约
- renderer 中知识库列表与详情头部的图标数据来源
## 待补充
- 后续新的结构调整项继续按同样格式追加到本文档

View File

@@ -134,12 +134,16 @@ V2 迁移时,不保留这个旧字段作为最终业务标识,而是把它
旧向量记录中的内容字段会转换为:
- `pageContent` -> `document`
- `knowledge_item.id` -> `metadata.itemId`
- `source` -> optional `metadata.source`
- `knowledge_item.id` -> `metadata.itemId``external_id`
- `knowledge_item.type` -> `metadata.itemType`
- `source` -> `metadata.source`
- chunk 顺序 -> `metadata.chunkIndex`
- chunk 文本 token 估算 -> `metadata.tokenCount`
当前实现不会保留所有旧 metadata只保留迁移和检索必需的最小信息。
其中 `metadata.itemId``external_id` 保持同值,用于恢复稳定的业务 item 归属。
如果 legacy row 的 `source` 为空,则迁移结果不会补默认值,也不会强制写出 `metadata.source`
迁移后的 metadata 必须满足 runtime `KnowledgeChunkMetadataSchema`
`itemId``itemType``source``chunkIndex``tokenCount` 都是必填字段
无法补出合法 `source` 的 legacy row 会被跳过,而不是写入不完整 metadata。
### 5.3 Embedding 复用
@@ -171,20 +175,26 @@ V2 迁移时,不保留这个旧字段作为最终业务标识,而是把它
## 6. 文件安全约束
当前迁移器采用“临时文件重建 + 就地替换”的策略。
当前迁移器采用“临时文件重建 + legacy `.bak` 保留 + 就地替换”的策略。
规则如下:
1. 先在原 DB 的同级路径写一个临时文件
- `<dbPath>.vectorstore.tmp`
2. 临时文件写完整并校验成功后,再替换原文件
3. 如果当前 base 在最终替换前失败,原始 legacy DB 保持不变
2. 临时文件写完整并校验成功前,不移动或删除原始 legacy `embedjs` DB
3. 临时文件准备完成后,把原始 legacy DB 移动到同级备份文件
- `<dbPath>.embedjs.bak`
4. 再把临时 V2 vector store 移动回原 DB 路径
- `<dbPath>`
5. 如果当前 base 在最终替换前失败,原始 legacy DB 仍保留在原路径
6. 如果当前 base 已完成替换,但后续 migrator 或最终校验失败retry 会从 `<dbPath>.embedjs.bak` 重新读取 legacy 向量源
这意味着:
1. 迁移过程尽量避免中途损坏原文件
2. 当前流程依赖用户在迁移前已完成 V1 备份
3. 迁移器自身不额外维护一份回滚副本
1. 迁移过程不会在 V2 store 成功写好前破坏原始 legacy DB
2. `.embedjs.bak` 是 migrator 为同一次迁移运行内 retry 保留的 legacy 向量源副本
3. `.embedjs.bak` 不是用户迁移前创建的完整 V1 备份,也不覆盖完整备份要求
4. 当前实现迁移成功后不会自动删除 `.embedjs.bak`
## IMPORTANT: 当前已接受的局限
@@ -193,18 +203,15 @@ V2 迁移时,不保留这个旧字段作为最终业务标识,而是把它
1. base 级执行失败属于迁移失败,不属于可跳过数据
- 如果某个 base 在重建临时库、写入目标表或替换正式文件时失败,`execute()` 会直接返回 `success: false`
- 这类失败不会被计入 `skippedCount`,也不应只记 warning 后继续成功
2. 当前实现**不会**在知识库目录中额外保留可重试的 legacy `.bak` 文件
- 也就是说,迁移器不会为后续自动重试维护一份原地可恢复的 V1 向量源
3. 当前替换策略仍然是“就地替换”
- 因此一旦失败发生在“原文件已移除,但新文件还未成功落位”的窗口中,磁盘上的 legacy source 可能已经不可继续复用
4. 当前重试策略依赖用户在迁移前完成的完整 V1 备份
- 如果迁移失败,后续再次迁移前,应先由用户手动恢复原始备份
- 迁移器自身**不保证**失败后目录仍处于“可直接再次迁移”的状态
- 这也是当前实现没有采用“三段式替换old -> .bak -> new”的直接原因
- 当前产品策略是在开始迁移前要求用户先做一次完整文件备份
- 因此失败恢复的 source of truth 是“用户手动恢复的迁移前备份”,不是迁移器在知识库目录内额外维护的一份临时回滚副本
5. 这一点很重要,后续可能需要继续讨论或改动
- 如果未来项目希望支持“失败后无需手动恢复、可直接再次迁移”,那么文件替换与 backup 策略需要重新设计
2. retry 依赖 `.embedjs.bak` 保留在迁移后的 V2 vector store 旁边
- 如果原路径已经是 V2 vector storesource reader 会回退读取同级 `.embedjs.bak`
- 如果 `.embedjs.bak` 被外部删除,迁移器就不能再保证同一次运行内可以从 legacy source 重试
3. `.embedjs.bak` 的生命周期目前只定义到“支持迁移 retry”
- 当前代码不会在迁移整体成功后清理 `.embedjs.bak`
- 如果后续产品希望成功后自动释放旧向量库占用的磁盘空间,需要单独增加 cleanup 策略、实现和测试
4. 用户迁移前的完整备份仍然必要
- `.embedjs.bak` 只覆盖单个 knowledge base 的 legacy vector DB
- 完整迁移失败后的全局恢复 source of truth 仍然是迁移前备份
## 7. 校验规则
@@ -278,7 +285,7 @@ V2 迁移时,不保留这个旧字段作为最终业务标识,而是把它
当前 runtime 向量侧实现位于:
- `src/main/services/knowledge/KnowledgeRuntimeService.ts`
- `src/main/services/knowledge/runtime/KnowledgeRuntimeService.ts`
- `src/main/services/knowledge/vectorstore/KnowledgeVectorStoreService.ts`
- `src/main/services/knowledge/vectorstore/providers/LibSqlVectorStoreProvider.ts`