Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com> Co-authored-by: fullex <106392080+0xfullex@users.noreply.github.com> Signed-off-by: eeee0717 <chentao020717Work@outlook.com>
16 KiB
Knowledge Operation Guards
This document records the guard and recovery semantics for the three caller-facing knowledge item operations:
addItemsdeleteItemsreindexItems
The operations intentionally do not share one generic validation pipeline. They share small guards where the semantics match, but each operation keeps its own explicit flow because their state transitions and enqueue-failure behavior are different.
Shared Helpers
assertBaseCanRunRuntimeOperation
Used by operations that create or rebuild runtime work on an existing base.
addItems: rejectsfailedbases.reindexItems: rejectsfailedbases.deleteItems: does not use this guard. Deleting a failed base's items must remain possible so callers can clean up recoverable or partially migrated data.
KnowledgeItemService.getOutermostSelectedItemIds
Used by subtree id-based operations: deleteItems and reindexItems.
- De-duplicates input item ids.
- Loads each selected item.
- Rejects items that do not belong to the requested
baseId. - Removes selected descendants when their selected ancestor is already present.
- Prevents the same subtree from being deleted or reindexed more than once in a single request.
This helper is not used by addItems because addItems receives new item payloads, not persisted item ids.
KnowledgeService.getRootItemsInBase
Private helper used only by single-item chunk operations.
- De-duplicates input item ids.
- Loads each selected item.
- Rejects items that do not belong to the requested
baseId.
Subtree operations do not use this helper; they use KnowledgeItemService.getOutermostSelectedItemIds instead.
Subtree Status Reconciliation
Any non-delete subtree status update must reconcile parent containers outside the updated subtree. For example, if a child subtree is marked failed after a scheduling failure, the parent directory must also be recalculated so it does not remain processing without active work.
Subtree membership must be resolved in the same serialized write transaction as the status write. Do not precompute subtree ids before entering DbService.withWriteTx; a concurrent create/delete between the read and update can leave descendants visible or reconcile containers against stale membership.
Hard Delete FileRef Cleanup
Final hard deletes must clear Knowledge file_ref rows for the full deletion subtree in the same DbService.withWriteTx before deleting knowledge_item rows. file_ref.sourceId is polymorphic and has no FK to knowledge_item; deleting a container cascades child knowledge_item rows through knowledge_item.groupId, but the database cannot cascade their file refs.
deleteItemsByIds must therefore expand explicit ids to the full subtree with a recursive CTE for ref cleanup. Row deletion may still target the explicit ids and rely on the groupId cascade, but ref cleanup must use the complete subtree id set.
assertSubtreesCanReindex
Used only by reindexItems.
- Runs after selected item ids have been collapsed to top-level roots.
- Loads each selected root subtree with roots included.
- Allows reindex only when every item in every selected subtree is terminal:
completedorfailed. - Rejects active or deleting subtree state:
idle,preparing,processing,reading,embedding, ordeleting.
This is the backend authority for user-triggered reindex. UI may hide the reindex action for non-terminal rows, but the service guard must still reject stale or direct calls.
Chunk Operations
Used by listItemChunks. (The chunk-level delete deleteItemChunk was removed with the per-base index store cutover — chunks are derived index rows now, replaced wholesale by rebuildMaterial.)
- Rejects failed bases through
assertBaseCanRunRuntimeOperation. - Loads the requested item and rejects items outside the requested
baseId. - Allows chunk listing only when the requested item itself is
completed. - For completed
directorylist requests, also rejects if any descendant isdeleting.
The UI should only expose chunk viewing for completed rows, but the service guard remains the backend authority for stale or direct IPC calls. The extra container descendant check exists because container reconciliation ignores deleting children, so a container can stay completed while cleanup is still pending below it.
addItems
addItems accepts new item payloads and creates persisted knowledge_item rows before scheduling the first workflow jobs.
addItems(baseId, inputs)
-> reject failed base
-> no-op on empty inputs
-> under same-base mutation lock:
create each item
set root status to preparing for containers
set root status to processing for leaves
rollback created rows if create/status update fails
-> schedule each accepted item
container -> knowledge.prepare-root
leaf -> knowledge.index-documents
invalid -> mark item failed, no job
deleting -> skip
-> if enqueue throws:
mark accepted items that did not finish scheduling as failed
rethrow
Why Enqueue Failure Marks Items Failed
addItems writes an active status before enqueueing. If enqueue fails after the mutation block, the row would otherwise stay in preparing or processing without a durable job to advance it.
The compensating rule is:
- items whose scheduling completed are left alone, because they already have a job or an intentional no-job terminal decision;
- the failing item and any later accepted items are marked
failed; - the original enqueue error is rethrown to the caller.
This prevents stuck active rows while avoiding deletion of rows that may already be referenced by a queued job.
deleteItems
deleteItems operates on existing item ids and is modeled as a durable cleanup state machine.
deleteItems(baseId, itemIds)
-> de-duplicate ids
-> load selected items
-> reject items outside baseId
-> collapse nested selections to top-level roots
-> no-op if no roots remain
-> under same-base mutation lock:
mark selected root subtrees deleting
-> enqueue knowledge.delete-subtree
idempotency key = knowledge:${baseId}:${sorted root ids}:delete
-> if enqueue throws:
keep rows deleting
log and rethrow
Why Enqueue Failure Keeps deleting
deleting is a recoverable intermediate state, not a terminal error. Once a subtree is marked deleting, other runtime paths can stop treating it as normal searchable/indexable content.
If enqueue fails, the rows remain deleting. The service does not run an in-session retry loop. Startup recovery scans deleting roots once and re-enqueues cleanup jobs best-effort:
deleteItems enqueue failure
-> keep rows deleting
-> throw the enqueue error to the caller
onAllReady
-> scan deleting root groups
-> enqueue knowledge.delete-subtree in bounded chunks
-> log scan or enqueue failures without retrying in-session
This keeps delete cleanup durable across process restart without maintaining a runtime recovery scheduler for the small enqueue-failure window.
Why Delete Cleanup Failure Does Not Mark Items failed
knowledge.delete-subtree is responsible for removing vector artifacts, detaching Knowledge file references, and deleting the resolved knowledge_item rows. If that job fails or is cancelled after rows were already marked deleting, the rows must stay deleting.
Do not convert these rows to ordinary failed items as a terminal fallback:
deletingis the state that hides requested-deletion content from default list, search, and RAG reads;failedmeans an indexing or preparation workflow failed, so list and search paths may treat the item as visible user data;- if vector cleanup failed before all chunks were removed,
deleting -> failedcan make stale chunks searchable again; - delete-base may cancel delete-subtree jobs because base deletion has taken ownership of cleanup, so cancellation is not always an item-level failure.
The recovery path for failed delete cleanup is to keep deleting, then let JobManager retry an existing knowledge.delete-subtree job or startup recovery enqueue another cleanup job for orphan deleting roots. If the product needs a user-visible terminal delete failure later, add an explicit delete-failure state or job-level UI, and keep that state excluded from default list, search, and RAG reads.
reindexItems
reindexItems operates on existing item ids but does not change item state in the caller-facing entrypoint.
reindexItems(baseId, itemIds)
-> reject failed base
-> de-duplicate ids
-> load selected items
-> reject items outside baseId
-> collapse nested selections to top-level roots
-> no-op if no roots remain
-> reject unless every selected root subtree is completed or failed
-> enqueue knowledge.reindex-subtree
idempotency key = knowledge:${baseId}:${sorted root ids}:reindex
Why Reindex Requires Terminal Subtrees
User-triggered reindex is intentionally an offline rebuild of an existing subtree, not a cancellation or preemption primitive.
Allowing reindex while a subtree is still preparing, processing, reading, or embedding would force reindex-subtree to coordinate with active indexing and expansion jobs. That reintroduces cancellation races: old jobs may still be reading, writing vectors, attaching refs, or expanding children while the reindex job is deleting vectors and resetting rows.
The simpler rule is:
- active work must finish as
completedorfailedbefore the user can reindex; - failed work can be retried by reindexing because it is already terminal;
- deleting work cannot be reindexed because delete owns cleanup once the durable
deletingintent is written; - delete remains available at any time and is the only user action allowed to preempt active work.
Why Reindex Does Not Pre-Mark Items Active
The reindex entrypoint only accepts the durable job. It does not set roots to preparing or processing before enqueueing.
The reindex job owns the destructive and stateful work:
- clear vectors for resolved leaf items;
- delete previous container descendants when selected roots are containers;
- keep selected leaf root file refs because those root items still own their source files;
- skip if the target subtree became
deletingafter the entrypoint guard; - reset subtree item state;
- call
scheduleItemfor each selected root.
Because the entrypoint does not write an active status before enqueueing, enqueue failure can be reported directly without leaving stuck active rows.
Delete Wins Reindex Races
reindexItems rejects deleting before enqueue, and reindex-subtree treats deleting as a higher-priority state if delete wins the race after enqueue:
- at job entry, it checks the target subtree and completes as skipped if any item is
deleting; - under the same-base mutation lock, it checks again before clearing vectors or resetting statuses;
- it does not cancel active jobs. Reindex is only admitted for terminal subtrees, so there should be no active indexing or expansion work to cancel.
This prevents a later reindex request from cancelling delete cleanup or turning a deleting row back into preparing / processing.
These two deleting checks are intentional, even though the entrypoint already rejects deleting subtrees. They cover the window between enqueue and job execution while preserving the rule that delete is always available.
Why Reindex Keeps Schedule-Failure Compensation
After the reset mutation, selected roots are deliberately visible as preparing or processing before their follow-up jobs are scheduled. This keeps the UI honest: a user-triggered reindex immediately appears as active work.
Because those active statuses are written before scheduleItem, the handler must compensate if scheduling fails. The failing roots are marked failed so the UI does not show stuck active work without a durable job. Do not remove this compensation unless reindex introduces a separate non-active pending state, such as a dedicated reindexing or pending_reindex lifecycle state.
Reindex FileRef Ownership
Knowledge source file_ref rows are business ownership refs, not vector artifacts. Reindex must not detach refs for selected leaf roots because the root knowledge_item rows remain alive and still read data.fileEntryId.
Leaf indexing repairs this relationship instead: knowledge.index-documents rebuilds Knowledge source refs from the current knowledge_item.data before reading the source. For file items, that creates the canonical knowledge_item / source ref to data.fileEntryId; for note and URL items, it clears stale Knowledge file refs.
File ref detach during reindex is valid only when rows are actually being removed, such as stale descendants from a container expansion. Those descendants are deleted through deleteItemsByIds, which performs full subtree ref cleanup in the delete transaction.
prepare-root
prepare-root is an internal job, but it creates child rows and schedules their leaf indexing jobs, so it has its own cleanup and compensation rules.
knowledge.prepare-root(baseId, itemId)
-> skip missing or deleting roots
-> under same-base mutation lock:
find previous descendants
ignore descendants already deleting
clear vectors for removable leaf descendants
detach file refs for removable descendants
delete removable descendants by resolved id
-> under same-base mutation lock:
re-read root and skip if it is now missing or deleting
expand source into new child rows
set root status processing
-> schedule each recreated leaf
if scheduling fails:
mark leaves that did not finish scheduling failed
leave already scheduled leaves alone
rethrow
The stale expansion cleanup clears vectors and file refs before deleting resolved descendant rows so a retry does not leave stale vectors or file refs from a previous partial expansion.
The second root read closes the race where prepare-root loads an active root, then a delete request marks that root deleting before expansion starts. Once a root is deleting, no new children may be created under it.
The child scheduling compensation mirrors addItems: once a child job was accepted, the row is left alone; the failing child and later children are marked failed so no processing leaf remains without a job.
Shutdown
KnowledgeService does not cancel knowledge jobs during service shutdown. Knowledge job handlers use JobManager recovery: 'retry', so unfinished pending, delayed, or running rows are left for JobManager startup recovery instead of being terminal-cancelled while their knowledge items still show active statuses.
Review Checklist
When changing these operations, check the operation-specific failure behavior before extracting shared code.
| Operation | Failed base | Root collapse | Extra status guard | State before enqueue | Enqueue failure |
|---|---|---|---|---|---|
addItems |
Reject | N/A | N/A | preparing / processing |
Mark unscheduled accepted rows failed |
deleteItems |
Allow | Yes | N/A | deleting |
Keep deleting; startup recovery best-effort re-enqueues |
reindexItems |
Reject | Yes | Entire selected subtree must be completed or failed |
None | Throw; no active state was written |
listItemChunks |
Reject | N/A | Requested item must be completed; container list rejects deleting descendants |
N/A | N/A |
Prefer shared helpers for exact common behavior, such as base-state guards, base ownership checks, root collapse, queue names, and idempotency key builders. Keep operation flows explicit when the state or recovery semantics differ.