docs: add plan-07..11 architectural fix plans

Plan masters #2685-#2689 covering server runtime GA, OpenCode integration,
data-pipeline integrity, build/artifact hygiene, and observer output fidelity.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This commit is contained in:
Alex Newman
2026-05-28 16:14:51 -07:00
parent c3d2af7c14
commit c46563c68d
5 changed files with 202 additions and 0 deletions

View File

@@ -0,0 +1,52 @@
# [plan-07] Server Runtime GA — graduate server-beta to a first-class runtime
## Defect
The standalone server runtime was extracted from `worker-service` but never inherited the worker's startup, auth, deployment, and operability guarantees. It is shipped under a "beta" label, yet the gaps are not stylistic — they are structural: the server can boot into a non-functional state (no mode loaded), its auth contract disagrees with its own route middleware, its container stack has no supervision, its CLI cannot perform routine operator tasks, and there is no install/uninstall/test path for it at all. The decision is **not** to remove the server runtime — it is to **remove the beta label**, which means closing every parity and hardening gap so the server runtime is as trustworthy as the in-plugin worker.
## Children
- #2428 — server-beta API key scopes (default `memories:read`) don't match the scopes the local route middleware actually requires
- #2443`server-beta-service` never calls `loadMode('code')` → every generation job fails with "No mode loaded"
- #2444`runServerBetaCli()` default `start` spawns a daemon and exits → unusable under systemd `Type=simple`
- #2540 — tracking issue for the multi-PR server-beta contribution series
- #2541 — API keys persisted as unsalted single-round SHA-256 → offline-crackable if the DB leaks (needs argon2id + timing-safe verify)
- #2543`claude-mem install` has no end-to-end setup for the server runtime (Docker + pg + redis, key gen, IDE MCP injection)
- #2550 — no end-to-end test exercising the server runtime path
- #2552 — Viewer UI unreachable on the server runtime (no static handler / API compat layer mounted)
- #2554 — no subscription auth path; API-key-only is prohibitively expensive; stale `DEFAULT_MODEL` → 404; loopback ECONNREFUSED in Docker
- #2558 — Docker stack: no restart policy, brittle `REDIS_URL` env, no credentials-file mount
- #2560 — Postgres schema gaps (missing `platform_source`/`metadata`/indexes), no key-scope migration tool, missing batch route
- #2562`api-key-service.ts` filename hides which auth backend it implements (DX)
- #2564 — hooks have no runtime selector; can't switch worker ↔ server without reinstall
- #2568`claude-mem uninstall` only knows the worker runtime; server operators must tear down manually
- #2572 — Server CLI missing `api-key`/`keys`/`jobs` subcommands, no helmet hardening, no wrong-runtime guard
## Fix sequence
1. **Boot correctness:** call `loadMode('code')` (and validate a mode is loaded) before the server accepts jobs; fail fast if not (#2443). Make `start` run in the foreground by default with an explicit `--daemon` flag (#2444).
2. **Auth contract:** reconcile API-key scope defaults with the route middleware's required scopes; add a scope-migration tool; move key hashing to argon2id with timing-safe verification (#2428, #2541, #2560).
3. **Deployment hardening:** Docker stack gets a restart policy, a Redis env fallback, and a credentials-file mount; document subscription vs API-key auth and fix the stale default-model 404 + loopback ECONNREFUSED (#2554, #2558).
4. **Operability:** complete the server CLI (`api-key`/`keys`/`jobs`), add helmet, add a wrong-runtime guard; mount the Viewer UI / API compat layer on the server runtime (#2572, #2552).
5. **Install / uninstall / switch:** end-to-end `claude-mem install --runtime server` and matching uninstall; a hook-level runtime selector so users switch worker ↔ server without reinstall (#2543, #2568, #2564).
6. **Tests + rename:** an end-to-end server-runtime test in CI; rename `api-key-service.ts` to reveal its backend (#2550, #2562). Land via the tracked PR series (#2540).
7. **Drop the beta label** only once 16 land and the server runtime passes the same matrix as the worker.
## Test matrix
| Deployment | Auth mode | Required behavior |
|---|---|---|
| Docker compose | API key | Boots with mode loaded; restart policy recovers a killed container; jobs succeed |
| Docker compose | Subscription | Auth path documented + functional; no API-key cost surprise |
| systemd `Type=simple` | API key | `start` stays in foreground; unit does not flap |
| Bare `claude-mem install --runtime server` | API key | Install wires pg+redis+keys+MCP; uninstall fully tears down |
| Switch worker→server (no reinstall) | either | Runtime selector flips; observations resume on the new runtime |
The matrix lives in CI (#2550). A server-runtime regression must fail CI before a user can file.
## Out of scope
- Worker-runtime-only lifecycle bugs → plan-03.
- Generic installer error taxonomy (non-server) → plan-04.
- Host env contamination of the Anthropic subprocess → plan-06.
- New provider backends (Vertex/DeepSeek/OpenAI-compatible) → tracked as standalone feature requests.

View File

@@ -0,0 +1,38 @@
# [plan-08] OpenCode Integration Event-Contract Correctness — make the OpenCode plugin actually capture
## Defect
claude-mem's OpenCode plugin was written against event names that do not exist in OpenCode's hook API (e.g. `session.created`, `message.updated`, `session.compacted`, `file.edited`, `session.deleted`). Because no real event ever fires, sessions are never initialized and no observations are captured — while `claude-mem install` still reports success. The result is a silent dead loop: OpenCode users believe memory is recording and it is recording nothing. A second defect compounds it: the OpenCode search client parses `data.items` while the worker returns Claude-style `data.content` blocks, so even manual search returns "No results", and it resolves the worker port from the wrong source.
The architectural fix is to bind the plugin to OpenCode's **real** event contract and to the worker's **actual** response shape, then add a contract test so a future OpenCode API change fails CI rather than silently disabling capture.
## Children
- #2435 — plugin subscribes to non-existent OpenCode event names → zero sessions/observations recorded
- #2406`claude_mem_search` always returns "No results" (`data.items` vs `data.content`); worker-port resolution uses the wrong source
- #2419 — feature framing of the same gap: OpenCode plugin lacks `tool.execute.after` observation capture
- #2462 — duplicate user report: OpenCode install reports success but captures no memory
## Fix sequence
1. **Rebind to real events:** rewrite the plugin against OpenCode's actual hooks (`tool.execute.after`, `chat.message`, `experimental.session.compacting`, …); add session init + observation POST on the correct events (#2435, #2419).
2. **Fix the search client:** parse the worker's `data.content` block shape; resolve the worker port from the authoritative settings source, with the env override honored (#2406).
3. **Contract test:** a test that asserts the plugin subscribes only to event names OpenCode actually emits, and that the search client parses the worker's real response shape. This is the regression guard.
4. **Install honesty:** OpenCode install must verify capture is live (one round-trip) before reporting success, so a future contract break surfaces at install time (ties to plan-04).
## Test matrix
| Surface | Input | Required behavior |
|---|---|---|
| Tool execution in OpenCode | a tool call | `tool.execute.after` fires → observation POSTed → row appears |
| Session lifecycle | open/compact/close | session init + compaction handled via real events |
| `claude_mem_search` | a query with known results | parses `data.content`; returns the rows (not "No results") |
| Worker-port resolution | non-default port via env/settings | client targets the correct port |
The matrix lives in CI. An OpenCode-capture regression must fail CI before a user can file.
## Out of scope
- Claude Code hook IO discipline → plan-01.
- Worker write-path / persistence correctness → plan-09.
- The worker's own search SQL source-scoping → plan-09.

View File

@@ -0,0 +1,41 @@
# [plan-09] Data-Pipeline Integrity & Migration Transparency — stop "healthy worker, frozen observations"
## Defect
The worker can report healthy while new data silently stops reaching the `observations` table. The write path and schema contract are enforced inconsistently and their failures are swallowed, so the operator-visible symptom is identical across several distinct root causes: memory looks "frozen" and there is no error. Concretely: an idempotent migration is skipped on pre-existing DBs and its absence is silent; the MCP write tools drop required fields so the row the sync trigger needs is never populated; session identity is never set so messages are dropped; and the search layer ignores the source-scoping filter so memories bleed across agents.
The architectural fix is a **verified write-path + migration contract**: migrations run unconditionally and idempotently on every boot with a logged schema-version audit; the MCP/REST create path validates the field→column→trigger chain so a record cannot be accepted yet fail to persist; ingestion filtering is expressive enough to exclude noise; and source-scoping is applied wherever memory is read.
## Children
- #2433`merged_into_project` migration silently skipped on pre-existing DBs (`schema_versions ≤ 23`); queries throw no-such-column, UI falsely shows "no memory yet"
- #2684 — MCP `observation_add` / `memory_add` drop `narrative`/`title`/`type` → empty records, `narrative` column never populated, sync trigger never fires, observations frozen
- #2533 — observations never persisted; `pending_messages` empty, `memory_session_id`/`worker_port` never set
- #2389`/api/search` ignores `platformSource`; Codex/other-agent search returns cross-platform / null-source memories
- #2442 — transcript `MatchRule` has no negation, so structurally-identical guardian/subagent sessions can't be excluded and pollute memory
## Fix sequence
1. **Migrations always run, idempotently, with audit:** drop the conditional skip; run every migration on boot guarded by idempotency, and log a schema-version audit line so a missing column is impossible to ship silently (#2433).
2. **Write-path contract:** validate the MCP/REST `create` field set, map it to columns, and assert the sync trigger's precondition column is populated; reject (loudly) a create that can't persist instead of accepting an empty record (#2684, #2533).
3. **Session identity:** ensure `memory_session_id`/`worker_port` are set before messages are accepted, so nothing is dropped for want of identity (#2533).
4. **Read-side source-scoping:** thread `platform_source` through the search SQL, the `search` MCP tool schema, and context generation so agents only see their own memory unless explicitly cross-querying (#2389).
5. **Ingestion filtering:** add `not_equals`/`not_in`/`not_contains` (and fix `exists:false`) to transcript `MatchRule` so noise sessions can be excluded (#2442).
## Test matrix
| Stage | Condition | Required behavior |
|---|---|---|
| Boot on pre-existing DB (`schema_versions ≤ 23`) | migration pending | migration runs idempotently; schema-version logged; no no-such-column |
| MCP `observation_add` | full + partial field set | row persists with `narrative`; missing required field → loud reject, never empty row |
| Session start | first message | `memory_session_id`/`worker_port` set before accept |
| Search | `platformSource=codex` | only codex-sourced rows returned; no null-source bleed |
| Transcript ingest | guardian/subagent session | excluded by negation rule; not stored |
The matrix lives in CI. A "healthy worker, frozen observations" regression must fail CI before a user can file.
## Out of scope
- Worker process supervision / crashes → plan-03.
- Chroma vector-search engine stability → plan-03 (process supervision) / upstream.
- OpenCode client response parsing → plan-08.

View File

@@ -0,0 +1,35 @@
# [plan-10] Build / Bundle / CI Artifact Hygiene — enforce a boundary on what we ship
## Defect
There is no enforced discipline on the contents, size, or correctness of published artifacts, so dead weight and maintainer files leak into what users install, and `main` can ship with a broken typecheck. The worker bundler reaches past the plugin's declared dependency boundary and pulls in code that is never used; there is no CI guard to catch the resulting bloat; the published npm tarball ships maintainer `CLAUDE.md` files because there is no `files` allowlist; and `npm run typecheck` is red on `main`. Each is a symptom of the same missing contract: **the build must declare and enforce its boundaries — externals, size, tarball contents, and a green typecheck — in CI**.
## Children
- #2584`worker-service.cjs` bundles unused `better-auth` (94 OAuth URLs, ~3.7MB); bundler reaches past the dep boundary
- #2570 — no bundle-size guardrail in CI; bash-only marketplace-sync breaks on Windows (non-idempotent)
- #2538 — 24 pre-existing TypeScript errors block `npm run typecheck` on `main` (Express 5 / React 19 / logger union drift)
- #2537 — published npm tarball ships five `CLAUDE.md` files (no `files` allowlist / `.npmignore`)
## Fix sequence
1. **Externalize / treeshake:** mark `better-auth` (and any other server-only dep) external to the worker bundle, or gate it behind the server runtime so it never enters the worker artifact (#2584).
2. **Bundle-size canary in CI:** record a baseline and fail CI when the worker bundle grows past a threshold; port the marketplace-sync step to a cross-platform, idempotent script (#2570).
3. **Green typecheck gate:** fix the 24 drift errors (Express 5, React 19, logger union) and make `npm run typecheck` a required CI check so `main` can't go red again (#2538).
4. **Tarball allowlist:** add a `files` allowlist (and/or `.npmignore`) so only intended artifacts publish; assert tarball contents in CI (#2537).
## Test matrix
| Artifact | Check | Required behavior |
|---|---|---|
| `worker-service.cjs` | bundle size vs baseline | no `better-auth`; size under threshold or CI fails |
| Repo `main` | `npm run typecheck` | exit 0; required check |
| npm tarball | `npm pack` contents | only allowlisted files; no maintainer `CLAUDE.md` |
| Marketplace sync | run on Windows + POSIX | idempotent; succeeds on both |
The matrix lives in CI. An artifact-hygiene regression must fail CI before a user can install it.
## Out of scope
- Missing-runtime-dependency-on-install (node_modules / zod not shipped) → plan-04 (install/dependency completeness).
- Worker runtime crashes → plan-03.

View File

@@ -0,0 +1,36 @@
# [plan-11] Observer / Summarizer Output Fidelity & Resilience — trust what the agent emits, or recover
## Defect
claude-mem's quality depends on the observer/summarizer emitting **truthful, parseable** output, but nothing enforces either property. Two failure modes anchor this plan. First, the observer SDK sometimes returns conversational prose, an empty string, or a "session exhausted" closure string instead of `<observation>` XML; the parser silently drops the entire batch and observations stay at zero, with no recovery and no signal. Second, the summarizer can confabulate — inventing cross-session narrative and fabricating a nonexistent git commit hash — while keeping `files_modified` accurate, which poisons every future context injection that trusts it.
This is distinct from plan-05 (which governs the observer's *tool permissions*, not whether its emitted text is parseable or true). The architectural fix is an **output-fidelity contract**: classify the observer's output (valid XML vs idle-empty vs prose vs poisoned session), recover by killing and respawning a poisoned SDK session while preserving pending work, and run a cheap verification pass that cross-checks generated claims (e.g. commit hashes) against ground truth before persisting.
## Children
- #2485 — observer SDK returns prose / empty / closure strings; parser drops all batches → observations stay 0, no recovery
- #2574 — summarizer hallucinates cross-session content and fabricates a nonexistent commit hash (`files_modified` correct), poisoning future injection
## Fix sequence
1. **Classify, don't silently drop:** split the observer's non-XML output into idle-empty vs prose vs poisoned-session; attach a preview to diagnostics so dropped batches are visible, not silent (#2485).
2. **Recover from poison:** after N consecutive invalid outputs, kill and respawn the SDK session while preserving pending messages, so a poisoned session can't wedge the pipeline at zero (#2485).
3. **Verify before persist:** cross-check generated claims against ground truth — validate any emitted commit hash with `git cat-file -e`, and reconcile `title`/`narrative` against `files_modified`; log input-context provenance so confabulation is traceable (#2574).
## Test matrix
| Observer output | Required behavior |
|---|---|
| valid `<observation>` XML | parsed + persisted |
| empty (idle) | classified idle; no error, no respawn churn |
| conversational prose | classified prose; preview logged; not persisted as observation |
| "session exhausted" closure | classified poisoned; session killed + respawned; pending preserved |
| fabricated commit hash | `git cat-file -e` fails → claim rejected/flagged, not persisted |
The matrix lives in CI. An output-fidelity regression must fail CI before a user can file.
## Out of scope
- Observer SDK tool permissions / security enforcement → plan-05.
- Worker process supervision / restart loops → plan-03.
- Write-path / persistence schema correctness → plan-09.