* plan-10 Phase 1: ship deterministic plugin runtime dependency closure Approach A — commit & ship plugin/bun.lock so the plugin's runtime node_modules install is deterministic, fixing the recurring `Cannot find module 'zod/v3'` (#2730). - align generated plugin zod range to root (^4.4.3) in build-hooks.js - new scripts/gen-plugin-lockfile.cjs generates plugin/bun.lock as a build artifact after build-hooks.js writes plugin/package.json - track & ship plugin/bun.lock (.gitignore negation, .npmignore, files allowlist) - install with `bun install --frozen-lockfile --ignore-scripts` at runtime Refs #2783, #2730 Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * plan-10 Phase 2: fail loud at install time on a broken dependency closure Strengthen verifyCriticalModules to assert each dependency is actually importable via require.resolve (not merely a directory), and assert the worker-required zod subpaths resolve: zod/v3, zod/v4, zod/v4-mini. A partial/stale install now fails `npx claude-mem install` immediately instead of surfacing later as a Stop-hook `Cannot find module 'zod/v3'`. Bin-only packages (e.g. tree-sitter-cli, which has no bare-name entry point) fall back to resolving <dep>/package.json so a healthy install isn't falsely rejected. Adds tests/cli/verify-critical-modules.test.ts covering a missing zod/v3 subpath (throws), a complete zod (passes), and a bin-only dep (passes). Refs #2783, #2730 Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * plan-10 Phase 3: clean-room install + import smoke test (#2730 backstop) Add scripts/smoke-clean-room.cjs and a `smoke:clean-room` npm script. Against fresh temp dirs (never the repo's node_modules) it: - copies plugin/, runs `bun install --frozen-lockfile --ignore-scripts`, asserts zod, zod/v3, zod/v4, zod/v4-mini resolve, and boots the bundled worker asserting no `Cannot find module` — the direct #2730 regression guard; - `npm pack`s, installs the tarball into a second temp dir, and load-tests the published bin entrypoint, warning loudly on any declared main/exports target missing from the tarball (latent #2537 gap). Exits non-zero naming the missing module on any failure; cleans up all temp dirs and the tarball in a finally. Refs #2783, #2730 Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * plan-10 Phase 4: gate CI and publish on the clean-room dependency closure - ci.yml: new `clean-room-deps` job (between build and the docker e2e job) runs a frozen-lockfile drift check on the committed plugin lockfile, then `npm run build` + `npm run smoke:clean-room`. The drift step catches a contributor who changed plugin deps without regenerating plugin/bun.lock. - npm-publish.yml: add setup-bun and run `npm run smoke:clean-room` between build and `npm publish`, so a broken runtime closure cannot be published on a tag push (ci.yml does not run on tags). Secrets block untouched. Refs #2783, #2730 Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * plan-10: doc recluster note + Phase 0 execution slice for #2730 Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * plans: backlog recluster (2026-06-04) — cross-cluster execution order + plan-13 doc Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * plan-10: gen-plugin-lockfile degrades gracefully when bun is absent The Windows build CI job has no bun on PATH; regenerating the lockfile there threw and failed the build. The committed plugin/bun.lock is already the deterministic closure, so skip regeneration (non-fatal) when bun is missing and a lockfile exists; fail loud only when neither is available. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2.2 KiB
[plan-03] Worker / Daemon Lifecycle Hardening — supervision, identity, resource bounds
Defect
The worker/daemon has no robust lifecycle contract: startup health is checked against the wrong PID so start reports "process died during startup" even when it is alive; the PID file is never validated against process identity, so a recycled PID produces a permanent ghost-PID deadlock; the generator's spawned SDK child is SIGTERM'd (exit 143) mid-run leaving the queue to drown; Bun workers OOM-cascade when the host runs a heavy dev server; observer transcripts grow unbounded (single 1.9 GB JSONL); and on Windows the cumulative effect is zero observations ever generated. These are all the same gap: no identity-validated supervision with bounded resources and honest health.
Children
- #2747 — worker-cli
startalways fails 'Process died during startup' — waitForHealth checks the wrong PID - #2726 — Worker PID file not validated against process identity → permanent ghost-PID deadlock (Windows)
- #2740 — Generator's spawned SDK child gets SIGTERM (exit 143) at ~3 min; no observations insert; queue drowns
- #2720 — Bun workers OOM cascade on Windows when host project runs Next.js dev (Turbopack)
- #2754 — Observer session transcripts grow unbounded — single 1.9 GB JSONL, 6.1 GB total
- #2703 — 0 observations ever generated on Windows (cross-cutting worker defects)
Fix sequence
Design doc: plans/03-worker-lifecycle.md. Health-check the actual spawned PID; validate PID-file identity (pid+start-time) before trusting/killing; supervise the SDK child with restart-on-unexpected-exit and queue drain protection; bound memory + transcript size with rotation; converge the Windows zero-observation path on the above.
Test matrix
| Host | Scenario | Required behavior |
|---|---|---|
| all | start | health checks the real PID; no false "died" |
| all | recycled PID | identity mismatch → no ghost deadlock |
| all | long generation | SDK child survives or restarts; queue drains |
| Windows | host Next.js dev running | no OOM cascade; observations land |
| all | long session | transcript rotates; bounded disk |
Out of scope
Env contamination of the SDK subprocess (was plan-06); observer output parsing (plan-11).