Files
Alex Newman cf450cec00 fix(build): enforce shipped dependency-closure boundary (plan-10, closes #2783) (#2800)
* plan-10 Phase 1: ship deterministic plugin runtime dependency closure

Approach A — commit & ship plugin/bun.lock so the plugin's runtime
node_modules install is deterministic, fixing the recurring
`Cannot find module 'zod/v3'` (#2730).

- align generated plugin zod range to root (^4.4.3) in build-hooks.js
- new scripts/gen-plugin-lockfile.cjs generates plugin/bun.lock as a
  build artifact after build-hooks.js writes plugin/package.json
- track & ship plugin/bun.lock (.gitignore negation, .npmignore, files allowlist)
- install with `bun install --frozen-lockfile --ignore-scripts` at runtime

Refs #2783, #2730

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* plan-10 Phase 2: fail loud at install time on a broken dependency closure

Strengthen verifyCriticalModules to assert each dependency is actually
importable via require.resolve (not merely a directory), and assert the
worker-required zod subpaths resolve: zod/v3, zod/v4, zod/v4-mini.
A partial/stale install now fails `npx claude-mem install` immediately
instead of surfacing later as a Stop-hook `Cannot find module 'zod/v3'`.

Bin-only packages (e.g. tree-sitter-cli, which has no bare-name entry
point) fall back to resolving <dep>/package.json so a healthy install
isn't falsely rejected.

Adds tests/cli/verify-critical-modules.test.ts covering a missing zod/v3
subpath (throws), a complete zod (passes), and a bin-only dep (passes).

Refs #2783, #2730

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* plan-10 Phase 3: clean-room install + import smoke test (#2730 backstop)

Add scripts/smoke-clean-room.cjs and a `smoke:clean-room` npm script.
Against fresh temp dirs (never the repo's node_modules) it:
- copies plugin/, runs `bun install --frozen-lockfile --ignore-scripts`,
  asserts zod, zod/v3, zod/v4, zod/v4-mini resolve, and boots the bundled
  worker asserting no `Cannot find module` — the direct #2730 regression guard;
- `npm pack`s, installs the tarball into a second temp dir, and load-tests
  the published bin entrypoint, warning loudly on any declared main/exports
  target missing from the tarball (latent #2537 gap).

Exits non-zero naming the missing module on any failure; cleans up all
temp dirs and the tarball in a finally.

Refs #2783, #2730

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* plan-10 Phase 4: gate CI and publish on the clean-room dependency closure

- ci.yml: new `clean-room-deps` job (between build and the docker e2e job)
  runs a frozen-lockfile drift check on the committed plugin lockfile, then
  `npm run build` + `npm run smoke:clean-room`. The drift step catches a
  contributor who changed plugin deps without regenerating plugin/bun.lock.
- npm-publish.yml: add setup-bun and run `npm run smoke:clean-room` between
  build and `npm publish`, so a broken runtime closure cannot be published
  on a tag push (ci.yml does not run on tags). Secrets block untouched.

Refs #2783, #2730

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* plan-10: doc recluster note + Phase 0 execution slice for #2730

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* plans: backlog recluster (2026-06-04) — cross-cluster execution order + plan-13 doc

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* plan-10: gen-plugin-lockfile degrades gracefully when bun is absent

The Windows build CI job has no bun on PATH; regenerating the lockfile there
threw and failed the build. The committed plugin/bun.lock is already the
deterministic closure, so skip regeneration (non-fatal) when bun is missing
and a lockfile exists; fail loud only when neither is available.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-05 20:27:34 -07:00

2.2 KiB

[plan-03] Worker / Daemon Lifecycle Hardening — supervision, identity, resource bounds

Defect

The worker/daemon has no robust lifecycle contract: startup health is checked against the wrong PID so start reports "process died during startup" even when it is alive; the PID file is never validated against process identity, so a recycled PID produces a permanent ghost-PID deadlock; the generator's spawned SDK child is SIGTERM'd (exit 143) mid-run leaving the queue to drown; Bun workers OOM-cascade when the host runs a heavy dev server; observer transcripts grow unbounded (single 1.9 GB JSONL); and on Windows the cumulative effect is zero observations ever generated. These are all the same gap: no identity-validated supervision with bounded resources and honest health.

Children

  • #2747 — worker-cli start always fails 'Process died during startup' — waitForHealth checks the wrong PID
  • #2726 — Worker PID file not validated against process identity → permanent ghost-PID deadlock (Windows)
  • #2740 — Generator's spawned SDK child gets SIGTERM (exit 143) at ~3 min; no observations insert; queue drowns
  • #2720 — Bun workers OOM cascade on Windows when host project runs Next.js dev (Turbopack)
  • #2754 — Observer session transcripts grow unbounded — single 1.9 GB JSONL, 6.1 GB total
  • #2703 — 0 observations ever generated on Windows (cross-cutting worker defects)

Fix sequence

Design doc: plans/03-worker-lifecycle.md. Health-check the actual spawned PID; validate PID-file identity (pid+start-time) before trusting/killing; supervise the SDK child with restart-on-unexpected-exit and queue drain protection; bound memory + transcript size with rotation; converge the Windows zero-observation path on the above.

Test matrix

Host Scenario Required behavior
all start health checks the real PID; no false "died"
all recycled PID identity mismatch → no ghost deadlock
all long generation SDK child survives or restarts; queue drains
Windows host Next.js dev running no OOM cascade; observations land
all long session transcript rotates; bounded disk

Out of scope

Env contamination of the SDK subprocess (was plan-06); observer output parsing (plan-11).