Files
thedotmack-claude-mem/plans/2026-06-10-worker-restart-single-source-of-truth.md
Alex Newman c0b96288a7 fix(restart): worker-restart single source of truth — self-replacing worker, spawn gate, verified restarts (#2894)
* plan-10 Phase 1: ship deterministic plugin runtime dependency closure

Approach A — commit & ship plugin/bun.lock so the plugin's runtime
node_modules install is deterministic, fixing the recurring
`Cannot find module 'zod/v3'` (#2730).

- align generated plugin zod range to root (^4.4.3) in build-hooks.js
- new scripts/gen-plugin-lockfile.cjs generates plugin/bun.lock as a
  build artifact after build-hooks.js writes plugin/package.json
- track & ship plugin/bun.lock (.gitignore negation, .npmignore, files allowlist)
- install with `bun install --frozen-lockfile --ignore-scripts` at runtime

Refs #2783, #2730

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* plan-10 Phase 2: fail loud at install time on a broken dependency closure

Strengthen verifyCriticalModules to assert each dependency is actually
importable via require.resolve (not merely a directory), and assert the
worker-required zod subpaths resolve: zod/v3, zod/v4, zod/v4-mini.
A partial/stale install now fails `npx claude-mem install` immediately
instead of surfacing later as a Stop-hook `Cannot find module 'zod/v3'`.

Bin-only packages (e.g. tree-sitter-cli, which has no bare-name entry
point) fall back to resolving <dep>/package.json so a healthy install
isn't falsely rejected.

Adds tests/cli/verify-critical-modules.test.ts covering a missing zod/v3
subpath (throws), a complete zod (passes), and a bin-only dep (passes).

Refs #2783, #2730

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* plan-10 Phase 3: clean-room install + import smoke test (#2730 backstop)

Add scripts/smoke-clean-room.cjs and a `smoke:clean-room` npm script.
Against fresh temp dirs (never the repo's node_modules) it:
- copies plugin/, runs `bun install --frozen-lockfile --ignore-scripts`,
  asserts zod, zod/v3, zod/v4, zod/v4-mini resolve, and boots the bundled
  worker asserting no `Cannot find module` — the direct #2730 regression guard;
- `npm pack`s, installs the tarball into a second temp dir, and load-tests
  the published bin entrypoint, warning loudly on any declared main/exports
  target missing from the tarball (latent #2537 gap).

Exits non-zero naming the missing module on any failure; cleans up all
temp dirs and the tarball in a finally.

Refs #2783, #2730

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* plan-10 Phase 4: gate CI and publish on the clean-room dependency closure

- ci.yml: new `clean-room-deps` job (between build and the docker e2e job)
  runs a frozen-lockfile drift check on the committed plugin lockfile, then
  `npm run build` + `npm run smoke:clean-room`. The drift step catches a
  contributor who changed plugin deps without regenerating plugin/bun.lock.
- npm-publish.yml: add setup-bun and run `npm run smoke:clean-room` between
  build and `npm publish`, so a broken runtime closure cannot be published
  on a tag push (ci.yml does not run on tags). Secrets block untouched.

Refs #2783, #2730

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* plan-10: doc recluster note + Phase 0 execution slice for #2730

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* plans: backlog recluster (2026-06-04) — cross-cluster execution order + plan-13 doc

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* plan-10: gen-plugin-lockfile degrades gracefully when bun is absent

The Windows build CI job has no bun on PATH; regenerating the lockfile there
threw and failed the build. The committed plugin/bun.lock is already the
deterministic closure, so skip regeneration (non-fatal) when bun is missing
and a lockfile exists; fail loud only when neither is available.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* chore: rebuild plugin artifacts after merging main (v13.5.1) + plan-10 work

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

* chore: rebuild plugin artifacts after merging main v13.5.5

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

* chore(deps): daily upgrade pass — agent SDK 0.3.172, better-auth 1.6.16, posthog-node 5.36.15, dompurify 3.4.9

- Bump @anthropic-ai/claude-agent-sdk 0.2.141 -> 0.3.172 (tsc + full test suite green)
- Remove deprecated @types/dompurify stub (dompurify ships its own types)
- Add overrides.tmp ^0.2.7 to clear GHSA-52f5-9888-hmc6 / GHSA-ph9p-34f9-6g65
  via np -> listr-input -> inquirer -> external-editor -> tmp chain
- npm audit: 0 vulnerabilities; npm outdated: clean
- package-lock.json is gitignored in this repo, so not committed

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

* plan: worker-restart single-source-of-truth — 7-phase fix for restart races

Phased plan from the adversarially-verified diagnosis (wf_f07f3541-b05):
kill the cache mirror, single verified restart initiator, self-replacing
restart endpoint, unified spawn gate with lockfile, PID-file demotion,
test data-dir isolation, soak verification.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

* refactor(restart): delete sync-script cache-mirror and HTTP restart trigger

Phase 1 of plans/2026-06-10-worker-restart-single-source-of-truth.md.
The installed-version cache mirror wrote version-N code into the
version-(N-1) cache dir, manufacturing permanent version disagreement;
the HTTP POST to /api/admin/restart raced the CLI restart that follows
it in build-and-sync. Both are deleted; the CLI worker:restart in the
marketplace copy is now the single restart initiator, and the sleep 1
between the two mechanisms is gone.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

* feat(restart): restart proves itself or exits 1

Phase 2 of plans/2026-06-10-worker-restart-single-source-of-truth.md.
worker-service restart now captures the old worker pid, waits for the
port with the same platform-scaled 15s budget as stop, spawns the
marketplace copy of worker-service.cjs when present, then polls
/api/health until the pid changes and the version matches this build's
baked __DEFAULT_PACKAGE_VERSION__ — success is printed to stdout,
deadline (platform-scaled 30s) exits 1 with the last observed health
payload and the spawned script path. The --daemon generic start-failure
path now exits 1 instead of masquerading as success; the three
duplicate-suppression exits remain 0.

New helper src/services/restart-verify.ts (worker-service.ts bootstraps
on import, so the helper lives in an import-safe module) with 8 tests
covering pid-flip success, stale pid, wrong version, unreachable
timeout, 503-degraded acceptance, and null-oldPid version-only
verification.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

* feat(restart): self-replacing worker — old worker spawns its successor

Phase 3 of plans/2026-06-10-worker-restart-single-source-of-truth.md.
/api/admin/restart was kill-only: hooks that POSTed it then raced the
dying worker with their own lazy-spawn (the observed recycle ping-pong).
Now the dying worker spawns its successor itself — after a re-entrancy-
guarded, deadline-bounded (platform-scaled 10s) graceful shutdown, and
only once its port is confirmed free; stop and signal shutdowns stay
kill-only. The hook recycle path waits for that successor via
/api/health polling (HOOK_READINESS_TIMEOUT_MS budget) and lazy-spawns
only as a fallback, with a warn-only version re-check so a hook never
recycles more than once per invocation.

Shutdown sequence lives in import-safe src/services/worker-shutdown.ts
(worker-service.ts bootstraps on import); registerSignalHandlers no
longer pre-sets isShuttingDown — the supervisor's shutdownInitiated
guard owns signal dedupe, and pre-setting would no-op the new entry
guard. 13 new tests cover re-entrancy, deadline expiry/rejection,
handoff ordering, kill-only reasons, successor-wait vs lazy-spawn
fallback, and pre-graceful bookkeeping failures.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

* feat(restart): one spawn gate; CLI restart defers to the self-replacing worker

Phase 4 of plans/2026-06-10-worker-restart-single-source-of-truth.md.
Three uncoordinated spawn paths (hook lazy-spawn, MCP worker-spawner,
CLI) with two different bun resolvers produced 3-launcher collisions
within a single second. Now a wx-flag lockfile (<DATA_DIR>/spawn.lock,
60s mtime staleness with re-stat-before-unlink, owner-checked release)
gates every external spawn: lock losers never fail — they skip the
spawn and wait for the winner's worker. resolveBunRuntime is deleted in
favor of ProcessManager's resolveWorkerRuntimePath (adds BUN_PATH,
~/.bun/bin, brew, which fallbacks), closing the kill-then-can't-respawn
path; mcp-server prefers the marketplace worker script so stale cache
dirs stop spawning stale workers.

Integration fix surfaced by live verification: the CLI restart raced
the Phase 3 self-replacement handoff (the successor re-binds the port
in ~200ms, so waitForPortFree always timed out and restart exited 1
while the restart had actually succeeded). The CLI now verifies the
worker's self-spawned successor directly, and only spawns — gate-
wrapped, after the port frees — as the fallback when no worker was
running, the shutdown POST was rejected, or no successor appeared. The
dying worker's handoff is intentionally ungated: it spawns only after
its own port closes, and hooks wait on it.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

* feat(restart): demote the PID file — health and port are the liveness oracle

Phase 5 of plans/2026-06-10-worker-restart-single-source-of-truth.md.
The dying worker's shutdown cascade deleted the PID file
unconditionally as its final act, clobbering the successor's
freshly-written file; status then required portInUse AND pidInfo, so a
healthy worker reported as "not running". Now every PID-file deletion
is owner-guarded: the supervisor cascade deletes only its own pid
(removeOwnedPidFile), and the CLI stop/restart-fallback, the restart
handoff, and the daemon start-failure cleanup go through
removePidFileIfOwner (owner-or-dead — a live successor's file always
survives; corrupt files are left for the next boot's validator).

status sources from GET /api/health alone (pid, version, uptime,
workerPath; 503-degraded counts as running and now surfaces its queue
detail), with port-in-use-but-unreachable and not-running fallbacks —
all exit 0 as before. The --daemon duplicate gate checks the port
first (ground truth) and the PID file second (advisory, for the
freed-port-but-undeleted-file window); duplicate suppression stays
exit 0. writePidFile/touchPidFile remain — the file is diagnostics,
and the worker stays its only writer.

Also fixes combined-run test pollution: spawn-gate and worker-utils
timeout tests now eagerly import paths.js before setting a temp
CLAUDE_MEM_DATA_DIR, so the import-time DATA_DIR const can't freeze on
a deleted temp dir for suites loaded later in the same bun process.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

* test: no test ever touches the real ~/.claude-mem again

Phase 6 of plans/2026-06-10-worker-restart-single-source-of-truth.md.
process-manager and graceful-shutdown tests wrote corrupt JSON and
sentinel PIDs (2147483647) into the real ~/.claude-mem/worker.pid and
drove the real supervisor.json cascade under a snapshot-restore that a
killed run would skip — that pollution contaminated production logs and
a prior diagnosis. Both files now set a temp CLAUDE_MEM_DATA_DIR at the
top of the file before dynamically importing the code under test (ESM
hoisting makes beforeEach too late), assert their paths landed outside
the real dir, and derive PID_FILE from the same frozen paths module the
code uses so test and code can never diverge under bun's shared module
cache. The snapshot-restore scaffolding is deleted; zero assertions
changed.

tests/preload.ts gains a tripwire: when CLAUDE_MEM_DATA_DIR is unset it
fills a per-run temp dir, so no test in any file can fall through to
the real data dir. Fallout made explicit: worker-spawn child processes
get an explicit temp dir; install-error-matrix restores rather than
deletes the env var; settings-defaults-manager pins the unset-env
default it was implicitly relying on.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

* fix(settings): bootstrap notices go to stderr, never stdout

CI on PR #2894 caught the latent bug: on the first boot in a fresh data
dir, SettingsDefaultsManager printed '[SETTINGS] Created settings file
with defaults: ...' to stdout before the start command's JSON hook
payload, corrupting the machine-readable contract every fresh install's
first hook invocation relies on. The Phase 6 per-run temp data dir made
the cold-dir case deterministic in CI, exposing it. Both informational
notices (creation, nested-schema migration) now use console.warn —
stderr — matching the function's existing failure-path idiom; two
regression tests pin stdout silence on both paths.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

* refactor(restart): address PR #2894 review — dedupe script resolver, skip futile port wait

Both inline copies of the marketplace-first script-candidate list in
worker-service.ts (restart fallback + successor handoff injection) now
call the exported resolveWorkerScriptPath() ?? __filename, so the
candidate list lives in one place. verifyRestartedWorker's failure
result gains lastPollSawHealth; when the self-replacement handoff
verification timed out while a live (but unverifiable) worker was still
serving on the port, the CLI fallback now skips its port-free wait —
the port cannot free while that worker lives, so the wait only burned
its full platform-scaled budget before the same final verification ran
anyway.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-10 19:59:34 -07:00

25 KiB
Raw Blame History

Worker Restart: Single Source of Truth

Created: 2026-06-10 Root-cause analysis: 12-agent diagnosis, adversarially verified (workflow wf_f07f3541-b05). Summary: worker-restart failures are caused by five redundant "who is the worker" oracles with uncoordinated writers, a sync-script "hot reload" mirror that writes version-N code into the version-(N-1) cache dir, a kill-only restart endpoint that races hook lazy-spawns, and a build chain that fires two uncoordinated restarts and never verifies the outcome. 98 version-recycle ping-pong events and six EADDRINUSE hard failures observed Jun 810.

Execution model: Each phase is self-contained and lands independently (one commit/PR per phase, in order). Phases 12 are the high-leverage, low-risk slice. Run npm test plus the phase's verification checklist before moving on.


Phase 0: Consolidated Documentation — Allowed APIs and Ground Truth

(Discovery already performed by three fact-extraction subagents on 2026-06-10; consolidated here. Executors: trust these refs, but re-read the cited lines before editing — line numbers drift.)

Allowed APIs (verified to exist — use ONLY these, with these exact signatures)

API Location Signature / shape
ensureWorkerRunning() src/shared/worker-utils.ts:293-381 (): Promise<boolean> — hook lazy-spawn + PR #2768 version-recycle
resolveWorkerScriptPath() src/shared/worker-utils.ts:206-215 candidates: MARKETPLACE_ROOT/plugin/scripts/worker-service.cjs, then cwd()/plugin/scripts/worker-service.cjs
resolveBunRuntime() src/shared/worker-utils.ts:217-235 hook-side resolver; MISSING ~/.bun/bin fallbacks
waitForWorkerPort / waitForWorkerReadiness src/shared/worker-utils.ts:237-273 polls GET /api/readiness; budget env CLAUDE_MEM_HOOK_READINESS_TIMEOUT_MS
ensureWorkerStarted(port, workerScriptPath) src/services/worker-spawner.ts:70-153 returns 'ready' | 'warming' | 'dead'; NO version guard
spawnDaemon(scriptPath, port, extraEnv?) src/services/infrastructure/ProcessManager.ts:408-472 returns PID or undefined; setsid on Unix, PowerShell on Windows
resolveWorkerRuntimePath(options?) src/services/infrastructure/ProcessManager.ts:63-125 full bun resolver chain (BUN, BUN_PATH, ~/.bun/bin, brew paths, which)
PID file APIs ProcessManager.ts:134-168, 508-520 writePidFile(info), readPidFile(): PidInfo|null, removePidFile(), touchPidFile(), cleanStalePidFile()
httpShutdown(port, reason?) src/services/infrastructure/HealthMonitor.ts:94-114 POSTs /api/admin/shutdown?reason=restart
waitForPortFree(port, timeoutMs?) HealthMonitor.ts:85-92 500ms poll
checkVersionMatch(port) src/services/infrastructure/HealthMonitor.ts:~120-161 returns {matches, pluginVersion, workerVersion}; fail-open on ENOENT
performGracefulShutdown(config) src/services/infrastructure/GracefulShutdown.ts:30-58 sequential closes, NO global deadline
flushResponseThen(res, payload, action) src/services/server/flushResponseThen.ts:3-16 responds, runs action on 'finish', ALWAYS process.exit(0)
writeJsonFileAtomic() src/npx-cli/utils/paths.ts:124-205 the ONLY atomic-write helper in the repo
MARKETPLACE_ROOT src/shared/paths.ts:43 ~/.claude/plugins/marketplaces/thedotmack
resolveDataDir() / DATA_DIR src/shared/paths.ts:18-40 env CLAUDE_MEM_DATA_DIR wins (line 19-20); module-level const — computed at import time
Worker port default src/shared/SettingsDefaultsManager.ts:91 37700 + (uid % 100) — NEVER hardcode 37777
/api/health response src/services/server/Server.ts:212-233 {status, version, workerPath, uptime, pid, initialized, mcpReady, ...} — has everything verification needs
/api/readiness response Server.ts:235-247 {status: 'ready'|'initializing', mcpReady}
/api/admin/restart Server.ts:282-294 kill-only via flushResponseThenonRestart()shutdown('restart'); NOTHING respawns (macOS/Linux)
version: BUILT_IN_VERSION baked via __DEFAULT_PACKAGE_VERSION__ esbuild define scripts/build-hooks.js:312,365,421,469,495,540

Anti-patterns (verified NOT to exist — do not invent)

  • There is NO /api/admin/status, NO /api/version route in Server.ts (version comes from /api/health), NO respawn anywhere in the HTTP restart path.
  • There is NO flock/O_EXCL/wx-flag lockfile pattern anywhere in src/ — Phase 4 introduces the first one; copy writeJsonFileAtomic for the write discipline.
  • plugin/scripts/*.cjs are BUILT artifacts — never hand-edit; rebuild with npm run build.
  • package-lock.json is gitignored — do not commit it.
  • Tests are bun test (bunfig.toml preloads tests/preload.ts for the PostHog mock). Mock pattern: mock.module() + query-param cache-bust fresh import (tests/shared/worker-utils-version-recycle.test.ts:22-32).

Behavioral contracts that MUST keep passing

  • tests/shared/worker-utils-version-recycle.test.ts: on version mismatch ensureWorkerRunning() POSTs /api/admin/restart ≥1×; on match, 0×.
  • tests/integrations/spawn-contract-windows.test.ts: spawn-contract env overrides.
  • Full suite: 2203 pass / 0 fail as of v13.5.5.

Phase 1: Delete the cache-mirror; make the CLI restart the single initiator

Why first: The mirror (sync-marketplace.cjs:164-173) is the largest manufactured source of version disagreement (7 of 10 cache dirs hold off-by-one content); the double restart (HTTP POST + sleep 1 + CLI restart) is the race generator. Both fixes are deletions.

What to implement

  1. In scripts/sync-marketplace.cjs:
    • Delete the cache-mirror block (lines ~164-173: INSTALLED_CACHE_PATH + rsync + bun install into the cache dir).
    • Delete detectInstalledVersion() (lines ~79-114) and its call site — it exists only to feed the mirror.
    • Delete the HTTP restart trigger block (lines ~196-216: the http.request POST to /api/admin/restart and its success/error prints). The sync script's job ends at "files synced + bun install in the marketplace copy".
  2. In root package.json line ~67, simplify:
    • From: "build-and-sync": "npm run build && npm run sync-marketplace && sleep 1 && (cd ~/.claude/plugins/marketplaces/thedotmack && npm run worker:restart)"
    • To: "build-and-sync": "npm run build && npm run sync-marketplace && (cd ~/.claude/plugins/marketplaces/thedotmack && npm run worker:restart)"
    • (drop the sleep 1 — it existed to let the now-deleted HTTP kill land before the CLI restart)
  3. Do NOT touch the existing one-time legacy-cache cleanup elsewhere in sync-marketplace.cjs (if present) — only the three blocks named above.

Documentation references

  • Verbatim quotes of all three blocks: Phase 0 table + the restart/shutdown extraction (sync-marketplace.cjs:79-114, 164-173, 196-216; package.json:64-79).

Verification checklist

  • grep -n "INSTALLED_CACHE_PATH\|detectInstalledVersion\|admin/restart" scripts/sync-marketplace.cjs → no matches.
  • grep -n "sleep 1" package.json → no match in build-and-sync.
  • npm run build-and-sync → completes; worker restarts (via CLI path only); curl -s http://127.0.0.1:$PORT/api/health shows the just-built version. (Resolve $PORT as 37700 + uid%100 or from ~/.claude-mem/settings.json.)
  • npm test → green.
  • Stale-cache regression check: ls ~/.claude/plugins/cache/thedotmack/claude-mem/*/package.json | xargs grep -h '"version"' — record current values; after the NEXT release, re-check that no cache dir's content changed (the mirror used to mutate them).

Anti-pattern guards

  • Do not "fix" the mirror by targeting workerPath instead — the feature is being removed, not repaired.
  • Do not add a new restart mechanism here; Phase 2 hardens the existing CLI restart.

Phase 2: Post-restart verification — restart proves itself or exits 1

Why: restart currently exits 0 after spawnDaemon returns a PID — fork success, not boot success. Four silent exit-0 daemon paths mean "✓" with a dead/stale worker.

What to implement

All in src/services/worker-service.ts:

  1. In case 'restart' (lines ~956-973):
    • BEFORE httpShutdown, capture the old worker: GET /api/health (2s timeout) → save oldPid (may be null if no worker).
    • Replace waitForPortFree(port, 5000) with waitForPortFree(port, getPlatformTimeout(15000)) — parity with stop (line ~946).
    • AFTER spawnDaemon, add a verification loop (new helper verifyRestartedWorker(port, oldPid, deadlineMs)): poll GET /api/health every 500ms until health.pid !== oldPid && health.version === EXPECTED_VERSION, where EXPECTED_VERSION is this process's own baked __DEFAULT_PACKAGE_VERSION__ (build-and-sync runs the marketplace copy, so its baked version IS the just-synced version). Deadline: getPlatformTimeout(30000).
    • On success: log Worker restart verified {pid, version} and exit 0. On deadline: console.error with the last observed health payload (or connection error) and exit 1.
  2. In the --daemon block (lines ~1167-1208): change EXIT PATH 4 only (generic start failure, lines ~1204-1206) from process.exit(0) to process.exit(1). Paths 1-3 are legitimate duplicate-suppression and stay exit 0.
  3. Spawn target: change spawnDaemon(__filename, port) (line ~965) to prefer the marketplace script — copy the candidate pattern from resolveWorkerScriptPath() (src/shared/worker-utils.ts:206-215), falling back to __filename when no marketplace copy exists (dev trees, CI).

Documentation references

  • /api/health shape: Server.ts:212-233 (pid, version fields confirmed).
  • getPlatformTimeout: used at worker-service.ts:946.
  • __DEFAULT_PACKAGE_VERSION__ availability inside worker-service bundle: scripts/build-hooks.js:312-313.

Verification checklist

  • New test tests/services/worker-restart-verify.test.ts: mock global.fetch (copy the fetchLog pattern from tests/shared/worker-utils-version-recycle.test.ts:34-50); assert verifyRestartedWorker returns success when health flips to {pid: newPid, version: expected}, failure on stale pid, failure on wrong version, failure on timeout.
  • Manual: npm run build-and-sync → output includes Worker restart verified; then kill -9 <worker pid> mid-restart-window and re-run to see a LOUD exit-1 path (or simulate by pointing the verify loop at a dead port in the test).
  • npm test → green, including the version-recycle contract.

Anti-pattern guards

  • Do NOT poll /api/version (doesn't exist) or invent new health fields.
  • Do NOT compare against marketplace package.json on disk for EXPECTED_VERSION — the baked constant is the truth for "the code I am running"; disk reads reintroduce a second oracle.
  • The verify deadline must not block forever: hard cap, then exit 1.

Phase 3: Self-replacing restart — old worker spawns its successor; recycle path stops spawning into corpses

Why: /api/admin/restart is kill-only; hooks that POST it then lazy-spawn race the dying worker (the ping-pong). If the OLD worker spawns its successor as its final act after the port closes, old and new never coexist and no third party spawns into a corpse.

What to implement

  1. src/services/worker-service.ts shutdown() (lines ~671-699):
    • Re-entrancy guard: the isShuttingDown field (line ~188) is write-only today; make shutdown() check-and-set it at entry (if (this.isShuttingDown) return; this.isShuttingDown = true;).
    • Hard deadline: wrap performGracefulShutdown(...) in Promise.race with a getPlatformTimeout(10000) timer; on deadline, log Graceful shutdown deadline exceeded — proceeding and continue (do not hang on unbounded session drain — drain today can run 35-40s).
    • Successor spawn: when reason === 'restart', after graceful shutdown completes/deadlines: resolve the marketplace script (same candidate pattern as Phase 2 step 3), await waitForPortFree(port, 5000), then spawnDaemon(marketplaceScript, port). If port never frees or spawn returns undefined, log loudly (logger.error) — the next hook's lazy-spawn is the safety net. Note flushResponseThen (flushResponseThen.ts:3-16) calls process.exit(0) after the action completes, so the spawn must be awaited inside the action.
  2. src/shared/worker-utils.ts ensureWorkerRunning() recycle path (lines ~305-330):
    • After POSTing /api/admin/restart, do NOT immediately lazy-spawn. Instead poll GET /api/health (500ms interval, HOOK_READINESS_TIMEOUT_MS budget) for the successor: healthy AND version === pluginVersion (already in hand from checkVersionMatch). Only fall through to the existing lazy-spawn if the successor never appears.
    • Amplifier fix: after waitForWorkerReadiness succeeds anywhere in this function, re-check the version once via /api/health; if still mismatched, log a warning (do NOT loop/recycle again in the same invocation — one recycle per hook event, the next hook retries; unbounded loops here re-create the storm).

Documentation references

  • onRestart wiring: worker-service.ts:255; route: Server.ts:282-294; flushResponseThen.ts:3-16.
  • Drain timings + unbounded awaits: GracefulShutdown.ts:30-75.
  • HOOK_READINESS_TIMEOUT_MS: worker-utils.ts:45-48.
  • Signal-path guard to mirror: src/supervisor/index.ts:56-60.

Verification checklist

  • tests/shared/worker-utils-version-recycle.test.ts still green (still POSTs restart on mismatch — the change is what happens AFTER the POST).
  • New test: shutdown re-entrancy — call shutdown() twice, assert performGracefulShutdown runs once (mock it via mock.module).
  • New test: recycle-no-corpse-spawn — fetch mock where /api/admin/restart 200s and /api/health returns the NEW version on poll N: assert NO spawn attempt; where health never recovers: assert lazy-spawn fallback fires.
  • Manual end-to-end: with a worker running, curl -X POST http://127.0.0.1:$PORT/api/admin/restart; within ~15s /api/health shows a NEW pid and the marketplace version, with no hook involvement. Check ~/.claude-mem/logs/claude-mem-$(date +%F).log for exactly one shutdown and one daemon start (no duplicate-refusal lines).
  • npm test → green.

Anti-pattern guards

  • The successor spawn happens ONLY on reason === 'restart'stop must stay kill-only.
  • Never spawn before the port is confirmed free — that recreates EXIT PATH 2/3 duplicate suicides.
  • Do not add respawn logic to flushResponseThen itself or to the Windows-managed IPC branch (process.send path, Server.ts:284-289) — Windows wrapper already owns restart there.

Phase 4: One spawn gate — shared ensureWorker() with a spawn lockfile; kill the resolver asymmetry

Why: Three spawn paths (hooks via worker-utils, MCP via worker-spawner, CLI via spawnDaemon) with two different bun resolvers and no mutual exclusion; logs show 3 launchers colliding within one second.

What to implement

  1. New module src/shared/worker-spawn-gate.ts:
    • acquireSpawnLock(): booleanwriteFileSync(join(DATA_DIR, 'spawn.lock'), JSON.stringify({pid: process.pid, startedAt: new Date().toISOString()}), {flag: 'wx'}) in try/catch. On EEXIST: statSync the lock; if mtimeMs older than 30_000ms, unlinkSync and retry ONCE; else return false.
    • releaseSpawnLock(): void — unlink, owner-checked (read it; only delete if pid === process.pid), errors swallowed.
    • Use DATA_DIR from src/shared/paths.ts — never homedir() directly.
  2. In src/shared/worker-utils.ts ensureWorkerRunning() (spawn section, lines ~332-351): wrap the spawn in the lock — if acquireSpawnLock() fails, skip the spawn and go straight to waitForWorkerPort/waitForWorkerReadiness (someone else is spawning; wait for their worker). Release in finally.
  3. Same wrap in src/services/worker-spawner.ts ensureWorkerStarted() around the spawnDaemon call (line ~132).
  4. Resolver unification: delete resolveBunRuntime() from worker-utils.ts (lines 217-235) and import/re-export resolveWorkerRuntimePath from ProcessManager.ts (already exported; it strictly supersedes — adds BUN_PATH, ~/.bun/bin, brew, snap fallbacks). This closes the kill-then-can't-respawn path.
  5. src/servers/mcp-server.ts (lines ~42-51): compute WORKER_SCRIPT_PATH preferring the marketplace copy — copy the candidate pattern from resolveWorkerScriptPath() with fallback to the current own-dir resolution. This stops MCP servers in stale cache dirs from spawning stale workers.

Documentation references

  • Write-discipline reference for the lock: writeJsonFileAtomic, src/npx-cli/utils/paths.ts:124-205 (the wx-flag + cleanup-on-error shape; the lock is simpler — no rename needed, wx IS the atomicity).
  • Spawn call sites table: Phase 0 / spawn-path report §7.
  • resolveWorkerRuntimePath chain: ProcessManager.ts:63-125.

Verification checklist

  • New test tests/shared/worker-spawn-gate.test.ts (temp dir via CLAUDE_MEM_DATA_DIR + dynamic import — see Phase 6 trap): second acquire fails while held; stale lock (backdate mtime via utimesSync) is broken and re-acquired; release is owner-only.
  • grep -rn "resolveBunRuntime" src/ → no definition in worker-utils (only the ProcessManager import).
  • grep -n "spawnHidden\|spawnDaemon" src/shared/worker-utils.ts src/services/worker-spawner.ts → every spawn site is inside the lock.
  • Race test (manual): run 5 concurrent node "$_P/scripts/bun-runner.js" .../worker-service.cjs start invocations with no worker running; logs must show exactly ONE Starting worker daemon and zero refusing to start duplicate storms.
  • npm test → green (version-recycle contract intact).

Anti-pattern guards

  • The lock gates SPAWNING only — never health checks; a held lock must never make a hook fail, only wait.
  • 30s staleness must use file mtime, not clock-in-content comparisons (test pollution wrote startedAt: 2024-01-01 once already).
  • Do not introduce a lock library; wx flag is sufficient and dependency-free.

Phase 5: Demote the PID file — port + /api/health become the only liveness oracle

Why: The dying worker's shutdown cascade deletes the NEW worker's PID file (src/supervisor/shutdown.ts:88), after which status reports a healthy worker as "not running" (status requires portInUse && pidInfo, worker-service.ts:975-988). /api/health already carries pid, version, workerPath — it subsumes the file.

What to implement

  1. Owner-guarded deletion (the clobber fix):
    • src/supervisor/shutdown.ts (~line 88): before rmSync(pidFilePath), read the file; delete ONLY if its pid === process.pid. A mismatch means a successor already wrote its own — log debug and leave it.
    • src/services/worker-service.ts case 'restart' (~line 964) and case 'stop' (~line 949): replace bare removePidFile() with the same owner-or-dead check: delete only if the recorded pid is the one we just shut down (captured in Phase 2's pre-shutdown health probe) or the recorded pid is not alive.
  2. case 'status' (lines ~975-988): source of truth becomes GET /api/health — report pid, version, uptime, workerPath from the response; fall back to "port in use but health unreachable (wedged?)" and "not running". Drop the readPidFile() requirement.
  3. --daemon duplicate gate (lines ~1167-1182): reorder — port/health probe FIRST (it's ground truth), PID file second (advisory only, for the no-port-bound-yet boot window). Keep writePidFile/touchPidFile as diagnostics — the worker itself remains the only writer.

Documentation references

  • Clobber chain: GracefulShutdown.ts:55supervisor.stop()runShutdownCascadeshutdown.ts:87-98.
  • verifyPidFileOwnership + startToken: ProcessManager.ts (PID APIs §3 of spawn-path report).
  • Health shape: Server.ts:212-233.

Verification checklist

  • Update tests/infrastructure/process-manager.test.ts expectations where deletion semantics changed (owner-guard) — coordinate with Phase 6 which relocates this file's data dir.
  • New test: old-worker-cleanup-spares-successor — write PID file as {pid: 99999...} (not own pid), run the shutdown cascade deletion step, assert file survives.
  • Manual: start worker, rm ~/.claude-mem/worker.pid, run worker-service.cjs status → must still print "Worker is running" with pid/version from health.
  • Full ping-pong scenario from Phase 3's manual check still converges.
  • npm test → green.

Anti-pattern guards

  • Do NOT delete writePidFile entirely — external tooling may read it; it's demoted to diagnostics, not removed.
  • status must not require BOTH oracles ever again — health wins, full stop.
  • Do not let the boot-window PID check (advisory) exit 1 — duplicate suppression stays exit 0 (Phase 2 changed only the generic-failure path).

Phase 6: Test hygiene — no test ever touches the real ~/.claude-mem again

Why: tests/infrastructure/process-manager.test.ts writes corrupt JSON and sentinel PIDs into the REAL ~/.claude-mem/worker.pid (snapshot-restore shrinks but doesn't close the race window, and a killed test run leaves corruption behind). It also pollutes the shared log, which contaminated this very diagnosis.

What to implement

  1. tests/infrastructure/process-manager.test.ts:
    • Replace the hardcoded DATA_DIR = path.join(homedir(), '.claude-mem') (lines 24-25) with: create mkdtempSync(join(tmpdir(), 'claude-mem-pm-test-')), set process.env.CLAUDE_MEM_DATA_DIR to it.
    • TRAP: DATA_DIR in src/shared/paths.ts:40 is a module-level const computed at import time, and ESM hoists static imports — setting the env var in beforeEach is too late. Copy the fresh-import pattern from tests/shared/worker-utils-version-recycle.test.ts:30-32 (query-param cache-bust dynamic import) OR set the env var at the very top of the file before any await import(...) of ProcessManager modules (convert the static imports of code-under-test to dynamic).
    • Delete the snapshot-restore of the real PID file (lines 28-43) — unnecessary once isolated; rmSync(tempDir, {recursive: true, force: true}) in afterAll (copy Pattern A: tests/write-json-file-atomic.test.ts:34-40).
  2. Sweep for other offenders: grep -rn "homedir()" tests/ | grep -v node_modules — any test resolving the real data dir gets the same treatment.
  3. Add a tripwire to tests/preload.ts (it already exists for the PostHog mock): if CLAUDE_MEM_DATA_DIR is unset, set it to a per-run mkdtempSync dir so NO test can fall through to ~/.claude-mem. (Env restoration discipline: copy tests/env-isolation.test.ts:31-90.)

Documentation references

  • Offending lines verbatim: process-manager.test.ts:24-25, 28-43, 109, 499-503 (test-hygiene report §1).
  • Copy-ready isolation patterns: Pattern A write-json-file-atomic.test.ts:34-40; Pattern C env restore env-isolation.test.ts:31-90.
  • Env override honored at src/shared/paths.ts:19-20.

Verification checklist

  • bun test tests/infrastructure/ green.
  • While the suite runs: ls -la ~/.claude-mem/worker.pid; shasum ~/.claude-mem/worker.pid before/after → byte-identical (or consistently absent).
  • grep -rn "homedir()" tests/ → no hit resolves a data-dir path for writes.
  • Full npm test green.

Anti-pattern guards

  • Do not mock fs to fake isolation — real temp dirs only, the tests exercise real file semantics.
  • Do not weaken assertions to dodge the trap; the fresh-import pattern is proven in-repo.

Phase 7: Final Verification

  1. Full suite + typecheck: npx tsc --noEmit and npm test — zero failures.
  2. Anti-pattern greps (all must be empty):
    • grep -rn "INSTALLED_CACHE_PATH\|detectInstalledVersion" scripts/
    • grep -n "admin/restart" scripts/sync-marketplace.cjs
    • grep -n "37777" src/ scripts/ -r (hardcoded port)
    • grep -rn "resolveBunRuntime" src/shared/worker-utils.ts
    • grep -rn "homedir(), '.claude-mem'" tests/
  3. Triple-restart soak: run npm run build-and-sync three times consecutively; every run must end Worker restart verified; /api/health pid changes each time, version stays the built version; grep -c "refusing to start duplicate\|Failed to start server" ~/.claude-mem/logs/claude-mem-$(date +%F).log shows no new occurrences during the soak.
  4. Stale-launcher convergence test (the original bug): manually spawn a worker from an OLD cache dir (node ~/.claude/plugins/cache/thedotmack/claude-mem/<old>/scripts/bun-runner.js .../worker-service.cjs start), then trigger any hook (or ensureWorkerRunning via a session). Expect in the log: exactly ONE Worker version mismatch — recycling stale worker, then the self-replacing restart, then a healthy marketplace-version worker — NO ping-pong (no second recycle within 5 minutes).
  5. Observation check: confirm claude-mem itself recorded observations during the soak (the worker was restarting underneath the recorder — the pipeline must survive its own surgery): query the 10 most recent rows in ~/.claude-mem/claude-mem.db observations and confirm fresh timestamps.
  6. Docs: update docs/public/troubleshooting.mdx if it documents the old restart semantics; CHANGELOG is auto-generated — do not edit.