* plan-10 Phase 1: ship deterministic plugin runtime dependency closure Approach A — commit & ship plugin/bun.lock so the plugin's runtime node_modules install is deterministic, fixing the recurring `Cannot find module 'zod/v3'` (#2730). - align generated plugin zod range to root (^4.4.3) in build-hooks.js - new scripts/gen-plugin-lockfile.cjs generates plugin/bun.lock as a build artifact after build-hooks.js writes plugin/package.json - track & ship plugin/bun.lock (.gitignore negation, .npmignore, files allowlist) - install with `bun install --frozen-lockfile --ignore-scripts` at runtime Refs #2783, #2730 Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * plan-10 Phase 2: fail loud at install time on a broken dependency closure Strengthen verifyCriticalModules to assert each dependency is actually importable via require.resolve (not merely a directory), and assert the worker-required zod subpaths resolve: zod/v3, zod/v4, zod/v4-mini. A partial/stale install now fails `npx claude-mem install` immediately instead of surfacing later as a Stop-hook `Cannot find module 'zod/v3'`. Bin-only packages (e.g. tree-sitter-cli, which has no bare-name entry point) fall back to resolving <dep>/package.json so a healthy install isn't falsely rejected. Adds tests/cli/verify-critical-modules.test.ts covering a missing zod/v3 subpath (throws), a complete zod (passes), and a bin-only dep (passes). Refs #2783, #2730 Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * plan-10 Phase 3: clean-room install + import smoke test (#2730 backstop) Add scripts/smoke-clean-room.cjs and a `smoke:clean-room` npm script. Against fresh temp dirs (never the repo's node_modules) it: - copies plugin/, runs `bun install --frozen-lockfile --ignore-scripts`, asserts zod, zod/v3, zod/v4, zod/v4-mini resolve, and boots the bundled worker asserting no `Cannot find module` — the direct #2730 regression guard; - `npm pack`s, installs the tarball into a second temp dir, and load-tests the published bin entrypoint, warning loudly on any declared main/exports target missing from the tarball (latent #2537 gap). Exits non-zero naming the missing module on any failure; cleans up all temp dirs and the tarball in a finally. Refs #2783, #2730 Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * plan-10 Phase 4: gate CI and publish on the clean-room dependency closure - ci.yml: new `clean-room-deps` job (between build and the docker e2e job) runs a frozen-lockfile drift check on the committed plugin lockfile, then `npm run build` + `npm run smoke:clean-room`. The drift step catches a contributor who changed plugin deps without regenerating plugin/bun.lock. - npm-publish.yml: add setup-bun and run `npm run smoke:clean-room` between build and `npm publish`, so a broken runtime closure cannot be published on a tag push (ci.yml does not run on tags). Secrets block untouched. Refs #2783, #2730 Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * plan-10: doc recluster note + Phase 0 execution slice for #2730 Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * plans: backlog recluster (2026-06-04) — cross-cluster execution order + plan-13 doc Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * plan-10: gen-plugin-lockfile degrades gracefully when bun is absent The Windows build CI job has no bun on PATH; regenerating the lockfile there threw and failed the build. The committed plugin/bun.lock is already the deterministic closure, so skip regeneration (non-fatal) when bun is missing and a lockfile exists; fail loud only when neither is available. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * chore: rebuild plugin artifacts after merging main (v13.5.1) + plan-10 work Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> * chore: rebuild plugin artifacts after merging main v13.5.5 Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> * chore(deps): daily upgrade pass — agent SDK 0.3.172, better-auth 1.6.16, posthog-node 5.36.15, dompurify 3.4.9 - Bump @anthropic-ai/claude-agent-sdk 0.2.141 -> 0.3.172 (tsc + full test suite green) - Remove deprecated @types/dompurify stub (dompurify ships its own types) - Add overrides.tmp ^0.2.7 to clear GHSA-52f5-9888-hmc6 / GHSA-ph9p-34f9-6g65 via np -> listr-input -> inquirer -> external-editor -> tmp chain - npm audit: 0 vulnerabilities; npm outdated: clean - package-lock.json is gitignored in this repo, so not committed Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> * plan: worker-restart single-source-of-truth — 7-phase fix for restart races Phased plan from the adversarially-verified diagnosis (wf_f07f3541-b05): kill the cache mirror, single verified restart initiator, self-replacing restart endpoint, unified spawn gate with lockfile, PID-file demotion, test data-dir isolation, soak verification. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> * refactor(restart): delete sync-script cache-mirror and HTTP restart trigger Phase 1 of plans/2026-06-10-worker-restart-single-source-of-truth.md. The installed-version cache mirror wrote version-N code into the version-(N-1) cache dir, manufacturing permanent version disagreement; the HTTP POST to /api/admin/restart raced the CLI restart that follows it in build-and-sync. Both are deleted; the CLI worker:restart in the marketplace copy is now the single restart initiator, and the sleep 1 between the two mechanisms is gone. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> * feat(restart): restart proves itself or exits 1 Phase 2 of plans/2026-06-10-worker-restart-single-source-of-truth.md. worker-service restart now captures the old worker pid, waits for the port with the same platform-scaled 15s budget as stop, spawns the marketplace copy of worker-service.cjs when present, then polls /api/health until the pid changes and the version matches this build's baked __DEFAULT_PACKAGE_VERSION__ — success is printed to stdout, deadline (platform-scaled 30s) exits 1 with the last observed health payload and the spawned script path. The --daemon generic start-failure path now exits 1 instead of masquerading as success; the three duplicate-suppression exits remain 0. New helper src/services/restart-verify.ts (worker-service.ts bootstraps on import, so the helper lives in an import-safe module) with 8 tests covering pid-flip success, stale pid, wrong version, unreachable timeout, 503-degraded acceptance, and null-oldPid version-only verification. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> * feat(restart): self-replacing worker — old worker spawns its successor Phase 3 of plans/2026-06-10-worker-restart-single-source-of-truth.md. /api/admin/restart was kill-only: hooks that POSTed it then raced the dying worker with their own lazy-spawn (the observed recycle ping-pong). Now the dying worker spawns its successor itself — after a re-entrancy- guarded, deadline-bounded (platform-scaled 10s) graceful shutdown, and only once its port is confirmed free; stop and signal shutdowns stay kill-only. The hook recycle path waits for that successor via /api/health polling (HOOK_READINESS_TIMEOUT_MS budget) and lazy-spawns only as a fallback, with a warn-only version re-check so a hook never recycles more than once per invocation. Shutdown sequence lives in import-safe src/services/worker-shutdown.ts (worker-service.ts bootstraps on import); registerSignalHandlers no longer pre-sets isShuttingDown — the supervisor's shutdownInitiated guard owns signal dedupe, and pre-setting would no-op the new entry guard. 13 new tests cover re-entrancy, deadline expiry/rejection, handoff ordering, kill-only reasons, successor-wait vs lazy-spawn fallback, and pre-graceful bookkeeping failures. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> * feat(restart): one spawn gate; CLI restart defers to the self-replacing worker Phase 4 of plans/2026-06-10-worker-restart-single-source-of-truth.md. Three uncoordinated spawn paths (hook lazy-spawn, MCP worker-spawner, CLI) with two different bun resolvers produced 3-launcher collisions within a single second. Now a wx-flag lockfile (<DATA_DIR>/spawn.lock, 60s mtime staleness with re-stat-before-unlink, owner-checked release) gates every external spawn: lock losers never fail — they skip the spawn and wait for the winner's worker. resolveBunRuntime is deleted in favor of ProcessManager's resolveWorkerRuntimePath (adds BUN_PATH, ~/.bun/bin, brew, which fallbacks), closing the kill-then-can't-respawn path; mcp-server prefers the marketplace worker script so stale cache dirs stop spawning stale workers. Integration fix surfaced by live verification: the CLI restart raced the Phase 3 self-replacement handoff (the successor re-binds the port in ~200ms, so waitForPortFree always timed out and restart exited 1 while the restart had actually succeeded). The CLI now verifies the worker's self-spawned successor directly, and only spawns — gate- wrapped, after the port frees — as the fallback when no worker was running, the shutdown POST was rejected, or no successor appeared. The dying worker's handoff is intentionally ungated: it spawns only after its own port closes, and hooks wait on it. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> * feat(restart): demote the PID file — health and port are the liveness oracle Phase 5 of plans/2026-06-10-worker-restart-single-source-of-truth.md. The dying worker's shutdown cascade deleted the PID file unconditionally as its final act, clobbering the successor's freshly-written file; status then required portInUse AND pidInfo, so a healthy worker reported as "not running". Now every PID-file deletion is owner-guarded: the supervisor cascade deletes only its own pid (removeOwnedPidFile), and the CLI stop/restart-fallback, the restart handoff, and the daemon start-failure cleanup go through removePidFileIfOwner (owner-or-dead — a live successor's file always survives; corrupt files are left for the next boot's validator). status sources from GET /api/health alone (pid, version, uptime, workerPath; 503-degraded counts as running and now surfaces its queue detail), with port-in-use-but-unreachable and not-running fallbacks — all exit 0 as before. The --daemon duplicate gate checks the port first (ground truth) and the PID file second (advisory, for the freed-port-but-undeleted-file window); duplicate suppression stays exit 0. writePidFile/touchPidFile remain — the file is diagnostics, and the worker stays its only writer. Also fixes combined-run test pollution: spawn-gate and worker-utils timeout tests now eagerly import paths.js before setting a temp CLAUDE_MEM_DATA_DIR, so the import-time DATA_DIR const can't freeze on a deleted temp dir for suites loaded later in the same bun process. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> * test: no test ever touches the real ~/.claude-mem again Phase 6 of plans/2026-06-10-worker-restart-single-source-of-truth.md. process-manager and graceful-shutdown tests wrote corrupt JSON and sentinel PIDs (2147483647) into the real ~/.claude-mem/worker.pid and drove the real supervisor.json cascade under a snapshot-restore that a killed run would skip — that pollution contaminated production logs and a prior diagnosis. Both files now set a temp CLAUDE_MEM_DATA_DIR at the top of the file before dynamically importing the code under test (ESM hoisting makes beforeEach too late), assert their paths landed outside the real dir, and derive PID_FILE from the same frozen paths module the code uses so test and code can never diverge under bun's shared module cache. The snapshot-restore scaffolding is deleted; zero assertions changed. tests/preload.ts gains a tripwire: when CLAUDE_MEM_DATA_DIR is unset it fills a per-run temp dir, so no test in any file can fall through to the real data dir. Fallout made explicit: worker-spawn child processes get an explicit temp dir; install-error-matrix restores rather than deletes the env var; settings-defaults-manager pins the unset-env default it was implicitly relying on. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> * fix(settings): bootstrap notices go to stderr, never stdout CI on PR #2894 caught the latent bug: on the first boot in a fresh data dir, SettingsDefaultsManager printed '[SETTINGS] Created settings file with defaults: ...' to stdout before the start command's JSON hook payload, corrupting the machine-readable contract every fresh install's first hook invocation relies on. The Phase 6 per-run temp data dir made the cold-dir case deterministic in CI, exposing it. Both informational notices (creation, nested-schema migration) now use console.warn — stderr — matching the function's existing failure-path idiom; two regression tests pin stdout silence on both paths. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> * refactor(restart): address PR #2894 review — dedupe script resolver, skip futile port wait Both inline copies of the marketplace-first script-candidate list in worker-service.ts (restart fallback + successor handoff injection) now call the exported resolveWorkerScriptPath() ?? __filename, so the candidate list lives in one place. verifyRestartedWorker's failure result gains lastPollSawHealth; when the self-replacement handoff verification timed out while a live (but unverifiable) worker was still serving on the port, the CLI fallback now skips its port-free wait — the port cannot free while that worker lives, so the wait only burned its full platform-scaled budget before the same final verification ran anyway. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
25 KiB
Worker Restart: Single Source of Truth
Created: 2026-06-10
Root-cause analysis: 12-agent diagnosis, adversarially verified (workflow wf_f07f3541-b05). Summary: worker-restart failures are caused by five redundant "who is the worker" oracles with uncoordinated writers, a sync-script "hot reload" mirror that writes version-N code into the version-(N-1) cache dir, a kill-only restart endpoint that races hook lazy-spawns, and a build chain that fires two uncoordinated restarts and never verifies the outcome. 98 version-recycle ping-pong events and six EADDRINUSE hard failures observed Jun 8–10.
Execution model: Each phase is self-contained and lands independently (one commit/PR per phase, in order). Phases 1–2 are the high-leverage, low-risk slice. Run npm test plus the phase's verification checklist before moving on.
Phase 0: Consolidated Documentation — Allowed APIs and Ground Truth
(Discovery already performed by three fact-extraction subagents on 2026-06-10; consolidated here. Executors: trust these refs, but re-read the cited lines before editing — line numbers drift.)
Allowed APIs (verified to exist — use ONLY these, with these exact signatures)
| API | Location | Signature / shape |
|---|---|---|
ensureWorkerRunning() |
src/shared/worker-utils.ts:293-381 |
(): Promise<boolean> — hook lazy-spawn + PR #2768 version-recycle |
resolveWorkerScriptPath() |
src/shared/worker-utils.ts:206-215 |
candidates: MARKETPLACE_ROOT/plugin/scripts/worker-service.cjs, then cwd()/plugin/scripts/worker-service.cjs |
resolveBunRuntime() |
src/shared/worker-utils.ts:217-235 |
hook-side resolver; MISSING ~/.bun/bin fallbacks |
waitForWorkerPort / waitForWorkerReadiness |
src/shared/worker-utils.ts:237-273 |
polls GET /api/readiness; budget env CLAUDE_MEM_HOOK_READINESS_TIMEOUT_MS |
ensureWorkerStarted(port, workerScriptPath) |
src/services/worker-spawner.ts:70-153 |
returns 'ready' | 'warming' | 'dead'; NO version guard |
spawnDaemon(scriptPath, port, extraEnv?) |
src/services/infrastructure/ProcessManager.ts:408-472 |
returns PID or undefined; setsid on Unix, PowerShell on Windows |
resolveWorkerRuntimePath(options?) |
src/services/infrastructure/ProcessManager.ts:63-125 |
full bun resolver chain (BUN, BUN_PATH, ~/.bun/bin, brew paths, which) |
| PID file APIs | ProcessManager.ts:134-168, 508-520 |
writePidFile(info), readPidFile(): PidInfo|null, removePidFile(), touchPidFile(), cleanStalePidFile() |
httpShutdown(port, reason?) |
src/services/infrastructure/HealthMonitor.ts:94-114 |
POSTs /api/admin/shutdown?reason=restart |
waitForPortFree(port, timeoutMs?) |
HealthMonitor.ts:85-92 |
500ms poll |
checkVersionMatch(port) |
src/services/infrastructure/HealthMonitor.ts:~120-161 |
returns {matches, pluginVersion, workerVersion}; fail-open on ENOENT |
performGracefulShutdown(config) |
src/services/infrastructure/GracefulShutdown.ts:30-58 |
sequential closes, NO global deadline |
flushResponseThen(res, payload, action) |
src/services/server/flushResponseThen.ts:3-16 |
responds, runs action on 'finish', ALWAYS process.exit(0) |
writeJsonFileAtomic() |
src/npx-cli/utils/paths.ts:124-205 |
the ONLY atomic-write helper in the repo |
MARKETPLACE_ROOT |
src/shared/paths.ts:43 |
~/.claude/plugins/marketplaces/thedotmack |
resolveDataDir() / DATA_DIR |
src/shared/paths.ts:18-40 |
env CLAUDE_MEM_DATA_DIR wins (line 19-20); module-level const — computed at import time |
| Worker port default | src/shared/SettingsDefaultsManager.ts:91 |
37700 + (uid % 100) — NEVER hardcode 37777 |
/api/health response |
src/services/server/Server.ts:212-233 |
{status, version, workerPath, uptime, pid, initialized, mcpReady, ...} — has everything verification needs |
/api/readiness response |
Server.ts:235-247 |
{status: 'ready'|'initializing', mcpReady} |
/api/admin/restart |
Server.ts:282-294 |
kill-only via flushResponseThen → onRestart() → shutdown('restart'); NOTHING respawns (macOS/Linux) |
version: BUILT_IN_VERSION |
baked via __DEFAULT_PACKAGE_VERSION__ esbuild define |
scripts/build-hooks.js:312,365,421,469,495,540 |
Anti-patterns (verified NOT to exist — do not invent)
- There is NO
/api/admin/status, NO/api/versionroute in Server.ts (version comes from/api/health), NO respawn anywhere in the HTTP restart path. - There is NO flock/O_EXCL/
wx-flag lockfile pattern anywhere in src/ — Phase 4 introduces the first one; copywriteJsonFileAtomicfor the write discipline. plugin/scripts/*.cjsare BUILT artifacts — never hand-edit; rebuild withnpm run build.package-lock.jsonis gitignored — do not commit it.- Tests are
bun test(bunfig.tomlpreloadstests/preload.tsfor the PostHog mock). Mock pattern:mock.module()+ query-param cache-bust fresh import (tests/shared/worker-utils-version-recycle.test.ts:22-32).
Behavioral contracts that MUST keep passing
tests/shared/worker-utils-version-recycle.test.ts: on version mismatchensureWorkerRunning()POSTs/api/admin/restart≥1×; on match, 0×.tests/integrations/spawn-contract-windows.test.ts: spawn-contract env overrides.- Full suite: 2203 pass / 0 fail as of v13.5.5.
Phase 1: Delete the cache-mirror; make the CLI restart the single initiator
Why first: The mirror (sync-marketplace.cjs:164-173) is the largest manufactured source of version disagreement (7 of 10 cache dirs hold off-by-one content); the double restart (HTTP POST + sleep 1 + CLI restart) is the race generator. Both fixes are deletions.
What to implement
- In
scripts/sync-marketplace.cjs:- Delete the cache-mirror block (lines ~164-173:
INSTALLED_CACHE_PATH+ rsync +bun installinto the cache dir). - Delete
detectInstalledVersion()(lines ~79-114) and its call site — it exists only to feed the mirror. - Delete the HTTP restart trigger block (lines ~196-216: the
http.requestPOST to/api/admin/restartand its success/error prints). The sync script's job ends at "files synced +bun installin the marketplace copy".
- Delete the cache-mirror block (lines ~164-173:
- In root
package.jsonline ~67, simplify:- From:
"build-and-sync": "npm run build && npm run sync-marketplace && sleep 1 && (cd ~/.claude/plugins/marketplaces/thedotmack && npm run worker:restart)" - To:
"build-and-sync": "npm run build && npm run sync-marketplace && (cd ~/.claude/plugins/marketplaces/thedotmack && npm run worker:restart)" - (drop the
sleep 1— it existed to let the now-deleted HTTP kill land before the CLI restart)
- From:
- Do NOT touch the existing one-time legacy-cache cleanup elsewhere in sync-marketplace.cjs (if present) — only the three blocks named above.
Documentation references
- Verbatim quotes of all three blocks: Phase 0 table + the restart/shutdown extraction (sync-marketplace.cjs:79-114, 164-173, 196-216; package.json:64-79).
Verification checklist
grep -n "INSTALLED_CACHE_PATH\|detectInstalledVersion\|admin/restart" scripts/sync-marketplace.cjs→ no matches.grep -n "sleep 1" package.json→ no match in build-and-sync.npm run build-and-sync→ completes; worker restarts (via CLI path only);curl -s http://127.0.0.1:$PORT/api/healthshows the just-built version. (Resolve$PORTas37700 + uid%100or from~/.claude-mem/settings.json.)npm test→ green.- Stale-cache regression check:
ls ~/.claude/plugins/cache/thedotmack/claude-mem/*/package.json | xargs grep -h '"version"'— record current values; after the NEXT release, re-check that no cache dir's content changed (the mirror used to mutate them).
Anti-pattern guards
- Do not "fix" the mirror by targeting
workerPathinstead — the feature is being removed, not repaired. - Do not add a new restart mechanism here; Phase 2 hardens the existing CLI restart.
Phase 2: Post-restart verification — restart proves itself or exits 1
Why: restart currently exits 0 after spawnDaemon returns a PID — fork success, not boot success. Four silent exit-0 daemon paths mean "✓" with a dead/stale worker.
What to implement
All in src/services/worker-service.ts:
- In
case 'restart'(lines ~956-973):- BEFORE
httpShutdown, capture the old worker:GET /api/health(2s timeout) → saveoldPid(may be null if no worker). - Replace
waitForPortFree(port, 5000)withwaitForPortFree(port, getPlatformTimeout(15000))— parity withstop(line ~946). - AFTER
spawnDaemon, add a verification loop (new helperverifyRestartedWorker(port, oldPid, deadlineMs)): pollGET /api/healthevery 500ms untilhealth.pid !== oldPid && health.version === EXPECTED_VERSION, whereEXPECTED_VERSIONis this process's own baked__DEFAULT_PACKAGE_VERSION__(build-and-sync runs the marketplace copy, so its baked version IS the just-synced version). Deadline:getPlatformTimeout(30000). - On success: log
Worker restart verified {pid, version}and exit 0. On deadline:console.errorwith the last observed health payload (or connection error) and exit 1.
- BEFORE
- In the
--daemonblock (lines ~1167-1208): change EXIT PATH 4 only (generic start failure, lines ~1204-1206) fromprocess.exit(0)toprocess.exit(1). Paths 1-3 are legitimate duplicate-suppression and stay exit 0. - Spawn target: change
spawnDaemon(__filename, port)(line ~965) to prefer the marketplace script — copy the candidate pattern fromresolveWorkerScriptPath()(src/shared/worker-utils.ts:206-215), falling back to__filenamewhen no marketplace copy exists (dev trees, CI).
Documentation references
/api/healthshape:Server.ts:212-233(pid,versionfields confirmed).getPlatformTimeout: used atworker-service.ts:946.__DEFAULT_PACKAGE_VERSION__availability inside worker-service bundle:scripts/build-hooks.js:312-313.
Verification checklist
- New test
tests/services/worker-restart-verify.test.ts: mockglobal.fetch(copy the fetchLog pattern fromtests/shared/worker-utils-version-recycle.test.ts:34-50); assertverifyRestartedWorkerreturns success when health flips to{pid: newPid, version: expected}, failure on stale pid, failure on wrong version, failure on timeout. - Manual:
npm run build-and-sync→ output includesWorker restart verified; thenkill -9 <worker pid>mid-restart-window and re-run to see a LOUD exit-1 path (or simulate by pointing the verify loop at a dead port in the test). npm test→ green, including the version-recycle contract.
Anti-pattern guards
- Do NOT poll
/api/version(doesn't exist) or invent new health fields. - Do NOT compare against marketplace
package.jsonon disk forEXPECTED_VERSION— the baked constant is the truth for "the code I am running"; disk reads reintroduce a second oracle. - The verify deadline must not block forever: hard cap, then exit 1.
Phase 3: Self-replacing restart — old worker spawns its successor; recycle path stops spawning into corpses
Why: /api/admin/restart is kill-only; hooks that POST it then lazy-spawn race the dying worker (the ping-pong). If the OLD worker spawns its successor as its final act after the port closes, old and new never coexist and no third party spawns into a corpse.
What to implement
src/services/worker-service.tsshutdown()(lines ~671-699):- Re-entrancy guard: the
isShuttingDownfield (line ~188) is write-only today; makeshutdown()check-and-set it at entry (if (this.isShuttingDown) return; this.isShuttingDown = true;). - Hard deadline: wrap
performGracefulShutdown(...)inPromise.racewith agetPlatformTimeout(10000)timer; on deadline, logGraceful shutdown deadline exceeded — proceedingand continue (do not hang on unbounded session drain — drain today can run 35-40s). - Successor spawn: when
reason === 'restart', after graceful shutdown completes/deadlines: resolve the marketplace script (same candidate pattern as Phase 2 step 3),await waitForPortFree(port, 5000), thenspawnDaemon(marketplaceScript, port). If port never frees or spawn returnsundefined, log loudly (logger.error) — the next hook's lazy-spawn is the safety net. NoteflushResponseThen(flushResponseThen.ts:3-16) callsprocess.exit(0)after the action completes, so the spawn must be awaited inside the action.
- Re-entrancy guard: the
src/shared/worker-utils.tsensureWorkerRunning()recycle path (lines ~305-330):- After POSTing
/api/admin/restart, do NOT immediately lazy-spawn. Instead pollGET /api/health(500ms interval,HOOK_READINESS_TIMEOUT_MSbudget) for the successor: healthy ANDversion === pluginVersion(already in hand fromcheckVersionMatch). Only fall through to the existing lazy-spawn if the successor never appears. - Amplifier fix: after
waitForWorkerReadinesssucceeds anywhere in this function, re-check the version once via/api/health; if still mismatched, log a warning (do NOT loop/recycle again in the same invocation — one recycle per hook event, the next hook retries; unbounded loops here re-create the storm).
- After POSTing
Documentation references
onRestartwiring:worker-service.ts:255; route:Server.ts:282-294;flushResponseThen.ts:3-16.- Drain timings + unbounded awaits:
GracefulShutdown.ts:30-75. HOOK_READINESS_TIMEOUT_MS:worker-utils.ts:45-48.- Signal-path guard to mirror:
src/supervisor/index.ts:56-60.
Verification checklist
tests/shared/worker-utils-version-recycle.test.tsstill green (still POSTs restart on mismatch — the change is what happens AFTER the POST).- New test: shutdown re-entrancy — call
shutdown()twice, assertperformGracefulShutdownruns once (mock it viamock.module). - New test: recycle-no-corpse-spawn — fetch mock where
/api/admin/restart200s and/api/healthreturns the NEW version on poll N: assert NO spawn attempt; where health never recovers: assert lazy-spawn fallback fires. - Manual end-to-end: with a worker running,
curl -X POST http://127.0.0.1:$PORT/api/admin/restart; within ~15s/api/healthshows a NEW pid and the marketplace version, with no hook involvement. Check~/.claude-mem/logs/claude-mem-$(date +%F).logfor exactly one shutdown and one daemon start (no duplicate-refusal lines). npm test→ green.
Anti-pattern guards
- The successor spawn happens ONLY on
reason === 'restart'—stopmust stay kill-only. - Never spawn before the port is confirmed free — that recreates EXIT PATH 2/3 duplicate suicides.
- Do not add respawn logic to
flushResponseThenitself or to the Windows-managed IPC branch (process.sendpath, Server.ts:284-289) — Windows wrapper already owns restart there.
Phase 4: One spawn gate — shared ensureWorker() with a spawn lockfile; kill the resolver asymmetry
Why: Three spawn paths (hooks via worker-utils, MCP via worker-spawner, CLI via spawnDaemon) with two different bun resolvers and no mutual exclusion; logs show 3 launchers colliding within one second.
What to implement
- New module
src/shared/worker-spawn-gate.ts:acquireSpawnLock(): boolean—writeFileSync(join(DATA_DIR, 'spawn.lock'), JSON.stringify({pid: process.pid, startedAt: new Date().toISOString()}), {flag: 'wx'})in try/catch. OnEEXIST:statSyncthe lock; ifmtimeMsolder than 30_000ms,unlinkSyncand retry ONCE; else return false.releaseSpawnLock(): void— unlink, owner-checked (read it; only delete ifpid === process.pid), errors swallowed.- Use
DATA_DIRfromsrc/shared/paths.ts— neverhomedir()directly.
- In
src/shared/worker-utils.tsensureWorkerRunning()(spawn section, lines ~332-351): wrap the spawn in the lock — ifacquireSpawnLock()fails, skip the spawn and go straight towaitForWorkerPort/waitForWorkerReadiness(someone else is spawning; wait for their worker). Release infinally. - Same wrap in
src/services/worker-spawner.tsensureWorkerStarted()around thespawnDaemoncall (line ~132). - Resolver unification: delete
resolveBunRuntime()fromworker-utils.ts(lines 217-235) and import/re-exportresolveWorkerRuntimePathfromProcessManager.ts(already exported; it strictly supersedes — adds BUN_PATH,~/.bun/bin, brew, snap fallbacks). This closes the kill-then-can't-respawn path. src/servers/mcp-server.ts(lines ~42-51): computeWORKER_SCRIPT_PATHpreferring the marketplace copy — copy the candidate pattern fromresolveWorkerScriptPath()with fallback to the current own-dir resolution. This stops MCP servers in stale cache dirs from spawning stale workers.
Documentation references
- Write-discipline reference for the lock:
writeJsonFileAtomic,src/npx-cli/utils/paths.ts:124-205(thewx-flag + cleanup-on-error shape; the lock is simpler — no rename needed,wxIS the atomicity). - Spawn call sites table: Phase 0 / spawn-path report §7.
resolveWorkerRuntimePathchain:ProcessManager.ts:63-125.
Verification checklist
- New test
tests/shared/worker-spawn-gate.test.ts(temp dir viaCLAUDE_MEM_DATA_DIR+ dynamic import — see Phase 6 trap): second acquire fails while held; stale lock (backdate mtime viautimesSync) is broken and re-acquired; release is owner-only. grep -rn "resolveBunRuntime" src/→ no definition in worker-utils (only the ProcessManager import).grep -n "spawnHidden\|spawnDaemon" src/shared/worker-utils.ts src/services/worker-spawner.ts→ every spawn site is inside the lock.- Race test (manual): run 5 concurrent
node "$_P/scripts/bun-runner.js" .../worker-service.cjs startinvocations with no worker running; logs must show exactly ONEStarting worker daemonand zerorefusing to start duplicatestorms. npm test→ green (version-recycle contract intact).
Anti-pattern guards
- The lock gates SPAWNING only — never health checks; a held lock must never make a hook fail, only wait.
- 30s staleness must use file mtime, not clock-in-content comparisons (test pollution wrote
startedAt: 2024-01-01once already). - Do not introduce a lock library;
wxflag is sufficient and dependency-free.
Phase 5: Demote the PID file — port + /api/health become the only liveness oracle
Why: The dying worker's shutdown cascade deletes the NEW worker's PID file (src/supervisor/shutdown.ts:88), after which status reports a healthy worker as "not running" (status requires portInUse && pidInfo, worker-service.ts:975-988). /api/health already carries pid, version, workerPath — it subsumes the file.
What to implement
- Owner-guarded deletion (the clobber fix):
src/supervisor/shutdown.ts(~line 88): beforermSync(pidFilePath), read the file; delete ONLY if itspid === process.pid. A mismatch means a successor already wrote its own — log debug and leave it.src/services/worker-service.tscase 'restart'(~line 964) andcase 'stop'(~line 949): replace bareremovePidFile()with the same owner-or-dead check: delete only if the recorded pid is the one we just shut down (captured in Phase 2's pre-shutdown health probe) or the recorded pid is not alive.
case 'status'(lines ~975-988): source of truth becomesGET /api/health— reportpid,version,uptime,workerPathfrom the response; fall back to "port in use but health unreachable (wedged?)" and "not running". Drop thereadPidFile()requirement.--daemonduplicate gate (lines ~1167-1182): reorder — port/health probe FIRST (it's ground truth), PID file second (advisory only, for the no-port-bound-yet boot window). KeepwritePidFile/touchPidFileas diagnostics — the worker itself remains the only writer.
Documentation references
- Clobber chain:
GracefulShutdown.ts:55→supervisor.stop()→runShutdownCascade→shutdown.ts:87-98. verifyPidFileOwnership+ startToken:ProcessManager.ts(PID APIs §3 of spawn-path report).- Health shape:
Server.ts:212-233.
Verification checklist
- Update
tests/infrastructure/process-manager.test.tsexpectations where deletion semantics changed (owner-guard) — coordinate with Phase 6 which relocates this file's data dir. - New test: old-worker-cleanup-spares-successor — write PID file as
{pid: 99999...}(not own pid), run the shutdown cascade deletion step, assert file survives. - Manual: start worker,
rm ~/.claude-mem/worker.pid, runworker-service.cjs status→ must still print "Worker is running" with pid/version from health. - Full ping-pong scenario from Phase 3's manual check still converges.
npm test→ green.
Anti-pattern guards
- Do NOT delete
writePidFileentirely — external tooling may read it; it's demoted to diagnostics, not removed. statusmust not require BOTH oracles ever again — health wins, full stop.- Do not let the boot-window PID check (advisory) exit 1 — duplicate suppression stays exit 0 (Phase 2 changed only the generic-failure path).
Phase 6: Test hygiene — no test ever touches the real ~/.claude-mem again
Why: tests/infrastructure/process-manager.test.ts writes corrupt JSON and sentinel PIDs into the REAL ~/.claude-mem/worker.pid (snapshot-restore shrinks but doesn't close the race window, and a killed test run leaves corruption behind). It also pollutes the shared log, which contaminated this very diagnosis.
What to implement
tests/infrastructure/process-manager.test.ts:- Replace the hardcoded
DATA_DIR = path.join(homedir(), '.claude-mem')(lines 24-25) with: createmkdtempSync(join(tmpdir(), 'claude-mem-pm-test-')), setprocess.env.CLAUDE_MEM_DATA_DIRto it. - TRAP:
DATA_DIRinsrc/shared/paths.ts:40is a module-level const computed at import time, and ESM hoists static imports — setting the env var inbeforeEachis too late. Copy the fresh-import pattern fromtests/shared/worker-utils-version-recycle.test.ts:30-32(query-param cache-bust dynamic import) OR set the env var at the very top of the file before anyawait import(...)of ProcessManager modules (convert the static imports of code-under-test to dynamic). - Delete the snapshot-restore of the real PID file (lines 28-43) — unnecessary once isolated;
rmSync(tempDir, {recursive: true, force: true})inafterAll(copy Pattern A:tests/write-json-file-atomic.test.ts:34-40).
- Replace the hardcoded
- Sweep for other offenders:
grep -rn "homedir()" tests/ | grep -v node_modules— any test resolving the real data dir gets the same treatment. - Add a tripwire to
tests/preload.ts(it already exists for the PostHog mock): ifCLAUDE_MEM_DATA_DIRis unset, set it to a per-runmkdtempSyncdir so NO test can fall through to~/.claude-mem. (Env restoration discipline: copytests/env-isolation.test.ts:31-90.)
Documentation references
- Offending lines verbatim: process-manager.test.ts:24-25, 28-43, 109, 499-503 (test-hygiene report §1).
- Copy-ready isolation patterns: Pattern A
write-json-file-atomic.test.ts:34-40; Pattern C env restoreenv-isolation.test.ts:31-90. - Env override honored at
src/shared/paths.ts:19-20.
Verification checklist
bun test tests/infrastructure/green.- While the suite runs:
ls -la ~/.claude-mem/worker.pid; shasum ~/.claude-mem/worker.pidbefore/after → byte-identical (or consistently absent). grep -rn "homedir()" tests/→ no hit resolves a data-dir path for writes.- Full
npm testgreen.
Anti-pattern guards
- Do not mock
fsto fake isolation — real temp dirs only, the tests exercise real file semantics. - Do not weaken assertions to dodge the trap; the fresh-import pattern is proven in-repo.
Phase 7: Final Verification
- Full suite + typecheck:
npx tsc --noEmitandnpm test— zero failures. - Anti-pattern greps (all must be empty):
grep -rn "INSTALLED_CACHE_PATH\|detectInstalledVersion" scripts/grep -n "admin/restart" scripts/sync-marketplace.cjsgrep -n "37777" src/ scripts/ -r(hardcoded port)grep -rn "resolveBunRuntime" src/shared/worker-utils.tsgrep -rn "homedir(), '.claude-mem'" tests/
- Triple-restart soak: run
npm run build-and-syncthree times consecutively; every run must endWorker restart verified;/api/healthpid changes each time, version stays the built version;grep -c "refusing to start duplicate\|Failed to start server" ~/.claude-mem/logs/claude-mem-$(date +%F).logshows no new occurrences during the soak. - Stale-launcher convergence test (the original bug): manually spawn a worker from an OLD cache dir (
node ~/.claude/plugins/cache/thedotmack/claude-mem/<old>/scripts/bun-runner.js .../worker-service.cjs start), then trigger any hook (orensureWorkerRunningvia a session). Expect in the log: exactly ONEWorker version mismatch — recycling stale worker, then the self-replacing restart, then a healthy marketplace-version worker — NO ping-pong (no second recycle within 5 minutes). - Observation check: confirm claude-mem itself recorded observations during the soak (the worker was restarting underneath the recorder — the pipeline must survive its own surgery): query the 10 most recent rows in
~/.claude-mem/claude-mem.dbobservationsand confirm fresh timestamps. - Docs: update
docs/public/troubleshooting.mdxif it documents the old restart semantics; CHANGELOG is auto-generated — do not edit.