thedotmack-claude-mem/plans/2026-06-10-worker-restart-single-source-of-truth.md

# Worker Restart: Single Source of Truth

**Created:** 2026-06-10
**Root-cause analysis:** 12-agent diagnosis, adversarially verified (workflow `wf_f07f3541-b05`). Summary: worker-restart failures are caused by five redundant "who is the worker" oracles with uncoordinated writers, a sync-script "hot reload" mirror that writes version-N code into the version-(N-1) cache dir, a kill-only restart endpoint that races hook lazy-spawns, and a build chain that fires two uncoordinated restarts and never verifies the outcome. 98 version-recycle ping-pong events and six EADDRINUSE hard failures observed Jun 8–10.

**Execution model:** Each phase is self-contained and lands independently (one commit/PR per phase, in order). Phases 1–2 are the high-leverage, low-risk slice. Run `npm test` plus the phase's verification checklist before moving on.

---

## Phase 0: Consolidated Documentation — Allowed APIs and Ground Truth

*(Discovery already performed by three fact-extraction subagents on 2026-06-10; consolidated here. Executors: trust these refs, but re-read the cited lines before editing — line numbers drift.)*

### Allowed APIs (verified to exist — use ONLY these, with these exact signatures)

| API | Location | Signature / shape |
|---|---|---|
| `ensureWorkerRunning()` | `src/shared/worker-utils.ts:293-381` | `(): Promise<boolean>` — hook lazy-spawn + PR #2768 version-recycle |
| `resolveWorkerScriptPath()` | `src/shared/worker-utils.ts:206-215` | candidates: `MARKETPLACE_ROOT/plugin/scripts/worker-service.cjs`, then `cwd()/plugin/scripts/worker-service.cjs` |
| `resolveBunRuntime()` | `src/shared/worker-utils.ts:217-235` | hook-side resolver; MISSING `~/.bun/bin` fallbacks |
| `waitForWorkerPort` / `waitForWorkerReadiness` | `src/shared/worker-utils.ts:237-273` | polls `GET /api/readiness`; budget env `CLAUDE_MEM_HOOK_READINESS_TIMEOUT_MS` |
| `ensureWorkerStarted(port, workerScriptPath)` | `src/services/worker-spawner.ts:70-153` | returns `'ready' \| 'warming' \| 'dead'`; NO version guard |
| `spawnDaemon(scriptPath, port, extraEnv?)` | `src/services/infrastructure/ProcessManager.ts:408-472` | returns PID or `undefined`; setsid on Unix, PowerShell on Windows |
| `resolveWorkerRuntimePath(options?)` | `src/services/infrastructure/ProcessManager.ts:63-125` | full bun resolver chain (BUN, BUN_PATH, `~/.bun/bin`, brew paths, `which`) |
| PID file APIs | `ProcessManager.ts:134-168, 508-520` | `writePidFile(info)`, `readPidFile(): PidInfo\|null`, `removePidFile()`, `touchPidFile()`, `cleanStalePidFile()` |
| `httpShutdown(port, reason?)` | `src/services/infrastructure/HealthMonitor.ts:94-114` | POSTs `/api/admin/shutdown?reason=restart` |
| `waitForPortFree(port, timeoutMs?)` | `HealthMonitor.ts:85-92` | 500ms poll |
| `checkVersionMatch(port)` | `src/services/infrastructure/HealthMonitor.ts:~120-161` | returns `{matches, pluginVersion, workerVersion}`; fail-open on ENOENT |
| `performGracefulShutdown(config)` | `src/services/infrastructure/GracefulShutdown.ts:30-58` | sequential closes, NO global deadline |
| `flushResponseThen(res, payload, action)` | `src/services/server/flushResponseThen.ts:3-16` | responds, runs action on 'finish', ALWAYS `process.exit(0)` |
| `writeJsonFileAtomic()` | `src/npx-cli/utils/paths.ts:124-205` | the ONLY atomic-write helper in the repo |
| `MARKETPLACE_ROOT` | `src/shared/paths.ts:43` | `~/.claude/plugins/marketplaces/thedotmack` |
| `resolveDataDir()` / `DATA_DIR` | `src/shared/paths.ts:18-40` | env `CLAUDE_MEM_DATA_DIR` wins (line 19-20); module-level const — computed at import time |
| Worker port default | `src/shared/SettingsDefaultsManager.ts:91` | `37700 + (uid % 100)` — NEVER hardcode 37777 |
| `/api/health` response | `src/services/server/Server.ts:212-233` | `{status, version, workerPath, uptime, pid, initialized, mcpReady, ...}` — has everything verification needs |
| `/api/readiness` response | `Server.ts:235-247` | `{status: 'ready'\|'initializing', mcpReady}` |
| `/api/admin/restart` | `Server.ts:282-294` | kill-only via `flushResponseThen` → `onRestart()` → `shutdown('restart')`; NOTHING respawns (macOS/Linux) |
| `version: BUILT_IN_VERSION` | baked via `__DEFAULT_PACKAGE_VERSION__` esbuild define | `scripts/build-hooks.js:312,365,421,469,495,540` |

### Anti-patterns (verified NOT to exist — do not invent)
- There is NO `/api/admin/status`, NO `/api/version` route in Server.ts (version comes from `/api/health`), NO respawn anywhere in the HTTP restart path.
- There is NO flock/O_EXCL/`wx`-flag lockfile pattern anywhere in src/ — Phase 4 introduces the first one; copy `writeJsonFileAtomic` for the write discipline.
- `plugin/scripts/*.cjs` are BUILT artifacts — never hand-edit; rebuild with `npm run build`.
- `package-lock.json` is gitignored — do not commit it.
- Tests are `bun test` (`bunfig.toml` preloads `tests/preload.ts` for the PostHog mock). Mock pattern: `mock.module()` + query-param cache-bust fresh import (`tests/shared/worker-utils-version-recycle.test.ts:22-32`).

### Behavioral contracts that MUST keep passing
- `tests/shared/worker-utils-version-recycle.test.ts`: on version mismatch `ensureWorkerRunning()` POSTs `/api/admin/restart` ≥1×; on match, 0×.
- `tests/integrations/spawn-contract-windows.test.ts`: spawn-contract env overrides.
- Full suite: 2203 pass / 0 fail as of v13.5.5.

---

## Phase 1: Delete the cache-mirror; make the CLI restart the single initiator

**Why first:** The mirror (`sync-marketplace.cjs:164-173`) is the largest manufactured source of version disagreement (7 of 10 cache dirs hold off-by-one content); the double restart (HTTP POST + sleep 1 + CLI restart) is the race generator. Both fixes are deletions.

### What to implement
1. In `scripts/sync-marketplace.cjs`:
   - Delete the cache-mirror block (lines ~164-173: `INSTALLED_CACHE_PATH` + rsync + `bun install` into the cache dir).
   - Delete `detectInstalledVersion()` (lines ~79-114) and its call site — it exists only to feed the mirror.
   - Delete the HTTP restart trigger block (lines ~196-216: the `http.request` POST to `/api/admin/restart` and its success/error prints). The sync script's job ends at "files synced + `bun install` in the marketplace copy".
2. In root `package.json` line ~67, simplify:
   - From: `"build-and-sync": "npm run build && npm run sync-marketplace && sleep 1 && (cd ~/.claude/plugins/marketplaces/thedotmack && npm run worker:restart)"`
   - To: `"build-and-sync": "npm run build && npm run sync-marketplace && (cd ~/.claude/plugins/marketplaces/thedotmack && npm run worker:restart)"`
   - (drop the `sleep 1` — it existed to let the now-deleted HTTP kill land before the CLI restart)
3. Do NOT touch the existing one-time legacy-cache cleanup elsewhere in sync-marketplace.cjs (if present) — only the three blocks named above.

### Documentation references
- Verbatim quotes of all three blocks: Phase 0 table + the restart/shutdown extraction (sync-marketplace.cjs:79-114, 164-173, 196-216; package.json:64-79).

### Verification checklist
- `grep -n "INSTALLED_CACHE_PATH\|detectInstalledVersion\|admin/restart" scripts/sync-marketplace.cjs` → no matches.
- `grep -n "sleep 1" package.json` → no match in build-and-sync.
- `npm run build-and-sync` → completes; worker restarts (via CLI path only); `curl -s http://127.0.0.1:$PORT/api/health` shows the just-built version. (Resolve `$PORT` as `37700 + uid%100` or from `~/.claude-mem/settings.json`.)
- `npm test` → green.
- Stale-cache regression check: `ls ~/.claude/plugins/cache/thedotmack/claude-mem/*/package.json | xargs grep -h '"version"'` — record current values; after the NEXT release, re-check that no cache dir's content changed (the mirror used to mutate them).

### Anti-pattern guards
- Do not "fix" the mirror by targeting `workerPath` instead — the feature is being removed, not repaired.
- Do not add a new restart mechanism here; Phase 2 hardens the existing CLI restart.

---

## Phase 2: Post-restart verification — restart proves itself or exits 1

**Why:** `restart` currently exits 0 after `spawnDaemon` returns a PID — fork success, not boot success. Four silent exit-0 daemon paths mean "✓" with a dead/stale worker.

### What to implement
All in `src/services/worker-service.ts`:
1. In `case 'restart'` (lines ~956-973):
   - BEFORE `httpShutdown`, capture the old worker: `GET /api/health` (2s timeout) → save `oldPid` (may be null if no worker).
   - Replace `waitForPortFree(port, 5000)` with `waitForPortFree(port, getPlatformTimeout(15000))` — parity with `stop` (line ~946).
   - AFTER `spawnDaemon`, add a verification loop (new helper `verifyRestartedWorker(port, oldPid, deadlineMs)`): poll `GET /api/health` every 500ms until `health.pid !== oldPid && health.version === EXPECTED_VERSION`, where `EXPECTED_VERSION` is this process's own baked `__DEFAULT_PACKAGE_VERSION__` (build-and-sync runs the marketplace copy, so its baked version IS the just-synced version). Deadline: `getPlatformTimeout(30000)`.
   - On success: log `Worker restart verified {pid, version}` and exit 0. On deadline: `console.error` with the last observed health payload (or connection error) and **exit 1**.
2. In the `--daemon` block (lines ~1167-1208): change EXIT PATH 4 only (generic start failure, lines ~1204-1206) from `process.exit(0)` to `process.exit(1)`. Paths 1-3 are legitimate duplicate-suppression and stay exit 0.
3. Spawn target: change `spawnDaemon(__filename, port)` (line ~965) to prefer the marketplace script — copy the candidate pattern from `resolveWorkerScriptPath()` (`src/shared/worker-utils.ts:206-215`), falling back to `__filename` when no marketplace copy exists (dev trees, CI).

### Documentation references
- `/api/health` shape: `Server.ts:212-233` (`pid`, `version` fields confirmed).
- `getPlatformTimeout`: used at `worker-service.ts:946`.
- `__DEFAULT_PACKAGE_VERSION__` availability inside worker-service bundle: `scripts/build-hooks.js:312-313`.

### Verification checklist
- New test `tests/services/worker-restart-verify.test.ts`: mock `global.fetch` (copy the fetchLog pattern from `tests/shared/worker-utils-version-recycle.test.ts:34-50`); assert `verifyRestartedWorker` returns success when health flips to `{pid: newPid, version: expected}`, failure on stale pid, failure on wrong version, failure on timeout.
- Manual: `npm run build-and-sync` → output includes `Worker restart verified`; then `kill -9 <worker pid>` mid-restart-window and re-run to see a LOUD exit-1 path (or simulate by pointing the verify loop at a dead port in the test).
- `npm test` → green, including the version-recycle contract.

### Anti-pattern guards
- Do NOT poll `/api/version` (doesn't exist) or invent new health fields.
- Do NOT compare against marketplace `package.json` on disk for `EXPECTED_VERSION` — the baked constant is the truth for "the code I am running"; disk reads reintroduce a second oracle.
- The verify deadline must not block forever: hard cap, then exit 1.

---

## Phase 3: Self-replacing restart — old worker spawns its successor; recycle path stops spawning into corpses

**Why:** `/api/admin/restart` is kill-only; hooks that POST it then lazy-spawn race the dying worker (the ping-pong). If the OLD worker spawns its successor as its final act after the port closes, old and new never coexist and no third party spawns into a corpse.

### What to implement
1. `src/services/worker-service.ts` `shutdown()` (lines ~671-699):
   - Re-entrancy guard: the `isShuttingDown` field (line ~188) is write-only today; make `shutdown()` check-and-set it at entry (`if (this.isShuttingDown) return; this.isShuttingDown = true;`).
   - Hard deadline: wrap `performGracefulShutdown(...)` in `Promise.race` with a `getPlatformTimeout(10000)` timer; on deadline, log `Graceful shutdown deadline exceeded — proceeding` and continue (do not hang on unbounded session drain — drain today can run 35-40s).
   - Successor spawn: when `reason === 'restart'`, after graceful shutdown completes/deadlines: resolve the marketplace script (same candidate pattern as Phase 2 step 3), `await waitForPortFree(port, 5000)`, then `spawnDaemon(marketplaceScript, port)`. If port never frees or spawn returns `undefined`, log loudly (`logger.error`) — the next hook's lazy-spawn is the safety net. Note `flushResponseThen` (flushResponseThen.ts:3-16) calls `process.exit(0)` after the action completes, so the spawn must be awaited inside the action.
2. `src/shared/worker-utils.ts` `ensureWorkerRunning()` recycle path (lines ~305-330):
   - After POSTing `/api/admin/restart`, do NOT immediately lazy-spawn. Instead poll `GET /api/health` (500ms interval, `HOOK_READINESS_TIMEOUT_MS` budget) for the successor: healthy AND `version === pluginVersion` (already in hand from `checkVersionMatch`). Only fall through to the existing lazy-spawn if the successor never appears.
   - Amplifier fix: after `waitForWorkerReadiness` succeeds anywhere in this function, re-check the version once via `/api/health`; if still mismatched, log a warning (do NOT loop/recycle again in the same invocation — one recycle per hook event, the next hook retries; unbounded loops here re-create the storm).

### Documentation references
- `onRestart` wiring: `worker-service.ts:255`; route: `Server.ts:282-294`; `flushResponseThen.ts:3-16`.
- Drain timings + unbounded awaits: `GracefulShutdown.ts:30-75`.
- `HOOK_READINESS_TIMEOUT_MS`: `worker-utils.ts:45-48`.
- Signal-path guard to mirror: `src/supervisor/index.ts:56-60`.

### Verification checklist
- `tests/shared/worker-utils-version-recycle.test.ts` still green (still POSTs restart on mismatch — the change is what happens AFTER the POST).
- New test: shutdown re-entrancy — call `shutdown()` twice, assert `performGracefulShutdown` runs once (mock it via `mock.module`).
- New test: recycle-no-corpse-spawn — fetch mock where `/api/admin/restart` 200s and `/api/health` returns the NEW version on poll N: assert NO spawn attempt; where health never recovers: assert lazy-spawn fallback fires.
- Manual end-to-end: with a worker running, `curl -X POST http://127.0.0.1:$PORT/api/admin/restart`; within ~15s `/api/health` shows a NEW pid and the marketplace version, with no hook involvement. Check `~/.claude-mem/logs/claude-mem-$(date +%F).log` for exactly one shutdown and one daemon start (no duplicate-refusal lines).
- `npm test` → green.

### Anti-pattern guards
- The successor spawn happens ONLY on `reason === 'restart'` — `stop` must stay kill-only.
- Never spawn before the port is confirmed free — that recreates EXIT PATH 2/3 duplicate suicides.
- Do not add respawn logic to `flushResponseThen` itself or to the Windows-managed IPC branch (`process.send` path, Server.ts:284-289) — Windows wrapper already owns restart there.

---

## Phase 4: One spawn gate — shared ensureWorker() with a spawn lockfile; kill the resolver asymmetry

**Why:** Three spawn paths (hooks via `worker-utils`, MCP via `worker-spawner`, CLI via `spawnDaemon`) with two different bun resolvers and no mutual exclusion; logs show 3 launchers colliding within one second.

### What to implement
1. New module `src/shared/worker-spawn-gate.ts`:
   - `acquireSpawnLock(): boolean` — `writeFileSync(join(DATA_DIR, 'spawn.lock'), JSON.stringify({pid: process.pid, startedAt: new Date().toISOString()}), {flag: 'wx'})` in try/catch. On `EEXIST`: `statSync` the lock; if `mtimeMs` older than 30_000ms, `unlinkSync` and retry ONCE; else return false.
   - `releaseSpawnLock(): void` — unlink, owner-checked (read it; only delete if `pid === process.pid`), errors swallowed.
   - Use `DATA_DIR` from `src/shared/paths.ts` — never `homedir()` directly.
2. In `src/shared/worker-utils.ts` `ensureWorkerRunning()` (spawn section, lines ~332-351): wrap the spawn in the lock — if `acquireSpawnLock()` fails, skip the spawn and go straight to `waitForWorkerPort`/`waitForWorkerReadiness` (someone else is spawning; wait for their worker). Release in `finally`.
3. Same wrap in `src/services/worker-spawner.ts` `ensureWorkerStarted()` around the `spawnDaemon` call (line ~132).
4. Resolver unification: delete `resolveBunRuntime()` from `worker-utils.ts` (lines 217-235) and import/re-export `resolveWorkerRuntimePath` from `ProcessManager.ts` (already exported; it strictly supersedes — adds BUN_PATH, `~/.bun/bin`, brew, snap fallbacks). This closes the kill-then-can't-respawn path.
5. `src/servers/mcp-server.ts` (lines ~42-51): compute `WORKER_SCRIPT_PATH` preferring the marketplace copy — copy the candidate pattern from `resolveWorkerScriptPath()` with fallback to the current own-dir resolution. This stops MCP servers in stale cache dirs from spawning stale workers.

### Documentation references
- Write-discipline reference for the lock: `writeJsonFileAtomic`, `src/npx-cli/utils/paths.ts:124-205` (the `wx`-flag + cleanup-on-error shape; the lock is simpler — no rename needed, `wx` IS the atomicity).
- Spawn call sites table: Phase 0 / spawn-path report §7.
- `resolveWorkerRuntimePath` chain: `ProcessManager.ts:63-125`.

### Verification checklist
- New test `tests/shared/worker-spawn-gate.test.ts` (temp dir via `CLAUDE_MEM_DATA_DIR` + dynamic import — see Phase 6 trap): second acquire fails while held; stale lock (backdate mtime via `utimesSync`) is broken and re-acquired; release is owner-only.
- `grep -rn "resolveBunRuntime" src/` → no definition in worker-utils (only the ProcessManager import).
- `grep -n "spawnHidden\|spawnDaemon" src/shared/worker-utils.ts src/services/worker-spawner.ts` → every spawn site is inside the lock.
- Race test (manual): run 5 concurrent `node "$_P/scripts/bun-runner.js" .../worker-service.cjs start` invocations with no worker running; logs must show exactly ONE `Starting worker daemon` and zero `refusing to start duplicate` storms.
- `npm test` → green (version-recycle contract intact).

### Anti-pattern guards
- The lock gates SPAWNING only — never health checks; a held lock must never make a hook fail, only wait.
- 30s staleness must use file mtime, not clock-in-content comparisons (test pollution wrote `startedAt: 2024-01-01` once already).
- Do not introduce a lock library; `wx` flag is sufficient and dependency-free.

---

## Phase 5: Demote the PID file — port + /api/health become the only liveness oracle

**Why:** The dying worker's shutdown cascade deletes the NEW worker's PID file (`src/supervisor/shutdown.ts:88`), after which `status` reports a healthy worker as "not running" (status requires `portInUse && pidInfo`, `worker-service.ts:975-988`). `/api/health` already carries `pid`, `version`, `workerPath` — it subsumes the file.

### What to implement
1. Owner-guarded deletion (the clobber fix):
   - `src/supervisor/shutdown.ts` (~line 88): before `rmSync(pidFilePath)`, read the file; delete ONLY if its `pid === process.pid`. A mismatch means a successor already wrote its own — log debug and leave it.
   - `src/services/worker-service.ts` `case 'restart'` (~line 964) and `case 'stop'` (~line 949): replace bare `removePidFile()` with the same owner-or-dead check: delete only if the recorded pid is the one we just shut down (captured in Phase 2's pre-shutdown health probe) or the recorded pid is not alive.
2. `case 'status'` (lines ~975-988): source of truth becomes `GET /api/health` — report `pid`, `version`, `uptime`, `workerPath` from the response; fall back to "port in use but health unreachable (wedged?)" and "not running". Drop the `readPidFile()` requirement.
3. `--daemon` duplicate gate (lines ~1167-1182): reorder — port/health probe FIRST (it's ground truth), PID file second (advisory only, for the no-port-bound-yet boot window). Keep `writePidFile`/`touchPidFile` as diagnostics — the worker itself remains the only writer.

### Documentation references
- Clobber chain: `GracefulShutdown.ts:55` → `supervisor.stop()` → `runShutdownCascade` → `shutdown.ts:87-98`.
- `verifyPidFileOwnership` + startToken: `ProcessManager.ts` (PID APIs §3 of spawn-path report).
- Health shape: `Server.ts:212-233`.

### Verification checklist
- Update `tests/infrastructure/process-manager.test.ts` expectations where deletion semantics changed (owner-guard) — coordinate with Phase 6 which relocates this file's data dir.
- New test: old-worker-cleanup-spares-successor — write PID file as `{pid: 99999...}` (not own pid), run the shutdown cascade deletion step, assert file survives.
- Manual: start worker, `rm ~/.claude-mem/worker.pid`, run `worker-service.cjs status` → must still print "Worker is running" with pid/version from health.
- Full ping-pong scenario from Phase 3's manual check still converges.
- `npm test` → green.

### Anti-pattern guards
- Do NOT delete `writePidFile` entirely — external tooling may read it; it's demoted to diagnostics, not removed.
- `status` must not require BOTH oracles ever again — health wins, full stop.
- Do not let the boot-window PID check (advisory) exit 1 — duplicate suppression stays exit 0 (Phase 2 changed only the generic-failure path).

---

## Phase 6: Test hygiene — no test ever touches the real ~/.claude-mem again

**Why:** `tests/infrastructure/process-manager.test.ts` writes corrupt JSON and sentinel PIDs into the REAL `~/.claude-mem/worker.pid` (snapshot-restore shrinks but doesn't close the race window, and a killed test run leaves corruption behind). It also pollutes the shared log, which contaminated this very diagnosis.

### What to implement
1. `tests/infrastructure/process-manager.test.ts`:
   - Replace the hardcoded `DATA_DIR = path.join(homedir(), '.claude-mem')` (lines 24-25) with: create `mkdtempSync(join(tmpdir(), 'claude-mem-pm-test-'))`, set `process.env.CLAUDE_MEM_DATA_DIR` to it.
   - **TRAP:** `DATA_DIR` in `src/shared/paths.ts:40` is a module-level const computed at import time, and ESM hoists static imports — setting the env var in `beforeEach` is too late. Copy the fresh-import pattern from `tests/shared/worker-utils-version-recycle.test.ts:30-32` (query-param cache-bust dynamic import) OR set the env var at the very top of the file before any `await import(...)` of ProcessManager modules (convert the static imports of code-under-test to dynamic).
   - Delete the snapshot-restore of the real PID file (lines 28-43) — unnecessary once isolated; `rmSync(tempDir, {recursive: true, force: true})` in `afterAll` (copy Pattern A: `tests/write-json-file-atomic.test.ts:34-40`).
2. Sweep for other offenders: `grep -rn "homedir()" tests/ | grep -v node_modules` — any test resolving the real data dir gets the same treatment.
3. Add a tripwire to `tests/preload.ts` (it already exists for the PostHog mock): if `CLAUDE_MEM_DATA_DIR` is unset, set it to a per-run `mkdtempSync` dir so NO test can fall through to `~/.claude-mem`. (Env restoration discipline: copy `tests/env-isolation.test.ts:31-90`.)

### Documentation references
- Offending lines verbatim: process-manager.test.ts:24-25, 28-43, 109, 499-503 (test-hygiene report §1).
- Copy-ready isolation patterns: Pattern A `write-json-file-atomic.test.ts:34-40`; Pattern C env restore `env-isolation.test.ts:31-90`.
- Env override honored at `src/shared/paths.ts:19-20`.

### Verification checklist
- `bun test tests/infrastructure/` green.
- While the suite runs: `ls -la ~/.claude-mem/worker.pid; shasum ~/.claude-mem/worker.pid` before/after → byte-identical (or consistently absent).
- `grep -rn "homedir()" tests/` → no hit resolves a data-dir path for writes.
- Full `npm test` green.

### Anti-pattern guards
- Do not mock `fs` to fake isolation — real temp dirs only, the tests exercise real file semantics.
- Do not weaken assertions to dodge the trap; the fresh-import pattern is proven in-repo.

---

## Phase 7: Final Verification

1. **Full suite + typecheck:** `npx tsc --noEmit` and `npm test` — zero failures.
2. **Anti-pattern greps (all must be empty):**
   - `grep -rn "INSTALLED_CACHE_PATH\|detectInstalledVersion" scripts/`
   - `grep -n "admin/restart" scripts/sync-marketplace.cjs`
   - `grep -n "37777" src/ scripts/ -r` (hardcoded port)
   - `grep -rn "resolveBunRuntime" src/shared/worker-utils.ts`
   - `grep -rn "homedir(), '.claude-mem'" tests/`
3. **Triple-restart soak:** run `npm run build-and-sync` three times consecutively; every run must end `Worker restart verified`; `/api/health` pid changes each time, version stays the built version; `grep -c "refusing to start duplicate\|Failed to start server" ~/.claude-mem/logs/claude-mem-$(date +%F).log` shows no new occurrences during the soak.
4. **Stale-launcher convergence test (the original bug):** manually spawn a worker from an OLD cache dir (`node ~/.claude/plugins/cache/thedotmack/claude-mem/<old>/scripts/bun-runner.js .../worker-service.cjs start`), then trigger any hook (or `ensureWorkerRunning` via a session). Expect in the log: exactly ONE `Worker version mismatch — recycling stale worker`, then the self-replacing restart, then a healthy marketplace-version worker — NO ping-pong (no second recycle within 5 minutes).
5. **Observation check:** confirm claude-mem itself recorded observations during the soak (the worker was restarting underneath the recorder — the pipeline must survive its own surgery): query the 10 most recent rows in `~/.claude-mem/claude-mem.db` `observations` and confirm fresh timestamps.
6. **Docs:** update `docs/public/troubleshooting.mdx` if it documents the old restart semantics; CHANGELOG is auto-generated — do not edit.