Files
thedotmack-claude-mem/tests/supervisor/shutdown.test.ts
Alex Newman c0b96288a7 fix(restart): worker-restart single source of truth — self-replacing worker, spawn gate, verified restarts (#2894)
* plan-10 Phase 1: ship deterministic plugin runtime dependency closure

Approach A — commit & ship plugin/bun.lock so the plugin's runtime
node_modules install is deterministic, fixing the recurring
`Cannot find module 'zod/v3'` (#2730).

- align generated plugin zod range to root (^4.4.3) in build-hooks.js
- new scripts/gen-plugin-lockfile.cjs generates plugin/bun.lock as a
  build artifact after build-hooks.js writes plugin/package.json
- track & ship plugin/bun.lock (.gitignore negation, .npmignore, files allowlist)
- install with `bun install --frozen-lockfile --ignore-scripts` at runtime

Refs #2783, #2730

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* plan-10 Phase 2: fail loud at install time on a broken dependency closure

Strengthen verifyCriticalModules to assert each dependency is actually
importable via require.resolve (not merely a directory), and assert the
worker-required zod subpaths resolve: zod/v3, zod/v4, zod/v4-mini.
A partial/stale install now fails `npx claude-mem install` immediately
instead of surfacing later as a Stop-hook `Cannot find module 'zod/v3'`.

Bin-only packages (e.g. tree-sitter-cli, which has no bare-name entry
point) fall back to resolving <dep>/package.json so a healthy install
isn't falsely rejected.

Adds tests/cli/verify-critical-modules.test.ts covering a missing zod/v3
subpath (throws), a complete zod (passes), and a bin-only dep (passes).

Refs #2783, #2730

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* plan-10 Phase 3: clean-room install + import smoke test (#2730 backstop)

Add scripts/smoke-clean-room.cjs and a `smoke:clean-room` npm script.
Against fresh temp dirs (never the repo's node_modules) it:
- copies plugin/, runs `bun install --frozen-lockfile --ignore-scripts`,
  asserts zod, zod/v3, zod/v4, zod/v4-mini resolve, and boots the bundled
  worker asserting no `Cannot find module` — the direct #2730 regression guard;
- `npm pack`s, installs the tarball into a second temp dir, and load-tests
  the published bin entrypoint, warning loudly on any declared main/exports
  target missing from the tarball (latent #2537 gap).

Exits non-zero naming the missing module on any failure; cleans up all
temp dirs and the tarball in a finally.

Refs #2783, #2730

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* plan-10 Phase 4: gate CI and publish on the clean-room dependency closure

- ci.yml: new `clean-room-deps` job (between build and the docker e2e job)
  runs a frozen-lockfile drift check on the committed plugin lockfile, then
  `npm run build` + `npm run smoke:clean-room`. The drift step catches a
  contributor who changed plugin deps without regenerating plugin/bun.lock.
- npm-publish.yml: add setup-bun and run `npm run smoke:clean-room` between
  build and `npm publish`, so a broken runtime closure cannot be published
  on a tag push (ci.yml does not run on tags). Secrets block untouched.

Refs #2783, #2730

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* plan-10: doc recluster note + Phase 0 execution slice for #2730

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* plans: backlog recluster (2026-06-04) — cross-cluster execution order + plan-13 doc

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* plan-10: gen-plugin-lockfile degrades gracefully when bun is absent

The Windows build CI job has no bun on PATH; regenerating the lockfile there
threw and failed the build. The committed plugin/bun.lock is already the
deterministic closure, so skip regeneration (non-fatal) when bun is missing
and a lockfile exists; fail loud only when neither is available.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* chore: rebuild plugin artifacts after merging main (v13.5.1) + plan-10 work

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

* chore: rebuild plugin artifacts after merging main v13.5.5

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

* chore(deps): daily upgrade pass — agent SDK 0.3.172, better-auth 1.6.16, posthog-node 5.36.15, dompurify 3.4.9

- Bump @anthropic-ai/claude-agent-sdk 0.2.141 -> 0.3.172 (tsc + full test suite green)
- Remove deprecated @types/dompurify stub (dompurify ships its own types)
- Add overrides.tmp ^0.2.7 to clear GHSA-52f5-9888-hmc6 / GHSA-ph9p-34f9-6g65
  via np -> listr-input -> inquirer -> external-editor -> tmp chain
- npm audit: 0 vulnerabilities; npm outdated: clean
- package-lock.json is gitignored in this repo, so not committed

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

* plan: worker-restart single-source-of-truth — 7-phase fix for restart races

Phased plan from the adversarially-verified diagnosis (wf_f07f3541-b05):
kill the cache mirror, single verified restart initiator, self-replacing
restart endpoint, unified spawn gate with lockfile, PID-file demotion,
test data-dir isolation, soak verification.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

* refactor(restart): delete sync-script cache-mirror and HTTP restart trigger

Phase 1 of plans/2026-06-10-worker-restart-single-source-of-truth.md.
The installed-version cache mirror wrote version-N code into the
version-(N-1) cache dir, manufacturing permanent version disagreement;
the HTTP POST to /api/admin/restart raced the CLI restart that follows
it in build-and-sync. Both are deleted; the CLI worker:restart in the
marketplace copy is now the single restart initiator, and the sleep 1
between the two mechanisms is gone.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

* feat(restart): restart proves itself or exits 1

Phase 2 of plans/2026-06-10-worker-restart-single-source-of-truth.md.
worker-service restart now captures the old worker pid, waits for the
port with the same platform-scaled 15s budget as stop, spawns the
marketplace copy of worker-service.cjs when present, then polls
/api/health until the pid changes and the version matches this build's
baked __DEFAULT_PACKAGE_VERSION__ — success is printed to stdout,
deadline (platform-scaled 30s) exits 1 with the last observed health
payload and the spawned script path. The --daemon generic start-failure
path now exits 1 instead of masquerading as success; the three
duplicate-suppression exits remain 0.

New helper src/services/restart-verify.ts (worker-service.ts bootstraps
on import, so the helper lives in an import-safe module) with 8 tests
covering pid-flip success, stale pid, wrong version, unreachable
timeout, 503-degraded acceptance, and null-oldPid version-only
verification.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

* feat(restart): self-replacing worker — old worker spawns its successor

Phase 3 of plans/2026-06-10-worker-restart-single-source-of-truth.md.
/api/admin/restart was kill-only: hooks that POSTed it then raced the
dying worker with their own lazy-spawn (the observed recycle ping-pong).
Now the dying worker spawns its successor itself — after a re-entrancy-
guarded, deadline-bounded (platform-scaled 10s) graceful shutdown, and
only once its port is confirmed free; stop and signal shutdowns stay
kill-only. The hook recycle path waits for that successor via
/api/health polling (HOOK_READINESS_TIMEOUT_MS budget) and lazy-spawns
only as a fallback, with a warn-only version re-check so a hook never
recycles more than once per invocation.

Shutdown sequence lives in import-safe src/services/worker-shutdown.ts
(worker-service.ts bootstraps on import); registerSignalHandlers no
longer pre-sets isShuttingDown — the supervisor's shutdownInitiated
guard owns signal dedupe, and pre-setting would no-op the new entry
guard. 13 new tests cover re-entrancy, deadline expiry/rejection,
handoff ordering, kill-only reasons, successor-wait vs lazy-spawn
fallback, and pre-graceful bookkeeping failures.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

* feat(restart): one spawn gate; CLI restart defers to the self-replacing worker

Phase 4 of plans/2026-06-10-worker-restart-single-source-of-truth.md.
Three uncoordinated spawn paths (hook lazy-spawn, MCP worker-spawner,
CLI) with two different bun resolvers produced 3-launcher collisions
within a single second. Now a wx-flag lockfile (<DATA_DIR>/spawn.lock,
60s mtime staleness with re-stat-before-unlink, owner-checked release)
gates every external spawn: lock losers never fail — they skip the
spawn and wait for the winner's worker. resolveBunRuntime is deleted in
favor of ProcessManager's resolveWorkerRuntimePath (adds BUN_PATH,
~/.bun/bin, brew, which fallbacks), closing the kill-then-can't-respawn
path; mcp-server prefers the marketplace worker script so stale cache
dirs stop spawning stale workers.

Integration fix surfaced by live verification: the CLI restart raced
the Phase 3 self-replacement handoff (the successor re-binds the port
in ~200ms, so waitForPortFree always timed out and restart exited 1
while the restart had actually succeeded). The CLI now verifies the
worker's self-spawned successor directly, and only spawns — gate-
wrapped, after the port frees — as the fallback when no worker was
running, the shutdown POST was rejected, or no successor appeared. The
dying worker's handoff is intentionally ungated: it spawns only after
its own port closes, and hooks wait on it.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

* feat(restart): demote the PID file — health and port are the liveness oracle

Phase 5 of plans/2026-06-10-worker-restart-single-source-of-truth.md.
The dying worker's shutdown cascade deleted the PID file
unconditionally as its final act, clobbering the successor's
freshly-written file; status then required portInUse AND pidInfo, so a
healthy worker reported as "not running". Now every PID-file deletion
is owner-guarded: the supervisor cascade deletes only its own pid
(removeOwnedPidFile), and the CLI stop/restart-fallback, the restart
handoff, and the daemon start-failure cleanup go through
removePidFileIfOwner (owner-or-dead — a live successor's file always
survives; corrupt files are left for the next boot's validator).

status sources from GET /api/health alone (pid, version, uptime,
workerPath; 503-degraded counts as running and now surfaces its queue
detail), with port-in-use-but-unreachable and not-running fallbacks —
all exit 0 as before. The --daemon duplicate gate checks the port
first (ground truth) and the PID file second (advisory, for the
freed-port-but-undeleted-file window); duplicate suppression stays
exit 0. writePidFile/touchPidFile remain — the file is diagnostics,
and the worker stays its only writer.

Also fixes combined-run test pollution: spawn-gate and worker-utils
timeout tests now eagerly import paths.js before setting a temp
CLAUDE_MEM_DATA_DIR, so the import-time DATA_DIR const can't freeze on
a deleted temp dir for suites loaded later in the same bun process.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

* test: no test ever touches the real ~/.claude-mem again

Phase 6 of plans/2026-06-10-worker-restart-single-source-of-truth.md.
process-manager and graceful-shutdown tests wrote corrupt JSON and
sentinel PIDs (2147483647) into the real ~/.claude-mem/worker.pid and
drove the real supervisor.json cascade under a snapshot-restore that a
killed run would skip — that pollution contaminated production logs and
a prior diagnosis. Both files now set a temp CLAUDE_MEM_DATA_DIR at the
top of the file before dynamically importing the code under test (ESM
hoisting makes beforeEach too late), assert their paths landed outside
the real dir, and derive PID_FILE from the same frozen paths module the
code uses so test and code can never diverge under bun's shared module
cache. The snapshot-restore scaffolding is deleted; zero assertions
changed.

tests/preload.ts gains a tripwire: when CLAUDE_MEM_DATA_DIR is unset it
fills a per-run temp dir, so no test in any file can fall through to
the real data dir. Fallout made explicit: worker-spawn child processes
get an explicit temp dir; install-error-matrix restores rather than
deletes the env var; settings-defaults-manager pins the unset-env
default it was implicitly relying on.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

* fix(settings): bootstrap notices go to stderr, never stdout

CI on PR #2894 caught the latent bug: on the first boot in a fresh data
dir, SettingsDefaultsManager printed '[SETTINGS] Created settings file
with defaults: ...' to stdout before the start command's JSON hook
payload, corrupting the machine-readable contract every fresh install's
first hook invocation relies on. The Phase 6 per-run temp data dir made
the cold-dir case deterministic in CI, exposing it. Both informational
notices (creation, nested-schema migration) now use console.warn —
stderr — matching the function's existing failure-path idiom; two
regression tests pin stdout silence on both paths.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

* refactor(restart): address PR #2894 review — dedupe script resolver, skip futile port wait

Both inline copies of the marketplace-first script-candidate list in
worker-service.ts (restart fallback + successor handoff injection) now
call the exported resolveWorkerScriptPath() ?? __filename, so the
candidate list lives in one place. verifyRestartedWorker's failure
result gains lastPollSawHealth; when the self-replacement handoff
verification timed out while a live (but unverifiable) worker was still
serving on the port, the CLI fallback now skips its port-free wait —
the port cannot free while that worker lives, so the wait only burned
its full platform-scaled budget before the same final verification ran
anyway.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-10 19:59:34 -07:00

283 lines
8.4 KiB
TypeScript

import { afterEach, describe, expect, it } from 'bun:test';
import { existsSync, mkdirSync, readFileSync, rmSync, writeFileSync } from 'fs';
import { tmpdir } from 'os';
import path from 'path';
import { createProcessRegistry } from '../../src/supervisor/process-registry.js';
import { removeOwnedPidFile, runShutdownCascade } from '../../src/supervisor/shutdown.js';
function makeTempDir(): string {
return path.join(tmpdir(), `claude-mem-shutdown-${Date.now()}-${Math.random().toString(36).slice(2)}`);
}
const tempDirs: string[] = [];
describe('supervisor shutdown cascade', () => {
afterEach(() => {
while (tempDirs.length > 0) {
const dir = tempDirs.pop();
if (dir) {
rmSync(dir, { recursive: true, force: true });
}
}
});
it('removes child records and pid file', async () => {
const tempDir = makeTempDir();
tempDirs.push(tempDir);
mkdirSync(tempDir, { recursive: true });
const registryPath = path.join(tempDir, 'supervisor.json');
const pidFilePath = path.join(tempDir, 'worker.pid');
writeFileSync(pidFilePath, JSON.stringify({
pid: process.pid,
port: 37777,
startedAt: new Date().toISOString()
}));
const registry = createProcessRegistry(registryPath);
registry.register('worker', {
pid: process.pid,
type: 'worker',
startedAt: '2026-03-15T00:00:00.000Z'
});
registry.register('dead-child', {
pid: 2147483647,
type: 'mcp',
startedAt: '2026-03-15T00:00:01.000Z'
});
await runShutdownCascade({
registry,
currentPid: process.pid,
pidFilePath
});
const persisted = JSON.parse(readFileSync(registryPath, 'utf-8'));
expect(Object.keys(persisted.processes)).toHaveLength(0);
expect(() => readFileSync(pidFilePath, 'utf-8')).toThrow();
});
it('terminates tracked children in reverse spawn order', async () => {
const tempDir = makeTempDir();
tempDirs.push(tempDir);
mkdirSync(tempDir, { recursive: true });
const registry = createProcessRegistry(path.join(tempDir, 'supervisor.json'));
registry.register('oldest', {
pid: 41001,
type: 'sdk',
startedAt: '2026-03-15T00:00:00.000Z'
});
registry.register('middle', {
pid: 41002,
type: 'mcp',
startedAt: '2026-03-15T00:00:01.000Z'
});
registry.register('newest', {
pid: 41003,
type: 'chroma',
startedAt: '2026-03-15T00:00:02.000Z'
});
const originalKill = process.kill;
const alive = new Set([41001, 41002, 41003]);
const calls: Array<{ pid: number; signal: NodeJS.Signals | number }> = [];
process.kill = ((pid: number, signal?: NodeJS.Signals | number) => {
const normalizedSignal = signal ?? 'SIGTERM';
if (normalizedSignal === 0) {
if (!alive.has(pid)) {
const error = new Error(`kill ESRCH ${pid}`) as NodeJS.ErrnoException;
error.code = 'ESRCH';
throw error;
}
return true;
}
calls.push({ pid, signal: normalizedSignal });
alive.delete(pid);
return true;
}) as typeof process.kill;
try {
await runShutdownCascade({
registry,
currentPid: process.pid,
pidFilePath: path.join(tempDir, 'worker.pid')
});
} finally {
process.kill = originalKill;
}
expect(calls).toEqual([
{ pid: 41003, signal: 'SIGTERM' },
{ pid: 41002, signal: 'SIGTERM' },
{ pid: 41001, signal: 'SIGTERM' }
]);
});
it('handles already-dead processes gracefully without throwing', async () => {
const tempDir = makeTempDir();
tempDirs.push(tempDir);
mkdirSync(tempDir, { recursive: true });
const registryPath = path.join(tempDir, 'supervisor.json');
const registry = createProcessRegistry(registryPath);
registry.register('dead:1', {
pid: 2147483640,
type: 'sdk',
startedAt: '2026-03-15T00:00:00.000Z'
});
registry.register('dead:2', {
pid: 2147483641,
type: 'mcp',
startedAt: '2026-03-15T00:00:01.000Z'
});
await runShutdownCascade({
registry,
currentPid: process.pid,
pidFilePath: path.join(tempDir, 'worker.pid')
});
const persisted = JSON.parse(readFileSync(registryPath, 'utf-8'));
expect(Object.keys(persisted.processes)).toHaveLength(0);
});
// Phase 5 (worker-restart plan): the dying worker's shutdown cascade runs
// AFTER the restart successor has written its own PID file. Blind deletion
// here clobbered the successor's file and made `worker status` report a
// healthy worker as not running.
it('old-worker cleanup spares the successor\'s PID file (owner guard)', async () => {
const tempDir = makeTempDir();
tempDirs.push(tempDir);
mkdirSync(tempDir, { recursive: true });
const registryPath = path.join(tempDir, 'supervisor.json');
const pidFilePath = path.join(tempDir, 'worker.pid');
// A successor (NOT this process) already owns the PID file.
const successorContent = JSON.stringify({
pid: 99999847,
port: 37777,
startedAt: new Date().toISOString()
});
writeFileSync(pidFilePath, successorContent);
const registry = createProcessRegistry(registryPath);
registry.register('worker', {
pid: process.pid,
type: 'worker',
startedAt: '2026-03-15T00:00:00.000Z'
});
await runShutdownCascade({
registry,
currentPid: process.pid,
pidFilePath
});
// The successor's file must survive the old worker's dying breath, byte
// for byte.
expect(existsSync(pidFilePath)).toBe(true);
expect(readFileSync(pidFilePath, 'utf-8')).toBe(successorContent);
});
it('unregisters all children from registry after cascade', async () => {
const tempDir = makeTempDir();
tempDirs.push(tempDir);
mkdirSync(tempDir, { recursive: true });
const registryPath = path.join(tempDir, 'supervisor.json');
const registry = createProcessRegistry(registryPath);
registry.register('worker', {
pid: process.pid,
type: 'worker',
startedAt: '2026-03-15T00:00:00.000Z'
});
registry.register('child:1', {
pid: 2147483640,
type: 'sdk',
startedAt: '2026-03-15T00:00:01.000Z'
});
registry.register('child:2', {
pid: 2147483641,
type: 'mcp',
startedAt: '2026-03-15T00:00:02.000Z'
});
await runShutdownCascade({
registry,
currentPid: process.pid,
pidFilePath: path.join(tempDir, 'worker.pid')
});
expect(registry.getAll()).toHaveLength(0);
});
});
describe('removeOwnedPidFile (owner guard, Phase 5)', () => {
afterEach(() => {
while (tempDirs.length > 0) {
const dir = tempDirs.pop();
if (dir) {
rmSync(dir, { recursive: true, force: true });
}
}
});
function makePidFilePath(): string {
const tempDir = makeTempDir();
tempDirs.push(tempDir);
mkdirSync(tempDir, { recursive: true });
return path.join(tempDir, 'worker.pid');
}
it('deletes the file when the recorded pid is the current process', () => {
const pidFilePath = makePidFilePath();
writeFileSync(pidFilePath, JSON.stringify({ pid: process.pid, port: 37777, startedAt: new Date().toISOString() }));
removeOwnedPidFile(pidFilePath, process.pid);
expect(existsSync(pidFilePath)).toBe(false);
});
it('leaves the file when the recorded pid belongs to another process', () => {
const pidFilePath = makePidFilePath();
writeFileSync(pidFilePath, JSON.stringify({ pid: 99999847, port: 37777, startedAt: new Date().toISOString() }));
removeOwnedPidFile(pidFilePath, process.pid);
expect(existsSync(pidFilePath)).toBe(true);
});
it('leaves a corrupt file in place (ownership cannot be proven)', () => {
const pidFilePath = makePidFilePath();
writeFileSync(pidFilePath, 'not valid json {{{');
removeOwnedPidFile(pidFilePath, process.pid);
expect(existsSync(pidFilePath)).toBe(true);
});
it('leaves a pid-less JSON file in place (no recorded owner)', () => {
const pidFilePath = makePidFilePath();
writeFileSync(pidFilePath, JSON.stringify({ port: 37777 }));
removeOwnedPidFile(pidFilePath, process.pid);
expect(existsSync(pidFilePath)).toBe(true);
});
it('does not throw when the file is missing', () => {
const pidFilePath = makePidFilePath();
expect(() => removeOwnedPidFile(pidFilePath, process.pid)).not.toThrow();
expect(existsSync(pidFilePath)).toBe(false);
});
});