mirror of
https://github.com/thedotmack/claude-mem.git
synced 2026-07-03 12:32:32 +08:00
fix(restart): worker-restart single source of truth — self-replacing worker, spawn gate, verified restarts (#2894)
* plan-10 Phase 1: ship deterministic plugin runtime dependency closure Approach A — commit & ship plugin/bun.lock so the plugin's runtime node_modules install is deterministic, fixing the recurring `Cannot find module 'zod/v3'` (#2730). - align generated plugin zod range to root (^4.4.3) in build-hooks.js - new scripts/gen-plugin-lockfile.cjs generates plugin/bun.lock as a build artifact after build-hooks.js writes plugin/package.json - track & ship plugin/bun.lock (.gitignore negation, .npmignore, files allowlist) - install with `bun install --frozen-lockfile --ignore-scripts` at runtime Refs #2783, #2730 Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * plan-10 Phase 2: fail loud at install time on a broken dependency closure Strengthen verifyCriticalModules to assert each dependency is actually importable via require.resolve (not merely a directory), and assert the worker-required zod subpaths resolve: zod/v3, zod/v4, zod/v4-mini. A partial/stale install now fails `npx claude-mem install` immediately instead of surfacing later as a Stop-hook `Cannot find module 'zod/v3'`. Bin-only packages (e.g. tree-sitter-cli, which has no bare-name entry point) fall back to resolving <dep>/package.json so a healthy install isn't falsely rejected. Adds tests/cli/verify-critical-modules.test.ts covering a missing zod/v3 subpath (throws), a complete zod (passes), and a bin-only dep (passes). Refs #2783, #2730 Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * plan-10 Phase 3: clean-room install + import smoke test (#2730 backstop) Add scripts/smoke-clean-room.cjs and a `smoke:clean-room` npm script. Against fresh temp dirs (never the repo's node_modules) it: - copies plugin/, runs `bun install --frozen-lockfile --ignore-scripts`, asserts zod, zod/v3, zod/v4, zod/v4-mini resolve, and boots the bundled worker asserting no `Cannot find module` — the direct #2730 regression guard; - `npm pack`s, installs the tarball into a second temp dir, and load-tests the published bin entrypoint, warning loudly on any declared main/exports target missing from the tarball (latent #2537 gap). Exits non-zero naming the missing module on any failure; cleans up all temp dirs and the tarball in a finally. Refs #2783, #2730 Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * plan-10 Phase 4: gate CI and publish on the clean-room dependency closure - ci.yml: new `clean-room-deps` job (between build and the docker e2e job) runs a frozen-lockfile drift check on the committed plugin lockfile, then `npm run build` + `npm run smoke:clean-room`. The drift step catches a contributor who changed plugin deps without regenerating plugin/bun.lock. - npm-publish.yml: add setup-bun and run `npm run smoke:clean-room` between build and `npm publish`, so a broken runtime closure cannot be published on a tag push (ci.yml does not run on tags). Secrets block untouched. Refs #2783, #2730 Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * plan-10: doc recluster note + Phase 0 execution slice for #2730 Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * plans: backlog recluster (2026-06-04) — cross-cluster execution order + plan-13 doc Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * plan-10: gen-plugin-lockfile degrades gracefully when bun is absent The Windows build CI job has no bun on PATH; regenerating the lockfile there threw and failed the build. The committed plugin/bun.lock is already the deterministic closure, so skip regeneration (non-fatal) when bun is missing and a lockfile exists; fail loud only when neither is available. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * chore: rebuild plugin artifacts after merging main (v13.5.1) + plan-10 work Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> * chore: rebuild plugin artifacts after merging main v13.5.5 Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> * chore(deps): daily upgrade pass — agent SDK 0.3.172, better-auth 1.6.16, posthog-node 5.36.15, dompurify 3.4.9 - Bump @anthropic-ai/claude-agent-sdk 0.2.141 -> 0.3.172 (tsc + full test suite green) - Remove deprecated @types/dompurify stub (dompurify ships its own types) - Add overrides.tmp ^0.2.7 to clear GHSA-52f5-9888-hmc6 / GHSA-ph9p-34f9-6g65 via np -> listr-input -> inquirer -> external-editor -> tmp chain - npm audit: 0 vulnerabilities; npm outdated: clean - package-lock.json is gitignored in this repo, so not committed Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> * plan: worker-restart single-source-of-truth — 7-phase fix for restart races Phased plan from the adversarially-verified diagnosis (wf_f07f3541-b05): kill the cache mirror, single verified restart initiator, self-replacing restart endpoint, unified spawn gate with lockfile, PID-file demotion, test data-dir isolation, soak verification. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> * refactor(restart): delete sync-script cache-mirror and HTTP restart trigger Phase 1 of plans/2026-06-10-worker-restart-single-source-of-truth.md. The installed-version cache mirror wrote version-N code into the version-(N-1) cache dir, manufacturing permanent version disagreement; the HTTP POST to /api/admin/restart raced the CLI restart that follows it in build-and-sync. Both are deleted; the CLI worker:restart in the marketplace copy is now the single restart initiator, and the sleep 1 between the two mechanisms is gone. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> * feat(restart): restart proves itself or exits 1 Phase 2 of plans/2026-06-10-worker-restart-single-source-of-truth.md. worker-service restart now captures the old worker pid, waits for the port with the same platform-scaled 15s budget as stop, spawns the marketplace copy of worker-service.cjs when present, then polls /api/health until the pid changes and the version matches this build's baked __DEFAULT_PACKAGE_VERSION__ — success is printed to stdout, deadline (platform-scaled 30s) exits 1 with the last observed health payload and the spawned script path. The --daemon generic start-failure path now exits 1 instead of masquerading as success; the three duplicate-suppression exits remain 0. New helper src/services/restart-verify.ts (worker-service.ts bootstraps on import, so the helper lives in an import-safe module) with 8 tests covering pid-flip success, stale pid, wrong version, unreachable timeout, 503-degraded acceptance, and null-oldPid version-only verification. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> * feat(restart): self-replacing worker — old worker spawns its successor Phase 3 of plans/2026-06-10-worker-restart-single-source-of-truth.md. /api/admin/restart was kill-only: hooks that POSTed it then raced the dying worker with their own lazy-spawn (the observed recycle ping-pong). Now the dying worker spawns its successor itself — after a re-entrancy- guarded, deadline-bounded (platform-scaled 10s) graceful shutdown, and only once its port is confirmed free; stop and signal shutdowns stay kill-only. The hook recycle path waits for that successor via /api/health polling (HOOK_READINESS_TIMEOUT_MS budget) and lazy-spawns only as a fallback, with a warn-only version re-check so a hook never recycles more than once per invocation. Shutdown sequence lives in import-safe src/services/worker-shutdown.ts (worker-service.ts bootstraps on import); registerSignalHandlers no longer pre-sets isShuttingDown — the supervisor's shutdownInitiated guard owns signal dedupe, and pre-setting would no-op the new entry guard. 13 new tests cover re-entrancy, deadline expiry/rejection, handoff ordering, kill-only reasons, successor-wait vs lazy-spawn fallback, and pre-graceful bookkeeping failures. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> * feat(restart): one spawn gate; CLI restart defers to the self-replacing worker Phase 4 of plans/2026-06-10-worker-restart-single-source-of-truth.md. Three uncoordinated spawn paths (hook lazy-spawn, MCP worker-spawner, CLI) with two different bun resolvers produced 3-launcher collisions within a single second. Now a wx-flag lockfile (<DATA_DIR>/spawn.lock, 60s mtime staleness with re-stat-before-unlink, owner-checked release) gates every external spawn: lock losers never fail — they skip the spawn and wait for the winner's worker. resolveBunRuntime is deleted in favor of ProcessManager's resolveWorkerRuntimePath (adds BUN_PATH, ~/.bun/bin, brew, which fallbacks), closing the kill-then-can't-respawn path; mcp-server prefers the marketplace worker script so stale cache dirs stop spawning stale workers. Integration fix surfaced by live verification: the CLI restart raced the Phase 3 self-replacement handoff (the successor re-binds the port in ~200ms, so waitForPortFree always timed out and restart exited 1 while the restart had actually succeeded). The CLI now verifies the worker's self-spawned successor directly, and only spawns — gate- wrapped, after the port frees — as the fallback when no worker was running, the shutdown POST was rejected, or no successor appeared. The dying worker's handoff is intentionally ungated: it spawns only after its own port closes, and hooks wait on it. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> * feat(restart): demote the PID file — health and port are the liveness oracle Phase 5 of plans/2026-06-10-worker-restart-single-source-of-truth.md. The dying worker's shutdown cascade deleted the PID file unconditionally as its final act, clobbering the successor's freshly-written file; status then required portInUse AND pidInfo, so a healthy worker reported as "not running". Now every PID-file deletion is owner-guarded: the supervisor cascade deletes only its own pid (removeOwnedPidFile), and the CLI stop/restart-fallback, the restart handoff, and the daemon start-failure cleanup go through removePidFileIfOwner (owner-or-dead — a live successor's file always survives; corrupt files are left for the next boot's validator). status sources from GET /api/health alone (pid, version, uptime, workerPath; 503-degraded counts as running and now surfaces its queue detail), with port-in-use-but-unreachable and not-running fallbacks — all exit 0 as before. The --daemon duplicate gate checks the port first (ground truth) and the PID file second (advisory, for the freed-port-but-undeleted-file window); duplicate suppression stays exit 0. writePidFile/touchPidFile remain — the file is diagnostics, and the worker stays its only writer. Also fixes combined-run test pollution: spawn-gate and worker-utils timeout tests now eagerly import paths.js before setting a temp CLAUDE_MEM_DATA_DIR, so the import-time DATA_DIR const can't freeze on a deleted temp dir for suites loaded later in the same bun process. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> * test: no test ever touches the real ~/.claude-mem again Phase 6 of plans/2026-06-10-worker-restart-single-source-of-truth.md. process-manager and graceful-shutdown tests wrote corrupt JSON and sentinel PIDs (2147483647) into the real ~/.claude-mem/worker.pid and drove the real supervisor.json cascade under a snapshot-restore that a killed run would skip — that pollution contaminated production logs and a prior diagnosis. Both files now set a temp CLAUDE_MEM_DATA_DIR at the top of the file before dynamically importing the code under test (ESM hoisting makes beforeEach too late), assert their paths landed outside the real dir, and derive PID_FILE from the same frozen paths module the code uses so test and code can never diverge under bun's shared module cache. The snapshot-restore scaffolding is deleted; zero assertions changed. tests/preload.ts gains a tripwire: when CLAUDE_MEM_DATA_DIR is unset it fills a per-run temp dir, so no test in any file can fall through to the real data dir. Fallout made explicit: worker-spawn child processes get an explicit temp dir; install-error-matrix restores rather than deletes the env var; settings-defaults-manager pins the unset-env default it was implicitly relying on. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> * fix(settings): bootstrap notices go to stderr, never stdout CI on PR #2894 caught the latent bug: on the first boot in a fresh data dir, SettingsDefaultsManager printed '[SETTINGS] Created settings file with defaults: ...' to stdout before the start command's JSON hook payload, corrupting the machine-readable contract every fresh install's first hook invocation relies on. The Phase 6 per-run temp data dir made the cold-dir case deterministic in CI, exposing it. Both informational notices (creation, nested-schema migration) now use console.warn — stderr — matching the function's existing failure-path idiom; two regression tests pin stdout silence on both paths. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> * refactor(restart): address PR #2894 review — dedupe script resolver, skip futile port wait Both inline copies of the marketplace-first script-candidate list in worker-service.ts (restart fallback + successor handoff injection) now call the exported resolveWorkerScriptPath() ?? __filename, so the candidate list lives in one place. verifyRestartedWorker's failure result gains lastPollSawHealth; when the self-replacement handoff verification timed out while a live (but unverifiable) worker was still serving on the port, the CLI fallback now skips its port-free wait — the port cannot free while that worker lives, so the wait only burned its full platform-scaled budget before the same final verification ran anyway. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This commit is contained in:
20
package.json
20
package.json
@@ -64,7 +64,7 @@
|
||||
"scripts": {
|
||||
"dev": "npm run build-and-sync",
|
||||
"build": "node scripts/sync-plugin-manifests.js && node scripts/build-hooks.js && node scripts/gen-plugin-lockfile.cjs",
|
||||
"build-and-sync": "npm run build && npm run sync-marketplace && sleep 1 && (cd ~/.claude/plugins/marketplaces/thedotmack && npm run worker:restart)",
|
||||
"build-and-sync": "npm run build && npm run sync-marketplace && (cd ~/.claude/plugins/marketplaces/thedotmack && npm run worker:restart)",
|
||||
"sync-marketplace": "node scripts/sync-marketplace.cjs",
|
||||
"sync-marketplace:force": "node scripts/sync-marketplace.cjs --force",
|
||||
"build:binaries": "node scripts/build-worker-binary.js",
|
||||
@@ -125,22 +125,22 @@
|
||||
"2fa": false
|
||||
},
|
||||
"dependencies": {
|
||||
"@anthropic-ai/claude-agent-sdk": "^0.2.138",
|
||||
"@better-auth/api-key": "^1.6.9",
|
||||
"@anthropic-ai/claude-agent-sdk": "^0.3.172",
|
||||
"@better-auth/api-key": "^1.6.16",
|
||||
"@clack/prompts": "^1.3.0",
|
||||
"@modelcontextprotocol/sdk": "^1.29.0",
|
||||
"ansi-to-html": "^0.7.2",
|
||||
"better-auth": "^1.6.9",
|
||||
"better-auth": "^1.6.16",
|
||||
"bullmq": "^5.76.6",
|
||||
"cors": "^2.8.6",
|
||||
"dompurify": "^3.4.2",
|
||||
"dompurify": "^3.4.9",
|
||||
"express": "^5.2.1",
|
||||
"glob": "^13.0.6",
|
||||
"handlebars": "^4.7.9",
|
||||
"ioredis": "^5.10.1",
|
||||
"pg": "^8.20.0",
|
||||
"picocolors": "^1.1.1",
|
||||
"posthog-node": "^5.36.8",
|
||||
"posthog-node": "^5.36.15",
|
||||
"react": "^19.2.6",
|
||||
"react-dom": "^19.2.6",
|
||||
"shell-quote": "^1.8.3",
|
||||
@@ -148,6 +148,9 @@
|
||||
"zod": "^4.4.3",
|
||||
"zod-to-json-schema": "^3.25.2"
|
||||
},
|
||||
"overrides": {
|
||||
"tmp": "^0.2.7"
|
||||
},
|
||||
"devDependencies": {
|
||||
"@derekstride/tree-sitter-sql": "^0.3.11",
|
||||
"@tree-sitter-grammars/tree-sitter-lua": "^0.4.1",
|
||||
@@ -157,11 +160,10 @@
|
||||
"@tree-sitter-grammars/tree-sitter-zig": "^1.1.2",
|
||||
"@types/bun": "^1.3.13",
|
||||
"@types/cors": "^2.8.19",
|
||||
"@types/dompurify": "^3.2.0",
|
||||
"@types/express": "^5.0.6",
|
||||
"@types/node": "^25.6.2",
|
||||
"@types/node": "^25.9.2",
|
||||
"@types/pg": "^8.20.0",
|
||||
"@types/react": "^19.2.14",
|
||||
"@types/react": "^19.2.17",
|
||||
"@types/react-dom": "^19.2.3",
|
||||
"esbuild": "^0.28.0",
|
||||
"jimp": "^1.6.1",
|
||||
|
||||
253
plans/2026-06-10-worker-restart-single-source-of-truth.md
Normal file
253
plans/2026-06-10-worker-restart-single-source-of-truth.md
Normal file
@@ -0,0 +1,253 @@
|
||||
# Worker Restart: Single Source of Truth
|
||||
|
||||
**Created:** 2026-06-10
|
||||
**Root-cause analysis:** 12-agent diagnosis, adversarially verified (workflow `wf_f07f3541-b05`). Summary: worker-restart failures are caused by five redundant "who is the worker" oracles with uncoordinated writers, a sync-script "hot reload" mirror that writes version-N code into the version-(N-1) cache dir, a kill-only restart endpoint that races hook lazy-spawns, and a build chain that fires two uncoordinated restarts and never verifies the outcome. 98 version-recycle ping-pong events and six EADDRINUSE hard failures observed Jun 8–10.
|
||||
|
||||
**Execution model:** Each phase is self-contained and lands independently (one commit/PR per phase, in order). Phases 1–2 are the high-leverage, low-risk slice. Run `npm test` plus the phase's verification checklist before moving on.
|
||||
|
||||
---
|
||||
|
||||
## Phase 0: Consolidated Documentation — Allowed APIs and Ground Truth
|
||||
|
||||
*(Discovery already performed by three fact-extraction subagents on 2026-06-10; consolidated here. Executors: trust these refs, but re-read the cited lines before editing — line numbers drift.)*
|
||||
|
||||
### Allowed APIs (verified to exist — use ONLY these, with these exact signatures)
|
||||
|
||||
| API | Location | Signature / shape |
|
||||
|---|---|---|
|
||||
| `ensureWorkerRunning()` | `src/shared/worker-utils.ts:293-381` | `(): Promise<boolean>` — hook lazy-spawn + PR #2768 version-recycle |
|
||||
| `resolveWorkerScriptPath()` | `src/shared/worker-utils.ts:206-215` | candidates: `MARKETPLACE_ROOT/plugin/scripts/worker-service.cjs`, then `cwd()/plugin/scripts/worker-service.cjs` |
|
||||
| `resolveBunRuntime()` | `src/shared/worker-utils.ts:217-235` | hook-side resolver; MISSING `~/.bun/bin` fallbacks |
|
||||
| `waitForWorkerPort` / `waitForWorkerReadiness` | `src/shared/worker-utils.ts:237-273` | polls `GET /api/readiness`; budget env `CLAUDE_MEM_HOOK_READINESS_TIMEOUT_MS` |
|
||||
| `ensureWorkerStarted(port, workerScriptPath)` | `src/services/worker-spawner.ts:70-153` | returns `'ready' \| 'warming' \| 'dead'`; NO version guard |
|
||||
| `spawnDaemon(scriptPath, port, extraEnv?)` | `src/services/infrastructure/ProcessManager.ts:408-472` | returns PID or `undefined`; setsid on Unix, PowerShell on Windows |
|
||||
| `resolveWorkerRuntimePath(options?)` | `src/services/infrastructure/ProcessManager.ts:63-125` | full bun resolver chain (BUN, BUN_PATH, `~/.bun/bin`, brew paths, `which`) |
|
||||
| PID file APIs | `ProcessManager.ts:134-168, 508-520` | `writePidFile(info)`, `readPidFile(): PidInfo\|null`, `removePidFile()`, `touchPidFile()`, `cleanStalePidFile()` |
|
||||
| `httpShutdown(port, reason?)` | `src/services/infrastructure/HealthMonitor.ts:94-114` | POSTs `/api/admin/shutdown?reason=restart` |
|
||||
| `waitForPortFree(port, timeoutMs?)` | `HealthMonitor.ts:85-92` | 500ms poll |
|
||||
| `checkVersionMatch(port)` | `src/services/infrastructure/HealthMonitor.ts:~120-161` | returns `{matches, pluginVersion, workerVersion}`; fail-open on ENOENT |
|
||||
| `performGracefulShutdown(config)` | `src/services/infrastructure/GracefulShutdown.ts:30-58` | sequential closes, NO global deadline |
|
||||
| `flushResponseThen(res, payload, action)` | `src/services/server/flushResponseThen.ts:3-16` | responds, runs action on 'finish', ALWAYS `process.exit(0)` |
|
||||
| `writeJsonFileAtomic()` | `src/npx-cli/utils/paths.ts:124-205` | the ONLY atomic-write helper in the repo |
|
||||
| `MARKETPLACE_ROOT` | `src/shared/paths.ts:43` | `~/.claude/plugins/marketplaces/thedotmack` |
|
||||
| `resolveDataDir()` / `DATA_DIR` | `src/shared/paths.ts:18-40` | env `CLAUDE_MEM_DATA_DIR` wins (line 19-20); module-level const — computed at import time |
|
||||
| Worker port default | `src/shared/SettingsDefaultsManager.ts:91` | `37700 + (uid % 100)` — NEVER hardcode 37777 |
|
||||
| `/api/health` response | `src/services/server/Server.ts:212-233` | `{status, version, workerPath, uptime, pid, initialized, mcpReady, ...}` — has everything verification needs |
|
||||
| `/api/readiness` response | `Server.ts:235-247` | `{status: 'ready'\|'initializing', mcpReady}` |
|
||||
| `/api/admin/restart` | `Server.ts:282-294` | kill-only via `flushResponseThen` → `onRestart()` → `shutdown('restart')`; NOTHING respawns (macOS/Linux) |
|
||||
| `version: BUILT_IN_VERSION` | baked via `__DEFAULT_PACKAGE_VERSION__` esbuild define | `scripts/build-hooks.js:312,365,421,469,495,540` |
|
||||
|
||||
### Anti-patterns (verified NOT to exist — do not invent)
|
||||
- There is NO `/api/admin/status`, NO `/api/version` route in Server.ts (version comes from `/api/health`), NO respawn anywhere in the HTTP restart path.
|
||||
- There is NO flock/O_EXCL/`wx`-flag lockfile pattern anywhere in src/ — Phase 4 introduces the first one; copy `writeJsonFileAtomic` for the write discipline.
|
||||
- `plugin/scripts/*.cjs` are BUILT artifacts — never hand-edit; rebuild with `npm run build`.
|
||||
- `package-lock.json` is gitignored — do not commit it.
|
||||
- Tests are `bun test` (`bunfig.toml` preloads `tests/preload.ts` for the PostHog mock). Mock pattern: `mock.module()` + query-param cache-bust fresh import (`tests/shared/worker-utils-version-recycle.test.ts:22-32`).
|
||||
|
||||
### Behavioral contracts that MUST keep passing
|
||||
- `tests/shared/worker-utils-version-recycle.test.ts`: on version mismatch `ensureWorkerRunning()` POSTs `/api/admin/restart` ≥1×; on match, 0×.
|
||||
- `tests/integrations/spawn-contract-windows.test.ts`: spawn-contract env overrides.
|
||||
- Full suite: 2203 pass / 0 fail as of v13.5.5.
|
||||
|
||||
---
|
||||
|
||||
## Phase 1: Delete the cache-mirror; make the CLI restart the single initiator
|
||||
|
||||
**Why first:** The mirror (`sync-marketplace.cjs:164-173`) is the largest manufactured source of version disagreement (7 of 10 cache dirs hold off-by-one content); the double restart (HTTP POST + sleep 1 + CLI restart) is the race generator. Both fixes are deletions.
|
||||
|
||||
### What to implement
|
||||
1. In `scripts/sync-marketplace.cjs`:
|
||||
- Delete the cache-mirror block (lines ~164-173: `INSTALLED_CACHE_PATH` + rsync + `bun install` into the cache dir).
|
||||
- Delete `detectInstalledVersion()` (lines ~79-114) and its call site — it exists only to feed the mirror.
|
||||
- Delete the HTTP restart trigger block (lines ~196-216: the `http.request` POST to `/api/admin/restart` and its success/error prints). The sync script's job ends at "files synced + `bun install` in the marketplace copy".
|
||||
2. In root `package.json` line ~67, simplify:
|
||||
- From: `"build-and-sync": "npm run build && npm run sync-marketplace && sleep 1 && (cd ~/.claude/plugins/marketplaces/thedotmack && npm run worker:restart)"`
|
||||
- To: `"build-and-sync": "npm run build && npm run sync-marketplace && (cd ~/.claude/plugins/marketplaces/thedotmack && npm run worker:restart)"`
|
||||
- (drop the `sleep 1` — it existed to let the now-deleted HTTP kill land before the CLI restart)
|
||||
3. Do NOT touch the existing one-time legacy-cache cleanup elsewhere in sync-marketplace.cjs (if present) — only the three blocks named above.
|
||||
|
||||
### Documentation references
|
||||
- Verbatim quotes of all three blocks: Phase 0 table + the restart/shutdown extraction (sync-marketplace.cjs:79-114, 164-173, 196-216; package.json:64-79).
|
||||
|
||||
### Verification checklist
|
||||
- `grep -n "INSTALLED_CACHE_PATH\|detectInstalledVersion\|admin/restart" scripts/sync-marketplace.cjs` → no matches.
|
||||
- `grep -n "sleep 1" package.json` → no match in build-and-sync.
|
||||
- `npm run build-and-sync` → completes; worker restarts (via CLI path only); `curl -s http://127.0.0.1:$PORT/api/health` shows the just-built version. (Resolve `$PORT` as `37700 + uid%100` or from `~/.claude-mem/settings.json`.)
|
||||
- `npm test` → green.
|
||||
- Stale-cache regression check: `ls ~/.claude/plugins/cache/thedotmack/claude-mem/*/package.json | xargs grep -h '"version"'` — record current values; after the NEXT release, re-check that no cache dir's content changed (the mirror used to mutate them).
|
||||
|
||||
### Anti-pattern guards
|
||||
- Do not "fix" the mirror by targeting `workerPath` instead — the feature is being removed, not repaired.
|
||||
- Do not add a new restart mechanism here; Phase 2 hardens the existing CLI restart.
|
||||
|
||||
---
|
||||
|
||||
## Phase 2: Post-restart verification — restart proves itself or exits 1
|
||||
|
||||
**Why:** `restart` currently exits 0 after `spawnDaemon` returns a PID — fork success, not boot success. Four silent exit-0 daemon paths mean "✓" with a dead/stale worker.
|
||||
|
||||
### What to implement
|
||||
All in `src/services/worker-service.ts`:
|
||||
1. In `case 'restart'` (lines ~956-973):
|
||||
- BEFORE `httpShutdown`, capture the old worker: `GET /api/health` (2s timeout) → save `oldPid` (may be null if no worker).
|
||||
- Replace `waitForPortFree(port, 5000)` with `waitForPortFree(port, getPlatformTimeout(15000))` — parity with `stop` (line ~946).
|
||||
- AFTER `spawnDaemon`, add a verification loop (new helper `verifyRestartedWorker(port, oldPid, deadlineMs)`): poll `GET /api/health` every 500ms until `health.pid !== oldPid && health.version === EXPECTED_VERSION`, where `EXPECTED_VERSION` is this process's own baked `__DEFAULT_PACKAGE_VERSION__` (build-and-sync runs the marketplace copy, so its baked version IS the just-synced version). Deadline: `getPlatformTimeout(30000)`.
|
||||
- On success: log `Worker restart verified {pid, version}` and exit 0. On deadline: `console.error` with the last observed health payload (or connection error) and **exit 1**.
|
||||
2. In the `--daemon` block (lines ~1167-1208): change EXIT PATH 4 only (generic start failure, lines ~1204-1206) from `process.exit(0)` to `process.exit(1)`. Paths 1-3 are legitimate duplicate-suppression and stay exit 0.
|
||||
3. Spawn target: change `spawnDaemon(__filename, port)` (line ~965) to prefer the marketplace script — copy the candidate pattern from `resolveWorkerScriptPath()` (`src/shared/worker-utils.ts:206-215`), falling back to `__filename` when no marketplace copy exists (dev trees, CI).
|
||||
|
||||
### Documentation references
|
||||
- `/api/health` shape: `Server.ts:212-233` (`pid`, `version` fields confirmed).
|
||||
- `getPlatformTimeout`: used at `worker-service.ts:946`.
|
||||
- `__DEFAULT_PACKAGE_VERSION__` availability inside worker-service bundle: `scripts/build-hooks.js:312-313`.
|
||||
|
||||
### Verification checklist
|
||||
- New test `tests/services/worker-restart-verify.test.ts`: mock `global.fetch` (copy the fetchLog pattern from `tests/shared/worker-utils-version-recycle.test.ts:34-50`); assert `verifyRestartedWorker` returns success when health flips to `{pid: newPid, version: expected}`, failure on stale pid, failure on wrong version, failure on timeout.
|
||||
- Manual: `npm run build-and-sync` → output includes `Worker restart verified`; then `kill -9 <worker pid>` mid-restart-window and re-run to see a LOUD exit-1 path (or simulate by pointing the verify loop at a dead port in the test).
|
||||
- `npm test` → green, including the version-recycle contract.
|
||||
|
||||
### Anti-pattern guards
|
||||
- Do NOT poll `/api/version` (doesn't exist) or invent new health fields.
|
||||
- Do NOT compare against marketplace `package.json` on disk for `EXPECTED_VERSION` — the baked constant is the truth for "the code I am running"; disk reads reintroduce a second oracle.
|
||||
- The verify deadline must not block forever: hard cap, then exit 1.
|
||||
|
||||
---
|
||||
|
||||
## Phase 3: Self-replacing restart — old worker spawns its successor; recycle path stops spawning into corpses
|
||||
|
||||
**Why:** `/api/admin/restart` is kill-only; hooks that POST it then lazy-spawn race the dying worker (the ping-pong). If the OLD worker spawns its successor as its final act after the port closes, old and new never coexist and no third party spawns into a corpse.
|
||||
|
||||
### What to implement
|
||||
1. `src/services/worker-service.ts` `shutdown()` (lines ~671-699):
|
||||
- Re-entrancy guard: the `isShuttingDown` field (line ~188) is write-only today; make `shutdown()` check-and-set it at entry (`if (this.isShuttingDown) return; this.isShuttingDown = true;`).
|
||||
- Hard deadline: wrap `performGracefulShutdown(...)` in `Promise.race` with a `getPlatformTimeout(10000)` timer; on deadline, log `Graceful shutdown deadline exceeded — proceeding` and continue (do not hang on unbounded session drain — drain today can run 35-40s).
|
||||
- Successor spawn: when `reason === 'restart'`, after graceful shutdown completes/deadlines: resolve the marketplace script (same candidate pattern as Phase 2 step 3), `await waitForPortFree(port, 5000)`, then `spawnDaemon(marketplaceScript, port)`. If port never frees or spawn returns `undefined`, log loudly (`logger.error`) — the next hook's lazy-spawn is the safety net. Note `flushResponseThen` (flushResponseThen.ts:3-16) calls `process.exit(0)` after the action completes, so the spawn must be awaited inside the action.
|
||||
2. `src/shared/worker-utils.ts` `ensureWorkerRunning()` recycle path (lines ~305-330):
|
||||
- After POSTing `/api/admin/restart`, do NOT immediately lazy-spawn. Instead poll `GET /api/health` (500ms interval, `HOOK_READINESS_TIMEOUT_MS` budget) for the successor: healthy AND `version === pluginVersion` (already in hand from `checkVersionMatch`). Only fall through to the existing lazy-spawn if the successor never appears.
|
||||
- Amplifier fix: after `waitForWorkerReadiness` succeeds anywhere in this function, re-check the version once via `/api/health`; if still mismatched, log a warning (do NOT loop/recycle again in the same invocation — one recycle per hook event, the next hook retries; unbounded loops here re-create the storm).
|
||||
|
||||
### Documentation references
|
||||
- `onRestart` wiring: `worker-service.ts:255`; route: `Server.ts:282-294`; `flushResponseThen.ts:3-16`.
|
||||
- Drain timings + unbounded awaits: `GracefulShutdown.ts:30-75`.
|
||||
- `HOOK_READINESS_TIMEOUT_MS`: `worker-utils.ts:45-48`.
|
||||
- Signal-path guard to mirror: `src/supervisor/index.ts:56-60`.
|
||||
|
||||
### Verification checklist
|
||||
- `tests/shared/worker-utils-version-recycle.test.ts` still green (still POSTs restart on mismatch — the change is what happens AFTER the POST).
|
||||
- New test: shutdown re-entrancy — call `shutdown()` twice, assert `performGracefulShutdown` runs once (mock it via `mock.module`).
|
||||
- New test: recycle-no-corpse-spawn — fetch mock where `/api/admin/restart` 200s and `/api/health` returns the NEW version on poll N: assert NO spawn attempt; where health never recovers: assert lazy-spawn fallback fires.
|
||||
- Manual end-to-end: with a worker running, `curl -X POST http://127.0.0.1:$PORT/api/admin/restart`; within ~15s `/api/health` shows a NEW pid and the marketplace version, with no hook involvement. Check `~/.claude-mem/logs/claude-mem-$(date +%F).log` for exactly one shutdown and one daemon start (no duplicate-refusal lines).
|
||||
- `npm test` → green.
|
||||
|
||||
### Anti-pattern guards
|
||||
- The successor spawn happens ONLY on `reason === 'restart'` — `stop` must stay kill-only.
|
||||
- Never spawn before the port is confirmed free — that recreates EXIT PATH 2/3 duplicate suicides.
|
||||
- Do not add respawn logic to `flushResponseThen` itself or to the Windows-managed IPC branch (`process.send` path, Server.ts:284-289) — Windows wrapper already owns restart there.
|
||||
|
||||
---
|
||||
|
||||
## Phase 4: One spawn gate — shared ensureWorker() with a spawn lockfile; kill the resolver asymmetry
|
||||
|
||||
**Why:** Three spawn paths (hooks via `worker-utils`, MCP via `worker-spawner`, CLI via `spawnDaemon`) with two different bun resolvers and no mutual exclusion; logs show 3 launchers colliding within one second.
|
||||
|
||||
### What to implement
|
||||
1. New module `src/shared/worker-spawn-gate.ts`:
|
||||
- `acquireSpawnLock(): boolean` — `writeFileSync(join(DATA_DIR, 'spawn.lock'), JSON.stringify({pid: process.pid, startedAt: new Date().toISOString()}), {flag: 'wx'})` in try/catch. On `EEXIST`: `statSync` the lock; if `mtimeMs` older than 30_000ms, `unlinkSync` and retry ONCE; else return false.
|
||||
- `releaseSpawnLock(): void` — unlink, owner-checked (read it; only delete if `pid === process.pid`), errors swallowed.
|
||||
- Use `DATA_DIR` from `src/shared/paths.ts` — never `homedir()` directly.
|
||||
2. In `src/shared/worker-utils.ts` `ensureWorkerRunning()` (spawn section, lines ~332-351): wrap the spawn in the lock — if `acquireSpawnLock()` fails, skip the spawn and go straight to `waitForWorkerPort`/`waitForWorkerReadiness` (someone else is spawning; wait for their worker). Release in `finally`.
|
||||
3. Same wrap in `src/services/worker-spawner.ts` `ensureWorkerStarted()` around the `spawnDaemon` call (line ~132).
|
||||
4. Resolver unification: delete `resolveBunRuntime()` from `worker-utils.ts` (lines 217-235) and import/re-export `resolveWorkerRuntimePath` from `ProcessManager.ts` (already exported; it strictly supersedes — adds BUN_PATH, `~/.bun/bin`, brew, snap fallbacks). This closes the kill-then-can't-respawn path.
|
||||
5. `src/servers/mcp-server.ts` (lines ~42-51): compute `WORKER_SCRIPT_PATH` preferring the marketplace copy — copy the candidate pattern from `resolveWorkerScriptPath()` with fallback to the current own-dir resolution. This stops MCP servers in stale cache dirs from spawning stale workers.
|
||||
|
||||
### Documentation references
|
||||
- Write-discipline reference for the lock: `writeJsonFileAtomic`, `src/npx-cli/utils/paths.ts:124-205` (the `wx`-flag + cleanup-on-error shape; the lock is simpler — no rename needed, `wx` IS the atomicity).
|
||||
- Spawn call sites table: Phase 0 / spawn-path report §7.
|
||||
- `resolveWorkerRuntimePath` chain: `ProcessManager.ts:63-125`.
|
||||
|
||||
### Verification checklist
|
||||
- New test `tests/shared/worker-spawn-gate.test.ts` (temp dir via `CLAUDE_MEM_DATA_DIR` + dynamic import — see Phase 6 trap): second acquire fails while held; stale lock (backdate mtime via `utimesSync`) is broken and re-acquired; release is owner-only.
|
||||
- `grep -rn "resolveBunRuntime" src/` → no definition in worker-utils (only the ProcessManager import).
|
||||
- `grep -n "spawnHidden\|spawnDaemon" src/shared/worker-utils.ts src/services/worker-spawner.ts` → every spawn site is inside the lock.
|
||||
- Race test (manual): run 5 concurrent `node "$_P/scripts/bun-runner.js" .../worker-service.cjs start` invocations with no worker running; logs must show exactly ONE `Starting worker daemon` and zero `refusing to start duplicate` storms.
|
||||
- `npm test` → green (version-recycle contract intact).
|
||||
|
||||
### Anti-pattern guards
|
||||
- The lock gates SPAWNING only — never health checks; a held lock must never make a hook fail, only wait.
|
||||
- 30s staleness must use file mtime, not clock-in-content comparisons (test pollution wrote `startedAt: 2024-01-01` once already).
|
||||
- Do not introduce a lock library; `wx` flag is sufficient and dependency-free.
|
||||
|
||||
---
|
||||
|
||||
## Phase 5: Demote the PID file — port + /api/health become the only liveness oracle
|
||||
|
||||
**Why:** The dying worker's shutdown cascade deletes the NEW worker's PID file (`src/supervisor/shutdown.ts:88`), after which `status` reports a healthy worker as "not running" (status requires `portInUse && pidInfo`, `worker-service.ts:975-988`). `/api/health` already carries `pid`, `version`, `workerPath` — it subsumes the file.
|
||||
|
||||
### What to implement
|
||||
1. Owner-guarded deletion (the clobber fix):
|
||||
- `src/supervisor/shutdown.ts` (~line 88): before `rmSync(pidFilePath)`, read the file; delete ONLY if its `pid === process.pid`. A mismatch means a successor already wrote its own — log debug and leave it.
|
||||
- `src/services/worker-service.ts` `case 'restart'` (~line 964) and `case 'stop'` (~line 949): replace bare `removePidFile()` with the same owner-or-dead check: delete only if the recorded pid is the one we just shut down (captured in Phase 2's pre-shutdown health probe) or the recorded pid is not alive.
|
||||
2. `case 'status'` (lines ~975-988): source of truth becomes `GET /api/health` — report `pid`, `version`, `uptime`, `workerPath` from the response; fall back to "port in use but health unreachable (wedged?)" and "not running". Drop the `readPidFile()` requirement.
|
||||
3. `--daemon` duplicate gate (lines ~1167-1182): reorder — port/health probe FIRST (it's ground truth), PID file second (advisory only, for the no-port-bound-yet boot window). Keep `writePidFile`/`touchPidFile` as diagnostics — the worker itself remains the only writer.
|
||||
|
||||
### Documentation references
|
||||
- Clobber chain: `GracefulShutdown.ts:55` → `supervisor.stop()` → `runShutdownCascade` → `shutdown.ts:87-98`.
|
||||
- `verifyPidFileOwnership` + startToken: `ProcessManager.ts` (PID APIs §3 of spawn-path report).
|
||||
- Health shape: `Server.ts:212-233`.
|
||||
|
||||
### Verification checklist
|
||||
- Update `tests/infrastructure/process-manager.test.ts` expectations where deletion semantics changed (owner-guard) — coordinate with Phase 6 which relocates this file's data dir.
|
||||
- New test: old-worker-cleanup-spares-successor — write PID file as `{pid: 99999...}` (not own pid), run the shutdown cascade deletion step, assert file survives.
|
||||
- Manual: start worker, `rm ~/.claude-mem/worker.pid`, run `worker-service.cjs status` → must still print "Worker is running" with pid/version from health.
|
||||
- Full ping-pong scenario from Phase 3's manual check still converges.
|
||||
- `npm test` → green.
|
||||
|
||||
### Anti-pattern guards
|
||||
- Do NOT delete `writePidFile` entirely — external tooling may read it; it's demoted to diagnostics, not removed.
|
||||
- `status` must not require BOTH oracles ever again — health wins, full stop.
|
||||
- Do not let the boot-window PID check (advisory) exit 1 — duplicate suppression stays exit 0 (Phase 2 changed only the generic-failure path).
|
||||
|
||||
---
|
||||
|
||||
## Phase 6: Test hygiene — no test ever touches the real ~/.claude-mem again
|
||||
|
||||
**Why:** `tests/infrastructure/process-manager.test.ts` writes corrupt JSON and sentinel PIDs into the REAL `~/.claude-mem/worker.pid` (snapshot-restore shrinks but doesn't close the race window, and a killed test run leaves corruption behind). It also pollutes the shared log, which contaminated this very diagnosis.
|
||||
|
||||
### What to implement
|
||||
1. `tests/infrastructure/process-manager.test.ts`:
|
||||
- Replace the hardcoded `DATA_DIR = path.join(homedir(), '.claude-mem')` (lines 24-25) with: create `mkdtempSync(join(tmpdir(), 'claude-mem-pm-test-'))`, set `process.env.CLAUDE_MEM_DATA_DIR` to it.
|
||||
- **TRAP:** `DATA_DIR` in `src/shared/paths.ts:40` is a module-level const computed at import time, and ESM hoists static imports — setting the env var in `beforeEach` is too late. Copy the fresh-import pattern from `tests/shared/worker-utils-version-recycle.test.ts:30-32` (query-param cache-bust dynamic import) OR set the env var at the very top of the file before any `await import(...)` of ProcessManager modules (convert the static imports of code-under-test to dynamic).
|
||||
- Delete the snapshot-restore of the real PID file (lines 28-43) — unnecessary once isolated; `rmSync(tempDir, {recursive: true, force: true})` in `afterAll` (copy Pattern A: `tests/write-json-file-atomic.test.ts:34-40`).
|
||||
2. Sweep for other offenders: `grep -rn "homedir()" tests/ | grep -v node_modules` — any test resolving the real data dir gets the same treatment.
|
||||
3. Add a tripwire to `tests/preload.ts` (it already exists for the PostHog mock): if `CLAUDE_MEM_DATA_DIR` is unset, set it to a per-run `mkdtempSync` dir so NO test can fall through to `~/.claude-mem`. (Env restoration discipline: copy `tests/env-isolation.test.ts:31-90`.)
|
||||
|
||||
### Documentation references
|
||||
- Offending lines verbatim: process-manager.test.ts:24-25, 28-43, 109, 499-503 (test-hygiene report §1).
|
||||
- Copy-ready isolation patterns: Pattern A `write-json-file-atomic.test.ts:34-40`; Pattern C env restore `env-isolation.test.ts:31-90`.
|
||||
- Env override honored at `src/shared/paths.ts:19-20`.
|
||||
|
||||
### Verification checklist
|
||||
- `bun test tests/infrastructure/` green.
|
||||
- While the suite runs: `ls -la ~/.claude-mem/worker.pid; shasum ~/.claude-mem/worker.pid` before/after → byte-identical (or consistently absent).
|
||||
- `grep -rn "homedir()" tests/` → no hit resolves a data-dir path for writes.
|
||||
- Full `npm test` green.
|
||||
|
||||
### Anti-pattern guards
|
||||
- Do not mock `fs` to fake isolation — real temp dirs only, the tests exercise real file semantics.
|
||||
- Do not weaken assertions to dodge the trap; the fresh-import pattern is proven in-repo.
|
||||
|
||||
---
|
||||
|
||||
## Phase 7: Final Verification
|
||||
|
||||
1. **Full suite + typecheck:** `npx tsc --noEmit` and `npm test` — zero failures.
|
||||
2. **Anti-pattern greps (all must be empty):**
|
||||
- `grep -rn "INSTALLED_CACHE_PATH\|detectInstalledVersion" scripts/`
|
||||
- `grep -n "admin/restart" scripts/sync-marketplace.cjs`
|
||||
- `grep -n "37777" src/ scripts/ -r` (hardcoded port)
|
||||
- `grep -rn "resolveBunRuntime" src/shared/worker-utils.ts`
|
||||
- `grep -rn "homedir(), '.claude-mem'" tests/`
|
||||
3. **Triple-restart soak:** run `npm run build-and-sync` three times consecutively; every run must end `Worker restart verified`; `/api/health` pid changes each time, version stays the built version; `grep -c "refusing to start duplicate\|Failed to start server" ~/.claude-mem/logs/claude-mem-$(date +%F).log` shows no new occurrences during the soak.
|
||||
4. **Stale-launcher convergence test (the original bug):** manually spawn a worker from an OLD cache dir (`node ~/.claude/plugins/cache/thedotmack/claude-mem/<old>/scripts/bun-runner.js .../worker-service.cjs start`), then trigger any hook (or `ensureWorkerRunning` via a session). Expect in the log: exactly ONE `Worker version mismatch — recycling stale worker`, then the self-replacing restart, then a healthy marketplace-version worker — NO ping-pong (no second recycle within 5 minutes).
|
||||
5. **Observation check:** confirm claude-mem itself recorded observations during the soak (the worker was restarting underneath the recorder — the pipeline must survive its own surgery): query the 10 most recent rows in `~/.claude-mem/claude-mem.db` `observations` and confirm fresh timestamps.
|
||||
6. **Docs:** update `docs/public/troubleshooting.mdx` if it documents the old restart semantics; CHANGELOG is auto-generated — do not edit.
|
||||
File diff suppressed because one or more lines are too long
File diff suppressed because one or more lines are too long
File diff suppressed because one or more lines are too long
File diff suppressed because one or more lines are too long
File diff suppressed because one or more lines are too long
File diff suppressed because one or more lines are too long
@@ -8,13 +8,6 @@ const os = require('os');
|
||||
const INSTALLED_PATH = path.join(os.homedir(), '.claude', 'plugins', 'marketplaces', 'thedotmack');
|
||||
const CACHE_BASE_PATH = path.join(os.homedir(), '.claude', 'plugins', 'cache', 'thedotmack', 'claude-mem');
|
||||
|
||||
// Reject obviously invalid ports before they reach http.request, which would
|
||||
// throw with a confusing error like "RangeError: Port should be > 0 and < 65536".
|
||||
function parseWorkerPort(value) {
|
||||
const port = Number.parseInt(String(value ?? ''), 10);
|
||||
return Number.isInteger(port) && port >= 1 && port <= 65535 ? port : null;
|
||||
}
|
||||
|
||||
function getCurrentBranch() {
|
||||
try {
|
||||
if (!existsSync(path.join(INSTALLED_PATH, '.git'))) {
|
||||
@@ -76,60 +69,6 @@ function getPluginVersion() {
|
||||
}
|
||||
}
|
||||
|
||||
function detectInstalledVersion(buildVersion) {
|
||||
const dataDir = process.env.CLAUDE_MEM_DATA_DIR || path.join(os.homedir(), '.claude-mem');
|
||||
const settingsPath = path.join(dataDir, 'settings.json');
|
||||
let port = parseWorkerPort(process.env.CLAUDE_MEM_WORKER_PORT);
|
||||
if (!port && existsSync(settingsPath)) {
|
||||
try {
|
||||
const s = JSON.parse(readFileSync(settingsPath, 'utf8'));
|
||||
const settingsPort = parseWorkerPort(s.CLAUDE_MEM_WORKER_PORT);
|
||||
if (settingsPort) port = settingsPort;
|
||||
} catch {}
|
||||
}
|
||||
if (!port) {
|
||||
const uid = typeof process.getuid === 'function' ? process.getuid() : 77;
|
||||
port = 37700 + (uid % 100);
|
||||
}
|
||||
let healthBody;
|
||||
try {
|
||||
healthBody = execSync(`curl -s --max-time 2 http://127.0.0.1:${port}/api/health`, {
|
||||
stdio: ['ignore', 'pipe', 'ignore'],
|
||||
}).toString().trim();
|
||||
} catch {
|
||||
return null;
|
||||
}
|
||||
if (!healthBody) return null;
|
||||
let installedVersion;
|
||||
let installedPath;
|
||||
try {
|
||||
const j = JSON.parse(healthBody);
|
||||
installedVersion = j.version;
|
||||
installedPath = j.workerPath;
|
||||
} catch {
|
||||
return null;
|
||||
}
|
||||
if (!installedVersion || installedVersion === buildVersion) return null;
|
||||
return { installedVersion, installedPath };
|
||||
}
|
||||
|
||||
const installedMismatch = detectInstalledVersion(getPluginVersion());
|
||||
if (installedMismatch) {
|
||||
console.log('');
|
||||
console.log('\x1b[33m%s\x1b[0m', 'Version mismatch detected:');
|
||||
console.log(` Building: ${getPluginVersion()}`);
|
||||
console.log(` Installed: ${installedMismatch.installedVersion}`);
|
||||
if (installedMismatch.installedPath) console.log(` Worker path: ${installedMismatch.installedPath}`);
|
||||
console.log('');
|
||||
console.log('Claude Code is pinned to the installed version, so the worker loads from');
|
||||
console.log(`its cache dir. Mirroring this build into the installed-version cache so the`);
|
||||
console.log('worker restart picks up new code without a Claude Code session restart.');
|
||||
console.log('');
|
||||
console.log('\x1b[36m%s\x1b[0m', `For a formal version bump, run \`claude plugin update thedotmack/claude-mem\``);
|
||||
console.log('\x1b[36m%s\x1b[0m', `and restart Claude Code so it loads the ${getPluginVersion()} cache dir.`);
|
||||
console.log('');
|
||||
}
|
||||
|
||||
console.log('Syncing to marketplace...');
|
||||
try {
|
||||
const rootDir = path.join(__dirname, '..');
|
||||
@@ -161,60 +100,8 @@ try {
|
||||
console.log(`Running bun install in cache folder (version ${version})...`);
|
||||
execSync(`bun install`, { cwd: CACHE_VERSION_PATH, stdio: 'inherit' });
|
||||
|
||||
if (installedMismatch && installedMismatch.installedVersion !== version) {
|
||||
const INSTALLED_CACHE_PATH = path.join(CACHE_BASE_PATH, installedMismatch.installedVersion);
|
||||
console.log(`Mirroring to installed-version cache (${installedMismatch.installedVersion}) for hot reload...`);
|
||||
execSync(
|
||||
`rsync -av --delete --exclude=.git ${pluginGitignoreExcludes} plugin/ "${INSTALLED_CACHE_PATH}/"`,
|
||||
{ stdio: 'inherit' }
|
||||
);
|
||||
console.log(`Running bun install in installed-version cache (${installedMismatch.installedVersion})...`);
|
||||
execSync(`bun install`, { cwd: INSTALLED_CACHE_PATH, stdio: 'inherit' });
|
||||
}
|
||||
|
||||
console.log('\x1b[32m%s\x1b[0m', 'Sync complete!');
|
||||
|
||||
console.log('\n🔄 Triggering worker restart...');
|
||||
const http = require('http');
|
||||
const dataDir = process.env.CLAUDE_MEM_DATA_DIR || path.join(os.homedir(), '.claude-mem');
|
||||
const settingsPath = path.join(dataDir, 'settings.json');
|
||||
let settingsPort = null;
|
||||
if (existsSync(settingsPath)) {
|
||||
try {
|
||||
const settings = JSON.parse(readFileSync(settingsPath, 'utf8'));
|
||||
settingsPort = parseWorkerPort(settings.CLAUDE_MEM_WORKER_PORT);
|
||||
} catch {
|
||||
// fall through to env / default
|
||||
}
|
||||
}
|
||||
const uid = typeof process.getuid === 'function' ? process.getuid() : 77;
|
||||
const defaultPort = 37700 + (uid % 100);
|
||||
const workerPort =
|
||||
parseWorkerPort(process.env.CLAUDE_MEM_WORKER_PORT) ??
|
||||
settingsPort ??
|
||||
defaultPort;
|
||||
const req = http.request({
|
||||
hostname: '127.0.0.1',
|
||||
port: workerPort,
|
||||
path: '/api/admin/restart',
|
||||
method: 'POST',
|
||||
timeout: 2000
|
||||
}, (res) => {
|
||||
if (res.statusCode === 200) {
|
||||
console.log('\x1b[32m%s\x1b[0m', `✓ Worker restart triggered on port ${workerPort}`);
|
||||
} else {
|
||||
console.log('\x1b[33m%s\x1b[0m', `ℹ Worker restart on port ${workerPort} returned status ${res.statusCode}`);
|
||||
}
|
||||
});
|
||||
req.on('error', () => {
|
||||
console.log('\x1b[33m%s\x1b[0m', `ℹ No worker reachable on port ${workerPort}; the next worker:restart step will start one.`);
|
||||
});
|
||||
req.on('timeout', () => {
|
||||
req.destroy();
|
||||
console.log('\x1b[33m%s\x1b[0m', `ℹ Worker restart on port ${workerPort} timed out`);
|
||||
});
|
||||
req.end();
|
||||
|
||||
} catch (error) {
|
||||
console.error('\x1b[31m%s\x1b[0m', 'Sync failed:', error.message);
|
||||
process.exit(1);
|
||||
|
||||
@@ -14,7 +14,7 @@ import {
|
||||
CallToolRequestSchema,
|
||||
ListToolsRequestSchema,
|
||||
} from '@modelcontextprotocol/sdk/types.js';
|
||||
import { getWorkerPort, workerHttpRequest } from '../shared/worker-utils.js';
|
||||
import { getWorkerPort, workerHttpRequest, resolveWorkerScriptPath } from '../shared/worker-utils.js';
|
||||
import { ensureWorkerStarted } from '../services/worker-spawner.js';
|
||||
import { searchCodebase, formatSearchResults } from '../services/smart-file-read/search.js';
|
||||
import { parseFile, formatFoldedView, unfoldSymbol, findProjectRoot } from '../services/smart-file-read/parser.js';
|
||||
@@ -49,7 +49,12 @@ const mcpServerDir = (() => {
|
||||
return process.cwd();
|
||||
}
|
||||
})();
|
||||
const WORKER_SCRIPT_PATH = resolve(mcpServerDir, 'worker-service.cjs');
|
||||
// Prefer the canonical marketplace copy of the worker bundle (same
|
||||
// marketplace-first candidates as the hook launcher) over this server's own
|
||||
// directory: an MCP server still running out of a stale plugin cache dir
|
||||
// would otherwise auto-spawn a stale worker. The own-dir resolution stays as
|
||||
// the fallback for installs without a marketplace copy.
|
||||
const WORKER_SCRIPT_PATH = resolveWorkerScriptPath() ?? resolve(mcpServerDir, 'worker-service.cjs');
|
||||
|
||||
function errorIfWorkerScriptMissing(): void {
|
||||
if (!mcpServerDirResolutionFailed) return;
|
||||
|
||||
@@ -167,6 +167,51 @@ export function removePidFile(): void {
|
||||
}
|
||||
}
|
||||
|
||||
/**
|
||||
* Owner-or-dead guarded PID-file removal (Phase 5, worker-restart plan).
|
||||
*
|
||||
* Deletes the PID file only when the recorded pid is `expectedOwnerPid` (the
|
||||
* worker the caller just shut down, or the caller itself) OR is no longer
|
||||
* alive. A live, different pid means a restart successor has already written
|
||||
* its own file — blind deletion here is exactly the clobber that made
|
||||
* `status` report a healthy worker as not running.
|
||||
*
|
||||
* Malformed files split two ways. Unparseable JSON cannot prove ownership and
|
||||
* is left in place (the safe default): readPidFile() parses it to null so it
|
||||
* never gates a start, and the next worker boot overwrites or cleans it
|
||||
* (validateWorkerPidFile). Parseable JSON with a missing/invalid `pid` field
|
||||
* (e.g. `{"port":37777}`) is treated as a DEAD owner and deleted:
|
||||
* recorded.pid is undefined, so isProcessAlive() returns false and the
|
||||
* owner-or-dead guard falls through to removal. (The supervisor-side
|
||||
* removeOwnedPidFile spares pid-less files instead — that divergence is
|
||||
* intentional: this helper may delete dead leftovers, the shutdown cascade
|
||||
* only ever deletes its own file.)
|
||||
*/
|
||||
export function removePidFileIfOwner(expectedOwnerPid: number | null): void {
|
||||
if (!existsSync(PID_FILE)) return;
|
||||
|
||||
const recorded = readPidFile();
|
||||
if (recorded === null) {
|
||||
logger.debug('SYSTEM', 'PID file unreadable — leaving it (cannot prove ownership)', {
|
||||
path: PID_FILE,
|
||||
expectedOwnerPid
|
||||
});
|
||||
return;
|
||||
}
|
||||
|
||||
const ownedByStoppedWorker = expectedOwnerPid !== null && recorded.pid === expectedOwnerPid;
|
||||
if (!ownedByStoppedWorker && isProcessAlive(recorded.pid)) {
|
||||
logger.debug('SYSTEM', 'PID file belongs to a live, different worker (restart successor?) — leaving it', {
|
||||
path: PID_FILE,
|
||||
recordedPid: recorded.pid,
|
||||
expectedOwnerPid
|
||||
});
|
||||
return;
|
||||
}
|
||||
|
||||
removePidFile();
|
||||
}
|
||||
|
||||
export function getPlatformTimeout(baseMs: number): number {
|
||||
const WINDOWS_MULTIPLIER = 2.0;
|
||||
return process.platform === 'win32' ? Math.round(baseMs * WINDOWS_MULTIPLIER) : baseMs;
|
||||
|
||||
113
src/services/restart-verify.ts
Normal file
113
src/services/restart-verify.ts
Normal file
@@ -0,0 +1,113 @@
|
||||
/**
|
||||
* Restart verification helpers for the CLI `restart` command (worker-service.ts).
|
||||
*
|
||||
* This lives in its own module rather than inside worker-service.ts so tests
|
||||
* can import it directly: worker-service.ts drags in a very large dependency
|
||||
* graph (bun:sqlite, MCP SDK, telemetry, supervisor) and ends with an
|
||||
* isMainModule bootstrap, which makes it unsafe to import from `bun test`.
|
||||
*
|
||||
* `restart` must prove the NEW worker is up (different pid than the old
|
||||
* worker, and self-reporting the same baked version as the CLI process that
|
||||
* initiated the restart) or exit non-zero — a restart that silently leaves
|
||||
* the old worker serving is worse than a failed one
|
||||
* (plans/2026-06-10-worker-restart-single-source-of-truth.md).
|
||||
* Verification reads only the `pid` and `version` fields of GET /api/health
|
||||
* (src/services/server/Server.ts), which the worker reports from its own
|
||||
* baked __DEFAULT_PACKAGE_VERSION__ constant.
|
||||
*/
|
||||
|
||||
import { getWorkerHost, fetchWithTimeout } from '../shared/worker-utils.js';
|
||||
|
||||
interface HealthSnapshot {
|
||||
pid?: unknown;
|
||||
version?: unknown;
|
||||
}
|
||||
|
||||
export interface RestartVerifyOptions {
|
||||
/** Delay between health polls (ms). Default 500. */
|
||||
pollIntervalMs?: number;
|
||||
/** Per-request timeout for each health poll (ms). Default 2000. */
|
||||
requestTimeoutMs?: number;
|
||||
}
|
||||
|
||||
export type RestartVerifyResult =
|
||||
| { ok: true; pid: number; version: string }
|
||||
| {
|
||||
ok: false;
|
||||
lastObserved: string;
|
||||
/**
|
||||
* True when the most recent poll received a health payload — i.e. a
|
||||
* live (but unverifiable) worker is serving on the port. Callers use
|
||||
* this to skip waiting for the port to free: it will not free while
|
||||
* that worker lives.
|
||||
*/
|
||||
lastPollSawHealth: boolean;
|
||||
};
|
||||
|
||||
async function fetchHealthSnapshot(port: number, timeoutMs: number): Promise<HealthSnapshot> {
|
||||
const url = `http://${getWorkerHost()}:${port}/api/health`;
|
||||
const response = await fetchWithTimeout(url, {}, timeoutMs);
|
||||
// /api/health answers 503 when the queue is degraded but still includes
|
||||
// pid/version — a degraded-but-booted worker still proves the restart.
|
||||
return await response.json() as HealthSnapshot;
|
||||
}
|
||||
|
||||
/**
|
||||
* Capture the pid of the currently-running worker before shutting it down.
|
||||
* Returns null when no worker is reachable (nothing listening, timeout, or a
|
||||
* malformed health payload) — verification then only requires that a worker
|
||||
* with the expected version appears.
|
||||
*/
|
||||
export async function getCurrentWorkerPid(port: number, timeoutMs: number = 2000): Promise<number | null> {
|
||||
try {
|
||||
const health = await fetchHealthSnapshot(port, timeoutMs);
|
||||
return typeof health.pid === 'number' ? health.pid : null;
|
||||
} catch {
|
||||
return null;
|
||||
}
|
||||
}
|
||||
|
||||
/**
|
||||
* Poll GET /api/health until the worker reports a pid different from
|
||||
* `oldPid` AND a version equal to `expectedVersion` (the caller's own baked
|
||||
* __DEFAULT_PACKAGE_VERSION__ constant — never package.json read from disk).
|
||||
*
|
||||
* Hard-capped by `deadlineMs`; on expiry returns `{ ok: false }` carrying the
|
||||
* last observed health payload (or connection error) so the caller can report
|
||||
* it and exit 1.
|
||||
*/
|
||||
export async function verifyRestartedWorker(
|
||||
port: number,
|
||||
oldPid: number | null,
|
||||
expectedVersion: string,
|
||||
deadlineMs: number,
|
||||
options: RestartVerifyOptions = {}
|
||||
): Promise<RestartVerifyResult> {
|
||||
const pollIntervalMs = options.pollIntervalMs ?? 500;
|
||||
const requestTimeoutMs = options.requestTimeoutMs ?? 2000;
|
||||
const deadline = Date.now() + deadlineMs;
|
||||
let lastObserved = 'no health response observed before deadline';
|
||||
let lastPollSawHealth = false;
|
||||
|
||||
while (Date.now() < deadline) {
|
||||
try {
|
||||
const health = await fetchHealthSnapshot(port, requestTimeoutMs);
|
||||
lastObserved = `last health payload: ${JSON.stringify({ pid: health.pid, version: health.version })}`;
|
||||
lastPollSawHealth = true;
|
||||
if (
|
||||
typeof health.pid === 'number' &&
|
||||
health.pid !== oldPid &&
|
||||
typeof health.version === 'string' &&
|
||||
health.version === expectedVersion
|
||||
) {
|
||||
return { ok: true, pid: health.pid, version: health.version };
|
||||
}
|
||||
} catch (error) {
|
||||
lastObserved = `connection error: ${error instanceof Error ? error.message : String(error)}`;
|
||||
lastPollSawHealth = false;
|
||||
}
|
||||
await new Promise(resolve => setTimeout(resolve, pollIntervalMs));
|
||||
}
|
||||
|
||||
return { ok: false, lastObserved, lastPollSawHealth };
|
||||
}
|
||||
@@ -5,7 +5,9 @@ import { spawn } from 'child_process';
|
||||
import { Database } from 'bun:sqlite';
|
||||
import { Client } from '@modelcontextprotocol/sdk/client/index.js';
|
||||
import { StdioClientTransport } from '@modelcontextprotocol/sdk/client/stdio.js';
|
||||
import { getWorkerPort, getWorkerHost } from '../shared/worker-utils.js';
|
||||
import { getWorkerPort, getWorkerHost, fetchWithTimeout, resolveWorkerScriptPath } from '../shared/worker-utils.js';
|
||||
import { getCurrentWorkerPid, verifyRestartedWorker } from './restart-verify.js';
|
||||
import { runShutdownSequence, type WorkerShutdownReason } from './worker-shutdown.js';
|
||||
import { DATA_DIR, DB_PATH, ensureDir } from '../shared/paths.js';
|
||||
import { HOOK_TIMEOUTS } from '../shared/hook-constants.js';
|
||||
import { getUptimeSeconds } from '../shared/uptime.js';
|
||||
@@ -18,6 +20,7 @@ import { configureSupervisorSignalHandlers, getSupervisor, startSupervisor } fro
|
||||
import { sanitizeEnv } from '../supervisor/env-sanitizer.js';
|
||||
|
||||
import { ensureWorkerStarted as ensureWorkerStartedShared, type WorkerStartResult } from './worker-spawner.js';
|
||||
import { acquireSpawnLock, releaseSpawnLock } from '../shared/worker-spawn-gate.js';
|
||||
import { captureEvent, shutdownTelemetry } from './telemetry/telemetry.js';
|
||||
import { collectInstallStats } from './telemetry/install-stats.js';
|
||||
|
||||
@@ -30,7 +33,7 @@ const packageVersion = typeof __DEFAULT_PACKAGE_VERSION__ !== 'undefined' ? __DE
|
||||
import {
|
||||
writePidFile,
|
||||
readPidFile,
|
||||
removePidFile,
|
||||
removePidFileIfOwner,
|
||||
getPlatformTimeout,
|
||||
runOneTimeChromaMigration,
|
||||
runOneTimeCwdRemap,
|
||||
@@ -118,13 +121,10 @@ export function buildStatusOutput(status: 'ready' | 'error', message?: string):
|
||||
};
|
||||
}
|
||||
|
||||
/**
|
||||
* Closed enum for worker_stopped telemetry. Must stay in sync with the
|
||||
* shutdown_reason whitelist documentation (scrub.ts / telemetry.mdx):
|
||||
* stop = /api/admin/shutdown (CLI `stop`), restart = /api/admin/restart or
|
||||
* CLI `restart` (tagged ?reason=restart), signal = SIGTERM/SIGINT handler.
|
||||
*/
|
||||
export type WorkerShutdownReason = 'stop' | 'restart' | 'signal';
|
||||
// Closed enum for worker_stopped telemetry — definition (and its
|
||||
// scrub.ts/telemetry.mdx sync requirements) moved to worker-shutdown.ts, the
|
||||
// import-safe shutdown seam. Re-exported here for existing importers.
|
||||
export type { WorkerShutdownReason } from './worker-shutdown.js';
|
||||
|
||||
// Clean-shutdown sentinel — same marker-file pattern as the one-time markers
|
||||
// in ProcessManager.ts (.chroma-cleaned-v10.3). Written in the graceful
|
||||
@@ -281,8 +281,11 @@ export class WorkerService implements WorkerRef {
|
||||
}
|
||||
|
||||
private registerSignalHandlers(): void {
|
||||
// Do NOT pre-set isShuttingDown here: the flag is now the re-entrancy
|
||||
// guard INSIDE shutdown() (runShutdownSequence), and pre-setting it would
|
||||
// turn the signal-path shutdown into a no-op. The supervisor has its own
|
||||
// signal re-entrancy guard (shutdownInitiated in src/supervisor/index.ts).
|
||||
configureSupervisorSignalHandlers(async () => {
|
||||
this.isShuttingDown = true;
|
||||
await this.shutdown('signal');
|
||||
});
|
||||
}
|
||||
@@ -669,32 +672,59 @@ export class WorkerService implements WorkerRef {
|
||||
}
|
||||
|
||||
async shutdown(reason: WorkerShutdownReason = 'stop'): Promise<void> {
|
||||
if (this.transcriptWatcher) {
|
||||
this.transcriptWatcher.stop();
|
||||
this.transcriptWatcher = null;
|
||||
logger.info('TRANSCRIPT', 'Transcript watcher stopped');
|
||||
}
|
||||
// Full sequence (re-entrancy guard, graceful-shutdown deadline, restart
|
||||
// successor handoff) lives in worker-shutdown.ts so tests can exercise it
|
||||
// without importing this module's bootstrap. When reason === 'restart'
|
||||
// this runs inside flushResponseThen's flushed action, so the successor
|
||||
// spawn completes before that helper's process.exit(0).
|
||||
await runShutdownSequence({
|
||||
reason,
|
||||
isShuttingDown: () => this.isShuttingDown,
|
||||
markShuttingDown: () => { this.isShuttingDown = true; },
|
||||
beforeGracefulShutdown: async () => {
|
||||
if (this.transcriptWatcher) {
|
||||
this.transcriptWatcher.stop();
|
||||
this.transcriptWatcher = null;
|
||||
logger.info('TRANSCRIPT', 'Transcript watcher stopped');
|
||||
}
|
||||
|
||||
if (this.telemetryHeartbeat) {
|
||||
clearInterval(this.telemetryHeartbeat);
|
||||
this.telemetryHeartbeat = null;
|
||||
}
|
||||
// Mark this stop as graceful for the next start's crash detection, and
|
||||
// capture worker_stopped BEFORE shutdownTelemetry() — isShutdown drops
|
||||
// any event captured after the flush, by design.
|
||||
writeCleanShutdownSentinel();
|
||||
captureEvent('worker_stopped', {
|
||||
uptime_seconds: getUptimeSeconds(this.startTime),
|
||||
shutdown_reason: reason,
|
||||
});
|
||||
await shutdownTelemetry();
|
||||
|
||||
await performGracefulShutdown({
|
||||
server: this.server.getHttpServer(),
|
||||
sessionManager: this.sessionManager,
|
||||
mcpClient: this.mcpClient,
|
||||
dbManager: this.dbManager,
|
||||
chromaMcpManager: this.chromaMcpManager || undefined
|
||||
if (this.telemetryHeartbeat) {
|
||||
clearInterval(this.telemetryHeartbeat);
|
||||
this.telemetryHeartbeat = null;
|
||||
}
|
||||
// Mark this stop as graceful for the next start's crash detection, and
|
||||
// capture worker_stopped BEFORE shutdownTelemetry() — isShutdown drops
|
||||
// any event captured after the flush, by design.
|
||||
writeCleanShutdownSentinel();
|
||||
captureEvent('worker_stopped', {
|
||||
uptime_seconds: getUptimeSeconds(this.startTime),
|
||||
shutdown_reason: reason,
|
||||
});
|
||||
await shutdownTelemetry();
|
||||
},
|
||||
performGracefulShutdown: () => performGracefulShutdown({
|
||||
server: this.server.getHttpServer(),
|
||||
sessionManager: this.sessionManager,
|
||||
mcpClient: this.mcpClient,
|
||||
dbManager: this.dbManager,
|
||||
chromaMcpManager: this.chromaMcpManager || undefined
|
||||
}),
|
||||
gracefulDeadlineMs: getPlatformTimeout(10000),
|
||||
restartHandoff: {
|
||||
port: getWorkerPort(),
|
||||
portFreeTimeoutMs: getPlatformTimeout(5000),
|
||||
// Prefer the marketplace-installed script so the successor boots the
|
||||
// freshly-synced plugin, falling back to this script for dev trees /
|
||||
// CI where no marketplace copy exists.
|
||||
resolveSuccessorScript: () => resolveWorkerScriptPath() ?? __filename,
|
||||
waitForPortFree,
|
||||
// Owner-or-dead guarded (Phase 5): the dying worker may delete the
|
||||
// PID file it owns (its own pid) or a dead pid's leftover — never a
|
||||
// live successor's. Guarding at this injection site keeps the
|
||||
// runShutdownSequence seam (`removePidFile: () => void`) unchanged.
|
||||
removePidFile: () => removePidFileIfOwner(process.pid),
|
||||
spawnDaemon,
|
||||
},
|
||||
});
|
||||
}
|
||||
|
||||
@@ -942,12 +972,16 @@ async function main() {
|
||||
}
|
||||
|
||||
case 'stop': {
|
||||
// Capture the dying worker's pid BEFORE shutdown so the PID-file
|
||||
// cleanup below can prove it deletes THAT worker's file (or a dead
|
||||
// pid's leftover) — never a live successor's (Phase 5).
|
||||
const stoppedPid = await getCurrentWorkerPid(port, 2000);
|
||||
await httpShutdown(port);
|
||||
const freed = await waitForPortFree(port, getPlatformTimeout(15000));
|
||||
if (!freed) {
|
||||
logger.warn('SYSTEM', 'Port did not free up after shutdown', { port });
|
||||
}
|
||||
removePidFile();
|
||||
removePidFileIfOwner(stoppedPid);
|
||||
logger.info('SYSTEM', 'Worker stopped successfully');
|
||||
process.exit(0);
|
||||
break;
|
||||
@@ -955,35 +989,142 @@ async function main() {
|
||||
|
||||
case 'restart': {
|
||||
logger.info('SYSTEM', 'Restarting worker');
|
||||
await httpShutdown(port, 'restart');
|
||||
const restartFreed = await waitForPortFree(port, 5000);
|
||||
if (!restartFreed) {
|
||||
console.error('Port still bound after shutdown. Resolve manually.');
|
||||
// Capture the old worker's pid BEFORE shutdown so we can later prove
|
||||
// the worker answering health checks is a NEW process, not the corpse.
|
||||
const oldPid = await getCurrentWorkerPid(port, 2000);
|
||||
// Track whether the worker accepted the shutdown POST: an accepted
|
||||
// reason=restart shutdown means the dying worker spawns its OWN
|
||||
// successor the moment its port frees (worker-shutdown.ts handoff).
|
||||
// That handoff is the PRIMARY restart path; this CLI defers to it.
|
||||
const shutdownAccepted = await httpShutdown(port, 'restart');
|
||||
|
||||
let handoffDetail = '';
|
||||
let handoffSawLiveWorker = false;
|
||||
if (oldPid !== null && shutdownAccepted) {
|
||||
// PRIMARY: the dying worker self-replaces. Do NOT waitForPortFree and
|
||||
// do NOT spawn — the successor re-binds the port within ~200ms of it
|
||||
// freeing, so a port-free wait here loses the race against the very
|
||||
// handoff this CLI just triggered (and a CLI spawn would be a second
|
||||
// restart initiator — the disease this flow cures). Just verify the
|
||||
// successor.
|
||||
const handoff = await verifyRestartedWorker(port, oldPid, packageVersion, getPlatformTimeout(30000));
|
||||
if (handoff.ok) {
|
||||
console.log(`Worker restart verified (pid: ${handoff.pid}, version: ${handoff.version})`);
|
||||
logger.info('SYSTEM', 'Worker restart verified', { pid: handoff.pid, version: handoff.version });
|
||||
process.exit(0);
|
||||
}
|
||||
handoffDetail = `; handoff attempt: ${handoff.lastObserved}`;
|
||||
handoffSawLiveWorker = handoff.lastPollSawHealth;
|
||||
logger.warn('SYSTEM', 'Self-replacing worker handoff did not verify in time — falling back to CLI spawn', {
|
||||
oldPid,
|
||||
lastObserved: handoff.lastObserved,
|
||||
});
|
||||
}
|
||||
|
||||
// FALLBACK — reached when no worker was running, the shutdown POST was
|
||||
// not accepted (e.g. the old worker predates the self-replacement
|
||||
// handoff), or the handoff never produced a verified successor (its
|
||||
// spawn failed). Only here may the CLI spawn, and only through the
|
||||
// spawn gate so it can never race a hook's lazy-spawn.
|
||||
//
|
||||
// When the handoff verification's most recent poll already saw a live
|
||||
// health responder, a worker (just not a verifiable successor) holds
|
||||
// the port — waiting for it to free would burn the full timeout for
|
||||
// nothing, so skip straight to verifying the current owner.
|
||||
const restartFreed = handoffSawLiveWorker
|
||||
? false
|
||||
: await waitForPortFree(port, getPlatformTimeout(15000));
|
||||
// Prefer the marketplace-installed script so restart boots the
|
||||
// freshly-synced plugin, falling back to this script for dev trees /
|
||||
// CI where no marketplace copy exists.
|
||||
const restartScript = resolveWorkerScriptPath() ?? __filename;
|
||||
let spawnedScript = 'none (port still bound — nothing spawned)';
|
||||
let spawnLockHeld = false;
|
||||
if (restartFreed) {
|
||||
// Owner-or-dead guarded (Phase 5): delete only the old worker's PID
|
||||
// file (oldPid) or a dead pid's leftover. If a successor we failed to
|
||||
// observe already wrote its own file, it must survive this cleanup.
|
||||
removePidFileIfOwner(oldPid);
|
||||
// Spawn gate (src/shared/worker-spawn-gate.ts): if another launcher
|
||||
// (a hook or the MCP server) is already mid-spawn, skip our own spawn
|
||||
// and just verify its worker below.
|
||||
spawnLockHeld = acquireSpawnLock();
|
||||
} else {
|
||||
// The port never freed: either the old worker refuses to die (the
|
||||
// verification below fails and reports its health payload) or a
|
||||
// successor we failed to observe in time already owns the port (the
|
||||
// verification below passes). Spawning a competitor here is never
|
||||
// useful — it could not bind the port anyway.
|
||||
logger.warn('SYSTEM', 'Port still bound entering restart fallback — verifying current port owner instead of spawning', { port, portWaitSkipped: handoffSawLiveWorker });
|
||||
}
|
||||
try {
|
||||
if (spawnLockHeld) {
|
||||
const restartPid = spawnDaemon(restartScript, port);
|
||||
if (restartPid === undefined) {
|
||||
console.error('Failed to spawn worker daemon during restart.');
|
||||
// Manual release: process.exit() does not unwind to finally.
|
||||
releaseSpawnLock();
|
||||
process.exit(1);
|
||||
}
|
||||
spawnedScript = restartScript;
|
||||
logger.info('SYSTEM', 'Worker restart spawned (CLI fallback)', { pid: restartPid, script: restartScript });
|
||||
// Hold the lock until the spawned worker owns the port (the spawn
|
||||
// isn't "done" until then — same rule as the other gated
|
||||
// launchers); the longer verification below runs unlocked.
|
||||
await waitForHealth(port, getPlatformTimeout(15000));
|
||||
} else if (restartFreed) {
|
||||
spawnedScript = 'none (another launcher holds the spawn lock)';
|
||||
logger.info('SYSTEM', 'Another launcher holds the spawn lock — skipping CLI restart spawn and verifying its worker');
|
||||
}
|
||||
} finally {
|
||||
if (spawnLockHeld) releaseSpawnLock();
|
||||
}
|
||||
// Restart must prove itself: the new worker has to answer /api/health
|
||||
// with a different pid than the old worker and this CLI's own baked
|
||||
// version, before the hard deadline — otherwise exit 1.
|
||||
const verification = await verifyRestartedWorker(port, oldPid, packageVersion, getPlatformTimeout(30000));
|
||||
if (!verification.ok) {
|
||||
console.error(`Worker restart verification failed (old pid: ${oldPid ?? 'none'}, expected version: ${packageVersion}, spawned script: ${spawnedScript}); ${verification.lastObserved}${handoffDetail}`);
|
||||
process.exit(1);
|
||||
}
|
||||
removePidFile();
|
||||
const restartPid = spawnDaemon(__filename, port);
|
||||
if (restartPid === undefined) {
|
||||
console.error('Failed to spawn worker daemon during restart.');
|
||||
process.exit(1);
|
||||
}
|
||||
logger.info('SYSTEM', 'Worker restart spawned', { pid: restartPid });
|
||||
console.log(`Worker restart verified (pid: ${verification.pid}, version: ${verification.version})`);
|
||||
logger.info('SYSTEM', 'Worker restart verified', { pid: verification.pid, version: verification.version });
|
||||
process.exit(0);
|
||||
break;
|
||||
}
|
||||
|
||||
case 'status': {
|
||||
const portInUse = await isPortInUse(port);
|
||||
const pidInfo = readPidFile();
|
||||
if (portInUse && pidInfo) {
|
||||
// Source of truth is GET /api/health: the worker self-reports pid,
|
||||
// version, uptime and script path (Phase 5, worker-restart plan). The
|
||||
// PID file is diagnostics only — it must never make `status` lie in
|
||||
// either direction (a clobbered file reporting a healthy worker as
|
||||
// down, or a stale file reporting a dead worker as up). Exit code is 0
|
||||
// on every branch, matching the historical behavior.
|
||||
const health = await fetchWorkerHealth(port, getPlatformTimeout(3000));
|
||||
if (health && typeof health.pid === 'number') {
|
||||
console.log('Worker is running');
|
||||
console.log(` PID: ${pidInfo.pid}`);
|
||||
console.log(` Port: ${pidInfo.port}`);
|
||||
console.log(` Started: ${pidInfo.startedAt}`);
|
||||
await printQueueStatusIfBullMq(port);
|
||||
} else {
|
||||
console.log('Worker is not running');
|
||||
console.log(` PID: ${health.pid}`);
|
||||
console.log(` Port: ${port}`);
|
||||
if (typeof health.version === 'string') {
|
||||
console.log(` Version: ${health.version}`);
|
||||
}
|
||||
if (typeof health.uptime === 'number') {
|
||||
console.log(` Uptime: ${health.uptime}s`);
|
||||
}
|
||||
if (typeof health.workerPath === 'string') {
|
||||
console.log(` Worker path: ${health.workerPath}`);
|
||||
}
|
||||
printQueueStatusIfBullMq(health);
|
||||
process.exit(0);
|
||||
}
|
||||
if (await isPortInUse(port)) {
|
||||
// Something owns the port but cannot answer /api/health — a wedged
|
||||
// worker mid-boot/mid-death, or a foreign process. Say so instead of
|
||||
// guessing in either direction.
|
||||
console.log(`Worker port ${port} is in use but health is unreachable (worker may be wedged or still booting)`);
|
||||
process.exit(0);
|
||||
}
|
||||
console.log('Worker is not running');
|
||||
process.exit(0);
|
||||
break;
|
||||
}
|
||||
@@ -1166,6 +1307,21 @@ async function main() {
|
||||
|
||||
case '--daemon':
|
||||
default: {
|
||||
// Duplicate gate, ground truth FIRST (Phase 5): a live worker owns the
|
||||
// port — the port cannot be faked by a stale or clobbered file. Exit 0:
|
||||
// duplicate suppression is a success, not a failure.
|
||||
if (await isPortInUse(port)) {
|
||||
logger.info('SYSTEM', 'Port already in use, refusing to start duplicate', { port });
|
||||
process.exit(0);
|
||||
}
|
||||
|
||||
// PID file second, ADVISORY only: it covers a dying-but-still-alive
|
||||
// predecessor whose port has already been released (so the port check
|
||||
// above misses it) but whose owned PID file has not been deleted yet.
|
||||
// It does NOT cover a just-spawned worker that hasn't bound the port:
|
||||
// writePidFile runs after server.listen, so that worker has no file.
|
||||
// The worker itself remains the sole writer of this file
|
||||
// (writePidFile/touchPidFile stay as diagnostics).
|
||||
const existingPidInfo = readPidFile();
|
||||
if (verifyPidFileOwnership(existingPidInfo)) {
|
||||
logger.info('SYSTEM', 'Worker already running (PID alive), refusing to start duplicate', {
|
||||
@@ -1176,11 +1332,6 @@ async function main() {
|
||||
process.exit(0);
|
||||
}
|
||||
|
||||
if (await isPortInUse(port)) {
|
||||
logger.info('SYSTEM', 'Port already in use, refusing to start duplicate', { port });
|
||||
process.exit(0);
|
||||
}
|
||||
|
||||
process.on('unhandledRejection', (reason) => {
|
||||
logger.error('SYSTEM', 'Unhandled rejection in daemon', {
|
||||
reason: reason instanceof Error ? reason.message : String(reason)
|
||||
@@ -1202,45 +1353,73 @@ async function main() {
|
||||
process.exit(0);
|
||||
}
|
||||
logger.failure('SYSTEM', 'Worker failed to start', {}, error as Error);
|
||||
removePidFile();
|
||||
process.exit(0);
|
||||
// Owner-or-dead guarded (Phase 5): clean up our own PID file (written
|
||||
// in start() before the failure) or a dead leftover, but never a live
|
||||
// competitor's — e.g. a port-conflict loser whose error didn't match
|
||||
// the EADDRINUSE detection above must not clobber the winner's file.
|
||||
removePidFileIfOwner(process.pid);
|
||||
// Genuine start failure (not duplicate suppression): exit non-zero so
|
||||
// the restart verifier and any supervising caller see a dead boot
|
||||
// instead of a silent "success".
|
||||
process.exit(1);
|
||||
});
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
async function printQueueStatusIfBullMq(port: number): Promise<void> {
|
||||
interface WorkerHealthSnapshot {
|
||||
status?: unknown;
|
||||
pid?: unknown;
|
||||
version?: unknown;
|
||||
uptime?: unknown;
|
||||
workerPath?: unknown;
|
||||
queue?: {
|
||||
redis?: {
|
||||
status?: string;
|
||||
host?: string;
|
||||
port?: number;
|
||||
mode?: string;
|
||||
prefix?: string;
|
||||
error?: string;
|
||||
};
|
||||
};
|
||||
}
|
||||
|
||||
/**
|
||||
* Fetch the worker's self-reported state from GET /api/health. Returns null
|
||||
* when nothing answers (connection refused, timeout, non-JSON body).
|
||||
* /api/health answers 503 when the queue is degraded but still includes
|
||||
* pid/version/uptime — a degraded worker is still a RUNNING worker, so both
|
||||
* 200 and 503 payloads are returned as-is.
|
||||
*/
|
||||
async function fetchWorkerHealth(port: number, timeoutMs: number): Promise<WorkerHealthSnapshot | null> {
|
||||
try {
|
||||
const response = await fetchWithTimeout(`http://${getWorkerHost()}:${port}/api/health`, {}, timeoutMs);
|
||||
return await response.json() as WorkerHealthSnapshot;
|
||||
} catch {
|
||||
return null;
|
||||
}
|
||||
}
|
||||
|
||||
/**
|
||||
* Print BullMQ queue detail from an already-fetched /api/health snapshot.
|
||||
* A degraded worker answers 503 but still includes the queue block, and
|
||||
* `status` already treats that worker as running — so this must not
|
||||
* re-fetch and bail on a non-2xx response (which hid the queue detail
|
||||
* behind "BullMQ health unavailable (HTTP 503)"). Reusing the snapshot in
|
||||
* hand keeps the output consistent with what `status` just reported.
|
||||
*/
|
||||
function printQueueStatusIfBullMq(health: WorkerHealthSnapshot): void {
|
||||
if (SettingsDefaultsManager.get('CLAUDE_MEM_QUEUE_ENGINE').trim().toLowerCase() !== 'bullmq') {
|
||||
return;
|
||||
}
|
||||
try {
|
||||
const response = await fetch(`http://${getWorkerHost()}:${port}/api/health`);
|
||||
if (!response.ok) {
|
||||
console.log(` Queue: BullMQ health unavailable (HTTP ${response.status})`);
|
||||
return;
|
||||
}
|
||||
const body = await response.json() as {
|
||||
queue?: {
|
||||
redis?: {
|
||||
status?: string;
|
||||
host?: string;
|
||||
port?: number;
|
||||
mode?: string;
|
||||
prefix?: string;
|
||||
error?: string;
|
||||
};
|
||||
};
|
||||
};
|
||||
const redis = body.queue?.redis;
|
||||
if (!redis) {
|
||||
return;
|
||||
}
|
||||
const target = `${redis.host ?? 'unknown'}:${redis.port ?? 'unknown'}`;
|
||||
const suffix = redis.status === 'ok' ? '' : ` (${redis.error ?? 'unhealthy'})`;
|
||||
console.log(` Queue: BullMQ Redis ${redis.status ?? 'unknown'} at ${target} [${redis.mode ?? 'external'}, prefix=${redis.prefix ?? 'claude_mem'}]${suffix}`);
|
||||
} catch (error) {
|
||||
console.log(` Queue: BullMQ health unavailable (${error instanceof Error ? error.message : String(error)})`);
|
||||
const redis = health.queue?.redis;
|
||||
if (!redis) {
|
||||
return;
|
||||
}
|
||||
const target = `${redis.host ?? 'unknown'}:${redis.port ?? 'unknown'}`;
|
||||
const suffix = redis.status === 'ok' ? '' : ` (${redis.error ?? 'unhealthy'})`;
|
||||
console.log(` Queue: BullMQ Redis ${redis.status ?? 'unknown'} at ${target} [${redis.mode ?? 'external'}, prefix=${redis.prefix ?? 'claude_mem'}]${suffix}`);
|
||||
}
|
||||
|
||||
const isMainModule = typeof require !== 'undefined' && typeof module !== 'undefined'
|
||||
|
||||
176
src/services/worker-shutdown.ts
Normal file
176
src/services/worker-shutdown.ts
Normal file
@@ -0,0 +1,176 @@
|
||||
/**
|
||||
* Guarded worker shutdown sequence: the dying worker drains gracefully under
|
||||
* a hard deadline and, on restart, spawns its own successor so no other
|
||||
* process races it for the port
|
||||
* (plans/2026-06-10-worker-restart-single-source-of-truth.md).
|
||||
*
|
||||
* This lives in its own module rather than inside worker-service.ts for the
|
||||
* same reason as restart-verify.ts: worker-service.ts drags in a very large
|
||||
* dependency graph (bun:sqlite, MCP SDK, telemetry, supervisor, express) and
|
||||
* ends with an isMainModule bootstrap, which makes it unsafe to import from
|
||||
* `bun test`. WorkerService.shutdown() delegates here with its real
|
||||
* dependencies — this module IS the production shutdown logic, not a test
|
||||
* double.
|
||||
*
|
||||
* Sequence:
|
||||
* 1. Re-entrancy guard — /api/admin/restart, /api/admin/shutdown and the
|
||||
* signal handler can all race into shutdown; only the first wins.
|
||||
* 2. Pre-shutdown bookkeeping (watcher/heartbeat/sentinel/telemetry).
|
||||
* 3. performGracefulShutdown under a hard deadline — it has no global
|
||||
* deadline of its own and session drain has been observed at 35-40s.
|
||||
* 4. reason === 'restart' ONLY: spawn the successor worker as the dying
|
||||
* worker's final act, AFTER the port is confirmed free. 'stop' and
|
||||
* signal shutdowns stay kill-only.
|
||||
*/
|
||||
|
||||
import { logger } from '../utils/logger.js';
|
||||
|
||||
/**
|
||||
* Closed enum for worker_stopped telemetry. Must stay in sync with the
|
||||
* shutdown_reason whitelist documentation (scrub.ts / telemetry.mdx):
|
||||
* stop = /api/admin/shutdown (CLI `stop`), restart = /api/admin/restart or
|
||||
* CLI `restart` (tagged ?reason=restart), signal = SIGTERM/SIGINT handler.
|
||||
*/
|
||||
export type WorkerShutdownReason = 'stop' | 'restart' | 'signal';
|
||||
|
||||
export interface RestartHandoffDeps {
|
||||
port: number;
|
||||
/** Budget for the old worker's port to close before giving up on the spawn. */
|
||||
portFreeTimeoutMs: number;
|
||||
/** Marketplace-script candidates with a dev-tree fallback (resolveWorkerScriptPath pattern). */
|
||||
resolveSuccessorScript: () => string;
|
||||
waitForPortFree: (port: number, timeoutMs: number) => Promise<boolean>;
|
||||
/**
|
||||
* Owner-or-dead guarded deletion (Phase 5): the production injection
|
||||
* (worker-service.ts) deletes only the dying worker's own PID file or a
|
||||
* dead pid's leftover — never a live successor's.
|
||||
*/
|
||||
removePidFile: () => void;
|
||||
spawnDaemon: (scriptPath: string, port: number) => number | undefined;
|
||||
}
|
||||
|
||||
export interface ShutdownSequenceOptions {
|
||||
reason: WorkerShutdownReason;
|
||||
/** Reads the owner's shutdown flag (WorkerService.isShuttingDown). */
|
||||
isShuttingDown: () => boolean;
|
||||
markShuttingDown: () => void;
|
||||
/** Pre-graceful bookkeeping: transcript watcher, heartbeat, sentinel, telemetry flush. */
|
||||
beforeGracefulShutdown: () => Promise<void>;
|
||||
performGracefulShutdown: () => Promise<void>;
|
||||
gracefulDeadlineMs: number;
|
||||
restartHandoff: RestartHandoffDeps;
|
||||
}
|
||||
|
||||
export async function runShutdownSequence(options: ShutdownSequenceOptions): Promise<void> {
|
||||
if (options.isShuttingDown()) {
|
||||
logger.warn('SYSTEM', 'Shutdown already in progress — ignoring re-entrant shutdown request', {
|
||||
reason: options.reason,
|
||||
});
|
||||
return;
|
||||
}
|
||||
options.markShuttingDown();
|
||||
|
||||
try {
|
||||
await options.beforeGracefulShutdown();
|
||||
} catch (error: unknown) {
|
||||
// Pre-graceful bookkeeping (watcher/heartbeat/sentinel/telemetry flush)
|
||||
// failing must not abort the sequence: graceful shutdown and — for
|
||||
// restarts — the successor handoff still have to run. Same "proceed on
|
||||
// error, never abort the handoff" policy as performGracefulShutdown below.
|
||||
logger.error(
|
||||
'SYSTEM',
|
||||
'Pre-graceful shutdown bookkeeping failed — proceeding',
|
||||
{ reason: options.reason },
|
||||
error instanceof Error ? error : new Error(String(error))
|
||||
);
|
||||
}
|
||||
|
||||
// Hard deadline around performGracefulShutdown: on expiry (or failure) log
|
||||
// and continue — a restart must never hang the dying worker on an unbounded
|
||||
// session drain.
|
||||
let deadlineTimer: ReturnType<typeof setTimeout> | undefined;
|
||||
const deadline = new Promise<'deadline'>((resolve) => {
|
||||
deadlineTimer = setTimeout(() => resolve('deadline'), options.gracefulDeadlineMs);
|
||||
deadlineTimer.unref?.();
|
||||
});
|
||||
try {
|
||||
const outcome = await Promise.race([
|
||||
options.performGracefulShutdown().then(
|
||||
() => 'graceful' as const,
|
||||
(error: unknown) => {
|
||||
// A failed graceful shutdown must not abort the restart handoff;
|
||||
// proceed exactly like the deadline path.
|
||||
logger.error(
|
||||
'SYSTEM',
|
||||
'Graceful shutdown failed — proceeding',
|
||||
{ reason: options.reason },
|
||||
error instanceof Error ? error : new Error(String(error))
|
||||
);
|
||||
return 'graceful-error' as const;
|
||||
}
|
||||
),
|
||||
deadline,
|
||||
]);
|
||||
if (outcome === 'deadline') {
|
||||
logger.warn('SYSTEM', 'Graceful shutdown deadline exceeded — proceeding', {
|
||||
deadlineMs: options.gracefulDeadlineMs,
|
||||
reason: options.reason,
|
||||
});
|
||||
}
|
||||
} finally {
|
||||
if (deadlineTimer !== undefined) clearTimeout(deadlineTimer);
|
||||
}
|
||||
|
||||
// Successor handoff — ONLY for restart; 'stop' and signal shutdowns stay
|
||||
// kill-only. The old worker spawns its replacement as its final act, after
|
||||
// its port is confirmed free, so the successor never races the corpse for
|
||||
// the port. The hook recycle path (ensureWorkerRunning in
|
||||
// src/shared/worker-utils.ts) waits for this successor instead of spawning
|
||||
// its own. This runs inside flushResponseThen's flushed action, so it
|
||||
// completes before that helper's process.exit(0).
|
||||
if (options.reason !== 'restart') return;
|
||||
|
||||
const handoff = options.restartHandoff;
|
||||
try {
|
||||
const successorScript = handoff.resolveSuccessorScript();
|
||||
const portFree = await handoff.waitForPortFree(handoff.port, handoff.portFreeTimeoutMs);
|
||||
if (!portFree) {
|
||||
logger.error('SYSTEM', 'Restart successor NOT spawned: port never freed after graceful shutdown — the next hook lazy-spawn is the safety net', {
|
||||
port: handoff.port,
|
||||
timeoutMs: handoff.portFreeTimeoutMs,
|
||||
});
|
||||
return;
|
||||
}
|
||||
// Same ordering as the CLI restart path (worker-service.ts `restart`
|
||||
// case): port free → remove the now-ownerless PID file → spawn. Without
|
||||
// the removal a fast-booting successor can still see this not-yet-exited
|
||||
// process in the PID file and refuse to start as a "duplicate". The
|
||||
// injected implementation is owner-or-dead guarded (Phase 5): it deletes
|
||||
// only this dying worker's own file (or a dead pid's leftover), never a
|
||||
// live successor's.
|
||||
handoff.removePidFile();
|
||||
const successorPid = handoff.spawnDaemon(successorScript, handoff.port);
|
||||
if (successorPid === undefined) {
|
||||
logger.error('SYSTEM', 'Restart successor spawn FAILED — the next hook lazy-spawn is the safety net', {
|
||||
port: handoff.port,
|
||||
script: successorScript,
|
||||
});
|
||||
return;
|
||||
}
|
||||
logger.info('SYSTEM', 'Restart successor spawned', {
|
||||
pid: successorPid,
|
||||
script: successorScript,
|
||||
port: handoff.port,
|
||||
});
|
||||
} catch (error: unknown) {
|
||||
// spawnDaemon can throw (supervisor assertCanSpawn refuses while its stop
|
||||
// cascade is still in flight after a deadline); the handoff must never
|
||||
// turn the dying worker's exit into an unhandled rejection.
|
||||
logger.error(
|
||||
'SYSTEM',
|
||||
'Restart successor handoff threw — the next hook lazy-spawn is the safety net',
|
||||
{ port: handoff.port },
|
||||
error instanceof Error ? error : new Error(String(error))
|
||||
);
|
||||
}
|
||||
}
|
||||
@@ -15,6 +15,7 @@ import {
|
||||
waitForHealth,
|
||||
waitForReadiness,
|
||||
} from './infrastructure/HealthMonitor.js';
|
||||
import { acquireSpawnLock, releaseSpawnLock } from '../shared/worker-spawn-gate.js';
|
||||
|
||||
const WINDOWS_SPAWN_COOLDOWN_MS = 2 * 60 * 1000;
|
||||
|
||||
@@ -127,18 +128,38 @@ export async function ensureWorkerStarted(
|
||||
return 'dead';
|
||||
}
|
||||
|
||||
logger.info('SYSTEM', 'Starting worker daemon', { workerScriptPath });
|
||||
markWorkerSpawnAttempted();
|
||||
const pid = spawnDaemon(workerScriptPath, port);
|
||||
if (pid === undefined) {
|
||||
logger.error('SYSTEM', 'Failed to spawn worker daemon');
|
||||
return 'dead';
|
||||
}
|
||||
// Spawn gate (src/shared/worker-spawn-gate.ts): only ONE gated launcher —
|
||||
// hook, MCP server, or the CLI restart fallback — may spawn at a time. (The
|
||||
// dying worker's restart handoff in worker-shutdown.ts is deliberately NOT
|
||||
// gated: it is the primary spawner on restart, and hooks wait for its
|
||||
// successor.) Losing the lock never fails this path; the loser skips its
|
||||
// spawn and falls through to the SAME wait-for-health/readiness logic
|
||||
// (someone else is spawning — wait for their worker). The winner holds the
|
||||
// lock through the post-spawn health wait (the spawn isn't "done" until the
|
||||
// worker owns the port) and releases in finally on every exit path.
|
||||
const spawnLockHeld = acquireSpawnLock();
|
||||
try {
|
||||
if (spawnLockHeld) {
|
||||
logger.info('SYSTEM', 'Starting worker daemon', { workerScriptPath });
|
||||
markWorkerSpawnAttempted();
|
||||
const pid = spawnDaemon(workerScriptPath, port);
|
||||
if (pid === undefined) {
|
||||
logger.error('SYSTEM', 'Failed to spawn worker daemon');
|
||||
return 'dead';
|
||||
}
|
||||
} else {
|
||||
logger.info('SYSTEM', 'Another launcher holds the spawn lock — skipping duplicate spawn and waiting for its worker');
|
||||
}
|
||||
|
||||
const healthy = await waitForHealth(port, getPlatformTimeout(HOOK_TIMEOUTS.POST_SPAWN_WAIT));
|
||||
if (!healthy) {
|
||||
logger.warn('SYSTEM', 'Worker spawned but health endpoint not responding within window — likely still starting in background');
|
||||
return 'warming';
|
||||
const healthy = await waitForHealth(port, getPlatformTimeout(HOOK_TIMEOUTS.POST_SPAWN_WAIT));
|
||||
if (!healthy) {
|
||||
logger.warn('SYSTEM', spawnLockHeld
|
||||
? 'Worker spawned but health endpoint not responding within window — likely still starting in background'
|
||||
: 'Spawn-lock holder\'s worker not healthy within window — likely still starting in background');
|
||||
return 'warming';
|
||||
}
|
||||
} finally {
|
||||
if (spawnLockHeld) releaseSpawnLock();
|
||||
}
|
||||
|
||||
const ready = await waitForReadiness(port, getPlatformTimeout(HOOK_TIMEOUTS.READINESS_WAIT));
|
||||
@@ -147,7 +168,11 @@ export async function ensureWorkerStarted(
|
||||
}
|
||||
|
||||
clearWorkerSpawnAttempted();
|
||||
// touchPidFile is existsSync-guarded and merely refreshes the live worker's
|
||||
// pid-file mtime — correct for lock losers too, since the worker IS up.
|
||||
touchPidFile();
|
||||
logger.info('SYSTEM', 'Worker started successfully');
|
||||
logger.info('SYSTEM', spawnLockHeld
|
||||
? 'Worker started successfully'
|
||||
: 'Worker is up (started by another launcher)');
|
||||
return ready ? 'ready' : 'warming';
|
||||
}
|
||||
|
||||
@@ -203,7 +203,10 @@ export class SettingsDefaultsManager {
|
||||
mkdirSync(dir, { recursive: true });
|
||||
}
|
||||
writeFileSync(settingsPath, JSON.stringify(defaults, null, 2), 'utf-8');
|
||||
console.log('[SETTINGS] Created settings file with defaults:', settingsPath);
|
||||
// stderr, never stdout: this fires on the first boot in a fresh data
|
||||
// dir, and CLI commands like `start` promise machine-readable JSON
|
||||
// on stdout to the hook framework.
|
||||
console.warn('[SETTINGS] Created settings file with defaults:', settingsPath);
|
||||
} catch (error: unknown) {
|
||||
console.warn('[SETTINGS] Failed to create settings file, using in-memory defaults:', settingsPath, error instanceof Error ? error.message : String(error));
|
||||
}
|
||||
@@ -222,7 +225,8 @@ export class SettingsDefaultsManager {
|
||||
|
||||
try {
|
||||
writeFileSync(settingsPath, JSON.stringify(flatSettings, null, 2), 'utf-8');
|
||||
console.log('[SETTINGS] Migrated settings file from nested to flat schema:', settingsPath);
|
||||
// stderr, never stdout — same JSON-on-stdout contract as above.
|
||||
console.warn('[SETTINGS] Migrated settings file from nested to flat schema:', settingsPath);
|
||||
} catch (error: unknown) {
|
||||
console.warn('[SETTINGS] Failed to auto-migrate settings file:', settingsPath, error instanceof Error ? error.message : String(error));
|
||||
// Continue with in-memory migration even if write fails
|
||||
|
||||
156
src/shared/worker-spawn-gate.ts
Normal file
156
src/shared/worker-spawn-gate.ts
Normal file
@@ -0,0 +1,156 @@
|
||||
import { dirname, join } from 'path';
|
||||
import { mkdirSync, readFileSync, statSync, unlinkSync, writeFileSync } from 'fs';
|
||||
import { resolveDataDir } from './paths.js';
|
||||
|
||||
/**
|
||||
* Cross-launcher spawn lockfile (Phase 4 of
|
||||
* plans/2026-06-10-worker-restart-single-source-of-truth.md).
|
||||
*
|
||||
* Three independent launchers can try to start the worker at the same time —
|
||||
* hooks (src/shared/worker-utils.ts), the MCP server
|
||||
* (src/services/worker-spawner.ts), and the CLI restart fallback
|
||||
* (src/services/worker-service.ts). This gate gives
|
||||
* them mutual exclusion over the SPAWN only: whoever creates
|
||||
* `<DATA_DIR>/spawn.lock` with the `wx` flag (O_CREAT|O_EXCL — the create IS
|
||||
* the atomicity, no rename or lock library needed) is the one launcher allowed
|
||||
* to spawn; everyone else skips their spawn and waits for the winner's worker
|
||||
* to come up.
|
||||
*
|
||||
* Hard rules:
|
||||
* - The lock gates SPAWNING only — never health/readiness checks. A held lock
|
||||
* must never make a hook FAIL, only wait for the holder's worker.
|
||||
* - Staleness is judged by the lock file's mtime (statSync().mtimeMs), never
|
||||
* by clock values stored in the file content.
|
||||
* - The dying worker's restart handoff (src/services/worker-shutdown.ts) is
|
||||
* deliberately NOT gated: it is the PRIMARY spawner on restart, and hooks
|
||||
* wait for its successor instead of competing with it.
|
||||
*/
|
||||
|
||||
/**
|
||||
* A holder that hasn't finished spawning within this window is presumed dead
|
||||
* (crashed mid-spawn); its lock may be broken. The longest in-lock wait any
|
||||
* holder performs is the ~15s post-spawn port/health wait, which
|
||||
* getPlatformTimeout scales 2.0x on Windows to ~30s — so the staleness window
|
||||
* must clear 30s, not 15s. 60s keeps a 2x margin over that worst case.
|
||||
*/
|
||||
const SPAWN_LOCK_STALE_MS = 60_000;
|
||||
|
||||
/**
|
||||
* Resolved at call time (resolveDataDir consults CLAUDE_MEM_DATA_DIR / the
|
||||
* settings file on each call) rather than binding paths.ts's import-time
|
||||
* DATA_DIR const, so every launcher — and the test suite, which points
|
||||
* CLAUDE_MEM_DATA_DIR at a temp dir — agrees on the same lock path.
|
||||
*/
|
||||
function getSpawnLockPath(): string {
|
||||
return join(resolveDataDir(), 'spawn.lock');
|
||||
}
|
||||
|
||||
/**
|
||||
* Try to become the one launcher allowed to spawn the worker.
|
||||
*
|
||||
* Returns true when this process now holds the lock (caller MUST
|
||||
* releaseSpawnLock() in a finally). Returns false when another launcher holds
|
||||
* a fresh lock — the caller must SKIP its spawn and wait for the holder's
|
||||
* worker instead.
|
||||
*
|
||||
* A lock whose mtime is older than SPAWN_LOCK_STALE_MS is broken (unlinked)
|
||||
* and acquisition is retried exactly once.
|
||||
*/
|
||||
export function acquireSpawnLock(): boolean {
|
||||
const lockPath = getSpawnLockPath();
|
||||
const payload = JSON.stringify({
|
||||
pid: process.pid,
|
||||
startedAt: new Date().toISOString(),
|
||||
});
|
||||
|
||||
for (let attempt = 0; attempt < 2; attempt++) {
|
||||
try {
|
||||
mkdirSync(dirname(lockPath), { recursive: true });
|
||||
writeFileSync(lockPath, payload, { flag: 'wx' });
|
||||
return true;
|
||||
} catch (error: unknown) {
|
||||
const code = (error as NodeJS.ErrnoException)?.code;
|
||||
if (code !== 'EEXIST') {
|
||||
// Not contention — the filesystem refused the lock outright (EACCES,
|
||||
// EROFS, ...). Fail OPEN: the lock is a collision guard, not a
|
||||
// correctness gate, and a broken lock mechanism must degrade to the
|
||||
// pre-lock behavior (spawn anyway), never suppress every spawn
|
||||
// forever. releaseSpawnLock() is a no-op when no file was written.
|
||||
return true;
|
||||
}
|
||||
if (attempt > 0) {
|
||||
// We already broke one stale lock and someone re-acquired before us —
|
||||
// treat them as the live holder; never break twice.
|
||||
return false;
|
||||
}
|
||||
|
||||
let mtimeMs: number;
|
||||
try {
|
||||
mtimeMs = statSync(lockPath).mtimeMs;
|
||||
} catch {
|
||||
// Lock vanished between the failed write and the stat — the holder
|
||||
// just released. Retry once via the loop.
|
||||
continue;
|
||||
}
|
||||
|
||||
if (Date.now() - mtimeMs <= SPAWN_LOCK_STALE_MS) {
|
||||
// Fresh lock: another launcher is mid-spawn. Caller waits for its
|
||||
// worker instead of spawning a competitor.
|
||||
return false;
|
||||
}
|
||||
|
||||
// Stale lock: the holder died mid-spawn. Re-stat immediately before
|
||||
// breaking it — if the mtime changed since we judged it stale, another
|
||||
// launcher already broke it and re-took the lock; unlinking now would
|
||||
// delete THEIR fresh lock and mint two winners. The re-stat narrows
|
||||
// that TOCTOU window from the whole staleness evaluation to a few
|
||||
// microseconds.
|
||||
let recheckedMtimeMs: number;
|
||||
try {
|
||||
recheckedMtimeMs = statSync(lockPath).mtimeMs;
|
||||
} catch {
|
||||
// Lock vanished between the staleness judgment and the re-stat —
|
||||
// either the holder released or a competing breaker won. Retry once
|
||||
// via the loop.
|
||||
continue;
|
||||
}
|
||||
if (recheckedMtimeMs !== mtimeMs) {
|
||||
// Re-taken (or refreshed) since we judged it stale — its new owner is
|
||||
// live; yield to them.
|
||||
return false;
|
||||
}
|
||||
|
||||
try {
|
||||
// Break the stale lock and retry once.
|
||||
unlinkSync(lockPath);
|
||||
} catch {
|
||||
// The file vanished between the re-stat and the unlink (a competing
|
||||
// breaker's unlink won the race), or the filesystem refused the
|
||||
// delete (EPERM/EACCES). Either way we cannot claim the break —
|
||||
// yield.
|
||||
return false;
|
||||
}
|
||||
}
|
||||
}
|
||||
return false;
|
||||
}
|
||||
|
||||
/**
|
||||
* Release the spawn lock IF this process owns it. Owner-checked: the file is
|
||||
* read back and only deleted when its pid matches process.pid, so a launcher
|
||||
* can never delete a competitor's live lock (e.g. after its own stale lock
|
||||
* was broken and re-acquired by someone else). All errors are swallowed —
|
||||
* release is best-effort; an orphaned lock self-heals via the staleness
|
||||
* breaker in acquireSpawnLock.
|
||||
*/
|
||||
export function releaseSpawnLock(): void {
|
||||
const lockPath = getSpawnLockPath();
|
||||
try {
|
||||
const lock = JSON.parse(readFileSync(lockPath, 'utf-8')) as { pid?: unknown };
|
||||
if (lock.pid !== process.pid) return;
|
||||
unlinkSync(lockPath);
|
||||
} catch {
|
||||
// Missing, unreadable, or corrupt lock file — leave it alone; the
|
||||
// staleness breaker (SPAWN_LOCK_STALE_MS) reclaims anything orphaned.
|
||||
}
|
||||
}
|
||||
@@ -1,6 +1,5 @@
|
||||
import path from "path";
|
||||
import { readFileSync, existsSync, writeFileSync, renameSync, mkdirSync } from "fs";
|
||||
import { execSync } from "child_process";
|
||||
import { spawnHidden } from "./spawn.js";
|
||||
import { logger } from "../utils/logger.js";
|
||||
import { HOOK_TIMEOUTS, getTimeout } from "./hook-constants.js";
|
||||
@@ -11,6 +10,11 @@ import { validateWorkerPidFile } from "../supervisor/index.js";
|
||||
import { emitBlockingError } from "./hook-io.js";
|
||||
import { captureCliEvent } from "../services/telemetry/cli-telemetry.js";
|
||||
import { checkVersionMatch } from "../services/infrastructure/index.js";
|
||||
// Imported from ProcessManager.js directly (not the infrastructure barrel):
|
||||
// tests mock the barrel module wholesale, and the resolver must stay real.
|
||||
// ProcessManager imports nothing from worker-utils, so no cycle.
|
||||
import { resolveWorkerRuntimePath } from "../services/infrastructure/ProcessManager.js";
|
||||
import { acquireSpawnLock, releaseSpawnLock } from "./worker-spawn-gate.js";
|
||||
|
||||
function readTimeoutEnv(
|
||||
envName: string,
|
||||
@@ -203,7 +207,13 @@ async function isWorkerReady(): Promise<boolean> {
|
||||
return response.ok;
|
||||
}
|
||||
|
||||
function resolveWorkerScriptPath(): string | null {
|
||||
/**
|
||||
* Canonical worker-script resolver: the marketplace install first, then the
|
||||
* dev-tree copy under cwd. Exported so other launchers (e.g. the MCP server)
|
||||
* prefer the same marketplace copy instead of spawning a stale cache-dir
|
||||
* bundle.
|
||||
*/
|
||||
export function resolveWorkerScriptPath(): string | null {
|
||||
const candidates = [
|
||||
path.join(MARKETPLACE_ROOT, 'plugin', 'scripts', 'worker-service.cjs'),
|
||||
path.join(process.cwd(), 'plugin', 'scripts', 'worker-service.cjs'),
|
||||
@@ -214,26 +224,6 @@ function resolveWorkerScriptPath(): string | null {
|
||||
return null;
|
||||
}
|
||||
|
||||
function resolveBunRuntime(): string | null {
|
||||
if (process.env.BUN && existsSync(process.env.BUN)) return process.env.BUN;
|
||||
|
||||
try {
|
||||
const cmd = process.platform === 'win32' ? 'where bun' : 'which bun';
|
||||
const output = execSync(cmd, {
|
||||
stdio: ['ignore', 'pipe', 'ignore'],
|
||||
encoding: 'utf-8',
|
||||
windowsHide: true,
|
||||
});
|
||||
const firstMatch = output
|
||||
.split(/\r?\n/)
|
||||
.map(line => line.trim())
|
||||
.find(line => line.length > 0);
|
||||
return firstMatch || null;
|
||||
} catch {
|
||||
return null;
|
||||
}
|
||||
}
|
||||
|
||||
async function waitForWorkerPort(options: { attempts: number; backoffMs: number }): Promise<boolean> {
|
||||
let delayMs = options.backoffMs;
|
||||
for (let attempt = 1; attempt <= options.attempts; attempt++) {
|
||||
@@ -272,6 +262,62 @@ async function waitForWorkerReadiness(timeoutMs: number = HOOK_READINESS_TIMEOUT
|
||||
return false;
|
||||
}
|
||||
|
||||
/**
|
||||
* Read the version the worker self-reports on GET /api/health. The payload
|
||||
* carries pid/version even on a 503 (degraded queue) response, so the body is
|
||||
* parsed regardless of status — same contract as restart-verify.ts. Returns
|
||||
* null when the worker is unreachable or the payload is malformed.
|
||||
*/
|
||||
async function fetchWorkerHealthVersion(): Promise<string | null> {
|
||||
try {
|
||||
const response = await workerHttpRequest('/api/health', { timeoutMs: HEALTH_CHECK_TIMEOUT_MS });
|
||||
const body = await response.json() as { version?: unknown };
|
||||
return typeof body.version === 'string' ? body.version : null;
|
||||
} catch {
|
||||
return null;
|
||||
}
|
||||
}
|
||||
|
||||
/**
|
||||
* After POSTing /api/admin/restart, the OLD worker spawns its own successor
|
||||
* once its port closes (runShutdownSequence in src/services/worker-shutdown.ts;
|
||||
* plans/2026-06-10-worker-restart-single-source-of-truth.md). Wait for that
|
||||
* successor — a worker answering /api/health with the installed plugin's
|
||||
* version — instead of immediately lazy-spawning into the dying worker
|
||||
* (the old behavior, which caused the spawn ping-pong).
|
||||
*/
|
||||
async function waitForRecycledWorker(
|
||||
pluginVersion: string,
|
||||
timeoutMs: number = HOOK_READINESS_TIMEOUT_MS
|
||||
): Promise<boolean> {
|
||||
const start = Date.now();
|
||||
while (Date.now() - start < timeoutMs) {
|
||||
const observedVersion = await fetchWorkerHealthVersion();
|
||||
if (observedVersion === pluginVersion) return true;
|
||||
|
||||
const remainingMs = timeoutMs - (Date.now() - start);
|
||||
if (remainingMs <= 0) break;
|
||||
await new Promise<void>(resolve => setTimeout(resolve, Math.min(500, remainingMs)));
|
||||
}
|
||||
return false;
|
||||
}
|
||||
|
||||
/**
|
||||
* Amplifier guard: a hook recycles a stale worker AT MOST once per
|
||||
* invocation. If the worker that became ready still reports a mismatched
|
||||
* version, warn and return — the NEXT hook event retries. Recycling again in
|
||||
* the same invocation re-creates the restart storm.
|
||||
*/
|
||||
async function warnIfVersionStillMismatched(expectedPluginVersion: string): Promise<void> {
|
||||
const observedVersion = await fetchWorkerHealthVersion();
|
||||
if (observedVersion !== null && observedVersion !== expectedPluginVersion) {
|
||||
logger.warn('SYSTEM', 'Worker is ready but still reports a stale version; not recycling again in this hook invocation (one recycle per hook event)', {
|
||||
pluginVersion: expectedPluginVersion,
|
||||
workerVersion: observedVersion,
|
||||
});
|
||||
}
|
||||
}
|
||||
|
||||
async function isWorkerPortAlive(): Promise<boolean> {
|
||||
let healthy: boolean;
|
||||
try {
|
||||
@@ -291,24 +337,31 @@ async function isWorkerPortAlive(): Promise<boolean> {
|
||||
}
|
||||
|
||||
export async function ensureWorkerRunning(): Promise<boolean> {
|
||||
// Installed-plugin version captured when the alive branch runs, so every
|
||||
// post-readiness path below can run the one-shot amplifier check
|
||||
// (warnIfVersionStillMismatched). Stays null when no worker was alive
|
||||
// (plain cold-start lazy-spawn — no recycle happened, nothing to amplify)
|
||||
// or when the plugin version is unreadable ('unknown').
|
||||
let expectedPluginVersion: string | null = null;
|
||||
|
||||
if (await isWorkerPortAlive()) {
|
||||
// A worker is already alive. If it is a DIFFERENT version than the
|
||||
// installed plugin (e.g. the user upgraded but the previous worker is
|
||||
// still squatting the port), recycle it so the current version takes
|
||||
// over — otherwise the stale worker keeps serving indefinitely.
|
||||
//
|
||||
// Previously this branch only logged the mismatch (debug level) and
|
||||
// returned, so a stale worker was reused across upgrades. We now act on
|
||||
// it: ask the running worker to restart via its localhost-only admin
|
||||
// endpoint, then fall through to the lazy-spawn + readiness path so the
|
||||
// current-version worker is (re)started and awaited.
|
||||
const { matches, pluginVersion, workerVersion } = await checkVersionMatch(getWorkerPort());
|
||||
if (pluginVersion !== 'unknown') {
|
||||
expectedPluginVersion = pluginVersion;
|
||||
}
|
||||
if (matches) {
|
||||
const ready = await waitForWorkerReadiness();
|
||||
if (!ready) {
|
||||
logger.warn('SYSTEM', 'Worker is healthy but not ready; skipping hook API call');
|
||||
return false;
|
||||
}
|
||||
if (expectedPluginVersion !== null) {
|
||||
await warnIfVersionStillMismatched(expectedPluginVersion);
|
||||
}
|
||||
return true;
|
||||
}
|
||||
|
||||
@@ -321,6 +374,28 @@ export async function ensureWorkerRunning(): Promise<boolean> {
|
||||
method: 'POST',
|
||||
timeoutMs: HEALTH_CHECK_TIMEOUT_MS,
|
||||
});
|
||||
// Do NOT lazy-spawn immediately after the POST — the old worker is
|
||||
// still dying and owns the port, so a spawn here races the corpse (the
|
||||
// observed restart ping-pong). The dying worker spawns its own
|
||||
// successor once its port closes (worker-shutdown.ts; see
|
||||
// plans/2026-06-10-worker-restart-single-source-of-truth.md); wait for
|
||||
// that successor and only fall through to lazy-spawn as the safety net
|
||||
// when it never appears.
|
||||
if (await waitForRecycledWorker(pluginVersion)) {
|
||||
const ready = await waitForWorkerReadiness();
|
||||
if (!ready) {
|
||||
logger.warn('SYSTEM', 'Recycled worker appeared but did not become ready; skipping hook API call');
|
||||
return false;
|
||||
}
|
||||
if (expectedPluginVersion !== null) {
|
||||
await warnIfVersionStillMismatched(expectedPluginVersion);
|
||||
}
|
||||
return true;
|
||||
}
|
||||
logger.warn('SYSTEM', 'No successor worker appeared after recycle; falling through to lazy-spawn', {
|
||||
pluginVersion,
|
||||
workerVersion,
|
||||
});
|
||||
} catch (error: unknown) {
|
||||
logger.debug('SYSTEM', 'Worker restart request failed; falling through to lazy-spawn', {
|
||||
error: error instanceof Error ? error.message : String(error),
|
||||
@@ -329,7 +404,7 @@ export async function ensureWorkerRunning(): Promise<boolean> {
|
||||
// Fall through to (re)spawn + readiness wait below.
|
||||
}
|
||||
|
||||
const runtimePath = resolveBunRuntime();
|
||||
const runtimePath = resolveWorkerRuntimePath();
|
||||
const scriptPath = resolveWorkerScriptPath();
|
||||
|
||||
if (!runtimePath) {
|
||||
@@ -341,42 +416,67 @@ export async function ensureWorkerRunning(): Promise<boolean> {
|
||||
return false;
|
||||
}
|
||||
|
||||
logger.info('SYSTEM', 'Worker not running — lazy-spawning', { runtimePath, scriptPath });
|
||||
|
||||
// Spawn gate (worker-spawn-gate.ts): only ONE gated launcher — hook, MCP
|
||||
// server, or the CLI restart fallback — may spawn at a time. (The dying
|
||||
// worker's restart handoff in worker-shutdown.ts is deliberately NOT gated:
|
||||
// it is the primary spawner on restart, and hooks wait for its successor.)
|
||||
// Losing the lock never fails the hook; the loser skips its spawn and waits
|
||||
// for the winner's worker on the existing port/readiness waits below. The
|
||||
// winner holds the lock through the port-open wait (the spawn isn't "done"
|
||||
// until the worker owns the port) and releases in finally on every exit
|
||||
// path.
|
||||
const spawnLockHeld = acquireSpawnLock();
|
||||
try {
|
||||
const proc = spawnHidden(runtimePath, [scriptPath, '--daemon'], {
|
||||
detached: true,
|
||||
stdio: ['ignore', 'ignore', 'ignore'],
|
||||
});
|
||||
proc.unref();
|
||||
} catch (error: unknown) {
|
||||
if (error instanceof Error) {
|
||||
logger.error('SYSTEM', 'Lazy-spawn of worker failed', { runtimePath, scriptPath }, error);
|
||||
} else {
|
||||
logger.error('SYSTEM', 'Lazy-spawn of worker failed (non-Error)', {
|
||||
runtimePath, scriptPath, error: String(error),
|
||||
});
|
||||
}
|
||||
return false;
|
||||
}
|
||||
if (spawnLockHeld) {
|
||||
logger.info('SYSTEM', 'Worker not running — lazy-spawning', { runtimePath, scriptPath });
|
||||
|
||||
// Cold boot (#2795): on the first session after a reboot the SessionStart
|
||||
// `start` hook is booting the daemon in parallel, and a cold macOS+Chroma
|
||||
// worker needs ~7s to bind. The old 3-attempt/250ms budget (~0.75s) expired
|
||||
// long before that, so the context (and session-init) hooks raced boot and
|
||||
// soft-failed to empty — dropping memory injection and the user_prompts row
|
||||
// (the upstream trigger for #2794). Wait up to ~15.5s (≈ POST_SPAWN_WAIT) so
|
||||
// whichever worker wins the port is seen before we give up.
|
||||
const alive = await waitForWorkerPort({ attempts: 6, backoffMs: 500 });
|
||||
if (!alive) {
|
||||
logger.warn('SYSTEM', 'Worker port did not open after lazy-spawn within the cold-boot wait (~15s)');
|
||||
return false;
|
||||
try {
|
||||
const proc = spawnHidden(runtimePath, [scriptPath, '--daemon'], {
|
||||
detached: true,
|
||||
stdio: ['ignore', 'ignore', 'ignore'],
|
||||
});
|
||||
proc.unref();
|
||||
} catch (error: unknown) {
|
||||
if (error instanceof Error) {
|
||||
logger.error('SYSTEM', 'Lazy-spawn of worker failed', { runtimePath, scriptPath }, error);
|
||||
} else {
|
||||
logger.error('SYSTEM', 'Lazy-spawn of worker failed (non-Error)', {
|
||||
runtimePath, scriptPath, error: String(error),
|
||||
});
|
||||
}
|
||||
return false;
|
||||
}
|
||||
} else {
|
||||
logger.info('SYSTEM', 'Another launcher holds the spawn lock — skipping lazy-spawn and waiting for its worker');
|
||||
}
|
||||
|
||||
// Cold boot (#2795): on the first session after a reboot the SessionStart
|
||||
// `start` hook is booting the daemon in parallel, and a cold macOS+Chroma
|
||||
// worker needs ~7s to bind. The old 3-attempt/250ms budget (~0.75s) expired
|
||||
// long before that, so the context (and session-init) hooks raced boot and
|
||||
// soft-failed to empty — dropping memory injection and the user_prompts row
|
||||
// (the upstream trigger for #2794). Wait up to ~15.5s (≈ POST_SPAWN_WAIT) so
|
||||
// whichever worker wins the port is seen before we give up.
|
||||
const alive = await waitForWorkerPort({ attempts: 6, backoffMs: 500 });
|
||||
if (!alive) {
|
||||
logger.warn('SYSTEM', spawnLockHeld
|
||||
? 'Worker port did not open after lazy-spawn within the cold-boot wait (~15s)'
|
||||
: 'Spawn-lock holder\'s worker port did not open within the cold-boot wait (~15s)');
|
||||
return false;
|
||||
}
|
||||
} finally {
|
||||
if (spawnLockHeld) releaseSpawnLock();
|
||||
}
|
||||
const ready = await waitForWorkerReadiness();
|
||||
if (!ready) {
|
||||
logger.warn('SYSTEM', 'Worker lazy-spawned but did not become ready before hook readiness timeout');
|
||||
return false;
|
||||
}
|
||||
// Amplifier guard: even if the worker that won the port is still stale,
|
||||
// never recycle a second time in the same hook invocation.
|
||||
if (expectedPluginVersion !== null) {
|
||||
await warnIfVersionStillMismatched(expectedPluginVersion);
|
||||
}
|
||||
return true;
|
||||
}
|
||||
|
||||
|
||||
@@ -1,5 +1,5 @@
|
||||
import { execFile } from 'child_process';
|
||||
import { rmSync } from 'fs';
|
||||
import { existsSync, readFileSync, rmSync } from 'fs';
|
||||
import { promisify } from 'util';
|
||||
import { logger } from '../utils/logger.js';
|
||||
import { HOOK_TIMEOUTS } from '../shared/hook-constants.js';
|
||||
@@ -84,6 +84,50 @@ export async function runShutdownCascade(options: ShutdownCascadeOptions): Promi
|
||||
options.registry.unregister(record.id);
|
||||
}
|
||||
|
||||
removeOwnedPidFile(pidFilePath, currentPid);
|
||||
|
||||
options.registry.pruneDeadEntries();
|
||||
}
|
||||
|
||||
/**
|
||||
* Owner-guarded PID-file removal (Phase 5, worker-restart plan).
|
||||
*
|
||||
* The shutdown cascade is the dying worker's LAST act — during a restart the
|
||||
* successor worker has typically already written its OWN PID file by the time
|
||||
* this runs. Blindly rmSync'ing here clobbered that file and made
|
||||
* `worker status` report a healthy worker as not running. Deletion therefore
|
||||
* requires proof of ownership: the recorded pid must equal `currentPid`.
|
||||
*
|
||||
* A missing file is a no-op. An unreadable/corrupt file cannot prove
|
||||
* ownership, so it is left in place (the safe default): readPidFile() and
|
||||
* validateWorkerPidFile() both treat unparseable files as ownerless, so a
|
||||
* leftover corrupt file never blocks a successor's boot and is cleaned up by
|
||||
* the next worker start.
|
||||
*/
|
||||
export function removeOwnedPidFile(pidFilePath: string, currentPid: number): void {
|
||||
if (!existsSync(pidFilePath)) return;
|
||||
|
||||
let recordedPid: number | null = null;
|
||||
try {
|
||||
const parsed = JSON.parse(readFileSync(pidFilePath, 'utf-8')) as { pid?: unknown };
|
||||
recordedPid = typeof parsed.pid === 'number' ? parsed.pid : null;
|
||||
} catch (error: unknown) {
|
||||
logger.debug('SYSTEM', 'PID file unreadable during shutdown — leaving it (cannot prove ownership)', {
|
||||
pidFilePath,
|
||||
error: error instanceof Error ? error.message : String(error)
|
||||
});
|
||||
return;
|
||||
}
|
||||
|
||||
if (recordedPid !== currentPid) {
|
||||
logger.debug('SYSTEM', 'PID file not owned by this process — leaving it for its owner (restart successor?)', {
|
||||
pidFilePath,
|
||||
recordedPid,
|
||||
currentPid
|
||||
});
|
||||
return;
|
||||
}
|
||||
|
||||
try {
|
||||
rmSync(pidFilePath, { force: true });
|
||||
} catch (error: unknown) {
|
||||
@@ -96,8 +140,6 @@ export async function runShutdownCascade(options: ShutdownCascadeOptions): Promi
|
||||
});
|
||||
}
|
||||
}
|
||||
|
||||
options.registry.pruneDeadEntries();
|
||||
}
|
||||
|
||||
async function waitForExit(records: ManagedProcessRecord[], timeoutMs: number): Promise<void> {
|
||||
|
||||
@@ -1,31 +1,49 @@
|
||||
import { describe, it, expect, beforeEach, afterEach, mock, spyOn } from 'bun:test';
|
||||
import { existsSync, readFileSync } from 'fs';
|
||||
import { homedir } from 'os';
|
||||
import { describe, it, expect, beforeEach, afterEach, afterAll, mock, spyOn } from 'bun:test';
|
||||
import { existsSync, mkdirSync, mkdtempSync, rmSync } from 'fs';
|
||||
import { homedir, tmpdir } from 'os';
|
||||
import path from 'path';
|
||||
import http from 'http';
|
||||
import {
|
||||
import type {
|
||||
GracefulShutdownConfig,
|
||||
ShutdownableService,
|
||||
CloseableClient,
|
||||
CloseableDatabase,
|
||||
PidInfo
|
||||
} from '../../src/services/infrastructure/index.js';
|
||||
|
||||
// ── Data-dir isolation (Phase 6, worker-restart plan) ──────────────────────
|
||||
// performGracefulShutdown writes/deletes the worker PID file and runs the
|
||||
// supervisor shutdown cascade against paths.supervisorRegistry() — both of
|
||||
// which must resolve into a temp dir, never the real ~/.claude-mem. paths.ts
|
||||
// freezes DATA_DIR at first evaluation (env wins), and ESM hoists static
|
||||
// imports above any env assignment, so the env var is set FIRST and the code
|
||||
// under test is loaded with dynamic imports below. (`import type` above is
|
||||
// erased at compile time and loads nothing.)
|
||||
const TEST_DATA_DIR = mkdtempSync(path.join(tmpdir(), 'claude-mem-shutdown-test-'));
|
||||
const PREVIOUS_DATA_DIR = process.env.CLAUDE_MEM_DATA_DIR;
|
||||
process.env.CLAUDE_MEM_DATA_DIR = TEST_DATA_DIR;
|
||||
|
||||
const {
|
||||
performGracefulShutdown,
|
||||
writePidFile,
|
||||
readPidFile,
|
||||
removePidFile,
|
||||
type GracefulShutdownConfig,
|
||||
type ShutdownableService,
|
||||
type CloseableClient,
|
||||
type CloseableDatabase,
|
||||
type PidInfo
|
||||
} from '../../src/services/infrastructure/index.js';
|
||||
} = await import('../../src/services/infrastructure/index.js');
|
||||
const { paths } = await import('../../src/shared/paths.js');
|
||||
|
||||
const DATA_DIR = path.join(homedir(), '.claude-mem');
|
||||
const PID_FILE = path.join(DATA_DIR, 'worker.pid');
|
||||
// If an earlier test file already evaluated paths.ts, the module cache wins
|
||||
// and DATA_DIR stays frozen on that earlier value — the preload tripwire's
|
||||
// per-run temp dir (tests/preload.ts), never the real ~/.claude-mem. Derive
|
||||
// the asserted paths from the SAME frozen module the code under test uses.
|
||||
const DATA_DIR = paths.dataDir();
|
||||
const PID_FILE = paths.workerPid();
|
||||
|
||||
describe('GracefulShutdown', () => {
|
||||
let originalPidContent: string | null = null;
|
||||
const originalPlatform = process.platform;
|
||||
|
||||
beforeEach(() => {
|
||||
if (existsSync(PID_FILE)) {
|
||||
originalPidContent = readFileSync(PID_FILE, 'utf-8');
|
||||
}
|
||||
mkdirSync(DATA_DIR, { recursive: true });
|
||||
removePidFile();
|
||||
|
||||
Object.defineProperty(process, 'platform', {
|
||||
value: 'darwin',
|
||||
@@ -35,13 +53,7 @@ describe('GracefulShutdown', () => {
|
||||
});
|
||||
|
||||
afterEach(() => {
|
||||
if (originalPidContent !== null) {
|
||||
const { writeFileSync } = require('fs');
|
||||
writeFileSync(PID_FILE, originalPidContent);
|
||||
originalPidContent = null;
|
||||
} else {
|
||||
removePidFile();
|
||||
}
|
||||
removePidFile();
|
||||
|
||||
Object.defineProperty(process, 'platform', {
|
||||
value: originalPlatform,
|
||||
@@ -50,17 +62,38 @@ describe('GracefulShutdown', () => {
|
||||
});
|
||||
});
|
||||
|
||||
afterAll(() => {
|
||||
if (PREVIOUS_DATA_DIR === undefined) {
|
||||
delete process.env.CLAUDE_MEM_DATA_DIR;
|
||||
} else {
|
||||
process.env.CLAUDE_MEM_DATA_DIR = PREVIOUS_DATA_DIR;
|
||||
}
|
||||
if (DATA_DIR === TEST_DATA_DIR) {
|
||||
// paths.ts froze on our per-file dir (this file evaluated it first):
|
||||
// empty it but keep the directory alive so later-loaded modules in this
|
||||
// process don't point at a deleted path.
|
||||
rmSync(TEST_DATA_DIR, { recursive: true, force: true });
|
||||
mkdirSync(TEST_DATA_DIR, { recursive: true });
|
||||
} else {
|
||||
rmSync(TEST_DATA_DIR, { recursive: true, force: true });
|
||||
}
|
||||
});
|
||||
|
||||
it('resolves the PID file and supervisor registry into a temp dir, never the real ~/.claude-mem', () => {
|
||||
const realDataDir = path.join(homedir(), '.claude-mem');
|
||||
expect(DATA_DIR).not.toBe(realDataDir);
|
||||
expect(PID_FILE.startsWith(realDataDir + path.sep)).toBe(false);
|
||||
expect(paths.supervisorRegistry().startsWith(realDataDir + path.sep)).toBe(false);
|
||||
});
|
||||
|
||||
describe('performGracefulShutdown', () => {
|
||||
// Timeout bumped to 15s. performGracefulShutdown calls
|
||||
// getSupervisor().stop() which runs runShutdownCascade against the real
|
||||
// ~/.claude-mem/supervisor.json registry. If the developer has a live
|
||||
// worker + chroma-mcp registered, the cascade SIGTERMs/SIGKILLs them
|
||||
// and waits up to ~5–6s for them to exit, which sails past the default
|
||||
// 5000ms test timeout. The other shutdown tests below are unaffected
|
||||
// because they don't register an mcpClient/dbManager/chromaMcpManager
|
||||
// mock that exercises the same path. This is test-infrastructure debt
|
||||
// — the test interacts with the production supervisor singleton — not
|
||||
// a code regression in the shutdown flow itself.
|
||||
// Timeout kept at 15s as headroom. performGracefulShutdown calls
|
||||
// getSupervisor().stop() which runs runShutdownCascade against
|
||||
// paths.supervisorRegistry() — since the Phase 6 data-dir isolation
|
||||
// above, that registry resolves into a temp dir (empty), so the cascade
|
||||
// no longer SIGTERMs the developer's real worker/chroma-mcp or waits on
|
||||
// their exit. The historic 5s overrun came from the test exercising the
|
||||
// REAL ~/.claude-mem/supervisor.json before isolation.
|
||||
it('should call shutdown steps in correct order', async () => {
|
||||
const callOrder: string[] = [];
|
||||
|
||||
@@ -127,12 +160,14 @@ describe('GracefulShutdown', () => {
|
||||
expect(callOrder.indexOf('chromaMcpManager.stop')).toBeLessThan(callOrder.indexOf('dbManager.close'));
|
||||
}, 15000);
|
||||
|
||||
it('should remove PID file during shutdown', async () => {
|
||||
it('should remove its OWN PID file during shutdown (owner guard)', async () => {
|
||||
const mockSessionManager: ShutdownableService = {
|
||||
shutdownAll: mock(async () => {})
|
||||
};
|
||||
|
||||
writePidFile({ pid: 99999, port: 37777, startedAt: new Date().toISOString() });
|
||||
// Phase 5 (worker-restart plan): the shutdown cascade deletes the PID
|
||||
// file only when this process owns it (recorded pid === process.pid).
|
||||
writePidFile({ pid: process.pid, port: 37777, startedAt: new Date().toISOString() });
|
||||
expect(existsSync(PID_FILE)).toBe(true);
|
||||
|
||||
const config: GracefulShutdownConfig = {
|
||||
@@ -145,6 +180,28 @@ describe('GracefulShutdown', () => {
|
||||
expect(existsSync(PID_FILE)).toBe(false);
|
||||
});
|
||||
|
||||
it('should spare another process\'s PID file during shutdown (restart successor)', async () => {
|
||||
const mockSessionManager: ShutdownableService = {
|
||||
shutdownAll: mock(async () => {})
|
||||
};
|
||||
|
||||
// A restart successor has already written its own PID file by the time
|
||||
// the dying worker's cascade runs — the dying worker must not clobber
|
||||
// it (Phase 5, worker-restart plan).
|
||||
writePidFile({ pid: 99999, port: 37777, startedAt: new Date().toISOString() });
|
||||
expect(existsSync(PID_FILE)).toBe(true);
|
||||
|
||||
const config: GracefulShutdownConfig = {
|
||||
server: null,
|
||||
sessionManager: mockSessionManager
|
||||
};
|
||||
|
||||
await performGracefulShutdown(config);
|
||||
|
||||
expect(existsSync(PID_FILE)).toBe(true);
|
||||
expect(readPidFile()!.pid).toBe(99999);
|
||||
});
|
||||
|
||||
it('should handle missing optional services gracefully', async () => {
|
||||
const mockSessionManager: ShutdownableService = {
|
||||
shutdownAll: mock(async () => {})
|
||||
|
||||
@@ -1,12 +1,26 @@
|
||||
import { describe, it, expect, beforeEach, afterEach } from 'bun:test';
|
||||
import { existsSync, readFileSync, mkdirSync, writeFileSync, rmSync, statSync } from 'fs';
|
||||
import { homedir } from 'os';
|
||||
import { tmpdir } from 'os';
|
||||
import { describe, it, expect, beforeEach, afterEach, afterAll } from 'bun:test';
|
||||
import { existsSync, readFileSync, mkdirSync, mkdtempSync, writeFileSync, rmSync, statSync } from 'fs';
|
||||
import { homedir, tmpdir } from 'os';
|
||||
import path from 'path';
|
||||
import {
|
||||
import type { PidInfo } from '../../src/services/infrastructure/index.js';
|
||||
|
||||
// ── Data-dir isolation (Phase 6, worker-restart plan) ──────────────────────
|
||||
// These tests write corrupt JSON and sentinel PIDs into the worker PID file,
|
||||
// so that file must NEVER be the real ~/.claude-mem/worker.pid. paths.ts
|
||||
// freezes DATA_DIR at first evaluation and ProcessManager freezes PID_FILE
|
||||
// from it at import time — and ESM hoists static imports above any env
|
||||
// assignment — so the env var is set FIRST and the code under test is loaded
|
||||
// with dynamic imports below. (`import type` above is erased at compile time
|
||||
// and loads nothing.)
|
||||
const TEST_DATA_DIR = mkdtempSync(path.join(tmpdir(), 'claude-mem-pm-test-'));
|
||||
const PREVIOUS_DATA_DIR = process.env.CLAUDE_MEM_DATA_DIR;
|
||||
process.env.CLAUDE_MEM_DATA_DIR = TEST_DATA_DIR;
|
||||
|
||||
const {
|
||||
writePidFile,
|
||||
readPidFile,
|
||||
removePidFile,
|
||||
removePidFileIfOwner,
|
||||
getPlatformTimeout,
|
||||
parseElapsedTime,
|
||||
isProcessAlive,
|
||||
@@ -18,28 +32,58 @@ import {
|
||||
runOneTimeChromaMigration,
|
||||
captureProcessStartToken,
|
||||
verifyPidFileOwnership,
|
||||
type PidInfo
|
||||
} from '../../src/services/infrastructure/index.js';
|
||||
} = await import('../../src/services/infrastructure/index.js');
|
||||
const { paths } = await import('../../src/shared/paths.js');
|
||||
|
||||
const DATA_DIR = path.join(homedir(), '.claude-mem');
|
||||
const PID_FILE = path.join(DATA_DIR, 'worker.pid');
|
||||
// If an earlier test file in this bun process already evaluated paths.ts, the
|
||||
// module cache wins and DATA_DIR stays frozen on that earlier value — which is
|
||||
// the preload tripwire's per-run temp dir (tests/preload.ts), never the real
|
||||
// ~/.claude-mem. Derive the paths the assertions use from the SAME frozen
|
||||
// module the code under test uses, so test and code can never diverge.
|
||||
const DATA_DIR = paths.dataDir();
|
||||
const PID_FILE = paths.workerPid();
|
||||
|
||||
describe('ProcessManager', () => {
|
||||
let originalPidContent: string | null = null;
|
||||
const REAL_DATA_DIR = path.join(homedir(), '.claude-mem');
|
||||
|
||||
beforeEach(() => {
|
||||
if (existsSync(PID_FILE)) {
|
||||
originalPidContent = readFileSync(PID_FILE, 'utf-8');
|
||||
}
|
||||
mkdirSync(DATA_DIR, { recursive: true });
|
||||
removePidFile();
|
||||
});
|
||||
|
||||
afterEach(() => {
|
||||
if (originalPidContent !== null) {
|
||||
writeFileSync(PID_FILE, originalPidContent);
|
||||
originalPidContent = null;
|
||||
removePidFile();
|
||||
});
|
||||
|
||||
afterAll(() => {
|
||||
if (PREVIOUS_DATA_DIR === undefined) {
|
||||
delete process.env.CLAUDE_MEM_DATA_DIR;
|
||||
} else {
|
||||
removePidFile();
|
||||
process.env.CLAUDE_MEM_DATA_DIR = PREVIOUS_DATA_DIR;
|
||||
}
|
||||
if (DATA_DIR === TEST_DATA_DIR) {
|
||||
// paths.ts froze on our per-file dir (this file evaluated it first):
|
||||
// empty it but keep the directory alive so later-loaded modules in this
|
||||
// process don't point at a deleted path.
|
||||
rmSync(TEST_DATA_DIR, { recursive: true, force: true });
|
||||
mkdirSync(TEST_DATA_DIR, { recursive: true });
|
||||
} else {
|
||||
rmSync(TEST_DATA_DIR, { recursive: true, force: true });
|
||||
}
|
||||
});
|
||||
|
||||
describe('test isolation (Phase 6, worker-restart plan)', () => {
|
||||
it('resolves the PID file into a temp dir, never the real ~/.claude-mem', () => {
|
||||
expect(DATA_DIR).not.toBe(REAL_DATA_DIR);
|
||||
expect(PID_FILE.startsWith(REAL_DATA_DIR + path.sep)).toBe(false);
|
||||
expect(PID_FILE).toBe(path.join(DATA_DIR, 'worker.pid'));
|
||||
});
|
||||
|
||||
it('writePidFile lands in the isolated dir', () => {
|
||||
writePidFile({ pid: 4242, port: 37777, startedAt: new Date().toISOString() });
|
||||
expect(existsSync(PID_FILE)).toBe(true);
|
||||
expect(readPidFile()!.pid).toBe(4242);
|
||||
});
|
||||
});
|
||||
|
||||
describe('writePidFile', () => {
|
||||
@@ -137,6 +181,68 @@ describe('ProcessManager', () => {
|
||||
});
|
||||
});
|
||||
|
||||
// Phase 5 (worker-restart plan): owner-or-dead guarded deletion. The CLI
|
||||
// stop/restart cleanup and the dying worker's restart handoff must never
|
||||
// delete a live successor's PID file.
|
||||
describe('removePidFileIfOwner', () => {
|
||||
it('deletes the file when the recorded pid matches the expected owner (even if alive)', () => {
|
||||
writePidFile({ pid: process.pid, port: 37777, startedAt: new Date().toISOString() });
|
||||
|
||||
removePidFileIfOwner(process.pid);
|
||||
|
||||
expect(existsSync(PID_FILE)).toBe(false);
|
||||
});
|
||||
|
||||
it('deletes the file when the recorded pid is dead, regardless of owner match', () => {
|
||||
writePidFile({ pid: 2147483647, port: 37777, startedAt: new Date().toISOString() });
|
||||
|
||||
removePidFileIfOwner(null);
|
||||
|
||||
expect(existsSync(PID_FILE)).toBe(false);
|
||||
});
|
||||
|
||||
it('spares the file when the recorded pid is a live, different process (restart successor)', () => {
|
||||
// This test process stands in for the live successor; pid 1 (init,
|
||||
// never this process) stands in for the worker the caller shut down.
|
||||
writePidFile({ pid: process.pid, port: 37777, startedAt: new Date().toISOString() });
|
||||
|
||||
removePidFileIfOwner(1);
|
||||
|
||||
expect(existsSync(PID_FILE)).toBe(true);
|
||||
expect(readPidFile()!.pid).toBe(process.pid);
|
||||
});
|
||||
|
||||
it('spares a corrupt file (ownership cannot be proven)', () => {
|
||||
writeFileSync(PID_FILE, 'not valid json {{{');
|
||||
|
||||
removePidFileIfOwner(process.pid);
|
||||
|
||||
expect(existsSync(PID_FILE)).toBe(true);
|
||||
});
|
||||
|
||||
it('deletes a parseable file with no pid field (treated as dead owner)', () => {
|
||||
// Valid JSON, but no `pid`: recorded.pid is undefined, so
|
||||
// isProcessAlive() is false and the owner-or-dead guard falls through
|
||||
// to removal. This intentionally diverges from the supervisor-side
|
||||
// removeOwnedPidFile, which spares pid-less files — that guard only
|
||||
// ever deletes its own file, while this helper may clean dead
|
||||
// leftovers. The divergence is safe: a pid-less file can't belong to a
|
||||
// live successor (writePidFile always records a pid).
|
||||
writeFileSync(PID_FILE, JSON.stringify({ port: 37777 }));
|
||||
|
||||
removePidFileIfOwner(null);
|
||||
|
||||
expect(existsSync(PID_FILE)).toBe(false);
|
||||
});
|
||||
|
||||
it('does not throw when the file is missing', () => {
|
||||
removePidFile();
|
||||
expect(existsSync(PID_FILE)).toBe(false);
|
||||
|
||||
expect(() => removePidFileIfOwner(process.pid)).not.toThrow();
|
||||
});
|
||||
});
|
||||
|
||||
describe('parseElapsedTime', () => {
|
||||
it('should parse MM:SS format', () => {
|
||||
expect(parseElapsedTime('05:30')).toBe(5);
|
||||
|
||||
@@ -276,13 +276,20 @@ function simulateInstall(_ide: string, scenario: Scenario): Outcome {
|
||||
describe('cross-IDE failure matrix (12 IDEs x 4 scenarios)', () => {
|
||||
const scenarios: Scenario[] = ['happy', 'eresolve', 'missing-uv', 'missing-bun'];
|
||||
|
||||
let prevMatrixDataDir: string | undefined;
|
||||
beforeEach(() => {
|
||||
prevMatrixDataDir = process.env.CLAUDE_MEM_DATA_DIR;
|
||||
process.env.CLAUDE_MEM_DATA_DIR = mkdtempSync(join(tmpdir(), 'cm-matrix-'));
|
||||
});
|
||||
afterEach(() => {
|
||||
const dir = process.env.CLAUDE_MEM_DATA_DIR;
|
||||
if (dir) rmSync(dir, { recursive: true, force: true });
|
||||
delete process.env.CLAUDE_MEM_DATA_DIR;
|
||||
// Restore (not delete): the preload tripwire (tests/preload.ts) pins a
|
||||
// per-run default temp dir, and unconditionally deleting the env var
|
||||
// would expose later test files to the real ~/.claude-mem fallback in
|
||||
// call-time resolvers.
|
||||
if (prevMatrixDataDir === undefined) delete process.env.CLAUDE_MEM_DATA_DIR;
|
||||
else process.env.CLAUDE_MEM_DATA_DIR = prevMatrixDataDir;
|
||||
});
|
||||
|
||||
it('produces 48 cells (12 IDEs x 4 scenarios)', () => {
|
||||
|
||||
@@ -1,4 +1,24 @@
|
||||
import { mock } from 'bun:test';
|
||||
import { mkdtempSync } from 'fs';
|
||||
import { tmpdir } from 'os';
|
||||
import { join } from 'path';
|
||||
|
||||
/**
|
||||
* Data-dir tripwire (Phase 6, worker-restart plan): no test may ever touch the
|
||||
* real ~/.claude-mem. src/shared/paths.ts freezes DATA_DIR at first evaluation
|
||||
* (env CLAUDE_MEM_DATA_DIR wins), and module-level consts like ProcessManager's
|
||||
* PID_FILE inherit that frozen value — so the env var must point at a safe
|
||||
* directory BEFORE any module loads. This preload runs first (bunfig.toml
|
||||
* [test].preload), so when the env var is unset we pin it to a fresh per-run
|
||||
* temp dir. Tests that want tighter isolation still override it per-file /
|
||||
* per-test; this only fills the default so nothing can fall through to the
|
||||
* real data dir. The leaked temp dir per run is deliberate: correctness over
|
||||
* cleanup (an afterAll here could rip the dir out from under frozen module
|
||||
* constants while later test files still run).
|
||||
*/
|
||||
if (!process.env.CLAUDE_MEM_DATA_DIR) {
|
||||
process.env.CLAUDE_MEM_DATA_DIR = mkdtempSync(join(tmpdir(), 'claude-mem-test-run-'));
|
||||
}
|
||||
|
||||
/**
|
||||
* Global posthog-node mock, registered via bunfig.toml [test].preload BEFORE
|
||||
|
||||
176
tests/services/worker-restart-verify.test.ts
Normal file
176
tests/services/worker-restart-verify.test.ts
Normal file
@@ -0,0 +1,176 @@
|
||||
import { describe, it, expect, beforeEach, afterEach, mock } from 'bun:test';
|
||||
import { verifyRestartedWorker, getCurrentWorkerPid } from '../../src/services/restart-verify.js';
|
||||
|
||||
// verifyRestartedWorker lives in src/services/restart-verify.ts (not
|
||||
// worker-service.ts) precisely so this test can import it without triggering
|
||||
// worker-service.ts's top-level side effects (isMainModule bootstrap, bun:sqlite,
|
||||
// MCP SDK, telemetry).
|
||||
|
||||
const EXPECTED_VERSION = '13.5.5-test';
|
||||
const OLD_PID = 11111;
|
||||
const NEW_PID = 22222;
|
||||
const PORT = 45678; // arbitrary; port is always injected, never resolved here
|
||||
|
||||
// Record every HTTP call the verifier makes (same fetchLog pattern as
|
||||
// tests/shared/worker-utils-version-recycle.test.ts).
|
||||
const fetchLog: Array<{ url: string; method: string }> = [];
|
||||
|
||||
// Each test sets this to script what /api/health reports per call.
|
||||
// 'unreachable' rejects the fetch like a connection refusal. The
|
||||
// `{ status, body }` form scripts a non-200 response (e.g. 503 degraded).
|
||||
let healthResponder: (callIndex: number) =>
|
||||
| { pid?: number; version?: string }
|
||||
| { status: number; body: { pid?: number; version?: string } }
|
||||
| 'unreachable';
|
||||
|
||||
function installFetchMock(): void {
|
||||
fetchLog.length = 0;
|
||||
let callIndex = 0;
|
||||
global.fetch = mock((url: string | URL | Request, init?: RequestInit) => {
|
||||
const u = typeof url === 'string' ? url : url.toString();
|
||||
const method = (init?.method ?? 'GET').toUpperCase();
|
||||
fetchLog.push({ url: u, method });
|
||||
|
||||
const scripted = healthResponder(callIndex++);
|
||||
if (scripted === 'unreachable') {
|
||||
return Promise.reject(new Error('connect ECONNREFUSED'));
|
||||
}
|
||||
const status = 'body' in scripted ? scripted.status : 200;
|
||||
const body = 'body' in scripted ? scripted.body : scripted;
|
||||
return Promise.resolve({
|
||||
ok: status >= 200 && status < 300,
|
||||
status,
|
||||
text: () => Promise.resolve(JSON.stringify(body)),
|
||||
json: () => Promise.resolve(body),
|
||||
} as unknown as Response);
|
||||
}) as unknown as typeof fetch;
|
||||
}
|
||||
|
||||
// Short injectable deadline + poll interval so every test completes fast.
|
||||
const FAST = { pollIntervalMs: 10, requestTimeoutMs: 100 };
|
||||
const DEADLINE_MS = 300;
|
||||
|
||||
describe('verifyRestartedWorker — restart must prove itself', () => {
|
||||
const originalFetch = global.fetch;
|
||||
|
||||
beforeEach(() => {
|
||||
installFetchMock();
|
||||
});
|
||||
|
||||
afterEach(() => {
|
||||
global.fetch = originalFetch;
|
||||
});
|
||||
|
||||
it('succeeds when health flips to the new pid with the expected version', async () => {
|
||||
// First poll still sees the old worker, then the new one comes up.
|
||||
healthResponder = i =>
|
||||
i === 0
|
||||
? { pid: OLD_PID, version: EXPECTED_VERSION }
|
||||
: { pid: NEW_PID, version: EXPECTED_VERSION };
|
||||
|
||||
const result = await verifyRestartedWorker(PORT, OLD_PID, EXPECTED_VERSION, DEADLINE_MS, FAST);
|
||||
|
||||
expect(result.ok).toBe(true);
|
||||
if (result.ok) {
|
||||
expect(result.pid).toBe(NEW_PID);
|
||||
expect(result.version).toBe(EXPECTED_VERSION);
|
||||
}
|
||||
// It polled /api/health (only pid + version are read — no /api/version).
|
||||
expect(fetchLog.length).toBeGreaterThanOrEqual(2);
|
||||
expect(fetchLog.every(c => c.url.includes('/api/health') && c.method === 'GET')).toBe(true);
|
||||
});
|
||||
|
||||
it('succeeds when health answers 503 (degraded) but reports the new pid and expected version', async () => {
|
||||
// /api/health returns 503 when the queue is degraded but still includes
|
||||
// pid/version — a degraded-but-booted worker still proves the restart.
|
||||
healthResponder = () => ({ status: 503, body: { pid: NEW_PID, version: EXPECTED_VERSION } });
|
||||
|
||||
const result = await verifyRestartedWorker(PORT, OLD_PID, EXPECTED_VERSION, DEADLINE_MS, FAST);
|
||||
|
||||
expect(result.ok).toBe(true);
|
||||
if (result.ok) {
|
||||
expect(result.pid).toBe(NEW_PID);
|
||||
expect(result.version).toBe(EXPECTED_VERSION);
|
||||
}
|
||||
});
|
||||
|
||||
it('succeeds on version alone when no previous worker existed (oldPid null)', async () => {
|
||||
// getCurrentWorkerPid returned null (nothing was listening before the
|
||||
// restart), so any pid counts — verification only requires the version.
|
||||
healthResponder = () => ({ pid: NEW_PID, version: EXPECTED_VERSION });
|
||||
|
||||
const result = await verifyRestartedWorker(PORT, null, EXPECTED_VERSION, DEADLINE_MS, FAST);
|
||||
|
||||
expect(result.ok).toBe(true);
|
||||
if (result.ok) {
|
||||
expect(result.pid).toBe(NEW_PID);
|
||||
expect(result.version).toBe(EXPECTED_VERSION);
|
||||
}
|
||||
});
|
||||
|
||||
it('fails when health keeps returning the stale (old) pid', async () => {
|
||||
healthResponder = () => ({ pid: OLD_PID, version: EXPECTED_VERSION });
|
||||
|
||||
const result = await verifyRestartedWorker(PORT, OLD_PID, EXPECTED_VERSION, DEADLINE_MS, FAST);
|
||||
|
||||
expect(result.ok).toBe(false);
|
||||
if (!result.ok) {
|
||||
expect(result.lastObserved).toContain(String(OLD_PID));
|
||||
// A live (stale) worker is serving — callers skip the port-free wait.
|
||||
expect(result.lastPollSawHealth).toBe(true);
|
||||
}
|
||||
});
|
||||
|
||||
it('fails when the new worker reports the wrong version', async () => {
|
||||
healthResponder = () => ({ pid: NEW_PID, version: '0.0.1-stale' });
|
||||
|
||||
const result = await verifyRestartedWorker(PORT, OLD_PID, EXPECTED_VERSION, DEADLINE_MS, FAST);
|
||||
|
||||
expect(result.ok).toBe(false);
|
||||
if (!result.ok) {
|
||||
expect(result.lastObserved).toContain('0.0.1-stale');
|
||||
// A live (wrong-version) worker is serving — callers skip the port-free wait.
|
||||
expect(result.lastPollSawHealth).toBe(true);
|
||||
}
|
||||
});
|
||||
|
||||
it('fails on timeout when health is unreachable, reporting the connection error', async () => {
|
||||
healthResponder = () => 'unreachable';
|
||||
|
||||
const start = Date.now();
|
||||
const result = await verifyRestartedWorker(PORT, OLD_PID, EXPECTED_VERSION, DEADLINE_MS, FAST);
|
||||
const elapsed = Date.now() - start;
|
||||
|
||||
expect(result.ok).toBe(false);
|
||||
if (!result.ok) {
|
||||
expect(result.lastObserved).toContain('connection error');
|
||||
expect(result.lastObserved).toContain('ECONNREFUSED');
|
||||
// Nothing is serving on the port — callers may wait for it to free.
|
||||
expect(result.lastPollSawHealth).toBe(false);
|
||||
}
|
||||
// Hard cap: the deadline bounds the wait (generous slack for CI).
|
||||
expect(elapsed).toBeLessThan(DEADLINE_MS + 1000);
|
||||
});
|
||||
});
|
||||
|
||||
describe('getCurrentWorkerPid — old-pid capture before shutdown', () => {
|
||||
const originalFetch = global.fetch;
|
||||
|
||||
beforeEach(() => {
|
||||
installFetchMock();
|
||||
});
|
||||
|
||||
afterEach(() => {
|
||||
global.fetch = originalFetch;
|
||||
});
|
||||
|
||||
it('returns the running worker pid from /api/health', async () => {
|
||||
healthResponder = () => ({ pid: OLD_PID, version: EXPECTED_VERSION });
|
||||
expect(await getCurrentWorkerPid(PORT, 100)).toBe(OLD_PID);
|
||||
});
|
||||
|
||||
it('returns null when no worker is reachable', async () => {
|
||||
healthResponder = () => 'unreachable';
|
||||
expect(await getCurrentWorkerPid(PORT, 100)).toBeNull();
|
||||
});
|
||||
});
|
||||
225
tests/services/worker-shutdown-sequence.test.ts
Normal file
225
tests/services/worker-shutdown-sequence.test.ts
Normal file
@@ -0,0 +1,225 @@
|
||||
import { describe, it, expect } from 'bun:test';
|
||||
import { runShutdownSequence, type ShutdownSequenceOptions, type WorkerShutdownReason } from '../../src/services/worker-shutdown.js';
|
||||
|
||||
// runShutdownSequence lives in src/services/worker-shutdown.ts (not
|
||||
// worker-service.ts) precisely so this test can import it without triggering
|
||||
// worker-service.ts's top-level side effects (isMainModule bootstrap,
|
||||
// bun:sqlite, MCP SDK, telemetry) — same seam precedent as restart-verify.ts.
|
||||
// WorkerService.shutdown() delegates to this function with its real deps, so
|
||||
// these tests exercise the production guard/deadline/handoff logic directly.
|
||||
|
||||
const PORT = 45678;
|
||||
const SCRIPT = '/marketplace/plugin/scripts/worker-service.cjs';
|
||||
|
||||
interface Harness {
|
||||
options: ShutdownSequenceOptions;
|
||||
guard: { shuttingDown: boolean };
|
||||
calls: string[]; // ordered event log
|
||||
counters: {
|
||||
beforeGraceful: number;
|
||||
graceful: number;
|
||||
waitForPortFree: number;
|
||||
removePidFile: number;
|
||||
spawnDaemon: number;
|
||||
};
|
||||
spawnArgs: Array<{ scriptPath: string; port: number }>;
|
||||
}
|
||||
|
||||
function makeHarness(overrides: {
|
||||
reason?: WorkerShutdownReason;
|
||||
gracefulDeadlineMs?: number;
|
||||
beforeGracefulThrows?: boolean;
|
||||
graceful?: () => Promise<void>;
|
||||
portFree?: boolean;
|
||||
spawnResult?: number | undefined;
|
||||
spawnThrows?: boolean;
|
||||
} = {}): Harness {
|
||||
const guard = { shuttingDown: false };
|
||||
const calls: string[] = [];
|
||||
const counters = {
|
||||
beforeGraceful: 0,
|
||||
graceful: 0,
|
||||
waitForPortFree: 0,
|
||||
removePidFile: 0,
|
||||
spawnDaemon: 0,
|
||||
};
|
||||
const spawnArgs: Array<{ scriptPath: string; port: number }> = [];
|
||||
|
||||
const options: ShutdownSequenceOptions = {
|
||||
reason: overrides.reason ?? 'stop',
|
||||
isShuttingDown: () => guard.shuttingDown,
|
||||
markShuttingDown: () => { guard.shuttingDown = true; },
|
||||
beforeGracefulShutdown: async () => {
|
||||
counters.beforeGraceful++;
|
||||
calls.push('beforeGraceful');
|
||||
if (overrides.beforeGracefulThrows) {
|
||||
throw new Error('telemetry flush failed');
|
||||
}
|
||||
},
|
||||
performGracefulShutdown: () => {
|
||||
counters.graceful++;
|
||||
calls.push('graceful');
|
||||
return overrides.graceful ? overrides.graceful() : Promise.resolve();
|
||||
},
|
||||
gracefulDeadlineMs: overrides.gracefulDeadlineMs ?? 1000,
|
||||
restartHandoff: {
|
||||
port: PORT,
|
||||
portFreeTimeoutMs: 1000,
|
||||
resolveSuccessorScript: () => SCRIPT,
|
||||
waitForPortFree: async (port: number) => {
|
||||
counters.waitForPortFree++;
|
||||
calls.push(`waitForPortFree:${port}`);
|
||||
return overrides.portFree ?? true;
|
||||
},
|
||||
removePidFile: () => {
|
||||
counters.removePidFile++;
|
||||
calls.push('removePidFile');
|
||||
},
|
||||
spawnDaemon: (scriptPath: string, port: number) => {
|
||||
counters.spawnDaemon++;
|
||||
calls.push('spawnDaemon');
|
||||
spawnArgs.push({ scriptPath, port });
|
||||
if (overrides.spawnThrows) {
|
||||
throw new Error('Supervisor is shutting down, refusing to spawn worker daemon');
|
||||
}
|
||||
return 'spawnResult' in overrides ? overrides.spawnResult : 9999;
|
||||
},
|
||||
},
|
||||
};
|
||||
|
||||
return { options, guard, calls, counters, spawnArgs };
|
||||
}
|
||||
|
||||
describe('runShutdownSequence — re-entrancy guard', () => {
|
||||
it('runs performGracefulShutdown exactly once when shutdown is invoked twice', async () => {
|
||||
const h = makeHarness({ reason: 'stop' });
|
||||
|
||||
await runShutdownSequence(h.options);
|
||||
await runShutdownSequence(h.options); // re-entrant call: must be a no-op
|
||||
|
||||
expect(h.counters.graceful).toBe(1);
|
||||
expect(h.counters.beforeGraceful).toBe(1);
|
||||
expect(h.guard.shuttingDown).toBe(true);
|
||||
});
|
||||
|
||||
it('blocks a concurrent second invocation (guard is set synchronously at entry)', async () => {
|
||||
const h = makeHarness({
|
||||
reason: 'stop',
|
||||
// Graceful takes a tick so the second call overlaps the first.
|
||||
graceful: () => new Promise(resolve => setTimeout(resolve, 20)),
|
||||
});
|
||||
|
||||
await Promise.all([
|
||||
runShutdownSequence(h.options),
|
||||
runShutdownSequence(h.options),
|
||||
]);
|
||||
|
||||
expect(h.counters.graceful).toBe(1);
|
||||
expect(h.counters.beforeGraceful).toBe(1);
|
||||
});
|
||||
});
|
||||
|
||||
describe('runShutdownSequence — pre-graceful bookkeeping guard', () => {
|
||||
it('proceeds to graceful shutdown and the restart handoff when beforeGracefulShutdown throws', async () => {
|
||||
const h = makeHarness({ reason: 'restart', beforeGracefulThrows: true });
|
||||
|
||||
await runShutdownSequence(h.options); // must not throw
|
||||
|
||||
// Bookkeeping failure is logged and skipped; the sequence still drains
|
||||
// gracefully and still hands off to the successor.
|
||||
expect(h.counters.beforeGraceful).toBe(1);
|
||||
expect(h.counters.graceful).toBe(1);
|
||||
expect(h.counters.waitForPortFree).toBe(1);
|
||||
expect(h.counters.spawnDaemon).toBe(1);
|
||||
});
|
||||
});
|
||||
|
||||
describe('runShutdownSequence — graceful-shutdown deadline', () => {
|
||||
it('proceeds when performGracefulShutdown never resolves (hard deadline)', async () => {
|
||||
const h = makeHarness({
|
||||
reason: 'restart',
|
||||
gracefulDeadlineMs: 50,
|
||||
graceful: () => new Promise<void>(() => { /* hangs forever — unbounded session drain */ }),
|
||||
});
|
||||
|
||||
const start = Date.now();
|
||||
await runShutdownSequence(h.options);
|
||||
const elapsed = Date.now() - start;
|
||||
|
||||
// Deadlined and continued into the restart handoff anyway.
|
||||
expect(elapsed).toBeLessThan(2000);
|
||||
expect(h.counters.waitForPortFree).toBe(1);
|
||||
expect(h.counters.spawnDaemon).toBe(1);
|
||||
});
|
||||
|
||||
it('proceeds (and does not reject) when performGracefulShutdown rejects', async () => {
|
||||
const h = makeHarness({
|
||||
reason: 'restart',
|
||||
graceful: () => Promise.reject(new Error('db close failed')),
|
||||
});
|
||||
|
||||
await runShutdownSequence(h.options); // must not throw
|
||||
|
||||
expect(h.counters.spawnDaemon).toBe(1);
|
||||
});
|
||||
});
|
||||
|
||||
describe('runShutdownSequence — restart successor handoff', () => {
|
||||
it('spawns the successor only AFTER the port is confirmed free (restart)', async () => {
|
||||
const h = makeHarness({ reason: 'restart', portFree: true });
|
||||
|
||||
await runShutdownSequence(h.options);
|
||||
|
||||
expect(h.counters.spawnDaemon).toBe(1);
|
||||
expect(h.spawnArgs[0]).toEqual({ scriptPath: SCRIPT, port: PORT });
|
||||
// Ordering: graceful → port-free confirmation → pid-file cleanup → spawn.
|
||||
const order = h.calls;
|
||||
expect(order.indexOf(`waitForPortFree:${PORT}`)).toBeGreaterThan(order.indexOf('graceful'));
|
||||
expect(order.indexOf('removePidFile')).toBeGreaterThan(order.indexOf(`waitForPortFree:${PORT}`));
|
||||
expect(order.indexOf('spawnDaemon')).toBeGreaterThan(order.indexOf('removePidFile'));
|
||||
});
|
||||
|
||||
it('never spawns when the port never frees', async () => {
|
||||
const h = makeHarness({ reason: 'restart', portFree: false });
|
||||
|
||||
await runShutdownSequence(h.options);
|
||||
|
||||
expect(h.counters.waitForPortFree).toBe(1);
|
||||
expect(h.counters.removePidFile).toBe(0);
|
||||
expect(h.counters.spawnDaemon).toBe(0);
|
||||
});
|
||||
|
||||
it("stays kill-only for reason 'stop'", async () => {
|
||||
const h = makeHarness({ reason: 'stop' });
|
||||
|
||||
await runShutdownSequence(h.options);
|
||||
|
||||
expect(h.counters.waitForPortFree).toBe(0);
|
||||
expect(h.counters.spawnDaemon).toBe(0);
|
||||
});
|
||||
|
||||
it("stays kill-only for reason 'signal'", async () => {
|
||||
const h = makeHarness({ reason: 'signal' });
|
||||
|
||||
await runShutdownSequence(h.options);
|
||||
|
||||
expect(h.counters.waitForPortFree).toBe(0);
|
||||
expect(h.counters.spawnDaemon).toBe(0);
|
||||
});
|
||||
|
||||
it('completes (logging loudly, not throwing) when spawnDaemon returns undefined', async () => {
|
||||
const h = makeHarness({ reason: 'restart', spawnResult: undefined });
|
||||
|
||||
await runShutdownSequence(h.options); // must not throw
|
||||
|
||||
expect(h.counters.spawnDaemon).toBe(1);
|
||||
});
|
||||
|
||||
it('completes when spawnDaemon throws (supervisor refusing mid-cascade)', async () => {
|
||||
const h = makeHarness({ reason: 'restart', spawnThrows: true });
|
||||
|
||||
await runShutdownSequence(h.options); // must not throw
|
||||
|
||||
expect(h.counters.spawnDaemon).toBe(1);
|
||||
});
|
||||
});
|
||||
@@ -8,14 +8,27 @@ import { SettingsDefaultsManager } from '../../src/shared/SettingsDefaultsManage
|
||||
describe('SettingsDefaultsManager', () => {
|
||||
let tempDir: string;
|
||||
let settingsPath: string;
|
||||
let prevDataDirEnv: string | undefined;
|
||||
|
||||
beforeEach(() => {
|
||||
tempDir = join(tmpdir(), `settings-test-${Date.now()}-${Math.random().toString(36).slice(2)}`);
|
||||
mkdirSync(tempDir, { recursive: true });
|
||||
settingsPath = join(tempDir, 'settings.json');
|
||||
|
||||
// The preload tripwire (tests/preload.ts) pins CLAUDE_MEM_DATA_DIR for
|
||||
// the whole run, and loadFromFile applies env overrides on top of file
|
||||
// values — which would make every loadFromFile result diverge from
|
||||
// getAllDefaults()'s hardcoded ~/.claude-mem default. These tests are
|
||||
// about file > defaults behavior on an EXPLICIT settingsPath (no real
|
||||
// data-dir I/O happens here), so drop the env override for their
|
||||
// duration and restore it after.
|
||||
prevDataDirEnv = process.env.CLAUDE_MEM_DATA_DIR;
|
||||
delete process.env.CLAUDE_MEM_DATA_DIR;
|
||||
});
|
||||
|
||||
afterEach(() => {
|
||||
if (prevDataDirEnv === undefined) delete process.env.CLAUDE_MEM_DATA_DIR;
|
||||
else process.env.CLAUDE_MEM_DATA_DIR = prevDataDirEnv;
|
||||
try {
|
||||
rmSync(tempDir, { recursive: true, force: true });
|
||||
} catch {
|
||||
@@ -258,6 +271,40 @@ describe('SettingsDefaultsManager', () => {
|
||||
});
|
||||
});
|
||||
|
||||
describe('stdout discipline', () => {
|
||||
// CLI commands like `start` promise machine-readable JSON on stdout to
|
||||
// the hook framework; settings bootstrap runs inside them, so its
|
||||
// informational notices must go to stderr. PR #2894 CI caught the
|
||||
// creation notice corrupting the start command's JSON on first boot in
|
||||
// a fresh data dir.
|
||||
it('should not write to stdout when creating the settings file', () => {
|
||||
const stdoutCalls: unknown[][] = [];
|
||||
const originalLog = console.log;
|
||||
console.log = (...args: unknown[]) => { stdoutCalls.push(args); };
|
||||
try {
|
||||
expect(existsSync(settingsPath)).toBe(false);
|
||||
SettingsDefaultsManager.loadFromFile(settingsPath);
|
||||
expect(existsSync(settingsPath)).toBe(true);
|
||||
expect(stdoutCalls).toEqual([]);
|
||||
} finally {
|
||||
console.log = originalLog;
|
||||
}
|
||||
});
|
||||
|
||||
it('should not write to stdout when migrating a nested-schema file', () => {
|
||||
writeFileSync(settingsPath, JSON.stringify({ env: { CLAUDE_MEM_MODEL: 'nested-model' } }));
|
||||
const stdoutCalls: unknown[][] = [];
|
||||
const originalLog = console.log;
|
||||
console.log = (...args: unknown[]) => { stdoutCalls.push(args); };
|
||||
try {
|
||||
SettingsDefaultsManager.loadFromFile(settingsPath);
|
||||
expect(stdoutCalls).toEqual([]);
|
||||
} finally {
|
||||
console.log = originalLog;
|
||||
}
|
||||
});
|
||||
});
|
||||
|
||||
describe('getAllDefaults', () => {
|
||||
it('should return a copy of defaults', () => {
|
||||
const defaults1 = SettingsDefaultsManager.getAllDefaults();
|
||||
|
||||
128
tests/shared/worker-spawn-gate.test.ts
Normal file
128
tests/shared/worker-spawn-gate.test.ts
Normal file
@@ -0,0 +1,128 @@
|
||||
import { describe, it, expect, beforeEach, afterEach } from 'bun:test';
|
||||
import { existsSync, mkdtempSync, readFileSync, rmSync, utimesSync, writeFileSync } from 'fs';
|
||||
import { tmpdir } from 'os';
|
||||
import { join } from 'path';
|
||||
|
||||
// Eagerly evaluate src/shared/paths.ts BEFORE any per-test env override:
|
||||
// paths.ts freezes its DATA_DIR const at first evaluation, and without this
|
||||
// import the dynamic imports inside these tests can be the first to evaluate
|
||||
// it — while the env var points at a soon-deleted per-test temp dir — which
|
||||
// poisons every later-loaded module in the same bun process (e.g.
|
||||
// ProcessManager's PID_FILE in combined runs). At this point the env var is
|
||||
// the per-RUN temp dir pinned by the preload tripwire (tests/preload.ts), so
|
||||
// paths.ts freezes on a stable, isolated dir that outlives this file.
|
||||
// The module under test is unaffected: it resolves its lock path at call time
|
||||
// via resolveDataDir(), not via paths.ts's frozen const.
|
||||
import '../../src/shared/paths.js';
|
||||
|
||||
// The spawn gate's lock path comes from resolveDataDir() (src/shared/paths.ts),
|
||||
// which consults CLAUDE_MEM_DATA_DIR — so the env var MUST point at the temp
|
||||
// dir BEFORE the gate module is imported/exercised. The cache-busted dynamic
|
||||
// import follows the worker-utils test idiom
|
||||
// (tests/shared/worker-utils-version-recycle.test.ts).
|
||||
const ORIGINAL_DATA_DIR = process.env.CLAUDE_MEM_DATA_DIR;
|
||||
|
||||
async function importGateFresh() {
|
||||
return import(`../../src/shared/worker-spawn-gate.js?spawn-gate=${Date.now()}-${Math.random()}`);
|
||||
}
|
||||
|
||||
describe('worker-spawn-gate — cross-launcher spawn lockfile', () => {
|
||||
let tempDir: string;
|
||||
let lockPath: string;
|
||||
|
||||
beforeEach(() => {
|
||||
tempDir = mkdtempSync(join(tmpdir(), 'claude-mem-spawn-gate-'));
|
||||
process.env.CLAUDE_MEM_DATA_DIR = tempDir;
|
||||
lockPath = join(tempDir, 'spawn.lock');
|
||||
});
|
||||
|
||||
afterEach(() => {
|
||||
if (ORIGINAL_DATA_DIR === undefined) {
|
||||
delete process.env.CLAUDE_MEM_DATA_DIR;
|
||||
} else {
|
||||
process.env.CLAUDE_MEM_DATA_DIR = ORIGINAL_DATA_DIR;
|
||||
}
|
||||
rmSync(tempDir, { recursive: true, force: true });
|
||||
});
|
||||
|
||||
it('second acquire fails while the lock is held', async () => {
|
||||
const { acquireSpawnLock } = await importGateFresh();
|
||||
|
||||
expect(acquireSpawnLock()).toBe(true);
|
||||
expect(existsSync(lockPath)).toBe(true);
|
||||
|
||||
// A fresh lock is honored: the loser must skip its spawn (and wait).
|
||||
expect(acquireSpawnLock()).toBe(false);
|
||||
|
||||
// The original lock survives the failed attempt.
|
||||
const lock = JSON.parse(readFileSync(lockPath, 'utf-8'));
|
||||
expect(lock.pid).toBe(process.pid);
|
||||
});
|
||||
|
||||
it('breaks a stale lock (mtime backdated >60s) and re-acquires', async () => {
|
||||
const { acquireSpawnLock } = await importGateFresh();
|
||||
|
||||
// A crashed launcher's leftover lock, last touched 61s ago. This also
|
||||
// exercises the re-stat-before-unlink guard's happy path: nothing races
|
||||
// us, so the second stat sees the same mtime and the break proceeds.
|
||||
writeFileSync(
|
||||
lockPath,
|
||||
JSON.stringify({ pid: 999_999_999, startedAt: new Date(Date.now() - 61_000).toISOString() })
|
||||
);
|
||||
const past = new Date(Date.now() - 61_000);
|
||||
utimesSync(lockPath, past, past);
|
||||
|
||||
expect(acquireSpawnLock()).toBe(true);
|
||||
|
||||
// The broken lock was replaced with OUR lock.
|
||||
const lock = JSON.parse(readFileSync(lockPath, 'utf-8'));
|
||||
expect(lock.pid).toBe(process.pid);
|
||||
});
|
||||
|
||||
it('honors a 45s-old lock — within the 60s staleness window (regression: was broken under the old 30s threshold)', async () => {
|
||||
const { acquireSpawnLock } = await importGateFresh();
|
||||
|
||||
// 45s covers the worst legitimate in-lock wait: the ~15s post-spawn
|
||||
// health wait scaled 2.0x by getPlatformTimeout on Windows (~30s). A
|
||||
// 30s staleness window would break this holder's lock mid-spawn.
|
||||
const foreignPayload = JSON.stringify({
|
||||
pid: 999_999_999,
|
||||
startedAt: new Date(Date.now() - 45_000).toISOString(),
|
||||
});
|
||||
writeFileSync(lockPath, foreignPayload);
|
||||
const past = new Date(Date.now() - 45_000);
|
||||
utimesSync(lockPath, past, past);
|
||||
|
||||
expect(acquireSpawnLock()).toBe(false);
|
||||
|
||||
// The holder's lock survives untouched.
|
||||
expect(readFileSync(lockPath, 'utf-8')).toBe(foreignPayload);
|
||||
});
|
||||
|
||||
it('release is owner-only: a foreign lock survives releaseSpawnLock', async () => {
|
||||
const { releaseSpawnLock } = await importGateFresh();
|
||||
|
||||
const foreignPayload = JSON.stringify({
|
||||
pid: process.pid + 1,
|
||||
startedAt: new Date().toISOString(),
|
||||
});
|
||||
writeFileSync(lockPath, foreignPayload);
|
||||
|
||||
releaseSpawnLock();
|
||||
|
||||
expect(existsSync(lockPath)).toBe(true);
|
||||
expect(readFileSync(lockPath, 'utf-8')).toBe(foreignPayload);
|
||||
});
|
||||
|
||||
it('release after own acquire removes the lock file (and it can be re-acquired)', async () => {
|
||||
const { acquireSpawnLock, releaseSpawnLock } = await importGateFresh();
|
||||
|
||||
expect(acquireSpawnLock()).toBe(true);
|
||||
expect(existsSync(lockPath)).toBe(true);
|
||||
|
||||
releaseSpawnLock();
|
||||
expect(existsSync(lockPath)).toBe(false);
|
||||
|
||||
expect(acquireSpawnLock()).toBe(true);
|
||||
});
|
||||
});
|
||||
164
tests/shared/worker-utils-recycle-successor.test.ts
Normal file
164
tests/shared/worker-utils-recycle-successor.test.ts
Normal file
@@ -0,0 +1,164 @@
|
||||
import { describe, it, expect, beforeEach, afterEach, afterAll, mock } from 'bun:test';
|
||||
import { mkdtempSync, rmSync } from 'fs';
|
||||
import { tmpdir } from 'os';
|
||||
import { join } from 'path';
|
||||
import * as realInfrastructure from '../../src/services/infrastructure/index.js';
|
||||
import * as realSupervisor from '../../src/supervisor/index.js';
|
||||
import * as realSpawn from '../../src/shared/spawn.js';
|
||||
|
||||
const realInfrastructureSnapshot = { ...realInfrastructure };
|
||||
const realSupervisorSnapshot = { ...realSupervisor };
|
||||
const realSpawnSnapshot = { ...realSpawn };
|
||||
|
||||
// After the hook POSTs /api/admin/restart the OLD worker spawns its own
|
||||
// successor (src/services/worker-shutdown.ts; see
|
||||
// plans/2026-06-10-worker-restart-single-source-of-truth.md). The hook must
|
||||
// WAIT for that successor on /api/health instead of immediately lazy-spawning
|
||||
// into the dying worker — lazy-spawn remains only as the safety net when no
|
||||
// successor ever appears.
|
||||
|
||||
const PLUGIN_VERSION = '13.4.0';
|
||||
const STALE_VERSION = '13.3.0';
|
||||
|
||||
// Record every HTTP call (same fetchLog pattern as
|
||||
// tests/shared/worker-utils-version-recycle.test.ts).
|
||||
const fetchLog: Array<{ url: string; method: string }> = [];
|
||||
|
||||
// Scripts the version /api/health reports per health call (0-based index).
|
||||
// When the script array is exhausted the last entry repeats.
|
||||
let healthVersionScript: string[] = [PLUGIN_VERSION];
|
||||
|
||||
// Records every spawn attempt — the seam the lazy-spawn fallback goes
|
||||
// through (spawnHidden in src/shared/spawn.ts).
|
||||
const spawnCalls: Array<{ command: string; args: string[] }> = [];
|
||||
|
||||
// The stale worker on the port: alive (health ok) and version-mismatched.
|
||||
mock.module('../../src/services/infrastructure/index.js', () => ({
|
||||
checkVersionMatch: () => Promise.resolve({
|
||||
matches: false,
|
||||
pluginVersion: PLUGIN_VERSION,
|
||||
workerVersion: STALE_VERSION,
|
||||
}),
|
||||
}));
|
||||
|
||||
mock.module('../../src/supervisor/index.js', () => ({
|
||||
validateWorkerPidFile: () => 'alive',
|
||||
}));
|
||||
|
||||
mock.module('../../src/shared/spawn.js', () => ({
|
||||
spawnHidden: (command: string, args: string[]) => {
|
||||
spawnCalls.push({ command, args });
|
||||
return { pid: 4242, unref: () => {} };
|
||||
},
|
||||
}));
|
||||
|
||||
async function importWorkerUtilsFresh() {
|
||||
return import(`../../src/shared/worker-utils.js?worker-utils-recycle-successor=${Date.now()}-${Math.random()}`);
|
||||
}
|
||||
|
||||
function installFetchMock(): void {
|
||||
fetchLog.length = 0;
|
||||
let healthCallIndex = 0;
|
||||
global.fetch = mock((url: string | URL | Request, init?: RequestInit) => {
|
||||
const u = typeof url === 'string' ? url : url.toString();
|
||||
const method = (init?.method ?? 'GET').toUpperCase();
|
||||
fetchLog.push({ url: u, method });
|
||||
|
||||
let body: Record<string, unknown> = {};
|
||||
if (u.includes('/api/health')) {
|
||||
const version = healthVersionScript[Math.min(healthCallIndex, healthVersionScript.length - 1)];
|
||||
healthCallIndex++;
|
||||
body = { version };
|
||||
}
|
||||
|
||||
// /api/readiness and /api/admin/restart answer plain 200 OK.
|
||||
return Promise.resolve({
|
||||
ok: true,
|
||||
status: 200,
|
||||
text: () => Promise.resolve(JSON.stringify(body)),
|
||||
json: () => Promise.resolve(body),
|
||||
} as unknown as Response);
|
||||
}) as unknown as typeof fetch;
|
||||
}
|
||||
|
||||
describe('ensureWorkerRunning — recycle waits for the dying worker\'s successor instead of spawning into the corpse', () => {
|
||||
const originalFetch = global.fetch;
|
||||
const originalReadinessBudget = process.env.CLAUDE_MEM_HOOK_READINESS_TIMEOUT_MS;
|
||||
const originalDataDir = process.env.CLAUDE_MEM_DATA_DIR;
|
||||
let tempDataDir: string;
|
||||
|
||||
beforeEach(() => {
|
||||
// The lazy-spawn fallback now goes through the spawn gate
|
||||
// (src/shared/worker-spawn-gate.ts), which writes <DATA_DIR>/spawn.lock.
|
||||
// Point DATA_DIR at a temp dir so the test never touches the real
|
||||
// ~/.claude-mem lock (a live launcher's lock would make the fallback
|
||||
// SKIP its spawn and fail the expectation below).
|
||||
tempDataDir = mkdtempSync(join(tmpdir(), 'claude-mem-recycle-successor-'));
|
||||
process.env.CLAUDE_MEM_DATA_DIR = tempDataDir;
|
||||
installFetchMock();
|
||||
spawnCalls.length = 0;
|
||||
});
|
||||
|
||||
afterEach(() => {
|
||||
global.fetch = originalFetch;
|
||||
if (originalReadinessBudget === undefined) {
|
||||
delete process.env.CLAUDE_MEM_HOOK_READINESS_TIMEOUT_MS;
|
||||
} else {
|
||||
process.env.CLAUDE_MEM_HOOK_READINESS_TIMEOUT_MS = originalReadinessBudget;
|
||||
}
|
||||
if (originalDataDir === undefined) {
|
||||
delete process.env.CLAUDE_MEM_DATA_DIR;
|
||||
} else {
|
||||
process.env.CLAUDE_MEM_DATA_DIR = originalDataDir;
|
||||
}
|
||||
rmSync(tempDataDir, { recursive: true, force: true });
|
||||
mock.restore();
|
||||
});
|
||||
|
||||
afterAll(() => {
|
||||
mock.module('../../src/services/infrastructure/index.js', () => realInfrastructureSnapshot);
|
||||
mock.module('../../src/supervisor/index.js', () => realSupervisorSnapshot);
|
||||
mock.module('../../src/shared/spawn.js', () => realSpawnSnapshot);
|
||||
});
|
||||
|
||||
it('does NOT spawn when the successor appears on a later health poll', async () => {
|
||||
// Health call 0 = the initial isWorkerPortAlive probe (stale worker).
|
||||
// Call 1 = first successor poll, still the dying stale worker.
|
||||
// Call 2+ = the successor reports the plugin version.
|
||||
healthVersionScript = [STALE_VERSION, STALE_VERSION, PLUGIN_VERSION];
|
||||
// Generous budget — success arrives on poll 2 (~500ms in).
|
||||
process.env.CLAUDE_MEM_HOOK_READINESS_TIMEOUT_MS = '3000';
|
||||
|
||||
const { ensureWorkerRunning } = await importWorkerUtilsFresh();
|
||||
const result = await ensureWorkerRunning();
|
||||
|
||||
expect(result).toBe(true);
|
||||
// The recycle was requested...
|
||||
const restartCalls = fetchLog.filter(
|
||||
c => c.url.includes('/api/admin/restart') && c.method === 'POST'
|
||||
);
|
||||
expect(restartCalls.length).toBe(1);
|
||||
// ...and NO spawn attempt raced the dying worker.
|
||||
expect(spawnCalls.length).toBe(0);
|
||||
});
|
||||
|
||||
it('falls back to lazy-spawn when no successor ever appears within the budget', async () => {
|
||||
healthVersionScript = [STALE_VERSION]; // health never recovers to the plugin version
|
||||
// Small budget so the successor wait expires fast.
|
||||
process.env.CLAUDE_MEM_HOOK_READINESS_TIMEOUT_MS = '700';
|
||||
|
||||
const { ensureWorkerRunning } = await importWorkerUtilsFresh();
|
||||
const result = await ensureWorkerRunning();
|
||||
|
||||
// The restart was still requested exactly once (one recycle per hook)...
|
||||
const restartCalls = fetchLog.filter(
|
||||
c => c.url.includes('/api/admin/restart') && c.method === 'POST'
|
||||
);
|
||||
expect(restartCalls.length).toBe(1);
|
||||
// ...and the safety-net lazy-spawn fired.
|
||||
expect(spawnCalls.length).toBe(1);
|
||||
expect(spawnCalls[0].args).toContain('--daemon');
|
||||
// The mocked port/readiness probes answer OK, so the hook proceeds.
|
||||
expect(result).toBe(true);
|
||||
});
|
||||
});
|
||||
@@ -4,6 +4,17 @@ import { tmpdir } from 'os';
|
||||
import { join } from 'path';
|
||||
|
||||
import { SettingsDefaultsManager } from '../../src/shared/SettingsDefaultsManager.js';
|
||||
// Eagerly evaluate src/shared/paths.ts BEFORE any per-test env override:
|
||||
// paths.ts freezes its DATA_DIR const at first evaluation, and without this
|
||||
// import the dynamic `import('../../src/shared/worker-utils.js')` calls
|
||||
// below can be the first to evaluate it — while the env var points at a
|
||||
// soon-deleted per-test temp dir — poisoning every later-loaded module in the
|
||||
// same bun process (e.g. ProcessManager's PID_FILE in combined runs). At this
|
||||
// point the env var is the per-RUN temp dir pinned by the preload tripwire
|
||||
// (tests/preload.ts), so paths.ts freezes on a stable, isolated dir that
|
||||
// outlives this file. worker-utils itself is unaffected: it resolves its
|
||||
// settings path at call time via SettingsDefaultsManager.get('CLAUDE_MEM_DATA_DIR').
|
||||
import '../../src/shared/paths.js';
|
||||
|
||||
describe('worker-utils API timeout resolution', () => {
|
||||
let tempDir: string;
|
||||
|
||||
@@ -40,11 +40,16 @@ function installFetchMock(): void {
|
||||
|
||||
// /api/health and /api/readiness must report OK so the worker is "alive"
|
||||
// and "ready"; /api/admin/restart and anything else also returns OK.
|
||||
// The health payload carries the PLUGIN version: the recycle path waits
|
||||
// for a successor worker reporting the installed plugin's version on
|
||||
// /api/health (the old worker spawns it after the port closes —
|
||||
// src/services/worker-shutdown.ts), so this scripts "the successor came
|
||||
// up immediately".
|
||||
return Promise.resolve({
|
||||
ok: true,
|
||||
status: 200,
|
||||
text: () => Promise.resolve(''),
|
||||
json: () => Promise.resolve({}),
|
||||
json: () => Promise.resolve({ version: versionMatchResult.pluginVersion }),
|
||||
} as unknown as Response);
|
||||
}) as unknown as typeof fetch;
|
||||
}
|
||||
|
||||
@@ -1,9 +1,9 @@
|
||||
import { afterEach, describe, expect, it } from 'bun:test';
|
||||
import { mkdirSync, readFileSync, rmSync, writeFileSync } from 'fs';
|
||||
import { existsSync, mkdirSync, readFileSync, rmSync, writeFileSync } from 'fs';
|
||||
import { tmpdir } from 'os';
|
||||
import path from 'path';
|
||||
import { createProcessRegistry } from '../../src/supervisor/process-registry.js';
|
||||
import { runShutdownCascade } from '../../src/supervisor/shutdown.js';
|
||||
import { removeOwnedPidFile, runShutdownCascade } from '../../src/supervisor/shutdown.js';
|
||||
|
||||
function makeTempDir(): string {
|
||||
return path.join(tmpdir(), `claude-mem-shutdown-${Date.now()}-${Math.random().toString(36).slice(2)}`);
|
||||
@@ -146,6 +146,45 @@ describe('supervisor shutdown cascade', () => {
|
||||
expect(Object.keys(persisted.processes)).toHaveLength(0);
|
||||
});
|
||||
|
||||
// Phase 5 (worker-restart plan): the dying worker's shutdown cascade runs
|
||||
// AFTER the restart successor has written its own PID file. Blind deletion
|
||||
// here clobbered the successor's file and made `worker status` report a
|
||||
// healthy worker as not running.
|
||||
it('old-worker cleanup spares the successor\'s PID file (owner guard)', async () => {
|
||||
const tempDir = makeTempDir();
|
||||
tempDirs.push(tempDir);
|
||||
mkdirSync(tempDir, { recursive: true });
|
||||
|
||||
const registryPath = path.join(tempDir, 'supervisor.json');
|
||||
const pidFilePath = path.join(tempDir, 'worker.pid');
|
||||
|
||||
// A successor (NOT this process) already owns the PID file.
|
||||
const successorContent = JSON.stringify({
|
||||
pid: 99999847,
|
||||
port: 37777,
|
||||
startedAt: new Date().toISOString()
|
||||
});
|
||||
writeFileSync(pidFilePath, successorContent);
|
||||
|
||||
const registry = createProcessRegistry(registryPath);
|
||||
registry.register('worker', {
|
||||
pid: process.pid,
|
||||
type: 'worker',
|
||||
startedAt: '2026-03-15T00:00:00.000Z'
|
||||
});
|
||||
|
||||
await runShutdownCascade({
|
||||
registry,
|
||||
currentPid: process.pid,
|
||||
pidFilePath
|
||||
});
|
||||
|
||||
// The successor's file must survive the old worker's dying breath, byte
|
||||
// for byte.
|
||||
expect(existsSync(pidFilePath)).toBe(true);
|
||||
expect(readFileSync(pidFilePath, 'utf-8')).toBe(successorContent);
|
||||
});
|
||||
|
||||
it('unregisters all children from registry after cascade', async () => {
|
||||
const tempDir = makeTempDir();
|
||||
tempDirs.push(tempDir);
|
||||
@@ -180,3 +219,64 @@ describe('supervisor shutdown cascade', () => {
|
||||
});
|
||||
});
|
||||
|
||||
describe('removeOwnedPidFile (owner guard, Phase 5)', () => {
|
||||
afterEach(() => {
|
||||
while (tempDirs.length > 0) {
|
||||
const dir = tempDirs.pop();
|
||||
if (dir) {
|
||||
rmSync(dir, { recursive: true, force: true });
|
||||
}
|
||||
}
|
||||
});
|
||||
|
||||
function makePidFilePath(): string {
|
||||
const tempDir = makeTempDir();
|
||||
tempDirs.push(tempDir);
|
||||
mkdirSync(tempDir, { recursive: true });
|
||||
return path.join(tempDir, 'worker.pid');
|
||||
}
|
||||
|
||||
it('deletes the file when the recorded pid is the current process', () => {
|
||||
const pidFilePath = makePidFilePath();
|
||||
writeFileSync(pidFilePath, JSON.stringify({ pid: process.pid, port: 37777, startedAt: new Date().toISOString() }));
|
||||
|
||||
removeOwnedPidFile(pidFilePath, process.pid);
|
||||
|
||||
expect(existsSync(pidFilePath)).toBe(false);
|
||||
});
|
||||
|
||||
it('leaves the file when the recorded pid belongs to another process', () => {
|
||||
const pidFilePath = makePidFilePath();
|
||||
writeFileSync(pidFilePath, JSON.stringify({ pid: 99999847, port: 37777, startedAt: new Date().toISOString() }));
|
||||
|
||||
removeOwnedPidFile(pidFilePath, process.pid);
|
||||
|
||||
expect(existsSync(pidFilePath)).toBe(true);
|
||||
});
|
||||
|
||||
it('leaves a corrupt file in place (ownership cannot be proven)', () => {
|
||||
const pidFilePath = makePidFilePath();
|
||||
writeFileSync(pidFilePath, 'not valid json {{{');
|
||||
|
||||
removeOwnedPidFile(pidFilePath, process.pid);
|
||||
|
||||
expect(existsSync(pidFilePath)).toBe(true);
|
||||
});
|
||||
|
||||
it('leaves a pid-less JSON file in place (no recorded owner)', () => {
|
||||
const pidFilePath = makePidFilePath();
|
||||
writeFileSync(pidFilePath, JSON.stringify({ port: 37777 }));
|
||||
|
||||
removeOwnedPidFile(pidFilePath, process.pid);
|
||||
|
||||
expect(existsSync(pidFilePath)).toBe(true);
|
||||
});
|
||||
|
||||
it('does not throw when the file is missing', () => {
|
||||
const pidFilePath = makePidFilePath();
|
||||
|
||||
expect(() => removeOwnedPidFile(pidFilePath, process.pid)).not.toThrow();
|
||||
expect(existsSync(pidFilePath)).toBe(false);
|
||||
});
|
||||
});
|
||||
|
||||
|
||||
@@ -1,11 +1,14 @@
|
||||
import { describe, it, expect, beforeAll, afterAll, afterEach } from 'bun:test';
|
||||
import { execSync, ChildProcess } from 'child_process';
|
||||
import { existsSync, readFileSync, writeFileSync, unlinkSync, mkdirSync, rmSync } from 'fs';
|
||||
import { homedir } from 'os';
|
||||
import { existsSync, readFileSync, writeFileSync, unlinkSync, mkdirSync, mkdtempSync, rmSync } from 'fs';
|
||||
import { tmpdir } from 'os';
|
||||
import path from 'path';
|
||||
|
||||
const TEST_PORT = 37877;
|
||||
const TEST_DATA_DIR = path.join(homedir(), '.claude-mem-test');
|
||||
// Phase 6 (worker-restart plan): a unique temp dir, NOT a fixed path in the
|
||||
// user's home directory — concurrent runs can't collide and nothing lands
|
||||
// near the real ~/.claude-mem.
|
||||
const TEST_DATA_DIR = mkdtempSync(path.join(tmpdir(), 'claude-mem-worker-spawn-'));
|
||||
const TEST_PID_FILE = path.join(TEST_DATA_DIR, 'worker.pid');
|
||||
const WORKER_SCRIPT = path.join(__dirname, '../plugin/scripts/worker-service.cjs');
|
||||
|
||||
@@ -45,26 +48,20 @@ function runWorkerCommand(command: string, env: Record<string, string> = {}): st
|
||||
}
|
||||
|
||||
describe('Worker Self-Spawn CLI', () => {
|
||||
beforeAll(async () => {
|
||||
if (existsSync(TEST_DATA_DIR)) {
|
||||
rmSync(TEST_DATA_DIR, { recursive: true });
|
||||
}
|
||||
});
|
||||
|
||||
afterAll(async () => {
|
||||
if (existsSync(TEST_DATA_DIR)) {
|
||||
rmSync(TEST_DATA_DIR, { recursive: true });
|
||||
}
|
||||
rmSync(TEST_DATA_DIR, { recursive: true, force: true });
|
||||
});
|
||||
|
||||
describe('status command', () => {
|
||||
// The spawned CLI must be pointed at the isolated temp dir explicitly so
|
||||
// it never reads (or stale-cleans) the real ~/.claude-mem/worker.pid.
|
||||
it('should report worker status in expected format', async () => {
|
||||
const output = runWorkerCommand('status');
|
||||
const output = runWorkerCommand('status', { CLAUDE_MEM_DATA_DIR: TEST_DATA_DIR });
|
||||
expect(output.includes('running')).toBe(true);
|
||||
});
|
||||
|
||||
it('should include PID and port when running', async () => {
|
||||
const output = runWorkerCommand('status');
|
||||
const output = runWorkerCommand('status', { CLAUDE_MEM_DATA_DIR: TEST_DATA_DIR });
|
||||
if (output.includes('Worker running')) {
|
||||
expect(output).toMatch(/PID: \d+/);
|
||||
expect(output).toMatch(/Port: \d+/);
|
||||
|
||||
Reference in New Issue
Block a user