thedotmack-claude-mem

github/thedotmack-claude-mem

Fork 0

mirror of https://github.com/thedotmack/claude-mem.git synced 2026-07-03 12:32:32 +08:00

Commit Graph

Author	SHA1	Message	Date
Alex Newman	c0b96288a7	fix(restart): worker-restart single source of truth — self-replacing worker, spawn gate, verified restarts (#2894 ) * plan-10 Phase 1: ship deterministic plugin runtime dependency closure Approach A — commit & ship plugin/bun.lock so the plugin's runtime node_modules install is deterministic, fixing the recurring `Cannot find module 'zod/v3'` (#2730). - align generated plugin zod range to root (^4.4.3) in build-hooks.js - new scripts/gen-plugin-lockfile.cjs generates plugin/bun.lock as a build artifact after build-hooks.js writes plugin/package.json - track & ship plugin/bun.lock (.gitignore negation, .npmignore, files allowlist) - install with `bun install --frozen-lockfile --ignore-scripts` at runtime Refs #2783, #2730 Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * plan-10 Phase 2: fail loud at install time on a broken dependency closure Strengthen verifyCriticalModules to assert each dependency is actually importable via require.resolve (not merely a directory), and assert the worker-required zod subpaths resolve: zod/v3, zod/v4, zod/v4-mini. A partial/stale install now fails `npx claude-mem install` immediately instead of surfacing later as a Stop-hook `Cannot find module 'zod/v3'`. Bin-only packages (e.g. tree-sitter-cli, which has no bare-name entry point) fall back to resolving <dep>/package.json so a healthy install isn't falsely rejected. Adds tests/cli/verify-critical-modules.test.ts covering a missing zod/v3 subpath (throws), a complete zod (passes), and a bin-only dep (passes). Refs #2783, #2730 Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * plan-10 Phase 3: clean-room install + import smoke test (#2730 backstop) Add scripts/smoke-clean-room.cjs and a `smoke:clean-room` npm script. Against fresh temp dirs (never the repo's node_modules) it: - copies plugin/, runs `bun install --frozen-lockfile --ignore-scripts`, asserts zod, zod/v3, zod/v4, zod/v4-mini resolve, and boots the bundled worker asserting no `Cannot find module` — the direct #2730 regression guard; - `npm pack`s, installs the tarball into a second temp dir, and load-tests the published bin entrypoint, warning loudly on any declared main/exports target missing from the tarball (latent #2537 gap). Exits non-zero naming the missing module on any failure; cleans up all temp dirs and the tarball in a finally. Refs #2783, #2730 Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * plan-10 Phase 4: gate CI and publish on the clean-room dependency closure - ci.yml: new `clean-room-deps` job (between build and the docker e2e job) runs a frozen-lockfile drift check on the committed plugin lockfile, then `npm run build` + `npm run smoke:clean-room`. The drift step catches a contributor who changed plugin deps without regenerating plugin/bun.lock. - npm-publish.yml: add setup-bun and run `npm run smoke:clean-room` between build and `npm publish`, so a broken runtime closure cannot be published on a tag push (ci.yml does not run on tags). Secrets block untouched. Refs #2783, #2730 Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * plan-10: doc recluster note + Phase 0 execution slice for #2730 Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * plans: backlog recluster (2026-06-04) — cross-cluster execution order + plan-13 doc Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * plan-10: gen-plugin-lockfile degrades gracefully when bun is absent The Windows build CI job has no bun on PATH; regenerating the lockfile there threw and failed the build. The committed plugin/bun.lock is already the deterministic closure, so skip regeneration (non-fatal) when bun is missing and a lockfile exists; fail loud only when neither is available. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * chore: rebuild plugin artifacts after merging main (v13.5.1) + plan-10 work Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> * chore: rebuild plugin artifacts after merging main v13.5.5 Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> * chore(deps): daily upgrade pass — agent SDK 0.3.172, better-auth 1.6.16, posthog-node 5.36.15, dompurify 3.4.9 - Bump @anthropic-ai/claude-agent-sdk 0.2.141 -> 0.3.172 (tsc + full test suite green) - Remove deprecated @types/dompurify stub (dompurify ships its own types) - Add overrides.tmp ^0.2.7 to clear GHSA-52f5-9888-hmc6 / GHSA-ph9p-34f9-6g65 via np -> listr-input -> inquirer -> external-editor -> tmp chain - npm audit: 0 vulnerabilities; npm outdated: clean - package-lock.json is gitignored in this repo, so not committed Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> * plan: worker-restart single-source-of-truth — 7-phase fix for restart races Phased plan from the adversarially-verified diagnosis (wf_f07f3541-b05): kill the cache mirror, single verified restart initiator, self-replacing restart endpoint, unified spawn gate with lockfile, PID-file demotion, test data-dir isolation, soak verification. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> * refactor(restart): delete sync-script cache-mirror and HTTP restart trigger Phase 1 of plans/2026-06-10-worker-restart-single-source-of-truth.md. The installed-version cache mirror wrote version-N code into the version-(N-1) cache dir, manufacturing permanent version disagreement; the HTTP POST to /api/admin/restart raced the CLI restart that follows it in build-and-sync. Both are deleted; the CLI worker:restart in the marketplace copy is now the single restart initiator, and the sleep 1 between the two mechanisms is gone. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> * feat(restart): restart proves itself or exits 1 Phase 2 of plans/2026-06-10-worker-restart-single-source-of-truth.md. worker-service restart now captures the old worker pid, waits for the port with the same platform-scaled 15s budget as stop, spawns the marketplace copy of worker-service.cjs when present, then polls /api/health until the pid changes and the version matches this build's baked __DEFAULT_PACKAGE_VERSION__ — success is printed to stdout, deadline (platform-scaled 30s) exits 1 with the last observed health payload and the spawned script path. The --daemon generic start-failure path now exits 1 instead of masquerading as success; the three duplicate-suppression exits remain 0. New helper src/services/restart-verify.ts (worker-service.ts bootstraps on import, so the helper lives in an import-safe module) with 8 tests covering pid-flip success, stale pid, wrong version, unreachable timeout, 503-degraded acceptance, and null-oldPid version-only verification. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> * feat(restart): self-replacing worker — old worker spawns its successor Phase 3 of plans/2026-06-10-worker-restart-single-source-of-truth.md. /api/admin/restart was kill-only: hooks that POSTed it then raced the dying worker with their own lazy-spawn (the observed recycle ping-pong). Now the dying worker spawns its successor itself — after a re-entrancy- guarded, deadline-bounded (platform-scaled 10s) graceful shutdown, and only once its port is confirmed free; stop and signal shutdowns stay kill-only. The hook recycle path waits for that successor via /api/health polling (HOOK_READINESS_TIMEOUT_MS budget) and lazy-spawns only as a fallback, with a warn-only version re-check so a hook never recycles more than once per invocation. Shutdown sequence lives in import-safe src/services/worker-shutdown.ts (worker-service.ts bootstraps on import); registerSignalHandlers no longer pre-sets isShuttingDown — the supervisor's shutdownInitiated guard owns signal dedupe, and pre-setting would no-op the new entry guard. 13 new tests cover re-entrancy, deadline expiry/rejection, handoff ordering, kill-only reasons, successor-wait vs lazy-spawn fallback, and pre-graceful bookkeeping failures. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> * feat(restart): one spawn gate; CLI restart defers to the self-replacing worker Phase 4 of plans/2026-06-10-worker-restart-single-source-of-truth.md. Three uncoordinated spawn paths (hook lazy-spawn, MCP worker-spawner, CLI) with two different bun resolvers produced 3-launcher collisions within a single second. Now a wx-flag lockfile (<DATA_DIR>/spawn.lock, 60s mtime staleness with re-stat-before-unlink, owner-checked release) gates every external spawn: lock losers never fail — they skip the spawn and wait for the winner's worker. resolveBunRuntime is deleted in favor of ProcessManager's resolveWorkerRuntimePath (adds BUN_PATH, ~/.bun/bin, brew, which fallbacks), closing the kill-then-can't-respawn path; mcp-server prefers the marketplace worker script so stale cache dirs stop spawning stale workers. Integration fix surfaced by live verification: the CLI restart raced the Phase 3 self-replacement handoff (the successor re-binds the port in ~200ms, so waitForPortFree always timed out and restart exited 1 while the restart had actually succeeded). The CLI now verifies the worker's self-spawned successor directly, and only spawns — gate- wrapped, after the port frees — as the fallback when no worker was running, the shutdown POST was rejected, or no successor appeared. The dying worker's handoff is intentionally ungated: it spawns only after its own port closes, and hooks wait on it. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> * feat(restart): demote the PID file — health and port are the liveness oracle Phase 5 of plans/2026-06-10-worker-restart-single-source-of-truth.md. The dying worker's shutdown cascade deleted the PID file unconditionally as its final act, clobbering the successor's freshly-written file; status then required portInUse AND pidInfo, so a healthy worker reported as "not running". Now every PID-file deletion is owner-guarded: the supervisor cascade deletes only its own pid (removeOwnedPidFile), and the CLI stop/restart-fallback, the restart handoff, and the daemon start-failure cleanup go through removePidFileIfOwner (owner-or-dead — a live successor's file always survives; corrupt files are left for the next boot's validator). status sources from GET /api/health alone (pid, version, uptime, workerPath; 503-degraded counts as running and now surfaces its queue detail), with port-in-use-but-unreachable and not-running fallbacks — all exit 0 as before. The --daemon duplicate gate checks the port first (ground truth) and the PID file second (advisory, for the freed-port-but-undeleted-file window); duplicate suppression stays exit 0. writePidFile/touchPidFile remain — the file is diagnostics, and the worker stays its only writer. Also fixes combined-run test pollution: spawn-gate and worker-utils timeout tests now eagerly import paths.js before setting a temp CLAUDE_MEM_DATA_DIR, so the import-time DATA_DIR const can't freeze on a deleted temp dir for suites loaded later in the same bun process. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> * test: no test ever touches the real ~/.claude-mem again Phase 6 of plans/2026-06-10-worker-restart-single-source-of-truth.md. process-manager and graceful-shutdown tests wrote corrupt JSON and sentinel PIDs (2147483647) into the real ~/.claude-mem/worker.pid and drove the real supervisor.json cascade under a snapshot-restore that a killed run would skip — that pollution contaminated production logs and a prior diagnosis. Both files now set a temp CLAUDE_MEM_DATA_DIR at the top of the file before dynamically importing the code under test (ESM hoisting makes beforeEach too late), assert their paths landed outside the real dir, and derive PID_FILE from the same frozen paths module the code uses so test and code can never diverge under bun's shared module cache. The snapshot-restore scaffolding is deleted; zero assertions changed. tests/preload.ts gains a tripwire: when CLAUDE_MEM_DATA_DIR is unset it fills a per-run temp dir, so no test in any file can fall through to the real data dir. Fallout made explicit: worker-spawn child processes get an explicit temp dir; install-error-matrix restores rather than deletes the env var; settings-defaults-manager pins the unset-env default it was implicitly relying on. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> * fix(settings): bootstrap notices go to stderr, never stdout CI on PR #2894 caught the latent bug: on the first boot in a fresh data dir, SettingsDefaultsManager printed '[SETTINGS] Created settings file with defaults: ...' to stdout before the start command's JSON hook payload, corrupting the machine-readable contract every fresh install's first hook invocation relies on. The Phase 6 per-run temp data dir made the cold-dir case deterministic in CI, exposing it. Both informational notices (creation, nested-schema migration) now use console.warn — stderr — matching the function's existing failure-path idiom; two regression tests pin stdout silence on both paths. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> * refactor(restart): address PR #2894 review — dedupe script resolver, skip futile port wait Both inline copies of the marketplace-first script-candidate list in worker-service.ts (restart fallback + successor handoff injection) now call the exported resolveWorkerScriptPath() ?? __filename, so the candidate list lives in one place. verifyRestartedWorker's failure result gains lastPollSawHealth; when the self-replacement handoff verification timed out while a live (but unverifiable) worker was still serving on the port, the CLI fallback now skips its port-free wait — the port cannot free while that worker lives, so the wait only burned its full platform-scaled budget before the same final verification ran anyway. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-10 19:59:34 -07:00

Author

SHA1

Message

Date

Alex Newman

c0b96288a7

fix(restart): worker-restart single source of truth — self-replacing worker, spawn gate, verified restarts (#2894 )

* plan-10 Phase 1: ship deterministic plugin runtime dependency closure

Approach A — commit & ship plugin/bun.lock so the plugin's runtime
node_modules install is deterministic, fixing the recurring
`Cannot find module 'zod/v3'` (#2730).

- align generated plugin zod range to root (^4.4.3) in build-hooks.js
- new scripts/gen-plugin-lockfile.cjs generates plugin/bun.lock as a
  build artifact after build-hooks.js writes plugin/package.json
- track & ship plugin/bun.lock (.gitignore negation, .npmignore, files allowlist)
- install with `bun install --frozen-lockfile --ignore-scripts` at runtime

Refs #2783, #2730

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* plan-10 Phase 2: fail loud at install time on a broken dependency closure

Strengthen verifyCriticalModules to assert each dependency is actually
importable via require.resolve (not merely a directory), and assert the
worker-required zod subpaths resolve: zod/v3, zod/v4, zod/v4-mini.
A partial/stale install now fails `npx claude-mem install` immediately
instead of surfacing later as a Stop-hook `Cannot find module 'zod/v3'`.

Bin-only packages (e.g. tree-sitter-cli, which has no bare-name entry
point) fall back to resolving <dep>/package.json so a healthy install
isn't falsely rejected.

Adds tests/cli/verify-critical-modules.test.ts covering a missing zod/v3
subpath (throws), a complete zod (passes), and a bin-only dep (passes).

Refs #2783, #2730

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* plan-10 Phase 3: clean-room install + import smoke test (#2730 backstop)

Add scripts/smoke-clean-room.cjs and a `smoke:clean-room` npm script.
Against fresh temp dirs (never the repo's node_modules) it:
- copies plugin/, runs `bun install --frozen-lockfile --ignore-scripts`,
  asserts zod, zod/v3, zod/v4, zod/v4-mini resolve, and boots the bundled
  worker asserting no `Cannot find module` — the direct #2730 regression guard;
- `npm pack`s, installs the tarball into a second temp dir, and load-tests
  the published bin entrypoint, warning loudly on any declared main/exports
  target missing from the tarball (latent #2537 gap).

Exits non-zero naming the missing module on any failure; cleans up all
temp dirs and the tarball in a finally.

Refs #2783, #2730

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* plan-10 Phase 4: gate CI and publish on the clean-room dependency closure

- ci.yml: new `clean-room-deps` job (between build and the docker e2e job)
  runs a frozen-lockfile drift check on the committed plugin lockfile, then
  `npm run build` + `npm run smoke:clean-room`. The drift step catches a
  contributor who changed plugin deps without regenerating plugin/bun.lock.
- npm-publish.yml: add setup-bun and run `npm run smoke:clean-room` between
  build and `npm publish`, so a broken runtime closure cannot be published
  on a tag push (ci.yml does not run on tags). Secrets block untouched.

Refs #2783, #2730

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* plan-10: doc recluster note + Phase 0 execution slice for #2730

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* plans: backlog recluster (2026-06-04) — cross-cluster execution order + plan-13 doc

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* plan-10: gen-plugin-lockfile degrades gracefully when bun is absent

The Windows build CI job has no bun on PATH; regenerating the lockfile there
threw and failed the build. The committed plugin/bun.lock is already the
deterministic closure, so skip regeneration (non-fatal) when bun is missing
and a lockfile exists; fail loud only when neither is available.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* chore: rebuild plugin artifacts after merging main (v13.5.1) + plan-10 work

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

* chore: rebuild plugin artifacts after merging main v13.5.5

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

* chore(deps): daily upgrade pass — agent SDK 0.3.172, better-auth 1.6.16, posthog-node 5.36.15, dompurify 3.4.9

- Bump @anthropic-ai/claude-agent-sdk 0.2.141 -> 0.3.172 (tsc + full test suite green)
- Remove deprecated @types/dompurify stub (dompurify ships its own types)
- Add overrides.tmp ^0.2.7 to clear GHSA-52f5-9888-hmc6 / GHSA-ph9p-34f9-6g65
  via np -> listr-input -> inquirer -> external-editor -> tmp chain
- npm audit: 0 vulnerabilities; npm outdated: clean
- package-lock.json is gitignored in this repo, so not committed

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

* plan: worker-restart single-source-of-truth — 7-phase fix for restart races

Phased plan from the adversarially-verified diagnosis (wf_f07f3541-b05):
kill the cache mirror, single verified restart initiator, self-replacing
restart endpoint, unified spawn gate with lockfile, PID-file demotion,
test data-dir isolation, soak verification.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

* refactor(restart): delete sync-script cache-mirror and HTTP restart trigger

Phase 1 of plans/2026-06-10-worker-restart-single-source-of-truth.md.
The installed-version cache mirror wrote version-N code into the
version-(N-1) cache dir, manufacturing permanent version disagreement;
the HTTP POST to /api/admin/restart raced the CLI restart that follows
it in build-and-sync. Both are deleted; the CLI worker:restart in the
marketplace copy is now the single restart initiator, and the sleep 1
between the two mechanisms is gone.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

* feat(restart): restart proves itself or exits 1

Phase 2 of plans/2026-06-10-worker-restart-single-source-of-truth.md.
worker-service restart now captures the old worker pid, waits for the
port with the same platform-scaled 15s budget as stop, spawns the
marketplace copy of worker-service.cjs when present, then polls
/api/health until the pid changes and the version matches this build's
baked __DEFAULT_PACKAGE_VERSION__ — success is printed to stdout,
deadline (platform-scaled 30s) exits 1 with the last observed health
payload and the spawned script path. The --daemon generic start-failure
path now exits 1 instead of masquerading as success; the three
duplicate-suppression exits remain 0.

New helper src/services/restart-verify.ts (worker-service.ts bootstraps
on import, so the helper lives in an import-safe module) with 8 tests
covering pid-flip success, stale pid, wrong version, unreachable
timeout, 503-degraded acceptance, and null-oldPid version-only
verification.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

* feat(restart): self-replacing worker — old worker spawns its successor

Phase 3 of plans/2026-06-10-worker-restart-single-source-of-truth.md.
/api/admin/restart was kill-only: hooks that POSTed it then raced the
dying worker with their own lazy-spawn (the observed recycle ping-pong).
Now the dying worker spawns its successor itself — after a re-entrancy-
guarded, deadline-bounded (platform-scaled 10s) graceful shutdown, and
only once its port is confirmed free; stop and signal shutdowns stay
kill-only. The hook recycle path waits for that successor via
/api/health polling (HOOK_READINESS_TIMEOUT_MS budget) and lazy-spawns
only as a fallback, with a warn-only version re-check so a hook never
recycles more than once per invocation.

Shutdown sequence lives in import-safe src/services/worker-shutdown.ts
(worker-service.ts bootstraps on import); registerSignalHandlers no
longer pre-sets isShuttingDown — the supervisor's shutdownInitiated
guard owns signal dedupe, and pre-setting would no-op the new entry
guard. 13 new tests cover re-entrancy, deadline expiry/rejection,
handoff ordering, kill-only reasons, successor-wait vs lazy-spawn
fallback, and pre-graceful bookkeeping failures.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

* feat(restart): one spawn gate; CLI restart defers to the self-replacing worker

Phase 4 of plans/2026-06-10-worker-restart-single-source-of-truth.md.
Three uncoordinated spawn paths (hook lazy-spawn, MCP worker-spawner,
CLI) with two different bun resolvers produced 3-launcher collisions
within a single second. Now a wx-flag lockfile (<DATA_DIR>/spawn.lock,
60s mtime staleness with re-stat-before-unlink, owner-checked release)
gates every external spawn: lock losers never fail — they skip the
spawn and wait for the winner's worker. resolveBunRuntime is deleted in
favor of ProcessManager's resolveWorkerRuntimePath (adds BUN_PATH,
~/.bun/bin, brew, which fallbacks), closing the kill-then-can't-respawn
path; mcp-server prefers the marketplace worker script so stale cache
dirs stop spawning stale workers.

Integration fix surfaced by live verification: the CLI restart raced
the Phase 3 self-replacement handoff (the successor re-binds the port
in ~200ms, so waitForPortFree always timed out and restart exited 1
while the restart had actually succeeded). The CLI now verifies the
worker's self-spawned successor directly, and only spawns — gate-
wrapped, after the port frees — as the fallback when no worker was
running, the shutdown POST was rejected, or no successor appeared. The
dying worker's handoff is intentionally ungated: it spawns only after
its own port closes, and hooks wait on it.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

* feat(restart): demote the PID file — health and port are the liveness oracle

Phase 5 of plans/2026-06-10-worker-restart-single-source-of-truth.md.
The dying worker's shutdown cascade deleted the PID file
unconditionally as its final act, clobbering the successor's
freshly-written file; status then required portInUse AND pidInfo, so a
healthy worker reported as "not running". Now every PID-file deletion
is owner-guarded: the supervisor cascade deletes only its own pid
(removeOwnedPidFile), and the CLI stop/restart-fallback, the restart
handoff, and the daemon start-failure cleanup go through
removePidFileIfOwner (owner-or-dead — a live successor's file always
survives; corrupt files are left for the next boot's validator).

status sources from GET /api/health alone (pid, version, uptime,
workerPath; 503-degraded counts as running and now surfaces its queue
detail), with port-in-use-but-unreachable and not-running fallbacks —
all exit 0 as before. The --daemon duplicate gate checks the port
first (ground truth) and the PID file second (advisory, for the
freed-port-but-undeleted-file window); duplicate suppression stays
exit 0. writePidFile/touchPidFile remain — the file is diagnostics,
and the worker stays its only writer.

Also fixes combined-run test pollution: spawn-gate and worker-utils
timeout tests now eagerly import paths.js before setting a temp
CLAUDE_MEM_DATA_DIR, so the import-time DATA_DIR const can't freeze on
a deleted temp dir for suites loaded later in the same bun process.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

* test: no test ever touches the real ~/.claude-mem again

Phase 6 of plans/2026-06-10-worker-restart-single-source-of-truth.md.
process-manager and graceful-shutdown tests wrote corrupt JSON and
sentinel PIDs (2147483647) into the real ~/.claude-mem/worker.pid and
drove the real supervisor.json cascade under a snapshot-restore that a
killed run would skip — that pollution contaminated production logs and
a prior diagnosis. Both files now set a temp CLAUDE_MEM_DATA_DIR at the
top of the file before dynamically importing the code under test (ESM
hoisting makes beforeEach too late), assert their paths landed outside
the real dir, and derive PID_FILE from the same frozen paths module the
code uses so test and code can never diverge under bun's shared module
cache. The snapshot-restore scaffolding is deleted; zero assertions
changed.

tests/preload.ts gains a tripwire: when CLAUDE_MEM_DATA_DIR is unset it
fills a per-run temp dir, so no test in any file can fall through to
the real data dir. Fallout made explicit: worker-spawn child processes
get an explicit temp dir; install-error-matrix restores rather than
deletes the env var; settings-defaults-manager pins the unset-env
default it was implicitly relying on.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

* fix(settings): bootstrap notices go to stderr, never stdout

CI on PR #2894 caught the latent bug: on the first boot in a fresh data
dir, SettingsDefaultsManager printed '[SETTINGS] Created settings file
with defaults: ...' to stdout before the start command's JSON hook
payload, corrupting the machine-readable contract every fresh install's
first hook invocation relies on. The Phase 6 per-run temp data dir made
the cold-dir case deterministic in CI, exposing it. Both informational
notices (creation, nested-schema migration) now use console.warn —
stderr — matching the function's existing failure-path idiom; two
regression tests pin stdout silence on both paths.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

* refactor(restart): address PR #2894 review — dedupe script resolver, skip futile port wait

Both inline copies of the marketplace-first script-candidate list in
worker-service.ts (restart fallback + successor handoff injection) now
call the exported resolveWorkerScriptPath() ?? __filename, so the
candidate list lives in one place. verifyRestartedWorker's failure
result gains lastPollSawHealth; when the self-replacement handoff
verification timed out while a live (but unverifiable) worker was still
serving on the port, the CLI fallback now skips its port-free wait —
the port cannot free while that worker lives, so the wait only burned
its full platform-scaled budget before the same final verification ran
anyway.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

2026-06-10 19:59:34 -07:00

1 Commits