mirror of
https://github.com/thedotmack/claude-mem.git
synced 2026-07-03 12:32:32 +08:00
main
1 Commits
| Author | SHA1 | Message | Date | |
|---|---|---|---|---|
|
|
c0b96288a7 |
fix(restart): worker-restart single source of truth — self-replacing worker, spawn gate, verified restarts (#2894)
* plan-10 Phase 1: ship deterministic plugin runtime dependency closure Approach A — commit & ship plugin/bun.lock so the plugin's runtime node_modules install is deterministic, fixing the recurring `Cannot find module 'zod/v3'` (#2730). - align generated plugin zod range to root (^4.4.3) in build-hooks.js - new scripts/gen-plugin-lockfile.cjs generates plugin/bun.lock as a build artifact after build-hooks.js writes plugin/package.json - track & ship plugin/bun.lock (.gitignore negation, .npmignore, files allowlist) - install with `bun install --frozen-lockfile --ignore-scripts` at runtime Refs #2783, #2730 Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * plan-10 Phase 2: fail loud at install time on a broken dependency closure Strengthen verifyCriticalModules to assert each dependency is actually importable via require.resolve (not merely a directory), and assert the worker-required zod subpaths resolve: zod/v3, zod/v4, zod/v4-mini. A partial/stale install now fails `npx claude-mem install` immediately instead of surfacing later as a Stop-hook `Cannot find module 'zod/v3'`. Bin-only packages (e.g. tree-sitter-cli, which has no bare-name entry point) fall back to resolving <dep>/package.json so a healthy install isn't falsely rejected. Adds tests/cli/verify-critical-modules.test.ts covering a missing zod/v3 subpath (throws), a complete zod (passes), and a bin-only dep (passes). Refs #2783, #2730 Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * plan-10 Phase 3: clean-room install + import smoke test (#2730 backstop) Add scripts/smoke-clean-room.cjs and a `smoke:clean-room` npm script. Against fresh temp dirs (never the repo's node_modules) it: - copies plugin/, runs `bun install --frozen-lockfile --ignore-scripts`, asserts zod, zod/v3, zod/v4, zod/v4-mini resolve, and boots the bundled worker asserting no `Cannot find module` — the direct #2730 regression guard; - `npm pack`s, installs the tarball into a second temp dir, and load-tests the published bin entrypoint, warning loudly on any declared main/exports target missing from the tarball (latent #2537 gap). Exits non-zero naming the missing module on any failure; cleans up all temp dirs and the tarball in a finally. Refs #2783, #2730 Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * plan-10 Phase 4: gate CI and publish on the clean-room dependency closure - ci.yml: new `clean-room-deps` job (between build and the docker e2e job) runs a frozen-lockfile drift check on the committed plugin lockfile, then `npm run build` + `npm run smoke:clean-room`. The drift step catches a contributor who changed plugin deps without regenerating plugin/bun.lock. - npm-publish.yml: add setup-bun and run `npm run smoke:clean-room` between build and `npm publish`, so a broken runtime closure cannot be published on a tag push (ci.yml does not run on tags). Secrets block untouched. Refs #2783, #2730 Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * plan-10: doc recluster note + Phase 0 execution slice for #2730 Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * plans: backlog recluster (2026-06-04) — cross-cluster execution order + plan-13 doc Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * plan-10: gen-plugin-lockfile degrades gracefully when bun is absent The Windows build CI job has no bun on PATH; regenerating the lockfile there threw and failed the build. The committed plugin/bun.lock is already the deterministic closure, so skip regeneration (non-fatal) when bun is missing and a lockfile exists; fail loud only when neither is available. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * chore: rebuild plugin artifacts after merging main (v13.5.1) + plan-10 work Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> * chore: rebuild plugin artifacts after merging main v13.5.5 Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> * chore(deps): daily upgrade pass — agent SDK 0.3.172, better-auth 1.6.16, posthog-node 5.36.15, dompurify 3.4.9 - Bump @anthropic-ai/claude-agent-sdk 0.2.141 -> 0.3.172 (tsc + full test suite green) - Remove deprecated @types/dompurify stub (dompurify ships its own types) - Add overrides.tmp ^0.2.7 to clear GHSA-52f5-9888-hmc6 / GHSA-ph9p-34f9-6g65 via np -> listr-input -> inquirer -> external-editor -> tmp chain - npm audit: 0 vulnerabilities; npm outdated: clean - package-lock.json is gitignored in this repo, so not committed Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> * plan: worker-restart single-source-of-truth — 7-phase fix for restart races Phased plan from the adversarially-verified diagnosis (wf_f07f3541-b05): kill the cache mirror, single verified restart initiator, self-replacing restart endpoint, unified spawn gate with lockfile, PID-file demotion, test data-dir isolation, soak verification. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> * refactor(restart): delete sync-script cache-mirror and HTTP restart trigger Phase 1 of plans/2026-06-10-worker-restart-single-source-of-truth.md. The installed-version cache mirror wrote version-N code into the version-(N-1) cache dir, manufacturing permanent version disagreement; the HTTP POST to /api/admin/restart raced the CLI restart that follows it in build-and-sync. Both are deleted; the CLI worker:restart in the marketplace copy is now the single restart initiator, and the sleep 1 between the two mechanisms is gone. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> * feat(restart): restart proves itself or exits 1 Phase 2 of plans/2026-06-10-worker-restart-single-source-of-truth.md. worker-service restart now captures the old worker pid, waits for the port with the same platform-scaled 15s budget as stop, spawns the marketplace copy of worker-service.cjs when present, then polls /api/health until the pid changes and the version matches this build's baked __DEFAULT_PACKAGE_VERSION__ — success is printed to stdout, deadline (platform-scaled 30s) exits 1 with the last observed health payload and the spawned script path. The --daemon generic start-failure path now exits 1 instead of masquerading as success; the three duplicate-suppression exits remain 0. New helper src/services/restart-verify.ts (worker-service.ts bootstraps on import, so the helper lives in an import-safe module) with 8 tests covering pid-flip success, stale pid, wrong version, unreachable timeout, 503-degraded acceptance, and null-oldPid version-only verification. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> * feat(restart): self-replacing worker — old worker spawns its successor Phase 3 of plans/2026-06-10-worker-restart-single-source-of-truth.md. /api/admin/restart was kill-only: hooks that POSTed it then raced the dying worker with their own lazy-spawn (the observed recycle ping-pong). Now the dying worker spawns its successor itself — after a re-entrancy- guarded, deadline-bounded (platform-scaled 10s) graceful shutdown, and only once its port is confirmed free; stop and signal shutdowns stay kill-only. The hook recycle path waits for that successor via /api/health polling (HOOK_READINESS_TIMEOUT_MS budget) and lazy-spawns only as a fallback, with a warn-only version re-check so a hook never recycles more than once per invocation. Shutdown sequence lives in import-safe src/services/worker-shutdown.ts (worker-service.ts bootstraps on import); registerSignalHandlers no longer pre-sets isShuttingDown — the supervisor's shutdownInitiated guard owns signal dedupe, and pre-setting would no-op the new entry guard. 13 new tests cover re-entrancy, deadline expiry/rejection, handoff ordering, kill-only reasons, successor-wait vs lazy-spawn fallback, and pre-graceful bookkeeping failures. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> * feat(restart): one spawn gate; CLI restart defers to the self-replacing worker Phase 4 of plans/2026-06-10-worker-restart-single-source-of-truth.md. Three uncoordinated spawn paths (hook lazy-spawn, MCP worker-spawner, CLI) with two different bun resolvers produced 3-launcher collisions within a single second. Now a wx-flag lockfile (<DATA_DIR>/spawn.lock, 60s mtime staleness with re-stat-before-unlink, owner-checked release) gates every external spawn: lock losers never fail — they skip the spawn and wait for the winner's worker. resolveBunRuntime is deleted in favor of ProcessManager's resolveWorkerRuntimePath (adds BUN_PATH, ~/.bun/bin, brew, which fallbacks), closing the kill-then-can't-respawn path; mcp-server prefers the marketplace worker script so stale cache dirs stop spawning stale workers. Integration fix surfaced by live verification: the CLI restart raced the Phase 3 self-replacement handoff (the successor re-binds the port in ~200ms, so waitForPortFree always timed out and restart exited 1 while the restart had actually succeeded). The CLI now verifies the worker's self-spawned successor directly, and only spawns — gate- wrapped, after the port frees — as the fallback when no worker was running, the shutdown POST was rejected, or no successor appeared. The dying worker's handoff is intentionally ungated: it spawns only after its own port closes, and hooks wait on it. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> * feat(restart): demote the PID file — health and port are the liveness oracle Phase 5 of plans/2026-06-10-worker-restart-single-source-of-truth.md. The dying worker's shutdown cascade deleted the PID file unconditionally as its final act, clobbering the successor's freshly-written file; status then required portInUse AND pidInfo, so a healthy worker reported as "not running". Now every PID-file deletion is owner-guarded: the supervisor cascade deletes only its own pid (removeOwnedPidFile), and the CLI stop/restart-fallback, the restart handoff, and the daemon start-failure cleanup go through removePidFileIfOwner (owner-or-dead — a live successor's file always survives; corrupt files are left for the next boot's validator). status sources from GET /api/health alone (pid, version, uptime, workerPath; 503-degraded counts as running and now surfaces its queue detail), with port-in-use-but-unreachable and not-running fallbacks — all exit 0 as before. The --daemon duplicate gate checks the port first (ground truth) and the PID file second (advisory, for the freed-port-but-undeleted-file window); duplicate suppression stays exit 0. writePidFile/touchPidFile remain — the file is diagnostics, and the worker stays its only writer. Also fixes combined-run test pollution: spawn-gate and worker-utils timeout tests now eagerly import paths.js before setting a temp CLAUDE_MEM_DATA_DIR, so the import-time DATA_DIR const can't freeze on a deleted temp dir for suites loaded later in the same bun process. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> * test: no test ever touches the real ~/.claude-mem again Phase 6 of plans/2026-06-10-worker-restart-single-source-of-truth.md. process-manager and graceful-shutdown tests wrote corrupt JSON and sentinel PIDs (2147483647) into the real ~/.claude-mem/worker.pid and drove the real supervisor.json cascade under a snapshot-restore that a killed run would skip — that pollution contaminated production logs and a prior diagnosis. Both files now set a temp CLAUDE_MEM_DATA_DIR at the top of the file before dynamically importing the code under test (ESM hoisting makes beforeEach too late), assert their paths landed outside the real dir, and derive PID_FILE from the same frozen paths module the code uses so test and code can never diverge under bun's shared module cache. The snapshot-restore scaffolding is deleted; zero assertions changed. tests/preload.ts gains a tripwire: when CLAUDE_MEM_DATA_DIR is unset it fills a per-run temp dir, so no test in any file can fall through to the real data dir. Fallout made explicit: worker-spawn child processes get an explicit temp dir; install-error-matrix restores rather than deletes the env var; settings-defaults-manager pins the unset-env default it was implicitly relying on. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> * fix(settings): bootstrap notices go to stderr, never stdout CI on PR #2894 caught the latent bug: on the first boot in a fresh data dir, SettingsDefaultsManager printed '[SETTINGS] Created settings file with defaults: ...' to stdout before the start command's JSON hook payload, corrupting the machine-readable contract every fresh install's first hook invocation relies on. The Phase 6 per-run temp data dir made the cold-dir case deterministic in CI, exposing it. Both informational notices (creation, nested-schema migration) now use console.warn — stderr — matching the function's existing failure-path idiom; two regression tests pin stdout silence on both paths. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> * refactor(restart): address PR #2894 review — dedupe script resolver, skip futile port wait Both inline copies of the marketplace-first script-candidate list in worker-service.ts (restart fallback + successor handoff injection) now call the exported resolveWorkerScriptPath() ?? __filename, so the candidate list lives in one place. verifyRestartedWorker's failure result gains lastPollSawHealth; when the self-replacement handoff verification timed out while a live (but unverifiable) worker was still serving on the port, the CLI fallback now skips its port-free wait — the port cannot free while that worker lives, so the wait only burned its full platform-scaled budget before the same final verification ran anyway. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com> |