feat(telemetry): reliability signals — retrieval quality, compression trust, worker lifecycle, hook failures (Plan 14) (#2874)

* feat(telemetry): disclose 19 reliability-signal fields and 2 new events across all surfaces Whitelist (scrub.ts), scrub tests, public docs (telemetry.mdx), and CLI disclosure (COLLECTED_FIELDS/EVENT_NAMES) for the Plan 14 reliability signals: search retrieval quality, compression trust, worker lifecycle, and hook failure keys, plus the worker_stopped and hook_failed events. Includes the plan document. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * feat(telemetry): retrieval-quality signals on search_performed SearchManager.search() fills an optional telemetry envelope (result_count, search_strategy, chroma_available, fallback_reason) across all three search paths; handlers stash it on res.locals.searchTelemetry and the existing finish-middleware spreads it into the search_performed capture. Zero-result searches report result_count: 0; Chroma fallback reasons are a closed enum, never the error message. Response shapes unchanged. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * feat(telemetry): compression trust signals on session_compressed fabrication_detected/fabricated_count flow through compressionProps (all three emit paths); invalid-output respawns emit a respawn-gated session_compressed with outcome invalid_output and the classifier value; aborted generators emit outcome aborted with abort_reason normalized to a closed enum in the .finally where all five abort flows converge (the .catch path can never observe a non-null abortReason). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * feat(telemetry): worker lifecycle signals — worker_stopped, crash detection, memory metrics Clean-shutdown sentinel written before telemetry flush and consumed at startup; worker_started gains previous_shutdown (crash/clean/unknown) and previous_uptime_seconds derived from the stale PID file; new worker_stopped event (uptime_seconds, shutdown_reason stop/restart/ signal) emitted before shutdownTelemetry(); the CLI restart path tags /api/admin/shutdown?reason=restart so restarts are distinguishable; buildLifecycleProps adds integer process_rss_mb/heap_used_mb. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * feat(telemetry): threshold-gated hook_failed distress signal via CLI transport recordWorkerUnreachable emits hook_failed exactly when the consecutive- failure count reaches the fail-loud threshold; the generic blocking-error branch emits error_mode blocking_error. Both emits are awaited before the process.exit paths so the 2s-capped CLI POST survives; hook_type is a closed enum registered at hookCommand entry. Exit codes unchanged. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * chore(build): regenerate plugin artifacts with Plan 14 telemetry signals Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * fix(tests): make PostHog client regression test order-independent via global preload mock The disableGeoip regression test mocked posthog-node per-file, but telemetry.ts is imported transitively by many test files in the shared bun process, so the mock registered too late and the test failed in full-suite runs — CI on main has been red since v13.5.4. The mock now registers in a bunfig [test].preload before any module loads, which also guarantees test runs can never construct a real PostHog client and flush fabricated events into production analytics (consent is default-on and the suite outlives flushInterval). telemetry.ts gains a test-only state reset so construction is observed deterministically regardless of suite order. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * fix(telemetry): forward shutdown reason in Windows-managed IPC message Review follow-up: the wrapper IPC path discarded the restart tag, so an external Windows wrapper could only ever report shutdown_reason 'stop'. No wrapper in this repo listens for the message, but the reason now travels with it for any that does. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-07-03 12:32:32 +08:00 · 2026-06-10 02:51:45 -07:00
parent 25015fc3e5
commit 4ce51efd28
22 changed files with 1197 additions and 374 deletions
--- a/bunfig.toml
+++ b/bunfig.toml
@@ -1,2 +1,7 @@
 [test]
 smol = true
+# Mocks posthog-node for ALL tests before any module loads — required because
+# telemetry.ts is a process-wide singleton imported transitively by many test
+# files, and because real captures from tests would pollute production
+# analytics. See tests/preload.ts.
+preload = ["./tests/preload.ts"]
--- a/docs/public/telemetry.mdx
+++ b/docs/public/telemetry.mdx
@@ -36,7 +36,7 @@ Every event property passes through a strict whitelist scrubber — any key not
 | `runtime_version` | `1.2.0` | Runtime version string |
 | `node_version` | `22.14.0` | Node.js version string |
 | `duration_ms` | `1843` | How long an operation took |
-| `outcome` | `ok` | Coarse result: ok / error / partial |
+| `outcome` | `ok` | Coarse result — a closed enum: ok / error / partial / invalid_output / aborted |
 | `error_category` | `provider_error` | Coarse error bucket — never an error message |
 | `locale` | `en-US` | Language tag |
 | `is_ci` | `false` | Whether running in CI |
@@ -72,6 +72,25 @@ Every event property passes through a strict whitelist scrubber — any key not
 | `install_age_days` | `104` | Days since the install's first recorded session |
 | `obs_count_7d` / `obs_count_30d` | `1887` / `10357` | Observations stored in the last 7 / 30 days |
 | `days_since_last_obs` | `0` | Days since the most recent observation was stored |
+| `result_count` | `12` | How many results a memory search returned — count only, never the results or the query |
+| `chroma_available` | `true` | Whether the vector-search backend was reachable for a search (false = fell back to full-text search) |
+| `fallback_reason` | `none` | Why a search fell back from vector search: none / chroma_connection / chroma_error / chroma_not_initialized — a closed enum, never an error message |
+| `fabrication_detected` | `false` | Whether a compression's output referenced commit hashes that don't exist in your repo (a model-trust check) |
+| `fabricated_count` | `0` | How many nonexistent commit hashes were detected — count only, never the hashes |
+| `invalid_output_class` | `idle` | Coarse class of an unusable compression output: xml / idle / prose / poisoned (`xml` = looked like the expected format but failed to parse) — never the output itself |
+| `consecutive_invalid_outputs` | `3` | How many unusable outputs occurred in a row before recovery |
+| `respawn_triggered` | `true` | Whether the compression agent was restarted after repeated unusable output |
+| `abort_reason` | `idle` | Why a compression session was aborted: idle / shutdown / overflow / restart_guard / quota / poisoned / none — a closed enum |
+| `previous_shutdown` | `clean` | How the previous worker run ended, detected at startup: crash / clean / unknown |
+| `previous_uptime_seconds` | `86400` | How long the previous worker run was up, in whole seconds |
+| `uptime_seconds` | `3600` | How long the worker was up when it stopped, in whole seconds |
+| `shutdown_reason` | `restart` | Why the worker stopped: stop / restart / signal |
+| `process_rss_mb` | `187` | Worker process resident memory, integer megabytes |
+| `heap_used_mb` | `92` | Worker JS heap in use, integer megabytes |
+| `hook_type` | `observation` | Which hook kind failed: context / session-init / observation / summarize / file-context — our handler names |
+| `error_mode` | `worker_unavailable` | Coarse hook failure mode: worker_unavailable / blocking_error — never an error message |
+| `consecutive_failures` | `3` | How many hook failures occurred in a row (the fail-loud counter) |
+| `threshold_tripped` | `true` | Whether the consecutive-failure count reached the fail-loud threshold |

 One value is derived server-side rather than sent by the client: PostHog resolves the request's sender IP to a **coarse location** (country / region / city) at ingestion, before the IP itself is discarded. The client never attaches an IP to any event, and the raw IP is never stored — see [What is NEVER collected](#what-is-never-collected).

@@ -82,10 +101,12 @@ One value is derived server-side rather than sent by the client: PostHog resolve
 | `install_completed` | `npx claude-mem install` finishes | `ide`, `provider`, `runtime_mode`, `is_update`, `outcome`, `duration_ms`, `interactive`, `install_method`, `bun_version`, `uv_version`, `claude_code_version` |
 | `install_failed` | The installer aborts | `error_category` (our error-taxonomy id), `interactive`, `install_method`, `claude_code_version` |
 | `uninstall_completed` | `npx claude-mem uninstall` finishes | — |
-| `worker_started` | The background worker starts, plus one heartbeat per 24h of uptime | `trigger` (start / heartbeat), `duration_ms`, `ide`, `provider`, `mode`, `runtime_mode`, and the install snapshot: `db_observation_count`, `db_session_count`, `db_summary_count`, `db_project_count`, `db_size_mb`, `install_age_days`, `obs_count_7d`, `obs_count_30d`, `days_since_last_obs` |
-| `session_compressed` | A session compression stores observations, or the compression generator fails | `outcome`, `duration_ms`, `count`, `has_summary`, `provider`, `model`, `ide`, `hook`, `compression_ms`, `observation_type`, `obs_type_*`, `tokens_input`, `tokens_output`, `compression_ratio`, `cost_usd`, `endpoint_class`; on failure `error_category`, `model` |
+| `worker_started` | The background worker starts, plus one heartbeat per 24h of uptime | `trigger` (start / heartbeat), `duration_ms`, `ide`, `provider`, `mode`, `runtime_mode`, process memory (`process_rss_mb`, `heap_used_mb`), the install snapshot: `db_observation_count`, `db_session_count`, `db_summary_count`, `db_project_count`, `db_size_mb`, `install_age_days`, `obs_count_7d`, `obs_count_30d`, `days_since_last_obs`; on a real start also crash detection: `previous_shutdown` (crash / clean / unknown) and, after a clean shutdown, `previous_uptime_seconds` |
+| `session_compressed` | A session compression stores observations, repeated unusable model output forces an agent respawn, the compression generator fails, or a compression session is aborted | `outcome` (ok / invalid_output / error / aborted), `duration_ms`, `count`, `has_summary`, `provider`, `model`, `ide`, `hook`, `compression_ms`, `observation_type`, `obs_type_*`, `tokens_input`, `tokens_output`, `compression_ratio`, `cost_usd`, `endpoint_class`, `fabrication_detected`, `fabricated_count`; on an invalid-output respawn `invalid_output_class`, `consecutive_invalid_outputs`, `respawn_triggered`; on failure `error_category`; on abort `abort_reason` |
 | `context_injected` | Stored memory is injected into a new session | `outcome`, `duration_ms`, `mode`, `provider`, `search_strategy`, `observation_count`, `session_count`, `timeline_depth_days`, `has_session_summary`, `obs_type_*`, `tokens_injected`, `tokens_saved_vs_naive` |
-| `search_performed` | A memory search runs (never the query text) | `endpoint`, `outcome`, `duration_ms` |
+| `search_performed` | A memory search runs (never the query text) | `endpoint`, `outcome`, `duration_ms`, `result_count`, `search_strategy`, `chroma_available`, `fallback_reason` |
+| `worker_stopped` | The background worker shuts down gracefully | `uptime_seconds`, `shutdown_reason` (stop / restart / signal) |
+| `hook_failed` | A claude-mem hook fails hard — the worker is unreachable past the fail-loud threshold, or a blocking error occurs | `hook_type`, `error_mode`, `consecutive_failures`, `threshold_tripped` |
 | `error_occurred` | The worker returns an HTTP 5xx | `error_category` |

 ## What is NEVER collected
--- a/plans/14-telemetry-reliability-signals.md
+++ b/plans/14-telemetry-reliability-signals.md
@@ -0,0 +1,319 @@
+# Plan 14 — Telemetry Reliability Signals
+
+Adds the five highest-value missing telemetry signals identified by the 2026-06-10
+capture-surface audit. Theme: **we instrument success well; failure is invisible.**
+Every signal below feeds the **Reliability** sentence of
+`plans/2026-06-09-telemetry-metrics-spec.md` ("Core pipeline succeeds X% of the
+time at scale") — plus retrieval quality, which today has no KPI at all.
+
+Phases are self-contained: each can be executed in a fresh chat context. Execute
+in order; Phase 1–4 are independent of each other but all depend on Phase 0's
+facts and share the Phase-ritual below.
+
+---
+
+## Phase 0 — Verified facts, allowed APIs, and the every-property ritual
+
+Consolidated from 5 documentation-discovery agents (all high confidence, all
+findings cite read code). **Do not invent APIs beyond this list.**
+
+### The pipeline ritual — EVERY new property or event must touch all five surfaces
+
+| # | Surface | Location | What to do |
+|---|---|---|---|
+| 1 | Scrub whitelist | `src/services/telemetry/scrub.ts:8-82` (`ALLOWED_PROPERTY_KEYS: Set<string>`) | Add the key, grouped with a category comment like the existing ones |
+| 2 | Scrub tests | `tests/telemetry/scrub.test.ts` | Copy the pattern at `:5-31` (single-group) or `:81-106` (multi-key group); also confirm `:139-169` drop-tests still pass |
+| 3 | Public docs | `docs/public/telemetry.mdx` fields table `:26-75`, events table `:78-89` | Add a row per field; new events get an events-table row |
+| 4 | CLI disclosure | `src/npx-cli/commands/telemetry.ts` `COLLECTED_FIELDS:23-66`, `EVENT_NAMES:68-77` | Add a line per field; new event names go in `EVENT_NAMES` |
+| 5 | Capture site | per phase below | Emit via `captureEvent` / `captureCliEvent` only |
+
+### Allowed APIs (verified signatures)
+
+- `captureEvent(event: string, props?: Record<string, unknown>, opts?: { person?: boolean }): void` — `src/services/telemetry/telemetry.ts:72` (worker transport; consent-gated, scrubbed, fire-and-forget)
+- `captureCliEvent(event, props?, opts?): Promise<void>` — `src/services/telemetry/cli-telemetry.ts:22` (short-lived-process transport; direct POST, hard 2s timeout `CAPTURE_TIMEOUT_MS` at `:15`, never throws)
+- `scrubProperties(props): Record<string, string | number | boolean>` — `src/services/telemetry/scrub.ts:91-114` (drops non-whitelisted keys and non-primitives **silently**; strings clamped to 200 chars; numbers must be finite)
+- `collectInstallStats(db): Record<string, number>` — `src/services/telemetry/install-stats.ts:29`
+- `getUptimeSeconds(startedAtMs: number, now?): number` — `src/shared/uptime.ts:5-7`
+- `writePidFile(info: PidInfo) / readPidFile(): PidInfo | null / removePidFile()` — `src/services/infrastructure/ProcessManager.ts:134/141/156`; `PidInfo = { pid, port, startedAt: string /* ISO8601 */, startToken? }` (`src/supervisor/process-registry.ts:49-54`)
+- `recordWorkerUnreachable(): number` — `src/shared/worker-utils.ts:451-470` (returns the consecutive-failure count; persists atomically in `~/.claude-mem/state/hook-failures.json`; threshold default 3, env `CLAUDE_MEM_HOOK_FAIL_LOUD_THRESHOLD`)
+- `classifyObserverOutput(raw): 'xml'|'idle'|'prose'|'poisoned'` — `src/sdk/output-classifier.ts:60-80`
+- `verifyCommitHashesInText(...): CommitVerificationResult` with `fabricated: string[]` — `src/sdk/commit-verification.ts:69-108`
+- `DATA_DIR` / `paths.workerPid()` etc. — `src/shared/paths.ts:40,129-151`
+
+### Global anti-patterns (from discovery; apply to every phase)
+
+- Properties not added to `ALLOWED_PROPERTY_KEYS` are **silently dropped** — no error. Always whitelist first, then emit.
+- Only `number | boolean | closed-enum string`. Never free text, paths, queries, error messages, IDs derived from the user. (An earlier audit draft proposed `error_summary: string` — explicitly rejected.)
+- `person: true` only on lifecycle events (spec constraint, `plans/2026-06-09-telemetry-metrics-spec.md:65-71`). Nothing in this plan adds person properties; do not touch `PERSON_PROPERTY_KEYS`.
+- Never bypass `captureEvent`/`captureCliEvent` with direct PostHog calls.
+- Debug-mode verification harness: `CLAUDE_MEM_TELEMETRY_DEBUG=1` prints would-be payloads to stderr and sends nothing (`telemetry.ts:97-103`).
+
+### Discovery discrepancy to resolve during Phase 2
+
+One agent reported `INVALID_OUTPUT_RESPAWN_THRESHOLD` = 25, another = 3. Read
+`src/services/worker/agents/ResponseProcessor.ts:25` before relying on the value.
+
+---
+
+## Phase 1 — Retrieval quality: `result_count` + strategy/fallback on `search_performed`
+
+**Narrative served:** Reliability + retrieval quality. Zero-result rate becomes
+computable; Chroma's silent degradation to FTS becomes visible (the recurring
+`SQLiteSearchStrategy Database error` incident class).
+
+### Verified obstacles (do not skip)
+
+- The existing capture is a middleware: `SearchRoutes.ts:117-123` inside
+  `res.once('finish')` — it fires **after** the response, outside handler scope.
+  It can see only `endpoint`, `res.statusCode`, and elapsed time. Result arrays,
+  `totalResults` (computed at `SearchManager.ts:307`), `chromaFailed`
+  (`SearchManager.ts:158, 206, 274`) and `chromaFailureReason`
+  (`SearchManager.ts:267-275`) are method-local and unreachable from there.
+- `SearchManager.search()` has three paths: filter-only SQLite (`:165-176`),
+  Chroma (`:179-286`, sets `chromaFailed` on error), Chroma-not-initialized FTS
+  (`:288-305`). Text-format responses (`:420-425`) do not carry counts; only
+  `format='json'` (`:309-316`) includes `totalResults`.
+- `search_strategy` is **already whitelisted** (`scrub.ts:55`); only the new keys
+  need whitelist entries.
+
+### What to implement
+
+1. In `SearchManager.search()`, build a small telemetry envelope alongside the
+   existing return value — do not change response shapes. Collect:
+   `result_count` (the `totalResults` already computed at `:307`),
+   `search_strategy: 'chroma' | 'fts' | 'filter_only'` (one per path above),
+   `chroma_available: boolean` (false when `chromaFailed` or not initialized),
+   `fallback_reason: 'none' | 'chroma_connection' | 'chroma_error' | 'chroma_not_initialized'`
+   (map from `chromaFailureReason.isConnectionError` at `:271`; never the message).
+   Expose it to callers — recommended: return `{ ...existing, telemetry }` for an
+   internal caller, or set it on a mutable param. Simplest verified-safe plumbing:
+   handlers stash it on `res.locals.searchTelemetry`, and the middleware at
+   `SearchRoutes.ts:117-123` spreads `res.locals.searchTelemetry ?? {}` into the
+   existing `captureEvent('search_performed', …)` props.
+2. Whitelist `result_count`, `chroma_available`, `fallback_reason` (ritual #1–4).
+3. Note: `src/services/worker/search/types.ts:53-64` has a `StrategySearchResult`
+   with a `strategy` field but `SearchManager.search()` does not use it — derive
+   strategy from the three paths; do not refactor onto SearchOrchestrator here.
+
+### Verification
+
+- [ ] `bun test tests/telemetry/` green (new scrub cases included)
+- [ ] `npm run typecheck:root` clean
+- [ ] `CLAUDE_MEM_TELEMETRY_DEBUG=1` + a worker search request prints `search_performed` with `result_count`, `search_strategy`, `chroma_available`, `fallback_reason`
+- [ ] Grep guard: `grep -n "fallback_reason" src/services/telemetry/scrub.ts docs/public/telemetry.mdx src/npx-cli/commands/telemetry.ts` hits all three
+- [ ] Zero-result search shows `result_count: 0` (not missing)
+
+### Anti-pattern guards
+
+- Do NOT try to introspect the response body from the middleware (no `res._getBuffer()`-style Express internals — unverified, fragile).
+- Do NOT put `chromaFailureReason.message` in any property — enum only.
+- Do NOT change the text-format response shape consumed by clients.
+
+---
+
+## Phase 2 — Compression quality: fabrication, invalid-output, and abort reasons on `session_compressed`
+
+**Narrative served:** Reliability + model quality (extends yesterday's
+tokens/cost/ratio work with per-model trust signals).
+
+### Verified mechanics (this is the key to doing it right)
+
+- `compressionProps` is built at `ResponseProcessor.ts:194-214`. Non-SDK
+  providers emit immediately (`:228`); the SDK/Claude path stashes the object
+  into `session.pendingCompressionEvent` (`worker-types.ts:60`) at `:216-226`,
+  and `ClaudeProvider.ts:416-435` later merges real token fields and emits;
+  `:442-445` is the no-result fallback emit. **Therefore: any property added to
+  `compressionProps` automatically flows through all three emit paths.**
+- Fabrication scope: `ResponseProcessor.ts:115-135` already computes
+  `fabricated: string[]` via `verifyCommitHashesInText`.
+- Invalid output: `ResponseProcessor.ts:48-88` returns early — **no event fires
+  at all** on that path today. `session.consecutiveInvalidOutputs`
+  (`worker-types.ts:34`) increments at `:54`, resets at `:92`; respawn decision
+  at `:67-79` (`outputClass === 'poisoned'` OR threshold reached — read the
+  threshold at `:25`, see Phase 0 discrepancy).
+- `abortReason` enum: `worker-types.ts:42` — `'idle'|'shutdown'|'overflow'|'restart-guard'|'quota'|string|null`;
+  set at `ClaudeProvider.ts:270` (note: `'quota:…'` prefix format), `:315`,
+  `SessionManager.ts:272,294,407`; consumed at `SessionRoutes.ts:166-167`. The
+  error-path emit is `SessionRoutes.ts:154-163`.
+
+### What to implement
+
+1. **Fabrication:** in `ResponseProcessor.ts` where `fabricated.length` is known
+   (`:128-135`), add to `compressionProps`: `fabrication_detected: boolean`,
+   `fabricated_count: number`. (Flows through deferred path for free.)
+2. **Invalid output:** at the respawn decision (`:67-79`) — and ONLY when a
+   respawn triggers, to bound volume — emit one
+   `captureEvent('session_compressed', { outcome: 'invalid_output', invalid_output_class, consecutive_invalid_outputs, respawn_triggered: true, provider, model, ide, hook })`
+   where `invalid_output_class` is the classifier value (`'idle'|'prose'|'poisoned'`).
+3. **Abort reason:** in the error-path emit (`SessionRoutes.ts:154-163`), add
+   `abort_reason` normalized to a closed enum:
+   `'idle'|'shutdown'|'overflow'|'restart_guard'|'quota'|'none'` — split the
+   `'quota:…'` format on `':'` and map `'restart-guard'` → `'restart_guard'`.
+4. Whitelist `fabrication_detected`, `fabricated_count`, `invalid_output_class`,
+   `consecutive_invalid_outputs`, `respawn_triggered`, `abort_reason` (ritual #1–4).
+
+### Verification
+
+- [ ] `bun test tests/telemetry/` green; `npm run typecheck:root` clean
+- [ ] Debug-mode `session_compressed` payload shows `fabrication_detected: false, fabricated_count: 0` on a normal compression
+- [ ] Grep guard: `grep -rn "abort_reason" src/services/telemetry/scrub.ts src/services/worker/http/routes/SessionRoutes.ts` both hit
+- [ ] Confirm the deferred path carries new props: grep the built `plugin/scripts/worker-service.cjs` for `fabrication_detected` after `npm run build`
+
+### Anti-pattern guards
+
+- Do NOT emit an event per invalid output (volume) — respawn-gated only.
+- Do NOT send raw `abortReason` strings (`'quota:daily'`, `'restart-guard'`) — normalize to the closed enum first; the scrubber will happily pass any ≤200-char string, so enum discipline is on the emitter.
+- Do NOT add the new props anywhere except `compressionProps` for the fabrication fields — adding them only at the `ClaudeProvider` merge would miss non-SDK providers.
+
+---
+
+## Phase 3 — Worker lifecycle: crash detection, `worker_stopped`, heartbeat health
+
+**Narrative served:** Reliability ("crash-free installs") + makes the DAU/uptime
+data trustworthy.
+
+### Verified mechanics
+
+- PID file already stores `startedAt` ISO8601 (`worker-service.ts:289`,
+  `PidInfo` at `process-registry.ts:49-54`) → previous uptime is computable on
+  next start via `Date.parse(startedAt)`.
+- There is NO shutdown sentinel today; marker-file pattern to copy:
+  `ProcessManager.ts:232-254` (`.chroma-cleaned-v10.3`) — write to `DATA_DIR`.
+- Graceful shutdown: `worker-service.ts:565-585`; `shutdownTelemetry()` is called
+  at `:576` and races a 3s flush (`telemetry.ts:124-144`) — an event captured
+  **before** `:576` will flush. Stop-case `removePidFile()` is at `:836`.
+- `worker_started` captures: `:427` (trigger `start`, `person: true`), `:436`
+  (heartbeat, 24h `setInterval` with `.unref()` at `:435-438`); props builder
+  `buildLifecycleProps()` at `:401-426`.
+- `uncaughtException` handler at `:1075-1078` logs and does NOT exit (known smell — out of scope here, do not change process semantics in this plan).
+
+### What to implement
+
+1. **Clean-shutdown sentinel:** in the shutdown path (before `:576`), write
+   `DATA_DIR/.worker-clean-shutdown` containing the ISO timestamp (copy the
+   marker pattern from `ProcessManager.ts:232-254`). Delete the sentinel at
+   startup after reading it.
+2. **Crash detection on start:** in the startup daemon path, before
+   `writePidFile`, derive:
+   - stale PID file present + no sentinel → `previous_shutdown: 'crash'`
+   - sentinel present → `'clean'`
+   - neither (first run) → `'unknown'`
+   - `previous_uptime_seconds` from the stale PID file's `startedAt` to sentinel
+     time (clean) or to `now` minus unknown gap (crash → omit rather than guess;
+     omitted properties are fine).
+   Add both to the existing `captureEvent('worker_started', …)` at `:427`.
+3. **`worker_stopped` event:** immediately before `shutdownTelemetry()` at
+   `:576`, `captureEvent('worker_stopped', { uptime_seconds, shutdown_reason })`
+   with `uptime_seconds` from `getUptimeSeconds(this.startTime)`
+   (`worker-service.ts:122`, `uptime.ts:5-7`) and
+   `shutdown_reason: 'stop' | 'restart' | 'signal'` from the caller. No
+   `person: true`.
+4. **Heartbeat health:** in the heartbeat payload (`:436` / `buildLifecycleProps`),
+   add `process_rss_mb` and `heap_used_mb` as integers from
+   `process.memoryUsage()` (`Math.round(rss / 1024 / 1024)`).
+5. Whitelist `previous_shutdown`, `previous_uptime_seconds`, `uptime_seconds`,
+   `shutdown_reason`, `process_rss_mb`, `heap_used_mb`; add `worker_stopped` to
+   `EVENT_NAMES` and the docs events table (ritual #1–4).
+
+### Verification
+
+- [ ] `bun test tests/telemetry/` green; `npm run typecheck:root` clean
+- [ ] Debug mode: `worker-service restart` prints `worker_stopped` (reason `restart`) then `worker_started` with `previous_shutdown: 'clean'`
+- [ ] Kill -9 the worker, start it: `worker_started` shows `previous_shutdown: 'crash'`
+- [ ] Heartbeat payload contains integer `process_rss_mb`
+- [ ] Sentinel file is removed after startup reads it (no stale `'clean'` after a later crash)
+
+### Anti-pattern guards
+
+- Do NOT compute uptime from in-memory `startTime` for the *previous* run — it's never persisted; use the PID file's `startedAt`.
+- Do NOT emit `worker_stopped` after `shutdownTelemetry()` — `isShutdown` (`telemetry.ts:81`) drops late events by design.
+- Do NOT add the new keys to `PERSON_PROPERTY_KEYS` (spec ingestion-cost constraint).
+- `process.memoryUsage().rss` is bytes — convert; the scrubber drops non-finite numbers silently.
+
+---
+
+## Phase 4 — `hook_failed` event (threshold-gated, CLI transport)
+
+**Narrative served:** Reliability — a failing hook is silent memory loss; today
+the fail-loud counter only writes to the user's stderr.
+
+### Verified constraints (these dictate the design — read before coding)
+
+- Hooks are short-lived processes (<1s typical). The worker transport
+  (posthog-node batching) can never flush there; and emitting via the worker API
+  is self-defeating (the defining failure IS "worker unreachable"). **Transport
+  must be `captureCliEvent`** (`cli-telemetry.ts:22`, direct POST, 2s cap, never throws).
+- **The trap:** `exitGraceful` (`hook-io.ts:166-173`) and `emitBlockingError`
+  (`hook-io.ts:150-159`) call `process.exit()` immediately and do not await
+  pending promises — a fire-and-forget POST is killed mid-flight. The emit must
+  be **awaited before** the exit call, inside the failure branch.
+- Catch taxonomy lives at `hook-command.ts:99-128`: AdapterRejectedInput
+  (`:100-105`), non-blocking input error (`:106-111`), worker-unavailable
+  (`:112-119`, the only branch calling `recordWorkerUnreachable()`), generic
+  blocking error (`:121-128`, exit 2).
+- `recordWorkerUnreachable(): number` returns the consecutive count and knows the
+  threshold — gate on it.
+- Hooks currently import zero telemetry code; `captureCliEvent` has only
+  fs/fetch deps and bundles fine via `scripts/build-hooks.js` esbuild (telemetry
+  modules are not externalized — verified at `build-hooks.js:284-330`).
+
+### What to implement
+
+1. In `hook-command.ts`, in exactly two branches:
+   - **worker-unavailable branch (`:112-119`):** after
+     `recordWorkerUnreachable()` returns `count`, if `count` has just reached the
+     fail-loud threshold (the same condition that triggers the blocking stderr
+     message), `await captureCliEvent('hook_failed', { hook_type, error_mode: 'worker_unavailable', consecutive_failures: count, threshold_tripped: true })`.
+   - **generic blocking-error branch (`:121-128`):**
+     `await captureCliEvent('hook_failed', { hook_type, error_mode: 'blocking_error', threshold_tripped: false })` before `emitBlockingError`.
+   Both branches are rare and already failed — the ≤2s bounded wait is
+   acceptable there. Never emit on the success path or the two skip branches.
+2. `hook_type`: closed enum from the hook event already passed to
+   `hookCommand(platform, event, …)` (`:79`) — use the event/handler name set
+   (`context | session-init | observation | summarize | file-context`), not free text.
+3. Whitelist `hook_type`, `error_mode`, `consecutive_failures`,
+   `threshold_tripped`; add `hook_failed` to `EVENT_NAMES` + docs events table
+   (ritual #1–4).
+
+### Verification
+
+- [ ] `bun test tests/telemetry/` green; `npm run typecheck:root` clean
+- [ ] `npm run build` then grep the built hook artifact for `hook_failed` (confirms bundling)
+- [ ] With the worker stopped and `CLAUDE_MEM_TELEMETRY_DEBUG=1`, run a hook 3× (threshold): third run prints `hook_failed` with `consecutive_failures: 3`
+- [ ] Success-path hook run emits nothing and latency is unchanged
+- [ ] Confirm exit codes unchanged (`HOOK_EXIT_CODES`, `hook-constants.ts:15-20`)
+
+### Anti-pattern guards
+
+- Do NOT fire-and-forget then `process.exit()` — the event dies with the process.
+- Do NOT emit per-invocation hook latency events (volume + inline-latency cost). Worker-side `duration_ms` on `context_injected`/`search_performed` already covers worker latency; defer hook-side latency to a future aggregate.
+- Do NOT route the emit through `executeWithWorkerFallback` or any worker API.
+- Do NOT emit in the AdapterRejectedInput / non-blocking-input branches (expected, noisy, not failures of ours).
+
+---
+
+## Phase 5 — Final verification
+
+1. **Full ritual audit** — for each new key
+   (`result_count, chroma_available, fallback_reason, fabrication_detected, fabricated_count, invalid_output_class, consecutive_invalid_outputs, respawn_triggered, abort_reason, previous_shutdown, previous_uptime_seconds, uptime_seconds, shutdown_reason, process_rss_mb, heap_used_mb, hook_type, error_mode, consecutive_failures, threshold_tripped`):
+   `grep -n "<key>" src/services/telemetry/scrub.ts tests/telemetry/scrub.test.ts docs/public/telemetry.mdx src/npx-cli/commands/telemetry.ts` — all four must hit.
+2. **New events disclosed:** `worker_stopped`, `hook_failed` present in
+   `EVENT_NAMES` (`src/npx-cli/commands/telemetry.ts:68-77`) and the
+   `telemetry.mdx` events table.
+3. **Anti-pattern greps:**
+   - `grep -rn "captureEvent\|captureCliEvent" src/ | grep -v services/telemetry` — every site passes enums/counts only (manual scan of new sites)
+   - `grep -rn "posthog" src/ --include="*.ts" | grep -v services/telemetry` — no direct SDK use outside the pipeline
+   - no `PERSON_PROPERTY_KEYS` additions in the diff
+4. **Tests & build:** `bun test tests/telemetry/` (note: bun only — the suite
+   fails under vitest), `npm run typecheck:root`, `npm run build-and-sync`,
+   worker `/health` returns ok.
+5. **Live smoke:** `CLAUDE_MEM_TELEMETRY_DEBUG=1` walk: search (Phase 1 fields),
+   compression (Phase 2 fields), restart (Phase 3 events), worker-down hook ×3
+   (Phase 4 event).
+6. **Docs deploy:** telemetry.mdx changes auto-deploy on push to main — confirm
+   the public page renders the new rows after release.
+
+## Out of scope (deliberately)
+
+- The `uncaughtException` no-exit smell (`worker-service.ts:1075-1078`) — process-semantics change, separate plan.
+- Per-hook latency events, event-loop-lag sampling, `telemetry_disabled` final ping (product/privacy decision pending), installer funnel (`install_started`), doctor/repair distress signals — candidates for Plan 15 after this data lands.
--- a/plugin/scripts/mcp-server.cjs
+++ b/plugin/scripts/mcp-server.cjs
--- a/plugin/scripts/server-beta-service.cjs
+++ b/plugin/scripts/server-beta-service.cjs
--- a/plugin/scripts/transcript-watcher.cjs
+++ b/plugin/scripts/transcript-watcher.cjs
--- a/plugin/scripts/worker-service.cjs
+++ b/plugin/scripts/worker-service.cjs
--- a/src/cli/hook-command.ts
+++ b/src/cli/hook-command.ts
@@ -10,6 +10,12 @@ import {
  exitGraceful,
  resetHookIoState,
 } from '../shared/hook-io.js';
+import {
+  recordWorkerUnreachable,
+  setActiveHookType,
+  getActiveHookType,
+} from '../shared/worker-utils.js';
+import { captureCliEvent } from '../services/telemetry/cli-telemetry.js';
 import { logger } from '../utils/logger.js';

 export interface HookCommandOptions {
@@ -78,6 +84,9 @@ async function executeHookPipeline(

 export async function hookCommand(platform: string, event: string, options: HookCommandOptions = {}): Promise<number> {
  resetHookIoState();
+  // Register the hook event for the threshold-gated hook_failed telemetry
+  // (closed enum enforced inside; non-enum events just omit hook_type).
+  setActiveHookType(event);

  // Hook IO Discipline (issue #2292):
  // We BUFFER stderr during handler execution so that unsolicited writes from
@@ -113,12 +122,28 @@ export async function hookCommand(platform: string, event: string, options: Hook
      logger.warn('HOOK', `Worker unavailable, skipping hook: ${error instanceof Error ? error.message : error}`);
      // EXIT_SIGNAL per CLAUDE.md: transient worker errors exit 0 to avoid
      // Windows Terminal tab accumulation. The fail-loud counter (worker-utils
-      // recordWorkerUnreachable) handles the surface-after-N-failures path.
+      // recordWorkerUnreachable) handles the surface-after-N-failures path and
+      // emits the threshold-gated hook_failed telemetry internally. Awaited:
+      // when the count JUST reaches the threshold it sends the event and then
+      // exits 2; exitGraceful below would kill a pending POST mid-flight.
+      await recordWorkerUnreachable();
      exitGraceful(options);
      return HOOK_EXIT_CODES.SUCCESS;
    }

    logger.error('HOOK', `Hook error: ${error instanceof Error ? error.message : error}`, {}, error instanceof Error ? error : undefined);
+    // hook_failed telemetry MUST be awaited BEFORE emitBlockingError — it
+    // calls process.exit(2), which would kill a fire-and-forget POST
+    // mid-flight. captureCliEvent never throws and is hard-capped at 2s.
+    // Closed-enum props only: the error message itself is never sent.
+    {
+      const hookType = getActiveHookType();
+      await captureCliEvent('hook_failed', {
+        ...(hookType !== null ? { hook_type: hookType } : {}),
+        error_mode: 'blocking_error',
+        threshold_tripped: false,
+      });
+    }
    // BLOCKING_FEEDBACK: flush the buffered logger.error line to stderr and
    // exit 2 so the model receives it per Claude Code's hook contract.
    emitBlockingError(
--- a/src/npx-cli/commands/telemetry.ts
+++ b/src/npx-cli/commands/telemetry.ts
@@ -30,7 +30,7 @@ const COLLECTED_FIELDS = [
  'runtime_version  runtime version string',
  'node_version     Node.js version string',
  'duration_ms      how long an operation took',
-  'outcome          ok / error / partial',
+  'outcome          ok / error / partial / invalid_output / aborted',
  'error_category   coarse error bucket (never a message)',
  'locale           language tag (e.g. en-US)',
  'is_ci            whether running in CI',
@@ -63,6 +63,24 @@ const COLLECTED_FIELDS = [
  'db_size_mb       memory database file size in MB',
  'install_age_days / days_since_last_obs / obs_count_7d / obs_count_30d',
  '                 install age and recent activity, in days/counts',
+  'result_count     how many results a search returned (never the query)',
+  'chroma_available whether vector search was reachable for a search',
+  'fallback_reason  none / chroma_connection / chroma_error / chroma_not_initialized',
+  'fabrication_detected   whether compression output cited nonexistent commits',
+  'fabricated_count how many nonexistent commit hashes (count, never hashes)',
+  'invalid_output_class   xml / idle / prose / poisoned (never the output)',
+  'consecutive_invalid_outputs   unusable outputs in a row before recovery',
+  'respawn_triggered      whether the compression agent was restarted',
+  'abort_reason     idle / shutdown / overflow / restart_guard / quota / poisoned / none',
+  'previous_shutdown      crash / clean / unknown (detected at worker start)',
+  'previous_uptime_seconds / uptime_seconds',
+  '                 worker uptime in whole seconds (previous run / at stop)',
+  'shutdown_reason  stop / restart / signal',
+  'process_rss_mb / heap_used_mb   worker memory, integer megabytes',
+  'hook_type        context / session-init / observation / summarize / file-context',
+  'error_mode       worker_unavailable / blocking_error (never a message)',
+  'consecutive_failures   hook failures in a row (the fail-loud counter)',
+  'threshold_tripped      whether the fail-loud threshold was reached',
 ];

 const EVENT_NAMES = [
@@ -70,9 +88,11 @@ const EVENT_NAMES = [
  'install_failed',
  'uninstall_completed',
  'worker_started',
+  'worker_stopped',
  'session_compressed',
  'context_injected',
  'search_performed',
+  'hook_failed',
  'error_occurred',
 ];

--- a/src/services/infrastructure/HealthMonitor.ts
+++ b/src/services/infrastructure/HealthMonitor.ts
@@ -91,9 +91,13 @@ export async function waitForPortFree(port: number, timeoutMs: number = 10000):
  return false;
 }

-export async function httpShutdown(port: number): Promise<boolean> {
+export async function httpShutdown(port: number, reason: 'stop' | 'restart' = 'stop'): Promise<boolean> {
  try {
-    const result = await httpRequestToWorker(port, '/api/admin/shutdown', 'POST');
+    // The CLI restart path stops the worker through this same endpoint; the
+    // reason tag lets the worker report shutdown_reason: 'restart' on its
+    // worker_stopped telemetry instead of a generic 'stop'.
+    const endpointPath = reason === 'restart' ? '/api/admin/shutdown?reason=restart' : '/api/admin/shutdown';
+    const result = await httpRequestToWorker(port, endpointPath, 'POST');
    if (!result.ok) {
      logger.warn('SYSTEM', 'Shutdown request returned error', { status: result.statusCode });
      return false;
--- a/src/services/server/Server.ts
+++ b/src/services/server/Server.ts
@@ -80,7 +80,9 @@ export interface AiStatus {
 export interface ServerOptions {
  getInitializationComplete: () => boolean;
  getMcpReady: () => boolean;
-  onShutdown: () => Promise<void>;
+  // reason feeds worker_stopped telemetry: 'restart' when the CLI restart
+  // path tags /api/admin/shutdown with ?reason=restart, 'stop' otherwise.
+  onShutdown: (reason?: 'stop' | 'restart') => Promise<void>;
  onRestart: () => Promise<void>;
  workerPath: string;
  runtime?: string;
@@ -291,7 +293,11 @@ export class Server {
      }
    });

-    this.app.post('/api/admin/shutdown', requireLocalhost, async (_req: Request, res: Response) => {
+    this.app.post('/api/admin/shutdown', requireLocalhost, async (req: Request, res: Response) => {
+      // Closed-enum mapping for worker_stopped telemetry: only the exact
+      // 'restart' tag (set by the CLI restart path) upgrades the reason;
+      // anything else stays 'stop'.
+      const shutdownReason: 'stop' | 'restart' = req.query.reason === 'restart' ? 'restart' : 'stop';
      const isWindowsManaged = process.platform === 'win32' &&
        process.env.CLAUDE_MEM_MANAGED === 'true' &&
        process.send;
@@ -299,9 +305,12 @@ export class Server {
      if (isWindowsManaged) {
        res.json({ status: 'shutting_down' });
        logger.info('SYSTEM', 'Sending shutdown request to wrapper');
-        process.send!({ type: 'shutdown' });
+        // No wrapper in this repo listens for this message (legacy external
+        // path), but forward the reason so a wrapper that does can preserve
+        // shutdown_reason fidelity instead of defaulting to 'stop'.
+        process.send!({ type: 'shutdown', reason: shutdownReason });
      } else {
-        flushResponseThen(res, { status: 'shutting_down' }, () => this.options.onShutdown());
+        flushResponseThen(res, { status: 'shutting_down' }, () => this.options.onShutdown(shutdownReason));
      }
    });

--- a/src/services/telemetry/scrub.ts
+++ b/src/services/telemetry/scrub.ts
@@ -79,6 +79,41 @@ export const ALLOWED_PROPERTY_KEYS: Set<string> = new Set([
  'obs_count_7d',
  'obs_count_30d',
  'days_since_last_obs',
+  // search_performed retrieval quality — result_count is an integer,
+  // chroma_available a boolean, fallback_reason one of OUR enum values
+  // (none | chroma_connection | chroma_error | chroma_not_initialized).
+  // Never the query, never an error message.
+  'result_count',
+  'chroma_available',
+  'fallback_reason',
+  // session_compressed trust signals — booleans, counters, and our own
+  // closed enums (invalid_output_class: xml | idle | prose | poisoned, where
+  // 'xml' means XML-shaped output that still failed to parse; abort_reason:
+  // idle | shutdown | overflow | restart_guard | quota | poisoned | none).
+  // Never model output, never raw abort strings.
+  'fabrication_detected',
+  'fabricated_count',
+  'invalid_output_class',
+  'consecutive_invalid_outputs',
+  'respawn_triggered',
+  'abort_reason',
+  // Worker lifecycle health — previous_shutdown (crash | clean | unknown),
+  // shutdown_reason (stop | restart | signal), uptime in whole seconds, and
+  // process memory as integer megabytes. No paths, no PIDs.
+  'previous_shutdown',
+  'previous_uptime_seconds',
+  'uptime_seconds',
+  'shutdown_reason',
+  'process_rss_mb',
+  'heap_used_mb',
+  // hook_failed distress signal — hook_type is one of OUR hook names
+  // (context | session-init | observation | summarize | file-context),
+  // error_mode (worker_unavailable | blocking_error), plus a consecutive
+  // failure counter and threshold flag. Never an error message.
+  'hook_type',
+  'error_mode',
+  'consecutive_failures',
+  'threshold_tripped',
 ]);

 const MAX_STRING_LENGTH = 200;
--- a/src/services/telemetry/telemetry.ts
+++ b/src/services/telemetry/telemetry.ts
@@ -116,6 +116,18 @@ export function captureEvent(
  }
 }

+/**
+ * Test-only. The module state (singleton client, 30s consent TTL cache,
+ * shutdown latch) is process-wide, and the whole bun test suite shares one
+ * process — without a reset, a test asserting client construction inherits
+ * whatever earlier test files did. Never called by production code.
+ */
+export function __resetTelemetryForTests(): void {
+  client = null;
+  consentCache = null;
+  isShutdown = false;
+}
+
 /**
 * Flush queued events on graceful shutdown. Races the SDK shutdown against a
 * 3s timeout so a slow/unreachable ingestion host can never hang worker stop.
--- a/src/services/worker-service.ts
+++ b/src/services/worker-service.ts
@@ -1,6 +1,6 @@

 import path from 'path';
-import { existsSync } from 'fs';
+import { existsSync, readFileSync, unlinkSync, writeFileSync } from 'fs';
 import { spawn } from 'child_process';
 import { Database } from 'bun:sqlite';
 import { Client } from '@modelcontextprotocol/sdk/client/index.js';
@@ -8,6 +8,7 @@ import { StdioClientTransport } from '@modelcontextprotocol/sdk/client/stdio.js'
 import { getWorkerPort, getWorkerHost } from '../shared/worker-utils.js';
 import { DATA_DIR, DB_PATH, ensureDir } from '../shared/paths.js';
 import { HOOK_TIMEOUTS } from '../shared/hook-constants.js';
+import { getUptimeSeconds } from '../shared/uptime.js';
 import { SettingsDefaultsManager } from '../shared/SettingsDefaultsManager.js';
 import { getAuthMethodDescription } from '../shared/EnvManager.js';
 import { logger } from '../utils/logger.js';
@@ -117,10 +118,69 @@ export function buildStatusOutput(status: 'ready' | 'error', message?: string):
  };
 }

+/**
+ * Closed enum for worker_stopped telemetry. Must stay in sync with the
+ * shutdown_reason whitelist documentation (scrub.ts / telemetry.mdx):
+ * stop = /api/admin/shutdown (CLI `stop`), restart = /api/admin/restart or
+ * CLI `restart` (tagged ?reason=restart), signal = SIGTERM/SIGINT handler.
+ */
+export type WorkerShutdownReason = 'stop' | 'restart' | 'signal';
+
+// Clean-shutdown sentinel — same marker-file pattern as the one-time markers
+// in ProcessManager.ts (.chroma-cleaned-v10.3). Written in the graceful
+// shutdown path, consumed (read + deleted) at the next startup: sentinel
+// present = previous run stopped cleanly; stale PID file with no sentinel =
+// previous run died without reaching the graceful-shutdown path (crash).
+const CLEAN_SHUTDOWN_SENTINEL_PATH = path.join(DATA_DIR, '.worker-clean-shutdown');
+
+function writeCleanShutdownSentinel(): void {
+  try {
+    ensureDir(DATA_DIR);
+    writeFileSync(CLEAN_SHUTDOWN_SENTINEL_PATH, new Date().toISOString());
+  } catch (error: unknown) {
+    if (error instanceof Error) {
+      logger.warn('SYSTEM', 'Failed to write clean-shutdown sentinel', { path: CLEAN_SHUTDOWN_SENTINEL_PATH }, error);
+    } else {
+      logger.warn('SYSTEM', 'Failed to write clean-shutdown sentinel', { path: CLEAN_SHUTDOWN_SENTINEL_PATH }, new Error(String(error)));
+    }
+  }
+}
+
+function readAndClearCleanShutdownSentinel(): string | null {
+  if (!existsSync(CLEAN_SHUTDOWN_SENTINEL_PATH)) return null;
+
+  let contents: string | null = null;
+  try {
+    contents = readFileSync(CLEAN_SHUTDOWN_SENTINEL_PATH, 'utf-8').trim();
+  } catch (error: unknown) {
+    if (error instanceof Error) {
+      logger.warn('SYSTEM', 'Failed to read clean-shutdown sentinel', { path: CLEAN_SHUTDOWN_SENTINEL_PATH }, error);
+    } else {
+      logger.warn('SYSTEM', 'Failed to read clean-shutdown sentinel', { path: CLEAN_SHUTDOWN_SENTINEL_PATH }, new Error(String(error)));
+    }
+  }
+  try {
+    // Always delete after reading: a stale sentinel would mislabel a later
+    // crash as 'clean'.
+    unlinkSync(CLEAN_SHUTDOWN_SENTINEL_PATH);
+  } catch (error: unknown) {
+    if (error instanceof Error) {
+      logger.warn('SYSTEM', 'Failed to remove clean-shutdown sentinel', { path: CLEAN_SHUTDOWN_SENTINEL_PATH }, error);
+    } else {
+      logger.warn('SYSTEM', 'Failed to remove clean-shutdown sentinel', { path: CLEAN_SHUTDOWN_SENTINEL_PATH }, new Error(String(error)));
+    }
+  }
+  return contents;
+}
+
 export class WorkerService implements WorkerRef {
  private server: Server;
  private startTime: number = Date.now();
  private telemetryHeartbeat: ReturnType<typeof setInterval> | null = null;
+  // Crash detection (worker_started telemetry): derived once at startup from
+  // the previous run's stale PID file + the clean-shutdown sentinel.
+  private previousShutdown: 'clean' | 'crash' | 'unknown' = 'unknown';
+  private previousUptimeSeconds: number | null = null;
  private mcpClient: Client;

  private mcpReady: boolean = false;
@@ -191,8 +251,8 @@ export class WorkerService implements WorkerRef {
    this.server = new Server({
      getInitializationComplete: () => this.initializationCompleteFlag,
      getMcpReady: () => this.mcpReady,
-      onShutdown: () => this.shutdown(),
-      onRestart: () => this.shutdown(),
+      onShutdown: (reason) => this.shutdown(reason ?? 'stop'),
+      onRestart: () => this.shutdown('restart'),
      workerPath: __filename,
      getAiStatus: () => {
        let provider = 'claude';
@@ -223,7 +283,7 @@ export class WorkerService implements WorkerRef {
  private registerSignalHandlers(): void {
    configureSupervisorSignalHandlers(async () => {
      this.isShuttingDown = true;
-      await this.shutdown();
+      await this.shutdown('signal');
    });
  }

@@ -275,10 +335,47 @@ export class WorkerService implements WorkerRef {
    }));
  }

+  /**
+   * Crash detection for worker_started telemetry. Must run BEFORE
+   * startSupervisor() — whose validateWorkerPidFile() deletes the previous
+   * run's stale PID file — and before writePidFile overwrites it.
+   *   - clean-shutdown sentinel present → previous run stopped gracefully
+   *   - stale PID file present, no sentinel → previous run crashed
+   *   - neither (first run, or the spawner already cleaned the stale PID
+   *     file) → unknown
+   * The sentinel is consumed here so it can never mislabel a later crash.
+   */
+  private detectPreviousShutdown(): void {
+    const stalePidInfo = readPidFile();
+    const sentinelTimestamp = readAndClearCleanShutdownSentinel();
+
+    if (sentinelTimestamp !== null) {
+      this.previousShutdown = 'clean';
+      // Previous uptime = previous run's PID-file startedAt → sentinel write
+      // time. The previous run's in-memory startTime is never persisted, so
+      // the PID file is the only source; omit when either side is missing.
+      const startedAtMs = stalePidInfo ? Date.parse(stalePidInfo.startedAt) : NaN;
+      const stoppedAtMs = Date.parse(sentinelTimestamp);
+      if (Number.isFinite(startedAtMs) && Number.isFinite(stoppedAtMs) && stoppedAtMs >= startedAtMs) {
+        this.previousUptimeSeconds = Math.floor((stoppedAtMs - startedAtMs) / 1000);
+      }
+    } else if (stalePidInfo) {
+      // Crash: the previous run's stop time is unknowable, so
+      // previous_uptime_seconds is deliberately omitted rather than guessed.
+      this.previousShutdown = 'crash';
+    } else {
+      this.previousShutdown = 'unknown';
+    }
+  }
+
  async start(): Promise<void> {
    const port = getWorkerPort();
    const host = getWorkerHost();

+    // Must run before startSupervisor(): its validateWorkerPidFile() removes
+    // the dead previous run's stale PID file, which crash detection needs.
+    this.detectPreviousShutdown();
+
    await startSupervisor();

    await this.server.listen(port, host);
@@ -422,11 +519,20 @@ export class WorkerService implements WorkerRef {
          // Snapshot is best-effort; the lifecycle event still ships without it.
          logger.debug('SYSTEM', 'Install stats snapshot failed', {}, error as Error);
        }
+        // Process health for the daily heartbeat: memoryUsage() returns bytes;
+        // the scrubber drops non-finite numbers, so round to whole MiB.
+        const memory = process.memoryUsage();
+        props.process_rss_mb = Math.round(memory.rss / 1024 / 1024);
+        props.heap_used_mb = Math.round(memory.heapUsed / 1024 / 1024);
        return props;
      };
      captureEvent('worker_started', {
        trigger: 'start',
        duration_ms: Date.now() - this.startTime,
+        // Crash detection (detectPreviousShutdown): crash case carries no
+        // previous_uptime_seconds — the stop time is unknowable.
+        previous_shutdown: this.previousShutdown,
+        ...(this.previousUptimeSeconds !== null && { previous_uptime_seconds: this.previousUptimeSeconds }),
        ...buildLifecycleProps(),
      }, { person: true });
      // Long-lived workers would otherwise look like a single day of activity.
@@ -562,7 +668,7 @@ export class WorkerService implements WorkerRef {
    this.sessionManager.removeSessionImmediate(sessionDbId);
  }

-  async shutdown(): Promise<void> {
+  async shutdown(reason: WorkerShutdownReason = 'stop'): Promise<void> {
    if (this.transcriptWatcher) {
      this.transcriptWatcher.stop();
      this.transcriptWatcher = null;
@@ -573,6 +679,14 @@ export class WorkerService implements WorkerRef {
      clearInterval(this.telemetryHeartbeat);
      this.telemetryHeartbeat = null;
    }
+    // Mark this stop as graceful for the next start's crash detection, and
+    // capture worker_stopped BEFORE shutdownTelemetry() — isShutdown drops
+    // any event captured after the flush, by design.
+    writeCleanShutdownSentinel();
+    captureEvent('worker_stopped', {
+      uptime_seconds: getUptimeSeconds(this.startTime),
+      shutdown_reason: reason,
+    });
    await shutdownTelemetry();

    await performGracefulShutdown({
@@ -841,7 +955,7 @@ async function main() {

    case 'restart': {
      logger.info('SYSTEM', 'Restarting worker');
-      await httpShutdown(port);
+      await httpShutdown(port, 'restart');
      const restartFreed = await waitForPortFree(port, 5000);
      if (!restartFreed) {
        console.error('Port still bound after shutdown. Resolve manually.');
--- a/src/services/worker/SearchManager.ts
+++ b/src/services/worker/SearchManager.ts
@@ -21,6 +21,19 @@ import type { TimelineData } from './search/index.js';
 import { ResultFormatter } from './search/ResultFormatter.js';
 import { ChromaUnavailableError } from './search/errors.js';

+/**
+ * Telemetry envelope for search_performed (see docs/public/telemetry.mdx).
+ * Populated by SearchManager.search() via a mutable sink param so response
+ * shapes (json and text formats) stay untouched. Privacy: counts, booleans,
+ * and closed enums only — never query text, results, or error messages.
+ */
+export interface SearchTelemetryEnvelope {
+  result_count?: number;
+  search_strategy?: 'chroma' | 'fts' | 'filter_only';
+  chroma_available?: boolean;
+  fallback_reason?: 'none' | 'chroma_connection' | 'chroma_error' | 'chroma_not_initialized';
+}
+
 export class SearchManager {
  private orchestrator: SearchOrchestrator;
  private timelineBuilder: TimelineBuilder;
@@ -149,7 +162,7 @@ export class SearchManager {
    return normalized;
  }

-  async search(args: any): Promise<any> {
+  async search(args: any, telemetryOut?: SearchTelemetryEnvelope): Promise<any> {
    const normalized = this.normalizeParams(args);
    const { query, type, obs_type, concepts, files, format, ...options } = normalized;
    let observations: ObservationSearchResult[] = [];
@@ -306,6 +319,33 @@ export class SearchManager {

    const totalResults = observations.length + sessions.length + prompts.length;

+    // Telemetry envelope (search_performed): derive the strategy from the
+    // three paths above. Enum/count values only — never the Chroma error
+    // message, query text, or result content.
+    if (telemetryOut) {
+      let searchStrategy: SearchTelemetryEnvelope['search_strategy'];
+      let fallbackReason: SearchTelemetryEnvelope['fallback_reason'];
+      if (!query) {
+        // PATH 1: filter-only SQLite (no query text; Chroma never consulted)
+        searchStrategy = 'filter_only';
+        fallbackReason = 'none';
+      } else if (this.chromaSync) {
+        // PATH 2: Chroma semantic search, degrading to FTS5 on error
+        searchStrategy = chromaFailed ? 'fts' : 'chroma';
+        fallbackReason = chromaFailed
+          ? (chromaFailureReason?.isConnectionError ? 'chroma_connection' : 'chroma_error')
+          : 'none';
+      } else {
+        // PATH 3: FTS5 keyword search (Chroma not initialized)
+        searchStrategy = 'fts';
+        fallbackReason = 'chroma_not_initialized';
+      }
+      telemetryOut.result_count = totalResults;
+      telemetryOut.search_strategy = searchStrategy;
+      telemetryOut.chroma_available = this.chromaSync !== null && !chromaFailed;
+      telemetryOut.fallback_reason = fallbackReason;
+    }
+
    if (format === 'json') {
      return {
        observations,
--- a/src/services/worker/agents/ResponseProcessor.ts
+++ b/src/services/worker/agents/ResponseProcessor.ts
@@ -45,6 +45,13 @@ export async function processAgentResponse(

  const parsed = parseAgentXml(text, session.contentSessionId);

+  // Provider enum for telemetry, derived once so the invalid-output and
+  // success paths stamp the same value.
+  const providerName =
+    session.currentProvider ??
+    ({ SDK: 'claude', Gemini: 'gemini', OpenRouter: 'openrouter' } as Record<string, string>)[agentName] ??
+    'claude';
+
  if (!parsed.valid) {
    // Classify the non-XML output so a dropped batch is VISIBLE, not silent
    // (plan-11, #2485). Attach a preview for diagnostics.
@@ -75,6 +82,18 @@ export async function processAgentResponse(
        consecutiveInvalidOutputs: session.consecutiveInvalidOutputs,
        threshold: INVALID_OUTPUT_RESPAWN_THRESHOLD,
      });
+      // Respawn-gated telemetry ONLY (never per invalid output — volume).
+      // Closed enums and counts; the raw model output never leaves the box.
+      captureEvent('session_compressed', {
+        outcome: 'invalid_output',
+        invalid_output_class: outputClass,
+        consecutive_invalid_outputs: session.consecutiveInvalidOutputs,
+        respawn_triggered: true,
+        provider: providerName,
+        model: typeof modelId === 'string' && modelId ? modelId : 'unknown',
+        ide: session.platformSource,
+        hook: session.lastGeneratorSource,
+      });
      await sessionManager.respawnPoisonedSession(session.sessionDbId);
      return;
    }
@@ -111,6 +130,7 @@ export async function processAgentResponse(
  // ground truth via `git cat-file -e` in the session's repo and strip
  // fabricated hashes from the persisted text. projectRoot carries the cwd of
  // the most recently observed tool-use.
+  let fabricatedCount = 0;
  if (summaryForStore) {
    const { fabricated } = verifyCommitHashesInText(
      [
@@ -125,6 +145,8 @@ export async function processAgentResponse(
      session.contentSessionId
    );

+    fabricatedCount = fabricated.length;
+
    if (fabricated.length > 0) {
      logger.warn('PARSER', `${agentName} summary referenced fabricated commit hash(es); flagging before persist`, {
        sessionId: session.sessionDbId,
@@ -175,10 +197,6 @@ export async function processAgentResponse(

  // Telemetry: counts, enums, and REAL usage only (lastUsage is never an
  // estimate — providers leave it null when the API gave no usage split).
-  const providerName =
-    session.currentProvider ??
-    ({ SDK: 'claude', Gemini: 'gemini', OpenRouter: 'openrouter' } as Record<string, string>)[agentName] ??
-    'claude';
  const typeCounts: Record<string, number> = { bugfix: 0, discovery: 0, decision: 0, refactor: 0, other: 0 };
  for (const obs of labeledObservations) {
    const bucket = obs.type in typeCounts && obs.type !== 'other' ? obs.type : 'other';
@@ -205,6 +223,10 @@ export async function processAgentResponse(
    hook: session.lastGeneratorSource,
    endpoint_class: session.endpointClass,
    compression_ms: compressionMs,
+    // Fabrication signals live HERE (not at the ClaudeProvider merge) so they
+    // flow through all three emit paths: immediate, deferred, and no-result.
+    fabrication_detected: fabricatedCount > 0,
+    fabricated_count: fabricatedCount,
    observation_type: labeledObservations.length > 0 ? dominantType : undefined,
    obs_type_bugfix: typeCounts.bugfix,
    obs_type_discovery: typeCounts.discovery,
--- a/src/services/worker/http/routes/SearchRoutes.ts
+++ b/src/services/worker/http/routes/SearchRoutes.ts
@@ -4,6 +4,7 @@ import * as fs from 'fs';
 import path from 'path';
 import { z } from 'zod';
 import { SearchManager } from '../../SearchManager.js';
+import type { SearchTelemetryEnvelope } from '../../SearchManager.js';
 import { BaseRouteHandler } from '../BaseRouteHandler.js';
 import { validateBody } from '../middleware/validateBody.js';
 import { logger } from '../../../../utils/logger.js';
@@ -115,10 +116,15 @@ export class SearchRoutes extends BaseRouteHandler {
        const segment = req.path === '/' ? 'unified' : req.path.slice(1).split('/')[0];
        const endpoint = KNOWN_SEARCH_ENDPOINTS.has(segment) ? segment : 'other';
        res.once('finish', () => {
+          // res.locals.searchTelemetry is the retrieval-quality envelope
+          // (result_count, search_strategy, chroma_available, fallback_reason)
+          // populated by SearchManager.search() and stashed by the handler —
+          // counts/booleans/enums only, never response-body introspection.
          captureEvent('search_performed', {
            endpoint,
            outcome: res.statusCode < 400 ? 'ok' : 'error',
            duration_ms: Date.now() - searchStartedAt,
+            ...(res.locals.searchTelemetry ?? {}),
          });
        });
      }
@@ -153,7 +159,13 @@ export class SearchRoutes extends BaseRouteHandler {
  }

  private handleUnifiedSearch = this.wrapHandler(async (req: Request, res: Response): Promise<void> => {
-    const result = await this.searchManager.search(req.query);
+    // Mutable telemetry sink: SearchManager.search() fills it with the
+    // retrieval-quality envelope; the /api/search middleware spreads it into
+    // search_performed on response finish. Stashed before the await so the
+    // envelope survives even if response serialization fails afterwards.
+    const searchTelemetry: SearchTelemetryEnvelope = {};
+    res.locals.searchTelemetry = searchTelemetry;
+    const result = await this.searchManager.search(req.query, searchTelemetry);
    res.json(result);
  });

--- a/src/services/worker/http/routes/SessionRoutes.ts
+++ b/src/services/worker/http/routes/SessionRoutes.ts
@@ -26,6 +26,25 @@ import { USER_PROMPT_DEDUPE_WINDOW_MS } from '../../../../shared/user-prompts.js

 const MAX_USER_PROMPT_BYTES = 256 * 1024;

+/**
+ * Collapse session.abortReason onto a closed telemetry enum. The raw value can
+ * carry free text after a colon (e.g. 'quota:<provider message>') — never emit
+ * it verbatim. Unknown or absent reasons map to 'none'.
+ */
+function normalizeAbortReason(
+  reason: string | null | undefined
+): 'idle' | 'shutdown' | 'overflow' | 'restart_guard' | 'quota' | 'poisoned' | 'none' {
+  switch ((reason ?? '').split(':')[0]) {
+    case 'idle': return 'idle';
+    case 'shutdown': return 'shutdown';
+    case 'overflow': return 'overflow';
+    case 'restart-guard': return 'restart_guard';
+    case 'quota': return 'quota';
+    case 'poisoned': return 'poisoned';
+    default: return 'none';
+  }
+}
+
 export class SessionRoutes extends BaseRouteHandler {
  constructor(
    private sessionManager: SessionManager,
@@ -151,6 +170,10 @@ export class SessionRoutes extends BaseRouteHandler {
          provider: provider,
          error: errorMsg
        }, error);
+        // No abort_reason here: every site that sets abortReason aborts the
+        // controller on its next line, so aborted generators either resolve
+        // normally (quota/overflow break) or hit the signal-aborted early
+        // return above — this catch only ever sees non-abort rejections.
        captureEvent('session_compressed', {
          outcome: 'error',
          provider,
@@ -165,6 +188,20 @@ export class SessionRoutes extends BaseRouteHandler {
      .finally(async () => {
        const reason = session.abortReason ?? null;
        session.abortReason = null;  // consume the reason
+        if (reason !== null) {
+          // Abort accounting lives HERE, where the reason is consumed — the
+          // ONLY point every abort flow (idle / shutdown / overflow / quota /
+          // poisoned) passes through. Emit the closed enum, never the raw
+          // string ('quota:…' carries a window suffix).
+          captureEvent('session_compressed', {
+            outcome: 'aborted',
+            provider,
+            model: session.lastModelId ?? 'unknown',
+            abort_reason: normalizeAbortReason(reason),
+            hook: session.lastGeneratorSource,
+            ide: session.platformSource,
+          });
+        }
        await handleGeneratorExit(session, reason, {
          sessionManager: this.sessionManager,
          completionHandler: this.completionHandler,
--- a/src/shared/worker-utils.ts
+++ b/src/shared/worker-utils.ts
@@ -9,6 +9,7 @@ import { MARKETPLACE_ROOT, DATA_DIR } from "./paths.js";
 import { loadFromFileOnce } from "./hook-settings.js";
 import { validateWorkerPidFile } from "../supervisor/index.js";
 import { emitBlockingError } from "./hook-io.js";
+import { captureCliEvent } from "../services/telemetry/cli-telemetry.js";
 import { checkVersionMatch } from "../services/infrastructure/index.js";

 function readTimeoutEnv(
@@ -448,7 +449,35 @@ function getFailLoudThreshold(): number {
  return FAIL_LOUD_DEFAULT_THRESHOLD;
 }

-function recordWorkerUnreachable(): number {
+/**
+ * Closed enum of hook handler names allowed as the `hook_type` telemetry
+ * property. Mirrors the scrub whitelist comment (scrub.ts), the CLI
+ * disclosure (npx-cli/commands/telemetry.ts), and docs/public/telemetry.mdx —
+ * never widen one without the others. Events outside this set (user-message,
+ * file-edit) simply omit hook_type.
+ */
+const TELEMETRY_HOOK_TYPES = ['context', 'session-init', 'observation', 'summarize', 'file-context'] as const;
+export type TelemetryHookType = (typeof TELEMETRY_HOOK_TYPES)[number];
+
+let activeHookType: TelemetryHookType | null = null;
+
+/**
+ * Record which hook event this short-lived hook process is executing, so the
+ * fail-loud counter can tag its threshold-gated hook_failed telemetry.
+ * Called once at hookCommand entry; values outside the closed enum are
+ * dropped (never free text).
+ */
+export function setActiveHookType(event: string): void {
+  activeHookType = (TELEMETRY_HOOK_TYPES as readonly string[]).includes(event)
+    ? (event as TelemetryHookType)
+    : null;
+}
+
+export function getActiveHookType(): TelemetryHookType | null {
+  return activeHookType;
+}
+
+export async function recordWorkerUnreachable(): Promise<number> {
  const state = readHookFailureState();
  const next: HookFailureState = {
    consecutiveFailures: state.consecutiveFailures + 1,
@@ -458,6 +487,23 @@ function recordWorkerUnreachable(): number {

  const threshold = getFailLoudThreshold();
  if (next.consecutiveFailures >= threshold) {
+    // hook_failed distress signal. Gated to the failure that JUST reached the
+    // threshold (`===`, not `>=`): the stderr warning below repeats on every
+    // failure past the threshold, but telemetry emits once per failure streak
+    // to bound volume. MUST be awaited BEFORE emitBlockingError — it calls
+    // process.exit(2) immediately, which would kill a fire-and-forget POST
+    // mid-flight. captureCliEvent never throws and is hard-capped at 2s, so
+    // this cannot hang the fail-loud path. Closed-enum/count props only —
+    // never error text. Transport is the direct CLI POST, never the worker
+    // API (the defining failure here IS "worker unreachable").
+    if (next.consecutiveFailures === threshold) {
+      await captureCliEvent('hook_failed', {
+        ...(activeHookType !== null ? { hook_type: activeHookType } : {}),
+        error_mode: 'worker_unavailable',
+        consecutive_failures: next.consecutiveFailures,
+        threshold_tripped: true,
+      });
+    }
    // #2292 fix: BLOCKING_FEEDBACK. emitBlockingError flushes the Phase 2
    // stderr buffer (so preceding logger.warn lines also surface) and writes
    // via the bypass channel + exits 2. Previously this raw process.stderr.write
@@ -501,7 +547,7 @@ export async function executeWithWorkerFallback<T = unknown>(
 ): Promise<WorkerCallResult<T>> {
  const alive = await ensureWorkerAliveOnce();
  if (!alive) {
-    recordWorkerUnreachable();
+    await recordWorkerUnreachable();
    return { continue: true, reason: 'worker_unreachable', [WORKER_FALLBACK_BRAND]: true };
  }

--- a/tests/preload.ts
+++ b/tests/preload.ts
@@ -0,0 +1,35 @@
+import { mock } from 'bun:test';
+
+/**
+ * Global posthog-node mock, registered via bunfig.toml [test].preload BEFORE
+ * any test file or src module loads. It must be global, not per-file:
+ *
+ *  1. telemetry.ts imports PostHog at module top level, and many src modules
+ *     (ResponseProcessor, SessionRoutes, SearchRoutes, worker-service, ...)
+ *     transitively import telemetry.ts. The whole suite runs in one bun
+ *     process, so a per-file mock.module registers too late once any earlier
+ *     test file has touched those modules — the cached telemetry module keeps
+ *     the real PostHog binding.
+ *  2. Telemetry consent is default-on and the publishable key ships in the
+ *     code, so without this mock a full-suite run constructs a REAL PostHog
+ *     client and can flush fabricated test events into production analytics
+ *     (flushAt: 20 / flushInterval: 10s vs a ~25s suite).
+ *
+ * Tests assert against these recorded calls — see
+ * tests/telemetry/telemetry-client.test.ts.
+ */
+export type PostHogConstructorCall = { apiKey: string; options: Record<string, unknown> };
+export const postHogConstructorCalls: PostHogConstructorCall[] = [];
+export const postHogCaptureCalls: Array<Record<string, unknown>> = [];
+
+mock.module('posthog-node', () => ({
+  PostHog: class {
+    constructor(apiKey: string, options: Record<string, unknown>) {
+      postHogConstructorCalls.push({ apiKey, options });
+    }
+    capture(payload: Record<string, unknown>): void {
+      postHogCaptureCalls.push(payload);
+    }
+    async shutdown(): Promise<void> {}
+  },
+}));
--- a/tests/telemetry/scrub.test.ts
+++ b/tests/telemetry/scrub.test.ts
@@ -136,6 +136,74 @@ describe('scrubProperties', () => {
    expect(result.days_since_last_obs).toBe(0);
  });

+  it('keeps the retrieval quality keys with primitive values', () => {
+    const result = scrubProperties({
+      result_count: 0,
+      chroma_available: false,
+      fallback_reason: 'chroma_connection',
+    });
+
+    expect(result).toEqual({
+      result_count: 0,
+      chroma_available: false,
+      fallback_reason: 'chroma_connection',
+    });
+  });
+
+  it('keeps the compression trust keys with primitive values', () => {
+    const result = scrubProperties({
+      fabrication_detected: true,
+      fabricated_count: 2,
+      invalid_output_class: 'poisoned',
+      consecutive_invalid_outputs: 3,
+      respawn_triggered: true,
+      abort_reason: 'restart_guard',
+    });
+
+    expect(Object.keys(result)).toHaveLength(6);
+    expect(result.fabrication_detected).toBe(true);
+    expect(result.fabricated_count).toBe(2);
+    expect(result.invalid_output_class).toBe('poisoned');
+    expect(result.consecutive_invalid_outputs).toBe(3);
+    expect(result.respawn_triggered).toBe(true);
+    expect(result.abort_reason).toBe('restart_guard');
+  });
+
+  it('keeps the worker lifecycle keys with primitive values', () => {
+    const result = scrubProperties({
+      previous_shutdown: 'crash',
+      previous_uptime_seconds: 86400,
+      uptime_seconds: 3600,
+      shutdown_reason: 'restart',
+      process_rss_mb: 187,
+      heap_used_mb: 92,
+    });
+
+    expect(Object.keys(result)).toHaveLength(6);
+    expect(result.previous_shutdown).toBe('crash');
+    expect(result.previous_uptime_seconds).toBe(86400);
+    expect(result.uptime_seconds).toBe(3600);
+    expect(result.shutdown_reason).toBe('restart');
+    expect(result.process_rss_mb).toBe(187);
+    expect(result.heap_used_mb).toBe(92);
+  });
+
+  it('keeps the hook failure keys with primitive values', () => {
+    const result = scrubProperties({
+      hook_type: 'observation',
+      error_mode: 'worker_unavailable',
+      consecutive_failures: 3,
+      threshold_tripped: true,
+    });
+
+    expect(result).toEqual({
+      hook_type: 'observation',
+      error_mode: 'worker_unavailable',
+      consecutive_failures: 3,
+      threshold_tripped: true,
+    });
+  });
+
  it('drops unknown keys silently', () => {
    const result = scrubProperties({
      version: '1.0.0',
--- a/tests/telemetry/telemetry-client.test.ts
+++ b/tests/telemetry/telemetry-client.test.ts
@@ -1,33 +1,22 @@
-import { describe, it, expect, mock, beforeAll, afterAll } from 'bun:test';
+import { describe, it, expect, beforeAll, afterAll } from 'bun:test';
 import { mkdtempSync, rmSync } from 'fs';
 import { tmpdir } from 'os';
 import { join } from 'path';
+import { postHogConstructorCalls, postHogCaptureCalls } from '../preload';
+import { captureEvent, __resetTelemetryForTests } from '../../src/services/telemetry/telemetry';

 /**
 * Guards the PostHog client construction options. The posthog-node SDK stamps
 * $geoip_disable: true on every event unless disableGeoip: false is passed —
- * losing ingest-side coarse location for every worker event. The module is
- * mocked so no real client (and no network) is ever created.
+ * losing ingest-side coarse location for every worker event.
+ *
+ * posthog-node is mocked globally in tests/preload.ts (it cannot be mocked
+ * per-file: telemetry.ts is imported transitively by many other test files in
+ * the same process, so a local mock.module registers too late). The telemetry
+ * module's process-wide state is reset below so construction is observed from
+ * scratch regardless of suite order.
 */

-type ConstructorCall = { apiKey: string; options: Record<string, unknown> };
-const constructorCalls: ConstructorCall[] = [];
-const captureCalls: Array<Record<string, unknown>> = [];
-
-mock.module('posthog-node', () => ({
-  PostHog: class {
-    constructor(apiKey: string, options: Record<string, unknown>) {
-      constructorCalls.push({ apiKey, options });
-    }
-    capture(payload: Record<string, unknown>): void {
-      captureCalls.push(payload);
-    }
-    async shutdown(): Promise<void> {}
-  },
-}));
-
-const { captureEvent } = await import('../../src/services/telemetry/telemetry');
-
 let tempDir: string;
 const savedEnv: Record<string, string | undefined> = {};
 const ENV_KEYS = [
@@ -46,6 +35,10 @@ beforeAll(() => {
  delete process.env.CLAUDE_MEM_TELEMETRY_DEBUG;
  delete process.env.CLAUDE_MEM_TELEMETRY_KEY;
  delete process.env.DO_NOT_TRACK;
+
+  __resetTelemetryForTests();
+  postHogConstructorCalls.length = 0;
+  postHogCaptureCalls.length = 0;
 });

 afterAll(() => {
@@ -54,21 +47,24 @@ afterAll(() => {
    else process.env[key] = savedEnv[key];
  }
  rmSync(tempDir, { recursive: true, force: true });
+  // Drop the client/consent state built under this file's env so later test
+  // files start from the same blank slate this file demanded.
+  __resetTelemetryForTests();
 });

 describe('PostHog client construction', () => {
  it('constructs the client with disableGeoip: false so ingest-side geolocation works', () => {
    captureEvent('test_event');

-    expect(constructorCalls.length).toBe(1);
-    expect(constructorCalls[0].options.disableGeoip).toBe(false);
+    expect(postHogConstructorCalls.length).toBe(1);
+    expect(postHogConstructorCalls[0].options.disableGeoip).toBe(false);
  });

  it('reuses the client and queues the capture', () => {
    captureEvent('test_event_2');

-    expect(constructorCalls.length).toBe(1);
-    expect(captureCalls.length).toBe(2);
-    expect(captureCalls[1].event).toBe('test_event_2');
+    expect(postHogConstructorCalls.length).toBe(1);
+    expect(postHogCaptureCalls.length).toBe(2);
+    expect(postHogCaptureCalls[1].event).toBe('test_event_2');
  });
 });