feat(telemetry): reliability signals — retrieval quality, compression trust, worker lifecycle, hook failures (Plan 14) (#2874)

* feat(telemetry): disclose 19 reliability-signal fields and 2 new events across all surfaces

Whitelist (scrub.ts), scrub tests, public docs (telemetry.mdx), and CLI
disclosure (COLLECTED_FIELDS/EVENT_NAMES) for the Plan 14 reliability
signals: search retrieval quality, compression trust, worker lifecycle,
and hook failure keys, plus the worker_stopped and hook_failed events.
Includes the plan document.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* feat(telemetry): retrieval-quality signals on search_performed

SearchManager.search() fills an optional telemetry envelope
(result_count, search_strategy, chroma_available, fallback_reason)
across all three search paths; handlers stash it on
res.locals.searchTelemetry and the existing finish-middleware spreads it
into the search_performed capture. Zero-result searches report
result_count: 0; Chroma fallback reasons are a closed enum, never the
error message. Response shapes unchanged.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* feat(telemetry): compression trust signals on session_compressed

fabrication_detected/fabricated_count flow through compressionProps (all
three emit paths); invalid-output respawns emit a respawn-gated
session_compressed with outcome invalid_output and the classifier value;
aborted generators emit outcome aborted with abort_reason normalized to
a closed enum in the .finally where all five abort flows converge (the
.catch path can never observe a non-null abortReason).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* feat(telemetry): worker lifecycle signals — worker_stopped, crash detection, memory metrics

Clean-shutdown sentinel written before telemetry flush and consumed at
startup; worker_started gains previous_shutdown (crash/clean/unknown)
and previous_uptime_seconds derived from the stale PID file; new
worker_stopped event (uptime_seconds, shutdown_reason stop/restart/
signal) emitted before shutdownTelemetry(); the CLI restart path tags
/api/admin/shutdown?reason=restart so restarts are distinguishable;
buildLifecycleProps adds integer process_rss_mb/heap_used_mb.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* feat(telemetry): threshold-gated hook_failed distress signal via CLI transport

recordWorkerUnreachable emits hook_failed exactly when the consecutive-
failure count reaches the fail-loud threshold; the generic blocking-error
branch emits error_mode blocking_error. Both emits are awaited before
the process.exit paths so the 2s-capped CLI POST survives; hook_type is
a closed enum registered at hookCommand entry. Exit codes unchanged.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* chore(build): regenerate plugin artifacts with Plan 14 telemetry signals

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* fix(tests): make PostHog client regression test order-independent via global preload mock

The disableGeoip regression test mocked posthog-node per-file, but
telemetry.ts is imported transitively by many test files in the shared
bun process, so the mock registered too late and the test failed in
full-suite runs — CI on main has been red since v13.5.4. The mock now
registers in a bunfig [test].preload before any module loads, which
also guarantees test runs can never construct a real PostHog client and
flush fabricated events into production analytics (consent is
default-on and the suite outlives flushInterval). telemetry.ts gains a
test-only state reset so construction is observed deterministically
regardless of suite order.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* fix(telemetry): forward shutdown reason in Windows-managed IPC message

Review follow-up: the wrapper IPC path discarded the restart tag, so an
external Windows wrapper could only ever report shutdown_reason 'stop'.
No wrapper in this repo listens for the message, but the reason now
travels with it for any that does.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This commit is contained in:
Alex Newman
2026-06-10 02:51:45 -07:00
committed by GitHub
parent 25015fc3e5
commit 4ce51efd28
22 changed files with 1197 additions and 374 deletions

View File

@@ -1,2 +1,7 @@
[test]
smol = true
# Mocks posthog-node for ALL tests before any module loads — required because
# telemetry.ts is a process-wide singleton imported transitively by many test
# files, and because real captures from tests would pollute production
# analytics. See tests/preload.ts.
preload = ["./tests/preload.ts"]

View File

@@ -36,7 +36,7 @@ Every event property passes through a strict whitelist scrubber — any key not
| `runtime_version` | `1.2.0` | Runtime version string |
| `node_version` | `22.14.0` | Node.js version string |
| `duration_ms` | `1843` | How long an operation took |
| `outcome` | `ok` | Coarse result: ok / error / partial |
| `outcome` | `ok` | Coarse result — a closed enum: ok / error / partial / invalid_output / aborted |
| `error_category` | `provider_error` | Coarse error bucket — never an error message |
| `locale` | `en-US` | Language tag |
| `is_ci` | `false` | Whether running in CI |
@@ -72,6 +72,25 @@ Every event property passes through a strict whitelist scrubber — any key not
| `install_age_days` | `104` | Days since the install's first recorded session |
| `obs_count_7d` / `obs_count_30d` | `1887` / `10357` | Observations stored in the last 7 / 30 days |
| `days_since_last_obs` | `0` | Days since the most recent observation was stored |
| `result_count` | `12` | How many results a memory search returned — count only, never the results or the query |
| `chroma_available` | `true` | Whether the vector-search backend was reachable for a search (false = fell back to full-text search) |
| `fallback_reason` | `none` | Why a search fell back from vector search: none / chroma_connection / chroma_error / chroma_not_initialized — a closed enum, never an error message |
| `fabrication_detected` | `false` | Whether a compression's output referenced commit hashes that don't exist in your repo (a model-trust check) |
| `fabricated_count` | `0` | How many nonexistent commit hashes were detected — count only, never the hashes |
| `invalid_output_class` | `idle` | Coarse class of an unusable compression output: xml / idle / prose / poisoned (`xml` = looked like the expected format but failed to parse) — never the output itself |
| `consecutive_invalid_outputs` | `3` | How many unusable outputs occurred in a row before recovery |
| `respawn_triggered` | `true` | Whether the compression agent was restarted after repeated unusable output |
| `abort_reason` | `idle` | Why a compression session was aborted: idle / shutdown / overflow / restart_guard / quota / poisoned / none — a closed enum |
| `previous_shutdown` | `clean` | How the previous worker run ended, detected at startup: crash / clean / unknown |
| `previous_uptime_seconds` | `86400` | How long the previous worker run was up, in whole seconds |
| `uptime_seconds` | `3600` | How long the worker was up when it stopped, in whole seconds |
| `shutdown_reason` | `restart` | Why the worker stopped: stop / restart / signal |
| `process_rss_mb` | `187` | Worker process resident memory, integer megabytes |
| `heap_used_mb` | `92` | Worker JS heap in use, integer megabytes |
| `hook_type` | `observation` | Which hook kind failed: context / session-init / observation / summarize / file-context — our handler names |
| `error_mode` | `worker_unavailable` | Coarse hook failure mode: worker_unavailable / blocking_error — never an error message |
| `consecutive_failures` | `3` | How many hook failures occurred in a row (the fail-loud counter) |
| `threshold_tripped` | `true` | Whether the consecutive-failure count reached the fail-loud threshold |
One value is derived server-side rather than sent by the client: PostHog resolves the request's sender IP to a **coarse location** (country / region / city) at ingestion, before the IP itself is discarded. The client never attaches an IP to any event, and the raw IP is never stored — see [What is NEVER collected](#what-is-never-collected).
@@ -82,10 +101,12 @@ One value is derived server-side rather than sent by the client: PostHog resolve
| `install_completed` | `npx claude-mem install` finishes | `ide`, `provider`, `runtime_mode`, `is_update`, `outcome`, `duration_ms`, `interactive`, `install_method`, `bun_version`, `uv_version`, `claude_code_version` |
| `install_failed` | The installer aborts | `error_category` (our error-taxonomy id), `interactive`, `install_method`, `claude_code_version` |
| `uninstall_completed` | `npx claude-mem uninstall` finishes | — |
| `worker_started` | The background worker starts, plus one heartbeat per 24h of uptime | `trigger` (start / heartbeat), `duration_ms`, `ide`, `provider`, `mode`, `runtime_mode`, and the install snapshot: `db_observation_count`, `db_session_count`, `db_summary_count`, `db_project_count`, `db_size_mb`, `install_age_days`, `obs_count_7d`, `obs_count_30d`, `days_since_last_obs` |
| `session_compressed` | A session compression stores observations, or the compression generator fails | `outcome`, `duration_ms`, `count`, `has_summary`, `provider`, `model`, `ide`, `hook`, `compression_ms`, `observation_type`, `obs_type_*`, `tokens_input`, `tokens_output`, `compression_ratio`, `cost_usd`, `endpoint_class`; on failure `error_category`, `model` |
| `worker_started` | The background worker starts, plus one heartbeat per 24h of uptime | `trigger` (start / heartbeat), `duration_ms`, `ide`, `provider`, `mode`, `runtime_mode`, process memory (`process_rss_mb`, `heap_used_mb`), the install snapshot: `db_observation_count`, `db_session_count`, `db_summary_count`, `db_project_count`, `db_size_mb`, `install_age_days`, `obs_count_7d`, `obs_count_30d`, `days_since_last_obs`; on a real start also crash detection: `previous_shutdown` (crash / clean / unknown) and, after a clean shutdown, `previous_uptime_seconds` |
| `session_compressed` | A session compression stores observations, repeated unusable model output forces an agent respawn, the compression generator fails, or a compression session is aborted | `outcome` (ok / invalid_output / error / aborted), `duration_ms`, `count`, `has_summary`, `provider`, `model`, `ide`, `hook`, `compression_ms`, `observation_type`, `obs_type_*`, `tokens_input`, `tokens_output`, `compression_ratio`, `cost_usd`, `endpoint_class`, `fabrication_detected`, `fabricated_count`; on an invalid-output respawn `invalid_output_class`, `consecutive_invalid_outputs`, `respawn_triggered`; on failure `error_category`; on abort `abort_reason` |
| `context_injected` | Stored memory is injected into a new session | `outcome`, `duration_ms`, `mode`, `provider`, `search_strategy`, `observation_count`, `session_count`, `timeline_depth_days`, `has_session_summary`, `obs_type_*`, `tokens_injected`, `tokens_saved_vs_naive` |
| `search_performed` | A memory search runs (never the query text) | `endpoint`, `outcome`, `duration_ms` |
| `search_performed` | A memory search runs (never the query text) | `endpoint`, `outcome`, `duration_ms`, `result_count`, `search_strategy`, `chroma_available`, `fallback_reason` |
| `worker_stopped` | The background worker shuts down gracefully | `uptime_seconds`, `shutdown_reason` (stop / restart / signal) |
| `hook_failed` | A claude-mem hook fails hard — the worker is unreachable past the fail-loud threshold, or a blocking error occurs | `hook_type`, `error_mode`, `consecutive_failures`, `threshold_tripped` |
| `error_occurred` | The worker returns an HTTP 5xx | `error_category` |
## What is NEVER collected

View File

@@ -0,0 +1,319 @@
# Plan 14 — Telemetry Reliability Signals
Adds the five highest-value missing telemetry signals identified by the 2026-06-10
capture-surface audit. Theme: **we instrument success well; failure is invisible.**
Every signal below feeds the **Reliability** sentence of
`plans/2026-06-09-telemetry-metrics-spec.md` ("Core pipeline succeeds X% of the
time at scale") — plus retrieval quality, which today has no KPI at all.
Phases are self-contained: each can be executed in a fresh chat context. Execute
in order; Phase 14 are independent of each other but all depend on Phase 0's
facts and share the Phase-ritual below.
---
## Phase 0 — Verified facts, allowed APIs, and the every-property ritual
Consolidated from 5 documentation-discovery agents (all high confidence, all
findings cite read code). **Do not invent APIs beyond this list.**
### The pipeline ritual — EVERY new property or event must touch all five surfaces
| # | Surface | Location | What to do |
|---|---|---|---|
| 1 | Scrub whitelist | `src/services/telemetry/scrub.ts:8-82` (`ALLOWED_PROPERTY_KEYS: Set<string>`) | Add the key, grouped with a category comment like the existing ones |
| 2 | Scrub tests | `tests/telemetry/scrub.test.ts` | Copy the pattern at `:5-31` (single-group) or `:81-106` (multi-key group); also confirm `:139-169` drop-tests still pass |
| 3 | Public docs | `docs/public/telemetry.mdx` fields table `:26-75`, events table `:78-89` | Add a row per field; new events get an events-table row |
| 4 | CLI disclosure | `src/npx-cli/commands/telemetry.ts` `COLLECTED_FIELDS:23-66`, `EVENT_NAMES:68-77` | Add a line per field; new event names go in `EVENT_NAMES` |
| 5 | Capture site | per phase below | Emit via `captureEvent` / `captureCliEvent` only |
### Allowed APIs (verified signatures)
- `captureEvent(event: string, props?: Record<string, unknown>, opts?: { person?: boolean }): void``src/services/telemetry/telemetry.ts:72` (worker transport; consent-gated, scrubbed, fire-and-forget)
- `captureCliEvent(event, props?, opts?): Promise<void>``src/services/telemetry/cli-telemetry.ts:22` (short-lived-process transport; direct POST, hard 2s timeout `CAPTURE_TIMEOUT_MS` at `:15`, never throws)
- `scrubProperties(props): Record<string, string | number | boolean>``src/services/telemetry/scrub.ts:91-114` (drops non-whitelisted keys and non-primitives **silently**; strings clamped to 200 chars; numbers must be finite)
- `collectInstallStats(db): Record<string, number>``src/services/telemetry/install-stats.ts:29`
- `getUptimeSeconds(startedAtMs: number, now?): number``src/shared/uptime.ts:5-7`
- `writePidFile(info: PidInfo) / readPidFile(): PidInfo | null / removePidFile()``src/services/infrastructure/ProcessManager.ts:134/141/156`; `PidInfo = { pid, port, startedAt: string /* ISO8601 */, startToken? }` (`src/supervisor/process-registry.ts:49-54`)
- `recordWorkerUnreachable(): number``src/shared/worker-utils.ts:451-470` (returns the consecutive-failure count; persists atomically in `~/.claude-mem/state/hook-failures.json`; threshold default 3, env `CLAUDE_MEM_HOOK_FAIL_LOUD_THRESHOLD`)
- `classifyObserverOutput(raw): 'xml'|'idle'|'prose'|'poisoned'``src/sdk/output-classifier.ts:60-80`
- `verifyCommitHashesInText(...): CommitVerificationResult` with `fabricated: string[]``src/sdk/commit-verification.ts:69-108`
- `DATA_DIR` / `paths.workerPid()` etc. — `src/shared/paths.ts:40,129-151`
### Global anti-patterns (from discovery; apply to every phase)
- Properties not added to `ALLOWED_PROPERTY_KEYS` are **silently dropped** — no error. Always whitelist first, then emit.
- Only `number | boolean | closed-enum string`. Never free text, paths, queries, error messages, IDs derived from the user. (An earlier audit draft proposed `error_summary: string` — explicitly rejected.)
- `person: true` only on lifecycle events (spec constraint, `plans/2026-06-09-telemetry-metrics-spec.md:65-71`). Nothing in this plan adds person properties; do not touch `PERSON_PROPERTY_KEYS`.
- Never bypass `captureEvent`/`captureCliEvent` with direct PostHog calls.
- Debug-mode verification harness: `CLAUDE_MEM_TELEMETRY_DEBUG=1` prints would-be payloads to stderr and sends nothing (`telemetry.ts:97-103`).
### Discovery discrepancy to resolve during Phase 2
One agent reported `INVALID_OUTPUT_RESPAWN_THRESHOLD` = 25, another = 3. Read
`src/services/worker/agents/ResponseProcessor.ts:25` before relying on the value.
---
## Phase 1 — Retrieval quality: `result_count` + strategy/fallback on `search_performed`
**Narrative served:** Reliability + retrieval quality. Zero-result rate becomes
computable; Chroma's silent degradation to FTS becomes visible (the recurring
`SQLiteSearchStrategy Database error` incident class).
### Verified obstacles (do not skip)
- The existing capture is a middleware: `SearchRoutes.ts:117-123` inside
`res.once('finish')` — it fires **after** the response, outside handler scope.
It can see only `endpoint`, `res.statusCode`, and elapsed time. Result arrays,
`totalResults` (computed at `SearchManager.ts:307`), `chromaFailed`
(`SearchManager.ts:158, 206, 274`) and `chromaFailureReason`
(`SearchManager.ts:267-275`) are method-local and unreachable from there.
- `SearchManager.search()` has three paths: filter-only SQLite (`:165-176`),
Chroma (`:179-286`, sets `chromaFailed` on error), Chroma-not-initialized FTS
(`:288-305`). Text-format responses (`:420-425`) do not carry counts; only
`format='json'` (`:309-316`) includes `totalResults`.
- `search_strategy` is **already whitelisted** (`scrub.ts:55`); only the new keys
need whitelist entries.
### What to implement
1. In `SearchManager.search()`, build a small telemetry envelope alongside the
existing return value — do not change response shapes. Collect:
`result_count` (the `totalResults` already computed at `:307`),
`search_strategy: 'chroma' | 'fts' | 'filter_only'` (one per path above),
`chroma_available: boolean` (false when `chromaFailed` or not initialized),
`fallback_reason: 'none' | 'chroma_connection' | 'chroma_error' | 'chroma_not_initialized'`
(map from `chromaFailureReason.isConnectionError` at `:271`; never the message).
Expose it to callers — recommended: return `{ ...existing, telemetry }` for an
internal caller, or set it on a mutable param. Simplest verified-safe plumbing:
handlers stash it on `res.locals.searchTelemetry`, and the middleware at
`SearchRoutes.ts:117-123` spreads `res.locals.searchTelemetry ?? {}` into the
existing `captureEvent('search_performed', …)` props.
2. Whitelist `result_count`, `chroma_available`, `fallback_reason` (ritual #14).
3. Note: `src/services/worker/search/types.ts:53-64` has a `StrategySearchResult`
with a `strategy` field but `SearchManager.search()` does not use it — derive
strategy from the three paths; do not refactor onto SearchOrchestrator here.
### Verification
- [ ] `bun test tests/telemetry/` green (new scrub cases included)
- [ ] `npm run typecheck:root` clean
- [ ] `CLAUDE_MEM_TELEMETRY_DEBUG=1` + a worker search request prints `search_performed` with `result_count`, `search_strategy`, `chroma_available`, `fallback_reason`
- [ ] Grep guard: `grep -n "fallback_reason" src/services/telemetry/scrub.ts docs/public/telemetry.mdx src/npx-cli/commands/telemetry.ts` hits all three
- [ ] Zero-result search shows `result_count: 0` (not missing)
### Anti-pattern guards
- Do NOT try to introspect the response body from the middleware (no `res._getBuffer()`-style Express internals — unverified, fragile).
- Do NOT put `chromaFailureReason.message` in any property — enum only.
- Do NOT change the text-format response shape consumed by clients.
---
## Phase 2 — Compression quality: fabrication, invalid-output, and abort reasons on `session_compressed`
**Narrative served:** Reliability + model quality (extends yesterday's
tokens/cost/ratio work with per-model trust signals).
### Verified mechanics (this is the key to doing it right)
- `compressionProps` is built at `ResponseProcessor.ts:194-214`. Non-SDK
providers emit immediately (`:228`); the SDK/Claude path stashes the object
into `session.pendingCompressionEvent` (`worker-types.ts:60`) at `:216-226`,
and `ClaudeProvider.ts:416-435` later merges real token fields and emits;
`:442-445` is the no-result fallback emit. **Therefore: any property added to
`compressionProps` automatically flows through all three emit paths.**
- Fabrication scope: `ResponseProcessor.ts:115-135` already computes
`fabricated: string[]` via `verifyCommitHashesInText`.
- Invalid output: `ResponseProcessor.ts:48-88` returns early — **no event fires
at all** on that path today. `session.consecutiveInvalidOutputs`
(`worker-types.ts:34`) increments at `:54`, resets at `:92`; respawn decision
at `:67-79` (`outputClass === 'poisoned'` OR threshold reached — read the
threshold at `:25`, see Phase 0 discrepancy).
- `abortReason` enum: `worker-types.ts:42``'idle'|'shutdown'|'overflow'|'restart-guard'|'quota'|string|null`;
set at `ClaudeProvider.ts:270` (note: `'quota:…'` prefix format), `:315`,
`SessionManager.ts:272,294,407`; consumed at `SessionRoutes.ts:166-167`. The
error-path emit is `SessionRoutes.ts:154-163`.
### What to implement
1. **Fabrication:** in `ResponseProcessor.ts` where `fabricated.length` is known
(`:128-135`), add to `compressionProps`: `fabrication_detected: boolean`,
`fabricated_count: number`. (Flows through deferred path for free.)
2. **Invalid output:** at the respawn decision (`:67-79`) — and ONLY when a
respawn triggers, to bound volume — emit one
`captureEvent('session_compressed', { outcome: 'invalid_output', invalid_output_class, consecutive_invalid_outputs, respawn_triggered: true, provider, model, ide, hook })`
where `invalid_output_class` is the classifier value (`'idle'|'prose'|'poisoned'`).
3. **Abort reason:** in the error-path emit (`SessionRoutes.ts:154-163`), add
`abort_reason` normalized to a closed enum:
`'idle'|'shutdown'|'overflow'|'restart_guard'|'quota'|'none'` — split the
`'quota:…'` format on `':'` and map `'restart-guard'``'restart_guard'`.
4. Whitelist `fabrication_detected`, `fabricated_count`, `invalid_output_class`,
`consecutive_invalid_outputs`, `respawn_triggered`, `abort_reason` (ritual #14).
### Verification
- [ ] `bun test tests/telemetry/` green; `npm run typecheck:root` clean
- [ ] Debug-mode `session_compressed` payload shows `fabrication_detected: false, fabricated_count: 0` on a normal compression
- [ ] Grep guard: `grep -rn "abort_reason" src/services/telemetry/scrub.ts src/services/worker/http/routes/SessionRoutes.ts` both hit
- [ ] Confirm the deferred path carries new props: grep the built `plugin/scripts/worker-service.cjs` for `fabrication_detected` after `npm run build`
### Anti-pattern guards
- Do NOT emit an event per invalid output (volume) — respawn-gated only.
- Do NOT send raw `abortReason` strings (`'quota:daily'`, `'restart-guard'`) — normalize to the closed enum first; the scrubber will happily pass any ≤200-char string, so enum discipline is on the emitter.
- Do NOT add the new props anywhere except `compressionProps` for the fabrication fields — adding them only at the `ClaudeProvider` merge would miss non-SDK providers.
---
## Phase 3 — Worker lifecycle: crash detection, `worker_stopped`, heartbeat health
**Narrative served:** Reliability ("crash-free installs") + makes the DAU/uptime
data trustworthy.
### Verified mechanics
- PID file already stores `startedAt` ISO8601 (`worker-service.ts:289`,
`PidInfo` at `process-registry.ts:49-54`) → previous uptime is computable on
next start via `Date.parse(startedAt)`.
- There is NO shutdown sentinel today; marker-file pattern to copy:
`ProcessManager.ts:232-254` (`.chroma-cleaned-v10.3`) — write to `DATA_DIR`.
- Graceful shutdown: `worker-service.ts:565-585`; `shutdownTelemetry()` is called
at `:576` and races a 3s flush (`telemetry.ts:124-144`) — an event captured
**before** `:576` will flush. Stop-case `removePidFile()` is at `:836`.
- `worker_started` captures: `:427` (trigger `start`, `person: true`), `:436`
(heartbeat, 24h `setInterval` with `.unref()` at `:435-438`); props builder
`buildLifecycleProps()` at `:401-426`.
- `uncaughtException` handler at `:1075-1078` logs and does NOT exit (known smell — out of scope here, do not change process semantics in this plan).
### What to implement
1. **Clean-shutdown sentinel:** in the shutdown path (before `:576`), write
`DATA_DIR/.worker-clean-shutdown` containing the ISO timestamp (copy the
marker pattern from `ProcessManager.ts:232-254`). Delete the sentinel at
startup after reading it.
2. **Crash detection on start:** in the startup daemon path, before
`writePidFile`, derive:
- stale PID file present + no sentinel → `previous_shutdown: 'crash'`
- sentinel present → `'clean'`
- neither (first run) → `'unknown'`
- `previous_uptime_seconds` from the stale PID file's `startedAt` to sentinel
time (clean) or to `now` minus unknown gap (crash → omit rather than guess;
omitted properties are fine).
Add both to the existing `captureEvent('worker_started', …)` at `:427`.
3. **`worker_stopped` event:** immediately before `shutdownTelemetry()` at
`:576`, `captureEvent('worker_stopped', { uptime_seconds, shutdown_reason })`
with `uptime_seconds` from `getUptimeSeconds(this.startTime)`
(`worker-service.ts:122`, `uptime.ts:5-7`) and
`shutdown_reason: 'stop' | 'restart' | 'signal'` from the caller. No
`person: true`.
4. **Heartbeat health:** in the heartbeat payload (`:436` / `buildLifecycleProps`),
add `process_rss_mb` and `heap_used_mb` as integers from
`process.memoryUsage()` (`Math.round(rss / 1024 / 1024)`).
5. Whitelist `previous_shutdown`, `previous_uptime_seconds`, `uptime_seconds`,
`shutdown_reason`, `process_rss_mb`, `heap_used_mb`; add `worker_stopped` to
`EVENT_NAMES` and the docs events table (ritual #14).
### Verification
- [ ] `bun test tests/telemetry/` green; `npm run typecheck:root` clean
- [ ] Debug mode: `worker-service restart` prints `worker_stopped` (reason `restart`) then `worker_started` with `previous_shutdown: 'clean'`
- [ ] Kill -9 the worker, start it: `worker_started` shows `previous_shutdown: 'crash'`
- [ ] Heartbeat payload contains integer `process_rss_mb`
- [ ] Sentinel file is removed after startup reads it (no stale `'clean'` after a later crash)
### Anti-pattern guards
- Do NOT compute uptime from in-memory `startTime` for the *previous* run — it's never persisted; use the PID file's `startedAt`.
- Do NOT emit `worker_stopped` after `shutdownTelemetry()``isShutdown` (`telemetry.ts:81`) drops late events by design.
- Do NOT add the new keys to `PERSON_PROPERTY_KEYS` (spec ingestion-cost constraint).
- `process.memoryUsage().rss` is bytes — convert; the scrubber drops non-finite numbers silently.
---
## Phase 4 — `hook_failed` event (threshold-gated, CLI transport)
**Narrative served:** Reliability — a failing hook is silent memory loss; today
the fail-loud counter only writes to the user's stderr.
### Verified constraints (these dictate the design — read before coding)
- Hooks are short-lived processes (<1s typical). The worker transport
(posthog-node batching) can never flush there; and emitting via the worker API
is self-defeating (the defining failure IS "worker unreachable"). **Transport
must be `captureCliEvent`** (`cli-telemetry.ts:22`, direct POST, 2s cap, never throws).
- **The trap:** `exitGraceful` (`hook-io.ts:166-173`) and `emitBlockingError`
(`hook-io.ts:150-159`) call `process.exit()` immediately and do not await
pending promises — a fire-and-forget POST is killed mid-flight. The emit must
be **awaited before** the exit call, inside the failure branch.
- Catch taxonomy lives at `hook-command.ts:99-128`: AdapterRejectedInput
(`:100-105`), non-blocking input error (`:106-111`), worker-unavailable
(`:112-119`, the only branch calling `recordWorkerUnreachable()`), generic
blocking error (`:121-128`, exit 2).
- `recordWorkerUnreachable(): number` returns the consecutive count and knows the
threshold — gate on it.
- Hooks currently import zero telemetry code; `captureCliEvent` has only
fs/fetch deps and bundles fine via `scripts/build-hooks.js` esbuild (telemetry
modules are not externalized — verified at `build-hooks.js:284-330`).
### What to implement
1. In `hook-command.ts`, in exactly two branches:
- **worker-unavailable branch (`:112-119`):** after
`recordWorkerUnreachable()` returns `count`, if `count` has just reached the
fail-loud threshold (the same condition that triggers the blocking stderr
message), `await captureCliEvent('hook_failed', { hook_type, error_mode: 'worker_unavailable', consecutive_failures: count, threshold_tripped: true })`.
- **generic blocking-error branch (`:121-128`):**
`await captureCliEvent('hook_failed', { hook_type, error_mode: 'blocking_error', threshold_tripped: false })` before `emitBlockingError`.
Both branches are rare and already failed — the ≤2s bounded wait is
acceptable there. Never emit on the success path or the two skip branches.
2. `hook_type`: closed enum from the hook event already passed to
`hookCommand(platform, event, …)` (`:79`) — use the event/handler name set
(`context | session-init | observation | summarize | file-context`), not free text.
3. Whitelist `hook_type`, `error_mode`, `consecutive_failures`,
`threshold_tripped`; add `hook_failed` to `EVENT_NAMES` + docs events table
(ritual #14).
### Verification
- [ ] `bun test tests/telemetry/` green; `npm run typecheck:root` clean
- [ ] `npm run build` then grep the built hook artifact for `hook_failed` (confirms bundling)
- [ ] With the worker stopped and `CLAUDE_MEM_TELEMETRY_DEBUG=1`, run a hook 3× (threshold): third run prints `hook_failed` with `consecutive_failures: 3`
- [ ] Success-path hook run emits nothing and latency is unchanged
- [ ] Confirm exit codes unchanged (`HOOK_EXIT_CODES`, `hook-constants.ts:15-20`)
### Anti-pattern guards
- Do NOT fire-and-forget then `process.exit()` — the event dies with the process.
- Do NOT emit per-invocation hook latency events (volume + inline-latency cost). Worker-side `duration_ms` on `context_injected`/`search_performed` already covers worker latency; defer hook-side latency to a future aggregate.
- Do NOT route the emit through `executeWithWorkerFallback` or any worker API.
- Do NOT emit in the AdapterRejectedInput / non-blocking-input branches (expected, noisy, not failures of ours).
---
## Phase 5 — Final verification
1. **Full ritual audit** — for each new key
(`result_count, chroma_available, fallback_reason, fabrication_detected, fabricated_count, invalid_output_class, consecutive_invalid_outputs, respawn_triggered, abort_reason, previous_shutdown, previous_uptime_seconds, uptime_seconds, shutdown_reason, process_rss_mb, heap_used_mb, hook_type, error_mode, consecutive_failures, threshold_tripped`):
`grep -n "<key>" src/services/telemetry/scrub.ts tests/telemetry/scrub.test.ts docs/public/telemetry.mdx src/npx-cli/commands/telemetry.ts` — all four must hit.
2. **New events disclosed:** `worker_stopped`, `hook_failed` present in
`EVENT_NAMES` (`src/npx-cli/commands/telemetry.ts:68-77`) and the
`telemetry.mdx` events table.
3. **Anti-pattern greps:**
- `grep -rn "captureEvent\|captureCliEvent" src/ | grep -v services/telemetry` — every site passes enums/counts only (manual scan of new sites)
- `grep -rn "posthog" src/ --include="*.ts" | grep -v services/telemetry` — no direct SDK use outside the pipeline
- no `PERSON_PROPERTY_KEYS` additions in the diff
4. **Tests & build:** `bun test tests/telemetry/` (note: bun only — the suite
fails under vitest), `npm run typecheck:root`, `npm run build-and-sync`,
worker `/health` returns ok.
5. **Live smoke:** `CLAUDE_MEM_TELEMETRY_DEBUG=1` walk: search (Phase 1 fields),
compression (Phase 2 fields), restart (Phase 3 events), worker-down hook ×3
(Phase 4 event).
6. **Docs deploy:** telemetry.mdx changes auto-deploy on push to main — confirm
the public page renders the new rows after release.
## Out of scope (deliberately)
- The `uncaughtException` no-exit smell (`worker-service.ts:1075-1078`) — process-semantics change, separate plan.
- Per-hook latency events, event-loop-lag sampling, `telemetry_disabled` final ping (product/privacy decision pending), installer funnel (`install_started`), doctor/repair distress signals — candidates for Plan 15 after this data lands.

File diff suppressed because one or more lines are too long

File diff suppressed because one or more lines are too long

File diff suppressed because one or more lines are too long

File diff suppressed because one or more lines are too long

View File

@@ -10,6 +10,12 @@ import {
exitGraceful,
resetHookIoState,
} from '../shared/hook-io.js';
import {
recordWorkerUnreachable,
setActiveHookType,
getActiveHookType,
} from '../shared/worker-utils.js';
import { captureCliEvent } from '../services/telemetry/cli-telemetry.js';
import { logger } from '../utils/logger.js';
export interface HookCommandOptions {
@@ -78,6 +84,9 @@ async function executeHookPipeline(
export async function hookCommand(platform: string, event: string, options: HookCommandOptions = {}): Promise<number> {
resetHookIoState();
// Register the hook event for the threshold-gated hook_failed telemetry
// (closed enum enforced inside; non-enum events just omit hook_type).
setActiveHookType(event);
// Hook IO Discipline (issue #2292):
// We BUFFER stderr during handler execution so that unsolicited writes from
@@ -113,12 +122,28 @@ export async function hookCommand(platform: string, event: string, options: Hook
logger.warn('HOOK', `Worker unavailable, skipping hook: ${error instanceof Error ? error.message : error}`);
// EXIT_SIGNAL per CLAUDE.md: transient worker errors exit 0 to avoid
// Windows Terminal tab accumulation. The fail-loud counter (worker-utils
// recordWorkerUnreachable) handles the surface-after-N-failures path.
// recordWorkerUnreachable) handles the surface-after-N-failures path and
// emits the threshold-gated hook_failed telemetry internally. Awaited:
// when the count JUST reaches the threshold it sends the event and then
// exits 2; exitGraceful below would kill a pending POST mid-flight.
await recordWorkerUnreachable();
exitGraceful(options);
return HOOK_EXIT_CODES.SUCCESS;
}
logger.error('HOOK', `Hook error: ${error instanceof Error ? error.message : error}`, {}, error instanceof Error ? error : undefined);
// hook_failed telemetry MUST be awaited BEFORE emitBlockingError — it
// calls process.exit(2), which would kill a fire-and-forget POST
// mid-flight. captureCliEvent never throws and is hard-capped at 2s.
// Closed-enum props only: the error message itself is never sent.
{
const hookType = getActiveHookType();
await captureCliEvent('hook_failed', {
...(hookType !== null ? { hook_type: hookType } : {}),
error_mode: 'blocking_error',
threshold_tripped: false,
});
}
// BLOCKING_FEEDBACK: flush the buffered logger.error line to stderr and
// exit 2 so the model receives it per Claude Code's hook contract.
emitBlockingError(

View File

@@ -30,7 +30,7 @@ const COLLECTED_FIELDS = [
'runtime_version runtime version string',
'node_version Node.js version string',
'duration_ms how long an operation took',
'outcome ok / error / partial',
'outcome ok / error / partial / invalid_output / aborted',
'error_category coarse error bucket (never a message)',
'locale language tag (e.g. en-US)',
'is_ci whether running in CI',
@@ -63,6 +63,24 @@ const COLLECTED_FIELDS = [
'db_size_mb memory database file size in MB',
'install_age_days / days_since_last_obs / obs_count_7d / obs_count_30d',
' install age and recent activity, in days/counts',
'result_count how many results a search returned (never the query)',
'chroma_available whether vector search was reachable for a search',
'fallback_reason none / chroma_connection / chroma_error / chroma_not_initialized',
'fabrication_detected whether compression output cited nonexistent commits',
'fabricated_count how many nonexistent commit hashes (count, never hashes)',
'invalid_output_class xml / idle / prose / poisoned (never the output)',
'consecutive_invalid_outputs unusable outputs in a row before recovery',
'respawn_triggered whether the compression agent was restarted',
'abort_reason idle / shutdown / overflow / restart_guard / quota / poisoned / none',
'previous_shutdown crash / clean / unknown (detected at worker start)',
'previous_uptime_seconds / uptime_seconds',
' worker uptime in whole seconds (previous run / at stop)',
'shutdown_reason stop / restart / signal',
'process_rss_mb / heap_used_mb worker memory, integer megabytes',
'hook_type context / session-init / observation / summarize / file-context',
'error_mode worker_unavailable / blocking_error (never a message)',
'consecutive_failures hook failures in a row (the fail-loud counter)',
'threshold_tripped whether the fail-loud threshold was reached',
];
const EVENT_NAMES = [
@@ -70,9 +88,11 @@ const EVENT_NAMES = [
'install_failed',
'uninstall_completed',
'worker_started',
'worker_stopped',
'session_compressed',
'context_injected',
'search_performed',
'hook_failed',
'error_occurred',
];

View File

@@ -91,9 +91,13 @@ export async function waitForPortFree(port: number, timeoutMs: number = 10000):
return false;
}
export async function httpShutdown(port: number): Promise<boolean> {
export async function httpShutdown(port: number, reason: 'stop' | 'restart' = 'stop'): Promise<boolean> {
try {
const result = await httpRequestToWorker(port, '/api/admin/shutdown', 'POST');
// The CLI restart path stops the worker through this same endpoint; the
// reason tag lets the worker report shutdown_reason: 'restart' on its
// worker_stopped telemetry instead of a generic 'stop'.
const endpointPath = reason === 'restart' ? '/api/admin/shutdown?reason=restart' : '/api/admin/shutdown';
const result = await httpRequestToWorker(port, endpointPath, 'POST');
if (!result.ok) {
logger.warn('SYSTEM', 'Shutdown request returned error', { status: result.statusCode });
return false;

View File

@@ -80,7 +80,9 @@ export interface AiStatus {
export interface ServerOptions {
getInitializationComplete: () => boolean;
getMcpReady: () => boolean;
onShutdown: () => Promise<void>;
// reason feeds worker_stopped telemetry: 'restart' when the CLI restart
// path tags /api/admin/shutdown with ?reason=restart, 'stop' otherwise.
onShutdown: (reason?: 'stop' | 'restart') => Promise<void>;
onRestart: () => Promise<void>;
workerPath: string;
runtime?: string;
@@ -291,7 +293,11 @@ export class Server {
}
});
this.app.post('/api/admin/shutdown', requireLocalhost, async (_req: Request, res: Response) => {
this.app.post('/api/admin/shutdown', requireLocalhost, async (req: Request, res: Response) => {
// Closed-enum mapping for worker_stopped telemetry: only the exact
// 'restart' tag (set by the CLI restart path) upgrades the reason;
// anything else stays 'stop'.
const shutdownReason: 'stop' | 'restart' = req.query.reason === 'restart' ? 'restart' : 'stop';
const isWindowsManaged = process.platform === 'win32' &&
process.env.CLAUDE_MEM_MANAGED === 'true' &&
process.send;
@@ -299,9 +305,12 @@ export class Server {
if (isWindowsManaged) {
res.json({ status: 'shutting_down' });
logger.info('SYSTEM', 'Sending shutdown request to wrapper');
process.send!({ type: 'shutdown' });
// No wrapper in this repo listens for this message (legacy external
// path), but forward the reason so a wrapper that does can preserve
// shutdown_reason fidelity instead of defaulting to 'stop'.
process.send!({ type: 'shutdown', reason: shutdownReason });
} else {
flushResponseThen(res, { status: 'shutting_down' }, () => this.options.onShutdown());
flushResponseThen(res, { status: 'shutting_down' }, () => this.options.onShutdown(shutdownReason));
}
});

View File

@@ -79,6 +79,41 @@ export const ALLOWED_PROPERTY_KEYS: Set<string> = new Set([
'obs_count_7d',
'obs_count_30d',
'days_since_last_obs',
// search_performed retrieval quality — result_count is an integer,
// chroma_available a boolean, fallback_reason one of OUR enum values
// (none | chroma_connection | chroma_error | chroma_not_initialized).
// Never the query, never an error message.
'result_count',
'chroma_available',
'fallback_reason',
// session_compressed trust signals — booleans, counters, and our own
// closed enums (invalid_output_class: xml | idle | prose | poisoned, where
// 'xml' means XML-shaped output that still failed to parse; abort_reason:
// idle | shutdown | overflow | restart_guard | quota | poisoned | none).
// Never model output, never raw abort strings.
'fabrication_detected',
'fabricated_count',
'invalid_output_class',
'consecutive_invalid_outputs',
'respawn_triggered',
'abort_reason',
// Worker lifecycle health — previous_shutdown (crash | clean | unknown),
// shutdown_reason (stop | restart | signal), uptime in whole seconds, and
// process memory as integer megabytes. No paths, no PIDs.
'previous_shutdown',
'previous_uptime_seconds',
'uptime_seconds',
'shutdown_reason',
'process_rss_mb',
'heap_used_mb',
// hook_failed distress signal — hook_type is one of OUR hook names
// (context | session-init | observation | summarize | file-context),
// error_mode (worker_unavailable | blocking_error), plus a consecutive
// failure counter and threshold flag. Never an error message.
'hook_type',
'error_mode',
'consecutive_failures',
'threshold_tripped',
]);
const MAX_STRING_LENGTH = 200;

View File

@@ -116,6 +116,18 @@ export function captureEvent(
}
}
/**
* Test-only. The module state (singleton client, 30s consent TTL cache,
* shutdown latch) is process-wide, and the whole bun test suite shares one
* process — without a reset, a test asserting client construction inherits
* whatever earlier test files did. Never called by production code.
*/
export function __resetTelemetryForTests(): void {
client = null;
consentCache = null;
isShutdown = false;
}
/**
* Flush queued events on graceful shutdown. Races the SDK shutdown against a
* 3s timeout so a slow/unreachable ingestion host can never hang worker stop.

View File

@@ -1,6 +1,6 @@
import path from 'path';
import { existsSync } from 'fs';
import { existsSync, readFileSync, unlinkSync, writeFileSync } from 'fs';
import { spawn } from 'child_process';
import { Database } from 'bun:sqlite';
import { Client } from '@modelcontextprotocol/sdk/client/index.js';
@@ -8,6 +8,7 @@ import { StdioClientTransport } from '@modelcontextprotocol/sdk/client/stdio.js'
import { getWorkerPort, getWorkerHost } from '../shared/worker-utils.js';
import { DATA_DIR, DB_PATH, ensureDir } from '../shared/paths.js';
import { HOOK_TIMEOUTS } from '../shared/hook-constants.js';
import { getUptimeSeconds } from '../shared/uptime.js';
import { SettingsDefaultsManager } from '../shared/SettingsDefaultsManager.js';
import { getAuthMethodDescription } from '../shared/EnvManager.js';
import { logger } from '../utils/logger.js';
@@ -117,10 +118,69 @@ export function buildStatusOutput(status: 'ready' | 'error', message?: string):
};
}
/**
* Closed enum for worker_stopped telemetry. Must stay in sync with the
* shutdown_reason whitelist documentation (scrub.ts / telemetry.mdx):
* stop = /api/admin/shutdown (CLI `stop`), restart = /api/admin/restart or
* CLI `restart` (tagged ?reason=restart), signal = SIGTERM/SIGINT handler.
*/
export type WorkerShutdownReason = 'stop' | 'restart' | 'signal';
// Clean-shutdown sentinel — same marker-file pattern as the one-time markers
// in ProcessManager.ts (.chroma-cleaned-v10.3). Written in the graceful
// shutdown path, consumed (read + deleted) at the next startup: sentinel
// present = previous run stopped cleanly; stale PID file with no sentinel =
// previous run died without reaching the graceful-shutdown path (crash).
const CLEAN_SHUTDOWN_SENTINEL_PATH = path.join(DATA_DIR, '.worker-clean-shutdown');
function writeCleanShutdownSentinel(): void {
try {
ensureDir(DATA_DIR);
writeFileSync(CLEAN_SHUTDOWN_SENTINEL_PATH, new Date().toISOString());
} catch (error: unknown) {
if (error instanceof Error) {
logger.warn('SYSTEM', 'Failed to write clean-shutdown sentinel', { path: CLEAN_SHUTDOWN_SENTINEL_PATH }, error);
} else {
logger.warn('SYSTEM', 'Failed to write clean-shutdown sentinel', { path: CLEAN_SHUTDOWN_SENTINEL_PATH }, new Error(String(error)));
}
}
}
function readAndClearCleanShutdownSentinel(): string | null {
if (!existsSync(CLEAN_SHUTDOWN_SENTINEL_PATH)) return null;
let contents: string | null = null;
try {
contents = readFileSync(CLEAN_SHUTDOWN_SENTINEL_PATH, 'utf-8').trim();
} catch (error: unknown) {
if (error instanceof Error) {
logger.warn('SYSTEM', 'Failed to read clean-shutdown sentinel', { path: CLEAN_SHUTDOWN_SENTINEL_PATH }, error);
} else {
logger.warn('SYSTEM', 'Failed to read clean-shutdown sentinel', { path: CLEAN_SHUTDOWN_SENTINEL_PATH }, new Error(String(error)));
}
}
try {
// Always delete after reading: a stale sentinel would mislabel a later
// crash as 'clean'.
unlinkSync(CLEAN_SHUTDOWN_SENTINEL_PATH);
} catch (error: unknown) {
if (error instanceof Error) {
logger.warn('SYSTEM', 'Failed to remove clean-shutdown sentinel', { path: CLEAN_SHUTDOWN_SENTINEL_PATH }, error);
} else {
logger.warn('SYSTEM', 'Failed to remove clean-shutdown sentinel', { path: CLEAN_SHUTDOWN_SENTINEL_PATH }, new Error(String(error)));
}
}
return contents;
}
export class WorkerService implements WorkerRef {
private server: Server;
private startTime: number = Date.now();
private telemetryHeartbeat: ReturnType<typeof setInterval> | null = null;
// Crash detection (worker_started telemetry): derived once at startup from
// the previous run's stale PID file + the clean-shutdown sentinel.
private previousShutdown: 'clean' | 'crash' | 'unknown' = 'unknown';
private previousUptimeSeconds: number | null = null;
private mcpClient: Client;
private mcpReady: boolean = false;
@@ -191,8 +251,8 @@ export class WorkerService implements WorkerRef {
this.server = new Server({
getInitializationComplete: () => this.initializationCompleteFlag,
getMcpReady: () => this.mcpReady,
onShutdown: () => this.shutdown(),
onRestart: () => this.shutdown(),
onShutdown: (reason) => this.shutdown(reason ?? 'stop'),
onRestart: () => this.shutdown('restart'),
workerPath: __filename,
getAiStatus: () => {
let provider = 'claude';
@@ -223,7 +283,7 @@ export class WorkerService implements WorkerRef {
private registerSignalHandlers(): void {
configureSupervisorSignalHandlers(async () => {
this.isShuttingDown = true;
await this.shutdown();
await this.shutdown('signal');
});
}
@@ -275,10 +335,47 @@ export class WorkerService implements WorkerRef {
}));
}
/**
* Crash detection for worker_started telemetry. Must run BEFORE
* startSupervisor() — whose validateWorkerPidFile() deletes the previous
* run's stale PID file — and before writePidFile overwrites it.
* - clean-shutdown sentinel present → previous run stopped gracefully
* - stale PID file present, no sentinel → previous run crashed
* - neither (first run, or the spawner already cleaned the stale PID
* file) → unknown
* The sentinel is consumed here so it can never mislabel a later crash.
*/
private detectPreviousShutdown(): void {
const stalePidInfo = readPidFile();
const sentinelTimestamp = readAndClearCleanShutdownSentinel();
if (sentinelTimestamp !== null) {
this.previousShutdown = 'clean';
// Previous uptime = previous run's PID-file startedAt → sentinel write
// time. The previous run's in-memory startTime is never persisted, so
// the PID file is the only source; omit when either side is missing.
const startedAtMs = stalePidInfo ? Date.parse(stalePidInfo.startedAt) : NaN;
const stoppedAtMs = Date.parse(sentinelTimestamp);
if (Number.isFinite(startedAtMs) && Number.isFinite(stoppedAtMs) && stoppedAtMs >= startedAtMs) {
this.previousUptimeSeconds = Math.floor((stoppedAtMs - startedAtMs) / 1000);
}
} else if (stalePidInfo) {
// Crash: the previous run's stop time is unknowable, so
// previous_uptime_seconds is deliberately omitted rather than guessed.
this.previousShutdown = 'crash';
} else {
this.previousShutdown = 'unknown';
}
}
async start(): Promise<void> {
const port = getWorkerPort();
const host = getWorkerHost();
// Must run before startSupervisor(): its validateWorkerPidFile() removes
// the dead previous run's stale PID file, which crash detection needs.
this.detectPreviousShutdown();
await startSupervisor();
await this.server.listen(port, host);
@@ -422,11 +519,20 @@ export class WorkerService implements WorkerRef {
// Snapshot is best-effort; the lifecycle event still ships without it.
logger.debug('SYSTEM', 'Install stats snapshot failed', {}, error as Error);
}
// Process health for the daily heartbeat: memoryUsage() returns bytes;
// the scrubber drops non-finite numbers, so round to whole MiB.
const memory = process.memoryUsage();
props.process_rss_mb = Math.round(memory.rss / 1024 / 1024);
props.heap_used_mb = Math.round(memory.heapUsed / 1024 / 1024);
return props;
};
captureEvent('worker_started', {
trigger: 'start',
duration_ms: Date.now() - this.startTime,
// Crash detection (detectPreviousShutdown): crash case carries no
// previous_uptime_seconds — the stop time is unknowable.
previous_shutdown: this.previousShutdown,
...(this.previousUptimeSeconds !== null && { previous_uptime_seconds: this.previousUptimeSeconds }),
...buildLifecycleProps(),
}, { person: true });
// Long-lived workers would otherwise look like a single day of activity.
@@ -562,7 +668,7 @@ export class WorkerService implements WorkerRef {
this.sessionManager.removeSessionImmediate(sessionDbId);
}
async shutdown(): Promise<void> {
async shutdown(reason: WorkerShutdownReason = 'stop'): Promise<void> {
if (this.transcriptWatcher) {
this.transcriptWatcher.stop();
this.transcriptWatcher = null;
@@ -573,6 +679,14 @@ export class WorkerService implements WorkerRef {
clearInterval(this.telemetryHeartbeat);
this.telemetryHeartbeat = null;
}
// Mark this stop as graceful for the next start's crash detection, and
// capture worker_stopped BEFORE shutdownTelemetry() — isShutdown drops
// any event captured after the flush, by design.
writeCleanShutdownSentinel();
captureEvent('worker_stopped', {
uptime_seconds: getUptimeSeconds(this.startTime),
shutdown_reason: reason,
});
await shutdownTelemetry();
await performGracefulShutdown({
@@ -841,7 +955,7 @@ async function main() {
case 'restart': {
logger.info('SYSTEM', 'Restarting worker');
await httpShutdown(port);
await httpShutdown(port, 'restart');
const restartFreed = await waitForPortFree(port, 5000);
if (!restartFreed) {
console.error('Port still bound after shutdown. Resolve manually.');

View File

@@ -21,6 +21,19 @@ import type { TimelineData } from './search/index.js';
import { ResultFormatter } from './search/ResultFormatter.js';
import { ChromaUnavailableError } from './search/errors.js';
/**
* Telemetry envelope for search_performed (see docs/public/telemetry.mdx).
* Populated by SearchManager.search() via a mutable sink param so response
* shapes (json and text formats) stay untouched. Privacy: counts, booleans,
* and closed enums only — never query text, results, or error messages.
*/
export interface SearchTelemetryEnvelope {
result_count?: number;
search_strategy?: 'chroma' | 'fts' | 'filter_only';
chroma_available?: boolean;
fallback_reason?: 'none' | 'chroma_connection' | 'chroma_error' | 'chroma_not_initialized';
}
export class SearchManager {
private orchestrator: SearchOrchestrator;
private timelineBuilder: TimelineBuilder;
@@ -149,7 +162,7 @@ export class SearchManager {
return normalized;
}
async search(args: any): Promise<any> {
async search(args: any, telemetryOut?: SearchTelemetryEnvelope): Promise<any> {
const normalized = this.normalizeParams(args);
const { query, type, obs_type, concepts, files, format, ...options } = normalized;
let observations: ObservationSearchResult[] = [];
@@ -306,6 +319,33 @@ export class SearchManager {
const totalResults = observations.length + sessions.length + prompts.length;
// Telemetry envelope (search_performed): derive the strategy from the
// three paths above. Enum/count values only — never the Chroma error
// message, query text, or result content.
if (telemetryOut) {
let searchStrategy: SearchTelemetryEnvelope['search_strategy'];
let fallbackReason: SearchTelemetryEnvelope['fallback_reason'];
if (!query) {
// PATH 1: filter-only SQLite (no query text; Chroma never consulted)
searchStrategy = 'filter_only';
fallbackReason = 'none';
} else if (this.chromaSync) {
// PATH 2: Chroma semantic search, degrading to FTS5 on error
searchStrategy = chromaFailed ? 'fts' : 'chroma';
fallbackReason = chromaFailed
? (chromaFailureReason?.isConnectionError ? 'chroma_connection' : 'chroma_error')
: 'none';
} else {
// PATH 3: FTS5 keyword search (Chroma not initialized)
searchStrategy = 'fts';
fallbackReason = 'chroma_not_initialized';
}
telemetryOut.result_count = totalResults;
telemetryOut.search_strategy = searchStrategy;
telemetryOut.chroma_available = this.chromaSync !== null && !chromaFailed;
telemetryOut.fallback_reason = fallbackReason;
}
if (format === 'json') {
return {
observations,

View File

@@ -45,6 +45,13 @@ export async function processAgentResponse(
const parsed = parseAgentXml(text, session.contentSessionId);
// Provider enum for telemetry, derived once so the invalid-output and
// success paths stamp the same value.
const providerName =
session.currentProvider ??
({ SDK: 'claude', Gemini: 'gemini', OpenRouter: 'openrouter' } as Record<string, string>)[agentName] ??
'claude';
if (!parsed.valid) {
// Classify the non-XML output so a dropped batch is VISIBLE, not silent
// (plan-11, #2485). Attach a preview for diagnostics.
@@ -75,6 +82,18 @@ export async function processAgentResponse(
consecutiveInvalidOutputs: session.consecutiveInvalidOutputs,
threshold: INVALID_OUTPUT_RESPAWN_THRESHOLD,
});
// Respawn-gated telemetry ONLY (never per invalid output — volume).
// Closed enums and counts; the raw model output never leaves the box.
captureEvent('session_compressed', {
outcome: 'invalid_output',
invalid_output_class: outputClass,
consecutive_invalid_outputs: session.consecutiveInvalidOutputs,
respawn_triggered: true,
provider: providerName,
model: typeof modelId === 'string' && modelId ? modelId : 'unknown',
ide: session.platformSource,
hook: session.lastGeneratorSource,
});
await sessionManager.respawnPoisonedSession(session.sessionDbId);
return;
}
@@ -111,6 +130,7 @@ export async function processAgentResponse(
// ground truth via `git cat-file -e` in the session's repo and strip
// fabricated hashes from the persisted text. projectRoot carries the cwd of
// the most recently observed tool-use.
let fabricatedCount = 0;
if (summaryForStore) {
const { fabricated } = verifyCommitHashesInText(
[
@@ -125,6 +145,8 @@ export async function processAgentResponse(
session.contentSessionId
);
fabricatedCount = fabricated.length;
if (fabricated.length > 0) {
logger.warn('PARSER', `${agentName} summary referenced fabricated commit hash(es); flagging before persist`, {
sessionId: session.sessionDbId,
@@ -175,10 +197,6 @@ export async function processAgentResponse(
// Telemetry: counts, enums, and REAL usage only (lastUsage is never an
// estimate — providers leave it null when the API gave no usage split).
const providerName =
session.currentProvider ??
({ SDK: 'claude', Gemini: 'gemini', OpenRouter: 'openrouter' } as Record<string, string>)[agentName] ??
'claude';
const typeCounts: Record<string, number> = { bugfix: 0, discovery: 0, decision: 0, refactor: 0, other: 0 };
for (const obs of labeledObservations) {
const bucket = obs.type in typeCounts && obs.type !== 'other' ? obs.type : 'other';
@@ -205,6 +223,10 @@ export async function processAgentResponse(
hook: session.lastGeneratorSource,
endpoint_class: session.endpointClass,
compression_ms: compressionMs,
// Fabrication signals live HERE (not at the ClaudeProvider merge) so they
// flow through all three emit paths: immediate, deferred, and no-result.
fabrication_detected: fabricatedCount > 0,
fabricated_count: fabricatedCount,
observation_type: labeledObservations.length > 0 ? dominantType : undefined,
obs_type_bugfix: typeCounts.bugfix,
obs_type_discovery: typeCounts.discovery,

View File

@@ -4,6 +4,7 @@ import * as fs from 'fs';
import path from 'path';
import { z } from 'zod';
import { SearchManager } from '../../SearchManager.js';
import type { SearchTelemetryEnvelope } from '../../SearchManager.js';
import { BaseRouteHandler } from '../BaseRouteHandler.js';
import { validateBody } from '../middleware/validateBody.js';
import { logger } from '../../../../utils/logger.js';
@@ -115,10 +116,15 @@ export class SearchRoutes extends BaseRouteHandler {
const segment = req.path === '/' ? 'unified' : req.path.slice(1).split('/')[0];
const endpoint = KNOWN_SEARCH_ENDPOINTS.has(segment) ? segment : 'other';
res.once('finish', () => {
// res.locals.searchTelemetry is the retrieval-quality envelope
// (result_count, search_strategy, chroma_available, fallback_reason)
// populated by SearchManager.search() and stashed by the handler —
// counts/booleans/enums only, never response-body introspection.
captureEvent('search_performed', {
endpoint,
outcome: res.statusCode < 400 ? 'ok' : 'error',
duration_ms: Date.now() - searchStartedAt,
...(res.locals.searchTelemetry ?? {}),
});
});
}
@@ -153,7 +159,13 @@ export class SearchRoutes extends BaseRouteHandler {
}
private handleUnifiedSearch = this.wrapHandler(async (req: Request, res: Response): Promise<void> => {
const result = await this.searchManager.search(req.query);
// Mutable telemetry sink: SearchManager.search() fills it with the
// retrieval-quality envelope; the /api/search middleware spreads it into
// search_performed on response finish. Stashed before the await so the
// envelope survives even if response serialization fails afterwards.
const searchTelemetry: SearchTelemetryEnvelope = {};
res.locals.searchTelemetry = searchTelemetry;
const result = await this.searchManager.search(req.query, searchTelemetry);
res.json(result);
});

View File

@@ -26,6 +26,25 @@ import { USER_PROMPT_DEDUPE_WINDOW_MS } from '../../../../shared/user-prompts.js
const MAX_USER_PROMPT_BYTES = 256 * 1024;
/**
* Collapse session.abortReason onto a closed telemetry enum. The raw value can
* carry free text after a colon (e.g. 'quota:<provider message>') — never emit
* it verbatim. Unknown or absent reasons map to 'none'.
*/
function normalizeAbortReason(
reason: string | null | undefined
): 'idle' | 'shutdown' | 'overflow' | 'restart_guard' | 'quota' | 'poisoned' | 'none' {
switch ((reason ?? '').split(':')[0]) {
case 'idle': return 'idle';
case 'shutdown': return 'shutdown';
case 'overflow': return 'overflow';
case 'restart-guard': return 'restart_guard';
case 'quota': return 'quota';
case 'poisoned': return 'poisoned';
default: return 'none';
}
}
export class SessionRoutes extends BaseRouteHandler {
constructor(
private sessionManager: SessionManager,
@@ -151,6 +170,10 @@ export class SessionRoutes extends BaseRouteHandler {
provider: provider,
error: errorMsg
}, error);
// No abort_reason here: every site that sets abortReason aborts the
// controller on its next line, so aborted generators either resolve
// normally (quota/overflow break) or hit the signal-aborted early
// return above — this catch only ever sees non-abort rejections.
captureEvent('session_compressed', {
outcome: 'error',
provider,
@@ -165,6 +188,20 @@ export class SessionRoutes extends BaseRouteHandler {
.finally(async () => {
const reason = session.abortReason ?? null;
session.abortReason = null; // consume the reason
if (reason !== null) {
// Abort accounting lives HERE, where the reason is consumed — the
// ONLY point every abort flow (idle / shutdown / overflow / quota /
// poisoned) passes through. Emit the closed enum, never the raw
// string ('quota:…' carries a window suffix).
captureEvent('session_compressed', {
outcome: 'aborted',
provider,
model: session.lastModelId ?? 'unknown',
abort_reason: normalizeAbortReason(reason),
hook: session.lastGeneratorSource,
ide: session.platformSource,
});
}
await handleGeneratorExit(session, reason, {
sessionManager: this.sessionManager,
completionHandler: this.completionHandler,

View File

@@ -9,6 +9,7 @@ import { MARKETPLACE_ROOT, DATA_DIR } from "./paths.js";
import { loadFromFileOnce } from "./hook-settings.js";
import { validateWorkerPidFile } from "../supervisor/index.js";
import { emitBlockingError } from "./hook-io.js";
import { captureCliEvent } from "../services/telemetry/cli-telemetry.js";
import { checkVersionMatch } from "../services/infrastructure/index.js";
function readTimeoutEnv(
@@ -448,7 +449,35 @@ function getFailLoudThreshold(): number {
return FAIL_LOUD_DEFAULT_THRESHOLD;
}
function recordWorkerUnreachable(): number {
/**
* Closed enum of hook handler names allowed as the `hook_type` telemetry
* property. Mirrors the scrub whitelist comment (scrub.ts), the CLI
* disclosure (npx-cli/commands/telemetry.ts), and docs/public/telemetry.mdx —
* never widen one without the others. Events outside this set (user-message,
* file-edit) simply omit hook_type.
*/
const TELEMETRY_HOOK_TYPES = ['context', 'session-init', 'observation', 'summarize', 'file-context'] as const;
export type TelemetryHookType = (typeof TELEMETRY_HOOK_TYPES)[number];
let activeHookType: TelemetryHookType | null = null;
/**
* Record which hook event this short-lived hook process is executing, so the
* fail-loud counter can tag its threshold-gated hook_failed telemetry.
* Called once at hookCommand entry; values outside the closed enum are
* dropped (never free text).
*/
export function setActiveHookType(event: string): void {
activeHookType = (TELEMETRY_HOOK_TYPES as readonly string[]).includes(event)
? (event as TelemetryHookType)
: null;
}
export function getActiveHookType(): TelemetryHookType | null {
return activeHookType;
}
export async function recordWorkerUnreachable(): Promise<number> {
const state = readHookFailureState();
const next: HookFailureState = {
consecutiveFailures: state.consecutiveFailures + 1,
@@ -458,6 +487,23 @@ function recordWorkerUnreachable(): number {
const threshold = getFailLoudThreshold();
if (next.consecutiveFailures >= threshold) {
// hook_failed distress signal. Gated to the failure that JUST reached the
// threshold (`===`, not `>=`): the stderr warning below repeats on every
// failure past the threshold, but telemetry emits once per failure streak
// to bound volume. MUST be awaited BEFORE emitBlockingError — it calls
// process.exit(2) immediately, which would kill a fire-and-forget POST
// mid-flight. captureCliEvent never throws and is hard-capped at 2s, so
// this cannot hang the fail-loud path. Closed-enum/count props only —
// never error text. Transport is the direct CLI POST, never the worker
// API (the defining failure here IS "worker unreachable").
if (next.consecutiveFailures === threshold) {
await captureCliEvent('hook_failed', {
...(activeHookType !== null ? { hook_type: activeHookType } : {}),
error_mode: 'worker_unavailable',
consecutive_failures: next.consecutiveFailures,
threshold_tripped: true,
});
}
// #2292 fix: BLOCKING_FEEDBACK. emitBlockingError flushes the Phase 2
// stderr buffer (so preceding logger.warn lines also surface) and writes
// via the bypass channel + exits 2. Previously this raw process.stderr.write
@@ -501,7 +547,7 @@ export async function executeWithWorkerFallback<T = unknown>(
): Promise<WorkerCallResult<T>> {
const alive = await ensureWorkerAliveOnce();
if (!alive) {
recordWorkerUnreachable();
await recordWorkerUnreachable();
return { continue: true, reason: 'worker_unreachable', [WORKER_FALLBACK_BRAND]: true };
}

35
tests/preload.ts Normal file
View File

@@ -0,0 +1,35 @@
import { mock } from 'bun:test';
/**
* Global posthog-node mock, registered via bunfig.toml [test].preload BEFORE
* any test file or src module loads. It must be global, not per-file:
*
* 1. telemetry.ts imports PostHog at module top level, and many src modules
* (ResponseProcessor, SessionRoutes, SearchRoutes, worker-service, ...)
* transitively import telemetry.ts. The whole suite runs in one bun
* process, so a per-file mock.module registers too late once any earlier
* test file has touched those modules — the cached telemetry module keeps
* the real PostHog binding.
* 2. Telemetry consent is default-on and the publishable key ships in the
* code, so without this mock a full-suite run constructs a REAL PostHog
* client and can flush fabricated test events into production analytics
* (flushAt: 20 / flushInterval: 10s vs a ~25s suite).
*
* Tests assert against these recorded calls — see
* tests/telemetry/telemetry-client.test.ts.
*/
export type PostHogConstructorCall = { apiKey: string; options: Record<string, unknown> };
export const postHogConstructorCalls: PostHogConstructorCall[] = [];
export const postHogCaptureCalls: Array<Record<string, unknown>> = [];
mock.module('posthog-node', () => ({
PostHog: class {
constructor(apiKey: string, options: Record<string, unknown>) {
postHogConstructorCalls.push({ apiKey, options });
}
capture(payload: Record<string, unknown>): void {
postHogCaptureCalls.push(payload);
}
async shutdown(): Promise<void> {}
},
}));

View File

@@ -136,6 +136,74 @@ describe('scrubProperties', () => {
expect(result.days_since_last_obs).toBe(0);
});
it('keeps the retrieval quality keys with primitive values', () => {
const result = scrubProperties({
result_count: 0,
chroma_available: false,
fallback_reason: 'chroma_connection',
});
expect(result).toEqual({
result_count: 0,
chroma_available: false,
fallback_reason: 'chroma_connection',
});
});
it('keeps the compression trust keys with primitive values', () => {
const result = scrubProperties({
fabrication_detected: true,
fabricated_count: 2,
invalid_output_class: 'poisoned',
consecutive_invalid_outputs: 3,
respawn_triggered: true,
abort_reason: 'restart_guard',
});
expect(Object.keys(result)).toHaveLength(6);
expect(result.fabrication_detected).toBe(true);
expect(result.fabricated_count).toBe(2);
expect(result.invalid_output_class).toBe('poisoned');
expect(result.consecutive_invalid_outputs).toBe(3);
expect(result.respawn_triggered).toBe(true);
expect(result.abort_reason).toBe('restart_guard');
});
it('keeps the worker lifecycle keys with primitive values', () => {
const result = scrubProperties({
previous_shutdown: 'crash',
previous_uptime_seconds: 86400,
uptime_seconds: 3600,
shutdown_reason: 'restart',
process_rss_mb: 187,
heap_used_mb: 92,
});
expect(Object.keys(result)).toHaveLength(6);
expect(result.previous_shutdown).toBe('crash');
expect(result.previous_uptime_seconds).toBe(86400);
expect(result.uptime_seconds).toBe(3600);
expect(result.shutdown_reason).toBe('restart');
expect(result.process_rss_mb).toBe(187);
expect(result.heap_used_mb).toBe(92);
});
it('keeps the hook failure keys with primitive values', () => {
const result = scrubProperties({
hook_type: 'observation',
error_mode: 'worker_unavailable',
consecutive_failures: 3,
threshold_tripped: true,
});
expect(result).toEqual({
hook_type: 'observation',
error_mode: 'worker_unavailable',
consecutive_failures: 3,
threshold_tripped: true,
});
});
it('drops unknown keys silently', () => {
const result = scrubProperties({
version: '1.0.0',

View File

@@ -1,33 +1,22 @@
import { describe, it, expect, mock, beforeAll, afterAll } from 'bun:test';
import { describe, it, expect, beforeAll, afterAll } from 'bun:test';
import { mkdtempSync, rmSync } from 'fs';
import { tmpdir } from 'os';
import { join } from 'path';
import { postHogConstructorCalls, postHogCaptureCalls } from '../preload';
import { captureEvent, __resetTelemetryForTests } from '../../src/services/telemetry/telemetry';
/**
* Guards the PostHog client construction options. The posthog-node SDK stamps
* $geoip_disable: true on every event unless disableGeoip: false is passed —
* losing ingest-side coarse location for every worker event. The module is
* mocked so no real client (and no network) is ever created.
* losing ingest-side coarse location for every worker event.
*
* posthog-node is mocked globally in tests/preload.ts (it cannot be mocked
* per-file: telemetry.ts is imported transitively by many other test files in
* the same process, so a local mock.module registers too late). The telemetry
* module's process-wide state is reset below so construction is observed from
* scratch regardless of suite order.
*/
type ConstructorCall = { apiKey: string; options: Record<string, unknown> };
const constructorCalls: ConstructorCall[] = [];
const captureCalls: Array<Record<string, unknown>> = [];
mock.module('posthog-node', () => ({
PostHog: class {
constructor(apiKey: string, options: Record<string, unknown>) {
constructorCalls.push({ apiKey, options });
}
capture(payload: Record<string, unknown>): void {
captureCalls.push(payload);
}
async shutdown(): Promise<void> {}
},
}));
const { captureEvent } = await import('../../src/services/telemetry/telemetry');
let tempDir: string;
const savedEnv: Record<string, string | undefined> = {};
const ENV_KEYS = [
@@ -46,6 +35,10 @@ beforeAll(() => {
delete process.env.CLAUDE_MEM_TELEMETRY_DEBUG;
delete process.env.CLAUDE_MEM_TELEMETRY_KEY;
delete process.env.DO_NOT_TRACK;
__resetTelemetryForTests();
postHogConstructorCalls.length = 0;
postHogCaptureCalls.length = 0;
});
afterAll(() => {
@@ -54,21 +47,24 @@ afterAll(() => {
else process.env[key] = savedEnv[key];
}
rmSync(tempDir, { recursive: true, force: true });
// Drop the client/consent state built under this file's env so later test
// files start from the same blank slate this file demanded.
__resetTelemetryForTests();
});
describe('PostHog client construction', () => {
it('constructs the client with disableGeoip: false so ingest-side geolocation works', () => {
captureEvent('test_event');
expect(constructorCalls.length).toBe(1);
expect(constructorCalls[0].options.disableGeoip).toBe(false);
expect(postHogConstructorCalls.length).toBe(1);
expect(postHogConstructorCalls[0].options.disableGeoip).toBe(false);
});
it('reuses the client and queues the capture', () => {
captureEvent('test_event_2');
expect(constructorCalls.length).toBe(1);
expect(captureCalls.length).toBe(2);
expect(captureCalls[1].event).toBe('test_event_2');
expect(postHogConstructorCalls.length).toBe(1);
expect(postHogCaptureCalls.length).toBe(2);
expect(postHogCaptureCalls[1].event).toBe('test_event_2');
});
});