* feat(telemetry): disclose 19 reliability-signal fields and 2 new events across all surfaces Whitelist (scrub.ts), scrub tests, public docs (telemetry.mdx), and CLI disclosure (COLLECTED_FIELDS/EVENT_NAMES) for the Plan 14 reliability signals: search retrieval quality, compression trust, worker lifecycle, and hook failure keys, plus the worker_stopped and hook_failed events. Includes the plan document. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * feat(telemetry): retrieval-quality signals on search_performed SearchManager.search() fills an optional telemetry envelope (result_count, search_strategy, chroma_available, fallback_reason) across all three search paths; handlers stash it on res.locals.searchTelemetry and the existing finish-middleware spreads it into the search_performed capture. Zero-result searches report result_count: 0; Chroma fallback reasons are a closed enum, never the error message. Response shapes unchanged. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * feat(telemetry): compression trust signals on session_compressed fabrication_detected/fabricated_count flow through compressionProps (all three emit paths); invalid-output respawns emit a respawn-gated session_compressed with outcome invalid_output and the classifier value; aborted generators emit outcome aborted with abort_reason normalized to a closed enum in the .finally where all five abort flows converge (the .catch path can never observe a non-null abortReason). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * feat(telemetry): worker lifecycle signals — worker_stopped, crash detection, memory metrics Clean-shutdown sentinel written before telemetry flush and consumed at startup; worker_started gains previous_shutdown (crash/clean/unknown) and previous_uptime_seconds derived from the stale PID file; new worker_stopped event (uptime_seconds, shutdown_reason stop/restart/ signal) emitted before shutdownTelemetry(); the CLI restart path tags /api/admin/shutdown?reason=restart so restarts are distinguishable; buildLifecycleProps adds integer process_rss_mb/heap_used_mb. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * feat(telemetry): threshold-gated hook_failed distress signal via CLI transport recordWorkerUnreachable emits hook_failed exactly when the consecutive- failure count reaches the fail-loud threshold; the generic blocking-error branch emits error_mode blocking_error. Both emits are awaited before the process.exit paths so the 2s-capped CLI POST survives; hook_type is a closed enum registered at hookCommand entry. Exit codes unchanged. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * chore(build): regenerate plugin artifacts with Plan 14 telemetry signals Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * fix(tests): make PostHog client regression test order-independent via global preload mock The disableGeoip regression test mocked posthog-node per-file, but telemetry.ts is imported transitively by many test files in the shared bun process, so the mock registered too late and the test failed in full-suite runs — CI on main has been red since v13.5.4. The mock now registers in a bunfig [test].preload before any module loads, which also guarantees test runs can never construct a real PostHog client and flush fabricated events into production analytics (consent is default-on and the suite outlives flushInterval). telemetry.ts gains a test-only state reset so construction is observed deterministically regardless of suite order. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * fix(telemetry): forward shutdown reason in Windows-managed IPC message Review follow-up: the wrapper IPC path discarded the restart tag, so an external Windows wrapper could only ever report shutdown_reason 'stop'. No wrapper in this repo listens for the message, but the reason now travels with it for any that does. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
20 KiB
Plan 14 — Telemetry Reliability Signals
Adds the five highest-value missing telemetry signals identified by the 2026-06-10
capture-surface audit. Theme: we instrument success well; failure is invisible.
Every signal below feeds the Reliability sentence of
plans/2026-06-09-telemetry-metrics-spec.md ("Core pipeline succeeds X% of the
time at scale") — plus retrieval quality, which today has no KPI at all.
Phases are self-contained: each can be executed in a fresh chat context. Execute in order; Phase 1–4 are independent of each other but all depend on Phase 0's facts and share the Phase-ritual below.
Phase 0 — Verified facts, allowed APIs, and the every-property ritual
Consolidated from 5 documentation-discovery agents (all high confidence, all findings cite read code). Do not invent APIs beyond this list.
The pipeline ritual — EVERY new property or event must touch all five surfaces
| # | Surface | Location | What to do |
|---|---|---|---|
| 1 | Scrub whitelist | src/services/telemetry/scrub.ts:8-82 (ALLOWED_PROPERTY_KEYS: Set<string>) |
Add the key, grouped with a category comment like the existing ones |
| 2 | Scrub tests | tests/telemetry/scrub.test.ts |
Copy the pattern at :5-31 (single-group) or :81-106 (multi-key group); also confirm :139-169 drop-tests still pass |
| 3 | Public docs | docs/public/telemetry.mdx fields table :26-75, events table :78-89 |
Add a row per field; new events get an events-table row |
| 4 | CLI disclosure | src/npx-cli/commands/telemetry.ts COLLECTED_FIELDS:23-66, EVENT_NAMES:68-77 |
Add a line per field; new event names go in EVENT_NAMES |
| 5 | Capture site | per phase below | Emit via captureEvent / captureCliEvent only |
Allowed APIs (verified signatures)
captureEvent(event: string, props?: Record<string, unknown>, opts?: { person?: boolean }): void—src/services/telemetry/telemetry.ts:72(worker transport; consent-gated, scrubbed, fire-and-forget)captureCliEvent(event, props?, opts?): Promise<void>—src/services/telemetry/cli-telemetry.ts:22(short-lived-process transport; direct POST, hard 2s timeoutCAPTURE_TIMEOUT_MSat:15, never throws)scrubProperties(props): Record<string, string | number | boolean>—src/services/telemetry/scrub.ts:91-114(drops non-whitelisted keys and non-primitives silently; strings clamped to 200 chars; numbers must be finite)collectInstallStats(db): Record<string, number>—src/services/telemetry/install-stats.ts:29getUptimeSeconds(startedAtMs: number, now?): number—src/shared/uptime.ts:5-7writePidFile(info: PidInfo) / readPidFile(): PidInfo | null / removePidFile()—src/services/infrastructure/ProcessManager.ts:134/141/156;PidInfo = { pid, port, startedAt: string /* ISO8601 */, startToken? }(src/supervisor/process-registry.ts:49-54)recordWorkerUnreachable(): number—src/shared/worker-utils.ts:451-470(returns the consecutive-failure count; persists atomically in~/.claude-mem/state/hook-failures.json; threshold default 3, envCLAUDE_MEM_HOOK_FAIL_LOUD_THRESHOLD)classifyObserverOutput(raw): 'xml'|'idle'|'prose'|'poisoned'—src/sdk/output-classifier.ts:60-80verifyCommitHashesInText(...): CommitVerificationResultwithfabricated: string[]—src/sdk/commit-verification.ts:69-108DATA_DIR/paths.workerPid()etc. —src/shared/paths.ts:40,129-151
Global anti-patterns (from discovery; apply to every phase)
- Properties not added to
ALLOWED_PROPERTY_KEYSare silently dropped — no error. Always whitelist first, then emit. - Only
number | boolean | closed-enum string. Never free text, paths, queries, error messages, IDs derived from the user. (An earlier audit draft proposederror_summary: string— explicitly rejected.) person: trueonly on lifecycle events (spec constraint,plans/2026-06-09-telemetry-metrics-spec.md:65-71). Nothing in this plan adds person properties; do not touchPERSON_PROPERTY_KEYS.- Never bypass
captureEvent/captureCliEventwith direct PostHog calls. - Debug-mode verification harness:
CLAUDE_MEM_TELEMETRY_DEBUG=1prints would-be payloads to stderr and sends nothing (telemetry.ts:97-103).
Discovery discrepancy to resolve during Phase 2
One agent reported INVALID_OUTPUT_RESPAWN_THRESHOLD = 25, another = 3. Read
src/services/worker/agents/ResponseProcessor.ts:25 before relying on the value.
Phase 1 — Retrieval quality: result_count + strategy/fallback on search_performed
Narrative served: Reliability + retrieval quality. Zero-result rate becomes
computable; Chroma's silent degradation to FTS becomes visible (the recurring
SQLiteSearchStrategy Database error incident class).
Verified obstacles (do not skip)
- The existing capture is a middleware:
SearchRoutes.ts:117-123insideres.once('finish')— it fires after the response, outside handler scope. It can see onlyendpoint,res.statusCode, and elapsed time. Result arrays,totalResults(computed atSearchManager.ts:307),chromaFailed(SearchManager.ts:158, 206, 274) andchromaFailureReason(SearchManager.ts:267-275) are method-local and unreachable from there. SearchManager.search()has three paths: filter-only SQLite (:165-176), Chroma (:179-286, setschromaFailedon error), Chroma-not-initialized FTS (:288-305). Text-format responses (:420-425) do not carry counts; onlyformat='json'(:309-316) includestotalResults.search_strategyis already whitelisted (scrub.ts:55); only the new keys need whitelist entries.
What to implement
- In
SearchManager.search(), build a small telemetry envelope alongside the existing return value — do not change response shapes. Collect:result_count(thetotalResultsalready computed at:307),search_strategy: 'chroma' | 'fts' | 'filter_only'(one per path above),chroma_available: boolean(false whenchromaFailedor not initialized),fallback_reason: 'none' | 'chroma_connection' | 'chroma_error' | 'chroma_not_initialized'(map fromchromaFailureReason.isConnectionErrorat:271; never the message). Expose it to callers — recommended: return{ ...existing, telemetry }for an internal caller, or set it on a mutable param. Simplest verified-safe plumbing: handlers stash it onres.locals.searchTelemetry, and the middleware atSearchRoutes.ts:117-123spreadsres.locals.searchTelemetry ?? {}into the existingcaptureEvent('search_performed', …)props. - Whitelist
result_count,chroma_available,fallback_reason(ritual #1–4). - Note:
src/services/worker/search/types.ts:53-64has aStrategySearchResultwith astrategyfield butSearchManager.search()does not use it — derive strategy from the three paths; do not refactor onto SearchOrchestrator here.
Verification
bun test tests/telemetry/green (new scrub cases included)npm run typecheck:rootcleanCLAUDE_MEM_TELEMETRY_DEBUG=1+ a worker search request printssearch_performedwithresult_count,search_strategy,chroma_available,fallback_reason- Grep guard:
grep -n "fallback_reason" src/services/telemetry/scrub.ts docs/public/telemetry.mdx src/npx-cli/commands/telemetry.tshits all three - Zero-result search shows
result_count: 0(not missing)
Anti-pattern guards
- Do NOT try to introspect the response body from the middleware (no
res._getBuffer()-style Express internals — unverified, fragile). - Do NOT put
chromaFailureReason.messagein any property — enum only. - Do NOT change the text-format response shape consumed by clients.
Phase 2 — Compression quality: fabrication, invalid-output, and abort reasons on session_compressed
Narrative served: Reliability + model quality (extends yesterday's tokens/cost/ratio work with per-model trust signals).
Verified mechanics (this is the key to doing it right)
compressionPropsis built atResponseProcessor.ts:194-214. Non-SDK providers emit immediately (:228); the SDK/Claude path stashes the object intosession.pendingCompressionEvent(worker-types.ts:60) at:216-226, andClaudeProvider.ts:416-435later merges real token fields and emits;:442-445is the no-result fallback emit. Therefore: any property added tocompressionPropsautomatically flows through all three emit paths.- Fabrication scope:
ResponseProcessor.ts:115-135already computesfabricated: string[]viaverifyCommitHashesInText. - Invalid output:
ResponseProcessor.ts:48-88returns early — no event fires at all on that path today.session.consecutiveInvalidOutputs(worker-types.ts:34) increments at:54, resets at:92; respawn decision at:67-79(outputClass === 'poisoned'OR threshold reached — read the threshold at:25, see Phase 0 discrepancy). abortReasonenum:worker-types.ts:42—'idle'|'shutdown'|'overflow'|'restart-guard'|'quota'|string|null; set atClaudeProvider.ts:270(note:'quota:…'prefix format),:315,SessionManager.ts:272,294,407; consumed atSessionRoutes.ts:166-167. The error-path emit isSessionRoutes.ts:154-163.
What to implement
- Fabrication: in
ResponseProcessor.tswherefabricated.lengthis known (:128-135), add tocompressionProps:fabrication_detected: boolean,fabricated_count: number. (Flows through deferred path for free.) - Invalid output: at the respawn decision (
:67-79) — and ONLY when a respawn triggers, to bound volume — emit onecaptureEvent('session_compressed', { outcome: 'invalid_output', invalid_output_class, consecutive_invalid_outputs, respawn_triggered: true, provider, model, ide, hook })whereinvalid_output_classis the classifier value ('idle'|'prose'|'poisoned'). - Abort reason: in the error-path emit (
SessionRoutes.ts:154-163), addabort_reasonnormalized to a closed enum:'idle'|'shutdown'|'overflow'|'restart_guard'|'quota'|'none'— split the'quota:…'format on':'and map'restart-guard'→'restart_guard'. - Whitelist
fabrication_detected,fabricated_count,invalid_output_class,consecutive_invalid_outputs,respawn_triggered,abort_reason(ritual #1–4).
Verification
bun test tests/telemetry/green;npm run typecheck:rootclean- Debug-mode
session_compressedpayload showsfabrication_detected: false, fabricated_count: 0on a normal compression - Grep guard:
grep -rn "abort_reason" src/services/telemetry/scrub.ts src/services/worker/http/routes/SessionRoutes.tsboth hit - Confirm the deferred path carries new props: grep the built
plugin/scripts/worker-service.cjsforfabrication_detectedafternpm run build
Anti-pattern guards
- Do NOT emit an event per invalid output (volume) — respawn-gated only.
- Do NOT send raw
abortReasonstrings ('quota:daily','restart-guard') — normalize to the closed enum first; the scrubber will happily pass any ≤200-char string, so enum discipline is on the emitter. - Do NOT add the new props anywhere except
compressionPropsfor the fabrication fields — adding them only at theClaudeProvidermerge would miss non-SDK providers.
Phase 3 — Worker lifecycle: crash detection, worker_stopped, heartbeat health
Narrative served: Reliability ("crash-free installs") + makes the DAU/uptime data trustworthy.
Verified mechanics
- PID file already stores
startedAtISO8601 (worker-service.ts:289,PidInfoatprocess-registry.ts:49-54) → previous uptime is computable on next start viaDate.parse(startedAt). - There is NO shutdown sentinel today; marker-file pattern to copy:
ProcessManager.ts:232-254(.chroma-cleaned-v10.3) — write toDATA_DIR. - Graceful shutdown:
worker-service.ts:565-585;shutdownTelemetry()is called at:576and races a 3s flush (telemetry.ts:124-144) — an event captured before:576will flush. Stop-caseremovePidFile()is at:836. worker_startedcaptures::427(triggerstart,person: true),:436(heartbeat, 24hsetIntervalwith.unref()at:435-438); props builderbuildLifecycleProps()at:401-426.uncaughtExceptionhandler at:1075-1078logs and does NOT exit (known smell — out of scope here, do not change process semantics in this plan).
What to implement
- Clean-shutdown sentinel: in the shutdown path (before
:576), writeDATA_DIR/.worker-clean-shutdowncontaining the ISO timestamp (copy the marker pattern fromProcessManager.ts:232-254). Delete the sentinel at startup after reading it. - Crash detection on start: in the startup daemon path, before
writePidFile, derive:- stale PID file present + no sentinel →
previous_shutdown: 'crash' - sentinel present →
'clean' - neither (first run) →
'unknown' previous_uptime_secondsfrom the stale PID file'sstartedAtto sentinel time (clean) or tonowminus unknown gap (crash → omit rather than guess; omitted properties are fine). Add both to the existingcaptureEvent('worker_started', …)at:427.
- stale PID file present + no sentinel →
worker_stoppedevent: immediately beforeshutdownTelemetry()at:576,captureEvent('worker_stopped', { uptime_seconds, shutdown_reason })withuptime_secondsfromgetUptimeSeconds(this.startTime)(worker-service.ts:122,uptime.ts:5-7) andshutdown_reason: 'stop' | 'restart' | 'signal'from the caller. Noperson: true.- Heartbeat health: in the heartbeat payload (
:436/buildLifecycleProps), addprocess_rss_mbandheap_used_mbas integers fromprocess.memoryUsage()(Math.round(rss / 1024 / 1024)). - Whitelist
previous_shutdown,previous_uptime_seconds,uptime_seconds,shutdown_reason,process_rss_mb,heap_used_mb; addworker_stoppedtoEVENT_NAMESand the docs events table (ritual #1–4).
Verification
bun test tests/telemetry/green;npm run typecheck:rootclean- Debug mode:
worker-service restartprintsworker_stopped(reasonrestart) thenworker_startedwithprevious_shutdown: 'clean' - Kill -9 the worker, start it:
worker_startedshowsprevious_shutdown: 'crash' - Heartbeat payload contains integer
process_rss_mb - Sentinel file is removed after startup reads it (no stale
'clean'after a later crash)
Anti-pattern guards
- Do NOT compute uptime from in-memory
startTimefor the previous run — it's never persisted; use the PID file'sstartedAt. - Do NOT emit
worker_stoppedaftershutdownTelemetry()—isShutdown(telemetry.ts:81) drops late events by design. - Do NOT add the new keys to
PERSON_PROPERTY_KEYS(spec ingestion-cost constraint). process.memoryUsage().rssis bytes — convert; the scrubber drops non-finite numbers silently.
Phase 4 — hook_failed event (threshold-gated, CLI transport)
Narrative served: Reliability — a failing hook is silent memory loss; today the fail-loud counter only writes to the user's stderr.
Verified constraints (these dictate the design — read before coding)
- Hooks are short-lived processes (<1s typical). The worker transport
(posthog-node batching) can never flush there; and emitting via the worker API
is self-defeating (the defining failure IS "worker unreachable"). Transport
must be
captureCliEvent(cli-telemetry.ts:22, direct POST, 2s cap, never throws). - The trap:
exitGraceful(hook-io.ts:166-173) andemitBlockingError(hook-io.ts:150-159) callprocess.exit()immediately and do not await pending promises — a fire-and-forget POST is killed mid-flight. The emit must be awaited before the exit call, inside the failure branch. - Catch taxonomy lives at
hook-command.ts:99-128: AdapterRejectedInput (:100-105), non-blocking input error (:106-111), worker-unavailable (:112-119, the only branch callingrecordWorkerUnreachable()), generic blocking error (:121-128, exit 2). recordWorkerUnreachable(): numberreturns the consecutive count and knows the threshold — gate on it.- Hooks currently import zero telemetry code;
captureCliEventhas only fs/fetch deps and bundles fine viascripts/build-hooks.jsesbuild (telemetry modules are not externalized — verified atbuild-hooks.js:284-330).
What to implement
- In
hook-command.ts, in exactly two branches:- worker-unavailable branch (
:112-119): afterrecordWorkerUnreachable()returnscount, ifcounthas just reached the fail-loud threshold (the same condition that triggers the blocking stderr message),await captureCliEvent('hook_failed', { hook_type, error_mode: 'worker_unavailable', consecutive_failures: count, threshold_tripped: true }). - generic blocking-error branch (
:121-128):await captureCliEvent('hook_failed', { hook_type, error_mode: 'blocking_error', threshold_tripped: false })beforeemitBlockingError. Both branches are rare and already failed — the ≤2s bounded wait is acceptable there. Never emit on the success path or the two skip branches.
- worker-unavailable branch (
hook_type: closed enum from the hook event already passed tohookCommand(platform, event, …)(:79) — use the event/handler name set (context | session-init | observation | summarize | file-context), not free text.- Whitelist
hook_type,error_mode,consecutive_failures,threshold_tripped; addhook_failedtoEVENT_NAMES+ docs events table (ritual #1–4).
Verification
bun test tests/telemetry/green;npm run typecheck:rootcleannpm run buildthen grep the built hook artifact forhook_failed(confirms bundling)- With the worker stopped and
CLAUDE_MEM_TELEMETRY_DEBUG=1, run a hook 3× (threshold): third run printshook_failedwithconsecutive_failures: 3 - Success-path hook run emits nothing and latency is unchanged
- Confirm exit codes unchanged (
HOOK_EXIT_CODES,hook-constants.ts:15-20)
Anti-pattern guards
- Do NOT fire-and-forget then
process.exit()— the event dies with the process. - Do NOT emit per-invocation hook latency events (volume + inline-latency cost). Worker-side
duration_msoncontext_injected/search_performedalready covers worker latency; defer hook-side latency to a future aggregate. - Do NOT route the emit through
executeWithWorkerFallbackor any worker API. - Do NOT emit in the AdapterRejectedInput / non-blocking-input branches (expected, noisy, not failures of ours).
Phase 5 — Final verification
- Full ritual audit — for each new key
(
result_count, chroma_available, fallback_reason, fabrication_detected, fabricated_count, invalid_output_class, consecutive_invalid_outputs, respawn_triggered, abort_reason, previous_shutdown, previous_uptime_seconds, uptime_seconds, shutdown_reason, process_rss_mb, heap_used_mb, hook_type, error_mode, consecutive_failures, threshold_tripped):grep -n "<key>" src/services/telemetry/scrub.ts tests/telemetry/scrub.test.ts docs/public/telemetry.mdx src/npx-cli/commands/telemetry.ts— all four must hit. - New events disclosed:
worker_stopped,hook_failedpresent inEVENT_NAMES(src/npx-cli/commands/telemetry.ts:68-77) and thetelemetry.mdxevents table. - Anti-pattern greps:
grep -rn "captureEvent\|captureCliEvent" src/ | grep -v services/telemetry— every site passes enums/counts only (manual scan of new sites)grep -rn "posthog" src/ --include="*.ts" | grep -v services/telemetry— no direct SDK use outside the pipeline- no
PERSON_PROPERTY_KEYSadditions in the diff
- Tests & build:
bun test tests/telemetry/(note: bun only — the suite fails under vitest),npm run typecheck:root,npm run build-and-sync, worker/healthreturns ok. - Live smoke:
CLAUDE_MEM_TELEMETRY_DEBUG=1walk: search (Phase 1 fields), compression (Phase 2 fields), restart (Phase 3 events), worker-down hook ×3 (Phase 4 event). - Docs deploy: telemetry.mdx changes auto-deploy on push to main — confirm the public page renders the new rows after release.
Out of scope (deliberately)
- The
uncaughtExceptionno-exit smell (worker-service.ts:1075-1078) — process-semantics change, separate plan. - Per-hook latency events, event-loop-lag sampling,
telemetry_disabledfinal ping (product/privacy decision pending), installer funnel (install_started), doctor/repair distress signals — candidates for Plan 15 after this data lands.