Files
Alex Newman 4ce51efd28 feat(telemetry): reliability signals — retrieval quality, compression trust, worker lifecycle, hook failures (Plan 14) (#2874)
* feat(telemetry): disclose 19 reliability-signal fields and 2 new events across all surfaces

Whitelist (scrub.ts), scrub tests, public docs (telemetry.mdx), and CLI
disclosure (COLLECTED_FIELDS/EVENT_NAMES) for the Plan 14 reliability
signals: search retrieval quality, compression trust, worker lifecycle,
and hook failure keys, plus the worker_stopped and hook_failed events.
Includes the plan document.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* feat(telemetry): retrieval-quality signals on search_performed

SearchManager.search() fills an optional telemetry envelope
(result_count, search_strategy, chroma_available, fallback_reason)
across all three search paths; handlers stash it on
res.locals.searchTelemetry and the existing finish-middleware spreads it
into the search_performed capture. Zero-result searches report
result_count: 0; Chroma fallback reasons are a closed enum, never the
error message. Response shapes unchanged.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* feat(telemetry): compression trust signals on session_compressed

fabrication_detected/fabricated_count flow through compressionProps (all
three emit paths); invalid-output respawns emit a respawn-gated
session_compressed with outcome invalid_output and the classifier value;
aborted generators emit outcome aborted with abort_reason normalized to
a closed enum in the .finally where all five abort flows converge (the
.catch path can never observe a non-null abortReason).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* feat(telemetry): worker lifecycle signals — worker_stopped, crash detection, memory metrics

Clean-shutdown sentinel written before telemetry flush and consumed at
startup; worker_started gains previous_shutdown (crash/clean/unknown)
and previous_uptime_seconds derived from the stale PID file; new
worker_stopped event (uptime_seconds, shutdown_reason stop/restart/
signal) emitted before shutdownTelemetry(); the CLI restart path tags
/api/admin/shutdown?reason=restart so restarts are distinguishable;
buildLifecycleProps adds integer process_rss_mb/heap_used_mb.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* feat(telemetry): threshold-gated hook_failed distress signal via CLI transport

recordWorkerUnreachable emits hook_failed exactly when the consecutive-
failure count reaches the fail-loud threshold; the generic blocking-error
branch emits error_mode blocking_error. Both emits are awaited before
the process.exit paths so the 2s-capped CLI POST survives; hook_type is
a closed enum registered at hookCommand entry. Exit codes unchanged.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* chore(build): regenerate plugin artifacts with Plan 14 telemetry signals

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* fix(tests): make PostHog client regression test order-independent via global preload mock

The disableGeoip regression test mocked posthog-node per-file, but
telemetry.ts is imported transitively by many test files in the shared
bun process, so the mock registered too late and the test failed in
full-suite runs — CI on main has been red since v13.5.4. The mock now
registers in a bunfig [test].preload before any module loads, which
also guarantees test runs can never construct a real PostHog client and
flush fabricated events into production analytics (consent is
default-on and the suite outlives flushInterval). telemetry.ts gains a
test-only state reset so construction is observed deterministically
regardless of suite order.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* fix(telemetry): forward shutdown reason in Windows-managed IPC message

Review follow-up: the wrapper IPC path discarded the restart tag, so an
external Windows wrapper could only ever report shutdown_reason 'stop'.
No wrapper in this repo listens for the message, but the reason now
travels with it for any that does.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-10 02:51:45 -07:00
..