mirror of https://github.com/thedotmack/claude-mem.git synced 2026-07-03 12:32:32 +08:00

Files

Alex Newman ebe6133089 feat(telemetry): carry observation volume on rollups so cache-value survives migration (#3017 )

* feat(telemetry): carry observation volume on rollups so cache-value survives migration

The context-cache-value, per-user-savings, and observation-type-by-model
metrics were derivable only from the legacy per-occurrence events
(context_injected, session_compressed), which decay to zero as the fleet
upgrades to 13.7.0 and switches to the rollups. The rollups already received
the underlying records — they just didn't aggregate the observation fields.

- observer_turn_rollup: add observations_created (Σ per-turn observation count,
  distinct from the rollup's turn `count`) + summed obs_type_* buckets, so
  cost-per-observation (total_cost_usd / observations_created) and
  observation-type-by-top_model are derivable from the rollup alone.
- context_injected_rollup: add total_observations_injected (cache-reuse count)
  + total_tokens_saved_vs_naive (windowed savings sum).
- scrub.ts: whitelist the three new emitted keys (obs_type_* already allowed;
  deny-by-default whitelist would drop them otherwise).
- docs: correct the rollup field tables — the prior context_injected_rollup row
  documented fields the code never actually emitted.
- tests: assert both new aggregations (167 telemetry tests pass).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01LEZpnYz9z4TjKcG19qHFrJ

* build(telemetry): regenerate plugin bundles for rollup observation fields

worker-service.cjs and transcript-watcher.cjs rebuilt via `npm run build` to
bundle the new observation aggregation. Incidental, telemetry-unrelated churn
in the other service/UI bundles was left out to keep the diff meaningful.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01LEZpnYz9z4TjKcG19qHFrJ

---------

Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

2026-06-21 13:31:35 -07:00

6.2 KiB

Raw Blame History

TODO (next): make the observer IGNORE output it can't parse — kill the "poisoned" respawn heuristic

Status: Not started. Spec for a follow-up cleanup pass. Owner principle (from @thedotmack): The observer parser should IGNORE any output it can't parse. That is the whole job. It must NOT inspect the content of unparseable output and act on it.

The problem

When the observer SDK returns something that isn't parseable <observation>/<summary> XML, the code is supposed to drop it. Instead, a substring classifier labels some of it poisoned and kills + respawns the SDK session. "Poisoned" is decided by a case-insensitive includes() against 11 hardcoded English phrases (session exhausted, context window, prompt is too long, this session has ended, …) — see src/sdk/output-classifier.ts:20-32.

This is a brittle keyword match standing in for "ignore what you can't parse," and it over-fires badly.

Evidence it's broken (live PostHog, 30d, project CMEM)

poisoned aborts as a share of all compression turns, by model:

Model	poisoned % of turns	context
haiku (4.5)	41.1%	200K
claude-haiku-4-5-20251001	33.9% (9.0M turns)	200K
claude-sonnet-4-5	17.7%	200K–1M
claude-sonnet-4-6	9.6%	200K–1M
claude-opus-4-7	5.9%	200K
gemini-2.5-flash-lite	0.1%	1M
openai/gpt-oss-120b:free	0.1%	—
gemini-3-flash-preview / xiaomi mimo-flash:free	~0%	small

Total ≈ 13.2M poisoned aborts / 30d ≈ 22% of all turns.

This is the inverse of real context exhaustion. If "poisoned" meant running out of room, the small-context models would top the list. Instead they're ~0% and the large-context Claude models are 6–41%. The classifier is matching Claude's closure-phrase wording (e.g. "I cannot continue this session"), not session health. Net effect: claude-mem kills+respawns ~1/3 of Haiku sessions on a keyword false positive, throwing away work and re-spending tokens.

(Note: the word "poisoned" is also just wrong — it implies tainted/corrupted input, not a wedged session. The whole concept should go.)

The fix — "parse XML or drop it," nothing content-based

Where it lives

src/services/worker/agents/ResponseProcessor.ts
- :56-120 — the !parsed.valid branch. The correct behavior already exists at :115-120 ("Plain-text skip responses are intentionally ignored" → confirmClaimedMessages drops the batch, returns).
- :75-77 — the offending trigger:
```
const mustRespawn =
  outputClass === 'poisoned' ||                                  // ← remove
  session.consecutiveInvalidOutputs >= INVALID_OUTPUT_RESPAWN_THRESHOLD; // ← see decision
```
- :79-113 — the respawn + telemetry block. :26 — INVALID_OUTPUT_RESPAWN_THRESHOLD = 3.
src/sdk/output-classifier.ts — POISONED_MARKERS (:20-32), poisoned in the ObserverOutputClass union (:13), the precedence check (:67-73).
src/services/worker/SessionManager.ts — respawnPoisonedSession (:252), session.abortReason = 'poisoned' (:273).
src/services/worker/http/routes/SessionRoutes.ts:37,44 — abort_reason enum mapping incl. poisoned.
src/services/telemetry/scrub.ts:90-99 — enum doc comments + whitelisted keys invalid_output_class, abort_reason, respawn_triggered, consecutive_invalid_outputs.
src/npx-cli/commands/telemetry.ts:71,74 — field docs listing poisoned.
Tests: output-classifier tests + any ResponseProcessor respawn tests.

Recommended change: PURE IGNORE (remove ALL content-driven respawn)

output-classifier.ts — delete POISONED_MARKERS and the poisoned class. Classifier collapses to xml (parseable) vs not. Keep previewOutput() for log visibility only — it must never drive behavior.
ResponseProcessor.ts — remove the whole mustRespawn/respawn block (:71-113). Unparseable output always falls through to the existing drop path (:115-120). Keep the warn-log + previewOutput so drops stay visible.
SessionManager.respawnPoisonedSession — remove if it becomes dead code (confirm no other callers; respawnPoisonedSession is referenced from the Phase-2 buffer-flush guards — do NOT flush a rollup for it; just delete the call path cleanly).
scrub.ts / SessionRoutes.ts / CLI docs — drop the poisoned enum value and, if respawn goes entirely, the now-unused respawn_triggered / invalid_output_class / abort_reason keys (or keep them whitelisted but unused — safer to keep keys, remove only the poisoned value from docs).
Tests — delete poison-classification + immediate-respawn tests; add a test asserting unparseable output is dropped (batch confirmed, no respawn) regardless of content.

Why pure-ignore is loop-safe: the drop path calls confirmClaimedMessages — the batch is NOT re-queued, so a genuinely-stuck session can't loop "until quota exhausted." The 3-strikes respawn was guarding a loop that the drop path already prevents.

If a safety net is still wanted (fallback option)

Keep only the content-agnostic consecutiveInvalidOutputs >= 3 structural respawn; remove just the outputClass === 'poisoned' || clause and the POISONED_MARKERS list. This kills the 33–41% false positives but retains a structural circuit-breaker. (Owner leans toward pure-ignore above.)

Verification

bun test (output-classifier + ResponseProcessor + telemetry suites) green; tsc --noEmit clean.
Grep guard: grep -rin "poison" src returns nothing behavioral (only history/comments if intentionally kept).
Manual: feed the observer non-XML output containing the string "context window" → asserted ignored/dropped, session NOT respawned.
Post-ship PostHog: invalid_output_class='poisoned' / abort_reason='poisoned' volume goes to zero on updated installs; Haiku turn throughput should rise (fewer needless respawns).

Dashboard follow-up (separate, non-blocking)

The "Session abort reasons" and "Compression invalid-output classes" tiles on dashboard 1739781 will lose the poisoned slice as installs update — expected. No tile change needed; the decay itself is the confirmation signal.

6.2 KiB Raw Blame History Unescape Escape