mirror of
https://github.com/thedotmack/claude-mem.git
synced 2026-07-03 12:32:32 +08:00
* feat(telemetry): carry observation volume on rollups so cache-value survives migration The context-cache-value, per-user-savings, and observation-type-by-model metrics were derivable only from the legacy per-occurrence events (context_injected, session_compressed), which decay to zero as the fleet upgrades to 13.7.0 and switches to the rollups. The rollups already received the underlying records — they just didn't aggregate the observation fields. - observer_turn_rollup: add observations_created (Σ per-turn observation count, distinct from the rollup's turn `count`) + summed obs_type_* buckets, so cost-per-observation (total_cost_usd / observations_created) and observation-type-by-top_model are derivable from the rollup alone. - context_injected_rollup: add total_observations_injected (cache-reuse count) + total_tokens_saved_vs_naive (windowed savings sum). - scrub.ts: whitelist the three new emitted keys (obs_type_* already allowed; deny-by-default whitelist would drop them otherwise). - docs: correct the rollup field tables — the prior context_injected_rollup row documented fields the code never actually emitted. - tests: assert both new aggregations (167 telemetry tests pass). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01LEZpnYz9z4TjKcG19qHFrJ * build(telemetry): regenerate plugin bundles for rollup observation fields worker-service.cjs and transcript-watcher.cjs rebuilt via `npm run build` to bundle the new observation aggregation. Incidental, telemetry-unrelated churn in the other service/UI bundles was left out to keep the diff meaningful. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01LEZpnYz9z4TjKcG19qHFrJ --------- Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
78 lines
6.2 KiB
Markdown
78 lines
6.2 KiB
Markdown
# TODO (next): make the observer IGNORE output it can't parse — kill the "poisoned" respawn heuristic
|
||
|
||
**Status:** Not started. Spec for a follow-up cleanup pass.
|
||
**Owner principle (from @thedotmack):** *The observer parser should IGNORE any output it can't parse. That is the whole job.* It must NOT inspect the content of unparseable output and act on it.
|
||
|
||
---
|
||
|
||
## The problem
|
||
|
||
When the observer SDK returns something that isn't parseable `<observation>`/`<summary>` XML, the code is supposed to drop it. Instead, a substring classifier labels some of it **`poisoned`** and **kills + respawns the SDK session**. "Poisoned" is decided by a case-insensitive `includes()` against 11 hardcoded English phrases (`session exhausted`, `context window`, `prompt is too long`, `this session has ended`, …) — see `src/sdk/output-classifier.ts:20-32`.
|
||
|
||
This is a brittle keyword match standing in for "ignore what you can't parse," and it over-fires badly.
|
||
|
||
### Evidence it's broken (live PostHog, 30d, project CMEM)
|
||
|
||
`poisoned` aborts as a share of all compression turns, by model:
|
||
|
||
| Model | poisoned % of turns | context |
|
||
|---|---|---|
|
||
| haiku (4.5) | **41.1%** | 200K |
|
||
| claude-haiku-4-5-20251001 | **33.9%** (9.0M turns) | 200K |
|
||
| claude-sonnet-4-5 | 17.7% | 200K–1M |
|
||
| claude-sonnet-4-6 | 9.6% | 200K–1M |
|
||
| claude-opus-4-7 | 5.9% | 200K |
|
||
| gemini-2.5-flash-lite | **0.1%** | 1M |
|
||
| openai/gpt-oss-120b:free | **0.1%** | — |
|
||
| gemini-3-flash-preview / xiaomi mimo-flash:free | **~0%** | small |
|
||
|
||
Total ≈ **13.2M poisoned aborts / 30d ≈ 22% of all turns.**
|
||
|
||
**This is the inverse of real context exhaustion.** If "poisoned" meant running out of room, the *small-context* models would top the list. Instead they're ~0% and the *large-context Claude* models are 6–41%. The classifier is matching Claude's closure-phrase *wording* (e.g. "I cannot continue this session"), not session health. Net effect: claude-mem kills+respawns ~1/3 of Haiku sessions on a keyword false positive, throwing away work and re-spending tokens.
|
||
|
||
(Note: the word "poisoned" is also just wrong — it implies tainted/corrupted input, not a wedged session. The whole concept should go.)
|
||
|
||
---
|
||
|
||
## The fix — "parse XML or drop it," nothing content-based
|
||
|
||
### Where it lives
|
||
- `src/services/worker/agents/ResponseProcessor.ts`
|
||
- `:56-120` — the `!parsed.valid` branch. The **correct behavior already exists** at `:115-120` ("Plain-text skip responses are intentionally ignored" → `confirmClaimedMessages` drops the batch, returns).
|
||
- `:75-77` — the offending trigger:
|
||
```js
|
||
const mustRespawn =
|
||
outputClass === 'poisoned' || // ← remove
|
||
session.consecutiveInvalidOutputs >= INVALID_OUTPUT_RESPAWN_THRESHOLD; // ← see decision
|
||
```
|
||
- `:79-113` — the respawn + telemetry block. `:26` — `INVALID_OUTPUT_RESPAWN_THRESHOLD = 3`.
|
||
- `src/sdk/output-classifier.ts` — `POISONED_MARKERS` (`:20-32`), `poisoned` in the `ObserverOutputClass` union (`:13`), the precedence check (`:67-73`).
|
||
- `src/services/worker/SessionManager.ts` — `respawnPoisonedSession` (`:252`), `session.abortReason = 'poisoned'` (`:273`).
|
||
- `src/services/worker/http/routes/SessionRoutes.ts:37,44` — `abort_reason` enum mapping incl. `poisoned`.
|
||
- `src/services/telemetry/scrub.ts:90-99` — enum doc comments + whitelisted keys `invalid_output_class`, `abort_reason`, `respawn_triggered`, `consecutive_invalid_outputs`.
|
||
- `src/npx-cli/commands/telemetry.ts:71,74` — field docs listing `poisoned`.
|
||
- Tests: `output-classifier` tests + any `ResponseProcessor` respawn tests.
|
||
|
||
### Recommended change: PURE IGNORE (remove ALL content-driven respawn)
|
||
1. **`output-classifier.ts`** — delete `POISONED_MARKERS` and the `poisoned` class. Classifier collapses to `xml` (parseable) vs not. Keep `previewOutput()` for log visibility only — it must never drive behavior.
|
||
2. **`ResponseProcessor.ts`** — remove the whole `mustRespawn`/respawn block (`:71-113`). Unparseable output always falls through to the existing drop path (`:115-120`). Keep the warn-log + `previewOutput` so drops stay visible.
|
||
3. **`SessionManager.respawnPoisonedSession`** — remove if it becomes dead code (confirm no other callers; `respawnPoisonedSession` is referenced from the Phase-2 buffer-flush guards — do NOT flush a rollup for it; just delete the call path cleanly).
|
||
4. **`scrub.ts` / `SessionRoutes.ts` / CLI docs** — drop the `poisoned` enum value and, if respawn goes entirely, the now-unused `respawn_triggered` / `invalid_output_class` / `abort_reason` keys (or keep them whitelisted but unused — safer to keep keys, remove only the `poisoned` value from docs).
|
||
5. **Tests** — delete poison-classification + immediate-respawn tests; add a test asserting unparseable output is dropped (batch confirmed, no respawn) regardless of content.
|
||
|
||
**Why pure-ignore is loop-safe:** the drop path calls `confirmClaimedMessages` — the batch is NOT re-queued, so a genuinely-stuck session can't loop "until quota exhausted." The 3-strikes respawn was guarding a loop that the drop path already prevents.
|
||
|
||
### If a safety net is still wanted (fallback option)
|
||
Keep **only** the content-agnostic `consecutiveInvalidOutputs >= 3` structural respawn; remove just the `outputClass === 'poisoned' ||` clause and the `POISONED_MARKERS` list. This kills the 33–41% false positives but retains a structural circuit-breaker. (Owner leans toward pure-ignore above.)
|
||
|
||
---
|
||
|
||
## Verification
|
||
- `bun test` (output-classifier + ResponseProcessor + telemetry suites) green; `tsc --noEmit` clean.
|
||
- Grep guard: `grep -rin "poison" src` returns nothing behavioral (only history/comments if intentionally kept).
|
||
- Manual: feed the observer non-XML output containing the string "context window" → asserted **ignored/dropped, session NOT respawned**.
|
||
- Post-ship PostHog: `invalid_output_class='poisoned'` / `abort_reason='poisoned'` volume goes to zero on updated installs; Haiku turn throughput should rise (fewer needless respawns).
|
||
|
||
## Dashboard follow-up (separate, non-blocking)
|
||
The "Session abort reasons" and "Compression invalid-output classes" tiles on dashboard 1739781 will lose the `poisoned` slice as installs update — expected. No tile change needed; the decay itself is the confirmation signal.
|