mirror of
https://github.com/thedotmack/claude-mem.git
synced 2026-07-03 12:32:32 +08:00
feat(telemetry): PostHog historical backfill — anonymized daily rollups + inferred install date (#2912)
* feat(telemetry): add PostHog historical backfill module (Phase 1) One-time anonymized daily-rollup backfill: collectDailyRollups with pinned aggregation semantics, sessions-first install-day inference, deterministic v5 event uuids, 9-step gated transport (marker/consent/debug gates, historicalMigration client, marker only on clean shutdown). Whitelist additions in scrub.ts/common.ts, extended PostHog test mock, full (a)-(k) test coverage. Per plans/2026-06-11-posthog-historical-backfill.md. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> * feat(telemetry): wire historical backfill into worker startup + docs disclosure Fire-and-forget runHistoricalBackfill call in initializeBackground after the worker_started capture (non-blocking, marker-gated one-shot, retries on next start if delivery fails). New "Historical backfill" section in docs/public/telemetry.mdx documenting what is sent, the once-per-install marker, identical consent gates, and upload-time geo caveat. Phases 3-4 of plans/2026-06-11-posthog-historical-backfill.md. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> * docs(telemetry): document dual-layer error handling on backfill call site runHistoricalBackfill never rejects by contract; the .catch is the unhandled-rejection backstop for the fire-and-forget call. Addresses Greptile review on #2912. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> --------- Co-authored-by: Claude Fable 5 <noreply@anthropic.com>
This commit is contained in:
502
plans/2026-06-11-posthog-historical-backfill.md
Normal file
502
plans/2026-06-11-posthog-historical-backfill.md
Normal file
@@ -0,0 +1,502 @@
|
||||
# PostHog Historical Backfill — Anonymized Observation Metadata Import
|
||||
|
||||
**Goal**: One-time, per-install backfill of anonymized daily activity rollups from the local
|
||||
SQLite DB into PostHog, using PostHog's historical-migration ingestion mode, so growth
|
||||
(installs over time, reconstructed WAU/MAU, cohort retention) is visible for activity that
|
||||
predates telemetry shipping.
|
||||
|
||||
**What gets sent** (anonymous counts/sums only — never titles, text, prompts, project names,
|
||||
or any raw string column):
|
||||
- One `historical_activity` event per active day per install, profile-less
|
||||
(`$process_person_profile: false`), carrying ONLY rollup counters + `backfilled: true`
|
||||
(no `buildBaseProperties()` — stamping the *current* version/os onto 2025-dated events
|
||||
would permanently poison version-over-time charts):
|
||||
`{ observation_count, session_count, summary_count, prompt_count, project_count,
|
||||
discovery_tokens, obs_type_bugfix, obs_type_discovery, obs_type_decision,
|
||||
obs_type_refactor, obs_type_other, session_completed_count, session_failed_count,
|
||||
sessions_claude_count, sessions_codex_count, sessions_gemini_count,
|
||||
sessions_other_platform_count, subagent_obs_count, backfilled: true }`.
|
||||
- One `install_inferred` event at noon UTC of the install day, `person: true`, with
|
||||
`first_active_date` in properties and `$set` — this single event draws the adoption curve.
|
||||
|
||||
**Known caveats (accept, do not solve)**:
|
||||
- Survivorship bias — only currently-installed, telemetry-consenting users backfill. The
|
||||
curve shows the history of the *retained* base.
|
||||
- The last ~2.5 days before each install's first post-upgrade worker start are never
|
||||
backfilled (PostHog's 48h rule + whole-day windowing). Do NOT "fix" this by re-running
|
||||
without the marker — duplicates are worse than the gap.
|
||||
- Geo properties on backfilled events reflect location at *upload* time, not the historical
|
||||
date (`disableGeoip: false` kept for consistency with live telemetry).
|
||||
- Dashboards slicing by `version` must filter `backfilled != true`; combined install metrics
|
||||
must dedupe `install_inferred` ∪ `install_completed(is_update=false)` per distinct_id
|
||||
(telemetry-era installs emit both).
|
||||
- Event-style live metrics (searches, injections, compression spend) are intentionally absent
|
||||
from backfill — they are not stored locally and cannot be reconstructed.
|
||||
|
||||
---
|
||||
|
||||
## Phase 0: Documentation Discovery — COMPLETE (findings consolidated below)
|
||||
|
||||
All implementation phases MUST use only the APIs listed here. Citations were verified against
|
||||
the working tree, the published posthog-node v5.36.15 source, and the real
|
||||
`~/.claude-mem/claude-mem.db` (read-only) on 2026-06-11, then adversarially re-verified on
|
||||
2026-06-12 (see Review log at the bottom).
|
||||
|
||||
**Precondition**: this worktree has no `node_modules` — run `bun install` (or `npm install`)
|
||||
before Phase 1 so imports and `bun test` resolve.
|
||||
|
||||
### Allowed APIs
|
||||
|
||||
**posthog-node ^5.36.15** (root `package.json:143`, imported in
|
||||
`src/services/telemetry/telemetry.ts:1`; caret range — if the installed major ever drifts,
|
||||
re-verify `historicalMigration` forwarding in `dist/`, search for `historical_migration`):
|
||||
- `new PostHog(apiKey, options)` — `PostHogOptions = Omit<PostHogCoreOptions, 'before_send'> & {...}`.
|
||||
`PostHogCoreOptions` **includes** `historicalMigration?: boolean` ("Special flag to indicate
|
||||
ingested data is for a historical migration", default false). It is forwarded as
|
||||
`historical_migration: true` on every `/batch` request body (verified in
|
||||
`@posthog/core@1.32.1` shipped source). Also valid: `host`, `flushAt`, `maxBatchSize`,
|
||||
`maxQueueSize`, `disableGeoip`.
|
||||
- `client.capture(msg: EventMessage)` — `EventMessage` includes `distinctId`, `event`,
|
||||
`properties`, **`timestamp?: Date`** (a `Date` object, NOT an ISO string), **`uuid?: string`**.
|
||||
Both are forwarded end-to-end to the payload.
|
||||
- **capture() enqueues asynchronously** (a multi-microtask promise chain runs before the event
|
||||
reaches the queue). A bare `await client.flush()` does NOT join those pending captures and
|
||||
can resolve while events are still un-enqueued. Only `client.shutdown()` joins pending
|
||||
capture promises and then loops flush until the queue is empty — **shutdown() is the only
|
||||
delivery barrier**.
|
||||
- `shutdown()` SWALLOWS fetch errors internally (`logFlushError`); its resolution proves
|
||||
nothing about delivery. Delivery failure is observable only via the public error emitter:
|
||||
`client.on('error', handler)` (`@posthog/core` `posthog-core-stateless.ts:301`).
|
||||
- The in-memory queue silently drops the OLDEST event past `maxQueueSize` (default 1,000).
|
||||
Multi-year installs can exceed 1,000 active days — `maxQueueSize` must be raised.
|
||||
- Background flushes fire automatically when the queue reaches `flushAt`; their errors are
|
||||
swallowed AND a non-network (4xx) failure removes the batch from the queue before throwing.
|
||||
Therefore: set `flushAt`/`maxBatchSize`/`maxQueueSize` all to 5000 so NO background flush
|
||||
ever fires and the entire backfill goes as one request at shutdown (the SDK auto-halves the
|
||||
batch on HTTP 413). One request also makes a cross-restart retry byte-identical — the best
|
||||
possible dedupe-key match.
|
||||
|
||||
**PostHog historical-migration rules** (https://posthog.com/docs/migrate):
|
||||
- Set `historicalMigration: true` so events bypass standard ingestion/billing handling.
|
||||
- Event timestamps must be **at least 48 hours in the past**. Server behavior for violating
|
||||
events is undocumented (rejection, billing, silent acceptance all possible) — the
|
||||
client-side day window is the only guard, and it applies to **every** event including
|
||||
`install_inferred`.
|
||||
- "There is no way to selectively delete event data in PostHog" — idempotency is mandatory
|
||||
before anything ships (deterministic `uuid` + completion marker).
|
||||
- uuid dedupe (https://posthog.com/docs/data/events) is **eventual and best-effort**
|
||||
(ClickHouse merge-time, not query-time) and keyed on
|
||||
`(toDate(timestamp), event, distinct_id, uuid)` — retried events must carry byte-identical
|
||||
timestamp+uuid+event+distinctId. The marker (not the uuid) is the PRIMARY idempotency gate;
|
||||
the uuid minimizes damage in the crash-retry window. Residual accepted risk: a crash between
|
||||
server-side ingest and marker write can leave transient duplicates until ClickHouse merges.
|
||||
|
||||
**Existing telemetry modules to copy from** (do not reinvent):
|
||||
- Consent: `resolveTelemetryConsent(process.env, loadTelemetryConfig())` —
|
||||
`src/services/telemetry/consent.ts:68-73`. Install UUID: `getOrCreateInstallId()` —
|
||||
`consent.ts:114-124` (random v4 UUID persisted to `telemetry.json` via `getTelemetryConfigPath()`).
|
||||
- Scrub: `scrubProperties(props)` whitelist scrubber — `src/services/telemetry/scrub.ts:126-149`;
|
||||
whitelist set `ALLOWED_PROPERTY_KEYS` at `scrub.ts:8-117`. **Properties not in the whitelist
|
||||
are silently dropped** — new keys MUST be added there. Already present (verified): the
|
||||
`obs_type_*` family (live `context_injected` ships them), `session_count`,
|
||||
`observation_count`. Confirmed ABSENT (must add): `discovery_tokens`.
|
||||
- Key/host: `getTelemetryApiKey()`, `getTelemetryHost()` — `src/services/telemetry/common.ts:22-28`.
|
||||
Note: `getTelemetryApiKey()` falls back to the **embedded production key** and is never
|
||||
falsy — every worker boot anywhere can send real data (see Phase 2/3 sequencing).
|
||||
Base props: `buildBaseProperties()` — `common.ts:100-113` (returns CURRENT version,
|
||||
os_version, runtime_version, locale, is_ci — historical events must NOT carry it).
|
||||
Person-prop subset: `PERSON_PROPERTY_KEYS` + `buildPersonSet()` — `common.ts:36-76`.
|
||||
`buildPersonSet` only copies keys PRESENT on the event's properties — a person trait that is
|
||||
never assigned to the event silently never ships.
|
||||
- Capture-path conventions (consent gate → scrub → debug-mode stderr print → no-key no-op →
|
||||
capture; swallow all errors): `captureEvent()` — `src/services/telemetry/telemetry.ts:72-117`.
|
||||
- DB-reading telemetry pattern: `collectInstallStats(db: Database)` —
|
||||
`src/services/telemetry/install-stats.ts:29-99`. Per-block try/catch; typed `.get()` casts.
|
||||
Keep the per-block try/catch in the rollup too: older installs may lack a table or column —
|
||||
skip that block's keys, never throw.
|
||||
- Epoch normalization (legacy rows store **seconds**, newer rows milliseconds — verified
|
||||
against a real DB where 273 rows render as 1970 without it):
|
||||
`asMs(col)` → `` `CASE WHEN ${col} < 1000000000000 THEN ${col} * 1000 ELSE ${col} END` `` —
|
||||
`install-stats.ts:23-25`. `DAY_MS = 86_400_000` — `install-stats.ts:27`. Note `MIN(asMs(x))`
|
||||
applies normalization INSIDE the MIN (correct); `asMs(MIN(x))` would not be.
|
||||
- Deterministic UUID: **`Bun.randomUUIDv5(name, namespace)`** — exists and is deterministic on
|
||||
the worker's embedded Bun 1.3.9 (verified by execution). Use it with one fixed namespace
|
||||
UUID constant in the module. The npm `uuid` package is not installed and must not be added.
|
||||
- Discovery-token storage semantics (load-bearing for aggregation):
|
||||
`src/services/sqlite/SessionStore.ts:1901-2015` — ONE `discoveryTokens` value per
|
||||
compression turn is written identically to EVERY observation row of that turn (line 1962)
|
||||
AND to the turn's `session_summaries` row (line 2004). Summing across observations
|
||||
multi-counts by the obs-per-turn factor. **Sum `session_summaries.discovery_tokens` only.**
|
||||
|
||||
**Database access** (`bun:sqlite`, synchronous):
|
||||
- `import { Database } from 'bun:sqlite'` — pattern at `src/services/sqlite/SessionStore.ts:1-2`.
|
||||
- Query: `db.query(sql).get(...)` / `.all(...)` with `as` type casts —
|
||||
e.g. `src/services/worker-service.ts:508-513`, `install-stats.ts:36-50`.
|
||||
- Relevant columns: base schema in `src/services/sqlite/migrations/runner.ts:43-112` —
|
||||
`observations(project, created_at_epoch, memory_session_id, type, agent_type)`,
|
||||
`sdk_sessions(project, started_at_epoch, memory_session_id, status, platform_source)`,
|
||||
`session_summaries(project, created_at_epoch)`, `user_prompts(created_at_epoch)`
|
||||
(counts only — NEVER select `prompt_text`). **`discovery_tokens` is NOT in 43-112**: it is
|
||||
added by `ensureDiscoveryTokensColumn` at `runner.ts:381-402`
|
||||
(`ALTER TABLE ... ADD COLUMN discovery_tokens INTEGER DEFAULT 0` on both observations and
|
||||
session_summaries). Query only columns created in runner.ts — the dev DB has extra columns
|
||||
other installs lack.
|
||||
- Observation `type` buckets: use the closed `STAT_TYPE_BUCKETS` set from
|
||||
`src/services/context/ContextBuilder.ts` (bugfix/discovery/decision/refactor/other) so the
|
||||
backfill vocabulary is identical to live `context_injected`.
|
||||
- `platform_source` is a user-influenceable string — bucket in JS to the closed enum
|
||||
{claude, codex, gemini, other}; never ship the raw value.
|
||||
|
||||
**Worker integration points** (`src/services/worker-service.ts`):
|
||||
- `worker_started` capture: lines 532-540 (inside `initializeBackground()`).
|
||||
- Fire-and-forget background-task precedent to copy:
|
||||
`ChromaSync.backfillAllProjects(...)` `.then/.catch` block at lines 551-556.
|
||||
- Daily heartbeat precedent: `setInterval(..., 24*60*60*1000)` + `.unref?.()` at lines 544-547.
|
||||
- Shutdown order: `worker_stopped` capture BEFORE `shutdownTelemetry()` (lines 699-703).
|
||||
|
||||
**State files in `~/.claude-mem`**:
|
||||
- Path root: `DATA_DIR` / `paths.dataDir()` — `src/shared/paths.ts:40, 129-151`.
|
||||
- Read with `readJsonSafe<T>(path, fallback)` from `src/utils/json-utils.js` (as
|
||||
`consent.ts:5` imports it). **There is no JSON-write helper** — write with
|
||||
`mkdirSync` + `writeFileSync` exactly as `saveTelemetryConfig` does at `consent.ts:103-107`.
|
||||
- The marker is its own file `backfill.json` — do NOT merge it into `telemetry.json`
|
||||
(the consent save path would clobber it).
|
||||
|
||||
**Logger** (console.* is forbidden in services — enforced by
|
||||
`scripts/check-hook-io-discipline.cjs` / `tests/logger-usage-standards.test.ts`):
|
||||
- `import { logger } from '../utils/logger.js'`; `logger.info('SYSTEM', msg, ctx, err?)`.
|
||||
`'TELEMETRY'` is NOT in the `Component` union (`src/utils/logger.ts:15-52`) — use `'SYSTEM'`.
|
||||
|
||||
**Tests** (`bun test`; global posthog-node mock):
|
||||
- Global mock of `posthog-node` in `tests/preload.ts:45-55` records
|
||||
`postHogConstructorCalls` / `postHogCaptureCalls` — extend, don't replace. The mock class
|
||||
has NO `flush()`, `shutdown()` race-compatible signature, or `on()` — add no-op
|
||||
`flush()`/`shutdown()`/`on()` to it.
|
||||
- State reset: `__resetTelemetryForTests()` — `src/services/telemetry/telemetry.ts:125-129`.
|
||||
- Env/temp-dir isolation pattern: `tests/telemetry/telemetry-client.test.ts:30-53`
|
||||
(`CLAUDE_MEM_DATA_DIR = mkdtempSync(...)`).
|
||||
- In-memory DB schema pattern: `tests/telemetry/install-stats.test.ts:8-28` — but its `makeDb`
|
||||
schema LACKS `discovery_tokens`, `memory_session_id`, `type`, `agent_type`, `status`,
|
||||
`platform_source`, and the `user_prompts` table. Extend the test schema with every column
|
||||
the rollup queries touch before writing tests, or the queries throw `no such column` (and
|
||||
the per-block try/catch would mask it as silently-empty rollups).
|
||||
|
||||
### Anti-patterns (DO NOT)
|
||||
|
||||
- ❌ `historical_migration` (snake_case) as a posthog-node constructor option — the SDK option
|
||||
is **`historicalMigration`** (camelCase). Snake_case is only for the raw `/batch` HTTP API.
|
||||
- ❌ Passing `timestamp` as an ISO string to `client.capture()` — `EventMessage.timestamp` is
|
||||
typed `Date`. Construct `new Date(...)`.
|
||||
- ❌ Reusing the live singleton client in `telemetry.ts` for backfill — it lacks
|
||||
`historicalMigration` and its `isShutdown` latch must stay untouched. Build a dedicated,
|
||||
short-lived client.
|
||||
- ❌ Sending properties without adding them to `ALLOWED_PROPERTY_KEYS` — the scrubber drops
|
||||
them silently and the backfill would ship empty events.
|
||||
- ❌ Raw `created_at_epoch` / `started_at_epoch` math without `asMs()` — legacy second-unit
|
||||
rows land in 1970 and poison the earliest cohort. Additionally apply the project-epoch
|
||||
floor (below): corrupt epochs can also land on plausible-looking wrong days, not just 1970.
|
||||
- ❌ Filtering rollup rows by raw epoch against a cutoff instant — that ships a permanently
|
||||
truncated PARTIAL day. Always include whole UTC day buckets only (window rule in Phase 1).
|
||||
- ❌ Summing `discovery_tokens` across observations AND session_summaries — the same per-turn
|
||||
value is stored on every row of the turn (see Phase 0); sum summaries only.
|
||||
- ❌ `buildBaseProperties()` on `historical_activity` events — current version/os on
|
||||
2025-dated events is permanently wrong version-over-time data.
|
||||
- ❌ The npm `uuid` package — use `Bun.randomUUIDv5` (verified present and deterministic).
|
||||
- ❌ `console.*` anywhere in `src/services/` — use `logger` (debug-mode payload printing is the
|
||||
one exception and must use `process.stderr.write`, copying `telemetry.ts:97-103`).
|
||||
- ❌ Writing the completion marker before delivery is confirmed — and "delivery confirmed"
|
||||
means `await client.shutdown()` resolved AND zero `client.on('error')` events, NOT a bare
|
||||
`flush()` (see Phase 0 SDK facts). A crash mid-send must retry on next startup.
|
||||
- ❌ Booting the worker on the dev machine during Phases 1–3 without
|
||||
`CLAUDE_MEM_TELEMETRY=0` (or `CLAUDE_MEM_TELEMETRY_DEBUG=1`) exported — the embedded
|
||||
production key + default-on consent means a casual `npm run build-and-sync` ships the dev
|
||||
machine's entire history to production PostHog before the dry-run gate.
|
||||
|
||||
---
|
||||
|
||||
## Phase 1: Backfill module + tests (rollups, events, transport, marker)
|
||||
|
||||
**Create `src/services/telemetry/backfill.ts`** (one module, one test file).
|
||||
|
||||
### 1.1 Day window (one rule, used everywhere)
|
||||
|
||||
- `lastFullDay = utcDayString(nowMs - 60 * 3_600_000)` — 60h = 48h (PostHog contract) + 12h
|
||||
(noon-UTC event timestamps). Noon of any included day is then guaranteed ≥48h old.
|
||||
- `PROJECT_EPOCH_FLOOR = Date.parse('2024-01-01T00:00:00Z')` — predates claude-mem's first
|
||||
release; rows with normalized epoch below it are corrupt and ignored everywhere (rollups
|
||||
AND first-activity MIN).
|
||||
- `installDay = utcDayString(firstActivityEpochMs)` (1.3). Include only whole UTC days where
|
||||
`installDay <= day <= lastFullDay`, comparing **day strings** (YYYY-MM-DD compares
|
||||
lexicographically) — never raw epochs, so no partial days can ship. The lower bound discards
|
||||
backdated artifact rows (verified on the reference DB: obs ids 66888/66889 carry a
|
||||
2025-08-12 epoch but belong to a session started 2026-04-10).
|
||||
|
||||
### 1.2 `collectDailyRollups(db: Database, lastFullDay: string, installDay: string): DailyRollup[]`
|
||||
|
||||
Copy the query style of `collectInstallStats` (per-block try/catch — a missing table/column
|
||||
skips that block's keys), bucketing every table by
|
||||
`date(<asMs(col)>/1000,'unixepoch') AS day`, merged in a `Map<day, rollup>`. Pinned
|
||||
semantics — these exact aggregations, so any two implementers ship identical numbers:
|
||||
|
||||
| key | source (per day) |
|
||||
|---|---|
|
||||
| `observation_count` | `COUNT(*)` FROM observations |
|
||||
| `obs_type_bugfix/discovery/decision/refactor/other` | observations `GROUP BY day, type`, bucketed via `STAT_TYPE_BUCKETS` |
|
||||
| `subagent_obs_count` | `COUNT(*)` FROM observations WHERE `agent_type IS NOT NULL` |
|
||||
| `session_count` | `COUNT(*)` FROM sdk_sessions **only** (do NOT add observations' distinct memory_session_id — same sessions, double count) |
|
||||
| `session_completed_count` / `session_failed_count` | sdk_sessions `GROUP BY day, status` (closed enum) |
|
||||
| `sessions_claude_count` / `sessions_codex_count` / `sessions_gemini_count` / `sessions_other_platform_count` | sdk_sessions `platform_source`, bucketed in JS to the closed enum |
|
||||
| `summary_count` | `COUNT(*)` FROM session_summaries |
|
||||
| `discovery_tokens` | `SUM(discovery_tokens)` FROM **session_summaries only** (per-turn cost — see Phase 0 storage semantics) |
|
||||
| `prompt_count` | `COUNT(*)` FROM user_prompts (count only — never `prompt_text`) |
|
||||
| `project_count` | `COUNT(DISTINCT project)` over `(SELECT day, project FROM observations UNION SELECT day, project FROM sdk_sessions)` — cross-table distinct in ONE query; never sum per-table distincts (triple-counts the same project) |
|
||||
|
||||
Omitted on purpose: session durations (verified dirty — 22% of completed sessions show >24h
|
||||
stale spans), dev-only columns not in runner.ts, anything string-valued.
|
||||
|
||||
### 1.3 `findFirstActivityEpochMs(db: Database): number | null`
|
||||
|
||||
`SELECT MIN(asMs(started_at_epoch)) FROM sdk_sessions WHERE asMs(started_at_epoch) >= FLOOR`
|
||||
(copy `install-stats.ts:56-61` — note asMs INSIDE the MIN), falling back to the observations
|
||||
MIN **only if sdk_sessions is empty**. Keep sessions-first: session timestamps are write-time
|
||||
and trustworthy, while observation epochs can be backdated artifacts (verified — see 1.1);
|
||||
a cross-table MIN would bake an artifact date into undeletable data.
|
||||
|
||||
### 1.4 `deterministicEventUuid(installId, event, day): string`
|
||||
|
||||
`Bun.randomUUIDv5(`${installId}|${event}|${day}`, BACKFILL_NAMESPACE)` where
|
||||
`BACKFILL_NAMESPACE` is one fixed UUID constant in the module. One line; deterministic;
|
||||
no hand-rolled hashing.
|
||||
|
||||
### 1.5 `buildBackfillEvents(db, installId, nowMs): EventMessage-like[]`
|
||||
|
||||
Pure assembly:
|
||||
- Each rollup → `historical_activity` with `timestamp: new Date(day + 'T12:00:00Z')` and the
|
||||
deterministic uuid. Noon UTC is load-bearing TWICE: it keeps the event inside its day for
|
||||
dashboard timezones in UTC-12..+11 (keep the PostHog project timezone on UTC), and it makes
|
||||
the timestamp retry-stable, which the dedupe key requires — do not "simplify" to a
|
||||
non-deterministic timestamp. Properties: the rollup counters + `backfilled: true`, passed
|
||||
through `scrubProperties(...)`, then `$process_person_profile: false`.
|
||||
**No `buildBaseProperties()`** (see anti-patterns).
|
||||
- One `install_inferred` with `timestamp: new Date(installDay + 'T12:00:00Z')` (noon UTC —
|
||||
retry-stable even if the sessions/observations MIN source flips between runs) and uuid keyed
|
||||
on day `'install'`. Properties: `{ ...buildBaseProperties(), first_active_date: installDay,
|
||||
backfilled: true }` through `scrubProperties`, then `$set = buildPersonSet(props)`
|
||||
(copy `telemetry.ts:91-95`). `first_active_date` MUST be assigned here — `buildPersonSet`
|
||||
only copies keys present on the event, and a whitelisted-but-never-assigned trait silently
|
||||
never ships. Base props are fine on this one person-event ($set = current person state).
|
||||
- If `installDay > lastFullDay` (install younger than ~60h): return `[]`. Such installs have
|
||||
live telemetry coverage for their entire life — there is no pre-telemetry history to
|
||||
reconstruct, and shipping a <48h timestamp violates the migration contract.
|
||||
|
||||
### 1.6 Whitelist additions
|
||||
|
||||
Add to `ALLOWED_PROPERTY_KEYS` (`scrub.ts:8-117`) — verify each against the file first:
|
||||
`discovery_tokens` (confirmed absent), `summary_count`, `prompt_count`, `project_count`,
|
||||
`backfilled`, `first_active_date`, `session_completed_count`, `session_failed_count`,
|
||||
`sessions_claude_count`, `sessions_codex_count`, `sessions_gemini_count`,
|
||||
`sessions_other_platform_count`, `subagent_obs_count`.
|
||||
Already whitelisted — reuse, do not re-add: `observation_count`, `session_count`,
|
||||
the `obs_type_*` family. (`session_count`/`observation_count` carry different semantics on
|
||||
live `context_injected` — acceptable: PostHog properties are filtered per-event in practice;
|
||||
note the collision in the PR description.)
|
||||
Add `first_active_date` to `PERSON_PROPERTY_KEYS` (`common.ts:36-60`).
|
||||
|
||||
### 1.7 Marker + transport: `runHistoricalBackfill(db: Database): Promise<void>`
|
||||
|
||||
Marker file `~/.claude-mem/backfill.json` at `join(dataDir, 'backfill.json')` (mirror
|
||||
`getTelemetryConfigPath()`, `consent.ts:76-78`). Shape:
|
||||
`{ completedAt: string, throughDay: string, eventCount: number, installId: string }`.
|
||||
Read with `readJsonSafe`; write with `mkdirSync` + `writeFileSync` (as
|
||||
`saveTelemetryConfig`, `consent.ts:103-107`).
|
||||
|
||||
Gate sequence (ORDER MATTERS — debug must precede every marker write):
|
||||
1. Marker exists → return (idempotency gate #1).
|
||||
2. `resolveTelemetryConsent(process.env, loadTelemetryConfig())` false → return
|
||||
**without writing the marker** (a later opt-in still backfills).
|
||||
3. Build events via 1.5.
|
||||
4. `CLAUDE_MEM_TELEMETRY_DEBUG === '1'` → `process.stderr.write` one summary line
|
||||
(event count + day range) then one line per event (copy `telemetry.ts:97-103`), do NOT
|
||||
send, do NOT write marker — even when the event list is empty. This dry-run intentionally
|
||||
re-runs on every debug-mode worker start; the marker must never latch from debug mode.
|
||||
(Only the exact value `'1'` activates it, matching `captureEvent`.)
|
||||
5. Zero events → write marker, return. (Fresh installs land here: nothing pre-telemetry
|
||||
exists and live `install_completed`/daily events cover them from day 0 — `install_inferred`
|
||||
is intentionally NOT emitted for them.)
|
||||
6. `!getTelemetryApiKey()` → return without marker. (Vestigial — the embedded key makes this
|
||||
unreachable; keep the one-liner for symmetry with `captureEvent`, write no test for it.)
|
||||
7. Dedicated client:
|
||||
`new PostHog(getTelemetryApiKey(), { host: getTelemetryHost(), historicalMigration: true,
|
||||
flushAt: 5000, maxBatchSize: 5000, maxQueueSize: 5000, disableGeoip: false })` — the 5000s
|
||||
guarantee a single batch, no swallowed background flushes, and no silent queue-cap drops
|
||||
(see Phase 0 SDK facts).
|
||||
8. `const errors: unknown[] = []; client.on('error', e => errors.push(e))`, then
|
||||
`client.capture({ distinctId: getOrCreateInstallId(), event, properties, timestamp, uuid })`
|
||||
per event, then `await client.shutdown()` — NO separate `flush()` call (it is not a
|
||||
delivery barrier) and NO 3s race (this runs fire-and-forget in the background; the SDK's
|
||||
default shutdown timeout is fine).
|
||||
9. Write the marker ONLY if shutdown resolved AND `errors.length === 0`. Wrap the whole body
|
||||
in try/catch that logs via `logger` (`'SYSTEM'`) and never throws (telemetry must never
|
||||
break the worker — `telemetry.ts:114-116`).
|
||||
|
||||
**Verification checklist**:
|
||||
- [ ] New test `tests/telemetry/backfill.test.ts` using the `:memory:` pattern from
|
||||
`tests/telemetry/install-stats.test.ts:8-28` with the schema EXTENDED per Phase 0
|
||||
(discovery_tokens, memory_session_id, type, agent_type, status, platform_source,
|
||||
user_prompts table). Cases:
|
||||
(a) mixed second/ms epochs land on the correct day (insert `created_at_epoch =
|
||||
1755000000` seconds; assert no 1970 day);
|
||||
(b) day window: a row 47h old is excluded AND no partial day ever ships (whole-day
|
||||
buckets only); a row before `PROJECT_EPOCH_FLOOR` or before `installDay` produces no day;
|
||||
(c) `deterministicEventUuid` stable across calls, general UUID shape (it is a v5 — do
|
||||
not assert a v4 version nibble);
|
||||
(d) every property in built events survives `scrubProperties` (no silent drops);
|
||||
(e) empty DB → zero events, no throw;
|
||||
(f) discovery_tokens dedupe: one turn = 3 observations + 1 summary all carrying 100 →
|
||||
day total **100**, not 400;
|
||||
(g) one session + one observation in it, same day → `session_count` 1 and
|
||||
`project_count` 1 (no double counting);
|
||||
(h) `install_inferred` uses the sessions MIN, is stamped noon UTC of `installDay`, and
|
||||
its `$set` contains `first_active_date`;
|
||||
(i) first activity 1h ago → zero events;
|
||||
(j) consent-off → zero `postHogCaptureCalls`, no marker; marker present → zero calls;
|
||||
debug mode → zero calls, no marker **including on an empty DB**; second invocation after
|
||||
success → zero calls;
|
||||
(k) happy path → constructor received `historicalMigration: true`, every capture has
|
||||
`uuid` + `Date` timestamp, marker written with correct `throughDay`; an `error` emitted
|
||||
via the mock's `on` handler → NO marker.
|
||||
- [ ] `bun test tests/telemetry/` passes.
|
||||
|
||||
**Anti-pattern guards**: no marker write reachable before the debug gate; sum
|
||||
summaries-only for discovery_tokens; day-string windowing only; no `buildBaseProperties` on
|
||||
`historical_activity`; no `console.*`; no raw epoch math without `asMs`.
|
||||
|
||||
---
|
||||
|
||||
## Phase 2: Live dry-run against the real DB — THE SHIP/NO-SHIP GATE
|
||||
|
||||
This phase MUST complete before Phase 3 wires anything into the worker, because PostHog data
|
||||
cannot be selectively deleted and `getTelemetryApiKey()` always returns the embedded
|
||||
production key.
|
||||
|
||||
1. Run `runHistoricalBackfill` against the real `~/.claude-mem/claude-mem.db` with
|
||||
`CLAUDE_MEM_TELEMETRY_DEBUG=1` (small runner script or one-off test), and eyeball the
|
||||
stderr payload dump:
|
||||
- First day = **2025-10-19** for this machine (NOT Aug 2025 — the two 2025-08-12 rows are
|
||||
verified backdated artifacts and must be absent thanks to the installDay clamp).
|
||||
- No day newer than `lastFullDay` (≈ T-2.5 days), **no 1970 days**, no day before
|
||||
`install_inferred`'s `first_active_date`.
|
||||
- Plausible counts (~215 active days, ~93k total observations on this machine).
|
||||
- Exactly one `install_inferred`, `first_active_date: 2025-10-19`, noon-UTC timestamp.
|
||||
- No string property that looks like user content; every property is a number, boolean, or
|
||||
the single `first_active_date` date string.
|
||||
2. Repeat runs are expected and safe (debug mode never writes the marker). If a non-debug test
|
||||
run ever latches the marker locally, the reset is `rm ~/.claude-mem/backfill.json`.
|
||||
|
||||
---
|
||||
|
||||
## Phase 3: Worker wiring + telemetry docs disclosure
|
||||
|
||||
**Export `CLAUDE_MEM_TELEMETRY=0` (or `CLAUDE_MEM_TELEMETRY_DEBUG=1`) in the dev shell for
|
||||
every worker boot in this phase** — `npm run build-and-sync` restarts the worker, and an
|
||||
unguarded boot performs the real production send from the dev machine.
|
||||
|
||||
1. **Wire into startup**: in `src/services/worker-service.ts`, immediately after the
|
||||
`worker_started` capture (lines 532-540) and alongside the ChromaSync fire-and-forget
|
||||
precedent (lines 551-556), add:
|
||||
|
||||
```typescript
|
||||
runHistoricalBackfill(this.dbManager.getConnection()).catch(error => {
|
||||
logger.error('SYSTEM', 'Telemetry historical backfill failed (non-blocking)', {}, error as Error);
|
||||
});
|
||||
```
|
||||
|
||||
Non-blocking, after core init. Do not add it to the heartbeat — it is one-shot by marker
|
||||
(a failed run retries on the NEXT worker start because no marker was written).
|
||||
|
||||
2. **Docs**: update `docs/public/telemetry.mdx` — add a short "Historical backfill" section:
|
||||
what is sent (daily anonymous counts + inferred install date), that it runs once, that it
|
||||
honors the same consent gates (`DO_NOT_TRACK`, `CLAUDE_MEM_TELEMETRY=0`, `telemetry.json`),
|
||||
that opting out before first worker start after upgrade prevents it entirely, and that geo
|
||||
properties on backfilled events reflect upload-time location. Follow the existing page's
|
||||
tone/structure (read it first).
|
||||
|
||||
**Verification checklist**:
|
||||
- [ ] `bun test` (full suite) passes — especially `tests/logger-usage-standards.test.ts`.
|
||||
- [ ] `grep -rn "runHistoricalBackfill" src/` shows exactly two hits: definition + the one
|
||||
worker-service call site.
|
||||
- [ ] Worker boots clean **with telemetry disabled in the shell**: `CLAUDE_MEM_TELEMETRY=0
|
||||
npm run build-and-sync`, then confirm via worker log (`~/.claude-mem/logs/`) that
|
||||
startup completes and no backfill error is logged.
|
||||
|
||||
**Anti-pattern guards**: do not block `initializeBackground()` on the backfill promise; do not
|
||||
capture `worker_stopped`-style events from inside backfill; no unguarded dev-machine boots.
|
||||
|
||||
---
|
||||
|
||||
## Phase 4: Final Verification
|
||||
|
||||
1. **Anti-pattern greps** (all must return nothing):
|
||||
- `grep -rn "historical_migration" src/` (wrong spelling for SDK path)
|
||||
- `grep -rn "console\." src/services/telemetry/backfill.ts`
|
||||
- `grep -rn "from 'uuid'" src/`
|
||||
- `grep -n "buildBaseProperties" src/services/telemetry/backfill.ts` → must appear ONLY in
|
||||
the install_inferred builder, never for historical_activity.
|
||||
2. **Whitelist proof**: test asserting `scrubProperties(buildBackfillEvents(...)[i].properties)`
|
||||
retains every expected key on both event types.
|
||||
3. **Full suite**: `bun test` green; `npm run build-and-sync` (telemetry-guarded shell)
|
||||
succeeds; worker starts.
|
||||
4. **Re-run the Phase 2 dry-run** one final time on the release build (the marker is still
|
||||
absent on the dev machine if Phase 3 boots were guarded — if not, `rm
|
||||
~/.claude-mem/backfill.json` first).
|
||||
5. **Post-ship validation (manual, in PostHog UI after release)**:
|
||||
- BEFORE trusting reconstructed WAU: build a unique-users trend mixing one person event
|
||||
(`worker_started`) and one profile-less event (`session_compressed`) for a known
|
||||
installId and confirm it counts 1 user — this validates that profile-less
|
||||
`historical_activity` and person `install_inferred` merge as one unique user.
|
||||
- Adoption curve = `install_inferred` ∪ `install_completed(is_update=false)`, deduped per
|
||||
distinct_id (telemetry-era installs emit both — document on the dashboard).
|
||||
- Trend on unique `historical_activity` users by week = reconstructed WAU.
|
||||
- Annotate dashboards: survivorship bias; `version` slicing must filter
|
||||
`backfilled != true`; geo on backfilled events is upload-time.
|
||||
|
||||
---
|
||||
|
||||
## Phase ordering & session boundaries
|
||||
|
||||
Phase 1 (module + tests) → Phase 2 (live dry-run gate, BEFORE any wiring) → Phase 3
|
||||
(wiring + docs, telemetry-guarded) → Phase 4 (verification + post-ship). An executor starting
|
||||
any phase cold should read this file's Phase 0 plus the cited source files for that phase
|
||||
before writing code. Run `bun install` first — the worktree has no `node_modules`.
|
||||
|
||||
---
|
||||
|
||||
## Review log (2026-06-12)
|
||||
|
||||
Adversarially reviewed by a 55-agent workflow (5 dimensions × independent skeptic
|
||||
verification; 26 findings confirmed, 6 downgraded, 1 disputed, 0 refuted). Material changes
|
||||
vs. the first draft:
|
||||
- **Pinned rollup semantics** (session_count from sdk_sessions only; project_count via UNION
|
||||
distinct; discovery_tokens from summaries only — was multi-counted ~Nx per obs-per-turn).
|
||||
- **Day window redefined** (whole-day buckets ≤ `utcDay(now-60h)`; was an ambiguous row-level
|
||||
48h filter that could ship truncated days and <48h timestamps).
|
||||
- **install_inferred**: kept sessions-first MIN (a proposed cross-table MIN would have baked
|
||||
two verified backdated artifact rows — 2025-08-12 epochs written by a 2026-04-10 session —
|
||||
into undeletable data); added installDay clamp on rollups, 60h skip rule, noon-UTC
|
||||
retry-stable timestamp, and explicit `first_active_date` assignment (was whitelisted but
|
||||
never attached to any event).
|
||||
- **Marker gating rebuilt on real SDK semantics**: bare `flush()` is not a delivery barrier
|
||||
and `shutdown()` swallows errors — single-batch config + `on('error')` latch + marker only
|
||||
on clean shutdown. Dropped the 3s race and the dead "flush every 5,000" advice.
|
||||
- **Phase order fixed**: the live dry-run now precedes worker wiring (the embedded prod key +
|
||||
default-on consent meant the old Phase 3 "boot the worker" step performed the real
|
||||
irreversible send before the old Phase 4 dry-run, whose marker then made that dry-run a
|
||||
silent no-op).
|
||||
- **More usage data, same privacy posture**: added prompt_count, obs_type_* breakdown
|
||||
(whitelist keys already exist from live telemetry), session outcome counts, platform
|
||||
buckets, subagent count; project_count now includes session-only days.
|
||||
- **Simplifications**: Bun.randomUUIDv5 one-liner replaces hand-rolled sha256 nibble-forcing
|
||||
(the "invented API" anti-pattern claim was false — verified on Bun 1.3.9); deleted dead
|
||||
whitelist keys (backfill_days, backfill_events); old Phases 1+2 merged; debug gate moved
|
||||
ahead of all marker writes; corrected citations (discovery_tokens lives in a migration at
|
||||
runner.ts:381-402, not the base schema; there is no JSON-write helper).
|
||||
Reference in New Issue
Block a user