diff --git a/docs/sleep/FINAL_REPORT.md b/docs/sleep/FINAL_REPORT.md index 00596d6..3ebae06 100644 --- a/docs/sleep/FINAL_REPORT.md +++ b/docs/sleep/FINAL_REPORT.md @@ -93,12 +93,24 @@ weak and what changed. > *Optimize cheap overnight, deploy anywhere.* A skill is just text, so a good > rewrite should help a model it was never optimized on. -The sweep runs these pairs (optimize on SOURCE, freeze, evaluate held-out on -TARGET with no further optimization). See `benchmark_report.md` / `sweep.jsonl` -for the auto-generated table once the sweep completes: +Optimize on SOURCE, **freeze** the learned skill, evaluate held-out on TARGET with +no further optimization. All four pairs are positive — including **across +runtimes** (Codex ↔ Claude): -- Haiku → Sonnet, Sonnet → Haiku (within Claude) -- Codex → Claude, Claude → Codex (across runtimes) +| Source (optimizer) | Target (deploy) | Seed | Target baseline → transferred | Gain | +|---|---|---|---|---| +| Claude Haiku (cheap) | Claude Sonnet (expensive) | brief-writer | 0.00 → **1.00** | +1.00 | +| Claude Sonnet | Claude Haiku | brief-writer | 0.00 → **1.00** | +1.00 | +| **Codex** | **Claude Haiku** | brief-writer | 0.00 → **1.00** | +1.00 | +| **Claude Haiku** | **Codex** | brief-writer | 0.00 → **1.00** | +1.00 | + +**4/4 transfers positive.** A skill optimized on a cheap model deploys for free on +an expensive one, and skills move between Codex and Claude — the Sleep-setting +analogue of SkillOpt's cross-model and cross-harness transfer tables. This is the +quantified answer to "optimize cheap overnight, deploy anywhere." + +Full machine-generated scorecard: [`benchmark_report.md`](benchmark_report.md) +(source data `sweep.jsonl`). --- diff --git a/docs/sleep/benchmark_report.md b/docs/sleep/benchmark_report.md new file mode 100644 index 0000000..1fe6832 --- /dev/null +++ b/docs/sleep/benchmark_report.md @@ -0,0 +1,39 @@ +# SkillOpt-Sleep — benchmark report + +Auto-generated from `sweep.jsonl`. Benchmark: [gbrain-evals](https://github.com/garrytan/gbrain-evals) `skillopt-v1` (deficient skills, train/held-out split, local rule judge — no judge-API). +Held-out scores are computed by the harness, not the optimizer. + +## Direct improvement (optimize, then deploy) + +| Optimizer → Target | Seed | Held-out before | Held-out after | Nights | Tokens | +|---|---|---|---|---|---| +| claude:sonnet → claude:haiku | brief-writer | 0.00 | **1.00** | 2 | 6657 | +| claude:sonnet → claude:haiku | advisor | 0.00 | **1.00** | 2 | 7891 | +| claude:sonnet → claude:haiku | thorough-analyst | 0.00 | **1.00** | 2 | 17960 | +| codex:default → codex:default | brief-writer | 0.00 | **1.00** | 2 | 9969 | +| codex:default → codex:default | advisor | 0.00 | **1.00** | 2 | 6210 | + +**5/5 configurations improved on held-out.** + +## Cross-model transfer (optimize on SOURCE, deploy frozen on TARGET) + +The price-difference story: spend cheap tokens optimizing overnight, then deploy the frozen skill on any model with no further optimization. + +| Source (optimizer) | Target (deploy) | Seed | Target baseline | Transferred | Gain | +|---|---|---|---|---|---| +| claude:haiku | claude:sonnet | brief-writer | 0.00 | **1.00** | +1.00 | +| claude:sonnet | claude:haiku | brief-writer | 0.00 | **1.00** | +1.00 | +| codex:default | claude:haiku | brief-writer | 0.00 | **1.00** | +1.00 | +| claude:haiku | codex:default | brief-writer | 0.00 | **1.00** | +1.00 | + +**4/4 transfers were positive** (frozen skill helped a different model than it was optimized on). + +## How to reproduce + +```bash +git clone https://github.com/garrytan/gbrain-evals /tmp/gbrain-evals +python -m skillopt.sleep.experiments.sweep --plan full \ + --data-root /tmp/gbrain-evals/eval/data/skillopt-v1 --out docs/sleep/sweep.jsonl +python -m skillopt.sleep.experiments.report \ + --in docs/sleep/sweep.jsonl --out docs/sleep/benchmark_report.md +``` diff --git a/docs/sleep/sweep.jsonl b/docs/sleep/sweep.jsonl new file mode 100644 index 0000000..4bd1173 --- /dev/null +++ b/docs/sleep/sweep.jsonl @@ -0,0 +1,9 @@ +{"baseline": 0.0, "after": 1.0, "improved": true, "tokens": 6657, "cfg": {"kind": "dual", "optimizer_backend": "claude", "optimizer_model": "sonnet", "target_backend": "claude", "target_model": "haiku", "seed": "brief-writer", "nights": 2}, "cfg_key": "{\"kind\": \"dual\", \"nights\": 2, \"optimizer_backend\": \"claude\", \"optimizer_model\": \"sonnet\", \"seed\": \"brief-writer\", \"target_backend\": \"claude\", \"target_model\": \"haiku\"}", "elapsed_s": 71.5} +{"baseline": 0.0, "after": 1.0, "improved": true, "tokens": 7891, "cfg": {"kind": "dual", "optimizer_backend": "claude", "optimizer_model": "sonnet", "target_backend": "claude", "target_model": "haiku", "seed": "advisor", "nights": 2}, "cfg_key": "{\"kind\": \"dual\", \"nights\": 2, \"optimizer_backend\": \"claude\", \"optimizer_model\": \"sonnet\", \"seed\": \"advisor\", \"target_backend\": \"claude\", \"target_model\": \"haiku\"}", "elapsed_s": 79.3} +{"baseline": 0.0, "after": 1.0, "improved": true, "tokens": 17960, "cfg": {"kind": "dual", "optimizer_backend": "claude", "optimizer_model": "sonnet", "target_backend": "claude", "target_model": "haiku", "seed": "thorough-analyst", "nights": 2}, "cfg_key": "{\"kind\": \"dual\", \"nights\": 2, \"optimizer_backend\": \"claude\", \"optimizer_model\": \"sonnet\", \"seed\": \"thorough-analyst\", \"target_backend\": \"claude\", \"target_model\": \"haiku\"}", "elapsed_s": 319.3} +{"baseline": 0.0, "after": 1.0, "improved": true, "tokens": 9969, "cfg": {"kind": "direct", "backend": "codex", "model": "", "seed": "brief-writer", "nights": 2}, "cfg_key": "{\"backend\": \"codex\", \"kind\": \"direct\", \"model\": \"\", \"nights\": 2, \"seed\": \"brief-writer\"}", "elapsed_s": 187.6} +{"baseline": 0.0, "after": 1.0, "improved": true, "tokens": 6210, "cfg": {"kind": "direct", "backend": "codex", "model": "", "seed": "advisor", "nights": 2}, "cfg_key": "{\"backend\": \"codex\", \"kind\": \"direct\", \"model\": \"\", \"nights\": 2, \"seed\": \"advisor\"}", "elapsed_s": 114.1} +{"baseline_target": 0.0, "transferred": 1.0, "transfer_gain": 1.0, "tokens": 13673, "cfg": {"kind": "transfer", "source_backend": "claude", "source_model": "haiku", "target_backend": "claude", "target_model": "sonnet", "seed": "brief-writer", "nights": 2}, "cfg_key": "{\"kind\": \"transfer\", \"nights\": 2, \"seed\": \"brief-writer\", \"source_backend\": \"claude\", \"source_model\": \"haiku\", \"target_backend\": \"claude\", \"target_model\": \"sonnet\"}", "elapsed_s": 180.3} +{"baseline_target": 0.0, "transferred": 1.0, "transfer_gain": 1.0, "tokens": 11668, "cfg": {"kind": "transfer", "source_backend": "claude", "source_model": "sonnet", "target_backend": "claude", "target_model": "haiku", "seed": "brief-writer", "nights": 2}, "cfg_key": "{\"kind\": \"transfer\", \"nights\": 2, \"seed\": \"brief-writer\", \"source_backend\": \"claude\", \"source_model\": \"sonnet\", \"target_backend\": \"claude\", \"target_model\": \"haiku\"}", "elapsed_s": 173.9} +{"baseline_target": 0.0, "transferred": 1.0, "transfer_gain": 1.0, "tokens": 13707, "cfg": {"kind": "transfer", "source_backend": "codex", "source_model": "", "target_backend": "claude", "target_model": "haiku", "seed": "brief-writer", "nights": 2}, "cfg_key": "{\"kind\": \"transfer\", \"nights\": 2, \"seed\": \"brief-writer\", \"source_backend\": \"codex\", \"source_model\": \"\", \"target_backend\": \"claude\", \"target_model\": \"haiku\"}", "elapsed_s": 215.7} +{"baseline_target": 0.0, "transferred": 1.0, "transfer_gain": 1.0, "tokens": 11284, "cfg": {"kind": "transfer", "source_backend": "claude", "source_model": "haiku", "target_backend": "codex", "target_model": "", "seed": "brief-writer", "nights": 2}, "cfg_key": "{\"kind\": \"transfer\", \"nights\": 2, \"seed\": \"brief-writer\", \"source_backend\": \"claude\", \"source_model\": \"haiku\", \"target_backend\": \"codex\", \"target_model\": \"\"}", "elapsed_s": 145.5}