docs(sleep): full sweep — 5/5 direct + 4/4 transfer all 0->1.00

Machine-generated benchmark_report.md from a 9-config sweep:
  - Direct (Sonnet->Haiku): brief-writer/advisor/thorough-analyst 0->1.00
  - Direct (Codex): brief-writer/advisor 0->1.00
  - Transfer (4/4 positive, incl. cross-runtime Codex<->Claude): all 0->1.00

Cross-model transfer confirms the price-difference value prop: a skill
optimized on a cheap model deploys for free on an expensive one, and skills
move between Codex and Claude. sweep.jsonl is the committed source data.

Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
This commit is contained in:
Yifan Yang
2026-06-08 14:31:51 +00:00
parent 4186e5bb73
commit b1f41a7506
3 changed files with 65 additions and 5 deletions

View File

@@ -93,12 +93,24 @@ weak and what changed.
> *Optimize cheap overnight, deploy anywhere.* A skill is just text, so a good
> rewrite should help a model it was never optimized on.
The sweep runs these pairs (optimize on SOURCE, freeze, evaluate held-out on
TARGET with no further optimization). See `benchmark_report.md` / `sweep.jsonl`
for the auto-generated table once the sweep completes:
Optimize on SOURCE, **freeze** the learned skill, evaluate held-out on TARGET with
no further optimization. All four pairs are positive — including **across
runtimes** (Codex ↔ Claude):
- Haiku → Sonnet, Sonnet → Haiku (within Claude)
- Codex → Claude, Claude → Codex (across runtimes)
| Source (optimizer) | Target (deploy) | Seed | Target baseline → transferred | Gain |
|---|---|---|---|---|
| Claude Haiku (cheap) | Claude Sonnet (expensive) | brief-writer | 0.00 → **1.00** | +1.00 |
| Claude Sonnet | Claude Haiku | brief-writer | 0.00 → **1.00** | +1.00 |
| **Codex** | **Claude Haiku** | brief-writer | 0.00 → **1.00** | +1.00 |
| **Claude Haiku** | **Codex** | brief-writer | 0.00 → **1.00** | +1.00 |
**4/4 transfers positive.** A skill optimized on a cheap model deploys for free on
an expensive one, and skills move between Codex and Claude — the Sleep-setting
analogue of SkillOpt's cross-model and cross-harness transfer tables. This is the
quantified answer to "optimize cheap overnight, deploy anywhere."
Full machine-generated scorecard: [`benchmark_report.md`](benchmark_report.md)
(source data `sweep.jsonl`).
---

View File

@@ -0,0 +1,39 @@
# SkillOpt-Sleep — benchmark report
Auto-generated from `sweep.jsonl`. Benchmark: [gbrain-evals](https://github.com/garrytan/gbrain-evals) `skillopt-v1` (deficient skills, train/held-out split, local rule judge — no judge-API).
Held-out scores are computed by the harness, not the optimizer.
## Direct improvement (optimize, then deploy)
| Optimizer → Target | Seed | Held-out before | Held-out after | Nights | Tokens |
|---|---|---|---|---|---|
| claude:sonnet → claude:haiku | brief-writer | 0.00 | **1.00** | 2 | 6657 |
| claude:sonnet → claude:haiku | advisor | 0.00 | **1.00** | 2 | 7891 |
| claude:sonnet → claude:haiku | thorough-analyst | 0.00 | **1.00** | 2 | 17960 |
| codex:default → codex:default | brief-writer | 0.00 | **1.00** | 2 | 9969 |
| codex:default → codex:default | advisor | 0.00 | **1.00** | 2 | 6210 |
**5/5 configurations improved on held-out.**
## Cross-model transfer (optimize on SOURCE, deploy frozen on TARGET)
The price-difference story: spend cheap tokens optimizing overnight, then deploy the frozen skill on any model with no further optimization.
| Source (optimizer) | Target (deploy) | Seed | Target baseline | Transferred | Gain |
|---|---|---|---|---|---|
| claude:haiku | claude:sonnet | brief-writer | 0.00 | **1.00** | +1.00 |
| claude:sonnet | claude:haiku | brief-writer | 0.00 | **1.00** | +1.00 |
| codex:default | claude:haiku | brief-writer | 0.00 | **1.00** | +1.00 |
| claude:haiku | codex:default | brief-writer | 0.00 | **1.00** | +1.00 |
**4/4 transfers were positive** (frozen skill helped a different model than it was optimized on).
## How to reproduce
```bash
git clone https://github.com/garrytan/gbrain-evals /tmp/gbrain-evals
python -m skillopt.sleep.experiments.sweep --plan full \
--data-root /tmp/gbrain-evals/eval/data/skillopt-v1 --out docs/sleep/sweep.jsonl
python -m skillopt.sleep.experiments.report \
--in docs/sleep/sweep.jsonl --out docs/sleep/benchmark_report.md
```

9
docs/sleep/sweep.jsonl Normal file
View File

@@ -0,0 +1,9 @@
{"baseline": 0.0, "after": 1.0, "improved": true, "tokens": 6657, "cfg": {"kind": "dual", "optimizer_backend": "claude", "optimizer_model": "sonnet", "target_backend": "claude", "target_model": "haiku", "seed": "brief-writer", "nights": 2}, "cfg_key": "{\"kind\": \"dual\", \"nights\": 2, \"optimizer_backend\": \"claude\", \"optimizer_model\": \"sonnet\", \"seed\": \"brief-writer\", \"target_backend\": \"claude\", \"target_model\": \"haiku\"}", "elapsed_s": 71.5}
{"baseline": 0.0, "after": 1.0, "improved": true, "tokens": 7891, "cfg": {"kind": "dual", "optimizer_backend": "claude", "optimizer_model": "sonnet", "target_backend": "claude", "target_model": "haiku", "seed": "advisor", "nights": 2}, "cfg_key": "{\"kind\": \"dual\", \"nights\": 2, \"optimizer_backend\": \"claude\", \"optimizer_model\": \"sonnet\", \"seed\": \"advisor\", \"target_backend\": \"claude\", \"target_model\": \"haiku\"}", "elapsed_s": 79.3}
{"baseline": 0.0, "after": 1.0, "improved": true, "tokens": 17960, "cfg": {"kind": "dual", "optimizer_backend": "claude", "optimizer_model": "sonnet", "target_backend": "claude", "target_model": "haiku", "seed": "thorough-analyst", "nights": 2}, "cfg_key": "{\"kind\": \"dual\", \"nights\": 2, \"optimizer_backend\": \"claude\", \"optimizer_model\": \"sonnet\", \"seed\": \"thorough-analyst\", \"target_backend\": \"claude\", \"target_model\": \"haiku\"}", "elapsed_s": 319.3}
{"baseline": 0.0, "after": 1.0, "improved": true, "tokens": 9969, "cfg": {"kind": "direct", "backend": "codex", "model": "", "seed": "brief-writer", "nights": 2}, "cfg_key": "{\"backend\": \"codex\", \"kind\": \"direct\", \"model\": \"\", \"nights\": 2, \"seed\": \"brief-writer\"}", "elapsed_s": 187.6}
{"baseline": 0.0, "after": 1.0, "improved": true, "tokens": 6210, "cfg": {"kind": "direct", "backend": "codex", "model": "", "seed": "advisor", "nights": 2}, "cfg_key": "{\"backend\": \"codex\", \"kind\": \"direct\", \"model\": \"\", \"nights\": 2, \"seed\": \"advisor\"}", "elapsed_s": 114.1}
{"baseline_target": 0.0, "transferred": 1.0, "transfer_gain": 1.0, "tokens": 13673, "cfg": {"kind": "transfer", "source_backend": "claude", "source_model": "haiku", "target_backend": "claude", "target_model": "sonnet", "seed": "brief-writer", "nights": 2}, "cfg_key": "{\"kind\": \"transfer\", \"nights\": 2, \"seed\": \"brief-writer\", \"source_backend\": \"claude\", \"source_model\": \"haiku\", \"target_backend\": \"claude\", \"target_model\": \"sonnet\"}", "elapsed_s": 180.3}
{"baseline_target": 0.0, "transferred": 1.0, "transfer_gain": 1.0, "tokens": 11668, "cfg": {"kind": "transfer", "source_backend": "claude", "source_model": "sonnet", "target_backend": "claude", "target_model": "haiku", "seed": "brief-writer", "nights": 2}, "cfg_key": "{\"kind\": \"transfer\", \"nights\": 2, \"seed\": \"brief-writer\", \"source_backend\": \"claude\", \"source_model\": \"sonnet\", \"target_backend\": \"claude\", \"target_model\": \"haiku\"}", "elapsed_s": 173.9}
{"baseline_target": 0.0, "transferred": 1.0, "transfer_gain": 1.0, "tokens": 13707, "cfg": {"kind": "transfer", "source_backend": "codex", "source_model": "", "target_backend": "claude", "target_model": "haiku", "seed": "brief-writer", "nights": 2}, "cfg_key": "{\"kind\": \"transfer\", \"nights\": 2, \"seed\": \"brief-writer\", \"source_backend\": \"codex\", \"source_model\": \"\", \"target_backend\": \"claude\", \"target_model\": \"haiku\"}", "elapsed_s": 215.7}
{"baseline_target": 0.0, "transferred": 1.0, "transfer_gain": 1.0, "tokens": 11284, "cfg": {"kind": "transfer", "source_backend": "claude", "source_model": "haiku", "target_backend": "codex", "target_model": "", "seed": "brief-writer", "nights": 2}, "cfg_key": "{\"kind\": \"transfer\", \"nights\": 2, \"seed\": \"brief-writer\", \"source_backend\": \"claude\", \"source_model\": \"haiku\", \"target_backend\": \"codex\", \"target_model\": \"\"}", "elapsed_s": 145.5}