mirror of
https://github.com/microsoft/SkillOpt.git
synced 2026-07-03 14:02:58 +08:00
docs(sleep): full sweep — 5/5 direct + 4/4 transfer all 0->1.00
Machine-generated benchmark_report.md from a 9-config sweep: - Direct (Sonnet->Haiku): brief-writer/advisor/thorough-analyst 0->1.00 - Direct (Codex): brief-writer/advisor 0->1.00 - Transfer (4/4 positive, incl. cross-runtime Codex<->Claude): all 0->1.00 Cross-model transfer confirms the price-difference value prop: a skill optimized on a cheap model deploys for free on an expensive one, and skills move between Codex and Claude. sweep.jsonl is the committed source data. Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
This commit is contained in:
@@ -93,12 +93,24 @@ weak and what changed.
|
||||
> *Optimize cheap overnight, deploy anywhere.* A skill is just text, so a good
|
||||
> rewrite should help a model it was never optimized on.
|
||||
|
||||
The sweep runs these pairs (optimize on SOURCE, freeze, evaluate held-out on
|
||||
TARGET with no further optimization). See `benchmark_report.md` / `sweep.jsonl`
|
||||
for the auto-generated table once the sweep completes:
|
||||
Optimize on SOURCE, **freeze** the learned skill, evaluate held-out on TARGET with
|
||||
no further optimization. All four pairs are positive — including **across
|
||||
runtimes** (Codex ↔ Claude):
|
||||
|
||||
- Haiku → Sonnet, Sonnet → Haiku (within Claude)
|
||||
- Codex → Claude, Claude → Codex (across runtimes)
|
||||
| Source (optimizer) | Target (deploy) | Seed | Target baseline → transferred | Gain |
|
||||
|---|---|---|---|---|
|
||||
| Claude Haiku (cheap) | Claude Sonnet (expensive) | brief-writer | 0.00 → **1.00** | +1.00 |
|
||||
| Claude Sonnet | Claude Haiku | brief-writer | 0.00 → **1.00** | +1.00 |
|
||||
| **Codex** | **Claude Haiku** | brief-writer | 0.00 → **1.00** | +1.00 |
|
||||
| **Claude Haiku** | **Codex** | brief-writer | 0.00 → **1.00** | +1.00 |
|
||||
|
||||
**4/4 transfers positive.** A skill optimized on a cheap model deploys for free on
|
||||
an expensive one, and skills move between Codex and Claude — the Sleep-setting
|
||||
analogue of SkillOpt's cross-model and cross-harness transfer tables. This is the
|
||||
quantified answer to "optimize cheap overnight, deploy anywhere."
|
||||
|
||||
Full machine-generated scorecard: [`benchmark_report.md`](benchmark_report.md)
|
||||
(source data `sweep.jsonl`).
|
||||
|
||||
---
|
||||
|
||||
|
||||
39
docs/sleep/benchmark_report.md
Normal file
39
docs/sleep/benchmark_report.md
Normal file
@@ -0,0 +1,39 @@
|
||||
# SkillOpt-Sleep — benchmark report
|
||||
|
||||
Auto-generated from `sweep.jsonl`. Benchmark: [gbrain-evals](https://github.com/garrytan/gbrain-evals) `skillopt-v1` (deficient skills, train/held-out split, local rule judge — no judge-API).
|
||||
Held-out scores are computed by the harness, not the optimizer.
|
||||
|
||||
## Direct improvement (optimize, then deploy)
|
||||
|
||||
| Optimizer → Target | Seed | Held-out before | Held-out after | Nights | Tokens |
|
||||
|---|---|---|---|---|---|
|
||||
| claude:sonnet → claude:haiku | brief-writer | 0.00 | **1.00** | 2 | 6657 |
|
||||
| claude:sonnet → claude:haiku | advisor | 0.00 | **1.00** | 2 | 7891 |
|
||||
| claude:sonnet → claude:haiku | thorough-analyst | 0.00 | **1.00** | 2 | 17960 |
|
||||
| codex:default → codex:default | brief-writer | 0.00 | **1.00** | 2 | 9969 |
|
||||
| codex:default → codex:default | advisor | 0.00 | **1.00** | 2 | 6210 |
|
||||
|
||||
**5/5 configurations improved on held-out.**
|
||||
|
||||
## Cross-model transfer (optimize on SOURCE, deploy frozen on TARGET)
|
||||
|
||||
The price-difference story: spend cheap tokens optimizing overnight, then deploy the frozen skill on any model with no further optimization.
|
||||
|
||||
| Source (optimizer) | Target (deploy) | Seed | Target baseline | Transferred | Gain |
|
||||
|---|---|---|---|---|---|
|
||||
| claude:haiku | claude:sonnet | brief-writer | 0.00 | **1.00** | +1.00 |
|
||||
| claude:sonnet | claude:haiku | brief-writer | 0.00 | **1.00** | +1.00 |
|
||||
| codex:default | claude:haiku | brief-writer | 0.00 | **1.00** | +1.00 |
|
||||
| claude:haiku | codex:default | brief-writer | 0.00 | **1.00** | +1.00 |
|
||||
|
||||
**4/4 transfers were positive** (frozen skill helped a different model than it was optimized on).
|
||||
|
||||
## How to reproduce
|
||||
|
||||
```bash
|
||||
git clone https://github.com/garrytan/gbrain-evals /tmp/gbrain-evals
|
||||
python -m skillopt.sleep.experiments.sweep --plan full \
|
||||
--data-root /tmp/gbrain-evals/eval/data/skillopt-v1 --out docs/sleep/sweep.jsonl
|
||||
python -m skillopt.sleep.experiments.report \
|
||||
--in docs/sleep/sweep.jsonl --out docs/sleep/benchmark_report.md
|
||||
```
|
||||
9
docs/sleep/sweep.jsonl
Normal file
9
docs/sleep/sweep.jsonl
Normal file
@@ -0,0 +1,9 @@
|
||||
{"baseline": 0.0, "after": 1.0, "improved": true, "tokens": 6657, "cfg": {"kind": "dual", "optimizer_backend": "claude", "optimizer_model": "sonnet", "target_backend": "claude", "target_model": "haiku", "seed": "brief-writer", "nights": 2}, "cfg_key": "{\"kind\": \"dual\", \"nights\": 2, \"optimizer_backend\": \"claude\", \"optimizer_model\": \"sonnet\", \"seed\": \"brief-writer\", \"target_backend\": \"claude\", \"target_model\": \"haiku\"}", "elapsed_s": 71.5}
|
||||
{"baseline": 0.0, "after": 1.0, "improved": true, "tokens": 7891, "cfg": {"kind": "dual", "optimizer_backend": "claude", "optimizer_model": "sonnet", "target_backend": "claude", "target_model": "haiku", "seed": "advisor", "nights": 2}, "cfg_key": "{\"kind\": \"dual\", \"nights\": 2, \"optimizer_backend\": \"claude\", \"optimizer_model\": \"sonnet\", \"seed\": \"advisor\", \"target_backend\": \"claude\", \"target_model\": \"haiku\"}", "elapsed_s": 79.3}
|
||||
{"baseline": 0.0, "after": 1.0, "improved": true, "tokens": 17960, "cfg": {"kind": "dual", "optimizer_backend": "claude", "optimizer_model": "sonnet", "target_backend": "claude", "target_model": "haiku", "seed": "thorough-analyst", "nights": 2}, "cfg_key": "{\"kind\": \"dual\", \"nights\": 2, \"optimizer_backend\": \"claude\", \"optimizer_model\": \"sonnet\", \"seed\": \"thorough-analyst\", \"target_backend\": \"claude\", \"target_model\": \"haiku\"}", "elapsed_s": 319.3}
|
||||
{"baseline": 0.0, "after": 1.0, "improved": true, "tokens": 9969, "cfg": {"kind": "direct", "backend": "codex", "model": "", "seed": "brief-writer", "nights": 2}, "cfg_key": "{\"backend\": \"codex\", \"kind\": \"direct\", \"model\": \"\", \"nights\": 2, \"seed\": \"brief-writer\"}", "elapsed_s": 187.6}
|
||||
{"baseline": 0.0, "after": 1.0, "improved": true, "tokens": 6210, "cfg": {"kind": "direct", "backend": "codex", "model": "", "seed": "advisor", "nights": 2}, "cfg_key": "{\"backend\": \"codex\", \"kind\": \"direct\", \"model\": \"\", \"nights\": 2, \"seed\": \"advisor\"}", "elapsed_s": 114.1}
|
||||
{"baseline_target": 0.0, "transferred": 1.0, "transfer_gain": 1.0, "tokens": 13673, "cfg": {"kind": "transfer", "source_backend": "claude", "source_model": "haiku", "target_backend": "claude", "target_model": "sonnet", "seed": "brief-writer", "nights": 2}, "cfg_key": "{\"kind\": \"transfer\", \"nights\": 2, \"seed\": \"brief-writer\", \"source_backend\": \"claude\", \"source_model\": \"haiku\", \"target_backend\": \"claude\", \"target_model\": \"sonnet\"}", "elapsed_s": 180.3}
|
||||
{"baseline_target": 0.0, "transferred": 1.0, "transfer_gain": 1.0, "tokens": 11668, "cfg": {"kind": "transfer", "source_backend": "claude", "source_model": "sonnet", "target_backend": "claude", "target_model": "haiku", "seed": "brief-writer", "nights": 2}, "cfg_key": "{\"kind\": \"transfer\", \"nights\": 2, \"seed\": \"brief-writer\", \"source_backend\": \"claude\", \"source_model\": \"sonnet\", \"target_backend\": \"claude\", \"target_model\": \"haiku\"}", "elapsed_s": 173.9}
|
||||
{"baseline_target": 0.0, "transferred": 1.0, "transfer_gain": 1.0, "tokens": 13707, "cfg": {"kind": "transfer", "source_backend": "codex", "source_model": "", "target_backend": "claude", "target_model": "haiku", "seed": "brief-writer", "nights": 2}, "cfg_key": "{\"kind\": \"transfer\", \"nights\": 2, \"seed\": \"brief-writer\", \"source_backend\": \"codex\", \"source_model\": \"\", \"target_backend\": \"claude\", \"target_model\": \"haiku\"}", "elapsed_s": 215.7}
|
||||
{"baseline_target": 0.0, "transferred": 1.0, "transfer_gain": 1.0, "tokens": 11284, "cfg": {"kind": "transfer", "source_backend": "claude", "source_model": "haiku", "target_backend": "codex", "target_model": "", "seed": "brief-writer", "nights": 2}, "cfg_key": "{\"kind\": \"transfer\", \"nights\": 2, \"seed\": \"brief-writer\", \"source_backend\": \"claude\", \"source_model\": \"haiku\", \"target_backend\": \"codex\", \"target_model\": \"\"}", "elapsed_s": 145.5}
|
||||
Reference in New Issue
Block a user