mirror of
https://github.com/microsoft/SkillOpt.git
synced 2026-07-03 14:02:58 +08:00
benchmark_report.md now 7/7 direct + 4/4 transfer, all 0->1.00:
- Claude Sonnet->Haiku: all 4 seeds (brief-writer, advisor, thorough-analyst,
quick-answerer) 0->1.00
- Codex self-optimized: brief-writer, advisor, quick-answerer 0->1.00
- quick-answerer uses the real ./search tool loop on both runtimes.
This matches gbrain's own "4/4 skills 0->1.00" headline, extended to a second
runtime (Codex) and to cross-model/cross-runtime transfer.
Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
2.1 KiB
2.1 KiB
SkillOpt-Sleep — benchmark report
Auto-generated from sweep.jsonl. Benchmark: gbrain-evals skillopt-v1 (deficient skills, train/held-out split, local rule judge — no judge-API).
Held-out scores are computed by the harness, not the optimizer.
Direct improvement (optimize, then deploy)
| Optimizer → Target | Seed | Held-out before | Held-out after | Nights | Tokens |
|---|---|---|---|---|---|
| claude:sonnet → claude:haiku | brief-writer | 0.00 | 1.00 | 2 | 6657 |
| claude:sonnet → claude:haiku | advisor | 0.00 | 1.00 | 2 | 7891 |
| claude:sonnet → claude:haiku | thorough-analyst | 0.00 | 1.00 | 2 | 17960 |
| codex:default → codex:default | brief-writer | 0.00 | 1.00 | 2 | 9969 |
| codex:default → codex:default | advisor | 0.00 | 1.00 | 2 | 6210 |
| claude:sonnet → claude:haiku | quick-answerer | 0.00 | 1.00 | 2 | 10988 |
| codex:default → codex:default | quick-answerer | 0.00 | 1.00 | 2 | 7347 |
7/7 configurations improved on held-out.
Cross-model transfer (optimize on SOURCE, deploy frozen on TARGET)
The price-difference story: spend cheap tokens optimizing overnight, then deploy the frozen skill on any model with no further optimization.
| Source (optimizer) | Target (deploy) | Seed | Target baseline | Transferred | Gain |
|---|---|---|---|---|---|
| claude:haiku | claude:sonnet | brief-writer | 0.00 | 1.00 | +1.00 |
| claude:sonnet | claude:haiku | brief-writer | 0.00 | 1.00 | +1.00 |
| codex:default | claude:haiku | brief-writer | 0.00 | 1.00 | +1.00 |
| claude:haiku | codex:default | brief-writer | 0.00 | 1.00 | +1.00 |
4/4 transfers were positive (frozen skill helped a different model than it was optimized on).
How to reproduce
git clone https://github.com/garrytan/gbrain-evals /tmp/gbrain-evals
python -m skillopt.sleep.experiments.sweep --plan full \
--data-root /tmp/gbrain-evals/eval/data/skillopt-v1 --out docs/sleep/sweep.jsonl
python -m skillopt.sleep.experiments.report \
--in docs/sleep/sweep.jsonl --out docs/sleep/benchmark_report.md