Files
microsoft-SkillOpt/docs/sleep/benchmark_report.md
Yifan Yang 99ec2caf6b docs(sleep): complete 4/4 gbrain parity on Claude AND Codex (tool loop incl.)
benchmark_report.md now 7/7 direct + 4/4 transfer, all 0->1.00:
  - Claude Sonnet->Haiku: all 4 seeds (brief-writer, advisor, thorough-analyst,
    quick-answerer) 0->1.00
  - Codex self-optimized: brief-writer, advisor, quick-answerer 0->1.00
  - quick-answerer uses the real ./search tool loop on both runtimes.

This matches gbrain's own "4/4 skills 0->1.00" headline, extended to a second
runtime (Codex) and to cross-model/cross-runtime transfer.

Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
2026-06-08 14:31:51 +00:00

2.1 KiB

SkillOpt-Sleep — benchmark report

Auto-generated from sweep.jsonl. Benchmark: gbrain-evals skillopt-v1 (deficient skills, train/held-out split, local rule judge — no judge-API). Held-out scores are computed by the harness, not the optimizer.

Direct improvement (optimize, then deploy)

Optimizer → Target Seed Held-out before Held-out after Nights Tokens
claude:sonnet → claude:haiku brief-writer 0.00 1.00 2 6657
claude:sonnet → claude:haiku advisor 0.00 1.00 2 7891
claude:sonnet → claude:haiku thorough-analyst 0.00 1.00 2 17960
codex:default → codex:default brief-writer 0.00 1.00 2 9969
codex:default → codex:default advisor 0.00 1.00 2 6210
claude:sonnet → claude:haiku quick-answerer 0.00 1.00 2 10988
codex:default → codex:default quick-answerer 0.00 1.00 2 7347

7/7 configurations improved on held-out.

Cross-model transfer (optimize on SOURCE, deploy frozen on TARGET)

The price-difference story: spend cheap tokens optimizing overnight, then deploy the frozen skill on any model with no further optimization.

Source (optimizer) Target (deploy) Seed Target baseline Transferred Gain
claude:haiku claude:sonnet brief-writer 0.00 1.00 +1.00
claude:sonnet claude:haiku brief-writer 0.00 1.00 +1.00
codex:default claude:haiku brief-writer 0.00 1.00 +1.00
claude:haiku codex:default brief-writer 0.00 1.00 +1.00

4/4 transfers were positive (frozen skill helped a different model than it was optimized on).

How to reproduce

git clone https://github.com/garrytan/gbrain-evals /tmp/gbrain-evals
python -m skillopt.sleep.experiments.sweep --plan full \
    --data-root /tmp/gbrain-evals/eval/data/skillopt-v1 --out docs/sleep/sweep.jsonl
python -m skillopt.sleep.experiments.report \
    --in docs/sleep/sweep.jsonl --out docs/sleep/benchmark_report.md