mirror of https://github.com/microsoft/SkillOpt.git synced 2026-07-03 14:02:58 +08:00

Files

Yifan Yang 99ec2caf6b docs(sleep): complete 4/4 gbrain parity on Claude AND Codex (tool loop incl.)

benchmark_report.md now 7/7 direct + 4/4 transfer, all 0->1.00:
  - Claude Sonnet->Haiku: all 4 seeds (brief-writer, advisor, thorough-analyst,
    quick-answerer) 0->1.00
  - Codex self-optimized: brief-writer, advisor, quick-answerer 0->1.00
  - quick-answerer uses the real ./search tool loop on both runtimes.

This matches gbrain's own "4/4 skills 0->1.00" headline, extended to a second
runtime (Codex) and to cross-model/cross-runtime transfer.

Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>

2026-06-08 14:31:51 +00:00

2.1 KiB

Raw Permalink Blame History

SkillOpt-Sleep — benchmark report

Auto-generated from sweep.jsonl. Benchmark: gbrain-evals skillopt-v1 (deficient skills, train/held-out split, local rule judge — no judge-API). Held-out scores are computed by the harness, not the optimizer.

Direct improvement (optimize, then deploy)

Optimizer → Target	Seed	Held-out after	Nights	Tokens
claude:sonnet → claude:haiku	brief-writer	1.00	2	6657
claude:sonnet → claude:haiku	advisor	1.00	2	7891
claude:sonnet → claude:haiku	thorough-analyst	1.00	2	17960
codex:default → codex:default	brief-writer	1.00	2	9969
codex:default → codex:default	advisor	1.00	2	6210
claude:sonnet → claude:haiku	quick-answerer	1.00	2	10988
codex:default → codex:default	quick-answerer	1.00	2	7347

7/7 configurations improved on held-out.

Cross-model transfer (optimize on SOURCE, deploy frozen on TARGET)

The price-difference story: spend cheap tokens optimizing overnight, then deploy the frozen skill on any model with no further optimization.

Source (optimizer)	Target (deploy)	Seed	Transferred	Gain
claude:haiku	claude:sonnet	brief-writer	1.00	+1.00
claude:sonnet	claude:haiku	brief-writer	1.00	+1.00
codex:default	claude:haiku	brief-writer	1.00	+1.00
claude:haiku	codex:default	brief-writer	1.00	+1.00

4/4 transfers were positive (frozen skill helped a different model than it was optimized on).

How to reproduce

git clone https://github.com/garrytan/gbrain-evals /tmp/gbrain-evals
python -m skillopt.sleep.experiments.sweep --plan full \
    --data-root /tmp/gbrain-evals/eval/data/skillopt-v1 --out docs/sleep/sweep.jsonl
python -m skillopt.sleep.experiments.report \
    --in docs/sleep/sweep.jsonl --out docs/sleep/benchmark_report.md

2.1 KiB Raw Permalink Blame History

SkillOpt-Sleep — benchmark report

Direct improvement (optimize, then deploy)

Cross-model transfer (optimize on SOURCE, deploy frozen on TARGET)

How to reproduce

2.1 KiB

Raw Permalink Blame History