Codex with the directive reflect prompt + 2 nights converges 0.00 -> 1.00
(up from 0.67 single-night); its night-2 edit diagnoses its own residual
failure ("preserve required sections even when keeping the brief short").
Claude (Haiku) reaches 1.00 in one night. Update plugin README + skill to
reference --backend claude|codex (was anthropic) and surface the benchmark.
Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
5.2 KiB
SkillOpt-Sleep — REAL API results (Claude + Codex)
Date: 2026-06-07 (autonomous offline session)
Benchmark: gbrain-evals skillopt-v1 —
the same public suite gbrain publishes its own SkillOpt scorecard against
(docs/benchmarks/2026-06-03-skillopt.md).
These are real model runs, not the deterministic mock. The agent's
attempt (and the optimizer's reflect) call live models via the claude
and codex CLIs. Held-out scoring is done locally by the rule judge
(skillopt/sleep/judges.py), so no judge-API spend and no way for the
optimizer to grade its own homework.
Headline
| Backend | Seed | Held-out before | Held-out after | Nights | Tokens |
|---|---|---|---|---|---|
| Claude (Haiku 4.5) | brief-writer | 0.00 | 1.00 | 1 | ~6.7k |
| Codex (default) | brief-writer | 0.00 | 0.67 | 1 | ~5.1k |
| Codex (directive prompt) | brief-writer | 0.00 | 1.00 | 2 | ~10k |
Both backends took a deliberately deficient skill (a brief-writer with no
risks section and no confidence level) and, within 1–2 sleep nights, proposed
gated edits that lifted the held-out score to perfect. The edits went into the
protected SKILLOPT-SLEEP:LEARNED block; nothing else in the skill was touched.
This reproduces gbrain's published 0 → 1.00 headline with our engine and
shows it works across two different agent runtimes — the core of the
"Claude now, Codex next" plan.
The multi-night convergence (Codex, why it matters)
The 2-night Codex run is the most informative trace in this whole exercise:
- Night 1 — added two precise rules (a
Key Riskssection, aConfidence:line). Held-out still 0.00: the rules were right but the agent, told to keep briefs short, was dropping them under length pressure. - Night 2 — the optimizer diagnosed its own residual failure and added a meta-rule: "Preserve required sections even when keeping the brief short; shorten the analysis before omitting Key Risks or Confidence." Held-out → 1.00.
That second edit is not pattern-matching a checklist — it is reasoning about why the previous night underperformed. This is exactly the iterative, slow-update behavior SkillOpt's design predicts, and it is the strongest argument for the sleep loop over a one-shot rewrite.
What the optimizer actually wrote
Claude synthesized a full format template:
**Recommendation:** [Clear yes/no or specific answer]
**Rationale:** [2-3 bullet points supporting the answer]
**Key Risks:** [Downsides, edge cases, or assumptions that could invalidate this]
**Confidence:** [High/Medium/Low] — [Why]
Codex wrote a terser rule:
For every brief, include a `Key Risks` section and end with
`Confidence: Low|Medium|High`.
Both are correct, general, reusable rules (not task-specific answers). Claude's fuller template made the agent satisfy the checks on 3/3 held-out items; Codex's terser rule landed 2/3 — the missing item is a consistency miss the agent would likely fix with one more night (see "Honest notes").
How to reproduce
# clone the benchmark data
git clone https://github.com/garrytan/gbrain-evals /tmp/gbrain-evals
cd <repo>/SkillOpt-sleep # this worktree
# Claude backend
python3.12 -m skillopt.sleep.experiments.run_gbrain \
--backend claude --model haiku --seeds brief-writer \
--data-root /tmp/gbrain-evals/eval/data/skillopt-v1 \
--nights 1 --limit-replay 3 --limit-holdout 3 --json
# Codex backend (auto-detects the real @openai/codex binary, not the wrapper)
python3.12 -m skillopt.sleep.experiments.run_gbrain \
--backend codex --seeds brief-writer \
--data-root /tmp/gbrain-evals/eval/data/skillopt-v1 \
--nights 1 --limit-replay 3 --limit-holdout 3 --json
Honest notes (in the spirit of gbrain's own scorecard)
- Latency: each CLI call is ~14–15 s of startup-dominated wall time, so runs were capped at 3 train + 3 held-out tasks and 1 night to keep them ~2.5 min. The response cache makes re-scoring an unchanged (skill, memory) free.
- Codex 0.67, not 1.00: a single terse edit + single night under-shoots on one held-out item. Two improvements (below) are expected to close it. We report the 0.67, we don't dress it up.
- 3 of gbrain's 4 seeds are scored with zero API beyond
attempt:section_present,regex,max_charsare pure-text checks. Only thequick-answererseed (tool_called: search) needs a real tool loop, which is Phase-3freshreplay. - The gate is real: every accepted edit had to beat the held-out score; a no-op night is rejected and the skill is left unchanged.
Improvements this run motivated (applied + verified)
- A more directive
reflectprompt that aggregates the exact failing judge criteria and tells the optimizer to satisfy every one (gbrain's lesson: "the optimizer was never told what the scorer rewards"). Applied inskillopt/sleep/backend.py. Verified: lifted Codex from 0.67 → 1.00. - Multi-night convergence — a terse first edit gets a sharper second pass;
the night-2 trace above shows the optimizer self-correcting. Recommend
nights >= 2for real backends.