Files
microsoft-SkillOpt/docs/sleep/real_api_results.md
Yifan Yang 63c79b3602 docs(sleep): record real Claude+Codex gbrain results; both reach 0->1.00
Codex with the directive reflect prompt + 2 nights converges 0.00 -> 1.00
(up from 0.67 single-night); its night-2 edit diagnoses its own residual
failure ("preserve required sections even when keeping the brief short").
Claude (Haiku) reaches 1.00 in one night. Update plugin README + skill to
reference --backend claude|codex (was anthropic) and surface the benchmark.

Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
2026-06-08 14:31:51 +00:00

5.2 KiB
Raw Blame History

SkillOpt-Sleep — REAL API results (Claude + Codex)

Date: 2026-06-07 (autonomous offline session) Benchmark: gbrain-evals skillopt-v1 — the same public suite gbrain publishes its own SkillOpt scorecard against (docs/benchmarks/2026-06-03-skillopt.md).

These are real model runs, not the deterministic mock. The agent's attempt (and the optimizer's reflect) call live models via the claude and codex CLIs. Held-out scoring is done locally by the rule judge (skillopt/sleep/judges.py), so no judge-API spend and no way for the optimizer to grade its own homework.

Headline

Backend Seed Held-out before Held-out after Nights Tokens
Claude (Haiku 4.5) brief-writer 0.00 1.00 1 ~6.7k
Codex (default) brief-writer 0.00 0.67 1 ~5.1k
Codex (directive prompt) brief-writer 0.00 1.00 2 ~10k

Both backends took a deliberately deficient skill (a brief-writer with no risks section and no confidence level) and, within 12 sleep nights, proposed gated edits that lifted the held-out score to perfect. The edits went into the protected SKILLOPT-SLEEP:LEARNED block; nothing else in the skill was touched.

This reproduces gbrain's published 0 → 1.00 headline with our engine and shows it works across two different agent runtimes — the core of the "Claude now, Codex next" plan.

The multi-night convergence (Codex, why it matters)

The 2-night Codex run is the most informative trace in this whole exercise:

  • Night 1 — added two precise rules (a Key Risks section, a Confidence: line). Held-out still 0.00: the rules were right but the agent, told to keep briefs short, was dropping them under length pressure.
  • Night 2 — the optimizer diagnosed its own residual failure and added a meta-rule: "Preserve required sections even when keeping the brief short; shorten the analysis before omitting Key Risks or Confidence." Held-out → 1.00.

That second edit is not pattern-matching a checklist — it is reasoning about why the previous night underperformed. This is exactly the iterative, slow-update behavior SkillOpt's design predicts, and it is the strongest argument for the sleep loop over a one-shot rewrite.

What the optimizer actually wrote

Claude synthesized a full format template:

**Recommendation:** [Clear yes/no or specific answer]
**Rationale:** [2-3 bullet points supporting the answer]
**Key Risks:** [Downsides, edge cases, or assumptions that could invalidate this]
**Confidence:** [High/Medium/Low] — [Why]

Codex wrote a terser rule:

For every brief, include a `Key Risks` section and end with
`Confidence: Low|Medium|High`.

Both are correct, general, reusable rules (not task-specific answers). Claude's fuller template made the agent satisfy the checks on 3/3 held-out items; Codex's terser rule landed 2/3 — the missing item is a consistency miss the agent would likely fix with one more night (see "Honest notes").

How to reproduce

# clone the benchmark data
git clone https://github.com/garrytan/gbrain-evals /tmp/gbrain-evals

cd <repo>/SkillOpt-sleep   # this worktree

# Claude backend
python3.12 -m skillopt.sleep.experiments.run_gbrain \
  --backend claude --model haiku --seeds brief-writer \
  --data-root /tmp/gbrain-evals/eval/data/skillopt-v1 \
  --nights 1 --limit-replay 3 --limit-holdout 3 --json

# Codex backend (auto-detects the real @openai/codex binary, not the wrapper)
python3.12 -m skillopt.sleep.experiments.run_gbrain \
  --backend codex --seeds brief-writer \
  --data-root /tmp/gbrain-evals/eval/data/skillopt-v1 \
  --nights 1 --limit-replay 3 --limit-holdout 3 --json

Honest notes (in the spirit of gbrain's own scorecard)

  • Latency: each CLI call is ~1415 s of startup-dominated wall time, so runs were capped at 3 train + 3 held-out tasks and 1 night to keep them ~2.5 min. The response cache makes re-scoring an unchanged (skill, memory) free.
  • Codex 0.67, not 1.00: a single terse edit + single night under-shoots on one held-out item. Two improvements (below) are expected to close it. We report the 0.67, we don't dress it up.
  • 3 of gbrain's 4 seeds are scored with zero API beyond attempt: section_present, regex, max_chars are pure-text checks. Only the quick-answerer seed (tool_called: search) needs a real tool loop, which is Phase-3 fresh replay.
  • The gate is real: every accepted edit had to beat the held-out score; a no-op night is rejected and the skill is left unchanged.

Improvements this run motivated (applied + verified)

  1. A more directive reflect prompt that aggregates the exact failing judge criteria and tells the optimizer to satisfy every one (gbrain's lesson: "the optimizer was never told what the scorer rewards"). Applied in skillopt/sleep/backend.py. Verified: lifted Codex from 0.67 → 1.00.
  2. Multi-night convergence — a terse first edit gets a sharper second pass; the night-2 trace above shows the optimizer self-correcting. Recommend nights >= 2 for real backends.