Files
microsoft-SkillOpt/docs/sleep
Yifan Yang 4203086899 feat(sleep): real claude + codex backends, gbrain-evals benchmark, rule judges
Upgrade from mock-only to REAL multi-backend validation:

Backends (skillopt/sleep/backend.py):
  - CliBackend base: shared attempt/judge/reflect prompts, response cache,
    token accounting. Subclasses implement only _call().
  - ClaudeCliBackend: drives `claude -p --output-format text`.
  - CodexCliBackend: drives the REAL @openai/codex `exec -o <file>` for clean
    output; resolve_codex_path() skips the hermes wrapper at ~/.local/bin/codex.
  - reflect() now aggregates the exact failing judge criteria into the prompt
    (gbrain's lesson: tell the optimizer what the scorer rewards).

Rule judges (skillopt/sleep/judges.py): gbrain-compatible local scorers
  (section_present / regex / max_chars / contains / tool_called) — held-out
  scoring with no judge-API spend. TaskRecord gains a `judge` field +
  reference_kind="rule".

gbrain-evals adapter (experiments/gbrain_bench.py, run_gbrain.py): load
  garrytan/gbrain-evals skillopt-v1 deficient skills + train/held-out task
  sets and run our consolidate() loop against the SAME suite gbrain scores.

REAL results (docs/sleep/real_api_results.md), brief-writer seed, 1 night:
  - Claude (Haiku): held-out 0.00 -> 1.00
  - Codex:          held-out 0.00 -> 0.67
  Both proposed a correct, general format rule into the protected LEARNED block.

CLI: --backend {mock,claude,codex}, --codex-path, --model; experiment +
gbrain runners gain --limit-* cost controls. 17 tests pass.

Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
2026-06-08 14:31:51 +00:00
..