mirror of
https://github.com/microsoft/SkillOpt.git
synced 2026-07-03 14:02:58 +08:00
Upgrade from mock-only to REAL multi-backend validation:
Backends (skillopt/sleep/backend.py):
- CliBackend base: shared attempt/judge/reflect prompts, response cache,
token accounting. Subclasses implement only _call().
- ClaudeCliBackend: drives `claude -p --output-format text`.
- CodexCliBackend: drives the REAL @openai/codex `exec -o <file>` for clean
output; resolve_codex_path() skips the hermes wrapper at ~/.local/bin/codex.
- reflect() now aggregates the exact failing judge criteria into the prompt
(gbrain's lesson: tell the optimizer what the scorer rewards).
Rule judges (skillopt/sleep/judges.py): gbrain-compatible local scorers
(section_present / regex / max_chars / contains / tool_called) — held-out
scoring with no judge-API spend. TaskRecord gains a `judge` field +
reference_kind="rule".
gbrain-evals adapter (experiments/gbrain_bench.py, run_gbrain.py): load
garrytan/gbrain-evals skillopt-v1 deficient skills + train/held-out task
sets and run our consolidate() loop against the SAME suite gbrain scores.
REAL results (docs/sleep/real_api_results.md), brief-writer seed, 1 night:
- Claude (Haiku): held-out 0.00 -> 1.00
- Codex: held-out 0.00 -> 0.67
Both proposed a correct, general format rule into the protected LEARNED block.
CLI: --backend {mock,claude,codex}, --codex-path, --model; experiment +
gbrain runners gain --limit-* cost controls. 17 tests pass.
Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>