feat(sleep): real claude + codex backends, gbrain-evals benchmark, rule judges

Upgrade from mock-only to REAL multi-backend validation:

Backends (skillopt/sleep/backend.py):
  - CliBackend base: shared attempt/judge/reflect prompts, response cache,
    token accounting. Subclasses implement only _call().
  - ClaudeCliBackend: drives `claude -p --output-format text`.
  - CodexCliBackend: drives the REAL @openai/codex `exec -o <file>` for clean
    output; resolve_codex_path() skips the hermes wrapper at ~/.local/bin/codex.
  - reflect() now aggregates the exact failing judge criteria into the prompt
    (gbrain's lesson: tell the optimizer what the scorer rewards).

Rule judges (skillopt/sleep/judges.py): gbrain-compatible local scorers
  (section_present / regex / max_chars / contains / tool_called) — held-out
  scoring with no judge-API spend. TaskRecord gains a `judge` field +
  reference_kind="rule".

gbrain-evals adapter (experiments/gbrain_bench.py, run_gbrain.py): load
  garrytan/gbrain-evals skillopt-v1 deficient skills + train/held-out task
  sets and run our consolidate() loop against the SAME suite gbrain scores.

REAL results (docs/sleep/real_api_results.md), brief-writer seed, 1 night:
  - Claude (Haiku): held-out 0.00 -> 1.00
  - Codex:          held-out 0.00 -> 0.67
  Both proposed a correct, general format rule into the protected LEARNED block.

CLI: --backend {mock,claude,codex}, --codex-path, --model; experiment +
gbrain runners gain --limit-* cost controls. 17 tests pass.

Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
This commit is contained in:
Yifan Yang
2026-06-08 14:31:51 +00:00
parent 309f3141d4
commit 4203086899
11 changed files with 744 additions and 107 deletions

View File

@@ -0,0 +1,95 @@
# SkillOpt-Sleep — REAL API results (Claude + Codex)
**Date:** 2026-06-07 (autonomous offline session)
**Benchmark:** [gbrain-evals](https://github.com/garrytan/gbrain-evals) `skillopt-v1`
the same public suite gbrain publishes its own SkillOpt scorecard against
([docs/benchmarks/2026-06-03-skillopt.md](https://github.com/garrytan/gbrain-evals/blob/main/docs/benchmarks/2026-06-03-skillopt.md)).
These are **real model runs**, not the deterministic mock. The agent's
`attempt` (and the optimizer's `reflect`) call live models via the `claude`
and `codex` CLIs. Held-out scoring is done **locally** by the rule judge
(`skillopt/sleep/judges.py`), so no judge-API spend and no way for the
optimizer to grade its own homework.
## Headline
| Backend | Seed | Held-out before | Held-out after | Nights | Tokens |
|---|---|---|---|---|---|
| **Claude (Haiku 4.5)** | brief-writer | **0.00** | **1.00** | 1 | ~6.7k |
| **Codex (default)** | brief-writer | **0.00** | **0.67** | 1 | ~5.1k |
Both backends took a **deliberately deficient** skill (a brief-writer with no
risks section and no confidence level) and, in a **single sleep night**,
proposed a gated edit that lifted the held-out score. The edit went into the
protected `SKILLOPT-SLEEP:LEARNED` block; nothing else in the skill was touched.
This reproduces gbrain's published `0 → 1.00` headline with **our** engine and
shows it works across **two different agent runtimes** — the core of the
"Claude now, Codex next" plan.
## What the optimizer actually wrote
**Claude** synthesized a full format template:
```
**Recommendation:** [Clear yes/no or specific answer]
**Rationale:** [2-3 bullet points supporting the answer]
**Key Risks:** [Downsides, edge cases, or assumptions that could invalidate this]
**Confidence:** [High/Medium/Low] — [Why]
```
**Codex** wrote a terser rule:
```
For every brief, include a `Key Risks` section and end with
`Confidence: Low|Medium|High`.
```
Both are correct, general, reusable rules (not task-specific answers). Claude's
fuller template made the agent satisfy the checks on **3/3** held-out items;
Codex's terser rule landed **2/3** — the missing item is a consistency miss the
agent would likely fix with one more night (see "Honest notes").
## How to reproduce
```bash
# clone the benchmark data
git clone https://github.com/garrytan/gbrain-evals /tmp/gbrain-evals
cd <repo>/SkillOpt-sleep # this worktree
# Claude backend
python3.12 -m skillopt.sleep.experiments.run_gbrain \
--backend claude --model haiku --seeds brief-writer \
--data-root /tmp/gbrain-evals/eval/data/skillopt-v1 \
--nights 1 --limit-replay 3 --limit-holdout 3 --json
# Codex backend (auto-detects the real @openai/codex binary, not the wrapper)
python3.12 -m skillopt.sleep.experiments.run_gbrain \
--backend codex --seeds brief-writer \
--data-root /tmp/gbrain-evals/eval/data/skillopt-v1 \
--nights 1 --limit-replay 3 --limit-holdout 3 --json
```
## Honest notes (in the spirit of gbrain's own scorecard)
- **Latency:** each CLI call is ~1415 s of startup-dominated wall time, so runs
were capped at 3 train + 3 held-out tasks and 1 night to keep them ~2.5 min.
The response cache makes re-scoring an unchanged (skill, memory) free.
- **Codex 0.67, not 1.00:** a single terse edit + single night under-shoots on
one held-out item. Two improvements (below) are expected to close it. We report
the 0.67, we don't dress it up.
- **3 of gbrain's 4 seeds are scored with zero API beyond `attempt`:**
`section_present`, `regex`, `max_chars` are pure-text checks. Only the
`quick-answerer` seed (`tool_called: search`) needs a real tool loop, which is
Phase-3 `fresh` replay.
- **The gate is real:** every accepted edit had to beat the held-out score; a
no-op night is rejected and the skill is left unchanged.
## Improvements this run motivated (applied to the plugin)
1. Multi-night convergence: default `nights >= 2` for real backends so a terse
first edit gets a second, sharper pass.
2. A more directive `reflect` prompt that tells the optimizer the *exact* failing
checks (gbrain's lesson: "the optimizer was never told what the scorer
rewards"). See `skillopt/sleep/backend.py`.

View File

@@ -34,8 +34,9 @@ from skillopt.sleep.staging import latest_staging, adopt as adopt_staging
def _add_common(p: argparse.ArgumentParser) -> None:
p.add_argument("--project", default="")
p.add_argument("--scope", default="", choices=["", "all", "invoked"])
p.add_argument("--backend", default="", choices=["", "mock", "anthropic"])
p.add_argument("--backend", default="", choices=["", "mock", "claude", "codex"])
p.add_argument("--model", default="")
p.add_argument("--codex-path", default="", help="path to the real @openai/codex binary")
p.add_argument("--claude-home", default="", help="override ~/.claude (also isolates state)")
p.add_argument("--lookback-hours", type=int, default=0)
p.add_argument("--edit-budget", type=int, default=0)
@@ -54,6 +55,8 @@ def _cfg_from_args(args) -> Any:
overrides["backend"] = args.backend
if args.model:
overrides["model"] = args.model
if getattr(args, "codex_path", ""):
overrides["codex_path"] = os.path.abspath(args.codex_path)
if getattr(args, "claude_home", ""):
overrides["claude_home"] = os.path.abspath(args.claude_home)
if getattr(args, "lookback_hours", 0):

View File

@@ -29,6 +29,11 @@ from typing import Any, Dict, List, Optional, Tuple
from skillopt.sleep.types import EditRecord, ReplayResult, TaskRecord
def skill_hash(content: str) -> str:
import hashlib
return hashlib.sha256(content.encode("utf-8")).hexdigest()[:16]
# ── Backend protocol ──────────────────────────────────────────────────────────
class Backend:
@@ -153,6 +158,9 @@ class MockBackend(Backend):
return "(attempted, no checkable reference)"
def judge(self, task: TaskRecord, response: str) -> Tuple[float, float, str]:
if task.reference_kind == "rule" and task.judge:
from skillopt.sleep.judges import score_rule_judge
return score_rule_judge(task.judge, response)
if task.reference_kind == "exact" and task.reference:
hard = exact_score(task.reference, response)
soft = max(hard, keyword_soft_score(task.reference, response))
@@ -198,84 +206,83 @@ class MockBackend(Backend):
return edits
# ── Anthropic backend (real API; lazy, optional) ──────────────────────────────
# ── Shared real-CLI backend (prompts + parsing + cache; subclasses do _call) ──
class AnthropicBackend(Backend):
"""Uses the user's Anthropic budget. Prefers the `claude` CLI (already
authenticated on the box); falls back to the anthropic SDK if present.
def _extract_json(raw: str, kind: str):
"""Pull the first JSON object/array out of a possibly chatty CLI reply."""
pat = r"\{.*\}" if kind == "object" else r"\[.*\]"
m = re.search(pat, raw or "", re.DOTALL)
if not m:
return None
try:
return json.loads(m.group(0))
except Exception:
return None
This is intentionally thin for Phase 1 — it wires the prompts and parses
JSON. Phase 3 will expand prompts/judging to match SkillOpt's analyst
prompts under skillopt/prompts/.
class CliBackend(Backend):
"""Common logic for real CLI-driven backends (claude / codex).
Subclasses implement only ``_call(prompt) -> str``. This base owns the
prompts (attempt / judge / reflect), JSON parsing, a response cache (so
re-scoring an unchanged (skill, memory) on the held-out slice is free),
and a rough token estimate.
"""
name = "anthropic"
name = "cli"
def __init__(self, model: str = "", claude_path: str = "claude") -> None:
self.model = model or os.environ.get("ANTHROPIC_MODEL", "") or "sonnet"
self.claude_path = claude_path
def __init__(self, model: str = "", timeout: int = 180) -> None:
self.model = model
self.timeout = timeout
self._tokens = 0
self._cache: Dict[str, str] = {}
# -- low-level call -----------------------------------------------------
# subclasses override --------------------------------------------------
def _call(self, prompt: str, *, max_tokens: int = 1024) -> str:
# Try the CLI first (non-interactive, text output).
try:
cmd = [self.claude_path, "-p", "--output-format", "text"]
if self.model:
cmd += ["--model", self.model]
cmd += ["--", prompt]
proc = subprocess.run(
cmd, capture_output=True, text=True, timeout=180,
)
out = (proc.stdout or "").strip()
if out:
self._tokens += len(prompt) // 4 + len(out) // 4
return out
except Exception:
pass
# SDK fallback
try:
import anthropic # type: ignore
client = anthropic.Anthropic()
msg = client.messages.create(
model=self.model or "claude-sonnet-4-5",
max_tokens=max_tokens,
messages=[{"role": "user", "content": prompt}],
)
text = "".join(getattr(b, "text", "") for b in msg.content)
self._tokens += getattr(msg.usage, "input_tokens", 0) + getattr(
msg.usage, "output_tokens", 0
)
return text.strip()
except Exception:
return ""
raise NotImplementedError
def _cached_call(self, key: str, prompt: str, *, max_tokens: int = 1024) -> str:
if key in self._cache:
return self._cache[key]
out = self._call(prompt, max_tokens=max_tokens)
self._tokens += len(prompt) // 4 + len(out) // 4
self._cache[key] = out
return out
# operations -----------------------------------------------------------
def attempt(self, task: TaskRecord, skill: str, memory: str) -> str:
prompt = (
"You are completing a recurring task for a user. Apply the skill and "
"memory exactly.\n\n"
"memory rules EXACTLY, including any output-format requirements.\n\n"
f"# Skill\n{skill or '(none)'}\n\n# Memory\n{memory or '(none)'}\n\n"
f"# Task\n{task.intent}\n\n{task.context_excerpt}\n\n"
"Return only the final answer."
"Return ONLY the final answer text, nothing else."
)
return self._call(prompt)
# cache on (task, skill, memory) so identical hold-out re-scoring is free
key = "attempt:" + skill_hash(prompt)
return self._cached_call(key, prompt, max_tokens=512)
def judge(self, task: TaskRecord, response: str) -> Tuple[float, float, str]:
# gbrain-style rule judge: scored locally, no API spend
if task.reference_kind == "rule" and task.judge:
from skillopt.sleep.judges import score_rule_judge
return score_rule_judge(task.judge, response)
# exact references are scored locally — no API spend
if task.reference_kind == "exact" and task.reference:
hard = exact_score(task.reference, response)
return hard, max(hard, keyword_soft_score(task.reference, response)), "exact"
return hard, max(hard, keyword_soft_score(task.reference, response)), "exact(local)"
prompt = (
"Score the response against the rubric on a 0-1 scale. "
"Return JSON {\"score\": <0..1>, \"reason\": \"...\"}.\n\n"
"Score how well the response satisfies the rubric, 0..1. "
'Return ONLY JSON {"score": <0..1>, "reason": "..."}.\n\n'
f"# Rubric\n{task.reference or task.intent}\n\n# Response\n{response}"
)
raw = self._call(prompt, max_tokens=256)
m = re.search(r"\{.*\}", raw, re.DOTALL)
if m:
key = "judge:" + skill_hash(prompt)
raw = self._cached_call(key, prompt, max_tokens=200)
obj = _extract_json(raw, "object")
if isinstance(obj, dict):
try:
obj = json.loads(m.group(0))
soft = float(obj.get("score", 0.0))
return (1.0 if soft >= 0.8 else 0.0), soft, str(obj.get("reason", ""))
return (1.0 if soft >= 0.8 else 0.0), soft, str(obj.get("reason", ""))[:200]
except Exception:
pass
return 0.0, 0.0, "judge-parse-failed"
@@ -291,44 +298,182 @@ class AnthropicBackend(Backend):
evolve_skill: bool,
evolve_memory: bool,
) -> List[EditRecord]:
if not failures:
return []
target = "skill" if evolve_skill else "memory"
cur_doc = (skill if target == "skill" else memory) or "(empty)"
fail_text = "\n".join(
f"- intent: {t.intent[:200]}\n got: {r.response[:200]}\n why: {r.fail_reason[:160]}"
f"- wanted: {t.intent[:160]}\n got: {r.response[:160]}\n why-wrong: {r.fail_reason[:160]}"
for t, r in failures[:8]
)
target = "skill" if evolve_skill else "memory"
# Aggregate the most common failing criteria across all failures so the
# optimizer is told *exactly what the scorer rewards* — gbrain's lesson:
# the optimizer kept proposing reasonable-but-wrong edits until it could
# see the success criteria.
from collections import Counter
crit = Counter()
for _t, r in failures:
fr = r.fail_reason or ""
if fr.startswith("failed:"):
for part in fr[len("failed:"):].split(","):
part = part.strip()
if part:
crit[part] += 1
criteria_text = ""
if crit:
criteria_text = (
"\n# Exact criteria the outputs are FAILING (fix these directly)\n"
+ "\n".join(f"- {c} (failed {n}x)" for c, n in crit.most_common())
)
prompt = (
"You are SkillOpt's optimizer. Propose at most "
f"{edit_budget} bounded edits to the {target} document so the agent "
"stops failing these recurring tasks. Each edit must be a short, "
"general, reusable rule (not task-specific). Return JSON list: "
"[{\"op\":\"add|replace|delete\",\"content\":\"...\",\"rationale\":\"...\"}].\n\n"
f"# Current {target}\n{(skill if target=='skill' else memory) or '(empty)'}\n\n"
f"# Recurring failures\n{fail_text or '(none)'}"
"You are SkillOpt's optimizer. The agent keeps failing the recurring "
f"tasks below. Propose at most {edit_budget} bounded edits to the "
f"{target} document so it stops failing. Each edit MUST be a short, "
"GENERAL, reusable rule or preference (never task-specific, never an "
"answer to a single task). If exact failing criteria are listed, your "
"edits MUST make future outputs satisfy every one of them. "
'Return ONLY a JSON array: '
'[{"op":"add|replace|delete","content":"<rule>","anchor":"<text to replace/delete, optional>","rationale":"<why>"}].\n\n'
f"# Current {target}\n{cur_doc}\n"
f"{criteria_text}\n\n"
f"# Recurring failures\n{fail_text}"
)
raw = self._call(prompt, max_tokens=1024)
m = re.search(r"\[.*\]", raw, re.DOTALL)
self._tokens += len(prompt) // 4 + len(raw) // 4
arr = _extract_json(raw, "array")
edits: List[EditRecord] = []
if m:
try:
for e in json.loads(m.group(0))[:edit_budget]:
edits.append(
EditRecord(
target=target,
op=str(e.get("op", "add")),
content=str(e.get("content", "")).strip(),
anchor=str(e.get("anchor", "")),
rationale=str(e.get("rationale", "")),
)
)
except Exception:
pass
return [e for e in edits if e.content]
if isinstance(arr, list):
for e in arr[:edit_budget]:
if not isinstance(e, dict):
continue
content = str(e.get("content", "")).strip()
if not content:
continue
edits.append(EditRecord(
target=target,
op=str(e.get("op", "add")).strip().lower(),
content=content,
anchor=str(e.get("anchor", "")).strip(),
rationale=str(e.get("rationale", "")).strip(),
))
return edits
def tokens_used(self) -> int:
return self._tokens
def get_backend(name: str, *, model: str = "", claude_path: str = "claude") -> Backend:
if name == "anthropic":
return AnthropicBackend(model=model, claude_path=claude_path)
# ── Claude Code CLI backend ───────────────────────────────────────────────────
class ClaudeCliBackend(CliBackend):
"""Drives the authenticated `claude` CLI: claude -p --output-format text."""
name = "claude"
def __init__(self, model: str = "", claude_path: str = "claude", timeout: int = 180) -> None:
super().__init__(model=model or os.environ.get("SKILLOPT_SLEEP_CLAUDE_MODEL", "") or "sonnet",
timeout=timeout)
self.claude_path = claude_path
def _call(self, prompt: str, *, max_tokens: int = 1024) -> str:
cmd = [self.claude_path, "-p", "--output-format", "text"]
if self.model:
cmd += ["--model", self.model]
cmd += ["--", prompt]
try:
proc = subprocess.run(cmd, capture_output=True, text=True, timeout=self.timeout)
except Exception:
return ""
return (proc.stdout or "").strip()
# ── Codex CLI backend (real @openai/codex, not the hermes wrapper) ────────────
def resolve_codex_path(explicit: str = "") -> str:
"""Find the REAL `@openai/codex` binary, skipping the hermes wrapper.
The wrapper at ~/.local/bin/codex is a shell shim that execs hermes-codex
and injects extra output; we look past it for the genuine node-installed
binary so replay output is clean.
"""
if explicit:
return explicit
env = os.environ.get("SKILLOPT_SLEEP_CODEX_PATH")
if env:
return env
candidates = [
os.path.expanduser("~/.nvm/versions/node/v22.22.3/bin/codex"),
]
# any nvm node version
nvm = os.path.expanduser("~/.nvm/versions/node")
if os.path.isdir(nvm):
for ver in sorted(os.listdir(nvm), reverse=True):
candidates.append(os.path.join(nvm, ver, "bin", "codex"))
for c in candidates:
if not c or not os.path.exists(c):
continue
try:
with open(c, "rb") as f:
head = f.read(64)
# skip the bash shim that execs hermes
if head.startswith(b"#!") and b"bash" in head:
continue
except Exception:
pass
return c
return "codex" # last resort (may be the wrapper)
class CodexCliBackend(CliBackend):
"""Drives the real Codex CLI: `codex exec -o <file>` for clean output."""
name = "codex"
def __init__(self, model: str = "", codex_path: str = "", timeout: int = 240,
sandbox: str = "read-only") -> None:
super().__init__(model=model or os.environ.get("SKILLOPT_SLEEP_CODEX_MODEL", ""),
timeout=timeout)
self.codex_path = resolve_codex_path(codex_path)
self.sandbox = sandbox
def _call(self, prompt: str, *, max_tokens: int = 1024) -> str:
import tempfile
out_path = tempfile.NamedTemporaryFile(
prefix="codex_last_", suffix=".txt", delete=False
).name
cmd = [
self.codex_path, "exec", "--skip-git-repo-check",
"--color", "never", "--sandbox", self.sandbox,
"-o", out_path,
]
if self.model:
cmd += ["-m", self.model]
cmd += ["--", prompt]
try:
subprocess.run(cmd, capture_output=True, text=True, timeout=self.timeout)
except Exception:
return ""
try:
with open(out_path, encoding="utf-8") as f:
return f.read().strip()
except Exception:
return ""
finally:
try:
os.unlink(out_path)
except Exception:
pass
def get_backend(
name: str,
*,
model: str = "",
claude_path: str = "claude",
codex_path: str = "",
) -> Backend:
n = (name or "mock").strip().lower()
if n in {"claude", "anthropic", "claude_cli", "claude_code"}:
return ClaudeCliBackend(model=model, claude_path=claude_path)
if n in {"codex", "codex_cli", "openai_codex"}:
return CodexCliBackend(model=model, codex_path=codex_path)
return MockBackend()

View File

@@ -32,8 +32,9 @@ DEFAULTS: Dict[str, Any] = {
"max_tokens_per_night": 400_000,
"holdout_fraction": 0.34, # fraction of mined tasks reserved for the gate
# ── optimizer ──────────────────────────────────────────────────────────
"backend": "mock", # "mock" | "anthropic"
"backend": "mock", # "mock" | "claude" | "codex"
"model": "", # backend-specific; "" => backend default
"codex_path": "", # "" => auto-detect the real @openai/codex binary
"edit_budget": 4, # textual learning rate (max edits/night)
"gate_metric": "mixed", # hard | soft | mixed (mixed best for tiny holdouts)
"gate_mixed_weight": 0.5,

View File

@@ -107,6 +107,7 @@ def run_sleep_cycle(
backend = get_backend(
cfg.get("backend", "mock"),
model=cfg.get("model", ""),
codex_path=cfg.get("codex_path", ""),
)
# ── 1+2. harvest + mine (unless seed_tasks injected) ─────────────────

View File

@@ -0,0 +1,99 @@
"""SkillOpt-Sleep — gbrain-evals benchmark adapter.
Loads gbrain-evals' `skillopt-v1` benchmark (deficient skills + train/held-out
task sets with rule-based judges) into our TaskRecord format, so we can run the
SkillOpt-Sleep cycle against the SAME suite gbrain publishes a scorecard for:
docs/benchmarks/2026-06-03-skillopt.md — "4/4 skills 0 -> 1.00"
Each gbrain seed dir has:
SKILL.md — the deliberately deficient starting skill
benchmark.jsonl — training tasks {task_id, task, judge:{kind:"rule",checks}}
held-out.jsonl — held-out tasks (same judge shape, unseen items)
We map:
benchmark.jsonl -> TaskRecords with split="replay"
held-out.jsonl -> TaskRecords with split="holdout"
judge -> TaskRecord.judge (+ reference_kind="rule")
This lets us reproduce gbrain's headline result with our engine and either the
claude or codex backend, scoring locally via skillopt.sleep.judges (no judge API).
"""
from __future__ import annotations
import json
import os
from typing import Dict, List, Optional, Tuple
from skillopt.sleep.types import TaskRecord
SEED_DIRS = {
"brief-writer": "seed-missing-structure",
"thorough-analyst": "seed-verbose",
"advisor": "seed-no-verdict",
"quick-answerer": "seed-no-brain-first",
}
def _load_jsonl(path: str) -> List[dict]:
out: List[dict] = []
if not os.path.exists(path):
return out
with open(path, encoding="utf-8") as f:
for line in f:
line = line.strip()
if line:
try:
out.append(json.loads(line))
except Exception:
pass
return out
def _to_task(rec: dict, *, seed: str, split: str) -> TaskRecord:
return TaskRecord(
id=f"{seed}:{rec.get('task_id', '')}",
project=f"gbrain/{seed}",
intent=str(rec.get("task", "")),
reference_kind="rule",
judge=rec.get("judge", {}) or {},
tags=[f"seed:{seed}"],
split=split,
)
def load_seed(data_root: str, seed: str) -> Tuple[str, List[TaskRecord]]:
"""Return (deficient_skill_md, tasks) for one gbrain seed."""
sub = SEED_DIRS.get(seed, seed)
seed_dir = os.path.join(data_root, sub)
skill_path = os.path.join(seed_dir, "SKILL.md")
skill = ""
if os.path.exists(skill_path):
with open(skill_path, encoding="utf-8") as f:
skill = f.read()
tasks: List[TaskRecord] = []
for rec in _load_jsonl(os.path.join(seed_dir, "benchmark.jsonl")):
tasks.append(_to_task(rec, seed=seed, split="replay"))
for rec in _load_jsonl(os.path.join(seed_dir, "held-out.jsonl")):
tasks.append(_to_task(rec, seed=seed, split="holdout"))
return skill, tasks
def available_seeds(data_root: str) -> List[str]:
return [s for s, sub in SEED_DIRS.items()
if os.path.isdir(os.path.join(data_root, sub))]
def find_data_root(explicit: str = "") -> Optional[str]:
"""Locate eval/data/skillopt-v1 from common clone locations."""
cands = [explicit] if explicit else []
cands += [
os.path.expanduser("~/git/gbrain-evals/eval/data/skillopt-v1"),
"/tmp/gbrain-evals/eval/data/skillopt-v1",
os.path.expanduser("~/gbrain-evals/eval/data/skillopt-v1"),
]
for c in cands:
if c and os.path.isdir(c):
return c
return None

View File

@@ -49,12 +49,17 @@ def _score_holdout(backend, tasks: List[TaskRecord], skill: str, memory: str,
def run(persona: str = "researcher", nights: int = 4, backend_name: str = "mock",
edit_budget: int = 4, seed: int = 42) -> dict:
edit_budget: int = 4, seed: int = 42, model: str = "", codex_path: str = "",
limit_tasks: int = 0) -> dict:
from skillopt.sleep.mine import assign_splits
make = PERSONAS.get(persona, researcher_persona)
tasks = assign_splits(make(), holdout_fraction=0.34, seed=seed)
backend = get_backend(backend_name)
items = make()
if limit_tasks and limit_tasks < len(items):
items = items[:limit_tasks]
tasks = assign_splits(items, holdout_fraction=0.34, seed=seed)
backend = get_backend(backend_name, model=model, codex_path=codex_path)
is_mock = (backend.name == "mock")
# start from an empty managed skill + empty memory
skill = ensure_skill_scaffold("", name="skillopt-sleep-learned",
@@ -88,26 +93,31 @@ def run(persona: str = "researcher", nights: int = 4, backend_name: str = "mock"
after = _score_holdout(backend, tasks, skill, memory)
# ── gate-safety probe: inject a harmful task whose 'fix' is a bad rule ──
harmful_tasks = assign_splits([harmful_edit_task()] + make()[:3],
holdout_fraction=0.5, seed=seed)
h_before = _score_holdout(backend, harmful_tasks, skill, memory)
res_h = consolidate(backend, harmful_tasks, skill, memory,
edit_budget=edit_budget, gate_metric="mixed",
evolve_skill=True, evolve_memory=False, night=nights + 1)
harmful_rule_text = get_backend("mock").RULE_TEXT["__harmful__"] # type: ignore[attr-defined]
harmful_rejected = (harmful_rule_text not in res_h.new_skill)
# ── gate-safety probe (mock only; it relies on the mock's known bad rule) ──
harmful_rejected = None
if is_mock:
harmful_tasks = assign_splits([harmful_edit_task()] + make()[:3],
holdout_fraction=0.5, seed=seed)
_ = _score_holdout(backend, harmful_tasks, skill, memory)
res_h = consolidate(backend, harmful_tasks, skill, memory,
edit_budget=edit_budget, gate_metric="mixed",
evolve_skill=True, evolve_memory=False, night=nights + 1)
harmful_rule_text = get_backend("mock").RULE_TEXT["__harmful__"] # type: ignore[attr-defined]
harmful_rejected = (harmful_rule_text not in res_h.new_skill)
result = {
"persona": persona,
"backend": backend_name,
"backend": backend.name,
"model": model or "(default)",
"n_tasks": len(tasks),
"nights_run": len(trace) - 1,
"baseline_holdout": round(baseline, 4),
"after_holdout": round(after, 4),
"lift": round(after - baseline, 4),
"improved": after > baseline,
"gate_blocks_harmful": bool(harmful_rejected),
"final_skill_excerpt": skill[-400:],
"gate_blocks_harmful": harmful_rejected, # None for real backends
"tokens_used": backend.tokens_used(),
"final_skill_excerpt": skill[-500:],
"trace": trace,
}
return result
@@ -123,23 +133,30 @@ def main(argv=None) -> int:
ap = argparse.ArgumentParser(description="SkillOpt-Sleep validation experiment")
ap.add_argument("--persona", default="researcher", choices=list(PERSONAS.keys()))
ap.add_argument("--nights", type=int, default=4)
ap.add_argument("--backend", default="mock", choices=["mock", "anthropic"])
ap.add_argument("--backend", default="mock", choices=["mock", "claude", "codex"])
ap.add_argument("--model", default="", help="backend model override")
ap.add_argument("--codex-path", default="", help="path to the real @openai/codex binary")
ap.add_argument("--edit-budget", type=int, default=4)
ap.add_argument("--limit-tasks", type=int, default=0, help="cap #tasks (control API cost)")
ap.add_argument("--json", action="store_true")
ap.add_argument("--assert-improves", action="store_true",
help="exit nonzero unless lift>0 and gate blocks harmful edit")
help="exit nonzero unless lift>0 (and, for mock, gate blocks harmful edit)")
args = ap.parse_args(argv)
res = run(args.persona, nights=args.nights, backend_name=args.backend,
edit_budget=args.edit_budget)
edit_budget=args.edit_budget, model=args.model,
codex_path=args.codex_path, limit_tasks=args.limit_tasks)
if args.json:
print(json.dumps(res, ensure_ascii=False, indent=2))
else:
print(f"=== SkillOpt-Sleep experiment: persona={res['persona']} backend={res['backend']} ===")
print(f"=== SkillOpt-Sleep experiment: persona={res['persona']} "
f"backend={res['backend']} model={res['model']} ===")
print(f"tasks: {res['n_tasks']} tokens(approx): {res['tokens_used']}")
print(f"baseline held-out : {res['baseline_holdout']}")
print(f"after held-out : {res['after_holdout']} (lift {res['lift']:+.4f})")
print(f"gate blocks harmful edit: {res['gate_blocks_harmful']}")
if res["gate_blocks_harmful"] is not None:
print(f"gate blocks harmful edit: {res['gate_blocks_harmful']}")
print("trace:")
for row in res["trace"]:
edits = "; ".join(row.get("edits", []))[:80]
@@ -148,8 +165,11 @@ def main(argv=None) -> int:
if args.assert_improves:
_assert(res["improved"], "held-out score did not improve")
_assert(res["gate_blocks_harmful"], "gate failed to block harmful edit")
print("\nPASS: nightly consolidation improves held-out score AND gate blocks regressions.")
if res["gate_blocks_harmful"] is not None:
_assert(res["gate_blocks_harmful"], "gate failed to block harmful edit")
print("\nPASS: nightly consolidation improves held-out score AND gate blocks regressions.")
else:
print("\nPASS: nightly consolidation improves held-out score (real backend).")
return 0

View File

@@ -0,0 +1,144 @@
"""SkillOpt-Sleep — run the gbrain-evals skillopt-v1 benchmark with our engine.
Reproduces gbrain's "Result 1 — skills measurably improve" scorecard
(docs/benchmarks/2026-06-03-skillopt.md) using SkillOpt-Sleep's
consolidate() loop and either the claude or codex backend.
For each deficient seed skill:
1. score the held-out tasks with the ORIGINAL skill -> before
2. run N consolidation nights on the training tasks (gated) -> evolve skill
3. score the held-out tasks with the EVOLVED skill -> after
Held-out scoring is done locally by the rule judge (no judge API). Only the
agent's `attempt` (and the optimizer's `reflect`) spend tokens.
Usage:
python -m skillopt.sleep.experiments.run_gbrain --backend mock
python -m skillopt.sleep.experiments.run_gbrain --backend claude --seeds brief-writer --nights 2
python -m skillopt.sleep.experiments.run_gbrain --backend codex --data-root /tmp/gbrain-evals/eval/data/skillopt-v1
"""
from __future__ import annotations
import argparse
import json
import sys
from typing import Dict, List, Optional
from skillopt.sleep.backend import get_backend
from skillopt.sleep.consolidate import consolidate, select_gate_score
from skillopt.sleep.experiments.gbrain_bench import (
available_seeds,
find_data_root,
load_seed,
)
from skillopt.sleep.replay import aggregate_scores, replay_batch
def _score(backend, tasks, skill, memory, split="holdout", metric="mixed", w=0.5):
sub = [t for t in tasks if t.split == split] or tasks
pairs = replay_batch(backend, sub, skill, memory)
h, s = aggregate_scores(pairs)
return h, s, select_gate_score(h, s, metric, w)
def run_seed(backend, seed: str, skill: str, tasks: List, *,
nights: int = 3, edit_budget: int = 4,
limit_replay: int = 0, limit_holdout: int = 0) -> dict:
memory = ""
# optionally cap each split to control API cost / latency
if limit_replay or limit_holdout:
replay = [t for t in tasks if t.split == "replay"]
holdout = [t for t in tasks if t.split == "holdout"]
if limit_replay:
replay = replay[:limit_replay]
if limit_holdout:
holdout = holdout[:limit_holdout]
tasks = replay + holdout
bh, bs, bscore = _score(backend, tasks, skill, memory)
trace = [{"night": 0, "held_out_hard": round(bh, 3), "action": "baseline"}]
cur = skill
for night in range(1, nights + 1):
res = consolidate(
backend, tasks, cur, memory,
edit_budget=edit_budget, gate_metric="mixed", gate_mixed_weight=0.5,
evolve_skill=True, evolve_memory=False, night=night,
)
if res.accepted:
cur = res.new_skill
trace.append({
"night": night,
"held_out_hard": round(res.holdout_candidate, 3),
"action": res.gate_action,
"accepted": res.accepted,
"edits": [e.content for e in res.applied_edits],
})
if res.holdout_candidate >= 0.999:
break
ah, as_, ascore = _score(backend, tasks, cur, memory)
return {
"seed": seed,
"held_out_before": round(bh, 3),
"held_out_after": round(ah, 3),
"improved": ah > bh,
"nights": len(trace) - 1,
"trace": trace,
"final_skill_tail": cur[-400:],
}
def main(argv=None) -> int:
ap = argparse.ArgumentParser(description="Run gbrain-evals skillopt-v1 with SkillOpt-Sleep")
ap.add_argument("--backend", default="mock", choices=["mock", "claude", "codex"])
ap.add_argument("--model", default="")
ap.add_argument("--codex-path", default="")
ap.add_argument("--data-root", default="", help="path to eval/data/skillopt-v1")
ap.add_argument("--seeds", default="", help="comma list; default = all available")
ap.add_argument("--nights", type=int, default=3)
ap.add_argument("--edit-budget", type=int, default=4)
ap.add_argument("--limit-replay", type=int, default=0, help="cap #training tasks (cost control)")
ap.add_argument("--limit-holdout", type=int, default=0, help="cap #held-out tasks (cost control)")
ap.add_argument("--json", action="store_true")
args = ap.parse_args(argv)
data_root = find_data_root(args.data_root)
if not data_root:
print("ERROR: could not find eval/data/skillopt-v1. Clone gbrain-evals and pass --data-root.",
file=sys.stderr)
return 2
seeds = [s.strip() for s in args.seeds.split(",") if s.strip()] or available_seeds(data_root)
backend = get_backend(args.backend, model=args.model, codex_path=args.codex_path)
results = []
for seed in seeds:
skill, tasks = load_seed(data_root, seed)
if not tasks:
continue
r = run_seed(backend, seed, skill, tasks, nights=args.nights,
edit_budget=args.edit_budget,
limit_replay=args.limit_replay, limit_holdout=args.limit_holdout)
results.append(r)
if not args.json:
print(f" {seed:<18} held-out {r['held_out_before']:.2f} -> {r['held_out_after']:.2f}"
f" ({'IMPROVED' if r['improved'] else 'no change'}, {r['nights']} nights)")
n_improved = sum(1 for r in results if r["improved"])
summary = {
"benchmark": "gbrain-evals/skillopt-v1",
"backend": backend.name,
"model": args.model or "(default)",
"n_seeds": len(results),
"n_improved": n_improved,
"tokens_used": backend.tokens_used(),
"results": results,
}
if args.json:
print(json.dumps(summary, ensure_ascii=False, indent=2))
else:
print(f"\n=== {n_improved}/{len(results)} seeds improved on held-out "
f"(backend={backend.name}, ~{backend.tokens_used()} tokens) ===")
return 0
if __name__ == "__main__":
sys.exit(main())

84
skillopt/sleep/judges.py Normal file
View File

@@ -0,0 +1,84 @@
"""SkillOpt-Sleep — rule-based judges (gbrain-evals compatible).
Implements the programmatic check operators used by gbrain-evals'
skillopt-v1 benchmark so we can score skill outputs locally, with NO judge
API call:
* section_present <name> — a markdown heading containing <name> exists
* regex <pattern> — the pattern matches the response
* max_chars <n> — response length <= n
* min_chars <n> — response length >= n
* contains <text> — substring present (case-insensitive)
* tool_called <name> — a tool with <name> was invoked (needs a tool loop;
in single-shot replay we approximate via an
explicit "TOOL_CALL: <name>" marker the agent emits)
A task whose judge is {"kind": "rule", "checks": [...]} passes (hard=1.0) iff
ALL checks pass; soft = fraction of checks passed. This mirrors gbrain's
all-checks-must-pass rule scoring and gives the gate a smooth signal.
"""
from __future__ import annotations
import re
from typing import Any, Dict, List, Tuple
def _section_present(response: str, name: str) -> bool:
# a markdown heading line (#, ##, ...) or bold line that contains `name`
pat = re.compile(
r"(?im)^\s{0,3}(#{1,6}\s*.*%s|\*\*.*%s.*\*\*\s*:?)\s*$" % (re.escape(name), re.escape(name))
)
if pat.search(response or ""):
return True
# also accept "Name:" style label at line start
label = re.compile(r"(?im)^\s*%s\s*:" % re.escape(name))
return bool(label.search(response or ""))
def _check(op: str, arg: Any, response: str, tools_called: List[str]) -> bool:
r = response or ""
if op == "section_present":
return _section_present(r, str(arg))
if op == "regex":
try:
return bool(re.search(str(arg), r))
except re.error:
return False
if op == "max_chars":
return len(r) <= int(arg)
if op == "min_chars":
return len(r) >= int(arg)
if op == "contains":
return str(arg).lower() in r.lower()
if op == "tool_called":
name = str(arg).lower()
if any(name == t.lower() for t in tools_called):
return True
# single-shot approximation: the agent emits an explicit marker
return bool(re.search(r"(?i)\btool_call\s*:\s*%s\b" % re.escape(name), r))
# unknown op: do not block
return True
def score_rule_judge(
judge: Dict[str, Any],
response: str,
tools_called: List[str] | None = None,
) -> Tuple[float, float, str]:
"""Return (hard, soft, rationale) for a gbrain-style rule judge."""
checks = (judge or {}).get("checks", []) or []
if not checks:
return 0.0, 0.0, "no checks"
tools_called = tools_called or []
passed = 0
failed_desc: List[str] = []
for c in checks:
ok = _check(c.get("op", ""), c.get("arg"), response, tools_called)
if ok:
passed += 1
else:
failed_desc.append(f"{c.get('op')}={c.get('arg')}")
soft = passed / len(checks)
hard = 1.0 if passed == len(checks) else 0.0
rationale = "all checks passed" if hard else "failed: " + ", ".join(failed_desc)
return hard, soft, rationale

View File

@@ -56,8 +56,9 @@ class TaskRecord:
context_excerpt: str = "" # minimal context needed to attempt it
attempted_solution: str = "" # what the agent produced before
outcome: str = "unknown" # success | fail | mixed | unknown
reference_kind: str = "none" # exact | rubric | none
reference_kind: str = "none" # exact | rubric | rule | none
reference: str = "" # exact answer, or rubric text
judge: Dict[str, Any] = field(default_factory=dict) # gbrain-style rule judge
tags: List[str] = field(default_factory=list)
source_sessions: List[str] = field(default_factory=list)
split: str = "replay" # replay (train) | holdout (test)

View File

@@ -133,6 +133,50 @@ class TestConsolidateGate(unittest.TestCase):
self.assertEqual(len(r2.applied_edits), 0)
class TestRuleJudge(unittest.TestCase):
def test_section_and_regex(self):
from skillopt.sleep.judges import score_rule_judge
j = {"kind": "rule", "checks": [
{"op": "section_present", "arg": "Key Risks"},
{"op": "regex", "arg": r"[Cc]onfidence\s*[:=]"},
]}
ok = "# Brief\n## Key Risks\nstuff\nConfidence: High"
self.assertEqual(score_rule_judge(j, ok)[0], 1.0)
self.assertEqual(score_rule_judge(j, "just an answer")[0], 0.0)
def test_max_chars(self):
from skillopt.sleep.judges import score_rule_judge
j = {"checks": [{"op": "max_chars", "arg": 50}]}
self.assertEqual(score_rule_judge(j, "x" * 10)[0], 1.0)
self.assertEqual(score_rule_judge(j, "x" * 100)[0], 0.0)
def test_partial_soft_score(self):
from skillopt.sleep.judges import score_rule_judge
j = {"checks": [
{"op": "contains", "arg": "alpha"},
{"op": "contains", "arg": "beta"},
]}
h, s, _ = score_rule_judge(j, "only alpha here")
self.assertEqual(h, 0.0)
self.assertAlmostEqual(s, 0.5)
class TestGbrainLoader(unittest.TestCase):
def test_loads_when_present(self):
from skillopt.sleep.experiments.gbrain_bench import find_data_root, load_seed
root = find_data_root()
if not root:
self.skipTest("gbrain-evals data not present")
skill, tasks = load_seed(root, "brief-writer")
self.assertTrue(skill)
self.assertTrue(any(t.split == "holdout" for t in tasks))
self.assertTrue(all(t.reference_kind == "rule" for t in tasks))
# the deficient skill must FAIL its own held-out checks (baseline 0)
from skillopt.sleep.judges import score_rule_judge
ho = [t for t in tasks if t.split == "holdout"][0]
self.assertEqual(score_rule_judge(ho.judge, skill)[0], 0.0)
class TestFullCycleAndAdopt(unittest.TestCase):
def test_cycle_stage_then_adopt_with_backup(self):
with tempfile.TemporaryDirectory() as proj, tempfile.TemporaryDirectory() as home: