Remove the per-cell full deployment grid section; keep the gate-safety stress
test, experience-replay scaling + night-by-night climb, the dream-diversity
ablation, the gbrain end-to-end result, and the scope/limitations. Renumber
sections; update the README pointer accordingly.
Replace the compact baseline->after grid with three grouped per-benchmark tables
(SearchQA / LiveMath / SpreadsheetBench), each showing all 3 targets x both modes
across every night (N0..N5) + Δ. Makes the trajectory visible — gains reach a
level and hold rather than being single lucky readings — and presents the full
18-cell evidence in a more solid, readable form. Footnotes LiveMath's 4-night run
(train split <50 tasks). Numbers unchanged; just richer presentation.
Adds docs/sleep/RESULTS.md — the complete deployment-scale study behind
SkillOpt-Sleep, presented rigorously (named benchmarks, test sizes, metrics,
baseline->after, single shared protocol):
1. Gate-safety stress test: ungated nano SearchQA collapses 0.554->0.026
(-52.8); the gated twin holds 0.570 — the core argument for the design.
2. Full 18-cell deployment grid (3 benchmarks x 3 targets x gate/free),
shipped config: mean +0.5, range [-2.4, +5.1], nothing hidden.
3. Experience-replay scaling (recall_k 10->20->full: +3.1->+4.5->+5.6) and
the night-by-night climb (0.798->...->0.858, gate accepts as late as N5).
4. Dream-diversity fix as defense-in-depth: 3-config grid comparison
(-2.66/-52.8 -> +0.24/-4.0 -> +0.53/-2.4); the -52.8 cell becomes +2.7
from the dream fix alone.
5. gbrain end-to-end 0.00->1.00 on real Claude + Codex.
6. Honest scope: where it helps vs flat-in-noise, single-seed caveat with a
seed-robustness spot check, keep-the-gate-on.
README Results section now links prominently to it. Docs only; numbers are
self-contained with reproduce commands (no raw run dumps committed).
Label each result with its benchmark, test size, metric, target model, and gate
mode; show absolute baseline→after (not just Δ); state the single shared protocol
once. SearchQA recall-scaling table (1400-item test, SQuAD-EM, GPT-5.5, gated) +
SpreadsheetBench confirmation (280-item, cell-value compare, nano, gate-free) +
the gbrain end-to-end line. Keeps the single-seed / flat-on-noisy caveats.
Adds docs/sleep/README.md — a concise intro to the SkillOpt-Sleep plugin (what
it is, how to use it across the three agents, the opt-in experience-replay /
dream-rollout knobs, and headline results), linking to the full guide section.
Adds a News bullet pointing to it. No code changes.
Per maintainer request:
- Remove the internal/scratch docs/sleep/ tree (reports, raw logs, blog run
JSON, sweep.jsonl) — 23 files — and the root PUBLISHING.md. These were
working notes, not reference docs.
- Take the dedicated SkillOpt-Sleep content out of the main README (News bullet
+ section) and host it in the rendered guide instead: new section 9 in
docs/guideline.html (deployment companion, the three plugins, opt-in
experience replay / dream rollouts) with a sidebar entry.
- Fix the README's opening reference so "Documentation & Reproduction Guide"
links directly to the rendered GitHub Pages page, not the raw .html source.
- Repoint the now-removed docs/sleep links in the plugin READMEs to the guide
section.
The plugin code (plugins/, skillopt_sleep/) is unchanged; only docs move.
Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
Wires two consolidation mechanisms into the shipped nightly cycle, both default
OFF so existing behavior is unchanged:
- dream_rollouts (>1): multi-rollout contrastive reflection per task
- recall_k (>0): associative recall of the K most-similar past tasks (from a
capped task_archive persisted in state.json) into tonight's dream
- dream_factor (>0): synthetic task variants
New shared engine module skillopt_sleep/dream.py (recall_similar, dream_augment,
dream_consolidate) is called by both the plugin cycle and the experiment harness,
so reported numbers exercise the exact shipped code. Built on the existing
rollouts_k/sample_id support already in consolidate.py/rollout.py.
Validated (5 nights x 10 real tasks/night, full held-out test, GPT-5.5, gated):
the gain scales with recall depth on a clean signal —
SearchQA recall_k=10 +3.1, recall_k=20 +4.5, full-history reference +5.6;
SpreadsheetBench (nano, gate-free) +3.6. Flat within noise on saturated/noisy
cells. See docs/sleep/EXPERIENCE_REPLAY.md (+ raw runs under blog_runs/v2_port/).
Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
All six adapters duplicated an identical reflect() that delegates to
run_minibatch_reflect. The copies had drifted: OfficeQA/DocVQA silently
dropped meta_skill_context and ALFWorld dropped update_mode, so those
analysts ran without inputs every other benchmark receives (active under
the default use_meta_skill: true).
Move the delegation into EnvAdapter.reflect as one default that forwards
all kwargs uniformly, and delete the six overrides. reflect is no longer
abstract — adapters inherit it and override only for custom logic.
Net -225 lines. Behavior change: OfficeQA/DocVQA/ALFWorld reflect now
receive the kwargs they previously dropped; the three already-correct
benchmarks are unaffected.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
- Move Quick Start (now §3) ahead of the data chapter; renumber and fix
cross-references and the sidebar nav.
- Add §3.1 'Your First Demo': states plainly that data/ ships ID manifests
only, gives the one benchmark that runs out of the box (ALFWorld with its
bundled path split), and points other benchmarks to the data/README.md
materialization step. Also offers eval-only with ckpt/ skills as a
lighter sanity check.
- Reframe the data chapter as 'Run on Your Own Data' (§4) with a three-step
lead-in (split dir -> item schema -> --split_dir) and a pointer to §7.2
for new task shapes.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
The new-benchmark guide and the env template README referred to the data
loader file as loader.py, but all six built-in benchmarks name it
dataloader.py (skillopt/envs/<name>/dataloader.py). Update the docs and
the template rename step to match the actual convention.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Actually exercised every plugin shell end to end on a brand-new "SQL must always
include LIMIT" analyst persona:
- Claude Code shell: harvest (2 real crafted transcripts -> 2 tasks), full run
(stages a proposal), adopt (honors the no-op-when-nothing-accepted contract).
- Codex: install.sh places ~/.codex/prompts/sleep.md + ~/.agents/skills correctly.
- Copilot: MCP server initialize -> tools/list -> tools/call returns engine output.
Genuine improvement on the fresh persona, both backends: held-out TEST 0.00 -> 1.00
(Sonnet->Haiku and Codex), the optimizer learning the user's LIMIT house rule and
generalizing to unseen queries. Honest finding: the first split left too few train
tasks (no-op night) — re-balancing fixed it; motivates a small-train-pool warning.
Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
Remove every non-ASCII/CJK character for a professional open-source repo:
- harvest.py: drop hardcoded Chinese feedback phrases; add an env-based
extensibility hook (SKILLOPT_SLEEP_NEG_FEEDBACK / _POS_FEEDBACK) so any
locale can be added without baking one in. Verified with a German example.
- rollout.py / consolidate.py: English comments.
- README.md section heading + anchor, CONTROLLABLE_DREAMING.md, plugin.json,
marketplace.json (also fixed stale path skillopt-sleep-plugin ->
plugins/claude-code), SKILL.md: English only.
- Remove the internal WAKE_UP_SUMMARY.md note (not user-facing, not referenced).
Verified: zero CJK chars remain anywhere; 29 tests pass.
Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
Three live runs exercise the new code paths on both runtimes:
A) Claude Sonnet->Haiku, gate=OFF + rollouts_k=2: brief-writer test 0->1.00,
action 'greedy_improved', val & test both reported (3-way split works).
B) Codex, gate=ON + rollouts_k=2: brief-writer test 0->1.00 in 2 nights.
C) Claude Sonnet->Haiku, thorough-analyst, 3 nights: slow-update fires and
distils a durable cross-night meta-rule (general, not task-specific).
Confirms gate-off greedy path, 3-way val/test split, multi-rollout, and the
gate-independent slow-update all work with real models on Claude AND Codex.
Raw logs under docs/sleep/raw/crosscheck_*.txt.
Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
benchmark_report.md now 7/7 direct + 4/4 transfer, all 0->1.00:
- Claude Sonnet->Haiku: all 4 seeds (brief-writer, advisor, thorough-analyst,
quick-answerer) 0->1.00
- Codex self-optimized: brief-writer, advisor, quick-answerer 0->1.00
- quick-answerer uses the real ./search tool loop on both runtimes.
This matches gbrain's own "4/4 skills 0->1.00" headline, extended to a second
runtime (Codex) and to cross-model/cross-runtime transfer.
Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
quick-answerer (judge: tool_called=search) reaches 0.00 -> 1.00 with Sonnet
optimizer -> Haiku target: the optimizer wrote an OVERRIDE of the "never use
tools" instruction and the Haiku target genuinely invoked the ./search shim.
All 4 gbrain skillopt-v1 seeds now at 0->1.00, matching gbrain's own headline.
Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
Machine-generated benchmark_report.md from a 9-config sweep:
- Direct (Sonnet->Haiku): brief-writer/advisor/thorough-analyst 0->1.00
- Direct (Codex): brief-writer/advisor 0->1.00
- Transfer (4/4 positive, incl. cross-runtime Codex<->Claude): all 0->1.00
Cross-model transfer confirms the price-difference value prop: a skill
optimized on a cheap model deploys for free on an expensive one, and skills
move between Codex and Claude. sweep.jsonl is the committed source data.
Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
Strong-optimizer/weak-target (Sonnet -> Haiku), fully isolated:
brief-writer, advisor, thorough-analyst all 0.00 -> 1.00 on held-out.
thorough-analyst shows 2-night convergence (0.33 -> 1.00). Codex self-optimized
brief-writer also 0 -> 1.00.
Key finding answering the optimizer/target-split request: the OPTIMIZER MODEL is
decisive — weak Haiku-as-optimizer is flaky (0 or 1.0 across runs), strong
Sonnet-as-optimizer reliably hits 1.0 on every seed. Raw logs under docs/sleep/raw/.
Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
- reflect() now retries once with a firmer "JSON only" instruction when the
first reply doesn't parse to a non-empty array. A transient non-JSON reply
otherwise wastes a whole night (gate sees no edits -> reject), which made
weak optimizers (Haiku) flaky across runs.
- FINAL_REPORT.md: document the context-leak discovery honestly; Codex cells
stand (clean), Claude cells recomputed under strict isolation.
Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
- skillopt-sleep-plugin/.claude-plugin/marketplace.json so the plugin is
installable via `/plugin marketplace add ./skillopt-sleep-plugin`.
- README install section (clone -> add marketplace -> install -> /sleep status).
- docs/sleep/FINAL_REPORT.md: the consolidated presented results doc (real
Claude+Codex, transfer, and the honest thorough-analyst failure + fix).
- sweep.py flushes stdout for live monitoring.
Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
Codex with the directive reflect prompt + 2 nights converges 0.00 -> 1.00
(up from 0.67 single-night); its night-2 edit diagnoses its own residual
failure ("preserve required sections even when keeping the brief short").
Claude (Haiku) reaches 1.00 in one night. Update plugin README + skill to
reference --backend claude|codex (was anthropic) and surface the benchmark.
Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
Upgrade from mock-only to REAL multi-backend validation:
Backends (skillopt/sleep/backend.py):
- CliBackend base: shared attempt/judge/reflect prompts, response cache,
token accounting. Subclasses implement only _call().
- ClaudeCliBackend: drives `claude -p --output-format text`.
- CodexCliBackend: drives the REAL @openai/codex `exec -o <file>` for clean
output; resolve_codex_path() skips the hermes wrapper at ~/.local/bin/codex.
- reflect() now aggregates the exact failing judge criteria into the prompt
(gbrain's lesson: tell the optimizer what the scorer rewards).
Rule judges (skillopt/sleep/judges.py): gbrain-compatible local scorers
(section_present / regex / max_chars / contains / tool_called) — held-out
scoring with no judge-API spend. TaskRecord gains a `judge` field +
reference_kind="rule".
gbrain-evals adapter (experiments/gbrain_bench.py, run_gbrain.py): load
garrytan/gbrain-evals skillopt-v1 deficient skills + train/held-out task
sets and run our consolidate() loop against the SAME suite gbrain scores.
REAL results (docs/sleep/real_api_results.md), brief-writer seed, 1 night:
- Claude (Haiku): held-out 0.00 -> 1.00
- Codex: held-out 0.00 -> 0.67
Both proposed a correct, general format rule into the protected LEARNED block.
CLI: --backend {mock,claude,codex}, --codex-path, --model; experiment +
gbrain runners gain --limit-* cost controls. 17 tests pass.
Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
Design for a nightly offline self-evolution plugin that synthesizes
SkillOpt (validation-gated bounded text optimizer), Claude Dreams
(offline memory consolidation), and the Agent-Sleep paper (short-term
to long-term experience). Harvests local ~/.claude transcripts, mines
recurring tasks, replays them offline, and consolidates memory+skills
behind a held-out gate.
Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
docs/reference/api.md previously documented a fictional EnvAdapter API
(execute / evaluate / build_prompt + DataItem / TaskResult) and a
BENCHMARK_REGISTRY that never existed in code. Anyone following the
documented contract would hit ImportError or TypeError on the first
instantiation.
Replace both pages with the real shape from skillopt/envs/base.py and
skillopt/datasets/base.py:
- EnvAdapter: build_train_env, build_eval_env, rollout, reflect,
get_task_types (the 5 actual abstract methods).
- Rollout dicts: id / hard / soft required; everything else preserved
into RolloutResult.extras.
- Reflect dicts: {patch, source_type} schema as consumed by
run_minibatch_reflect.
- BatchSpec: slotted-but-mutable dataclass matching the actual
definition (payload defaults to None, metadata to dict()).
- SplitDataLoader.load_split_items as the one mandatory loader method.
- Registry: _ENV_REGISTRY in scripts/train.py (lazy try/except
ImportError block), not a non-existent BENCHMARK_REGISTRY in
skillopt/envs/__init__.py.
- _base_: documented as a string path, since the current YAML loader
only accepts strings.
The new-benchmark.md guide now walks through a docfaithful worked
example with a real rollout helper (chat_target + scorer) instead of
hand-waving over the rollout step. Refs microsoft/SkillOpt#30.
Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
Add docs/guideline.html, a single self-contained documentation guide
(left-nav + content + on-this-page TOC) covering installation, data
preparation, training/eval, full configuration reference, framework
internals, and an API reference. Link it from the README with local,
htmlpreview, and GitHub Pages access instructions.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Skill optimization framework with training loop analogy
- 11 benchmarks, 4 model backends (Azure OpenAI, Claude, Codex, Qwen)
- WebUI for browser-based training control
- Pluggable architecture for extending benchmarks and backends