mirror of
https://github.com/microsoft/SkillOpt.git
synced 2026-07-03 14:02:58 +08:00
docs: move SkillOpt-Sleep into the guide; clean docs/sleep; fix guide link
Per maintainer request: - Remove the internal/scratch docs/sleep/ tree (reports, raw logs, blog run JSON, sweep.jsonl) — 23 files — and the root PUBLISHING.md. These were working notes, not reference docs. - Take the dedicated SkillOpt-Sleep content out of the main README (News bullet + section) and host it in the rendered guide instead: new section 9 in docs/guideline.html (deployment companion, the three plugins, opt-in experience replay / dream rollouts) with a sidebar entry. - Fix the README's opening reference so "Documentation & Reproduction Guide" links directly to the rendered GitHub Pages page, not the raw .html source. - Repoint the now-removed docs/sleep links in the plugin READMEs to the guide section. The plugin code (plugins/, skillopt_sleep/) is unchanged; only docs move. Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
This commit is contained in:
@@ -1,81 +0,0 @@
|
||||
# Publishing SkillOpt-Sleep — how people install and use it
|
||||
|
||||
This is the open-source SkillOpt-Sleep tool: a nightly offline "sleep cycle" for
|
||||
local coding agents, shipped as plugins for **Claude Code**, **Codex**, and
|
||||
**Copilot**. One engine ([`skillopt_sleep/`](skillopt_sleep)), three thin shells
|
||||
([`plugins/`](plugins)), decoupled from the research code.
|
||||
|
||||
## How end users install it
|
||||
|
||||
### Claude Code
|
||||
|
||||
The Claude Code plugin ships a marketplace manifest at
|
||||
`plugins/claude-code/.claude-plugin/marketplace.json`.
|
||||
|
||||
```text
|
||||
# inside Claude Code:
|
||||
/plugin marketplace add microsoft/SkillOpt
|
||||
/plugin install skillopt-sleep
|
||||
/sleep status
|
||||
```
|
||||
|
||||
(`/plugin marketplace add <owner>/<repo>` reads the marketplace manifest from the
|
||||
repo; the entry points at `plugins/claude-code`.)
|
||||
|
||||
### Codex
|
||||
|
||||
```bash
|
||||
git clone https://github.com/microsoft/SkillOpt.git
|
||||
cd SkillOpt
|
||||
bash plugins/codex/install.sh # installs /sleep prompt + skill
|
||||
export SKILLOPT_SLEEP_REPO="$(pwd)" # so the runner is found anywhere
|
||||
# then, in Codex: /sleep status
|
||||
```
|
||||
|
||||
### Copilot
|
||||
|
||||
```bash
|
||||
git clone https://github.com/microsoft/SkillOpt.git
|
||||
# register the MCP server with your Copilot config (see plugins/copilot/README.md
|
||||
# and plugins/copilot/mcp-config.example.json), pointing SKILLOPT_SLEEP_REPO at
|
||||
# the clone. Then ask Copilot to "run the sleep cycle".
|
||||
```
|
||||
|
||||
Requirements for all three: Python ≥ 3.10, and the corresponding agent CLI on
|
||||
PATH. The default backend is `mock` (no API spend); `--backend claude|codex`
|
||||
uses the user's own budget.
|
||||
|
||||
## Wider distribution (optional, maintainer steps)
|
||||
|
||||
1. **GitHub Release.** Tag the milestone so users can pin a version:
|
||||
```bash
|
||||
gh release create sleep-v0.1.0 --title "SkillOpt-Sleep v0.1.0" \
|
||||
--notes "Nightly offline self-evolution plugins for Claude Code, Codex, Copilot."
|
||||
```
|
||||
|
||||
2. **Official Claude Code plugin marketplace.** To appear in the public
|
||||
directory, open a PR adding a `marketplace.json` entry to
|
||||
[`anthropics/claude-code` / the official marketplace repo], pointing at
|
||||
`microsoft/SkillOpt` subdir `plugins/claude-code`. Users could then
|
||||
`/plugin install skillopt-sleep@<official-marketplace>`.
|
||||
|
||||
3. **PyPI (optional).** `skillopt_sleep` is a standalone package
|
||||
(`pyproject.toml` lists it). A `pip install skillopt-sleep` distribution would
|
||||
let users run `python -m skillopt_sleep ...` without cloning. Build with
|
||||
`python -m build` and publish with `twine`.
|
||||
|
||||
4. **README News.** The main [`README.md`](README.md) already announces the
|
||||
release and links to [`plugins/`](plugins) and
|
||||
[`docs/sleep/FINAL_REPORT.md`](docs/sleep/FINAL_REPORT.md).
|
||||
|
||||
## Verifying a release works
|
||||
|
||||
```bash
|
||||
# deterministic, no API key:
|
||||
python -m skillopt_sleep.experiments.run_experiment --persona researcher --assert-improves
|
||||
# the unit suite:
|
||||
python -m unittest tests.test_sleep_engine
|
||||
# the MCP server (Copilot):
|
||||
printf '%s\n' '{"jsonrpc":"2.0","id":1,"method":"tools/list"}' \
|
||||
| SKILLOPT_SLEEP_REPO="$(pwd)" python3 plugins/copilot/mcp_server.py
|
||||
```
|
||||
56
README.md
56
README.md
@@ -4,12 +4,11 @@
|
||||
|
||||
[](https://microsoft.github.io/SkillOpt/) [](https://arxiv.org/abs/2605.23904) [](https://youtu.be/JUBMDTCiM0M) [](https://pypi.org/project/skillopt/) [](https://www.python.org/) [](LICENSE)
|
||||
|
||||
> 📖 **For installation, data preparation, training/eval commands, the full configuration reference, and framework internals, see the [Documentation & Reproduction Guide](docs/guideline.html)** — view it [rendered online](https://htmlpreview.github.io/?https://github.com/microsoft/SkillOpt/blob/main/docs/guideline.html) or via [GitHub Pages](https://microsoft.github.io/SkillOpt/docs/guideline.html).
|
||||
> 📖 **For installation, data preparation, training/eval commands, the full configuration reference, and framework internals, see the [Documentation & Reproduction Guide](https://microsoft.github.io/SkillOpt/docs/guideline.html)** (rendered on GitHub Pages).
|
||||
|
||||
---
|
||||
|
||||
## News 🔥🔥🔥
|
||||
- **[2026-06-14]** 😴 **SkillOpt-Sleep (preview).** A nightly *sleep cycle* for local coding agents (Claude Code / Codex / Copilot): review past sessions offline, replay recurring tasks, and consolidate validated skills behind a held-out gate. This is an early **preview** — open-source and decoupled from the paper code — that we'll keep iterating on. See [`plugins/`](plugins/) and the [section below](#-skillopt-sleep--the-deployment-time-companion).
|
||||
- **[2026-06-03]** 🎉 **[gbrain](https://github.com/garrytan/gbrain), [gbrain-evals](https://github.com/garrytan/gbrain-evals/blob/main/docs/benchmarks/2026-06-03-skillopt.md), and [darwin-skill](https://github.com/alchaincyf/darwin-skill) have all integrated SkillOpt.**
|
||||
- **[2026-06-02]** 🎉 **SkillOpt [v0.1.0](https://github.com/microsoft/SkillOpt/releases/tag/v0.1.0) is now available on [PyPI](https://pypi.org/project/skillopt/)!** Install with `pip install skillopt`. This initial release includes the full training loop (rollout → reflect → aggregate → select → update → evaluate), multi-backend support (OpenAI / Azure / Claude / Qwen / MiniMax), six built-in benchmarks, and WebUI dashboard.
|
||||
|
||||
@@ -53,59 +52,6 @@ https://github.com/user-attachments/assets/eb12d3bc-371c-467f-904d-91b61f339ed7
|
||||
|
||||
---
|
||||
|
||||
## 😴 SkillOpt-Sleep — the deployment-time companion
|
||||
|
||||
> **Preview.** SkillOpt-Sleep is an early preview that we are actively iterating
|
||||
> on; interfaces and defaults may change. Feedback and issues are welcome.
|
||||
|
||||
SkillOpt (above) trains a skill offline on a benchmark. **SkillOpt-Sleep**
|
||||
applies the same discipline to *your own daily usage*: it gives a local coding
|
||||
agent a nightly **sleep cycle** that reviews your past sessions, replays your
|
||||
recurring tasks on your own API budget, and consolidates what it learns into
|
||||
**validated** long-term memory and skills — behind a held-out gate, staged for
|
||||
your review. The agent gets better the more you use it, with no weight training.
|
||||
|
||||
It synthesizes **SkillOpt** (validation-gated bounded text edits), **Claude
|
||||
Dreams** (offline consolidation; review-then-adopt), and the **agent sleep**
|
||||
idea (short-term experience → long-term competence). One "night":
|
||||
|
||||
```
|
||||
harvest Claude Code / Codex Desktop transcripts → mine recurring tasks → replay offline
|
||||
→ consolidate (reflect → bounded edit → GATE on real held-out tasks)
|
||||
→ stage proposal → (you) adopt
|
||||
```
|
||||
|
||||
**Plugins for three agents** (one engine, three thin shells — see [`plugins/`](plugins/)):
|
||||
|
||||
| Platform | Folder | Install |
|
||||
|---|---|---|
|
||||
| **Claude Code** | [`plugins/claude-code`](plugins/claude-code) | `/plugin marketplace add ./plugins/claude-code` → `/skillopt-sleep` |
|
||||
| **Codex** | [`plugins/codex`](plugins/codex) | `bash plugins/codex/install.sh` → `skillopt-sleep` skill |
|
||||
| **Copilot** | [`plugins/copilot`](plugins/copilot) | register `plugins/copilot/mcp_server.py` as an MCP server |
|
||||
|
||||
**Validated on real models.** On the public
|
||||
[gbrain-evals](https://github.com/garrytan/gbrain-evals) `skillopt-v1` benchmark,
|
||||
deficient skills go **0.00 → 1.00** on held-out sets with **both Claude and
|
||||
Codex** (all 4 seeds, including a real tool-use loop), cross-model transfer is
|
||||
positive, and the gate blocks regressions
|
||||
([full results](docs/sleep/FINAL_REPORT.md)).
|
||||
|
||||
> **Open-source tool, decoupled from the research.** The engine lives in the
|
||||
> top-level [`skillopt_sleep/`](skillopt_sleep) package with **zero dependency**
|
||||
> on the paper's `skillopt/` experiment code (the validation gate is vendored).
|
||||
> Controls — optional gate, multi-rollout contrastive reflection, token/time
|
||||
> budget, multi-objective reward, user preferences, optimizer/target split — are
|
||||
> documented in [`docs/sleep/CONTROLLABLE_DREAMING.md`](docs/sleep/CONTROLLABLE_DREAMING.md).
|
||||
|
||||
Deterministic proof (no API key): `python -m skillopt_sleep.experiments.run_experiment --persona researcher --assert-improves`.
|
||||
|
||||
For local sleep cycles, transcript source and replay backend are separate knobs:
|
||||
use `--source claude` for Claude Code transcripts, `--source codex` for Codex
|
||||
Desktop archived sessions under `~/.codex/archived_sessions`, and
|
||||
`--backend codex` only when you want the replay/optimizer to spend Codex budget.
|
||||
|
||||
---
|
||||
|
||||
## Extensibility & WebUI
|
||||
|
||||
### Adding a new backend
|
||||
|
||||
@@ -288,6 +288,12 @@
|
||||
<a href="#cli">CLI scripts</a>
|
||||
<a href="#webui">WebUI</a>
|
||||
</div>
|
||||
<div class="group">
|
||||
<div class="glabel"><span class="num">9</span> SkillOpt-Sleep</div>
|
||||
<a href="#sleep">Deployment companion</a>
|
||||
<a href="#sleep-plugins">Plugins (3 agents)</a>
|
||||
<a href="#sleep-replay">Experience replay (opt-in)</a>
|
||||
</div>
|
||||
</nav>
|
||||
|
||||
<!-- ───────────── MAIN CONTENT ───────────── -->
|
||||
@@ -917,6 +923,57 @@ PY</code></pre>
|
||||
<tr><td><code>--share</code></td><td class="def">off</td><td>Create a public Gradio share link.</td></tr>
|
||||
</tbody>
|
||||
</table></div>
|
||||
</section>
|
||||
|
||||
<section id="sleep">
|
||||
<h2>9.1 SkillOpt-Sleep — the deployment-time companion (preview) <a class="anchor" href="#sleep">#</a></h2>
|
||||
<p><strong>SkillOpt-Sleep</strong> applies SkillOpt's discipline to your own daily usage. It gives a
|
||||
local coding agent a nightly <em>sleep cycle</em> that reviews your past sessions, replays your
|
||||
recurring tasks on your own API budget, and consolidates what it learns into <strong>validated</strong>
|
||||
long-term memory and skills — behind a held-out gate, staged for your review. The agent gets better
|
||||
the more you use it, with no weight training and zero inference-time overhead. It is an early
|
||||
<strong>preview</strong> we are actively iterating on; interfaces and defaults may change.</p>
|
||||
<p>One "night":</p>
|
||||
<pre><code>harvest Claude Code / Codex transcripts → mine recurring tasks → replay offline
|
||||
→ consolidate (reflect → bounded edit → GATE on real held-out tasks)
|
||||
→ stage proposal → (you) adopt</code></pre>
|
||||
<p>The engine lives in the top-level <code>skillopt_sleep/</code> package with <strong>zero dependency</strong>
|
||||
on the paper's <code>skillopt/</code> experiment code (the validation gate is vendored). Deterministic
|
||||
proof, no API key required:
|
||||
<code>python -m skillopt_sleep.experiments.run_experiment --persona researcher --assert-improves</code>.</p>
|
||||
|
||||
<h2 id="sleep-plugins">9.2 Plugins (three agents) <a class="anchor" href="#sleep-plugins">#</a></h2>
|
||||
<p>One engine, thin per-agent shells (see <a href="https://github.com/microsoft/SkillOpt/tree/main/plugins"><code>plugins/</code></a>):</p>
|
||||
<div class="table-wrap"><table>
|
||||
<thead><tr><th>Platform</th><th>Folder</th><th>Install</th></tr></thead>
|
||||
<tbody>
|
||||
<tr><td>Claude Code</td><td><code>plugins/claude-code</code></td><td><code>/plugin marketplace add ./plugins/claude-code</code> → <code>/skillopt-sleep</code></td></tr>
|
||||
<tr><td>Codex</td><td><code>plugins/codex</code></td><td><code>bash plugins/codex/install.sh</code> → <code>skillopt-sleep</code> skill</td></tr>
|
||||
<tr><td>Copilot</td><td><code>plugins/copilot</code></td><td>register <code>plugins/copilot/mcp_server.py</code> as an MCP server</td></tr>
|
||||
</tbody>
|
||||
</table></div>
|
||||
<p>Transcript source and replay backend are separate knobs: <code>--source claude</code> for Claude Code
|
||||
transcripts, <code>--source codex</code> for Codex Desktop archived sessions under
|
||||
<code>~/.codex/archived_sessions</code>, and <code>--backend codex</code> only when you want the
|
||||
replay/optimizer to spend Codex budget.</p>
|
||||
|
||||
<h2 id="sleep-replay">9.3 Experience replay & dream rollouts (opt-in) <a class="anchor" href="#sleep-replay">#</a></h2>
|
||||
<p>Two consolidation mechanisms, both default <strong>off</strong> (so behavior is unchanged unless
|
||||
enabled). They strengthen the nightly update when your tasks have a clean correctness signal; the
|
||||
validation gate still governs what ships.</p>
|
||||
<div class="table-wrap"><table>
|
||||
<thead><tr><th>Config knob</th><th>Default</th><th>Effect</th></tr></thead>
|
||||
<tbody>
|
||||
<tr><td><code>dream_rollouts</code></td><td class="def">1</td><td>Run each task K times and learn from the good-vs-bad contrast (contrastive reflection).</td></tr>
|
||||
<tr><td><code>recall_k</code></td><td class="def">0</td><td>Associative recall — pull the K most-similar past tasks (from a persisted archive) into tonight's dream.</td></tr>
|
||||
<tr><td><code>dream_factor</code></td><td class="def">0</td><td>Add N lightweight synthetic variants of each task.</td></tr>
|
||||
</tbody>
|
||||
</table></div>
|
||||
<p>On a clean-signal benchmark the gain scales with recall depth (deployment protocol: 5 nights ×
|
||||
10 new real tasks/night, full held-out test, GPT-5.5, gated): <code>recall_k=10</code> → +3.1 pts,
|
||||
<code>recall_k=20</code> → +4.5 pts, full-history replay reference → +5.6 pts; a second benchmark
|
||||
(SpreadsheetBench, GPT-5.4-nano, gate-free) gives +3.6 pts. On saturated or noisy tasks the effect is
|
||||
flat within run-to-run noise (±1–2 pts). Keep the gate on; it bounds the downside.</p>
|
||||
|
||||
<div class="footer-note">
|
||||
SkillOpt — Executive Strategy for Self-Evolving Agent Skills ·
|
||||
|
||||
@@ -1,134 +0,0 @@
|
||||
# SkillOpt-Sleep — controllable dreaming architecture
|
||||
|
||||
The sleep engine is no longer a single fixed pipeline. It is a controllable
|
||||
offline "dream / imagination" loop the user steers. This documents the knobs
|
||||
added in the four-stage refactor and how they map to the user's design.
|
||||
|
||||
## Transcript sources
|
||||
|
||||
Sleep separates the source of past sessions from the backend used to replay and
|
||||
optimize tasks:
|
||||
|
||||
```bash
|
||||
python -m skillopt_sleep dry-run --project "$(pwd)" --source claude --backend mock
|
||||
python -m skillopt_sleep dry-run --project "$(pwd)" --source codex --backend mock
|
||||
python -m skillopt_sleep run --project "$(pwd)" --source codex --backend codex
|
||||
```
|
||||
|
||||
`--source claude` reads Claude Code transcripts from `~/.claude/projects`.
|
||||
`--source codex` reads Codex Desktop archives from
|
||||
`~/.codex/archived_sessions`. `--source auto` tries Codex archives first, then
|
||||
falls back to Claude Code transcripts. Use `--codex-home /path/to/.codex` or
|
||||
`--claude-home /path/to/.claude` to point at non-default homes.
|
||||
|
||||
## The mental model
|
||||
|
||||
> Sleep = an offline imagination rollout. Re-run the user's real
|
||||
> tasks (and dream-augmented variants) many times, look at what went well vs
|
||||
> badly, distil durable rules, and keep only what survives a real-task check —
|
||||
> unless the user opts out of that check.
|
||||
|
||||
## 1. Data splits — train (dream) / val (real) / test (real)
|
||||
|
||||
The anti-overfitting foundation:
|
||||
|
||||
| Split | Source | Role |
|
||||
|---|---|---|
|
||||
| **train** | real tasks **+ dream-augmented** variants | drives reflection (the imagination pool — over-dreaming is fine) |
|
||||
| **val** | **real only**, disjoint from test | gates updates (prevents overfitting) |
|
||||
| **test** | **real only**, disjoint from val | the final held-out measure, kept close to real usage |
|
||||
|
||||
Hard guarantee (unit-tested): a task with `origin='dream'` **never** lands in
|
||||
val or test. `assign_splits(val_fraction, test_fraction)` does the deterministic
|
||||
3-way split; gbrain's own held-out maps to our `test`.
|
||||
|
||||
## 2. The validation gate is optional
|
||||
|
||||
`--gate on` (default): an edit is accepted only if it strictly improves the
|
||||
**val** score — the SkillOpt discipline that blocks regressions and reward
|
||||
hacking.
|
||||
|
||||
`--gate off`: greedy. Edits are kept without the hard val-improvement
|
||||
requirement (the user decides they don't want hard filtering), but val/test
|
||||
movement is still reported (`greedy_improved` / `greedy_regressed` /
|
||||
`greedy_flat`) so nothing is hidden.
|
||||
|
||||
## 3. Slow-update — long-term memory, gate-independent
|
||||
|
||||
Even with the gate off, the engine runs a **slow-update** at the end of the
|
||||
nights: it compares behaviour under the first-night vs final skill across the
|
||||
val tasks and distils durable longitudinal guidance into a **protected field**
|
||||
(`<!-- SLOW_UPDATE_START --> … <!-- SLOW_UPDATE_END -->`, the same markers as
|
||||
the main SkillOpt repo). Step-level edits never touch this field. This is the
|
||||
"short-term experience → long-term memory" consolidation; turning the gate off
|
||||
does not cost you long-term memory.
|
||||
|
||||
## 4. Budget — the user picks the spend
|
||||
|
||||
`--budget-tokens N` / `--budget-minutes M`: the engine auto-plans depth
|
||||
(`nights × rollouts_per_task`) to fit the budget (`plan_depth`). Stops cleanly
|
||||
when exhausted and logs what it skipped — no silent truncation. The whole thing
|
||||
is offline imagination on the user's own quota.
|
||||
|
||||
## 5. Multi-rollout contrastive reflection — the imagination core
|
||||
|
||||
`--rollouts-k K` (K>1): each train task is rolled out K times. The optimizer is
|
||||
shown the **high-scoring vs low-scoring** attempts of the same task and asked
|
||||
what the good ones did that the bad ones didn't, distilling a general rule. This
|
||||
is a far stronger signal than a single failure, and it is exactly the user's
|
||||
"run it many times, learn from the contrast" idea. Tasks with the highest score
|
||||
*spread* (some passed, some failed) are the most informative and are prioritised.
|
||||
|
||||
## 6. Multi-objective reward — accuracy ↑, tokens ↓, latency ↓
|
||||
|
||||
Every rollout records its `tokens` and `latency_ms`.
|
||||
`multi_objective_reward(w_acc, w_tokens, w_latency)` is a weighted reward so a
|
||||
skill can be optimised to be **cheaper and faster**, not only more accurate
|
||||
(cost terms normalised against a reference; default weights = accuracy-only, so
|
||||
existing behaviour is unchanged). This turns "gets better the more you use it"
|
||||
into "more accurate, cheaper, and faster the more you use it".
|
||||
|
||||
## 7. User preferences as a prior
|
||||
|
||||
`--preferences "<free text>"`: injected into the optimizer's reflect prompt as a
|
||||
prior (set on the optimizer model for dual backends), so the user's stated
|
||||
preferences steer what rules get written.
|
||||
|
||||
## How the knobs compose (one command)
|
||||
|
||||
```bash
|
||||
python -m skillopt.sleep.experiments.run_gbrain \
|
||||
--optimizer-backend claude --optimizer-model sonnet \ # strong optimizer
|
||||
--target-backend claude --target-model haiku \ # cheap target (transfer)
|
||||
--seeds thorough-analyst \
|
||||
--gate on \ # or off for greedy
|
||||
--rollouts-k 2 \ # contrastive imagination
|
||||
--budget-tokens 60000 \ # auto-plan depth
|
||||
--preferences "Prefer concise, British English." \ # prior
|
||||
--nights 3
|
||||
```
|
||||
|
||||
All of this is exercised by the deterministic test suite (29 tests) and
|
||||
validated on real Claude + Codex (see `real_api_results.md` / `FINAL_REPORT.md`).
|
||||
|
||||
## Real cross-validation of the new features (Claude ⟷ Codex)
|
||||
|
||||
Three live runs exercised the new code paths on both runtimes (raw logs under
|
||||
`docs/sleep/raw/crosscheck_*.txt`):
|
||||
|
||||
| # | Config | What it proves | Result |
|
||||
|---|---|---|---|
|
||||
| **A** | Claude Sonnet→Haiku, **gate=off**, **rollouts_k=2** | greedy mode + multi-rollout + 3-way split (val & test both reported) | brief-writer **test 0→1.00**, action `greedy_improved`, val=1.0 test=1.0 |
|
||||
| **B** | **Codex**, gate=on, **rollouts_k=2** | new paths on the other runtime | brief-writer **test 0→1.00**, 2-night `accept_new_best`, val+test reported |
|
||||
| **C** | Claude Sonnet→Haiku, thorough-analyst, 3 nights | **slow-update** long-term memory fires | test 0→0.33 (val gate holds nights 2–3) and the slow-update distilled a durable meta-rule |
|
||||
|
||||
The slow-update guidance C produced is the kind of cross-night lesson the field
|
||||
is for — note it is general, not task-specific:
|
||||
|
||||
> *"On character-constrained tasks (≤1200 chars), plan structure before writing:
|
||||
> allocate space per point explicitly and cut until the outline fits, then fill —
|
||||
> never draft freely and trim after."*
|
||||
|
||||
Takeaways confirmed live: the **gate-off greedy path**, the **3-way val/test
|
||||
split**, **multi-rollout** on both runtimes, and the **gate-independent
|
||||
slow-update** all work with real models on both Claude and Codex.
|
||||
@@ -1,64 +0,0 @@
|
||||
# SkillOpt-Sleep — experience replay & dream rollouts (opt-in)
|
||||
|
||||
Two opt-in mechanisms that strengthen the nightly consolidation when your tasks
|
||||
have a clean correctness signal. Both default **off**, so enabling them is the
|
||||
only way they change behavior.
|
||||
|
||||
## What they do
|
||||
|
||||
| Config knob | Default | Effect |
|
||||
|---|---|---|
|
||||
| `dream_rollouts` | `1` | Run each task **K** times and learn from the *contrast* between the good and bad attempts (contrastive reflection) instead of a single failure. |
|
||||
| `recall_k` | `0` | **Associative recall** — each night, pull the `K` past tasks most similar to tonight's new ones (from a persisted task archive) into the dream, so related experience is revisited without replaying the whole history. |
|
||||
| `dream_factor` | `0` | Add `N` lightweight synthetic variants of each task to the training pool. |
|
||||
|
||||
The validation gate still governs what ships, so these only ever *enlarge the
|
||||
signal the optimizer reflects on* — the held-out gate decides what is kept.
|
||||
|
||||
## How to enable
|
||||
|
||||
```jsonc
|
||||
// ~/.skillopt-sleep/config.json (or pass via the plugin's config)
|
||||
{
|
||||
"dream_rollouts": 5, // contrastive dreaming
|
||||
"recall_k": 20, // recall ~20 similar past tasks each night
|
||||
"gate_mode": "on" // keep the gate on (recommended)
|
||||
}
|
||||
```
|
||||
|
||||
`recall_k` draws from a capped `task_archive` that the cycle persists in
|
||||
`state.json`, so recall becomes useful from the second night onward (once there
|
||||
is history to recall from).
|
||||
|
||||
## Measured effect
|
||||
|
||||
Deployment protocol (5 nights × 10 new real tasks/night, full held-out test
|
||||
sets, GPT-5.5 optimizer), run through the **same engine the plugin executes**
|
||||
(`skillopt_sleep.dream.dream_consolidate`):
|
||||
|
||||
**SearchQA (GPT-5.5, full 1,400-item test, gated) — the gain scales with recall depth:**
|
||||
|
||||
| Config | Δ vs baseline |
|
||||
|---|---|
|
||||
| `recall_k=10, dream_rollouts=5` | +3.1 |
|
||||
| `dream_rollouts=8` | +3.7 |
|
||||
| **`recall_k=20, dream_rollouts=5`** | **+4.5** |
|
||||
| full-history replay (reference) | +5.6 |
|
||||
|
||||
**Second-benchmark confirmation** (SpreadsheetBench, GPT-5.4-nano, gate-free,
|
||||
shipped path): 0.279 → **0.314 (+3.6)**.
|
||||
|
||||
## When it helps — and when it doesn't
|
||||
|
||||
- **Helps** when tasks recur and have a checkable correctness signal (the
|
||||
optimizer has something real to learn and the gate can verify it).
|
||||
- **Roughly flat** on saturated or noisy tasks (e.g. a strong model already near
|
||||
ceiling) — within run-to-run noise (±1–2 points, single seed).
|
||||
- The validation gate keeps the downside bounded; keep it on by default.
|
||||
|
||||
Trade-off: `dream_rollouts > 1` multiplies the per-night rollout cost (K×), and
|
||||
`recall_k > 0` adds the recalled tasks to each night's replay. Since the cycle
|
||||
runs offline on idle quota this is usually acceptable, but budget accordingly
|
||||
(`budget_tokens` / `budget_seconds`).
|
||||
|
||||
Raw per-run results for the table above: `docs/sleep/blog_runs/v2_port/`.
|
||||
@@ -1,160 +0,0 @@
|
||||
# SkillOpt-Sleep — final validation report
|
||||
|
||||
> **What this is:** the consolidated, presented results for the SkillOpt-Sleep
|
||||
> Claude Code plugin — a tool that lets a local agent improve itself overnight by
|
||||
> reviewing past sessions, replaying tasks, and consolidating validated memory +
|
||||
> skills behind a held-out gate. Every real-model result here was run on **both
|
||||
> Claude and Codex**, including the honest failures and the bugs they exposed.
|
||||
|
||||
**Date:** 2026-06-07 · **Branch:** `feat/claude-code-sleep-plugin`
|
||||
**Benchmark:** [gbrain-evals](https://github.com/garrytan/gbrain-evals) `skillopt-v1`
|
||||
(the same public suite gbrain scores its own optimizer against).
|
||||
**Protocol:** a deliberately deficient skill → 1–2 offline "nights" (replay →
|
||||
reflect → bounded **gated** edit) → score the **held-out** task set (never
|
||||
optimized against). Held-out scoring uses a local rule judge — the optimizer
|
||||
never grades itself.
|
||||
|
||||
---
|
||||
|
||||
## 1. Headline — clean, all green (full gbrain parity)
|
||||
|
||||
**Strong optimizer (Claude Sonnet 4.6) → weak target (Claude Haiku 4.5)**, fully
|
||||
isolated calls, 3 held-out tasks/seed. All **4** gbrain `skillopt-v1` seeds —
|
||||
matching gbrain's own scorecard coverage:
|
||||
|
||||
| Optimizer → Target | Seed | Flaw | Held-out before → after | Nights |
|
||||
|---|---|---|---|---|
|
||||
| Sonnet → Haiku | brief-writer | missing structure | **0.00 → 1.00** | 1 |
|
||||
| Sonnet → Haiku | advisor | no verdict | **0.00 → 1.00** | 1 |
|
||||
| Sonnet → Haiku | thorough-analyst | no length discipline | **0.00 → 1.00** | 2 |
|
||||
| Sonnet → Haiku | quick-answerer | never uses tools | **0.00 → 1.00** | 1 |
|
||||
| Codex → Codex (gpt-5.5) | brief-writer | missing structure | **0.00 → 1.00** | 2 |
|
||||
| Codex → Codex (gpt-5.5) | advisor | no verdict | **0.00 → 1.00** | 2 |
|
||||
|
||||
**4/4 Claude seeds reach a perfect held-out score** (gbrain's headline is the same
|
||||
4/4 0→1.00), plus Codex on the text seeds. Every change is gated and staged.
|
||||
|
||||
The `quick-answerer` seed is judged by **real tool use** (`tool_called: search`):
|
||||
the deficient skill says *"never look anything up — answer from memory"*; the
|
||||
optimizer wrote an OVERRIDE rule, and the Haiku target **genuinely invoked a
|
||||
`./search` shell tool** (detected from the tool's own log, not self-reported) →
|
||||
held-out 1.00. The thorough-analyst run shows textbook **2-night convergence**
|
||||
(0.33 → 1.00).
|
||||
|
||||
---
|
||||
|
||||
## 2. The finding that matters most: the optimizer model is decisive
|
||||
|
||||
This is the direct answer to "let me specify the optimizer and target separately,
|
||||
and watch the skill." It matters a lot:
|
||||
|
||||
| Optimizer | Target | brief-writer | advisor | thorough-analyst |
|
||||
|---|---|---|---|---|
|
||||
| **Haiku** (weak) | Haiku | 1.00 *or* 0.00 (flaky) | 1.00 | 0.33 |
|
||||
| **Sonnet** (strong) | Haiku | **1.00** | **1.00** | **1.00** |
|
||||
|
||||
A weak self-optimizing model (Haiku proposing its own edits) is **unreliable** —
|
||||
it intermittently emits non-JSON and wastes a night, so the same seed scores 1.00
|
||||
on one run and 0.00 on another. A **strong optimizer** (Sonnet) reliably produces
|
||||
clean, concrete edit rules and lifts every seed to 1.00. This is exactly the
|
||||
SkillOpt design (strong optimizer, frozen target) and the reason the
|
||||
optimizer/target split is a first-class feature here.
|
||||
|
||||
**Practical guidance baked into the plugin:** default to a strong optimizer; the
|
||||
sweep's `direct` plan now uses Sonnet→Haiku.
|
||||
|
||||
---
|
||||
|
||||
## 3. Two real bugs we found by running against live models
|
||||
|
||||
Per gbrain's own lesson ("the bugs that matter only show up when the whole thing
|
||||
actually runs"), the first live runs surfaced two real defects. Both are fixed.
|
||||
|
||||
1. **Ambient-context leak (Claude).** `claude -p` was injecting the user's
|
||||
*global* skills + project `CLAUDE.md` into every optimizer/target call — one
|
||||
reflect call literally returned a 21 KB list of the machine's installed skills
|
||||
instead of JSON edits, so the night produced no edits and the gate rejected.
|
||||
Some early Claude "successes" were partly leak-assisted. **Fix:** run isolated
|
||||
— `--bare --disable-slash-commands --disallowedTools '*'
|
||||
--exclude-dynamic-system-prompt-sections`, clean temp cwd. (Codex was never
|
||||
affected; the real `@openai/codex` binary runs in its own clean context.)
|
||||
|
||||
2. **Wasted nights on transient non-JSON.** A single malformed reply zeroed a
|
||||
night. **Fix:** `reflect()` retries once with a firmer "JSON only" instruction.
|
||||
|
||||
We report these because a tool people build on has to be honest about where it was
|
||||
weak and what changed.
|
||||
|
||||
---
|
||||
|
||||
## 4. Cross-model transfer (the price-difference value prop)
|
||||
|
||||
> *Optimize cheap overnight, deploy anywhere.* A skill is just text, so a good
|
||||
> rewrite should help a model it was never optimized on.
|
||||
|
||||
Optimize on SOURCE, **freeze** the learned skill, evaluate held-out on TARGET with
|
||||
no further optimization. All four pairs are positive — including **across
|
||||
runtimes** (Codex ↔ Claude):
|
||||
|
||||
| Source (optimizer) | Target (deploy) | Seed | Target baseline → transferred | Gain |
|
||||
|---|---|---|---|---|
|
||||
| Claude Haiku (cheap) | Claude Sonnet (expensive) | brief-writer | 0.00 → **1.00** | +1.00 |
|
||||
| Claude Sonnet | Claude Haiku | brief-writer | 0.00 → **1.00** | +1.00 |
|
||||
| **Codex** | **Claude Haiku** | brief-writer | 0.00 → **1.00** | +1.00 |
|
||||
| **Claude Haiku** | **Codex** | brief-writer | 0.00 → **1.00** | +1.00 |
|
||||
|
||||
**4/4 transfers positive.** A skill optimized on a cheap model deploys for free on
|
||||
an expensive one, and skills move between Codex and Claude — the Sleep-setting
|
||||
analogue of SkillOpt's cross-model and cross-harness transfer tables. This is the
|
||||
quantified answer to "optimize cheap overnight, deploy anywhere."
|
||||
|
||||
Full machine-generated scorecard: [`benchmark_report.md`](benchmark_report.md)
|
||||
(source data `sweep.jsonl`).
|
||||
|
||||
---
|
||||
|
||||
## 5. Reproduce everything
|
||||
|
||||
```bash
|
||||
git clone https://github.com/garrytan/gbrain-evals /tmp/gbrain-evals
|
||||
cd <repo>/SkillOpt-sleep
|
||||
|
||||
# the clean headline result (strong optimizer -> weak target)
|
||||
python3.12 -m skillopt.sleep.experiments.run_gbrain \
|
||||
--optimizer-backend claude --optimizer-model sonnet \
|
||||
--target-backend claude --target-model haiku \
|
||||
--seeds brief-writer,advisor,thorough-analyst \
|
||||
--data-root /tmp/gbrain-evals/eval/data/skillopt-v1 --nights 2 --limit-replay 3 --limit-holdout 3
|
||||
|
||||
# Codex self-optimized
|
||||
python3.12 -m skillopt.sleep.experiments.run_gbrain --backend codex --seeds brief-writer \
|
||||
--data-root /tmp/gbrain-evals/eval/data/skillopt-v1 --nights 2 --limit-replay 3 --limit-holdout 3
|
||||
|
||||
# cross-model transfer
|
||||
python3.12 -m skillopt.sleep.experiments.run_transfer \
|
||||
--source-backend claude --source-model haiku --target-backend claude --target-model sonnet \
|
||||
--seeds brief-writer
|
||||
|
||||
# the whole sweep + report
|
||||
python3.12 -m skillopt.sleep.experiments.sweep --plan full \
|
||||
--data-root /tmp/gbrain-evals/eval/data/skillopt-v1 --out docs/sleep/sweep.jsonl
|
||||
python3.12 -m skillopt.sleep.experiments.report --in docs/sleep/sweep.jsonl --out docs/sleep/benchmark_report.md
|
||||
|
||||
# deterministic, no API (CI anchor)
|
||||
python3.12 -m skillopt.sleep.experiments.run_experiment --persona researcher --assert-improves
|
||||
```
|
||||
|
||||
Raw run logs are under `docs/sleep/raw/`.
|
||||
|
||||
---
|
||||
|
||||
## 6. Honest limitations
|
||||
|
||||
- **Latency:** each CLI call is ~14–15 s startup-dominated, so runs are capped at
|
||||
a few tasks/nights. Fine for nightly cron; we note it plainly.
|
||||
- **Weak optimizers are flaky:** use a strong optimizer model (§2).
|
||||
- **Tool-use seed covered honestly:** `quick-answerer` (`tool_called: search`)
|
||||
runs a real tool loop — a callable `./search` shim, detected from its log.
|
||||
Deeper multi-tool / multi-turn workflows are future work.
|
||||
- **Small, single-flaw skills:** like gbrain, these prove the mechanism is real
|
||||
and safe; a large production skill will be messier and partial.
|
||||
@@ -1,53 +0,0 @@
|
||||
TITLE:
|
||||
Add SkillOpt-Sleep: nightly offline self-evolution plugins (Claude Code, Codex, Copilot)
|
||||
|
||||
BODY:
|
||||
## Summary
|
||||
|
||||
Adds **SkillOpt-Sleep** — a nightly offline "sleep cycle" that gives a local
|
||||
coding agent the deployment-time analogue of training: it reviews past sessions,
|
||||
replays recurring tasks on the user's own API budget, and consolidates what it
|
||||
learns into **validated** long-term memory and skills behind a held-out gate.
|
||||
Synthesizes SkillOpt (validation-gated bounded text edits), Claude Dreams
|
||||
(offline consolidation; review-then-adopt), and the agent-sleep idea
|
||||
(short-term experience -> long-term competence).
|
||||
|
||||
Shipped as plugins for **three agents**, one engine + three thin shells:
|
||||
|
||||
- **Claude Code** — `.claude-plugin` + `/sleep` command + skill + hooks
|
||||
- **Codex** — user-level `skillopt-sleep` skill + shared runner + `install.sh`
|
||||
- **Copilot** — a stdlib-only MCP server exposing `sleep_*` tools
|
||||
|
||||
## Design notes
|
||||
|
||||
- **Open-source tool, decoupled from the research code.** The engine lives in the
|
||||
new top-level `skillopt_sleep/` package with **zero dependency** on the paper's
|
||||
`skillopt/` experiment package (the validation gate is vendored).
|
||||
- Controllable: optional gate (`--gate on|off`), train(dream)/val(real)/test(real)
|
||||
splits, slow-update long-term memory, token/time budget, multi-rollout
|
||||
contrastive reflection, multi-objective reward (accuracy/tokens/latency), user
|
||||
preferences, and separate optimizer/target models.
|
||||
|
||||
## Validation (real models)
|
||||
|
||||
On the public [gbrain-evals](https://github.com/garrytan/gbrain-evals)
|
||||
`skillopt-v1` benchmark, deficient skills go **0.00 -> 1.00** on held-out sets
|
||||
with **both Claude and Codex** (all 4 seeds, including a real tool-use loop);
|
||||
cross-model transfer is positive; the gate blocks regressions. Independently
|
||||
load-tested on a fresh non-benchmark persona ("SQL must always include LIMIT"):
|
||||
held-out test **0.00 -> 1.00** on both backends. See `docs/sleep/FINAL_REPORT.md`
|
||||
and `docs/sleep/plugin_load_test.md`.
|
||||
|
||||
## Tests
|
||||
|
||||
- 29 deterministic unit tests (`tests/test_sleep_engine.py`), no API key required.
|
||||
- `python -m skillopt_sleep.experiments.run_experiment --persona researcher --assert-improves`
|
||||
proves held-out lift and that the gate blocks a harmful edit.
|
||||
|
||||
## Test plan
|
||||
|
||||
- [ ] `python -m unittest tests.test_sleep_engine` (29 pass)
|
||||
- [ ] `python -m skillopt_sleep.experiments.run_experiment --persona researcher --assert-improves`
|
||||
- [ ] Claude Code: `/plugin marketplace add ./plugins/claude-code` -> `/sleep status`
|
||||
- [ ] Codex: `bash plugins/codex/install.sh`
|
||||
- [ ] Copilot: MCP server `tools/list` returns the `sleep_*` tools
|
||||
@@ -1,41 +0,0 @@
|
||||
# SkillOpt-Sleep — benchmark report
|
||||
|
||||
Auto-generated from `sweep.jsonl`. Benchmark: [gbrain-evals](https://github.com/garrytan/gbrain-evals) `skillopt-v1` (deficient skills, train/held-out split, local rule judge — no judge-API).
|
||||
Held-out scores are computed by the harness, not the optimizer.
|
||||
|
||||
## Direct improvement (optimize, then deploy)
|
||||
|
||||
| Optimizer → Target | Seed | Held-out before | Held-out after | Nights | Tokens |
|
||||
|---|---|---|---|---|---|
|
||||
| claude:sonnet → claude:haiku | brief-writer | 0.00 | **1.00** | 2 | 6657 |
|
||||
| claude:sonnet → claude:haiku | advisor | 0.00 | **1.00** | 2 | 7891 |
|
||||
| claude:sonnet → claude:haiku | thorough-analyst | 0.00 | **1.00** | 2 | 17960 |
|
||||
| codex:default → codex:default | brief-writer | 0.00 | **1.00** | 2 | 9969 |
|
||||
| codex:default → codex:default | advisor | 0.00 | **1.00** | 2 | 6210 |
|
||||
| claude:sonnet → claude:haiku | quick-answerer | 0.00 | **1.00** | 2 | 10988 |
|
||||
| codex:default → codex:default | quick-answerer | 0.00 | **1.00** | 2 | 7347 |
|
||||
|
||||
**7/7 configurations improved on held-out.**
|
||||
|
||||
## Cross-model transfer (optimize on SOURCE, deploy frozen on TARGET)
|
||||
|
||||
The price-difference story: spend cheap tokens optimizing overnight, then deploy the frozen skill on any model with no further optimization.
|
||||
|
||||
| Source (optimizer) | Target (deploy) | Seed | Target baseline | Transferred | Gain |
|
||||
|---|---|---|---|---|---|
|
||||
| claude:haiku | claude:sonnet | brief-writer | 0.00 | **1.00** | +1.00 |
|
||||
| claude:sonnet | claude:haiku | brief-writer | 0.00 | **1.00** | +1.00 |
|
||||
| codex:default | claude:haiku | brief-writer | 0.00 | **1.00** | +1.00 |
|
||||
| claude:haiku | codex:default | brief-writer | 0.00 | **1.00** | +1.00 |
|
||||
|
||||
**4/4 transfers were positive** (frozen skill helped a different model than it was optimized on).
|
||||
|
||||
## How to reproduce
|
||||
|
||||
```bash
|
||||
git clone https://github.com/garrytan/gbrain-evals /tmp/gbrain-evals
|
||||
python -m skillopt.sleep.experiments.sweep --plan full \
|
||||
--data-root /tmp/gbrain-evals/eval/data/skillopt-v1 --out docs/sleep/sweep.jsonl
|
||||
python -m skillopt.sleep.experiments.report \
|
||||
--in docs/sleep/sweep.jsonl --out docs/sleep/benchmark_report.md
|
||||
```
|
||||
@@ -1,94 +0,0 @@
|
||||
{
|
||||
"experiment": "skillopt-sleep/nightly",
|
||||
"model": "gpt-5.4-nano",
|
||||
"results": [
|
||||
{
|
||||
"benchmark": "spreadsheet",
|
||||
"gate": "off",
|
||||
"replay_mode": "retrieval",
|
||||
"retrieve_k": 10,
|
||||
"nights": 5,
|
||||
"per_night": 10,
|
||||
"rollouts": 5,
|
||||
"n_val": 40,
|
||||
"n_test": 280,
|
||||
"test_baseline": 0.2786,
|
||||
"test_final": 0.3143,
|
||||
"delta": 0.0357,
|
||||
"progression": [
|
||||
0.2786,
|
||||
0.3036,
|
||||
0.3143,
|
||||
0.3107,
|
||||
0.3179,
|
||||
0.3143
|
||||
],
|
||||
"nights_log": [
|
||||
{
|
||||
"night": 0,
|
||||
"n_train": 0,
|
||||
"test_hard": 0.2786,
|
||||
"action": "baseline",
|
||||
"accepted": false
|
||||
},
|
||||
{
|
||||
"night": 1,
|
||||
"n_train": 10,
|
||||
"n_replayed": 0,
|
||||
"n_dream": 20,
|
||||
"val_hard": 0.0,
|
||||
"test_hard": 0.3036,
|
||||
"action": "greedy_applied",
|
||||
"accepted": true,
|
||||
"n_edits": 4
|
||||
},
|
||||
{
|
||||
"night": 2,
|
||||
"n_train": 20,
|
||||
"n_replayed": 10,
|
||||
"n_dream": 40,
|
||||
"val_hard": 0.0,
|
||||
"test_hard": 0.3143,
|
||||
"action": "greedy_applied",
|
||||
"accepted": true,
|
||||
"n_edits": 4
|
||||
},
|
||||
{
|
||||
"night": 3,
|
||||
"n_train": 30,
|
||||
"n_replayed": 10,
|
||||
"n_dream": 40,
|
||||
"val_hard": 0.0,
|
||||
"test_hard": 0.3107,
|
||||
"action": "greedy_applied",
|
||||
"accepted": true,
|
||||
"n_edits": 4
|
||||
},
|
||||
{
|
||||
"night": 4,
|
||||
"n_train": 40,
|
||||
"n_replayed": 10,
|
||||
"n_dream": 40,
|
||||
"val_hard": 0.0,
|
||||
"test_hard": 0.3179,
|
||||
"action": "greedy_applied",
|
||||
"accepted": true,
|
||||
"n_edits": 4
|
||||
},
|
||||
{
|
||||
"night": 5,
|
||||
"n_train": 50,
|
||||
"n_replayed": 10,
|
||||
"n_dream": 40,
|
||||
"val_hard": 0.0,
|
||||
"test_hard": 0.3143,
|
||||
"action": "greedy_applied",
|
||||
"accepted": true,
|
||||
"n_edits": 4
|
||||
}
|
||||
],
|
||||
"tokens": 13587597,
|
||||
"final_skill_tail": "t/headers rather than hardcoding specific cell coordinates or values.\n- When searching for specific text, use an exact match check on the cell string, e.g. `if cell_value == \"Georgia Its Tax\": ...` (not partial regex, not truncated comparisons).\n- If a cell contains multiple tokens separated by semicolons, split and normalize before comparing: `parts = [p.strip() for p in str(cell_value).split(';') if p.strip()]` and then test membership/lookup using `parts`.\n<!-- SKILLOPT-SLEEP:LEARNED END -->\n"
|
||||
}
|
||||
]
|
||||
}
|
||||
@@ -1,94 +0,0 @@
|
||||
{
|
||||
"experiment": "skillopt-sleep/nightly",
|
||||
"model": "gpt-5.5",
|
||||
"results": [
|
||||
{
|
||||
"benchmark": "searchqa",
|
||||
"gate": "on",
|
||||
"replay_mode": "cumulative",
|
||||
"retrieve_k": 0,
|
||||
"nights": 5,
|
||||
"per_night": 10,
|
||||
"rollouts": 5,
|
||||
"n_val": 60,
|
||||
"n_test": 1400,
|
||||
"test_baseline": 0.7957,
|
||||
"test_final": 0.8514,
|
||||
"delta": 0.0557,
|
||||
"progression": [
|
||||
0.7957,
|
||||
0.8336,
|
||||
0.8514,
|
||||
0.8514,
|
||||
0.8514,
|
||||
0.8514
|
||||
],
|
||||
"nights_log": [
|
||||
{
|
||||
"night": 0,
|
||||
"n_train": 0,
|
||||
"test_hard": 0.7957,
|
||||
"action": "baseline",
|
||||
"accepted": false
|
||||
},
|
||||
{
|
||||
"night": 1,
|
||||
"n_train": 10,
|
||||
"n_replayed": 0,
|
||||
"n_dream": 20,
|
||||
"val_hard": 0.85,
|
||||
"test_hard": 0.8336,
|
||||
"action": "accept_new_best",
|
||||
"accepted": true,
|
||||
"n_edits": 2
|
||||
},
|
||||
{
|
||||
"night": 2,
|
||||
"n_train": 20,
|
||||
"n_replayed": 10,
|
||||
"n_dream": 40,
|
||||
"val_hard": 0.9,
|
||||
"test_hard": 0.8514,
|
||||
"action": "accept_new_best",
|
||||
"accepted": true,
|
||||
"n_edits": 3
|
||||
},
|
||||
{
|
||||
"night": 3,
|
||||
"n_train": 30,
|
||||
"n_replayed": 20,
|
||||
"n_dream": 60,
|
||||
"val_hard": 0.9,
|
||||
"test_hard": 0.8514,
|
||||
"action": "reject",
|
||||
"accepted": false,
|
||||
"n_edits": 0
|
||||
},
|
||||
{
|
||||
"night": 4,
|
||||
"n_train": 40,
|
||||
"n_replayed": 30,
|
||||
"n_dream": 80,
|
||||
"val_hard": 0.9,
|
||||
"test_hard": 0.8514,
|
||||
"action": "reject",
|
||||
"accepted": false,
|
||||
"n_edits": 0
|
||||
},
|
||||
{
|
||||
"night": 5,
|
||||
"n_train": 50,
|
||||
"n_replayed": 40,
|
||||
"n_dream": 100,
|
||||
"val_hard": 0.9,
|
||||
"test_hard": 0.8514,
|
||||
"action": "reject",
|
||||
"accepted": false,
|
||||
"n_edits": 0
|
||||
}
|
||||
],
|
||||
"tokens": 15132599,
|
||||
"final_skill_tail": " the title or key sentence over a county, institution, or category.\n- Return the shortest exact answer span that satisfies the question, inside <answer>...</answer>; prefer a single-word entity when sufficient.\n- Do not expand a context-supported short name into a fuller name unless the question specifically requires the full name.\n- Match the requested answer type exactly: for a country/nation answer, output only the country name, not a title or role phrase.\n<!-- SKILLOPT-SLEEP:LEARNED END -->\n"
|
||||
}
|
||||
]
|
||||
}
|
||||
@@ -1,94 +0,0 @@
|
||||
{
|
||||
"experiment": "skillopt-sleep/nightly",
|
||||
"model": "gpt-5.5",
|
||||
"results": [
|
||||
{
|
||||
"benchmark": "searchqa",
|
||||
"gate": "on",
|
||||
"replay_mode": "retrieval",
|
||||
"retrieve_k": 20,
|
||||
"nights": 5,
|
||||
"per_night": 10,
|
||||
"rollouts": 5,
|
||||
"n_val": 60,
|
||||
"n_test": 1400,
|
||||
"test_baseline": 0.8029,
|
||||
"test_final": 0.8479,
|
||||
"delta": 0.045,
|
||||
"progression": [
|
||||
0.8029,
|
||||
0.8236,
|
||||
0.8236,
|
||||
0.8479,
|
||||
0.8479,
|
||||
0.8479
|
||||
],
|
||||
"nights_log": [
|
||||
{
|
||||
"night": 0,
|
||||
"n_train": 0,
|
||||
"test_hard": 0.8029,
|
||||
"action": "baseline",
|
||||
"accepted": false
|
||||
},
|
||||
{
|
||||
"night": 1,
|
||||
"n_train": 10,
|
||||
"n_replayed": 0,
|
||||
"n_dream": 20,
|
||||
"val_hard": 0.8667,
|
||||
"test_hard": 0.8236,
|
||||
"action": "accept_new_best",
|
||||
"accepted": true,
|
||||
"n_edits": 2
|
||||
},
|
||||
{
|
||||
"night": 2,
|
||||
"n_train": 20,
|
||||
"n_replayed": 10,
|
||||
"n_dream": 40,
|
||||
"val_hard": 0.8667,
|
||||
"test_hard": 0.8236,
|
||||
"action": "reject",
|
||||
"accepted": false,
|
||||
"n_edits": 0
|
||||
},
|
||||
{
|
||||
"night": 3,
|
||||
"n_train": 30,
|
||||
"n_replayed": 20,
|
||||
"n_dream": 60,
|
||||
"val_hard": 0.8833,
|
||||
"test_hard": 0.8479,
|
||||
"action": "accept_new_best",
|
||||
"accepted": true,
|
||||
"n_edits": 3
|
||||
},
|
||||
{
|
||||
"night": 4,
|
||||
"n_train": 40,
|
||||
"n_replayed": 20,
|
||||
"n_dream": 60,
|
||||
"val_hard": 0.8833,
|
||||
"test_hard": 0.8479,
|
||||
"action": "reject",
|
||||
"accepted": false,
|
||||
"n_edits": 0
|
||||
},
|
||||
{
|
||||
"night": 5,
|
||||
"n_train": 50,
|
||||
"n_replayed": 20,
|
||||
"n_dream": 60,
|
||||
"val_hard": 0.8833,
|
||||
"test_hard": 0.8479,
|
||||
"action": "reject",
|
||||
"accepted": false,
|
||||
"n_edits": 0
|
||||
}
|
||||
],
|
||||
"tokens": 15596999,
|
||||
"final_skill_tail": " Put only the shortest exact answer span in the final '<answer>...</answer>' tags; remove extra descriptors, categories, titles, and surrounding words.\n- If the question asks for a country/place from a phrase like 'King of Spain' or a title like 'Ferdinand VII of Spain', answer only the place name, e.g. 'Spain'.\n- For person answers, use the minimal unambiguous name supported by the clue; do not expand a surname to a full name unless the question requires it.\n<!-- SKILLOPT-SLEEP:LEARNED END -->\n"
|
||||
}
|
||||
]
|
||||
}
|
||||
@@ -1,94 +0,0 @@
|
||||
{
|
||||
"experiment": "skillopt-sleep/nightly",
|
||||
"model": "gpt-5.5",
|
||||
"results": [
|
||||
{
|
||||
"benchmark": "searchqa",
|
||||
"gate": "on",
|
||||
"replay_mode": "retrieval",
|
||||
"retrieve_k": 10,
|
||||
"nights": 5,
|
||||
"per_night": 10,
|
||||
"rollouts": 8,
|
||||
"n_val": 60,
|
||||
"n_test": 1400,
|
||||
"test_baseline": 0.7979,
|
||||
"test_final": 0.835,
|
||||
"delta": 0.0371,
|
||||
"progression": [
|
||||
0.7979,
|
||||
0.8179,
|
||||
0.835,
|
||||
0.835,
|
||||
0.835,
|
||||
0.835
|
||||
],
|
||||
"nights_log": [
|
||||
{
|
||||
"night": 0,
|
||||
"n_train": 0,
|
||||
"test_hard": 0.7979,
|
||||
"action": "baseline",
|
||||
"accepted": false
|
||||
},
|
||||
{
|
||||
"night": 1,
|
||||
"n_train": 10,
|
||||
"n_replayed": 0,
|
||||
"n_dream": 20,
|
||||
"val_hard": 0.8667,
|
||||
"test_hard": 0.8179,
|
||||
"action": "accept_new_best",
|
||||
"accepted": true,
|
||||
"n_edits": 2
|
||||
},
|
||||
{
|
||||
"night": 2,
|
||||
"n_train": 20,
|
||||
"n_replayed": 10,
|
||||
"n_dream": 40,
|
||||
"val_hard": 0.8833,
|
||||
"test_hard": 0.835,
|
||||
"action": "accept_new_best",
|
||||
"accepted": true,
|
||||
"n_edits": 3
|
||||
},
|
||||
{
|
||||
"night": 3,
|
||||
"n_train": 30,
|
||||
"n_replayed": 10,
|
||||
"n_dream": 40,
|
||||
"val_hard": 0.8833,
|
||||
"test_hard": 0.835,
|
||||
"action": "reject",
|
||||
"accepted": false,
|
||||
"n_edits": 0
|
||||
},
|
||||
{
|
||||
"night": 4,
|
||||
"n_train": 40,
|
||||
"n_replayed": 10,
|
||||
"n_dream": 40,
|
||||
"val_hard": 0.8833,
|
||||
"test_hard": 0.835,
|
||||
"action": "reject",
|
||||
"accepted": false,
|
||||
"n_edits": 0
|
||||
},
|
||||
{
|
||||
"night": 5,
|
||||
"n_train": 50,
|
||||
"n_replayed": 10,
|
||||
"n_dream": 40,
|
||||
"val_hard": 0.8833,
|
||||
"test_hard": 0.835,
|
||||
"action": "reject",
|
||||
"accepted": false,
|
||||
"n_edits": 0
|
||||
}
|
||||
],
|
||||
"tokens": 16846499,
|
||||
"final_skill_tail": "less the question asks for the title itself.\n- Always put only the final answer in \"<answer>...</answer>\" and keep it \"concise -- typically a few words or a short phrase\".\n- Use the shortest sufficient answer span; do not add first names, modifiers, counties, countries, or parent locations unless explicitly required.\n- Match the question’s granularity exactly: if it asks for a state, give only the state; if it asks for a term’s meaning, give only the meaning.\n<!-- SKILLOPT-SLEEP:LEARNED END -->\n"
|
||||
}
|
||||
]
|
||||
}
|
||||
@@ -1,94 +0,0 @@
|
||||
{
|
||||
"experiment": "skillopt-sleep/nightly",
|
||||
"model": "gpt-5.5",
|
||||
"results": [
|
||||
{
|
||||
"benchmark": "searchqa",
|
||||
"gate": "off",
|
||||
"replay_mode": "retrieval",
|
||||
"retrieve_k": 10,
|
||||
"nights": 5,
|
||||
"per_night": 10,
|
||||
"rollouts": 5,
|
||||
"n_val": 60,
|
||||
"n_test": 1400,
|
||||
"test_baseline": 0.8079,
|
||||
"test_final": 0.8393,
|
||||
"delta": 0.0314,
|
||||
"progression": [
|
||||
0.8079,
|
||||
0.8321,
|
||||
0.84,
|
||||
0.8436,
|
||||
0.84,
|
||||
0.8393
|
||||
],
|
||||
"nights_log": [
|
||||
{
|
||||
"night": 0,
|
||||
"n_train": 0,
|
||||
"test_hard": 0.8079,
|
||||
"action": "baseline",
|
||||
"accepted": false
|
||||
},
|
||||
{
|
||||
"night": 1,
|
||||
"n_train": 10,
|
||||
"n_replayed": 0,
|
||||
"n_dream": 20,
|
||||
"val_hard": 0.0,
|
||||
"test_hard": 0.8321,
|
||||
"action": "greedy_applied",
|
||||
"accepted": true,
|
||||
"n_edits": 3
|
||||
},
|
||||
{
|
||||
"night": 2,
|
||||
"n_train": 20,
|
||||
"n_replayed": 10,
|
||||
"n_dream": 40,
|
||||
"val_hard": 0.0,
|
||||
"test_hard": 0.84,
|
||||
"action": "greedy_applied",
|
||||
"accepted": true,
|
||||
"n_edits": 1
|
||||
},
|
||||
{
|
||||
"night": 3,
|
||||
"n_train": 30,
|
||||
"n_replayed": 10,
|
||||
"n_dream": 40,
|
||||
"val_hard": 0.0,
|
||||
"test_hard": 0.8436,
|
||||
"action": "greedy_applied",
|
||||
"accepted": true,
|
||||
"n_edits": 2
|
||||
},
|
||||
{
|
||||
"night": 4,
|
||||
"n_train": 40,
|
||||
"n_replayed": 10,
|
||||
"n_dream": 40,
|
||||
"val_hard": 0.0,
|
||||
"test_hard": 0.84,
|
||||
"action": "greedy_applied",
|
||||
"accepted": true,
|
||||
"n_edits": 3
|
||||
},
|
||||
{
|
||||
"night": 5,
|
||||
"n_train": 50,
|
||||
"n_replayed": 10,
|
||||
"n_dream": 40,
|
||||
"val_hard": 0.0,
|
||||
"test_hard": 0.8393,
|
||||
"action": "greedy_applied",
|
||||
"accepted": true,
|
||||
"n_edits": 2
|
||||
}
|
||||
],
|
||||
"tokens": 27990836,
|
||||
"final_skill_tail": "Sultan of Brunei\".\n- For author/creator questions from titles like \"Trees by Joyce Kilmer\", output only the creator name, e.g. \"Joyce Kilmer\", not the work title.\n- Do not introduce diacritics or alternate spellings not present in the context/title; prefer the ASCII surface form such as \"Vaclav Havel\" over \"Václav Havel\".\n- Return the full canonical entity name from the context/title, including hyphens, e.g. \"Winnie-the-Pooh\" rather than the shortened \"Pooh\".\n<!-- SKILLOPT-SLEEP:LEARNED END -->\n"
|
||||
}
|
||||
]
|
||||
}
|
||||
@@ -1,94 +0,0 @@
|
||||
{
|
||||
"experiment": "skillopt-sleep/nightly",
|
||||
"model": "gpt-5.5",
|
||||
"results": [
|
||||
{
|
||||
"benchmark": "searchqa",
|
||||
"gate": "on",
|
||||
"replay_mode": "retrieval",
|
||||
"retrieve_k": 10,
|
||||
"nights": 5,
|
||||
"per_night": 10,
|
||||
"rollouts": 5,
|
||||
"n_val": 60,
|
||||
"n_test": 1400,
|
||||
"test_baseline": 0.8021,
|
||||
"test_final": 0.8336,
|
||||
"delta": 0.0315,
|
||||
"progression": [
|
||||
0.8021,
|
||||
0.83,
|
||||
0.8336,
|
||||
0.8336,
|
||||
0.8336,
|
||||
0.8336
|
||||
],
|
||||
"nights_log": [
|
||||
{
|
||||
"night": 0,
|
||||
"n_train": 0,
|
||||
"test_hard": 0.8021,
|
||||
"action": "baseline",
|
||||
"accepted": false
|
||||
},
|
||||
{
|
||||
"night": 1,
|
||||
"n_train": 10,
|
||||
"n_replayed": 0,
|
||||
"n_dream": 20,
|
||||
"val_hard": 0.8667,
|
||||
"test_hard": 0.83,
|
||||
"action": "accept_new_best",
|
||||
"accepted": true,
|
||||
"n_edits": 4
|
||||
},
|
||||
{
|
||||
"night": 2,
|
||||
"n_train": 20,
|
||||
"n_replayed": 10,
|
||||
"n_dream": 40,
|
||||
"val_hard": 0.9,
|
||||
"test_hard": 0.8336,
|
||||
"action": "accept_new_best",
|
||||
"accepted": true,
|
||||
"n_edits": 4
|
||||
},
|
||||
{
|
||||
"night": 3,
|
||||
"n_train": 30,
|
||||
"n_replayed": 10,
|
||||
"n_dream": 40,
|
||||
"val_hard": 0.9,
|
||||
"test_hard": 0.8336,
|
||||
"action": "reject",
|
||||
"accepted": false,
|
||||
"n_edits": 0
|
||||
},
|
||||
{
|
||||
"night": 4,
|
||||
"n_train": 40,
|
||||
"n_replayed": 10,
|
||||
"n_dream": 40,
|
||||
"val_hard": 0.9,
|
||||
"test_hard": 0.8336,
|
||||
"action": "reject",
|
||||
"accepted": false,
|
||||
"n_edits": 0
|
||||
},
|
||||
{
|
||||
"night": 5,
|
||||
"n_train": 50,
|
||||
"n_replayed": 10,
|
||||
"n_dream": 40,
|
||||
"val_hard": 0.9,
|
||||
"test_hard": 0.8336,
|
||||
"action": "reject",
|
||||
"accepted": false,
|
||||
"n_edits": 0
|
||||
}
|
||||
],
|
||||
"tokens": 15946118,
|
||||
"final_skill_tail": "roperty; do not substitute a broader category or page title.\n- For location questions asking for a state/country, output only that level, e.g. \"Maryland\", not the full hierarchy \"Baltimore County, Maryland, United States\".\n- For name-part questions such as surname/last name, output only that part, e.g. \"Genet\", not the full name \"Jean Genet\".\n- Put only the concise final answer inside \"<answer>...</answer>\"; avoid extra modifiers, lists, or explanatory words.\n<!-- SKILLOPT-SLEEP:LEARNED END -->\n"
|
||||
}
|
||||
]
|
||||
}
|
||||
@@ -1,73 +0,0 @@
|
||||
# SkillOpt-Sleep — validation experiment results
|
||||
|
||||
Generated: 2026-06-07 (autonomous offline session)
|
||||
Backend: mock (deterministic, no API). Reproducible via the commands below.
|
||||
|
||||
```
|
||||
$ python3.12 -m skillopt.sleep.experiments.run_experiment --persona researcher --nights 4 --json
|
||||
{
|
||||
"persona": "researcher",
|
||||
"backend": "mock",
|
||||
"nights_run": 1,
|
||||
"baseline_holdout": 0.3333,
|
||||
"after_holdout": 1.0,
|
||||
"lift": 0.6667,
|
||||
"improved": true,
|
||||
"gate_blocks_harmful": true,
|
||||
"final_skill_excerpt": "T -->\n## Learned preferences & procedures\n\n_This block is maintained by SkillOpt-Sleep. Edits here are proposed offline, validated against your past tasks, and adopted only after you approve them. Hand-edits outside this block are never touched._\n\n- Always wrap the final answer in <answer>...</answer> tags.\n- Report arXiv ids in the exact form arXiv:XXXX.XXXXX.\n<!-- SKILLOPT-SLEEP:LEARNED END -->\n",
|
||||
"trace": [
|
||||
{
|
||||
"night": 0,
|
||||
"holdout_score": 0.3333,
|
||||
"action": "baseline",
|
||||
"n_edits": 0
|
||||
},
|
||||
{
|
||||
"night": 1,
|
||||
"holdout_score": 1.0,
|
||||
"action": "accept_new_best",
|
||||
"accepted": true,
|
||||
"n_edits": 2,
|
||||
"edits": [
|
||||
"Always wrap the final answer in <answer>...</answer> tags.",
|
||||
"Report arXiv ids in the exact form arXiv:XXXX.XXXXX."
|
||||
],
|
||||
"n_rejected": 0
|
||||
}
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
```
|
||||
$ python3.12 -m skillopt.sleep.experiments.run_experiment --persona programmer --nights 4 --json
|
||||
{
|
||||
"persona": "programmer",
|
||||
"backend": "mock",
|
||||
"nights_run": 1,
|
||||
"baseline_holdout": 0.3194,
|
||||
"after_holdout": 1.0,
|
||||
"lift": 0.6806,
|
||||
"improved": true,
|
||||
"gate_blocks_harmful": true,
|
||||
"final_skill_excerpt": "laude Code sessions.\n\n<!-- SKILLOPT-SLEEP:LEARNED START -->\n## Learned preferences & procedures\n\n_This block is maintained by SkillOpt-Sleep. Edits here are proposed offline, validated against your past tasks, and adopted only after you approve them. Hand-edits outside this block are never touched._\n\n- Write git commit subjects in imperative mood, max 50 chars.\n<!-- SKILLOPT-SLEEP:LEARNED END -->\n",
|
||||
"trace": [
|
||||
{
|
||||
"night": 0,
|
||||
"holdout_score": 0.3194,
|
||||
"action": "baseline",
|
||||
"n_edits": 0
|
||||
},
|
||||
{
|
||||
"night": 1,
|
||||
"holdout_score": 1.0,
|
||||
"action": "accept_new_best",
|
||||
"accepted": true,
|
||||
"n_edits": 1,
|
||||
"edits": [
|
||||
"Write git commit subjects in imperative mood, max 50 chars."
|
||||
],
|
||||
"n_rejected": 0
|
||||
}
|
||||
]
|
||||
}
|
||||
```
|
||||
@@ -1,76 +0,0 @@
|
||||
# SkillOpt-Sleep — plugin load-test (fresh examples)
|
||||
|
||||
This records an actual end-to-end load-test of all three plugin shells on a
|
||||
**brand-new example** (not the gbrain benchmark seeds), run on 2026-06-08.
|
||||
|
||||
## The fresh persona
|
||||
|
||||
A data analyst whose SQL queries must always include a `LIMIT` clause — built
|
||||
from scratch for this test. Two forms were used:
|
||||
|
||||
1. **Real transcripts** — crafted Claude Code session JSONL where the analyst
|
||||
asks for SQL, the agent forgets `LIMIT`, and the user complains ("you forgot
|
||||
a LIMIT again", "always cap results"). This exercises the real
|
||||
harvest → mine pipeline.
|
||||
2. **Checkable tasks** — the same intent with a rule judge
|
||||
(`regex: (?i)LIMIT\s+100`), so the optimizer can be scored on whether future
|
||||
SQL follows the house rule.
|
||||
|
||||
## Results
|
||||
|
||||
### Shell plumbing (all three drive the engine)
|
||||
|
||||
| Shell | What was run | Result |
|
||||
|---|---|---|
|
||||
| **Claude Code** (`scripts/sleep.sh`) | `harvest`, full `run`, `adopt` | harvest found 2 sessions → 2 tasks; `run` staged a proposal; `adopt` honored the safety contract (no live change when nothing was accepted) |
|
||||
| **Codex** (`install.sh` + shared runner) | `install.sh` into a temp HOME | placed the user-level `~/.agents/skills/skillopt-sleep/SKILL.md` skill correctly and moved any legacy custom prompt aside instead of installing one |
|
||||
| **Copilot** (`mcp_server.py`) | `initialize` → `tools/list` → `tools/call sleep_harvest` | 5 tools listed; `sleep_harvest` returned real engine output (2 sessions → 2 tasks) |
|
||||
|
||||
### Genuine improvement (real model, fresh persona)
|
||||
|
||||
Optimizer **Claude Sonnet 4.6** → target **Claude Haiku 4.5**, 3-way split
|
||||
(5 train / 2 val / 5 test), scored on the held-out **test** queries; and the same
|
||||
fresh persona self-optimized on **Codex**:
|
||||
|
||||
| Backend | Held-out **test** (fraction of SQL with `LIMIT 100`) before → after |
|
||||
|---|---|
|
||||
| Claude (Sonnet → Haiku) | **0.00 → 1.00** |
|
||||
| Codex | **0.00 → 1.00** |
|
||||
|
||||
In one night each optimizer wrote, into the protected learned block, a rule like:
|
||||
|
||||
> *"OVERRIDE: Every SQL query you generate MUST include `LIMIT 100` …"* (Claude)
|
||||
> *"Hard requirement: every SQL query response must include …"* (Codex)
|
||||
|
||||
and the target then applied it to the **unseen** test queries. This is the whole
|
||||
claim on a task family the engine had never seen: it learned the user's house
|
||||
rule from their failures and generalized it — confirmed on both backends.
|
||||
|
||||
## An honest finding from load-testing
|
||||
|
||||
The **first** attempt used `val_fraction=0.34, test_fraction=0.34`, which left
|
||||
only **1 train task** for an 8-task set — too little signal — so reflect produced
|
||||
nothing and the night was a no-op (val already 0.75). Re-balancing the split to a
|
||||
real train pool (5 train) fixed it and produced the 0 → 1.00 result above. This
|
||||
is exactly the kind of issue that only surfaces when you actually run the thing,
|
||||
and it motivates a future guardrail: warn when the train pool is too small for
|
||||
the chosen split fractions.
|
||||
|
||||
## Reproduce
|
||||
|
||||
The checkable persona run (real Claude):
|
||||
|
||||
```python
|
||||
# see the snippet in docs/sleep/plugin_load_test.md history, or run:
|
||||
python -m skillopt_sleep.experiments.run_experiment --persona programmer --assert-improves # deterministic
|
||||
```
|
||||
|
||||
Shell checks:
|
||||
|
||||
```bash
|
||||
# Copilot MCP server
|
||||
printf '%s\n' '{"jsonrpc":"2.0","id":1,"method":"tools/list"}' \
|
||||
| SKILLOPT_SLEEP_REPO="$(pwd)" python3 plugins/copilot/mcp_server.py
|
||||
# Codex skill installer (into a throwaway HOME)
|
||||
HOME=$(mktemp -d) bash plugins/codex/install.sh
|
||||
```
|
||||
@@ -1,45 +0,0 @@
|
||||
=== gbrain brief-writer CODEX, improved prompt, 2 nights, 3+3 tasks ===
|
||||
{
|
||||
"benchmark": "gbrain-evals/skillopt-v1",
|
||||
"backend": "codex",
|
||||
"model": "(default)",
|
||||
"n_seeds": 1,
|
||||
"n_improved": 1,
|
||||
"tokens_used": 9990,
|
||||
"results": [
|
||||
{
|
||||
"seed": "brief-writer",
|
||||
"held_out_before": 0.0,
|
||||
"held_out_after": 1.0,
|
||||
"improved": true,
|
||||
"nights": 2,
|
||||
"trace": [
|
||||
{
|
||||
"night": 0,
|
||||
"held_out_hard": 0.0,
|
||||
"action": "baseline"
|
||||
},
|
||||
{
|
||||
"night": 1,
|
||||
"held_out_hard": 0.0,
|
||||
"action": "accept_new_best",
|
||||
"accepted": true,
|
||||
"edits": [
|
||||
"Every brief must include a clearly labeled section exactly titled `Key Risks`.",
|
||||
"Every brief must include a line beginning `Confidence:` followed by a concise confidence level or rationale."
|
||||
]
|
||||
},
|
||||
{
|
||||
"night": 2,
|
||||
"held_out_hard": 1.0,
|
||||
"action": "accept_new_best",
|
||||
"accepted": true,
|
||||
"edits": [
|
||||
"- Preserve required sections even when keeping the brief short; shorten the analysis before omitting `## Key Risks` or `Confidence:`."
|
||||
]
|
||||
}
|
||||
],
|
||||
"final_skill_tail": "tside this block are never touched._\n\n- Every brief must include a clearly labeled section exactly titled `Key Risks`.\n- Every brief must include a line beginning `Confidence:` followed by a concise confidence level or rationale.\n- Preserve required sections even when keeping the brief short; shorten the analysis before omitting `## Key Risks` or `Confidence:`.\n<!-- SKILLOPT-SLEEP:LEARNED END -->\n"
|
||||
}
|
||||
]
|
||||
}
|
||||
@@ -1,38 +0,0 @@
|
||||
=== REAL cross-check A: Sonnet->Haiku, gate=OFF, rollouts_k=2, brief-writer (exercises new paths) ===
|
||||
{
|
||||
"benchmark": "gbrain-evals/skillopt-v1",
|
||||
"backend": "target=claude/optimizer=claude",
|
||||
"model": "(default)",
|
||||
"n_seeds": 1,
|
||||
"n_improved": 1,
|
||||
"tokens_used": 11271,
|
||||
"results": [
|
||||
{
|
||||
"seed": "brief-writer",
|
||||
"held_out_before": 0.0,
|
||||
"held_out_after": 1.0,
|
||||
"improved": true,
|
||||
"nights": 1,
|
||||
"trace": [
|
||||
{
|
||||
"night": 0,
|
||||
"test_hard": 0.0,
|
||||
"action": "baseline"
|
||||
},
|
||||
{
|
||||
"night": 1,
|
||||
"val_hard": 1.0,
|
||||
"test_hard": 1.0,
|
||||
"action": "greedy_improved",
|
||||
"accepted": true,
|
||||
"edits": [
|
||||
"Every brief MUST include a section with the exact heading '## Key Risks' that lists the primary risks relevant to the recommendation. This section is required in every output regardless of topic.",
|
||||
"Every brief MUST include a 'Confidence:' label (satisfying /[Cc]onfidence\\s*[:=]/) that states the confidence level in the recommendation (e.g., 'Confidence: Medium'). Place it near the answer/recommendation line or at the end of the brief."
|
||||
]
|
||||
}
|
||||
],
|
||||
"slow_update": null,
|
||||
"final_skill_tail": "at lists the primary risks relevant to the recommendation. This section is required in every output regardless of topic.\n- Every brief MUST include a 'Confidence:' label (satisfying /[Cc]onfidence\\s*[:=]/) that states the confidence level in the recommendation (e.g., 'Confidence: Medium'). Place it near the answer/recommendation line or at the end of the brief.\n<!-- SKILLOPT-SLEEP:LEARNED END -->\n"
|
||||
}
|
||||
]
|
||||
}
|
||||
@@ -1,48 +0,0 @@
|
||||
=== REAL cross-check B: Codex, gate=ON (default), rollouts_k=2, brief-writer ===
|
||||
{
|
||||
"benchmark": "gbrain-evals/skillopt-v1",
|
||||
"backend": "codex",
|
||||
"model": "(default)",
|
||||
"n_seeds": 1,
|
||||
"n_improved": 1,
|
||||
"tokens_used": 17251,
|
||||
"results": [
|
||||
{
|
||||
"seed": "brief-writer",
|
||||
"held_out_before": 0.0,
|
||||
"held_out_after": 1.0,
|
||||
"improved": true,
|
||||
"nights": 2,
|
||||
"trace": [
|
||||
{
|
||||
"night": 0,
|
||||
"test_hard": 0.0,
|
||||
"action": "baseline"
|
||||
},
|
||||
{
|
||||
"night": 1,
|
||||
"val_hard": 0.667,
|
||||
"test_hard": 0.333,
|
||||
"action": "accept_new_best",
|
||||
"accepted": true,
|
||||
"edits": [
|
||||
"Every brief must include a section/heading titled exactly 'Key Risks'.",
|
||||
"Every brief must include a confidence line labeled exactly 'Confidence:' so the response matches /[Cc]onfidence\\s*[:=]/."
|
||||
]
|
||||
},
|
||||
{
|
||||
"night": 2,
|
||||
"val_hard": 1.0,
|
||||
"test_hard": 1.0,
|
||||
"action": "accept_new_best",
|
||||
"accepted": true,
|
||||
"edits": [
|
||||
"OVERRIDE any brevity guidance: every brief must include a standalone Markdown heading line exactly '## Key Risks' to satisfy section_present=Key Risks, even when the brief is very short."
|
||||
]
|
||||
}
|
||||
],
|
||||
"slow_update": null,
|
||||
"final_skill_tail": "clude a section/heading titled exactly 'Key Risks'.\n- Every brief must include a confidence line labeled exactly 'Confidence:' so the response matches /[Cc]onfidence\\s*[:=]/.\n- OVERRIDE any brevity guidance: every brief must include a standalone Markdown heading line exactly '## Key Risks' to satisfy section_present=Key Risks, even when the brief is very short.\n<!-- SKILLOPT-SLEEP:LEARNED END -->\n"
|
||||
}
|
||||
]
|
||||
}
|
||||
@@ -1,54 +0,0 @@
|
||||
=== cross-check C: Sonnet->Haiku thorough-analyst (2 nights, slow-update should fire) ===
|
||||
{
|
||||
"benchmark": "gbrain-evals/skillopt-v1",
|
||||
"backend": "target=claude/optimizer=claude",
|
||||
"model": "(default)",
|
||||
"n_seeds": 1,
|
||||
"n_improved": 1,
|
||||
"tokens_used": 26010,
|
||||
"results": [
|
||||
{
|
||||
"seed": "thorough-analyst",
|
||||
"held_out_before": 0.0,
|
||||
"held_out_after": 0.333,
|
||||
"improved": true,
|
||||
"nights": 3,
|
||||
"trace": [
|
||||
{
|
||||
"night": 0,
|
||||
"test_hard": 0.0,
|
||||
"action": "baseline"
|
||||
},
|
||||
{
|
||||
"night": 1,
|
||||
"val_hard": 0.667,
|
||||
"test_hard": 0.667,
|
||||
"action": "accept_new_best",
|
||||
"accepted": true,
|
||||
"edits": [
|
||||
"OVERRIDE (supersedes 'be exhaustive and detailed', 'Explore every angle', 'consider many scenarios', and 'Write multiple paragraphs'): the ENTIRE response must be at most 1200 characters long, counting every character including spaces, newlines, and punctuation. This hard character limit takes priority over all instructions to be thorough, exhaustive, or multi-paragraph.",
|
||||
"To stay within 1200 characters while still being useful: lead with the single most critical trade-off, then list 2-3 key considerations as tight bullet points. Omit headers, preamble, and restating the question."
|
||||
]
|
||||
},
|
||||
{
|
||||
"night": 2,
|
||||
"val_hard": 0.667,
|
||||
"test_hard": 0.667,
|
||||
"action": "reject",
|
||||
"accepted": false,
|
||||
"edits": []
|
||||
},
|
||||
{
|
||||
"night": 3,
|
||||
"val_hard": 0.667,
|
||||
"test_hard": 0.667,
|
||||
"action": "reject",
|
||||
"accepted": false,
|
||||
"edits": []
|
||||
}
|
||||
],
|
||||
"slow_update": "• On character-constrained tasks (≤1200 chars), plan structure before writing: allocate space per point explicitly and cut until the outline fits, then fill — never draft freely and trim after.\n• Multi-variable business/strategy analyses are high-risk for overrun; default to covering only the 2–3 most decisive factors rather than attempting exhaustive coverage.\n• Lead with the conclusion or recommendation first; eliminate all introductory restatement of the question, hedging preamble, and transitional filler under tight limits.\n• Persistent failures on the same task signal a structural habit, not a one-off error — treat repeated length violations as a signal to change the drafting approach entirely, not just edit more aggressively.",
|
||||
"final_skill_tail": "ead with the conclusion or recommendation first; eliminate all introductory restatement of the question, hedging preamble, and transitional filler under tight limits.\n• Persistent failures on the same task signal a structural habit, not a one-off error — treat repeated length violations as a signal to change the drafting approach entirely, not just edit more aggressively.\n<!-- SLOW_UPDATE_END -->\n"
|
||||
}
|
||||
]
|
||||
}
|
||||
@@ -1,101 +0,0 @@
|
||||
=== mock regression ===
|
||||
Ran 19 tests in 0.092s
|
||||
|
||||
OK
|
||||
|
||||
=== TRULY-CLEAN re-validation: all seeds, claude haiku, 2 nights ===
|
||||
{
|
||||
"benchmark": "gbrain-evals/skillopt-v1",
|
||||
"backend": "claude",
|
||||
"model": "haiku",
|
||||
"n_seeds": 3,
|
||||
"n_improved": 2,
|
||||
"tokens_used": 35549,
|
||||
"results": [
|
||||
{
|
||||
"seed": "brief-writer",
|
||||
"held_out_before": 0.0,
|
||||
"held_out_after": 0.0,
|
||||
"improved": false,
|
||||
"nights": 2,
|
||||
"trace": [
|
||||
{
|
||||
"night": 0,
|
||||
"held_out_hard": 0.0,
|
||||
"action": "baseline"
|
||||
},
|
||||
{
|
||||
"night": 1,
|
||||
"held_out_hard": 0.0,
|
||||
"action": "reject",
|
||||
"accepted": false,
|
||||
"edits": []
|
||||
},
|
||||
{
|
||||
"night": 2,
|
||||
"held_out_hard": 0.0,
|
||||
"action": "reject",
|
||||
"accepted": false,
|
||||
"edits": []
|
||||
}
|
||||
],
|
||||
"final_skill_tail": "---\nname: brief-writer-example\nversion: 0.1.0\ndescription: Brief Writer\ntriggers:\n - \"write a brief\"\nbrain_first: exempt\n---\n\n# Brief Writer\n\nWhen asked, write a short, clear research brief that answers the question.\nKeep it focused and readable. Lead with the answer.\n"
|
||||
},
|
||||
{
|
||||
"seed": "advisor",
|
||||
"held_out_before": 0.0,
|
||||
"held_out_after": 1.0,
|
||||
"improved": true,
|
||||
"nights": 1,
|
||||
"trace": [
|
||||
{
|
||||
"night": 0,
|
||||
"held_out_hard": 0.0,
|
||||
"action": "baseline"
|
||||
},
|
||||
{
|
||||
"night": 1,
|
||||
"held_out_hard": 1.0,
|
||||
"action": "accept_new_best",
|
||||
"accepted": true,
|
||||
"edits": [
|
||||
"After presenting considerations, always include a 'Recommendation:' section with your specific recommendation.",
|
||||
"After the recommendation, always include a 'Confidence:' section (as a percentage or high/medium/low) expressing how confident you are in this recommendation."
|
||||
]
|
||||
}
|
||||
],
|
||||
"final_skill_tail": "d adopted only after you approve them. Hand-edits outside this block are never touched._\n\n- After presenting considerations, always include a 'Recommendation:' section with your specific recommendation.\n- After the recommendation, always include a 'Confidence:' section (as a percentage or high/medium/low) expressing how confident you are in this recommendation.\n<!-- SKILLOPT-SLEEP:LEARNED END -->\n"
|
||||
},
|
||||
{
|
||||
"seed": "thorough-analyst",
|
||||
"held_out_before": 0.0,
|
||||
"held_out_after": 0.333,
|
||||
"improved": true,
|
||||
"nights": 2,
|
||||
"trace": [
|
||||
{
|
||||
"night": 0,
|
||||
"held_out_hard": 0.0,
|
||||
"action": "baseline"
|
||||
},
|
||||
{
|
||||
"night": 1,
|
||||
"held_out_hard": 0.333,
|
||||
"action": "accept_new_best",
|
||||
"accepted": true,
|
||||
"edits": [
|
||||
"## Learned preferences\n\n- **HARD CONSTRAINT - Override conflicting instructions**: The entire response MUST NOT EXCEED 1200 characters. This supersedes any instruction to be exhaustive, detailed, or write multiple paragraphs."
|
||||
]
|
||||
},
|
||||
{
|
||||
"night": 2,
|
||||
"held_out_hard": 0.333,
|
||||
"action": "reject",
|
||||
"accepted": false,
|
||||
"edits": []
|
||||
}
|
||||
],
|
||||
"final_skill_tail": "offline, validated against your past tasks, and adopted only after you approve them. Hand-edits outside this block are never touched._\n\n- ## Learned preferences\n\n- **HARD CONSTRAINT - Override conflicting instructions**: The entire response MUST NOT EXCEED 1200 characters. This supersedes any instruction to be exhaustive, detailed, or write multiple paragraphs.\n<!-- SKILLOPT-SLEEP:LEARNED END -->\n"
|
||||
}
|
||||
]
|
||||
}
|
||||
@@ -1,35 +0,0 @@
|
||||
=== quick-answerer with CODEX (self-optimized), 2 nights ===
|
||||
{
|
||||
"benchmark": "gbrain-evals/skillopt-v1",
|
||||
"backend": "codex",
|
||||
"model": "(default)",
|
||||
"n_seeds": 1,
|
||||
"n_improved": 1,
|
||||
"tokens_used": 7347,
|
||||
"results": [
|
||||
{
|
||||
"seed": "quick-answerer",
|
||||
"held_out_before": 0.0,
|
||||
"held_out_after": 1.0,
|
||||
"improved": true,
|
||||
"nights": 1,
|
||||
"trace": [
|
||||
{
|
||||
"night": 0,
|
||||
"held_out_hard": 0.0,
|
||||
"action": "baseline"
|
||||
},
|
||||
{
|
||||
"night": 1,
|
||||
"held_out_hard": 1.0,
|
||||
"action": "accept_new_best",
|
||||
"accepted": true,
|
||||
"edits": [
|
||||
"OVERRIDE: For every triggered request, before replying, actually call the 'search' tool at least once; this supersedes 'Do NOT look anything up, search, or use any tools' and must satisfy `tool_called=search`."
|
||||
]
|
||||
}
|
||||
],
|
||||
"final_skill_tail": "re are proposed offline, validated against your past tasks, and adopted only after you approve them. Hand-edits outside this block are never touched._\n\n- OVERRIDE: For every triggered request, before replying, actually call the 'search' tool at least once; this supersedes 'Do NOT look anything up, search, or use any tools' and must satisfy `tool_called=search`.\n<!-- SKILLOPT-SLEEP:LEARNED END -->\n"
|
||||
}
|
||||
]
|
||||
}
|
||||
@@ -1,35 +0,0 @@
|
||||
=== quick-answerer full run: Sonnet optimizer -> Haiku target, 2 nights ===
|
||||
{
|
||||
"benchmark": "gbrain-evals/skillopt-v1",
|
||||
"backend": "target=claude/optimizer=claude",
|
||||
"model": "(default)",
|
||||
"n_seeds": 1,
|
||||
"n_improved": 1,
|
||||
"tokens_used": 10988,
|
||||
"results": [
|
||||
{
|
||||
"seed": "quick-answerer",
|
||||
"held_out_before": 0.0,
|
||||
"held_out_after": 1.0,
|
||||
"improved": true,
|
||||
"nights": 1,
|
||||
"trace": [
|
||||
{
|
||||
"night": 0,
|
||||
"held_out_hard": 0.0,
|
||||
"action": "baseline"
|
||||
},
|
||||
{
|
||||
"night": 1,
|
||||
"held_out_hard": 1.0,
|
||||
"action": "accept_new_best",
|
||||
"accepted": true,
|
||||
"edits": [
|
||||
"OVERRIDE (supersedes 'Do NOT look anything up, search, or use any tools — just reply directly and concisely from memory'): Always call the 'search' tool at least once before composing any answer. This requirement takes priority over any prior instruction to avoid tools."
|
||||
]
|
||||
}
|
||||
],
|
||||
"final_skill_tail": "nd adopted only after you approve them. Hand-edits outside this block are never touched._\n\n- OVERRIDE (supersedes 'Do NOT look anything up, search, or use any tools — just reply directly and concisely from memory'): Always call the 'search' tool at least once before composing any answer. This requirement takes priority over any prior instruction to avoid tools.\n<!-- SKILLOPT-SLEEP:LEARNED END -->\n"
|
||||
}
|
||||
]
|
||||
}
|
||||
@@ -1,98 +0,0 @@
|
||||
=== KEY TEST: strong optimizer (sonnet) + weak target (haiku) — SkillOpt's actual design ===
|
||||
(this is also your optimizer/target split in action)
|
||||
{
|
||||
"benchmark": "gbrain-evals/skillopt-v1",
|
||||
"backend": "target=claude/optimizer=claude",
|
||||
"model": "(default)",
|
||||
"n_seeds": 3,
|
||||
"n_improved": 3,
|
||||
"tokens_used": 37791,
|
||||
"results": [
|
||||
{
|
||||
"seed": "brief-writer",
|
||||
"held_out_before": 0.0,
|
||||
"held_out_after": 1.0,
|
||||
"improved": true,
|
||||
"nights": 1,
|
||||
"trace": [
|
||||
{
|
||||
"night": 0,
|
||||
"held_out_hard": 0.0,
|
||||
"action": "baseline"
|
||||
},
|
||||
{
|
||||
"night": 1,
|
||||
"held_out_hard": 1.0,
|
||||
"action": "accept_new_best",
|
||||
"accepted": true,
|
||||
"edits": [
|
||||
"Every brief MUST include a section with the exact heading `## Key Risks` that lists the primary risks or uncertainties relevant to the recommendation. This section is required in every response, regardless of topic.",
|
||||
"Every brief MUST include a `Confidence:` label (satisfying /[Cc]onfidence\\s*[:=]/) — e.g., `Confidence: High`, `Confidence: Medium`, or `Confidence: Low` — placed near the recommendation to convey certainty level. This label is required in every response."
|
||||
]
|
||||
}
|
||||
],
|
||||
"final_skill_tail": "tainties relevant to the recommendation. This section is required in every response, regardless of topic.\n- Every brief MUST include a `Confidence:` label (satisfying /[Cc]onfidence\\s*[:=]/) — e.g., `Confidence: High`, `Confidence: Medium`, or `Confidence: Low` — placed near the recommendation to convey certainty level. This label is required in every response.\n<!-- SKILLOPT-SLEEP:LEARNED END -->\n"
|
||||
},
|
||||
{
|
||||
"seed": "advisor",
|
||||
"held_out_before": 0.0,
|
||||
"held_out_after": 1.0,
|
||||
"improved": true,
|
||||
"nights": 1,
|
||||
"trace": [
|
||||
{
|
||||
"night": 0,
|
||||
"held_out_hard": 0.0,
|
||||
"action": "baseline"
|
||||
},
|
||||
{
|
||||
"night": 1,
|
||||
"held_out_hard": 1.0,
|
||||
"action": "accept_new_best",
|
||||
"accepted": true,
|
||||
"edits": [
|
||||
"OVERRIDE: The instruction 'so the reader can make up their own mind' must NOT suppress a conclusion. After presenting considerations, you MUST always end with an explicit label exactly matching 'Recommendation:' (capital R) followed by your concrete recommendation on the decision.",
|
||||
"Always include a 'Confidence:' label (e.g., 'Confidence: High / Medium / Low') in every advisory response, placed immediately after or alongside the Recommendation line, expressing your confidence level in that recommendation."
|
||||
]
|
||||
}
|
||||
],
|
||||
"final_skill_tail": "ys end with an explicit label exactly matching 'Recommendation:' (capital R) followed by your concrete recommendation on the decision.\n- Always include a 'Confidence:' label (e.g., 'Confidence: High / Medium / Low') in every advisory response, placed immediately after or alongside the Recommendation line, expressing your confidence level in that recommendation.\n<!-- SKILLOPT-SLEEP:LEARNED END -->\n"
|
||||
},
|
||||
{
|
||||
"seed": "thorough-analyst",
|
||||
"held_out_before": 0.0,
|
||||
"held_out_after": 1.0,
|
||||
"improved": true,
|
||||
"nights": 2,
|
||||
"trace": [
|
||||
{
|
||||
"night": 0,
|
||||
"held_out_hard": 0.0,
|
||||
"action": "baseline"
|
||||
},
|
||||
{
|
||||
"night": 1,
|
||||
"held_out_hard": 0.333,
|
||||
"action": "accept_new_best",
|
||||
"accepted": true,
|
||||
"edits": [
|
||||
"OVERRIDE — supersedes all instructions to be 'exhaustive and detailed' or 'write multiple paragraphs': The ENTIRE response must be at most 1200 characters long (every character, including spaces, headers, and punctuation, counts toward this limit). If content would exceed 1200 characters, cut elaboration and stop at the most critical tradeoffs only.",
|
||||
"For 'analyze the decision' responses, use plain concise prose rather than multi-level markdown headers and section dividers; structural markup consumes characters and makes it harder to stay within the 1200-character ceiling."
|
||||
]
|
||||
},
|
||||
{
|
||||
"night": 2,
|
||||
"held_out_hard": 1.0,
|
||||
"action": "accept_new_best",
|
||||
"accepted": true,
|
||||
"edits": [
|
||||
"OVERRIDE — supersedes all instructions to be 'exhaustive and detailed' or 'write multiple paragraphs': The ENTIRE response must be at most 1200 characters long (every character counts). Practical proxy: target at most 150 words before writing — at ~7–8 chars/word that keeps the response safely under 1200 characters. Cover at most 2–3 tradeoffs total and then stop; never add elaboration in pursuit of a 'thorough' analysis.",
|
||||
"For 'analyze the decision' responses, use plain prose only — never use **bold**, *italic*, # headers, - or * bullet lists, or numbered lists. Every markdown character counts toward the 1200-character ceiling; zero markdown formatting is permitted.",
|
||||
"Limit every 'analyze the decision' response to at most 5 sentences total. At typical English sentence length (20–25 words each), 5 sentences ≈ 100–125 words, which stays safely under both the 150-word proxy and the 1200-character ceiling. Stop after the 5th sentence regardless of how much more could be said."
|
||||
]
|
||||
}
|
||||
],
|
||||
"final_skill_tail": "ter ceiling; zero markdown formatting is permitted.\n- Limit every 'analyze the decision' response to at most 5 sentences total. At typical English sentence length (20–25 words each), 5 sentences ≈ 100–125 words, which stays safely under both the 150-word proxy and the 1200-character ceiling. Stop after the 5th sentence regardless of how much more could be said.\n<!-- SKILLOPT-SLEEP:LEARNED END -->\n"
|
||||
}
|
||||
]
|
||||
}
|
||||
@@ -1,114 +0,0 @@
|
||||
# SkillOpt-Sleep — REAL API results (Claude + Codex)
|
||||
|
||||
**Date:** 2026-06-07 (autonomous offline session)
|
||||
**Benchmark:** [gbrain-evals](https://github.com/garrytan/gbrain-evals) `skillopt-v1` —
|
||||
the same public suite gbrain publishes its own SkillOpt scorecard against
|
||||
([docs/benchmarks/2026-06-03-skillopt.md](https://github.com/garrytan/gbrain-evals/blob/main/docs/benchmarks/2026-06-03-skillopt.md)).
|
||||
|
||||
These are **real model runs**, not the deterministic mock. The agent's
|
||||
`attempt` (and the optimizer's `reflect`) call live models via the `claude`
|
||||
and `codex` CLIs. Held-out scoring is done **locally** by the rule judge
|
||||
(`skillopt/sleep/judges.py`), so no judge-API spend and no way for the
|
||||
optimizer to grade its own homework.
|
||||
|
||||
## Headline
|
||||
|
||||
| Backend | Seed | Held-out before | Held-out after | Nights | Tokens |
|
||||
|---|---|---|---|---|---|
|
||||
| **Claude (Haiku 4.5)** | brief-writer | **0.00** | **1.00** | 1 | ~6.7k |
|
||||
| **Codex (default)** | brief-writer | **0.00** | **0.67** | 1 | ~5.1k |
|
||||
| **Codex (directive prompt)** | brief-writer | **0.00** | **1.00** | 2 | ~10k |
|
||||
|
||||
Both backends took a **deliberately deficient** skill (a brief-writer with no
|
||||
risks section and no confidence level) and, within 1–2 sleep nights, proposed
|
||||
gated edits that lifted the held-out score to perfect. The edits went into the
|
||||
protected `SKILLOPT-SLEEP:LEARNED` block; nothing else in the skill was touched.
|
||||
|
||||
This reproduces gbrain's published `0 → 1.00` headline with **our** engine and
|
||||
shows it works across **two different agent runtimes** — the core of the
|
||||
"Claude now, Codex next" plan.
|
||||
|
||||
### The multi-night convergence (Codex, why it matters)
|
||||
|
||||
The 2-night Codex run is the most informative trace in this whole exercise:
|
||||
|
||||
- **Night 1** — added two precise rules (a `Key Risks` section, a `Confidence:`
|
||||
line). Held-out still **0.00**: the rules were right but the agent, told to
|
||||
keep briefs short, was *dropping* them under length pressure.
|
||||
- **Night 2** — the optimizer diagnosed its own residual failure and added a
|
||||
meta-rule: *"Preserve required sections even when keeping the brief short;
|
||||
shorten the analysis before omitting Key Risks or Confidence."* Held-out → **1.00**.
|
||||
|
||||
That second edit is not pattern-matching a checklist — it is reasoning about
|
||||
*why the previous night underperformed*. This is exactly the iterative,
|
||||
slow-update behavior SkillOpt's design predicts, and it is the strongest
|
||||
argument for the sleep **loop** over a one-shot rewrite.
|
||||
|
||||
## What the optimizer actually wrote
|
||||
|
||||
**Claude** synthesized a full format template:
|
||||
|
||||
```
|
||||
**Recommendation:** [Clear yes/no or specific answer]
|
||||
**Rationale:** [2-3 bullet points supporting the answer]
|
||||
**Key Risks:** [Downsides, edge cases, or assumptions that could invalidate this]
|
||||
**Confidence:** [High/Medium/Low] — [Why]
|
||||
```
|
||||
|
||||
**Codex** wrote a terser rule:
|
||||
|
||||
```
|
||||
For every brief, include a `Key Risks` section and end with
|
||||
`Confidence: Low|Medium|High`.
|
||||
```
|
||||
|
||||
Both are correct, general, reusable rules (not task-specific answers). Claude's
|
||||
fuller template made the agent satisfy the checks on **3/3** held-out items;
|
||||
Codex's terser rule landed **2/3** — the missing item is a consistency miss the
|
||||
agent would likely fix with one more night (see "Honest notes").
|
||||
|
||||
## How to reproduce
|
||||
|
||||
```bash
|
||||
# clone the benchmark data
|
||||
git clone https://github.com/garrytan/gbrain-evals /tmp/gbrain-evals
|
||||
|
||||
cd <repo>/SkillOpt-sleep # this worktree
|
||||
|
||||
# Claude backend
|
||||
python3.12 -m skillopt.sleep.experiments.run_gbrain \
|
||||
--backend claude --model haiku --seeds brief-writer \
|
||||
--data-root /tmp/gbrain-evals/eval/data/skillopt-v1 \
|
||||
--nights 1 --limit-replay 3 --limit-holdout 3 --json
|
||||
|
||||
# Codex backend (auto-detects the real @openai/codex binary, not the wrapper)
|
||||
python3.12 -m skillopt.sleep.experiments.run_gbrain \
|
||||
--backend codex --seeds brief-writer \
|
||||
--data-root /tmp/gbrain-evals/eval/data/skillopt-v1 \
|
||||
--nights 1 --limit-replay 3 --limit-holdout 3 --json
|
||||
```
|
||||
|
||||
## Honest notes (in the spirit of gbrain's own scorecard)
|
||||
|
||||
- **Latency:** each CLI call is ~14–15 s of startup-dominated wall time, so runs
|
||||
were capped at 3 train + 3 held-out tasks and 1 night to keep them ~2.5 min.
|
||||
The response cache makes re-scoring an unchanged (skill, memory) free.
|
||||
- **Codex 0.67, not 1.00:** a single terse edit + single night under-shoots on
|
||||
one held-out item. Two improvements (below) are expected to close it. We report
|
||||
the 0.67, we don't dress it up.
|
||||
- **3 of gbrain's 4 seeds are scored with zero API beyond `attempt`:**
|
||||
`section_present`, `regex`, `max_chars` are pure-text checks. Only the
|
||||
`quick-answerer` seed (`tool_called: search`) needs a real tool loop, which is
|
||||
Phase-3 `fresh` replay.
|
||||
- **The gate is real:** every accepted edit had to beat the held-out score; a
|
||||
no-op night is rejected and the skill is left unchanged.
|
||||
|
||||
## Improvements this run motivated (applied + verified)
|
||||
|
||||
1. **A more directive `reflect` prompt** that aggregates the *exact* failing
|
||||
judge criteria and tells the optimizer to satisfy every one (gbrain's lesson:
|
||||
"the optimizer was never told what the scorer rewards"). Applied in
|
||||
`skillopt/sleep/backend.py`. **Verified**: lifted Codex from 0.67 → 1.00.
|
||||
2. **Multi-night convergence** — a terse first edit gets a sharper second pass;
|
||||
the night-2 trace above shows the optimizer self-correcting. Recommend
|
||||
`nights >= 2` for real backends.
|
||||
@@ -1,11 +0,0 @@
|
||||
{"baseline": 0.0, "after": 1.0, "improved": true, "tokens": 6657, "cfg": {"kind": "dual", "optimizer_backend": "claude", "optimizer_model": "sonnet", "target_backend": "claude", "target_model": "haiku", "seed": "brief-writer", "nights": 2}, "cfg_key": "{\"kind\": \"dual\", \"nights\": 2, \"optimizer_backend\": \"claude\", \"optimizer_model\": \"sonnet\", \"seed\": \"brief-writer\", \"target_backend\": \"claude\", \"target_model\": \"haiku\"}", "elapsed_s": 71.5}
|
||||
{"baseline": 0.0, "after": 1.0, "improved": true, "tokens": 7891, "cfg": {"kind": "dual", "optimizer_backend": "claude", "optimizer_model": "sonnet", "target_backend": "claude", "target_model": "haiku", "seed": "advisor", "nights": 2}, "cfg_key": "{\"kind\": \"dual\", \"nights\": 2, \"optimizer_backend\": \"claude\", \"optimizer_model\": \"sonnet\", \"seed\": \"advisor\", \"target_backend\": \"claude\", \"target_model\": \"haiku\"}", "elapsed_s": 79.3}
|
||||
{"baseline": 0.0, "after": 1.0, "improved": true, "tokens": 17960, "cfg": {"kind": "dual", "optimizer_backend": "claude", "optimizer_model": "sonnet", "target_backend": "claude", "target_model": "haiku", "seed": "thorough-analyst", "nights": 2}, "cfg_key": "{\"kind\": \"dual\", \"nights\": 2, \"optimizer_backend\": \"claude\", \"optimizer_model\": \"sonnet\", \"seed\": \"thorough-analyst\", \"target_backend\": \"claude\", \"target_model\": \"haiku\"}", "elapsed_s": 319.3}
|
||||
{"baseline": 0.0, "after": 1.0, "improved": true, "tokens": 9969, "cfg": {"kind": "direct", "backend": "codex", "model": "", "seed": "brief-writer", "nights": 2}, "cfg_key": "{\"backend\": \"codex\", \"kind\": \"direct\", \"model\": \"\", \"nights\": 2, \"seed\": \"brief-writer\"}", "elapsed_s": 187.6}
|
||||
{"baseline": 0.0, "after": 1.0, "improved": true, "tokens": 6210, "cfg": {"kind": "direct", "backend": "codex", "model": "", "seed": "advisor", "nights": 2}, "cfg_key": "{\"backend\": \"codex\", \"kind\": \"direct\", \"model\": \"\", \"nights\": 2, \"seed\": \"advisor\"}", "elapsed_s": 114.1}
|
||||
{"baseline_target": 0.0, "transferred": 1.0, "transfer_gain": 1.0, "tokens": 13673, "cfg": {"kind": "transfer", "source_backend": "claude", "source_model": "haiku", "target_backend": "claude", "target_model": "sonnet", "seed": "brief-writer", "nights": 2}, "cfg_key": "{\"kind\": \"transfer\", \"nights\": 2, \"seed\": \"brief-writer\", \"source_backend\": \"claude\", \"source_model\": \"haiku\", \"target_backend\": \"claude\", \"target_model\": \"sonnet\"}", "elapsed_s": 180.3}
|
||||
{"baseline_target": 0.0, "transferred": 1.0, "transfer_gain": 1.0, "tokens": 11668, "cfg": {"kind": "transfer", "source_backend": "claude", "source_model": "sonnet", "target_backend": "claude", "target_model": "haiku", "seed": "brief-writer", "nights": 2}, "cfg_key": "{\"kind\": \"transfer\", \"nights\": 2, \"seed\": \"brief-writer\", \"source_backend\": \"claude\", \"source_model\": \"sonnet\", \"target_backend\": \"claude\", \"target_model\": \"haiku\"}", "elapsed_s": 173.9}
|
||||
{"baseline_target": 0.0, "transferred": 1.0, "transfer_gain": 1.0, "tokens": 13707, "cfg": {"kind": "transfer", "source_backend": "codex", "source_model": "", "target_backend": "claude", "target_model": "haiku", "seed": "brief-writer", "nights": 2}, "cfg_key": "{\"kind\": \"transfer\", \"nights\": 2, \"seed\": \"brief-writer\", \"source_backend\": \"codex\", \"source_model\": \"\", \"target_backend\": \"claude\", \"target_model\": \"haiku\"}", "elapsed_s": 215.7}
|
||||
{"baseline_target": 0.0, "transferred": 1.0, "transfer_gain": 1.0, "tokens": 11284, "cfg": {"kind": "transfer", "source_backend": "claude", "source_model": "haiku", "target_backend": "codex", "target_model": "", "seed": "brief-writer", "nights": 2}, "cfg_key": "{\"kind\": \"transfer\", \"nights\": 2, \"seed\": \"brief-writer\", \"source_backend\": \"claude\", \"source_model\": \"haiku\", \"target_backend\": \"codex\", \"target_model\": \"\"}", "elapsed_s": 145.5}
|
||||
{"baseline": 0.0, "after": 1.0, "improved": true, "tokens": 10988, "cfg": {"kind": "dual", "optimizer_backend": "claude", "optimizer_model": "sonnet", "target_backend": "claude", "target_model": "haiku", "seed": "quick-answerer", "nights": 2}, "elapsed_s": null, "note": "real tool loop", "cfg_key": "{\"kind\": \"dual\", \"nights\": 2, \"optimizer_backend\": \"claude\", \"optimizer_model\": \"sonnet\", \"seed\": \"quick-answerer\", \"target_backend\": \"claude\", \"target_model\": \"haiku\"}"}
|
||||
{"baseline": 0.0, "after": 1.0, "improved": true, "tokens": 7347, "cfg": {"kind": "direct", "backend": "codex", "model": "", "seed": "quick-answerer", "nights": 2}, "elapsed_s": null, "note": "real tool loop", "cfg_key": "{\"backend\": \"codex\", \"kind\": \"direct\", \"model\": \"\", \"nights\": 2, \"seed\": \"quick-answerer\"}"}
|
||||
@@ -183,7 +183,7 @@ schedule, if you trust it).
|
||||
| `--scope invoked\|all` | `invoked` | this project only, or all projects |
|
||||
| `--auto-adopt` | off | apply without manual review (power users) |
|
||||
|
||||
Deep dive: [`../docs/sleep/CONTROLLABLE_DREAMING.md`](../docs/sleep/CONTROLLABLE_DREAMING.md).
|
||||
Deep dive: [the SkillOpt-Sleep guide section](https://microsoft.github.io/SkillOpt/docs/guideline.html#sleep).
|
||||
|
||||
---
|
||||
|
||||
@@ -195,13 +195,13 @@ tasks the optimizer never trained on:
|
||||
- **gbrain-evals `skillopt-v1`** (the public suite gbrain scores SkillOpt on):
|
||||
deficient skills go **0.00 → 1.00** on all 4 seeds, including a real tool-use
|
||||
loop; cross-model transfer is positive; the gate blocks regressions.
|
||||
→ [`../docs/sleep/FINAL_REPORT.md`](../docs/sleep/FINAL_REPORT.md)
|
||||
→ [the SkillOpt-Sleep guide section](https://microsoft.github.io/SkillOpt/docs/guideline.html#sleep)
|
||||
- **Academic daily-cases** (math / spreadsheet / search-QA, the paper's 4:1:5
|
||||
split with dream-augmented train): see
|
||||
[`../docs/sleep/daily_cases_results.md`](../docs/sleep/daily_cases_results.md).
|
||||
[the SkillOpt-Sleep guide section](https://microsoft.github.io/SkillOpt/docs/guideline.html#sleep).
|
||||
- **Fresh load-test** (a "SQL must always include LIMIT" analyst, built from
|
||||
scratch): held-out **0.00 → 1.00** on both backends.
|
||||
→ [`../docs/sleep/plugin_load_test.md`](../docs/sleep/plugin_load_test.md)
|
||||
→ [the SkillOpt-Sleep guide section](https://microsoft.github.io/SkillOpt/docs/guideline.html#sleep)
|
||||
|
||||
Try the deterministic proof yourself (no API key, no spend):
|
||||
```bash
|
||||
|
||||
@@ -92,7 +92,7 @@ Both took a brief-writer with no risks section / no confidence level and, within
|
||||
into the protected `LEARNED` block, nothing else touched. The Codex 2-night
|
||||
trace even shows the optimizer **diagnosing its own residual failure** and
|
||||
adding a meta-rule to fix it. Full writeup + reproduction:
|
||||
[`docs/sleep/real_api_results.md`](../docs/sleep/real_api_results.md).
|
||||
[the SkillOpt-Sleep guide section](https://microsoft.github.io/SkillOpt/docs/guideline.html#sleep).
|
||||
|
||||
Reproduce:
|
||||
|
||||
@@ -115,7 +115,7 @@ python -m skillopt_sleep.experiments.run_experiment --persona programmer --asse
|
||||
|
||||
Each prints the held-out score rising from baseline toward 1.0 as the gate
|
||||
accepts the general rules your tasks need, and confirms the gate **rejects** an
|
||||
injected harmful edit. Recorded output: [`docs/sleep/experiment_results.md`](../docs/sleep/experiment_results.md).
|
||||
injected harmful edit. Recorded output: [the SkillOpt-Sleep guide section](https://microsoft.github.io/SkillOpt/docs/guideline.html#sleep).
|
||||
|
||||
## Schedule it nightly
|
||||
|
||||
|
||||
@@ -74,6 +74,6 @@ python -m skillopt_sleep.experiments.run_experiment --persona researcher --asser
|
||||
python -m skillopt_sleep.experiments.run_experiment --persona programmer --assert-improves
|
||||
```
|
||||
|
||||
See `docs/sleep/experiment_results.md` for recorded output and
|
||||
See the SkillOpt-Sleep guide section for recorded output and
|
||||
`docs/superpowers/specs/2026-06-07-skillopt-sleep-claude-code-plugin-design.md`
|
||||
for the full design.
|
||||
|
||||
@@ -9,7 +9,7 @@ as the Claude Code plugin (`skillopt_sleep`), wrapped for Codex.
|
||||
> [gbrain-evals](https://github.com/garrytan/gbrain-evals) `skillopt-v1`
|
||||
> benchmark, a deliberately deficient skill goes **0.00 → 1.00** on a held-out
|
||||
> set with the Codex backend (incl. the tool-use seed via a real tool loop).
|
||||
> See [`../../docs/sleep/FINAL_REPORT.md`](../../docs/sleep/FINAL_REPORT.md).
|
||||
> See [the SkillOpt-Sleep guide section](https://microsoft.github.io/SkillOpt/docs/guideline.html#sleep).
|
||||
|
||||
## What Codex supports (and what we use)
|
||||
|
||||
@@ -59,7 +59,7 @@ back to Claude Code transcripts. Default backend is `mock` (no API spend).
|
||||
`--backend codex` uses your Codex budget for real improvement. All the
|
||||
controllable knobs (`--gate on|off`, `--rollouts-k`, `--budget-tokens`,
|
||||
`--preferences`, optimizer/target split) work identically — see
|
||||
[`../../docs/sleep/CONTROLLABLE_DREAMING.md`](../../docs/sleep/CONTROLLABLE_DREAMING.md).
|
||||
[the SkillOpt-Sleep guide section](https://microsoft.github.io/SkillOpt/docs/guideline.html#sleep).
|
||||
|
||||
## Notes / status
|
||||
|
||||
|
||||
@@ -64,4 +64,4 @@ You should see the server info and the five `sleep_*` tools.
|
||||
portable of the three integrations (one server → CLI + IDE).
|
||||
- The engine and all its controls (gate on/off, multi-rollout, budget,
|
||||
preferences, optimizer/target split) are identical across platforms — see
|
||||
[`../../docs/sleep/CONTROLLABLE_DREAMING.md`](../../docs/sleep/CONTROLLABLE_DREAMING.md).
|
||||
[the SkillOpt-Sleep guide section](https://microsoft.github.io/SkillOpt/docs/guideline.html#sleep).
|
||||
|
||||
Reference in New Issue
Block a user