docs(sleep): add a SkillOpt-Sleep module readme + News mention

Adds docs/sleep/README.md — a concise intro to the SkillOpt-Sleep plugin (what it is, how to use it across the three agents, the opt-in experience-replay / dream-rollout knobs, and headline results), linking to the full guide section. Adds a News bullet pointing to it. No code changes.
2026-07-03 14:02:58 +08:00 · 2026-06-15 16:31:15 +00:00
parent b701d9b6d9
commit de3be75bac
2 changed files with 78 additions and 0 deletions
--- a/README.md
+++ b/README.md
@@ -9,6 +9,7 @@
 ---

 ## News 🔥🔥🔥
+- **[2026-06-15]** 😴 **SkillOpt-Sleep (preview)** — a nightly offline self-evolution companion for local coding agents (Claude Code / Codex / Copilot): review past sessions, replay recurring tasks, and consolidate validated skills behind a held-out gate. See **[`docs/sleep/README.md`](docs/sleep/README.md)** for what it is, how to use it, and results.
 - **[2026-06-03]** 🎉 **[gbrain](https://github.com/garrytan/gbrain), [gbrain-evals](https://github.com/garrytan/gbrain-evals/blob/main/docs/benchmarks/2026-06-03-skillopt.md), and [darwin-skill](https://github.com/alchaincyf/darwin-skill) have all integrated SkillOpt.**
 - **[2026-06-02]** 🎉 **SkillOpt [v0.1.0](https://github.com/microsoft/SkillOpt/releases/tag/v0.1.0) is now available on [PyPI](https://pypi.org/project/skillopt/)!** Install with `pip install skillopt`. This initial release includes the full training loop (rollout → reflect → aggregate → select → update → evaluate), multi-backend support (OpenAI / Azure / Claude / Qwen / MiniMax), six built-in benchmarks, and WebUI dashboard.

--- a/docs/sleep/README.md
+++ b/docs/sleep/README.md
@@ -0,0 +1,77 @@
+# SkillOpt-Sleep 😴 — deployment-time companion (preview)
+
+**SkillOpt-Sleep** applies SkillOpt's discipline to your *own daily usage*. It gives a
+local coding agent a nightly **sleep cycle** that reviews your past sessions, replays
+your recurring tasks on your own API budget, and consolidates what it learns into
+**validated** long-term memory and skills — behind a held-out gate, staged for your
+review. The agent gets better the more you use it, with **no weight training** and
+**zero inference-time overhead**.
+
+> **Preview.** This is an early preview we are actively iterating on; interfaces and
+> defaults may change. The engine lives in the top-level [`skillopt_sleep/`](../../skillopt_sleep)
+> package with **zero dependency** on the paper's `skillopt/` code (the validation gate
+> is vendored).
+
+## How it works
+
+One "night":
+
+```
+harvest Claude Code / Codex transcripts → mine recurring tasks → replay offline
+   → consolidate (reflect → bounded edit → GATE on real held-out tasks)
+   → stage proposal → (you) adopt
+```
+
+It synthesizes **SkillOpt** (validation-gated bounded text edits), **Claude Dreams**
+(offline consolidation; review-then-adopt), and the **agent-sleep** idea (short-term
+experience → long-term competence).
+
+## How to use it
+
+One engine, thin per-agent shells (see [`plugins/`](../../plugins)):
+
+| Platform | Folder | Install |
+|---|---|---|
+| **Claude Code** | [`plugins/claude-code`](../../plugins/claude-code) | `/plugin marketplace add ./plugins/claude-code` → `/skillopt-sleep` |
+| **Codex** | [`plugins/codex`](../../plugins/codex) | `bash plugins/codex/install.sh` → `skillopt-sleep` skill |
+| **Copilot** | [`plugins/copilot`](../../plugins/copilot) | register `plugins/copilot/mcp_server.py` as an MCP server |
+
+Deterministic proof (no API key):
+`python -m skillopt_sleep.experiments.run_experiment --persona researcher --assert-improves`.
+
+### Opt-in: experience replay & dream rollouts
+
+Two consolidation mechanisms, both default **off** (behavior is unchanged unless you
+enable them). They strengthen the nightly update when your tasks have a clean
+correctness signal; the validation gate still governs what ships.
+
+| Config knob | Default | Effect |
+|---|---|---|
+| `dream_rollouts` | `1` | Run each task K times → learn from the good-vs-bad contrast (contrastive reflection). |
+| `recall_k` | `0` | Associative recall — pull the K most-similar past tasks (from a persisted archive) into tonight's dream. |
+| `dream_factor` | `0` | Add N lightweight synthetic variants of each task. |
+
+## Results
+
+- **End-to-end on real agents.** On the public
+  [gbrain-evals](https://github.com/garrytan/gbrain-evals) `skillopt-v1` benchmark,
+  deficient seed skills go **0.00 → 1.00** on held-out sets with **both Claude and
+  Codex** (all 4 seeds, including a real tool-use loop).
+- **Experience replay scales the gain on a clean signal** (deployment protocol:
+  5 nights × 10 new real tasks/night, full held-out test, GPT-5.5, gated):
+
+  | Config | Δ vs baseline |
+  |---|---|
+  | `recall_k=10, dream_rollouts=5` | +3.1 pts |
+  | `recall_k=20, dream_rollouts=5` | **+4.5 pts** |
+  | full-history replay (reference) | +5.6 pts |
+
+  A second benchmark (SpreadsheetBench, GPT-5.4-nano, gate-free) gives **+3.6 pts**.
+- **Honest scope.** Gains are real where tasks recur and have a checkable correctness
+  signal; on saturated or noisy tasks the effect is flat within run-to-run noise
+  (±1–2 pts, single seed). The validation gate keeps the downside bounded — keep it on.
+
+## Learn more
+
+Full reference (pipeline, the three plugins, the experience-replay knobs) is in the
+**[Documentation & Reproduction Guide](https://microsoft.github.io/SkillOpt/docs/guideline.html#sleep)**.