docs(sleep): add a SkillOpt-Sleep module readme + News mention

Adds docs/sleep/README.md — a concise intro to the SkillOpt-Sleep plugin (what
it is, how to use it across the three agents, the opt-in experience-replay /
dream-rollout knobs, and headline results), linking to the full guide section.
Adds a News bullet pointing to it. No code changes.
This commit is contained in:
Yifan Yang
2026-06-15 16:31:15 +00:00
parent b701d9b6d9
commit de3be75bac
2 changed files with 78 additions and 0 deletions

View File

@@ -9,6 +9,7 @@
---
## News 🔥🔥🔥
- **[2026-06-15]** 😴 **SkillOpt-Sleep (preview)** — a nightly offline self-evolution companion for local coding agents (Claude Code / Codex / Copilot): review past sessions, replay recurring tasks, and consolidate validated skills behind a held-out gate. See **[`docs/sleep/README.md`](docs/sleep/README.md)** for what it is, how to use it, and results.
- **[2026-06-03]** 🎉 **[gbrain](https://github.com/garrytan/gbrain), [gbrain-evals](https://github.com/garrytan/gbrain-evals/blob/main/docs/benchmarks/2026-06-03-skillopt.md), and [darwin-skill](https://github.com/alchaincyf/darwin-skill) have all integrated SkillOpt.**
- **[2026-06-02]** 🎉 **SkillOpt [v0.1.0](https://github.com/microsoft/SkillOpt/releases/tag/v0.1.0) is now available on [PyPI](https://pypi.org/project/skillopt/)!** Install with `pip install skillopt`. This initial release includes the full training loop (rollout → reflect → aggregate → select → update → evaluate), multi-backend support (OpenAI / Azure / Claude / Qwen / MiniMax), six built-in benchmarks, and WebUI dashboard.

77
docs/sleep/README.md Normal file
View File

@@ -0,0 +1,77 @@
# SkillOpt-Sleep 😴 — deployment-time companion (preview)
**SkillOpt-Sleep** applies SkillOpt's discipline to your *own daily usage*. It gives a
local coding agent a nightly **sleep cycle** that reviews your past sessions, replays
your recurring tasks on your own API budget, and consolidates what it learns into
**validated** long-term memory and skills — behind a held-out gate, staged for your
review. The agent gets better the more you use it, with **no weight training** and
**zero inference-time overhead**.
> **Preview.** This is an early preview we are actively iterating on; interfaces and
> defaults may change. The engine lives in the top-level [`skillopt_sleep/`](../../skillopt_sleep)
> package with **zero dependency** on the paper's `skillopt/` code (the validation gate
> is vendored).
## How it works
One "night":
```
harvest Claude Code / Codex transcripts → mine recurring tasks → replay offline
→ consolidate (reflect → bounded edit → GATE on real held-out tasks)
→ stage proposal → (you) adopt
```
It synthesizes **SkillOpt** (validation-gated bounded text edits), **Claude Dreams**
(offline consolidation; review-then-adopt), and the **agent-sleep** idea (short-term
experience → long-term competence).
## How to use it
One engine, thin per-agent shells (see [`plugins/`](../../plugins)):
| Platform | Folder | Install |
|---|---|---|
| **Claude Code** | [`plugins/claude-code`](../../plugins/claude-code) | `/plugin marketplace add ./plugins/claude-code``/skillopt-sleep` |
| **Codex** | [`plugins/codex`](../../plugins/codex) | `bash plugins/codex/install.sh``skillopt-sleep` skill |
| **Copilot** | [`plugins/copilot`](../../plugins/copilot) | register `plugins/copilot/mcp_server.py` as an MCP server |
Deterministic proof (no API key):
`python -m skillopt_sleep.experiments.run_experiment --persona researcher --assert-improves`.
### Opt-in: experience replay & dream rollouts
Two consolidation mechanisms, both default **off** (behavior is unchanged unless you
enable them). They strengthen the nightly update when your tasks have a clean
correctness signal; the validation gate still governs what ships.
| Config knob | Default | Effect |
|---|---|---|
| `dream_rollouts` | `1` | Run each task K times → learn from the good-vs-bad contrast (contrastive reflection). |
| `recall_k` | `0` | Associative recall — pull the K most-similar past tasks (from a persisted archive) into tonight's dream. |
| `dream_factor` | `0` | Add N lightweight synthetic variants of each task. |
## Results
- **End-to-end on real agents.** On the public
[gbrain-evals](https://github.com/garrytan/gbrain-evals) `skillopt-v1` benchmark,
deficient seed skills go **0.00 → 1.00** on held-out sets with **both Claude and
Codex** (all 4 seeds, including a real tool-use loop).
- **Experience replay scales the gain on a clean signal** (deployment protocol:
5 nights × 10 new real tasks/night, full held-out test, GPT-5.5, gated):
| Config | Δ vs baseline |
|---|---|
| `recall_k=10, dream_rollouts=5` | +3.1 pts |
| `recall_k=20, dream_rollouts=5` | **+4.5 pts** |
| full-history replay (reference) | +5.6 pts |
A second benchmark (SpreadsheetBench, GPT-5.4-nano, gate-free) gives **+3.6 pts**.
- **Honest scope.** Gains are real where tasks recur and have a checkable correctness
signal; on saturated or noisy tasks the effect is flat within run-to-run noise
(±12 pts, single seed). The validation gate keeps the downside bounded — keep it on.
## Learn more
Full reference (pipeline, the three plugins, the experience-replay knobs) is in the
**[Documentation & Reproduction Guide](https://microsoft.github.io/SkillOpt/docs/guideline.html#sleep)**.