Files
microsoft-SkillOpt/docs/sleep/README.md
CharlesYang030 e4ea6a6771 chore(release): v0.2.0
Highlights since v0.1.0:
- feat: SkillOpt-Sleep engine — nightly offline self-evolution
  (harvest -> mine -> replay -> consolidate behind a validation gate),
  with multi-objective reward, experience replay + dream rollouts,
  slow-update long-term memory, and secret redaction in cycle diagnostics.
  Shipped as the `skillopt-sleep` CLI.
- feat: cross-tool backends & plugin shells — Claude, Codex (+Desktop
  harvest), Copilot, Devin, and OpenClaw.
- feat: SearchQA split materialization + rollout fail-fast.
- fix: Windows robustness for claude/codex backends, hardened JSON
  fallback, Qwen timeout/thinking gating, Codex failure surfacing.

Packaging:
- Bump pyproject / skillopt / skillopt_sleep to 0.2.0.
- Restore skillopt_webui to the packaged wheel.

See CHANGELOG.md for the full changelog and contributor acknowledgements.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-07-02 22:11:10 +08:00

111 lines
5.5 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# SkillOpt-Sleep 😴 — deployment-time companion (preview)
**SkillOpt-Sleep** applies SkillOpt's discipline to your *own daily usage*. It gives a
local coding agent a nightly **sleep cycle** that reviews your past sessions, replays
your recurring tasks on your own API budget, and consolidates what it learns into
**validated** long-term memory and skills — behind a held-out gate, staged for your
review. The agent gets better the more you use it, with **no weight training** and
**zero inference-time overhead**.
> **Preview.** This is an early preview we are actively iterating on; interfaces and
> defaults may change. The engine lives in the top-level [`skillopt_sleep/`](../../skillopt_sleep)
> package with **zero dependency** on the paper's `skillopt/` code (the validation gate
> is vendored).
## How it works
One "night":
```
harvest Claude Code / Codex transcripts → mine recurring tasks → replay offline
→ consolidate (reflect → bounded edit → GATE on real held-out tasks)
→ stage proposal → (you) adopt
```
It synthesizes **SkillOpt** (validation-gated bounded text edits), **Claude Dreams**
(offline consolidation; review-then-adopt), and the **agent-sleep** idea (short-term
experience → long-term competence).
## How to use it
### Quickest path: the `skillopt-sleep` CLI (pip)
```bash
pip install skillopt # installs the engine + the `skillopt-sleep` command
skillopt-sleep dry-run # harvest + mine + replay, report only (changes nothing)
skillopt-sleep run # a full nightly cycle; the proposal is staged for review
skillopt-sleep status # show state + the latest staged proposal
skillopt-sleep adopt # apply the latest staged proposal
skillopt-sleep schedule # install a nightly cron entry for this project
```
The per-agent plugin shells below (Claude Code / Codex / Copilot) still come from the
repo; the CLI above is the standalone, pip-only way to run a cycle.
One engine, thin per-agent shells (see [`plugins/`](../../plugins)):
| Platform | Folder | Install |
|---|---|---|
| **Claude Code** | [`plugins/claude-code`](../../plugins/claude-code) | `/plugin marketplace add ./plugins/claude-code``/skillopt-sleep` |
| **Codex** | [`plugins/codex`](../../plugins/codex) | `bash plugins/codex/install.sh``skillopt-sleep` skill |
| **Copilot** | [`plugins/copilot`](../../plugins/copilot) | register `plugins/copilot/mcp_server.py` as an MCP server |
Deterministic proof (no API key):
`python -m skillopt_sleep.experiments.run_experiment --persona researcher --assert-improves`.
### Opt-in: experience replay & dream rollouts
Two consolidation mechanisms, both default **off** (behavior is unchanged unless you
enable them). They strengthen the nightly update when your tasks have a clean
correctness signal; the validation gate still governs what ships.
| Config knob | Default | Effect |
|---|---|---|
| `dream_rollouts` | `1` | Run each task K times → learn from the good-vs-bad contrast (contrastive reflection). |
| `recall_k` | `0` | Associative recall — pull the K most-similar past tasks (from a persisted archive) into tonight's dream. |
| `dream_factor` | `0` | Add N lightweight synthetic variants of each task. |
## Results
> 📊 **More results & analysis — the gate-safety stress test, experience-replay
> scaling, and the dream-diversity ablation — are in
> [`docs/sleep/RESULTS.md`](RESULTS.md).** The highlights:
**Protocol (identical for every row below).** 5 nights × 10 new real "today" tasks
per night; the full held-out **test** split is scored before night 1 (baseline) and
after night 5 (after); optimizer = GPT-5.5; single seed (42); run through the exact
shipped engine (`skillopt_sleep.dream.dream_consolidate`). Numbers are absolute
held-out accuracy; **Δ** = `after baseline` in percentage points.
**(a) End-to-end on real agents — [gbrain-evals](https://github.com/garrytan/gbrain-evals) `skillopt-v1`.**
Deficient seed skills go **0.00 → 1.00** on the held-out set with **both Claude Code
and Codex** as the target agent (all 4 seeds, including a real tool-use loop).
**(b) Experience replay scales the gain — SearchQA** (1,400-item held-out test,
SQuAD exact-match; target = GPT-5.5; **validation-gated**):
| Replay config (`dream_rollouts=5`) | Baseline → After | Δ (pts) |
|---|---|---|
| `recall_k=10` | 0.802 → 0.834 | +3.1 |
| `recall_k=20` | 0.803 → 0.848 | **+4.5** |
| full-history replay *(reference, not a shipping default)* | 0.796 → 0.851 | +5.6 |
| `recall_k=10`, `dream_rollouts=8` *(more dreaming, same recall)* | 0.798 → 0.835 | +3.7 |
The gain rises monotonically with how much relevant past experience is recalled. The
same SearchQA cell **without** the gate (`recall_k=10`) is 0.808 → 0.839 (+3.1).
**(c) Second benchmark — SpreadsheetBench** (280-item held-out test; the agent's
generated openpyxl code is executed and compared cell-by-cell to a golden workbook;
target = GPT-5.4-nano; gate-free + the output-contract guardrail): 0.279 → 0.314 (**+3.6**).
**(d) Honest scope.** These gains hold where tasks recur and have a checkable
correctness signal. On saturated or noisy benchmarks (e.g. a strong model already
near ceiling) the effect is **flat within run-to-run noise** — single-seed baseline
variance here is ±12 pts, so treat sub-~1.5 pt differences as noise. The validation
gate keeps the worst case bounded; keep it **on** by default.
## Learn more
Full reference (pipeline, the three plugins, the experience-replay knobs) is in the
**[Documentation & Reproduction Guide](https://microsoft.github.io/SkillOpt/docs/guideline.html#sleep)**.