mirror of
https://github.com/microsoft/SkillOpt.git
synced 2026-07-05 23:30:35 +08:00
Per maintainer request: - Remove the internal/scratch docs/sleep/ tree (reports, raw logs, blog run JSON, sweep.jsonl) — 23 files — and the root PUBLISHING.md. These were working notes, not reference docs. - Take the dedicated SkillOpt-Sleep content out of the main README (News bullet + section) and host it in the rendered guide instead: new section 9 in docs/guideline.html (deployment companion, the three plugins, opt-in experience replay / dream rollouts) with a sidebar entry. - Fix the README's opening reference so "Documentation & Reproduction Guide" links directly to the rendered GitHub Pages page, not the raw .html source. - Repoint the now-removed docs/sleep links in the plugin READMEs to the guide section. The plugin code (plugins/, skillopt_sleep/) is unchanged; only docs move. Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
139 lines
5.9 KiB
Markdown
139 lines
5.9 KiB
Markdown
# SkillOpt-Sleep (Claude Code plugin)
|
||
|
||
> Give your local Claude agent a **sleep cycle**. Every night it reviews your
|
||
> past sessions offline, replays your recurring tasks on your own API budget,
|
||
> and consolidates what it learns into **validated** memory (`CLAUDE.md`) and
|
||
> skills (`SKILL.md`). Your agent gets better the more you use it — no
|
||
> model-weight training.
|
||
|
||
SkillOpt-Sleep is the **deployment-time** companion to
|
||
[SkillOpt](https://github.com/microsoft/SkillOpt). SkillOpt trains a skill
|
||
offline on a benchmark; SkillOpt-Sleep applies the same discipline to *your own
|
||
daily usage*: bounded text edits, accepted only through a held-out validation
|
||
gate, with rejected edits kept as negative feedback.
|
||
|
||
It synthesizes three ideas:
|
||
|
||
| Idea | Contribution |
|
||
|---|---|
|
||
| **SkillOpt** | skill/memory = trainable text; bounded add/delete/replace edits; **held-out gate** keeps only changes that help. |
|
||
| **Claude Dreams** | offline consolidation over past sessions; input never mutated; output **reviewed then adopted**. |
|
||
| **Agent sleep** | periodic offline replay turns short-term episodes into long-term skill. |
|
||
|
||
## What it does (one "night")
|
||
|
||
```
|
||
harvest ~/.claude transcripts → mine recurring tasks → replay offline
|
||
→ consolidate (reflect → bounded edit → GATE) → stage proposal → (you) adopt
|
||
```
|
||
|
||
Nothing live is modified until **you** run `/skillopt-sleep adopt` (the Dreams "review,
|
||
then adopt or discard" contract). Every adopt backs up the prior file first.
|
||
|
||
## Install
|
||
|
||
**Requirements:** Python ≥ 3.10, and the `claude` CLI (and/or `codex` CLI) on PATH.
|
||
|
||
```bash
|
||
# 1) get the code (the plugin ships inside the SkillOpt repo)
|
||
git clone https://github.com/microsoft/SkillOpt.git
|
||
cd SkillOpt
|
||
|
||
# 2) add the plugin to Claude Code as a local marketplace
|
||
/plugin marketplace add ./skillopt-sleep-plugin
|
||
/plugin install skillopt-sleep@skillopt-sleep
|
||
|
||
# 3) verify
|
||
/skillopt-sleep status
|
||
```
|
||
|
||
The plugin's bundled runner (`scripts/sleep.sh`) auto-selects a Python ≥ 3.10
|
||
interpreter and calls the `skillopt_sleep` engine in the repo. No `pip install`
|
||
is required for the default `mock` backend or for `claude`/`codex` backends —
|
||
they shell out to the CLIs you already have.
|
||
|
||
## Quick start
|
||
|
||
```bash
|
||
# from inside any project you use with Claude Code:
|
||
/skillopt-sleep dry-run # safe preview: what it would learn, no changes staged
|
||
/skillopt-sleep run # full cycle: stages a reviewed proposal (still no live edits)
|
||
/skillopt-sleep status # see history + the latest staged proposal
|
||
/skillopt-sleep adopt # apply the staged proposal to CLAUDE.md / SKILL.md (with backup)
|
||
```
|
||
|
||
Or call the engine directly (Python ≥ 3.10):
|
||
|
||
```bash
|
||
python -m skillopt_sleep run --project "$(pwd)" --scope invoked --backend mock
|
||
python -m skillopt_sleep run --project "$(pwd)" --backend claude # real lift via Claude
|
||
python -m skillopt_sleep run --project "$(pwd)" --backend codex # real lift via Codex
|
||
```
|
||
|
||
Default backend is **`mock`** — deterministic, no API spend — so you can try the
|
||
plumbing for free. Switch to `--backend claude` or `--backend codex` for genuine
|
||
improvement on your own budget.
|
||
|
||
## Does it actually improve? (real models, public benchmark)
|
||
|
||
SkillOpt-Sleep is validated against [gbrain-evals](https://github.com/garrytan/gbrain-evals)'
|
||
public `skillopt-v1` suite — the same benchmark gbrain scores its own skill
|
||
optimizer against. We take a deliberately **deficient** skill and run one sleep
|
||
night; held-out scoring is done by a local rule judge (no judge-API, no way to
|
||
grade its own homework).
|
||
|
||
| Backend | Seed | Held-out before → after | Nights |
|
||
|---|---|---|---|
|
||
| **Claude (Haiku 4.5)** | brief-writer | **0.00 → 1.00** | 1 |
|
||
| **Codex** | brief-writer | **0.00 → 1.00** | 2 |
|
||
|
||
Both took a brief-writer with no risks section / no confidence level and, within
|
||
1–2 nights, proposed gated edits that lifted the held-out score to perfect —
|
||
into the protected `LEARNED` block, nothing else touched. The Codex 2-night
|
||
trace even shows the optimizer **diagnosing its own residual failure** and
|
||
adding a meta-rule to fix it. Full writeup + reproduction:
|
||
[the SkillOpt-Sleep guide section](https://microsoft.github.io/SkillOpt/docs/guideline.html#sleep).
|
||
|
||
Reproduce:
|
||
|
||
```bash
|
||
git clone https://github.com/garrytan/gbrain-evals /tmp/gbrain-evals
|
||
python -m skillopt_sleep.experiments.run_gbrain --backend claude --model haiku \
|
||
--seeds brief-writer --data-root /tmp/gbrain-evals/eval/data/skillopt-v1 \
|
||
--nights 1 --limit-replay 3 --limit-holdout 3
|
||
python -m skillopt_sleep.experiments.run_gbrain --backend codex \
|
||
--seeds brief-writer --data-root /tmp/gbrain-evals/eval/data/skillopt-v1 \
|
||
--nights 1 --limit-replay 3 --limit-holdout 3
|
||
```
|
||
|
||
## Deterministic proof (no API, no keys)
|
||
|
||
```bash
|
||
python -m skillopt_sleep.experiments.run_experiment --persona researcher --assert-improves
|
||
python -m skillopt_sleep.experiments.run_experiment --persona programmer --assert-improves
|
||
```
|
||
|
||
Each prints the held-out score rising from baseline toward 1.0 as the gate
|
||
accepts the general rules your tasks need, and confirms the gate **rejects** an
|
||
injected harmful edit. Recorded output: [the SkillOpt-Sleep guide section](https://microsoft.github.io/SkillOpt/docs/guideline.html#sleep).
|
||
|
||
## Schedule it nightly
|
||
|
||
```bash
|
||
"${CLAUDE_PLUGIN_ROOT}/scripts/install-cron.sh" "$(pwd)" # prints a crontab line; installs nothing
|
||
```
|
||
|
||
## Safety
|
||
|
||
- **Read-only** harvest of `~/.claude`. `mock` replay has no side effects.
|
||
- Proposals are **staged**, never auto-applied (unless you opt in with `--auto-adopt`).
|
||
- Every adopt writes a backup under the staging dir's `backup/`.
|
||
- Per-night **token/task budget caps**; secrets redacted from prompts.
|
||
- `fresh` replay (Phase 3) runs only in throwaway git worktrees.
|
||
|
||
## Status
|
||
|
||
Phase 1 (engine + deterministic experiment + plugin surface) is complete.
|
||
Phase 3 adds the real-API miner/judge and `fresh` worktree replay. See
|
||
[`docs/superpowers/specs/2026-06-07-skillopt-sleep-claude-code-plugin-design.md`](../docs/superpowers/specs/2026-06-07-skillopt-sleep-claude-code-plugin-design.md).
|