From b701d9b6d9b48200d5bb9353e744e36e45aef43b Mon Sep 17 00:00:00 2001 From: Yifan Yang Date: Mon, 15 Jun 2026 16:20:50 +0000 Subject: [PATCH] docs: move SkillOpt-Sleep into the guide; clean docs/sleep; fix guide link MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Per maintainer request: - Remove the internal/scratch docs/sleep/ tree (reports, raw logs, blog run JSON, sweep.jsonl) β€” 23 files β€” and the root PUBLISHING.md. These were working notes, not reference docs. - Take the dedicated SkillOpt-Sleep content out of the main README (News bullet + section) and host it in the rendered guide instead: new section 9 in docs/guideline.html (deployment companion, the three plugins, opt-in experience replay / dream rollouts) with a sidebar entry. - Fix the README's opening reference so "Documentation & Reproduction Guide" links directly to the rendered GitHub Pages page, not the raw .html source. - Repoint the now-removed docs/sleep links in the plugin READMEs to the guide section. The plugin code (plugins/, skillopt_sleep/) is unchanged; only docs move. Co-Authored-By: Claude Opus 4 --- PUBLISHING.md | 81 --------- README.md | 56 +----- docs/guideline.html | 57 +++++++ docs/sleep/CONTROLLABLE_DREAMING.md | 134 --------------- docs/sleep/EXPERIENCE_REPLAY.md | 64 ------- docs/sleep/FINAL_REPORT.md | 160 ------------------ docs/sleep/PR_DRAFT.md | 53 ------ docs/sleep/benchmark_report.md | 41 ----- .../blog_runs/v2_port/conf_ss_nano_free.json | 94 ---------- .../v2_port/imp_cumulative_gate.json | 94 ---------- .../blog_runs/v2_port/imp_recall20_gate.json | 94 ---------- .../blog_runs/v2_port/imp_rollouts8_gate.json | 94 ---------- .../blog_runs/v2_port/parity_sq_g55_free.json | 94 ---------- .../blog_runs/v2_port/parity_sq_g55_gate.json | 94 ---------- docs/sleep/experiment_results.md | 73 -------- docs/sleep/plugin_load_test.md | 76 --------- docs/sleep/raw/codex_brief_writer.txt | 45 ----- .../crosscheck_A_claude_gateoff_rollouts.txt | 38 ----- .../sleep/raw/crosscheck_B_codex_rollouts.txt | 48 ------ .../raw/crosscheck_C_claude_slowupdate.txt | 54 ------ docs/sleep/raw/haiku_self_clean.txt | 101 ----------- docs/sleep/raw/quick_answerer_codex.txt | 35 ---- .../sleep/raw/quick_answerer_sonnet_haiku.txt | 35 ---- docs/sleep/raw/sonnet_opt_haiku_target.txt | 98 ----------- docs/sleep/real_api_results.md | 114 ------------- docs/sleep/sweep.jsonl | 11 -- plugins/README.md | 8 +- plugins/claude-code/README.md | 4 +- .../skills/skillopt-sleep/SKILL.md | 2 +- plugins/codex/README.md | 4 +- plugins/copilot/README.md | 2 +- 31 files changed, 68 insertions(+), 1890 deletions(-) delete mode 100644 PUBLISHING.md delete mode 100644 docs/sleep/CONTROLLABLE_DREAMING.md delete mode 100644 docs/sleep/EXPERIENCE_REPLAY.md delete mode 100644 docs/sleep/FINAL_REPORT.md delete mode 100644 docs/sleep/PR_DRAFT.md delete mode 100644 docs/sleep/benchmark_report.md delete mode 100644 docs/sleep/blog_runs/v2_port/conf_ss_nano_free.json delete mode 100644 docs/sleep/blog_runs/v2_port/imp_cumulative_gate.json delete mode 100644 docs/sleep/blog_runs/v2_port/imp_recall20_gate.json delete mode 100644 docs/sleep/blog_runs/v2_port/imp_rollouts8_gate.json delete mode 100644 docs/sleep/blog_runs/v2_port/parity_sq_g55_free.json delete mode 100644 docs/sleep/blog_runs/v2_port/parity_sq_g55_gate.json delete mode 100644 docs/sleep/experiment_results.md delete mode 100644 docs/sleep/plugin_load_test.md delete mode 100644 docs/sleep/raw/codex_brief_writer.txt delete mode 100644 docs/sleep/raw/crosscheck_A_claude_gateoff_rollouts.txt delete mode 100644 docs/sleep/raw/crosscheck_B_codex_rollouts.txt delete mode 100644 docs/sleep/raw/crosscheck_C_claude_slowupdate.txt delete mode 100644 docs/sleep/raw/haiku_self_clean.txt delete mode 100644 docs/sleep/raw/quick_answerer_codex.txt delete mode 100644 docs/sleep/raw/quick_answerer_sonnet_haiku.txt delete mode 100644 docs/sleep/raw/sonnet_opt_haiku_target.txt delete mode 100644 docs/sleep/real_api_results.md delete mode 100644 docs/sleep/sweep.jsonl diff --git a/PUBLISHING.md b/PUBLISHING.md deleted file mode 100644 index 1d85e5a..0000000 --- a/PUBLISHING.md +++ /dev/null @@ -1,81 +0,0 @@ -# Publishing SkillOpt-Sleep β€” how people install and use it - -This is the open-source SkillOpt-Sleep tool: a nightly offline "sleep cycle" for -local coding agents, shipped as plugins for **Claude Code**, **Codex**, and -**Copilot**. One engine ([`skillopt_sleep/`](skillopt_sleep)), three thin shells -([`plugins/`](plugins)), decoupled from the research code. - -## How end users install it - -### Claude Code - -The Claude Code plugin ships a marketplace manifest at -`plugins/claude-code/.claude-plugin/marketplace.json`. - -```text -# inside Claude Code: -/plugin marketplace add microsoft/SkillOpt -/plugin install skillopt-sleep -/sleep status -``` - -(`/plugin marketplace add /` reads the marketplace manifest from the -repo; the entry points at `plugins/claude-code`.) - -### Codex - -```bash -git clone https://github.com/microsoft/SkillOpt.git -cd SkillOpt -bash plugins/codex/install.sh # installs /sleep prompt + skill -export SKILLOPT_SLEEP_REPO="$(pwd)" # so the runner is found anywhere -# then, in Codex: /sleep status -``` - -### Copilot - -```bash -git clone https://github.com/microsoft/SkillOpt.git -# register the MCP server with your Copilot config (see plugins/copilot/README.md -# and plugins/copilot/mcp-config.example.json), pointing SKILLOPT_SLEEP_REPO at -# the clone. Then ask Copilot to "run the sleep cycle". -``` - -Requirements for all three: Python β‰₯ 3.10, and the corresponding agent CLI on -PATH. The default backend is `mock` (no API spend); `--backend claude|codex` -uses the user's own budget. - -## Wider distribution (optional, maintainer steps) - -1. **GitHub Release.** Tag the milestone so users can pin a version: - ```bash - gh release create sleep-v0.1.0 --title "SkillOpt-Sleep v0.1.0" \ - --notes "Nightly offline self-evolution plugins for Claude Code, Codex, Copilot." - ``` - -2. **Official Claude Code plugin marketplace.** To appear in the public - directory, open a PR adding a `marketplace.json` entry to - [`anthropics/claude-code` / the official marketplace repo], pointing at - `microsoft/SkillOpt` subdir `plugins/claude-code`. Users could then - `/plugin install skillopt-sleep@`. - -3. **PyPI (optional).** `skillopt_sleep` is a standalone package - (`pyproject.toml` lists it). A `pip install skillopt-sleep` distribution would - let users run `python -m skillopt_sleep ...` without cloning. Build with - `python -m build` and publish with `twine`. - -4. **README News.** The main [`README.md`](README.md) already announces the - release and links to [`plugins/`](plugins) and - [`docs/sleep/FINAL_REPORT.md`](docs/sleep/FINAL_REPORT.md). - -## Verifying a release works - -```bash -# deterministic, no API key: -python -m skillopt_sleep.experiments.run_experiment --persona researcher --assert-improves -# the unit suite: -python -m unittest tests.test_sleep_engine -# the MCP server (Copilot): -printf '%s\n' '{"jsonrpc":"2.0","id":1,"method":"tools/list"}' \ - | SKILLOPT_SLEEP_REPO="$(pwd)" python3 plugins/copilot/mcp_server.py -``` diff --git a/README.md b/README.md index 1e6470e..d2204c3 100644 --- a/README.md +++ b/README.md @@ -4,12 +4,11 @@ [![Project Page](https://img.shields.io/badge/Project%20Page-SkillOpt-8dbb3c)](https://microsoft.github.io/SkillOpt/) [![Paper](https://img.shields.io/badge/Paper-arXiv-b31b1b)](https://arxiv.org/abs/2605.23904) [![Project Video](https://img.shields.io/badge/Project%20Video-Watch%20Demo-ff0000)](https://youtu.be/JUBMDTCiM0M) [![PyPI](https://img.shields.io/badge/PyPI-skillopt-green.svg)](https://pypi.org/project/skillopt/) [![Python 3.10+](https://img.shields.io/badge/Python-3.10%2B-blue.svg)](https://www.python.org/) [![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](LICENSE) -> πŸ“– **For installation, data preparation, training/eval commands, the full configuration reference, and framework internals, see the [Documentation & Reproduction Guide](docs/guideline.html)** β€” view it [rendered online](https://htmlpreview.github.io/?https://github.com/microsoft/SkillOpt/blob/main/docs/guideline.html) or via [GitHub Pages](https://microsoft.github.io/SkillOpt/docs/guideline.html). +> πŸ“– **For installation, data preparation, training/eval commands, the full configuration reference, and framework internals, see the [Documentation & Reproduction Guide](https://microsoft.github.io/SkillOpt/docs/guideline.html)** (rendered on GitHub Pages). --- ## News πŸ”₯πŸ”₯πŸ”₯ -- **[2026-06-14]** 😴 **SkillOpt-Sleep (preview).** A nightly *sleep cycle* for local coding agents (Claude Code / Codex / Copilot): review past sessions offline, replay recurring tasks, and consolidate validated skills behind a held-out gate. This is an early **preview** β€” open-source and decoupled from the paper code β€” that we'll keep iterating on. See [`plugins/`](plugins/) and the [section below](#-skillopt-sleep--the-deployment-time-companion). - **[2026-06-03]** πŸŽ‰ **[gbrain](https://github.com/garrytan/gbrain), [gbrain-evals](https://github.com/garrytan/gbrain-evals/blob/main/docs/benchmarks/2026-06-03-skillopt.md), and [darwin-skill](https://github.com/alchaincyf/darwin-skill) have all integrated SkillOpt.** - **[2026-06-02]** πŸŽ‰ **SkillOpt [v0.1.0](https://github.com/microsoft/SkillOpt/releases/tag/v0.1.0) is now available on [PyPI](https://pypi.org/project/skillopt/)!** Install with `pip install skillopt`. This initial release includes the full training loop (rollout β†’ reflect β†’ aggregate β†’ select β†’ update β†’ evaluate), multi-backend support (OpenAI / Azure / Claude / Qwen / MiniMax), six built-in benchmarks, and WebUI dashboard. @@ -53,59 +52,6 @@ https://github.com/user-attachments/assets/eb12d3bc-371c-467f-904d-91b61f339ed7 --- -## 😴 SkillOpt-Sleep β€” the deployment-time companion - -> **Preview.** SkillOpt-Sleep is an early preview that we are actively iterating -> on; interfaces and defaults may change. Feedback and issues are welcome. - -SkillOpt (above) trains a skill offline on a benchmark. **SkillOpt-Sleep** -applies the same discipline to *your own daily usage*: it gives a local coding -agent a nightly **sleep cycle** that reviews your past sessions, replays your -recurring tasks on your own API budget, and consolidates what it learns into -**validated** long-term memory and skills β€” behind a held-out gate, staged for -your review. The agent gets better the more you use it, with no weight training. - -It synthesizes **SkillOpt** (validation-gated bounded text edits), **Claude -Dreams** (offline consolidation; review-then-adopt), and the **agent sleep** -idea (short-term experience β†’ long-term competence). One "night": - -``` -harvest Claude Code / Codex Desktop transcripts β†’ mine recurring tasks β†’ replay offline - β†’ consolidate (reflect β†’ bounded edit β†’ GATE on real held-out tasks) - β†’ stage proposal β†’ (you) adopt -``` - -**Plugins for three agents** (one engine, three thin shells β€” see [`plugins/`](plugins/)): - -| Platform | Folder | Install | -|---|---|---| -| **Claude Code** | [`plugins/claude-code`](plugins/claude-code) | `/plugin marketplace add ./plugins/claude-code` β†’ `/skillopt-sleep` | -| **Codex** | [`plugins/codex`](plugins/codex) | `bash plugins/codex/install.sh` β†’ `skillopt-sleep` skill | -| **Copilot** | [`plugins/copilot`](plugins/copilot) | register `plugins/copilot/mcp_server.py` as an MCP server | - -**Validated on real models.** On the public -[gbrain-evals](https://github.com/garrytan/gbrain-evals) `skillopt-v1` benchmark, -deficient skills go **0.00 β†’ 1.00** on held-out sets with **both Claude and -Codex** (all 4 seeds, including a real tool-use loop), cross-model transfer is -positive, and the gate blocks regressions -([full results](docs/sleep/FINAL_REPORT.md)). - -> **Open-source tool, decoupled from the research.** The engine lives in the -> top-level [`skillopt_sleep/`](skillopt_sleep) package with **zero dependency** -> on the paper's `skillopt/` experiment code (the validation gate is vendored). -> Controls β€” optional gate, multi-rollout contrastive reflection, token/time -> budget, multi-objective reward, user preferences, optimizer/target split β€” are -> documented in [`docs/sleep/CONTROLLABLE_DREAMING.md`](docs/sleep/CONTROLLABLE_DREAMING.md). - -Deterministic proof (no API key): `python -m skillopt_sleep.experiments.run_experiment --persona researcher --assert-improves`. - -For local sleep cycles, transcript source and replay backend are separate knobs: -use `--source claude` for Claude Code transcripts, `--source codex` for Codex -Desktop archived sessions under `~/.codex/archived_sessions`, and -`--backend codex` only when you want the replay/optimizer to spend Codex budget. - ---- - ## Extensibility & WebUI ### Adding a new backend diff --git a/docs/guideline.html b/docs/guideline.html index ddc6567..8712012 100644 --- a/docs/guideline.html +++ b/docs/guideline.html @@ -288,6 +288,12 @@ CLI scripts WebUI + @@ -917,6 +923,57 @@ PY --shareoffCreate a public Gradio share link. + + +
+

9.1 SkillOpt-Sleep β€” the deployment-time companion (preview) #

+

SkillOpt-Sleep applies SkillOpt's discipline to your own daily usage. It gives a + local coding agent a nightly sleep cycle that reviews your past sessions, replays your + recurring tasks on your own API budget, and consolidates what it learns into validated + long-term memory and skills β€” behind a held-out gate, staged for your review. The agent gets better + the more you use it, with no weight training and zero inference-time overhead. It is an early + preview we are actively iterating on; interfaces and defaults may change.

+

One "night":

+
harvest Claude Code / Codex transcripts → mine recurring tasks → replay offline
+   → consolidate (reflect → bounded edit → GATE on real held-out tasks)
+   → stage proposal → (you) adopt
+

The engine lives in the top-level skillopt_sleep/ package with zero dependency + on the paper's skillopt/ experiment code (the validation gate is vendored). Deterministic + proof, no API key required: + python -m skillopt_sleep.experiments.run_experiment --persona researcher --assert-improves.

+ +

9.2 Plugins (three agents) #

+

One engine, thin per-agent shells (see plugins/):

+
+ + + + + + +
PlatformFolderInstall
Claude Codeplugins/claude-code/plugin marketplace add ./plugins/claude-code/skillopt-sleep
Codexplugins/codexbash plugins/codex/install.shskillopt-sleep skill
Copilotplugins/copilotregister plugins/copilot/mcp_server.py as an MCP server
+

Transcript source and replay backend are separate knobs: --source claude for Claude Code + transcripts, --source codex for Codex Desktop archived sessions under + ~/.codex/archived_sessions, and --backend codex only when you want the + replay/optimizer to spend Codex budget.

+ +

9.3 Experience replay & dream rollouts (opt-in) #

+

Two consolidation mechanisms, both default off (so behavior is unchanged unless + enabled). They strengthen the nightly update when your tasks have a clean correctness signal; the + validation gate still governs what ships.

+
+ + + + + + +
Config knobDefaultEffect
dream_rollouts1Run each task K times and learn from the good-vs-bad contrast (contrastive reflection).
recall_k0Associative recall β€” pull the K most-similar past tasks (from a persisted archive) into tonight's dream.
dream_factor0Add N lightweight synthetic variants of each task.
+

On a clean-signal benchmark the gain scales with recall depth (deployment protocol: 5 nights × + 10 new real tasks/night, full held-out test, GPT-5.5, gated): recall_k=10 → +3.1 pts, + recall_k=20 → +4.5 pts, full-history replay reference → +5.6 pts; a second benchmark + (SpreadsheetBench, GPT-5.4-nano, gate-free) gives +3.6 pts. On saturated or noisy tasks the effect is + flat within run-to-run noise (±1–2 pts). Keep the gate on; it bounds the downside.