feat(plugins): ship SkillOpt-Sleep for Claude Code, Codex, and Copilot

Restructure into plugins/{claude-code,codex,copilot}/ — one engine, three thin shells, all calling the shared plugins/run-sleep.sh -> python -m skillopt_sleep. - claude-code/: existing plugin moved here; runner delegates to the shared launcher (fixes repo-root resolution after the move). - codex/: ~/.codex/prompts/sleep.md custom prompt + ~/.agents/skills SKILL.md + install.sh + AGENTS.md hint — Codex's documented, stable extension surfaces. - copilot/: a stdlib-only MCP server (mcp_server.py) exposing sleep_* tools, plus mcp-config.example.json and a copilot-instructions snippet. Verified end to end (initialize -> tools/list -> tools/call returns real engine output). - plugins/README.md overview table; main README News + a dedicated SkillOpt-Sleep section; pyproject lists skillopt_sleep as a first-class package. Decoupling emphasized throughout: open-source tool (skillopt_sleep/) with zero dependency on the research package. 29 tests pass; all three shells resolve. Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
2026-07-03 14:02:58 +08:00 · 2026-06-08 14:31:52 +00:00
parent b02ffc2c99
commit f9db99853b
22 changed files with 576 additions and 31 deletions
--- a/plugins/README.md
+++ b/plugins/README.md
@@ -0,0 +1,74 @@
+# SkillOpt-Sleep — plugins for Claude Code, Codex, and Copilot
+
+One engine, three thin shells. **SkillOpt-Sleep** gives a local coding agent a
+nightly **sleep cycle**: it reviews your past sessions offline, replays your
+recurring tasks on your own API budget, and consolidates what it learns into
+**validated** long-term memory and skills — behind a held-out gate, staged for
+your review. Your agent gets better the more you use it, with no model-weight
+training.
+
+It synthesizes three ideas: **SkillOpt** (validation-gated bounded text
+optimization — the research in this repo), **Claude Dreams** (offline memory
+consolidation; input never mutated; review-then-adopt), and the **agent sleep**
+literature (short-term experience → long-term competence).
+
+> **This is an open-source tool, decoupled from the research code.** The engine
+> lives in the top-level [`skillopt_sleep/`](../skillopt_sleep) package and has
+> **zero dependency** on the paper's `skillopt/` experiment package (the
+> validation gate is vendored). You can ship/use it without the research stack.
+
+## The three integrations
+
+| Platform | Folder | Mechanism | Status |
+|---|---|---|---|
+| **Claude Code** | [`claude-code/`](claude-code) | `.claude-plugin` + `/sleep` command + skill + hooks | full, installable |
+| **Codex** | [`codex/`](codex) | `~/.codex/prompts/sleep.md` + `~/.agents/skills` + `AGENTS.md` | full |
+| **Copilot** | [`copilot/`](copilot) | MCP server (`sleep_*` tools) + `copilot-instructions` | full (MCP) |
+
+All three call the **same** [`plugins/run-sleep.sh`](run-sleep.sh) → `python -m
+skillopt_sleep`, so behaviour is identical everywhere. Per-platform setup is in
+each folder's README.
+
+## Quick start (Claude Code)
+
+```bash
+git clone <repo-url> && cd SkillOpt-Sleep
+# Claude Code:
+/plugin marketplace add ./plugins/claude-code
+/plugin install skillopt-sleep@skillopt-sleep
+/sleep status
+```
+Codex: `bash plugins/codex/install.sh`.
+Copilot: register `plugins/copilot/mcp_server.py` as an MCP server.
+
+## What one "night" does
+
+```
+harvest ~/.claude (or session) transcripts → mine recurring tasks → replay offline
+   → consolidate (reflect → bounded edit → GATE on real held-out tasks)
+   → stage proposal → (you) adopt
+```
+
+Nothing live changes until you adopt; every adopt backs up first.
+
+## Controls (work on all platforms)
+
+`--gate on|off` · `--rollouts-k K` (multi-rollout contrastive reflection) ·
+`--budget-tokens/--budget-minutes` · `--preferences "..."` · separate
+optimizer/target models (`--optimizer-model` / `--target-model`) · slow-update
+long-term memory. Full guide:
+[`../docs/sleep/CONTROLLABLE_DREAMING.md`](../docs/sleep/CONTROLLABLE_DREAMING.md).
+
+## Does it actually work?
+
+Validated on the public
+[gbrain-evals](https://github.com/garrytan/gbrain-evals) `skillopt-v1` benchmark
+with **real models on both Claude and Codex**: deficient skills go **0.00 →
+1.00** on held-out sets (all 4 seeds incl. a real tool-use loop), cross-model
+transfer is positive, and the gate blocks regressions. Full results:
+[`../docs/sleep/FINAL_REPORT.md`](../docs/sleep/FINAL_REPORT.md).
+
+Deterministic proof (no API key):
+```bash
+python -m skillopt_sleep.experiments.run_experiment --persona researcher --assert-improves
+```
--- a/plugins/claude-code/.claude-plugin/marketplace.json
+++ b/plugins/claude-code/.claude-plugin/marketplace.json
@@ -0,0 +1,26 @@
+{
+  "$schema": "https://anthropic.com/claude-code/marketplace.schema.json",
+  "name": "skillopt-sleep",
+  "description": "SkillOpt-Sleep: give your local Claude agent a nightly sleep cycle that reviews past sessions and consolidates validated memory + skills.",
+  "owner": {
+    "name": "Yifan Yang",
+    "email": "yifanyang@microsoft.com"
+  },
+  "plugins": [
+    {
+      "name": "skillopt-sleep",
+      "description": "Nightly offline self-evolution: harvest your past Claude Code sessions, replay recurring tasks on your own API budget, and consolidate what the agent learns into validated CLAUDE.md memory and SKILL.md skills — behind a held-out gate, staged for your review.越用越好用. Synthesizes SkillOpt (validation-gated skill optimization), Claude Dreams (offline memory consolidation), and agent sleep/consolidation.",
+      "author": {
+        "name": "Yifan Yang"
+      },
+      "category": "productivity",
+      "source": {
+        "source": "git-subdir",
+        "url": "https://github.com/microsoft/SkillOpt.git",
+        "path": "skillopt-sleep-plugin",
+        "ref": "main"
+      },
+      "homepage": "https://github.com/microsoft/SkillOpt"
+    }
+  ]
+}
--- a/plugins/claude-code/.claude-plugin/plugin.json
+++ b/plugins/claude-code/.claude-plugin/plugin.json
@@ -0,0 +1,22 @@
+{
+  "name": "skillopt-sleep",
+  "description": "Give your local Claude agent a nightly 'sleep cycle': it reviews your past sessions offline, replays recurring tasks on your own API budget, and consolidates what it learns into validated memory (CLAUDE.md) and skills (SKILL.md).越用越好用 — gets better the more you use it. Synthesizes SkillOpt (validation-gated skill optimization), Claude Dreams (offline memory consolidation), and agent sleep/consolidation.",
+  "version": "0.1.0",
+  "author": {
+    "name": "Yifan Yang",
+    "email": "yifanyang@microsoft.com"
+  },
+  "homepage": "https://github.com/microsoft/SkillOpt",
+  "repository": "https://github.com/microsoft/SkillOpt",
+  "license": "MIT",
+  "keywords": [
+    "skillopt",
+    "self-improvement",
+    "memory-consolidation",
+    "dreams",
+    "sleep",
+    "skills",
+    "continual-learning",
+    "offline-optimization"
+  ]
+}
--- a/plugins/claude-code/README.md
+++ b/plugins/claude-code/README.md
@@ -0,0 +1,138 @@
+# SkillOpt-Sleep (Claude Code plugin)
+
+> Give your local Claude agent a **sleep cycle**. Every night it reviews your
+> past sessions offline, replays your recurring tasks on your own API budget,
+> and consolidates what it learns into **validated** memory (`CLAUDE.md`) and
+> skills (`SKILL.md`). Your agent gets better the more you use it — no
+> model-weight training.
+
+SkillOpt-Sleep is the **deployment-time** companion to
+[SkillOpt](https://github.com/microsoft/SkillOpt). SkillOpt trains a skill
+offline on a benchmark; SkillOpt-Sleep applies the same discipline to *your own
+daily usage*: bounded text edits, accepted only through a held-out validation
+gate, with rejected edits kept as negative feedback.
+
+It synthesizes three ideas:
+
+| Idea | Contribution |
+|---|---|
+| **SkillOpt** | skill/memory = trainable text; bounded add/delete/replace edits; **held-out gate** keeps only changes that help. |
+| **Claude Dreams** | offline consolidation over past sessions; input never mutated; output **reviewed then adopted**. |
+| **Agent sleep** | periodic offline replay turns short-term episodes into long-term skill. |
+
+## What it does (one "night")
+
+```
+harvest ~/.claude transcripts → mine recurring tasks → replay offline
+   → consolidate (reflect → bounded edit → GATE) → stage proposal → (you) adopt
+```
+
+Nothing live is modified until **you** run `/sleep adopt` (the Dreams "review,
+then adopt or discard" contract). Every adopt backs up the prior file first.
+
+## Install
+
+**Requirements:** Python ≥ 3.10, and the `claude` CLI (and/or `codex` CLI) on PATH.
+
+```bash
+# 1) get the code (the plugin ships inside the SkillOpt repo)
+git clone https://github.com/microsoft/SkillOpt.git
+cd SkillOpt
+
+# 2) add the plugin to Claude Code as a local marketplace
+/plugin marketplace add ./skillopt-sleep-plugin
+/plugin install skillopt-sleep@skillopt-sleep
+
+# 3) verify
+/sleep status
+```
+
+The plugin's bundled runner (`scripts/sleep.sh`) auto-selects a Python ≥ 3.10
+interpreter and calls the `skillopt_sleep` engine in the repo. No `pip install`
+is required for the default `mock` backend or for `claude`/`codex` backends —
+they shell out to the CLIs you already have.
+
+## Quick start
+
+```bash
+# from inside any project you use with Claude Code:
+/sleep dry-run     # safe preview: what it would learn, no changes staged
+/sleep run         # full cycle: stages a reviewed proposal (still no live edits)
+/sleep status      # see history + the latest staged proposal
+/sleep adopt       # apply the staged proposal to CLAUDE.md / SKILL.md (with backup)
+```
+
+Or call the engine directly (Python ≥ 3.10):
+
+```bash
+python -m skillopt_sleep run --project "$(pwd)" --scope invoked --backend mock
+python -m skillopt_sleep run --project "$(pwd)" --backend claude   # real lift via Claude
+python -m skillopt_sleep run --project "$(pwd)" --backend codex    # real lift via Codex
+```
+
+Default backend is **`mock`** — deterministic, no API spend — so you can try the
+plumbing for free. Switch to `--backend claude` or `--backend codex` for genuine
+improvement on your own budget.
+
+## Does it actually improve? (real models, public benchmark)
+
+SkillOpt-Sleep is validated against [gbrain-evals](https://github.com/garrytan/gbrain-evals)'
+public `skillopt-v1` suite — the same benchmark gbrain scores its own skill
+optimizer against. We take a deliberately **deficient** skill and run one sleep
+night; held-out scoring is done by a local rule judge (no judge-API, no way to
+grade its own homework).
+
+| Backend | Seed | Held-out before → after | Nights |
+|---|---|---|---|
+| **Claude (Haiku 4.5)** | brief-writer | **0.00 → 1.00** | 1 |
+| **Codex** | brief-writer | **0.00 → 1.00** | 2 |
+
+Both took a brief-writer with no risks section / no confidence level and, within
+1–2 nights, proposed gated edits that lifted the held-out score to perfect —
+into the protected `LEARNED` block, nothing else touched. The Codex 2-night
+trace even shows the optimizer **diagnosing its own residual failure** and
+adding a meta-rule to fix it. Full writeup + reproduction:
+[`docs/sleep/real_api_results.md`](../docs/sleep/real_api_results.md).
+
+Reproduce:
+
+```bash
+git clone https://github.com/garrytan/gbrain-evals /tmp/gbrain-evals
+python -m skillopt_sleep.experiments.run_gbrain --backend claude --model haiku \
+  --seeds brief-writer --data-root /tmp/gbrain-evals/eval/data/skillopt-v1 \
+  --nights 1 --limit-replay 3 --limit-holdout 3
+python -m skillopt_sleep.experiments.run_gbrain --backend codex \
+  --seeds brief-writer --data-root /tmp/gbrain-evals/eval/data/skillopt-v1 \
+  --nights 1 --limit-replay 3 --limit-holdout 3
+```
+
+## Deterministic proof (no API, no keys)
+
+```bash
+python -m skillopt_sleep.experiments.run_experiment --persona researcher --assert-improves
+python -m skillopt_sleep.experiments.run_experiment --persona programmer  --assert-improves
+```
+
+Each prints the held-out score rising from baseline toward 1.0 as the gate
+accepts the general rules your tasks need, and confirms the gate **rejects** an
+injected harmful edit. Recorded output: [`docs/sleep/experiment_results.md`](../docs/sleep/experiment_results.md).
+
+## Schedule it nightly
+
+```bash
+"${CLAUDE_PLUGIN_ROOT}/scripts/install-cron.sh" "$(pwd)"   # prints a crontab line; installs nothing
+```
+
+## Safety
+
+- **Read-only** harvest of `~/.claude`. `mock` replay has no side effects.
+- Proposals are **staged**, never auto-applied (unless you opt in with `--auto-adopt`).
+- Every adopt writes a backup under the staging dir's `backup/`.
+- Per-night **token/task budget caps**; secrets redacted from prompts.
+- `fresh` replay (Phase 3) runs only in throwaway git worktrees.
+
+## Status
+
+Phase 1 (engine + deterministic experiment + plugin surface) is complete.
+Phase 3 adds the real-API miner/judge and `fresh` worktree replay. See
+[`docs/superpowers/specs/2026-06-07-skillopt-sleep-claude-code-plugin-design.md`](../docs/superpowers/specs/2026-06-07-skillopt-sleep-claude-code-plugin-design.md).
--- a/plugins/claude-code/commands/sleep.md
+++ b/plugins/claude-code/commands/sleep.md
@@ -0,0 +1,63 @@
+---
+description: Run or manage the SkillOpt-Sleep self-evolution cycle (review past sessions, replay tasks offline, consolidate validated memory + skills)
+argument-hint: "[run | dry-run | status | adopt | harvest] (default: status)"
+allowed-tools: Bash, Read
+---
+
+# /sleep — SkillOpt-Sleep nightly self-evolution
+
+You are driving **SkillOpt-Sleep**: a tool that lets this user's Claude agent
+improve offline by reviewing past sessions, replaying recurring tasks, and
+consolidating what it learns into **validated** memory (`CLAUDE.md`) and skills
+(`SKILL.md`). It is gated like SkillOpt: a change is kept only if it improves a
+held-out replay score, and nothing live is modified until the user adopts it.
+
+## Requested action: $ARGUMENTS
+
+(If `$ARGUMENTS` is empty, treat it as `status`.)
+
+## How to run it
+
+The engine is the `skillopt_sleep` Python package in this repo. Use the
+**plugin's bundled runner** so the right interpreter and repo are on the path:
+
+```bash
+"${CLAUDE_PLUGIN_ROOT}/scripts/sleep.sh" <action> --project "$(pwd)" --scope invoked
+```
+
+`<action>` is one of:
+
+| action    | what it does |
+|-----------|--------------|
+| `status`  | show how many nights have run + the latest staged proposal (READ-ONLY) |
+| `dry-run` | harvest → mine → replay → report, but **stage nothing** (safe preview) |
+| `run`     | full cycle: also **stage** a reviewed proposal (still does NOT touch live files) |
+| `adopt`   | apply the latest staged proposal to live `CLAUDE.md` / `SKILL.md` (backs up first) |
+| `harvest` | debug: print the recurring tasks mined from recent sessions |
+
+Default backend is `mock` (deterministic, no API spend). To use real Anthropic
+budget for genuine improvement, add `--backend anthropic`.
+
+## Steps to follow
+
+1. **Run the requested action** via the bundled runner above. Capture stdout.
+2. **For `run` / `dry-run`:** after it completes, `Read` the generated
+   `report.md` in the staging dir it prints, and show the user:
+   - held-out score: baseline → candidate (the proof it helped)
+   - the gate decision (accept/reject) and the exact edits it proposes
+   - where the proposal is staged
+3. **For `run` that produced an accepted proposal:** tell the user the diff is
+   staged and that **nothing live changed yet**. Offer to run `/sleep adopt`.
+4. **For `adopt`:** confirm which live files were updated and that backups were
+   written under the staging dir's `backup/`.
+5. **Never** edit `CLAUDE.md` or `SKILL.md` yourself — only the `adopt` action
+   does that, with a backup. Respect the review gate.
+
+## Safety reminders
+
+- Harvest is **read-only** over `~/.claude`. Replay in `mock` mode runs no
+  shell side effects.
+- The cycle stages proposals; the user is in control of adoption.
+- If the user asks to schedule this nightly, point them at
+  `${CLAUDE_PLUGIN_ROOT}/scripts/install-cron.sh` (prints a crontab line; does
+  not install anything without confirmation).
--- a/plugins/claude-code/hooks/hooks.json
+++ b/plugins/claude-code/hooks/hooks.json
@@ -0,0 +1,16 @@
+{
+  "hooks": {
+    "SessionEnd": [
+      {
+        "matcher": "*",
+        "hooks": [
+          {
+            "type": "command",
+            "command": "\"${CLAUDE_PLUGIN_ROOT}/hooks/on-session-end.sh\"",
+            "async": true
+          }
+        ]
+      }
+    ]
+  }
+}
--- a/plugins/claude-code/hooks/on-session-end.sh
+++ b/plugins/claude-code/hooks/on-session-end.sh
@@ -0,0 +1,18 @@
+#!/usr/bin/env bash
+# SkillOpt-Sleep SessionEnd hook (async, best-effort, NON-BLOCKING).
+#
+# This does NOT run the optimizer. It only appends a tiny marker so the next
+# nightly cycle knows there is fresh activity to harvest, and (optionally)
+# nudges the user once that a sleep cycle is available. It must never fail the
+# session or spend API budget.
+set -uo pipefail
+
+PLUGIN_ROOT="${CLAUDE_PLUGIN_ROOT:-$(cd "$(dirname "${BASH_SOURCE[0]}")/.." && pwd)}"
+STATE_DIR="${HOME}/.skillopt-sleep"
+mkdir -p "$STATE_DIR" 2>/dev/null || exit 0
+
+# Record that a session just ended (cheap; used for "is there new data?").
+printf '%s\t%s\n' "$(date -u +%Y-%m-%dT%H:%M:%SZ)" "${PWD}" \
+  >> "$STATE_DIR/session-end.log" 2>/dev/null || true
+
+exit 0
--- a/plugins/claude-code/scripts/install-cron.sh
+++ b/plugins/claude-code/scripts/install-cron.sh
@@ -0,0 +1,29 @@
+#!/usr/bin/env bash
+# Print (does NOT install) a crontab line that runs SkillOpt-Sleep nightly.
+# The user copies the line into `crontab -e` if they want it.
+set -euo pipefail
+
+PLUGIN_ROOT="${CLAUDE_PLUGIN_ROOT:-$(cd "$(dirname "${BASH_SOURCE[0]}")/.." && pwd)}"
+RUNNER="$PLUGIN_ROOT/scripts/sleep.sh"
+PROJECT="${1:-$(pwd)}"
+BACKEND="${2:-mock}"
+
+# 3:17am local — deliberately off the :00 mark so many users don't all hit the
+# API at once (and we leave room for jitter).
+MIN=17
+HOUR=3
+
+cat <<EOF
+# ── SkillOpt-Sleep nightly cycle ────────────────────────────────────────────
+# Review past sessions, replay tasks, stage validated memory/skill updates.
+# Runs at ${HOUR}:$(printf '%02d' $MIN) local every day. Output goes to the project's
+# .skillopt-sleep/ dir; nothing live is changed until you run '/sleep adopt'
+# (unless you pass --auto-adopt below).
+#
+# Copy the next line into 'crontab -e':
+${MIN} ${HOUR} * * *  "${RUNNER}" run --project "${PROJECT}" --scope invoked --backend ${BACKEND} >> "${PROJECT}/.skillopt-sleep/cron.log" 2>&1
+#
+# For fully-autonomous adoption (power users), append: --auto-adopt
+# To spend real API budget for genuine lift, set BACKEND=anthropic above.
+# ────────────────────────────────────────────────────────────────────────────
+EOF
--- a/plugins/claude-code/scripts/sleep.sh
+++ b/plugins/claude-code/scripts/sleep.sh
@@ -0,0 +1,11 @@
+#!/usr/bin/env bash
+# Claude Code plugin runner — thin wrapper over the shared runner so all three
+# platform plugins share one engine launcher. The shared runner lives at
+# <repo>/plugins/run-sleep.sh and handles repo-root + interpreter resolution.
+set -euo pipefail
+HERE="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"   # <repo>/plugins/claude-code/scripts
+SHARED="$(cd "$HERE/../.." && pwd)/run-sleep.sh"        # <repo>/plugins/run-sleep.sh
+if [ ! -f "$SHARED" ] && [ -n "${CLAUDE_PLUGIN_ROOT:-}" ]; then
+  SHARED="$(cd "$CLAUDE_PLUGIN_ROOT/.." && pwd)/run-sleep.sh"
+fi
+exec bash "$SHARED" "$@"
--- a/plugins/claude-code/skills/skillopt-sleep/SKILL.md
+++ b/plugins/claude-code/skills/skillopt-sleep/SKILL.md
@@ -0,0 +1,79 @@
+---
+name: skillopt-sleep
+description: "Use when the user wants their Claude agent to self-improve from past usage, asks about a nightly/offline 'sleep' or 'dream' cycle, memory/skill consolidation, or says things like '让 agent 越用越好用', 'review my past sessions', 'learn my preferences', 'consolidate what you learned', 'run the sleep cycle', or wants to schedule offline self-optimization. Drives the skillopt_sleep engine: harvest past sessions → mine recurring tasks → replay offline → consolidate validated CLAUDE.md/SKILL.md behind a held-out gate."
+---
+
+# SkillOpt-Sleep: offline self-evolution for a local Claude agent
+
+SkillOpt-Sleep gives the user's agent a **sleep cycle**. While the user is
+offline (e.g. nightly), it reviews their real past Claude Code sessions,
+re-runs recurring tasks on their own API budget, and consolidates what it
+learns into **memory** (`CLAUDE.md`) and **skills** (`SKILL.md`) — but only
+keeps changes that pass a held-out validation gate, and only after the user
+adopts them. The agent gets measurably better at *this* user's recurring work,
+with no model-weight training. It is the deployment-time analogue of training:
+short-term experience → long-term competence.
+
+It synthesizes three ideas:
+- **SkillOpt** — the skill/memory doc is trainable text; bounded add/delete/replace
+  edits; accepted only through a held-out gate; rejected edits become negative feedback.
+- **Claude Dreams** — offline consolidation that reads past sessions and rebuilds
+  memory (dedup/merge/resolve); the input is never mutated; output is reviewed then adopted.
+- **Agent sleep** — periodic offline replay turns episodes into durable skill.
+
+## When to use this skill
+
+Trigger when the user wants any of:
+- "make my agent learn from how I use it" / "越用越好用" / "remember my preferences across sessions"
+- a nightly/scheduled or on-demand **offline self-improvement / dream / sleep** run
+- to **review past sessions/trajectories** and distill recurring tasks
+- to **consolidate** feedback into `CLAUDE.md` or a managed skill
+- to **schedule** the cycle (cron) or **adopt** a staged proposal
+
+## The cycle (six stages)
+
+1. **Harvest** — read `~/.claude/projects/*/<session>.jsonl` + `~/.claude/history.jsonl` (READ-ONLY) → session digests.
+2. **Mine** — digests → `TaskRecord`s (recurring intents + outcome labels + checkable refs where possible).
+3. **Replay** — re-run tasks offline under the *current* skill+memory → (hard, soft) scores.
+4. **Consolidate** — reflect on failures → propose bounded edits → **gate** on a held-out slice; accept only if it strictly improves.
+5. **Stage** — write `proposed_CLAUDE.md`, `proposed_SKILL.md`, a diff, and `report.md` into `<project>/.skillopt-sleep/staging/<date>/`. **Nothing live changes.**
+6. **Adopt** — explicit (or opt-in auto): copy staged files over live ones, backing up first.
+
+## How to drive it
+
+Prefer the `/sleep` command. Under the hood it calls the bundled runner:
+
+```bash
+"${CLAUDE_PLUGIN_ROOT}/scripts/sleep.sh" status                       # what's happened
+"${CLAUDE_PLUGIN_ROOT}/scripts/sleep.sh" dry-run --project "$(pwd)"    # safe preview
+"${CLAUDE_PLUGIN_ROOT}/scripts/sleep.sh" run --project "$(pwd)"        # full cycle, stages a proposal
+"${CLAUDE_PLUGIN_ROOT}/scripts/sleep.sh" adopt --project "$(pwd)"      # apply staged proposal (with backup)
+```
+
+- Default backend is `mock` (deterministic, **no API spend**) — good for trying the plumbing.
+- Add `--backend claude` or `--backend codex` to spend the user's real budget for genuine improvement.
+- Scope defaults to the invoked project; `--scope all` harvests every project.
+
+## Hard rules
+
+- **Never** hand-edit the user's `CLAUDE.md` / `SKILL.md` as part of this skill.
+  Only the `adopt` action changes live files, and it backs them up first.
+- Harvest is read-only. `mock` replay has no side effects.
+- Always show the user the **held-out baseline → candidate** score and the
+  exact proposed edits before suggesting adoption. Evidence before adoption.
+- If asked whether it really helps, run
+  `python -m skillopt_sleep.experiments.run_experiment --persona researcher --json`
+  — a deterministic demo that proves held-out lift and that the gate blocks
+  harmful edits.
+
+## Validate / demo
+
+```bash
+# deterministic proof (no API): held-out score rises, gate blocks regressions
+python -m skillopt_sleep.experiments.run_experiment --persona researcher --assert-improves
+python -m skillopt_sleep.experiments.run_experiment --persona programmer  --assert-improves
+```
+
+See `docs/sleep/experiment_results.md` for recorded output and
+`docs/superpowers/specs/2026-06-07-skillopt-sleep-claude-code-plugin-design.md`
+for the full design.
--- a/plugins/codex/README.md
+++ b/plugins/codex/README.md
@@ -0,0 +1,59 @@
+# SkillOpt-Sleep — Codex integration
+
+Give your **Codex** agent a nightly **sleep cycle**: it reviews past sessions
+offline, replays your recurring tasks on your own Codex budget, and consolidates
+what it learns into validated memory + skills behind a held-out gate. Same engine
+as the Claude Code plugin (`skillopt_sleep`), wrapped for Codex.
+
+> **Verified on Codex:** on the public
+> [gbrain-evals](https://github.com/garrytan/gbrain-evals) `skillopt-v1`
+> benchmark, a deliberately deficient skill goes **0.00 → 1.00** on a held-out
+> set with the Codex backend (incl. the tool-use seed via a real tool loop).
+> See [`../../docs/sleep/FINAL_REPORT.md`](../../docs/sleep/FINAL_REPORT.md).
+
+## What Codex supports (and what we use)
+
+Codex (`@openai/codex`) extends via **`AGENTS.md`** instructions, **skills** at
+`~/.agents/skills/<name>/SKILL.md`, and **custom prompts** at
+`~/.codex/prompts/<name>.md` (invoked as `/<name>`). This integration ships all
+three, plus a shared runner.
+
+## Install
+
+```bash
+git clone <repo-url> SkillOpt-Sleep
+cd SkillOpt-Sleep
+bash plugins/codex/install.sh          # installs the /sleep prompt + skill
+export SKILLOPT_SLEEP_REPO="$(pwd)"    # so the runner is found from anywhere
+```
+
+Requires Python ≥ 3.10 and the `codex` CLI on PATH.
+
+## Use
+
+```text
+/sleep status      # what's happened
+/sleep dry-run     # safe preview, stages nothing
+/sleep run         # full cycle, stages a reviewed proposal (no live edits)
+/sleep adopt       # apply the staged proposal (with backup)
+```
+
+Or call the engine directly:
+
+```bash
+python -m skillopt_sleep run --project "$(pwd)" --backend codex
+```
+
+Default backend is `mock` (no API spend). `--backend codex` uses your Codex
+budget for real improvement. All the controllable knobs (`--gate on|off`,
+`--rollouts-k`, `--budget-tokens`, `--preferences`, optimizer/target split) work
+identically — see [`../../docs/sleep/CONTROLLABLE_DREAMING.md`](../../docs/sleep/CONTROLLABLE_DREAMING.md).
+
+## Notes / status
+
+- Codex's `exec` runs shell, so the real-tool-loop replay (e.g. the
+  `tool_called: search` benchmark seed) works natively.
+- Codex's standalone *plugin-package manifest* format is not yet a stable public
+  spec; this integration uses the documented `AGENTS.md` + skills + prompts
+  mechanisms, which are stable. If/when a `codex plugin` package format ships,
+  we'll add a one-file manifest.
--- a/plugins/codex/install.sh
+++ b/plugins/codex/install.sh
@@ -0,0 +1,36 @@
+#!/usr/bin/env bash
+# Install the SkillOpt-Sleep Codex integration into the user's ~/.codex and
+# ~/.agents directories. Idempotent; prints what it does.
+set -euo pipefail
+
+REPO_ROOT="$(cd "$(dirname "${BASH_SOURCE[0]}")/../.." && pwd)"
+CODEX_HOME="${CODEX_HOME:-$HOME/.codex}"
+AGENTS_SKILLS="${HOME}/.agents/skills"
+
+echo "[install] repo: $REPO_ROOT"
+
+# 1) custom /sleep prompt
+mkdir -p "$CODEX_HOME/prompts"
+cp "$REPO_ROOT/plugins/codex/prompts/sleep.md" "$CODEX_HOME/prompts/sleep.md"
+echo "[install] /sleep prompt   -> $CODEX_HOME/prompts/sleep.md"
+
+# 2) user-level skill
+mkdir -p "$AGENTS_SKILLS/skillopt-sleep"
+cp "$REPO_ROOT/plugins/codex/skills/skillopt-sleep/SKILL.md" "$AGENTS_SKILLS/skillopt-sleep/SKILL.md"
+echo "[install] skill           -> $AGENTS_SKILLS/skillopt-sleep/SKILL.md"
+
+# 3) record the repo location so the runner is found from anywhere
+echo "[install] add to your shell profile:"
+echo "    export SKILLOPT_SLEEP_REPO=\"$REPO_ROOT\""
+
+# 4) optional: append an AGENTS.md hint (only if the user opts in)
+cat <<EOF
+
+[install] Optional — add this to ~/.codex/AGENTS.md so Codex always knows the tool:
+
+  ## SkillOpt-Sleep
+  An offline self-improvement cycle is available. To run it:
+  \`bash "$REPO_ROOT/plugins/run-sleep.sh" status\`. Use \`/sleep\` for the guided flow.
+
+Done. Try:  /sleep status
+EOF
--- a/plugins/codex/prompts/sleep.md
+++ b/plugins/codex/prompts/sleep.md
@@ -0,0 +1,21 @@
+# /sleep — SkillOpt-Sleep for Codex
+#
+# Custom prompt: copy this file to ~/.codex/prompts/sleep.md and invoke with
+# `/sleep` in the Codex CLI. ($ARGUMENTS is the text after /sleep.)
+
+Run the SkillOpt-Sleep offline self-evolution cycle. Action: $ARGUMENTS
+(empty → "status").
+
+Use the bundled runner via shell:
+
+    bash "${SKILLOPT_SLEEP_REPO:?set SKILLOPT_SLEEP_REPO to the repo root}/plugins/run-sleep.sh" $ARGUMENTS --project "$(pwd)"
+
+Then:
+- For `run`/`dry-run`: read the staged `report.md` and show the held-out
+  baseline → candidate score and the proposed edits. `run` only stages a
+  proposal; nothing live changes until `adopt`.
+- For `adopt`: confirm which files were updated and that a backup was written.
+- Never edit the user's AGENTS.md / skills yourself; only `adopt` does that.
+
+Default backend is `mock` (no API spend). Add `--backend codex` for real
+improvement on the user's Codex budget.
--- a/plugins/codex/skills/skillopt-sleep/SKILL.md
+++ b/plugins/codex/skills/skillopt-sleep/SKILL.md
@@ -0,0 +1,49 @@
+---
+name: skillopt-sleep
+description: Nightly offline self-evolution for a Codex agent. Reviews past sessions, replays recurring tasks, and consolidates validated memory + skills behind a held-out gate. Use when the user wants Codex to learn from past usage, run a "sleep"/"dream" cycle, or schedule offline self-optimization.
+---
+
+# SkillOpt-Sleep (Codex skill)
+
+This skill drives the `skillopt_sleep` engine — an offline "sleep cycle" that
+makes a Codex agent better at the user's recurring work without retraining.
+
+## When to use
+
+Trigger when the user wants to: review past sessions, learn their preferences,
+consolidate feedback into long-term memory/skills, run a nightly/offline
+self-improvement cycle, or adopt a staged proposal.
+
+## How to run it
+
+Invoke the bundled runner via shell (Codex `exec` has shell access). The runner
+finds the engine and a Python ≥ 3.10 automatically:
+
+```bash
+# point at the repo if it isn't auto-detected from CWD:
+export SKILLOPT_SLEEP_REPO=/path/to/SkillOpt-Sleep
+bash "$SKILLOPT_SLEEP_REPO/plugins/run-sleep.sh" <action> --project "$(pwd)"
+```
+
+`<action>` ∈ `status | dry-run | run | adopt | harvest`. Use `--backend codex`
+for real improvement on the user's own Codex budget (default `mock` = no spend).
+
+## Steps
+
+1. Run the requested action; capture stdout.
+2. For `run`/`dry-run`: read the staged `report.md` it prints and show the user
+   the held-out baseline → candidate score and the exact proposed edits.
+3. `run` only **stages** a proposal under `<project>/.skillopt-sleep/staging/`;
+   nothing live changes until `adopt`. Offer `/sleep adopt`.
+4. Never hand-edit the user's `AGENTS.md` / skills yourself — only `adopt` does,
+   and it backs up first.
+
+## Validate
+
+```bash
+python -m skillopt_sleep.experiments.run_gbrain --backend codex \
+  --seeds brief-writer --data-root /path/to/gbrain-evals/eval/data/skillopt-v1 \
+  --nights 2 --limit-replay 3 --limit-holdout 3
+```
+A deficient skill goes 0.00 → 1.00 on a held-out set; the optimizer's edits are
+gated on real-task performance.
--- a/plugins/copilot/README.md
+++ b/plugins/copilot/README.md
@@ -0,0 +1,67 @@
+# SkillOpt-Sleep — GitHub Copilot integration
+
+Give **Copilot** (CLI or VS Code) a nightly **sleep cycle** via a tiny **MCP
+server** that exposes the `skillopt_sleep` engine as tools. MCP is GitHub's
+supported way to extend Copilot, so this works across Copilot CLI, VS Code, and
+other MCP clients with the same server.
+
+## What's here
+
+| File | Purpose |
+|---|---|
+| `mcp_server.py` | stdlib-only MCP (stdio) server exposing `sleep_*` tools |
+| `mcp-config.example.json` | drop-in MCP server config |
+| `copilot-instructions.snippet.md` | paste into `.github/copilot-instructions.md` |
+
+## Install
+
+Requires Python ≥ 3.10. No third-party packages — the server is pure stdlib.
+
+1. **Register the MCP server.** Add the server to your Copilot MCP config
+   (Copilot CLI: `~/.copilot/mcp-config.json`; VS Code: your MCP settings).
+   Use `mcp-config.example.json` as a template — set `SKILLOPT_SLEEP_REPO` to
+   this repo's path:
+
+   ```json
+   {
+     "mcpServers": {
+       "skillopt-sleep": {
+         "command": "python3",
+         "args": ["/abs/path/SkillOpt-Sleep/plugins/copilot/mcp_server.py"],
+         "env": { "SKILLOPT_SLEEP_REPO": "/abs/path/SkillOpt-Sleep" }
+       }
+     }
+   }
+   ```
+
+2. **(Optional) Tell Copilot about it.** Append
+   `copilot-instructions.snippet.md` to your repo's
+   `.github/copilot-instructions.md` so Copilot reaches for the tools when the
+   user asks to "run the sleep cycle".
+
+## Use
+
+Ask Copilot things like *"run the sleep cycle"*, *"what did the last sleep
+propose?"*, *"adopt the staged sleep proposal"*. Copilot calls the MCP tools:
+`sleep_status`, `sleep_dry_run`, `sleep_run`, `sleep_adopt`, `sleep_harvest`.
+
+Each tool takes optional `project`, `backend` (`mock`/`claude`/`codex`), and
+`scope` arguments. Default backend is `mock` (no API spend).
+
+## Verify the server directly (no Copilot needed)
+
+```bash
+printf '%s\n' \
+  '{"jsonrpc":"2.0","id":1,"method":"initialize","params":{}}' \
+  '{"jsonrpc":"2.0","id":2,"method":"tools/list"}' \
+  | SKILLOPT_SLEEP_REPO="$(pwd)" python3 plugins/copilot/mcp_server.py
+```
+You should see the server info and the five `sleep_*` tools.
+
+## Notes / status
+
+- MCP is the stable, official Copilot extension surface, so this is the most
+  portable of the three integrations (one server → CLI + IDE).
+- The engine and all its controls (gate on/off, multi-rollout, budget,
+  preferences, optimizer/target split) are identical across platforms — see
+  [`../../docs/sleep/CONTROLLABLE_DREAMING.md`](../../docs/sleep/CONTROLLABLE_DREAMING.md).
--- a/plugins/copilot/copilot-instructions.snippet.md
+++ b/plugins/copilot/copilot-instructions.snippet.md
@@ -0,0 +1,25 @@
+<!--
+Copy this block into your repo's .github/copilot-instructions.md so Copilot
+knows the SkillOpt-Sleep tools exist. (Copilot reads copilot-instructions.md
+automatically as ambient guidance.)
+-->
+
+## SkillOpt-Sleep (offline self-evolution)
+
+This project has SkillOpt-Sleep available via an MCP server (`skillopt-sleep`).
+It gives the agent a nightly "sleep cycle": it reviews past sessions, replays
+recurring tasks offline, and consolidates validated memory + skills behind a
+held-out gate.
+
+When the user asks to "run the sleep cycle", "review my past sessions", "learn
+my preferences", or "make the agent improve from past usage", use the MCP tools:
+
+- `sleep_status` — what's happened + the latest staged proposal
+- `sleep_dry_run` — safe preview, stages nothing
+- `sleep_run` — full cycle, stages a reviewed proposal (nothing live changes)
+- `sleep_adopt` — apply the staged proposal (backs up first)
+- `sleep_harvest` — list mined recurring tasks
+
+Always show the user the held-out baseline → candidate score and the proposed
+edits before suggesting `sleep_adopt`. Never hand-edit the user's memory/skill
+files; only `sleep_adopt` does that, with a backup.
--- a/plugins/copilot/mcp-config.example.json
+++ b/plugins/copilot/mcp-config.example.json
@@ -0,0 +1,11 @@
+{
+  "mcpServers": {
+    "skillopt-sleep": {
+      "command": "python3",
+      "args": ["plugins/copilot/mcp_server.py"],
+      "env": {
+        "SKILLOPT_SLEEP_REPO": "${workspaceFolder}"
+      }
+    }
+  }
+}
--- a/plugins/copilot/mcp_server.py
+++ b/plugins/copilot/mcp_server.py
@@ -0,0 +1,128 @@
+#!/usr/bin/env python3
+"""SkillOpt-Sleep — minimal MCP server (stdio, stdlib-only).
+
+Exposes the sleep engine as MCP tools so any MCP-capable client (GitHub Copilot
+CLI / VS Code, Claude Desktop, etc.) can drive it. No third-party deps: speaks
+JSON-RPC 2.0 over stdio with just the handful of MCP methods clients need.
+
+Tools exposed:
+  - sleep_status   : how many nights have run + the latest staged proposal
+  - sleep_dry_run  : harvest+mine+replay, report only (no staging)
+  - sleep_run      : full cycle, stages a proposal (nothing live changes)
+  - sleep_adopt    : apply the latest staged proposal (with backup)
+  - sleep_harvest  : debug — list mined recurring tasks
+
+Each tool shells out to `python -m skillopt_sleep <action> ...` and returns its
+stdout. Configure your client to launch:  python plugins/copilot/mcp_server.py
+"""
+from __future__ import annotations
+
+import json
+import os
+import subprocess
+import sys
+
+REPO_ROOT = os.environ.get("SKILLOPT_SLEEP_REPO") or os.path.abspath(
+    os.path.join(os.path.dirname(__file__), "..", "..")
+)
+PROTOCOL_VERSION = "2024-11-05"
+
+TOOLS = [
+    {"name": "sleep_status", "action": "status",
+     "description": "Show how many SkillOpt-Sleep nights have run and the latest staged proposal."},
+    {"name": "sleep_dry_run", "action": "dry-run",
+     "description": "Preview a sleep cycle (harvest+mine+replay) without staging anything."},
+    {"name": "sleep_run", "action": "run",
+     "description": "Run a full sleep cycle; stages a reviewed proposal. Nothing live changes until adopt."},
+    {"name": "sleep_adopt", "action": "adopt",
+     "description": "Apply the latest staged proposal to CLAUDE.md/SKILL.md (backs up first)."},
+    {"name": "sleep_harvest", "action": "harvest",
+     "description": "Debug: list the recurring tasks mined from recent sessions."},
+]
+_BY_NAME = {t["name"]: t for t in TOOLS}
+
+_TOOL_SCHEMA = {
+    "type": "object",
+    "properties": {
+        "project": {"type": "string", "description": "Project dir to evolve (default: cwd)."},
+        "backend": {"type": "string", "enum": ["mock", "claude", "codex"],
+                     "description": "mock = no API spend (default); claude/codex = real."},
+        "scope": {"type": "string", "enum": ["invoked", "all"]},
+    },
+    "additionalProperties": False,
+}
+
+
+def _run_engine(action: str, args: dict) -> str:
+    py = sys.executable or "python3"
+    cmd = [py, "-m", "skillopt_sleep", action]
+    if args.get("project"):
+        cmd += ["--project", str(args["project"])]
+    if args.get("backend"):
+        cmd += ["--backend", str(args["backend"])]
+    if args.get("scope"):
+        cmd += ["--scope", str(args["scope"])]
+    try:
+        proc = subprocess.run(cmd, cwd=REPO_ROOT, capture_output=True, text=True, timeout=3600)
+    except Exception as e:  # noqa: BLE001
+        return f"[error] failed to run engine: {e}"
+    out = (proc.stdout or "").strip()
+    err = (proc.stderr or "").strip()
+    return out + (("\n[stderr]\n" + err) if err else "")
+
+
+def _result(id_, result):
+    return {"jsonrpc": "2.0", "id": id_, "result": result}
+
+
+def _error(id_, code, message):
+    return {"jsonrpc": "2.0", "id": id_, "error": {"code": code, "message": message}}
+
+
+def handle(req: dict):
+    method = req.get("method")
+    id_ = req.get("id")
+    if method == "initialize":
+        return _result(id_, {
+            "protocolVersion": PROTOCOL_VERSION,
+            "capabilities": {"tools": {}},
+            "serverInfo": {"name": "skillopt-sleep", "version": "0.1.0"},
+        })
+    if method in ("notifications/initialized", "initialized"):
+        return None  # notification, no response
+    if method == "tools/list":
+        return _result(id_, {"tools": [
+            {"name": t["name"], "description": t["description"], "inputSchema": _TOOL_SCHEMA}
+            for t in TOOLS
+        ]})
+    if method == "tools/call":
+        params = req.get("params") or {}
+        name = params.get("name")
+        tool = _BY_NAME.get(name)
+        if not tool:
+            return _error(id_, -32602, f"unknown tool: {name}")
+        text = _run_engine(tool["action"], params.get("arguments") or {})
+        return _result(id_, {"content": [{"type": "text", "text": text}]})
+    if method == "ping":
+        return _result(id_, {})
+    return _error(id_, -32601, f"method not found: {method}")
+
+
+def main() -> int:
+    for line in sys.stdin:
+        line = line.strip()
+        if not line:
+            continue
+        try:
+            req = json.loads(line)
+        except Exception:
+            continue
+        resp = handle(req)
+        if resp is not None:
+            sys.stdout.write(json.dumps(resp) + "\n")
+            sys.stdout.flush()
+    return 0
+
+
+if __name__ == "__main__":
+    raise SystemExit(main())
--- a/plugins/run-sleep.sh
+++ b/plugins/run-sleep.sh
@@ -0,0 +1,46 @@
+#!/usr/bin/env bash
+# SkillOpt-Sleep shared runner — used by all platform plugins (Claude Code,
+# Codex, Copilot). Resolves the repo root (which contains the skillopt_sleep
+# package), picks a Python >= 3.10, and execs the engine CLI.
+#
+# Usage: run-sleep.sh <run|dry-run|status|adopt|harvest|...> [args...]
+set -euo pipefail
+
+# This script lives at <repo>/plugins/run-sleep.sh, so the repo root (which
+# holds skillopt_sleep/) is one level up. CLAUDE_PLUGIN_ROOT (if set by Claude
+# Code) points at the plugin dir; the engine is then two levels above it.
+SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
+if [ -d "$SCRIPT_DIR/../skillopt_sleep" ]; then
+  REPO_ROOT="$(cd "$SCRIPT_DIR/.." && pwd)"
+elif [ -n "${CLAUDE_PLUGIN_ROOT:-}" ] && [ -d "$CLAUDE_PLUGIN_ROOT/../../skillopt_sleep" ]; then
+  REPO_ROOT="$(cd "$CLAUDE_PLUGIN_ROOT/../.." && pwd)"
+elif [ -n "${SKILLOPT_SLEEP_REPO:-}" ] && [ -d "$SKILLOPT_SLEEP_REPO/skillopt_sleep" ]; then
+  REPO_ROOT="$SKILLOPT_SLEEP_REPO"
+else
+  # last resort: search upward from CWD
+  d="$PWD"
+  while [ "$d" != "/" ]; do
+    [ -d "$d/skillopt_sleep" ] && { REPO_ROOT="$d"; break; }
+    d="$(dirname "$d")"
+  done
+fi
+if [ -z "${REPO_ROOT:-}" ]; then
+  echo "[sleep] ERROR: could not locate the skillopt_sleep package. Set SKILLOPT_SLEEP_REPO to the repo root." >&2
+  exit 1
+fi
+
+PY=""
+for cand in python3.12 python3.11 python3.10 python3; do
+  if command -v "$cand" >/dev/null 2>&1; then
+    ver="$("$cand" -c 'import sys; print("%d%d" % sys.version_info[:2])' 2>/dev/null || echo 0)"
+    if [ "${ver:-0}" -ge 310 ]; then PY="$cand"; break; fi
+  fi
+done
+if [ -z "$PY" ]; then
+  echo "[sleep] ERROR: need Python >= 3.10 (found none)." >&2
+  exit 1
+fi
+
+if [ "$#" -eq 0 ]; then set -- status; fi
+cd "$REPO_ROOT"
+exec "$PY" -m skillopt_sleep "$@"