feat(plugins): ship SkillOpt-Sleep for Claude Code, Codex, and Copilot

Restructure into plugins/{claude-code,codex,copilot}/ — one engine, three thin
shells, all calling the shared plugins/run-sleep.sh -> python -m skillopt_sleep.

  - claude-code/: existing plugin moved here; runner delegates to the shared
    launcher (fixes repo-root resolution after the move).
  - codex/: ~/.codex/prompts/sleep.md custom prompt + ~/.agents/skills SKILL.md +
    install.sh + AGENTS.md hint — Codex's documented, stable extension surfaces.
  - copilot/: a stdlib-only MCP server (mcp_server.py) exposing sleep_* tools,
    plus mcp-config.example.json and a copilot-instructions snippet. Verified end
    to end (initialize -> tools/list -> tools/call returns real engine output).
  - plugins/README.md overview table; main README News + a dedicated SkillOpt-Sleep
    section; pyproject lists skillopt_sleep as a first-class package.

Decoupling emphasized throughout: open-source tool (skillopt_sleep/) with zero
dependency on the research package. 29 tests pass; all three shells resolve.

Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
This commit is contained in:
Yifan Yang
2026-06-08 14:31:52 +00:00
parent b02ffc2c99
commit f9db99853b
22 changed files with 576 additions and 31 deletions

74
plugins/README.md Normal file
View File

@@ -0,0 +1,74 @@
# SkillOpt-Sleep — plugins for Claude Code, Codex, and Copilot
One engine, three thin shells. **SkillOpt-Sleep** gives a local coding agent a
nightly **sleep cycle**: it reviews your past sessions offline, replays your
recurring tasks on your own API budget, and consolidates what it learns into
**validated** long-term memory and skills — behind a held-out gate, staged for
your review. Your agent gets better the more you use it, with no model-weight
training.
It synthesizes three ideas: **SkillOpt** (validation-gated bounded text
optimization — the research in this repo), **Claude Dreams** (offline memory
consolidation; input never mutated; review-then-adopt), and the **agent sleep**
literature (short-term experience → long-term competence).
> **This is an open-source tool, decoupled from the research code.** The engine
> lives in the top-level [`skillopt_sleep/`](../skillopt_sleep) package and has
> **zero dependency** on the paper's `skillopt/` experiment package (the
> validation gate is vendored). You can ship/use it without the research stack.
## The three integrations
| Platform | Folder | Mechanism | Status |
|---|---|---|---|
| **Claude Code** | [`claude-code/`](claude-code) | `.claude-plugin` + `/sleep` command + skill + hooks | full, installable |
| **Codex** | [`codex/`](codex) | `~/.codex/prompts/sleep.md` + `~/.agents/skills` + `AGENTS.md` | full |
| **Copilot** | [`copilot/`](copilot) | MCP server (`sleep_*` tools) + `copilot-instructions` | full (MCP) |
All three call the **same** [`plugins/run-sleep.sh`](run-sleep.sh) → `python -m
skillopt_sleep`, so behaviour is identical everywhere. Per-platform setup is in
each folder's README.
## Quick start (Claude Code)
```bash
git clone <repo-url> && cd SkillOpt-Sleep
# Claude Code:
/plugin marketplace add ./plugins/claude-code
/plugin install skillopt-sleep@skillopt-sleep
/sleep status
```
Codex: `bash plugins/codex/install.sh`.
Copilot: register `plugins/copilot/mcp_server.py` as an MCP server.
## What one "night" does
```
harvest ~/.claude (or session) transcripts → mine recurring tasks → replay offline
→ consolidate (reflect → bounded edit → GATE on real held-out tasks)
→ stage proposal → (you) adopt
```
Nothing live changes until you adopt; every adopt backs up first.
## Controls (work on all platforms)
`--gate on|off` · `--rollouts-k K` (multi-rollout contrastive reflection) ·
`--budget-tokens/--budget-minutes` · `--preferences "..."` · separate
optimizer/target models (`--optimizer-model` / `--target-model`) · slow-update
long-term memory. Full guide:
[`../docs/sleep/CONTROLLABLE_DREAMING.md`](../docs/sleep/CONTROLLABLE_DREAMING.md).
## Does it actually work?
Validated on the public
[gbrain-evals](https://github.com/garrytan/gbrain-evals) `skillopt-v1` benchmark
with **real models on both Claude and Codex**: deficient skills go **0.00 →
1.00** on held-out sets (all 4 seeds incl. a real tool-use loop), cross-model
transfer is positive, and the gate blocks regressions. Full results:
[`../docs/sleep/FINAL_REPORT.md`](../docs/sleep/FINAL_REPORT.md).
Deterministic proof (no API key):
```bash
python -m skillopt_sleep.experiments.run_experiment --persona researcher --assert-improves
```

View File

@@ -0,0 +1,26 @@
{
"$schema": "https://anthropic.com/claude-code/marketplace.schema.json",
"name": "skillopt-sleep",
"description": "SkillOpt-Sleep: give your local Claude agent a nightly sleep cycle that reviews past sessions and consolidates validated memory + skills.",
"owner": {
"name": "Yifan Yang",
"email": "yifanyang@microsoft.com"
},
"plugins": [
{
"name": "skillopt-sleep",
"description": "Nightly offline self-evolution: harvest your past Claude Code sessions, replay recurring tasks on your own API budget, and consolidate what the agent learns into validated CLAUDE.md memory and SKILL.md skills — behind a held-out gate, staged for your review.越用越好用. Synthesizes SkillOpt (validation-gated skill optimization), Claude Dreams (offline memory consolidation), and agent sleep/consolidation.",
"author": {
"name": "Yifan Yang"
},
"category": "productivity",
"source": {
"source": "git-subdir",
"url": "https://github.com/microsoft/SkillOpt.git",
"path": "skillopt-sleep-plugin",
"ref": "main"
},
"homepage": "https://github.com/microsoft/SkillOpt"
}
]
}

View File

@@ -0,0 +1,22 @@
{
"name": "skillopt-sleep",
"description": "Give your local Claude agent a nightly 'sleep cycle': it reviews your past sessions offline, replays recurring tasks on your own API budget, and consolidates what it learns into validated memory (CLAUDE.md) and skills (SKILL.md).越用越好用 — gets better the more you use it. Synthesizes SkillOpt (validation-gated skill optimization), Claude Dreams (offline memory consolidation), and agent sleep/consolidation.",
"version": "0.1.0",
"author": {
"name": "Yifan Yang",
"email": "yifanyang@microsoft.com"
},
"homepage": "https://github.com/microsoft/SkillOpt",
"repository": "https://github.com/microsoft/SkillOpt",
"license": "MIT",
"keywords": [
"skillopt",
"self-improvement",
"memory-consolidation",
"dreams",
"sleep",
"skills",
"continual-learning",
"offline-optimization"
]
}

View File

@@ -0,0 +1,138 @@
# SkillOpt-Sleep (Claude Code plugin)
> Give your local Claude agent a **sleep cycle**. Every night it reviews your
> past sessions offline, replays your recurring tasks on your own API budget,
> and consolidates what it learns into **validated** memory (`CLAUDE.md`) and
> skills (`SKILL.md`). Your agent gets better the more you use it — no
> model-weight training.
SkillOpt-Sleep is the **deployment-time** companion to
[SkillOpt](https://github.com/microsoft/SkillOpt). SkillOpt trains a skill
offline on a benchmark; SkillOpt-Sleep applies the same discipline to *your own
daily usage*: bounded text edits, accepted only through a held-out validation
gate, with rejected edits kept as negative feedback.
It synthesizes three ideas:
| Idea | Contribution |
|---|---|
| **SkillOpt** | skill/memory = trainable text; bounded add/delete/replace edits; **held-out gate** keeps only changes that help. |
| **Claude Dreams** | offline consolidation over past sessions; input never mutated; output **reviewed then adopted**. |
| **Agent sleep** | periodic offline replay turns short-term episodes into long-term skill. |
## What it does (one "night")
```
harvest ~/.claude transcripts → mine recurring tasks → replay offline
→ consolidate (reflect → bounded edit → GATE) → stage proposal → (you) adopt
```
Nothing live is modified until **you** run `/sleep adopt` (the Dreams "review,
then adopt or discard" contract). Every adopt backs up the prior file first.
## Install
**Requirements:** Python ≥ 3.10, and the `claude` CLI (and/or `codex` CLI) on PATH.
```bash
# 1) get the code (the plugin ships inside the SkillOpt repo)
git clone https://github.com/microsoft/SkillOpt.git
cd SkillOpt
# 2) add the plugin to Claude Code as a local marketplace
/plugin marketplace add ./skillopt-sleep-plugin
/plugin install skillopt-sleep@skillopt-sleep
# 3) verify
/sleep status
```
The plugin's bundled runner (`scripts/sleep.sh`) auto-selects a Python ≥ 3.10
interpreter and calls the `skillopt_sleep` engine in the repo. No `pip install`
is required for the default `mock` backend or for `claude`/`codex` backends —
they shell out to the CLIs you already have.
## Quick start
```bash
# from inside any project you use with Claude Code:
/sleep dry-run # safe preview: what it would learn, no changes staged
/sleep run # full cycle: stages a reviewed proposal (still no live edits)
/sleep status # see history + the latest staged proposal
/sleep adopt # apply the staged proposal to CLAUDE.md / SKILL.md (with backup)
```
Or call the engine directly (Python ≥ 3.10):
```bash
python -m skillopt_sleep run --project "$(pwd)" --scope invoked --backend mock
python -m skillopt_sleep run --project "$(pwd)" --backend claude # real lift via Claude
python -m skillopt_sleep run --project "$(pwd)" --backend codex # real lift via Codex
```
Default backend is **`mock`** — deterministic, no API spend — so you can try the
plumbing for free. Switch to `--backend claude` or `--backend codex` for genuine
improvement on your own budget.
## Does it actually improve? (real models, public benchmark)
SkillOpt-Sleep is validated against [gbrain-evals](https://github.com/garrytan/gbrain-evals)'
public `skillopt-v1` suite — the same benchmark gbrain scores its own skill
optimizer against. We take a deliberately **deficient** skill and run one sleep
night; held-out scoring is done by a local rule judge (no judge-API, no way to
grade its own homework).
| Backend | Seed | Held-out before → after | Nights |
|---|---|---|---|
| **Claude (Haiku 4.5)** | brief-writer | **0.00 → 1.00** | 1 |
| **Codex** | brief-writer | **0.00 → 1.00** | 2 |
Both took a brief-writer with no risks section / no confidence level and, within
12 nights, proposed gated edits that lifted the held-out score to perfect —
into the protected `LEARNED` block, nothing else touched. The Codex 2-night
trace even shows the optimizer **diagnosing its own residual failure** and
adding a meta-rule to fix it. Full writeup + reproduction:
[`docs/sleep/real_api_results.md`](../docs/sleep/real_api_results.md).
Reproduce:
```bash
git clone https://github.com/garrytan/gbrain-evals /tmp/gbrain-evals
python -m skillopt_sleep.experiments.run_gbrain --backend claude --model haiku \
--seeds brief-writer --data-root /tmp/gbrain-evals/eval/data/skillopt-v1 \
--nights 1 --limit-replay 3 --limit-holdout 3
python -m skillopt_sleep.experiments.run_gbrain --backend codex \
--seeds brief-writer --data-root /tmp/gbrain-evals/eval/data/skillopt-v1 \
--nights 1 --limit-replay 3 --limit-holdout 3
```
## Deterministic proof (no API, no keys)
```bash
python -m skillopt_sleep.experiments.run_experiment --persona researcher --assert-improves
python -m skillopt_sleep.experiments.run_experiment --persona programmer --assert-improves
```
Each prints the held-out score rising from baseline toward 1.0 as the gate
accepts the general rules your tasks need, and confirms the gate **rejects** an
injected harmful edit. Recorded output: [`docs/sleep/experiment_results.md`](../docs/sleep/experiment_results.md).
## Schedule it nightly
```bash
"${CLAUDE_PLUGIN_ROOT}/scripts/install-cron.sh" "$(pwd)" # prints a crontab line; installs nothing
```
## Safety
- **Read-only** harvest of `~/.claude`. `mock` replay has no side effects.
- Proposals are **staged**, never auto-applied (unless you opt in with `--auto-adopt`).
- Every adopt writes a backup under the staging dir's `backup/`.
- Per-night **token/task budget caps**; secrets redacted from prompts.
- `fresh` replay (Phase 3) runs only in throwaway git worktrees.
## Status
Phase 1 (engine + deterministic experiment + plugin surface) is complete.
Phase 3 adds the real-API miner/judge and `fresh` worktree replay. See
[`docs/superpowers/specs/2026-06-07-skillopt-sleep-claude-code-plugin-design.md`](../docs/superpowers/specs/2026-06-07-skillopt-sleep-claude-code-plugin-design.md).

View File

@@ -0,0 +1,63 @@
---
description: Run or manage the SkillOpt-Sleep self-evolution cycle (review past sessions, replay tasks offline, consolidate validated memory + skills)
argument-hint: "[run | dry-run | status | adopt | harvest] (default: status)"
allowed-tools: Bash, Read
---
# /sleep — SkillOpt-Sleep nightly self-evolution
You are driving **SkillOpt-Sleep**: a tool that lets this user's Claude agent
improve offline by reviewing past sessions, replaying recurring tasks, and
consolidating what it learns into **validated** memory (`CLAUDE.md`) and skills
(`SKILL.md`). It is gated like SkillOpt: a change is kept only if it improves a
held-out replay score, and nothing live is modified until the user adopts it.
## Requested action: $ARGUMENTS
(If `$ARGUMENTS` is empty, treat it as `status`.)
## How to run it
The engine is the `skillopt_sleep` Python package in this repo. Use the
**plugin's bundled runner** so the right interpreter and repo are on the path:
```bash
"${CLAUDE_PLUGIN_ROOT}/scripts/sleep.sh" <action> --project "$(pwd)" --scope invoked
```
`<action>` is one of:
| action | what it does |
|-----------|--------------|
| `status` | show how many nights have run + the latest staged proposal (READ-ONLY) |
| `dry-run` | harvest → mine → replay → report, but **stage nothing** (safe preview) |
| `run` | full cycle: also **stage** a reviewed proposal (still does NOT touch live files) |
| `adopt` | apply the latest staged proposal to live `CLAUDE.md` / `SKILL.md` (backs up first) |
| `harvest` | debug: print the recurring tasks mined from recent sessions |
Default backend is `mock` (deterministic, no API spend). To use real Anthropic
budget for genuine improvement, add `--backend anthropic`.
## Steps to follow
1. **Run the requested action** via the bundled runner above. Capture stdout.
2. **For `run` / `dry-run`:** after it completes, `Read` the generated
`report.md` in the staging dir it prints, and show the user:
- held-out score: baseline → candidate (the proof it helped)
- the gate decision (accept/reject) and the exact edits it proposes
- where the proposal is staged
3. **For `run` that produced an accepted proposal:** tell the user the diff is
staged and that **nothing live changed yet**. Offer to run `/sleep adopt`.
4. **For `adopt`:** confirm which live files were updated and that backups were
written under the staging dir's `backup/`.
5. **Never** edit `CLAUDE.md` or `SKILL.md` yourself — only the `adopt` action
does that, with a backup. Respect the review gate.
## Safety reminders
- Harvest is **read-only** over `~/.claude`. Replay in `mock` mode runs no
shell side effects.
- The cycle stages proposals; the user is in control of adoption.
- If the user asks to schedule this nightly, point them at
`${CLAUDE_PLUGIN_ROOT}/scripts/install-cron.sh` (prints a crontab line; does
not install anything without confirmation).

View File

@@ -0,0 +1,16 @@
{
"hooks": {
"SessionEnd": [
{
"matcher": "*",
"hooks": [
{
"type": "command",
"command": "\"${CLAUDE_PLUGIN_ROOT}/hooks/on-session-end.sh\"",
"async": true
}
]
}
]
}
}

View File

@@ -0,0 +1,18 @@
#!/usr/bin/env bash
# SkillOpt-Sleep SessionEnd hook (async, best-effort, NON-BLOCKING).
#
# This does NOT run the optimizer. It only appends a tiny marker so the next
# nightly cycle knows there is fresh activity to harvest, and (optionally)
# nudges the user once that a sleep cycle is available. It must never fail the
# session or spend API budget.
set -uo pipefail
PLUGIN_ROOT="${CLAUDE_PLUGIN_ROOT:-$(cd "$(dirname "${BASH_SOURCE[0]}")/.." && pwd)}"
STATE_DIR="${HOME}/.skillopt-sleep"
mkdir -p "$STATE_DIR" 2>/dev/null || exit 0
# Record that a session just ended (cheap; used for "is there new data?").
printf '%s\t%s\n' "$(date -u +%Y-%m-%dT%H:%M:%SZ)" "${PWD}" \
>> "$STATE_DIR/session-end.log" 2>/dev/null || true
exit 0

View File

@@ -0,0 +1,29 @@
#!/usr/bin/env bash
# Print (does NOT install) a crontab line that runs SkillOpt-Sleep nightly.
# The user copies the line into `crontab -e` if they want it.
set -euo pipefail
PLUGIN_ROOT="${CLAUDE_PLUGIN_ROOT:-$(cd "$(dirname "${BASH_SOURCE[0]}")/.." && pwd)}"
RUNNER="$PLUGIN_ROOT/scripts/sleep.sh"
PROJECT="${1:-$(pwd)}"
BACKEND="${2:-mock}"
# 3:17am local — deliberately off the :00 mark so many users don't all hit the
# API at once (and we leave room for jitter).
MIN=17
HOUR=3
cat <<EOF
# ── SkillOpt-Sleep nightly cycle ────────────────────────────────────────────
# Review past sessions, replay tasks, stage validated memory/skill updates.
# Runs at ${HOUR}:$(printf '%02d' $MIN) local every day. Output goes to the project's
# .skillopt-sleep/ dir; nothing live is changed until you run '/sleep adopt'
# (unless you pass --auto-adopt below).
#
# Copy the next line into 'crontab -e':
${MIN} ${HOUR} * * * "${RUNNER}" run --project "${PROJECT}" --scope invoked --backend ${BACKEND} >> "${PROJECT}/.skillopt-sleep/cron.log" 2>&1
#
# For fully-autonomous adoption (power users), append: --auto-adopt
# To spend real API budget for genuine lift, set BACKEND=anthropic above.
# ────────────────────────────────────────────────────────────────────────────
EOF

View File

@@ -0,0 +1,11 @@
#!/usr/bin/env bash
# Claude Code plugin runner — thin wrapper over the shared runner so all three
# platform plugins share one engine launcher. The shared runner lives at
# <repo>/plugins/run-sleep.sh and handles repo-root + interpreter resolution.
set -euo pipefail
HERE="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)" # <repo>/plugins/claude-code/scripts
SHARED="$(cd "$HERE/../.." && pwd)/run-sleep.sh" # <repo>/plugins/run-sleep.sh
if [ ! -f "$SHARED" ] && [ -n "${CLAUDE_PLUGIN_ROOT:-}" ]; then
SHARED="$(cd "$CLAUDE_PLUGIN_ROOT/.." && pwd)/run-sleep.sh"
fi
exec bash "$SHARED" "$@"

View File

@@ -0,0 +1,79 @@
---
name: skillopt-sleep
description: "Use when the user wants their Claude agent to self-improve from past usage, asks about a nightly/offline 'sleep' or 'dream' cycle, memory/skill consolidation, or says things like '让 agent 越用越好用', 'review my past sessions', 'learn my preferences', 'consolidate what you learned', 'run the sleep cycle', or wants to schedule offline self-optimization. Drives the skillopt_sleep engine: harvest past sessions → mine recurring tasks → replay offline → consolidate validated CLAUDE.md/SKILL.md behind a held-out gate."
---
# SkillOpt-Sleep: offline self-evolution for a local Claude agent
SkillOpt-Sleep gives the user's agent a **sleep cycle**. While the user is
offline (e.g. nightly), it reviews their real past Claude Code sessions,
re-runs recurring tasks on their own API budget, and consolidates what it
learns into **memory** (`CLAUDE.md`) and **skills** (`SKILL.md`) — but only
keeps changes that pass a held-out validation gate, and only after the user
adopts them. The agent gets measurably better at *this* user's recurring work,
with no model-weight training. It is the deployment-time analogue of training:
short-term experience → long-term competence.
It synthesizes three ideas:
- **SkillOpt** — the skill/memory doc is trainable text; bounded add/delete/replace
edits; accepted only through a held-out gate; rejected edits become negative feedback.
- **Claude Dreams** — offline consolidation that reads past sessions and rebuilds
memory (dedup/merge/resolve); the input is never mutated; output is reviewed then adopted.
- **Agent sleep** — periodic offline replay turns episodes into durable skill.
## When to use this skill
Trigger when the user wants any of:
- "make my agent learn from how I use it" / "越用越好用" / "remember my preferences across sessions"
- a nightly/scheduled or on-demand **offline self-improvement / dream / sleep** run
- to **review past sessions/trajectories** and distill recurring tasks
- to **consolidate** feedback into `CLAUDE.md` or a managed skill
- to **schedule** the cycle (cron) or **adopt** a staged proposal
## The cycle (six stages)
1. **Harvest** — read `~/.claude/projects/*/<session>.jsonl` + `~/.claude/history.jsonl` (READ-ONLY) → session digests.
2. **Mine** — digests → `TaskRecord`s (recurring intents + outcome labels + checkable refs where possible).
3. **Replay** — re-run tasks offline under the *current* skill+memory → (hard, soft) scores.
4. **Consolidate** — reflect on failures → propose bounded edits → **gate** on a held-out slice; accept only if it strictly improves.
5. **Stage** — write `proposed_CLAUDE.md`, `proposed_SKILL.md`, a diff, and `report.md` into `<project>/.skillopt-sleep/staging/<date>/`. **Nothing live changes.**
6. **Adopt** — explicit (or opt-in auto): copy staged files over live ones, backing up first.
## How to drive it
Prefer the `/sleep` command. Under the hood it calls the bundled runner:
```bash
"${CLAUDE_PLUGIN_ROOT}/scripts/sleep.sh" status # what's happened
"${CLAUDE_PLUGIN_ROOT}/scripts/sleep.sh" dry-run --project "$(pwd)" # safe preview
"${CLAUDE_PLUGIN_ROOT}/scripts/sleep.sh" run --project "$(pwd)" # full cycle, stages a proposal
"${CLAUDE_PLUGIN_ROOT}/scripts/sleep.sh" adopt --project "$(pwd)" # apply staged proposal (with backup)
```
- Default backend is `mock` (deterministic, **no API spend**) — good for trying the plumbing.
- Add `--backend claude` or `--backend codex` to spend the user's real budget for genuine improvement.
- Scope defaults to the invoked project; `--scope all` harvests every project.
## Hard rules
- **Never** hand-edit the user's `CLAUDE.md` / `SKILL.md` as part of this skill.
Only the `adopt` action changes live files, and it backs them up first.
- Harvest is read-only. `mock` replay has no side effects.
- Always show the user the **held-out baseline → candidate** score and the
exact proposed edits before suggesting adoption. Evidence before adoption.
- If asked whether it really helps, run
`python -m skillopt_sleep.experiments.run_experiment --persona researcher --json`
— a deterministic demo that proves held-out lift and that the gate blocks
harmful edits.
## Validate / demo
```bash
# deterministic proof (no API): held-out score rises, gate blocks regressions
python -m skillopt_sleep.experiments.run_experiment --persona researcher --assert-improves
python -m skillopt_sleep.experiments.run_experiment --persona programmer --assert-improves
```
See `docs/sleep/experiment_results.md` for recorded output and
`docs/superpowers/specs/2026-06-07-skillopt-sleep-claude-code-plugin-design.md`
for the full design.

59
plugins/codex/README.md Normal file
View File

@@ -0,0 +1,59 @@
# SkillOpt-Sleep — Codex integration
Give your **Codex** agent a nightly **sleep cycle**: it reviews past sessions
offline, replays your recurring tasks on your own Codex budget, and consolidates
what it learns into validated memory + skills behind a held-out gate. Same engine
as the Claude Code plugin (`skillopt_sleep`), wrapped for Codex.
> **Verified on Codex:** on the public
> [gbrain-evals](https://github.com/garrytan/gbrain-evals) `skillopt-v1`
> benchmark, a deliberately deficient skill goes **0.00 → 1.00** on a held-out
> set with the Codex backend (incl. the tool-use seed via a real tool loop).
> See [`../../docs/sleep/FINAL_REPORT.md`](../../docs/sleep/FINAL_REPORT.md).
## What Codex supports (and what we use)
Codex (`@openai/codex`) extends via **`AGENTS.md`** instructions, **skills** at
`~/.agents/skills/<name>/SKILL.md`, and **custom prompts** at
`~/.codex/prompts/<name>.md` (invoked as `/<name>`). This integration ships all
three, plus a shared runner.
## Install
```bash
git clone <repo-url> SkillOpt-Sleep
cd SkillOpt-Sleep
bash plugins/codex/install.sh # installs the /sleep prompt + skill
export SKILLOPT_SLEEP_REPO="$(pwd)" # so the runner is found from anywhere
```
Requires Python ≥ 3.10 and the `codex` CLI on PATH.
## Use
```text
/sleep status # what's happened
/sleep dry-run # safe preview, stages nothing
/sleep run # full cycle, stages a reviewed proposal (no live edits)
/sleep adopt # apply the staged proposal (with backup)
```
Or call the engine directly:
```bash
python -m skillopt_sleep run --project "$(pwd)" --backend codex
```
Default backend is `mock` (no API spend). `--backend codex` uses your Codex
budget for real improvement. All the controllable knobs (`--gate on|off`,
`--rollouts-k`, `--budget-tokens`, `--preferences`, optimizer/target split) work
identically — see [`../../docs/sleep/CONTROLLABLE_DREAMING.md`](../../docs/sleep/CONTROLLABLE_DREAMING.md).
## Notes / status
- Codex's `exec` runs shell, so the real-tool-loop replay (e.g. the
`tool_called: search` benchmark seed) works natively.
- Codex's standalone *plugin-package manifest* format is not yet a stable public
spec; this integration uses the documented `AGENTS.md` + skills + prompts
mechanisms, which are stable. If/when a `codex plugin` package format ships,
we'll add a one-file manifest.

36
plugins/codex/install.sh Executable file
View File

@@ -0,0 +1,36 @@
#!/usr/bin/env bash
# Install the SkillOpt-Sleep Codex integration into the user's ~/.codex and
# ~/.agents directories. Idempotent; prints what it does.
set -euo pipefail
REPO_ROOT="$(cd "$(dirname "${BASH_SOURCE[0]}")/../.." && pwd)"
CODEX_HOME="${CODEX_HOME:-$HOME/.codex}"
AGENTS_SKILLS="${HOME}/.agents/skills"
echo "[install] repo: $REPO_ROOT"
# 1) custom /sleep prompt
mkdir -p "$CODEX_HOME/prompts"
cp "$REPO_ROOT/plugins/codex/prompts/sleep.md" "$CODEX_HOME/prompts/sleep.md"
echo "[install] /sleep prompt -> $CODEX_HOME/prompts/sleep.md"
# 2) user-level skill
mkdir -p "$AGENTS_SKILLS/skillopt-sleep"
cp "$REPO_ROOT/plugins/codex/skills/skillopt-sleep/SKILL.md" "$AGENTS_SKILLS/skillopt-sleep/SKILL.md"
echo "[install] skill -> $AGENTS_SKILLS/skillopt-sleep/SKILL.md"
# 3) record the repo location so the runner is found from anywhere
echo "[install] add to your shell profile:"
echo " export SKILLOPT_SLEEP_REPO=\"$REPO_ROOT\""
# 4) optional: append an AGENTS.md hint (only if the user opts in)
cat <<EOF
[install] Optional — add this to ~/.codex/AGENTS.md so Codex always knows the tool:
## SkillOpt-Sleep
An offline self-improvement cycle is available. To run it:
\`bash "$REPO_ROOT/plugins/run-sleep.sh" status\`. Use \`/sleep\` for the guided flow.
Done. Try: /sleep status
EOF

View File

@@ -0,0 +1,21 @@
# /sleep — SkillOpt-Sleep for Codex
#
# Custom prompt: copy this file to ~/.codex/prompts/sleep.md and invoke with
# `/sleep` in the Codex CLI. ($ARGUMENTS is the text after /sleep.)
Run the SkillOpt-Sleep offline self-evolution cycle. Action: $ARGUMENTS
(empty → "status").
Use the bundled runner via shell:
bash "${SKILLOPT_SLEEP_REPO:?set SKILLOPT_SLEEP_REPO to the repo root}/plugins/run-sleep.sh" $ARGUMENTS --project "$(pwd)"
Then:
- For `run`/`dry-run`: read the staged `report.md` and show the held-out
baseline → candidate score and the proposed edits. `run` only stages a
proposal; nothing live changes until `adopt`.
- For `adopt`: confirm which files were updated and that a backup was written.
- Never edit the user's AGENTS.md / skills yourself; only `adopt` does that.
Default backend is `mock` (no API spend). Add `--backend codex` for real
improvement on the user's Codex budget.

View File

@@ -0,0 +1,49 @@
---
name: skillopt-sleep
description: Nightly offline self-evolution for a Codex agent. Reviews past sessions, replays recurring tasks, and consolidates validated memory + skills behind a held-out gate. Use when the user wants Codex to learn from past usage, run a "sleep"/"dream" cycle, or schedule offline self-optimization.
---
# SkillOpt-Sleep (Codex skill)
This skill drives the `skillopt_sleep` engine — an offline "sleep cycle" that
makes a Codex agent better at the user's recurring work without retraining.
## When to use
Trigger when the user wants to: review past sessions, learn their preferences,
consolidate feedback into long-term memory/skills, run a nightly/offline
self-improvement cycle, or adopt a staged proposal.
## How to run it
Invoke the bundled runner via shell (Codex `exec` has shell access). The runner
finds the engine and a Python ≥ 3.10 automatically:
```bash
# point at the repo if it isn't auto-detected from CWD:
export SKILLOPT_SLEEP_REPO=/path/to/SkillOpt-Sleep
bash "$SKILLOPT_SLEEP_REPO/plugins/run-sleep.sh" <action> --project "$(pwd)"
```
`<action>``status | dry-run | run | adopt | harvest`. Use `--backend codex`
for real improvement on the user's own Codex budget (default `mock` = no spend).
## Steps
1. Run the requested action; capture stdout.
2. For `run`/`dry-run`: read the staged `report.md` it prints and show the user
the held-out baseline → candidate score and the exact proposed edits.
3. `run` only **stages** a proposal under `<project>/.skillopt-sleep/staging/`;
nothing live changes until `adopt`. Offer `/sleep adopt`.
4. Never hand-edit the user's `AGENTS.md` / skills yourself — only `adopt` does,
and it backs up first.
## Validate
```bash
python -m skillopt_sleep.experiments.run_gbrain --backend codex \
--seeds brief-writer --data-root /path/to/gbrain-evals/eval/data/skillopt-v1 \
--nights 2 --limit-replay 3 --limit-holdout 3
```
A deficient skill goes 0.00 → 1.00 on a held-out set; the optimizer's edits are
gated on real-task performance.

67
plugins/copilot/README.md Normal file
View File

@@ -0,0 +1,67 @@
# SkillOpt-Sleep — GitHub Copilot integration
Give **Copilot** (CLI or VS Code) a nightly **sleep cycle** via a tiny **MCP
server** that exposes the `skillopt_sleep` engine as tools. MCP is GitHub's
supported way to extend Copilot, so this works across Copilot CLI, VS Code, and
other MCP clients with the same server.
## What's here
| File | Purpose |
|---|---|
| `mcp_server.py` | stdlib-only MCP (stdio) server exposing `sleep_*` tools |
| `mcp-config.example.json` | drop-in MCP server config |
| `copilot-instructions.snippet.md` | paste into `.github/copilot-instructions.md` |
## Install
Requires Python ≥ 3.10. No third-party packages — the server is pure stdlib.
1. **Register the MCP server.** Add the server to your Copilot MCP config
(Copilot CLI: `~/.copilot/mcp-config.json`; VS Code: your MCP settings).
Use `mcp-config.example.json` as a template — set `SKILLOPT_SLEEP_REPO` to
this repo's path:
```json
{
"mcpServers": {
"skillopt-sleep": {
"command": "python3",
"args": ["/abs/path/SkillOpt-Sleep/plugins/copilot/mcp_server.py"],
"env": { "SKILLOPT_SLEEP_REPO": "/abs/path/SkillOpt-Sleep" }
}
}
}
```
2. **(Optional) Tell Copilot about it.** Append
`copilot-instructions.snippet.md` to your repo's
`.github/copilot-instructions.md` so Copilot reaches for the tools when the
user asks to "run the sleep cycle".
## Use
Ask Copilot things like *"run the sleep cycle"*, *"what did the last sleep
propose?"*, *"adopt the staged sleep proposal"*. Copilot calls the MCP tools:
`sleep_status`, `sleep_dry_run`, `sleep_run`, `sleep_adopt`, `sleep_harvest`.
Each tool takes optional `project`, `backend` (`mock`/`claude`/`codex`), and
`scope` arguments. Default backend is `mock` (no API spend).
## Verify the server directly (no Copilot needed)
```bash
printf '%s\n' \
'{"jsonrpc":"2.0","id":1,"method":"initialize","params":{}}' \
'{"jsonrpc":"2.0","id":2,"method":"tools/list"}' \
| SKILLOPT_SLEEP_REPO="$(pwd)" python3 plugins/copilot/mcp_server.py
```
You should see the server info and the five `sleep_*` tools.
## Notes / status
- MCP is the stable, official Copilot extension surface, so this is the most
portable of the three integrations (one server → CLI + IDE).
- The engine and all its controls (gate on/off, multi-rollout, budget,
preferences, optimizer/target split) are identical across platforms — see
[`../../docs/sleep/CONTROLLABLE_DREAMING.md`](../../docs/sleep/CONTROLLABLE_DREAMING.md).

View File

@@ -0,0 +1,25 @@
<!--
Copy this block into your repo's .github/copilot-instructions.md so Copilot
knows the SkillOpt-Sleep tools exist. (Copilot reads copilot-instructions.md
automatically as ambient guidance.)
-->
## SkillOpt-Sleep (offline self-evolution)
This project has SkillOpt-Sleep available via an MCP server (`skillopt-sleep`).
It gives the agent a nightly "sleep cycle": it reviews past sessions, replays
recurring tasks offline, and consolidates validated memory + skills behind a
held-out gate.
When the user asks to "run the sleep cycle", "review my past sessions", "learn
my preferences", or "make the agent improve from past usage", use the MCP tools:
- `sleep_status` — what's happened + the latest staged proposal
- `sleep_dry_run` — safe preview, stages nothing
- `sleep_run` — full cycle, stages a reviewed proposal (nothing live changes)
- `sleep_adopt` — apply the staged proposal (backs up first)
- `sleep_harvest` — list mined recurring tasks
Always show the user the held-out baseline → candidate score and the proposed
edits before suggesting `sleep_adopt`. Never hand-edit the user's memory/skill
files; only `sleep_adopt` does that, with a backup.

View File

@@ -0,0 +1,11 @@
{
"mcpServers": {
"skillopt-sleep": {
"command": "python3",
"args": ["plugins/copilot/mcp_server.py"],
"env": {
"SKILLOPT_SLEEP_REPO": "${workspaceFolder}"
}
}
}
}

128
plugins/copilot/mcp_server.py Executable file
View File

@@ -0,0 +1,128 @@
#!/usr/bin/env python3
"""SkillOpt-Sleep — minimal MCP server (stdio, stdlib-only).
Exposes the sleep engine as MCP tools so any MCP-capable client (GitHub Copilot
CLI / VS Code, Claude Desktop, etc.) can drive it. No third-party deps: speaks
JSON-RPC 2.0 over stdio with just the handful of MCP methods clients need.
Tools exposed:
- sleep_status : how many nights have run + the latest staged proposal
- sleep_dry_run : harvest+mine+replay, report only (no staging)
- sleep_run : full cycle, stages a proposal (nothing live changes)
- sleep_adopt : apply the latest staged proposal (with backup)
- sleep_harvest : debug — list mined recurring tasks
Each tool shells out to `python -m skillopt_sleep <action> ...` and returns its
stdout. Configure your client to launch: python plugins/copilot/mcp_server.py
"""
from __future__ import annotations
import json
import os
import subprocess
import sys
REPO_ROOT = os.environ.get("SKILLOPT_SLEEP_REPO") or os.path.abspath(
os.path.join(os.path.dirname(__file__), "..", "..")
)
PROTOCOL_VERSION = "2024-11-05"
TOOLS = [
{"name": "sleep_status", "action": "status",
"description": "Show how many SkillOpt-Sleep nights have run and the latest staged proposal."},
{"name": "sleep_dry_run", "action": "dry-run",
"description": "Preview a sleep cycle (harvest+mine+replay) without staging anything."},
{"name": "sleep_run", "action": "run",
"description": "Run a full sleep cycle; stages a reviewed proposal. Nothing live changes until adopt."},
{"name": "sleep_adopt", "action": "adopt",
"description": "Apply the latest staged proposal to CLAUDE.md/SKILL.md (backs up first)."},
{"name": "sleep_harvest", "action": "harvest",
"description": "Debug: list the recurring tasks mined from recent sessions."},
]
_BY_NAME = {t["name"]: t for t in TOOLS}
_TOOL_SCHEMA = {
"type": "object",
"properties": {
"project": {"type": "string", "description": "Project dir to evolve (default: cwd)."},
"backend": {"type": "string", "enum": ["mock", "claude", "codex"],
"description": "mock = no API spend (default); claude/codex = real."},
"scope": {"type": "string", "enum": ["invoked", "all"]},
},
"additionalProperties": False,
}
def _run_engine(action: str, args: dict) -> str:
py = sys.executable or "python3"
cmd = [py, "-m", "skillopt_sleep", action]
if args.get("project"):
cmd += ["--project", str(args["project"])]
if args.get("backend"):
cmd += ["--backend", str(args["backend"])]
if args.get("scope"):
cmd += ["--scope", str(args["scope"])]
try:
proc = subprocess.run(cmd, cwd=REPO_ROOT, capture_output=True, text=True, timeout=3600)
except Exception as e: # noqa: BLE001
return f"[error] failed to run engine: {e}"
out = (proc.stdout or "").strip()
err = (proc.stderr or "").strip()
return out + (("\n[stderr]\n" + err) if err else "")
def _result(id_, result):
return {"jsonrpc": "2.0", "id": id_, "result": result}
def _error(id_, code, message):
return {"jsonrpc": "2.0", "id": id_, "error": {"code": code, "message": message}}
def handle(req: dict):
method = req.get("method")
id_ = req.get("id")
if method == "initialize":
return _result(id_, {
"protocolVersion": PROTOCOL_VERSION,
"capabilities": {"tools": {}},
"serverInfo": {"name": "skillopt-sleep", "version": "0.1.0"},
})
if method in ("notifications/initialized", "initialized"):
return None # notification, no response
if method == "tools/list":
return _result(id_, {"tools": [
{"name": t["name"], "description": t["description"], "inputSchema": _TOOL_SCHEMA}
for t in TOOLS
]})
if method == "tools/call":
params = req.get("params") or {}
name = params.get("name")
tool = _BY_NAME.get(name)
if not tool:
return _error(id_, -32602, f"unknown tool: {name}")
text = _run_engine(tool["action"], params.get("arguments") or {})
return _result(id_, {"content": [{"type": "text", "text": text}]})
if method == "ping":
return _result(id_, {})
return _error(id_, -32601, f"method not found: {method}")
def main() -> int:
for line in sys.stdin:
line = line.strip()
if not line:
continue
try:
req = json.loads(line)
except Exception:
continue
resp = handle(req)
if resp is not None:
sys.stdout.write(json.dumps(resp) + "\n")
sys.stdout.flush()
return 0
if __name__ == "__main__":
raise SystemExit(main())

46
plugins/run-sleep.sh Executable file
View File

@@ -0,0 +1,46 @@
#!/usr/bin/env bash
# SkillOpt-Sleep shared runner — used by all platform plugins (Claude Code,
# Codex, Copilot). Resolves the repo root (which contains the skillopt_sleep
# package), picks a Python >= 3.10, and execs the engine CLI.
#
# Usage: run-sleep.sh <run|dry-run|status|adopt|harvest|...> [args...]
set -euo pipefail
# This script lives at <repo>/plugins/run-sleep.sh, so the repo root (which
# holds skillopt_sleep/) is one level up. CLAUDE_PLUGIN_ROOT (if set by Claude
# Code) points at the plugin dir; the engine is then two levels above it.
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
if [ -d "$SCRIPT_DIR/../skillopt_sleep" ]; then
REPO_ROOT="$(cd "$SCRIPT_DIR/.." && pwd)"
elif [ -n "${CLAUDE_PLUGIN_ROOT:-}" ] && [ -d "$CLAUDE_PLUGIN_ROOT/../../skillopt_sleep" ]; then
REPO_ROOT="$(cd "$CLAUDE_PLUGIN_ROOT/../.." && pwd)"
elif [ -n "${SKILLOPT_SLEEP_REPO:-}" ] && [ -d "$SKILLOPT_SLEEP_REPO/skillopt_sleep" ]; then
REPO_ROOT="$SKILLOPT_SLEEP_REPO"
else
# last resort: search upward from CWD
d="$PWD"
while [ "$d" != "/" ]; do
[ -d "$d/skillopt_sleep" ] && { REPO_ROOT="$d"; break; }
d="$(dirname "$d")"
done
fi
if [ -z "${REPO_ROOT:-}" ]; then
echo "[sleep] ERROR: could not locate the skillopt_sleep package. Set SKILLOPT_SLEEP_REPO to the repo root." >&2
exit 1
fi
PY=""
for cand in python3.12 python3.11 python3.10 python3; do
if command -v "$cand" >/dev/null 2>&1; then
ver="$("$cand" -c 'import sys; print("%d%d" % sys.version_info[:2])' 2>/dev/null || echo 0)"
if [ "${ver:-0}" -ge 310 ]; then PY="$cand"; break; fi
fi
done
if [ -z "$PY" ]; then
echo "[sleep] ERROR: need Python >= 3.10 (found none)." >&2
exit 1
fi
if [ "$#" -eq 0 ]; then set -- status; fi
cd "$REPO_ROOT"
exec "$PY" -m skillopt_sleep "$@"