SkillOpt-Sleep — plugins for Claude Code, Codex, and Copilot
Your coding agent forgets everything between sessions. SkillOpt-Sleep fixes
that. While you sleep, it reviews what you did today, notices the rules you
keep repeating ("always add a LIMIT", "answers in \boxed{}", "cite the
source"), and writes them into your agent's long-term memory and skills — but
only the rules that actually make it score better on your own past tasks. You
wake up to an agent that's better at your work, and you approve every change
before it sticks.
One engine, three thin shells. It synthesizes SkillOpt (validation-gated bounded text optimization — the research in this repo), Claude Dreams (offline consolidation; input never mutated; review-then-adopt), and the agent sleep idea (short-term experience → long-term competence).
Open-source tool, decoupled from the research. The engine lives in the top-level
skillopt_sleep/package with zero dependency on the paper'sskillopt/experiment code (the validation gate is vendored). Use it without the research stack.
| Platform | Folder | Mechanism | Status |
|---|---|---|---|
| Claude Code | claude-code/ |
.claude-plugin + /skillopt-sleep command + skill + hooks |
full, installable |
| Codex | codex/ |
user-level skillopt-sleep skill + shared runner |
full |
| Copilot | copilot/ |
MCP server (sleep_* tools) + copilot-instructions |
full (MCP) |
Install (pick your agent)
| Platform | Install | Then |
|---|---|---|
| Claude Code | /plugin marketplace add microsoft/SkillOpt → /plugin install skillopt-sleep |
/skillopt-sleep status |
| Codex | git clone → bash plugins/codex/install.sh |
/skillopt-sleep status |
| Copilot | git clone → register plugins/copilot/mcp_server.py as an MCP server |
ask "run the sleep cycle" |
Requirements: Python ≥ 3.10 and the agent's CLI on PATH. All three call the same
run-sleep.sh → python -m skillopt_sleep, so behaviour is
identical everywhere. Default backend is mock (no API spend); --backend claude|codex uses your own budget.
How it works: one "night", in plain terms
harvest your past sessions → mine the tasks you keep doing → replay them offline
→ reflect on failures → propose a few rule edits → KEEP only edits that raise
your held-out score → stage a proposal → (you) review & adopt
Nothing live changes until you adopt; every adopt backs up the prior file.
The split that keeps it honest: dream-train / real-val / real-test
This is the heart of the design, borrowed from the SkillOpt paper's train/selection/test protocol:
| Split | Where it comes from | What it's for |
|---|---|---|
| train | your real tasks + optional "dreamed" variants | what the optimizer learns from. Over-dreaming here is fine — it's imagination. |
| val (selection) | your real tasks only, held out | the gate: an edit is kept only if it raises this score. Stops overfitting. |
| test | your real tasks only, held out, never seen during optimization | the final score we report. Kept as close to your real usage as possible. |
So you can dream up extra training examples to learn a rule robustly, while
the rule is still judged on real, unseen tasks. A dream task can never
land in val or test — that invariant is unit-tested.
What each feature does for you (with examples)
Every control below works on all three platforms (pass it after the action,
e.g. /skillopt-sleep run --rollouts-k 3).
--preferences "..." — tell it your house rules
The single most useful knob. Free text that steers what the optimizer writes, as a prior. Use it to encode the conventions you're tired of repeating.
# A backend engineer:
/skillopt-sleep run --preferences "Always use async/await, never callbacks. \
Prefer pytest over unittest. Commit subjects in imperative mood under 50 chars."
# A data analyst:
/skillopt-sleep run --preferences "Every SQL query must end with LIMIT 1000 unless \
I say otherwise. Money in USD with 2 decimals. Prefer CTEs over nested subqueries."
# A researcher:
/skillopt-sleep run --preferences "Cite sources as [Author, Year]. Math answers in \
\\boxed{}. Keep explanations under 150 words unless I ask for depth."
What it does for you: the next morning your agent already follows these without you re-typing them, and the rules are validated against your real tasks (if a "preference" actually hurts your held-out score, the gate drops it).
--gate on|off — strict vs. greedy
on(default): an edit is kept only if it raises your held-out score. Safe — blocks plausible-but-wrong rules and reward-hacking.off: greedy — keep edits without the strict check (still reports whether quality moved).
What it does for you: leave it on for trust. Flip it off when you're
exploring and want to see everything the optimizer proposes.
--rollouts-k K — learn from contrast, not just failure
Re-runs each task K times and learns from the difference between the good
and bad attempts, not just a single failure.
/skillopt-sleep run --rollouts-k 3
What it does for you: a much stronger signal. If your agent gets a task right 1 time in 3, the optimizer figures out what the winning attempt did and makes it reliable.
--optimizer-model / --target-model — optimize cheap, deploy anywhere
Use a strong model to write the rules and a cheap model to run your tasks. The learned skill then helps the cheap model — or any model.
/skillopt-sleep run --optimizer-model sonnet --target-model haiku
What it does for you: spend a little on a smart optimizer overnight; your everyday cheap/fast agent inherits the upgrade. (Verified: a skill optimized on one model lifts a different one — cross-model and even cross-runtime Codex↔Claude.)
--budget-tokens N / --budget-minutes M — cap the spend
You decide how much the nightly "dreaming" costs; it auto-plans how many nights × how many rollouts fit.
/skillopt-sleep run --backend claude --budget-tokens 60000
What it does for you: predictable cost. It stops cleanly when the budget is hit and tells you what it skipped.
multi-objective (accuracy ↑, tokens ↓, latency ↓)
The reward can weight not just correctness but cost and speed, so a skill can learn to be cheaper and faster, not only more accurate. What it does for you: "answer directly instead of opening five files" becomes a learned habit.
schedule / unschedule — set it and forget it
Built-in nightly scheduling (no manual cron):
/skillopt-sleep schedule --hour 3 --minute 17 # runs every night for this project
/skillopt-sleep unschedule # stop it
What it does for you: it just gets better while you sleep. The nightly run only
stages a proposal — adopting is still your call (or add --auto-adopt when you
schedule, if you trust it).
Full action / flag reference
| Action | Does |
|---|---|
status |
nights so far + the latest staged proposal (read-only) |
dry-run |
harvest→mine→replay→report; stages nothing |
run |
full cycle; stages a proposal; nothing live changes |
adopt |
apply the staged proposal to CLAUDE.md/SKILL.md (backs up first) |
harvest |
debug: print the recurring tasks it mined |
schedule / unschedule |
install/remove the nightly cron entry |
| Flag | Default | Meaning |
|---|---|---|
--backend mock|claude|codex |
mock |
who runs/optimizes (mock = free) |
--preferences "..." |
– | your house rules, as a prior |
--gate on|off |
on |
strict held-out gate vs. greedy |
--rollouts-k K |
1 |
multi-rollout contrastive reflection |
--optimizer-model / --target-model |
– | split the optimizer from the target |
--budget-tokens / --budget-minutes |
– | cap the nightly spend |
--scope invoked|all |
invoked |
this project only, or all projects |
--auto-adopt |
off | apply without manual review (power users) |
Deep dive: ../docs/sleep/CONTROLLABLE_DREAMING.md.
Does it actually work?
Yes — measured with real models on both Claude and Codex, scored on held-out tasks the optimizer never trained on:
- gbrain-evals
skillopt-v1(the public suite gbrain scores SkillOpt on): deficient skills go 0.00 → 1.00 on all 4 seeds, including a real tool-use loop; cross-model transfer is positive; the gate blocks regressions. →../docs/sleep/FINAL_REPORT.md - Academic daily-cases (math / spreadsheet / search-QA, the paper's 4:1:5
split with dream-augmented train): see
../docs/sleep/daily_cases_results.md. - Fresh load-test (a "SQL must always include LIMIT" analyst, built from
scratch): held-out 0.00 → 1.00 on both backends.
→
../docs/sleep/plugin_load_test.md
Try the deterministic proof yourself (no API key, no spend):
python -m skillopt_sleep.experiments.run_experiment --persona researcher --assert-improves
It prints the held-out score rising to 1.0 as the gate accepts the right rules, and confirms the gate rejects an injected harmful edit.
Safety
- Read-only harvest of your sessions.
mockreplay has no side effects. - Proposals are staged, never auto-applied (unless you opt in with
--auto-adopt). - Every adopt writes a backup. Per-night token/time budget caps. Secrets redacted.