github/microsoft-SkillOpt

Fork 0

mirror of https://github.com/microsoft/SkillOpt.git synced 2026-07-03 14:02:58 +08:00

Files

History

…

claude-code

…

codex

…

copilot

…

README.md

…

run-sleep.sh

…

README.md

SkillOpt-Sleep — plugins for Claude Code, Codex, and Copilot

Your coding agent forgets everything between sessions. SkillOpt-Sleep fixes that. While you sleep, it reviews what you did today, notices the rules you keep repeating ("always add a LIMIT", "answers in \boxed{}", "cite the source"), and writes them into your agent's long-term memory and skills — but only the rules that actually make it score better on your own past tasks. You wake up to an agent that's better at your work, and you approve every change before it sticks.

One engine, three thin shells. It synthesizes SkillOpt (validation-gated bounded text optimization — the research in this repo), Claude Dreams (offline consolidation; input never mutated; review-then-adopt), and the agent sleep idea (short-term experience → long-term competence).

Open-source tool, decoupled from the research. The engine lives in the top-level skillopt_sleep/ package with zero dependency on the paper's skillopt/ experiment code (the validation gate is vendored). Use it without the research stack.

Platform	Folder	Mechanism	Status
Claude Code	`claude-code/`	`.claude-plugin` + `/skillopt-sleep` command + skill + hooks	full, installable
Codex	`codex/`	user-level `skillopt-sleep` skill + shared runner	full
Copilot	`copilot/`	MCP server (`sleep_*` tools) + `copilot-instructions`	full (MCP)

Install (pick your agent)

Platform	Install	Then
Claude Code	`/plugin marketplace add microsoft/SkillOpt` → `/plugin install skillopt-sleep`	`/skillopt-sleep status`
Codex	`git clone` → `bash plugins/codex/install.sh`	`/skillopt-sleep status`
Copilot	`git clone` → register `plugins/copilot/mcp_server.py` as an MCP server	ask "run the sleep cycle"

Requirements: Python ≥ 3.10 and the agent's CLI on PATH. All three call the same run-sleep.sh → python -m skillopt_sleep, so behaviour is identical everywhere. Default backend is mock (no API spend); --backend claude|codex uses your own budget.

How it works: one "night", in plain terms

harvest your past sessions → mine the tasks you keep doing → replay them offline
  → reflect on failures → propose a few rule edits → KEEP only edits that raise
    your held-out score → stage a proposal → (you) review & adopt

Nothing live changes until you adopt; every adopt backs up the prior file.

The split that keeps it honest: dream-train / real-val / real-test

This is the heart of the design, borrowed from the SkillOpt paper's train/selection/test protocol:

Split	Where it comes from	What it's for
train	your real tasks + optional "dreamed" variants	what the optimizer learns from. Over-dreaming here is fine — it's imagination.
val (selection)	your real tasks only, held out	the gate: an edit is kept only if it raises this score. Stops overfitting.
test	your real tasks only, held out, never seen during optimization	the final score we report. Kept as close to your real usage as possible.

So you can dream up extra training examples to learn a rule robustly, while the rule is still judged on real, unseen tasks. A dream task can never land in val or test — that invariant is unit-tested.

What each feature does for you (with examples)

Every control below works on all three platforms (pass it after the action, e.g. /skillopt-sleep run --rollouts-k 3).

`--preferences "..."` — tell it your house rules

The single most useful knob. Free text that steers what the optimizer writes, as a prior. Use it to encode the conventions you're tired of repeating.

# A backend engineer:
/skillopt-sleep run --preferences "Always use async/await, never callbacks. \
  Prefer pytest over unittest. Commit subjects in imperative mood under 50 chars."

# A data analyst:
/skillopt-sleep run --preferences "Every SQL query must end with LIMIT 1000 unless \
  I say otherwise. Money in USD with 2 decimals. Prefer CTEs over nested subqueries."

# A researcher:
/skillopt-sleep run --preferences "Cite sources as [Author, Year]. Math answers in \
  \\boxed{}. Keep explanations under 150 words unless I ask for depth."

What it does for you: the next morning your agent already follows these without you re-typing them, and the rules are validated against your real tasks (if a "preference" actually hurts your held-out score, the gate drops it).

`--gate on|off` — strict vs. greedy

on (default): an edit is kept only if it raises your held-out score. Safe — blocks plausible-but-wrong rules and reward-hacking.
off: greedy — keep edits without the strict check (still reports whether quality moved).

What it does for you: leave it on for trust. Flip it off when you're exploring and want to see everything the optimizer proposes.

`--rollouts-k K` — learn from contrast, not just failure

Re-runs each task K times and learns from the difference between the good and bad attempts, not just a single failure.

/skillopt-sleep run --rollouts-k 3

What it does for you: a much stronger signal. If your agent gets a task right 1 time in 3, the optimizer figures out what the winning attempt did and makes it reliable.

`--optimizer-model` / `--target-model` — optimize cheap, deploy anywhere

Use a strong model to write the rules and a cheap model to run your tasks. The learned skill then helps the cheap model — or any model.

/skillopt-sleep run --optimizer-model sonnet --target-model haiku

What it does for you: spend a little on a smart optimizer overnight; your everyday cheap/fast agent inherits the upgrade. (Verified: a skill optimized on one model lifts a different one — cross-model and even cross-runtime Codex↔Claude.)

`--budget-tokens N` / `--budget-minutes M` — cap the spend

You decide how much the nightly "dreaming" costs; it auto-plans how many nights × how many rollouts fit.

/skillopt-sleep run --backend claude --budget-tokens 60000

What it does for you: predictable cost. It stops cleanly when the budget is hit and tells you what it skipped.

multi-objective (accuracy ↑, tokens ↓, latency ↓)

The reward can weight not just correctness but cost and speed, so a skill can learn to be cheaper and faster, not only more accurate. What it does for you: "answer directly instead of opening five files" becomes a learned habit.

`schedule` / `unschedule` — set it and forget it

Built-in nightly scheduling (no manual cron):

/skillopt-sleep schedule --hour 3 --minute 17     # runs every night for this project
/skillopt-sleep unschedule                        # stop it

What it does for you: it just gets better while you sleep. The nightly run only stages a proposal — adopting is still your call (or add --auto-adopt when you schedule, if you trust it).

Full action / flag reference

Action	Does
`status`	nights so far + the latest staged proposal (read-only)
`dry-run`	harvest→mine→replay→report; stages nothing
`run`	full cycle; stages a proposal; nothing live changes
`adopt`	apply the staged proposal to `CLAUDE.md`/`SKILL.md` (backs up first)
`harvest`	debug: print the recurring tasks it mined
`schedule` / `unschedule`	install/remove the nightly cron entry

Flag	Default	Meaning
`--backend mock\|claude\|codex`	`mock`	who runs/optimizes (mock = free)
`--preferences "..."`	–	your house rules, as a prior
`--gate on\|off`	`on`	strict held-out gate vs. greedy
`--rollouts-k K`	`1`	multi-rollout contrastive reflection
`--optimizer-model` / `--target-model`	–	split the optimizer from the target
`--budget-tokens` / `--budget-minutes`	–	cap the nightly spend
`--scope invoked\|all`	`invoked`	this project only, or all projects
`--auto-adopt`	off	apply without manual review (power users)

Deep dive: ../docs/sleep/CONTROLLABLE_DREAMING.md.

Does it actually work?

Yes — measured with real models on both Claude and Codex, scored on held-out tasks the optimizer never trained on:

gbrain-evals skillopt-v1 (the public suite gbrain scores SkillOpt on): deficient skills go 0.00 → 1.00 on all 4 seeds, including a real tool-use loop; cross-model transfer is positive; the gate blocks regressions. → ../docs/sleep/FINAL_REPORT.md
Academic daily-cases (math / spreadsheet / search-QA, the paper's 4:1:5 split with dream-augmented train): see ../docs/sleep/daily_cases_results.md.
Fresh load-test (a "SQL must always include LIMIT" analyst, built from scratch): held-out 0.00 → 1.00 on both backends. → ../docs/sleep/plugin_load_test.md

Try the deterministic proof yourself (no API key, no spend):

python -m skillopt_sleep.experiments.run_experiment --persona researcher --assert-improves

It prints the held-out score rising to 1.0 as the gate accepts the right rules, and confirms the gate rejects an injected harmful edit.

Safety

Read-only harvest of your sessions. mock replay has no side effects.
Proposals are staged, never auto-applied (unless you opt in with --auto-adopt).
Every adopt writes a backup. Per-night token/time budget caps. Secrets redacted.

README.md Unescape Escape

SkillOpt-Sleep — plugins for Claude Code, Codex, and Copilot

Install (pick your agent)

How it works: one "night", in plain terms

The split that keeps it honest: dream-train / real-val / real-test

What each feature does for you (with examples)

--preferences "..." — tell it your house rules

--gate on|off — strict vs. greedy

--rollouts-k K — learn from contrast, not just failure

--optimizer-model / --target-model — optimize cheap, deploy anywhere

--budget-tokens N / --budget-minutes M — cap the spend

multi-objective (accuracy ↑, tokens ↓, latency ↓)

schedule / unschedule — set it and forget it

Full action / flag reference

Does it actually work?

Safety

README.md

`--preferences "..."` — tell it your house rules

`--gate on|off` — strict vs. greedy

`--rollouts-k K` — learn from contrast, not just failure

`--optimizer-model` / `--target-model` — optimize cheap, deploy anywhere

`--budget-tokens N` / `--budget-minutes M` — cap the spend

`schedule` / `unschedule` — set it and forget it