Files
microsoft-SkillOpt/plugins/README.md
khashayar bec23ed020 Add Devin plugin (plugins/devin): MCP server + ATIF-v1.7 harvest
Wires the skillopt_sleep engine into Devin (Cognition) via an MCP server,
following the same thin-shell pattern as plugins/copilot.

- mcp_server.py: stdlib-only stdio MCP server exposing the standard sleep_*
  tools (status, dry-run, run, adopt, harvest). REPO_ROOT defaults to ../.. so
  it finds skillopt_sleep automatically when run from plugins/devin/.
- harvest_devin.py: converts Devin ATIF-v1.7 transcripts, agentmemory, and
  .devin/skills/*/SKILL.md into the Claude Code-compatible JSONL the engine
  consumes; enriches with taskKey + outcome envelopes (hard test/build signal
  or judge rubric). Workspace auto-detection; cross-platform paths.
- judge.py, mcp-config.example.json, devin-rules.snippet.md, README.md.
- plugins/README.md: add Devin to the platform + install tables.

No changes to skillopt_sleep; shells out to `python -m skillopt_sleep` like the
other plugins. Pure stdlib; default backend mock (no API spend).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-25 10:42:52 +02:00

222 lines
9.9 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# SkillOpt-Sleep — plugins for Claude Code, Codex, Copilot, and Devin
**Your coding agent forgets everything between sessions. SkillOpt-Sleep fixes
that.** While you sleep, it reviews what you did today, notices the rules you
keep repeating ("always add a LIMIT", "answers in `\boxed{}`", "cite the
source"), and writes them into your agent's long-term memory and skills — but
only the rules that actually make it score better on *your own* past tasks. You
wake up to an agent that's better at *your* work, and you approve every change
before it sticks.
One engine, four thin shells. It synthesizes **SkillOpt** (validation-gated
bounded text optimization — the research in this repo), **Claude Dreams**
(offline consolidation; input never mutated; review-then-adopt), and the **agent
sleep** idea (short-term experience → long-term competence).
> **Open-source tool, decoupled from the research.** The engine lives in the
> top-level [`skillopt_sleep/`](../skillopt_sleep) package with **zero
> dependency** on the paper's `skillopt/` experiment code (the validation gate is
> vendored). Use it without the research stack.
---
| Platform | Folder | Mechanism | Status |
|---|---|---|---|
| **Claude Code** | [`claude-code/`](claude-code) | `.claude-plugin` + `/skillopt-sleep` command + skill + hooks | full, installable |
| **Codex** | [`codex/`](codex) | user-level `skillopt-sleep` skill + shared runner | full |
| **Copilot** | [`copilot/`](copilot) | MCP server (`sleep_*` tools) + `copilot-instructions` | full (MCP) |
| **Devin** | [`devin/`](devin) | MCP server (`sleep_*` tools) + Devin ATIF-v1.7 harvest + `.devin/rules` | full (MCP) |
## Install (pick your agent)
| Platform | Install | Then |
|---|---|---|
| **Claude Code** | `/plugin marketplace add microsoft/SkillOpt``/plugin install skillopt-sleep` | `/skillopt-sleep status` |
| **Codex** | `git clone``bash plugins/codex/install.sh` | `/skillopt-sleep status` |
| **Copilot** | `git clone` → register `plugins/copilot/mcp_server.py` as an MCP server | ask "run the sleep cycle" |
| **Devin** | `git clone``devin mcp add skillopt-sleep -- python3 plugins/devin/mcp_server.py` | ask "run the sleep cycle" |
Requirements: Python ≥ 3.10 and the agent's CLI on PATH. All three call the same
[`run-sleep.sh`](run-sleep.sh) → `python -m skillopt_sleep`, so behaviour is
identical everywhere. Default backend is `mock` (no API spend); `--backend
claude|codex|copilot` uses your own budget.
---
## How it works: one "night", in plain terms
```
harvest your past sessions → mine the tasks you keep doing → replay them offline
→ reflect on failures → propose a few rule edits → KEEP only edits that raise
your held-out score → stage a proposal → (you) review & adopt
```
Nothing live changes until you `adopt`; every adopt backs up the prior file.
### The split that keeps it honest: dream-train / real-val / real-test
This is the heart of the design, borrowed from the SkillOpt paper's
train/selection/test protocol:
| Split | Where it comes from | What it's for |
|---|---|---|
| **train** | your real tasks **+ optional "dreamed" variants** | what the optimizer *learns from*. Over-dreaming here is fine — it's imagination. |
| **val** (selection) | **your real tasks only**, held out | the **gate**: an edit is kept only if it raises this score. Stops overfitting. |
| **test** | **your real tasks only**, held out, never seen during optimization | the **final score** we report. Kept as close to your real usage as possible. |
So you can **dream up extra training examples** to learn a rule robustly, while
the rule is still **judged on real, unseen tasks**. A `dream` task can *never*
land in val or test — that invariant is unit-tested.
---
## What each feature does **for you** (with examples)
Every control below works on all three platforms (pass it after the action,
e.g. `/skillopt-sleep run --rollouts-k 3`).
### `--preferences "..."` — tell it your house rules
The single most useful knob. Free text that steers what the optimizer writes,
as a prior. Use it to encode the conventions you're tired of repeating.
```bash
# A backend engineer:
/skillopt-sleep run --preferences "Always use async/await, never callbacks. \
Prefer pytest over unittest. Commit subjects in imperative mood under 50 chars."
# A data analyst:
/skillopt-sleep run --preferences "Every SQL query must end with LIMIT 1000 unless \
I say otherwise. Money in USD with 2 decimals. Prefer CTEs over nested subqueries."
# A researcher:
/skillopt-sleep run --preferences "Cite sources as [Author, Year]. Math answers in \
\\boxed{}. Keep explanations under 150 words unless I ask for depth."
```
*What it does for you:* the next morning your agent already follows these
without you re-typing them, and the rules are validated against your real tasks
(if a "preference" actually hurts your held-out score, the gate drops it).
### `--gate on|off` — strict vs. greedy
- `on` (default): an edit is kept **only if it raises your held-out score**.
Safe — blocks plausible-but-wrong rules and reward-hacking.
- `off`: greedy — keep edits without the strict check (still reports whether
quality moved).
*What it does for you:* leave it `on` for trust. Flip it `off` when you're
exploring and want to see everything the optimizer proposes.
### `--rollouts-k K` — learn from contrast, not just failure
Re-runs each task `K` times and learns from the difference between the **good**
and **bad** attempts, not just a single failure.
```bash
/skillopt-sleep run --rollouts-k 3
```
*What it does for you:* a much stronger signal. If your agent gets a task right 1
time in 3, the optimizer figures out *what the winning attempt did* and makes it
reliable.
### `--optimizer-model` / `--target-model` — optimize cheap, deploy anywhere
Use a strong model to *write* the rules and a cheap model to *run* your tasks.
The learned skill then helps the cheap model — or any model.
```bash
/skillopt-sleep run --optimizer-model sonnet --target-model haiku
```
*What it does for you:* spend a little on a smart optimizer overnight; your
everyday cheap/fast agent inherits the upgrade. (Verified: a skill optimized on
one model lifts a different one — cross-model and even cross-runtime
Codex↔Claude.)
### `--budget-tokens N` / `--budget-minutes M` — cap the spend
You decide how much the nightly "dreaming" costs; it auto-plans how many nights
× how many rollouts fit.
```bash
/skillopt-sleep run --backend claude --budget-tokens 60000
```
*What it does for you:* predictable cost. It stops cleanly when the budget is hit
and tells you what it skipped.
### multi-objective (accuracy ↑, tokens ↓, latency ↓)
The reward can weight not just correctness but **cost and speed**, so a skill can
learn to be cheaper and faster, not only more accurate. *What it does for you:*
"answer directly instead of opening five files" becomes a learned habit.
### `schedule` / `unschedule` — set it and forget it
Built-in nightly scheduling (no manual cron):
```bash
/skillopt-sleep schedule --hour 3 --minute 17 # runs every night for this project
/skillopt-sleep unschedule # stop it
```
*What it does for you:* it just gets better while you sleep. The nightly run only
*stages* a proposal — adopting is still your call (or add `--auto-adopt` when you
schedule, if you trust it).
---
## Full action / flag reference
| Action | Does |
|---|---|
| `status` | nights so far + the latest staged proposal (read-only) |
| `dry-run` | harvest→mine→replay→report; **stages nothing** |
| `run` | full cycle; **stages** a proposal; nothing live changes |
| `adopt` | apply the staged proposal to `CLAUDE.md`/`SKILL.md` (backs up first) |
| `harvest` | debug: print the recurring tasks it mined |
| `schedule` / `unschedule` | install/remove the nightly cron entry |
| Flag | Default | Meaning |
|---|---|---|
| `--backend mock\|claude\|codex\|copilot` | `mock` | who runs/optimizes (mock = free) |
| `--preferences "..."` | | your house rules, as a prior |
| `--gate on\|off` | `on` | strict held-out gate vs. greedy |
| `--rollouts-k K` | `1` | multi-rollout contrastive reflection |
| `--optimizer-model` / `--target-model` | | split the optimizer from the target |
| `--budget-tokens` / `--budget-minutes` | | cap the nightly spend |
| `--scope invoked\|all` | `invoked` | this project only, or all projects |
| `--auto-adopt` | off | apply without manual review (power users) |
Deep dive: [the SkillOpt-Sleep guide section](https://microsoft.github.io/SkillOpt/docs/guideline.html#sleep).
---
## Does it actually work?
Yes — measured with **real models on both Claude and Codex**, scored on held-out
tasks the optimizer never trained on:
- **gbrain-evals `skillopt-v1`** (the public suite gbrain scores SkillOpt on):
deficient skills go **0.00 → 1.00** on all 4 seeds, including a real tool-use
loop; cross-model transfer is positive; the gate blocks regressions.
→ [the SkillOpt-Sleep guide section](https://microsoft.github.io/SkillOpt/docs/guideline.html#sleep)
- **Academic daily-cases** (math / spreadsheet / search-QA, the paper's 4:1:5
split with dream-augmented train): see
[the SkillOpt-Sleep guide section](https://microsoft.github.io/SkillOpt/docs/guideline.html#sleep).
- **Fresh load-test** (a "SQL must always include LIMIT" analyst, built from
scratch): held-out **0.00 → 1.00** on both backends.
→ [the SkillOpt-Sleep guide section](https://microsoft.github.io/SkillOpt/docs/guideline.html#sleep)
Try the deterministic proof yourself (no API key, no spend):
```bash
python -m skillopt_sleep.experiments.run_experiment --persona researcher --assert-improves
```
It prints the held-out score rising to 1.0 as the gate accepts the right rules,
and confirms the gate **rejects** an injected harmful edit.
---
## Safety
- **Read-only** harvest of your sessions. `mock` replay has no side effects.
- Proposals are **staged**, never auto-applied (unless you opt in with `--auto-adopt`).
- Every adopt writes a backup. Per-night token/time budget caps. Secrets redacted.