mirror of https://github.com/microsoft/SkillOpt.git synced 2026-07-03 14:02:58 +08:00

Files

Yifan Yang 46b3207b96 docs(sleep): trim RESULTS to the headline results (remove the full grid)

Remove the per-cell full deployment grid section; keep the gate-safety stress
test, experience-replay scaling + night-by-night climb, the dream-diversity
ablation, the gbrain end-to-end result, and the scope/limitations. Renumber
sections; update the README pointer accordingly.

2026-06-15 17:08:51 +00:00

4.9 KiB

Raw Blame History

SkillOpt-Sleep 😴 — deployment-time companion (preview)

SkillOpt-Sleep applies SkillOpt's discipline to your own daily usage. It gives a local coding agent a nightly sleep cycle that reviews your past sessions, replays your recurring tasks on your own API budget, and consolidates what it learns into validated long-term memory and skills — behind a held-out gate, staged for your review. The agent gets better the more you use it, with no weight training and zero inference-time overhead.

Preview. This is an early preview we are actively iterating on; interfaces and defaults may change. The engine lives in the top-level skillopt_sleep/ package with zero dependency on the paper's skillopt/ code (the validation gate is vendored).

How it works

One "night":

harvest Claude Code / Codex transcripts → mine recurring tasks → replay offline
   → consolidate (reflect → bounded edit → GATE on real held-out tasks)
   → stage proposal → (you) adopt

It synthesizes SkillOpt (validation-gated bounded text edits), Claude Dreams (offline consolidation; review-then-adopt), and the agent-sleep idea (short-term experience → long-term competence).

How to use it

One engine, thin per-agent shells (see plugins/):

Platform	Folder	Install
Claude Code	`plugins/claude-code`	`/plugin marketplace add ./plugins/claude-code` → `/skillopt-sleep`
Codex	`plugins/codex`	`bash plugins/codex/install.sh` → `skillopt-sleep` skill
Copilot	`plugins/copilot`	register `plugins/copilot/mcp_server.py` as an MCP server

Deterministic proof (no API key): python -m skillopt_sleep.experiments.run_experiment --persona researcher --assert-improves.

Opt-in: experience replay & dream rollouts

Two consolidation mechanisms, both default off (behavior is unchanged unless you enable them). They strengthen the nightly update when your tasks have a clean correctness signal; the validation gate still governs what ships.

Config knob	Default	Effect
`dream_rollouts`	`1`	Run each task K times → learn from the good-vs-bad contrast (contrastive reflection).
`recall_k`	`0`	Associative recall — pull the K most-similar past tasks (from a persisted archive) into tonight's dream.
`dream_factor`	`0`	Add N lightweight synthetic variants of each task.

Results

📊 More results & analysis — the gate-safety stress test, experience-replay scaling, and the dream-diversity ablation — are in docs/sleep/RESULTS.md. The highlights:

Protocol (identical for every row below). 5 nights × 10 new real "today" tasks per night; the full held-out test split is scored before night 1 (baseline) and after night 5 (after); optimizer = GPT-5.5; single seed (42); run through the exact shipped engine (skillopt_sleep.dream.dream_consolidate). Numbers are absolute held-out accuracy; Δ = after − baseline in percentage points.

(a) End-to-end on real agents — gbrain-evals skillopt-v1. Deficient seed skills go 0.00 → 1.00 on the held-out set with both Claude Code and Codex as the target agent (all 4 seeds, including a real tool-use loop).

(b) Experience replay scales the gain — SearchQA (1,400-item held-out test, SQuAD exact-match; target = GPT-5.5; validation-gated):

Replay config (`dream_rollouts=5`)	Baseline → After	Δ (pts)
`recall_k=10`	0.802 → 0.834	+3.1
`recall_k=20`	0.803 → 0.848	+4.5
full-history replay (reference, not a shipping default)	0.796 → 0.851	+5.6
`recall_k=10`, `dream_rollouts=8` (more dreaming, same recall)	0.798 → 0.835	+3.7

The gain rises monotonically with how much relevant past experience is recalled. The same SearchQA cell without the gate (recall_k=10) is 0.808 → 0.839 (+3.1).

(c) Second benchmark — SpreadsheetBench (280-item held-out test; the agent's generated openpyxl code is executed and compared cell-by-cell to a golden workbook; target = GPT-5.4-nano; gate-free + the output-contract guardrail): 0.279 → 0.314 (+3.6).

(d) Honest scope. These gains hold where tasks recur and have a checkable correctness signal. On saturated or noisy benchmarks (e.g. a strong model already near ceiling) the effect is flat within run-to-run noise — single-seed baseline variance here is ±1–2 pts, so treat sub-~1.5 pt differences as noise. The validation gate keeps the worst case bounded; keep it on by default.

Learn more

Full reference (pipeline, the three plugins, the experience-replay knobs) is in the Documentation & Reproduction Guide.

4.9 KiB Raw Blame History Unescape Escape