mirror of https://github.com/microsoft/SkillOpt.git synced 2026-07-03 22:24:36 +08:00

Files

Yifan Yang f9db99853b feat(plugins): ship SkillOpt-Sleep for Claude Code, Codex, and Copilot

Restructure into plugins/{claude-code,codex,copilot}/ — one engine, three thin
shells, all calling the shared plugins/run-sleep.sh -> python -m skillopt_sleep.

  - claude-code/: existing plugin moved here; runner delegates to the shared
    launcher (fixes repo-root resolution after the move).
  - codex/: ~/.codex/prompts/sleep.md custom prompt + ~/.agents/skills SKILL.md +
    install.sh + AGENTS.md hint — Codex's documented, stable extension surfaces.
  - copilot/: a stdlib-only MCP server (mcp_server.py) exposing sleep_* tools,
    plus mcp-config.example.json and a copilot-instructions snippet. Verified end
    to end (initialize -> tools/list -> tools/call returns real engine output).
  - plugins/README.md overview table; main README News + a dedicated SkillOpt-Sleep
    section; pyproject lists skillopt_sleep as a first-class package.

Decoupling emphasized throughout: open-source tool (skillopt_sleep/) with zero
dependency on the research package. 29 tests pass; all three shells resolve.

Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>

2026-06-08 14:31:52 +00:00

3.2 KiB

Raw Permalink Blame History

SkillOpt-Sleep — plugins for Claude Code, Codex, and Copilot

One engine, three thin shells. SkillOpt-Sleep gives a local coding agent a nightly sleep cycle: it reviews your past sessions offline, replays your recurring tasks on your own API budget, and consolidates what it learns into validated long-term memory and skills — behind a held-out gate, staged for your review. Your agent gets better the more you use it, with no model-weight training.

It synthesizes three ideas: SkillOpt (validation-gated bounded text optimization — the research in this repo), Claude Dreams (offline memory consolidation; input never mutated; review-then-adopt), and the agent sleep literature (short-term experience → long-term competence).

This is an open-source tool, decoupled from the research code. The engine lives in the top-level skillopt_sleep/ package and has zero dependency on the paper's skillopt/ experiment package (the validation gate is vendored). You can ship/use it without the research stack.

The three integrations

Platform	Folder	Mechanism	Status
Claude Code	`claude-code/`	`.claude-plugin` + `/sleep` command + skill + hooks	full, installable
Codex	`codex/`	`~/.codex/prompts/sleep.md` + `~/.agents/skills` + `AGENTS.md`	full
Copilot	`copilot/`	MCP server (`sleep_*` tools) + `copilot-instructions`	full (MCP)

All three call the same plugins/run-sleep.sh → python -m skillopt_sleep, so behaviour is identical everywhere. Per-platform setup is in each folder's README.

Quick start (Claude Code)

git clone <repo-url> && cd SkillOpt-Sleep
# Claude Code:
/plugin marketplace add ./plugins/claude-code
/plugin install skillopt-sleep@skillopt-sleep
/sleep status

Codex: bash plugins/codex/install.sh. Copilot: register plugins/copilot/mcp_server.py as an MCP server.

What one "night" does

harvest ~/.claude (or session) transcripts → mine recurring tasks → replay offline
   → consolidate (reflect → bounded edit → GATE on real held-out tasks)
   → stage proposal → (you) adopt

Nothing live changes until you adopt; every adopt backs up first.

Controls (work on all platforms)

--gate on|off · --rollouts-k K (multi-rollout contrastive reflection) · --budget-tokens/--budget-minutes · --preferences "..." · separate optimizer/target models (--optimizer-model / --target-model) · slow-update long-term memory. Full guide: ../docs/sleep/CONTROLLABLE_DREAMING.md.

Does it actually work?

Validated on the public gbrain-evals skillopt-v1 benchmark with real models on both Claude and Codex: deficient skills go 0.00 → 1.00 on held-out sets (all 4 seeds incl. a real tool-use loop), cross-model transfer is positive, and the gate blocks regressions. Full results: ../docs/sleep/FINAL_REPORT.md.

Deterministic proof (no API key):

python -m skillopt_sleep.experiments.run_experiment --persona researcher --assert-improves

3.2 KiB Raw Permalink Blame History