mirror of https://github.com/microsoft/SkillOpt.git synced 2026-07-03 14:02:58 +08:00

Go to file

Yifan Yang d02098ffc4 docs(sleep): add full Results & Analysis (RESULTS.md); link from README

Adds docs/sleep/RESULTS.md — the complete deployment-scale study behind
SkillOpt-Sleep, presented rigorously (named benchmarks, test sizes, metrics,
baseline->after, single shared protocol):
  1. Gate-safety stress test: ungated nano SearchQA collapses 0.554->0.026
     (-52.8); the gated twin holds 0.570 — the core argument for the design.
  2. Full 18-cell deployment grid (3 benchmarks x 3 targets x gate/free),
     shipped config: mean +0.5, range [-2.4, +5.1], nothing hidden.
  3. Experience-replay scaling (recall_k 10->20->full: +3.1->+4.5->+5.6) and
     the night-by-night climb (0.798->...->0.858, gate accepts as late as N5).
  4. Dream-diversity fix as defense-in-depth: 3-config grid comparison
     (-2.66/-52.8 -> +0.24/-4.0 -> +0.53/-2.4); the -52.8 cell becomes +2.7
     from the dream fix alone.
  5. gbrain end-to-end 0.00->1.00 on real Claude + Codex.
  6. Honest scope: where it helps vs flat-in-noise, single-seed caveat with a
     seed-robustness spot check, keep-the-gate-on.
README Results section now links prominently to it. Docs only; numbers are
self-contained with reproduce commands (no raw run dumps committed).

2026-06-15 16:49:13 +00:00

ckpt

docs: clarify optional features and ckpt artifacts

2026-05-31 09:36:25 +00:00

configs

feat(optimizer): skill-aware reflection (EmbodiSkill S_app), config-controlled and env-independent

2026-06-10 13:10:08 +00:00

data

Release data split manifests

2026-06-01 16:02:14 +00:00

docs

docs(sleep): add full Results & Analysis (RESULTS.md); link from README

2026-06-15 16:49:13 +00:00

plugins

docs: move SkillOpt-Sleep into the guide; clean docs/sleep; fix guide link

2026-06-15 16:20:50 +00:00

scripts

feat(optimizer): skill-aware reflection (EmbodiSkill S_app), config-controlled and env-independent

2026-06-10 13:10:08 +00:00

skillopt

refactor: make EnvAdapter.reflect a shared default (fixes dropped reflect kwargs)

2026-06-15 09:06:00 +00:00

skillopt_sleep

feat(sleep): experience replay + dream rollouts in the cycle (opt-in)

2026-06-15 15:58:27 +00:00

skillopt_webui

refactor: rename teacher/student to optimizer/target, remove best skills, fix slow update

2026-05-24 19:15:10 +00:00

skillopt-assets

Use official arXiv logomark

2026-05-24 19:43:19 +00:00

tests

Add Codex Desktop transcript harvesting

2026-06-15 10:23:08 +00:00

.env.example

fix: use correct MiniMax endpoint, model name, and add .venv to gitignore

2026-05-31 05:27:50 +08:00

.gitignore

chore: ignore local experiment launcher scripts (machine-specific endpoints/identities)

2026-06-10 13:10:55 +00:00

CONTRIBUTING.md

SkillOpt v0.1.0: initial release

2026-05-21 17:22:04 +00:00

index.html

Update BibTeX entry in index.html

2026-05-25 14:30:01 +08:00

LICENSE

SkillOpt v0.1.0: initial release

2026-05-21 17:22:04 +00:00

mkdocs.yml

docs: add local environment smoke test guide

2026-05-29 09:26:38 +09:00

pyproject.toml

feat(plugins): ship SkillOpt-Sleep for Claude Code, Codex, and Copilot

2026-06-08 14:31:52 +00:00

README.md

docs(sleep): add a SkillOpt-Sleep module readme + News mention

2026-06-15 16:31:15 +00:00

requirements.txt

SkillOpt v0.1.0: initial release

2026-05-21 17:22:04 +00:00

SECURITY.md

Microsoft mandatory file

2026-05-22 10:48:38 +00:00

skillopt.html

Update BibTeX entry for SkillOpt publication

2026-05-25 14:28:13 +08:00

README.md

SkillOpt: Executive Strategy for Self-Evolving Agent Skills

Train agent skills like you train neural networks — with epochs, (mini-)batchsize, learning rates, and validation gates — but without touching model weights.

📖 For installation, data preparation, training/eval commands, the full configuration reference, and framework internals, see the Documentation & Reproduction Guide (rendered on GitHub Pages).

News 🔥🔥🔥

[2026-06-15] 😴 SkillOpt-Sleep (preview) — a nightly offline self-evolution companion for local coding agents (Claude Code / Codex / Copilot): review past sessions, replay recurring tasks, and consolidate validated skills behind a held-out gate. See docs/sleep/README.md for what it is, how to use it, and results.
[2026-06-03] 🎉 gbrain, gbrain-evals, and darwin-skill have all integrated SkillOpt.
[2026-06-02] 🎉 SkillOpt v0.1.0 is now available on PyPI! Install with pip install skillopt. This initial release includes the full training loop (rollout → reflect → aggregate → select → update → evaluate), multi-backend support (OpenAI / Azure / Claude / Qwen / MiniMax), six built-in benchmarks, and WebUI dashboard.

Overview

Modern agent skills are usually hand-crafted, generated one-shot by a strong LLM, or evolved through loosely controlled self-revision — none of which behaves like a deep-learning optimizer for the skill itself, and none of which reliably improves over its starting point under feedback.

SkillOpt treats the skill document as the trainable state of a frozen agent, and trains it with the discipline that makes weight-space optimization reproducible. A separate optimizer model turns scored rollouts into bounded add / delete / replace edits on a single skill document; a candidate edit is accepted only when it strictly improves a held-out validation score. A textual learning-rate budget, a rejected-edit buffer, and an epoch-wise slow / meta update make skill training stable while adding zero inference-time model calls at deployment.

The deployed artifact is a compact best_skill.md (typically 300–2,000 tokens) that runs against the unchanged target model. Across six benchmarks, seven target models, and three execution harnesses (direct chat, Codex CLI, Claude Code CLI), SkillOpt is best or tied-best on all 52 evaluated (model, benchmark, harness) cells and on GPT-5.5 lifts the average no-skill accuracy by +23.5 points in direct chat, +24.8 inside the Codex agentic loop, and +19.1 inside Claude Code. Optimized skill artifacts transfer across model scales, between Codex and Claude Code harnesses, and to nearby benchmarks without further optimization.

For the full method, ablations, and per-cell results see the paper; for a visual walkthrough of the loop see the project page; for deeper API / backend / benchmark docs see docs/.

🎬 Demo Video

https://github.com/user-attachments/assets/eb12d3bc-371c-467f-904d-91b61f339ed7

▶ Watch the full demo on YouTube

Extensibility & WebUI

Adding a new backend

A backend = a chat / exec target (e.g. openai_chat, claude_chat, qwen_chat, minimax_chat, codex_exec, claude_code_exec). See docs/guide/new-backend.md for the full contract; in short you add a skillopt/model/<name>_backend.py module, register it in skillopt/model/common.py + backend_config.py, and wire it through the router in skillopt/model/__init__.py. qwen_backend.py and minimax_backend.py are good templates.

Adding a new benchmark

A benchmark = a skillopt/envs/<name>/ package with a dataloader.py, a rollout.py, and an initial.md seed skill. See docs/guide/new-benchmark.md for the full contract; the simplest reference is skillopt/envs/searchqa/.

WebUI

Launch the monitoring dashboard (optional):

pip install -e ".[webui]"
python -m skillopt_webui.app

Flag	Default	Description
`--port`	7860	Server port
`--host`	`0.0.0.0`	Bind address
`--share`	off	Create a public Gradio share link

Citation

@misc{yang2026skilloptexecutivestrategyselfevolving,
      title={SkillOpt: Executive Strategy for Self-Evolving Agent Skills}, 
      author={Yifan Yang and Ziyang Gong and Weiquan Huang and Qihao Yang and Ziwei Zhou and Zisu Huang and Yan Li and Xuemei Gao and Qi Dai and Bei Liu and Kai Qiu and Yuqing Yang and Dongdong Chen and Xue Yang and Chong Luo},
      year={2026},
      eprint={2605.23904},
      archivePrefix={arXiv},
      primaryClass={cs.AI},
      url={https://arxiv.org/abs/2605.23904}
}

README.md Unescape Escape

SkillOpt: Executive Strategy for Self-Evolving Agent Skills

News 🔥🔥🔥

Overview

🎬 Demo Video

Extensibility & WebUI

Adding a new backend

Adding a new benchmark

WebUI

Citation

README.md