mirror of
https://github.com/microsoft/SkillOpt.git
synced 2026-07-03 14:02:58 +08:00
docs: clarify README and paper-aligned skill artifacts
This commit is contained in:
254
README.md
254
README.md
@@ -4,7 +4,37 @@
|
||||
|
||||
[](https://microsoft.github.io/SkillOpt/) [](https://arxiv.org/abs/2605.23904) [](https://youtu.be/JUBMDTCiM0M) [](https://www.python.org/) [](LICENSE)
|
||||
|
||||
## 🎬 SkillOpt Demo Video
|
||||
---
|
||||
|
||||
## Overview
|
||||
|
||||
Modern agent skills are usually hand-crafted, generated one-shot by a strong
|
||||
LLM, or evolved through loosely controlled self-revision — none of which
|
||||
behaves like a deep-learning optimizer for the skill itself, and none of
|
||||
which reliably improves over its starting point under feedback.
|
||||
|
||||
**SkillOpt treats the skill document as the trainable state of a frozen
|
||||
agent**, and trains it with the discipline that makes weight-space
|
||||
optimization reproducible. A separate optimizer model turns scored rollouts
|
||||
into bounded add / delete / replace edits on a single skill document; a
|
||||
candidate edit is accepted only when it strictly improves a held-out
|
||||
validation score. A textual learning-rate budget, a rejected-edit buffer,
|
||||
and an epoch-wise slow / meta update make skill training stable while
|
||||
adding **zero inference-time model calls** at deployment.
|
||||
|
||||
The deployed artifact is a compact `best_skill.md` (typically 300–2,000
|
||||
tokens) that runs against the unchanged target model. Across **six
|
||||
benchmarks, seven target models, and three execution harnesses** (direct
|
||||
chat, Codex CLI, Claude Code CLI), SkillOpt is best or tied-best on **all
|
||||
52 evaluated (model, benchmark, harness) cells** and on GPT-5.5 lifts the
|
||||
average no-skill accuracy by **+23.5 points in direct chat, +24.8 inside
|
||||
the Codex agentic loop, and +19.1 inside Claude Code**. Optimized skill
|
||||
artifacts transfer across model scales, between Codex and Claude Code
|
||||
harnesses, and to nearby math benchmarks without further optimization.
|
||||
|
||||
For the full method, ablations, and per-cell results see the [paper](https://arxiv.org/abs/2605.23904); for a visual walkthrough of the loop see the [project page](https://microsoft.github.io/SkillOpt/); for deeper API / backend / benchmark docs see [`docs/`](docs/).
|
||||
|
||||
## 🎬 Demo Video
|
||||
|
||||
https://github.com/user-attachments/assets/eb12d3bc-371c-467f-904d-91b61f339ed7
|
||||
|
||||
@@ -16,14 +46,16 @@ https://github.com/user-attachments/assets/eb12d3bc-371c-467f-904d-91b61f339ed7
|
||||
|
||||
## Install
|
||||
|
||||
**Requirements:** Python 3.10+
|
||||
### Requirements
|
||||
|
||||
- Python 3.10+
|
||||
|
||||
```bash
|
||||
git clone https://github.com/microsoft/SkillOpt.git
|
||||
cd SkillOpt
|
||||
pip install -e .
|
||||
|
||||
# For ALFWorld benchmark (optional):
|
||||
# For the ALFWorld benchmark (optional):
|
||||
pip install -e ".[alfworld]"
|
||||
alfworld-download
|
||||
```
|
||||
@@ -36,7 +68,8 @@ cp .env.example .env
|
||||
source .env
|
||||
```
|
||||
|
||||
**Azure OpenAI** (recommended):
|
||||
#### Azure OpenAI *(recommended)*
|
||||
|
||||
```bash
|
||||
export AZURE_OPENAI_ENDPOINT="https://your-resource.openai.azure.com/"
|
||||
# Option 1: API key auth
|
||||
@@ -45,73 +78,40 @@ export AZURE_OPENAI_API_KEY="your-key"
|
||||
export AZURE_OPENAI_AUTH_MODE="azure_cli"
|
||||
```
|
||||
|
||||
> **Note:** `AZURE_OPENAI_ENDPOINT` is required for all three modes (`api_key`, `azure_cli`,
|
||||
> `openai_compatible`). Without it, all LLM calls will fail.
|
||||
> **Note:** `AZURE_OPENAI_ENDPOINT` is required for all three modes (`api_key`, `azure_cli`, `openai_compatible`). Without it, all LLM calls will fail.
|
||||
|
||||
#### OpenAI-compatible endpoints
|
||||
|
||||
**OpenAI-compatible endpoints**:
|
||||
```bash
|
||||
export AZURE_OPENAI_ENDPOINT="https://api.openai.com/v1"
|
||||
export AZURE_OPENAI_API_KEY="sk-..."
|
||||
export AZURE_OPENAI_AUTH_MODE="openai_compatible"
|
||||
```
|
||||
|
||||
This routes all calls through the plain OpenAI Python client (no Azure auth, no `api-version`
|
||||
header).
|
||||
This routes all calls through the plain OpenAI Python client (no Azure auth, no `api-version` header).
|
||||
|
||||
> **Note:** SkillOpt reuses the `AZURE_OPENAI_*` env var names even in this mode — there is no
|
||||
> separate `OPENAI_API_KEY` knob.
|
||||
> **Note:** SkillOpt reuses the `AZURE_OPENAI_*` env var names even in this mode — there is no separate `OPENAI_API_KEY` knob.
|
||||
|
||||
#### Anthropic Claude
|
||||
|
||||
**Anthropic Claude**:
|
||||
```bash
|
||||
export ANTHROPIC_API_KEY="sk-ant-..."
|
||||
```
|
||||
|
||||
**Qwen (local vLLM)**:
|
||||
#### Qwen *(local vLLM)*
|
||||
|
||||
```bash
|
||||
export QWEN_CHAT_BASE_URL="http://localhost:8000/v1"
|
||||
export QWEN_CHAT_MODEL="Qwen/Qwen3.5-4B"
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Data Preparation
|
||||
|
||||
SkillOpt expects data in a **split directory** with `train/`, `val/`, `test/` subdirectories, each containing a JSON file (e.g., `items.json`).
|
||||
#### MiniMax
|
||||
|
||||
```bash
|
||||
export MINIMAX_BASE_URL="https://api.minimax.io/v1"
|
||||
export MINIMAX_API_KEY="..."
|
||||
export MINIMAX_MODEL="MiniMax-M2.7"
|
||||
```
|
||||
data/my_split/
|
||||
├── train/items.json
|
||||
├── val/items.json
|
||||
└── test/items.json
|
||||
```
|
||||
|
||||
Each JSON file is an array of task items. The required fields depend on the benchmark. For example, SearchQA items look like:
|
||||
|
||||
```json
|
||||
[
|
||||
{
|
||||
"id": "unique_item_id",
|
||||
"question": "Who wrote the novel ...",
|
||||
"context": "[DOC] relevant passage text ...",
|
||||
"answers": ["expected answer"]
|
||||
}
|
||||
]
|
||||
```
|
||||
|
||||
See `skillopt/envs/<benchmark>/dataloader.py` for the exact format each benchmark expects.
|
||||
|
||||
> **Note:** Benchmark datasets are not included in this repository. Prepare your own data following the format above.
|
||||
|
||||
### Supported Benchmarks
|
||||
|
||||
| Benchmark | Type | Config |
|
||||
|---|---|---|
|
||||
| SearchQA | QA | `configs/searchqa/default.yaml` |
|
||||
| ALFWorld | Embodied agent | `configs/alfworld/default.yaml` |
|
||||
| DocVQA | Document QA | `configs/docvqa/default.yaml` |
|
||||
| LiveMathematicianBench | Math | `configs/livemathematicianbench/default.yaml` |
|
||||
| SpreadsheetBench | Code generation | `configs/spreadsheetbench/default.yaml` |
|
||||
| OfficeQA | Tool-augmented QA | `configs/officeqa/default.yaml` |
|
||||
|
||||
---
|
||||
|
||||
@@ -181,8 +181,7 @@ python scripts/eval_only.py \
|
||||
--azure_openai_endpoint https://your-resource.openai.azure.com/
|
||||
```
|
||||
|
||||
To evaluate a skill produced by a training run, replace `--skill` with that
|
||||
run's best-skill path, for example `outputs/my_run/best_skill.md`.
|
||||
To evaluate a skill produced by your own training run, replace `--skill` with that run's best-skill path, for example `outputs/my_run/best_skill.md`.
|
||||
|
||||
| Split | Description |
|
||||
|---|---|
|
||||
@@ -193,7 +192,7 @@ run's best-skill path, for example `outputs/my_run/best_skill.md`.
|
||||
|
||||
### Output Structure
|
||||
|
||||
Each run writes to a structured output directory:
|
||||
Each training run writes to a structured output directory:
|
||||
|
||||
```
|
||||
outputs/<run_name>/
|
||||
@@ -209,26 +208,148 @@ outputs/<run_name>/
|
||||
|
||||
Re-running the same command auto-resumes from the last completed step.
|
||||
|
||||
### Pretrained Skill Artifacts
|
||||
|
||||
The paper-aligned GPT-5.5 optimized skills are shipped in
|
||||
[`ckpt/<benchmark>/gpt5.5_skill.md`](ckpt/) (one per benchmark — SearchQA,
|
||||
ALFWorld, DocVQA, LiveMathematicianBench, OfficeQA, SpreadsheetBench). Use
|
||||
them with `scripts/eval_only.py` to evaluate the paper-aligned skills on a
|
||||
matching data split without re-running training. See [`ckpt/README.md`](ckpt/README.md)
|
||||
for the full per-benchmark command. This is the first artifact batch; we
|
||||
plan to continue uploading the remaining optimized skills and benchmark
|
||||
split manifests as they are cleaned and verified.
|
||||
|
||||
---
|
||||
|
||||
## Community-contributed configs
|
||||
## Data Preparation
|
||||
|
||||
### Directory layout
|
||||
|
||||
SkillOpt expects data in a **split directory** with `train/`, `val/`, `test/` subdirectories, each containing a JSON file (e.g., `items.json`):
|
||||
|
||||
```
|
||||
data/my_split/
|
||||
├── train/items.json
|
||||
├── val/items.json
|
||||
└── test/items.json
|
||||
```
|
||||
|
||||
Each JSON file is an array of task items. The required fields depend on the benchmark. For example, SearchQA items look like:
|
||||
|
||||
```json
|
||||
[
|
||||
{
|
||||
"id": "unique_item_id",
|
||||
"question": "Who wrote the novel ...",
|
||||
"context": "[DOC] relevant passage text ...",
|
||||
"answers": ["expected answer"]
|
||||
}
|
||||
]
|
||||
```
|
||||
|
||||
See `skillopt/envs/<benchmark>/dataloader.py` for the exact format each benchmark expects.
|
||||
|
||||
> **Note:** Most benchmark datasets are not included in this repository. Prepare your own data following the format above. The exact SearchQA split used in the paper is shipped at [`data/searchqa_id_split/`](data/searchqa_id_split) (400 train / 200 val / 1400 test). We are preparing the remaining benchmark split manifests for upload.
|
||||
|
||||
### Supported Benchmarks
|
||||
|
||||
| Benchmark | Type | Config |
|
||||
|---|---|---|
|
||||
| SearchQA | QA | `configs/searchqa/default.yaml` |
|
||||
| ALFWorld | Embodied agent | `configs/alfworld/default.yaml` |
|
||||
| DocVQA | Document QA | `configs/docvqa/default.yaml` |
|
||||
| LiveMathematicianBench | Math | `configs/livemathematicianbench/default.yaml` |
|
||||
| SpreadsheetBench | Code generation | `configs/spreadsheetbench/default.yaml` |
|
||||
| OfficeQA | Tool-augmented QA | `configs/officeqa/default.yaml` |
|
||||
|
||||
---
|
||||
|
||||
## Configuration
|
||||
|
||||
### Default settings and paper-reproduction knobs
|
||||
|
||||
`configs/_base_/default.yaml` is the single source of truth for SkillOpt's
|
||||
runtime knobs. Out of the box, every shipped benchmark config inherits
|
||||
from it and keeps the paper protocol visible: 4 epochs, rollout batch 40,
|
||||
reflection minibatch 8, textual learning rate 4 with cosine decay, strict
|
||||
hard validation gating, and slow-update + meta-skill enabled. The slow-update
|
||||
acceptance policy is now explicit because `main` has moved forward from
|
||||
the paper snapshot: the shipped `ckpt/` skills were produced with the gated
|
||||
semantics described in paper Section 3.6, while the current `main` default
|
||||
uses the post-submission force-accept behavior.
|
||||
|
||||
### Slow-update acceptance mode
|
||||
|
||||
The epoch-boundary slow / meta update can be applied two ways, controlled
|
||||
by `optimizer.slow_update_gate_with_selection`:
|
||||
|
||||
```yaml
|
||||
optimizer:
|
||||
slow_update_gate_with_selection: false # current main default
|
||||
```
|
||||
|
||||
- **`false`** *(current `main` default)*: force-accept. The
|
||||
slow-update guidance is injected into both `current_skill` and
|
||||
`best_skill` unconditionally at the epoch boundary. This is the newer
|
||||
post-submission behavior on `main`.
|
||||
- **`true`** *(paper / shipped-skill reproduction)*: gated, matching paper
|
||||
Section 3.6 verbatim. The slow-update candidate is evaluated on the
|
||||
selection split and accepted only if it passes the same validation gate
|
||||
as a step-level edit. Use this setting when re-running optimization to
|
||||
match the paper protocol and the provenance of the shipped `ckpt/` skills.
|
||||
|
||||
The trainer prints which mode is active at startup
|
||||
(`[slow update] acceptance=...`). See issue #22 for the discussion that
|
||||
led to the flag.
|
||||
|
||||
### Gate metric (`hard` / `soft` / `mixed`)
|
||||
|
||||
The validation gate compares candidate vs. current skills on the selection
|
||||
split using `gate_metric`:
|
||||
|
||||
- **`hard`** *(default, paper)*: exact-match accuracy, strictly greater
|
||||
than the current score is required.
|
||||
- **`soft`**: per-item soft / partial-credit score. Useful when the
|
||||
selection split is small (e.g. ≤10 items) and the reward is continuous,
|
||||
where the discrete hard gate often rejects every candidate.
|
||||
- **`mixed`**: weighted average, `(1 - w) * hard + w * soft`, with `w`
|
||||
set by `gate_mixed_weight` (default `0.5`).
|
||||
|
||||
Default is `hard`. Use the example config below to switch.
|
||||
|
||||
### Community-contributed examples
|
||||
|
||||
These are **not** default SkillOpt settings — they are reference configs
|
||||
contributed by users for specific scenarios. The paper-reported numbers
|
||||
were obtained with the default settings, not these.
|
||||
|
||||
- **`configs/examples/soft_gate.yaml`** *(PR #25, contributed by
|
||||
[@lvbaocheng](https://github.com/lvbaocheng))* — switches the
|
||||
validation gate from exact-match (`hard`) to soft / partial-credit
|
||||
(`soft` or `mixed`). Useful when the held-out **selection split is
|
||||
small** (e.g. ≤ ~10 items) and the **reward is continuous**, where the
|
||||
discrete hard gate often rejects every candidate and training stalls.
|
||||
See the comment at the top of the file for details and when not to use
|
||||
it.
|
||||
- **[`configs/examples/soft_gate.yaml`](configs/examples/soft_gate.yaml)**
|
||||
*(PR #25, contributed by [@lvbaocheng](https://github.com/lvbaocheng))* —
|
||||
switches `gate_metric` to `soft` (or `mixed`). See the comment at the
|
||||
top of the file for when to use and when not to.
|
||||
|
||||
---
|
||||
|
||||
## WebUI
|
||||
## Extensibility & WebUI
|
||||
|
||||
### Adding a new backend
|
||||
|
||||
A backend = a chat / exec target (e.g. `openai_chat`, `claude_chat`,
|
||||
`qwen_chat`, `minimax_chat`, `codex_exec`, `claude_code_exec`). See
|
||||
[`docs/guide/new-backend.md`](docs/guide/new-backend.md) for the full
|
||||
contract; in short you add a `skillopt/model/<name>_backend.py` module,
|
||||
register it in `skillopt/model/common.py` + `backend_config.py`, and wire
|
||||
it through the router in `skillopt/model/__init__.py`. `qwen_backend.py`
|
||||
and `minimax_backend.py` are good templates.
|
||||
|
||||
### Adding a new benchmark
|
||||
|
||||
A benchmark = a `skillopt/envs/<name>/` package with a `dataloader.py`, a
|
||||
`rollout.py`, and an `initial.md` seed skill. See
|
||||
[`docs/guide/new-benchmark.md`](docs/guide/new-benchmark.md) for the full
|
||||
contract; the simplest reference is `skillopt/envs/searchqa/`.
|
||||
|
||||
### WebUI
|
||||
|
||||
Launch the monitoring dashboard (optional):
|
||||
|
||||
@@ -243,11 +364,6 @@ python -m skillopt_webui.app
|
||||
| `--host` | `0.0.0.0` | Bind address |
|
||||
| `--share` | off | Create a public Gradio share link |
|
||||
|
||||
```bash
|
||||
# With public share link (useful for remote servers)
|
||||
python -m skillopt_webui.app --share
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Citation
|
||||
|
||||
79
ckpt/README.md
Normal file
79
ckpt/README.md
Normal file
@@ -0,0 +1,79 @@
|
||||
# Paper-aligned optimized SkillOpt skills (GPT-5.5)
|
||||
|
||||
This folder ships the GPT-5.5 best skills exported from SkillOpt training
|
||||
runs — one `gpt5.5_skill.md` per benchmark. You can plug them into
|
||||
`scripts/eval_only.py` to evaluate the paper-aligned optimized skills on a
|
||||
given split without re-running the training loop.
|
||||
|
||||
> These are checkpoints associated with the paper, not a general-purpose
|
||||
> tool. They're here so you can verify the reported numbers and use the
|
||||
> skills as portable artifacts. If you want to *train* your own skill,
|
||||
> use `scripts/train.py` per the top-level README.
|
||||
>
|
||||
> This is the first artifact batch. We plan to continue uploading the
|
||||
> remaining optimized skills and benchmark split manifests as they are
|
||||
> cleaned and verified.
|
||||
|
||||
## What's here
|
||||
|
||||
| Benchmark | Skill artifact | Matching config |
|
||||
|---|---|---|
|
||||
| SearchQA | `ckpt/searchqa/gpt5.5_skill.md` | `configs/searchqa/default.yaml` |
|
||||
| ALFWorld | `ckpt/alfworld/gpt5.5_skill.md` | `configs/alfworld/default.yaml` |
|
||||
| DocVQA | `ckpt/docvqa/gpt5.5_skill.md` | `configs/docvqa/default.yaml` |
|
||||
| LiveMathematicianBench | `ckpt/livemath/gpt5.5_skill.md` | `configs/livemathematicianbench/default.yaml` |
|
||||
| OfficeQA | `ckpt/officeqa/gpt5.5_skill.md` | `configs/officeqa/default.yaml` |
|
||||
| SpreadsheetBench | `ckpt/spreadsheetbench/gpt5.5_skill.md` | `configs/spreadsheetbench/default.yaml` |
|
||||
|
||||
Each file is a plain Markdown skill document (~2k–13k chars). It contains a
|
||||
protected `SLOW_UPDATE` section at the end that holds epoch-wise
|
||||
longitudinal guidance — that's expected, not a formatting issue.
|
||||
|
||||
## How to evaluate a shipped skill
|
||||
|
||||
`scripts/eval_only.py` runs a single skill against a data split without
|
||||
invoking the optimizer. Example for SearchQA against the test split:
|
||||
|
||||
```bash
|
||||
python scripts/eval_only.py \
|
||||
--config configs/searchqa/default.yaml \
|
||||
--skill ckpt/searchqa/gpt5.5_skill.md \
|
||||
--split valid_unseen \
|
||||
--split_dir data/searchqa_id_split \
|
||||
--azure_openai_endpoint https://your-resource.openai.azure.com/ \
|
||||
--target_model gpt-5.5
|
||||
```
|
||||
|
||||
Substitute the benchmark, config, skill path, and `--split_dir` to evaluate
|
||||
any of the other five. `--split valid_unseen` is the test split, `valid_seen`
|
||||
is the selection / validation split, `train` is the training split, and
|
||||
`all` runs all three.
|
||||
|
||||
## On comparing to the paper numbers
|
||||
|
||||
To compare against the paper-reported cells, use the same dataset split and
|
||||
scorer. SearchQA's split is checked in at `data/searchqa_id_split/` (400
|
||||
train / 200 selection / 1400 test). For the other benchmarks, point
|
||||
`--split_dir` at your own materialized split; the loader is deterministic
|
||||
from `split_seed` (default `42`) + `split_ratio` (default `2:1:7`) when
|
||||
`split_mode: ratio` is used, so a given `data_path` + seed reproduces
|
||||
across machines. Explicit per-benchmark split manifests are being prepared
|
||||
for upload — see issues #14 and #21.
|
||||
|
||||
## Why force-accept vs. gated slow-update matters
|
||||
|
||||
The shipped skills were produced with the gated slow-update semantics
|
||||
described in paper Section 3.6:
|
||||
|
||||
```yaml
|
||||
optimizer:
|
||||
slow_update_gate_with_selection: true
|
||||
```
|
||||
|
||||
Current `main` defaults to `false` (force-accept mode), a newer
|
||||
post-submission behavior where the slow-update guidance is written into
|
||||
`current_skill` and `best_skill` unconditionally at the epoch boundary. If
|
||||
you re-train with the current default, you may produce a *different*
|
||||
`best_skill.md` than the one checked in here. Both modes are supported;
|
||||
see the top-level README's "Configuration -> Slow-update acceptance mode"
|
||||
section.
|
||||
Reference in New Issue
Block a user