docs: clarify README and paper-aligned skill artifacts

2026-07-03 14:02:58 +08:00 · 2026-05-31 09:11:30 +00:00
parent b4850ce418
commit 9265545c45
2 changed files with 264 additions and 69 deletions
--- a/README.md
+++ b/README.md
@@ -4,7 +4,37 @@

 [![Project Page](https://img.shields.io/badge/Project%20Page-SkillOpt-8dbb3c)](https://microsoft.github.io/SkillOpt/) [![Paper](https://img.shields.io/badge/Paper-arXiv-b31b1b)](https://arxiv.org/abs/2605.23904) [![Project Video](https://img.shields.io/badge/Project%20Video-Watch%20Demo-ff0000)](https://youtu.be/JUBMDTCiM0M) [![Python 3.10+](https://img.shields.io/badge/Python-3.10%2B-blue.svg)](https://www.python.org/) [![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](LICENSE)

-## 🎬 SkillOpt Demo Video
+---
+
+## Overview
+
+Modern agent skills are usually hand-crafted, generated one-shot by a strong
+LLM, or evolved through loosely controlled self-revision — none of which
+behaves like a deep-learning optimizer for the skill itself, and none of
+which reliably improves over its starting point under feedback.
+
+**SkillOpt treats the skill document as the trainable state of a frozen
+agent**, and trains it with the discipline that makes weight-space
+optimization reproducible. A separate optimizer model turns scored rollouts
+into bounded add / delete / replace edits on a single skill document; a
+candidate edit is accepted only when it strictly improves a held-out
+validation score. A textual learning-rate budget, a rejected-edit buffer,
+and an epoch-wise slow / meta update make skill training stable while
+adding **zero inference-time model calls** at deployment.
+
+The deployed artifact is a compact `best_skill.md` (typically 300–2,000
+tokens) that runs against the unchanged target model. Across **six
+benchmarks, seven target models, and three execution harnesses** (direct
+chat, Codex CLI, Claude Code CLI), SkillOpt is best or tied-best on **all
+52 evaluated (model, benchmark, harness) cells** and on GPT-5.5 lifts the
+average no-skill accuracy by **+23.5 points in direct chat, +24.8 inside
+the Codex agentic loop, and +19.1 inside Claude Code**. Optimized skill
+artifacts transfer across model scales, between Codex and Claude Code
+harnesses, and to nearby math benchmarks without further optimization.
+
+For the full method, ablations, and per-cell results see the [paper](https://arxiv.org/abs/2605.23904); for a visual walkthrough of the loop see the [project page](https://microsoft.github.io/SkillOpt/); for deeper API / backend / benchmark docs see [`docs/`](docs/).
+
+## 🎬 Demo Video

 https://github.com/user-attachments/assets/eb12d3bc-371c-467f-904d-91b61f339ed7

@@ -16,14 +46,16 @@ https://github.com/user-attachments/assets/eb12d3bc-371c-467f-904d-91b61f339ed7

 ## Install

-**Requirements:** Python 3.10+
+### Requirements
+
+- Python 3.10+

 ```bash
 git clone https://github.com/microsoft/SkillOpt.git
 cd SkillOpt
 pip install -e .

-# For ALFWorld benchmark (optional):
+# For the ALFWorld benchmark (optional):
 pip install -e ".[alfworld]"
 alfworld-download
 ```
@@ -36,7 +68,8 @@ cp .env.example .env
 source .env
 ```

-**Azure OpenAI** (recommended):
+#### Azure OpenAI *(recommended)*
+
 ```bash
 export AZURE_OPENAI_ENDPOINT="https://your-resource.openai.azure.com/"
 # Option 1: API key auth
@@ -45,73 +78,40 @@ export AZURE_OPENAI_API_KEY="your-key"
 export AZURE_OPENAI_AUTH_MODE="azure_cli"
 ```

-> **Note:** `AZURE_OPENAI_ENDPOINT` is required for all three modes (`api_key`, `azure_cli`,
-> `openai_compatible`). Without it, all LLM calls will fail.
+> **Note:** `AZURE_OPENAI_ENDPOINT` is required for all three modes (`api_key`, `azure_cli`, `openai_compatible`). Without it, all LLM calls will fail.
+
+#### OpenAI-compatible endpoints

-**OpenAI-compatible endpoints**:
 ```bash
 export AZURE_OPENAI_ENDPOINT="https://api.openai.com/v1"
 export AZURE_OPENAI_API_KEY="sk-..."
 export AZURE_OPENAI_AUTH_MODE="openai_compatible"
 ```

-This routes all calls through the plain OpenAI Python client (no Azure auth, no `api-version`
-header).
+This routes all calls through the plain OpenAI Python client (no Azure auth, no `api-version` header).

-> **Note:** SkillOpt reuses the `AZURE_OPENAI_*` env var names even in this mode — there is no
-> separate `OPENAI_API_KEY` knob.
+> **Note:** SkillOpt reuses the `AZURE_OPENAI_*` env var names even in this mode — there is no separate `OPENAI_API_KEY` knob.
+
+#### Anthropic Claude

-**Anthropic Claude**:
 ```bash
 export ANTHROPIC_API_KEY="sk-ant-..."
 ```

-**Qwen (local vLLM)**:
+#### Qwen *(local vLLM)*
+
 ```bash
 export QWEN_CHAT_BASE_URL="http://localhost:8000/v1"
 export QWEN_CHAT_MODEL="Qwen/Qwen3.5-4B"
 ```

---
-
-## Data Preparation
-
-SkillOpt expects data in a **split directory** with `train/`, `val/`, `test/` subdirectories, each containing a JSON file (e.g., `items.json`).
+#### MiniMax

+```bash
+export MINIMAX_BASE_URL="https://api.minimax.io/v1"
+export MINIMAX_API_KEY="..."
+export MINIMAX_MODEL="MiniMax-M2.7"
 ```
-data/my_split/
-├── train/items.json
-├── val/items.json
-└── test/items.json
-```
-
-Each JSON file is an array of task items. The required fields depend on the benchmark. For example, SearchQA items look like:
-
-```json
-[
-  {
-    "id": "unique_item_id",
-    "question": "Who wrote the novel ...",
-    "context": "[DOC] relevant passage text ...",
-    "answers": ["expected answer"]
-  }
-]
-```
-
-See `skillopt/envs/<benchmark>/dataloader.py` for the exact format each benchmark expects.
-
-> **Note:** Benchmark datasets are not included in this repository. Prepare your own data following the format above.
-
-### Supported Benchmarks
-
-| Benchmark | Type | Config |
-|---|---|---|
-| SearchQA | QA | `configs/searchqa/default.yaml` |
-| ALFWorld | Embodied agent | `configs/alfworld/default.yaml` |
-| DocVQA | Document QA | `configs/docvqa/default.yaml` |
-| LiveMathematicianBench | Math | `configs/livemathematicianbench/default.yaml` |
-| SpreadsheetBench | Code generation | `configs/spreadsheetbench/default.yaml` |
-| OfficeQA | Tool-augmented QA | `configs/officeqa/default.yaml` |

 ---

@@ -181,8 +181,7 @@ python scripts/eval_only.py \
  --azure_openai_endpoint https://your-resource.openai.azure.com/
 ```

-To evaluate a skill produced by a training run, replace `--skill` with that
-run's best-skill path, for example `outputs/my_run/best_skill.md`.
+To evaluate a skill produced by your own training run, replace `--skill` with that run's best-skill path, for example `outputs/my_run/best_skill.md`.

 | Split | Description |
 |---|---|
@@ -193,7 +192,7 @@ run's best-skill path, for example `outputs/my_run/best_skill.md`.

 ### Output Structure

-Each run writes to a structured output directory:
+Each training run writes to a structured output directory:

 ```
 outputs/<run_name>/
@@ -209,26 +208,148 @@ outputs/<run_name>/

 Re-running the same command auto-resumes from the last completed step.

+### Pretrained Skill Artifacts
+
+The paper-aligned GPT-5.5 optimized skills are shipped in
+[`ckpt/<benchmark>/gpt5.5_skill.md`](ckpt/) (one per benchmark — SearchQA,
+ALFWorld, DocVQA, LiveMathematicianBench, OfficeQA, SpreadsheetBench). Use
+them with `scripts/eval_only.py` to evaluate the paper-aligned skills on a
+matching data split without re-running training. See [`ckpt/README.md`](ckpt/README.md)
+for the full per-benchmark command. This is the first artifact batch; we
+plan to continue uploading the remaining optimized skills and benchmark
+split manifests as they are cleaned and verified.
+
 ---

-## Community-contributed configs
+## Data Preparation
+
+### Directory layout
+
+SkillOpt expects data in a **split directory** with `train/`, `val/`, `test/` subdirectories, each containing a JSON file (e.g., `items.json`):
+
+```
+data/my_split/
+├── train/items.json
+├── val/items.json
+└── test/items.json
+```
+
+Each JSON file is an array of task items. The required fields depend on the benchmark. For example, SearchQA items look like:
+
+```json
+[
+  {
+    "id": "unique_item_id",
+    "question": "Who wrote the novel ...",
+    "context": "[DOC] relevant passage text ...",
+    "answers": ["expected answer"]
+  }
+]
+```
+
+See `skillopt/envs/<benchmark>/dataloader.py` for the exact format each benchmark expects.
+
+> **Note:** Most benchmark datasets are not included in this repository. Prepare your own data following the format above. The exact SearchQA split used in the paper is shipped at [`data/searchqa_id_split/`](data/searchqa_id_split) (400 train / 200 val / 1400 test). We are preparing the remaining benchmark split manifests for upload.
+
+### Supported Benchmarks
+
+| Benchmark | Type | Config |
+|---|---|---|
+| SearchQA | QA | `configs/searchqa/default.yaml` |
+| ALFWorld | Embodied agent | `configs/alfworld/default.yaml` |
+| DocVQA | Document QA | `configs/docvqa/default.yaml` |
+| LiveMathematicianBench | Math | `configs/livemathematicianbench/default.yaml` |
+| SpreadsheetBench | Code generation | `configs/spreadsheetbench/default.yaml` |
+| OfficeQA | Tool-augmented QA | `configs/officeqa/default.yaml` |
+
+---
+
+## Configuration
+
+### Default settings and paper-reproduction knobs
+
+`configs/_base_/default.yaml` is the single source of truth for SkillOpt's
+runtime knobs. Out of the box, every shipped benchmark config inherits
+from it and keeps the paper protocol visible: 4 epochs, rollout batch 40,
+reflection minibatch 8, textual learning rate 4 with cosine decay, strict
+hard validation gating, and slow-update + meta-skill enabled. The slow-update
+acceptance policy is now explicit because `main` has moved forward from
+the paper snapshot: the shipped `ckpt/` skills were produced with the gated
+semantics described in paper Section 3.6, while the current `main` default
+uses the post-submission force-accept behavior.
+
+### Slow-update acceptance mode
+
+The epoch-boundary slow / meta update can be applied two ways, controlled
+by `optimizer.slow_update_gate_with_selection`:
+
+```yaml
+optimizer:
+  slow_update_gate_with_selection: false   # current main default
+```
+
+- **`false`** *(current `main` default)*: force-accept. The
+  slow-update guidance is injected into both `current_skill` and
+  `best_skill` unconditionally at the epoch boundary. This is the newer
+  post-submission behavior on `main`.
+- **`true`** *(paper / shipped-skill reproduction)*: gated, matching paper
+  Section 3.6 verbatim. The slow-update candidate is evaluated on the
+  selection split and accepted only if it passes the same validation gate
+  as a step-level edit. Use this setting when re-running optimization to
+  match the paper protocol and the provenance of the shipped `ckpt/` skills.
+
+The trainer prints which mode is active at startup
+(`[slow update] acceptance=...`). See issue #22 for the discussion that
+led to the flag.
+
+### Gate metric (`hard` / `soft` / `mixed`)
+
+The validation gate compares candidate vs. current skills on the selection
+split using `gate_metric`:
+
+- **`hard`** *(default, paper)*: exact-match accuracy, strictly greater
+  than the current score is required.
+- **`soft`**: per-item soft / partial-credit score. Useful when the
+  selection split is small (e.g. ≤10 items) and the reward is continuous,
+  where the discrete hard gate often rejects every candidate.
+- **`mixed`**: weighted average, `(1 - w) * hard + w * soft`, with `w`
+  set by `gate_mixed_weight` (default `0.5`).
+
+Default is `hard`. Use the example config below to switch.
+
+### Community-contributed examples

 These are **not** default SkillOpt settings — they are reference configs
 contributed by users for specific scenarios. The paper-reported numbers
 were obtained with the default settings, not these.

- **`configs/examples/soft_gate.yaml`** *(PR #25, contributed by
-  [@lvbaocheng](https://github.com/lvbaocheng))* — switches the
-  validation gate from exact-match (`hard`) to soft / partial-credit
-  (`soft` or `mixed`). Useful when the held-out **selection split is
-  small** (e.g. ≤ ~10 items) and the **reward is continuous**, where the
-  discrete hard gate often rejects every candidate and training stalls.
-  See the comment at the top of the file for details and when not to use
-  it.
+- **[`configs/examples/soft_gate.yaml`](configs/examples/soft_gate.yaml)**
+  *(PR #25, contributed by [@lvbaocheng](https://github.com/lvbaocheng))* —
+  switches `gate_metric` to `soft` (or `mixed`). See the comment at the
+  top of the file for when to use and when not to.

 ---

-## WebUI
+## Extensibility & WebUI
+
+### Adding a new backend
+
+A backend = a chat / exec target (e.g. `openai_chat`, `claude_chat`,
+`qwen_chat`, `minimax_chat`, `codex_exec`, `claude_code_exec`). See
+[`docs/guide/new-backend.md`](docs/guide/new-backend.md) for the full
+contract; in short you add a `skillopt/model/<name>_backend.py` module,
+register it in `skillopt/model/common.py` + `backend_config.py`, and wire
+it through the router in `skillopt/model/__init__.py`. `qwen_backend.py`
+and `minimax_backend.py` are good templates.
+
+### Adding a new benchmark
+
+A benchmark = a `skillopt/envs/<name>/` package with a `dataloader.py`, a
+`rollout.py`, and an `initial.md` seed skill. See
+[`docs/guide/new-benchmark.md`](docs/guide/new-benchmark.md) for the full
+contract; the simplest reference is `skillopt/envs/searchqa/`.
+
+### WebUI

 Launch the monitoring dashboard (optional):

@@ -243,11 +364,6 @@ python -m skillopt_webui.app
 | `--host` | `0.0.0.0` | Bind address |
 | `--share` | off | Create a public Gradio share link |

-```bash
-# With public share link (useful for remote servers)
-python -m skillopt_webui.app --share
-```
-
 ---

 ## Citation
--- a/ckpt/README.md
+++ b/ckpt/README.md
@@ -0,0 +1,79 @@
+# Paper-aligned optimized SkillOpt skills (GPT-5.5)
+
+This folder ships the GPT-5.5 best skills exported from SkillOpt training
+runs — one `gpt5.5_skill.md` per benchmark. You can plug them into
+`scripts/eval_only.py` to evaluate the paper-aligned optimized skills on a
+given split without re-running the training loop.
+
+> These are checkpoints associated with the paper, not a general-purpose
+> tool. They're here so you can verify the reported numbers and use the
+> skills as portable artifacts. If you want to *train* your own skill,
+> use `scripts/train.py` per the top-level README.
+>
+> This is the first artifact batch. We plan to continue uploading the
+> remaining optimized skills and benchmark split manifests as they are
+> cleaned and verified.
+
+## What's here
+
+| Benchmark | Skill artifact | Matching config |
+|---|---|---|
+| SearchQA | `ckpt/searchqa/gpt5.5_skill.md` | `configs/searchqa/default.yaml` |
+| ALFWorld | `ckpt/alfworld/gpt5.5_skill.md` | `configs/alfworld/default.yaml` |
+| DocVQA | `ckpt/docvqa/gpt5.5_skill.md` | `configs/docvqa/default.yaml` |
+| LiveMathematicianBench | `ckpt/livemath/gpt5.5_skill.md` | `configs/livemathematicianbench/default.yaml` |
+| OfficeQA | `ckpt/officeqa/gpt5.5_skill.md` | `configs/officeqa/default.yaml` |
+| SpreadsheetBench | `ckpt/spreadsheetbench/gpt5.5_skill.md` | `configs/spreadsheetbench/default.yaml` |
+
+Each file is a plain Markdown skill document (~2k–13k chars). It contains a
+protected `SLOW_UPDATE` section at the end that holds epoch-wise
+longitudinal guidance — that's expected, not a formatting issue.
+
+## How to evaluate a shipped skill
+
+`scripts/eval_only.py` runs a single skill against a data split without
+invoking the optimizer. Example for SearchQA against the test split:
+
+```bash
+python scripts/eval_only.py \
+  --config configs/searchqa/default.yaml \
+  --skill ckpt/searchqa/gpt5.5_skill.md \
+  --split valid_unseen \
+  --split_dir data/searchqa_id_split \
+  --azure_openai_endpoint https://your-resource.openai.azure.com/ \
+  --target_model gpt-5.5
+```
+
+Substitute the benchmark, config, skill path, and `--split_dir` to evaluate
+any of the other five. `--split valid_unseen` is the test split, `valid_seen`
+is the selection / validation split, `train` is the training split, and
+`all` runs all three.
+
+## On comparing to the paper numbers
+
+To compare against the paper-reported cells, use the same dataset split and
+scorer. SearchQA's split is checked in at `data/searchqa_id_split/` (400
+train / 200 selection / 1400 test). For the other benchmarks, point
+`--split_dir` at your own materialized split; the loader is deterministic
+from `split_seed` (default `42`) + `split_ratio` (default `2:1:7`) when
+`split_mode: ratio` is used, so a given `data_path` + seed reproduces
+across machines. Explicit per-benchmark split manifests are being prepared
+for upload — see issues #14 and #21.
+
+## Why force-accept vs. gated slow-update matters
+
+The shipped skills were produced with the gated slow-update semantics
+described in paper Section 3.6:
+
+```yaml
+optimizer:
+  slow_update_gate_with_selection: true
+```
+
+Current `main` defaults to `false` (force-accept mode), a newer
+post-submission behavior where the slow-update guidance is written into
+`current_skill` and `best_skill` unconditionally at the epoch boundary. If
+you re-train with the current default, you may produce a *different*
+`best_skill.md` than the one checked in here. Both modes are supported;
+see the top-level README's "Configuration -> Slow-update acceptance mode"
+section.