mirror of
https://github.com/microsoft/SkillOpt.git
synced 2026-07-03 14:02:58 +08:00
All six adapters duplicated an identical reflect() that delegates to run_minibatch_reflect. The copies had drifted: OfficeQA/DocVQA silently dropped meta_skill_context and ALFWorld dropped update_mode, so those analysts ran without inputs every other benchmark receives (active under the default use_meta_skill: true). Move the delegation into EnvAdapter.reflect as one default that forwards all kwargs uniformly, and delete the six overrides. reflect is no longer abstract — adapters inherit it and override only for custom logic. Net -225 lines. Behavior change: OfficeQA/DocVQA/ALFWorld reflect now receive the kwargs they previously dropped; the three already-correct benchmarks are unaffected. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
372 lines
12 KiB
Markdown
372 lines
12 KiB
Markdown
# Add a New Benchmark
|
|
|
|
Extend SkillOpt with your own benchmark in ~200 lines of code. We will use
|
|
a tiny worked example, `docfaithful`, that scores a target model on
|
|
how faithfully it answers questions grounded in a small reference doc.
|
|
|
|
> **Working reference.** The easiest way to copy-cargo-cult a new env is
|
|
> to read [`skillopt/envs/officeqa/`](https://github.com/microsoft/SkillOpt/tree/main/skillopt/envs/officeqa).
|
|
> Everything below is the same shape, simplified.
|
|
|
|
## What you need to build
|
|
|
|
To add a benchmark you implement four things:
|
|
|
|
1. **A `SplitDataLoader` subclass** — knows how to load train / val / test
|
|
item dicts from disk.
|
|
2. **A rollout helper** — runs the target model on a batch of items
|
|
under the current skill and scores each prediction.
|
|
3. **An `EnvAdapter` subclass** — wires the loader + rollout helper into
|
|
SkillOpt's lifecycle (`build_*_env`, `rollout`, `reflect`,
|
|
`get_task_types`).
|
|
4. **A YAML config** — references your env name plus the standard
|
|
train / optimizer / gradient knobs.
|
|
|
|
Then one line in `scripts/train.py`'s `_register_builtins()` makes it
|
|
discoverable.
|
|
|
|
---
|
|
|
|
## Step 1 — Create the package
|
|
|
|
```bash
|
|
mkdir -p skillopt/envs/docfaithful
|
|
touch skillopt/envs/docfaithful/__init__.py
|
|
```
|
|
|
|
## Step 2 — Implement the data loader
|
|
|
|
`skillopt/envs/docfaithful/dataloader.py`:
|
|
|
|
```python
|
|
from __future__ import annotations
|
|
|
|
import json
|
|
from pathlib import Path
|
|
|
|
from skillopt.datasets.base import SplitDataLoader
|
|
|
|
|
|
def _normalize(raw: dict) -> dict:
|
|
"""Make sure every item has an ``id``. Other keys are env-specific."""
|
|
return {
|
|
"id": str(raw["uid"]),
|
|
"question": raw["question"],
|
|
"ground_truth": raw["answer"],
|
|
"reference_text": raw.get("reference", ""),
|
|
"task_type": raw.get("category", "docfaithful"),
|
|
}
|
|
|
|
|
|
class DocFaithfulDataLoader(SplitDataLoader):
|
|
"""Load DocFaithful items from JSON files inside each split dir."""
|
|
|
|
def load_split_items(self, split_path: str) -> list[dict]:
|
|
# split_path is e.g. data/docfaithful_split/train/
|
|
json_files = sorted(Path(split_path).glob("*.json"))
|
|
if not json_files:
|
|
raise FileNotFoundError(f"No .json file found in {split_path}")
|
|
with json_files[0].open(encoding="utf-8") as f:
|
|
raw = json.load(f)
|
|
return [_normalize(item) for item in raw]
|
|
```
|
|
|
|
Only `load_split_items()` is mandatory. If you also want to support
|
|
`split_mode="ratio"` (auto-split a single raw file into train/val/test),
|
|
override `load_raw_items(data_path)` as well — see
|
|
`skillopt/datasets/base.py` docstrings.
|
|
|
|
## Step 3 — Write the rollout helper
|
|
|
|
`skillopt/envs/docfaithful/rollout.py`:
|
|
|
|
```python
|
|
from __future__ import annotations
|
|
|
|
import json
|
|
import os
|
|
from pathlib import Path
|
|
|
|
from skillopt.model import chat_target
|
|
|
|
|
|
def _score(prediction: str, ground_truth: str) -> tuple[int, float]:
|
|
"""Trivial exact-match scorer. Replace with F1 / ROUGE / LLM-judge."""
|
|
p = (prediction or "").strip().lower()
|
|
g = (ground_truth or "").strip().lower()
|
|
hard = int(p == g and bool(g))
|
|
soft = 1.0 if hard else 0.0
|
|
return hard, soft
|
|
|
|
|
|
def _rollout_one(item: dict, skill_content: str,
|
|
*, max_completion_tokens: int) -> dict:
|
|
system = skill_content
|
|
user = (
|
|
f"Question: {item['question']}\n\n"
|
|
f"Reference:\n{item.get('reference_text', '')}\n\n"
|
|
"Answer:"
|
|
)
|
|
prediction, _usage = chat_target(
|
|
system=system,
|
|
user=user,
|
|
max_completion_tokens=max_completion_tokens,
|
|
)
|
|
hard, soft = _score(prediction, item.get("ground_truth", ""))
|
|
return {
|
|
"id": str(item["id"]),
|
|
"hard": hard,
|
|
"soft": soft,
|
|
"predicted_answer": prediction,
|
|
"question": item.get("question", ""),
|
|
"reference_text": item.get("reference_text", ""),
|
|
"task_type": item.get("task_type", "docfaithful"),
|
|
}
|
|
|
|
|
|
def run_batch(*, items: list[dict], skill_content: str, out_root: str,
|
|
workers: int = 4, max_completion_tokens: int = 4096) -> list[dict]:
|
|
"""Run a batch of episodes sequentially or with a thread pool."""
|
|
os.makedirs(out_root, exist_ok=True)
|
|
# For brevity we go sequentially — swap in concurrent.futures.ThreadPoolExecutor
|
|
# when network / model latency dominates.
|
|
results = [
|
|
_rollout_one(item, skill_content,
|
|
max_completion_tokens=max_completion_tokens)
|
|
for item in items
|
|
]
|
|
Path(out_root, "rollouts.json").write_text(
|
|
json.dumps(results, ensure_ascii=False, indent=2)
|
|
)
|
|
return results
|
|
```
|
|
|
|
Two design points worth flagging:
|
|
|
|
- **Scoring lives here, not in `EnvAdapter`.** There is no `evaluate()`
|
|
method on the ABC. Whatever signal you put in `hard` (0/1, or a float
|
|
in [0, 1] for smoothed reward) and `soft` (float in [0, 1]) is what
|
|
the optimizer reads.
|
|
- **Use `skillopt.model.chat_target`**, not raw OpenAI/Claude calls.
|
|
That routes through whichever **chat** target backend the user
|
|
configured (`openai_chat` / `claude_chat` / `qwen_chat` /
|
|
`minimax_chat`) without your adapter caring. Exec-style backends
|
|
(`codex_exec`, `claude_code_exec`) need env-specific rollout code —
|
|
see `skillopt/envs/swebench/` for an example.
|
|
|
|
## Step 4 — Implement the environment adapter
|
|
|
|
`skillopt/envs/docfaithful/adapter.py`:
|
|
|
|
```python
|
|
from __future__ import annotations
|
|
|
|
from skillopt.datasets.base import BatchSpec
|
|
from skillopt.envs.base import EnvAdapter
|
|
from skillopt.envs.docfaithful.dataloader import DocFaithfulDataLoader
|
|
from skillopt.envs.docfaithful.rollout import run_batch
|
|
|
|
|
|
class DocFaithfulAdapter(EnvAdapter):
|
|
"""SkillOpt adapter for the DocFaithful benchmark."""
|
|
|
|
def __init__(
|
|
self,
|
|
split_dir: str = "",
|
|
data_path: str = "",
|
|
split_mode: str = "split_dir",
|
|
split_ratio: str = "2:1:7",
|
|
split_seed: int = 42,
|
|
split_output_dir: str = "",
|
|
workers: int = 4,
|
|
analyst_workers: int = 4,
|
|
failure_only: bool = False,
|
|
minibatch_size: int = 8,
|
|
edit_budget: int = 4,
|
|
seed: int = 42,
|
|
limit: int = 0,
|
|
max_completion_tokens: int = 4096,
|
|
) -> None:
|
|
self.workers = workers
|
|
self.analyst_workers = analyst_workers
|
|
self.failure_only = failure_only
|
|
self.minibatch_size = minibatch_size
|
|
self.edit_budget = edit_budget
|
|
self.max_completion_tokens = int(max_completion_tokens)
|
|
self.dataloader = DocFaithfulDataLoader(
|
|
split_dir=split_dir,
|
|
data_path=data_path,
|
|
split_mode=split_mode,
|
|
split_ratio=split_ratio,
|
|
split_seed=split_seed,
|
|
split_output_dir=split_output_dir,
|
|
seed=seed,
|
|
limit=limit,
|
|
)
|
|
|
|
# ── Lifecycle ───────────────────────────────────────────────────────
|
|
|
|
def setup(self, cfg: dict) -> None:
|
|
super().setup(cfg)
|
|
self.dataloader.setup(cfg)
|
|
|
|
def get_dataloader(self):
|
|
return self.dataloader
|
|
|
|
# ── Env construction ────────────────────────────────────────────────
|
|
|
|
def build_env_from_batch(self, batch: BatchSpec, **kwargs):
|
|
# For dataset-backed envs the "manager" is just the items list.
|
|
return list(batch.payload or [])
|
|
|
|
def build_train_env(self, batch_size: int, seed: int, **kwargs):
|
|
batch = self.dataloader.build_train_batch(
|
|
batch_size=batch_size, seed=seed, **kwargs
|
|
)
|
|
return self.build_env_from_batch(batch, **kwargs)
|
|
|
|
def build_eval_env(self, env_num: int, split: str, seed: int, **kwargs):
|
|
batch = self.dataloader.build_eval_batch(
|
|
env_num=env_num, split=split, seed=seed, **kwargs
|
|
)
|
|
return self.build_env_from_batch(batch, **kwargs)
|
|
|
|
# ── The rollout method (reflect is inherited) ───────────────────────
|
|
|
|
def rollout(self, env_manager, skill_content: str,
|
|
out_dir: str, **kwargs) -> list[dict]:
|
|
items: list[dict] = env_manager
|
|
return run_batch(
|
|
items=items,
|
|
skill_content=skill_content,
|
|
out_root=out_dir,
|
|
workers=self.workers,
|
|
max_completion_tokens=self.max_completion_tokens,
|
|
)
|
|
|
|
# reflect() is inherited from EnvAdapter — it delegates to
|
|
# run_minibatch_reflect with your analyst_error_* / analyst_success_*
|
|
# prompts. Override it only if you need custom reflection logic.
|
|
|
|
def get_task_types(self) -> list[str]:
|
|
seen: list[str] = []
|
|
for item in (
|
|
self.dataloader.train_items
|
|
+ self.dataloader.val_items
|
|
+ self.dataloader.test_items
|
|
):
|
|
tt = str(item.get("task_type") or "docfaithful")
|
|
if tt not in seen:
|
|
seen.append(tt)
|
|
return seen or ["docfaithful"]
|
|
```
|
|
|
|
### What the rollout actually does
|
|
|
|
Look back at `run_batch` from Step 3 — it sends each `item["question"]`
|
|
to the target model with `skill_content` as the system prompt, scores
|
|
the answer against `item["ground_truth"]`, and returns a list of dicts:
|
|
|
|
```python
|
|
[
|
|
{"id": "ex_001", "hard": 1, "soft": 0.92,
|
|
"predicted_answer": "...", "question": "...",
|
|
"reference_text": item["reference_text"]},
|
|
{"id": "ex_002", "hard": 0, "soft": 0.13, "fail_reason": "...", ...},
|
|
...
|
|
]
|
|
```
|
|
|
|
The trainer only requires `id`, `hard`, `soft`. The rest is preserved on
|
|
`RolloutResult.extras` (see `skillopt/types.py`) and is what your
|
|
`reflect()` consumes via `run_minibatch_reflect`.
|
|
|
|
## Step 5 — Register the adapter
|
|
|
|
Edit [`scripts/train.py`](https://github.com/microsoft/SkillOpt/blob/main/scripts/train.py)
|
|
and add to `_register_builtins()`:
|
|
|
|
```python
|
|
try:
|
|
from skillopt.envs.docfaithful.adapter import DocFaithfulAdapter
|
|
_ENV_REGISTRY["docfaithful"] = DocFaithfulAdapter
|
|
except ImportError:
|
|
pass # docfaithful deps not installed — skip
|
|
```
|
|
|
|
There is **no `BENCHMARK_REGISTRY` dict in `skillopt/envs/__init__.py`** —
|
|
the registry lives in `scripts/train.py` and is populated lazily so that
|
|
optional deps don't break `--help`.
|
|
|
|
## Step 6 — Create the YAML config
|
|
|
|
`configs/docfaithful/default.yaml`:
|
|
|
|
```yaml
|
|
_base_: ../_base_/default.yaml # NOTE: string, not list
|
|
|
|
model:
|
|
reasoning_effort: medium
|
|
|
|
train:
|
|
batch_size: 16
|
|
accumulation: 1
|
|
num_epochs: 4
|
|
|
|
gradient:
|
|
minibatch_size: 8
|
|
merge_batch_size: 8
|
|
|
|
optimizer:
|
|
learning_rate: 4
|
|
|
|
env:
|
|
name: docfaithful
|
|
# Optional: a seed skill document. Create this file (or any markdown
|
|
# file) yourself before the first run, or omit the key to let SkillOpt
|
|
# start from an empty skill.
|
|
skill_init: skillopt/envs/docfaithful/skills/initial.md
|
|
split_mode: split_dir
|
|
split_dir: data/docfaithful_split
|
|
workers: 4
|
|
max_completion_tokens: 4096
|
|
limit: 0
|
|
```
|
|
|
|
> ⚠️ `_base_` is currently parsed as a **string path**, not a list. Write
|
|
> `_base_: ../_base_/default.yaml`, not `_base_: ['../_base_/default.yaml']`.
|
|
> See [`skillopt/config.py`](https://github.com/microsoft/SkillOpt/blob/main/skillopt/config.py)
|
|
> if you want to add list-form inheritance.
|
|
|
|
## Step 7 — Run
|
|
|
|
```bash
|
|
# If you set skill_init above, create the seed skill first:
|
|
# mkdir -p skillopt/envs/docfaithful/skills
|
|
# echo "# DocFaithful initial skill" > skillopt/envs/docfaithful/skills/initial.md
|
|
|
|
python scripts/train.py --config configs/docfaithful/default.yaml
|
|
```
|
|
|
|
If you get `ValueError: Unknown environment 'docfaithful'. Available: [...]`,
|
|
you forgot Step 5.
|
|
|
|
If you get `TypeError: Can't instantiate abstract class DocFaithfulAdapter`,
|
|
you forgot to implement one of the four abstract methods on `EnvAdapter`:
|
|
`build_train_env`, `build_eval_env`, `rollout`, `get_task_types`.
|
|
|
|
## Tips
|
|
|
|
- Start with `train.batch_size: 4` and `limit: 10` while debugging.
|
|
- The `evaluate` half lives **inside your `rollout`**, not as a separate
|
|
method — there is no `evaluate()` in the `EnvAdapter` ABC. Score the
|
|
prediction in `run_batch` and put the score on each result dict's
|
|
`hard` / `soft`.
|
|
- Noisy scoring kills the optimizer. Spend time on `run_batch`'s scoring
|
|
before you spend time on prompts.
|
|
- If your benchmark needs heavy optional deps (selenium, vllm, ...),
|
|
wrap the registration block with `try / except ImportError` (Step 5)
|
|
so people without those deps can still `--help`.
|
|
- Copy `skillopt/envs/_template/` as a starting skeleton — it now
|
|
implements the real abstract methods.
|