docs: align API reference and Add-a-Benchmark guide with real EnvAdapter ABC

docs/reference/api.md previously documented a fictional EnvAdapter API (execute / evaluate / build_prompt + DataItem / TaskResult) and a BENCHMARK_REGISTRY that never existed in code. Anyone following the documented contract would hit ImportError or TypeError on the first instantiation. Replace both pages with the real shape from skillopt/envs/base.py and skillopt/datasets/base.py: - EnvAdapter: build_train_env, build_eval_env, rollout, reflect, get_task_types (the 5 actual abstract methods). - Rollout dicts: id / hard / soft required; everything else preserved into RolloutResult.extras. - Reflect dicts: {patch, source_type} schema as consumed by run_minibatch_reflect. - BatchSpec: slotted-but-mutable dataclass matching the actual definition (payload defaults to None, metadata to dict()). - SplitDataLoader.load_split_items as the one mandatory loader method. - Registry: _ENV_REGISTRY in scripts/train.py (lazy try/except ImportError block), not a non-existent BENCHMARK_REGISTRY in skillopt/envs/__init__.py. - _base_: documented as a string path, since the current YAML loader only accepts strings. The new-benchmark.md guide now walks through a docfaithful worked example with a real rollout helper (chat_target + scorer) instead of hand-waving over the rollout step. Refs microsoft/SkillOpt#30. Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
2026-07-03 14:02:58 +08:00 · 2026-06-01 20:14:54 +00:00
parent fb1a76371d
commit 2ca2910649
2 changed files with 512 additions and 186 deletions
--- a/docs/guide/new-benchmark.md
+++ b/docs/guide/new-benchmark.md
@@ -1,181 +1,393 @@
 # Add a New Benchmark

-Extend SkillOpt with your own benchmark in ~100 lines of code.
+Extend SkillOpt with your own benchmark in ~200 lines of code. We will use
+a tiny worked example, `docfaithful`, that scores a target model on
+how faithfully it answers questions grounded in a small reference doc.

-## Overview
+> **Working reference.** The easiest way to copy-cargo-cult a new env is
+> to read [`skillopt/envs/officeqa/`](https://github.com/microsoft/SkillOpt/tree/main/skillopt/envs/officeqa).
+> Everything below is the same shape, simplified.

-To add a benchmark, you need:
+## What you need to build

-1. **Data Loader** — Loads and splits your dataset
-2. **Environment Adapter** — Executes tasks and returns scores
-3. **Config** — YAML configuration file
+To add a benchmark you implement four things:

-## Step 1: Create the Benchmark Package
+1. **A `SplitDataLoader` subclass** — knows how to load train / val / test
+   item dicts from disk.
+2. **A rollout helper** — runs the target model on a batch of items
+   under the current skill and scores each prediction.
+3. **An `EnvAdapter` subclass** — wires the loader + rollout helper into
+   SkillOpt's lifecycle (`build_*_env`, `rollout`, `reflect`,
+   `get_task_types`).
+4. **A YAML config** — references your env name plus the standard
+   train / optimizer / gradient knobs.
+
+Then one line in `scripts/train.py`'s `_register_builtins()` makes it
+discoverable.
+
+---
+
+## Step 1 — Create the package

 ```bash
-mkdir -p skillopt/envs/my_benchmark
-touch skillopt/envs/my_benchmark/__init__.py
+mkdir -p skillopt/envs/docfaithful
+touch skillopt/envs/docfaithful/__init__.py
 ```

-## Step 2: Implement the Data Loader
+## Step 2 — Implement the data loader

-Create `skillopt/envs/my_benchmark/loader.py`:
+`skillopt/envs/docfaithful/loader.py`:

 ```python
-from skillopt.data.base import DataLoader, DataItem
+from __future__ import annotations

-class MyBenchmarkDataLoader(DataLoader):
-    """Load and split your benchmark data."""
-    
-    def __init__(self, data_dir: str, **kwargs):
-        super().__init__(**kwargs)
-        self.data_dir = data_dir
-    
-    def setup(self, cfg: dict):
-        """Initialize splits based on config."""
-        self.split_mode = cfg.get('split_mode', 'ratio')
-        # Load your data here
-        self.items = self._load_items()
-        self._create_splits(cfg)
-    
-    def _load_items(self) -> list[DataItem]:
-        """Load raw data into DataItem objects."""
-        items = []
-        # TODO: Load your data
-        for entry in your_data:
-            items.append(DataItem(
-                id=entry['id'],
-                input=entry['question'],
-                ground_truth=entry['answer'],
-                metadata=entry.get('metadata', {})
-            ))
-        return items
-    
-    def get_split_items(self, split: str) -> list[DataItem]:
-        """Return items for a given split (train/valid/test)."""
-        return self.splits[split]
+import json
+from pathlib import Path
+
+from skillopt.datasets.base import SplitDataLoader
+
+
+def _normalize(raw: dict) -> dict:
+    """Make sure every item has an ``id``. Other keys are env-specific."""
+    return {
+        "id": str(raw["uid"]),
+        "question": raw["question"],
+        "ground_truth": raw["answer"],
+        "reference_text": raw.get("reference", ""),
+        "task_type": raw.get("category", "docfaithful"),
+    }
+
+
+class DocFaithfulDataLoader(SplitDataLoader):
+    """Load DocFaithful items from JSON files inside each split dir."""
+
+    def load_split_items(self, split_path: str) -> list[dict]:
+        # split_path is e.g. data/docfaithful_split/train/
+        json_files = sorted(Path(split_path).glob("*.json"))
+        if not json_files:
+            raise FileNotFoundError(f"No .json file found in {split_path}")
+        with json_files[0].open(encoding="utf-8") as f:
+            raw = json.load(f)
+        return [_normalize(item) for item in raw]
 ```

-## Step 3: Implement the Environment Adapter
+Only `load_split_items()` is mandatory. If you also want to support
+`split_mode="ratio"` (auto-split a single raw file into train/val/test),
+override `load_raw_items(data_path)` as well — see
+`skillopt/datasets/base.py` docstrings.

-Create `skillopt/envs/my_benchmark/env.py`:
+## Step 3 — Write the rollout helper
+
+`skillopt/envs/docfaithful/rollout.py`:

 ```python
-from skillopt.envs.base import EnvAdapter, TaskResult
+from __future__ import annotations

-class MyBenchmarkEnv(EnvAdapter):
-    """Execute tasks and evaluate results."""
-    
-    def __init__(self, cfg: dict):
-        super().__init__(cfg)
-    
-    async def execute(self, item: DataItem, skill: str, model) -> TaskResult:
-        """
-        Execute a single task.
-        
-        Args:
-            item: The data item to process
-            skill: Current skill document content
-            model: The target model instance
-            
-        Returns:
-            TaskResult with prediction, score, and trajectory
-        """
-        # Build prompt with skill document
-        prompt = self.build_prompt(item, skill)
-        
-        # Get model response
-        response = await model.generate(prompt)
-        
-        # Extract prediction
-        prediction = self.parse_response(response)
-        
-        # Score against ground truth
-        score = self.evaluate(prediction, item.ground_truth)
-        
-        return TaskResult(
-            item_id=item.id,
-            prediction=prediction,
-            score=score,
-            trajectory=[
-                {"role": "system", "content": skill},
-                {"role": "user", "content": item.input},
-                {"role": "assistant", "content": response}
-            ]
+import json
+import os
+from pathlib import Path
+
+from skillopt.model import chat_target
+
+
+def _score(prediction: str, ground_truth: str) -> tuple[int, float]:
+    """Trivial exact-match scorer. Replace with F1 / ROUGE / LLM-judge."""
+    p = (prediction or "").strip().lower()
+    g = (ground_truth or "").strip().lower()
+    hard = int(p == g and bool(g))
+    soft = 1.0 if hard else 0.0
+    return hard, soft
+
+
+def _rollout_one(item: dict, skill_content: str,
+                 *, max_completion_tokens: int) -> dict:
+    system = skill_content
+    user = (
+        f"Question: {item['question']}\n\n"
+        f"Reference:\n{item.get('reference_text', '')}\n\n"
+        "Answer:"
+    )
+    prediction, _usage = chat_target(
+        system=system,
+        user=user,
+        max_completion_tokens=max_completion_tokens,
+    )
+    hard, soft = _score(prediction, item.get("ground_truth", ""))
+    return {
+        "id": str(item["id"]),
+        "hard": hard,
+        "soft": soft,
+        "predicted_answer": prediction,
+        "question": item.get("question", ""),
+        "reference_text": item.get("reference_text", ""),
+        "task_type": item.get("task_type", "docfaithful"),
+    }
+
+
+def run_batch(*, items: list[dict], skill_content: str, out_root: str,
+              workers: int = 4, max_completion_tokens: int = 4096) -> list[dict]:
+    """Run a batch of episodes sequentially or with a thread pool."""
+    os.makedirs(out_root, exist_ok=True)
+    # For brevity we go sequentially — swap in concurrent.futures.ThreadPoolExecutor
+    # when network / model latency dominates.
+    results = [
+        _rollout_one(item, skill_content,
+                     max_completion_tokens=max_completion_tokens)
+        for item in items
+    ]
+    Path(out_root, "rollouts.json").write_text(
+        json.dumps(results, ensure_ascii=False, indent=2)
+    )
+    return results
+```
+
+Two design points worth flagging:
+
+- **Scoring lives here, not in `EnvAdapter`.** There is no `evaluate()`
+  method on the ABC. Whatever signal you put in `hard` (0/1, or a float
+  in [0, 1] for smoothed reward) and `soft` (float in [0, 1]) is what
+  the optimizer reads.
+- **Use `skillopt.model.chat_target`**, not raw OpenAI/Claude calls.
+  That routes through whichever **chat** target backend the user
+  configured (`openai_chat` / `claude_chat` / `qwen_chat` /
+  `minimax_chat`) without your adapter caring. Exec-style backends
+  (`codex_exec`, `claude_code_exec`) need env-specific rollout code —
+  see `skillopt/envs/swebench/` for an example.
+
+## Step 4 — Implement the environment adapter
+
+`skillopt/envs/docfaithful/adapter.py`:
+
+```python
+from __future__ import annotations
+
+import os
+
+from skillopt.datasets.base import BatchSpec
+from skillopt.envs.base import EnvAdapter
+from skillopt.envs.docfaithful.loader import DocFaithfulDataLoader
+from skillopt.envs.docfaithful.rollout import run_batch
+from skillopt.gradient.reflect import run_minibatch_reflect
+
+
+class DocFaithfulAdapter(EnvAdapter):
+    """SkillOpt adapter for the DocFaithful benchmark."""
+
+    def __init__(
+        self,
+        split_dir: str = "",
+        data_path: str = "",
+        split_mode: str = "split_dir",
+        split_ratio: str = "2:1:7",
+        split_seed: int = 42,
+        split_output_dir: str = "",
+        workers: int = 4,
+        analyst_workers: int = 4,
+        failure_only: bool = False,
+        minibatch_size: int = 8,
+        edit_budget: int = 4,
+        seed: int = 42,
+        limit: int = 0,
+        max_completion_tokens: int = 4096,
+    ) -> None:
+        self.workers = workers
+        self.analyst_workers = analyst_workers
+        self.failure_only = failure_only
+        self.minibatch_size = minibatch_size
+        self.edit_budget = edit_budget
+        self.max_completion_tokens = int(max_completion_tokens)
+        self.dataloader = DocFaithfulDataLoader(
+            split_dir=split_dir,
+            data_path=data_path,
+            split_mode=split_mode,
+            split_ratio=split_ratio,
+            split_seed=split_seed,
+            split_output_dir=split_output_dir,
+            seed=seed,
+            limit=limit,
        )
-    
-    def evaluate(self, prediction: str, ground_truth: str) -> float:
-        """
-        Score a prediction against ground truth.
-        
-        Returns:
-            Float between 0.0 and 1.0
-        """
-        # TODO: Implement your scoring logic
-        # Examples: exact match, F1, ANLS, etc.
-        return float(prediction.strip() == ground_truth.strip())
-    
-    def build_prompt(self, item, skill: str) -> str:
-        """Combine skill document with task input."""
-        return f"{skill}\n\n---\n\nQuestion: {item.input}"
-    
-    def parse_response(self, response: str) -> str:
-        """Extract the answer from model response."""
-        return response.strip()
+
+    # ── Lifecycle ───────────────────────────────────────────────────────
+
+    def setup(self, cfg: dict) -> None:
+        super().setup(cfg)
+        self.dataloader.setup(cfg)
+
+    def get_dataloader(self):
+        return self.dataloader
+
+    # ── Env construction ────────────────────────────────────────────────
+
+    def build_env_from_batch(self, batch: BatchSpec, **kwargs):
+        # For dataset-backed envs the "manager" is just the items list.
+        return list(batch.payload or [])
+
+    def build_train_env(self, batch_size: int, seed: int, **kwargs):
+        batch = self.dataloader.build_train_batch(
+            batch_size=batch_size, seed=seed, **kwargs
+        )
+        return self.build_env_from_batch(batch, **kwargs)
+
+    def build_eval_env(self, env_num: int, split: str, seed: int, **kwargs):
+        batch = self.dataloader.build_eval_batch(
+            env_num=env_num, split=split, seed=seed, **kwargs
+        )
+        return self.build_env_from_batch(batch, **kwargs)
+
+    # ── The two real action methods ─────────────────────────────────────
+
+    def rollout(self, env_manager, skill_content: str,
+                out_dir: str, **kwargs) -> list[dict]:
+        items: list[dict] = env_manager
+        return run_batch(
+            items=items,
+            skill_content=skill_content,
+            out_root=out_dir,
+            workers=self.workers,
+            max_completion_tokens=self.max_completion_tokens,
+        )
+
+    def reflect(self, results: list[dict], skill_content: str,
+                out_dir: str, **kwargs) -> list[dict | None]:
+        return run_minibatch_reflect(
+            results=results,
+            skill_content=skill_content,
+            prediction_dir=kwargs.get(
+                "prediction_dir", os.path.join(out_dir, "predictions")
+            ),
+            patches_dir=kwargs.get(
+                "patches_dir", os.path.join(out_dir, "patches")
+            ),
+            workers=self.analyst_workers,
+            failure_only=self.failure_only,
+            minibatch_size=self.minibatch_size,
+            edit_budget=self.edit_budget,
+            random_seed=kwargs.get("random_seed"),
+            error_system=self.get_error_minibatch_prompt(),
+            success_system=self.get_success_minibatch_prompt(),
+            step_buffer_context=kwargs.get("step_buffer_context", ""),
+            update_mode=getattr(self, "_cfg", {}).get("skill_update_mode", "patch"),
+        )
+
+    def get_task_types(self) -> list[str]:
+        seen: list[str] = []
+        for item in (
+            self.dataloader.train_items
+            + self.dataloader.val_items
+            + self.dataloader.test_items
+        ):
+            tt = str(item.get("task_type") or "docfaithful")
+            if tt not in seen:
+                seen.append(tt)
+        return seen or ["docfaithful"]
 ```

-## Step 4: Register the Benchmark
+### What the rollout actually does

-Add to `skillopt/envs/__init__.py`:
+Look back at `run_batch` from Step 3 — it sends each `item["question"]`
+to the target model with `skill_content` as the system prompt, scores
+the answer against `item["ground_truth"]`, and returns a list of dicts:

 ```python
-from .my_benchmark.env import MyBenchmarkEnv
-from .my_benchmark.loader import MyBenchmarkDataLoader
-
-BENCHMARK_REGISTRY = {
-    # ... existing benchmarks ...
-    'my_benchmark': {
-        'env': MyBenchmarkEnv,
-        'loader': MyBenchmarkDataLoader,
-    },
-}
+[
+    {"id": "ex_001", "hard": 1, "soft": 0.92,
+     "predicted_answer": "...", "question": "...",
+     "reference_text": item["reference_text"]},
+    {"id": "ex_002", "hard": 0, "soft": 0.13, "fail_reason": "...", ...},
+    ...
+]
 ```

-## Step 5: Create Config
+The trainer only requires `id`, `hard`, `soft`. The rest is preserved on
+`RolloutResult.extras` (see `skillopt/types.py`) and is what your
+`reflect()` consumes via `run_minibatch_reflect`.

-Create `configs/my_benchmark/default.yaml`:
+## Step 5 — Register the adapter
+
+Edit [`scripts/train.py`](https://github.com/microsoft/SkillOpt/blob/main/scripts/train.py)
+and add to `_register_builtins()`:
+
+```python
+    try:
+        from skillopt.envs.docfaithful.adapter import DocFaithfulAdapter
+        _ENV_REGISTRY["docfaithful"] = DocFaithfulAdapter
+    except ImportError:
+        pass  # docfaithful deps not installed — skip
+```
+
+There is **no `BENCHMARK_REGISTRY` dict in `skillopt/envs/__init__.py`** —
+the registry lives in `scripts/train.py` and is populated lazily so that
+optional deps don't break `--help`.
+
+## Step 6 — Create the YAML config
+
+`configs/docfaithful/default.yaml`:

 ```yaml
-_base_: ['../_base_/default.yaml']
+_base_: ../_base_/default.yaml      # NOTE: string, not list

-env:
-  name: my_benchmark
-  data_path: data/my_benchmark
-  split_mode: ratio
-  split_ratio: "2:1:7"
+model:
+  reasoning_effort: medium

 train:
+  batch_size: 16
+  accumulation: 1
  num_epochs: 4
-  batch_size: 40
+
+gradient:
+  minibatch_size: 8
+  merge_batch_size: 8

 optimizer:
  learning_rate: 4
-  lr_scheduler: cosine
-  use_slow_update: true
-  use_meta_skill: true

-gradient:
-  analyst_workers: 16
+env:
+  name: docfaithful
+  # Optional: a seed skill document. Create this file (or any markdown
+  # file) yourself before the first run, or omit the key to let SkillOpt
+  # start from an empty skill.
+  skill_init: skillopt/envs/docfaithful/skills/initial.md
+  split_mode: split_dir
+  split_dir: data/docfaithful_split
+  workers: 4
+  max_completion_tokens: 4096
+  limit: 0
 ```

-## Step 6: Run
+> ⚠️ `_base_` is currently parsed as a **string path**, not a list. Write
+> `_base_: ../_base_/default.yaml`, not `_base_: ['../_base_/default.yaml']`.
+> See [`skillopt/config.py`](https://github.com/microsoft/SkillOpt/blob/main/skillopt/config.py)
+> if you want to add list-form inheritance.
+
+## Step 7 — Run

 ```bash
-python scripts/train.py --config configs/my_benchmark/default.yaml
+# If you set skill_init above, create the seed skill first:
+#   mkdir -p skillopt/envs/docfaithful/skills
+#   echo "# DocFaithful initial skill" > skillopt/envs/docfaithful/skills/initial.md
+
+python scripts/train.py --config configs/docfaithful/default.yaml
 ```

+If you get `ValueError: Unknown environment 'docfaithful'. Available: [...]`,
+you forgot Step 5.
+
+If you get `TypeError: Can't instantiate abstract class DocFaithfulAdapter`,
+you forgot to implement one of the five abstract methods on `EnvAdapter`:
+`build_train_env`, `build_eval_env`, `rollout`, `reflect`,
+`get_task_types`.
+
 ## Tips

-!!! tip
-    - Use a small `batch_size` (10-20) for initial testing
-    - The `evaluate()` method is critical — a noisy metric will confuse the optimizer
+- Start with `train.batch_size: 4` and `limit: 10` while debugging.
+- The `evaluate` half lives **inside your `rollout`**, not as a separate
+  method — there is no `evaluate()` in the `EnvAdapter` ABC. Score the
+  prediction in `run_batch` and put the score on each result dict's
+  `hard` / `soft`.
+- Noisy scoring kills the optimizer. Spend time on `run_batch`'s scoring
+  before you spend time on prompts.
+- If your benchmark needs heavy optional deps (selenium, vllm, ...),
+  wrap the registration block with `try / except ImportError` (Step 5)
+  so people without those deps can still `--help`.
+- Copy `skillopt/envs/_template/` as a starting skeleton — it now
+  implements the real abstract methods.
--- a/docs/reference/api.md
+++ b/docs/reference/api.md
@@ -1,81 +1,195 @@
 # API Reference

+This page documents the public Python API SkillOpt exposes for **extending the
+framework** with new environments / benchmarks. For ready-made adapters,
+browse [`skillopt/envs/`](https://github.com/microsoft/SkillOpt/tree/main/skillopt/envs).
+
+> **Source of truth.** The classes below are real Python ABCs defined in
+> `skillopt/envs/base.py`, `skillopt/datasets/base.py`, `skillopt/types.py`,
+> and `skillopt/evaluation/gate.py`. If this page ever drifts, the code
+> wins — please open an issue.
+
+---
+
 ## Core Classes

 ### `EnvAdapter`

-Abstract base class for benchmark environments.
+`skillopt/envs/base.py` — abstract adapter that connects the SkillOpt
+trainer to an environment (benchmark, simulator, REST API, ...).
+Subclasses **must** implement the five abstract methods below.

 ```python
+from abc import ABC, abstractmethod
+from skillopt.datasets.base import BaseDataLoader, BatchSpec
+
 class EnvAdapter(ABC):
-    async def execute(self, item, skill, model) -> TaskResult
-    def evaluate(self, prediction, ground_truth) -> float
-    def build_prompt(self, item, skill) -> str
+
+    # ── Lifecycle hooks (have defaults; override only if needed) ────────
+
+    def setup(self, cfg: dict) -> None: ...
+    def get_dataloader(self) -> BaseDataLoader | None: ...
+    def requires_ray(self) -> bool: ...                 # default False
+
+    # ── Abstract methods (subclasses MUST implement) ────────────────────
+
+    @abstractmethod
+    def build_train_env(self, batch_size: int, seed: int, **kwargs):
+        """Return an environment-manager object to be passed to rollout()."""
+
+    @abstractmethod
+    def build_eval_env(self, env_num: int, split: str, seed: int, **kwargs):
+        """Like build_train_env() but for a fixed eval split."""
+
+    @abstractmethod
+    def rollout(self, env_manager, skill_content: str,
+                out_dir: str, **kwargs) -> list[dict]:
+        """Run a batch of episodes with the current skill.
+
+        Each returned dict MUST contain:
+          - "id":   str        episode/task identifier
+          - "hard": int (0|1)  pass/fail (may be float 0.0-1.0 if smoothed)
+          - "soft": float      partial-credit score in [0.0, 1.0]
+        It MAY contain env-specific extra keys (parsed into RolloutResult.extras).
+        """
+
+    @abstractmethod
+    def reflect(self, results: list[dict], skill_content: str,
+                out_dir: str, **kwargs) -> list[dict | None]:
+        """Turn rollout results into a list of raw patch dicts.
+
+        Each dict (or None to drop the slot) MUST contain:
+          - "patch":       {"edits": [...]}     a Patch.to_dict() payload
+          - "source_type": "failure" | "success"
+        """
+
+    @abstractmethod
+    def get_task_types(self) -> list[str]:
+        """Distinct task-type strings used for stratified sampling."""
 ```

-### `DataLoader`
+The trainer also calls a few default-implemented helpers on every adapter:
+`build_reference_text`, `get_reference_metadata`, `attach_reference_context`,
+`select_representative_items`, and `build_env_from_batch`. Read the docstrings
+in `skillopt/envs/base.py` if you need to override any of these — most
+benchmarks don't.

-Abstract base class for data loading and splitting.
+### `BaseDataLoader` / `SplitDataLoader`
+
+`skillopt/datasets/base.py` — episode-planning loaders.

 ```python
-class DataLoader(ABC):
-    def setup(self, cfg: dict) -> None
-    def get_split_items(self, split: str) -> list[DataItem]
+class BaseDataLoader(ABC):
+    def setup(self, cfg: dict) -> None: ...
+    @abstractmethod
+    def build_train_batch(self, batch_size: int, seed: int, **kwargs) -> BatchSpec: ...
+    @abstractmethod
+    def build_eval_batch(self, env_num: int, split: str, seed: int, **kwargs) -> BatchSpec: ...
+
+class SplitDataLoader(BaseDataLoader):
+    """Concrete base for dataset-backed envs with on-disk train/val/test splits.
+
+    Subclasses only need to implement load_split_items() (and optionally
+    load_raw_items() if you also want ``split_mode='ratio'``).
+    """
+    def load_split_items(self, split_path: str) -> list[dict]: ...
+    def load_raw_items(self, data_path: str) -> list[dict]: ...   # optional
 ```

-### `ModelBackend`
+`SplitDataLoader` handles two layout modes:

-Abstract base class for LLM backends.
+| `split_mode` | What it expects |
+|---|---|
+| `"split_dir"` | A directory with `train/`, `val/`, `test/` subdirs already split. |
+| `"ratio"` | A raw dataset path + `split_ratio: "2:1:7"` style string. |
+
+In either case the items returned by `load_split_items()` are plain
+`dict` objects with at minimum an `"id"` key.
+
+### `BatchSpec`
+
+`skillopt/datasets/base.py` — a slotted dataclass describing one batch
+request the trainer hands to the adapter.

 ```python
-class ModelBackend(ABC):
-    async def generate(self, messages, **kwargs) -> ModelResponse
-    async def generate_with_tools(self, messages, tools, **kwargs) -> ModelResponse
-```
-
-### `Trainer`
-
-Main training loop orchestrator.
-
-```python
-class Trainer:
-    def __init__(self, cfg: dict)
-    async def train(self) -> TrainResult
-    async def evaluate(self, skill: str, split: str) -> EvalResult
-```
-
-## Data Classes
-
-### `DataItem`
-
-```python
-@dataclass
-class DataItem:
-    id: str
-    input: str
-    ground_truth: str
+@dataclass(slots=True)
+class BatchSpec:
+    phase: str                 # "train" | "eval"
+    split: str                 # "train" | "val" | "test" | "valid_seen" | ...
+    seed: int
+    batch_size: int
+    payload: object | None = None     # what the loader produced (e.g. list[dict])
    metadata: dict = field(default_factory=dict)
 ```

-### `TaskResult`
+### `Edit` / `Patch`
+
+`skillopt/types.py` — the I/O types Reflect / Aggregate / Update produce
+and consume.

 ```python
+EditOp = Literal["append", "insert_after", "replace", "delete"]
+
@dataclass
-class TaskResult:
-    item_id: str
-    prediction: str
-    score: float
-    trajectory: list[dict]
+class Edit:
+    op: EditOp
+    content: str = ""
+    target: str = ""
+    support_count: int | None = None
+    source_type: Literal["failure", "success"] | None = None
+    merge_level: int | None = None
+    update_origin: str = ""
+    update_target: str = ""
+
+@dataclass
+class Patch:
+    edits: list[Edit] = field(default_factory=list)
+    reasoning: str = ""
+    ranking_details: dict[str, Any] | None = None
 ```

-### `ModelResponse`
+Both types support `to_dict()` / `from_dict()` for serialization.

-```python
-@dataclass
-class ModelResponse:
-    content: str
-    usage: dict
-    model: str
-```
+### `RolloutResult`

-For detailed source code, see the [`skillopt/`](https://github.com/microsoft/SkillOpt/tree/main/skillopt) directory.
+`skillopt/types.py` — the normalised rollout return type. The trainer
+calls `RolloutResult.from_dict(...)` on each dict returned from
+`EnvAdapter.rollout()`, so the only **hard** requirement on those dicts is
+the three keys above (`id`, `hard`, `soft`). Extra fields are preserved
+into `RolloutResult.extras`.
+
+### `GateResult` / `GateAction`
+
+`skillopt/evaluation/gate.py` — the validation-gate decision types
+returned each epoch.
+
+---
+
+## Registering an environment
+
+Environments are not registered via decorators or a `BENCHMARK_REGISTRY`
+dict. The trainer keeps a lazy registry inside `scripts/train.py` —
+`_ENV_REGISTRY` — populated by `_register_builtins()`. To add a new env
+you append a `try / except ImportError` block there. See
+[Add a New Benchmark](../guide/new-benchmark.md) for the full step-by-step.
+
+---
+
+## Backends (model layer)
+
+The model layer lives under `skillopt.model.*`. Backends are selected
+via `model.optimizer_backend` and `model.target_backend` in the config —
+not via a base class subclass. Supported values (as of this writing):
+
+| Backend | Optimizer? | Target? |
+|---|---|---|
+| `openai_chat` | ✓ | ✓ |
+| `claude_chat` | ✓ | ✓ |
+| `qwen_chat` | ✓ | ✓ |
+| `minimax_chat` | ✓ | ✓ |
+| `codex_exec` | — | ✓ |
+| `claude_code_exec` | — | ✓ |
+
+See `skillopt/model/backend_config.py` for the live whitelist and
+[`docs/reference/config.md`](./config.md) for the per-backend
+configuration keys.