Files
Yifan Yang 2ca2910649 docs: align API reference and Add-a-Benchmark guide with real EnvAdapter ABC
docs/reference/api.md previously documented a fictional EnvAdapter API
(execute / evaluate / build_prompt + DataItem / TaskResult) and a
BENCHMARK_REGISTRY that never existed in code. Anyone following the
documented contract would hit ImportError or TypeError on the first
instantiation.

Replace both pages with the real shape from skillopt/envs/base.py and
skillopt/datasets/base.py:

- EnvAdapter: build_train_env, build_eval_env, rollout, reflect,
  get_task_types (the 5 actual abstract methods).
- Rollout dicts: id / hard / soft required; everything else preserved
  into RolloutResult.extras.
- Reflect dicts: {patch, source_type} schema as consumed by
  run_minibatch_reflect.
- BatchSpec: slotted-but-mutable dataclass matching the actual
  definition (payload defaults to None, metadata to dict()).
- SplitDataLoader.load_split_items as the one mandatory loader method.
- Registry: _ENV_REGISTRY in scripts/train.py (lazy try/except
  ImportError block), not a non-existent BENCHMARK_REGISTRY in
  skillopt/envs/__init__.py.
- _base_: documented as a string path, since the current YAML loader
  only accepts strings.

The new-benchmark.md guide now walks through a docfaithful worked
example with a real rollout helper (chat_target + scorer) instead of
hand-waving over the rollout step. Refs microsoft/SkillOpt#30.

Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
2026-06-01 20:14:54 +00:00

196 lines
6.7 KiB
Markdown

# API Reference
This page documents the public Python API SkillOpt exposes for **extending the
framework** with new environments / benchmarks. For ready-made adapters,
browse [`skillopt/envs/`](https://github.com/microsoft/SkillOpt/tree/main/skillopt/envs).
> **Source of truth.** The classes below are real Python ABCs defined in
> `skillopt/envs/base.py`, `skillopt/datasets/base.py`, `skillopt/types.py`,
> and `skillopt/evaluation/gate.py`. If this page ever drifts, the code
> wins — please open an issue.
---
## Core Classes
### `EnvAdapter`
`skillopt/envs/base.py` — abstract adapter that connects the SkillOpt
trainer to an environment (benchmark, simulator, REST API, ...).
Subclasses **must** implement the five abstract methods below.
```python
from abc import ABC, abstractmethod
from skillopt.datasets.base import BaseDataLoader, BatchSpec
class EnvAdapter(ABC):
# ── Lifecycle hooks (have defaults; override only if needed) ────────
def setup(self, cfg: dict) -> None: ...
def get_dataloader(self) -> BaseDataLoader | None: ...
def requires_ray(self) -> bool: ... # default False
# ── Abstract methods (subclasses MUST implement) ────────────────────
@abstractmethod
def build_train_env(self, batch_size: int, seed: int, **kwargs):
"""Return an environment-manager object to be passed to rollout()."""
@abstractmethod
def build_eval_env(self, env_num: int, split: str, seed: int, **kwargs):
"""Like build_train_env() but for a fixed eval split."""
@abstractmethod
def rollout(self, env_manager, skill_content: str,
out_dir: str, **kwargs) -> list[dict]:
"""Run a batch of episodes with the current skill.
Each returned dict MUST contain:
- "id": str episode/task identifier
- "hard": int (0|1) pass/fail (may be float 0.0-1.0 if smoothed)
- "soft": float partial-credit score in [0.0, 1.0]
It MAY contain env-specific extra keys (parsed into RolloutResult.extras).
"""
@abstractmethod
def reflect(self, results: list[dict], skill_content: str,
out_dir: str, **kwargs) -> list[dict | None]:
"""Turn rollout results into a list of raw patch dicts.
Each dict (or None to drop the slot) MUST contain:
- "patch": {"edits": [...]} a Patch.to_dict() payload
- "source_type": "failure" | "success"
"""
@abstractmethod
def get_task_types(self) -> list[str]:
"""Distinct task-type strings used for stratified sampling."""
```
The trainer also calls a few default-implemented helpers on every adapter:
`build_reference_text`, `get_reference_metadata`, `attach_reference_context`,
`select_representative_items`, and `build_env_from_batch`. Read the docstrings
in `skillopt/envs/base.py` if you need to override any of these — most
benchmarks don't.
### `BaseDataLoader` / `SplitDataLoader`
`skillopt/datasets/base.py` — episode-planning loaders.
```python
class BaseDataLoader(ABC):
def setup(self, cfg: dict) -> None: ...
@abstractmethod
def build_train_batch(self, batch_size: int, seed: int, **kwargs) -> BatchSpec: ...
@abstractmethod
def build_eval_batch(self, env_num: int, split: str, seed: int, **kwargs) -> BatchSpec: ...
class SplitDataLoader(BaseDataLoader):
"""Concrete base for dataset-backed envs with on-disk train/val/test splits.
Subclasses only need to implement load_split_items() (and optionally
load_raw_items() if you also want ``split_mode='ratio'``).
"""
def load_split_items(self, split_path: str) -> list[dict]: ...
def load_raw_items(self, data_path: str) -> list[dict]: ... # optional
```
`SplitDataLoader` handles two layout modes:
| `split_mode` | What it expects |
|---|---|
| `"split_dir"` | A directory with `train/`, `val/`, `test/` subdirs already split. |
| `"ratio"` | A raw dataset path + `split_ratio: "2:1:7"` style string. |
In either case the items returned by `load_split_items()` are plain
`dict` objects with at minimum an `"id"` key.
### `BatchSpec`
`skillopt/datasets/base.py` — a slotted dataclass describing one batch
request the trainer hands to the adapter.
```python
@dataclass(slots=True)
class BatchSpec:
phase: str # "train" | "eval"
split: str # "train" | "val" | "test" | "valid_seen" | ...
seed: int
batch_size: int
payload: object | None = None # what the loader produced (e.g. list[dict])
metadata: dict = field(default_factory=dict)
```
### `Edit` / `Patch`
`skillopt/types.py` — the I/O types Reflect / Aggregate / Update produce
and consume.
```python
EditOp = Literal["append", "insert_after", "replace", "delete"]
@dataclass
class Edit:
op: EditOp
content: str = ""
target: str = ""
support_count: int | None = None
source_type: Literal["failure", "success"] | None = None
merge_level: int | None = None
update_origin: str = ""
update_target: str = ""
@dataclass
class Patch:
edits: list[Edit] = field(default_factory=list)
reasoning: str = ""
ranking_details: dict[str, Any] | None = None
```
Both types support `to_dict()` / `from_dict()` for serialization.
### `RolloutResult`
`skillopt/types.py` — the normalised rollout return type. The trainer
calls `RolloutResult.from_dict(...)` on each dict returned from
`EnvAdapter.rollout()`, so the only **hard** requirement on those dicts is
the three keys above (`id`, `hard`, `soft`). Extra fields are preserved
into `RolloutResult.extras`.
### `GateResult` / `GateAction`
`skillopt/evaluation/gate.py` — the validation-gate decision types
returned each epoch.
---
## Registering an environment
Environments are not registered via decorators or a `BENCHMARK_REGISTRY`
dict. The trainer keeps a lazy registry inside `scripts/train.py`
`_ENV_REGISTRY` — populated by `_register_builtins()`. To add a new env
you append a `try / except ImportError` block there. See
[Add a New Benchmark](../guide/new-benchmark.md) for the full step-by-step.
---
## Backends (model layer)
The model layer lives under `skillopt.model.*`. Backends are selected
via `model.optimizer_backend` and `model.target_backend` in the config —
not via a base class subclass. Supported values (as of this writing):
| Backend | Optimizer? | Target? |
|---|---|---|
| `openai_chat` | ✓ | ✓ |
| `claude_chat` | ✓ | ✓ |
| `qwen_chat` | ✓ | ✓ |
| `minimax_chat` | ✓ | ✓ |
| `codex_exec` | — | ✓ |
| `claude_code_exec` | — | ✓ |
See `skillopt/model/backend_config.py` for the live whitelist and
[`docs/reference/config.md`](./config.md) for the per-backend
configuration keys.