docs/reference/api.md previously documented a fictional EnvAdapter API
(execute / evaluate / build_prompt + DataItem / TaskResult) and a
BENCHMARK_REGISTRY that never existed in code. Anyone following the
documented contract would hit ImportError or TypeError on the first
instantiation.
Replace both pages with the real shape from skillopt/envs/base.py and
skillopt/datasets/base.py:
- EnvAdapter: build_train_env, build_eval_env, rollout, reflect,
get_task_types (the 5 actual abstract methods).
- Rollout dicts: id / hard / soft required; everything else preserved
into RolloutResult.extras.
- Reflect dicts: {patch, source_type} schema as consumed by
run_minibatch_reflect.
- BatchSpec: slotted-but-mutable dataclass matching the actual
definition (payload defaults to None, metadata to dict()).
- SplitDataLoader.load_split_items as the one mandatory loader method.
- Registry: _ENV_REGISTRY in scripts/train.py (lazy try/except
ImportError block), not a non-existent BENCHMARK_REGISTRY in
skillopt/envs/__init__.py.
- _base_: documented as a string path, since the current YAML loader
only accepts strings.
The new-benchmark.md guide now walks through a docfaithful worked
example with a real rollout helper (chat_target + scorer) instead of
hand-waving over the rollout step. Refs microsoft/SkillOpt#30.
Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
6.7 KiB
API Reference
This page documents the public Python API SkillOpt exposes for extending the
framework with new environments / benchmarks. For ready-made adapters,
browse skillopt/envs/.
Source of truth. The classes below are real Python ABCs defined in
skillopt/envs/base.py,skillopt/datasets/base.py,skillopt/types.py, andskillopt/evaluation/gate.py. If this page ever drifts, the code wins — please open an issue.
Core Classes
EnvAdapter
skillopt/envs/base.py — abstract adapter that connects the SkillOpt
trainer to an environment (benchmark, simulator, REST API, ...).
Subclasses must implement the five abstract methods below.
from abc import ABC, abstractmethod
from skillopt.datasets.base import BaseDataLoader, BatchSpec
class EnvAdapter(ABC):
# ── Lifecycle hooks (have defaults; override only if needed) ────────
def setup(self, cfg: dict) -> None: ...
def get_dataloader(self) -> BaseDataLoader | None: ...
def requires_ray(self) -> bool: ... # default False
# ── Abstract methods (subclasses MUST implement) ────────────────────
@abstractmethod
def build_train_env(self, batch_size: int, seed: int, **kwargs):
"""Return an environment-manager object to be passed to rollout()."""
@abstractmethod
def build_eval_env(self, env_num: int, split: str, seed: int, **kwargs):
"""Like build_train_env() but for a fixed eval split."""
@abstractmethod
def rollout(self, env_manager, skill_content: str,
out_dir: str, **kwargs) -> list[dict]:
"""Run a batch of episodes with the current skill.
Each returned dict MUST contain:
- "id": str episode/task identifier
- "hard": int (0|1) pass/fail (may be float 0.0-1.0 if smoothed)
- "soft": float partial-credit score in [0.0, 1.0]
It MAY contain env-specific extra keys (parsed into RolloutResult.extras).
"""
@abstractmethod
def reflect(self, results: list[dict], skill_content: str,
out_dir: str, **kwargs) -> list[dict | None]:
"""Turn rollout results into a list of raw patch dicts.
Each dict (or None to drop the slot) MUST contain:
- "patch": {"edits": [...]} a Patch.to_dict() payload
- "source_type": "failure" | "success"
"""
@abstractmethod
def get_task_types(self) -> list[str]:
"""Distinct task-type strings used for stratified sampling."""
The trainer also calls a few default-implemented helpers on every adapter:
build_reference_text, get_reference_metadata, attach_reference_context,
select_representative_items, and build_env_from_batch. Read the docstrings
in skillopt/envs/base.py if you need to override any of these — most
benchmarks don't.
BaseDataLoader / SplitDataLoader
skillopt/datasets/base.py — episode-planning loaders.
class BaseDataLoader(ABC):
def setup(self, cfg: dict) -> None: ...
@abstractmethod
def build_train_batch(self, batch_size: int, seed: int, **kwargs) -> BatchSpec: ...
@abstractmethod
def build_eval_batch(self, env_num: int, split: str, seed: int, **kwargs) -> BatchSpec: ...
class SplitDataLoader(BaseDataLoader):
"""Concrete base for dataset-backed envs with on-disk train/val/test splits.
Subclasses only need to implement load_split_items() (and optionally
load_raw_items() if you also want ``split_mode='ratio'``).
"""
def load_split_items(self, split_path: str) -> list[dict]: ...
def load_raw_items(self, data_path: str) -> list[dict]: ... # optional
SplitDataLoader handles two layout modes:
split_mode |
What it expects |
|---|---|
"split_dir" |
A directory with train/, val/, test/ subdirs already split. |
"ratio" |
A raw dataset path + split_ratio: "2:1:7" style string. |
In either case the items returned by load_split_items() are plain
dict objects with at minimum an "id" key.
BatchSpec
skillopt/datasets/base.py — a slotted dataclass describing one batch
request the trainer hands to the adapter.
@dataclass(slots=True)
class BatchSpec:
phase: str # "train" | "eval"
split: str # "train" | "val" | "test" | "valid_seen" | ...
seed: int
batch_size: int
payload: object | None = None # what the loader produced (e.g. list[dict])
metadata: dict = field(default_factory=dict)
Edit / Patch
skillopt/types.py — the I/O types Reflect / Aggregate / Update produce
and consume.
EditOp = Literal["append", "insert_after", "replace", "delete"]
@dataclass
class Edit:
op: EditOp
content: str = ""
target: str = ""
support_count: int | None = None
source_type: Literal["failure", "success"] | None = None
merge_level: int | None = None
update_origin: str = ""
update_target: str = ""
@dataclass
class Patch:
edits: list[Edit] = field(default_factory=list)
reasoning: str = ""
ranking_details: dict[str, Any] | None = None
Both types support to_dict() / from_dict() for serialization.
RolloutResult
skillopt/types.py — the normalised rollout return type. The trainer
calls RolloutResult.from_dict(...) on each dict returned from
EnvAdapter.rollout(), so the only hard requirement on those dicts is
the three keys above (id, hard, soft). Extra fields are preserved
into RolloutResult.extras.
GateResult / GateAction
skillopt/evaluation/gate.py — the validation-gate decision types
returned each epoch.
Registering an environment
Environments are not registered via decorators or a BENCHMARK_REGISTRY
dict. The trainer keeps a lazy registry inside scripts/train.py —
_ENV_REGISTRY — populated by _register_builtins(). To add a new env
you append a try / except ImportError block there. See
Add a New Benchmark for the full step-by-step.
Backends (model layer)
The model layer lives under skillopt.model.*. Backends are selected
via model.optimizer_backend and model.target_backend in the config —
not via a base class subclass. Supported values (as of this writing):
| Backend | Optimizer? | Target? |
|---|---|---|
openai_chat |
✓ | ✓ |
claude_chat |
✓ | ✓ |
qwen_chat |
✓ | ✓ |
minimax_chat |
✓ | ✓ |
codex_exec |
— | ✓ |
claude_code_exec |
— | ✓ |
See skillopt/model/backend_config.py for the live whitelist and
docs/reference/config.md for the per-backend
configuration keys.