Files
Yifan Yang 2ca2910649 docs: align API reference and Add-a-Benchmark guide with real EnvAdapter ABC
docs/reference/api.md previously documented a fictional EnvAdapter API
(execute / evaluate / build_prompt + DataItem / TaskResult) and a
BENCHMARK_REGISTRY that never existed in code. Anyone following the
documented contract would hit ImportError or TypeError on the first
instantiation.

Replace both pages with the real shape from skillopt/envs/base.py and
skillopt/datasets/base.py:

- EnvAdapter: build_train_env, build_eval_env, rollout, reflect,
  get_task_types (the 5 actual abstract methods).
- Rollout dicts: id / hard / soft required; everything else preserved
  into RolloutResult.extras.
- Reflect dicts: {patch, source_type} schema as consumed by
  run_minibatch_reflect.
- BatchSpec: slotted-but-mutable dataclass matching the actual
  definition (payload defaults to None, metadata to dict()).
- SplitDataLoader.load_split_items as the one mandatory loader method.
- Registry: _ENV_REGISTRY in scripts/train.py (lazy try/except
  ImportError block), not a non-existent BENCHMARK_REGISTRY in
  skillopt/envs/__init__.py.
- _base_: documented as a string path, since the current YAML loader
  only accepts strings.

The new-benchmark.md guide now walks through a docfaithful worked
example with a real rollout helper (chat_target + scorer) instead of
hand-waving over the rollout step. Refs microsoft/SkillOpt#30.

Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
2026-06-01 20:14:54 +00:00

6.7 KiB

API Reference

This page documents the public Python API SkillOpt exposes for extending the framework with new environments / benchmarks. For ready-made adapters, browse skillopt/envs/.

Source of truth. The classes below are real Python ABCs defined in skillopt/envs/base.py, skillopt/datasets/base.py, skillopt/types.py, and skillopt/evaluation/gate.py. If this page ever drifts, the code wins — please open an issue.


Core Classes

EnvAdapter

skillopt/envs/base.py — abstract adapter that connects the SkillOpt trainer to an environment (benchmark, simulator, REST API, ...). Subclasses must implement the five abstract methods below.

from abc import ABC, abstractmethod
from skillopt.datasets.base import BaseDataLoader, BatchSpec

class EnvAdapter(ABC):

    # ── Lifecycle hooks (have defaults; override only if needed) ────────

    def setup(self, cfg: dict) -> None: ...
    def get_dataloader(self) -> BaseDataLoader | None: ...
    def requires_ray(self) -> bool: ...                 # default False

    # ── Abstract methods (subclasses MUST implement) ────────────────────

    @abstractmethod
    def build_train_env(self, batch_size: int, seed: int, **kwargs):
        """Return an environment-manager object to be passed to rollout()."""

    @abstractmethod
    def build_eval_env(self, env_num: int, split: str, seed: int, **kwargs):
        """Like build_train_env() but for a fixed eval split."""

    @abstractmethod
    def rollout(self, env_manager, skill_content: str,
                out_dir: str, **kwargs) -> list[dict]:
        """Run a batch of episodes with the current skill.

        Each returned dict MUST contain:
          - "id":   str        episode/task identifier
          - "hard": int (0|1)  pass/fail (may be float 0.0-1.0 if smoothed)
          - "soft": float      partial-credit score in [0.0, 1.0]
        It MAY contain env-specific extra keys (parsed into RolloutResult.extras).
        """

    @abstractmethod
    def reflect(self, results: list[dict], skill_content: str,
                out_dir: str, **kwargs) -> list[dict | None]:
        """Turn rollout results into a list of raw patch dicts.

        Each dict (or None to drop the slot) MUST contain:
          - "patch":       {"edits": [...]}     a Patch.to_dict() payload
          - "source_type": "failure" | "success"
        """

    @abstractmethod
    def get_task_types(self) -> list[str]:
        """Distinct task-type strings used for stratified sampling."""

The trainer also calls a few default-implemented helpers on every adapter: build_reference_text, get_reference_metadata, attach_reference_context, select_representative_items, and build_env_from_batch. Read the docstrings in skillopt/envs/base.py if you need to override any of these — most benchmarks don't.

BaseDataLoader / SplitDataLoader

skillopt/datasets/base.py — episode-planning loaders.

class BaseDataLoader(ABC):
    def setup(self, cfg: dict) -> None: ...
    @abstractmethod
    def build_train_batch(self, batch_size: int, seed: int, **kwargs) -> BatchSpec: ...
    @abstractmethod
    def build_eval_batch(self, env_num: int, split: str, seed: int, **kwargs) -> BatchSpec: ...

class SplitDataLoader(BaseDataLoader):
    """Concrete base for dataset-backed envs with on-disk train/val/test splits.

    Subclasses only need to implement load_split_items() (and optionally
    load_raw_items() if you also want ``split_mode='ratio'``).
    """
    def load_split_items(self, split_path: str) -> list[dict]: ...
    def load_raw_items(self, data_path: str) -> list[dict]: ...   # optional

SplitDataLoader handles two layout modes:

split_mode What it expects
"split_dir" A directory with train/, val/, test/ subdirs already split.
"ratio" A raw dataset path + split_ratio: "2:1:7" style string.

In either case the items returned by load_split_items() are plain dict objects with at minimum an "id" key.

BatchSpec

skillopt/datasets/base.py — a slotted dataclass describing one batch request the trainer hands to the adapter.

@dataclass(slots=True)
class BatchSpec:
    phase: str                 # "train" | "eval"
    split: str                 # "train" | "val" | "test" | "valid_seen" | ...
    seed: int
    batch_size: int
    payload: object | None = None     # what the loader produced (e.g. list[dict])
    metadata: dict = field(default_factory=dict)

Edit / Patch

skillopt/types.py — the I/O types Reflect / Aggregate / Update produce and consume.

EditOp = Literal["append", "insert_after", "replace", "delete"]

@dataclass
class Edit:
    op: EditOp
    content: str = ""
    target: str = ""
    support_count: int | None = None
    source_type: Literal["failure", "success"] | None = None
    merge_level: int | None = None
    update_origin: str = ""
    update_target: str = ""

@dataclass
class Patch:
    edits: list[Edit] = field(default_factory=list)
    reasoning: str = ""
    ranking_details: dict[str, Any] | None = None

Both types support to_dict() / from_dict() for serialization.

RolloutResult

skillopt/types.py — the normalised rollout return type. The trainer calls RolloutResult.from_dict(...) on each dict returned from EnvAdapter.rollout(), so the only hard requirement on those dicts is the three keys above (id, hard, soft). Extra fields are preserved into RolloutResult.extras.

GateResult / GateAction

skillopt/evaluation/gate.py — the validation-gate decision types returned each epoch.


Registering an environment

Environments are not registered via decorators or a BENCHMARK_REGISTRY dict. The trainer keeps a lazy registry inside scripts/train.py_ENV_REGISTRY — populated by _register_builtins(). To add a new env you append a try / except ImportError block there. See Add a New Benchmark for the full step-by-step.


Backends (model layer)

The model layer lives under skillopt.model.*. Backends are selected via model.optimizer_backend and model.target_backend in the config — not via a base class subclass. Supported values (as of this writing):

Backend Optimizer? Target?
openai_chat
claude_chat
qwen_chat
minimax_chat
codex_exec
claude_code_exec

See skillopt/model/backend_config.py for the live whitelist and docs/reference/config.md for the per-backend configuration keys.