The shipped env_template.py and loader_template.py described the same
fictional async execute / evaluate / build_prompt API documented in
docs/reference/api.md. As a result TemplateBenchmarkEnv(cfg) raised
'TypeError: Can't instantiate abstract class' for every copy-and-paste
user who followed the in-tree scaffold.
Rewrite the template so it's a working starting point:
- env_template.py: TemplateBenchmarkEnv(EnvAdapter) now implements all
five real abstract methods (build_train_env, build_eval_env, rollout,
reflect, get_task_types) with no-op defaults documented as TODO.
Instantiable today; pytest 60/60 still passes.
- loader_template.py: TemplateBenchmarkLoader(SplitDataLoader)
implements load_split_items for .json / .jsonl input and explains the
optional load_raw_items override for split_mode="ratio".
- README.md: usage steps now point at scripts/train.py's _ENV_REGISTRY
(the real registry) instead of a non-existent BENCHMARK_REGISTRY in
skillopt/envs/__init__.py, and link to the rewritten new-benchmark
guide.
- config_template.yaml: _base_ is a string path (not a list, which the
loader rejects); skill_init is commented out with a note so the
template config doesn't reference a file the user hasn't created.
Verified locally: 'from skillopt.envs._template.env_template import
TemplateBenchmarkEnv; TemplateBenchmarkEnv()' succeeds. Refs
microsoft/SkillOpt#30.
Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
docs/reference/api.md previously documented a fictional EnvAdapter API
(execute / evaluate / build_prompt + DataItem / TaskResult) and a
BENCHMARK_REGISTRY that never existed in code. Anyone following the
documented contract would hit ImportError or TypeError on the first
instantiation.
Replace both pages with the real shape from skillopt/envs/base.py and
skillopt/datasets/base.py:
- EnvAdapter: build_train_env, build_eval_env, rollout, reflect,
get_task_types (the 5 actual abstract methods).
- Rollout dicts: id / hard / soft required; everything else preserved
into RolloutResult.extras.
- Reflect dicts: {patch, source_type} schema as consumed by
run_minibatch_reflect.
- BatchSpec: slotted-but-mutable dataclass matching the actual
definition (payload defaults to None, metadata to dict()).
- SplitDataLoader.load_split_items as the one mandatory loader method.
- Registry: _ENV_REGISTRY in scripts/train.py (lazy try/except
ImportError block), not a non-existent BENCHMARK_REGISTRY in
skillopt/envs/__init__.py.
- _base_: documented as a string path, since the current YAML loader
only accepts strings.
The new-benchmark.md guide now walks through a docfaithful worked
example with a real rollout helper (chat_target + scorer) instead of
hand-waving over the rollout step. Refs microsoft/SkillOpt#30.
Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
Add initial test infrastructure covering:
- skillopt/utils/scoring.py (compute_score, skill_hash)
- skillopt/utils/json_utils.py (extract_json, extract_json_array)
- skillopt/types.py (Edit, Patch dataclass serialization)
All tested functions are pure/deterministic with no LLM dependencies.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
PR #26 added a MiniMax chat backend but left three loose ends that
silently dropped any YAML / CLI configuration of minimax_* keys: only
the environment-variable path worked.
- skillopt/config.py: add 6 model.minimax_* entries to _FLATTEN_MAP so
the keys declared in configs/_base_/default.yaml actually survive
flatten_config() (mirroring the existing model.qwen_chat_* block).
- skillopt/engine/trainer.py: import configure_minimax_chat and call
it alongside configure_qwen_chat, so cfg-supplied credentials,
temperature, max_tokens, and enable_thinking reach the backend. Also
apply cfg["minimax_model"] via set_target_deployment when the active
target backend is minimax_chat.
- scripts/train.py: add 6 --minimax_* CLI flags + the corresponding
_CLI_TO_YAML entries, add 'minimax' / 'minimax_chat' to the --backend
choices, auto-route to target_backend=minimax_chat, and pick the
right default target_model for the new backend.
Default behavior on existing backends (openai, claude, qwen, codex,
claude_code_exec) is unchanged; all 8 shipped configs continue to load
with gate_metric falling back to 'hard' for paper reproduction.
feat: add MiniMax as first-class chat backend
Adds skillopt/model/minimax_backend.py (clean port of qwen_backend.py
targeting MiniMax-M2.7 via https://api.minimax.io/v1) and registers it
in the router, backend_config, and common defaults. Existing backends
(openai_chat, claude_chat, qwen_chat, codex_exec, claude_code_exec)
remain bit-for-bit unchanged.
Verified via 10 import / routing / parity subtests; backward-compat
sweep across the 8 shipped configs passes with no regression.
A follow-up commit completes the YAML / CLI plumbing that this PR left
half-wired (FLATTEN_MAP entries, trainer-level configure_minimax_chat
call, and --minimax_* CLI args).
Add optimizer.slow_update_gate_with_selection to control how epoch-boundary
slow-update guidance is applied:
- false (default): force-injected - inject guidance into current & best
unconditionally (unchanged behavior).
- true: gated - evaluate the slow-update candidate on the selection set and
accept/reject via the same validation gate as step-level updates
(logic follows the SkillReflection ablation).
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Port qwen_backend.py pattern to minimax_backend.py as a new
OpenAI-compatible urllib-based backend. Includes:
- BASE_URL defaulting to https://api.minimax.chat/v1
- API_KEY, TIMEOUT_SECONDS, MAX_TOKENS, TEMPERATURE env vars
- ENABLE_THINKING support (MiniMax thinking mode)
- configure_minimax_chat() runtime configurator
- chat_target() and chat_target_messages() functions
- TokenTracker integration and get_token_summary()
- set_target_deployment() support
- Default model: MiniMax/MiniMax-Text-01
Move the soft/mixed gate-metric configuration introduced in PR #25 out of
the base default config and into a standalone example config so that
default SkillOpt runs (and paper reproduction) remain bit-for-bit on the
original hard gate.
- configs/_base_/default.yaml: drop gate_metric / gate_mixed_weight keys.
The trainer's cfg.get("gate_metric", "hard") fallback preserves the
original behavior unchanged.
- configs/examples/soft_gate.yaml: new standalone reference config with
a header explaining when to consider it (small selection split with
continuous rewards) and when not to (paper reproduction, large or
binary-reward settings).
- README.md: add a short "Community-contributed configs" section that
clearly flags this as user-contributed and non-default.
Add configurable gate metric (hard / soft / mixed) for skill validation
Default is `hard`, preserving exact pre-PR behavior — verified by 22 unit
assertions on the gate module plus an end-to-end 8-step trainer-trajectory
test that produces a bit-for-bit identical accept/reject sequence between
the pre-PR and post-PR code paths under `gate_metric: hard`. Paper-
reproduction results are unaffected.
`soft` and `mixed` are opt-in via `evaluation.gate_metric` in the config
and address small-selection-set runs where discrete hard accuracy is too
coarse to distinguish candidate skills.
The yaml default `azure_openai_auth_mode: azure_cli` was silently
overwriting `AZURE_OPENAI_AUTH_MODE` exported by the user, because
`configure_clients()` treats any non-empty config value as an explicit
override. Switching the three auth_mode defaults (shared / optimizer /
target) to "" lets `_clean()` drop them and restores the intended
fallback chain: yaml → env var → module default ("azure_cli").
Also update README and .env.example to document the openai_compatible
mode introduced in d5c5b61, and remove the misleading `OPENAI_API_KEY`
snippet — SkillOpt reuses the `AZURE_OPENAI_*` env vars in this mode.
The training gate currently always compares candidate vs. current/best
using *hard* exact-match accuracy. On environments with a small
held-out selection set (e.g. 3-6 items) or partial-credit scoring,
hard accuracy is too coarse: candidate skills that meaningfully
improve per-item soft scores get rejected because the discrete hard
count does not move.
Add three opt-in metrics so users can pick the one that matches their
scoring function:
- `gate_metric: hard` — original behavior (default, fully backward
compatible).
- `gate_metric: soft` — gate on the soft / F1 / partial-credit score.
- `gate_metric: mixed` — `(1 - w) * hard + w * soft`, where `w` is
set by `gate_mixed_weight` (default 0.5).
Changes
-------
- `skillopt/evaluation/gate.py`: extend `evaluate_gate` with
`cand_soft`, `metric`, and `mixed_weight` keyword arguments; add a
pure helper `select_gate_score(hard, soft, metric, mixed_weight)`.
Defaults preserve the original `metric="hard"` behavior — existing
callers that only pass `cand_hard` keep working unchanged.
- `skillopt/evaluation/__init__.py`: export the new helper / type.
- `skillopt/engine/trainer.py`: read `evaluation.gate_metric` and
`evaluation.gate_mixed_weight` from the config (with safe defaults),
pass both metrics into `evaluate_gate`, and project the baseline
`current_score` / `best_score` into metric space so subsequent
comparisons are consistent. Print the gate metric on the
`[6/6 EVALUATE]` line so logs make the decision basis explicit. The
selection cache still records both `(hard, soft)` so a metric change
on resume is non-destructive.
- `configs/_base_/default.yaml`: document and ship the new keys with
backward-compatible defaults (`hard`, `0.5`).
Backward compatibility
----------------------
- Default config does not change behavior: `gate_metric` defaults to
`hard`, exactly matching the previous gate.
- `evaluate_gate(...)` keeps its existing positional signature; the
new parameters are keyword-only with safe defaults.
- `step_record.json` gains optional `gate_metric` and
`candidate_gate_score` fields; old records still load.
Tested
------
- Unit-tested all three metrics + boundary `mixed_weight` values
(0.0 / 1.0) and rejection of unknown metric strings. All six cases
pass.
- Verified `skillopt.engine.trainer` imports cleanly after the
refactor.
Claude Code CLI v2.x renamed the flag; passing --thinking low causes
all rollout calls to fail on CLI 2.1.87+.
Co-authored-by: Cursor <cursoragent@cursor.com>
int() truncates smoothed composite scores (0.0-1.0) to 0,
making all continuous reward values appear as failures.
This broke SkillOpt training pipelines using SmoothedCompositeReward.
int() truncates any float in [0,1) to 0. Replace with float().
Also fix falsy float check in failure detection.
Backward compatible with binary hard=0/1.
- Add 'openai_compatible', 'compat', and 'openai' auth modes to azure_openai.py
- Modify _make_client() to use OpenAI client (not AzureOpenAI) for compatible endpoints
- Update type hints to support both AzureOpenAI and OpenAI clients
- Auto-configure API version sentinel when using compatible modes
- Add .env template for Pioneer.ai configuration
This allows users to use Pioneer.ai or any OpenAI-compatible API endpoint
as both optimizer and target backend without requiring Azure OpenAI.
Resolves: Support for non-Azure OpenAI-compatible providers