The yaml default `azure_openai_auth_mode: azure_cli` was silently
overwriting `AZURE_OPENAI_AUTH_MODE` exported by the user, because
`configure_clients()` treats any non-empty config value as an explicit
override. Switching the three auth_mode defaults (shared / optimizer /
target) to "" lets `_clean()` drop them and restores the intended
fallback chain: yaml → env var → module default ("azure_cli").
Also update README and .env.example to document the openai_compatible
mode introduced in d5c5b61, and remove the misleading `OPENAI_API_KEY`
snippet — SkillOpt reuses the `AZURE_OPENAI_*` env vars in this mode.
The training gate currently always compares candidate vs. current/best
using *hard* exact-match accuracy. On environments with a small
held-out selection set (e.g. 3-6 items) or partial-credit scoring,
hard accuracy is too coarse: candidate skills that meaningfully
improve per-item soft scores get rejected because the discrete hard
count does not move.
Add three opt-in metrics so users can pick the one that matches their
scoring function:
- `gate_metric: hard` — original behavior (default, fully backward
compatible).
- `gate_metric: soft` — gate on the soft / F1 / partial-credit score.
- `gate_metric: mixed` — `(1 - w) * hard + w * soft`, where `w` is
set by `gate_mixed_weight` (default 0.5).
Changes
-------
- `skillopt/evaluation/gate.py`: extend `evaluate_gate` with
`cand_soft`, `metric`, and `mixed_weight` keyword arguments; add a
pure helper `select_gate_score(hard, soft, metric, mixed_weight)`.
Defaults preserve the original `metric="hard"` behavior — existing
callers that only pass `cand_hard` keep working unchanged.
- `skillopt/evaluation/__init__.py`: export the new helper / type.
- `skillopt/engine/trainer.py`: read `evaluation.gate_metric` and
`evaluation.gate_mixed_weight` from the config (with safe defaults),
pass both metrics into `evaluate_gate`, and project the baseline
`current_score` / `best_score` into metric space so subsequent
comparisons are consistent. Print the gate metric on the
`[6/6 EVALUATE]` line so logs make the decision basis explicit. The
selection cache still records both `(hard, soft)` so a metric change
on resume is non-destructive.
- `configs/_base_/default.yaml`: document and ship the new keys with
backward-compatible defaults (`hard`, `0.5`).
Backward compatibility
----------------------
- Default config does not change behavior: `gate_metric` defaults to
`hard`, exactly matching the previous gate.
- `evaluate_gate(...)` keeps its existing positional signature; the
new parameters are keyword-only with safe defaults.
- `step_record.json` gains optional `gate_metric` and
`candidate_gate_score` fields; old records still load.
Tested
------
- Unit-tested all three metrics + boundary `mixed_weight` values
(0.0 / 1.0) and rejection of unknown metric strings. All six cases
pass.
- Verified `skillopt.engine.trainer` imports cleanly after the
refactor.
Claude Code CLI v2.x renamed the flag; passing --thinking low causes
all rollout calls to fail on CLI 2.1.87+.
Co-authored-by: Cursor <cursoragent@cursor.com>
int() truncates smoothed composite scores (0.0-1.0) to 0,
making all continuous reward values appear as failures.
This broke SkillOpt training pipelines using SmoothedCompositeReward.
int() truncates any float in [0,1) to 0. Replace with float().
Also fix falsy float check in failure detection.
Backward compatible with binary hard=0/1.
- Add 'openai_compatible', 'compat', and 'openai' auth modes to azure_openai.py
- Modify _make_client() to use OpenAI client (not AzureOpenAI) for compatible endpoints
- Update type hints to support both AzureOpenAI and OpenAI clients
- Auto-configure API version sentinel when using compatible modes
- Add .env template for Pioneer.ai configuration
This allows users to use Pioneer.ai or any OpenAI-compatible API endpoint
as both optimizer and target backend without requiring Azure OpenAI.
Resolves: Support for non-Azure OpenAI-compatible providers
Remove sealqa, babyvision, mathverse, mmrb, swebench envs and configs.
Remove deep_probe, deep_reflect, meta_reflect modules and prompts.
Remove download_babyvision script.
These are not part of the core released benchmarks.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>