Commit Graph

87 Commits

Author SHA1 Message Date
copilot-swe-agent[bot]
4f582d4f6e test: add template contract checks and refine benchmark docs 2026-06-01 19:39:52 +00:00
copilot-swe-agent[bot]
b3c7d72364 docs: align benchmark guide and templates with real adapter API 2026-06-01 19:38:17 +00:00
copilot-swe-agent[bot]
36284e1bb0 Initial plan 2026-06-01 19:31:30 +00:00
Yifan Yang
fb1a76371d Merge pull request #29 from LifeIsSoSolong/codex/qwen-chat-optimizer-backend
Support qwen_chat as optimizer backend
2026-06-02 03:27:50 +08:00
Yifan Yang
47063e1ceb Merge pull request #27 from Oxygen56/test/add-core-utility-tests
test: add unit test suite for core utility modules
2026-06-02 03:27:26 +08:00
hwq
181d71b737 Release data split manifests 2026-06-01 16:02:14 +00:00
kaikai-macbook
41012e2d5e Support Qwen chat as optimizer backend 2026-06-01 16:44:49 +08:00
Claude Code Agent
dd8cd993b5 test: add unit test suite for core utility modules
Add initial test infrastructure covering:
- skillopt/utils/scoring.py (compute_score, skill_hash)
- skillopt/utils/json_utils.py (extract_json, extract_json_array)
- skillopt/types.py (Edit, Patch dataclass serialization)

All tested functions are pure/deterministic with no LLM dependencies.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-01 02:04:22 +08:00
Yif Yang
8ebede0efd Refine README for clarity on optimization results
Removed redundant wording about math benchmarks.
2026-05-31 18:20:00 +08:00
Yif Yang
266fca72ab docs: clarify optional features and ckpt artifacts 2026-05-31 09:36:25 +00:00
Yif Yang
9265545c45 docs: clarify README and paper-aligned skill artifacts 2026-05-31 09:23:07 +00:00
Yif Yang
b4850ce418 fix(minimax): wire YAML / CLI config through to backend
PR #26 added a MiniMax chat backend but left three loose ends that
silently dropped any YAML / CLI configuration of minimax_* keys: only
the environment-variable path worked.

- skillopt/config.py: add 6 model.minimax_* entries to _FLATTEN_MAP so
  the keys declared in configs/_base_/default.yaml actually survive
  flatten_config() (mirroring the existing model.qwen_chat_* block).
- skillopt/engine/trainer.py: import configure_minimax_chat and call
  it alongside configure_qwen_chat, so cfg-supplied credentials,
  temperature, max_tokens, and enable_thinking reach the backend. Also
  apply cfg["minimax_model"] via set_target_deployment when the active
  target backend is minimax_chat.
- scripts/train.py: add 6 --minimax_* CLI flags + the corresponding
  _CLI_TO_YAML entries, add 'minimax' / 'minimax_chat' to the --backend
  choices, auto-route to target_backend=minimax_chat, and pick the
  right default target_model for the new backend.

Default behavior on existing backends (openai, claude, qwen, codex,
claude_code_exec) is unchanged; all 8 shipped configs continue to load
with gate_metric falling back to 'hard' for paper reproduction.
2026-05-31 08:22:20 +00:00
Yif Yang
643346c9f3 Merge pull request #26 from KovaForge/minimax-backend
feat: add MiniMax as first-class chat backend

Adds skillopt/model/minimax_backend.py (clean port of qwen_backend.py
targeting MiniMax-M2.7 via https://api.minimax.io/v1) and registers it
in the router, backend_config, and common defaults. Existing backends
(openai_chat, claude_chat, qwen_chat, codex_exec, claude_code_exec)
remain bit-for-bit unchanged.

Verified via 10 import / routing / parity subtests; backward-compat
sweep across the 8 shipped configs passes with no regression.

A follow-up commit completes the YAML / CLI plumbing that this PR left
half-wired (FLATTEN_MAP entries, trainer-level configure_minimax_chat
call, and --minimax_* CLI args).
2026-05-31 08:20:39 +00:00
Cuzyoung
00602df9e9 feat(slow-update): add config-controlled gated / force-injected modes
Add optimizer.slow_update_gate_with_selection to control how epoch-boundary
slow-update guidance is applied:
- false (default): force-injected - inject guidance into current & best
  unconditionally (unchanged behavior).
- true: gated - evaluate the slow-update candidate on the selection set and
  accept/reject via the same validation gate as step-level updates
  (logic follows the SkillReflection ablation).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-05-31 02:02:23 +00:00
Declan Murphy
c6da31df44 fix: use correct MiniMax endpoint, model name, and add .venv to gitignore 2026-05-31 05:27:50 +08:00
Declan Murphy
e4201074aa docs: add MiniMax config to default.yaml and .env.example
default.yaml:
- Add minimax_base_url, minimax_api_key, minimax_model, minimax_temperature,
  minimax_max_tokens, minimax_enable_thinking settings
- Add optimizer_minimax_base_url, target_minimax_base_url per-role overrides
- Add optimizer_minimax_api_key, target_minimax_api_key per-role overrides

.env.example:
- Add MINIMAX_BASE_URL, MINIMAX_API_KEY, MINIMAX_MODEL env var docs
2026-05-31 05:22:35 +08:00
Declan Murphy
309ea64ff4 feat: integrate MiniMax into model router, backend config, and common
common.py:
- Add minimax_chat → MiniMax/MiniMax-Text-01 to _BACKEND_DEFAULT_MODELS
- Add minimax/minimax_chat aliases to _BACKEND_ALIASES

backend_config.py:
- Add minimax_chat to set_optimizer_backend() valid set
- Add minimax_chat to set_target_backend() valid set
- Add minimax_chat to is_optimizer_chat_backend()
- Add minimax_chat to is_target_chat_backend()

__init__.py:
- Import minimax_backend as _minimax
- Add minimax_chat to set_backend() legacy handler
- Add minimax_chat to get_backend_name() reporting
- Route chat_target() and chat_target_messages() to _minimax
- Update NotImplementedError messages to list minimax_chat
- Aggregate _minimax into get_token_summary()
- Add _minimax.reset_token_tracker()
- Add configure_minimax_chat() delegator
- Add _minimax to set_reasoning_effort() and set_target_deployment()
2026-05-31 05:22:33 +08:00
Declan Murphy
d224d425f9 feat: add MiniMax chat backend module
Port qwen_backend.py pattern to minimax_backend.py as a new
OpenAI-compatible urllib-based backend. Includes:
- BASE_URL defaulting to https://api.minimax.chat/v1
- API_KEY, TIMEOUT_SECONDS, MAX_TOKENS, TEMPERATURE env vars
- ENABLE_THINKING support (MiniMax thinking mode)
- configure_minimax_chat() runtime configurator
- chat_target() and chat_target_messages() functions
- TokenTracker integration and get_token_summary()
- set_target_deployment() support
- Default model: MiniMax/MiniMax-Text-01
2026-05-31 05:22:29 +08:00
hwq
42e555d28e Update eval-only README example 2026-05-30 15:28:17 +00:00
hwq
933c0a4ab5 Add GPT-5.5 benchmark skills 2026-05-30 15:15:15 +00:00
hwq
1f75d022a5 y 2026-05-30 15:01:34 +00:00
Yif Yang
4f3a9bc055 docs: scope PR #25 gate_metric as opt-in example, not default
Move the soft/mixed gate-metric configuration introduced in PR #25 out of
the base default config and into a standalone example config so that
default SkillOpt runs (and paper reproduction) remain bit-for-bit on the
original hard gate.

- configs/_base_/default.yaml: drop gate_metric / gate_mixed_weight keys.
  The trainer's cfg.get("gate_metric", "hard") fallback preserves the
  original behavior unchanged.
- configs/examples/soft_gate.yaml: new standalone reference config with
  a header explaining when to consider it (small selection split with
  continuous rewards) and when not to (paper reproduction, large or
  binary-reward settings).
- README.md: add a short "Community-contributed configs" section that
  clearly flags this as user-contributed and non-default.
2026-05-30 08:09:03 +00:00
Yif Yang
d190bf37c1 Merge pull request #25 from lvbaocheng/feature/gate-soft-metric
Add configurable gate metric (hard / soft / mixed) for skill validation

Default is `hard`, preserving exact pre-PR behavior — verified by 22 unit
assertions on the gate module plus an end-to-end 8-step trainer-trajectory
test that produces a bit-for-bit identical accept/reject sequence between
the pre-PR and post-PR code paths under `gate_metric: hard`. Paper-
reproduction results are unaffected.

`soft` and `mixed` are opt-in via `evaluation.gate_metric` in the config
and address small-selection-set runs where discrete hard accuracy is too
coarse to distinguish candidate skills.
2026-05-30 08:01:39 +00:00
Yif Yang
02695bd813 Merge pull request #24 from lvbaocheng/fix/claude-cli-effort-flag
fix(claude): use --effort instead of deprecated --thinking flag
2026-05-30 15:31:00 +08:00
Yif Yang
cf287cb608 Merge pull request #20 from 1s1x/fix-continuous-reward-scores
fix: support continuous reward scores (int truncation + falsy float)
2026-05-30 15:30:15 +08:00
Huangzisu
dbc90bd755 fix(auth): let env vars override yaml for openai_compatible mode
The yaml default `azure_openai_auth_mode: azure_cli` was silently
overwriting `AZURE_OPENAI_AUTH_MODE` exported by the user, because
`configure_clients()` treats any non-empty config value as an explicit
override. Switching the three auth_mode defaults (shared / optimizer /
target) to "" lets `_clean()` drop them and restores the intended
fallback chain: yaml → env var → module default ("azure_cli").

Also update README and .env.example to document the openai_compatible
mode introduced in d5c5b61, and remove the misleading `OPENAI_API_KEY`
snippet — SkillOpt reuses the `AZURE_OPENAI_*` env vars in this mode.
2026-05-30 06:58:05 +00:00
lvbaocheng
5d7875cb2e Add configurable gate metric (hard / soft / mixed) for skill validation
The training gate currently always compares candidate vs. current/best
using *hard* exact-match accuracy. On environments with a small
held-out selection set (e.g. 3-6 items) or partial-credit scoring,
hard accuracy is too coarse: candidate skills that meaningfully
improve per-item soft scores get rejected because the discrete hard
count does not move.

Add three opt-in metrics so users can pick the one that matches their
scoring function:

- `gate_metric: hard`  — original behavior (default, fully backward
  compatible).
- `gate_metric: soft`  — gate on the soft / F1 / partial-credit score.
- `gate_metric: mixed` — `(1 - w) * hard + w * soft`, where `w` is
  set by `gate_mixed_weight` (default 0.5).

Changes
-------
- `skillopt/evaluation/gate.py`: extend `evaluate_gate` with
  `cand_soft`, `metric`, and `mixed_weight` keyword arguments; add a
  pure helper `select_gate_score(hard, soft, metric, mixed_weight)`.
  Defaults preserve the original `metric="hard"` behavior — existing
  callers that only pass `cand_hard` keep working unchanged.
- `skillopt/evaluation/__init__.py`: export the new helper / type.
- `skillopt/engine/trainer.py`: read `evaluation.gate_metric` and
  `evaluation.gate_mixed_weight` from the config (with safe defaults),
  pass both metrics into `evaluate_gate`, and project the baseline
  `current_score` / `best_score` into metric space so subsequent
  comparisons are consistent. Print the gate metric on the
  `[6/6 EVALUATE]` line so logs make the decision basis explicit. The
  selection cache still records both `(hard, soft)` so a metric change
  on resume is non-destructive.
- `configs/_base_/default.yaml`: document and ship the new keys with
  backward-compatible defaults (`hard`, `0.5`).

Backward compatibility
----------------------
- Default config does not change behavior: `gate_metric` defaults to
  `hard`, exactly matching the previous gate.
- `evaluate_gate(...)` keeps its existing positional signature; the
  new parameters are keyword-only with safe defaults.
- `step_record.json` gains optional `gate_metric` and
  `candidate_gate_score` fields; old records still load.

Tested
------
- Unit-tested all three metrics + boundary `mixed_weight` values
  (0.0 / 1.0) and rejection of unknown metric strings. All six cases
  pass.
- Verified `skillopt.engine.trainer` imports cleanly after the
  refactor.
2026-05-30 14:45:27 +08:00
lvbaocheng
2532043d25 fix(claude): use --effort instead of deprecated --thinking flag
Claude Code CLI v2.x renamed the flag; passing --thinking low causes
all rollout calls to fail on CLI 2.1.87+.

Co-authored-by: Cursor <cursoragent@cursor.com>
2026-05-30 11:24:13 +08:00
zq
41be2f1803 fix(scoring): use float() instead of int() for continuous reward scores
int() truncates smoothed composite scores (0.0-1.0) to 0,
making all continuous reward values appear as failures.
This broke SkillOpt training pipelines using SmoothedCompositeReward.
2026-05-30 07:47:41 +08:00
zq
a62ec857f1 fix(reflect): support continuous reward scores in failure filtering
not r.get("hard") treats non-zero floats as success.
Add explicit float threshold check (< 1e-9).
Backward compatible with binary hard=0/1.
2026-05-29 19:04:42 +08:00
zq
afb552008b fix(trainer): support continuous reward scores in bucket aggregation
int() truncates any float in [0,1) to 0. Replace with float().
Also fix falsy float check in failure detection.
Backward compatible with binary hard=0/1.
2026-05-29 19:03:52 +08:00
Yif Yang
75b5c7f31c Merge pull request #16 from guilhermeleste/feat/pioneer-ai-provider-integration
Add OpenAI-compatible backend support for Pioneer.ai and other providers
2026-05-29 10:14:32 +08:00
Yif Yang
74ea3a1a8f Merge pull request #18 from yong2bba/docs/custom-env-smoke
docs: add local environment smoke test guide
2026-05-29 10:12:55 +08:00
yongjin
657b987de6 docs: add local environment smoke test guide 2026-05-29 09:26:38 +09:00
hwq
2a40aa3c98 Add SearchQA id split 2026-05-28 11:29:59 +00:00
hwq
786d57b5cf Make rollout completion tokens configurable 2026-05-28 09:45:47 +00:00
guilhermeleste
d5c5b61830 Add OpenAI-compatible backend support for Pioneer.ai and other providers
- Add 'openai_compatible', 'compat', and 'openai' auth modes to azure_openai.py
- Modify _make_client() to use OpenAI client (not AzureOpenAI) for compatible endpoints
- Update type hints to support both AzureOpenAI and OpenAI clients
- Auto-configure API version sentinel when using compatible modes
- Add .env template for Pioneer.ai configuration

This allows users to use Pioneer.ai or any OpenAI-compatible API endpoint
as both optimizer and target backend without requiring Azure OpenAI.

Resolves: Support for non-Azure OpenAI-compatible providers
2026-05-28 05:54:43 -03:00
Cuzyoung
99212e3956 docs: remove Star History section for now
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-05-26 08:12:51 +00:00
Cuzyoung
fc54c44e93 docs: add Star History chart to README
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-05-26 08:10:16 +00:00
Yif Yang
48adf5a69f Update citation format in README.md 2026-05-26 02:56:58 +08:00
Yif Yang
b11e6dcfb9 Enhance training description in README
Updated README to include '(mini-)batchsize' in the training description.
2026-05-26 02:35:10 +08:00
Yif Yang
4c1b74fce2 Update BibTeX entry in index.html 2026-05-25 14:30:01 +08:00
Yif Yang
db6443384a Update BibTeX entry for SkillOpt publication 2026-05-25 14:28:13 +08:00
Huangzisu
2c7d9074fb update webpage for arxiv link 2026-05-25 05:32:04 +00:00
Yif Yang
c98bcdd5b3 Update README.md 2026-05-25 13:27:40 +08:00
Yif Yang
0f6db9afc4 Update README.md 2026-05-25 13:26:55 +08:00
Yif Yang
5a36ac35ae Merge pull request #7 from microsoft/users/GitHubPolicyService/a41a3ce1-e5a1-4e18-810b-cfb8d2d21c29
Adding Microsoft SECURITY.MD
2026-05-25 13:09:26 +08:00
Lliar-liar
5f4b228543 Soften average gain column styling 2026-05-24 19:45:10 +00:00
Lliar-liar
a9cad7a125 Use official arXiv logomark 2026-05-24 19:43:19 +00:00
Lliar-liar
5e968115f5 Align citation section with SkillLens 2026-05-24 19:39:16 +00:00