213 Commits

Author SHA1 Message Date
Yif Yang
cf287cb608 Merge pull request #20 from 1s1x/fix-continuous-reward-scores
fix: support continuous reward scores (int truncation + falsy float)
2026-05-30 15:30:15 +08:00
Huangzisu
dbc90bd755 fix(auth): let env vars override yaml for openai_compatible mode
The yaml default `azure_openai_auth_mode: azure_cli` was silently
overwriting `AZURE_OPENAI_AUTH_MODE` exported by the user, because
`configure_clients()` treats any non-empty config value as an explicit
override. Switching the three auth_mode defaults (shared / optimizer /
target) to "" lets `_clean()` drop them and restores the intended
fallback chain: yaml → env var → module default ("azure_cli").

Also update README and .env.example to document the openai_compatible
mode introduced in d5c5b61, and remove the misleading `OPENAI_API_KEY`
snippet — SkillOpt reuses the `AZURE_OPENAI_*` env vars in this mode.
2026-05-30 06:58:05 +00:00
lvbaocheng
5d7875cb2e Add configurable gate metric (hard / soft / mixed) for skill validation
The training gate currently always compares candidate vs. current/best
using *hard* exact-match accuracy. On environments with a small
held-out selection set (e.g. 3-6 items) or partial-credit scoring,
hard accuracy is too coarse: candidate skills that meaningfully
improve per-item soft scores get rejected because the discrete hard
count does not move.

Add three opt-in metrics so users can pick the one that matches their
scoring function:

- `gate_metric: hard`  — original behavior (default, fully backward
  compatible).
- `gate_metric: soft`  — gate on the soft / F1 / partial-credit score.
- `gate_metric: mixed` — `(1 - w) * hard + w * soft`, where `w` is
  set by `gate_mixed_weight` (default 0.5).

Changes
-------
- `skillopt/evaluation/gate.py`: extend `evaluate_gate` with
  `cand_soft`, `metric`, and `mixed_weight` keyword arguments; add a
  pure helper `select_gate_score(hard, soft, metric, mixed_weight)`.
  Defaults preserve the original `metric="hard"` behavior — existing
  callers that only pass `cand_hard` keep working unchanged.
- `skillopt/evaluation/__init__.py`: export the new helper / type.
- `skillopt/engine/trainer.py`: read `evaluation.gate_metric` and
  `evaluation.gate_mixed_weight` from the config (with safe defaults),
  pass both metrics into `evaluate_gate`, and project the baseline
  `current_score` / `best_score` into metric space so subsequent
  comparisons are consistent. Print the gate metric on the
  `[6/6 EVALUATE]` line so logs make the decision basis explicit. The
  selection cache still records both `(hard, soft)` so a metric change
  on resume is non-destructive.
- `configs/_base_/default.yaml`: document and ship the new keys with
  backward-compatible defaults (`hard`, `0.5`).

Backward compatibility
----------------------
- Default config does not change behavior: `gate_metric` defaults to
  `hard`, exactly matching the previous gate.
- `evaluate_gate(...)` keeps its existing positional signature; the
  new parameters are keyword-only with safe defaults.
- `step_record.json` gains optional `gate_metric` and
  `candidate_gate_score` fields; old records still load.

Tested
------
- Unit-tested all three metrics + boundary `mixed_weight` values
  (0.0 / 1.0) and rejection of unknown metric strings. All six cases
  pass.
- Verified `skillopt.engine.trainer` imports cleanly after the
  refactor.
2026-05-30 14:45:27 +08:00
lvbaocheng
2532043d25 fix(claude): use --effort instead of deprecated --thinking flag
Claude Code CLI v2.x renamed the flag; passing --thinking low causes
all rollout calls to fail on CLI 2.1.87+.

Co-authored-by: Cursor <cursoragent@cursor.com>
2026-05-30 11:24:13 +08:00
zq
41be2f1803 fix(scoring): use float() instead of int() for continuous reward scores
int() truncates smoothed composite scores (0.0-1.0) to 0,
making all continuous reward values appear as failures.
This broke SkillOpt training pipelines using SmoothedCompositeReward.
2026-05-30 07:47:41 +08:00
zq
a62ec857f1 fix(reflect): support continuous reward scores in failure filtering
not r.get("hard") treats non-zero floats as success.
Add explicit float threshold check (< 1e-9).
Backward compatible with binary hard=0/1.
2026-05-29 19:04:42 +08:00
zq
afb552008b fix(trainer): support continuous reward scores in bucket aggregation
int() truncates any float in [0,1) to 0. Replace with float().
Also fix falsy float check in failure detection.
Backward compatible with binary hard=0/1.
2026-05-29 19:03:52 +08:00
Yif Yang
75b5c7f31c Merge pull request #16 from guilhermeleste/feat/pioneer-ai-provider-integration
Add OpenAI-compatible backend support for Pioneer.ai and other providers
2026-05-29 10:14:32 +08:00
Yif Yang
74ea3a1a8f Merge pull request #18 from yong2bba/docs/custom-env-smoke
docs: add local environment smoke test guide
2026-05-29 10:12:55 +08:00
yongjin
657b987de6 docs: add local environment smoke test guide 2026-05-29 09:26:38 +09:00
hwq
2a40aa3c98 Add SearchQA id split 2026-05-28 11:29:59 +00:00
hwq
786d57b5cf Make rollout completion tokens configurable 2026-05-28 09:45:47 +00:00
guilhermeleste
d5c5b61830 Add OpenAI-compatible backend support for Pioneer.ai and other providers
- Add 'openai_compatible', 'compat', and 'openai' auth modes to azure_openai.py
- Modify _make_client() to use OpenAI client (not AzureOpenAI) for compatible endpoints
- Update type hints to support both AzureOpenAI and OpenAI clients
- Auto-configure API version sentinel when using compatible modes
- Add .env template for Pioneer.ai configuration

This allows users to use Pioneer.ai or any OpenAI-compatible API endpoint
as both optimizer and target backend without requiring Azure OpenAI.

Resolves: Support for non-Azure OpenAI-compatible providers
2026-05-28 05:54:43 -03:00
Cuzyoung
99212e3956 docs: remove Star History section for now
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-05-26 08:12:51 +00:00
Cuzyoung
fc54c44e93 docs: add Star History chart to README
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-05-26 08:10:16 +00:00
Yif Yang
48adf5a69f Update citation format in README.md 2026-05-26 02:56:58 +08:00
Yif Yang
b11e6dcfb9 Enhance training description in README
Updated README to include '(mini-)batchsize' in the training description.
2026-05-26 02:35:10 +08:00
Yif Yang
4c1b74fce2 Update BibTeX entry in index.html 2026-05-25 14:30:01 +08:00
Yif Yang
db6443384a Update BibTeX entry for SkillOpt publication 2026-05-25 14:28:13 +08:00
Huangzisu
2c7d9074fb update webpage for arxiv link 2026-05-25 05:32:04 +00:00
Yif Yang
c98bcdd5b3 Update README.md 2026-05-25 13:27:40 +08:00
Yif Yang
0f6db9afc4 Update README.md 2026-05-25 13:26:55 +08:00
Yif Yang
5a36ac35ae Merge pull request #7 from microsoft/users/GitHubPolicyService/a41a3ce1-e5a1-4e18-810b-cfb8d2d21c29
Adding Microsoft SECURITY.MD
2026-05-25 13:09:26 +08:00
Lliar-liar
5f4b228543 Soften average gain column styling 2026-05-24 19:45:10 +00:00
Lliar-liar
a9cad7a125 Use official arXiv logomark 2026-05-24 19:43:19 +00:00
Lliar-liar
5e968115f5 Align citation section with SkillLens 2026-05-24 19:39:16 +00:00
Cuzyoung
ded8c27c90 restore: bring back project page HTML and assets
These were accidentally deleted in the cleanup commit.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-05-24 19:38:34 +00:00
Cuzyoung
f55a26414e cleanup: remove unused benchmarks, deep_probe, meta_reflect
Remove sealqa, babyvision, mathverse, mmrb, swebench envs and configs.
Remove deep_probe, deep_reflect, meta_reflect modules and prompts.
Remove download_babyvision script.
These are not part of the core released benchmarks.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-05-24 19:36:48 +00:00
Lliar-liar
2df2542aec Stabilize skill evolution layout 2026-05-24 19:36:08 +00:00
Lliar-liar
faa4ec6199 Align header and scroll effects with SkillLens 2026-05-24 19:31:24 +00:00
Cuzyoung
cff7ff6846 fix: rename remaining teacher/student refs, remove .gradio from repo
- Fix teacher/student in deep_reflect, meta_reflect, sealqa, babyvision,
  mathverse, mmrb, swebench envs and prompt templates
- Remove .gradio/certificate.pem from tracked files
- Add .gradio/ to .gitignore

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-05-24 19:22:20 +00:00
Cuzyoung
7ae2d8766e docs: restore clean README with Install/Data/QuickStart/WebUI/Citation only
Keep remote project page header (badges, video), replace body with our
streamlined 5-section README focused on reproducibility.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-05-24 19:19:19 +00:00
Lliar-liar
338a88d31c Add model logos to results table 2026-05-24 19:18:57 +00:00
Cuzyoung
4a1b984d87 refactor: rename teacher/student to optimizer/target, remove best skills, fix slow update
- Rename teacher -> optimizer, student -> target across all code, configs, docs, prompts
- CLI: --teacher_model -> --optimizer_model, --student_model -> --target_model
- Remove best_skill files, keep only initial skills
- Fix slow update gate (force write into skill)
- Fix SLOW_UPDATE marker stripping
- Remove deep_reflect and meta_reflect mechanisms
- Update .env.example with export prefix and azure_cli docs
- Add endpoint empty validation in azure_openai.py

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-05-24 19:15:10 +00:00
Lliar-liar
6e165d5347 Add Microsoft favicon 2026-05-24 19:14:33 +00:00
Lliar-liar
dde7dc9dd8 Add SkillLens related project link 2026-05-24 19:12:27 +00:00
Lliar-liar
cd9a0a02b9 Restyle project page after SkillLens 2026-05-24 19:08:05 +00:00
Lliar-liar
607bf74a1b Reorder hero evaluation stats 2026-05-24 18:52:05 +00:00
Lliar-liar
9605217e75 Use Microsoft logo in page header 2026-05-24 18:27:25 +00:00
Lliar-liar
c42d541828 Refine project links and citation section 2026-05-24 18:24:48 +00:00
Lliar-liar
2e05edc399 Add project links and citation section 2026-05-24 18:18:36 +00:00
Lliar-liar
6e7d5d0117 Clarify hero harness names 2026-05-24 18:15:35 +00:00
Yif Yang
441ccb9bda Update README.md 2026-05-25 02:15:02 +08:00
Lliar-liar
88a99048a4 Align method comparison chart with page theme 2026-05-24 18:05:23 +00:00
Lliar-liar
bf2106808e Remove method comparison implementation caption 2026-05-24 18:03:21 +00:00
Lliar-liar
ba0fa8c14b Render method comparison from raw data 2026-05-24 18:00:08 +00:00
Lliar-liar
9012a79827 Add main results method comparison chart 2026-05-24 17:55:22 +00:00
Lliar-liar
c64fbcd4f8 Shorten hero target model label 2026-05-24 17:51:11 +00:00
Lliar-liar
6e1027f01a Add harness count to hero badge 2026-05-24 17:48:32 +00:00
Lliar-liar
cd56a5fe7d Make hero results badge more prominent 2026-05-24 17:43:29 +00:00