microsoft-SkillOpt

mirror of https://github.com/microsoft/SkillOpt.git synced 2026-07-03 14:02:58 +08:00

Author	SHA1	Message	Date
Yif Yang	cf287cb608	Merge pull request #20 from 1s1x/fix-continuous-reward-scores fix: support continuous reward scores (int truncation + falsy float)	2026-05-30 15:30:15 +08:00
Huangzisu	dbc90bd755	fix(auth): let env vars override yaml for openai_compatible mode The yaml default `azure_openai_auth_mode: azure_cli` was silently overwriting `AZURE_OPENAI_AUTH_MODE` exported by the user, because `configure_clients()` treats any non-empty config value as an explicit override. Switching the three auth_mode defaults (shared / optimizer / target) to "" lets `_clean()` drop them and restores the intended fallback chain: yaml → env var → module default ("azure_cli"). Also update README and .env.example to document the openai_compatible mode introduced in `d5c5b61`, and remove the misleading `OPENAI_API_KEY` snippet — SkillOpt reuses the `AZURE_OPENAI_*` env vars in this mode.	2026-05-30 06:58:05 +00:00
lvbaocheng	5d7875cb2e	Add configurable gate metric (hard / soft / mixed) for skill validation The training gate currently always compares candidate vs. current/best using hard exact-match accuracy. On environments with a small held-out selection set (e.g. 3-6 items) or partial-credit scoring, hard accuracy is too coarse: candidate skills that meaningfully improve per-item soft scores get rejected because the discrete hard count does not move. Add three opt-in metrics so users can pick the one that matches their scoring function: - `gate_metric: hard` — original behavior (default, fully backward compatible). - `gate_metric: soft` — gate on the soft / F1 / partial-credit score. - `gate_metric: mixed` — `(1 - w) * hard + w * soft`, where `w` is set by `gate_mixed_weight` (default 0.5). Changes ------- - `skillopt/evaluation/gate.py`: extend `evaluate_gate` with `cand_soft`, `metric`, and `mixed_weight` keyword arguments; add a pure helper `select_gate_score(hard, soft, metric, mixed_weight)`. Defaults preserve the original `metric="hard"` behavior — existing callers that only pass `cand_hard` keep working unchanged. - `skillopt/evaluation/__init__.py`: export the new helper / type. - `skillopt/engine/trainer.py`: read `evaluation.gate_metric` and `evaluation.gate_mixed_weight` from the config (with safe defaults), pass both metrics into `evaluate_gate`, and project the baseline `current_score` / `best_score` into metric space so subsequent comparisons are consistent. Print the gate metric on the `[6/6 EVALUATE]` line so logs make the decision basis explicit. The selection cache still records both `(hard, soft)` so a metric change on resume is non-destructive. - `configs/_base_/default.yaml`: document and ship the new keys with backward-compatible defaults (`hard`, `0.5`). Backward compatibility ---------------------- - Default config does not change behavior: `gate_metric` defaults to `hard`, exactly matching the previous gate. - `evaluate_gate(...)` keeps its existing positional signature; the new parameters are keyword-only with safe defaults. - `step_record.json` gains optional `gate_metric` and `candidate_gate_score` fields; old records still load. Tested ------ - Unit-tested all three metrics + boundary `mixed_weight` values (0.0 / 1.0) and rejection of unknown metric strings. All six cases pass. - Verified `skillopt.engine.trainer` imports cleanly after the refactor.	2026-05-30 14:45:27 +08:00
lvbaocheng	2532043d25	fix(claude): use --effort instead of deprecated --thinking flag Claude Code CLI v2.x renamed the flag; passing --thinking low causes all rollout calls to fail on CLI 2.1.87+. Co-authored-by: Cursor <cursoragent@cursor.com>	2026-05-30 11:24:13 +08:00
zq	41be2f1803	fix(scoring): use float() instead of int() for continuous reward scores int() truncates smoothed composite scores (0.0-1.0) to 0, making all continuous reward values appear as failures. This broke SkillOpt training pipelines using SmoothedCompositeReward.	2026-05-30 07:47:41 +08:00
zq	a62ec857f1	fix(reflect): support continuous reward scores in failure filtering not r.get("hard") treats non-zero floats as success. Add explicit float threshold check (< 1e-9). Backward compatible with binary hard=0/1.	2026-05-29 19:04:42 +08:00
zq	afb552008b	fix(trainer): support continuous reward scores in bucket aggregation int() truncates any float in [0,1) to 0. Replace with float(). Also fix falsy float check in failure detection. Backward compatible with binary hard=0/1.	2026-05-29 19:03:52 +08:00
Yif Yang	75b5c7f31c	Merge pull request #16 from guilhermeleste/feat/pioneer-ai-provider-integration Add OpenAI-compatible backend support for Pioneer.ai and other providers	2026-05-29 10:14:32 +08:00
Yif Yang	74ea3a1a8f	Merge pull request #18 from yong2bba/docs/custom-env-smoke docs: add local environment smoke test guide	2026-05-29 10:12:55 +08:00
yongjin	657b987de6	docs: add local environment smoke test guide	2026-05-29 09:26:38 +09:00
hwq	2a40aa3c98	Add SearchQA id split	2026-05-28 11:29:59 +00:00
hwq	786d57b5cf	Make rollout completion tokens configurable	2026-05-28 09:45:47 +00:00
guilhermeleste	d5c5b61830	Add OpenAI-compatible backend support for Pioneer.ai and other providers - Add 'openai_compatible', 'compat', and 'openai' auth modes to azure_openai.py - Modify _make_client() to use OpenAI client (not AzureOpenAI) for compatible endpoints - Update type hints to support both AzureOpenAI and OpenAI clients - Auto-configure API version sentinel when using compatible modes - Add .env template for Pioneer.ai configuration This allows users to use Pioneer.ai or any OpenAI-compatible API endpoint as both optimizer and target backend without requiring Azure OpenAI. Resolves: Support for non-Azure OpenAI-compatible providers	2026-05-28 05:54:43 -03:00
Cuzyoung	99212e3956	docs: remove Star History section for now Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>	2026-05-26 08:12:51 +00:00
Cuzyoung	fc54c44e93	docs: add Star History chart to README Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>	2026-05-26 08:10:16 +00:00
Yif Yang	48adf5a69f	Update citation format in README.md	2026-05-26 02:56:58 +08:00
Yif Yang	b11e6dcfb9	Enhance training description in README Updated README to include '(mini-)batchsize' in the training description.	2026-05-26 02:35:10 +08:00
Yif Yang	4c1b74fce2	Update BibTeX entry in index.html	2026-05-25 14:30:01 +08:00
Yif Yang	db6443384a	Update BibTeX entry for SkillOpt publication	2026-05-25 14:28:13 +08:00
Huangzisu	2c7d9074fb	update webpage for arxiv link	2026-05-25 05:32:04 +00:00
Yif Yang	c98bcdd5b3	Update README.md	2026-05-25 13:27:40 +08:00
Yif Yang	0f6db9afc4	Update README.md	2026-05-25 13:26:55 +08:00
Yif Yang	5a36ac35ae	Merge pull request #7 from microsoft/users/GitHubPolicyService/a41a3ce1-e5a1-4e18-810b-cfb8d2d21c29 Adding Microsoft SECURITY.MD	2026-05-25 13:09:26 +08:00
Lliar-liar	5f4b228543	Soften average gain column styling	2026-05-24 19:45:10 +00:00
Lliar-liar	a9cad7a125	Use official arXiv logomark	2026-05-24 19:43:19 +00:00
Lliar-liar	5e968115f5	Align citation section with SkillLens	2026-05-24 19:39:16 +00:00
Cuzyoung	ded8c27c90	restore: bring back project page HTML and assets These were accidentally deleted in the cleanup commit. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>	2026-05-24 19:38:34 +00:00
Cuzyoung	f55a26414e	cleanup: remove unused benchmarks, deep_probe, meta_reflect Remove sealqa, babyvision, mathverse, mmrb, swebench envs and configs. Remove deep_probe, deep_reflect, meta_reflect modules and prompts. Remove download_babyvision script. These are not part of the core released benchmarks. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>	2026-05-24 19:36:48 +00:00
Lliar-liar	2df2542aec	Stabilize skill evolution layout	2026-05-24 19:36:08 +00:00
Lliar-liar	faa4ec6199	Align header and scroll effects with SkillLens	2026-05-24 19:31:24 +00:00
Cuzyoung	cff7ff6846	fix: rename remaining teacher/student refs, remove .gradio from repo - Fix teacher/student in deep_reflect, meta_reflect, sealqa, babyvision, mathverse, mmrb, swebench envs and prompt templates - Remove .gradio/certificate.pem from tracked files - Add .gradio/ to .gitignore Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>	2026-05-24 19:22:20 +00:00
Cuzyoung	7ae2d8766e	docs: restore clean README with Install/Data/QuickStart/WebUI/Citation only Keep remote project page header (badges, video), replace body with our streamlined 5-section README focused on reproducibility. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>	2026-05-24 19:19:19 +00:00
Lliar-liar	338a88d31c	Add model logos to results table	2026-05-24 19:18:57 +00:00
Cuzyoung	4a1b984d87	refactor: rename teacher/student to optimizer/target, remove best skills, fix slow update - Rename teacher -> optimizer, student -> target across all code, configs, docs, prompts - CLI: --teacher_model -> --optimizer_model, --student_model -> --target_model - Remove best_skill files, keep only initial skills - Fix slow update gate (force write into skill) - Fix SLOW_UPDATE marker stripping - Remove deep_reflect and meta_reflect mechanisms - Update .env.example with export prefix and azure_cli docs - Add endpoint empty validation in azure_openai.py Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>	2026-05-24 19:15:10 +00:00
Lliar-liar	6e165d5347	Add Microsoft favicon	2026-05-24 19:14:33 +00:00
Lliar-liar	dde7dc9dd8	Add SkillLens related project link	2026-05-24 19:12:27 +00:00
Lliar-liar	cd9a0a02b9	Restyle project page after SkillLens	2026-05-24 19:08:05 +00:00
Lliar-liar	607bf74a1b	Reorder hero evaluation stats	2026-05-24 18:52:05 +00:00
Lliar-liar	9605217e75	Use Microsoft logo in page header	2026-05-24 18:27:25 +00:00
Lliar-liar	c42d541828	Refine project links and citation section	2026-05-24 18:24:48 +00:00
Lliar-liar	2e05edc399	Add project links and citation section	2026-05-24 18:18:36 +00:00
Lliar-liar	6e7d5d0117	Clarify hero harness names	2026-05-24 18:15:35 +00:00
Yif Yang	441ccb9bda	Update README.md	2026-05-25 02:15:02 +08:00
Lliar-liar	88a99048a4	Align method comparison chart with page theme	2026-05-24 18:05:23 +00:00
Lliar-liar	bf2106808e	Remove method comparison implementation caption	2026-05-24 18:03:21 +00:00
Lliar-liar	ba0fa8c14b	Render method comparison from raw data	2026-05-24 18:00:08 +00:00
Lliar-liar	9012a79827	Add main results method comparison chart	2026-05-24 17:55:22 +00:00
Lliar-liar	c64fbcd4f8	Shorten hero target model label	2026-05-24 17:51:11 +00:00
Lliar-liar	6e1027f01a	Add harness count to hero badge	2026-05-24 17:48:32 +00:00
Lliar-liar	cd56a5fe7d	Make hero results badge more prominent	2026-05-24 17:43:29 +00:00

1 2 3 4 5

213 Commits