microsoft-SkillOpt

github/microsoft-SkillOpt

Fork 0

mirror of https://github.com/microsoft/SkillOpt.git synced 2026-07-03 14:02:58 +08:00

Commit Graph

Author	SHA1	Message	Date
lvbaocheng	5d7875cb2e	Add configurable gate metric (hard / soft / mixed) for skill validation The training gate currently always compares candidate vs. current/best using hard exact-match accuracy. On environments with a small held-out selection set (e.g. 3-6 items) or partial-credit scoring, hard accuracy is too coarse: candidate skills that meaningfully improve per-item soft scores get rejected because the discrete hard count does not move. Add three opt-in metrics so users can pick the one that matches their scoring function: - `gate_metric: hard` — original behavior (default, fully backward compatible). - `gate_metric: soft` — gate on the soft / F1 / partial-credit score. - `gate_metric: mixed` — `(1 - w) * hard + w * soft`, where `w` is set by `gate_mixed_weight` (default 0.5). Changes ------- - `skillopt/evaluation/gate.py`: extend `evaluate_gate` with `cand_soft`, `metric`, and `mixed_weight` keyword arguments; add a pure helper `select_gate_score(hard, soft, metric, mixed_weight)`. Defaults preserve the original `metric="hard"` behavior — existing callers that only pass `cand_hard` keep working unchanged. - `skillopt/evaluation/__init__.py`: export the new helper / type. - `skillopt/engine/trainer.py`: read `evaluation.gate_metric` and `evaluation.gate_mixed_weight` from the config (with safe defaults), pass both metrics into `evaluate_gate`, and project the baseline `current_score` / `best_score` into metric space so subsequent comparisons are consistent. Print the gate metric on the `[6/6 EVALUATE]` line so logs make the decision basis explicit. The selection cache still records both `(hard, soft)` so a metric change on resume is non-destructive. - `configs/_base_/default.yaml`: document and ship the new keys with backward-compatible defaults (`hard`, `0.5`). Backward compatibility ---------------------- - Default config does not change behavior: `gate_metric` defaults to `hard`, exactly matching the previous gate. - `evaluate_gate(...)` keeps its existing positional signature; the new parameters are keyword-only with safe defaults. - `step_record.json` gains optional `gate_metric` and `candidate_gate_score` fields; old records still load. Tested ------ - Unit-tested all three metrics + boundary `mixed_weight` values (0.0 / 1.0) and rejection of unknown metric strings. All six cases pass. - Verified `skillopt.engine.trainer` imports cleanly after the refactor.	2026-05-30 14:45:27 +08:00
CharlesYang030	244e346b83	SkillOpt v0.1.0: initial release - Skill optimization framework with training loop analogy - 11 benchmarks, 4 model backends (Azure OpenAI, Claude, Codex, Qwen) - WebUI for browser-based training control - Pluggable architecture for extending benchmarks and backends	2026-05-21 17:22:04 +00:00

Author

SHA1

Message

Date

lvbaocheng

5d7875cb2e

Add configurable gate metric (hard / soft / mixed) for skill validation

The training gate currently always compares candidate vs. current/best
using *hard* exact-match accuracy. On environments with a small
held-out selection set (e.g. 3-6 items) or partial-credit scoring,
hard accuracy is too coarse: candidate skills that meaningfully
improve per-item soft scores get rejected because the discrete hard
count does not move.

Add three opt-in metrics so users can pick the one that matches their
scoring function:

- `gate_metric: hard`  — original behavior (default, fully backward
  compatible).
- `gate_metric: soft`  — gate on the soft / F1 / partial-credit score.
- `gate_metric: mixed` — `(1 - w) * hard + w * soft`, where `w` is
  set by `gate_mixed_weight` (default 0.5).

Changes
-------
- `skillopt/evaluation/gate.py`: extend `evaluate_gate` with
  `cand_soft`, `metric`, and `mixed_weight` keyword arguments; add a
  pure helper `select_gate_score(hard, soft, metric, mixed_weight)`.
  Defaults preserve the original `metric="hard"` behavior — existing
  callers that only pass `cand_hard` keep working unchanged.
- `skillopt/evaluation/__init__.py`: export the new helper / type.
- `skillopt/engine/trainer.py`: read `evaluation.gate_metric` and
  `evaluation.gate_mixed_weight` from the config (with safe defaults),
  pass both metrics into `evaluate_gate`, and project the baseline
  `current_score` / `best_score` into metric space so subsequent
  comparisons are consistent. Print the gate metric on the
  `[6/6 EVALUATE]` line so logs make the decision basis explicit. The
  selection cache still records both `(hard, soft)` so a metric change
  on resume is non-destructive.
- `configs/_base_/default.yaml`: document and ship the new keys with
  backward-compatible defaults (`hard`, `0.5`).

Backward compatibility
----------------------
- Default config does not change behavior: `gate_metric` defaults to
  `hard`, exactly matching the previous gate.
- `evaluate_gate(...)` keeps its existing positional signature; the
  new parameters are keyword-only with safe defaults.
- `step_record.json` gains optional `gate_metric` and
  `candidate_gate_score` fields; old records still load.

Tested
------
- Unit-tested all three metrics + boundary `mixed_weight` values
  (0.0 / 1.0) and rejection of unknown metric strings. All six cases
  pass.
- Verified `skillopt.engine.trainer` imports cleanly after the
  refactor.

2026-05-30 14:45:27 +08:00

CharlesYang030

244e346b83

SkillOpt v0.1.0: initial release

- Skill optimization framework with training loop analogy
- 11 benchmarks, 4 model backends (Azure OpenAI, Claude, Codex, Qwen)
- WebUI for browser-based training control
- Pluggable architecture for extending benchmarks and backends

2026-05-21 17:22:04 +00:00

2 Commits