microsoft-SkillOpt

mirror of https://github.com/microsoft/SkillOpt.git synced 2026-07-03 14:02:58 +08:00

Author	SHA1	Message	Date
Yifan Yang	14c045f04f	Windows robustness for claude/codex backends (+ hardened JSON fallback) (#79 ) * Robustness for the claude/codex backends on Windows: argv overflow, subprocess encoding, tolerant JSON, test-eval dirs Fixes surfaced running SkillOpt end-to-end on the bundled `claude` backend (local Claude CLI) on Windows. None changes the OpenAI/GPT happy path. 1. skillopt/engine/trainer.py — the final test-eval directory (test_eval_final/) is written to before being created; add os.makedirs(..., exist_ok=True), matching the two sibling test-eval dirs. Without it, summary.json raises FileNotFoundError when a rollout yields zero predictions. 2. skillopt/model/claude_backend.py a. Pass the prompt via stdin (not argv): on Windows the whole command line is capped at ~32 KB and a large optimizer prompt (the success-analyst minibatch carrying several report trajectories) overflows it with [WinError 206], killing the run after retries. b. Pass the system prompt via --append-system-prompt-file (a temp file), not argv. The system prompt here is the skill being optimized, which SkillOpt grows over training; since the ~32 KB cap applies to the SUM of all argv, a grown skill would re-hit [WinError 206] even with the prompt on stdin. c. Pin the subprocess encoding to utf-8 (errors="replace"). With text=True and no encoding=, stdin is encoded with the system codepage; on a zh-CN box (cp936/GBK) a prompt containing an emoji or some Latin-1 characters raises UnicodeEncodeError before the CLI even starts, failing every retry. 3. skillopt/model/codex_backend.py — the same utf-8 encoding pin on its subprocess.run(input=...) call (identical unpinned-encoding pattern). 4. skillopt/utils/json_utils.py — extract_json() returned None for valid- looking JSON that strict json.loads rejects (unescaped ASCII quotes inside CJK string values, trailing commas), silently dropping the analyst's edits on non-schema backends (Claude/Qwen): reflect produces N edits, 0 applied. Add a json_repair fallback, but only on a single unambiguous object — a balanced-brace extractor plus a refuse-on-multiple-objects guard — so a chain-of-thought "scratch + final" response can't make repair silently return the wrong (discarded) object, which would be worse than None (None is detectable and retryable; a wrong-but-valid edit is applied blind). Declare json_repair in requirements.txt and the claude/qwen optional extras so the fallback is actually present (it otherwise no-ops, dropping edits silently). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> (cherry picked from commit `dca74a683e`) * fix(json_utils): harden tolerant JSON fallback from PR #77 Follow-up fixes on top of the cherry-picked Windows-robustness change: 1. Make _top_level_brace_objects() fully string-aware in its OUTER scan, not just inside an object. A '{' inside quoted prose (e.g. '"set it to {x}"') no longer starts a candidate object, so extract_json() returns None for prose pseudo-JSON instead of repairing it into a bogus dict — which would be strictly worse than dropping the edit, since extract_json feeds the optimizer's skill edits. 2. Pick the repair candidate BEFORE importing json_repair, so the missing- dependency RuntimeWarning only fires when there is genuinely a single malformed object that could have been repaired. Ordinary no-JSON / prose replies (the common case) now return None silently instead of warning on every call. 3. Resolve dependency-metadata inconsistency: json_repair is optional, so add it to the `all` extra (it was already in `claude`/`qwen`) and demote it from a hard requirement to an optional/commented entry in requirements.txt, matching the project's convention for backend-specific deps. Adds regression tests for prose-with-braces (-> None), no-warning-on-plain- text, single-object repair, and multi-object ambiguity. Existing 22 json tests still pass with and without json_repair installed. Co-Authored-By: Claude <noreply@anthropic.com> --------- Co-authored-by: samuelgoofus-boop <260247789+samuelgoofus-boop@users.noreply.github.com> Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-23 19:00:23 +08:00
Cuzyoung	44043d4ae5	docs(trainer): drop the stale skill-aware comments (claimed best_skill carries no appendix; it does) Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-06-10 13:10:08 +00:00
Cuzyoung	7dcd612361	fix(trainer): flush appendix notes on skip branches — lapse-only steps no longer drop them A step whose minibatches yield ONLY execution-lapse notes produces no body patches (analysts return empty-edits carriers, dropped by _normalise_patches), so skip_no_patches / skip_no_rewrite would `continue` before the appendix flush and silently discard every note of the step. This hit exactly the feature's target regime (mature skill body, failures classified as lapses): in c1_searchqa_def_g55_sar, 10/40 steps skipped this way and lost 95 notes total. Extract the flush block into _flush_skill_aware_appendix() and call it on the normal update path (unchanged behavior) AND on both skip branches before `continue`, so notes persist and appendix_notes.json / step_rec counters are recorded for skipped steps too. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-06-10 13:10:08 +00:00
Cuzyoung	0dc84162dc	feat(optimizer): skill-aware reflection (EmbodiSkill S_app), config-controlled and env-independent Split failure reflections into SKILL_DEFECT (body edit) vs EXECUTION_LAPSE (protected appendix note that re-emphasizes an existing rule, never edited by step-level analysts). Toggle: optimizer.use_skill_aware_reflection (default false; baseline byte-identical when off). - optimizer/appendix.py: protected APPENDIX region (inject/extract/append with dedup), mirrors the slow_update protected-field pattern - optimizer/skill_aware.py: analyst prompt augmentation, appendix_notes parsing, threshold-gated LLM consolidation, and a process-wide runtime switch (configure_skill_aware_reflection) set once by the trainer - gradient/reflect.py: augment error/success analyst prompts at runtime; None-sentinel kwargs resolve from the global switch, so env adapters need no per-benchmark wiring (works for all envs, present and future) - optimizer/skill.py: generalize the protected-region check to (slow_update, appendix); edits inside any protected region are skipped - engine/trainer.py: inject appendix at init, flush per-step EXECUTION_LAPSE notes after the gate settles, optional consolidation - tests: regression suite incl. toggle-off byte-identical guarantee and env-independent global-switch resolution (6/6 passing + live smoke) Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-06-10 13:10:08 +00:00
Cuzyoung	ffe581098b	feat(trainer): final-skill val + best promotion; keep best unpolluted by slow_update - slow_update force-inject now writes current_skill ONLY (best_skill stays a faithful val-best snapshot, never receives un-validated slow_update content) - after training, run one val on the final skill; if its gate score beats the incumbent best, promote final to best (updates best_skill/best_step/best_origin) - trainer now evaluates final skill on test itself (reuses best test result when final==best); records final_selection_* and final_test_* in summary.json - spreadsheetbench: head+tail truncate the post-execution verification report at source to fix multi-MB conversation bloat Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>	2026-06-10 13:03:17 +00:00
Cuzyoung	372fd56c1e	fix(spreadsheetbench)+optimizer: fix verify-feedback bloat, drop optimizer-side truncation, soft-disable gate A. SpreadsheetBench verification-feedback bloat - rollout.py _auto_verify_output: use official _compare_cell_value (was repr() equality, which falsely flagged 5 vs 5.0 / None vs ""); collapse correct-and-empty cells into a count so large sparse answer ranges no longer flood feedback with MBs of None=None noise. - codegen_agent.py _build_eval_feedback: only list WRONG cells, collapse correct ones into a count. Scoring is unaffected (evaluate() is independent); this only fixes the target model's multi-turn solving feedback. B. Remove optimizer-side truncation (bloat source now fixed) - reflect.py: drop _MAX_TRAJ_CHARS cap and all per-field clips. - update_modes.py / clip.py / lr_autonomous.py: describe_item / short_item_summary no longer truncate; raise ranking/lr token budget. - trainer.py _format_step_buffer: full task_ids / target. - slow_update.py: full comparison samples. C. Soft-disable gate - config.py / trainer.py: use_gate=false no longer raises; validation still runs but candidates are force-accepted (new force_accept branch + log). Misc: aggregate.py merge token budget 4096 -> 16384. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>	2026-06-10 13:03:17 +00:00
kaikai-macbook	41012e2d5e	Support Qwen chat as optimizer backend	2026-06-01 16:44:49 +08:00
Yif Yang	b4850ce418	fix(minimax): wire YAML / CLI config through to backend PR #26 added a MiniMax chat backend but left three loose ends that silently dropped any YAML / CLI configuration of minimax_* keys: only the environment-variable path worked. - skillopt/config.py: add 6 model.minimax_* entries to _FLATTEN_MAP so the keys declared in configs/_base_/default.yaml actually survive flatten_config() (mirroring the existing model.qwen_chat_* block). - skillopt/engine/trainer.py: import configure_minimax_chat and call it alongside configure_qwen_chat, so cfg-supplied credentials, temperature, max_tokens, and enable_thinking reach the backend. Also apply cfg["minimax_model"] via set_target_deployment when the active target backend is minimax_chat. - scripts/train.py: add 6 --minimax_* CLI flags + the corresponding _CLI_TO_YAML entries, add 'minimax' / 'minimax_chat' to the --backend choices, auto-route to target_backend=minimax_chat, and pick the right default target_model for the new backend. Default behavior on existing backends (openai, claude, qwen, codex, claude_code_exec) is unchanged; all 8 shipped configs continue to load with gate_metric falling back to 'hard' for paper reproduction.	2026-05-31 08:22:20 +00:00
Cuzyoung	00602df9e9	feat(slow-update): add config-controlled gated / force-injected modes Add optimizer.slow_update_gate_with_selection to control how epoch-boundary slow-update guidance is applied: - false (default): force-injected - inject guidance into current & best unconditionally (unchanged behavior). - true: gated - evaluate the slow-update candidate on the selection set and accept/reject via the same validation gate as step-level updates (logic follows the SkillReflection ablation). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>	2026-05-31 02:02:23 +00:00
Yif Yang	d190bf37c1	Merge pull request #25 from lvbaocheng/feature/gate-soft-metric Add configurable gate metric (hard / soft / mixed) for skill validation Default is `hard`, preserving exact pre-PR behavior — verified by 22 unit assertions on the gate module plus an end-to-end 8-step trainer-trajectory test that produces a bit-for-bit identical accept/reject sequence between the pre-PR and post-PR code paths under `gate_metric: hard`. Paper- reproduction results are unaffected. `soft` and `mixed` are opt-in via `evaluation.gate_metric` in the config and address small-selection-set runs where discrete hard accuracy is too coarse to distinguish candidate skills.	2026-05-30 08:01:39 +00:00
lvbaocheng	5d7875cb2e	Add configurable gate metric (hard / soft / mixed) for skill validation The training gate currently always compares candidate vs. current/best using hard exact-match accuracy. On environments with a small held-out selection set (e.g. 3-6 items) or partial-credit scoring, hard accuracy is too coarse: candidate skills that meaningfully improve per-item soft scores get rejected because the discrete hard count does not move. Add three opt-in metrics so users can pick the one that matches their scoring function: - `gate_metric: hard` — original behavior (default, fully backward compatible). - `gate_metric: soft` — gate on the soft / F1 / partial-credit score. - `gate_metric: mixed` — `(1 - w) * hard + w * soft`, where `w` is set by `gate_mixed_weight` (default 0.5). Changes ------- - `skillopt/evaluation/gate.py`: extend `evaluate_gate` with `cand_soft`, `metric`, and `mixed_weight` keyword arguments; add a pure helper `select_gate_score(hard, soft, metric, mixed_weight)`. Defaults preserve the original `metric="hard"` behavior — existing callers that only pass `cand_hard` keep working unchanged. - `skillopt/evaluation/__init__.py`: export the new helper / type. - `skillopt/engine/trainer.py`: read `evaluation.gate_metric` and `evaluation.gate_mixed_weight` from the config (with safe defaults), pass both metrics into `evaluate_gate`, and project the baseline `current_score` / `best_score` into metric space so subsequent comparisons are consistent. Print the gate metric on the `[6/6 EVALUATE]` line so logs make the decision basis explicit. The selection cache still records both `(hard, soft)` so a metric change on resume is non-destructive. - `configs/_base_/default.yaml`: document and ship the new keys with backward-compatible defaults (`hard`, `0.5`). Backward compatibility ---------------------- - Default config does not change behavior: `gate_metric` defaults to `hard`, exactly matching the previous gate. - `evaluate_gate(...)` keeps its existing positional signature; the new parameters are keyword-only with safe defaults. - `step_record.json` gains optional `gate_metric` and `candidate_gate_score` fields; old records still load. Tested ------ - Unit-tested all three metrics + boundary `mixed_weight` values (0.0 / 1.0) and rejection of unknown metric strings. All six cases pass. - Verified `skillopt.engine.trainer` imports cleanly after the refactor.	2026-05-30 14:45:27 +08:00
zq	afb552008b	fix(trainer): support continuous reward scores in bucket aggregation int() truncates any float in [0,1) to 0. Replace with float(). Also fix falsy float check in failure detection. Backward compatible with binary hard=0/1.	2026-05-29 19:03:52 +08:00
Cuzyoung	4a1b984d87	refactor: rename teacher/student to optimizer/target, remove best skills, fix slow update - Rename teacher -> optimizer, student -> target across all code, configs, docs, prompts - CLI: --teacher_model -> --optimizer_model, --student_model -> --target_model - Remove best_skill files, keep only initial skills - Fix slow update gate (force write into skill) - Fix SLOW_UPDATE marker stripping - Remove deep_reflect and meta_reflect mechanisms - Update .env.example with export prefix and azure_cli docs - Add endpoint empty validation in azure_openai.py Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>	2026-05-24 19:15:10 +00:00
CharlesYang030	244e346b83	SkillOpt v0.1.0: initial release - Skill optimization framework with training loop analogy - 11 benchmarks, 4 model backends (Azure OpenAI, Claude, Codex, Qwen) - WebUI for browser-based training control - Pluggable architecture for extending benchmarks and backends	2026-05-21 17:22:04 +00:00

14 Commits