9 Commits

Author SHA1 Message Date
Shunsuke
98d0430bee refactor: make EnvAdapter.reflect a shared default (fixes dropped reflect kwargs)
All six adapters duplicated an identical reflect() that delegates to
run_minibatch_reflect. The copies had drifted: OfficeQA/DocVQA silently
dropped meta_skill_context and ALFWorld dropped update_mode, so those
analysts ran without inputs every other benchmark receives (active under
the default use_meta_skill: true).

Move the delegation into EnvAdapter.reflect as one default that forwards
all kwargs uniformly, and delete the six overrides. reflect is no longer
abstract — adapters inherit it and override only for custom logic.

Net -225 lines. Behavior change: OfficeQA/DocVQA/ALFWorld reflect now
receive the kwargs they previously dropped; the three already-correct
benchmarks are unaffected.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-15 09:06:00 +00:00
Cuzyoung
ffe581098b feat(trainer): final-skill val + best promotion; keep best unpolluted by slow_update
- slow_update force-inject now writes current_skill ONLY (best_skill stays a
  faithful val-best snapshot, never receives un-validated slow_update content)
- after training, run one val on the final skill; if its gate score beats the
  incumbent best, promote final to best (updates best_skill/best_step/best_origin)
- trainer now evaluates final skill on test itself (reuses best test result when
  final==best); records final_selection_* and final_test_* in summary.json
- spreadsheetbench: head+tail truncate the post-execution verification report at
  source to fix multi-MB conversation bloat

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-06-10 13:03:17 +00:00
Cuzyoung
372fd56c1e fix(spreadsheetbench)+optimizer: fix verify-feedback bloat, drop optimizer-side truncation, soft-disable gate
A. SpreadsheetBench verification-feedback bloat
   - rollout.py _auto_verify_output: use official _compare_cell_value (was
     repr() equality, which falsely flagged 5 vs 5.0 / None vs ""); collapse
     correct-and-empty cells into a count so large sparse answer ranges no
     longer flood feedback with MBs of None=None noise.
   - codegen_agent.py _build_eval_feedback: only list WRONG cells, collapse
     correct ones into a count.
   Scoring is unaffected (evaluate() is independent); this only fixes the
   target model's multi-turn solving feedback.

B. Remove optimizer-side truncation (bloat source now fixed)
   - reflect.py: drop _MAX_TRAJ_CHARS cap and all per-field clips.
   - update_modes.py / clip.py / lr_autonomous.py: describe_item /
     short_item_summary no longer truncate; raise ranking/lr token budget.
   - trainer.py _format_step_buffer: full task_ids / target.
   - slow_update.py: full comparison samples.

C. Soft-disable gate
   - config.py / trainer.py: use_gate=false no longer raises; validation still
     runs but candidates are force-accepted (new force_accept branch + log).

Misc: aggregate.py merge token budget 4096 -> 16384.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-06-10 13:03:17 +00:00
hwq
1f75d022a5 y 2026-05-30 15:01:34 +00:00
hwq
786d57b5cf Make rollout completion tokens configurable 2026-05-28 09:45:47 +00:00
Cuzyoung
f55a26414e cleanup: remove unused benchmarks, deep_probe, meta_reflect
Remove sealqa, babyvision, mathverse, mmrb, swebench envs and configs.
Remove deep_probe, deep_reflect, meta_reflect modules and prompts.
Remove download_babyvision script.
These are not part of the core released benchmarks.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-05-24 19:36:48 +00:00
Cuzyoung
cff7ff6846 fix: rename remaining teacher/student refs, remove .gradio from repo
- Fix teacher/student in deep_reflect, meta_reflect, sealqa, babyvision,
  mathverse, mmrb, swebench envs and prompt templates
- Remove .gradio/certificate.pem from tracked files
- Add .gradio/ to .gitignore

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-05-24 19:22:20 +00:00
Cuzyoung
4a1b984d87 refactor: rename teacher/student to optimizer/target, remove best skills, fix slow update
- Rename teacher -> optimizer, student -> target across all code, configs, docs, prompts
- CLI: --teacher_model -> --optimizer_model, --student_model -> --target_model
- Remove best_skill files, keep only initial skills
- Fix slow update gate (force write into skill)
- Fix SLOW_UPDATE marker stripping
- Remove deep_reflect and meta_reflect mechanisms
- Update .env.example with export prefix and azure_cli docs
- Add endpoint empty validation in azure_openai.py

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-05-24 19:15:10 +00:00
CharlesYang030
244e346b83 SkillOpt v0.1.0: initial release
- Skill optimization framework with training loop analogy
- 11 benchmarks, 4 model backends (Azure OpenAI, Claude, Codex, Qwen)
- WebUI for browser-based training control
- Pluggable architecture for extending benchmarks and backends
2026-05-21 17:22:04 +00:00