All six adapters duplicated an identical reflect() that delegates to
run_minibatch_reflect. The copies had drifted: OfficeQA/DocVQA silently
dropped meta_skill_context and ALFWorld dropped update_mode, so those
analysts ran without inputs every other benchmark receives (active under
the default use_meta_skill: true).
Move the delegation into EnvAdapter.reflect as one default that forwards
all kwargs uniformly, and delete the six overrides. reflect is no longer
abstract — adapters inherit it and override only for custom logic.
Net -225 lines. Behavior change: OfficeQA/DocVQA/ALFWorld reflect now
receive the kwargs they previously dropped; the three already-correct
benchmarks are unaffected.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
- slow_update force-inject now writes current_skill ONLY (best_skill stays a
faithful val-best snapshot, never receives un-validated slow_update content)
- after training, run one val on the final skill; if its gate score beats the
incumbent best, promote final to best (updates best_skill/best_step/best_origin)
- trainer now evaluates final skill on test itself (reuses best test result when
final==best); records final_selection_* and final_test_* in summary.json
- spreadsheetbench: head+tail truncate the post-execution verification report at
source to fix multi-MB conversation bloat
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
A. SpreadsheetBench verification-feedback bloat
- rollout.py _auto_verify_output: use official _compare_cell_value (was
repr() equality, which falsely flagged 5 vs 5.0 / None vs ""); collapse
correct-and-empty cells into a count so large sparse answer ranges no
longer flood feedback with MBs of None=None noise.
- codegen_agent.py _build_eval_feedback: only list WRONG cells, collapse
correct ones into a count.
Scoring is unaffected (evaluate() is independent); this only fixes the
target model's multi-turn solving feedback.
B. Remove optimizer-side truncation (bloat source now fixed)
- reflect.py: drop _MAX_TRAJ_CHARS cap and all per-field clips.
- update_modes.py / clip.py / lr_autonomous.py: describe_item /
short_item_summary no longer truncate; raise ranking/lr token budget.
- trainer.py _format_step_buffer: full task_ids / target.
- slow_update.py: full comparison samples.
C. Soft-disable gate
- config.py / trainer.py: use_gate=false no longer raises; validation still
runs but candidates are force-accepted (new force_accept branch + log).
Misc: aggregate.py merge token budget 4096 -> 16384.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Remove sealqa, babyvision, mathverse, mmrb, swebench envs and configs.
Remove deep_probe, deep_reflect, meta_reflect modules and prompts.
Remove download_babyvision script.
These are not part of the core released benchmarks.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Skill optimization framework with training loop analogy
- 11 benchmarks, 4 model backends (Azure OpenAI, Claude, Codex, Qwen)
- WebUI for browser-based training control
- Pluggable architecture for extending benchmarks and backends