From eb4d1ee8d0425561ca76ec5ead0103d3f1a3b102 Mon Sep 17 00:00:00 2001 From: carpedkm Date: Fri, 8 May 2026 08:37:13 +0000 Subject: [PATCH] init --- ablation_plan.md | 1222 ++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 1222 insertions(+) create mode 100644 ablation_plan.md diff --git a/ablation_plan.md b/ablation_plan.md new file mode 100644 index 0000000..6a4b3ef --- /dev/null +++ b/ablation_plan.md @@ -0,0 +1,1222 @@ +# ReflACT Ablation Study 启动计划 + +最终可复现实验配置已单独固化在 `configs/ablation_study/`: +`matrix.yaml` 记录公共 overrides、benchmark splits、消融变量、token/output caps 和无效 run 规则; +`launch_commands.sh` 记录正确启动命令;`validation.md` 记录监控和填表检查。 + +当前新增实验目标有两类: + +```text +1. 新增 train.batch_size 消融:8 / 24 / 40 / 56 / full,gradient.minibatch_size 固定为 8。 +2. 在现有 ablation table 方法不变的前提下,补齐 LiveMathBench、ALFWorld 和 DocVQA。 +``` + +已完成的 SearchQA / SpreadsheetBench 结果不重复跑,除非缺 batch-size 新表所需的点。`train.batch_size=40` 是默认 setting,优先复用已有 default run;LiveMathBench / ALFWorld 还没有 default 结果,因此需要先跑 default。 + +## 2026-05-06 Harness 冲分实验清单 + +目标:把 harness 分数跑高,而不是继续扩大无关 ablation 矩阵。当前只考虑 +SearchQA、SpreadsheetBench、LiveMathBench、DocVQA 10%;不跑 ALFWorld harness。 + +状态规则: + +```text +[ ] = 还没起,另一台机器可以接着跑。 +[x] = 已经起过、正在跑、或已经完成,不要重复起。 +``` + +通用 harness 固定参数: + +```text +model.teacher_backend=openai_chat +model.teacher=gpt-5.5 +model.teacher_azure_openai_endpoint=https://oaidr21.openai.azure.com/ +model.teacher_azure_openai_auth_mode=azure_cli +model.reasoning_effort=medium +model.student_backend=codex_exec for the Codex rows +model.student=gpt-5.5 +model.codex_exec_use_sdk=auto +model.codex_exec_sandbox=workspace-write +model.codex_exec_approval_policy=never +model.codex_exec_reasoning_effort=medium +model.codex_trace_to_teacher=true +train.num_epochs=4 +train.train_size=0 +train.accumulation=1 +gradient.minibatch_size=8 +gradient.merge_batch_size=8 +gradient.analyst_workers=16 +gradient.use_deep_reflect=false +optimizer.lr_control_mode=fixed +optimizer.skill_update_mode=patch +optimizer.use_slow_update=true +optimizer.slow_update_samples=20 +optimizer.use_meta_skill=true +optimizer.use_meta_reflect=false +evaluation.use_gate=true +evaluation.eval_test=true +env.split_mode=split_dir +``` + +当前已经在跑或已经有结果的参考实验,不要重复起: + +| Status | 通俗名字 | Table row | 说明 | +| --- | --- | --- | --- | +| [x] | SearchQA 从默认 skill 训练 Codex | `HARNESS-BESTSETTING-searchqa-codex` | fresh-machine completed, test `0.8800` | +| [x] | LiveMath 从默认 skill 训练 Codex | `HARNESS-BESTSETTING-livemathematicianbench-codex` | fresh-machine completed, test `0.7600` | +| [x] | LiveMath 从默认 skill 训练 Claude | `HARNESS-BESTSETTING-livemathematicianbench-claude` | fresh-machine completed, test `0.3040` | +| [x] | DocVQA10 从默认 skill 训练 Codex | `HARNESS-BESTSETTING-docvqa10pct-codex` | fresh-machine completed, test `0.7032` | +| [x] | DocVQA10 从默认 skill 训练 Claude | `HARNESS-BESTSETTING-docvqa10pct-claude` | fresh-machine completed after image fix, test `0.5802` | +| [x] | SearchQA 从默认 skill 训练 Claude | `HARNESS-BESTSETTING-searchqa-claude` | fresh-machine running | +| [x] | Spreadsheet 从默认 skill 训练 Codex | `HARNESS-BESTSETTING-spreadsheetbench-codex` | fresh-machine running | +| [x] | Spreadsheet 从默认 skill 训练 Claude | `HARNESS-BESTSETTING-spreadsheetbench-claude` | fresh-machine running | +| [x] | SearchQA 最好 skill 迁移 Codex | `HARNESS-BESTSKILL-searchqa-codex` | completed, test `0.8714` | +| [x] | LiveMath 最好 skill 迁移 Codex | `HARNESS-BESTSKILL-livemathematicianbench-codex` | completed, test `0.8240` | +| [x] | DocVQA10 最好 skill 迁移 Codex | `HARNESS-BESTSKILL-docvqa10pct-codex` | completed, test `0.7032` | +| [x] | Spreadsheet 最好 skill 迁移 Codex | `HARNESS-BESTSKILL-spreadsheetbench-codex-multifix` | running on this machine | +| [x] | 四个 Claude 最好 skill fixed rerun | `HARNESS-BESTSKILL-*-claude*` | running on this machine; wait for fixed summaries | + +Harness 冲分分三步: + +```text +Step 1: Codex 先搜 no-harness 里最可能高分的 setting。 +Step 2: Claude Code 完整复制 Step 1 的 18 个 setting,用来比较两个 harness backend。 +Step 3: Codex 再补齐 no-harness 里剩余可调参数。 +``` + +全程固定: + +```text +optimizer.use_slow_update=true +optimizer.use_meta_skill=true +env.split_mode=split_dir +不调 split ratio;所有 harness 都用当前固定 harness split。 +不跑 meta-only、slow-only、no-meta、no-slow 这类模块关闭实验。 +所有 score-pushing run 都必须用 benchmark-level 唯一 best skill 初始化;不要用 setting-specific best_skill,也不要从默认初始 skill 训练。 +SpreadsheetBench 所有 score-pushing run 都必须保持 env.mode=multi, env.workers=4, env.data_root=data/spreadsheetbench/files。 +``` + +Step 1: Codex 高概率搜索,共 18 个。原则是从 `docs/ablation_paper_tables.html` 的 2:1:7 split 下所有消融实验中,为每个 benchmark 选 test 最好的一个 benchmark-level `best_skill.md`,再把不同参数设置迁移到 harness。每个 benchmark 只允许一个 source skill;参数 sweep 只改变参数,不改变 source skill。不选 split-ratio setting。 + +2026-05-06 note: the successful 2026-05-05 Codex harness rows used +`model.codex_exec_use_sdk=auto` and actually executed through `codex_sdk`. +The first local Codex Step 1 attempts were stopped and their outputs deleted +because `codex_exec` hit student-side Codex service 429s under high concurrency. +The 2026-05-06 08:04-08:26 UTC reruns were also stopped because they used +setting-specific source skills instead of one benchmark-level source skill per +benchmark. They are not valid results. Rerun these rows with SDK/auto auth +restored, low local concurrency, and the canonical source skills below. + +Canonical benchmark-level source skills: +SearchQA: `docs/harness_source_skills/searchqa_best_skill.md` from `outputs/ablation_20260502_040604_unique48/optimizer.lr_scheduler-searchqa-constant/best_skill.md`, test `0.8729`. +SpreadsheetBench: `docs/harness_source_skills/spreadsheetbench_best_skill.md` from `outputs/ablation_20260502_040604_unique48/optimizer.lr_scheduler-spreadsheetbench-constant/best_skill.md`, test `0.8071`. +LiveMathBench: `docs/harness_source_skills/livemathematicianbench_best_skill.md` from `outputs/ablation_livemath_alfworld_clean_20260503_155155_run/LR-livemathematicianbench-8/best_skill.md`, test `0.6694`. +DocVQA: `docs/harness_source_skills/docvqa_best_skill.md` from `outputs/ablation_docvqa_20260503_160225_run/BATCH-docvqa-full/best_skill.md`, test `0.9049`. + +| Status | 通俗名字 | 建议 Run ID | Benchmark | Backend | Skill init | Split / data | 需要改的参数 | +| --- | --- | --- | --- | --- | --- | --- | --- | +| [ ] | SearchQA skill + constant scheduler | `HARNESS-Codex-SearchQA-sched-constant` | SearchQA | `codex_exec` | `docs/harness_source_skills/searchqa_best_skill.md` | `data/searchqa/splits` | Pending valid rerun; 2026-05-06 setting-specific run stopped; Azure MI Codex wrapper, `env.workers=2`, `env.exec_timeout=1020` | +| [ ] | SearchQA skill + linear scheduler | `HARNESS-Codex-SearchQA-sched-linear` | SearchQA | `codex_exec` | `docs/harness_source_skills/searchqa_best_skill.md` | `data/searchqa/splits` | Pending valid rerun; `env.workers=2`, `env.exec_timeout=1020` | +| [ ] | SearchQA skill + full batch | `HARNESS-Codex-SearchQA-batch-full` | SearchQA | `codex_exec` | `docs/harness_source_skills/searchqa_best_skill.md` | `data/searchqa/splits` | Pending valid rerun; `train.batch_size=400`, `env.workers=2`, `env.exec_timeout=1020` | +| [ ] | SearchQA skill + LR=8 | `HARNESS-Codex-SearchQA-lr8` | SearchQA | `codex_exec` | `docs/harness_source_skills/searchqa_best_skill.md` | `data/searchqa/splits` | Pending valid rerun; `optimizer.learning_rate=8`, `env.workers=2`, `env.exec_timeout=1020` | +| [ ] | Spreadsheet skill + constant scheduler | `HARNESS-Codex-Spreadsheet-sched-constant-multi` | SpreadsheetBench | `codex_exec` | `docs/harness_source_skills/spreadsheetbench_best_skill.md` | `data/spreadsheetbench`, data root `data/spreadsheetbench/files` | `optimizer.learning_rate=4`, `optimizer.lr_scheduler=constant`, `optimizer.min_learning_rate=2`, `train.batch_size=40`, `env.mode=multi`, `env.workers=4` | +| [ ] | Spreadsheet skill + LR=4 | `HARNESS-Codex-Spreadsheet-lr4-multi` | SpreadsheetBench | `codex_exec` | `docs/harness_source_skills/spreadsheetbench_best_skill.md` | `data/spreadsheetbench`, data root `data/spreadsheetbench/files` | `optimizer.learning_rate=4`, `optimizer.lr_scheduler=constant`, `optimizer.min_learning_rate=1`, `train.batch_size=40`, `env.mode=multi`, `env.workers=4` | +| [ ] | Spreadsheet skill + LR=16 | `HARNESS-Codex-Spreadsheet-lr16-multi` | SpreadsheetBench | `codex_exec` | `docs/harness_source_skills/spreadsheetbench_best_skill.md` | `data/spreadsheetbench`, data root `data/spreadsheetbench/files` | `optimizer.learning_rate=16`, `optimizer.lr_scheduler=constant`, `optimizer.min_learning_rate=1`, `train.batch_size=40`, `env.mode=multi`, `env.workers=4` | +| [ ] | Spreadsheet skill + minibatch=16 | `HARNESS-Codex-Spreadsheet-minibatch16-multi` | SpreadsheetBench | `codex_exec` | `docs/harness_source_skills/spreadsheetbench_best_skill.md` | `data/spreadsheetbench`, data root `data/spreadsheetbench/files` | `train.batch_size=40`, `gradient.minibatch_size=16`, `env.mode=multi`, `env.workers=4` | +| [ ] | LiveMath skill + LR=8 | `HARNESS-Codex-LiveMath-lr8` | LiveMathBench | `codex_exec` | `docs/harness_source_skills/livemathematicianbench_best_skill.md` | `data/livemathbench/splits` | Pending valid rerun; teacher oaidr9 MI, Codex student MI wrapper, `env.workers=2`, `env.exec_timeout=1020` | +| [ ] | LiveMath skill + LR=16 | `HARNESS-Codex-LiveMath-lr16` | LiveMathBench | `codex_exec` | `docs/harness_source_skills/livemathematicianbench_best_skill.md` | `data/livemathbench/splits` | Pending valid rerun; `optimizer.learning_rate=16`, `env.workers=2`, `env.exec_timeout=1020` | +| [ ] | LiveMath skill + slow10 | `HARNESS-Codex-LiveMath-slow10` | LiveMathBench | `codex_exec` | `docs/harness_source_skills/livemathematicianbench_best_skill.md` | `data/livemathbench/splits` | Pending valid rerun; `optimizer.slow_update_samples=10`, teacher oaidr9 MI, Codex student MI wrapper | +| [ ] | LiveMath skill + linear scheduler | `HARNESS-Codex-LiveMath-sched-linear` | LiveMathBench | `codex_exec` | `docs/harness_source_skills/livemathematicianbench_best_skill.md` | `data/livemathbench/splits` | Pending valid rerun; `optimizer.lr_scheduler=linear`, teacher oaidr9 MI, Codex student MI wrapper | +| [ ] | LiveMath skill + minibatch=4 | `HARNESS-Codex-LiveMath-minibatch4` | LiveMathBench | `codex_exec` | `docs/harness_source_skills/livemathematicianbench_best_skill.md` | `data/livemathbench/splits` | Pending valid rerun; `gradient.minibatch_size=4`, teacher oaidr9 MI, Codex student MI wrapper | +| [ ] | DocVQA10 skill + full batch | `HARNESS-Codex-DocVQA10-batch-full` | DocVQA 10% | `codex_exec` | `docs/harness_source_skills/docvqa_best_skill.md` | `data/harness_splits/docvqa_zisu_first10pct` | Pending valid rerun; `train.batch_size=107`, teacher oaidr9 MI, absolute Codex wrapper | +| [ ] | DocVQA10 skill + LR=16 | `HARNESS-Codex-DocVQA10-lr16` | DocVQA 10% | `codex_exec` | `docs/harness_source_skills/docvqa_best_skill.md` | `data/harness_splits/docvqa_zisu_first10pct` | Pending valid rerun; `optimizer.learning_rate=16`, teacher oaidr9 MI, absolute Codex wrapper | +| [ ] | DocVQA10 skill + LR=8 | `HARNESS-Codex-DocVQA10-lr8` | DocVQA 10% | `codex_exec` | `docs/harness_source_skills/docvqa_best_skill.md` | `data/harness_splits/docvqa_zisu_first10pct` | Pending valid rerun; `optimizer.learning_rate=8`, teacher oaidr9 MI, absolute Codex wrapper | +| [ ] | DocVQA10 skill + minibatch=32 | `HARNESS-Codex-DocVQA10-minibatch32` | DocVQA 10% | `codex_exec` | `docs/harness_source_skills/docvqa_best_skill.md` | `data/harness_splits/docvqa_zisu_first10pct` | Pending valid rerun; `gradient.minibatch_size=32`, teacher oaidr9 MI, absolute Codex wrapper | +| [ ] | DocVQA10 skill + student batch=24 | `HARNESS-Codex-DocVQA10-batch24` | DocVQA 10% | `codex_exec` | `docs/harness_source_skills/docvqa_best_skill.md` | `data/harness_splits/docvqa_zisu_first10pct` | Pending valid rerun; `train.batch_size=24`, teacher oaidr9 MI, absolute Codex wrapper | + +Step 2: Claude Code 镜像实验,共 18 个。Claude Code 现在没有搜完;目前只是 fixed best-skill / best-setting 相关 run 在跑或部分完成。为了公平比较 harness backend,Claude Step 2 应完整复制 Codex Step 1 的 18 个 setting,而不是只复制每个 benchmark 的 top 1。每个 benchmark 也只允许一个 benchmark-level source skill。每个 run 单独打勾,不能用一个勾概括一组。Claude Step 2 默认用 `env.workers=2`, `env.exec_timeout=1020`(17 分钟);不要再把 `workers=4` 当默认稳定设置。 + +| Status | 通俗名字 | 建议 Run ID | Benchmark | Backend | Skill init | 必须保持 | +| --- | --- | --- | --- | --- | --- | --- | +| [ ] | Claude SearchQA skill + constant scheduler | `HARNESS-Claude-SearchQA-sched-constant` | SearchQA | `claude_code_exec` | `docs/harness_source_skills/searchqa_best_skill.md` | Pending valid rerun; prior setting-specific run stopped; `env.workers=2`, `env.exec_timeout=1020` | +| [ ] | Claude SearchQA skill + linear scheduler | `HARNESS-Claude-SearchQA-sched-linear` | SearchQA | `claude_code_exec` | `docs/harness_source_skills/searchqa_best_skill.md` | Pending valid rerun; `env.workers=2`, `env.exec_timeout=1020` | +| [ ] | Claude SearchQA skill + full batch | `HARNESS-Claude-SearchQA-batch-full` | SearchQA | `claude_code_exec` | `docs/harness_source_skills/searchqa_best_skill.md` | Pending valid rerun; `train.batch_size=400`, `env.workers=2`, `env.exec_timeout=1020` | +| [ ] | Claude SearchQA skill + LR=8 | `HARNESS-Claude-SearchQA-lr8` | SearchQA | `claude_code_exec` | `docs/harness_source_skills/searchqa_best_skill.md` | Pending valid rerun; `optimizer.learning_rate=8`, `optimizer.lr_scheduler=constant`, `optimizer.min_learning_rate=1`, `env.workers=2`, `env.exec_timeout=1020` | +| [ ] | Claude Spreadsheet skill + constant scheduler | `HARNESS-Claude-Spreadsheet-sched-constant-multi` | SpreadsheetBench | `claude_code_exec` | `docs/harness_source_skills/spreadsheetbench_best_skill.md` | `env.mode=multi`, `env.workers=4`, `env.data_root=data/spreadsheetbench/files` | +| [ ] | Claude Spreadsheet skill + LR=4 | `HARNESS-Claude-Spreadsheet-lr4-multi` | SpreadsheetBench | `claude_code_exec` | `docs/harness_source_skills/spreadsheetbench_best_skill.md` | `env.mode=multi`, `env.workers=4`, `env.data_root=data/spreadsheetbench/files` | +| [ ] | Claude Spreadsheet skill + LR=16 | `HARNESS-Claude-Spreadsheet-lr16-multi` | SpreadsheetBench | `claude_code_exec` | `docs/harness_source_skills/spreadsheetbench_best_skill.md` | `env.mode=multi`, `env.workers=4`, `env.data_root=data/spreadsheetbench/files` | +| [ ] | Claude Spreadsheet skill + minibatch=16 | `HARNESS-Claude-Spreadsheet-minibatch16-multi` | SpreadsheetBench | `claude_code_exec` | `docs/harness_source_skills/spreadsheetbench_best_skill.md` | `env.mode=multi`, `env.workers=4`, `env.data_root=data/spreadsheetbench/files` | +| [ ] | Claude LiveMath skill + LR=8 | `HARNESS-Claude-LiveMath-lr8` | LiveMathBench | `claude_code_exec` | `docs/harness_source_skills/livemathematicianbench_best_skill.md` | Pending valid rerun; `optimizer.learning_rate=8`, `optimizer.lr_scheduler=constant`, `optimizer.min_learning_rate=1`, `env.workers=2`, `env.exec_timeout=1020` | +| [ ] | Claude LiveMath skill + LR=16 | `HARNESS-Claude-LiveMath-lr16` | LiveMathBench | `claude_code_exec` | `docs/harness_source_skills/livemathematicianbench_best_skill.md` | Pending valid rerun; `optimizer.learning_rate=16`, `optimizer.lr_scheduler=constant`, `optimizer.min_learning_rate=1`, `env.workers=2`, `env.exec_timeout=1020` | +| [ ] | Claude LiveMath skill + slow10 | `HARNESS-Claude-LiveMath-slow10` | LiveMathBench | `claude_code_exec` | `docs/harness_source_skills/livemathematicianbench_best_skill.md` | Pending valid rerun; `optimizer.slow_update_samples=10`, `env.workers=2`, `env.exec_timeout=1020` | +| [ ] | Claude LiveMath skill + linear scheduler | `HARNESS-Claude-LiveMath-sched-linear` | LiveMathBench | `claude_code_exec` | `docs/harness_source_skills/livemathematicianbench_best_skill.md` | Pending valid rerun; `optimizer.lr_scheduler=linear`, `env.workers=2`, `env.exec_timeout=1020` | +| [ ] | Claude LiveMath skill + minibatch=4 | `HARNESS-Claude-LiveMath-minibatch4` | LiveMathBench | `claude_code_exec` | `docs/harness_source_skills/livemathematicianbench_best_skill.md` | Pending valid rerun; `gradient.minibatch_size=4`, `env.workers=2`, `env.exec_timeout=1020` | +| [ ] | Claude DocVQA10 skill + full batch | `HARNESS-Claude-DocVQA10-batch-full` | DocVQA 10% | `claude_code_exec` | `docs/harness_source_skills/docvqa_best_skill.md` | Pending valid rerun; `train.batch_size=107`, `env.workers=2`, `env.exec_timeout=1020` | +| [ ] | Claude DocVQA10 skill + LR=16 | `HARNESS-Claude-DocVQA10-lr16` | DocVQA 10% | `claude_code_exec` | `docs/harness_source_skills/docvqa_best_skill.md` | Pending valid rerun; `optimizer.learning_rate=16`, `env.workers=2`, `env.exec_timeout=1020` | +| [ ] | Claude DocVQA10 skill + LR=8 | `HARNESS-Claude-DocVQA10-lr8` | DocVQA 10% | `claude_code_exec` | `docs/harness_source_skills/docvqa_best_skill.md` | 确认 image attachment fix 生效 | +| [ ] | Claude DocVQA10 skill + minibatch=32 | `HARNESS-Claude-DocVQA10-minibatch32` | DocVQA 10% | `claude_code_exec` | `docs/harness_source_skills/docvqa_best_skill.md` | 确认 image attachment fix 生效 | +| [ ] | Claude DocVQA10 skill + student batch=24 | `HARNESS-Claude-DocVQA10-batch24` | DocVQA 10% | `claude_code_exec` | `docs/harness_source_skills/docvqa_best_skill.md` | 确认 image attachment fix 生效 | + +Step 3: Codex 补齐剩余可调参数。这个阶段是计划内的后续补齐,不一次性全起;通常在 Step 1 初步结果后按 benchmark 展开。搜索空间来自 no-harness ablation 中所有仍允许调的参数;排除 split ratio 和模块关闭实验。每个 run 单独打勾;启动前必须设置 `env.skill_init` 指向该 benchmark 的 canonical best skill,不能用 setting-specific skill,也不能用默认初始 skill。 + +| Status | 建议 Run ID | Benchmark | 参数 | 必须保持 | +| --- | --- | --- | --- | --- | +| [ ] | `HARNESS-Codex-SearchQA-batch8` | SearchQA | `train.batch_size=8` | 对应 no-harness best_skill | +| [ ] | `HARNESS-Codex-SearchQA-batch24` | SearchQA | `train.batch_size=24` | 对应 no-harness best_skill | +| [ ] | `HARNESS-Codex-SearchQA-batch56` | SearchQA | `train.batch_size=56` | 对应 no-harness best_skill | +| [ ] | `HARNESS-Codex-Spreadsheet-batch8-multi` | SpreadsheetBench | `train.batch_size=8` | 对应 no-harness best_skill; `env.mode=multi` | +| [ ] | `HARNESS-Codex-Spreadsheet-batch24-multi` | SpreadsheetBench | `train.batch_size=24` | 对应 no-harness best_skill; `env.mode=multi` | +| [ ] | `HARNESS-Codex-Spreadsheet-batch56-multi` | SpreadsheetBench | `train.batch_size=56` | 对应 no-harness best_skill; `env.mode=multi` | +| [ ] | `HARNESS-Codex-Spreadsheet-batch-full-multi` | SpreadsheetBench | `train.batch_size=80` | 对应 no-harness best_skill; `env.mode=multi` | +| [ ] | `HARNESS-Codex-LiveMath-batch8` | LiveMathBench | `train.batch_size=8` | 对应 no-harness best_skill | +| [ ] | `HARNESS-Codex-LiveMath-batch24` | LiveMathBench | `train.batch_size=24` | 对应 no-harness best_skill | +| [ ] | `HARNESS-Codex-LiveMath-batch56` | LiveMathBench | `train.batch_size=56` | 对应 no-harness best_skill | +| [ ] | `HARNESS-Codex-LiveMath-batch-full` | LiveMathBench | `train.batch_size=35` | 对应 no-harness best_skill; current 2:1:7 train size | +| [ ] | `HARNESS-Codex-DocVQA10-batch8` | DocVQA 10% | `train.batch_size=8` | 对应 no-harness best_skill | +| [ ] | `HARNESS-Codex-DocVQA10-batch56` | DocVQA 10% | `train.batch_size=56` | 对应 no-harness best_skill | +| [ ] | `HARNESS-Codex-SearchQA-minibatch1` | SearchQA | `gradient.minibatch_size=1` | 对应 no-harness best_skill | +| [ ] | `HARNESS-Codex-SearchQA-minibatch2` | SearchQA | `gradient.minibatch_size=2` | 对应 no-harness best_skill | +| [ ] | `HARNESS-Codex-SearchQA-minibatch4` | SearchQA | `gradient.minibatch_size=4` | 对应 no-harness best_skill | +| [ ] | `HARNESS-Codex-SearchQA-minibatch16` | SearchQA | `gradient.minibatch_size=16` | 对应 no-harness best_skill | +| [ ] | `HARNESS-Codex-SearchQA-minibatch32` | SearchQA | `gradient.minibatch_size=32` | 对应 no-harness best_skill | +| [ ] | `HARNESS-Codex-Spreadsheet-minibatch1-multi` | SpreadsheetBench | `gradient.minibatch_size=1` | 对应 no-harness best_skill; `env.mode=multi` | +| [ ] | `HARNESS-Codex-Spreadsheet-minibatch2-multi` | SpreadsheetBench | `gradient.minibatch_size=2` | 对应 no-harness best_skill; `env.mode=multi` | +| [ ] | `HARNESS-Codex-Spreadsheet-minibatch4-multi` | SpreadsheetBench | `gradient.minibatch_size=4` | 对应 no-harness best_skill; `env.mode=multi` | +| [ ] | `HARNESS-Codex-Spreadsheet-minibatch32-multi` | SpreadsheetBench | `gradient.minibatch_size=32` | 对应 no-harness best_skill; `env.mode=multi` | +| [ ] | `HARNESS-Codex-LiveMath-minibatch1` | LiveMathBench | `gradient.minibatch_size=1` | 对应 no-harness best_skill | +| [ ] | `HARNESS-Codex-LiveMath-minibatch2` | LiveMathBench | `gradient.minibatch_size=2` | 对应 no-harness best_skill | +| [ ] | `HARNESS-Codex-LiveMath-minibatch16` | LiveMathBench | `gradient.minibatch_size=16` | 对应 no-harness best_skill | +| [ ] | `HARNESS-Codex-LiveMath-minibatch32` | LiveMathBench | `gradient.minibatch_size=32` | 对应 no-harness best_skill | +| [ ] | `HARNESS-Codex-DocVQA10-minibatch1` | DocVQA 10% | `gradient.minibatch_size=1` | 对应 no-harness best_skill | +| [ ] | `HARNESS-Codex-DocVQA10-minibatch2` | DocVQA 10% | `gradient.minibatch_size=2` | 对应 no-harness best_skill | +| [ ] | `HARNESS-Codex-DocVQA10-minibatch4` | DocVQA 10% | `gradient.minibatch_size=4` | 对应 no-harness best_skill | +| [ ] | `HARNESS-Codex-DocVQA10-minibatch16` | DocVQA 10% | `gradient.minibatch_size=16` | 对应 no-harness best_skill | +| [ ] | `HARNESS-Codex-SearchQA-lr1` | SearchQA | `optimizer.learning_rate=1`, `optimizer.lr_scheduler=constant`, `optimizer.min_learning_rate=1` | 对应 no-harness best_skill | +| [ ] | `HARNESS-Codex-SearchQA-lr2` | SearchQA | `optimizer.learning_rate=2`, `optimizer.lr_scheduler=constant`, `optimizer.min_learning_rate=1` | 对应 no-harness best_skill | +| [ ] | `HARNESS-Codex-SearchQA-lr4` | SearchQA | `optimizer.learning_rate=4`, `optimizer.lr_scheduler=constant`, `optimizer.min_learning_rate=1` | 对应 no-harness best_skill | +| [ ] | `HARNESS-Codex-SearchQA-lr16` | SearchQA | `optimizer.learning_rate=16`, `optimizer.lr_scheduler=constant`, `optimizer.min_learning_rate=1` | 对应 no-harness best_skill | +| [ ] | `HARNESS-Codex-Spreadsheet-lr1-multi` | SpreadsheetBench | `optimizer.learning_rate=1`, `optimizer.lr_scheduler=constant`, `optimizer.min_learning_rate=1` | 对应 no-harness best_skill; `env.mode=multi` | +| [ ] | `HARNESS-Codex-Spreadsheet-lr2-multi` | SpreadsheetBench | `optimizer.learning_rate=2`, `optimizer.lr_scheduler=constant`, `optimizer.min_learning_rate=1` | 对应 no-harness best_skill; `env.mode=multi` | +| [ ] | `HARNESS-Codex-Spreadsheet-lr8-multi` | SpreadsheetBench | `optimizer.learning_rate=8`, `optimizer.lr_scheduler=constant`, `optimizer.min_learning_rate=1` | 对应 no-harness best_skill; `env.mode=multi` | +| [ ] | `HARNESS-Codex-LiveMath-lr1` | LiveMathBench | `optimizer.learning_rate=1`, `optimizer.lr_scheduler=constant`, `optimizer.min_learning_rate=1` | 对应 no-harness best_skill | +| [ ] | `HARNESS-Codex-LiveMath-lr2` | LiveMathBench | `optimizer.learning_rate=2`, `optimizer.lr_scheduler=constant`, `optimizer.min_learning_rate=1` | 对应 no-harness best_skill | +| [ ] | `HARNESS-Codex-LiveMath-lr4` | LiveMathBench | `optimizer.learning_rate=4`, `optimizer.lr_scheduler=constant`, `optimizer.min_learning_rate=1` | 对应 no-harness best_skill | +| [ ] | `HARNESS-Codex-DocVQA10-lr1` | DocVQA 10% | `optimizer.learning_rate=1`, `optimizer.lr_scheduler=constant`, `optimizer.min_learning_rate=1` | 对应 no-harness best_skill | +| [ ] | `HARNESS-Codex-DocVQA10-lr2` | DocVQA 10% | `optimizer.learning_rate=2`, `optimizer.lr_scheduler=constant`, `optimizer.min_learning_rate=1` | 对应 no-harness best_skill | +| [ ] | `HARNESS-Codex-DocVQA10-lr4` | DocVQA 10% | `optimizer.learning_rate=4`, `optimizer.lr_scheduler=constant`, `optimizer.min_learning_rate=1` | 对应 no-harness best_skill | +| [ ] | `HARNESS-Codex-Spreadsheet-sched-linear-multi` | SpreadsheetBench | `optimizer.learning_rate=4`, `optimizer.lr_scheduler=linear`, `optimizer.min_learning_rate=2` | 对应 no-harness best_skill; `env.mode=multi` | +| [ ] | `HARNESS-Codex-LiveMath-sched-constant` | LiveMathBench | `optimizer.learning_rate=4`, `optimizer.lr_scheduler=constant`, `optimizer.min_learning_rate=2` | 对应 no-harness best_skill | +| [ ] | `HARNESS-Codex-DocVQA10-sched-constant` | DocVQA 10% | `optimizer.learning_rate=4`, `optimizer.lr_scheduler=constant`, `optimizer.min_learning_rate=2` | 对应 no-harness best_skill | +| [ ] | `HARNESS-Codex-DocVQA10-sched-linear` | DocVQA 10% | `optimizer.learning_rate=4`, `optimizer.lr_scheduler=linear`, `optimizer.min_learning_rate=2` | 对应 no-harness best_skill | +| [ ] | `HARNESS-Codex-SearchQA-slow5` | SearchQA | `optimizer.slow_update_samples=5` | 对应 no-harness best_skill | +| [ ] | `HARNESS-Codex-SearchQA-slow10` | SearchQA | `optimizer.slow_update_samples=10` | 对应 no-harness best_skill | +| [ ] | `HARNESS-Codex-SearchQA-slow40` | SearchQA | `optimizer.slow_update_samples=40` | 对应 no-harness best_skill | +| [ ] | `HARNESS-Codex-Spreadsheet-slow5-multi` | SpreadsheetBench | `optimizer.slow_update_samples=5` | 对应 no-harness best_skill; `env.mode=multi` | +| [ ] | `HARNESS-Codex-Spreadsheet-slow10-multi` | SpreadsheetBench | `optimizer.slow_update_samples=10` | 对应 no-harness best_skill; `env.mode=multi` | +| [ ] | `HARNESS-Codex-Spreadsheet-slow40-multi` | SpreadsheetBench | `optimizer.slow_update_samples=40` | 对应 no-harness best_skill; `env.mode=multi` | +| [ ] | `HARNESS-Codex-LiveMath-slow5` | LiveMathBench | `optimizer.slow_update_samples=5` | 对应 no-harness best_skill | +| [ ] | `HARNESS-Codex-LiveMath-slow40` | LiveMathBench | `optimizer.slow_update_samples=40` | 对应 no-harness best_skill | +| [ ] | `HARNESS-Codex-DocVQA10-slow5` | DocVQA 10% | `optimizer.slow_update_samples=5` | 对应 no-harness best_skill | +| [ ] | `HARNESS-Codex-DocVQA10-slow10` | DocVQA 10% | `optimizer.slow_update_samples=10` | 对应 no-harness best_skill | +| [ ] | `HARNESS-Codex-DocVQA10-slow40` | DocVQA 10% | `optimizer.slow_update_samples=40` | 对应 no-harness best_skill | +| [ ] | `HARNESS-Codex-SearchQA-policy-changed` | SearchQA | `optimizer.longitudinal_pair_policy=changed` | 对应 no-harness best_skill | +| [ ] | `HARNESS-Codex-SearchQA-policy-unchanged` | SearchQA | `optimizer.longitudinal_pair_policy=unchanged` | 对应 no-harness best_skill | +| [ ] | `HARNESS-Codex-Spreadsheet-policy-changed-multi` | SpreadsheetBench | `optimizer.longitudinal_pair_policy=changed` | 对应 no-harness best_skill; `env.mode=multi` | +| [ ] | `HARNESS-Codex-Spreadsheet-policy-unchanged-multi` | SpreadsheetBench | `optimizer.longitudinal_pair_policy=unchanged` | 对应 no-harness best_skill; `env.mode=multi` | +| [ ] | `HARNESS-Codex-LiveMath-policy-changed` | LiveMathBench | `optimizer.longitudinal_pair_policy=changed` | 对应 no-harness best_skill | +| [ ] | `HARNESS-Codex-LiveMath-policy-unchanged` | LiveMathBench | `optimizer.longitudinal_pair_policy=unchanged` | 对应 no-harness best_skill | +| [ ] | `HARNESS-Codex-DocVQA10-policy-changed` | DocVQA 10% | `optimizer.longitudinal_pair_policy=changed` | 对应 no-harness best_skill | +| [ ] | `HARNESS-Codex-DocVQA10-policy-unchanged` | DocVQA 10% | `optimizer.longitudinal_pair_policy=unchanged` | 对应 no-harness best_skill | + +Step 3 展开数量:共 66 个 Codex 实验。计数规则是:排除 Step 1 已经覆盖的值,排除现有 default/best-skill/best-setting 已覆盖的默认值,排除 split ratio 和模块关闭组合。 + +| 参数组 | SearchQA | SpreadsheetBench | LiveMathBench | DocVQA10 | 小计 | +| --- | ---: | ---: | ---: | ---: | ---: | +| `train.batch_size` | 3 | 4 | 4 | 2 | 13 | +| `gradient.minibatch_size` | 5 | 4 | 4 | 4 | 17 | +| `optimizer.learning_rate` | 4 | 3 | 3 | 3 | 13 | +| `optimizer.lr_scheduler` | 0 | 1 | 1 | 2 | 4 | +| `optimizer.slow_update_samples` | 3 | 3 | 2 | 3 | 11 | +| `optimizer.longitudinal_pair_policy` | 2 | 2 | 2 | 2 | 8 | +| **Total** | **17** | **17** | **16** | **15** | **66** | + +数量上限: + +```text +Codex Step 1:18 个。 +Claude Step 2:18 个,完整复制 Codex Step 1 的 setting。 +Codex Step 3:66 个,补齐剩余允许参数。 +当前完整计划总量:102 个新实验。 +实际启动原则:每轮最多 1-4 个,高负载时不要继续加。 +``` + +明确不跑: + +```text +1. 不跑 ALFWorld harness。 +2. 不在本机重复 fresh-machine best-setting from-scratch。 +3. 不一次性启动 8 个 harness best-setting。 +4. 不在 Claude fixed rerun 完成前做 Claude Step 2 镜像实验。 +5. Claude Step 2 只复制 Codex Step 1 的 18 个 setting;Step 3 不默认镜像到 Claude。 +6. 不跑 `optimizer.use_slow_update/use_meta_skill` 的 off/on 模块关闭组合。 +``` + +## 0. 启动前强制检查 + +每次启动新一批实验前必须满足: + +| 检查项 | 必须满足 | +| --- | --- | +| Python | 使用 `/home/azureuser/workspace-gzy/miniconda3/envs/reflact/bin/python` | +| 代码语法 | `py_compile` 覆盖 launcher、train、LiveMathBench dataloader、ALFWorld dataloader/adapter | +| Harness | 普通 ablation 使用 `model.teacher_backend=openai_chat` 且 `model.student_backend=openai_chat`;本节 harness 冲分使用 `model.teacher_backend=openai_chat` 且 `model.student_backend=codex_exec/claude_code_exec` | +| 模型 | teacher/student 默认均为 `gpt-5.5` | +| Azure | 当前 harness 冲分 endpoint 使用 `https://oaidr21.openai.azure.com/` | +| Auth | teacher/student 默认均使用 `azure_cli` | +| API version | teacher/student 默认均为 `2024-12-01-preview` | +| Reasoning | `model.reasoning_effort=medium`,teacher 和 student 都必须生效 | +| Split | 所有 benchmark 都使用固定 `env.split_mode=split_dir` | +| Train size | `train.train_size=0`,由 split 自动推断 | +| Batch 逻辑 | `steps_per_epoch = ceil(train_size / (train.batch_size * train.accumulation))` | +| Rollout 并发 | 本轮 active benchmark 的 rollout 不允许使用裸 `as_completed`; 必须使用 `wait(..., FIRST_COMPLETED)` + started-at timeout,避免最后一个 future 卡死 | +| LLM timeout | `chat_student` / `chat_student_messages` 必须向 SDK 传入显式 timeout,不能只依赖外层 future timeout | +| Gate/Test | `evaluation.use_gate=true`,`evaluation.eval_test=true` | +| 输出目录 | 每个 run 的 `env.out_root` 唯一;错误目录不复用 | + +启动后立刻检查每个 run 的 `config.json` 和 log 前 50 行,确认模型、split、reasoning、train size、batch size、消融参数都正确。 + +## 1. 固定默认 Setting + +除当前消融变量外,所有实验固定: + +| 参数 | 默认值 | +| --- | --- | +| `model.teacher_backend` | `openai_chat` | +| `model.student_backend` | `openai_chat` | +| `model.teacher` | `gpt-5.5` | +| `model.student` | `gpt-5.5` | +| `model.teacher_azure_openai_endpoint` | `https://t2vgoaigpt4o3.openai.azure.com/` | +| `model.student_azure_openai_endpoint` | `https://t2vgoaigpt4o3.openai.azure.com/` | +| `model.teacher_azure_openai_api_version` | `2024-12-01-preview` | +| `model.student_azure_openai_api_version` | `2024-12-01-preview` | +| `model.teacher_azure_openai_auth_mode` | `azure_cli` | +| `model.student_azure_openai_auth_mode` | `azure_cli` | +| `model.reasoning_effort` | `medium` | +| `train.num_epochs` | `4` | +| `train.train_size` | `0` | +| `train.batch_size` | `40` | +| `train.accumulation` | `1` | +| `train.seed` | `42` | +| `gradient.minibatch_size` | `8` | +| `gradient.merge_batch_size` | `8` | +| `gradient.analyst_workers` | `16` | +| `gradient.use_deep_reflect` | `false` | +| `optimizer.learning_rate` | `4` | +| `optimizer.min_learning_rate` | `2` | +| `optimizer.lr_scheduler` | `cosine` | +| `optimizer.skill_update_mode` | `patch` | +| `optimizer.use_slow_update` | `true` | +| `optimizer.slow_update_samples` | `20` | +| `optimizer.use_meta_skill` | `true` | +| `optimizer.use_meta_reflect` | `false` | +| `evaluation.use_gate` | `true` | +| `evaluation.eval_test` | `true` | + +Benchmark 固定项: + +| Benchmark | Config | Default split | Train | Val | Test | Student rollout | +| --- | --- | --- | ---: | ---: | ---: | --- | +| SearchQA | `configs/searchqa/default.yaml` | `data/ablation_splits/searchqa/2-1-7_seed42` | 400 | 200 | 1400 | `chat_student` | +| SpreadsheetBench | `configs/spreadsheetbench/default.yaml` | `data/ablation_splits/spreadsheetbench/2-1-7_seed42` | 80 | 40 | 280 | `codegen_agent.py -> get_student_client()` | +| LiveMathBench | `configs/livemathematicianbench/default.yaml` | `data/ablation_splits/livemathematicianbench/2-1-7_seed42` | 35 | 18 | 124 | `chat_student` | +| ALFWorld | `configs/alfworld/default.yaml` | `data/ablation_splits/alfworld/2-1-7_seed42` | 39 | 18 | 134 | `chat_student` | +| DocVQA | `configs/docvqa/default.yaml` | `/home/azureuser/zisu/SkillReflection/data/docvqa/splits` | 1070 | 535 | 3744 | `chat_student_messages` | + +ALFWorld 必须设置: + +```bash +export ALFWORLD_DATA=/home/azureuser/.cache/alfworld +``` + +## 2. 已验证的数据和 loader 行为 + +LiveMathBench 使用固定 JSON split,训练时每个 batch 内仍按 seed shuffle choices。已验证 batch 切分: + +| `train.batch_size` | Train size | Steps/epoch | Batch sizes | +| ---: | ---: | ---: | --- | +| 8 | 35 | 5 | `8 + 8 + 8 + 8 + 3` | +| 24 | 35 | 2 | `24 + 11` | +| 40 | 35 | 1 | `35` | +| 56 | 35 | 1 | `35` | +| full | 35 | 1 | `35` | + +ALFWorld 使用固定 `gamefile` split,39 个 train gamefile 从本地 `/home/azureuser/.cache/alfworld/json_2.1.1/train` 中固定抽样,val/test 分别来自 `valid_seen` / `valid_unseen`。已验证本地 gamefile 全部存在,并做过单环境 `reset()` smoke。 + +| `train.batch_size` | Train size | Steps/epoch | Batch sizes | +| ---: | ---: | ---: | --- | +| 8 | 39 | 5 | `8 + 8 + 8 + 8 + 7` | +| 24 | 39 | 2 | `24 + 15` | +| 40 | 39 | 1 | `39` | +| 56 | 39 | 1 | `39` | +| full | 39 | 1 | `39` | + +新增 benchmark 的 `env.split_dir` 多比例数据已补齐: + +| Benchmark | Split | Train | Val | Test | 说明 | +| --- | --- | ---: | ---: | ---: | --- | +| LiveMathBench | `1shot_seed42` | 1 | 1 | 175 | 从同一 177-item 候选池生成 | +| LiveMathBench | `1-1-8_seed42` | 18 | 18 | 141 | 从同一 177-item 候选池生成 | +| LiveMathBench | `2-1-7_seed42` | 35 | 18 | 124 | 按 `scripts/prepare_ablation_splits.py` 同一整数规则生成 | +| LiveMathBench | `4-1-5_seed42` | 71 | 18 | 88 | 从同一 177-item 候选池生成 | +| ALFWorld | `1shot_seed42` | 1 | 1 | 134 | domain-preserving fixed gamefiles | +| ALFWorld | `1-1-8_seed42` | 18 | 18 | 134 | train 只来自 official train,val/test 保持 valid_seen/valid_unseen | +| ALFWorld | `2-1-7_seed42` | 39 | 18 | 134 | 已验证 fixed split | +| ALFWorld | `4-1-5_seed42` | 82 | 18 | 103 | train 只来自 `train_pool_80`,test 从 valid_unseen 固定抽样 | +| DocVQA | `1shot_seed42` | 1 | 1 | 5347 | 从同一 5349-item 候选池生成 | +| DocVQA | `1-1-8_seed42` | 535 | 535 | 4279 | 按 `scripts/prepare_ablation_splits.py` 同一整数规则生成 | +| DocVQA | zisu `splits` | 1070 | 535 | 3744 | 直接使用 `/home/azureuser/zisu/SkillReflection/data/docvqa/splits`;旧 `2-1-7_seed42` 是同一 5349 样本池的重新划分,不再作为最终 2:1:7 | +| DocVQA | `4-1-5_seed42` | 2140 | 535 | 2674 | 按 `scripts/prepare_ablation_splits.py` 同一整数规则生成 | + +DocVQA 图片目录不复制,使用 symlink: + +```text +data/docvqa_images -> /home/azureuser/zisu/SkillReflection/data/docvqa_images +``` + +2026-05-05 DocVQA 数据修正: + +```text +使用路径:/home/azureuser/zisu/SkillReflection/data/docvqa/splits +数量:train=1070, val=535, test=3744, total=5349 +CSV 字段:question, answer, topic, questionId, docId, image_path, ... +图片:所有抽查和全量 image_path 均可通过 data/docvqa_images symlink 解析 +对比:旧 data/ablation_splits/docvqa/2-1-7_seed42 与 zisu split 的 5349 questionId 集合完全相同,但 train/val/test 分配不同。 +结论:所有 DocVQA 2:1:7 结果需要按 zisu split 重跑;旧 DocVQA 表格结果仅作历史参考。 +建议新 run_root:outputs/ablation_docvqa_zisu10pct_20260505_run,避免旧 outputs/ablation_docvqa_20260503_160225_run 的 summary.json 被 launcher 跳过。 +``` + +## 3. 新跑实验清单 + +### 3.1 LiveMathBench / ALFWorld / DocVQA 补齐现有 ablation table + +对 LiveMathBench、ALFWorld 和 DocVQA 新跑下面这些组;默认点会被复用,不重复跑同一设置: + +| 组 | Values | 每 benchmark 新 run 数 | +| --- | --- | ---: | +| `default` | 默认 `gpt-5.5/gpt-5.5` setting | 1 | +| `env.split_dir` | `1shot / 1-1-8 / 4-1-5`,跳过默认 `2-1-7` | 3 | +| `gradient.minibatch_size` | `1 / 2 / 4 / 16 / 32`,跳过默认 `8` | 5 | +| `optimizer.learning_rate` | `1 / 2 / 4 / 8 / 16`,本组固定 `lr_scheduler=constant min_learning_rate=1` | 5 | +| `optimizer.lr_scheduler` | `constant / linear`,跳过默认 `cosine` | 2 | +| `optimizer.slow_update_samples` | `5 / 10 / 40`,跳过默认 `20` | 3 | +| `optimizer.use_slow_update` / `optimizer.use_meta_skill` | `true,false` / `false,true` / `false,false`,跳过默认 `true,true` | 3 | +| `model.student` | `gpt-5.4-pro / gpt-5.4-mini`,跳过默认 `gpt-5.5` | 2 | + +合计:每个新增 benchmark 24 个现有表补齐 run;三个 benchmark 共 72 个。 + +### 3.2 新增 `train.batch_size` ablation + +对五个 benchmark 都补 batch-size 表: + +```text +train.batch_size = 8 / 24 / 40 / 56 / full +gradient.minibatch_size = 8 +``` + +`train.batch_size=40` 是默认点: + +| Benchmark | full 展开值 | 默认 40 来源 | 需要新跑 batch 点 | +| --- | ---: | --- | --- | +| SearchQA | 400 | 复用已有 default | `8 / 24 / 56 / full` | +| SpreadsheetBench | 80 | 复用已有 default | `8 / 24 / 56 / full` | +| LiveMathBench | 35 | 复用本轮新增 default | `8 / 24 / 56 / full` | +| ALFWorld | 39 | 复用本轮新增 default | `8 / 24 / 56 / full` | +| DocVQA | 1070 | 复用本轮新增 default | `8 / 24 / 56 / full` | + +新增 batch-size run 数:五个 benchmark 共 20 个。 + +### 3.3 本轮总新增 run 数 + +| 实验块 | SearchQA | SpreadsheetBench | LiveMathBench | ALFWorld | DocVQA | 新增合计 | +| --- | ---: | ---: | ---: | ---: | ---: | ---: | +| 补齐现有 table | 0 | 0 | 24 | 24 | 24 | 72 | +| 新增 batch-size table | 4 | 4 | 4 | 4 | 4 | 20 | +| 合计 | 4 | 4 | 28 | 28 | 28 | 92 | + +### 3.4 新增 longitudinal comparison example policy ablation + +这个实验只改变 slow update 和 meta skill 看到的 comparison examples,不改 prompt、不改模型、不改 rollout、不改优化主流程。默认 `mixed` 复用原版随机 20 条。 + +| Policy | Comparison examples | Top-up 规则 | +| --- | --- | --- | +| `mixed` | 原版随机样本,可能包含 `10/01/00/11` | 不额外补样本 | +| `changed` | 只保留 `10/01`,即 improved/regressed | 不满 20 时继续从 train split 扫描补满,扫完整个 train split 为止 | +| `unchanged` | 只保留 `00/11`,即 persistent_fail/stable_success | 不强求 20 条,不补样本 | + +本轮先跑除 ALFWorld 外四个 benchmark: + +```text +SearchQA / SpreadsheetBench / LiveMathBench / DocVQA +``` + +新增 run 数:`4 benchmarks x 2 policies = 8`。 + +### 3.5 新增 learning-rate-control baseline + +这个实验只在除 ALFWorld 外四个 benchmark 上跑: + +```text +SearchQA / SpreadsheetBench / LiveMathBench / DocVQA +``` + +两类 baseline: + +| Run | 设置 | 说明 | +| --- | --- | --- | +| `autonomous` | `optimizer.lr_control_mode=autonomous` | 保留 edit/patch 框架,但每个 step 由 teacher 根据当前证据自主输出一个整数 LR;prompt 不给默认 LR、候选 LR 或历史 LR 先验;每步写 `lr_decision.json`,全局写 `lr_history.jsonl` | +| `full-rewrite` | `optimizer.lr_control_mode=none` + `optimizer.skill_update_mode=full_rewrite_minibatch` | 去掉 LR/edit budget/select/apply-edit;每个 minibatch 直接生成完整 skill candidate,aggregate/merge 直接合成最终 candidate skill | + +新增 run 数:`4 benchmarks x 2 baselines = 8`。 + +实现落点: + +| 功能 | 文件 | +| --- | --- | +| autonomous LR 决策 | `reflact/optimizer/lr_autonomous.py` | +| LR control config | `configs/_base_/default.yaml`, `reflact/config.py`, `scripts/train.py` | +| full-rewrite-minibatch mode | `reflact/optimizer/update_modes.py`, `reflact/gradient/reflect.py`, `reflact/gradient/aggregate.py`, `reflact/engine/trainer.py` | +| autonomous LR prompt | `reflact/prompts/lr_autonomous.md` | +| full-rewrite analyst prompts | `reflact/prompts/analyst_error_full_rewrite.md`, `reflact/prompts/analyst_success_full_rewrite.md` | +| full-rewrite merge prompts | `reflact/prompts/merge_failure_full_rewrite.md`, `reflact/prompts/merge_success_full_rewrite.md`, `reflact/prompts/merge_final_full_rewrite.md` | + +验证记录: + +```text +py_compile passed for train, config, trainer, reflect, aggregate, update_modes, lr_autonomous, env base. +SearchQA smoke autonomous passed: wrote steps/step_0001/lr_decision.json and lr_history.jsonl. +SearchQA smoke full-rewrite passed: wrote steps/step_0001/full_rewrite_result.json and candidate_skill.md. +Formal LRCTRL config alignment vs default: ALL OK. +``` + +正式 run root: + +```text +outputs/ablation_lrctrl_20260504_run +``` + +正式启动后必须检查: + +```text +autonomous: lr_decision.json count increases with steps; lr_history.jsonl exists. +full-rewrite: full_rewrite_result.json count increases with steps; edit_budget in step_record is null. +all: config.json differs from default only in lr_control_mode / skill_update_mode / out_root. +``` + +2026-05-05 minimal-prompt full-rewrite rerun: + +```text +Motivation: +The original full-rewrite prompts were highly guided: expert rewriter framing, +failure corrections as high priority, preservation of strongest guidance, and +explicit generalizable behavioral guidance. To test a less scaffolded rewrite +baseline, the original full-rewrite prompt files were replaced in-place with +summary-style prompts. + +Changed prompt files: +- reflact/prompts/analyst_error_full_rewrite.md +- reflact/prompts/analyst_success_full_rewrite.md +- reflact/prompts/merge_failure_full_rewrite.md +- reflact/prompts/merge_success_full_rewrite.md +- reflact/prompts/merge_final_full_rewrite.md + +The replacement prompts keep the same JSON schema and still require complete +replacement skill documents, but only ask the model to summarize lessons or +combine complete skill candidates. They keep the no-task-specific-answer rule. + +Runs started manually, not via matrix launcher: +run_root=outputs/rerun_lrctrl_fullrewrite_minimalprompt_20260505_run + +LRCTRL-searchqa-full-rewrite-minimalprompt + pid=281937 + config=configs/searchqa/default.yaml + split=data/ablation_splits/searchqa/2-1-7_seed42 + verified train/val/test=400/200/1400 + overrides: optimizer.lr_control_mode=none, optimizer.skill_update_mode=full_rewrite_minibatch + +LRCTRL-spreadsheetbench-full-rewrite-minimalprompt + pid=281938 + config=configs/spreadsheetbench/default.yaml + split=data/ablation_splits/spreadsheetbench/2-1-7_seed42 + verified train/val/test=80/40/280 + overrides: optimizer.lr_control_mode=none, optimizer.skill_update_mode=full_rewrite_minibatch + +Table status rows were added in docs/ablation_paper_tables.md; metrics remain +blank until summary.json is produced. + +2026-05-05 14:48 UTC update: + +The code-assembled final merge user prompt also had hardcoded priority wording: +`Group 1 (failure-driven, HIGH priority)` and `Group 2 (success-driven, lower priority)`. +For `full_rewrite_minibatch` only, this was changed in `reflact/gradient/aggregate.py` +to neutral source labels: `Group 1 (from failed trajectories)` and +`Group 2 (from successful trajectories)`. Ordinary patch-mode merging keeps the +existing failure-priority prompt. + +The earlier two `minimalprompt` processes were stopped before `summary.json`. +The replacement runs use a fresh root: + +run_root=outputs/rerun_lrctrl_fullrewrite_minimalprompt_neutralmerge_20260505_run + +LRCTRL-searchqa-full-rewrite-minimalprompt-neutralmerge + pid=438572 + config=configs/searchqa/default.yaml + split=data/ablation_splits/searchqa/2-1-7_seed42 + verified train/val/test=400/200/1400 + overrides: optimizer.lr_control_mode=none, optimizer.skill_update_mode=full_rewrite_minibatch + +LRCTRL-spreadsheetbench-full-rewrite-minimalprompt-neutralmerge + pid=438573 + config=configs/spreadsheetbench/default.yaml + split=data/ablation_splits/spreadsheetbench/2-1-7_seed42 + verified train/val/test=80/40/280 + overrides: optimizer.lr_control_mode=none, optimizer.skill_update_mode=full_rewrite_minibatch +``` + +### 3.6 当前 active experiments snapshot + +2026-05-04 18:00 UTC 左右,当前不跑 ALFWorld,active 训练进程保持 24 个,无重复 out_root。 + +DocVQA 旧矩阵剩余 8 个: + +```text +SLOWN-docvqa-5 +SLOWN-docvqa-10 +SLOWN-docvqa-40 +MOD-docvqa-slow-only +MOD-docvqa-meta-only +MOD-docvqa-none +SMODEL-docvqa-5.4 +SMODEL-docvqa-5.4-mini +``` + +Longitudinal comparison-example policy 8 个: + +```text +LONGPAIR-searchqa-changed +LONGPAIR-searchqa-unchanged +LONGPAIR-spreadsheetbench-changed +LONGPAIR-spreadsheetbench-unchanged +LONGPAIR-livemathematicianbench-changed +LONGPAIR-livemathematicianbench-unchanged +LONGPAIR-docvqa-changed +LONGPAIR-docvqa-unchanged +``` + +Learning-rate-control baseline 8 个: + +```text +LRCTRL-searchqa-autonomous +LRCTRL-searchqa-full-rewrite +LRCTRL-spreadsheetbench-autonomous +LRCTRL-spreadsheetbench-full-rewrite +LRCTRL-livemathematicianbench-autonomous +LRCTRL-livemathematicianbench-full-rewrite +LRCTRL-docvqa-autonomous +LRCTRL-docvqa-full-rewrite +``` + +DocVQA skill-none eval 另有一个 eval-only 进程在跑,不计入 24 个 train: + +```text +outputs/ablation_docvqa_20260503_160225_run/skill-none-model.student-docvqa-5.5 +``` + +旧的 0-byte partial 已归档: + +```text +outputs/ablation_docvqa_20260503_160225_run/archive_skill_none_docvqa_empty_20260504_175635 +``` + +### 3.7 2026-05-05 handoff snapshot + +2026-05-05 08:26 UTC 手动检查状态如下。本节覆盖上一节的过期 active snapshot;上一节只作为历史记录保留。 + +当前没有 `run_ablation_matrix.py` launcher、`rerun_unfinished_no_prior` helper 或自动监控脚本在跑。只保留已经直接启动的 `scripts/train.py` 实验进程。由于 ALFWorld 会派生本地环境 worker,进程行数会被放大;统计时必须按唯一 `env.out_root`/run id 计数。 + +当前 active setting 共 19 个: + +```text +DEFAULT-alfworld-5.5 +SPLIT-alfworld-1shot +SPLIT-alfworld-1-1-8 +SPLIT-alfworld-4-1-5 +BATCH-alfworld-8 +BATCH-alfworld-24 +BATCH-alfworld-56 +BATCH-alfworld-full +MBS-alfworld-1 +MBS-alfworld-2 +MBS-alfworld-4 +MBS-alfworld-16 +MBS-alfworld-32 +LR-alfworld-1 +LR-alfworld-2 +LR-alfworld-4 +LONGPAIR-spreadsheetbench-changed +LRCTRL-docvqa-full-rewrite +LRCTRL-spreadsheetbench-autonomous +``` + +当前尚未启动、建议交给新机器补跑的 setting 共 17 个: + +```text +LR-alfworld-8 +LR-alfworld-16 +SCHED-alfworld-constant +SCHED-alfworld-linear +SLOWN-alfworld-5 +SLOWN-alfworld-10 +SLOWN-alfworld-40 +MOD-alfworld-slow-only +MOD-alfworld-meta-only +MOD-alfworld-none +SMODEL-alfworld-5.4 +SMODEL-alfworld-5.4-mini +LONGPAIR-alfworld-changed +LONGPAIR-alfworld-unchanged +LRCTRL-alfworld-autonomous +LRCTRL-alfworld-full-rewrite +LRCTRL-docvqa-autonomous +``` + +No-prior autonomous LR rerun 状态: + +```text +done: LRCTRL-searchqa-autonomous +done: LRCTRL-livemathematicianbench-autonomous +active: LRCTRL-spreadsheetbench-autonomous +not started: LRCTRL-docvqa-autonomous +``` + +健康状态: + +```text +true 429 check: none found in active logs +Traceback/OOM/Killed check: none found in active logs +memory: about 617 GiB available, train RSS about 20 GiB +known non-fatal issue: SpreadsheetBench active runs have task-level TIMEOUT rows during rollout; these are benchmark item timeouts, not API 429. +``` + +继续手动监控时用这个唯一 run-id 统计,不要用裸进程行数: + +```bash +/home/azureuser/workspace-gzy/miniconda3/envs/reflact/bin/python - <<'PY' +import os, re, subprocess, collections +out = subprocess.check_output(["ps", "-eo", "pid,ppid,lstart,cmd"], text=True, errors="replace") +by = collections.defaultdict(list) +root_by = {} +for line in out.splitlines()[1:]: + if "scripts/train.py" not in line or "python - <<" in line or "ps -eo" in line: + continue + m = re.search(r"env\.out_root=([^\s]+)", line) + if not m: + continue + root = m.group(1) + run_id = os.path.basename(root.rstrip("/")) + by[run_id].append(line.split(None, 1)[0]) + root_by[run_id] = root +print(f"unique_active_settings={len(by)} process_rows={sum(len(v) for v in by.values())}") +for run_id in sorted(by): + print(f"{run_id}\trows={len(by[run_id])}\troot={root_by[run_id]}") +PY +``` + +真实 429 检查必须用错误文本,不要把 ALFWorld progress bar 的 `429/494` 误判为 rate limit: + +```bash +rg -n "Error code: 429|too_many_requests|Too Many Requests|Rate limit|rate_limit" \ + outputs/ablation_alfworld_nonray_20260505_run/logs \ + outputs/ablation_longpair_20260504_run/logs/LONGPAIR-spreadsheetbench-changed.log \ + outputs/ablation_lrctrl_20260504_run/logs/LRCTRL-docvqa-full-rewrite.log \ + outputs/ablation_lrctrl_autonomous_noprior_20260505_031612_run/logs/LRCTRL-spreadsheetbench-autonomous.rerun_20260505_0800.log \ + 2>/dev/null || true +``` + +### 3.8 Fresh-machine handoff + +新机器目标:不要重复当前 19 个 active setting;只补跑 17 个 not-started setting。若新机器和当前机器共享同一磁盘,必须使用新的 run root,避免与当前 active writer 混写。推荐: + +```text +ALFWorld remaining root: +outputs/ablation_alfworld_nonray_newmachine_20260505_run + +DocVQA no-prior autonomous root: +outputs/ablation_lrctrl_autonomous_noprior_newmachine_20260505_run +``` + +环境准备: + +```bash +git clone SkillReflection +cd SkillReflection +git checkout ablation + +conda create -n reflact python=3.11 -y +conda activate reflact +python -m pip install -U pip +python -m pip install openai pyyaml openpyxl numpy gymnasium ray regex tqdm alfworld textworld + +az login +az account show + +export ALFWORLD_DATA=/home/azureuser/.cache/alfworld +test -d "$ALFWORLD_DATA/json_2.1.1/train" +[ -L data/docvqa_images ] || [ -d data/docvqa_images ] +``` + +如果 DocVQA 图片目录不在仓库内,按机器实际路径建立 symlink: + +```bash +ln -s /path/to/docvqa_images data/docvqa_images +``` + +启动前验证: + +```bash +PY=$(which python) +$PY -m py_compile \ + scripts/train.py \ + scripts/run_ablation_matrix.py \ + reflact/envs/alfworld/rollout.py \ + reflact/envs/docvqa/rollout.py \ + reflact/engine/trainer.py \ + reflact/model/azure_openai.py + +$PY - <<'PY' +from reflact.envs.alfworld.dataloader import load_alfworld_splits +s = load_alfworld_splits("data/ablation_splits/alfworld/2-1-7_seed42") +print(len(s["train"]), len(s["val"]), len(s["test"])) +PY +``` + +只生成 ALFWorld 剩余 16 个 direct train 命令。注意:`run_ablation_matrix.py` 原始 dry-run 会枚举整个 ALFWorld matrix,显示 `num_experiments=32` 是正常的,但这 32 个不能全跑;其中 16 个已经在当前机器 active,只能启动 filtered list 里的 16 个 remaining run。 + +```bash +PY=$(which python) +$PY scripts/run_ablation_matrix.py \ + --groups default split batch mbs lr sched slown mod smodel longpair lrctrl \ + --bench alfworld \ + --run-root /home/azureuser/workspace-gzy/SkillReflection/outputs/ablation_alfworld_nonray_newmachine_20260505_run \ + > /tmp/alfworld_all_32_commands.txt + +$PY - <<'PY' +from pathlib import Path + +remaining = { + "LR-alfworld-8", + "LR-alfworld-16", + "SCHED-alfworld-constant", + "SCHED-alfworld-linear", + "SLOWN-alfworld-5", + "SLOWN-alfworld-10", + "SLOWN-alfworld-40", + "MOD-alfworld-slow-only", + "MOD-alfworld-meta-only", + "MOD-alfworld-none", + "SMODEL-alfworld-5.4", + "SMODEL-alfworld-5.4-mini", + "LONGPAIR-alfworld-changed", + "LONGPAIR-alfworld-unchanged", + "LRCTRL-alfworld-autonomous", + "LRCTRL-alfworld-full-rewrite", +} + +lines = Path("/tmp/alfworld_all_32_commands.txt").read_text().splitlines() +out = [] +current = None +for line in lines: + if line.startswith("# "): + current = line[2:].strip() + continue + if current in remaining and "scripts/train.py" in line: + out.append(f"# {current}") + out.append(line) + current = None + +missing = remaining - {out[i][2:] for i in range(0, len(out), 2)} +if missing: + raise SystemExit(f"missing commands: {sorted(missing)}") +Path("/tmp/alfworld_remaining_16_commands.txt").write_text("\n".join(out) + "\n") +print("remaining_alfworld_commands", len(out) // 2) +PY +``` + +只允许启动 `/tmp/alfworld_remaining_16_commands.txt` 里的这些 header 后面的命令,逐个用 `setsid ... > logs/.log 2>&1 &` 手动启动: + +```text +LR-alfworld-8 +LR-alfworld-16 +SCHED-alfworld-constant +SCHED-alfworld-linear +SLOWN-alfworld-5 +SLOWN-alfworld-10 +SLOWN-alfworld-40 +MOD-alfworld-slow-only +MOD-alfworld-meta-only +MOD-alfworld-none +SMODEL-alfworld-5.4 +SMODEL-alfworld-5.4-mini +LONGPAIR-alfworld-changed +LONGPAIR-alfworld-unchanged +LRCTRL-alfworld-autonomous +LRCTRL-alfworld-full-rewrite +``` + +生成 DocVQA no-prior autonomous direct train 命令,同样不加 `--execute`,只启动 `LRCTRL-docvqa-autonomous`,不要启动 full-rewrite: + +```bash +PY=$(which python) +$PY scripts/run_ablation_matrix.py \ + --groups lrctrl \ + --bench docvqa \ + --run-root /home/azureuser/workspace-gzy/SkillReflection/outputs/ablation_lrctrl_autonomous_noprior_newmachine_20260505_run \ + > /tmp/docvqa_lrctrl_noprior_command.txt +``` + +建议新机器第一批不要超过 16 个 direct train;如果 15-30 分钟内无真实 429、无 OOM、日志持续增长,再补到 20-24。不要把 helper/launcher 留在后台自动补位;本轮按手动观察、手动起进程执行。 + +## 4. 启动命令 + +当前恢复第一轮节奏:总并发约 24 个 train。具体做法是拆成 3 个 matrix launcher,每个 `--max-parallel 8`。不要给每个 launcher 设 24,否则会超过总并发预期。 + +如果刚修完 runner 或刚接入新 benchmark,可以先用 `--max-parallel 1` 做短 smoke;确认没有重复 out_root、tail-sample hang、配置错误后,再切到下面的 24-way 运行方式。 + +### 4.1 当前 24-way 启动方式 + +```bash +export ALFWORLD_DATA=/home/azureuser/.cache/alfworld + +setsid /home/azureuser/workspace-gzy/miniconda3/envs/reflact/bin/python \ + scripts/run_ablation_matrix.py \ + --groups default split batch mbs lr sched slown mod smodel \ + --bench docvqa \ + --run-root /home/azureuser/workspace-gzy/SkillReflection/outputs/ablation_docvqa_20260503_160225_run \ + --max-parallel 8 \ + --execute \ + > /home/azureuser/workspace-gzy/SkillReflection/outputs/ablation_docvqa_20260503_160225_run/launcher_parallel8.log 2>&1 & + +setsid /home/azureuser/workspace-gzy/miniconda3/envs/reflact/bin/python \ + scripts/run_ablation_matrix.py \ + --groups default split batch mbs lr sched slown mod smodel \ + --bench livemathematicianbench alfworld \ + --run-root /home/azureuser/workspace-gzy/SkillReflection/outputs/ablation_livemath_alfworld_clean_20260503_155155_run \ + --max-parallel 8 \ + --execute \ + > /home/azureuser/workspace-gzy/SkillReflection/outputs/ablation_livemath_alfworld_clean_20260503_155155_run/launcher_parallel8.log 2>&1 & + +setsid /home/azureuser/workspace-gzy/miniconda3/envs/reflact/bin/python \ + scripts/run_ablation_matrix.py \ + --groups batch \ + --bench searchqa spreadsheetbench \ + --run-root /home/azureuser/workspace-gzy/SkillReflection/outputs/ablation_batch_searchqa_spreadsheet_20260503_153902_run \ + --max-parallel 8 \ + --execute \ + > /home/azureuser/workspace-gzy/SkillReflection/outputs/ablation_batch_searchqa_spreadsheet_20260503_153902_run/launcher_parallel8.log 2>&1 & +``` + +### 4.2 Longitudinal comparison example policy 启动方式 + +ALFWorld 暂停不跑;其余四个 benchmark 用独立 run root,避免污染旧矩阵: + +```bash +setsid /home/azureuser/workspace-gzy/miniconda3/envs/reflact/bin/python \ + scripts/run_ablation_matrix.py \ + --groups longpair \ + --bench searchqa spreadsheetbench livemathematicianbench docvqa \ + --run-root /home/azureuser/workspace-gzy/SkillReflection/outputs/ablation_longpair_20260504_run \ + --max-parallel 8 \ + --execute \ + > /home/azureuser/workspace-gzy/SkillReflection/outputs/ablation_longpair_20260504_run/launcher_longpair_parallel8.log 2>&1 & +``` + +### 4.3 Learning-rate-control baseline 启动方式 + +ALFWorld 暂停不跑;其余四个 benchmark 用独立 run root: + +```bash +setsid /home/azureuser/workspace-gzy/miniconda3/envs/reflact/bin/python \ + scripts/run_ablation_matrix.py \ + --groups lrctrl \ + --bench searchqa spreadsheetbench livemathematicianbench docvqa \ + --run-root /home/azureuser/workspace-gzy/SkillReflection/outputs/ablation_lrctrl_20260504_run \ + --max-parallel 8 \ + --execute \ + > /home/azureuser/workspace-gzy/SkillReflection/outputs/ablation_lrctrl_20260504_run/launcher_lrctrl_parallel8.log 2>&1 & +``` + +注意:`run_ablation_matrix.py` 会在启动时跳过当时 active 的 run,但不会管理先前 orphan train 的生命周期。重启 launcher 前后必须检查 `env.out_root` 是否重复;如果有两个进程写同一目录,立刻杀掉重复进程、删除污染目录、让 launcher 干净重启该 run。 + +### 4.4 实时监控命令 + +进程数和重复输出目录检查: + +```bash +/home/azureuser/workspace-gzy/miniconda3/envs/reflact/bin/python - <<'PY' +import subprocess, re, collections +try: + raw = subprocess.check_output(["pgrep", "-af", "scripts/train.py"], text=True) +except subprocess.CalledProcessError: + raw = "" +roots = [] +for line in raw.splitlines(): + m = re.search(r"env\.out_root=([^\s]+)", line) + if m: + roots.append(m.group(1)) +ctr = collections.Counter(roots) +print("train_count", len(roots)) +print("duplicate_roots", [r for r, c in ctr.items() if c > 1]) +for root in sorted(roots): + print(root.rsplit("/", 1)[-1]) +PY +``` + +launcher 和错误日志检查: + +```bash +for f in \ + outputs/ablation_docvqa_20260503_160225_run/launcher_parallel8.log \ + outputs/ablation_livemath_alfworld_clean_20260503_155155_run/launcher_parallel8.log \ + outputs/ablation_batch_searchqa_spreadsheet_20260503_153902_run/launcher_parallel8.log +do + echo "### $f" + tail -80 "$f" 2>/dev/null || true +done + +rg -n "Traceback|ERROR|Error code|AuthenticationError|BadRequest|RateLimit|content_filter|Killed|rc=|\\[FAIL\\]|\\[RETRY\\]" \ + outputs/ablation_docvqa_20260503_160225_run/logs \ + outputs/ablation_livemath_alfworld_clean_20260503_155155_run/logs \ + outputs/ablation_batch_searchqa_spreadsheet_20260503_153902_run/logs \ + -g '*.log' | tail -120 || true +``` + +结果行数进度检查: + +```bash +/home/azureuser/workspace-gzy/miniconda3/envs/reflact/bin/python - <<'PY' +from pathlib import Path +roots = [ + Path("outputs/ablation_docvqa_20260503_160225_run"), + Path("outputs/ablation_livemath_alfworld_clean_20260503_155155_run"), + Path("outputs/ablation_batch_searchqa_spreadsheet_20260503_153902_run"), +] +for root in roots: + print("\\n##", root.name) + for run in sorted([p for p in root.iterdir() if p.is_dir() and p.name != "logs"]): + summary = run / "summary.json" + if summary.exists(): + print("DONE", run.name) + continue + files = sorted(run.rglob("results.jsonl"), key=lambda p: p.stat().st_mtime if p.exists() else 0) + if not files: + print("NO_RESULTS", run.name) + continue + f = files[-1] + print("RUN", run.name, sum(1 for _ in f.open()), f.relative_to(run)) +PY +``` + +DocVQA 是多模态 benchmark,`2-1-7` 默认 test 有 3744 个样本、selection 有 535 个样本;完整 28-run 矩阵成本显著高于 SearchQA/SpreadsheetBench。高并发时必须持续看 rate limit、content filter、重复 out_root 和尾样本 timeout。 + +## 5. 已完成的本地验证 + +当前本地验证状态: + +| 项 | 状态 | +| --- | --- | +| `py_compile` | 已通过:launcher、train、OpenAI wrapper、LiveMathBench dataloader/adapter、ALFWorld dataloader/adapter/rollout、DocVQA adapter/rollout | +| Tail-sample hang audit | 已通过:SearchQA、SpreadsheetBench、LiveMathBench、ALFWorld、DocVQA rollout active paths 均移除裸 `as_completed` | +| LiveMathBench split | 已加载:train=35 val=18 test=124 | +| ALFWorld split | 已加载:train=39 val=18 test=134 | +| ALFWorld gamefile | 39+18+134 个本地路径全部存在 | +| ALFWorld env smoke | 已通过:单环境 build/reset,reset 后 gamefile 与 fixed split item 一致 | +| ALFWorld reflect `None` fields | 已修复:trajectory 的 `action/env_feedback/reasoning/cmd/obs/content` 先归一为空字符串再截断;修复前 ALFWorld run 不可填表 | +| ALFWorld slow-update `None` fields | 已修复:`reflact/optimizer/slow_update.py` 读取 comparison trajectory 时同样归一 `None`;修复前失败 run 已删除并用同一 matrix launcher 重挂 | +| LiveMathBench token cap | 已修复:旧 768/512 cap 导致 GPT-5.x hidden reasoning 吃完预算,results 中大量空 response;已提到 16384 并通过 8 条 smoke(empty=0, ``=8/8, acc=0.625) | +| ALFWorld token cap | 已修复:旧 512 cap 下 conversation 存在空 `model_response/no_action`;已提到 2048,旧 512-token run 已归档并重启 | +| ALFWorld empty action fallback | 已修复:`rollout.py` 对 empty response / missing `` 统一 fallback 到 `look`;所有修复前启动的 ALFWorld active run 已归档,不能填表 | +| Launcher dry-run | 已验证:LiveMathBench/ALFWorld clean matrix 56 个;DocVQA matrix 28 个;SearchQA/SpreadsheetBench batch-size 8 个 | + +2026-05-04 token-cap restart: + +```bash +export ALFWORLD_DATA=/home/azureuser/.cache/alfworld +setsid /home/azureuser/workspace-gzy/miniconda3/envs/reflact/bin/python \ + scripts/run_ablation_matrix.py \ + --groups default split batch mbs lr sched slown mod smodel \ + --bench livemathematicianbench alfworld \ + --run-root /home/azureuser/workspace-gzy/SkillReflection/outputs/ablation_livemath_alfworld_clean_20260503_155155_run \ + --max-parallel 8 \ + --execute \ + > /home/azureuser/workspace-gzy/SkillReflection/outputs/ablation_livemath_alfworld_clean_20260503_155155_run/launcher_parallel8_tokenfix_20260504_0223.log 2>&1 & +``` + +Token-cap restart monitor checks: + +```text +LiveMath old evidence: DEFAULT baseline test 124 rows, 115 empty response, only 9 answer tags. +LiveMath 4096 smoke: 8 rows, 5 empty response, not enough. +LiveMath 16384 smoke: 8 rows, 0 empty response, 8 answer tags, acc=0.625. +New run first check: LiveMath DEFAULT/SPLIT selection baseline 38 rows total, empty=0, answer_tag=38/38. +New ALFWorld first check: SPLIT-alfworld-1shot first conversation 29 steps, empty_resp=0, no_action=0. +Archives: +- outputs/ablation_livemath_alfworld_clean_20260503_155155_run/archive_livemath_token768_20260504_022258/ +- outputs/ablation_livemath_alfworld_clean_20260503_155155_run/archive_alfworld_token512_20260504_021417/ +``` + +2026-05-04 parallelism top-up: + +```text +SearchQA/SpreadsheetBench batch-size matrix completed, freeing 8 slots. DocVQA launcher remains at 8 active train processes. A second LiveMath/ALFWorld launcher was started with --max-parallel 16 on the same run_root; run_ablation_matrix active_run_ids() skips existing active out_roots and only starts pending runs. This brought total active train processes back to 24 with no duplicate out_root. +``` + +```bash +export ALFWORLD_DATA=/home/azureuser/.cache/alfworld +setsid /home/azureuser/workspace-gzy/miniconda3/envs/reflact/bin/python \ + scripts/run_ablation_matrix.py \ + --groups default split batch mbs lr sched slown mod smodel \ + --bench livemathematicianbench alfworld \ + --run-root /home/azureuser/workspace-gzy/SkillReflection/outputs/ablation_livemath_alfworld_clean_20260503_155155_run \ + --max-parallel 16 \ + --execute \ + > /home/azureuser/workspace-gzy/SkillReflection/outputs/ablation_livemath_alfworld_clean_20260503_155155_run/launcher_parallel16_tokenfix_topup_20260504_0240.log 2>&1 & +``` + +Top-up verification: + +```text +train_count=24 +duplicate_roots=[] +newly started: BATCH-livemathematicianbench-{8,24,56,full}, BATCH-alfworld-{8,24,56,full} +``` + +2026-05-04 ALFWorld empty-action fallback restart: + +```text +发现 DEFAULT-alfworld-5.5 和 BATCH-alfworld-8 中各有 1 条 empty model_response/no_action。为了不混用旧代码,已停掉并归档所有修复前启动的 ALFWorld active run: +- DEFAULT-alfworld-5.5 +- SPLIT-alfworld-1shot +- SPLIT-alfworld-1-1-8 +- SPLIT-alfworld-4-1-5 +- BATCH-alfworld-8 +- BATCH-alfworld-24 +- BATCH-alfworld-56 +- BATCH-alfworld-full + +归档目录: +- outputs/ablation_livemath_alfworld_clean_20260503_155155_run/archive_alfworld_empty_action_20260504_025311/ +- outputs/ablation_livemath_alfworld_clean_20260503_155155_run/archive_alfworld_prefallback_20260504_025402/ + +launcher 已记录这些 run 为 rc=-6 retry;后续 retry 使用修复后的 `reflact/envs/alfworld/rollout.py`。当前 24 并发已满,ALFWorld default/split/batch retry 会在槽位释放后启动。当前 active ALFWorld 是修复后启动的 MBS-alfworld-{1,2,4}。 +``` + +2026-05-04 ALFWorld OOM/concurrency adjustment: + +```text +在 24 总并发下,ALFWorld 多个 run 同时进入 Ray/ALFWorld env 初始化和 slow-update rollout,触发 Ray OOM prevention: +memory 约 823GB / 866GB (0.950+) 超过 Ray 默认 0.95 阈值,导致 worker 被杀。 + +受影响并归档的 partial run: +- MBS-alfworld-{1,2,4,16,32} +- LR-alfworld-{2,4,8,16} +- SCHED-alfworld-{constant,linear} +- SLOWN-alfworld-{5,10,40} + +归档目录: +- outputs/ablation_livemath_alfworld_clean_20260503_155155_run/archive_alfworld_oom_partial_20260504_050517/ + +已停掉 LiveMath/ALFWorld 两个 launcher: +- launcher_parallel8_tokenfix_20260504_0223.log 对应进程 +- launcher_parallel16_tokenfix_topup_20260504_0240.log 对应进程 + +当前只保留已跑较久的 `LR-alfworld-1` 继续,DocVQA launcher 不动,LiveMath 已启动进程继续。后续 ALFWorld 必须低并发补跑,建议 `--bench alfworld --max-parallel 1` 或最多 2;不要再把 ALFWorld 混入 24 总并发。 +``` + +2026-05-04 ALFWorld serial retry launcher: + +```bash +export ALFWORLD_DATA=/home/azureuser/.cache/alfworld +setsid /home/azureuser/workspace-gzy/miniconda3/envs/reflact/bin/python \ + scripts/run_ablation_matrix.py \ + --groups default split batch mbs lr sched slown mod smodel \ + --bench alfworld \ + --run-root /home/azureuser/workspace-gzy/SkillReflection/outputs/ablation_livemath_alfworld_clean_20260503_155155_run \ + --max-parallel 1 \ + --execute \ + > /home/azureuser/workspace-gzy/SkillReflection/outputs/ablation_livemath_alfworld_clean_20260503_155155_run/launcher_alfworld_serial_20260504_1258.log 2>&1 & +``` + +```text +launcher_pid=9322 +first started: DEFAULT-alfworld-5.5 +valid existing ALFWorld result before serial retry: LR-alfworld-1 +reason: previous high-concurrency ALFWorld runs hit Ray OOM; serial retry avoids mixing ALFWorld into 24-way total concurrency. +``` + +2026-05-04 ALFWorld serial retry paused: + +```text +DEFAULT-alfworld-5.5 启动 Ray 后,系统内存 available 从约 19GiB 降到约 6GiB,Ray 同时报 `/tmp/ray` spill 风险。为避免影响正在跑的 DocVQA/LiveMath,已停掉串行 ALFWorld launcher 和 DEFAULT-alfworld-5.5 train 进程。 + +归档目录: +- outputs/ablation_livemath_alfworld_clean_20260503_155155_run/archive_alfworld_serial_lowmem_20260504_1300/ + +后续重启条件:DocVQA/LiveMath 当前队列进一步释放资源后,再用同一串行命令重启 ALFWorld;不要在 available memory 只有十几 GiB 时启动 ALFWorld。 +``` + +2026-05-04 current monitor and GPU-memory clarification: + +```text +当前 SkillReflection active train processes=12,无重复 env.out_root: +- LR-docvqa-16 +- MOD-docvqa-meta-only +- MOD-docvqa-none +- MOD-docvqa-slow-only +- SCHED-docvqa-constant +- SCHED-docvqa-linear +- SLOWN-docvqa-10 +- SLOWN-docvqa-40 +- SLOWN-docvqa-5 +- SMODEL-docvqa-5.4 +- SMODEL-docvqa-5.4-mini +- SMODEL-livemathematicianbench-5.4 + +最新完成并已填表: +- LR-docvqa-8: best_sel=0.9047, base=0.8675, best_test=0.9017, delta=+0.0342, tokens=279012493 + +SkillReflection ALFWorld 当前没有在跑。当前有效 ALFWorld summary 只有: +- LR-alfworld-1 + +nvidia-smi 显示 GPU 显存被 ray::ServeReplica:*GroundingDINOModel / DA3Model 占用,但这些进程 cwd 是: +/home/azureuser/workspace-gzy/zyf/gca-skill + +这不是本仓库 /home/azureuser/workspace-gzy/SkillReflection 的 ALFWorld ablation。 + +另有 unrelated ALFWorld jobs 在: +/home/azureuser/zisu/skill_distill + +这些不是本轮 SkillReflection ablation 输出,不能混进本表。 +``` + +## 6. 不纳入本轮的内容 + +本轮不启动以下实验: + +| 项 | 原因 | +| --- | --- | +| 已完成 default 点的重复 run | 复用 default 结果,避免同一设置重复计费 | +| `train.batch_size=40` 重复 run | 复用 default 结果,避免同一设置重复计费 | +| `model.student=gpt-5.5` student-model 重复 run | 复用 default 结果 | +| `gradient.minibatch_size=8` 重复 run | 复用 default 结果 | +| `optimizer.lr_scheduler=cosine` 重复 run | 复用 default 结果 | +| `optimizer.slow_update_samples=20` 重复 run | 复用 default 结果 | +| `use_slow_update=true,use_meta_skill=true` 重复 run | 复用 default 结果 |