ReflACT: Reflective Agent Tuning
ReflACT is a framework for optimizing an external skill document through iterative rollout, reflection, editing, and gated validation.
It does not fine-tune model weights. Instead, it treats the skill document as the optimization target:
- the student model executes tasks with the current skill
- the teacher model analyzes trajectories and proposes edits
- the framework merges, ranks, applies, and validates those edits
- only validated skill updates are kept
This branch implements a full training loop with step-level skill optimization and optional epoch-level memory mechanisms (slow_update, meta_skill, meta_reflect).
Method Overview
Optimization Target
Each run maintains a mutable markdown skill document. The framework repeatedly improves that document instead of changing model parameters.
This gives a training-style loop for prompt / policy optimization:
- Roll out the current skill on a batch of tasks.
- Reflect on failures and successes.
- Merge patch proposals into a coherent candidate update.
- Rank and select a bounded number of edits.
- Apply those edits to produce a candidate skill.
- Validate the candidate skill on a held-out selection split.
- Keep the update only if the gate accepts it.
Per-Step Pipeline
Every training step executes the following pipeline in reflact/engine/trainer.py:
-
Rollout The student model runs a batch of tasks using the current skill.
-
Reflect The teacher analyzes minibatches of trajectories and emits raw patches. Failure-driven and success-driven patches are tracked separately.
-
Aggregate Raw patches are merged hierarchically. Metadata such as
support_countandsource_typeis carried into the merged patch so later ranking can use it. -
Select The teacher ranks the merged edit pool and keeps up to
edit_budgetedits. -
Update The selected edits are applied to the skill document. The framework records an
edit_apply_report.jsonso you can see which edits actually landed, which were skipped, and why. -
Evaluate / Gate The candidate skill is evaluated on the selection split. Gate validation is mandatory in this branch. A candidate update is accepted only if it improves over the current selection score; a new global best is tracked separately.
Within-Epoch Memory
Inside an epoch, the trainer maintains a step buffer containing:
- compact failure-pattern summaries from previous steps
- rejected edits and their score deltas
That context is fed back into later reflection calls so the teacher can avoid repeating ineffective edits and can focus on unsolved error patterns.
Epoch-Level Mechanisms
This branch supports three optional epoch-level mechanisms.
Slow Update
At the end of each epoch, slow_update compares the previous epoch’s terminal skill and current epoch’s terminal skill on a sampled train subset. It then writes longitudinal guidance into a protected slow-update region inside the skill document.
Importantly, this guidance is not blindly written through. It is converted into a candidate skill and sent through the same selection gate as step-level updates.
Meta Skill
meta_skill is teacher-side cross-epoch memory. It does not directly edit the current skill. Instead, it writes a compact memory artifact describing longer-term patterns across adjacent epochs. That memory is loaded into later reflection / merge / ranking calls as extra context.
Meta Reflect
meta_reflect runs at epoch end over the step history of the current epoch. It looks at accepted and rejected directions from the whole epoch, proposes higher-level patch edits, applies them to a meta candidate, and then sends that candidate through the same selection gate.
What This Branch Guarantees
The current implementation assumes the following as the mainline method contract:
- gate validation is always on
- the current skill, current score, best skill, and best score stay aligned
slow_updateis gated before being committed- patch provenance (
source_type,support_count) reaches selection - patch application is observable through per-edit reports
- resume state is restored from
runtime_state.jsonrather than inferred only from history - all benchmark model calls go through the unified backend router
Model Backends
All model access now goes through the split teacher/student model layer in reflact.model.
Supported teacher backends:
openai_chatclaude_chat
Supported student backends:
openai_chatclaude_chatcodex_execclaude_code_exec
Recommended config shape:
model:
teacher_backend: openai_chat
student_backend: codex_exec
teacher: gpt-5.4
student: gpt-5.4-codex
reasoning_effort: medium
Legacy model.backend and CLI flags like --backend codex still work. They are mapped onto the split backend model for backward compatibility.
The same routing is used by:
- training (
scripts/train.py) - eval-only runs (
scripts/eval_only.py) - SpreadsheetBench standalone prompt eval scripts
- LiveMathematicianBench baseline eval script
- benchmark rollout code inside the main framework
Azure OpenAI
If you use openai_chat, configure either environment variables or config values:
export AZURE_OPENAI_ENDPOINT="https://your-endpoint.openai.azure.com/"
export AZURE_OPENAI_API_KEY="your-api-key"
export AZURE_OPENAI_API_VERSION="2025-04-01-preview"
The config supports both the old keys and the new explicit names:
model:
azure_openai_endpoint: "..."
azure_openai_api_version: "..."
azure_openai_api_key: ""
azure_openai_auth_mode: api_key
azure_openai_ad_scope: "https://cognitiveservices.azure.com/.default"
azure_openai_managed_identity_client_id: ""
azure_openai_auth_mode can be used for API-key auth or Azure AD / managed identity flows.
Exec Harness
codex_exec and claude_code_exec run the student inside a workspace harness instead of a plain chat call. The harness writes task files, renders a dynamic SKILL.md, runs the student CLI, and saves raw execution artifacts such as:
codex_raw.txtcodex_trace_summary.txt- workspace-local task / skill files
This branch keeps meta_skill and apply_patch_with_report, while upgrading the student path to the more realistic workspace-exec setup.
Trace-Aware Deep Reflect
When student_backend=codex_exec and gradient.use_deep_reflect=true, deep reflection can probe a specific earlier Codex attempt:
- the teacher sees a compact Codex trace summary
- deep probe can target
probe_target_id - the follow-up rollout can resume from
probe_after_step
This is wired for the dataset-backed environments in this branch.
Rewrite Mode
Skill updates support two modes:
optimizer.skill_update_mode=patchoptimizer.skill_update_mode=rewrite_from_suggestions
patch keeps the existing fine-grained edit application path and still records edit_apply_report.json.
rewrite_from_suggestions asks the teacher to emit higher-level rewrite suggestions, then rewrites the whole skill in one pass. This is useful when patch edits become too fragmented.
Repository Layout
reflact/
engine/
trainer.py main training loop
gradient/
reflect.py minibatch reflection
aggregate.py hierarchical patch merge
deep_probe.py diagnostic probing for deep reflect
optimizer/
clip.py edit ranking / selection
skill.py patch application + apply report
slow_update.py epoch-level longitudinal guidance
meta_skill.py teacher-side cross-epoch memory
meta_reflect.py epoch-level macro editing
evaluation/
gate.py pure gate decision logic
model/
backend_config.py teacher/student backend routing
azure_openai.py Azure backend
codex_harness.py workspace exec harness + Codex trace parsing
claude_backend.py Claude backend
envs/
... environment adapters and rollout logic
scripts/
train.py unified training entry
eval_only.py evaluate one skill without training
configs/
_base_/default.yaml shared defaults
<env>/default.yaml environment-specific configs
Configuration
Configs use structured YAML with _base_ inheritance.
The base config is configs/_base_/default.yaml. Key defaults in this branch are:
model.teacher_backend = openai_chatmodel.student_backend = openai_chatmodel.reasoning_effort = mediumoptimizer.use_slow_update = trueoptimizer.use_meta_skill = trueoptimizer.use_meta_reflect = falsegradient.use_deep_reflect = falseoptimizer.skill_update_mode = patch
Default setting snapshot:
model:
backend: azure_openai
teacher: gpt-5.4
student: gpt-5.4
teacher_backend: openai_chat
student_backend: openai_chat
reasoning_effort: medium
rewrite_reasoning_effort: ""
rewrite_max_completion_tokens: 64000
codex_exec_path: codex
codex_exec_sandbox: workspace-write
codex_exec_profile: ""
codex_exec_full_auto: false
codex_exec_reasoning_effort: none
claude_code_exec_path: claude
claude_code_exec_profile: ""
codex_trace_to_teacher: true
train:
num_epochs: 4
train_size: 0
batch_size: 80
accumulation: 1
seed: 42
gradient:
minibatch_size: 16
merge_batch_size: 16
analyst_workers: 16
max_analyst_rounds: 3
failure_only: false
use_deep_reflect: false
deep_reflect_failures: 4
deep_reflect_successes: 2
optimizer:
learning_rate: 8
min_learning_rate: 2
lr_scheduler: cosine
skill_update_mode: patch
use_meta_reflect: false
meta_learning_rate: 8
use_slow_update: true
slow_update_samples: 20
use_meta_skill: true
evaluation:
use_gate: true
sel_env_num: 0
test_env_num: 0
eval_test: true
env:
split_mode: ratio
split_ratio: "2:1:7"
split_seed: 42
For the full source of truth, see configs/base/default.yaml.
Selected fields:
| Section | Key | Meaning |
|---|---|---|
model |
teacher_backend |
teacher backend: openai_chat or claude_chat |
model |
student_backend |
student backend: chat backend or exec backend |
model |
teacher |
teacher model / deployment |
model |
student |
student model / deployment |
model |
reasoning_effort |
reasoning budget passed to the backend when supported |
model |
codex_trace_to_teacher |
include Codex trace summaries in teacher reflection context |
train |
num_epochs |
number of epochs |
train |
train_size |
expected train split size, or 0 to infer |
train |
batch_size |
tasks per rollout batch |
train |
accumulation |
number of rollout/reflect minibatches per step |
gradient |
minibatch_size |
trajectories per analyst minibatch |
gradient |
merge_batch_size |
patches per aggregate batch |
gradient |
use_deep_reflect |
enable diagnostic probe rollouts |
gradient |
max_analyst_rounds |
teacher reflection retries / refinement budget |
optimizer |
learning_rate |
max edits kept after selection |
optimizer |
lr_scheduler |
edit-budget scheduler |
optimizer |
use_slow_update |
epoch-level longitudinal guidance |
optimizer |
use_meta_skill |
teacher-side epoch memory |
optimizer |
use_meta_reflect |
epoch-level macro editing |
optimizer |
skill_update_mode |
patch or rewrite_from_suggestions |
evaluation |
sel_env_num |
selection set size (0 means full split) |
evaluation |
test_env_num |
test set size (0 means full split) |
Important Branch Rule
use_gate=false is intentionally not supported in this branch. Gate validation is part of the method contract here.
If an old config still contains evaluation.use_gate: false, the loader / trainer will raise instead of silently continuing.
Supported Environments
The main training entry and eval-only entry now register 11 environments:
| Env | Default rollout shape | Current default split / data setting | Branch alignment |
|---|---|---|---|
alfworld |
environment-backed episodic rollout | native ALFWorld train/eval splits | in reflact_new_zzw |
babyvision |
single-round multimodal QA | split_mode=ratio from raw metadata/images, or prepared split_dir |
in reflact_new_zzw |
docvqa |
single-round multimodal QA | split_dir: data/docvqa_split |
in reflact_new_zzw |
livemathematicianbench |
single-round QA | split_mode=ratio or prepared split_dir |
in reflact_new_zzw |
mathverse |
single-round multimodal math QA | data_root: data/MathVerse, split files loaded from split_dir when provided |
in reflact_new_zzw |
mmrb |
single-round multimodal reasoning QA | split_mode=ratio or prepared split_dir |
in reflact_new_zzw |
officeqa |
multi-turn tool loop | split_dir: data/officeqa_split plus data_dirs: [data/officeqa_docs_official] |
in reflact_new_zzw |
sealqa |
multi-turn tool loop | split_dir: data/sealqa_split |
in reflact_new_zzw |
searchqa |
single-round QA (max_turns=1) |
split_dir: data/searchqa_split |
in reflact_new_zzw |
spreadsheetbench |
codegen loop, default mode=multi, max_turns=30 |
split_dir: data/spreadsheetbench_split, data_root: data/spreadsheetbench_verified_400 |
in reflact_new_zzw, default adjusted here to multi-round |
swebench |
mini-swe-agent multi-step bug-fixing rollout | split_mode=ratio, dataset_name=lite, repo-stratified 2:1:7 split materialized under out_root/_generated_splits/... unless split_dir is provided |
added here, aligned to swe-bench-old |
Data Expectations
The standard two-mode dataset entry path is:
split_mode: ratio- load raw data from
env.data_path - build a deterministic
train/,val/,test/split underenv.split_output_dir(or underout_root/_generated_splits/if unset) - default ratio is explicitly
2:1:7
- load raw data from
split_mode: split_dir- load an existing
env.split_dirwithtrain/,val/,test/subdirectories
- load an existing
This currently applies to:
searchqaspreadsheetbenchbabyvisionlivemathematicianbenchmmrbswebench
ALFWorld is the exception: it is environment-backed rather than JSON split-backed.
The following environments currently expect prepared split directories or extra rooted assets rather than the generic ratio-split path:
docvqamathverseofficeqasealqa
At a high level:
SearchQA: raw QA json / jsonl or pre-split QA json filesSpreadsheetBench: raw task manifest json plus spreadsheet task directory, or a pre-split task manifestALFWorld: installed game environment and configured eval/train splitsBabyVision: rawmeta_data.jsonlplus images, or a pre-split directoryDocVQA: pre-split CSV / JSON data undersplit_dirLiveMathematicianBench: raw monthly QA json files, or a pre-split directoryMathVerse: split files plusdata_rootimage assetsMMRB: raw extracted dataset json files, or a pre-split directoryOfficeQA: pre-split metadata plus resolved office document directoriesSealQA: pre-split metadata for tool-augmented QA tasksSWEBench: HuggingFace SWE-bench dataset alias (lite/verified/full) or a prepared split directory
Split References Across Branches
The split-related defaults are not identical across skillopt-final, reflact_new_zzw, gepa, and swe-bench-old. The practical reference points are:
| Source branch | Explicit split settings / dirs |
|---|---|
skillopt-final |
searchqa -> data/searchqa_split; spreadsheetbench -> data/spreadsheetbench_split; docvqa -> data/docvqa_split; officeqa -> data/officeqa_split; sealqa -> data/sealqa_split; swebench -> ratio split 2:1:7 over the default lite dataset, materialized under out_root/_generated_splits/... |
reflact_new_zzw |
Same 10-benchmark env set as above except no swebench; explicit split dirs are data/searchqa_split, data/spreadsheetbench_split, data/docvqa_split, data/officeqa_split, data/sealqa_split; spreadsheetbench there defaults to mode=single; officeqa uses max_tool_turns=24; sealqa uses max_tool_turns=12 |
gepa |
configs/spreadsheetbench.yaml uses data.splits_dir = data/spreadsheetbench/splits, eval.mode = react, eval.max_turns = 20; configs/swebench.yaml uses dataset = SWE-bench/SWE-bench_Verified with train_size = 100, val_size = 50, test_size = 350 |
swe-bench-old |
Repo-stratified 2:1:7 split over SWE-Bench_Lite, persisted as outputs/.../split/train.json, selection.json, test.json; the example split in that branch is train=60, selection=33, test=207 |
For the 10 benches shared with reflact_new_zzw, the current branch is now aligned on env coverage. The main intentional delta is spreadsheetbench: this branch defaults to multi-round codegen, while reflact_new_zzw kept mode=single by default.
Running Training
Example:
python scripts/train.py --config configs/searchqa/default.yaml
Explicit 2:1:7 split from raw data:
python scripts/train.py \
--config configs/searchqa/default.yaml \
--split_mode ratio \
--data_path /path/to/searchqa_train_2000.json
Directly consume a prepared split directory:
python scripts/train.py \
--config configs/searchqa/default.yaml \
--split_mode split_dir \
--split_dir /path/to/searchqa_split
You can override structured config keys from the CLI:
python scripts/train.py \
--config configs/spreadsheetbench/default.yaml \
--cfg-options model.teacher_backend=openai_chat model.student_backend=codex_exec train.batch_size=40 optimizer.learning_rate=4
Legacy flat overrides still work for common keys:
python scripts/train.py \
--config configs/searchqa/default.yaml \
--backend azure_openai \
--teacher_model gpt-5.4 \
--student_model gpt-5.4 \
--reasoning_effort medium
Exec harness example:
python scripts/train.py \
--config configs/searchqa/default.yaml \
--teacher_backend openai_chat \
--student_backend codex_exec \
--teacher_model gpt-5.4 \
--student_model gpt-5.4-codex \
--use_deep_reflect true \
--skill_update_mode rewrite_from_suggestions
SWEBench example:
python scripts/train.py \
--config configs/swebench/default.yaml \
--cfg-options env.dataset_name=lite env.split_ratio=2:1:7
Eval-Only and Standalone Evaluation
Evaluate a specific skill without training:
python scripts/eval_only.py \
--config configs/searchqa/default.yaml \
--skill reflact/envs/searchqa/skills/initial.md
The same dataset entry modes apply in eval-only runs:
--split_mode ratio --data_path ...--split_mode split_dir --split_dir ...
Standalone scripts also exist for benchmark-specific comparisons, including:
scripts/eval_prompt_custom.pyscripts/eval_prompt_official.pyscripts/eval_livemathematicianbench_baseline.py
These scripts now also support backend selection through the unified model layer.
Output Structure
Each run writes a structured output directory under out_root.
Important top-level artifacts:
config.json— flattened runtime confighistory.json— per-step history recordsruntime_state.json— resume state for current/best skill trackingbest_skill.md— current best validated skillskills/skill_vXXXX.md— persisted skill snapshot per step
Per-step artifacts live under steps/step_XXXX/, including:
merged_patch.jsonranked_edits.jsoncandidate_skill.mdedit_apply_report.jsonrewrite_result.jsonwhen rewrite mode is enabledselection_eval/trajectory_digest.json- rollout and patch subdirectories
Epoch-level artifacts live under:
slow_update/epoch_XX/meta_skill/epoch_XX/meta_reflect/epoch_XX/
Resume Behavior
The trainer resumes from runtime_state.json when present. That state tracks:
- last completed step
- current skill path
- current score
- best skill path
- best score
- origin tags for current and best skill
This is important because skill state can change at both step level and epoch level; resuming only from history.json is not sufficient for this branch’s method logic.
Notes
- This repository focuses on skill optimization logic; datasets are not included.
- Patch application is intentionally observable. Inspect
edit_apply_report.jsonwhen candidate skills do not behave as expected. SpreadsheetBenchnow defaults tomode=multi. If you run an exec student backend there, override back toenv.mode=singlebecause exec backends are still only wired for SpreadsheetBench single-mode rollout.SWEBenchfollows the older mini-swe-agent +swebench.harness.run_evaluationpath, so it requires the SWE-bench / Docker toolchain rather than the generic chat-only stack.slow_updatewrites into a protected skill region and normal edits are prevented from overwriting that region directly.meta_skillis context memory, not a direct skill edit.meta_reflectis a gated skill edit stage, not just logging.
Minimal Setup
conda create -n reflact python=3.11
conda activate reflact
pip install openai pyyaml openpyxl
Depending on the environment, you may also need:
pip install datasets gymnasium numpy ray regex
For SWEBench, you also need a working Docker environment plus the SWE-bench / mini-swe-agent dependencies used in swe-bench-old.