cleanup: remove unused benchmarks, deep_probe, meta_reflect

Remove sealqa, babyvision, mathverse, mmrb, swebench envs and configs.
Remove deep_probe, deep_reflect, meta_reflect modules and prompts.
Remove download_babyvision script.
These are not part of the core released benchmarks.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
This commit is contained in:
Cuzyoung
2026-05-24 19:36:27 +00:00
parent 2df2542aec
commit f55a26414e
71 changed files with 0 additions and 11199 deletions

View File

@@ -1,35 +0,0 @@
You are an expert diagnostic-probe designer for ALFWorld embodied tasks.
You will design one short diagnostic instruction to append to the target's prompt
for a handful of representative ALFWorld trajectories.
The goal is to expose whether the target has the right intermediate subgoal,
object/receptacle state, and next-step intention without substantially changing
the current scaffold.
## Hard Constraints
1. Do NOT substantially change the target's existing action-selection scaffold.
2. Do NOT prescribe a brand-new planner or long multi-step policy.
3. Do NOT ask for exhaustive search over all objects or all admissible actions.
4. Keep the diagnostic readout brief and place it inside the existing <think>...</think> block.
5. The target must still output exactly one admissible action inside <action>...</action>.
6. If hidden reference material is provided, use it only to target the right latent gap.
7. Never copy hidden reference content into the target-facing probe.
## Good Probe Targets
- current subgoal
- target object / target receptacle / target state
- decisive missing precondition
- why one candidate action is better than a tempting alternative
- whether the current step should explore, transform an object, or place it
## Bad Probe Targets
- a full optimal plan from start to finish
- exhaustive object inventories
- a new theorem-like or planner-like protocol
Respond ONLY with a valid JSON object:
{
"reasoning": "<why this probe reveals the latent skill gap>",
"probe_instruction": "<the exact instruction text to append to the target prompt>"
}

View File

@@ -1 +0,0 @@
"""BabyVision environment package for ReflACT."""

View File

@@ -1,267 +0,0 @@
"""BabyVision environment adapter for ReflACT."""
from __future__ import annotations
import json
import os
from skillopt.gradient.deep_probe import generate_deep_probe_instruction
from skillopt.datasets.base import BatchSpec
from skillopt.gradient.reflect import run_minibatch_reflect
from skillopt.envs.base import EnvAdapter
from skillopt.envs.babyvision.dataloader import BabyVisionDataLoader
from skillopt.envs.babyvision.rollout import run_batch
from skillopt.model import get_target_backend
class BabyVisionAdapter(EnvAdapter):
"""BabyVision adapter."""
def build_reference_text(self, item: dict) -> str:
cot = str(item.get("cot") or "").strip()
if not cot:
return ""
return f"## Reference CoT\n{cot}"
def get_reference_metadata(self, item: dict) -> dict:
cot = str(item.get("cot") or "").strip()
if not cot:
return {"fields": [], "preview": ""}
return {
"fields": ["cot"],
"preview": cot[:400],
}
def __init__(
self,
split_dir: str = "",
data_path: str = "",
split_mode: str = "ratio",
split_ratio: str = "2:1:7",
split_seed: int = 42,
split_output_dir: str = "",
max_turns: int = 1,
workers: int = 32,
analyst_workers: int = 16,
failure_only: bool = False,
minibatch_size: int = 8,
edit_budget: int = 4,
seed: int = 42,
limit: int = 0,
image_detail: str = "auto",
judge_model: str = "gpt-5.4",
judge_max_completion_tokens: int = 256,
judge_retries: int = 5,
use_deep_reflect: bool = False,
deep_reflect_failures: int = 4,
deep_reflect_successes: int = 2,
) -> None:
self.max_turns = max_turns
self.workers = workers
self.analyst_workers = analyst_workers
self.failure_only = failure_only
self.minibatch_size = minibatch_size
self.edit_budget = edit_budget
self.image_detail = image_detail
self.judge_model = judge_model
self.judge_max_completion_tokens = judge_max_completion_tokens
self.judge_retries = judge_retries
self.use_deep_reflect = use_deep_reflect
self.deep_reflect_failures = deep_reflect_failures
self.deep_reflect_successes = deep_reflect_successes
self.dataloader = BabyVisionDataLoader(
split_dir=split_dir,
data_path=data_path,
split_mode=split_mode,
split_ratio=split_ratio,
split_seed=split_seed,
split_output_dir=split_output_dir,
seed=seed,
limit=limit,
)
def setup(self, cfg: dict) -> None:
super().setup(cfg)
self.dataloader.setup(cfg)
def get_dataloader(self):
return self.dataloader
def build_env_from_batch(self, batch: BatchSpec, **kwargs):
return list(batch.payload or [])
def build_train_env(self, batch_size: int, seed: int, **kwargs):
batch = self.dataloader.build_train_batch(batch_size=batch_size, seed=seed, **kwargs)
return self.build_env_from_batch(batch, **kwargs)
def build_eval_env(self, env_num: int, split: str, seed: int, **kwargs):
batch = self.dataloader.build_eval_batch(env_num=env_num, split=split, seed=seed, **kwargs)
return self.build_env_from_batch(batch, **kwargs)
def rollout(
self,
env_manager,
skill_content: str,
out_dir: str,
**kwargs,
) -> list[dict]:
items: list[dict] = env_manager
return run_batch(
items=items,
out_root=out_dir,
skill_content=skill_content,
max_turns=self.max_turns,
workers=self.workers,
image_detail=self.image_detail,
judge_model=self.judge_model,
judge_max_completion_tokens=self.judge_max_completion_tokens,
judge_retries=self.judge_retries,
diagnostic_mode=kwargs.get("diagnostic_mode", False),
diagnostic_instruction=kwargs.get("diagnostic_instruction", ""),
diagnostic_trace_context_by_id=kwargs.get("diagnostic_trace_context_by_id"),
)
def reflect(
self,
results: list[dict],
skill_content: str,
out_dir: str,
**kwargs,
) -> list[dict | None]:
prediction_dir = kwargs.get("prediction_dir", os.path.join(out_dir, "predictions"))
patches_dir = kwargs.get("patches_dir", os.path.join(out_dir, "patches"))
random_seed = kwargs.get("random_seed")
step_buffer_context = kwargs.get("step_buffer_context", "")
meta_skill_context = kwargs.get("meta_skill_context", "")
return run_minibatch_reflect(
results=results,
skill_content=skill_content,
prediction_dir=prediction_dir,
patches_dir=patches_dir,
workers=self.analyst_workers,
failure_only=self.failure_only,
minibatch_size=self.minibatch_size,
edit_budget=self.edit_budget,
random_seed=random_seed,
error_system=self.get_error_minibatch_prompt(),
success_system=self.get_success_minibatch_prompt(),
step_buffer_context=step_buffer_context,
meta_skill_context=meta_skill_context,
update_mode=getattr(self, "_cfg", {}).get("skill_update_mode", "patch"),
)
def deep_reflect(
self,
results: list[dict],
skill_content: str,
out_dir: str,
**kwargs,
) -> list[dict | None]:
if not self.use_deep_reflect:
return []
env_manager = kwargs.get("env_manager")
prediction_dir = kwargs.get("prediction_dir", os.path.join(out_dir, "predictions"))
random_seed = kwargs.get("random_seed")
step_buffer_context = kwargs.get("step_buffer_context", "")
meta_skill_context = kwargs.get("meta_skill_context", "")
codex_backend = get_target_backend() == "codex_exec"
selected_items = self.select_representative_items(
results,
env_manager if isinstance(env_manager, list) else None,
n_failures=self.deep_reflect_failures,
n_successes=self.deep_reflect_successes,
seed=random_seed,
)
if not selected_items:
return []
selected_ids = {str(item["id"]) for item in selected_items}
selected_results = [row for row in results if str(row.get("id")) in selected_ids]
selected_examples = self.attach_reference_context(selected_results, selected_items)
if codex_backend:
selected_examples = self.attach_codex_probe_context(selected_examples, prediction_dir)
selected_metadata = []
cot_count = 0
for item in selected_items:
meta = self.get_reference_metadata(item)
if meta["fields"]:
cot_count += 1
selected_metadata.append({
"id": str(item["id"]),
"task_type": str(item.get("subtype") or item.get("task_type") or "babyvision"),
"reference_fields": meta["fields"],
"reference_preview": meta["preview"],
})
deep_dir = os.path.join(out_dir, "deep_reflect")
rollout_dir = os.path.join(deep_dir, "rollout")
patches_dir = os.path.join(deep_dir, "patches")
os.makedirs(deep_dir, exist_ok=True)
print(
f" [2b/6 DEEP REFLECT setup] selected={len(selected_items)} "
f"reference_fields=cot({cot_count}/{len(selected_items)})"
)
probe = generate_deep_probe_instruction(
skill_content=skill_content,
items=selected_examples,
prediction_dir=prediction_dir,
system_prompt=self.get_codex_deep_probe_prompt() if codex_backend else self.get_deep_probe_prompt(),
step_buffer_context=step_buffer_context,
meta_skill_context=meta_skill_context,
)
if not probe:
return []
diagnostic_trace_context_by_id = None
if codex_backend:
selected_items, diagnostic_trace_context_by_id, probe = self.resolve_codex_probe_target(
selected_items=selected_items,
selected_examples=selected_examples,
prediction_dir=prediction_dir,
probe=probe,
)
probe_record = {
**probe,
"reference_summary": {
"selected_count": len(selected_items),
"field_counts": {
"cot": cot_count,
},
},
"selected_examples": selected_metadata,
}
with open(os.path.join(deep_dir, "probe.json"), "w", encoding="utf-8") as f:
json.dump(probe_record, f, ensure_ascii=False, indent=2)
deep_results = run_batch(
items=selected_items,
out_root=rollout_dir,
skill_content=skill_content,
max_turns=self.max_turns,
workers=min(self.workers, max(len(selected_items), 1)),
image_detail=self.image_detail,
judge_model=self.judge_model,
judge_max_completion_tokens=self.judge_max_completion_tokens,
judge_retries=self.judge_retries,
diagnostic_mode=True,
diagnostic_instruction=probe["probe_instruction"],
diagnostic_trace_context_by_id=diagnostic_trace_context_by_id,
)
deep_results = self.attach_reference_context(deep_results, selected_items)
return run_minibatch_reflect(
results=deep_results,
skill_content=skill_content,
prediction_dir=os.path.join(rollout_dir, "predictions"),
patches_dir=patches_dir,
workers=self.analyst_workers,
failure_only=self.failure_only,
minibatch_size=self.minibatch_size,
edit_budget=self.edit_budget,
random_seed=random_seed,
error_system=self.get_error_minibatch_prompt(),
success_system=self.get_success_minibatch_prompt(),
step_buffer_context=step_buffer_context,
meta_skill_context=meta_skill_context,
update_mode=getattr(self, "_cfg", {}).get("skill_update_mode", "patch"),
)
def get_task_types(self) -> list[str]:
return self.dataloader.get_task_types()

View File

@@ -1,214 +0,0 @@
"""BabyVision task dataloader."""
from __future__ import annotations
import json
import os
from typing import Any
from skillopt.datasets.base import SplitDataLoader
# ── Raw data loading utilities (for preprocessing / standalone eval) ─────
_CHOICE_LABELS = ["A", "B", "C", "D", "E", "F", "G"]
def _iter_jsonl(path: str) -> list[dict]:
items: list[dict] = []
with open(path, encoding="utf-8") as f:
for line in f:
line = line.strip()
if not line:
continue
items.append(json.loads(line))
return items
def _normalize_ans_type(raw: Any, options: list[dict], choice_answer: Any) -> str:
text = str(raw or "").strip().lower()
if text in {"choice", "multiple_choice", "mcq", "option"}:
return "choice"
if text in {"blank", "open", "open_ended", "fill_blank", "short_answer"}:
return "blank"
if options or choice_answer not in (None, "", []):
return "choice"
return "blank"
def _coerce_options(raw: Any) -> list[dict]:
options: list[dict] = []
if isinstance(raw, list):
for idx, item in enumerate(raw):
if isinstance(item, dict):
text = str(item.get("text") or item.get("content") or item.get("option") or "").strip()
label = str(item.get("label") or _CHOICE_LABELS[idx]).strip()
else:
text = str(item).strip()
label = _CHOICE_LABELS[idx]
if text:
options.append({"label": label, "text": text})
elif isinstance(raw, dict):
for idx, (key, value) in enumerate(raw.items()):
text = str(value).strip()
if text:
options.append({"label": str(key).strip() or _CHOICE_LABELS[idx], "text": text})
return options
def _normalize_choice_answer(choice_answer: Any, options: list[dict]) -> dict[str, str]:
if not options:
return {"label": "", "text": ""}
if isinstance(choice_answer, dict):
label = str(choice_answer.get("label") or "").strip().upper()
text = str(choice_answer.get("text") or "").strip()
for option in options:
if label and option["label"].strip().upper() == label:
return {"label": option["label"], "text": option["text"]}
if text and option["text"] == text:
return {"label": option["label"], "text": option["text"]}
if isinstance(choice_answer, int):
idx = choice_answer
if 0 <= idx < len(options):
return dict(options[idx])
if 1 <= idx <= len(options):
return dict(options[idx - 1])
text = str(choice_answer or "").strip()
label = text.upper().rstrip(".):")
for option in options:
if option["label"].strip().upper() == label:
return dict(option)
if option["text"] == text:
return dict(option)
return {"label": "", "text": ""}
def _coerce_blank_answers(raw: Any) -> list[str]:
if isinstance(raw, list):
return [str(item).strip() for item in raw if str(item).strip()]
if raw is None:
return []
text = str(raw).strip()
return [text] if text else []
def load_items(data_path: str) -> list[dict]:
"""Load and normalise BabyVision items from a directory or JSONL file."""
if not data_path:
raise ValueError("BabyVision requires data_path pointing to a local dataset directory or meta_data.jsonl.")
if os.path.isdir(data_path):
meta_path = os.path.join(data_path, "meta_data.jsonl")
image_root = os.path.join(data_path, "images")
else:
meta_path = data_path
image_root = os.path.join(os.path.dirname(data_path), "images")
if not os.path.exists(meta_path):
raise ValueError(
"BabyVision expected a meta_data.jsonl file. "
f"Could not find: {meta_path}"
)
raw_items = _iter_jsonl(meta_path)
items: list[dict] = []
for idx, raw in enumerate(raw_items):
options = _coerce_options(raw.get("options") or raw.get("choices") or raw.get("choiceOptions"))
ans_type = _normalize_ans_type(raw.get("ansType"), options, raw.get("choiceAns"))
correct_choice = _normalize_choice_answer(raw.get("choiceAns"), options)
blank_answers = _coerce_blank_answers(raw.get("blankAns"))
image_name = str(
raw.get("image")
or raw.get("image_path")
or raw.get("image_file")
or raw.get("img")
or ""
).strip()
if not image_name:
continue
image_path = image_name if os.path.isabs(image_name) else os.path.join(image_root, image_name)
if not os.path.exists(image_path):
alt = os.path.join(os.path.dirname(meta_path), image_name)
if os.path.exists(alt):
image_path = alt
else:
continue
task_id = str(raw.get("taskId") or raw.get("id") or idx + 1)
task_type = str(raw.get("type") or raw.get("taskType") or "unknown").strip() or "unknown"
subtype = str(raw.get("subtype") or raw.get("subType") or task_type).strip() or task_type
question = str(raw.get("question") or raw.get("query") or "").strip()
if not question:
continue
if ans_type == "choice" and not correct_choice["label"]:
continue
if ans_type != "choice" and not blank_answers:
continue
items.append({
"id": task_id,
"task_type": task_type,
"subtype": subtype,
"question": question,
"image_path": os.path.abspath(image_path),
"ans_type": ans_type,
"choices": options,
"correct_choice": correct_choice,
"blank_answers": blank_answers,
"cot": str(raw.get("coT") or raw.get("cot") or "").strip(),
"source_path": os.path.abspath(meta_path),
})
if not items:
raise ValueError(f"No valid BabyVision items loaded from {data_path}")
return items
# ── Dataloader ───────────────────────────────────────────────────────────
class BabyVisionDataLoader(SplitDataLoader):
"""BabyVision dataloader."""
def __init__(
self,
split_dir: str = "",
data_path: str = "",
split_mode: str = "ratio",
split_ratio: str = "2:1:7",
split_seed: int = 42,
split_output_dir: str = "",
seed: int = 42,
limit: int = 0,
**kwargs,
) -> None:
super().__init__(
split_dir=split_dir,
data_path=data_path,
split_mode=split_mode,
split_ratio=split_ratio,
split_seed=split_seed,
split_output_dir=split_output_dir,
seed=seed,
limit=limit,
)
self._task_types: list[str] = []
def load_raw_items(self, data_path: str) -> list[dict]:
return load_items(data_path)
def setup(self, cfg: dict) -> None:
super().setup(cfg)
all_items = self.train_items + self.val_items + self.test_items
task_types = {
item.get("subtype") or item.get("task_type") or "unknown"
for item in all_items
}
self._task_types = sorted(task_types)
def get_task_types(self) -> list[str]:
return list(self._task_types)

View File

@@ -1,160 +0,0 @@
"""BabyVision evaluation helpers using the official-style LLM judge."""
from __future__ import annotations
import re
import string
import regex
from skillopt.model import chat_with_deployment
from skillopt.prompts import load_prompt
_EVAL_MODE = "babyvision_judge_v2_official_style"
def normalize_text(text: str) -> str:
text = str(text).strip().lower()
text = "".join(ch for ch in text if ch not in string.punctuation)
return " ".join(text.split())
def extract_boxed_answer(text: str | None) -> str | None:
"""Extract the final answer using the official BabyVision rule."""
if text is None:
return None
pattern = r'\\boxed\{((?:[^{}]|{(?:[^{}]|{.*})*})*)\}'
matches = regex.findall(pattern, text)
if matches:
return matches[-1]
pattern_alt = r'<\|begin_of_box\|>(.*?)<\|end_of_box\|>'
matches_alt = regex.findall(pattern_alt, text)
if matches_alt:
return matches_alt[-1].strip()
return None
def _token_f1(prediction: str, gold: str) -> float:
pred_tokens = normalize_text(prediction).split()
gold_tokens = normalize_text(gold).split()
if not pred_tokens and not gold_tokens:
return 1.0
if not pred_tokens or not gold_tokens:
return 0.0
pred_set = {}
gold_set = {}
for tok in pred_tokens:
pred_set[tok] = pred_set.get(tok, 0) + 1
for tok in gold_tokens:
gold_set[tok] = gold_set.get(tok, 0) + 1
common = 0
for tok, count in pred_set.items():
common += min(count, gold_set.get(tok, 0))
if common == 0:
return 0.0
precision = common / len(pred_tokens)
recall = common / len(gold_tokens)
return 2 * precision * recall / (precision + recall)
def _format_choices(choices: list[dict]) -> str:
return "\n".join(f"{choice['label']}. {choice['text']}" for choice in choices)
def _judge_answer(
*,
item: dict,
prediction_text: str,
extracted_answer: str,
judge_model: str,
max_completion_tokens: int,
retries: int,
) -> dict:
if item["ans_type"] == "choice":
ground_truth = str(item["correct_choice"]["label"])
else:
if len(item["blank_answers"]) == 1:
ground_truth = item["blank_answers"][0]
else:
ground_truth = " | ".join(item["blank_answers"])
question = str(item["question"])
if item["ans_type"] == "choice" and item.get("choices"):
question = f"{question}\nChoices:\n{_format_choices(item['choices'])}"
raw, _ = chat_with_deployment(
deployment=judge_model,
system="You are a careful and strict evaluator.",
user=load_prompt("judge", env="babyvision").format(
question=question,
groundtruth=ground_truth,
modeloutput=extracted_answer,
),
max_completion_tokens=max_completion_tokens,
retries=retries,
stage="babyvision_judge",
)
judge_response_clean = str(raw).strip().lower()
if "true" in judge_response_clean:
correct = True
elif "false" in judge_response_clean:
correct = False
else:
correct = False
return {
"raw": raw,
"correct": correct,
"reason": judge_response_clean,
"matched_gold": ground_truth if correct else "",
}
def evaluate_item(
*,
item: dict,
prediction_text: str,
judge_model: str,
max_completion_tokens: int = 256,
retries: int = 5,
) -> dict:
answer = extract_boxed_answer(prediction_text)
judge = _judge_answer(
item=item,
prediction_text=prediction_text,
extracted_answer=answer,
judge_model=judge_model,
max_completion_tokens=max_completion_tokens,
retries=retries,
)
hard = 1.0 if judge["correct"] else 0.0
result = {
"evaluation_mode": _EVAL_MODE,
"predicted_answer": answer,
"em": hard,
"f1": hard,
"sub_em": hard,
"judge_model": judge_model,
"judge_raw": judge["raw"],
"judge_reason": judge["reason"],
"matched_gold": judge["matched_gold"],
}
if item["ans_type"] == "choice":
result["predicted_label"] = str(answer or "").strip().upper().rstrip(".):")
result["predicted_text"] = ""
result["correct_label"] = str(item["correct_choice"].get("label") or "")
result["correct_text"] = str(item["correct_choice"].get("text") or "")
else:
result["gold_answers"] = list(item["blank_answers"])
best_f1 = 0.0
for gold in item["blank_answers"]:
best_f1 = max(best_f1, _token_f1(str(answer or ""), gold))
result["string_f1"] = best_f1
return result
def evaluation_mode() -> str:
return _EVAL_MODE

View File

@@ -1,36 +0,0 @@
You are an expert failure-analysis agent for child-level visual reasoning tasks.
You will be given MULTIPLE failed BabyVision trajectories from a minibatch and the current skill document.
Each trajectory includes the text prompt, the model answer, and the evaluation result.
You do not have direct access to raw pixel content during reflection, so focus on general reasoning,
option-selection, and visual-question-answering behaviors that can be improved through prompting.
## Failure Type Categories
- **visual_detail_miss**: the agent likely overlooked a salient visual attribute, relation, count, or object state
- **option_mismatch**: the agent selected the wrong option despite relevant evidence likely being present
- **instruction_slip**: the agent ignored output format or answered too vaguely
- **answer_granularity**: the agent gave an answer that was too broad, too narrow, or mismatched the expected specificity
- **other**: none of the above
## Rules
1. Focus on patterns recurring across the minibatch.
2. Prefer reusable behaviors for inspecting images and grounding answers in visible evidence.
3. Do not memorize dataset-specific answers.
4. Only patch gaps not already covered by the current skill.
Respond ONLY with a valid JSON object:
{
"batch_size": <number>,
"failure_summary": [
{"failure_type": "<type>", "count": <int>, "description": "<one-line>"}
],
"patch": {
"reasoning": "<why these edits address the common failures>",
"edits": [
{"op": "append", "content": "<markdown>"},
{"op": "insert_after", "target": "<heading/text>", "content": "<markdown>"},
{"op": "replace", "target": "<old text>", "content": "<new text>"},
{"op": "delete", "target": "<exact text to remove>"}
]
}
}

View File

@@ -1,25 +0,0 @@
You are an expert success-pattern analyst for child-level visual reasoning tasks.
You will be given MULTIPLE successful BabyVision trajectories from a minibatch and the current skill document.
Identify generalizable behavior patterns that help the agent inspect the image carefully and answer at the right level of specificity.
## Rules
- Focus on broadly useful visual QA behaviors.
- Prefer patterns about systematic image inspection, comparing options, and concise grounded answers.
- Do not add dataset-specific facts.
- "edits" may be empty if the skill already captures the useful patterns.
Respond ONLY with a valid JSON object:
{
"batch_size": <number>,
"success_patterns": ["<pattern 1>", "<pattern 2>"],
"patch": {
"reasoning": "<why these patterns matter>",
"edits": [
{"op": "append", "content": "<markdown>"},
{"op": "insert_after", "target": "<heading/text>", "content": "<markdown>"},
{"op": "replace", "target": "<old text>", "content": "<new text>"},
{"op": "delete", "target": "<exact text to remove>"}
]
}
}

View File

@@ -1,25 +0,0 @@
You are an expert diagnostic-probe designer for BabyVision-style visual reasoning tasks.
You will be shown representative trajectories, the current target skill, and the target's original prompt context.
Design one SMALL diagnostic instruction that exposes the target's intermediate visual judgment without materially changing the original scaffold.
## Hard Constraints
1. Do NOT substantially change the original scaffold.
2. Do NOT prescribe a new step-by-step solving method.
3. You MAY ask for a short structured list of a few intermediate conclusions, candidate cues, or counted units, as long as it stays close to the original scaffold.
4. Do NOT ask for exhaustive listing of all cells, all objects, or a full chain-of-thought.
5. Ask only for a short readout that reveals the target's current latent state.
6. Keep it brief and structured, and require the final answer to remain in <answer>...</answer>.
## Good Probe Targets
- top answer and runner-up
- decisive visual cue
- suspicious region or compared objects
- counting unit or formatting interpretation
- 2-4 short intermediate conclusions that directly support the final answer
Respond ONLY with a valid JSON object:
{
"reasoning": "<why this probe is informative>",
"probe_instruction": "<the exact instruction text to append to the target prompt>"
}

View File

@@ -1,35 +0,0 @@
You are a careful and strict evaluator. You will be given:
1. **Question**
2. **Ground Truth Answer** (correct answer)
3. **Model Output** (answer from another model)
**Your goal:** Determine if the Model Output **accurately matches** the Ground Truth Answer in meaning.
* Matching means: the facts, entities, and key details are equivalent, even if phrasing differs.
* Not matching means: the Model Output is wrong, incomplete, contains extra incorrect facts, or changes the meaning.
**Process (internal reasoning):**
1. Read and understand the Question, Ground Truth Answer, and Model Output.
2. Ignore small wording differences, formatting, or synonyms.
3. If all factual content matches, conclude `1`. Otherwise, conclude `0`.
**Important:**
* Think through your decision step-by-step **internally** before responding.
* In your final output, return **only** True or False, with no extra text or explanation.
**Output format:**
True
or
False
**Input:**
Question: {question},
Ground Truth Answer: {groundtruth},
Model Output: {modeloutput}

View File

@@ -1,13 +0,0 @@
You are an expert visual reasoning agent solving child-level image understanding tasks.
{skill_section}## Task Format
You will receive one image and one question about it.
Inspect the image carefully before answering. Ground the answer in visible evidence.
## Answer Format
Think step by step, then provide your final answer in \boxed{{Answer}} format.
- For multiple-choice questions, output only the single choice label, such as \boxed{{A}}.
- For open questions, output only a short final answer inside \boxed{{...}}.
Example:
\boxed{{B}}

View File

@@ -1,4 +0,0 @@
"""BabyVision Reflect stage.
Prompts are now loaded from .md files by the base adapter.
"""

View File

@@ -1,483 +0,0 @@
"""BabyVision rollout — multimodal visual QA with image input."""
from __future__ import annotations
import base64
import json
import mimetypes
import os
from concurrent.futures import ThreadPoolExecutor, as_completed
from skillopt.envs.babyvision.evaluator import evaluate_item, evaluation_mode, extract_boxed_answer
from skillopt.model import chat_target_messages, get_target_backend, is_target_exec_backend
from skillopt.model.codex_harness import prepare_workspace, render_skill_md, run_target_exec
from skillopt.prompts import load_prompt
def _build_system(skill_content: str) -> str:
if skill_content.strip():
skill_section = f"## Skill\n{skill_content.strip()}\n\n"
else:
skill_section = ""
return load_prompt("rollout_system", env="babyvision").format(skill_section=skill_section)
def _format_choices(choices: list[dict]) -> str:
return "\n".join(f"{choice['label']}. {choice['text']}" for choice in choices)
def _build_user_text(
item: dict,
*,
diagnostic_mode: bool = False,
diagnostic_instruction: str = "",
diagnostic_trace_context: str = "",
) -> str:
parts = []
if diagnostic_trace_context.strip():
parts.append(
"## Previous Codex Trace Snapshot\n"
"This is a partial transcript from an earlier attempt. Use it as your current reasoning context.\n\n"
f"{diagnostic_trace_context.strip()}"
)
parts.append(f"## Question\n{item['question']}")
if item["ans_type"] == "choice":
parts.append(f"## Choices\n{_format_choices(item['choices'])}")
parts.append("Answer using the single correct option label in \\boxed{...}.")
else:
parts.append("Answer with a short phrase in \\boxed{...}.")
if diagnostic_mode and diagnostic_instruction.strip():
parts.append(f"## Training Readout\n{diagnostic_instruction.strip()}")
return "\n\n".join(parts)
def _image_to_data_uri(path: str) -> str:
mime = mimetypes.guess_type(path)[0] or "image/png"
with open(path, "rb") as f:
encoded = base64.b64encode(f.read()).decode("ascii")
return f"data:{mime};base64,{encoded}"
def _build_messages(
item: dict,
skill_content: str,
image_detail: str,
*,
diagnostic_mode: bool = False,
diagnostic_instruction: str = "",
diagnostic_trace_context: str = "",
) -> tuple[list[dict], str, str]:
system = _build_system(skill_content)
user_text = _build_user_text(
item,
diagnostic_mode=diagnostic_mode,
diagnostic_instruction=diagnostic_instruction,
diagnostic_trace_context=diagnostic_trace_context,
)
image_url = {
"url": _image_to_data_uri(item["image_path"]),
}
if image_detail and image_detail != "auto":
image_url["detail"] = image_detail
messages = [
{"role": "system", "content": system},
{
"role": "user",
"content": [
{"type": "text", "text": user_text},
{"type": "image_url", "image_url": image_url},
],
},
]
return messages, system, user_text
def _build_codex_skill(skill_content: str) -> str:
return render_skill_md(
skill_content,
description="Dynamic ReflACT skill for solving the current BabyVision visual reasoning question.",
preamble=(
"Use this skill when answering the current visual reasoning question.\n"
"Inspect the attached image carefully and return the final answer in \\boxed{...}."
),
)
def _run_codex_once(
*,
pred_dir: str,
item: dict,
skill_content: str,
model: str,
timeout: int,
image_detail: str,
diagnostic_mode: bool = False,
diagnostic_instruction: str = "",
diagnostic_trace_context: str = "",
previous_response: str = "",
) -> tuple[str, str, str, str]:
user_text = _build_user_text(
item,
diagnostic_mode=diagnostic_mode,
diagnostic_instruction=diagnostic_instruction,
diagnostic_trace_context=diagnostic_trace_context,
)
task_parts = [user_text]
if previous_response:
task_parts.append(
"## Previous Attempt\n"
f"{previous_response}\n\n"
"Review the same image and question carefully. If needed, correct the answer."
)
task_text = "\n\n".join(task_parts)
skill_md = _build_codex_skill(skill_content)
work_dir = os.path.join(pred_dir, "codex_exec")
prepare_workspace(
work_dir=work_dir,
skill_md=skill_md,
task_text=task_text,
images=[item["image_path"]],
)
prompt = (
"Use the `skillopt-target` skill available in this workspace.\n"
"Read `task.md`, inspect the attached image, and answer the question.\n"
"Return the final answer in \\boxed{...}."
)
final_message, raw = run_target_exec(
work_dir=work_dir,
prompt=prompt,
model=model,
timeout=timeout,
images=[item["image_path"]],
)
return final_message or raw, raw, skill_md, task_text
def process_one(
item: dict,
out_root: str,
skill_content: str,
*,
max_turns: int = 1,
image_detail: str = "auto",
judge_model: str = "gpt-5.4",
judge_max_completion_tokens: int = 256,
judge_retries: int = 5,
diagnostic_mode: bool = False,
diagnostic_instruction: str = "",
diagnostic_trace_context: str = "",
) -> dict:
item_id = str(item["id"])
result = {
"id": item_id,
"question": item["question"],
"task_type": item.get("subtype") or item.get("task_type") or "babyvision",
"task_description": item["question"],
"hard": 0,
"soft": 0.0,
"predicted_answer": "",
"predicted_label": "",
"predicted_text": "",
"response": "",
"fail_reason": "",
"agent_ok": False,
"n_turns": 0,
"image_path": item["image_path"],
"ans_type": item["ans_type"],
"evaluation_mode": evaluation_mode(),
"judge_model": judge_model,
}
if item["ans_type"] == "choice":
result["correct_label"] = item["correct_choice"]["label"]
result["correct_text"] = item["correct_choice"]["text"]
else:
result["gold_answers"] = item["blank_answers"]
try:
pred_dir = os.path.join(out_root, "predictions", item_id)
os.makedirs(pred_dir, exist_ok=True)
if is_target_exec_backend():
from skillopt.model import azure_openai as _llm
response = ""
conversation: list[dict] = [
{"role": "user", "content": f"{item['question']}\n\n[image] {os.path.basename(item['image_path'])}"}
]
system_prompt = ""
user_text = ""
for turn in range(max_turns):
response, raw, system_prompt, user_text = _run_codex_once(
pred_dir=pred_dir,
item=item,
skill_content=skill_content,
model=_llm.TARGET_DEPLOYMENT,
timeout=120,
image_detail=image_detail,
diagnostic_mode=diagnostic_mode if turn == 0 else False,
diagnostic_instruction=diagnostic_instruction if turn == 0 else "",
diagnostic_trace_context=diagnostic_trace_context if turn == 0 else "",
previous_response=response if turn > 0 else "",
)
conversation.append({"type": "message", "turn": turn + 1, "content": response})
if extract_boxed_answer(response) is not None:
break
result["response"] = response
result["agent_ok"] = True
result["n_turns"] = len(conversation) - 1
with open(os.path.join(pred_dir, "target_system_prompt.txt"), "w", encoding="utf-8") as f:
f.write(system_prompt)
with open(os.path.join(pred_dir, "target_user_prompt.txt"), "w", encoding="utf-8") as f:
f.write(user_text)
eval_result = evaluate_item(
item=item,
prediction_text=response,
judge_model=judge_model,
max_completion_tokens=judge_max_completion_tokens,
retries=judge_retries,
)
result["evaluation_mode"] = eval_result["evaluation_mode"]
result["judge_raw"] = eval_result["judge_raw"]
result["judge_reason"] = eval_result["judge_reason"]
result["matched_gold"] = eval_result["matched_gold"]
if item["ans_type"] == "choice":
result["predicted_label"] = eval_result["predicted_label"]
result["predicted_text"] = eval_result["predicted_text"]
result["predicted_answer"] = eval_result["predicted_answer"]
result["hard"] = int(eval_result["em"])
result["soft"] = eval_result["f1"]
if not result["hard"]:
result["fail_reason"] = (
f"judge=0: predicted '{eval_result['predicted_label'] or eval_result['predicted_answer']}' "
f"but expected '{eval_result['correct_label']}' ({eval_result['judge_reason']})"
)
eval_detail = (
f"[EVALUATION RESULT]\n"
f"Question: {item['question']}\n"
f"Predicted label: {eval_result['predicted_label']!r}\n"
f"Predicted text: {eval_result['predicted_text']!r}\n"
f"Correct label: {eval_result['correct_label']!r}\n"
f"Correct text: {eval_result['correct_text']!r}\n"
f"Judge correct: {eval_result['em']}\n"
f"Judge reason: {eval_result['judge_reason']}"
)
else:
result["predicted_answer"] = eval_result["predicted_answer"]
result["hard"] = int(eval_result["em"])
result["soft"] = eval_result["f1"]
if not result["hard"]:
result["fail_reason"] = (
f"judge=0: predicted '{eval_result['predicted_answer']}' "
f"but expected {item['blank_answers']} ({eval_result['judge_reason']})"
)
eval_detail = (
f"[EVALUATION RESULT]\n"
f"Question: {item['question']}\n"
f"Predicted answer: {eval_result['predicted_answer']!r}\n"
f"Gold answers: {item['blank_answers']!r}\n"
f"Judge correct: {eval_result['em']}\n"
f"Judge reason: {eval_result['judge_reason']}\n"
f"String F1: {eval_result.get('string_f1', 0.0):.4f}"
)
conversation.append({"role": "system", "content": eval_detail})
with open(os.path.join(pred_dir, "conversation.json"), "w", encoding="utf-8") as f:
json.dump(conversation, f, ensure_ascii=False, indent=2)
return result
messages, system_prompt, user_text = _build_messages(
item,
skill_content,
image_detail,
diagnostic_mode=diagnostic_mode,
diagnostic_instruction=diagnostic_instruction,
diagnostic_trace_context=diagnostic_trace_context,
)
response = ""
conversation: list[dict] = [
{"role": "user", "content": f"{user_text}\n\n[image] {os.path.basename(item['image_path'])}"}
]
for turn in range(max_turns):
if turn == 0:
resp_text, _ = chat_target_messages(
messages=messages,
max_completion_tokens=768,
retries=5,
stage="rollout",
)
else:
refinement_text = (
f"Your previous answer was:\n{response}\n\n"
"Review the same image and question carefully. "
"If needed, correct your answer. Output the final answer in \\boxed{...}."
)
refinement_messages = [
messages[0],
messages[1],
{"role": "assistant", "content": response},
{"role": "user", "content": refinement_text},
]
resp_text, _ = chat_target_messages(
messages=refinement_messages,
max_completion_tokens=512,
retries=5,
stage="rollout",
)
response = resp_text
conversation.append({"type": "message", "turn": turn + 1, "content": resp_text})
if extract_boxed_answer(resp_text) is not None:
break
result["response"] = response
result["agent_ok"] = True
result["n_turns"] = len(conversation) - 1
with open(os.path.join(pred_dir, "target_system_prompt.txt"), "w", encoding="utf-8") as f:
f.write(system_prompt)
with open(os.path.join(pred_dir, "target_user_prompt.txt"), "w", encoding="utf-8") as f:
f.write(user_text)
eval_result = evaluate_item(
item=item,
prediction_text=response,
judge_model=judge_model,
max_completion_tokens=judge_max_completion_tokens,
retries=judge_retries,
)
result["evaluation_mode"] = eval_result["evaluation_mode"]
result["judge_raw"] = eval_result["judge_raw"]
result["judge_reason"] = eval_result["judge_reason"]
result["matched_gold"] = eval_result["matched_gold"]
if item["ans_type"] == "choice":
result["predicted_label"] = eval_result["predicted_label"]
result["predicted_text"] = eval_result["predicted_text"]
result["predicted_answer"] = eval_result["predicted_answer"]
result["hard"] = int(eval_result["em"])
result["soft"] = eval_result["f1"]
if not result["hard"]:
result["fail_reason"] = (
f"judge=0: predicted '{eval_result['predicted_label'] or eval_result['predicted_answer']}' "
f"but expected '{eval_result['correct_label']}' ({eval_result['judge_reason']})"
)
eval_detail = (
f"[EVALUATION RESULT]\n"
f"Question: {item['question']}\n"
f"Predicted label: {eval_result['predicted_label']!r}\n"
f"Predicted text: {eval_result['predicted_text']!r}\n"
f"Correct label: {eval_result['correct_label']!r}\n"
f"Correct text: {eval_result['correct_text']!r}\n"
f"Judge correct: {eval_result['em']}\n"
f"Judge reason: {eval_result['judge_reason']}"
)
else:
result["predicted_answer"] = eval_result["predicted_answer"]
result["hard"] = int(eval_result["em"])
result["soft"] = eval_result["f1"]
if not result["hard"]:
result["fail_reason"] = (
f"judge=0: predicted '{eval_result['predicted_answer']}' "
f"but expected {item['blank_answers']} ({eval_result['judge_reason']})"
)
eval_detail = (
f"[EVALUATION RESULT]\n"
f"Question: {item['question']}\n"
f"Predicted answer: {eval_result['predicted_answer']!r}\n"
f"Gold answers: {item['blank_answers']!r}\n"
f"Judge correct: {eval_result['em']}\n"
f"Judge reason: {eval_result['judge_reason']}\n"
f"String F1: {eval_result.get('string_f1', 0.0):.4f}"
)
conversation.append({"role": "system", "content": eval_detail})
with open(os.path.join(pred_dir, "conversation.json"), "w", encoding="utf-8") as f:
json.dump(conversation, f, ensure_ascii=False, indent=2)
except Exception as e: # noqa: BLE001
result["fail_reason"] = f"error: {e}"
return result
def run_batch(
items: list[dict],
out_root: str,
skill_content: str,
*,
max_turns: int = 1,
workers: int = 32,
image_detail: str = "auto",
judge_model: str = "gpt-5.4",
judge_max_completion_tokens: int = 256,
judge_retries: int = 5,
diagnostic_mode: bool = False,
diagnostic_instruction: str = "",
diagnostic_trace_context_by_id: dict[str, str] | None = None,
) -> list[dict]:
results_path = os.path.join(out_root, "results.jsonl")
os.makedirs(out_root, exist_ok=True)
expected_eval_mode = evaluation_mode()
done_ids: set[str] = set()
existing: list[dict] = []
rewrite_results = False
if os.path.exists(results_path):
with open(results_path, encoding="utf-8") as f:
for line in f:
try:
row = json.loads(line)
if row.get("evaluation_mode") != expected_eval_mode:
rewrite_results = True
continue
done_ids.add(str(row["id"]))
existing.append(row)
except Exception:
rewrite_results = True
pending = [item for item in items if str(item["id"]) not in done_ids]
if not pending and not rewrite_results:
return existing
total = len(existing) + len(pending)
completed = len(existing)
correct_count = sum(1 for r in existing if r.get("hard", 0))
if existing:
print(f" [rollout] resuming: {completed}/{total} already done", flush=True)
results = list(existing)
file_mode = "w" if rewrite_results else "a"
with open(results_path, file_mode, encoding="utf-8") as outf, ThreadPoolExecutor(max_workers=workers) as ex:
if rewrite_results:
for row in existing:
outf.write(json.dumps(row, ensure_ascii=False) + "\n")
futs = {
ex.submit(
process_one,
item,
out_root,
skill_content,
max_turns=max_turns,
image_detail=image_detail,
judge_model=judge_model,
judge_max_completion_tokens=judge_max_completion_tokens,
judge_retries=judge_retries,
diagnostic_mode=diagnostic_mode,
diagnostic_instruction=diagnostic_instruction,
diagnostic_trace_context=(diagnostic_trace_context_by_id or {}).get(str(item["id"]), ""),
): item
for item in pending
}
for fut in as_completed(futs):
row = fut.result()
results.append(row)
completed += 1
if row.get("hard", 0):
correct_count += 1
acc = correct_count / completed if completed else 0
print(
f" [rollout] {completed}/{total} "
f"(acc={acc:.3f}) id={row.get('id', '?')} "
f"hard={row.get('hard', '?')}",
flush=True,
)
outf.write(json.dumps(row, ensure_ascii=False) + "\n")
outf.flush()
return results

View File

@@ -1,18 +0,0 @@
# BabyVision Visual QA Heuristics
## Image Inspection
- First identify the main objects, their attributes, and their spatial relations before answering.
- If the question involves counting, compare all relevant instances carefully instead of stopping after the first match.
- If the question asks about color, size, position, or action, verify the specific visible evidence for that attribute.
## Multiple Choice
- Compare every option against the visible image evidence before deciding.
- Prefer the option that matches the image exactly; reject options that are only partially true or too vague.
- When two options are close, check the smallest discriminating visual detail.
## Open Answers
- Answer with the shortest phrase that is fully supported by the image.
- Match the expected level of specificity: not broader than the image evidence, not narrower than the question asks.
## Final Answer
- Output only the final answer inside <answer>...</answer>.

View File

@@ -1,114 +0,0 @@
from __future__ import annotations
import json
import os
from typing import Any, Callable
from skillopt.gradient.deep_probe import generate_deep_probe_instruction
from skillopt.gradient.reflect import run_minibatch_reflect
def run_no_reference_deep_reflect(
adapter: Any,
results: list[dict],
skill_content: str,
out_dir: str,
*,
env_manager: Any = None,
prediction_dir: str | None = None,
random_seed: int | None = None,
step_buffer_context: str = "",
output_requirements: list[str] | None = None,
metadata_builder: Callable[[dict], dict] | None = None,
) -> list[dict | None]:
"""Run optimizer-designed diagnostic probing without hidden references."""
if not getattr(adapter, "use_deep_reflect", False):
return []
if not isinstance(env_manager, list):
return []
prediction_dir = prediction_dir or os.path.join(out_dir, "predictions")
selected_items = adapter.select_representative_items(
results,
env_manager,
n_failures=getattr(adapter, "deep_reflect_failures", 4),
n_successes=getattr(adapter, "deep_reflect_successes", 2),
seed=random_seed,
)
if not selected_items:
return []
selected_ids = {str(item["id"]) for item in selected_items}
selected_results = [row for row in results if str(row.get("id")) in selected_ids]
if metadata_builder is None:
selected_metadata = [
{
"id": str(item.get("id")),
"task_type": str(item.get("task_type") or item.get("topic") or "unknown"),
"question_preview": str(item.get("question") or "")[:200],
}
for item in selected_items
]
else:
selected_metadata = [metadata_builder(item) for item in selected_items]
deep_dir = os.path.join(out_dir, "deep_reflect")
rollout_dir = os.path.join(deep_dir, "rollout")
patches_dir = os.path.join(deep_dir, "patches")
os.makedirs(deep_dir, exist_ok=True)
print(
f" [2b/6 DEEP REFLECT setup] selected={len(selected_items)} "
"mode=no_reference_probe"
)
probe = generate_deep_probe_instruction(
skill_content=skill_content,
items=selected_results,
prediction_dir=prediction_dir,
system_prompt=adapter.get_deep_probe_prompt(),
step_buffer_context=step_buffer_context,
output_requirements=output_requirements,
)
if not probe:
return []
with open(os.path.join(deep_dir, "probe.json"), "w", encoding="utf-8") as f:
json.dump(
{
**probe,
"reference_summary": {
"mode": "no_reference_probe",
"selected_count": len(selected_items),
},
"selected_examples": selected_metadata,
},
f,
ensure_ascii=False,
indent=2,
)
deep_results = adapter.rollout(
selected_items,
skill_content,
rollout_dir,
diagnostic_mode=True,
diagnostic_instruction=probe["probe_instruction"],
)
return run_minibatch_reflect(
results=deep_results,
skill_content=skill_content,
prediction_dir=os.path.join(rollout_dir, "predictions"),
patches_dir=patches_dir,
workers=getattr(adapter, "analyst_workers", 8),
failure_only=getattr(adapter, "failure_only", False),
minibatch_size=getattr(adapter, "minibatch_size", 8),
edit_budget=getattr(adapter, "edit_budget", 4),
random_seed=random_seed,
error_system=adapter.get_error_minibatch_prompt(),
success_system=adapter.get_success_minibatch_prompt(),
step_buffer_context=step_buffer_context,
update_mode=getattr(getattr(adapter, "_cfg", {}), "get", lambda *_: "patch")(
"skill_update_mode",
"patch",
),
)

View File

@@ -1,23 +0,0 @@
You are an expert diagnostic-probe designer for theorem-grounded mathematical multiple-choice tasks.
You will be shown representative trajectories, the current target skill, and the target's original prompt context.
Design one SMALL diagnostic instruction that exposes the target's intermediate judgment without materially changing the original scaffold.
## Hard Constraints
1. Do NOT substantially change the original scaffold.
2. Do NOT prescribe a new multi-step theorem-solving procedure.
3. Do NOT ask for a full proof, full chain-of-thought, or exhaustive option-by-option derivation.
4. Ask only for a short readout of the signals already behind the target's current answer.
5. Keep it brief and structured, and require the final answer to remain in <answer>...</answer>.
## Good Probe Targets
- top choice and runner-up
- decisive constraint
- why the runner-up was rejected
- strongest-vs-weaker discrimination signal
Respond ONLY with a valid JSON object:
{
"reasoning": "<why this probe is informative>",
"probe_instruction": "<the exact instruction text to append to the target prompt>"
}

View File

@@ -1,26 +0,0 @@
You are an expert diagnostic-probe designer for theorem-grounded mathematical multiple-choice tasks executed through a Codex trace.
You will be shown representative trajectories, the current target skill, the target's original prompt context, hidden reference fields, and numbered Codex trace steps.
Choose exactly one trajectory and one probe point. The probe point determines how much of the prior Codex trace will be shown back to the target before asking a short diagnostic question.
## Hard Constraints
1. Do NOT reveal or paraphrase the hidden reference directly to the target.
2. Do NOT prescribe a new full solving procedure.
3. Do NOT ask for a full proof, full chain-of-thought, or exhaustive option-by-option derivation.
4. Ask only for a short readout of the signal that should already exist at that point in the target's process.
5. The probe instruction must explicitly request a short <analysis>...</analysis> block before the final <answer>...</answer>.
6. Select a probe point that is informative about theorem choice, decisive constraint, option elimination, or why a stronger/weaker option should be rejected.
## Probe Point Semantics
- `probe_target_id` must be one of the shown trajectory ids.
- `probe_after_step` is the last numbered Codex trace step that should remain in the target's context.
- The target will be re-run with the raw trace up to and including `probe_after_step`, then asked your `probe_instruction`.
- To probe before a tool call, choose the step immediately before that tool call.
Respond ONLY with a valid JSON object:
{
"reasoning": "<why this trajectory and probe point expose the target's intermediate state>",
"probe_target_id": "<trajectory id>",
"probe_after_step": <integer step number>,
"probe_instruction": "<the exact instruction text to append to the target's prompt>"
}

View File

@@ -1,5 +0,0 @@
"""MathVerse environment package."""
from skillopt.envs.mathverse.adapter import MathVerseAdapter
__all__ = ["MathVerseAdapter"]

View File

@@ -1,280 +0,0 @@
"""MathVerse environment adapter for ReflACT."""
from __future__ import annotations
import json
import os
from skillopt.datasets.base import BatchSpec
from skillopt.envs.base import EnvAdapter
from skillopt.envs.mathverse.dataloader import MathVerseDataLoader
from skillopt.envs.mathverse.rollout import run_batch
from skillopt.gradient.deep_probe import generate_deep_probe_instruction
from skillopt.gradient.reflect import run_minibatch_reflect
from skillopt.model import get_target_backend
class MathVerseAdapter(EnvAdapter):
"""MathVerse adapter."""
def build_reference_text(self, item: dict) -> str:
if not self.use_text_dominant_reference:
return ""
question = str(item.get("text_dominant_question") or "").strip()
if not question:
return ""
return f"## Reference Full Question\n{question}"
def get_reference_metadata(self, item: dict) -> dict:
if not self.use_text_dominant_reference:
return {"fields": [], "preview": ""}
question = str(item.get("text_dominant_question") or "").strip()
if not question:
return {"fields": [], "preview": ""}
return {
"fields": ["text_dominant_question"],
"preview": question[:400],
}
def __init__(
self,
split_dir: str = "",
data_root: str = "",
problem_version: str = "Text Lite",
use_text_dominant_reference: bool = False,
max_turns: int = 1,
workers: int = 16,
analyst_workers: int = 16,
failure_only: bool = False,
minibatch_size: int = 8,
edit_budget: int = 4,
seed: int = 42,
limit: int = 0,
image_detail: str = "auto",
judge_model: str = "gpt-5.4",
judge_max_completion_tokens: int = 256,
judge_retries: int = 5,
use_deep_reflect: bool = False,
deep_reflect_failures: int = 4,
deep_reflect_successes: int = 2,
) -> None:
self.max_turns = max_turns
self.workers = workers
self.analyst_workers = analyst_workers
self.failure_only = failure_only
self.minibatch_size = minibatch_size
self.edit_budget = edit_budget
self.image_detail = image_detail
self.judge_model = judge_model
self.judge_max_completion_tokens = judge_max_completion_tokens
self.judge_retries = judge_retries
self.problem_version = problem_version
self.use_text_dominant_reference = use_text_dominant_reference
self.use_deep_reflect = use_deep_reflect
self.deep_reflect_failures = deep_reflect_failures
self.deep_reflect_successes = deep_reflect_successes
self.dataloader = MathVerseDataLoader(
split_dir=split_dir,
seed=seed,
limit=limit,
data_root=data_root,
problem_version=problem_version,
)
def setup(self, cfg: dict) -> None:
super().setup(cfg)
self.dataloader.setup(cfg)
def get_dataloader(self):
return self.dataloader
def build_env_from_batch(self, batch: BatchSpec, **kwargs):
return list(batch.payload or [])
def build_train_env(self, batch_size: int, seed: int, **kwargs):
batch = self.dataloader.build_train_batch(batch_size=batch_size, seed=seed, **kwargs)
return self.build_env_from_batch(batch, **kwargs)
def build_eval_env(self, env_num: int, split: str, seed: int, **kwargs):
batch = self.dataloader.build_eval_batch(env_num=env_num, split=split, seed=seed, **kwargs)
return self.build_env_from_batch(batch, **kwargs)
def rollout(
self,
env_manager,
skill_content: str,
out_dir: str,
**kwargs,
) -> list[dict]:
items: list[dict] = env_manager
return run_batch(
items=items,
out_root=out_dir,
skill_content=skill_content,
max_turns=self.max_turns,
workers=self.workers,
image_detail=self.image_detail,
judge_model=self.judge_model,
judge_max_completion_tokens=self.judge_max_completion_tokens,
judge_retries=self.judge_retries,
diagnostic_mode=kwargs.get("diagnostic_mode", False),
diagnostic_instruction=kwargs.get("diagnostic_instruction", ""),
diagnostic_trace_context_by_id=kwargs.get("diagnostic_trace_context_by_id"),
)
def reflect(
self,
results: list[dict],
skill_content: str,
out_dir: str,
**kwargs,
) -> list[dict | None]:
prediction_dir = kwargs.get("prediction_dir", os.path.join(out_dir, "predictions"))
patches_dir = kwargs.get("patches_dir", os.path.join(out_dir, "patches"))
random_seed = kwargs.get("random_seed")
step_buffer_context = kwargs.get("step_buffer_context", "")
return run_minibatch_reflect(
results=results,
skill_content=skill_content,
prediction_dir=prediction_dir,
patches_dir=patches_dir,
workers=self.analyst_workers,
failure_only=self.failure_only,
minibatch_size=self.minibatch_size,
edit_budget=self.edit_budget,
random_seed=random_seed,
error_system=self.get_error_minibatch_prompt(),
success_system=self.get_success_minibatch_prompt(),
step_buffer_context=step_buffer_context,
update_mode=getattr(self, "_cfg", {}).get("skill_update_mode", "patch"),
)
def deep_reflect(
self,
results: list[dict],
skill_content: str,
out_dir: str,
**kwargs,
) -> list[dict | None]:
if not self.use_deep_reflect:
return []
env_manager = kwargs.get("env_manager")
prediction_dir = kwargs.get("prediction_dir", os.path.join(out_dir, "predictions"))
random_seed = kwargs.get("random_seed")
step_buffer_context = kwargs.get("step_buffer_context", "")
selected_items = self.select_representative_items(
results,
env_manager if isinstance(env_manager, list) else None,
n_failures=self.deep_reflect_failures,
n_successes=self.deep_reflect_successes,
seed=random_seed,
)
if not selected_items:
return []
selected_ids = {str(item["id"]) for item in selected_items}
selected_results = [row for row in results if str(row.get("id")) in selected_ids]
selected_examples = self.attach_reference_context(selected_results, selected_items)
codex_backend = get_target_backend() == "codex_exec"
if codex_backend:
selected_examples = self.attach_codex_probe_context(selected_examples, prediction_dir)
selected_metadata = []
ref_count = 0
for item in selected_items:
meta = self.get_reference_metadata(item)
if meta["fields"]:
ref_count += 1
record = {
"id": str(item["id"]),
"task_type": str(item.get("task_type") or item.get("question_type") or "mathverse"),
"reference_fields": meta["fields"],
"reference_preview": meta["preview"],
}
if codex_backend:
record["codex_probe_step_count"] = int(
next(
(row.get("codex_probe_step_count", 0) for row in selected_examples if str(row.get("id")) == str(item["id"])),
0,
)
)
selected_metadata.append(record)
deep_dir = os.path.join(out_dir, "deep_reflect")
rollout_dir = os.path.join(deep_dir, "rollout")
patches_dir = os.path.join(deep_dir, "patches")
os.makedirs(deep_dir, exist_ok=True)
print(
f" [2b/6 DEEP REFLECT setup] selected={len(selected_items)} "
f"reference_fields=text_dominant_question({ref_count}/{len(selected_items)})"
)
probe = generate_deep_probe_instruction(
skill_content=skill_content,
items=selected_examples,
prediction_dir=prediction_dir,
system_prompt=self.get_codex_deep_probe_prompt() if codex_backend else self.get_deep_probe_prompt(),
step_buffer_context=step_buffer_context,
)
if not probe:
return []
targeted_items = selected_items
diagnostic_trace_context_by_id: dict[str, str] | None = None
if codex_backend:
targeted_items, diagnostic_trace_context_by_id, probe = self.resolve_codex_probe_target(
selected_items=selected_items,
selected_examples=selected_examples,
prediction_dir=prediction_dir,
probe=probe,
)
with open(os.path.join(deep_dir, "probe.json"), "w", encoding="utf-8") as f:
json.dump(
{
**probe,
"reference_summary": {
"selected_count": len(selected_items),
"field_counts": {
"text_dominant_question": ref_count,
},
},
"selected_examples": selected_metadata,
},
f,
ensure_ascii=False,
indent=2,
)
deep_results = run_batch(
items=targeted_items,
out_root=rollout_dir,
skill_content=skill_content,
max_turns=self.max_turns,
workers=min(self.workers, max(len(targeted_items), 1)),
image_detail=self.image_detail,
judge_model=self.judge_model,
judge_max_completion_tokens=self.judge_max_completion_tokens,
judge_retries=self.judge_retries,
diagnostic_mode=True,
diagnostic_instruction=probe["probe_instruction"],
diagnostic_trace_context_by_id=diagnostic_trace_context_by_id,
)
deep_results = self.attach_reference_context(deep_results, targeted_items)
return run_minibatch_reflect(
results=deep_results,
skill_content=skill_content,
prediction_dir=os.path.join(rollout_dir, "predictions"),
patches_dir=patches_dir,
workers=self.analyst_workers,
failure_only=self.failure_only,
minibatch_size=self.minibatch_size,
edit_budget=self.edit_budget,
random_seed=random_seed,
error_system=self.get_error_minibatch_prompt(),
success_system=self.get_success_minibatch_prompt(),
step_buffer_context=step_buffer_context,
update_mode=getattr(self, "_cfg", {}).get("skill_update_mode", "patch"),
)
def get_task_types(self) -> list[str]:
return self.dataloader.get_task_types()

View File

@@ -1,228 +0,0 @@
"""MathVerse task dataloader."""
from __future__ import annotations
import json
import os
import re
from typing import Any
from skillopt.datasets.base import SplitDataLoader
_CHOICE_LABELS = ["A", "B", "C", "D", "E", "F", "G"]
_CHOICE_BLOCK_RE = re.compile(r"\bChoices?\s*:\s*", re.IGNORECASE)
_CHOICE_ITEM_RE = re.compile(r"([A-G])\s*[:.)]\s*(.*?)(?=(?:\s+[A-G]\s*[:.)])|$)", re.DOTALL)
def _load_json(path: str) -> Any:
with open(path, encoding="utf-8") as f:
return json.load(f)
def _normalize_space(text: Any) -> str:
return re.sub(r"\s+", " ", str(text or "").strip())
def _resolve_image_path(raw_path: str, *, data_root: str, source_path: str) -> str:
candidates = []
if raw_path:
if os.path.isabs(raw_path):
candidates.append(raw_path)
else:
if data_root:
candidates.append(os.path.join(data_root, raw_path))
candidates.append(os.path.join(data_root, "images", raw_path))
candidates.append(os.path.join(os.path.dirname(source_path), raw_path))
for candidate in candidates:
if candidate and os.path.exists(candidate):
return os.path.abspath(candidate)
return ""
def _split_question_and_choices(question: str) -> tuple[str, list[dict]]:
text = str(question or "").strip()
match = _CHOICE_BLOCK_RE.search(text)
if not match:
return text, []
stem = text[:match.start()].strip()
choice_block = text[match.end():].strip()
choices: list[dict] = []
for idx, m in enumerate(_CHOICE_ITEM_RE.finditer(choice_block)):
label = (m.group(1) or _CHOICE_LABELS[idx]).strip().upper()
choice_text = _normalize_space(m.group(2))
if choice_text:
choices.append({"label": label, "text": choice_text})
return stem or text, choices
def _build_text_dominant_map(data_root: str) -> dict[str, str]:
if not data_root:
return {}
candidates = [
os.path.join(data_root, "testmini.json"),
os.path.join(data_root, "data", "testmini.json"),
]
source_path = next((path for path in candidates if os.path.exists(path)), "")
if not source_path:
return {}
raw = _load_json(source_path)
if not isinstance(raw, list):
return {}
mapping: dict[str, str] = {}
for item in raw:
if not isinstance(item, dict):
continue
if str(item.get("problem_version") or "").strip() != "Text Dominant":
continue
problem_index = str(item.get("problem_index") or "").strip()
question = str(item.get("question") or "").strip()
if problem_index and question:
mapping[problem_index] = question
return mapping
def _normalize_item(
item: dict,
*,
row_idx: int,
source_path: str,
data_root: str,
problem_version: str,
text_dominant_map: dict[str, str],
) -> dict | None:
raw_problem_version = str(item.get("problem_version") or "").strip()
if problem_version and raw_problem_version and raw_problem_version != problem_version:
return None
question = str(item.get("question") or "").strip()
question_type = str(item.get("question_type") or "").strip()
answer = str(item.get("answer") or "").strip()
image_rel = str(item.get("image") or "").strip()
image_path = _resolve_image_path(image_rel, data_root=data_root, source_path=source_path)
if not answer or not image_path:
return None
metadata = item.get("metadata") if isinstance(item.get("metadata"), dict) else {}
subject = str(metadata.get("subject") or "").strip()
subfield = str(metadata.get("subfield") or "").strip()
source = str(metadata.get("source") or "").strip()
question_stem, choices = _split_question_and_choices(question)
is_choice = question_type == "multi-choice" or bool(choices)
correct_choice = {"label": "", "text": ""}
if is_choice:
label = str(answer).strip().upper().rstrip(".):")
choice_text = ""
for choice in choices:
if choice["label"].upper() == label:
choice_text = choice["text"]
break
correct_choice = {"label": label, "text": choice_text}
problem_index = str(item.get("problem_index") or "").strip()
sample_index = str(item.get("sample_index") or row_idx + 1).strip()
item_id = problem_index or sample_index
task_type = subfield or subject or question_type or "mathverse"
return {
"id": item_id,
"sample_index": sample_index,
"problem_index": problem_index,
"problem_version": raw_problem_version or problem_version,
"question": question,
"question_stem": question_stem,
"question_for_eval": str(item.get("question_for_eval") or question).strip(),
"question_type": question_type or ("multi-choice" if is_choice else "free-form"),
"is_choice": is_choice,
"choices": choices,
"correct_choice": correct_choice,
"answer": answer,
"gold_answers": [answer] if answer else [],
"image_rel": image_rel,
"image_path": image_path,
"query_wo": str(item.get("query_wo") or "").strip(),
"query_cot": str(item.get("query_cot") or "").strip(),
"metadata": {
"split": str(metadata.get("split") or "").strip(),
"source": source,
"subject": subject,
"subfield": subfield,
},
"task_type": task_type,
"source_path": os.path.abspath(source_path),
"text_dominant_question": str(
item.get("text_dominant_question")
or text_dominant_map.get(problem_index, "")
).strip(),
}
class MathVerseDataLoader(SplitDataLoader):
"""MathVerse dataloader."""
def __init__(
self,
split_dir: str = "",
seed: int = 42,
limit: int = 0,
data_root: str = "",
problem_version: str = "Text Lite",
**kwargs,
) -> None:
super().__init__(split_dir=split_dir, seed=seed, limit=limit)
self.data_root = data_root
self.problem_version = problem_version
self._task_types: list[str] = []
self._text_dominant_map = _build_text_dominant_map(data_root)
def setup(self, cfg: dict) -> None:
if not self.data_root:
self.data_root = str(cfg.get("data_root") or "")
if not self.problem_version:
self.problem_version = str(cfg.get("problem_version") or "Text Lite")
self._text_dominant_map = _build_text_dominant_map(self.data_root)
super().setup(cfg)
all_items = self.train_items + self.val_items + self.test_items
task_types = {
item.get("task_type") or item.get("question_type") or "mathverse"
for item in all_items
}
self._task_types = sorted(str(x) for x in task_types if str(x).strip())
def get_task_types(self) -> list[str]:
return list(self._task_types)
def load_split_items(self, split_path: str) -> list[dict]:
raw_items = super().load_split_items(split_path)
source_path = next(
(
os.path.join(split_path, name)
for name in sorted(os.listdir(split_path))
if name.endswith(".json")
),
split_path,
)
items: list[dict] = []
for row_idx, item in enumerate(raw_items):
if not isinstance(item, dict):
continue
norm = _normalize_item(
item,
row_idx=row_idx,
source_path=source_path,
data_root=self.data_root,
problem_version=self.problem_version,
text_dominant_map=self._text_dominant_map,
)
if norm is not None:
items.append(norm)
if not items:
raise ValueError(
f"No valid MathVerse items loaded from {split_path} "
f"for problem_version={self.problem_version!r}"
)
return items

View File

@@ -1,180 +0,0 @@
"""MathVerse evaluation helpers."""
from __future__ import annotations
import re
import string
from skillopt.model import chat_with_deployment
from skillopt.prompts import load_prompt
_EVAL_MODE = "mathverse_choice_or_judge_v1"
def normalize_text(text: str) -> str:
text = str(text or "").strip().lower()
text = text.replace("\\,", " ")
text = text.replace("\\ ", " ")
text = "".join(ch for ch in text if ch not in string.punctuation)
return " ".join(text.split())
def normalize_math_text(text: str) -> str:
text = str(text or "").strip()
text = text.replace("$", "")
text = text.replace("\\mathrm", "")
text = text.replace("{", "")
text = text.replace("}", "")
text = text.replace("~", " ")
text = text.replace("\\,", " ")
text = text.replace("\\ ", " ")
return " ".join(text.split()).lower()
def extract_answer(text: str | None) -> str:
raw = str(text or "").strip()
if not raw:
return ""
tags = re.findall(r"<answer>\s*(.*?)\s*</answer>", raw, re.IGNORECASE | re.DOTALL)
if tags:
return tags[-1].strip()
boxed = re.findall(r"\\boxed\{(.*?)\}", raw, re.IGNORECASE | re.DOTALL)
if boxed:
return boxed[-1].strip()
lines = [ln.strip() for ln in raw.splitlines() if ln.strip()]
if lines:
return lines[-1]
return raw
def _judge_answer(
*,
item: dict,
extracted_answer: str,
judge_model: str,
max_completion_tokens: int,
retries: int,
) -> dict:
question = str(item.get("question_for_eval") or item.get("question") or "").strip()
ground_truth = str(item.get("answer") or "").strip()
raw, _ = chat_with_deployment(
deployment=judge_model,
system="You are a careful and strict mathematical answer evaluator.",
user=load_prompt("judge", env="mathverse").format(
question=question,
groundtruth=ground_truth,
modeloutput=extracted_answer,
),
max_completion_tokens=max_completion_tokens,
retries=retries,
stage="mathverse_judge",
)
response = str(raw).strip().lower()
if "true" in response:
correct = True
elif "false" in response:
correct = False
else:
correct = False
return {
"raw": raw,
"correct": correct,
"reason": response,
"matched_gold": ground_truth if correct else "",
}
def evaluate_item(
*,
item: dict,
prediction_text: str,
judge_model: str,
max_completion_tokens: int = 256,
retries: int = 5,
) -> dict:
extracted = extract_answer(prediction_text)
if item.get("is_choice"):
predicted_label = str(extracted).strip().upper().rstrip(".):")
correct_label = str(item["correct_choice"].get("label") or "").strip().upper()
predicted_text = ""
for choice in item.get("choices") or []:
if str(choice.get("label") or "").strip().upper() == predicted_label:
predicted_text = str(choice.get("text") or "").strip()
break
hard = 1.0 if predicted_label == correct_label else 0.0
return {
"evaluation_mode": _EVAL_MODE,
"predicted_answer": extracted,
"predicted_label": predicted_label,
"predicted_text": predicted_text,
"correct_label": correct_label,
"correct_text": str(item["correct_choice"].get("text") or "").strip(),
"em": hard,
"f1": hard,
"sub_em": hard,
"judge_raw": "",
"judge_reason": "exact_label_match" if hard else "label_mismatch",
"matched_gold": correct_label if hard else "",
}
gold_answer = str(item.get("answer") or "").strip()
pred_norm = normalize_math_text(extracted)
gold_norm = normalize_math_text(gold_answer)
if pred_norm and gold_norm and pred_norm == gold_norm:
return {
"evaluation_mode": _EVAL_MODE,
"predicted_answer": extracted,
"em": 1.0,
"f1": 1.0,
"sub_em": 1.0,
"judge_raw": "",
"judge_reason": "normalized_exact_match",
"matched_gold": gold_answer,
"string_f1": 1.0,
}
judge = _judge_answer(
item=item,
extracted_answer=extracted,
judge_model=judge_model,
max_completion_tokens=max_completion_tokens,
retries=retries,
)
hard = 1.0 if judge["correct"] else 0.0
pred_tokens = normalize_text(extracted).split()
gold_tokens = normalize_text(gold_answer).split()
overlap = 0
gold_counts: dict[str, int] = {}
for tok in gold_tokens:
gold_counts[tok] = gold_counts.get(tok, 0) + 1
for tok in pred_tokens:
count = gold_counts.get(tok, 0)
if count > 0:
overlap += 1
gold_counts[tok] = count - 1
if pred_tokens and gold_tokens and overlap:
precision = overlap / len(pred_tokens)
recall = overlap / len(gold_tokens)
string_f1 = 2 * precision * recall / (precision + recall)
else:
string_f1 = 0.0
return {
"evaluation_mode": _EVAL_MODE,
"predicted_answer": extracted,
"em": hard,
"f1": hard,
"sub_em": hard,
"judge_raw": judge["raw"],
"judge_reason": judge["reason"],
"matched_gold": judge["matched_gold"],
"string_f1": string_f1,
}
def evaluation_mode() -> str:
return _EVAL_MODE

View File

@@ -1,37 +0,0 @@
You are an expert failure-analysis agent for visual mathematical reasoning problems.
You will be given MULTIPLE failed trajectories from a single minibatch and the current skill document.
Each trajectory includes the target's response, the evaluation result, and sometimes a hidden reference
containing the fuller Text Dominant version of the same problem.
Your job is to identify COMMON reasoning failures across the batch and propose concise skill edits.
## Failure Type Categories
- **diagram_underuse**: the agent did not recover key constraints from the image
- **constraint_drop**: the agent ignored a condition or relation that should guide the solution
- **option_confusion**: the agent failed to discriminate between close answer choices
- **format_miss**: the agent solved roughly correctly but returned the wrong final form, unit, or expression
- **other**: none of the above
## Rules
1. Focus on patterns that recur across the minibatch.
2. Prefer edits that improve visual grounding and exact answer selection.
3. Do not hardcode problem-specific formulas or answers.
4. If hidden reference text is present, use it only to infer what information the target failed to recover from the Text Lite version.
Respond ONLY with a valid JSON object:
{
"batch_size": <number>,
"failure_summary": [
{"failure_type": "<type>", "count": <int>, "description": "<one-line>"}
],
"patch": {
"reasoning": "<why these edits address the common failures>",
"edits": [
{"op": "append", "content": "<markdown>"},
{"op": "insert_after", "target": "<heading/text>", "content": "<markdown>"},
{"op": "replace", "target": "<old text>", "content": "<new text>"},
{"op": "delete", "target": "<exact text to remove>"}
]
}
}

View File

@@ -1,26 +0,0 @@
You are an expert success-pattern analyst for visual mathematical reasoning problems.
You will be given MULTIPLE successful trajectories from a minibatch and the current skill document.
Identify generalizable behavior patterns that genuinely help the agent recover the right constraints
from the image and convert them into the exact final answer.
## Rules
- Focus on broadly useful visual-math reasoning behaviors.
- Prefer patterns about reading decisive diagram cues, checking hidden assumptions, and matching the final answer format exactly.
- Do not add benchmark-specific facts or formulas.
- "edits" may be empty if the skill already captures the useful patterns.
Respond ONLY with a valid JSON object:
{
"batch_size": <number>,
"success_patterns": ["<pattern 1>", "<pattern 2>"],
"patch": {
"reasoning": "<why these patterns matter>",
"edits": [
{"op": "append", "content": "<markdown>"},
{"op": "insert_after", "target": "<heading/text>", "content": "<markdown>"},
{"op": "replace", "target": "<old text>", "content": "<new text>"},
{"op": "delete", "target": "<exact text to remove>"}
]
}
}

View File

@@ -1,25 +0,0 @@
You are an expert diagnostic-probe designer for visual mathematical reasoning tasks.
You will be shown representative trajectories, the current target skill, and the target's original prompt context.
Some trajectories may also include a hidden reference containing the fuller Text Dominant wording of the same problem.
Design one SMALL diagnostic instruction that exposes the target's intermediate judgment without materially changing the original scaffold.
## Hard Constraints
1. Do NOT substantially change the original scaffold.
2. Do NOT prescribe a new long multi-step solving procedure.
3. Do NOT ask for a full proof or full chain-of-thought.
4. Ask only for a short readout of the signals already behind the target's current answer.
5. Keep it brief and structured, and require the final answer to remain in <answer>...</answer>.
6. If hidden reference text is present, use it only to target what visual or textual constraint the target likely missed.
## Good Probe Targets
- decisive diagram cue
- top candidate and runner-up
- missing relation or quantity
- why a near-miss option was rejected
Respond ONLY with a valid JSON object:
{
"reasoning": "<why this probe is informative>",
"probe_instruction": "<the exact instruction text to append to the target prompt>"
}

View File

@@ -1,25 +0,0 @@
You are a careful and strict evaluator for visual math problems.
You will be given:
1. The original question
2. The ground-truth answer
3. A model output
Decide whether the model output is mathematically equivalent to the ground-truth answer.
Rules:
- Ignore harmless formatting differences.
- Accept mathematically equivalent expressions, equations, and values.
- Reject answers that are numerically wrong, symbolically different in meaning, missing required units when the unit changes meaning, or correspond to a different choice.
- Do not reward partially correct reasoning if the final answer is wrong.
Return only:
True
or
False
Question: {question}
Ground Truth Answer: {groundtruth}
Model Output: {modeloutput}

View File

@@ -1,11 +0,0 @@
You are an expert visual mathematical reasoning agent.
{skill_section}## Task Format
You will receive one math problem with an image or diagram.
Use the visible diagram as evidence, not just the text.
If some information is abbreviated in the text, recover it from the image before answering.
## Answer Format
Think step by step, then provide your final answer inside <answer>...</answer>.
- For multiple-choice questions, output only the single option label, such as <answer>B</answer>.
- For free-form questions, output only the final mathematical answer, such as <answer>14</answer>.

View File

@@ -1,4 +0,0 @@
"""MathVerse Reflect stage.
Prompts are loaded from .md files by the base adapter.
"""

View File

@@ -1,431 +0,0 @@
"""MathVerse rollout — single-image multimodal math reasoning."""
from __future__ import annotations
import base64
import json
import mimetypes
import os
from concurrent.futures import ThreadPoolExecutor, as_completed
from skillopt.envs.mathverse.evaluator import evaluate_item, evaluation_mode, extract_answer
from skillopt.model import chat_target_messages, get_target_backend, is_target_exec_backend
from skillopt.model.codex_harness import prepare_workspace, render_skill_md, run_target_exec
from skillopt.prompts import load_prompt
def _build_system(skill_content: str) -> str:
if skill_content.strip():
skill_section = f"## Skill\n{skill_content.strip()}\n\n"
else:
skill_section = ""
return load_prompt("rollout_system", env="mathverse").format(skill_section=skill_section)
def _format_choices(choices: list[dict]) -> str:
return "\n".join(f"{choice['label']}. {choice['text']}" for choice in choices)
def _build_user_text(
item: dict,
*,
diagnostic_mode: bool = False,
diagnostic_instruction: str = "",
diagnostic_trace_context: str = "",
) -> str:
parts = []
if diagnostic_trace_context.strip():
parts.append(
"## Previous Codex Trace Snapshot\n"
"This is a partial transcript from an earlier attempt. Use it as your current reasoning context.\n\n"
f"{diagnostic_trace_context.strip()}"
)
question = str(item.get("question_stem") or item.get("question") or "").strip()
if question:
parts.append(f"## Question\n{question}")
else:
parts.append("## Question\nRead the full problem statement from the image.")
if item.get("is_choice"):
choices = item.get("choices") or []
if choices:
parts.append(f"## Choices\n{_format_choices(choices)}")
parts.append("Return only the final option label inside <answer>...</answer>.")
else:
parts.append("Return only the final mathematical answer inside <answer>...</answer>.")
if diagnostic_mode and diagnostic_instruction.strip():
parts.append(f"## Training Readout\n{diagnostic_instruction.strip()}")
return "\n\n".join(parts)
def _image_to_data_uri(path: str) -> str:
mime = mimetypes.guess_type(path)[0] or "image/png"
with open(path, "rb") as f:
encoded = base64.b64encode(f.read()).decode("ascii")
return f"data:{mime};base64,{encoded}"
def _build_messages(
item: dict,
skill_content: str,
image_detail: str,
*,
diagnostic_mode: bool = False,
diagnostic_instruction: str = "",
diagnostic_trace_context: str = "",
) -> tuple[list[dict], str, str]:
system = _build_system(skill_content)
user_text = _build_user_text(
item,
diagnostic_mode=diagnostic_mode,
diagnostic_instruction=diagnostic_instruction,
diagnostic_trace_context=diagnostic_trace_context,
)
image_url = {"url": _image_to_data_uri(item["image_path"])}
if image_detail and image_detail != "auto":
image_url["detail"] = image_detail
messages = [
{"role": "system", "content": system},
{
"role": "user",
"content": [
{"type": "text", "text": user_text},
{"type": "image_url", "image_url": image_url},
],
},
]
return messages, system, user_text
def _build_codex_skill(skill_content: str) -> str:
return render_skill_md(
skill_content,
description="Dynamic ReflACT skill for solving the current MathVerse visual math problem.",
preamble=(
"Use this skill when solving the current MathVerse problem.\n"
"Read the image carefully and return the final answer inside <answer>...</answer>."
),
)
def _run_codex_once(
*,
pred_dir: str,
item: dict,
skill_content: str,
model: str,
timeout: int,
image_detail: str,
diagnostic_mode: bool = False,
diagnostic_instruction: str = "",
diagnostic_trace_context: str = "",
previous_response: str = "",
) -> tuple[str, str, str, str]:
user_text = _build_user_text(
item,
diagnostic_mode=diagnostic_mode,
diagnostic_instruction=diagnostic_instruction,
diagnostic_trace_context=diagnostic_trace_context,
)
task_parts = [user_text]
if previous_response:
task_parts.append(
"## Previous Attempt\n"
f"{previous_response}\n\n"
"Re-check the diagram and the mathematical constraints. Correct the final answer if needed."
)
task_text = "\n\n".join(task_parts)
skill_md = _build_codex_skill(skill_content)
work_dir = os.path.join(pred_dir, "codex_exec")
prepare_workspace(
work_dir=work_dir,
skill_md=skill_md,
task_text=task_text,
images=[item["image_path"]],
)
prompt = (
"Use the `skillopt-target` skill available in this workspace.\n"
"Read `task.md`, inspect the attached image, solve the problem, and return only the final answer inside <answer>...</answer>."
)
final_message, raw = run_target_exec(
work_dir=work_dir,
prompt=prompt,
model=model,
timeout=timeout,
images=[item["image_path"]],
)
return final_message or raw, raw, skill_md, task_text
def process_one(
item: dict,
out_root: str,
skill_content: str,
*,
max_turns: int = 1,
image_detail: str = "auto",
judge_model: str = "gpt-5.4",
judge_max_completion_tokens: int = 256,
judge_retries: int = 5,
diagnostic_mode: bool = False,
diagnostic_instruction: str = "",
diagnostic_trace_context: str = "",
) -> dict:
item_id = str(item["id"])
result = {
"id": item_id,
"question": item["question"],
"task_type": item.get("task_type") or item.get("question_type") or "mathverse",
"task_description": item.get("question_stem") or item["question"],
"hard": 0,
"soft": 0.0,
"predicted_answer": "",
"predicted_label": "",
"predicted_text": "",
"response": "",
"fail_reason": "",
"agent_ok": False,
"n_turns": 0,
"image_path": item["image_path"],
"question_type": item["question_type"],
"evaluation_mode": evaluation_mode(),
"judge_model": judge_model,
}
if item.get("is_choice"):
result["correct_label"] = item["correct_choice"]["label"]
result["correct_text"] = item["correct_choice"]["text"]
else:
result["gold_answers"] = item.get("gold_answers") or [item["answer"]]
try:
pred_dir = os.path.join(out_root, "predictions", item_id)
os.makedirs(pred_dir, exist_ok=True)
if is_target_exec_backend():
from skillopt.model import azure_openai as _llm
response = ""
conversation: list[dict] = [
{"role": "user", "content": f"{item['question']}\n\n[image] {os.path.basename(item['image_path'])}"}
]
system_prompt = ""
user_text = ""
for turn in range(max_turns):
response, raw, system_prompt, user_text = _run_codex_once(
pred_dir=pred_dir,
item=item,
skill_content=skill_content,
model=_llm.TARGET_DEPLOYMENT,
timeout=120,
image_detail=image_detail,
diagnostic_mode=diagnostic_mode if turn == 0 else False,
diagnostic_instruction=diagnostic_instruction if turn == 0 else "",
diagnostic_trace_context=diagnostic_trace_context if turn == 0 else "",
previous_response=response if turn > 0 else "",
)
conversation.append({"type": "message", "turn": turn + 1, "content": response})
if extract_answer(response):
break
result["response"] = response
result["agent_ok"] = True
result["n_turns"] = len(conversation) - 1
with open(os.path.join(pred_dir, "target_system_prompt.txt"), "w", encoding="utf-8") as f:
f.write(system_prompt)
with open(os.path.join(pred_dir, "target_user_prompt.txt"), "w", encoding="utf-8") as f:
f.write(user_text)
else:
messages, system_prompt, user_text = _build_messages(
item,
skill_content,
image_detail,
diagnostic_mode=diagnostic_mode,
diagnostic_instruction=diagnostic_instruction,
diagnostic_trace_context=diagnostic_trace_context,
)
response = ""
conversation = [
{"role": "user", "content": f"{user_text}\n\n[image] {os.path.basename(item['image_path'])}"}
]
for turn in range(max_turns):
if turn == 0:
resp_text, _ = chat_target_messages(
messages=messages,
max_completion_tokens=1024,
retries=5,
stage="rollout",
)
else:
refinement_text = (
f"Your previous answer was:\n{response}\n\n"
"Re-check the diagram and the mathematical constraints. "
"If needed, correct your answer. Output only the final answer inside <answer>...</answer>."
)
refinement_messages = [
messages[0],
messages[1],
{"role": "assistant", "content": response},
{"role": "user", "content": refinement_text},
]
resp_text, _ = chat_target_messages(
messages=refinement_messages,
max_completion_tokens=768,
retries=5,
stage="rollout",
)
response = resp_text
conversation.append({"type": "message", "turn": turn + 1, "content": resp_text})
if extract_answer(resp_text):
break
result["response"] = response
result["agent_ok"] = True
result["n_turns"] = len(conversation) - 1
with open(os.path.join(pred_dir, "target_system_prompt.txt"), "w", encoding="utf-8") as f:
f.write(system_prompt)
with open(os.path.join(pred_dir, "target_user_prompt.txt"), "w", encoding="utf-8") as f:
f.write(user_text)
eval_result = evaluate_item(
item=item,
prediction_text=result["response"],
judge_model=judge_model,
max_completion_tokens=judge_max_completion_tokens,
retries=judge_retries,
)
result["evaluation_mode"] = eval_result["evaluation_mode"]
result["judge_raw"] = eval_result.get("judge_raw", "")
result["judge_reason"] = eval_result.get("judge_reason", "")
result["matched_gold"] = eval_result.get("matched_gold", "")
if item.get("is_choice"):
result["predicted_label"] = eval_result["predicted_label"]
result["predicted_text"] = eval_result["predicted_text"]
result["predicted_answer"] = eval_result["predicted_answer"]
result["hard"] = int(eval_result["em"])
result["soft"] = eval_result["f1"]
if not result["hard"]:
result["fail_reason"] = (
f"choice=0: predicted '{eval_result['predicted_label'] or eval_result['predicted_answer']}' "
f"but expected '{eval_result['correct_label']}'"
)
eval_detail = (
f"[EVALUATION RESULT]\n"
f"Question: {item['question_for_eval']}\n"
f"Predicted label: {eval_result['predicted_label']!r}\n"
f"Predicted text: {eval_result['predicted_text']!r}\n"
f"Correct label: {eval_result['correct_label']!r}\n"
f"Correct text: {eval_result['correct_text']!r}\n"
f"Exact Match: {eval_result['em']}"
)
else:
result["predicted_answer"] = eval_result["predicted_answer"]
result["hard"] = int(eval_result["em"])
result["soft"] = eval_result["f1"]
if not result["hard"]:
result["fail_reason"] = (
f"judge=0: predicted '{eval_result['predicted_answer']}' "
f"but expected '{item['answer']}' ({eval_result.get('judge_reason', '')})"
)
eval_detail = (
f"[EVALUATION RESULT]\n"
f"Question: {item['question_for_eval']}\n"
f"Predicted answer: {eval_result['predicted_answer']!r}\n"
f"Gold answer: {item['answer']!r}\n"
f"Judge correct: {eval_result['em']}\n"
f"Judge reason: {eval_result.get('judge_reason', '')}\n"
f"String F1: {eval_result.get('string_f1', 0.0):.4f}"
)
conversation.append({"role": "system", "content": eval_detail})
with open(os.path.join(pred_dir, "conversation.json"), "w", encoding="utf-8") as f:
json.dump(conversation, f, ensure_ascii=False, indent=2)
except Exception as e: # noqa: BLE001
result["fail_reason"] = f"error: {e}"
return result
def run_batch(
items: list[dict],
out_root: str,
skill_content: str,
*,
max_turns: int = 1,
workers: int = 32,
image_detail: str = "auto",
judge_model: str = "gpt-5.4",
judge_max_completion_tokens: int = 256,
judge_retries: int = 5,
diagnostic_mode: bool = False,
diagnostic_instruction: str = "",
diagnostic_trace_context_by_id: dict[str, str] | None = None,
) -> list[dict]:
results_path = os.path.join(out_root, "results.jsonl")
os.makedirs(out_root, exist_ok=True)
expected_eval_mode = evaluation_mode()
done_ids: set[str] = set()
existing: list[dict] = []
rewrite_results = False
if os.path.exists(results_path):
with open(results_path, encoding="utf-8") as f:
for line in f:
try:
row = json.loads(line)
if row.get("evaluation_mode") != expected_eval_mode:
rewrite_results = True
continue
done_ids.add(str(row["id"]))
existing.append(row)
except Exception:
rewrite_results = True
pending = [item for item in items if str(item["id"]) not in done_ids]
if not pending and not rewrite_results:
return existing
total = len(existing) + len(pending)
completed = len(existing)
correct_count = sum(1 for r in existing if r.get("hard", 0))
if existing:
print(f" [rollout] resuming: {completed}/{total} already done", flush=True)
results = list(existing)
file_mode = "w" if rewrite_results else "a"
with open(results_path, file_mode, encoding="utf-8") as outf, ThreadPoolExecutor(max_workers=workers) as ex:
if rewrite_results:
for row in existing:
outf.write(json.dumps(row, ensure_ascii=False) + "\n")
futs = {
ex.submit(
process_one,
item,
out_root,
skill_content,
max_turns=max_turns,
image_detail=image_detail,
judge_model=judge_model,
judge_max_completion_tokens=judge_max_completion_tokens,
judge_retries=judge_retries,
diagnostic_mode=diagnostic_mode,
diagnostic_instruction=diagnostic_instruction,
diagnostic_trace_context=(diagnostic_trace_context_by_id or {}).get(str(item["id"]), ""),
): item
for item in pending
}
for fut in as_completed(futs):
row = fut.result()
results.append(row)
completed += 1
if row.get("hard", 0):
correct_count += 1
acc = correct_count / completed if completed else 0
print(
f" [rollout] {completed}/{total} "
f"(acc={acc:.3f}) id={row.get('id', '?')} "
f"hard={row.get('hard', '?')}",
flush=True,
)
outf.write(json.dumps(row, ensure_ascii=False) + "\n")
outf.flush()
return results

View File

@@ -1,15 +0,0 @@
# MathVerse Visual Math Heuristics
## Diagram First
- Read the diagram before locking onto an equation or option.
- Recover missing labels, lengths, angles, axes, or object relations from the image when the text is abbreviated.
- If the text seems underspecified, assume the image may contain the decisive constraint.
## Constraint Tracking
- Write down the few constraints that actually determine the answer instead of solving from vague intuition.
- Prefer geometric or functional relations that are directly supported by the figure.
- For multiple-choice questions, compare the final candidate against every option exactly.
## Final Answer
- Use the image and the text consistently.
- Return only the final answer inside <answer>...</answer>.

View File

@@ -1,2 +0,0 @@
"""MMRB environment package."""

View File

@@ -1,283 +0,0 @@
"""MMRB environment adapter for ReflACT."""
from __future__ import annotations
import json
import os
from skillopt.gradient.deep_probe import generate_deep_probe_instruction
from skillopt.datasets.base import BatchSpec
from skillopt.gradient.reflect import run_minibatch_reflect
from skillopt.envs.base import EnvAdapter
from skillopt.envs.mmrb.dataloader import MMRBDataLoader
from skillopt.envs.mmrb.rollout import run_batch
from skillopt.model import get_target_backend
class MMRBAdapter(EnvAdapter):
"""MMRB adapter."""
def build_reference_text(self, item: dict) -> str:
reasoning_steps = item.get("reasoning_steps") or []
if not reasoning_steps:
return ""
blocks: list[str] = []
for path_idx, path in enumerate(reasoning_steps, 1):
if not isinstance(path, list) or not path:
continue
lines = [f"### Reasoning Path {path_idx}"]
for step in path:
if not isinstance(step, dict):
continue
step_no = step.get("reasoning step", "?")
step_type = str(step.get("reasoning type") or "").strip()
rationale = str(step.get("rationale") or "").strip()
if rationale:
prefix = f"{step_no}. [{step_type}] " if step_type else f"{step_no}. "
lines.append(prefix + rationale)
if len(lines) > 1:
blocks.append("\n".join(lines))
if not blocks:
return ""
return "## Reference Reasoning Steps\n" + "\n\n".join(blocks[:3])
def get_reference_metadata(self, item: dict) -> dict:
reasoning_steps = item.get("reasoning_steps") or []
path_count = 0
preview_parts: list[str] = []
for path in reasoning_steps:
if not isinstance(path, list) or not path:
continue
path_count += 1
first = path[0] if isinstance(path[0], dict) else {}
step_type = str(first.get("reasoning type") or "").strip()
rationale = str(first.get("rationale") or "").strip()
preview_parts.append(f"[path {path_count}] {step_type}: {rationale[:180]}")
if not path_count:
return {"fields": [], "preview": ""}
return {
"fields": ["reasoning_steps"],
"preview": "\n".join(preview_parts)[:500],
}
def __init__(
self,
split_dir: str = "",
data_path: str = "",
split_mode: str = "ratio",
split_ratio: str = "2:1:7",
split_seed: int = 42,
split_output_dir: str = "",
max_turns: int = 1,
workers: int = 16,
analyst_workers: int = 16,
failure_only: bool = False,
minibatch_size: int = 8,
edit_budget: int = 4,
seed: int = 42,
limit: int = 0,
image_detail: str = "auto",
use_deep_reflect: bool = False,
deep_reflect_failures: int = 4,
deep_reflect_successes: int = 2,
) -> None:
self.max_turns = max_turns
self.workers = workers
self.analyst_workers = analyst_workers
self.failure_only = failure_only
self.minibatch_size = minibatch_size
self.edit_budget = edit_budget
self.image_detail = image_detail
self.use_deep_reflect = use_deep_reflect
self.deep_reflect_failures = deep_reflect_failures
self.deep_reflect_successes = deep_reflect_successes
self.dataloader = MMRBDataLoader(
split_dir=split_dir,
data_path=data_path,
split_mode=split_mode,
split_ratio=split_ratio,
split_seed=split_seed,
split_output_dir=split_output_dir,
seed=seed,
limit=limit,
)
def setup(self, cfg: dict) -> None:
super().setup(cfg)
self.dataloader.setup(cfg)
def get_dataloader(self):
return self.dataloader
def build_env_from_batch(self, batch: BatchSpec, **kwargs):
return list(batch.payload or [])
def build_train_env(self, batch_size: int, seed: int, **kwargs):
batch = self.dataloader.build_train_batch(batch_size=batch_size, seed=seed, **kwargs)
return self.build_env_from_batch(batch, **kwargs)
def build_eval_env(self, env_num: int, split: str, seed: int, **kwargs):
batch = self.dataloader.build_eval_batch(env_num=env_num, split=split, seed=seed, **kwargs)
return self.build_env_from_batch(batch, **kwargs)
def rollout(
self,
env_manager,
skill_content: str,
out_dir: str,
**kwargs,
) -> list[dict]:
items: list[dict] = env_manager
return run_batch(
items=items,
out_root=out_dir,
skill_content=skill_content,
max_turns=self.max_turns,
workers=self.workers,
image_detail=self.image_detail,
diagnostic_mode=kwargs.get("diagnostic_mode", False),
diagnostic_instruction=kwargs.get("diagnostic_instruction", ""),
diagnostic_trace_context_by_id=kwargs.get("diagnostic_trace_context_by_id"),
)
def reflect(
self,
results: list[dict],
skill_content: str,
out_dir: str,
**kwargs,
) -> list[dict | None]:
prediction_dir = kwargs.get("prediction_dir", os.path.join(out_dir, "predictions"))
patches_dir = kwargs.get("patches_dir", os.path.join(out_dir, "patches"))
random_seed = kwargs.get("random_seed")
step_buffer_context = kwargs.get("step_buffer_context", "")
meta_skill_context = kwargs.get("meta_skill_context", "")
return run_minibatch_reflect(
results=results,
skill_content=skill_content,
prediction_dir=prediction_dir,
patches_dir=patches_dir,
workers=self.analyst_workers,
failure_only=self.failure_only,
minibatch_size=self.minibatch_size,
edit_budget=self.edit_budget,
random_seed=random_seed,
error_system=self.get_error_minibatch_prompt(),
success_system=self.get_success_minibatch_prompt(),
step_buffer_context=step_buffer_context,
meta_skill_context=meta_skill_context,
update_mode=getattr(self, "_cfg", {}).get("skill_update_mode", "patch"),
)
def deep_reflect(
self,
results: list[dict],
skill_content: str,
out_dir: str,
**kwargs,
) -> list[dict | None]:
if not self.use_deep_reflect:
return []
env_manager = kwargs.get("env_manager")
prediction_dir = kwargs.get("prediction_dir", os.path.join(out_dir, "predictions"))
random_seed = kwargs.get("random_seed")
step_buffer_context = kwargs.get("step_buffer_context", "")
meta_skill_context = kwargs.get("meta_skill_context", "")
codex_backend = get_target_backend() == "codex_exec"
selected_items = self.select_representative_items(
results,
env_manager if isinstance(env_manager, list) else None,
n_failures=self.deep_reflect_failures,
n_successes=self.deep_reflect_successes,
seed=random_seed,
)
if not selected_items:
return []
selected_ids = {str(item["id"]) for item in selected_items}
selected_results = [row for row in results if str(row.get("id")) in selected_ids]
selected_examples = self.attach_reference_context(selected_results, selected_items)
if codex_backend:
selected_examples = self.attach_codex_probe_context(selected_examples, prediction_dir)
reasoning_count = 0
selected_metadata = []
for item in selected_items:
meta = self.get_reference_metadata(item)
if meta["fields"]:
reasoning_count += 1
selected_metadata.append({
"id": str(item["id"]),
"task_type": str(item.get("subtask") or item.get("task_type") or "mmrb"),
"reference_fields": meta["fields"],
"reference_preview": meta["preview"],
})
deep_dir = os.path.join(out_dir, "deep_reflect")
rollout_dir = os.path.join(deep_dir, "rollout")
patches_dir = os.path.join(deep_dir, "patches")
os.makedirs(deep_dir, exist_ok=True)
print(
f" [2b/6 DEEP REFLECT setup] selected={len(selected_items)} "
f"reference_fields=reasoning_steps({reasoning_count}/{len(selected_items)})"
)
probe = generate_deep_probe_instruction(
skill_content=skill_content,
items=selected_examples,
prediction_dir=prediction_dir,
system_prompt=self.get_codex_deep_probe_prompt() if codex_backend else self.get_deep_probe_prompt(),
step_buffer_context=step_buffer_context,
meta_skill_context=meta_skill_context,
)
if not probe:
return []
diagnostic_trace_context_by_id = None
if codex_backend:
selected_items, diagnostic_trace_context_by_id, probe = self.resolve_codex_probe_target(
selected_items=selected_items,
selected_examples=selected_examples,
prediction_dir=prediction_dir,
probe=probe,
)
probe_record = {
**probe,
"reference_summary": {
"selected_count": len(selected_items),
"field_counts": {"reasoning_steps": reasoning_count},
},
"selected_examples": selected_metadata,
}
with open(os.path.join(deep_dir, "probe.json"), "w", encoding="utf-8") as f:
json.dump(probe_record, f, ensure_ascii=False, indent=2)
deep_results = run_batch(
items=selected_items,
out_root=rollout_dir,
skill_content=skill_content,
max_turns=self.max_turns,
workers=min(self.workers, max(len(selected_items), 1)),
image_detail=self.image_detail,
diagnostic_mode=True,
diagnostic_instruction=probe["probe_instruction"],
diagnostic_trace_context_by_id=diagnostic_trace_context_by_id,
)
deep_results = self.attach_reference_context(deep_results, selected_items)
return run_minibatch_reflect(
results=deep_results,
skill_content=skill_content,
prediction_dir=os.path.join(rollout_dir, "predictions"),
patches_dir=patches_dir,
workers=self.analyst_workers,
failure_only=self.failure_only,
minibatch_size=self.minibatch_size,
edit_budget=self.edit_budget,
random_seed=random_seed,
error_system=self.get_error_minibatch_prompt(),
success_system=self.get_success_minibatch_prompt(),
step_buffer_context=step_buffer_context,
meta_skill_context=meta_skill_context,
update_mode=getattr(self, "_cfg", {}).get("skill_update_mode", "patch"),
)
def get_task_types(self) -> list[str]:
return self.dataloader.get_task_types()

View File

@@ -1,146 +0,0 @@
"""MMRB task dataloader."""
from __future__ import annotations
import glob
import json
import os
import re
from typing import Any
from skillopt.datasets.base import SplitDataLoader
# ── Raw data loading utilities (for preprocessing / standalone eval) ─────
def _load_json(path: str) -> Any:
with open(path, encoding="utf-8") as f:
return json.load(f)
def _iter_data_files(data_path: str) -> list[str]:
if not data_path:
return []
if os.path.isfile(data_path):
return [data_path]
if os.path.isdir(data_path):
nested = glob.glob(os.path.join(data_path, "**", "*_human.json"), recursive=True)
flat = glob.glob(os.path.join(data_path, "*_human.json"))
return sorted(set(nested + flat))
return []
def _normalize_space(text: str) -> str:
return re.sub(r"\s+", " ", str(text or "").strip())
def _normalize_item(item: dict, row_idx: int, source_path: str) -> dict | None:
question = _normalize_space(item.get("question") or "")
answer = _normalize_space(item.get("answer") or "")
raw_image_paths = item.get("image_paths") or []
if not question or not answer or not isinstance(raw_image_paths, list) or not raw_image_paths:
return None
base_dir = os.path.dirname(source_path)
image_paths: list[str] = []
for raw_path in raw_image_paths:
rel = str(raw_path or "").strip()
if not rel:
continue
abs_path = rel if os.path.isabs(rel) else os.path.abspath(os.path.join(base_dir, rel))
if os.path.exists(abs_path):
image_paths.append(abs_path)
if not image_paths:
return None
options_raw = item.get("options") or []
options = [_normalize_space(opt) for opt in options_raw if _normalize_space(opt)]
source = _normalize_space(item.get("source") or "unknown")
subtask = _normalize_space(item.get("subtask") or "unknown")
item_index = item.get("index", row_idx)
item_id = f"{source}:{subtask}:{item_index}"
return {
"id": item_id,
"source": source,
"subtask": subtask,
"task_type": subtask,
"question": question,
"answer": answer,
"options": options,
"is_choice": bool(options),
"image_paths": image_paths,
"reasoning_steps": item.get("reasoning_steps") or [],
"annotation_time": item.get("annotation_time"),
"source_path": os.path.abspath(source_path),
}
def load_items(data_path: str) -> list[dict]:
"""Load and normalise MMRB items from JSON files."""
files = _iter_data_files(data_path)
if not files:
raise ValueError(
"MMRB requires data_path to be a *_human.json file or a directory "
"containing extracted MMRB subtask folders."
)
items: list[dict] = []
for path in files:
raw = _load_json(path)
if not isinstance(raw, list):
raise ValueError(f"Expected JSON array in {path}, got {type(raw).__name__}")
for row_idx, item in enumerate(raw):
if not isinstance(item, dict):
continue
norm = _normalize_item(item, row_idx=row_idx, source_path=path)
if norm is not None:
items.append(norm)
if not items:
raise ValueError(f"No valid MMRB items loaded from {data_path}")
return items
# ── Dataloader ───────────────────────────────────────────────────────────
class MMRBDataLoader(SplitDataLoader):
"""MMRB dataloader."""
def __init__(
self,
split_dir: str = "",
data_path: str = "",
split_mode: str = "ratio",
split_ratio: str = "2:1:7",
split_seed: int = 42,
split_output_dir: str = "",
seed: int = 42,
limit: int = 0,
**kwargs,
) -> None:
super().__init__(
split_dir=split_dir,
data_path=data_path,
split_mode=split_mode,
split_ratio=split_ratio,
split_seed=split_seed,
split_output_dir=split_output_dir,
seed=seed,
limit=limit,
)
self._task_types: list[str] = []
def load_raw_items(self, data_path: str) -> list[dict]:
return load_items(data_path)
def setup(self, cfg: dict) -> None:
super().setup(cfg)
all_items = self.train_items + self.val_items + self.test_items
task_types = {
item.get("subtask") or item.get("task_type") or "unknown"
for item in all_items
}
self._task_types = sorted(task_types)
def get_task_types(self) -> list[str]:
return list(self._task_types)

View File

@@ -1,102 +0,0 @@
"""MMRB evaluation helpers."""
from __future__ import annotations
import re
import string
_EVAL_MODE = "mmrb_exact_match_v1"
def normalize_text(text: str) -> str:
text = str(text or "").strip().lower()
text = "".join(ch for ch in text if ch not in string.punctuation)
return " ".join(text.split())
def extract_answer(text: str | None) -> str:
raw = str(text or "").strip()
if not raw:
return ""
answer_tags = re.findall(r"<answer>\s*(.*?)\s*</answer>", raw, re.IGNORECASE | re.DOTALL)
if answer_tags:
return answer_tags[-1].strip()
bracket = re.findall(r"Answer\s*\[\s*(.*?)\s*\]", raw, re.IGNORECASE | re.DOTALL)
if bracket:
return bracket[-1].strip()
boxed = re.findall(r"\\boxed\{(.*?)\}", raw, re.IGNORECASE | re.DOTALL)
if boxed:
return boxed[-1].strip()
single = raw.strip().rstrip(".):")
if re.fullmatch(r"[A-Z]", single, re.IGNORECASE):
return single.strip()
patterns = [
r"final answer\s*(?:is)?\s*[:]?\s*(.+)",
r"the answer is\s*[:]?\s*(.+)",
r"answer\s*[:]?\s*(.+)$",
]
for pattern in patterns:
match = re.search(pattern, raw, re.IGNORECASE)
if match:
return match.group(1).strip().strip("*")
return raw
def evaluate_item(*, item: dict, prediction_text: str) -> dict:
predicted_answer = extract_answer(prediction_text)
gold_answer = str(item.get("answer") or "").strip()
predicted_norm = normalize_text(predicted_answer)
gold_norm = normalize_text(gold_answer)
hard = 0.0
matched_gold = ""
predicted_label = ""
predicted_text = predicted_answer
if item.get("is_choice"):
predicted_label = str(predicted_answer).strip().upper().rstrip(".):")
if predicted_label == str(gold_answer).strip().upper():
hard = 1.0
matched_gold = gold_answer
else:
for option in item.get("options") or []:
label_match = re.match(r"\(?([A-Z])\)", option)
if not label_match:
continue
label = label_match.group(1).upper()
option_text = option[label_match.end():].strip(" .:-")
if predicted_norm and normalize_text(option_text) == predicted_norm:
predicted_label = label
predicted_text = option_text
break
if predicted_label == str(gold_answer).strip().upper():
hard = 1.0
matched_gold = gold_answer
else:
if predicted_norm and gold_norm and (
predicted_norm == gold_norm or predicted_norm in gold_norm or gold_norm in predicted_norm
):
hard = 1.0
matched_gold = gold_answer
return {
"evaluation_mode": _EVAL_MODE,
"predicted_answer": predicted_answer,
"predicted_label": predicted_label,
"predicted_text": predicted_text,
"em": hard,
"f1": hard,
"sub_em": hard,
"matched_gold": matched_gold,
}
def evaluation_mode() -> str:
return _EVAL_MODE

View File

@@ -1,10 +0,0 @@
You are an expert multi-image reasoning agent.
{skill_section}## Task Format
You will receive a question grounded in multiple images.
Use the image order exactly as presented in the prompt and compare evidence across images carefully.
## Answer Format
- Put the final answer inside <answer>...</answer>.
- For multiple-choice questions, output only the single option letter inside <answer>...</answer>.
- For open questions, output only the short final answer inside <answer>...</answer>.

View File

@@ -1,455 +0,0 @@
"""MMRB rollout."""
from __future__ import annotations
import base64
import json
import mimetypes
import os
import re
from concurrent.futures import ThreadPoolExecutor, as_completed
from skillopt.envs.mmrb.evaluator import evaluate_item, evaluation_mode
from skillopt.model import chat_target_messages, get_target_backend, is_target_exec_backend
from skillopt.model.codex_harness import prepare_workspace, render_skill_md, run_target_exec
from skillopt.prompts import load_prompt
_IMAGE_REF_RE = re.compile(r"\{image#(\d+)\}", re.IGNORECASE)
def _build_system(skill_content: str) -> str:
if skill_content.strip():
skill_section = f"## Skill\n{skill_content.strip()}\n\n"
else:
skill_section = ""
return load_prompt("rollout_system", env="mmrb").format(skill_section=skill_section)
def _image_to_data_uri(path: str) -> str:
mime = mimetypes.guess_type(path)[0] or "image/png"
with open(path, "rb") as f:
encoded = base64.b64encode(f.read()).decode("ascii")
return f"data:{mime};base64,{encoded}"
def _build_user_content(
item: dict,
image_detail: str,
*,
diagnostic_mode: bool = False,
diagnostic_instruction: str = "",
diagnostic_trace_context: str = "",
) -> tuple[list[dict], str]:
raw_question = str(item["question"])
content: list[dict] = []
text_parts: list[str] = []
used_indices: set[int] = set()
cursor = 0
if diagnostic_trace_context.strip():
prefix = (
"## Previous Codex Trace Snapshot\n"
"This is a partial transcript from an earlier attempt. Use it as your current reasoning context.\n\n"
f"{diagnostic_trace_context.strip()}\n\n"
)
content.append({"type": "text", "text": prefix})
text_parts.append(prefix)
for match in _IMAGE_REF_RE.finditer(raw_question):
if match.start() > cursor:
chunk = raw_question[cursor:match.start()]
if chunk:
content.append({"type": "text", "text": chunk})
text_parts.append(chunk)
image_idx = int(match.group(1)) - 1
marker = f"[Image #{image_idx + 1}]"
text_parts.append(marker)
if 0 <= image_idx < len(item["image_paths"]):
image_url = {"url": _image_to_data_uri(item["image_paths"][image_idx])}
if image_detail and image_detail != "auto":
image_url["detail"] = image_detail
content.append({"type": "image_url", "image_url": image_url})
used_indices.add(image_idx)
else:
content.append({"type": "text", "text": marker})
cursor = match.end()
if cursor < len(raw_question):
tail = raw_question[cursor:]
if tail:
content.append({"type": "text", "text": tail})
text_parts.append(tail)
for idx, path in enumerate(item["image_paths"]):
if idx in used_indices:
continue
marker = f"\n[Additional Image #{idx + 1}]"
text_parts.append(marker)
content.append({"type": "text", "text": marker})
image_url = {"url": _image_to_data_uri(path)}
if image_detail and image_detail != "auto":
image_url["detail"] = image_detail
content.append({"type": "image_url", "image_url": image_url})
answer_instruction = (
"\n\nAnswer with the single correct option letter inside <answer>...</answer>."
if item.get("is_choice")
else "\n\nAnswer with the short final answer inside <answer>...</answer>."
)
content.append({"type": "text", "text": answer_instruction})
text_parts.append(answer_instruction)
if diagnostic_mode and diagnostic_instruction.strip():
diag_block = f"\n\n## Training Readout\n{diagnostic_instruction.strip()}"
content.append({"type": "text", "text": diag_block})
text_parts.append(diag_block)
return content, "".join(text_parts)
def _build_messages(
item: dict,
skill_content: str,
image_detail: str,
*,
diagnostic_mode: bool = False,
diagnostic_instruction: str = "",
) -> tuple[list[dict], str, str]:
system = _build_system(skill_content)
user_content, user_text = _build_user_content(
item,
image_detail,
diagnostic_mode=diagnostic_mode,
diagnostic_instruction=diagnostic_instruction,
)
messages = [
{"role": "system", "content": system},
{"role": "user", "content": user_content},
]
return messages, system, user_text
def _build_codex_skill(skill_content: str) -> str:
return render_skill_md(
skill_content,
description="Dynamic ReflACT skill for solving the current MMRB multi-image reasoning question.",
preamble=(
"Use this skill when solving the current multi-image reasoning task.\n"
"Inspect all attached images carefully and return the final answer inside <answer>...</answer>."
),
)
def _run_codex_once(
*,
pred_dir: str,
item: dict,
skill_content: str,
model: str,
timeout: int,
image_detail: str,
diagnostic_mode: bool = False,
diagnostic_instruction: str = "",
diagnostic_trace_context: str = "",
previous_response: str = "",
) -> tuple[str, str, str, str]:
user_text = _build_user_content(
item,
image_detail,
diagnostic_mode=diagnostic_mode,
diagnostic_instruction=diagnostic_instruction,
diagnostic_trace_context=diagnostic_trace_context,
)[1]
task_parts = [user_text]
if previous_response:
task_parts.append(
"## Previous Attempt\n"
f"{previous_response}\n\n"
"Review the same images carefully and answer again."
)
task_text = "\n\n".join(task_parts)
skill_md = _build_codex_skill(skill_content)
work_dir = os.path.join(pred_dir, "codex_exec")
prepare_workspace(
work_dir=work_dir,
skill_md=skill_md,
task_text=task_text,
images=item["image_paths"],
)
prompt = (
"Use the `skillopt-target` skill available in this workspace.\n"
"Read `task.md`, inspect all attached images, and answer the question.\n"
"Keep the final answer inside <answer>...</answer>."
)
final_message, raw = run_target_exec(
work_dir=work_dir,
prompt=prompt,
model=model,
timeout=timeout,
images=item["image_paths"],
)
return final_message or raw, raw, skill_md, task_text
def process_one(
item: dict,
out_root: str,
skill_content: str,
*,
max_turns: int = 1,
image_detail: str = "auto",
diagnostic_mode: bool = False,
diagnostic_instruction: str = "",
diagnostic_trace_context: str = "",
) -> dict:
item_id = str(item["id"])
result = {
"id": item_id,
"question": item["question"],
"task_type": item.get("subtask") or item.get("task_type") or "mmrb",
"task_description": item["question"],
"hard": 0,
"soft": 0.0,
"predicted_answer": "",
"predicted_label": "",
"predicted_text": "",
"response": "",
"fail_reason": "",
"agent_ok": False,
"n_turns": 0,
"image_paths": item["image_paths"],
"gold_answer": item["answer"],
"evaluation_mode": evaluation_mode(),
}
try:
pred_dir = os.path.join(out_root, "predictions", item_id)
os.makedirs(pred_dir, exist_ok=True)
if is_target_exec_backend():
from skillopt.model import azure_openai as _llm
response = ""
conversation: list[dict] = [
{
"role": "user",
"content": item["question"] + "\n\n" + "\n".join(
f"[image] {os.path.basename(path)}" for path in item["image_paths"]
),
}
]
system_prompt = ""
user_text = ""
for turn in range(max_turns):
response, raw, system_prompt, user_text = _run_codex_once(
pred_dir=pred_dir,
item=item,
skill_content=skill_content,
model=_llm.TARGET_DEPLOYMENT,
timeout=120,
image_detail=image_detail,
diagnostic_mode=diagnostic_mode if turn == 0 else False,
diagnostic_instruction=diagnostic_instruction if turn == 0 else "",
diagnostic_trace_context=diagnostic_trace_context if turn == 0 else "",
previous_response=response if turn > 0 else "",
)
conversation.append({"type": "message", "turn": turn + 1, "content": response})
if "<answer>" in response.lower():
break
result["response"] = response
result["agent_ok"] = True
result["n_turns"] = len(conversation) - 1
with open(os.path.join(pred_dir, "target_system_prompt.txt"), "w", encoding="utf-8") as f:
f.write(system_prompt)
with open(os.path.join(pred_dir, "target_user_prompt.txt"), "w", encoding="utf-8") as f:
f.write(user_text)
eval_result = evaluate_item(item=item, prediction_text=response)
result["evaluation_mode"] = eval_result["evaluation_mode"]
result["predicted_answer"] = eval_result["predicted_answer"]
result["predicted_label"] = eval_result["predicted_label"]
result["predicted_text"] = eval_result["predicted_text"]
result["matched_gold"] = eval_result["matched_gold"]
result["hard"] = int(eval_result["em"])
result["soft"] = eval_result["f1"]
if not result["hard"]:
result["fail_reason"] = (
f"predicted '{eval_result['predicted_answer']}' but expected '{item['answer']}'"
)
eval_detail = (
"[EVALUATION RESULT]\n"
f"Question: {item['question']}\n"
f"Predicted answer: {eval_result['predicted_answer']!r}\n"
f"Predicted label: {eval_result['predicted_label']!r}\n"
f"Gold answer: {item['answer']!r}\n"
f"Correct: {eval_result['em']}\n"
)
conversation.append({"role": "system", "content": eval_detail})
with open(os.path.join(pred_dir, "conversation.json"), "w", encoding="utf-8") as f:
json.dump(conversation, f, ensure_ascii=False, indent=2)
return result
messages, system_prompt, user_text = _build_messages(
item,
skill_content,
image_detail,
diagnostic_mode=diagnostic_mode,
diagnostic_instruction=diagnostic_instruction,
diagnostic_trace_context=diagnostic_trace_context,
)
response = ""
conversation: list[dict] = [
{
"role": "user",
"content": user_text + "\n\n" + "\n".join(
f"[image] {os.path.basename(path)}" for path in item["image_paths"]
),
}
]
for turn in range(max_turns):
if turn == 0:
resp_text, _ = chat_target_messages(
messages=messages,
max_completion_tokens=768,
retries=5,
stage="rollout",
)
else:
refinement_messages = [
messages[0],
messages[1],
{"role": "assistant", "content": response},
{
"role": "user",
"content": "Review the same images carefully and answer again. Keep the final answer inside <answer>...</answer>.",
},
]
resp_text, _ = chat_target_messages(
messages=refinement_messages,
max_completion_tokens=512,
retries=5,
stage="rollout",
)
response = resp_text
conversation.append({"type": "message", "turn": turn + 1, "content": resp_text})
if "<answer>" in resp_text.lower():
break
result["response"] = response
result["agent_ok"] = True
result["n_turns"] = len(conversation) - 1
with open(os.path.join(pred_dir, "target_system_prompt.txt"), "w", encoding="utf-8") as f:
f.write(system_prompt)
with open(os.path.join(pred_dir, "target_user_prompt.txt"), "w", encoding="utf-8") as f:
f.write(user_text)
eval_result = evaluate_item(item=item, prediction_text=response)
result["evaluation_mode"] = eval_result["evaluation_mode"]
result["predicted_answer"] = eval_result["predicted_answer"]
result["predicted_label"] = eval_result["predicted_label"]
result["predicted_text"] = eval_result["predicted_text"]
result["matched_gold"] = eval_result["matched_gold"]
result["hard"] = int(eval_result["em"])
result["soft"] = eval_result["f1"]
if not result["hard"]:
result["fail_reason"] = (
f"predicted '{eval_result['predicted_answer']}' but expected '{item['answer']}'"
)
eval_detail = (
"[EVALUATION RESULT]\n"
f"Question: {item['question']}\n"
f"Predicted answer: {eval_result['predicted_answer']!r}\n"
f"Predicted label: {eval_result['predicted_label']!r}\n"
f"Gold answer: {item['answer']!r}\n"
f"Correct: {eval_result['em']}\n"
)
conversation.append({"role": "system", "content": eval_detail})
with open(os.path.join(pred_dir, "conversation.json"), "w", encoding="utf-8") as f:
json.dump(conversation, f, ensure_ascii=False, indent=2)
except Exception as e: # noqa: BLE001
result["fail_reason"] = f"error: {e}"
return result
def run_batch(
items: list[dict],
out_root: str,
skill_content: str,
*,
max_turns: int = 1,
workers: int = 16,
image_detail: str = "auto",
diagnostic_mode: bool = False,
diagnostic_instruction: str = "",
diagnostic_trace_context_by_id: dict[str, str] | None = None,
) -> list[dict]:
results_path = os.path.join(out_root, "results.jsonl")
os.makedirs(out_root, exist_ok=True)
expected_eval_mode = evaluation_mode()
done_ids: set[str] = set()
existing: list[dict] = []
rewrite_results = False
if os.path.exists(results_path):
with open(results_path, encoding="utf-8") as f:
for line in f:
try:
row = json.loads(line)
if row.get("evaluation_mode") != expected_eval_mode:
rewrite_results = True
continue
done_ids.add(str(row["id"]))
existing.append(row)
except Exception:
rewrite_results = True
pending = [item for item in items if str(item["id"]) not in done_ids]
if not pending and not rewrite_results:
return existing
total = len(existing) + len(pending)
completed = len(existing)
correct_count = sum(1 for r in existing if r.get("hard", 0))
if existing:
print(f" [rollout] resuming: {completed}/{total} already done", flush=True)
results = list(existing)
file_mode = "w" if rewrite_results else "a"
with open(results_path, file_mode, encoding="utf-8") as outf, ThreadPoolExecutor(max_workers=workers) as ex:
if rewrite_results:
for row in existing:
outf.write(json.dumps(row, ensure_ascii=False) + "\n")
futs = {
ex.submit(
process_one,
item,
out_root,
skill_content,
max_turns=max_turns,
image_detail=image_detail,
diagnostic_mode=diagnostic_mode,
diagnostic_instruction=diagnostic_instruction,
diagnostic_trace_context=(diagnostic_trace_context_by_id or {}).get(str(item["id"]), ""),
): item
for item in pending
}
for fut in as_completed(futs):
row = fut.result()
results.append(row)
completed += 1
if row.get("hard", 0):
correct_count += 1
acc = correct_count / completed if completed else 0
print(
f" [rollout] {completed}/{total} "
f"(acc={acc:.3f}) id={row.get('id', '?')} "
f"hard={row.get('hard', '?')}",
flush=True,
)
outf.write(json.dumps(row, ensure_ascii=False) + "\n")
outf.flush()
return results

View File

@@ -1,17 +0,0 @@
# MMRB Multi-Image Reasoning Heuristics
## Cross-Image Alignment
- Track the role of each image by its index and compare evidence across all referenced images before deciding.
- When the question depends on sequence, correspondence, or retrieval, verify the relation between images instead of judging each image independently.
## Option Elimination
- For multiple-choice tasks, compare all options and reject choices that match only part of the visual evidence.
- If options differ by a small visual detail, use the most discriminative cue rather than a coarse scene impression.
## Open Answers
- For open-ended tasks, give the shortest answer that is fully supported by the combined images.
- Preserve exact entities, attributes, counts, and directions when the images support them directly.
## Final Answer
- Output only the final answer inside <answer>...</answer>.

View File

@@ -1 +0,0 @@
"""SealQA environment package for ReflACT."""

View File

@@ -1,130 +0,0 @@
from __future__ import annotations
import os
from skillopt.datasets.base import BatchSpec
from skillopt.envs.base import EnvAdapter
from skillopt.envs.deep_reflect import run_no_reference_deep_reflect
from skillopt.envs.sealqa.dataloader import SealQADataLoader
from skillopt.envs.sealqa.rollout import run_batch
from skillopt.gradient.reflect import run_minibatch_reflect
class SealQAAdapter(EnvAdapter):
def __init__(
self,
split_dir: str = '',
workers: int = 4,
analyst_workers: int = 8,
failure_only: bool = False,
minibatch_size: int = 8,
edit_budget: int = 4,
seed: int = 42,
limit: int = 0,
max_tool_turns: int = 12,
use_deep_reflect: bool = False,
deep_reflect_failures: int = 4,
deep_reflect_successes: int = 2,
) -> None:
self.workers = workers
self.analyst_workers = analyst_workers
self.failure_only = failure_only
self.minibatch_size = minibatch_size
self.edit_budget = edit_budget
self.max_tool_turns = max_tool_turns
self.use_deep_reflect = use_deep_reflect
self.deep_reflect_failures = deep_reflect_failures
self.deep_reflect_successes = deep_reflect_successes
self.dataloader = SealQADataLoader(split_dir=split_dir, seed=seed, limit=limit)
def setup(self, cfg: dict) -> None:
super().setup(cfg)
self.dataloader.setup(cfg)
def get_dataloader(self):
return self.dataloader
def build_env_from_batch(self, batch: BatchSpec, **kwargs):
return list(batch.payload or [])
def build_train_env(self, batch_size: int, seed: int, **kwargs):
batch = self.dataloader.build_train_batch(batch_size=batch_size, seed=seed, **kwargs)
return self.build_env_from_batch(batch, **kwargs)
def build_eval_env(self, env_num: int, split: str, seed: int, **kwargs):
batch = self.dataloader.build_eval_batch(env_num=env_num, split=split, seed=seed, **kwargs)
return self.build_env_from_batch(batch, **kwargs)
def rollout(self, env_manager, skill_content: str, out_dir: str, **kwargs) -> list[dict]:
items: list[dict] = env_manager
return run_batch(
items=items,
out_root=out_dir,
skill_content=skill_content,
workers=self.workers,
max_tool_turns=self.max_tool_turns,
diagnostic_mode=kwargs.get('diagnostic_mode', False),
diagnostic_instruction=kwargs.get('diagnostic_instruction', ''),
)
def reflect(self, results: list[dict], skill_content: str, out_dir: str, **kwargs) -> list[dict | None]:
prediction_dir = kwargs.get('prediction_dir', os.path.join(out_dir, 'predictions'))
patches_dir = kwargs.get('patches_dir', os.path.join(out_dir, 'patches'))
random_seed = kwargs.get('random_seed')
step_buffer_context = kwargs.get('step_buffer_context', '')
return run_minibatch_reflect(
results=results,
skill_content=skill_content,
prediction_dir=prediction_dir,
patches_dir=patches_dir,
workers=self.analyst_workers,
failure_only=self.failure_only,
minibatch_size=self.minibatch_size,
edit_budget=self.edit_budget,
random_seed=random_seed,
error_system=self.get_error_minibatch_prompt(),
success_system=self.get_success_minibatch_prompt(),
step_buffer_context=step_buffer_context,
update_mode=getattr(self, "_cfg", {}).get("skill_update_mode", "patch"),
)
def deep_reflect(
self,
results: list[dict],
skill_content: str,
out_dir: str,
**kwargs,
) -> list[dict | None]:
return run_no_reference_deep_reflect(
self,
results,
skill_content,
out_dir,
env_manager=kwargs.get('env_manager'),
prediction_dir=kwargs.get('prediction_dir'),
random_seed=kwargs.get('random_seed'),
step_buffer_context=kwargs.get('step_buffer_context', ''),
output_requirements=[
"- There is no hidden reference block. Use only the question, provided evidence, URL/fetch trace, target output, and evaluation result to infer what intermediate state is worth probing.",
"- The instruction must explicitly request a short <analysis>...</analysis> block before the final <answer>...</answer>.",
"- The readout should focus on effective time frame, conflicting evidence, decisive source, candidate answer, and answer-finalization rule.",
"- Do not ask for exhaustive web summaries or a full chain-of-thought.",
"- The instruction text should be ready to append directly to the target's prompt.",
],
metadata_builder=lambda item: {
"id": str(item.get('id')),
"task_type": str(item.get('task_type') or item.get('topic') or 'sealqa'),
"question_preview": str(item.get('question') or '')[:200],
"freshness": item.get('freshness', ''),
"question_types": item.get('question_types', ''),
"topic": item.get('topic', ''),
},
)
def get_task_types(self) -> list[str]:
seen: list[str] = []
for item in self.dataloader.train_items + self.dataloader.val_items + self.dataloader.test_items:
task_type = str(item.get('task_type') or 'sealqa')
if task_type not in seen:
seen.append(task_type)
return seen or ['sealqa']

View File

@@ -1,37 +0,0 @@
from __future__ import annotations
import csv
from pathlib import Path
from skillopt.datasets.base import SplitDataLoader
def _normalize_row(row: dict[str, str], index: int) -> dict:
canary = str(row.get('canary') or '').strip()
base_id = str(row.get('question_id') or row.get('id') or '').strip()
if not base_id:
base_id = f"{canary or 'sealqa'}:{index:04d}"
return {
'id': base_id,
'question': str(row.get('question') or '').strip(),
'ground_truth': str(row.get('answer') or row.get('ground_truth') or '').strip(),
'answers': [str(row.get('answer') or row.get('ground_truth') or '').strip()],
'task_type': str(row.get('topic') or 'sealqa').strip() or 'sealqa',
'topic': str(row.get('topic') or 'sealqa').strip() or 'sealqa',
'urls': str(row.get('urls') or '').strip(),
'search_results': str(row.get('search_results') or '').strip(),
'freshness': str(row.get('freshness') or '').strip(),
'question_types': str(row.get('question_types') or '').strip(),
'canary': canary,
}
class SealQADataLoader(SplitDataLoader):
def load_split_items(self, split_path: str) -> list[dict]:
path = Path(split_path)
csv_files = sorted(path.glob('*.csv'))
if not csv_files:
raise FileNotFoundError(f'No .csv file found in {split_path}')
with csv_files[0].open(encoding='utf-8', newline='') as f:
reader = csv.DictReader(f)
return [_normalize_row(row, idx) for idx, row in enumerate(reader, start=1)]

View File

@@ -1,126 +0,0 @@
from __future__ import annotations
import re
from openai import AzureOpenAI, OpenAI
GRADER_TEMPLATE = """
Your job is to look at a question, a gold target, and a predicted answer, and then assign a grade of either [\"CORRECT\", \"INCORRECT\", \"NOT_ATTEMPTED\"].
First, I will give examples of each grade, and then you will grade a new example.
The following are examples of CORRECT predicted answers.
```
Question: What are the names of Barack Obama's children?
Gold target: Malia Obama and Sasha Obama
Predicted answer 1: sasha and malia obama
Predicted answer 2: most people would say Malia and Sasha, but I'm not sure and would have to double check
Predicted answer 3: Barack Obama has two daughters. Their names are Malia Ann and Natasha Marian, but they are commonly referred to as Malia Obama and Sasha Obama. Malia was born on July 4, 1998, and Sasha was born on June 10, 2001.
```
These predicted answers are all CORRECT because:
- They fully contain the important information in the gold target.
- They do not contain any information that contradicts the gold target.
- Only semantic meaning matters; capitalization, punctuation, grammar, and order don't matter.
- Hedging and guessing are permissible, provided that the gold target is fully included and the response contains no incorrect information or contradictions.
The following are examples of INCORRECT predicted answers.
```
Question: What are the names of Barack Obama's children?
Gold target: Malia and Sasha
Predicted answer 1: Malia.
Predicted answer 2: Malia, Sasha, and Susan.
Predicted answer 3: Barack Obama does not have any children.
```
These predicted answers are all INCORRECT because a factual statement in the answer contradicts the gold target.
The following are examples of NOT_ATTEMPTED predicted answers.
```
Question: What are the names of Barack Obama's children?
Gold target: Malia and Sasha
Predicted answer 1: I don't know.
Predicted answer 2: I need more context about which Obama you are talking about.
```
These predicted answers are all NOT_ATTEMPTED because the important information in the gold target is not included and there is no contradiction.
Here is a new example. Simply reply with either CORRECT, INCORRECT, NOT ATTEMPTED. Just return the letters \"A\", \"B\", or \"C\", with no text around it.
```
Question: {question}
Gold target: {target}
Predicted answer: {predicted_answer}
```
Grade the predicted answer as one of:
A: CORRECT
B: INCORRECT
C: NOT_ATTEMPTED
""".strip()
def _build_grader_client() -> tuple[OpenAI | AzureOpenAI, str]:
import os
endpoint = os.environ.get('AZURE_OPENAI_ENDPOINT', '').strip()
api_version = os.environ.get('AZURE_OPENAI_API_VERSION', '').strip() or '2025-04-01-preview'
azure_key = os.environ.get('AZURE_OPENAI_API_KEY', '').strip()
openai_key = os.environ.get('OPENAI_API_KEY', '').strip()
api_key = azure_key or openai_key
if endpoint and api_version and api_key:
model = os.environ.get('SEALQA_GRADER_AZURE_MODEL', '').strip() or os.environ.get('SEALQA_GRADER_MODEL', '').strip() or os.environ.get('AZURE_MODEL_NAME', '').strip() or os.environ.get('OPTIMIZER_DEPLOYMENT', '').strip() or 'gpt-5.4'
client = AzureOpenAI(api_key=api_key, api_version=api_version, azure_endpoint=endpoint.rstrip('/'))
return client, model
if openai_key:
model = os.environ.get('SEALQA_GRADER_OPENAI_MODEL', '').strip() or os.environ.get('SEALQA_GRADER_MODEL', '').strip() or 'gpt-4.1-mini'
return OpenAI(api_key=openai_key), model
raise ValueError('Missing grader credentials for SealQA scoring.')
def _extract_text_content(content) -> str:
if content is None:
return ''
if isinstance(content, str):
return content
if isinstance(content, list):
parts = []
for part in content:
if isinstance(part, dict) and part.get('type') == 'text':
parts.append(str(part.get('text', '')))
else:
text = getattr(part, 'text', None)
if text:
parts.append(str(text))
return '\n'.join(parts).strip()
return str(content).strip()
def _normalize_text(text: str) -> str:
lowered = text.strip().lower()
lowered = re.sub(r'\s+', ' ', lowered)
lowered = re.sub(r'[^\w\s%.-]', '', lowered)
return lowered.strip()
def _fallback_score(ground_truth: str, predicted: str) -> float:
gold = _normalize_text(ground_truth)
pred = _normalize_text(predicted)
if not gold or not pred:
return 0.0
if gold == pred:
return 1.0
if gold in pred or pred in gold:
return 1.0
return 0.0
def score_sealqa(question: str, ground_truth: str, predicted: str) -> float:
try:
client, model = _build_grader_client()
except ValueError:
return _fallback_score(ground_truth, predicted)
prompt = GRADER_TEMPLATE.format(question=question, target=ground_truth, predicted_answer=predicted)
completion = client.chat.completions.create(model=model, messages=[{'role': 'user', 'content': prompt}])
content = _extract_text_content(completion.choices[0].message.content).strip().upper()
if content.startswith('A'):
return 1.0
return 0.0

View File

@@ -1,30 +0,0 @@
You are an expert failure-analysis agent for evidence-seeking factual question answering tasks.
You will be given MULTIPLE failed SealQA trajectories from a single minibatch and the current skill document. The trajectories may include tool calls such as search, fetch, local reads, or evidence gathering steps.
Your job is to identify COMMON failure patterns across the batch and propose concise skill edits.
## Failure Type Categories
- retrieval_miss: the agent failed to gather the right evidence
- evidence_conflict: the agent saw conflicting evidence but resolved it badly
- answer_selection: the agent found evidence but chose the wrong final answer
- not_attempted: the agent never reached a grounded answer
- other: none of the above
Respond ONLY with a valid JSON object (no markdown fences, no extra text):
{
"batch_size": <number of trajectories analysed>,
"failure_summary": [
{"failure_type": "<type>", "count": <int>, "description": "<one-line>"}
],
"patch": {
"reasoning": "<why these edits address the batch's common failures>",
"edits": [
{"op": "append", "content": "<markdown to add at end of skill>"},
{"op": "insert_after", "target": "<exact heading/text to insert after>", "content": "<markdown>"},
{"op": "replace", "target": "<exact text to replace>", "content": "<replacement>"},
{"op": "delete", "target": "<exact text to remove>"}
]
}
}
Only include edits that are needed. "edits" can be an empty list if no patch is warranted.

View File

@@ -1,19 +0,0 @@
You are an expert success-pattern analyst for evidence-seeking factual question answering tasks.
You will be given MULTIPLE successful SealQA trajectories from a single minibatch and the current skill document. Your job is to identify common evidence-gathering and answer-selection behaviors worth encoding in the skill.
Respond ONLY with a valid JSON object:
{
"batch_size": <number of trajectories analysed>,
"success_patterns": ["<pattern 1>", "<pattern 2>"],
"patch": {
"reasoning": "<why these patterns are worth encoding>",
"edits": [
{"op": "append", "content": "<markdown>"},
{"op": "insert_after", "target": "<heading/text>", "content": "<markdown>"},
{"op": "replace", "target": "<old text>", "content": "<new text>"},
{"op": "delete", "target": "<exact text to remove>"}
]
}
}
"edits" may be empty if the skill already covers all observed patterns.

View File

@@ -1,3 +0,0 @@
You are an expert research assistant. Use the provided search evidence first, and only if that is insufficient, inspect the provided URL content fetched for you. Reconcile conflicting information when necessary and return a concise final answer grounded in the evidence you found.
{skill_section}Return the final answer inside <answer>...</answer> when you are ready.

View File

@@ -1,284 +0,0 @@
from __future__ import annotations
import json
import os
import re
from concurrent.futures import ThreadPoolExecutor, as_completed
from skillopt.envs.sealqa.evaluator import score_sealqa
from skillopt.envs.sealqa.tool_runtime import web_fetch
from skillopt.model import chat_target, get_target_backend, is_target_exec_backend
from skillopt.model.codex_harness import prepare_workspace, render_skill_md, run_target_exec
from skillopt.prompts import load_prompt
_FINAL_RE = re.compile(r"<answer>(.*?)</answer>", re.IGNORECASE | re.DOTALL)
def _build_system(skill_content: str) -> str:
if skill_content.strip():
skill_section = f"## Skill\n{skill_content.strip()}\n\n"
else:
skill_section = ""
return load_prompt("rollout_system", env="sealqa").format(skill_section=skill_section)
def _build_user(item: dict, *, diagnostic_mode: bool = False, diagnostic_instruction: str = '') -> str:
parts = [f"## Question\n{item['question']}"]
if item.get('search_results'):
parts.append(f"## Search Results\n{item['search_results']}")
if item.get('urls'):
parts.append(f"## URL Hints\n{item['urls']}")
if item.get('freshness'):
parts.append(f"## Freshness\n{item['freshness']}")
if item.get('question_types'):
parts.append(f"## Question Types\n{item['question_types']}")
if diagnostic_mode and diagnostic_instruction.strip():
parts.append(f"## Training Readout\n{diagnostic_instruction.strip()}")
parts.append('Use the provided search evidence as your primary context. Do not rely on external tool use.')
return "\n\n".join(parts)
def _extract_answer(text: str) -> str:
match = _FINAL_RE.search(text)
if match:
return match.group(1).strip()
lines = [line.strip() for line in text.splitlines() if line.strip()]
return lines[-1] if lines else text.strip()
def _build_codex_skill(skill_content: str) -> str:
return render_skill_md(
skill_content,
description="Dynamic ReflACT skill for solving the current SealQA evidence-grounded question.",
preamble=(
"Use this skill when answering the current SealQA question.\n"
"Use the provided search evidence first, reconcile conflicts carefully,\n"
"and return the final answer inside <answer>...</answer>."
),
)
def _run_codex_once(
*,
pred_dir: str,
skill_content: str,
task_text: str,
model: str,
timeout: int,
previous_response: str = '',
) -> tuple[str, str, str, str]:
task_parts = [task_text]
if previous_response:
task_parts.append(
"## Previous Attempt\n"
f"{previous_response}\n\n"
"Review the evidence again and correct the final answer if needed."
)
final_task_text = "\n\n".join(task_parts)
skill_md = _build_codex_skill(skill_content)
work_dir = os.path.join(pred_dir, 'codex_exec')
prepare_workspace(
work_dir=work_dir,
skill_md=skill_md,
task_text=final_task_text,
)
prompt = (
"Use the `skillopt-target` skill available in this workspace.\n"
"Read `task.md`, answer the SealQA question using the provided evidence,\n"
"and return the final answer inside <answer>...</answer>."
)
final_message, raw = run_target_exec(
work_dir=work_dir,
prompt=prompt,
model=model,
timeout=timeout,
)
return final_message or raw, raw, skill_md, final_task_text
def process_one(
item: dict,
out_root: str,
skill_content: str,
*,
max_tool_turns: int = 12,
diagnostic_mode: bool = False,
diagnostic_instruction: str = '',
) -> dict:
item_id = str(item['id'])
pred_dir = os.path.join(out_root, 'predictions', item_id)
os.makedirs(pred_dir, exist_ok=True)
system = _build_system(skill_content)
user = _build_user(
item,
diagnostic_mode=diagnostic_mode,
diagnostic_instruction=diagnostic_instruction,
)
conversation: list[dict] = [{'role': 'user', 'content': user}]
final_response = ''
final_answer = ''
fail_reason = ''
try:
if is_target_exec_backend():
from skillopt.model import azure_openai as _llm
response, _raw, system, user_for_save = _run_codex_once(
pred_dir=pred_dir,
skill_content=skill_content,
task_text=user,
model=_llm.TARGET_DEPLOYMENT,
timeout=120,
)
final_response = response
conversation.append({'type': 'message', 'content': response})
if '<answer>' in response.lower():
final_answer = _extract_answer(response)
else:
user = user_for_save
else:
response, _ = chat_target(
system=system,
user=user,
max_completion_tokens=768,
retries=5,
stage='rollout',
)
final_response = response
conversation.append({'type': 'message', 'content': response})
if '<answer>' in response.lower():
final_answer = _extract_answer(response)
if not final_answer:
urls_text = str(item.get('urls') or '').strip()
fetched_blocks = []
for raw_url in re.findall(r'https?://[^\s\]\[\'\",]+', urls_text)[:2]:
try:
fetched = web_fetch(raw_url)
except Exception as fetch_error: # noqa: BLE001
fetched = f'URL: {raw_url}\n\n[fetch error: {fetch_error}]'
fetched_blocks.append(fetched)
conversation.append({'type': 'tool_call', 'cmd': f'web_fetch({raw_url!r})', 'obs': fetched})
if fetched_blocks:
retry_user = user + '\n\n## Fetched URL Content\n' + '\n\n'.join(fetched_blocks)
if is_target_exec_backend():
retry_response, _raw, system, retry_user = _run_codex_once(
pred_dir=pred_dir,
skill_content=skill_content,
task_text=retry_user,
model=_llm.TARGET_DEPLOYMENT,
timeout=120,
previous_response=final_response,
)
else:
retry_response, _ = chat_target(
system=system,
user=retry_user,
max_completion_tokens=768,
retries=5,
stage='rollout',
)
final_response = retry_response
conversation.append({'type': 'message', 'content': retry_response})
if '<answer>' in retry_response.lower():
final_answer = _extract_answer(retry_response)
else:
fail_reason = 'Model did not produce a final answer'
else:
fail_reason = 'Model did not produce a final answer'
except Exception as e: # noqa: BLE001
fail_reason = f'error: {e}'
with open(os.path.join(pred_dir, 'target_system_prompt.txt'), 'w', encoding='utf-8') as f:
f.write(system)
with open(os.path.join(pred_dir, 'target_user_prompt.txt'), 'w', encoding='utf-8') as f:
f.write(user)
with open(os.path.join(pred_dir, 'conversation.json'), 'w', encoding='utf-8') as f:
json.dump(conversation, f, ensure_ascii=False, indent=2)
score = score_sealqa(item.get('question', ''), item.get('ground_truth', ''), final_answer) if final_answer else 0.0
result = {
'id': item_id,
'question': item.get('question', ''),
'task_type': item.get('task_type', 'sealqa'),
'task_description': item.get('question', ''),
'predicted_answer': final_answer,
'response': final_response,
'ground_truth': item.get('ground_truth', ''),
'hard': int(score >= 1.0),
'soft': float(score),
'fail_reason': fail_reason or ('' if score >= 1.0 else f"predicted '{final_answer}' but expected '{item.get('ground_truth', '')}'"),
'agent_ok': not fail_reason,
'n_turns': len(conversation),
'target_system_prompt': system,
'target_user_prompt': user,
}
return result
def run_batch(
items: list[dict],
out_root: str,
skill_content: str,
*,
workers: int = 4,
max_tool_turns: int = 12,
diagnostic_mode: bool = False,
diagnostic_instruction: str = '',
) -> list[dict]:
results_path = os.path.join(out_root, 'results.jsonl')
os.makedirs(out_root, exist_ok=True)
done_ids: set[str] = set()
existing: list[dict] = []
if os.path.exists(results_path):
with open(results_path, encoding='utf-8') as f:
for line in f:
try:
row = json.loads(line)
except json.JSONDecodeError:
continue
done_ids.add(str(row.get('id')))
existing.append(row)
pending = [item for item in items if str(item['id']) not in done_ids]
if not pending:
return existing
total = len(existing) + len(pending)
completed = len(existing)
correct_count = sum(1 for r in existing if r.get("hard", 0))
if existing:
print(f" [rollout] resuming: {completed}/{total} already done", flush=True)
results = list(existing)
with open(results_path, 'a', encoding='utf-8') as outf, ThreadPoolExecutor(max_workers=workers) as ex:
futs = {
ex.submit(
process_one,
item,
out_root,
skill_content,
max_tool_turns=max_tool_turns,
diagnostic_mode=diagnostic_mode,
diagnostic_instruction=diagnostic_instruction,
): item
for item in pending
}
for fut in as_completed(futs):
res = fut.result()
results.append(res)
completed += 1
if res.get("hard", 0):
correct_count += 1
acc = correct_count / completed if completed else 0
print(
f" [rollout] {completed}/{total} "
f"(acc={acc:.3f}) id={res.get('id', '?')} "
f"hard={res.get('hard', '?')}",
flush=True,
)
outf.write(json.dumps(res, ensure_ascii=False) + '\n')
outf.flush()
return results

View File

@@ -1,11 +0,0 @@
# SealQA Skill
## Evidence Gathering
- Search for the most directly relevant evidence before answering.
- If multiple sources conflict, prefer the source that best matches the question's entity, date, and scope.
- Keep notes on which evidence directly answers the question versus which evidence is only contextual.
## Final Answer Discipline
- Do not answer until the supporting evidence is specific enough.
- Choose the final answer that is best grounded in the gathered evidence.
- Keep the final answer concise.

View File

@@ -1,30 +0,0 @@
from __future__ import annotations
import html
import re
from urllib.request import Request, urlopen
DEFAULT_USER_AGENT = (
'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 '
'(KHTML, like Gecko) Chrome/135.0 Safari/537.36'
)
_MAX_FETCH_CHARS = 6000
def _strip_html(raw_html: str) -> str:
cleaned = re.sub(r'(?is)<script.*?>.*?</script>', ' ', raw_html)
cleaned = re.sub(r'(?is)<style.*?>.*?</style>', ' ', cleaned)
cleaned = re.sub(r'(?is)<[^>]+>', ' ', cleaned)
cleaned = html.unescape(cleaned)
return re.sub(r'\s+', ' ', cleaned).strip()
def web_fetch(url: str, max_chars: int = _MAX_FETCH_CHARS) -> str:
req = Request(url, headers={'User-Agent': DEFAULT_USER_AGENT})
with urlopen(req, timeout=20) as response:
body = response.read().decode('utf-8', errors='ignore')
text = _strip_html(body)
if len(text) > max_chars:
omitted = len(text) - max_chars
text = text[:max_chars] + f"\n\n[... {omitted} characters omitted ...]"
return f"URL: {url}\n\n{text}"

View File

@@ -1,27 +0,0 @@
You are an expert diagnostic-probe designer for retrieval-style question answering tasks.
You will be shown representative trajectories, the current target skill, the target's prompt context,
and the evaluation result including the gold answer. There is NO hidden chain-of-thought reference.
Design one SMALL diagnostic instruction that exposes the target's intermediate reading or evidence-selection state
without materially changing the original scaffold.
## Hard Constraints
1. Do NOT substantially change the original scaffold.
2. Do NOT prescribe a brand-new multi-step solving procedure.
3. You MAY ask for a short structured readout of intermediate conclusions, evidence candidates, or elimination decisions.
4. Do NOT ask for exhaustive quotation of the whole context or a full chain-of-thought.
5. Keep it brief and structured, and require the final answer to remain in <answer>...</answer>.
6. Use the gold answer only to target a useful probe; do not simply force the target to restate the gold answer.
## Good Probe Targets
- the most likely supporting span or document cue
- top answer candidate and runner-up
- decisive lexical clue / entity / date / title
- why a tempting alternative was rejected
- 2-4 short intermediate conclusions that directly support the final answer
Respond ONLY with a valid JSON object:
{
"reasoning": "<why this probe is informative>",
"probe_instruction": "<the exact instruction text to append to the target prompt>"
}

View File

@@ -1,35 +0,0 @@
You are an expert diagnostic-probe designer for spreadsheet manipulation tasks.
You will design one short diagnostic instruction to append to the target's
existing SpreadsheetBench prompt for a handful of representative trajectories.
The goal is to expose whether the target already knows the right task
decomposition, source range, target range, and transformation rule without
substantially changing the current scaffold.
## Hard Constraints
1. Do NOT substantially change the target's current scaffold.
2. Do NOT prescribe a brand-new full algorithm.
3. Do NOT ask for exhaustive cell-by-cell enumeration.
4. Keep the diagnostic readout brief and structured.
5. The target must still complete the original spreadsheet task.
6. Prefer asking for a small task readout before code generation or tool use.
7. Never ask for hidden reference content or golden values.
## Good Probe Targets
- task family: filter / sort / dedup / lookup / aggregate / reshape
- source sheet/range and target sheet/range
- decisive grouping / matching / sorting key
- one or two representative cells or rows and how they should be derived
- whether the solution must be dynamic rather than hardcoded
## Bad Probe Targets
- full derivation of every output cell
- dumping all rows or all formulas
- imposing a long new checklist that was not already implicit
Respond ONLY with a valid JSON object:
{
"reasoning": "<why this probe reveals the latent skill gap>",
"probe_instruction": "<the exact instruction text to append to the target prompt>"
}

View File

@@ -1 +0,0 @@
"""SWEBench environment for ReflACT."""

View File

@@ -1,137 +0,0 @@
from __future__ import annotations
import os
from skillopt.datasets.base import BatchSpec
from skillopt.envs.base import EnvAdapter
from skillopt.envs.swebench.dataloader import SWEBenchDataLoader
from skillopt.envs.swebench.rollout import run_batch
from skillopt.gradient.reflect import run_minibatch_reflect
class SWEBenchAdapter(EnvAdapter):
def __init__(
self,
split_dir: str = "",
data_path: str = "",
split_mode: str = "ratio",
split_ratio: str = "2:1:7",
split_seed: int = 42,
split_output_dir: str = "",
dataset_name: str = "lite",
hf_split: str = "test",
workers: int = 8,
eval_workers: int = 8,
analyst_workers: int = 16,
failure_only: bool = False,
minibatch_size: int = 4,
edit_budget: int = 4,
seed: int = 42,
limit: int = 0,
step_limit: int = 50,
cost_limit: float = 3.0,
timeout_per_instance: int = 600,
target_model: str = "",
) -> None:
self.dataset_name = dataset_name
self.hf_split = hf_split
self.workers = workers
self.eval_workers = eval_workers
self.analyst_workers = analyst_workers
self.failure_only = failure_only
self.minibatch_size = minibatch_size
self.edit_budget = edit_budget
self.step_limit = step_limit
self.cost_limit = cost_limit
self.timeout_per_instance = timeout_per_instance
self.target_model = target_model
self.dataloader = SWEBenchDataLoader(
split_dir=split_dir,
data_path=data_path,
split_mode=split_mode,
split_ratio=split_ratio,
split_seed=split_seed,
split_output_dir=split_output_dir,
seed=seed,
limit=limit,
dataset_name=dataset_name,
hf_split=hf_split,
)
def setup(self, cfg: dict) -> None:
super().setup(cfg)
self.target_model = str(self.target_model or cfg.get("target_model") or "gpt-5.4").strip()
self.dataset_name = str(self.dataset_name or cfg.get("dataset_name") or "lite").strip()
self.hf_split = str(self.hf_split or cfg.get("hf_split") or "test").strip()
self.dataloader.setup(cfg)
def get_dataloader(self):
return self.dataloader
def build_env_from_batch(self, batch: BatchSpec, **kwargs):
return list(batch.payload or [])
def build_train_env(self, batch_size: int, seed: int, **kwargs):
batch = self.dataloader.build_train_batch(batch_size=batch_size, seed=seed, **kwargs)
return self.build_env_from_batch(batch, **kwargs)
def build_eval_env(self, env_num: int, split: str, seed: int, **kwargs):
batch = self.dataloader.build_eval_batch(env_num=env_num, split=split, seed=seed, **kwargs)
return self.build_env_from_batch(batch, **kwargs)
def rollout(self, env_manager, skill_content: str, out_dir: str, **kwargs) -> list[dict]:
items: list[dict] = env_manager
return run_batch(
items=items,
out_root=out_dir,
skill_content=skill_content,
target_model=self.target_model,
dataset_name=self.dataset_name,
hf_split=self.hf_split,
workers=self.workers,
eval_workers=self.eval_workers,
step_limit=self.step_limit,
cost_limit=self.cost_limit,
timeout_per_instance=self.timeout_per_instance,
)
def reflect(
self,
results: list[dict],
skill_content: str,
out_dir: str,
**kwargs,
) -> list[dict | None]:
prediction_dir = kwargs.get("prediction_dir", os.path.join(out_dir, "predictions"))
patches_dir = kwargs.get("patches_dir", os.path.join(out_dir, "patches"))
random_seed = kwargs.get("random_seed")
step_buffer_context = kwargs.get("step_buffer_context", "")
meta_skill_context = kwargs.get("meta_skill_context", "")
return run_minibatch_reflect(
results=results,
skill_content=skill_content,
prediction_dir=prediction_dir,
patches_dir=patches_dir,
workers=self.analyst_workers,
failure_only=self.failure_only,
minibatch_size=self.minibatch_size,
edit_budget=self.edit_budget,
random_seed=random_seed,
error_system=self.get_error_minibatch_prompt(),
success_system=self.get_success_minibatch_prompt(),
step_buffer_context=step_buffer_context,
meta_skill_context=meta_skill_context,
update_mode=getattr(self, "_cfg", {}).get("skill_update_mode", "patch"),
)
def get_task_types(self) -> list[str]:
repos = {
str(item.get("repo") or "").strip()
for item in (
self.dataloader.train_items
+ self.dataloader.val_items
+ self.dataloader.test_items
)
if str(item.get("repo") or "").strip()
}
return sorted(repos) or ["swebench"]

View File

@@ -1,151 +0,0 @@
from __future__ import annotations
import json
import os
import random
from collections import defaultdict
from skillopt.datasets.base import SplitDataLoader, _parse_split_ratio
_DATASET_ALIASES = {
"lite": "princeton-nlp/SWE-Bench_Lite",
"verified": "princeton-nlp/SWE-Bench_Verified",
"full": "princeton-nlp/SWE-Bench",
}
def _normalize_dataset_name(name: str) -> str:
key = str(name or "").strip()
return _DATASET_ALIASES.get(key.lower(), key or _DATASET_ALIASES["lite"])
class SWEBenchDataLoader(SplitDataLoader):
def __init__(
self,
split_dir: str = "",
data_path: str = "",
split_mode: str = "ratio",
split_ratio: str = "2:1:7",
split_seed: int = 42,
split_output_dir: str = "",
seed: int = 42,
limit: int = 0,
dataset_name: str = "lite",
hf_split: str = "test",
**kwargs,
) -> None:
super().__init__(
split_dir=split_dir,
data_path=data_path,
split_mode=split_mode,
split_ratio=split_ratio,
split_seed=split_seed,
split_output_dir=split_output_dir,
seed=seed,
limit=limit,
)
self.dataset_name = dataset_name
self.hf_split = hf_split
def setup(self, cfg: dict) -> None:
self.dataset_name = str(
self.dataset_name or cfg.get("dataset_name") or "lite"
).strip()
self.hf_split = str(self.hf_split or cfg.get("hf_split") or "test").strip()
super().setup(cfg)
def load_raw_items(self, data_path: str) -> list[dict]:
dataset_ref = str(data_path or "").strip()
if dataset_ref and (os.path.exists(dataset_ref) or dataset_ref.endswith(".json") or dataset_ref.endswith(".jsonl")):
return super().load_raw_items(dataset_ref)
dataset_name = _normalize_dataset_name(dataset_ref or self.dataset_name)
from datasets import load_dataset
ds = load_dataset(dataset_name, split=self.hf_split)
return [dict(item) for item in ds]
def _materialize_ratio_split(self, cfg: dict) -> str:
dataset_ref = os.path.abspath(str(self.data_path or "").strip()) if str(self.data_path or "").strip() and os.path.exists(str(self.data_path or "").strip()) else str(self.data_path or "").strip()
if not dataset_ref:
dataset_ref = _normalize_dataset_name(self.dataset_name)
items = self.load_raw_items(dataset_ref)
if not isinstance(items, list) or not items:
raise ValueError(f"No SWE-bench items available from {dataset_ref!r}")
ratio = _parse_split_ratio(self.split_ratio)
parts = list(ratio)
total_parts = sum(parts)
rng = random.Random(self.split_seed)
by_repo: dict[str, list[dict]] = defaultdict(list)
for item in items:
repo = str(item.get("repo") or "unknown").strip() or "unknown"
by_repo[repo].append(dict(item))
train_items: list[dict] = []
val_items: list[dict] = []
test_items: list[dict] = []
for repo in sorted(by_repo):
group = list(by_repo[repo])
rng.shuffle(group)
n = len(group)
n_train = round(n * parts[0] / total_parts)
n_val = round(n * parts[1] / total_parts)
if n >= 3:
n_train = max(1, n_train)
n_val = max(1, n_val)
elif n == 2:
n_train, n_val = 1, 0
else:
n_train, n_val = 0, 0
while n_train + n_val >= n and n >= 2:
if n_val > 1:
n_val -= 1
elif n_train > 1:
n_train -= 1
else:
break
train_items.extend(group[:n_train])
val_items.extend(group[n_train:n_train + n_val])
test_items.extend(group[n_train + n_val:])
rng2 = random.Random(self.split_seed + 1)
rng2.shuffle(train_items)
rng2.shuffle(val_items)
rng2.shuffle(test_items)
split_dir = self._resolve_split_output_dir(cfg)
os.makedirs(split_dir, exist_ok=True)
self.write_split_items(os.path.join(split_dir, "train"), train_items)
self.write_split_items(os.path.join(split_dir, "val"), val_items)
self.write_split_items(os.path.join(split_dir, "test"), test_items)
manifest = {
"source_data_path": dataset_ref,
"dataset_name": _normalize_dataset_name(self.dataset_name),
"hf_split": self.hf_split,
"split_mode": "ratio",
"split_ratio": self.split_ratio,
"split_seed": self.split_seed,
"strategy": "stratified_by_repo",
"counts": {
"train": len(train_items),
"val": len(val_items),
"test": len(test_items),
},
}
with open(os.path.join(split_dir, "split_manifest.json"), "w", encoding="utf-8") as f:
json.dump(manifest, f, ensure_ascii=False, indent=2)
print(
f" [SWEBenchDataLoader] generated repo-stratified split {self.split_ratio} "
f"at {split_dir} from {dataset_ref}"
)
return split_dir

View File

@@ -1,346 +0,0 @@
from __future__ import annotations
import json
import os
import shutil
import subprocess
import sys
import time
from concurrent.futures import ThreadPoolExecutor, as_completed
from pathlib import Path
_DATASET_ALIASES = {
"lite": ("princeton-nlp/SWE-Bench_Lite", "SWE-bench/SWE-bench_Lite"),
"verified": ("princeton-nlp/SWE-Bench_Verified", "SWE-bench/SWE-bench_Verified"),
"full": ("princeton-nlp/SWE-Bench", "SWE-bench/SWE-bench"),
}
def _normalize_dataset_names(dataset_name: str) -> tuple[str, str]:
key = str(dataset_name or "lite").strip()
pair = _DATASET_ALIASES.get(key.lower())
if pair:
return pair
return key, key
def _setup_litellm_env() -> None:
mapping = {
"AZURE_API_KEY": os.environ.get("AZURE_API_KEY") or os.environ.get("AZURE_OPENAI_API_KEY", ""),
"AZURE_API_BASE": os.environ.get("AZURE_API_BASE") or os.environ.get("AZURE_OPENAI_ENDPOINT", ""),
"AZURE_API_VERSION": os.environ.get("AZURE_API_VERSION") or os.environ.get("AZURE_OPENAI_API_VERSION", ""),
}
for key, value in mapping.items():
if value and not os.environ.get(key):
os.environ[key] = value
def _normalize_target_model(target_model: str) -> str:
model = str(target_model or "").strip()
if not model:
return "azure/gpt-5.4"
if "/" in model:
return model
if os.environ.get("AZURE_OPENAI_ENDPOINT"):
return f"azure/{model}"
return model
def _load_json(path: str) -> dict | list | None:
if not os.path.exists(path):
return None
with open(path, encoding="utf-8") as f:
return json.load(f)
def _build_agent_config(
*,
skill_content: str,
target_model: str,
step_limit: int,
cost_limit: float,
) -> tuple[dict, str]:
try:
from minisweagent.config import get_config_from_spec
from minisweagent.utils.serialize import recursive_merge
except ImportError as exc:
raise ImportError(
"SWEBench rollout requires minisweagent. Install the mini-swe-agent environment first."
) from exc
base_config = get_config_from_spec("swebench.yaml")
system_template = base_config.get("agent", {}).get("system_template", "")
rendered_system = system_template
if skill_content.strip():
rendered_system = (
system_template.rstrip()
+ "\n\n## Skill Document\n"
+ "The following skill contains learned guidance for SWE-bench style bug-fixing tasks.\n\n"
+ skill_content.strip()
+ "\n"
)
agent_override = {
"agent": {
"system_template": rendered_system,
"step_limit": int(step_limit),
"cost_limit": float(cost_limit),
},
"model": {
"model_name": _normalize_target_model(target_model),
"cost_tracking": "ignore_errors",
},
}
return recursive_merge(base_config, agent_override), rendered_system
def _load_messages_from_traj(traj_path: Path) -> list[dict]:
traj_data = _load_json(str(traj_path))
if not isinstance(traj_data, dict):
return []
messages = traj_data.get("messages")
if not isinstance(messages, list):
return []
return [msg for msg in messages if isinstance(msg, dict) and msg.get("role") != "system"]
def _load_exit_status(traj_path: Path) -> str:
traj_data = _load_json(str(traj_path))
if not isinstance(traj_data, dict):
return "missing_traj"
info = traj_data.get("info")
if isinstance(info, dict):
return str(info.get("exit_status") or "unknown")
return "unknown"
def _run_rollout(
*,
items: list[dict],
predictions_dir: str,
skill_content: str,
target_model: str,
workers: int,
step_limit: int,
cost_limit: float,
) -> tuple[list[dict], str]:
try:
from minisweagent.run.benchmarks.swebench import process_instance
from minisweagent.run.benchmarks.utils.batch_progress import RunBatchProgressManager
except ImportError as exc:
raise ImportError(
"SWEBench rollout requires minisweagent with swebench benchmark support."
) from exc
_setup_litellm_env()
config, system_prompt = _build_agent_config(
skill_content=skill_content,
target_model=target_model,
step_limit=step_limit,
cost_limit=cost_limit,
)
out_path = Path(predictions_dir)
out_path.mkdir(parents=True, exist_ok=True)
preds_path = out_path / "preds.json"
done_ids: set[str] = set()
if preds_path.exists():
data = _load_json(str(preds_path))
if isinstance(data, dict):
done_ids = set(data.keys())
pending = [item for item in items if str(item.get("instance_id")) not in done_ids]
progress_manager = RunBatchProgressManager(
len(pending),
out_path / f"exit_statuses_{int(time.time())}.yaml",
)
task_errors: dict[str, str] = {}
def _process(instance: dict) -> None:
process_instance(instance, out_path, config, progress_manager)
with ThreadPoolExecutor(max_workers=max(int(workers), 1)) as executor:
futures = {
executor.submit(_process, item): str(item.get("instance_id"))
for item in pending
}
for fut in as_completed(futures):
iid = futures[fut]
try:
fut.result()
except Exception as exc: # noqa: BLE001
task_errors[iid] = str(exc)
preds_data = _load_json(str(preds_path))
preds_dict = preds_data if isinstance(preds_data, dict) else {}
results: list[dict] = []
for item in items:
iid = str(item.get("instance_id"))
pred = preds_dict.get(iid, {}) if isinstance(preds_dict, dict) else {}
traj_path = out_path / iid / f"{iid}.traj.json"
messages = _load_messages_from_traj(traj_path)
task_dir = out_path / iid
task_dir.mkdir(parents=True, exist_ok=True)
user_prompt = (
f"Repository: {item.get('repo', '')}\n\n"
f"Issue:\n{item.get('problem_statement', '').strip()}"
).strip()
with open(task_dir / "conversation.json", "w", encoding="utf-8") as f:
json.dump(messages, f, ensure_ascii=False, indent=2)
with open(task_dir / "target_system_prompt.txt", "w", encoding="utf-8") as f:
f.write(system_prompt)
with open(task_dir / "target_user_prompt.txt", "w", encoding="utf-8") as f:
f.write(user_prompt)
results.append(
{
"id": iid,
"instance_id": iid,
"repo": str(item.get("repo") or "").strip(),
"task_type": str(item.get("repo") or "swebench").strip() or "swebench",
"task_description": str(item.get("problem_statement") or "").strip(),
"instruction": str(item.get("problem_statement") or "").strip(),
"hard": 0,
"soft": 0.0,
"response": str(pred.get("model_patch") or ""),
"submission": str(pred.get("model_patch") or ""),
"predicted_patch": str(pred.get("model_patch") or ""),
"agent_ok": bool(messages),
"n_turns": sum(1 for msg in messages if msg.get("role") == "assistant"),
"fail_reason": task_errors.get(iid, ""),
"exit_status": _load_exit_status(traj_path),
}
)
return results, str(preds_path)
def _run_evaluation(
*,
preds_path: str,
dataset_name: str,
split: str,
run_id: str,
eval_workers: int,
report_dir: str,
instance_ids: list[str],
) -> dict:
_, eval_dataset = _normalize_dataset_names(dataset_name)
os.makedirs(report_dir, exist_ok=True)
preds_data = _load_json(preds_path)
model_name = "unknown"
if isinstance(preds_data, dict) and preds_data:
first_pred = next(iter(preds_data.values()))
if isinstance(first_pred, dict):
model_name = str(first_pred.get("model_name_or_path") or "unknown")
expected_report = os.path.join(report_dir, f"{model_name.replace('/', '__')}.{run_id}.json")
if os.path.exists(expected_report):
cached = _load_json(expected_report)
return cached if isinstance(cached, dict) else {}
cmd = [
sys.executable,
"-m",
"swebench.harness.run_evaluation",
"--dataset_name",
eval_dataset,
"--split",
split,
"--predictions_path",
preds_path,
"--max_workers",
str(max(int(eval_workers), 1)),
"--run_id",
run_id,
]
if instance_ids:
cmd.extend(["--instance_ids"] + instance_ids)
subprocess.run(
cmd,
cwd=report_dir,
capture_output=True,
text=True,
timeout=7200,
check=False,
)
if os.path.exists(expected_report):
report = _load_json(expected_report)
return report if isinstance(report, dict) else {}
for name in sorted(os.listdir(report_dir)):
if name.endswith(".json") and run_id in name:
report = _load_json(os.path.join(report_dir, name))
if isinstance(report, dict):
if os.path.join(report_dir, name) != expected_report:
shutil.move(os.path.join(report_dir, name), expected_report)
return report
return {"resolved_ids": [], "total_instances": len(instance_ids), "resolved_instances": 0}
def run_batch(
*,
items: list[dict],
out_root: str,
skill_content: str,
target_model: str,
dataset_name: str,
hf_split: str,
workers: int,
eval_workers: int,
step_limit: int,
cost_limit: float,
timeout_per_instance: int,
) -> list[dict]:
os.makedirs(out_root, exist_ok=True)
results_path = os.path.join(out_root, "results.jsonl")
if os.path.exists(results_path):
cached: list[dict] = []
with open(results_path, encoding="utf-8") as f:
for line in f:
line = line.strip()
if line:
cached.append(json.loads(line))
if cached:
return cached
predictions_dir = os.path.join(out_root, "predictions")
results, preds_path = _run_rollout(
items=items,
predictions_dir=predictions_dir,
skill_content=skill_content,
target_model=target_model,
workers=workers,
step_limit=step_limit,
cost_limit=cost_limit,
)
eval_report = _run_evaluation(
preds_path=preds_path,
dataset_name=dataset_name,
split=hf_split,
run_id=f"skillopt_{int(time.time())}",
eval_workers=eval_workers,
report_dir=os.path.join(out_root, "evaluation"),
instance_ids=[str(item.get("instance_id")) for item in items],
)
resolved_ids = set(str(i) for i in eval_report.get("resolved_ids", []))
for row in results:
resolved = str(row["instance_id"]) in resolved_ids
row["hard"] = int(resolved)
row["soft"] = float(int(resolved))
if not resolved:
status = row.get("exit_status") or "not_resolved"
base_reason = str(row.get("fail_reason") or "").strip()
unresolved = f"swebench unresolved ({status})"
row["fail_reason"] = f"{base_reason}; {unresolved}" if base_reason else unresolved
row["timeout_per_instance"] = int(timeout_per_instance)
with open(results_path, "w", encoding="utf-8") as f:
for row in results:
f.write(json.dumps(row, ensure_ascii=False) + "\n")
return results

View File

@@ -1,23 +0,0 @@
# SWE-bench Bug Fixing Skill
## Overview
This skill guides agents in resolving real-world GitHub issues by producing correct patches.
**Goal**: Given a repository and an issue description, produce a minimal, correct `git diff` patch that resolves the issue without modifying test files.
## Workflow
1. Understand the issue. Read the problem statement carefully and restate the expected behavior before editing code.
2. Locate relevant code. Use targeted search to identify the files, functions, and tests that encode the buggy behavior.
3. Reproduce the issue. Build a small, local reproduction before changing source files when feasible.
4. Implement the fix. Make the smallest source change that addresses the root cause.
5. Verify the fix. Re-run the reproduction and any focused checks needed to confirm the change.
6. Submit the patch. Generate a clean unified diff of only the source files you modified.
## Key Rules
- Keep changes minimal and directly tied to the bug.
- Do not modify tests, fixtures, or unrelated configuration unless the issue explicitly requires it.
- Prefer understanding the code path before patching.
- Verify behavior after editing instead of relying on intuition.
- The final submission must be a valid unified diff.

View File

@@ -1,77 +0,0 @@
"""Optimizer-written diagnostic probe generation for deep reflection."""
from __future__ import annotations
from skillopt.gradient.reflect import fmt_minibatch_trajectories
from skillopt.model import chat_optimizer
from skillopt.optimizer.meta_skill import format_meta_skill_context
from skillopt.prompts import load_prompt
from skillopt.utils import extract_json
def generate_deep_probe_instruction(
skill_content: str,
items: list[dict],
prediction_dir: str,
*,
system_prompt: str | None = None,
step_buffer_context: str = "",
output_requirements: list[str] | None = None,
meta_skill_context: str = "",
) -> dict | None:
"""Generate one minimally-perturbing diagnostic probe instruction."""
trajectories_text = fmt_minibatch_trajectories(items, prediction_dir)
if not trajectories_text.strip():
return None
actual_system = system_prompt or load_prompt("deep_probe")
user = (
f"## Current Skill\n{skill_content}\n\n"
"## Probe Design Goal\n"
"Design one short diagnostic instruction to append to the target prompt.\n"
"The instruction should expose the target's current intermediate judgment\n"
"without materially changing the original scaffold.\n\n"
)
if step_buffer_context.strip():
user += f"## Previous Steps in This Epoch\n{step_buffer_context}\n\n"
optimizer_ctx = format_meta_skill_context(meta_skill_context)
if optimizer_ctx:
user += optimizer_ctx + "\n\n"
requirements = output_requirements or [
"- Some trajectories may include a hidden Reference block. Use it to identify what intermediate conclusion matters, but do not reveal or paraphrase that reference directly to the target.",
"- The instruction must explicitly request a short <analysis>...</analysis> block before the final <answer>...</answer>.",
"- Keep the readout concise and structured.",
"- Do not ask for exhaustive listing, full derivation, or a new solving protocol.",
"- The instruction text should be ready to append directly to the target's prompt.",
]
user += (
f"## Representative Trajectories ({len(items)} total)\n{trajectories_text}\n\n"
"## Output Requirements\n"
+ "\n".join(requirements)
+ "\n"
)
try:
response, _ = chat_optimizer(
system=actual_system,
user=user,
max_completion_tokens=1024,
retries=3,
stage="deep_probe",
)
result = extract_json(response)
if result and str(result.get("probe_instruction", "")).strip():
parsed = {
"reasoning": str(result.get("reasoning", "")).strip(),
"probe_instruction": str(result.get("probe_instruction", "")).strip(),
}
if str(result.get("probe_target_id", "")).strip():
parsed["probe_target_id"] = str(result.get("probe_target_id", "")).strip()
try:
if result.get("probe_after_step") is not None:
parsed["probe_after_step"] = int(result.get("probe_after_step"))
except Exception: # noqa: BLE001
pass
return parsed
except Exception: # noqa: BLE001
return None
return None

View File

@@ -1,198 +0,0 @@
"""ReflACT Meta-Reflect — epoch-level skill refinement with momentum.
After each epoch, the meta-reflect stage reviews the epoch's step history
(applied edits + gate scores) and performs high-level skill edits:
merging redundant rules, removing ineffective ones, and distilling
cross-step strategic patterns.
This is analogous to momentum in neural network optimization:
- Fast update (per step): analyst edits fix local issues from current batch
- Slow update (per epoch): meta-reflect refines the skill based on what
worked and what didn't across the full epoch
The meta-reflect also maintains a ``meta_summary`` — a compact memory
passed between epochs that captures directional insights (which editing
directions are effective, which are not). This is the "momentum buffer".
Public API
----------
- :func:`build_epoch_history` — format an epoch's step records for meta-reflect
- :func:`run_meta_reflect` — one optimizer call to produce high-level edits + meta_summary
"""
from __future__ import annotations
import json
import os
import traceback
from skillopt.model import chat_optimizer
from skillopt.optimizer.update_modes import (
describe_item,
get_payload_items,
normalize_update_mode,
payload_label,
truncate_payload,
)
from skillopt.prompts import load_prompt
from skillopt.utils import extract_json
# ── Epoch history formatting ─────────────────────────────────────────────────
def build_epoch_history(
epoch_step_records: list[dict],
out_root: str,
*,
update_mode: str = "patch",
) -> str:
"""Format an epoch's step records into text for the meta-reflect optimizer.
For each step, includes the exact edits applied (read from
``ranked_edits.json``) and the gate evaluation result.
Parameters
----------
epoch_step_records : list[dict]
Step record dicts from ``history.json`` belonging to this epoch.
out_root : str
Training output root directory (to locate ``ranked_edits.json``).
Returns
-------
str
Formatted epoch history text.
"""
update_mode = normalize_update_mode(update_mode)
parts: list[str] = []
for rec in epoch_step_records:
step = rec["step"]
action = rec.get("action", "unknown")
gate_score = rec.get("selection_hard", rec.get("current_score", "?"))
best_score = rec.get("best_score", "?")
header = (
f"### Step {step}"
f"gate: {gate_score}, {action.upper()}, "
f"best_so_far: {best_score}"
)
# Read the actual applied edits
ranked_path = os.path.join(
out_root, "steps", f"step_{step:04d}", "ranked_edits.json",
)
edits_text = ""
if os.path.exists(ranked_path):
try:
with open(ranked_path) as f:
ranked = json.load(f)
edits = get_payload_items(ranked, update_mode)
if edits:
lines = [f"Selected {payload_label(update_mode)}:"]
for i, edit in enumerate(edits, 1):
lines.append(f" {i}. {describe_item(edit, update_mode, max_chars=220)}")
edits_text = "\n".join(lines)
else:
edits_text = f"Selected {payload_label(update_mode)}: (none)"
except Exception:
edits_text = f"Selected {payload_label(update_mode)}: (could not read)"
else:
# Step may have been skipped
if "skip" in action:
edits_text = f"Selected {payload_label(update_mode)}: (skipped)"
else:
edits_text = f"Selected {payload_label(update_mode)}: (file not found)"
parts.append(f"{header}\n{edits_text}")
# Append trajectory failure digest if available
digest_path = os.path.join(
out_root, "steps", f"step_{step:04d}", "trajectory_digest.json",
)
if os.path.exists(digest_path):
try:
with open(digest_path) as f:
digest = json.load(f)
patterns = digest.get("failure_patterns", [])
if patterns:
n_fail = digest.get("n_fail", "?")
n_total = digest.get("n_total", "?")
lines = [f"Failure patterns ({n_fail}/{n_total} tasks failed):"]
for p in patterns:
lines.append(
f' - "{p["pattern"]}" (×{p["count"]})'
)
parts[-1] += "\n" + "\n".join(lines)
except Exception:
pass
return "\n\n".join(parts)
# ── Meta-reflect optimizer call ────────────────────────────────────────────────
def run_meta_reflect(
skill_content: str,
epoch_history_text: str,
prev_meta_summary: str,
meta_edit_budget: int = 4,
*,
system_prompt: str | None = None,
update_mode: str = "patch",
) -> dict | None:
"""Run one meta-reflect optimizer call for an epoch.
Parameters
----------
skill_content : str
Current skill document (after the epoch's fast updates).
epoch_history_text : str
Formatted epoch history from :func:`build_epoch_history`.
prev_meta_summary : str
Meta summary from the previous epoch ("" if first epoch).
meta_edit_budget : int
Maximum number of high-level edits.
system_prompt : str | None
Custom system prompt. ``None`` = use generic default.
Returns
-------
dict | None
Conforms to :class:`~skillopt.types.MetaReflectResult`:
``"meta_summary"`` (str) and ``"patch"`` (:class:`~skillopt.types.Patch`
dict), or ``None`` on failure.
"""
mode = normalize_update_mode(update_mode)
actual_system = system_prompt if system_prompt is not None else load_prompt(
"meta_reflect_rewrite" if mode == "rewrite_from_suggestions" else "meta_reflect"
)
prev_section = prev_meta_summary.strip() if prev_meta_summary else "(First epoch — no previous summary)"
user = (
f"## Previous Meta Summary\n{prev_section}\n\n"
f"## Current Skill Document\n{skill_content}\n\n"
f"## {payload_label(mode, title=True)} Budget\n"
f"Produce at most {meta_edit_budget} high-level {payload_label(mode)}.\n\n"
f"## This Epoch's Step History\n{epoch_history_text}"
)
try:
response, _ = chat_optimizer(
system=actual_system,
user=user,
max_completion_tokens=4096,
retries=3,
stage="meta_reflect",
)
result = extract_json(response)
if result and "patch" in result:
truncate_payload(result["patch"], meta_edit_budget, mode)
if "meta_summary" not in result:
result["meta_summary"] = ""
return result
except Exception: # noqa: BLE001
traceback.print_exc()
return None

View File

@@ -1,34 +0,0 @@
You are an expert diagnostic-probe designer for reflective skill learning.
You will design one short diagnostic instruction to append to the target prompt
for a handful of representative cases.
The goal is to expose the target's current intermediate judgment state without
substantially changing the current skill scaffold.
## Hard Constraints
1. Do NOT substantially change the target's existing scaffold.
2. Do NOT prescribe a new multi-step solving procedure.
3. Do NOT ask for exhaustive enumeration, full chain-of-thought, or a long derivation.
4. Ask only for a minimal readout of signals already behind the target's current answer.
5. Keep the diagnostic block brief and structured.
6. The final answer must still be produced in <answer>...</answer>.
7. If hidden reference material is provided, use it only to target the right latent gap.
8. Never copy hidden reference content into the target-facing probe.
## Good Probe Targets
- top candidate and runner-up
- decisive cue / decisive constraint
- why a runner-up was rejected
- counted unit / suspicious region / compared objects
## Bad Probe Targets
- full proof or full chain-of-thought
- dumping every object, cell, or possibility
- imposing a brand-new solving algorithm
Respond ONLY with a valid JSON object:
{
"reasoning": "<why this probe reveals the latent skill gap>",
"probe_instruction": "<the exact instruction text to append to the target prompt>"
}

View File

@@ -1,35 +0,0 @@
You are an expert diagnostic-probe designer for codex-executed target trajectories.
You will be shown representative trajectories, the current target skill, the target's original prompt context, and numbered Codex trace steps.
Some trajectories may also include a hidden Reference block. Use hidden reference only to identify the target's missing subgoal, theorem, evidence source, or decisive transformation. Do not reveal or paraphrase that reference directly to the target.
Choose exactly one trajectory and one probe point. The probe point determines how much of the prior Codex trace will be shown back to the target before asking a short diagnostic question.
## Hard Constraints
1. Do NOT reveal or paraphrase hidden reference content to the target.
2. Do NOT prescribe a new full solving procedure.
3. Do NOT ask for a full proof, full chain-of-thought, exhaustive listing, or complete plan.
4. Ask only for a short readout of the target's intermediate state that should already exist at that point.
5. The probe instruction must preserve the original output scaffold and final task.
6. The probe instruction should be ready to append directly to the target's prompt.
## Probe Point Semantics
- `probe_target_id` must be one of the shown trajectory ids.
- `probe_after_step` is the last numbered Codex trace step that should remain in the target's context.
- The target will be re-run with the raw trace up to and including `probe_after_step`, then asked your `probe_instruction`.
- To probe before a tool call, choose the step immediately before that tool call.
## Good Probe Targets
- next theorem / subgoal / evidence source
- strongest-vs-runner-up option distinction
- decisive constraint or transformation
- why a tempting alternative is being rejected
- what code region / spreadsheet region / image cue / passage evidence matters next
Respond ONLY with a valid JSON object:
{
"reasoning": "<why this trajectory and probe point expose the target's intermediate state>",
"probe_target_id": "<trajectory id>",
"probe_after_step": <integer step number>,
"probe_instruction": "<the exact instruction text to append to the target's prompt>"
}

View File

@@ -1,63 +0,0 @@
You are a meta-analyst for an AI agent skill optimization system.
Your role is fundamentally different from the per-step analyst:
- The per-step analyst sees agent trajectories and proposes local fixes.
- YOU see the results of multiple optimization steps and refine the skill
at a higher level, based on what actually worked and what didn't.
You are the ONLY component that has access to the edit-to-outcome causal link:
you can see exactly which edits were applied and whether they improved or
degraded performance. Use this unique vantage point.
## What You Receive
1. **Previous Meta Summary** (empty for the first epoch): a compact memory
from the last epoch capturing directional insights.
2. **Current Skill Document**: the skill as it stands after this epoch.
3. **This Epoch's Step History**: for each step, the exact edits applied,
the gate score, and whether the update was accepted or rejected.
## What You Produce
1. **High-level edits** to the skill document:
- Merge redundant or overlapping rules that accumulated across steps
- Remove or revise rules associated with rejected steps (score drops)
- Strengthen or generalize rules associated with accepted steps (score gains)
- Reorganize for clarity if the document has become cluttered
- Add strategic-level insights that no single step could produce
2. **Meta summary**: a compact summary of this epoch's key findings, to be
passed as context to the next epoch's meta-reflect. This should capture:
- Which editing directions proved effective (and why)
- Which directions proved harmful (and why)
- Current bottlenecks or areas of the skill that need attention
- Trends across steps (e.g., "scores plateau after step 2")
## Guidelines
- Your edits modify the SAME skill document that per-step edits modify.
There is no separate section — you operate on the full skill.
- Be conservative: the per-step process already optimized locally.
Your job is refinement, not revolution.
- Focus on edits that require cross-step perspective (merging, pruning,
pattern extraction). Don't duplicate what per-step analysts already do.
- The meta_summary should be concise (under 200 words). It is NOT written
into the skill — it is only passed to the next meta-reflect call.
You will be told the maximum number of edits (the budget). Produce AT MOST
that many edits. You may produce fewer or zero if the skill is already clean.
Respond ONLY with a valid JSON object (no markdown fences, no extra text):
{
"meta_summary": "<compact summary of this epoch's findings for next epoch>",
"patch": {
"reasoning": "<why these high-level edits improve the skill>",
"edits": [
{"op": "append", "content": "<markdown to add>"},
{"op": "insert_after", "target": "<exact text>", "content": "<markdown>"},
{"op": "replace", "target": "<exact old text>", "content": "<new text>"},
{"op": "delete", "target": "<exact text to remove>"}
]
}
}
"edits" may be empty if no refinement is warranted.

View File

@@ -1,28 +0,0 @@
You are a meta-analyst for an AI agent skill optimization system.
You see the current skill and an epoch's step history. Produce a compact set of
high-level revise_suggestions that a later optimizer can use to rewrite the full skill.
Focus on:
- merging redundant rules
- removing low-value or harmful guidance
- extracting cross-step strategic patterns
- reorganizing the skill for clarity
- compressing clutter without losing proven behavior
Respond ONLY with a valid JSON object:
{
"meta_summary": "<compact summary for next epoch>",
"patch": {
"reasoning": "<why these suggestions improve the skill>",
"revise_suggestions": [
{
"type": "add_rule|remove_rule|merge_rules|reorganize|compress|clarify",
"title": "<short title>",
"motivation": "<why this matters>",
"instruction": "<what the rewriting optimizer should change in the skill>",
"priority_hint": "high|medium|low"
}
]
}
}