cleanup: remove unused benchmarks, deep_probe, meta_reflect

Remove sealqa, babyvision, mathverse, mmrb, swebench envs and configs. Remove deep_probe, deep_reflect, meta_reflect modules and prompts. Remove download_babyvision script. These are not part of the core released benchmarks. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-07-03 14:02:58 +08:00 · 2026-05-24 19:36:27 +00:00
parent 2df2542aec
commit f55a26414e
71 changed files with 0 additions and 11199 deletions
--- a/skillopt/envs/alfworld/prompts/deep_probe.md
+++ b/skillopt/envs/alfworld/prompts/deep_probe.md
@@ -1,35 +0,0 @@
-You are an expert diagnostic-probe designer for ALFWorld embodied tasks.
-
-You will design one short diagnostic instruction to append to the target's prompt
-for a handful of representative ALFWorld trajectories.
-
-The goal is to expose whether the target has the right intermediate subgoal,
-object/receptacle state, and next-step intention without substantially changing
-the current scaffold.
-
-## Hard Constraints
-1. Do NOT substantially change the target's existing action-selection scaffold.
-2. Do NOT prescribe a brand-new planner or long multi-step policy.
-3. Do NOT ask for exhaustive search over all objects or all admissible actions.
-4. Keep the diagnostic readout brief and place it inside the existing <think>...</think> block.
-5. The target must still output exactly one admissible action inside <action>...</action>.
-6. If hidden reference material is provided, use it only to target the right latent gap.
-7. Never copy hidden reference content into the target-facing probe.
-
-## Good Probe Targets
- current subgoal
- target object / target receptacle / target state
- decisive missing precondition
- why one candidate action is better than a tempting alternative
- whether the current step should explore, transform an object, or place it
-
-## Bad Probe Targets
- a full optimal plan from start to finish
- exhaustive object inventories
- a new theorem-like or planner-like protocol
-
-Respond ONLY with a valid JSON object:
-{
-  "reasoning": "<why this probe reveals the latent skill gap>",
-  "probe_instruction": "<the exact instruction text to append to the target prompt>"
-}
--- a/skillopt/envs/babyvision/init.py
+++ b/skillopt/envs/babyvision/init.py
@@ -1 +0,0 @@
-"""BabyVision environment package for ReflACT."""
--- a/skillopt/envs/babyvision/adapter.py
+++ b/skillopt/envs/babyvision/adapter.py
@@ -1,267 +0,0 @@
-"""BabyVision environment adapter for ReflACT."""
-from __future__ import annotations
-
-import json
-import os
-
-from skillopt.gradient.deep_probe import generate_deep_probe_instruction
-from skillopt.datasets.base import BatchSpec
-from skillopt.gradient.reflect import run_minibatch_reflect
-from skillopt.envs.base import EnvAdapter
-from skillopt.envs.babyvision.dataloader import BabyVisionDataLoader
-from skillopt.envs.babyvision.rollout import run_batch
-from skillopt.model import get_target_backend
-
-
-class BabyVisionAdapter(EnvAdapter):
-    """BabyVision adapter."""
-
-    def build_reference_text(self, item: dict) -> str:
-        cot = str(item.get("cot") or "").strip()
-        if not cot:
-            return ""
-        return f"## Reference CoT\n{cot}"
-
-    def get_reference_metadata(self, item: dict) -> dict:
-        cot = str(item.get("cot") or "").strip()
-        if not cot:
-            return {"fields": [], "preview": ""}
-        return {
-            "fields": ["cot"],
-            "preview": cot[:400],
-        }
-
-    def __init__(
-        self,
-        split_dir: str = "",
-        data_path: str = "",
-        split_mode: str = "ratio",
-        split_ratio: str = "2:1:7",
-        split_seed: int = 42,
-        split_output_dir: str = "",
-        max_turns: int = 1,
-        workers: int = 32,
-        analyst_workers: int = 16,
-        failure_only: bool = False,
-        minibatch_size: int = 8,
-        edit_budget: int = 4,
-        seed: int = 42,
-        limit: int = 0,
-        image_detail: str = "auto",
-        judge_model: str = "gpt-5.4",
-        judge_max_completion_tokens: int = 256,
-        judge_retries: int = 5,
-        use_deep_reflect: bool = False,
-        deep_reflect_failures: int = 4,
-        deep_reflect_successes: int = 2,
-    ) -> None:
-        self.max_turns = max_turns
-        self.workers = workers
-        self.analyst_workers = analyst_workers
-        self.failure_only = failure_only
-        self.minibatch_size = minibatch_size
-        self.edit_budget = edit_budget
-        self.image_detail = image_detail
-        self.judge_model = judge_model
-        self.judge_max_completion_tokens = judge_max_completion_tokens
-        self.judge_retries = judge_retries
-        self.use_deep_reflect = use_deep_reflect
-        self.deep_reflect_failures = deep_reflect_failures
-        self.deep_reflect_successes = deep_reflect_successes
-        self.dataloader = BabyVisionDataLoader(
-            split_dir=split_dir,
-            data_path=data_path,
-            split_mode=split_mode,
-            split_ratio=split_ratio,
-            split_seed=split_seed,
-            split_output_dir=split_output_dir,
-            seed=seed,
-            limit=limit,
-        )
-
-    def setup(self, cfg: dict) -> None:
-        super().setup(cfg)
-        self.dataloader.setup(cfg)
-
-    def get_dataloader(self):
-        return self.dataloader
-
-    def build_env_from_batch(self, batch: BatchSpec, **kwargs):
-        return list(batch.payload or [])
-
-    def build_train_env(self, batch_size: int, seed: int, **kwargs):
-        batch = self.dataloader.build_train_batch(batch_size=batch_size, seed=seed, **kwargs)
-        return self.build_env_from_batch(batch, **kwargs)
-
-    def build_eval_env(self, env_num: int, split: str, seed: int, **kwargs):
-        batch = self.dataloader.build_eval_batch(env_num=env_num, split=split, seed=seed, **kwargs)
-        return self.build_env_from_batch(batch, **kwargs)
-
-    def rollout(
-        self,
-        env_manager,
-        skill_content: str,
-        out_dir: str,
-        **kwargs,
-    ) -> list[dict]:
-        items: list[dict] = env_manager
-        return run_batch(
-            items=items,
-            out_root=out_dir,
-            skill_content=skill_content,
-            max_turns=self.max_turns,
-            workers=self.workers,
-            image_detail=self.image_detail,
-            judge_model=self.judge_model,
-            judge_max_completion_tokens=self.judge_max_completion_tokens,
-            judge_retries=self.judge_retries,
-            diagnostic_mode=kwargs.get("diagnostic_mode", False),
-            diagnostic_instruction=kwargs.get("diagnostic_instruction", ""),
-            diagnostic_trace_context_by_id=kwargs.get("diagnostic_trace_context_by_id"),
-        )
-
-    def reflect(
-        self,
-        results: list[dict],
-        skill_content: str,
-        out_dir: str,
-        **kwargs,
-    ) -> list[dict | None]:
-        prediction_dir = kwargs.get("prediction_dir", os.path.join(out_dir, "predictions"))
-        patches_dir = kwargs.get("patches_dir", os.path.join(out_dir, "patches"))
-        random_seed = kwargs.get("random_seed")
-        step_buffer_context = kwargs.get("step_buffer_context", "")
-        meta_skill_context = kwargs.get("meta_skill_context", "")
-
-        return run_minibatch_reflect(
-            results=results,
-            skill_content=skill_content,
-            prediction_dir=prediction_dir,
-            patches_dir=patches_dir,
-            workers=self.analyst_workers,
-            failure_only=self.failure_only,
-            minibatch_size=self.minibatch_size,
-            edit_budget=self.edit_budget,
-            random_seed=random_seed,
-            error_system=self.get_error_minibatch_prompt(),
-            success_system=self.get_success_minibatch_prompt(),
-            step_buffer_context=step_buffer_context,
-            meta_skill_context=meta_skill_context,
-            update_mode=getattr(self, "_cfg", {}).get("skill_update_mode", "patch"),
-        )
-
-    def deep_reflect(
-        self,
-        results: list[dict],
-        skill_content: str,
-        out_dir: str,
-        **kwargs,
-    ) -> list[dict | None]:
-        if not self.use_deep_reflect:
-            return []
-
-        env_manager = kwargs.get("env_manager")
-        prediction_dir = kwargs.get("prediction_dir", os.path.join(out_dir, "predictions"))
-        random_seed = kwargs.get("random_seed")
-        step_buffer_context = kwargs.get("step_buffer_context", "")
-        meta_skill_context = kwargs.get("meta_skill_context", "")
-        codex_backend = get_target_backend() == "codex_exec"
-        selected_items = self.select_representative_items(
-            results,
-            env_manager if isinstance(env_manager, list) else None,
-            n_failures=self.deep_reflect_failures,
-            n_successes=self.deep_reflect_successes,
-            seed=random_seed,
-        )
-        if not selected_items:
-            return []
-        selected_ids = {str(item["id"]) for item in selected_items}
-        selected_results = [row for row in results if str(row.get("id")) in selected_ids]
-        selected_examples = self.attach_reference_context(selected_results, selected_items)
-        if codex_backend:
-            selected_examples = self.attach_codex_probe_context(selected_examples, prediction_dir)
-        selected_metadata = []
-        cot_count = 0
-        for item in selected_items:
-            meta = self.get_reference_metadata(item)
-            if meta["fields"]:
-                cot_count += 1
-            selected_metadata.append({
-                "id": str(item["id"]),
-                "task_type": str(item.get("subtype") or item.get("task_type") or "babyvision"),
-                "reference_fields": meta["fields"],
-                "reference_preview": meta["preview"],
-            })
-
-        deep_dir = os.path.join(out_dir, "deep_reflect")
-        rollout_dir = os.path.join(deep_dir, "rollout")
-        patches_dir = os.path.join(deep_dir, "patches")
-        os.makedirs(deep_dir, exist_ok=True)
-        print(
-            f"    [2b/6 DEEP REFLECT setup] selected={len(selected_items)} "
-            f"reference_fields=cot({cot_count}/{len(selected_items)})"
-        )
-        probe = generate_deep_probe_instruction(
-            skill_content=skill_content,
-            items=selected_examples,
-            prediction_dir=prediction_dir,
-            system_prompt=self.get_codex_deep_probe_prompt() if codex_backend else self.get_deep_probe_prompt(),
-            step_buffer_context=step_buffer_context,
-            meta_skill_context=meta_skill_context,
-        )
-        if not probe:
-            return []
-        diagnostic_trace_context_by_id = None
-        if codex_backend:
-            selected_items, diagnostic_trace_context_by_id, probe = self.resolve_codex_probe_target(
-                selected_items=selected_items,
-                selected_examples=selected_examples,
-                prediction_dir=prediction_dir,
-                probe=probe,
-            )
-        probe_record = {
-            **probe,
-            "reference_summary": {
-                "selected_count": len(selected_items),
-                "field_counts": {
-                    "cot": cot_count,
-                },
-            },
-            "selected_examples": selected_metadata,
-        }
-        with open(os.path.join(deep_dir, "probe.json"), "w", encoding="utf-8") as f:
-            json.dump(probe_record, f, ensure_ascii=False, indent=2)
-        deep_results = run_batch(
-            items=selected_items,
-            out_root=rollout_dir,
-            skill_content=skill_content,
-            max_turns=self.max_turns,
-            workers=min(self.workers, max(len(selected_items), 1)),
-            image_detail=self.image_detail,
-            judge_model=self.judge_model,
-            judge_max_completion_tokens=self.judge_max_completion_tokens,
-            judge_retries=self.judge_retries,
-            diagnostic_mode=True,
-            diagnostic_instruction=probe["probe_instruction"],
-            diagnostic_trace_context_by_id=diagnostic_trace_context_by_id,
-        )
-        deep_results = self.attach_reference_context(deep_results, selected_items)
-        return run_minibatch_reflect(
-            results=deep_results,
-            skill_content=skill_content,
-            prediction_dir=os.path.join(rollout_dir, "predictions"),
-            patches_dir=patches_dir,
-            workers=self.analyst_workers,
-            failure_only=self.failure_only,
-            minibatch_size=self.minibatch_size,
-            edit_budget=self.edit_budget,
-            random_seed=random_seed,
-            error_system=self.get_error_minibatch_prompt(),
-            success_system=self.get_success_minibatch_prompt(),
-            step_buffer_context=step_buffer_context,
-            meta_skill_context=meta_skill_context,
-            update_mode=getattr(self, "_cfg", {}).get("skill_update_mode", "patch"),
-        )
-
-    def get_task_types(self) -> list[str]:
-        return self.dataloader.get_task_types()
--- a/skillopt/envs/babyvision/dataloader.py
+++ b/skillopt/envs/babyvision/dataloader.py
@@ -1,214 +0,0 @@
-"""BabyVision task dataloader."""
-from __future__ import annotations
-
-import json
-import os
-from typing import Any
-
-from skillopt.datasets.base import SplitDataLoader
-
-
-# ── Raw data loading utilities (for preprocessing / standalone eval) ─────
-
-_CHOICE_LABELS = ["A", "B", "C", "D", "E", "F", "G"]
-
-
-def _iter_jsonl(path: str) -> list[dict]:
-    items: list[dict] = []
-    with open(path, encoding="utf-8") as f:
-        for line in f:
-            line = line.strip()
-            if not line:
-                continue
-            items.append(json.loads(line))
-    return items
-
-
-def _normalize_ans_type(raw: Any, options: list[dict], choice_answer: Any) -> str:
-    text = str(raw or "").strip().lower()
-    if text in {"choice", "multiple_choice", "mcq", "option"}:
-        return "choice"
-    if text in {"blank", "open", "open_ended", "fill_blank", "short_answer"}:
-        return "blank"
-    if options or choice_answer not in (None, "", []):
-        return "choice"
-    return "blank"
-
-
-def _coerce_options(raw: Any) -> list[dict]:
-    options: list[dict] = []
-    if isinstance(raw, list):
-        for idx, item in enumerate(raw):
-            if isinstance(item, dict):
-                text = str(item.get("text") or item.get("content") or item.get("option") or "").strip()
-                label = str(item.get("label") or _CHOICE_LABELS[idx]).strip()
-            else:
-                text = str(item).strip()
-                label = _CHOICE_LABELS[idx]
-            if text:
-                options.append({"label": label, "text": text})
-    elif isinstance(raw, dict):
-        for idx, (key, value) in enumerate(raw.items()):
-            text = str(value).strip()
-            if text:
-                options.append({"label": str(key).strip() or _CHOICE_LABELS[idx], "text": text})
-    return options
-
-
-def _normalize_choice_answer(choice_answer: Any, options: list[dict]) -> dict[str, str]:
-    if not options:
-        return {"label": "", "text": ""}
-
-    if isinstance(choice_answer, dict):
-        label = str(choice_answer.get("label") or "").strip().upper()
-        text = str(choice_answer.get("text") or "").strip()
-        for option in options:
-            if label and option["label"].strip().upper() == label:
-                return {"label": option["label"], "text": option["text"]}
-            if text and option["text"] == text:
-                return {"label": option["label"], "text": option["text"]}
-
-    if isinstance(choice_answer, int):
-        idx = choice_answer
-        if 0 <= idx < len(options):
-            return dict(options[idx])
-        if 1 <= idx <= len(options):
-            return dict(options[idx - 1])
-
-    text = str(choice_answer or "").strip()
-    label = text.upper().rstrip(".):")
-    for option in options:
-        if option["label"].strip().upper() == label:
-            return dict(option)
-        if option["text"] == text:
-            return dict(option)
-
-    return {"label": "", "text": ""}
-
-
-def _coerce_blank_answers(raw: Any) -> list[str]:
-    if isinstance(raw, list):
-        return [str(item).strip() for item in raw if str(item).strip()]
-    if raw is None:
-        return []
-    text = str(raw).strip()
-    return [text] if text else []
-
-
-def load_items(data_path: str) -> list[dict]:
-    """Load and normalise BabyVision items from a directory or JSONL file."""
-    if not data_path:
-        raise ValueError("BabyVision requires data_path pointing to a local dataset directory or meta_data.jsonl.")
-
-    if os.path.isdir(data_path):
-        meta_path = os.path.join(data_path, "meta_data.jsonl")
-        image_root = os.path.join(data_path, "images")
-    else:
-        meta_path = data_path
-        image_root = os.path.join(os.path.dirname(data_path), "images")
-
-    if not os.path.exists(meta_path):
-        raise ValueError(
-            "BabyVision expected a meta_data.jsonl file. "
-            f"Could not find: {meta_path}"
-        )
-
-    raw_items = _iter_jsonl(meta_path)
-    items: list[dict] = []
-    for idx, raw in enumerate(raw_items):
-        options = _coerce_options(raw.get("options") or raw.get("choices") or raw.get("choiceOptions"))
-        ans_type = _normalize_ans_type(raw.get("ansType"), options, raw.get("choiceAns"))
-        correct_choice = _normalize_choice_answer(raw.get("choiceAns"), options)
-        blank_answers = _coerce_blank_answers(raw.get("blankAns"))
-
-        image_name = str(
-            raw.get("image")
-            or raw.get("image_path")
-            or raw.get("image_file")
-            or raw.get("img")
-            or ""
-        ).strip()
-        if not image_name:
-            continue
-        image_path = image_name if os.path.isabs(image_name) else os.path.join(image_root, image_name)
-        if not os.path.exists(image_path):
-            alt = os.path.join(os.path.dirname(meta_path), image_name)
-            if os.path.exists(alt):
-                image_path = alt
-            else:
-                continue
-
-        task_id = str(raw.get("taskId") or raw.get("id") or idx + 1)
-        task_type = str(raw.get("type") or raw.get("taskType") or "unknown").strip() or "unknown"
-        subtype = str(raw.get("subtype") or raw.get("subType") or task_type).strip() or task_type
-        question = str(raw.get("question") or raw.get("query") or "").strip()
-        if not question:
-            continue
-
-        if ans_type == "choice" and not correct_choice["label"]:
-            continue
-        if ans_type != "choice" and not blank_answers:
-            continue
-
-        items.append({
-            "id": task_id,
-            "task_type": task_type,
-            "subtype": subtype,
-            "question": question,
-            "image_path": os.path.abspath(image_path),
-            "ans_type": ans_type,
-            "choices": options,
-            "correct_choice": correct_choice,
-            "blank_answers": blank_answers,
-            "cot": str(raw.get("coT") or raw.get("cot") or "").strip(),
-            "source_path": os.path.abspath(meta_path),
-        })
-
-    if not items:
-        raise ValueError(f"No valid BabyVision items loaded from {data_path}")
-    return items
-
-
-# ── Dataloader ───────────────────────────────────────────────────────────
-
-class BabyVisionDataLoader(SplitDataLoader):
-    """BabyVision dataloader."""
-
-    def __init__(
-        self,
-        split_dir: str = "",
-        data_path: str = "",
-        split_mode: str = "ratio",
-        split_ratio: str = "2:1:7",
-        split_seed: int = 42,
-        split_output_dir: str = "",
-        seed: int = 42,
-        limit: int = 0,
-        **kwargs,
-    ) -> None:
-        super().__init__(
-            split_dir=split_dir,
-            data_path=data_path,
-            split_mode=split_mode,
-            split_ratio=split_ratio,
-            split_seed=split_seed,
-            split_output_dir=split_output_dir,
-            seed=seed,
-            limit=limit,
-        )
-        self._task_types: list[str] = []
-
-    def load_raw_items(self, data_path: str) -> list[dict]:
-        return load_items(data_path)
-
-    def setup(self, cfg: dict) -> None:
-        super().setup(cfg)
-        all_items = self.train_items + self.val_items + self.test_items
-        task_types = {
-            item.get("subtype") or item.get("task_type") or "unknown"
-            for item in all_items
-        }
-        self._task_types = sorted(task_types)
-
-    def get_task_types(self) -> list[str]:
-        return list(self._task_types)
--- a/skillopt/envs/babyvision/evaluator.py
+++ b/skillopt/envs/babyvision/evaluator.py
@@ -1,160 +0,0 @@
-"""BabyVision evaluation helpers using the official-style LLM judge."""
-from __future__ import annotations
-
-import re
-import string
-
-import regex
-
-from skillopt.model import chat_with_deployment
-from skillopt.prompts import load_prompt
-
-_EVAL_MODE = "babyvision_judge_v2_official_style"
-
-def normalize_text(text: str) -> str:
-    text = str(text).strip().lower()
-    text = "".join(ch for ch in text if ch not in string.punctuation)
-    return " ".join(text.split())
-
-
-def extract_boxed_answer(text: str | None) -> str | None:
-    """Extract the final answer using the official BabyVision rule."""
-    if text is None:
-        return None
-
-    pattern = r'\\boxed\{((?:[^{}]|{(?:[^{}]|{.*})*})*)\}'
-    matches = regex.findall(pattern, text)
-    if matches:
-        return matches[-1]
-
-    pattern_alt = r'<\|begin_of_box\|>(.*?)<\|end_of_box\|>'
-    matches_alt = regex.findall(pattern_alt, text)
-    if matches_alt:
-        return matches_alt[-1].strip()
-
-    return None
-
-
-def _token_f1(prediction: str, gold: str) -> float:
-    pred_tokens = normalize_text(prediction).split()
-    gold_tokens = normalize_text(gold).split()
-    if not pred_tokens and not gold_tokens:
-        return 1.0
-    if not pred_tokens or not gold_tokens:
-        return 0.0
-    pred_set = {}
-    gold_set = {}
-    for tok in pred_tokens:
-        pred_set[tok] = pred_set.get(tok, 0) + 1
-    for tok in gold_tokens:
-        gold_set[tok] = gold_set.get(tok, 0) + 1
-    common = 0
-    for tok, count in pred_set.items():
-        common += min(count, gold_set.get(tok, 0))
-    if common == 0:
-        return 0.0
-    precision = common / len(pred_tokens)
-    recall = common / len(gold_tokens)
-    return 2 * precision * recall / (precision + recall)
-
-
-def _format_choices(choices: list[dict]) -> str:
-    return "\n".join(f"{choice['label']}. {choice['text']}" for choice in choices)
-
-
-def _judge_answer(
-    *,
-    item: dict,
-    prediction_text: str,
-    extracted_answer: str,
-    judge_model: str,
-    max_completion_tokens: int,
-    retries: int,
-) -> dict:
-    if item["ans_type"] == "choice":
-        ground_truth = str(item["correct_choice"]["label"])
-    else:
-        if len(item["blank_answers"]) == 1:
-            ground_truth = item["blank_answers"][0]
-        else:
-            ground_truth = " | ".join(item["blank_answers"])
-
-    question = str(item["question"])
-    if item["ans_type"] == "choice" and item.get("choices"):
-        question = f"{question}\nChoices:\n{_format_choices(item['choices'])}"
-
-    raw, _ = chat_with_deployment(
-        deployment=judge_model,
-        system="You are a careful and strict evaluator.",
-        user=load_prompt("judge", env="babyvision").format(
-            question=question,
-            groundtruth=ground_truth,
-            modeloutput=extracted_answer,
-        ),
-        max_completion_tokens=max_completion_tokens,
-        retries=retries,
-        stage="babyvision_judge",
-    )
-    judge_response_clean = str(raw).strip().lower()
-    if "true" in judge_response_clean:
-        correct = True
-    elif "false" in judge_response_clean:
-        correct = False
-    else:
-        correct = False
-    return {
-        "raw": raw,
-        "correct": correct,
-        "reason": judge_response_clean,
-        "matched_gold": ground_truth if correct else "",
-    }
-
-
-def evaluate_item(
-    *,
-    item: dict,
-    prediction_text: str,
-    judge_model: str,
-    max_completion_tokens: int = 256,
-    retries: int = 5,
-) -> dict:
-    answer = extract_boxed_answer(prediction_text)
-    judge = _judge_answer(
-        item=item,
-        prediction_text=prediction_text,
-        extracted_answer=answer,
-        judge_model=judge_model,
-        max_completion_tokens=max_completion_tokens,
-        retries=retries,
-    )
-    hard = 1.0 if judge["correct"] else 0.0
-
-    result = {
-        "evaluation_mode": _EVAL_MODE,
-        "predicted_answer": answer,
-        "em": hard,
-        "f1": hard,
-        "sub_em": hard,
-        "judge_model": judge_model,
-        "judge_raw": judge["raw"],
-        "judge_reason": judge["reason"],
-        "matched_gold": judge["matched_gold"],
-    }
-
-    if item["ans_type"] == "choice":
-        result["predicted_label"] = str(answer or "").strip().upper().rstrip(".):")
-        result["predicted_text"] = ""
-        result["correct_label"] = str(item["correct_choice"].get("label") or "")
-        result["correct_text"] = str(item["correct_choice"].get("text") or "")
-    else:
-        result["gold_answers"] = list(item["blank_answers"])
-        best_f1 = 0.0
-        for gold in item["blank_answers"]:
-            best_f1 = max(best_f1, _token_f1(str(answer or ""), gold))
-        result["string_f1"] = best_f1
-
-    return result
-
-
-def evaluation_mode() -> str:
-    return _EVAL_MODE
--- a/skillopt/envs/babyvision/prompts/analyst_error.md
+++ b/skillopt/envs/babyvision/prompts/analyst_error.md
@@ -1,36 +0,0 @@
-You are an expert failure-analysis agent for child-level visual reasoning tasks.
-
-You will be given MULTIPLE failed BabyVision trajectories from a minibatch and the current skill document.
-Each trajectory includes the text prompt, the model answer, and the evaluation result.
-You do not have direct access to raw pixel content during reflection, so focus on general reasoning,
-option-selection, and visual-question-answering behaviors that can be improved through prompting.
-
-## Failure Type Categories
- **visual_detail_miss**: the agent likely overlooked a salient visual attribute, relation, count, or object state
- **option_mismatch**: the agent selected the wrong option despite relevant evidence likely being present
- **instruction_slip**: the agent ignored output format or answered too vaguely
- **answer_granularity**: the agent gave an answer that was too broad, too narrow, or mismatched the expected specificity
- **other**: none of the above
-
-## Rules
-1. Focus on patterns recurring across the minibatch.
-2. Prefer reusable behaviors for inspecting images and grounding answers in visible evidence.
-3. Do not memorize dataset-specific answers.
-4. Only patch gaps not already covered by the current skill.
-
-Respond ONLY with a valid JSON object:
-{
-  "batch_size": <number>,
-  "failure_summary": [
-    {"failure_type": "<type>", "count": <int>, "description": "<one-line>"}
-  ],
-  "patch": {
-    "reasoning": "<why these edits address the common failures>",
-    "edits": [
-      {"op": "append",       "content": "<markdown>"},
-      {"op": "insert_after", "target": "<heading/text>", "content": "<markdown>"},
-      {"op": "replace",      "target": "<old text>",     "content": "<new text>"},
-      {"op": "delete",       "target": "<exact text to remove>"}
-    ]
-  }
-}
--- a/skillopt/envs/babyvision/prompts/analyst_success.md
+++ b/skillopt/envs/babyvision/prompts/analyst_success.md
@@ -1,25 +0,0 @@
-You are an expert success-pattern analyst for child-level visual reasoning tasks.
-
-You will be given MULTIPLE successful BabyVision trajectories from a minibatch and the current skill document.
-Identify generalizable behavior patterns that help the agent inspect the image carefully and answer at the right level of specificity.
-
-## Rules
- Focus on broadly useful visual QA behaviors.
- Prefer patterns about systematic image inspection, comparing options, and concise grounded answers.
- Do not add dataset-specific facts.
- "edits" may be empty if the skill already captures the useful patterns.
-
-Respond ONLY with a valid JSON object:
-{
-  "batch_size": <number>,
-  "success_patterns": ["<pattern 1>", "<pattern 2>"],
-  "patch": {
-    "reasoning": "<why these patterns matter>",
-    "edits": [
-      {"op": "append",       "content": "<markdown>"},
-      {"op": "insert_after", "target": "<heading/text>", "content": "<markdown>"},
-      {"op": "replace",      "target": "<old text>",     "content": "<new text>"},
-      {"op": "delete",       "target": "<exact text to remove>"}
-    ]
-  }
-}
--- a/skillopt/envs/babyvision/prompts/deep_probe.md
+++ b/skillopt/envs/babyvision/prompts/deep_probe.md
@@ -1,25 +0,0 @@
-You are an expert diagnostic-probe designer for BabyVision-style visual reasoning tasks.
-
-You will be shown representative trajectories, the current target skill, and the target's original prompt context.
-Design one SMALL diagnostic instruction that exposes the target's intermediate visual judgment without materially changing the original scaffold.
-
-## Hard Constraints
-1. Do NOT substantially change the original scaffold.
-2. Do NOT prescribe a new step-by-step solving method.
-3. You MAY ask for a short structured list of a few intermediate conclusions, candidate cues, or counted units, as long as it stays close to the original scaffold.
-4. Do NOT ask for exhaustive listing of all cells, all objects, or a full chain-of-thought.
-5. Ask only for a short readout that reveals the target's current latent state.
-6. Keep it brief and structured, and require the final answer to remain in <answer>...</answer>.
-
-## Good Probe Targets
- top answer and runner-up
- decisive visual cue
- suspicious region or compared objects
- counting unit or formatting interpretation
- 2-4 short intermediate conclusions that directly support the final answer
-
-Respond ONLY with a valid JSON object:
-{
-  "reasoning": "<why this probe is informative>",
-  "probe_instruction": "<the exact instruction text to append to the target prompt>"
-}
--- a/skillopt/envs/babyvision/prompts/judge.md
+++ b/skillopt/envs/babyvision/prompts/judge.md
@@ -1,35 +0,0 @@
-You are a careful and strict evaluator. You will be given:
-
-1. **Question**
-2. **Ground Truth Answer** (correct answer)
-3. **Model Output** (answer from another model)
-
-**Your goal:** Determine if the Model Output **accurately matches** the Ground Truth Answer in meaning.
-
-* Matching means: the facts, entities, and key details are equivalent, even if phrasing differs.
-* Not matching means: the Model Output is wrong, incomplete, contains extra incorrect facts, or changes the meaning.
-
-**Process (internal reasoning):**
-
-1. Read and understand the Question, Ground Truth Answer, and Model Output.
-2. Ignore small wording differences, formatting, or synonyms.
-3. If all factual content matches, conclude `1`. Otherwise, conclude `0`.
-
-**Important:**
-
-* Think through your decision step-by-step **internally** before responding.
-* In your final output, return **only** True or False, with no extra text or explanation.
-
-**Output format:**
-
-True
-
-or
-
-False
-
-**Input:**
-
-Question: {question},
-Ground Truth Answer: {groundtruth},
-Model Output: {modeloutput}
--- a/skillopt/envs/babyvision/prompts/rollout_system.md
+++ b/skillopt/envs/babyvision/prompts/rollout_system.md
@@ -1,13 +0,0 @@
-You are an expert visual reasoning agent solving child-level image understanding tasks.
-
-{skill_section}## Task Format
-You will receive one image and one question about it.
-Inspect the image carefully before answering. Ground the answer in visible evidence.
-
-## Answer Format
-Think step by step, then provide your final answer in \boxed{{Answer}} format.
- For multiple-choice questions, output only the single choice label, such as \boxed{{A}}.
- For open questions, output only a short final answer inside \boxed{{...}}.
-
-Example:
-\boxed{{B}}
--- a/skillopt/envs/babyvision/reflect.py
+++ b/skillopt/envs/babyvision/reflect.py
@@ -1,4 +0,0 @@
-"""BabyVision Reflect stage.
-
-Prompts are now loaded from .md files by the base adapter.
-"""
--- a/skillopt/envs/babyvision/rollout.py
+++ b/skillopt/envs/babyvision/rollout.py
@@ -1,483 +0,0 @@
-"""BabyVision rollout — multimodal visual QA with image input."""
-from __future__ import annotations
-
-import base64
-import json
-import mimetypes
-import os
-from concurrent.futures import ThreadPoolExecutor, as_completed
-
-from skillopt.envs.babyvision.evaluator import evaluate_item, evaluation_mode, extract_boxed_answer
-from skillopt.model import chat_target_messages, get_target_backend, is_target_exec_backend
-from skillopt.model.codex_harness import prepare_workspace, render_skill_md, run_target_exec
-from skillopt.prompts import load_prompt
-
-def _build_system(skill_content: str) -> str:
-    if skill_content.strip():
-        skill_section = f"## Skill\n{skill_content.strip()}\n\n"
-    else:
-        skill_section = ""
-    return load_prompt("rollout_system", env="babyvision").format(skill_section=skill_section)
-
-
-def _format_choices(choices: list[dict]) -> str:
-    return "\n".join(f"{choice['label']}. {choice['text']}" for choice in choices)
-
-
-def _build_user_text(
-    item: dict,
-    *,
-    diagnostic_mode: bool = False,
-    diagnostic_instruction: str = "",
-    diagnostic_trace_context: str = "",
-) -> str:
-    parts = []
-    if diagnostic_trace_context.strip():
-        parts.append(
-            "## Previous Codex Trace Snapshot\n"
-            "This is a partial transcript from an earlier attempt. Use it as your current reasoning context.\n\n"
-            f"{diagnostic_trace_context.strip()}"
-        )
-    parts.append(f"## Question\n{item['question']}")
-    if item["ans_type"] == "choice":
-        parts.append(f"## Choices\n{_format_choices(item['choices'])}")
-        parts.append("Answer using the single correct option label in \\boxed{...}.")
-    else:
-        parts.append("Answer with a short phrase in \\boxed{...}.")
-    if diagnostic_mode and diagnostic_instruction.strip():
-        parts.append(f"## Training Readout\n{diagnostic_instruction.strip()}")
-    return "\n\n".join(parts)
-
-
-def _image_to_data_uri(path: str) -> str:
-    mime = mimetypes.guess_type(path)[0] or "image/png"
-    with open(path, "rb") as f:
-        encoded = base64.b64encode(f.read()).decode("ascii")
-    return f"data:{mime};base64,{encoded}"
-
-
-def _build_messages(
-    item: dict,
-    skill_content: str,
-    image_detail: str,
-    *,
-    diagnostic_mode: bool = False,
-    diagnostic_instruction: str = "",
-    diagnostic_trace_context: str = "",
-) -> tuple[list[dict], str, str]:
-    system = _build_system(skill_content)
-    user_text = _build_user_text(
-        item,
-        diagnostic_mode=diagnostic_mode,
-        diagnostic_instruction=diagnostic_instruction,
-        diagnostic_trace_context=diagnostic_trace_context,
-    )
-    image_url = {
-        "url": _image_to_data_uri(item["image_path"]),
-    }
-    if image_detail and image_detail != "auto":
-        image_url["detail"] = image_detail
-    messages = [
-        {"role": "system", "content": system},
-        {
-            "role": "user",
-            "content": [
-                {"type": "text", "text": user_text},
-                {"type": "image_url", "image_url": image_url},
-            ],
-        },
-    ]
-    return messages, system, user_text
-
-
-def _build_codex_skill(skill_content: str) -> str:
-    return render_skill_md(
-        skill_content,
-        description="Dynamic ReflACT skill for solving the current BabyVision visual reasoning question.",
-        preamble=(
-            "Use this skill when answering the current visual reasoning question.\n"
-            "Inspect the attached image carefully and return the final answer in \\boxed{...}."
-        ),
-    )
-
-
-def _run_codex_once(
-    *,
-    pred_dir: str,
-    item: dict,
-    skill_content: str,
-    model: str,
-    timeout: int,
-    image_detail: str,
-    diagnostic_mode: bool = False,
-    diagnostic_instruction: str = "",
-    diagnostic_trace_context: str = "",
-    previous_response: str = "",
-) -> tuple[str, str, str, str]:
-    user_text = _build_user_text(
-        item,
-        diagnostic_mode=diagnostic_mode,
-        diagnostic_instruction=diagnostic_instruction,
-        diagnostic_trace_context=diagnostic_trace_context,
-    )
-    task_parts = [user_text]
-    if previous_response:
-        task_parts.append(
-            "## Previous Attempt\n"
-            f"{previous_response}\n\n"
-            "Review the same image and question carefully. If needed, correct the answer."
-        )
-    task_text = "\n\n".join(task_parts)
-    skill_md = _build_codex_skill(skill_content)
-    work_dir = os.path.join(pred_dir, "codex_exec")
-    prepare_workspace(
-        work_dir=work_dir,
-        skill_md=skill_md,
-        task_text=task_text,
-        images=[item["image_path"]],
-    )
-    prompt = (
-        "Use the `skillopt-target` skill available in this workspace.\n"
-        "Read `task.md`, inspect the attached image, and answer the question.\n"
-        "Return the final answer in \\boxed{...}."
-    )
-    final_message, raw = run_target_exec(
-        work_dir=work_dir,
-        prompt=prompt,
-        model=model,
-        timeout=timeout,
-        images=[item["image_path"]],
-    )
-    return final_message or raw, raw, skill_md, task_text
-
-
-def process_one(
-    item: dict,
-    out_root: str,
-    skill_content: str,
-    *,
-    max_turns: int = 1,
-    image_detail: str = "auto",
-    judge_model: str = "gpt-5.4",
-    judge_max_completion_tokens: int = 256,
-    judge_retries: int = 5,
-    diagnostic_mode: bool = False,
-    diagnostic_instruction: str = "",
-    diagnostic_trace_context: str = "",
-) -> dict:
-    item_id = str(item["id"])
-    result = {
-        "id": item_id,
-        "question": item["question"],
-        "task_type": item.get("subtype") or item.get("task_type") or "babyvision",
-        "task_description": item["question"],
-        "hard": 0,
-        "soft": 0.0,
-        "predicted_answer": "",
-        "predicted_label": "",
-        "predicted_text": "",
-        "response": "",
-        "fail_reason": "",
-        "agent_ok": False,
-        "n_turns": 0,
-        "image_path": item["image_path"],
-        "ans_type": item["ans_type"],
-        "evaluation_mode": evaluation_mode(),
-        "judge_model": judge_model,
-    }
-    if item["ans_type"] == "choice":
-        result["correct_label"] = item["correct_choice"]["label"]
-        result["correct_text"] = item["correct_choice"]["text"]
-    else:
-        result["gold_answers"] = item["blank_answers"]
-
-    try:
-        pred_dir = os.path.join(out_root, "predictions", item_id)
-        os.makedirs(pred_dir, exist_ok=True)
-
-        if is_target_exec_backend():
-            from skillopt.model import azure_openai as _llm
-
-            response = ""
-            conversation: list[dict] = [
-                {"role": "user", "content": f"{item['question']}\n\n[image] {os.path.basename(item['image_path'])}"}
-            ]
-            system_prompt = ""
-            user_text = ""
-            for turn in range(max_turns):
-                response, raw, system_prompt, user_text = _run_codex_once(
-                    pred_dir=pred_dir,
-                    item=item,
-                    skill_content=skill_content,
-                    model=_llm.TARGET_DEPLOYMENT,
-                    timeout=120,
-                    image_detail=image_detail,
-                    diagnostic_mode=diagnostic_mode if turn == 0 else False,
-                    diagnostic_instruction=diagnostic_instruction if turn == 0 else "",
-                    diagnostic_trace_context=diagnostic_trace_context if turn == 0 else "",
-                    previous_response=response if turn > 0 else "",
-                )
-                conversation.append({"type": "message", "turn": turn + 1, "content": response})
-                if extract_boxed_answer(response) is not None:
-                    break
-
-            result["response"] = response
-            result["agent_ok"] = True
-            result["n_turns"] = len(conversation) - 1
-            with open(os.path.join(pred_dir, "target_system_prompt.txt"), "w", encoding="utf-8") as f:
-                f.write(system_prompt)
-            with open(os.path.join(pred_dir, "target_user_prompt.txt"), "w", encoding="utf-8") as f:
-                f.write(user_text)
-
-            eval_result = evaluate_item(
-                item=item,
-                prediction_text=response,
-                judge_model=judge_model,
-                max_completion_tokens=judge_max_completion_tokens,
-                retries=judge_retries,
-            )
-            result["evaluation_mode"] = eval_result["evaluation_mode"]
-            result["judge_raw"] = eval_result["judge_raw"]
-            result["judge_reason"] = eval_result["judge_reason"]
-            result["matched_gold"] = eval_result["matched_gold"]
-            if item["ans_type"] == "choice":
-                result["predicted_label"] = eval_result["predicted_label"]
-                result["predicted_text"] = eval_result["predicted_text"]
-                result["predicted_answer"] = eval_result["predicted_answer"]
-                result["hard"] = int(eval_result["em"])
-                result["soft"] = eval_result["f1"]
-                if not result["hard"]:
-                    result["fail_reason"] = (
-                        f"judge=0: predicted '{eval_result['predicted_label'] or eval_result['predicted_answer']}' "
-                        f"but expected '{eval_result['correct_label']}' ({eval_result['judge_reason']})"
-                    )
-                eval_detail = (
-                    f"[EVALUATION RESULT]\n"
-                    f"Question: {item['question']}\n"
-                    f"Predicted label: {eval_result['predicted_label']!r}\n"
-                    f"Predicted text: {eval_result['predicted_text']!r}\n"
-                    f"Correct label: {eval_result['correct_label']!r}\n"
-                    f"Correct text: {eval_result['correct_text']!r}\n"
-                    f"Judge correct: {eval_result['em']}\n"
-                    f"Judge reason: {eval_result['judge_reason']}"
-                )
-            else:
-                result["predicted_answer"] = eval_result["predicted_answer"]
-                result["hard"] = int(eval_result["em"])
-                result["soft"] = eval_result["f1"]
-                if not result["hard"]:
-                    result["fail_reason"] = (
-                        f"judge=0: predicted '{eval_result['predicted_answer']}' "
-                        f"but expected {item['blank_answers']} ({eval_result['judge_reason']})"
-                    )
-                eval_detail = (
-                    f"[EVALUATION RESULT]\n"
-                    f"Question: {item['question']}\n"
-                    f"Predicted answer: {eval_result['predicted_answer']!r}\n"
-                    f"Gold answers: {item['blank_answers']!r}\n"
-                    f"Judge correct: {eval_result['em']}\n"
-                    f"Judge reason: {eval_result['judge_reason']}\n"
-                    f"String F1: {eval_result.get('string_f1', 0.0):.4f}"
-                )
-            conversation.append({"role": "system", "content": eval_detail})
-            with open(os.path.join(pred_dir, "conversation.json"), "w", encoding="utf-8") as f:
-                json.dump(conversation, f, ensure_ascii=False, indent=2)
-            return result
-
-        messages, system_prompt, user_text = _build_messages(
-            item,
-            skill_content,
-            image_detail,
-            diagnostic_mode=diagnostic_mode,
-            diagnostic_instruction=diagnostic_instruction,
-            diagnostic_trace_context=diagnostic_trace_context,
-        )
-        response = ""
-        conversation: list[dict] = [
-            {"role": "user", "content": f"{user_text}\n\n[image] {os.path.basename(item['image_path'])}"}
-        ]
-
-        for turn in range(max_turns):
-            if turn == 0:
-                resp_text, _ = chat_target_messages(
-                    messages=messages,
-                    max_completion_tokens=768,
-                    retries=5,
-                    stage="rollout",
-                )
-            else:
-                refinement_text = (
-                    f"Your previous answer was:\n{response}\n\n"
-                    "Review the same image and question carefully. "
-                    "If needed, correct your answer. Output the final answer in \\boxed{...}."
-                )
-                refinement_messages = [
-                    messages[0],
-                    messages[1],
-                    {"role": "assistant", "content": response},
-                    {"role": "user", "content": refinement_text},
-                ]
-                resp_text, _ = chat_target_messages(
-                    messages=refinement_messages,
-                    max_completion_tokens=512,
-                    retries=5,
-                    stage="rollout",
-                )
-            response = resp_text
-            conversation.append({"type": "message", "turn": turn + 1, "content": resp_text})
-            if extract_boxed_answer(resp_text) is not None:
-                break
-
-        result["response"] = response
-        result["agent_ok"] = True
-        result["n_turns"] = len(conversation) - 1
-
-        with open(os.path.join(pred_dir, "target_system_prompt.txt"), "w", encoding="utf-8") as f:
-            f.write(system_prompt)
-        with open(os.path.join(pred_dir, "target_user_prompt.txt"), "w", encoding="utf-8") as f:
-            f.write(user_text)
-
-        eval_result = evaluate_item(
-            item=item,
-            prediction_text=response,
-            judge_model=judge_model,
-            max_completion_tokens=judge_max_completion_tokens,
-            retries=judge_retries,
-        )
-        result["evaluation_mode"] = eval_result["evaluation_mode"]
-        result["judge_raw"] = eval_result["judge_raw"]
-        result["judge_reason"] = eval_result["judge_reason"]
-        result["matched_gold"] = eval_result["matched_gold"]
-
-        if item["ans_type"] == "choice":
-            result["predicted_label"] = eval_result["predicted_label"]
-            result["predicted_text"] = eval_result["predicted_text"]
-            result["predicted_answer"] = eval_result["predicted_answer"]
-            result["hard"] = int(eval_result["em"])
-            result["soft"] = eval_result["f1"]
-            if not result["hard"]:
-                result["fail_reason"] = (
-                    f"judge=0: predicted '{eval_result['predicted_label'] or eval_result['predicted_answer']}' "
-                    f"but expected '{eval_result['correct_label']}' ({eval_result['judge_reason']})"
-                )
-            eval_detail = (
-                f"[EVALUATION RESULT]\n"
-                f"Question: {item['question']}\n"
-                f"Predicted label: {eval_result['predicted_label']!r}\n"
-                f"Predicted text: {eval_result['predicted_text']!r}\n"
-                f"Correct label: {eval_result['correct_label']!r}\n"
-                f"Correct text: {eval_result['correct_text']!r}\n"
-                f"Judge correct: {eval_result['em']}\n"
-                f"Judge reason: {eval_result['judge_reason']}"
-            )
-        else:
-            result["predicted_answer"] = eval_result["predicted_answer"]
-            result["hard"] = int(eval_result["em"])
-            result["soft"] = eval_result["f1"]
-            if not result["hard"]:
-                result["fail_reason"] = (
-                    f"judge=0: predicted '{eval_result['predicted_answer']}' "
-                    f"but expected {item['blank_answers']} ({eval_result['judge_reason']})"
-                )
-            eval_detail = (
-                f"[EVALUATION RESULT]\n"
-                f"Question: {item['question']}\n"
-                f"Predicted answer: {eval_result['predicted_answer']!r}\n"
-                f"Gold answers: {item['blank_answers']!r}\n"
-                f"Judge correct: {eval_result['em']}\n"
-                f"Judge reason: {eval_result['judge_reason']}\n"
-                f"String F1: {eval_result.get('string_f1', 0.0):.4f}"
-            )
-
-        conversation.append({"role": "system", "content": eval_detail})
-        with open(os.path.join(pred_dir, "conversation.json"), "w", encoding="utf-8") as f:
-            json.dump(conversation, f, ensure_ascii=False, indent=2)
-    except Exception as e:  # noqa: BLE001
-        result["fail_reason"] = f"error: {e}"
-    return result
-
-
-def run_batch(
-    items: list[dict],
-    out_root: str,
-    skill_content: str,
-    *,
-    max_turns: int = 1,
-    workers: int = 32,
-    image_detail: str = "auto",
-    judge_model: str = "gpt-5.4",
-    judge_max_completion_tokens: int = 256,
-    judge_retries: int = 5,
-    diagnostic_mode: bool = False,
-    diagnostic_instruction: str = "",
-    diagnostic_trace_context_by_id: dict[str, str] | None = None,
-) -> list[dict]:
-    results_path = os.path.join(out_root, "results.jsonl")
-    os.makedirs(out_root, exist_ok=True)
-
-    expected_eval_mode = evaluation_mode()
-    done_ids: set[str] = set()
-    existing: list[dict] = []
-    rewrite_results = False
-    if os.path.exists(results_path):
-        with open(results_path, encoding="utf-8") as f:
-            for line in f:
-                try:
-                    row = json.loads(line)
-                    if row.get("evaluation_mode") != expected_eval_mode:
-                        rewrite_results = True
-                        continue
-                    done_ids.add(str(row["id"]))
-                    existing.append(row)
-                except Exception:
-                    rewrite_results = True
-
-    pending = [item for item in items if str(item["id"]) not in done_ids]
-    if not pending and not rewrite_results:
-        return existing
-
-    total = len(existing) + len(pending)
-    completed = len(existing)
-    correct_count = sum(1 for r in existing if r.get("hard", 0))
-    if existing:
-        print(f"    [rollout] resuming: {completed}/{total} already done", flush=True)
-
-    results = list(existing)
-    file_mode = "w" if rewrite_results else "a"
-    with open(results_path, file_mode, encoding="utf-8") as outf, ThreadPoolExecutor(max_workers=workers) as ex:
-        if rewrite_results:
-            for row in existing:
-                outf.write(json.dumps(row, ensure_ascii=False) + "\n")
-        futs = {
-            ex.submit(
-                process_one,
-                item,
-                out_root,
-                skill_content,
-                max_turns=max_turns,
-                image_detail=image_detail,
-                judge_model=judge_model,
-                judge_max_completion_tokens=judge_max_completion_tokens,
-                judge_retries=judge_retries,
-                diagnostic_mode=diagnostic_mode,
-                diagnostic_instruction=diagnostic_instruction,
-                diagnostic_trace_context=(diagnostic_trace_context_by_id or {}).get(str(item["id"]), ""),
-            ): item
-            for item in pending
-        }
-        for fut in as_completed(futs):
-            row = fut.result()
-            results.append(row)
-            completed += 1
-            if row.get("hard", 0):
-                correct_count += 1
-            acc = correct_count / completed if completed else 0
-            print(
-                f"    [rollout] {completed}/{total} "
-                f"(acc={acc:.3f}) id={row.get('id', '?')} "
-                f"hard={row.get('hard', '?')}",
-                flush=True,
-            )
-            outf.write(json.dumps(row, ensure_ascii=False) + "\n")
-            outf.flush()
-    return results
--- a/skillopt/envs/babyvision/skills/initial.md
+++ b/skillopt/envs/babyvision/skills/initial.md
@@ -1,18 +0,0 @@
-# BabyVision Visual QA Heuristics
-
-## Image Inspection
- First identify the main objects, their attributes, and their spatial relations before answering.
- If the question involves counting, compare all relevant instances carefully instead of stopping after the first match.
- If the question asks about color, size, position, or action, verify the specific visible evidence for that attribute.
-
-## Multiple Choice
- Compare every option against the visible image evidence before deciding.
- Prefer the option that matches the image exactly; reject options that are only partially true or too vague.
- When two options are close, check the smallest discriminating visual detail.
-
-## Open Answers
- Answer with the shortest phrase that is fully supported by the image.
- Match the expected level of specificity: not broader than the image evidence, not narrower than the question asks.
-
-## Final Answer
- Output only the final answer inside <answer>...</answer>.
--- a/skillopt/envs/deep_reflect.py
+++ b/skillopt/envs/deep_reflect.py
@@ -1,114 +0,0 @@
-from __future__ import annotations
-
-import json
-import os
-from typing import Any, Callable
-
-from skillopt.gradient.deep_probe import generate_deep_probe_instruction
-from skillopt.gradient.reflect import run_minibatch_reflect
-
-
-def run_no_reference_deep_reflect(
-    adapter: Any,
-    results: list[dict],
-    skill_content: str,
-    out_dir: str,
-    *,
-    env_manager: Any = None,
-    prediction_dir: str | None = None,
-    random_seed: int | None = None,
-    step_buffer_context: str = "",
-    output_requirements: list[str] | None = None,
-    metadata_builder: Callable[[dict], dict] | None = None,
-) -> list[dict | None]:
-    """Run optimizer-designed diagnostic probing without hidden references."""
-    if not getattr(adapter, "use_deep_reflect", False):
-        return []
-    if not isinstance(env_manager, list):
-        return []
-
-    prediction_dir = prediction_dir or os.path.join(out_dir, "predictions")
-    selected_items = adapter.select_representative_items(
-        results,
-        env_manager,
-        n_failures=getattr(adapter, "deep_reflect_failures", 4),
-        n_successes=getattr(adapter, "deep_reflect_successes", 2),
-        seed=random_seed,
-    )
-    if not selected_items:
-        return []
-
-    selected_ids = {str(item["id"]) for item in selected_items}
-    selected_results = [row for row in results if str(row.get("id")) in selected_ids]
-    if metadata_builder is None:
-        selected_metadata = [
-            {
-                "id": str(item.get("id")),
-                "task_type": str(item.get("task_type") or item.get("topic") or "unknown"),
-                "question_preview": str(item.get("question") or "")[:200],
-            }
-            for item in selected_items
-        ]
-    else:
-        selected_metadata = [metadata_builder(item) for item in selected_items]
-
-    deep_dir = os.path.join(out_dir, "deep_reflect")
-    rollout_dir = os.path.join(deep_dir, "rollout")
-    patches_dir = os.path.join(deep_dir, "patches")
-    os.makedirs(deep_dir, exist_ok=True)
-    print(
-        f"    [2b/6 DEEP REFLECT setup] selected={len(selected_items)} "
-        "mode=no_reference_probe"
-    )
-
-    probe = generate_deep_probe_instruction(
-        skill_content=skill_content,
-        items=selected_results,
-        prediction_dir=prediction_dir,
-        system_prompt=adapter.get_deep_probe_prompt(),
-        step_buffer_context=step_buffer_context,
-        output_requirements=output_requirements,
-    )
-    if not probe:
-        return []
-
-    with open(os.path.join(deep_dir, "probe.json"), "w", encoding="utf-8") as f:
-        json.dump(
-            {
-                **probe,
-                "reference_summary": {
-                    "mode": "no_reference_probe",
-                    "selected_count": len(selected_items),
-                },
-                "selected_examples": selected_metadata,
-            },
-            f,
-            ensure_ascii=False,
-            indent=2,
-        )
-
-    deep_results = adapter.rollout(
-        selected_items,
-        skill_content,
-        rollout_dir,
-        diagnostic_mode=True,
-        diagnostic_instruction=probe["probe_instruction"],
-    )
-    return run_minibatch_reflect(
-        results=deep_results,
-        skill_content=skill_content,
-        prediction_dir=os.path.join(rollout_dir, "predictions"),
-        patches_dir=patches_dir,
-        workers=getattr(adapter, "analyst_workers", 8),
-        failure_only=getattr(adapter, "failure_only", False),
-        minibatch_size=getattr(adapter, "minibatch_size", 8),
-        edit_budget=getattr(adapter, "edit_budget", 4),
-        random_seed=random_seed,
-        error_system=adapter.get_error_minibatch_prompt(),
-        success_system=adapter.get_success_minibatch_prompt(),
-        step_buffer_context=step_buffer_context,
-        update_mode=getattr(getattr(adapter, "_cfg", {}), "get", lambda *_: "patch")(
-            "skill_update_mode",
-            "patch",
-        ),
-    )
--- a/skillopt/envs/livemathematicianbench/prompts/deep_probe.md
+++ b/skillopt/envs/livemathematicianbench/prompts/deep_probe.md
@@ -1,23 +0,0 @@
-You are an expert diagnostic-probe designer for theorem-grounded mathematical multiple-choice tasks.
-
-You will be shown representative trajectories, the current target skill, and the target's original prompt context.
-Design one SMALL diagnostic instruction that exposes the target's intermediate judgment without materially changing the original scaffold.
-
-## Hard Constraints
-1. Do NOT substantially change the original scaffold.
-2. Do NOT prescribe a new multi-step theorem-solving procedure.
-3. Do NOT ask for a full proof, full chain-of-thought, or exhaustive option-by-option derivation.
-4. Ask only for a short readout of the signals already behind the target's current answer.
-5. Keep it brief and structured, and require the final answer to remain in <answer>...</answer>.
-
-## Good Probe Targets
- top choice and runner-up
- decisive constraint
- why the runner-up was rejected
- strongest-vs-weaker discrimination signal
-
-Respond ONLY with a valid JSON object:
-{
-  "reasoning": "<why this probe is informative>",
-  "probe_instruction": "<the exact instruction text to append to the target prompt>"
-}
--- a/skillopt/envs/livemathematicianbench/prompts/deep_probe_codex.md
+++ b/skillopt/envs/livemathematicianbench/prompts/deep_probe_codex.md
@@ -1,26 +0,0 @@
-You are an expert diagnostic-probe designer for theorem-grounded mathematical multiple-choice tasks executed through a Codex trace.
-
-You will be shown representative trajectories, the current target skill, the target's original prompt context, hidden reference fields, and numbered Codex trace steps.
-Choose exactly one trajectory and one probe point. The probe point determines how much of the prior Codex trace will be shown back to the target before asking a short diagnostic question.
-
-## Hard Constraints
-1. Do NOT reveal or paraphrase the hidden reference directly to the target.
-2. Do NOT prescribe a new full solving procedure.
-3. Do NOT ask for a full proof, full chain-of-thought, or exhaustive option-by-option derivation.
-4. Ask only for a short readout of the signal that should already exist at that point in the target's process.
-5. The probe instruction must explicitly request a short <analysis>...</analysis> block before the final <answer>...</answer>.
-6. Select a probe point that is informative about theorem choice, decisive constraint, option elimination, or why a stronger/weaker option should be rejected.
-
-## Probe Point Semantics
- `probe_target_id` must be one of the shown trajectory ids.
- `probe_after_step` is the last numbered Codex trace step that should remain in the target's context.
- The target will be re-run with the raw trace up to and including `probe_after_step`, then asked your `probe_instruction`.
- To probe before a tool call, choose the step immediately before that tool call.
-
-Respond ONLY with a valid JSON object:
-{
-  "reasoning": "<why this trajectory and probe point expose the target's intermediate state>",
-  "probe_target_id": "<trajectory id>",
-  "probe_after_step": <integer step number>,
-  "probe_instruction": "<the exact instruction text to append to the target's prompt>"
-}
--- a/skillopt/envs/mathverse/init.py
+++ b/skillopt/envs/mathverse/init.py
@@ -1,5 +0,0 @@
-"""MathVerse environment package."""
-
-from skillopt.envs.mathverse.adapter import MathVerseAdapter
-
-__all__ = ["MathVerseAdapter"]
--- a/skillopt/envs/mathverse/adapter.py
+++ b/skillopt/envs/mathverse/adapter.py
@@ -1,280 +0,0 @@
-"""MathVerse environment adapter for ReflACT."""
-from __future__ import annotations
-
-import json
-import os
-
-from skillopt.datasets.base import BatchSpec
-from skillopt.envs.base import EnvAdapter
-from skillopt.envs.mathverse.dataloader import MathVerseDataLoader
-from skillopt.envs.mathverse.rollout import run_batch
-from skillopt.gradient.deep_probe import generate_deep_probe_instruction
-from skillopt.gradient.reflect import run_minibatch_reflect
-from skillopt.model import get_target_backend
-
-
-class MathVerseAdapter(EnvAdapter):
-    """MathVerse adapter."""
-
-    def build_reference_text(self, item: dict) -> str:
-        if not self.use_text_dominant_reference:
-            return ""
-        question = str(item.get("text_dominant_question") or "").strip()
-        if not question:
-            return ""
-        return f"## Reference Full Question\n{question}"
-
-    def get_reference_metadata(self, item: dict) -> dict:
-        if not self.use_text_dominant_reference:
-            return {"fields": [], "preview": ""}
-        question = str(item.get("text_dominant_question") or "").strip()
-        if not question:
-            return {"fields": [], "preview": ""}
-        return {
-            "fields": ["text_dominant_question"],
-            "preview": question[:400],
-        }
-
-    def __init__(
-        self,
-        split_dir: str = "",
-        data_root: str = "",
-        problem_version: str = "Text Lite",
-        use_text_dominant_reference: bool = False,
-        max_turns: int = 1,
-        workers: int = 16,
-        analyst_workers: int = 16,
-        failure_only: bool = False,
-        minibatch_size: int = 8,
-        edit_budget: int = 4,
-        seed: int = 42,
-        limit: int = 0,
-        image_detail: str = "auto",
-        judge_model: str = "gpt-5.4",
-        judge_max_completion_tokens: int = 256,
-        judge_retries: int = 5,
-        use_deep_reflect: bool = False,
-        deep_reflect_failures: int = 4,
-        deep_reflect_successes: int = 2,
-    ) -> None:
-        self.max_turns = max_turns
-        self.workers = workers
-        self.analyst_workers = analyst_workers
-        self.failure_only = failure_only
-        self.minibatch_size = minibatch_size
-        self.edit_budget = edit_budget
-        self.image_detail = image_detail
-        self.judge_model = judge_model
-        self.judge_max_completion_tokens = judge_max_completion_tokens
-        self.judge_retries = judge_retries
-        self.problem_version = problem_version
-        self.use_text_dominant_reference = use_text_dominant_reference
-        self.use_deep_reflect = use_deep_reflect
-        self.deep_reflect_failures = deep_reflect_failures
-        self.deep_reflect_successes = deep_reflect_successes
-        self.dataloader = MathVerseDataLoader(
-            split_dir=split_dir,
-            seed=seed,
-            limit=limit,
-            data_root=data_root,
-            problem_version=problem_version,
-        )
-
-    def setup(self, cfg: dict) -> None:
-        super().setup(cfg)
-        self.dataloader.setup(cfg)
-
-    def get_dataloader(self):
-        return self.dataloader
-
-    def build_env_from_batch(self, batch: BatchSpec, **kwargs):
-        return list(batch.payload or [])
-
-    def build_train_env(self, batch_size: int, seed: int, **kwargs):
-        batch = self.dataloader.build_train_batch(batch_size=batch_size, seed=seed, **kwargs)
-        return self.build_env_from_batch(batch, **kwargs)
-
-    def build_eval_env(self, env_num: int, split: str, seed: int, **kwargs):
-        batch = self.dataloader.build_eval_batch(env_num=env_num, split=split, seed=seed, **kwargs)
-        return self.build_env_from_batch(batch, **kwargs)
-
-    def rollout(
-        self,
-        env_manager,
-        skill_content: str,
-        out_dir: str,
-        **kwargs,
-    ) -> list[dict]:
-        items: list[dict] = env_manager
-        return run_batch(
-            items=items,
-            out_root=out_dir,
-            skill_content=skill_content,
-            max_turns=self.max_turns,
-            workers=self.workers,
-            image_detail=self.image_detail,
-            judge_model=self.judge_model,
-            judge_max_completion_tokens=self.judge_max_completion_tokens,
-            judge_retries=self.judge_retries,
-            diagnostic_mode=kwargs.get("diagnostic_mode", False),
-            diagnostic_instruction=kwargs.get("diagnostic_instruction", ""),
-            diagnostic_trace_context_by_id=kwargs.get("diagnostic_trace_context_by_id"),
-        )
-
-    def reflect(
-        self,
-        results: list[dict],
-        skill_content: str,
-        out_dir: str,
-        **kwargs,
-    ) -> list[dict | None]:
-        prediction_dir = kwargs.get("prediction_dir", os.path.join(out_dir, "predictions"))
-        patches_dir = kwargs.get("patches_dir", os.path.join(out_dir, "patches"))
-        random_seed = kwargs.get("random_seed")
-        step_buffer_context = kwargs.get("step_buffer_context", "")
-
-        return run_minibatch_reflect(
-            results=results,
-            skill_content=skill_content,
-            prediction_dir=prediction_dir,
-            patches_dir=patches_dir,
-            workers=self.analyst_workers,
-            failure_only=self.failure_only,
-            minibatch_size=self.minibatch_size,
-            edit_budget=self.edit_budget,
-            random_seed=random_seed,
-            error_system=self.get_error_minibatch_prompt(),
-            success_system=self.get_success_minibatch_prompt(),
-            step_buffer_context=step_buffer_context,
-            update_mode=getattr(self, "_cfg", {}).get("skill_update_mode", "patch"),
-        )
-
-    def deep_reflect(
-        self,
-        results: list[dict],
-        skill_content: str,
-        out_dir: str,
-        **kwargs,
-    ) -> list[dict | None]:
-        if not self.use_deep_reflect:
-            return []
-
-        env_manager = kwargs.get("env_manager")
-        prediction_dir = kwargs.get("prediction_dir", os.path.join(out_dir, "predictions"))
-        random_seed = kwargs.get("random_seed")
-        step_buffer_context = kwargs.get("step_buffer_context", "")
-        selected_items = self.select_representative_items(
-            results,
-            env_manager if isinstance(env_manager, list) else None,
-            n_failures=self.deep_reflect_failures,
-            n_successes=self.deep_reflect_successes,
-            seed=random_seed,
-        )
-        if not selected_items:
-            return []
-
-        selected_ids = {str(item["id"]) for item in selected_items}
-        selected_results = [row for row in results if str(row.get("id")) in selected_ids]
-        selected_examples = self.attach_reference_context(selected_results, selected_items)
-        codex_backend = get_target_backend() == "codex_exec"
-        if codex_backend:
-            selected_examples = self.attach_codex_probe_context(selected_examples, prediction_dir)
-        selected_metadata = []
-        ref_count = 0
-        for item in selected_items:
-            meta = self.get_reference_metadata(item)
-            if meta["fields"]:
-                ref_count += 1
-            record = {
-                "id": str(item["id"]),
-                "task_type": str(item.get("task_type") or item.get("question_type") or "mathverse"),
-                "reference_fields": meta["fields"],
-                "reference_preview": meta["preview"],
-            }
-            if codex_backend:
-                record["codex_probe_step_count"] = int(
-                    next(
-                        (row.get("codex_probe_step_count", 0) for row in selected_examples if str(row.get("id")) == str(item["id"])),
-                        0,
-                    )
-                )
-            selected_metadata.append(record)
-
-        deep_dir = os.path.join(out_dir, "deep_reflect")
-        rollout_dir = os.path.join(deep_dir, "rollout")
-        patches_dir = os.path.join(deep_dir, "patches")
-        os.makedirs(deep_dir, exist_ok=True)
-        print(
-            f"    [2b/6 DEEP REFLECT setup] selected={len(selected_items)} "
-            f"reference_fields=text_dominant_question({ref_count}/{len(selected_items)})"
-        )
-        probe = generate_deep_probe_instruction(
-            skill_content=skill_content,
-            items=selected_examples,
-            prediction_dir=prediction_dir,
-            system_prompt=self.get_codex_deep_probe_prompt() if codex_backend else self.get_deep_probe_prompt(),
-            step_buffer_context=step_buffer_context,
-        )
-        if not probe:
-            return []
-
-        targeted_items = selected_items
-        diagnostic_trace_context_by_id: dict[str, str] | None = None
-        if codex_backend:
-            targeted_items, diagnostic_trace_context_by_id, probe = self.resolve_codex_probe_target(
-                selected_items=selected_items,
-                selected_examples=selected_examples,
-                prediction_dir=prediction_dir,
-                probe=probe,
-            )
-
-        with open(os.path.join(deep_dir, "probe.json"), "w", encoding="utf-8") as f:
-            json.dump(
-                {
-                    **probe,
-                    "reference_summary": {
-                        "selected_count": len(selected_items),
-                        "field_counts": {
-                            "text_dominant_question": ref_count,
-                        },
-                    },
-                    "selected_examples": selected_metadata,
-                },
-                f,
-                ensure_ascii=False,
-                indent=2,
-            )
-
-        deep_results = run_batch(
-            items=targeted_items,
-            out_root=rollout_dir,
-            skill_content=skill_content,
-            max_turns=self.max_turns,
-            workers=min(self.workers, max(len(targeted_items), 1)),
-            image_detail=self.image_detail,
-            judge_model=self.judge_model,
-            judge_max_completion_tokens=self.judge_max_completion_tokens,
-            judge_retries=self.judge_retries,
-            diagnostic_mode=True,
-            diagnostic_instruction=probe["probe_instruction"],
-            diagnostic_trace_context_by_id=diagnostic_trace_context_by_id,
-        )
-        deep_results = self.attach_reference_context(deep_results, targeted_items)
-        return run_minibatch_reflect(
-            results=deep_results,
-            skill_content=skill_content,
-            prediction_dir=os.path.join(rollout_dir, "predictions"),
-            patches_dir=patches_dir,
-            workers=self.analyst_workers,
-            failure_only=self.failure_only,
-            minibatch_size=self.minibatch_size,
-            edit_budget=self.edit_budget,
-            random_seed=random_seed,
-            error_system=self.get_error_minibatch_prompt(),
-            success_system=self.get_success_minibatch_prompt(),
-            step_buffer_context=step_buffer_context,
-            update_mode=getattr(self, "_cfg", {}).get("skill_update_mode", "patch"),
-        )
-
-    def get_task_types(self) -> list[str]:
-        return self.dataloader.get_task_types()
--- a/skillopt/envs/mathverse/dataloader.py
+++ b/skillopt/envs/mathverse/dataloader.py
@@ -1,228 +0,0 @@
-"""MathVerse task dataloader."""
-from __future__ import annotations
-
-import json
-import os
-import re
-from typing import Any
-
-from skillopt.datasets.base import SplitDataLoader
-
-
-_CHOICE_LABELS = ["A", "B", "C", "D", "E", "F", "G"]
-_CHOICE_BLOCK_RE = re.compile(r"\bChoices?\s*:\s*", re.IGNORECASE)
-_CHOICE_ITEM_RE = re.compile(r"([A-G])\s*[:.)]\s*(.*?)(?=(?:\s+[A-G]\s*[:.)])|$)", re.DOTALL)
-
-
-def _load_json(path: str) -> Any:
-    with open(path, encoding="utf-8") as f:
-        return json.load(f)
-
-
-def _normalize_space(text: Any) -> str:
-    return re.sub(r"\s+", " ", str(text or "").strip())
-
-
-def _resolve_image_path(raw_path: str, *, data_root: str, source_path: str) -> str:
-    candidates = []
-    if raw_path:
-        if os.path.isabs(raw_path):
-            candidates.append(raw_path)
-        else:
-            if data_root:
-                candidates.append(os.path.join(data_root, raw_path))
-                candidates.append(os.path.join(data_root, "images", raw_path))
-            candidates.append(os.path.join(os.path.dirname(source_path), raw_path))
-    for candidate in candidates:
-        if candidate and os.path.exists(candidate):
-            return os.path.abspath(candidate)
-    return ""
-
-
-def _split_question_and_choices(question: str) -> tuple[str, list[dict]]:
-    text = str(question or "").strip()
-    match = _CHOICE_BLOCK_RE.search(text)
-    if not match:
-        return text, []
-
-    stem = text[:match.start()].strip()
-    choice_block = text[match.end():].strip()
-    choices: list[dict] = []
-    for idx, m in enumerate(_CHOICE_ITEM_RE.finditer(choice_block)):
-        label = (m.group(1) or _CHOICE_LABELS[idx]).strip().upper()
-        choice_text = _normalize_space(m.group(2))
-        if choice_text:
-            choices.append({"label": label, "text": choice_text})
-    return stem or text, choices
-
-
-def _build_text_dominant_map(data_root: str) -> dict[str, str]:
-    if not data_root:
-        return {}
-    candidates = [
-        os.path.join(data_root, "testmini.json"),
-        os.path.join(data_root, "data", "testmini.json"),
-    ]
-    source_path = next((path for path in candidates if os.path.exists(path)), "")
-    if not source_path:
-        return {}
-
-    raw = _load_json(source_path)
-    if not isinstance(raw, list):
-        return {}
-
-    mapping: dict[str, str] = {}
-    for item in raw:
-        if not isinstance(item, dict):
-            continue
-        if str(item.get("problem_version") or "").strip() != "Text Dominant":
-            continue
-        problem_index = str(item.get("problem_index") or "").strip()
-        question = str(item.get("question") or "").strip()
-        if problem_index and question:
-            mapping[problem_index] = question
-    return mapping
-
-
-def _normalize_item(
-    item: dict,
-    *,
-    row_idx: int,
-    source_path: str,
-    data_root: str,
-    problem_version: str,
-    text_dominant_map: dict[str, str],
-) -> dict | None:
-    raw_problem_version = str(item.get("problem_version") or "").strip()
-    if problem_version and raw_problem_version and raw_problem_version != problem_version:
-        return None
-
-    question = str(item.get("question") or "").strip()
-    question_type = str(item.get("question_type") or "").strip()
-    answer = str(item.get("answer") or "").strip()
-    image_rel = str(item.get("image") or "").strip()
-    image_path = _resolve_image_path(image_rel, data_root=data_root, source_path=source_path)
-    if not answer or not image_path:
-        return None
-
-    metadata = item.get("metadata") if isinstance(item.get("metadata"), dict) else {}
-    subject = str(metadata.get("subject") or "").strip()
-    subfield = str(metadata.get("subfield") or "").strip()
-    source = str(metadata.get("source") or "").strip()
-
-    question_stem, choices = _split_question_and_choices(question)
-    is_choice = question_type == "multi-choice" or bool(choices)
-
-    correct_choice = {"label": "", "text": ""}
-    if is_choice:
-        label = str(answer).strip().upper().rstrip(".):")
-        choice_text = ""
-        for choice in choices:
-            if choice["label"].upper() == label:
-                choice_text = choice["text"]
-                break
-        correct_choice = {"label": label, "text": choice_text}
-
-    problem_index = str(item.get("problem_index") or "").strip()
-    sample_index = str(item.get("sample_index") or row_idx + 1).strip()
-    item_id = problem_index or sample_index
-    task_type = subfield or subject or question_type or "mathverse"
-
-    return {
-        "id": item_id,
-        "sample_index": sample_index,
-        "problem_index": problem_index,
-        "problem_version": raw_problem_version or problem_version,
-        "question": question,
-        "question_stem": question_stem,
-        "question_for_eval": str(item.get("question_for_eval") or question).strip(),
-        "question_type": question_type or ("multi-choice" if is_choice else "free-form"),
-        "is_choice": is_choice,
-        "choices": choices,
-        "correct_choice": correct_choice,
-        "answer": answer,
-        "gold_answers": [answer] if answer else [],
-        "image_rel": image_rel,
-        "image_path": image_path,
-        "query_wo": str(item.get("query_wo") or "").strip(),
-        "query_cot": str(item.get("query_cot") or "").strip(),
-        "metadata": {
-            "split": str(metadata.get("split") or "").strip(),
-            "source": source,
-            "subject": subject,
-            "subfield": subfield,
-        },
-        "task_type": task_type,
-        "source_path": os.path.abspath(source_path),
-        "text_dominant_question": str(
-            item.get("text_dominant_question")
-            or text_dominant_map.get(problem_index, "")
-        ).strip(),
-    }
-
-
-class MathVerseDataLoader(SplitDataLoader):
-    """MathVerse dataloader."""
-
-    def __init__(
-        self,
-        split_dir: str = "",
-        seed: int = 42,
-        limit: int = 0,
-        data_root: str = "",
-        problem_version: str = "Text Lite",
-        **kwargs,
-    ) -> None:
-        super().__init__(split_dir=split_dir, seed=seed, limit=limit)
-        self.data_root = data_root
-        self.problem_version = problem_version
-        self._task_types: list[str] = []
-        self._text_dominant_map = _build_text_dominant_map(data_root)
-
-    def setup(self, cfg: dict) -> None:
-        if not self.data_root:
-            self.data_root = str(cfg.get("data_root") or "")
-        if not self.problem_version:
-            self.problem_version = str(cfg.get("problem_version") or "Text Lite")
-        self._text_dominant_map = _build_text_dominant_map(self.data_root)
-        super().setup(cfg)
-        all_items = self.train_items + self.val_items + self.test_items
-        task_types = {
-            item.get("task_type") or item.get("question_type") or "mathverse"
-            for item in all_items
-        }
-        self._task_types = sorted(str(x) for x in task_types if str(x).strip())
-
-    def get_task_types(self) -> list[str]:
-        return list(self._task_types)
-
-    def load_split_items(self, split_path: str) -> list[dict]:
-        raw_items = super().load_split_items(split_path)
-        source_path = next(
-            (
-                os.path.join(split_path, name)
-                for name in sorted(os.listdir(split_path))
-                if name.endswith(".json")
-            ),
-            split_path,
-        )
-        items: list[dict] = []
-        for row_idx, item in enumerate(raw_items):
-            if not isinstance(item, dict):
-                continue
-            norm = _normalize_item(
-                item,
-                row_idx=row_idx,
-                source_path=source_path,
-                data_root=self.data_root,
-                problem_version=self.problem_version,
-                text_dominant_map=self._text_dominant_map,
-            )
-            if norm is not None:
-                items.append(norm)
-        if not items:
-            raise ValueError(
-                f"No valid MathVerse items loaded from {split_path} "
-                f"for problem_version={self.problem_version!r}"
-            )
-        return items
--- a/skillopt/envs/mathverse/evaluator.py
+++ b/skillopt/envs/mathverse/evaluator.py
@@ -1,180 +0,0 @@
-"""MathVerse evaluation helpers."""
-from __future__ import annotations
-
-import re
-import string
-
-from skillopt.model import chat_with_deployment
-from skillopt.prompts import load_prompt
-
-
-_EVAL_MODE = "mathverse_choice_or_judge_v1"
-
-
-def normalize_text(text: str) -> str:
-    text = str(text or "").strip().lower()
-    text = text.replace("\\,", " ")
-    text = text.replace("\\ ", " ")
-    text = "".join(ch for ch in text if ch not in string.punctuation)
-    return " ".join(text.split())
-
-
-def normalize_math_text(text: str) -> str:
-    text = str(text or "").strip()
-    text = text.replace("$", "")
-    text = text.replace("\\mathrm", "")
-    text = text.replace("{", "")
-    text = text.replace("}", "")
-    text = text.replace("~", " ")
-    text = text.replace("\\,", " ")
-    text = text.replace("\\ ", " ")
-    return " ".join(text.split()).lower()
-
-
-def extract_answer(text: str | None) -> str:
-    raw = str(text or "").strip()
-    if not raw:
-        return ""
-
-    tags = re.findall(r"<answer>\s*(.*?)\s*</answer>", raw, re.IGNORECASE | re.DOTALL)
-    if tags:
-        return tags[-1].strip()
-
-    boxed = re.findall(r"\\boxed\{(.*?)\}", raw, re.IGNORECASE | re.DOTALL)
-    if boxed:
-        return boxed[-1].strip()
-
-    lines = [ln.strip() for ln in raw.splitlines() if ln.strip()]
-    if lines:
-        return lines[-1]
-    return raw
-
-
-def _judge_answer(
-    *,
-    item: dict,
-    extracted_answer: str,
-    judge_model: str,
-    max_completion_tokens: int,
-    retries: int,
-) -> dict:
-    question = str(item.get("question_for_eval") or item.get("question") or "").strip()
-    ground_truth = str(item.get("answer") or "").strip()
-    raw, _ = chat_with_deployment(
-        deployment=judge_model,
-        system="You are a careful and strict mathematical answer evaluator.",
-        user=load_prompt("judge", env="mathverse").format(
-            question=question,
-            groundtruth=ground_truth,
-            modeloutput=extracted_answer,
-        ),
-        max_completion_tokens=max_completion_tokens,
-        retries=retries,
-        stage="mathverse_judge",
-    )
-    response = str(raw).strip().lower()
-    if "true" in response:
-        correct = True
-    elif "false" in response:
-        correct = False
-    else:
-        correct = False
-    return {
-        "raw": raw,
-        "correct": correct,
-        "reason": response,
-        "matched_gold": ground_truth if correct else "",
-    }
-
-
-def evaluate_item(
-    *,
-    item: dict,
-    prediction_text: str,
-    judge_model: str,
-    max_completion_tokens: int = 256,
-    retries: int = 5,
-) -> dict:
-    extracted = extract_answer(prediction_text)
-
-    if item.get("is_choice"):
-        predicted_label = str(extracted).strip().upper().rstrip(".):")
-        correct_label = str(item["correct_choice"].get("label") or "").strip().upper()
-        predicted_text = ""
-        for choice in item.get("choices") or []:
-            if str(choice.get("label") or "").strip().upper() == predicted_label:
-                predicted_text = str(choice.get("text") or "").strip()
-                break
-        hard = 1.0 if predicted_label == correct_label else 0.0
-        return {
-            "evaluation_mode": _EVAL_MODE,
-            "predicted_answer": extracted,
-            "predicted_label": predicted_label,
-            "predicted_text": predicted_text,
-            "correct_label": correct_label,
-            "correct_text": str(item["correct_choice"].get("text") or "").strip(),
-            "em": hard,
-            "f1": hard,
-            "sub_em": hard,
-            "judge_raw": "",
-            "judge_reason": "exact_label_match" if hard else "label_mismatch",
-            "matched_gold": correct_label if hard else "",
-        }
-
-    gold_answer = str(item.get("answer") or "").strip()
-    pred_norm = normalize_math_text(extracted)
-    gold_norm = normalize_math_text(gold_answer)
-    if pred_norm and gold_norm and pred_norm == gold_norm:
-        return {
-            "evaluation_mode": _EVAL_MODE,
-            "predicted_answer": extracted,
-            "em": 1.0,
-            "f1": 1.0,
-            "sub_em": 1.0,
-            "judge_raw": "",
-            "judge_reason": "normalized_exact_match",
-            "matched_gold": gold_answer,
-            "string_f1": 1.0,
-        }
-
-    judge = _judge_answer(
-        item=item,
-        extracted_answer=extracted,
-        judge_model=judge_model,
-        max_completion_tokens=max_completion_tokens,
-        retries=retries,
-    )
-    hard = 1.0 if judge["correct"] else 0.0
-    pred_tokens = normalize_text(extracted).split()
-    gold_tokens = normalize_text(gold_answer).split()
-    overlap = 0
-    gold_counts: dict[str, int] = {}
-    for tok in gold_tokens:
-        gold_counts[tok] = gold_counts.get(tok, 0) + 1
-    for tok in pred_tokens:
-        count = gold_counts.get(tok, 0)
-        if count > 0:
-            overlap += 1
-            gold_counts[tok] = count - 1
-    if pred_tokens and gold_tokens and overlap:
-        precision = overlap / len(pred_tokens)
-        recall = overlap / len(gold_tokens)
-        string_f1 = 2 * precision * recall / (precision + recall)
-    else:
-        string_f1 = 0.0
-
-    return {
-        "evaluation_mode": _EVAL_MODE,
-        "predicted_answer": extracted,
-        "em": hard,
-        "f1": hard,
-        "sub_em": hard,
-        "judge_raw": judge["raw"],
-        "judge_reason": judge["reason"],
-        "matched_gold": judge["matched_gold"],
-        "string_f1": string_f1,
-    }
-
-
-def evaluation_mode() -> str:
-    return _EVAL_MODE
--- a/skillopt/envs/mathverse/prompts/analyst_error.md
+++ b/skillopt/envs/mathverse/prompts/analyst_error.md
@@ -1,37 +0,0 @@
-You are an expert failure-analysis agent for visual mathematical reasoning problems.
-
-You will be given MULTIPLE failed trajectories from a single minibatch and the current skill document.
-Each trajectory includes the target's response, the evaluation result, and sometimes a hidden reference
-containing the fuller Text Dominant version of the same problem.
-
-Your job is to identify COMMON reasoning failures across the batch and propose concise skill edits.
-
-## Failure Type Categories
- **diagram_underuse**: the agent did not recover key constraints from the image
- **constraint_drop**: the agent ignored a condition or relation that should guide the solution
- **option_confusion**: the agent failed to discriminate between close answer choices
- **format_miss**: the agent solved roughly correctly but returned the wrong final form, unit, or expression
- **other**: none of the above
-
-## Rules
-1. Focus on patterns that recur across the minibatch.
-2. Prefer edits that improve visual grounding and exact answer selection.
-3. Do not hardcode problem-specific formulas or answers.
-4. If hidden reference text is present, use it only to infer what information the target failed to recover from the Text Lite version.
-
-Respond ONLY with a valid JSON object:
-{
-  "batch_size": <number>,
-  "failure_summary": [
-    {"failure_type": "<type>", "count": <int>, "description": "<one-line>"}
-  ],
-  "patch": {
-    "reasoning": "<why these edits address the common failures>",
-    "edits": [
-      {"op": "append",       "content": "<markdown>"},
-      {"op": "insert_after", "target": "<heading/text>", "content": "<markdown>"},
-      {"op": "replace",      "target": "<old text>",     "content": "<new text>"},
-      {"op": "delete",       "target": "<exact text to remove>"}
-    ]
-  }
-}
--- a/skillopt/envs/mathverse/prompts/analyst_success.md
+++ b/skillopt/envs/mathverse/prompts/analyst_success.md
@@ -1,26 +0,0 @@
-You are an expert success-pattern analyst for visual mathematical reasoning problems.
-
-You will be given MULTIPLE successful trajectories from a minibatch and the current skill document.
-Identify generalizable behavior patterns that genuinely help the agent recover the right constraints
-from the image and convert them into the exact final answer.
-
-## Rules
- Focus on broadly useful visual-math reasoning behaviors.
- Prefer patterns about reading decisive diagram cues, checking hidden assumptions, and matching the final answer format exactly.
- Do not add benchmark-specific facts or formulas.
- "edits" may be empty if the skill already captures the useful patterns.
-
-Respond ONLY with a valid JSON object:
-{
-  "batch_size": <number>,
-  "success_patterns": ["<pattern 1>", "<pattern 2>"],
-  "patch": {
-    "reasoning": "<why these patterns matter>",
-    "edits": [
-      {"op": "append",       "content": "<markdown>"},
-      {"op": "insert_after", "target": "<heading/text>", "content": "<markdown>"},
-      {"op": "replace",      "target": "<old text>",     "content": "<new text>"},
-      {"op": "delete",       "target": "<exact text to remove>"}
-    ]
-  }
-}
--- a/skillopt/envs/mathverse/prompts/deep_probe.md
+++ b/skillopt/envs/mathverse/prompts/deep_probe.md
@@ -1,25 +0,0 @@
-You are an expert diagnostic-probe designer for visual mathematical reasoning tasks.
-
-You will be shown representative trajectories, the current target skill, and the target's original prompt context.
-Some trajectories may also include a hidden reference containing the fuller Text Dominant wording of the same problem.
-Design one SMALL diagnostic instruction that exposes the target's intermediate judgment without materially changing the original scaffold.
-
-## Hard Constraints
-1. Do NOT substantially change the original scaffold.
-2. Do NOT prescribe a new long multi-step solving procedure.
-3. Do NOT ask for a full proof or full chain-of-thought.
-4. Ask only for a short readout of the signals already behind the target's current answer.
-5. Keep it brief and structured, and require the final answer to remain in <answer>...</answer>.
-6. If hidden reference text is present, use it only to target what visual or textual constraint the target likely missed.
-
-## Good Probe Targets
- decisive diagram cue
- top candidate and runner-up
- missing relation or quantity
- why a near-miss option was rejected
-
-Respond ONLY with a valid JSON object:
-{
-  "reasoning": "<why this probe is informative>",
-  "probe_instruction": "<the exact instruction text to append to the target prompt>"
-}
--- a/skillopt/envs/mathverse/prompts/judge.md
+++ b/skillopt/envs/mathverse/prompts/judge.md
@@ -1,25 +0,0 @@
-You are a careful and strict evaluator for visual math problems.
-
-You will be given:
-1. The original question
-2. The ground-truth answer
-3. A model output
-
-Decide whether the model output is mathematically equivalent to the ground-truth answer.
-
-Rules:
- Ignore harmless formatting differences.
- Accept mathematically equivalent expressions, equations, and values.
- Reject answers that are numerically wrong, symbolically different in meaning, missing required units when the unit changes meaning, or correspond to a different choice.
- Do not reward partially correct reasoning if the final answer is wrong.
-
-Return only:
-True
-
-or
-
-False
-
-Question: {question}
-Ground Truth Answer: {groundtruth}
-Model Output: {modeloutput}
--- a/skillopt/envs/mathverse/prompts/rollout_system.md
+++ b/skillopt/envs/mathverse/prompts/rollout_system.md
@@ -1,11 +0,0 @@
-You are an expert visual mathematical reasoning agent.
-
-{skill_section}## Task Format
-You will receive one math problem with an image or diagram.
-Use the visible diagram as evidence, not just the text.
-If some information is abbreviated in the text, recover it from the image before answering.
-
-## Answer Format
-Think step by step, then provide your final answer inside <answer>...</answer>.
- For multiple-choice questions, output only the single option label, such as <answer>B</answer>.
- For free-form questions, output only the final mathematical answer, such as <answer>14</answer>.
--- a/skillopt/envs/mathverse/reflect.py
+++ b/skillopt/envs/mathverse/reflect.py
@@ -1,4 +0,0 @@
-"""MathVerse Reflect stage.
-
-Prompts are loaded from .md files by the base adapter.
-"""
--- a/skillopt/envs/mathverse/rollout.py
+++ b/skillopt/envs/mathverse/rollout.py
@@ -1,431 +0,0 @@
-"""MathVerse rollout — single-image multimodal math reasoning."""
-from __future__ import annotations
-
-import base64
-import json
-import mimetypes
-import os
-from concurrent.futures import ThreadPoolExecutor, as_completed
-
-from skillopt.envs.mathverse.evaluator import evaluate_item, evaluation_mode, extract_answer
-from skillopt.model import chat_target_messages, get_target_backend, is_target_exec_backend
-from skillopt.model.codex_harness import prepare_workspace, render_skill_md, run_target_exec
-from skillopt.prompts import load_prompt
-
-
-def _build_system(skill_content: str) -> str:
-    if skill_content.strip():
-        skill_section = f"## Skill\n{skill_content.strip()}\n\n"
-    else:
-        skill_section = ""
-    return load_prompt("rollout_system", env="mathverse").format(skill_section=skill_section)
-
-
-def _format_choices(choices: list[dict]) -> str:
-    return "\n".join(f"{choice['label']}. {choice['text']}" for choice in choices)
-
-
-def _build_user_text(
-    item: dict,
-    *,
-    diagnostic_mode: bool = False,
-    diagnostic_instruction: str = "",
-    diagnostic_trace_context: str = "",
-) -> str:
-    parts = []
-    if diagnostic_trace_context.strip():
-        parts.append(
-            "## Previous Codex Trace Snapshot\n"
-            "This is a partial transcript from an earlier attempt. Use it as your current reasoning context.\n\n"
-            f"{diagnostic_trace_context.strip()}"
-        )
-    question = str(item.get("question_stem") or item.get("question") or "").strip()
-    if question:
-        parts.append(f"## Question\n{question}")
-    else:
-        parts.append("## Question\nRead the full problem statement from the image.")
-
-    if item.get("is_choice"):
-        choices = item.get("choices") or []
-        if choices:
-            parts.append(f"## Choices\n{_format_choices(choices)}")
-        parts.append("Return only the final option label inside <answer>...</answer>.")
-    else:
-        parts.append("Return only the final mathematical answer inside <answer>...</answer>.")
-
-    if diagnostic_mode and diagnostic_instruction.strip():
-        parts.append(f"## Training Readout\n{diagnostic_instruction.strip()}")
-    return "\n\n".join(parts)
-
-
-def _image_to_data_uri(path: str) -> str:
-    mime = mimetypes.guess_type(path)[0] or "image/png"
-    with open(path, "rb") as f:
-        encoded = base64.b64encode(f.read()).decode("ascii")
-    return f"data:{mime};base64,{encoded}"
-
-
-def _build_messages(
-    item: dict,
-    skill_content: str,
-    image_detail: str,
-    *,
-    diagnostic_mode: bool = False,
-    diagnostic_instruction: str = "",
-    diagnostic_trace_context: str = "",
-) -> tuple[list[dict], str, str]:
-    system = _build_system(skill_content)
-    user_text = _build_user_text(
-        item,
-        diagnostic_mode=diagnostic_mode,
-        diagnostic_instruction=diagnostic_instruction,
-        diagnostic_trace_context=diagnostic_trace_context,
-    )
-    image_url = {"url": _image_to_data_uri(item["image_path"])}
-    if image_detail and image_detail != "auto":
-        image_url["detail"] = image_detail
-    messages = [
-        {"role": "system", "content": system},
-        {
-            "role": "user",
-            "content": [
-                {"type": "text", "text": user_text},
-                {"type": "image_url", "image_url": image_url},
-            ],
-        },
-    ]
-    return messages, system, user_text
-
-
-def _build_codex_skill(skill_content: str) -> str:
-    return render_skill_md(
-        skill_content,
-        description="Dynamic ReflACT skill for solving the current MathVerse visual math problem.",
-        preamble=(
-            "Use this skill when solving the current MathVerse problem.\n"
-            "Read the image carefully and return the final answer inside <answer>...</answer>."
-        ),
-    )
-
-
-def _run_codex_once(
-    *,
-    pred_dir: str,
-    item: dict,
-    skill_content: str,
-    model: str,
-    timeout: int,
-    image_detail: str,
-    diagnostic_mode: bool = False,
-    diagnostic_instruction: str = "",
-    diagnostic_trace_context: str = "",
-    previous_response: str = "",
-) -> tuple[str, str, str, str]:
-    user_text = _build_user_text(
-        item,
-        diagnostic_mode=diagnostic_mode,
-        diagnostic_instruction=diagnostic_instruction,
-        diagnostic_trace_context=diagnostic_trace_context,
-    )
-    task_parts = [user_text]
-    if previous_response:
-        task_parts.append(
-            "## Previous Attempt\n"
-            f"{previous_response}\n\n"
-            "Re-check the diagram and the mathematical constraints. Correct the final answer if needed."
-        )
-    task_text = "\n\n".join(task_parts)
-    skill_md = _build_codex_skill(skill_content)
-    work_dir = os.path.join(pred_dir, "codex_exec")
-    prepare_workspace(
-        work_dir=work_dir,
-        skill_md=skill_md,
-        task_text=task_text,
-        images=[item["image_path"]],
-    )
-    prompt = (
-        "Use the `skillopt-target` skill available in this workspace.\n"
-        "Read `task.md`, inspect the attached image, solve the problem, and return only the final answer inside <answer>...</answer>."
-    )
-    final_message, raw = run_target_exec(
-        work_dir=work_dir,
-        prompt=prompt,
-        model=model,
-        timeout=timeout,
-        images=[item["image_path"]],
-    )
-    return final_message or raw, raw, skill_md, task_text
-
-
-def process_one(
-    item: dict,
-    out_root: str,
-    skill_content: str,
-    *,
-    max_turns: int = 1,
-    image_detail: str = "auto",
-    judge_model: str = "gpt-5.4",
-    judge_max_completion_tokens: int = 256,
-    judge_retries: int = 5,
-    diagnostic_mode: bool = False,
-    diagnostic_instruction: str = "",
-    diagnostic_trace_context: str = "",
-) -> dict:
-    item_id = str(item["id"])
-    result = {
-        "id": item_id,
-        "question": item["question"],
-        "task_type": item.get("task_type") or item.get("question_type") or "mathverse",
-        "task_description": item.get("question_stem") or item["question"],
-        "hard": 0,
-        "soft": 0.0,
-        "predicted_answer": "",
-        "predicted_label": "",
-        "predicted_text": "",
-        "response": "",
-        "fail_reason": "",
-        "agent_ok": False,
-        "n_turns": 0,
-        "image_path": item["image_path"],
-        "question_type": item["question_type"],
-        "evaluation_mode": evaluation_mode(),
-        "judge_model": judge_model,
-    }
-    if item.get("is_choice"):
-        result["correct_label"] = item["correct_choice"]["label"]
-        result["correct_text"] = item["correct_choice"]["text"]
-    else:
-        result["gold_answers"] = item.get("gold_answers") or [item["answer"]]
-
-    try:
-        pred_dir = os.path.join(out_root, "predictions", item_id)
-        os.makedirs(pred_dir, exist_ok=True)
-
-        if is_target_exec_backend():
-            from skillopt.model import azure_openai as _llm
-
-            response = ""
-            conversation: list[dict] = [
-                {"role": "user", "content": f"{item['question']}\n\n[image] {os.path.basename(item['image_path'])}"}
-            ]
-            system_prompt = ""
-            user_text = ""
-            for turn in range(max_turns):
-                response, raw, system_prompt, user_text = _run_codex_once(
-                    pred_dir=pred_dir,
-                    item=item,
-                    skill_content=skill_content,
-                    model=_llm.TARGET_DEPLOYMENT,
-                    timeout=120,
-                    image_detail=image_detail,
-                    diagnostic_mode=diagnostic_mode if turn == 0 else False,
-                    diagnostic_instruction=diagnostic_instruction if turn == 0 else "",
-                    diagnostic_trace_context=diagnostic_trace_context if turn == 0 else "",
-                    previous_response=response if turn > 0 else "",
-                )
-                conversation.append({"type": "message", "turn": turn + 1, "content": response})
-                if extract_answer(response):
-                    break
-
-            result["response"] = response
-            result["agent_ok"] = True
-            result["n_turns"] = len(conversation) - 1
-            with open(os.path.join(pred_dir, "target_system_prompt.txt"), "w", encoding="utf-8") as f:
-                f.write(system_prompt)
-            with open(os.path.join(pred_dir, "target_user_prompt.txt"), "w", encoding="utf-8") as f:
-                f.write(user_text)
-        else:
-            messages, system_prompt, user_text = _build_messages(
-                item,
-                skill_content,
-                image_detail,
-                diagnostic_mode=diagnostic_mode,
-                diagnostic_instruction=diagnostic_instruction,
-                diagnostic_trace_context=diagnostic_trace_context,
-            )
-            response = ""
-            conversation = [
-                {"role": "user", "content": f"{user_text}\n\n[image] {os.path.basename(item['image_path'])}"}
-            ]
-            for turn in range(max_turns):
-                if turn == 0:
-                    resp_text, _ = chat_target_messages(
-                        messages=messages,
-                        max_completion_tokens=1024,
-                        retries=5,
-                        stage="rollout",
-                    )
-                else:
-                    refinement_text = (
-                        f"Your previous answer was:\n{response}\n\n"
-                        "Re-check the diagram and the mathematical constraints. "
-                        "If needed, correct your answer. Output only the final answer inside <answer>...</answer>."
-                    )
-                    refinement_messages = [
-                        messages[0],
-                        messages[1],
-                        {"role": "assistant", "content": response},
-                        {"role": "user", "content": refinement_text},
-                    ]
-                    resp_text, _ = chat_target_messages(
-                        messages=refinement_messages,
-                        max_completion_tokens=768,
-                        retries=5,
-                        stage="rollout",
-                    )
-                response = resp_text
-                conversation.append({"type": "message", "turn": turn + 1, "content": resp_text})
-                if extract_answer(resp_text):
-                    break
-
-            result["response"] = response
-            result["agent_ok"] = True
-            result["n_turns"] = len(conversation) - 1
-            with open(os.path.join(pred_dir, "target_system_prompt.txt"), "w", encoding="utf-8") as f:
-                f.write(system_prompt)
-            with open(os.path.join(pred_dir, "target_user_prompt.txt"), "w", encoding="utf-8") as f:
-                f.write(user_text)
-
-        eval_result = evaluate_item(
-            item=item,
-            prediction_text=result["response"],
-            judge_model=judge_model,
-            max_completion_tokens=judge_max_completion_tokens,
-            retries=judge_retries,
-        )
-        result["evaluation_mode"] = eval_result["evaluation_mode"]
-        result["judge_raw"] = eval_result.get("judge_raw", "")
-        result["judge_reason"] = eval_result.get("judge_reason", "")
-        result["matched_gold"] = eval_result.get("matched_gold", "")
-
-        if item.get("is_choice"):
-            result["predicted_label"] = eval_result["predicted_label"]
-            result["predicted_text"] = eval_result["predicted_text"]
-            result["predicted_answer"] = eval_result["predicted_answer"]
-            result["hard"] = int(eval_result["em"])
-            result["soft"] = eval_result["f1"]
-            if not result["hard"]:
-                result["fail_reason"] = (
-                    f"choice=0: predicted '{eval_result['predicted_label'] or eval_result['predicted_answer']}' "
-                    f"but expected '{eval_result['correct_label']}'"
-                )
-            eval_detail = (
-                f"[EVALUATION RESULT]\n"
-                f"Question: {item['question_for_eval']}\n"
-                f"Predicted label: {eval_result['predicted_label']!r}\n"
-                f"Predicted text: {eval_result['predicted_text']!r}\n"
-                f"Correct label: {eval_result['correct_label']!r}\n"
-                f"Correct text: {eval_result['correct_text']!r}\n"
-                f"Exact Match: {eval_result['em']}"
-            )
-        else:
-            result["predicted_answer"] = eval_result["predicted_answer"]
-            result["hard"] = int(eval_result["em"])
-            result["soft"] = eval_result["f1"]
-            if not result["hard"]:
-                result["fail_reason"] = (
-                    f"judge=0: predicted '{eval_result['predicted_answer']}' "
-                    f"but expected '{item['answer']}' ({eval_result.get('judge_reason', '')})"
-                )
-            eval_detail = (
-                f"[EVALUATION RESULT]\n"
-                f"Question: {item['question_for_eval']}\n"
-                f"Predicted answer: {eval_result['predicted_answer']!r}\n"
-                f"Gold answer: {item['answer']!r}\n"
-                f"Judge correct: {eval_result['em']}\n"
-                f"Judge reason: {eval_result.get('judge_reason', '')}\n"
-                f"String F1: {eval_result.get('string_f1', 0.0):.4f}"
-            )
-
-        conversation.append({"role": "system", "content": eval_detail})
-        with open(os.path.join(pred_dir, "conversation.json"), "w", encoding="utf-8") as f:
-            json.dump(conversation, f, ensure_ascii=False, indent=2)
-    except Exception as e:  # noqa: BLE001
-        result["fail_reason"] = f"error: {e}"
-    return result
-
-
-def run_batch(
-    items: list[dict],
-    out_root: str,
-    skill_content: str,
-    *,
-    max_turns: int = 1,
-    workers: int = 32,
-    image_detail: str = "auto",
-    judge_model: str = "gpt-5.4",
-    judge_max_completion_tokens: int = 256,
-    judge_retries: int = 5,
-    diagnostic_mode: bool = False,
-    diagnostic_instruction: str = "",
-    diagnostic_trace_context_by_id: dict[str, str] | None = None,
-) -> list[dict]:
-    results_path = os.path.join(out_root, "results.jsonl")
-    os.makedirs(out_root, exist_ok=True)
-
-    expected_eval_mode = evaluation_mode()
-    done_ids: set[str] = set()
-    existing: list[dict] = []
-    rewrite_results = False
-    if os.path.exists(results_path):
-        with open(results_path, encoding="utf-8") as f:
-            for line in f:
-                try:
-                    row = json.loads(line)
-                    if row.get("evaluation_mode") != expected_eval_mode:
-                        rewrite_results = True
-                        continue
-                    done_ids.add(str(row["id"]))
-                    existing.append(row)
-                except Exception:
-                    rewrite_results = True
-
-    pending = [item for item in items if str(item["id"]) not in done_ids]
-    if not pending and not rewrite_results:
-        return existing
-
-    total = len(existing) + len(pending)
-    completed = len(existing)
-    correct_count = sum(1 for r in existing if r.get("hard", 0))
-    if existing:
-        print(f"    [rollout] resuming: {completed}/{total} already done", flush=True)
-
-    results = list(existing)
-    file_mode = "w" if rewrite_results else "a"
-    with open(results_path, file_mode, encoding="utf-8") as outf, ThreadPoolExecutor(max_workers=workers) as ex:
-        if rewrite_results:
-            for row in existing:
-                outf.write(json.dumps(row, ensure_ascii=False) + "\n")
-        futs = {
-            ex.submit(
-                process_one,
-                item,
-                out_root,
-                skill_content,
-                max_turns=max_turns,
-                image_detail=image_detail,
-                judge_model=judge_model,
-                judge_max_completion_tokens=judge_max_completion_tokens,
-                judge_retries=judge_retries,
-                diagnostic_mode=diagnostic_mode,
-                diagnostic_instruction=diagnostic_instruction,
-                diagnostic_trace_context=(diagnostic_trace_context_by_id or {}).get(str(item["id"]), ""),
-            ): item
-            for item in pending
-        }
-        for fut in as_completed(futs):
-            row = fut.result()
-            results.append(row)
-            completed += 1
-            if row.get("hard", 0):
-                correct_count += 1
-            acc = correct_count / completed if completed else 0
-            print(
-                f"    [rollout] {completed}/{total} "
-                f"(acc={acc:.3f}) id={row.get('id', '?')} "
-                f"hard={row.get('hard', '?')}",
-                flush=True,
-            )
-            outf.write(json.dumps(row, ensure_ascii=False) + "\n")
-            outf.flush()
-    return results
--- a/skillopt/envs/mathverse/skills/initial.md
+++ b/skillopt/envs/mathverse/skills/initial.md
@@ -1,15 +0,0 @@
-# MathVerse Visual Math Heuristics
-
-## Diagram First
- Read the diagram before locking onto an equation or option.
- Recover missing labels, lengths, angles, axes, or object relations from the image when the text is abbreviated.
- If the text seems underspecified, assume the image may contain the decisive constraint.
-
-## Constraint Tracking
- Write down the few constraints that actually determine the answer instead of solving from vague intuition.
- Prefer geometric or functional relations that are directly supported by the figure.
- For multiple-choice questions, compare the final candidate against every option exactly.
-
-## Final Answer
- Use the image and the text consistently.
- Return only the final answer inside <answer>...</answer>.
--- a/skillopt/envs/mmrb/init.py
+++ b/skillopt/envs/mmrb/init.py
@@ -1,2 +0,0 @@
-"""MMRB environment package."""
-
--- a/skillopt/envs/mmrb/adapter.py
+++ b/skillopt/envs/mmrb/adapter.py
@@ -1,283 +0,0 @@
-"""MMRB environment adapter for ReflACT."""
-from __future__ import annotations
-
-import json
-import os
-
-from skillopt.gradient.deep_probe import generate_deep_probe_instruction
-from skillopt.datasets.base import BatchSpec
-from skillopt.gradient.reflect import run_minibatch_reflect
-from skillopt.envs.base import EnvAdapter
-from skillopt.envs.mmrb.dataloader import MMRBDataLoader
-from skillopt.envs.mmrb.rollout import run_batch
-from skillopt.model import get_target_backend
-
-
-class MMRBAdapter(EnvAdapter):
-    """MMRB adapter."""
-
-    def build_reference_text(self, item: dict) -> str:
-        reasoning_steps = item.get("reasoning_steps") or []
-        if not reasoning_steps:
-            return ""
-
-        blocks: list[str] = []
-        for path_idx, path in enumerate(reasoning_steps, 1):
-            if not isinstance(path, list) or not path:
-                continue
-            lines = [f"### Reasoning Path {path_idx}"]
-            for step in path:
-                if not isinstance(step, dict):
-                    continue
-                step_no = step.get("reasoning step", "?")
-                step_type = str(step.get("reasoning type") or "").strip()
-                rationale = str(step.get("rationale") or "").strip()
-                if rationale:
-                    prefix = f"{step_no}. [{step_type}] " if step_type else f"{step_no}. "
-                    lines.append(prefix + rationale)
-            if len(lines) > 1:
-                blocks.append("\n".join(lines))
-        if not blocks:
-            return ""
-        return "## Reference Reasoning Steps\n" + "\n\n".join(blocks[:3])
-
-    def get_reference_metadata(self, item: dict) -> dict:
-        reasoning_steps = item.get("reasoning_steps") or []
-        path_count = 0
-        preview_parts: list[str] = []
-        for path in reasoning_steps:
-            if not isinstance(path, list) or not path:
-                continue
-            path_count += 1
-            first = path[0] if isinstance(path[0], dict) else {}
-            step_type = str(first.get("reasoning type") or "").strip()
-            rationale = str(first.get("rationale") or "").strip()
-            preview_parts.append(f"[path {path_count}] {step_type}: {rationale[:180]}")
-        if not path_count:
-            return {"fields": [], "preview": ""}
-        return {
-            "fields": ["reasoning_steps"],
-            "preview": "\n".join(preview_parts)[:500],
-        }
-
-    def __init__(
-        self,
-        split_dir: str = "",
-        data_path: str = "",
-        split_mode: str = "ratio",
-        split_ratio: str = "2:1:7",
-        split_seed: int = 42,
-        split_output_dir: str = "",
-        max_turns: int = 1,
-        workers: int = 16,
-        analyst_workers: int = 16,
-        failure_only: bool = False,
-        minibatch_size: int = 8,
-        edit_budget: int = 4,
-        seed: int = 42,
-        limit: int = 0,
-        image_detail: str = "auto",
-        use_deep_reflect: bool = False,
-        deep_reflect_failures: int = 4,
-        deep_reflect_successes: int = 2,
-    ) -> None:
-        self.max_turns = max_turns
-        self.workers = workers
-        self.analyst_workers = analyst_workers
-        self.failure_only = failure_only
-        self.minibatch_size = minibatch_size
-        self.edit_budget = edit_budget
-        self.image_detail = image_detail
-        self.use_deep_reflect = use_deep_reflect
-        self.deep_reflect_failures = deep_reflect_failures
-        self.deep_reflect_successes = deep_reflect_successes
-        self.dataloader = MMRBDataLoader(
-            split_dir=split_dir,
-            data_path=data_path,
-            split_mode=split_mode,
-            split_ratio=split_ratio,
-            split_seed=split_seed,
-            split_output_dir=split_output_dir,
-            seed=seed,
-            limit=limit,
-        )
-
-    def setup(self, cfg: dict) -> None:
-        super().setup(cfg)
-        self.dataloader.setup(cfg)
-
-    def get_dataloader(self):
-        return self.dataloader
-
-    def build_env_from_batch(self, batch: BatchSpec, **kwargs):
-        return list(batch.payload or [])
-
-    def build_train_env(self, batch_size: int, seed: int, **kwargs):
-        batch = self.dataloader.build_train_batch(batch_size=batch_size, seed=seed, **kwargs)
-        return self.build_env_from_batch(batch, **kwargs)
-
-    def build_eval_env(self, env_num: int, split: str, seed: int, **kwargs):
-        batch = self.dataloader.build_eval_batch(env_num=env_num, split=split, seed=seed, **kwargs)
-        return self.build_env_from_batch(batch, **kwargs)
-
-    def rollout(
-        self,
-        env_manager,
-        skill_content: str,
-        out_dir: str,
-        **kwargs,
-    ) -> list[dict]:
-        items: list[dict] = env_manager
-        return run_batch(
-            items=items,
-            out_root=out_dir,
-            skill_content=skill_content,
-            max_turns=self.max_turns,
-            workers=self.workers,
-            image_detail=self.image_detail,
-            diagnostic_mode=kwargs.get("diagnostic_mode", False),
-            diagnostic_instruction=kwargs.get("diagnostic_instruction", ""),
-            diagnostic_trace_context_by_id=kwargs.get("diagnostic_trace_context_by_id"),
-        )
-
-    def reflect(
-        self,
-        results: list[dict],
-        skill_content: str,
-        out_dir: str,
-        **kwargs,
-    ) -> list[dict | None]:
-        prediction_dir = kwargs.get("prediction_dir", os.path.join(out_dir, "predictions"))
-        patches_dir = kwargs.get("patches_dir", os.path.join(out_dir, "patches"))
-        random_seed = kwargs.get("random_seed")
-        step_buffer_context = kwargs.get("step_buffer_context", "")
-        meta_skill_context = kwargs.get("meta_skill_context", "")
-
-        return run_minibatch_reflect(
-            results=results,
-            skill_content=skill_content,
-            prediction_dir=prediction_dir,
-            patches_dir=patches_dir,
-            workers=self.analyst_workers,
-            failure_only=self.failure_only,
-            minibatch_size=self.minibatch_size,
-            edit_budget=self.edit_budget,
-            random_seed=random_seed,
-            error_system=self.get_error_minibatch_prompt(),
-            success_system=self.get_success_minibatch_prompt(),
-            step_buffer_context=step_buffer_context,
-            meta_skill_context=meta_skill_context,
-            update_mode=getattr(self, "_cfg", {}).get("skill_update_mode", "patch"),
-        )
-
-    def deep_reflect(
-        self,
-        results: list[dict],
-        skill_content: str,
-        out_dir: str,
-        **kwargs,
-    ) -> list[dict | None]:
-        if not self.use_deep_reflect:
-            return []
-
-        env_manager = kwargs.get("env_manager")
-        prediction_dir = kwargs.get("prediction_dir", os.path.join(out_dir, "predictions"))
-        random_seed = kwargs.get("random_seed")
-        step_buffer_context = kwargs.get("step_buffer_context", "")
-        meta_skill_context = kwargs.get("meta_skill_context", "")
-        codex_backend = get_target_backend() == "codex_exec"
-        selected_items = self.select_representative_items(
-            results,
-            env_manager if isinstance(env_manager, list) else None,
-            n_failures=self.deep_reflect_failures,
-            n_successes=self.deep_reflect_successes,
-            seed=random_seed,
-        )
-        if not selected_items:
-            return []
-        selected_ids = {str(item["id"]) for item in selected_items}
-        selected_results = [row for row in results if str(row.get("id")) in selected_ids]
-        selected_examples = self.attach_reference_context(selected_results, selected_items)
-        if codex_backend:
-            selected_examples = self.attach_codex_probe_context(selected_examples, prediction_dir)
-
-        reasoning_count = 0
-        selected_metadata = []
-        for item in selected_items:
-            meta = self.get_reference_metadata(item)
-            if meta["fields"]:
-                reasoning_count += 1
-            selected_metadata.append({
-                "id": str(item["id"]),
-                "task_type": str(item.get("subtask") or item.get("task_type") or "mmrb"),
-                "reference_fields": meta["fields"],
-                "reference_preview": meta["preview"],
-            })
-
-        deep_dir = os.path.join(out_dir, "deep_reflect")
-        rollout_dir = os.path.join(deep_dir, "rollout")
-        patches_dir = os.path.join(deep_dir, "patches")
-        os.makedirs(deep_dir, exist_ok=True)
-        print(
-            f"    [2b/6 DEEP REFLECT setup] selected={len(selected_items)} "
-            f"reference_fields=reasoning_steps({reasoning_count}/{len(selected_items)})"
-        )
-        probe = generate_deep_probe_instruction(
-            skill_content=skill_content,
-            items=selected_examples,
-            prediction_dir=prediction_dir,
-            system_prompt=self.get_codex_deep_probe_prompt() if codex_backend else self.get_deep_probe_prompt(),
-            step_buffer_context=step_buffer_context,
-            meta_skill_context=meta_skill_context,
-        )
-        if not probe:
-            return []
-        diagnostic_trace_context_by_id = None
-        if codex_backend:
-            selected_items, diagnostic_trace_context_by_id, probe = self.resolve_codex_probe_target(
-                selected_items=selected_items,
-                selected_examples=selected_examples,
-                prediction_dir=prediction_dir,
-                probe=probe,
-            )
-        probe_record = {
-            **probe,
-            "reference_summary": {
-                "selected_count": len(selected_items),
-                "field_counts": {"reasoning_steps": reasoning_count},
-            },
-            "selected_examples": selected_metadata,
-        }
-        with open(os.path.join(deep_dir, "probe.json"), "w", encoding="utf-8") as f:
-            json.dump(probe_record, f, ensure_ascii=False, indent=2)
-        deep_results = run_batch(
-            items=selected_items,
-            out_root=rollout_dir,
-            skill_content=skill_content,
-            max_turns=self.max_turns,
-            workers=min(self.workers, max(len(selected_items), 1)),
-            image_detail=self.image_detail,
-            diagnostic_mode=True,
-            diagnostic_instruction=probe["probe_instruction"],
-            diagnostic_trace_context_by_id=diagnostic_trace_context_by_id,
-        )
-        deep_results = self.attach_reference_context(deep_results, selected_items)
-        return run_minibatch_reflect(
-            results=deep_results,
-            skill_content=skill_content,
-            prediction_dir=os.path.join(rollout_dir, "predictions"),
-            patches_dir=patches_dir,
-            workers=self.analyst_workers,
-            failure_only=self.failure_only,
-            minibatch_size=self.minibatch_size,
-            edit_budget=self.edit_budget,
-            random_seed=random_seed,
-            error_system=self.get_error_minibatch_prompt(),
-            success_system=self.get_success_minibatch_prompt(),
-            step_buffer_context=step_buffer_context,
-            meta_skill_context=meta_skill_context,
-            update_mode=getattr(self, "_cfg", {}).get("skill_update_mode", "patch"),
-        )
-
-    def get_task_types(self) -> list[str]:
-        return self.dataloader.get_task_types()
--- a/skillopt/envs/mmrb/dataloader.py
+++ b/skillopt/envs/mmrb/dataloader.py
@@ -1,146 +0,0 @@
-"""MMRB task dataloader."""
-from __future__ import annotations
-
-import glob
-import json
-import os
-import re
-from typing import Any
-
-from skillopt.datasets.base import SplitDataLoader
-
-
-# ── Raw data loading utilities (for preprocessing / standalone eval) ─────
-
-def _load_json(path: str) -> Any:
-    with open(path, encoding="utf-8") as f:
-        return json.load(f)
-
-
-def _iter_data_files(data_path: str) -> list[str]:
-    if not data_path:
-        return []
-    if os.path.isfile(data_path):
-        return [data_path]
-    if os.path.isdir(data_path):
-        nested = glob.glob(os.path.join(data_path, "**", "*_human.json"), recursive=True)
-        flat = glob.glob(os.path.join(data_path, "*_human.json"))
-        return sorted(set(nested + flat))
-    return []
-
-
-def _normalize_space(text: str) -> str:
-    return re.sub(r"\s+", " ", str(text or "").strip())
-
-
-def _normalize_item(item: dict, row_idx: int, source_path: str) -> dict | None:
-    question = _normalize_space(item.get("question") or "")
-    answer = _normalize_space(item.get("answer") or "")
-    raw_image_paths = item.get("image_paths") or []
-    if not question or not answer or not isinstance(raw_image_paths, list) or not raw_image_paths:
-        return None
-
-    base_dir = os.path.dirname(source_path)
-    image_paths: list[str] = []
-    for raw_path in raw_image_paths:
-        rel = str(raw_path or "").strip()
-        if not rel:
-            continue
-        abs_path = rel if os.path.isabs(rel) else os.path.abspath(os.path.join(base_dir, rel))
-        if os.path.exists(abs_path):
-            image_paths.append(abs_path)
-    if not image_paths:
-        return None
-
-    options_raw = item.get("options") or []
-    options = [_normalize_space(opt) for opt in options_raw if _normalize_space(opt)]
-    source = _normalize_space(item.get("source") or "unknown")
-    subtask = _normalize_space(item.get("subtask") or "unknown")
-    item_index = item.get("index", row_idx)
-    item_id = f"{source}:{subtask}:{item_index}"
-
-    return {
-        "id": item_id,
-        "source": source,
-        "subtask": subtask,
-        "task_type": subtask,
-        "question": question,
-        "answer": answer,
-        "options": options,
-        "is_choice": bool(options),
-        "image_paths": image_paths,
-        "reasoning_steps": item.get("reasoning_steps") or [],
-        "annotation_time": item.get("annotation_time"),
-        "source_path": os.path.abspath(source_path),
-    }
-
-
-def load_items(data_path: str) -> list[dict]:
-    """Load and normalise MMRB items from JSON files."""
-    files = _iter_data_files(data_path)
-    if not files:
-        raise ValueError(
-            "MMRB requires data_path to be a *_human.json file or a directory "
-            "containing extracted MMRB subtask folders."
-        )
-
-    items: list[dict] = []
-    for path in files:
-        raw = _load_json(path)
-        if not isinstance(raw, list):
-            raise ValueError(f"Expected JSON array in {path}, got {type(raw).__name__}")
-        for row_idx, item in enumerate(raw):
-            if not isinstance(item, dict):
-                continue
-            norm = _normalize_item(item, row_idx=row_idx, source_path=path)
-            if norm is not None:
-                items.append(norm)
-
-    if not items:
-        raise ValueError(f"No valid MMRB items loaded from {data_path}")
-    return items
-
-
-# ── Dataloader ───────────────────────────────────────────────────────────
-
-class MMRBDataLoader(SplitDataLoader):
-    """MMRB dataloader."""
-
-    def __init__(
-        self,
-        split_dir: str = "",
-        data_path: str = "",
-        split_mode: str = "ratio",
-        split_ratio: str = "2:1:7",
-        split_seed: int = 42,
-        split_output_dir: str = "",
-        seed: int = 42,
-        limit: int = 0,
-        **kwargs,
-    ) -> None:
-        super().__init__(
-            split_dir=split_dir,
-            data_path=data_path,
-            split_mode=split_mode,
-            split_ratio=split_ratio,
-            split_seed=split_seed,
-            split_output_dir=split_output_dir,
-            seed=seed,
-            limit=limit,
-        )
-        self._task_types: list[str] = []
-
-    def load_raw_items(self, data_path: str) -> list[dict]:
-        return load_items(data_path)
-
-    def setup(self, cfg: dict) -> None:
-        super().setup(cfg)
-        all_items = self.train_items + self.val_items + self.test_items
-        task_types = {
-            item.get("subtask") or item.get("task_type") or "unknown"
-            for item in all_items
-        }
-        self._task_types = sorted(task_types)
-
-    def get_task_types(self) -> list[str]:
-        return list(self._task_types)
--- a/skillopt/envs/mmrb/evaluator.py
+++ b/skillopt/envs/mmrb/evaluator.py
@@ -1,102 +0,0 @@
-"""MMRB evaluation helpers."""
-from __future__ import annotations
-
-import re
-import string
-
-
-_EVAL_MODE = "mmrb_exact_match_v1"
-
-
-def normalize_text(text: str) -> str:
-    text = str(text or "").strip().lower()
-    text = "".join(ch for ch in text if ch not in string.punctuation)
-    return " ".join(text.split())
-
-
-def extract_answer(text: str | None) -> str:
-    raw = str(text or "").strip()
-    if not raw:
-        return ""
-
-    answer_tags = re.findall(r"<answer>\s*(.*?)\s*</answer>", raw, re.IGNORECASE | re.DOTALL)
-    if answer_tags:
-        return answer_tags[-1].strip()
-
-    bracket = re.findall(r"Answer\s*\[\s*(.*?)\s*\]", raw, re.IGNORECASE | re.DOTALL)
-    if bracket:
-        return bracket[-1].strip()
-
-    boxed = re.findall(r"\\boxed\{(.*?)\}", raw, re.IGNORECASE | re.DOTALL)
-    if boxed:
-        return boxed[-1].strip()
-
-    single = raw.strip().rstrip(".):")
-    if re.fullmatch(r"[A-Z]", single, re.IGNORECASE):
-        return single.strip()
-
-    patterns = [
-        r"final answer\s*(?:is)?\s*[:：]?\s*(.+)",
-        r"the answer is\s*[:：]?\s*(.+)",
-        r"answer\s*[:：]?\s*(.+)$",
-    ]
-    for pattern in patterns:
-        match = re.search(pattern, raw, re.IGNORECASE)
-        if match:
-            return match.group(1).strip().strip("*")
-
-    return raw
-
-
-def evaluate_item(*, item: dict, prediction_text: str) -> dict:
-    predicted_answer = extract_answer(prediction_text)
-    gold_answer = str(item.get("answer") or "").strip()
-    predicted_norm = normalize_text(predicted_answer)
-    gold_norm = normalize_text(gold_answer)
-
-    hard = 0.0
-    matched_gold = ""
-    predicted_label = ""
-    predicted_text = predicted_answer
-
-    if item.get("is_choice"):
-        predicted_label = str(predicted_answer).strip().upper().rstrip(".):")
-        if predicted_label == str(gold_answer).strip().upper():
-            hard = 1.0
-            matched_gold = gold_answer
-        else:
-            for option in item.get("options") or []:
-                label_match = re.match(r"\(?([A-Z])\)", option)
-                if not label_match:
-                    continue
-                label = label_match.group(1).upper()
-                option_text = option[label_match.end():].strip(" .:-")
-                if predicted_norm and normalize_text(option_text) == predicted_norm:
-                    predicted_label = label
-                    predicted_text = option_text
-                    break
-            if predicted_label == str(gold_answer).strip().upper():
-                hard = 1.0
-                matched_gold = gold_answer
-    else:
-        if predicted_norm and gold_norm and (
-            predicted_norm == gold_norm or predicted_norm in gold_norm or gold_norm in predicted_norm
-        ):
-            hard = 1.0
-            matched_gold = gold_answer
-
-    return {
-        "evaluation_mode": _EVAL_MODE,
-        "predicted_answer": predicted_answer,
-        "predicted_label": predicted_label,
-        "predicted_text": predicted_text,
-        "em": hard,
-        "f1": hard,
-        "sub_em": hard,
-        "matched_gold": matched_gold,
-    }
-
-
-def evaluation_mode() -> str:
-    return _EVAL_MODE
-
--- a/skillopt/envs/mmrb/prompts/rollout_system.md
+++ b/skillopt/envs/mmrb/prompts/rollout_system.md
@@ -1,10 +0,0 @@
-You are an expert multi-image reasoning agent.
-
-{skill_section}## Task Format
-You will receive a question grounded in multiple images.
-Use the image order exactly as presented in the prompt and compare evidence across images carefully.
-
-## Answer Format
- Put the final answer inside <answer>...</answer>.
- For multiple-choice questions, output only the single option letter inside <answer>...</answer>.
- For open questions, output only the short final answer inside <answer>...</answer>.
--- a/skillopt/envs/mmrb/rollout.py
+++ b/skillopt/envs/mmrb/rollout.py
@@ -1,455 +0,0 @@
-"""MMRB rollout."""
-from __future__ import annotations
-
-import base64
-import json
-import mimetypes
-import os
-import re
-from concurrent.futures import ThreadPoolExecutor, as_completed
-
-from skillopt.envs.mmrb.evaluator import evaluate_item, evaluation_mode
-from skillopt.model import chat_target_messages, get_target_backend, is_target_exec_backend
-from skillopt.model.codex_harness import prepare_workspace, render_skill_md, run_target_exec
-from skillopt.prompts import load_prompt
-
-_IMAGE_REF_RE = re.compile(r"\{image#(\d+)\}", re.IGNORECASE)
-
-
-def _build_system(skill_content: str) -> str:
-    if skill_content.strip():
-        skill_section = f"## Skill\n{skill_content.strip()}\n\n"
-    else:
-        skill_section = ""
-    return load_prompt("rollout_system", env="mmrb").format(skill_section=skill_section)
-
-
-def _image_to_data_uri(path: str) -> str:
-    mime = mimetypes.guess_type(path)[0] or "image/png"
-    with open(path, "rb") as f:
-        encoded = base64.b64encode(f.read()).decode("ascii")
-    return f"data:{mime};base64,{encoded}"
-
-
-def _build_user_content(
-    item: dict,
-    image_detail: str,
-    *,
-    diagnostic_mode: bool = False,
-    diagnostic_instruction: str = "",
-    diagnostic_trace_context: str = "",
-) -> tuple[list[dict], str]:
-    raw_question = str(item["question"])
-    content: list[dict] = []
-    text_parts: list[str] = []
-    used_indices: set[int] = set()
-    cursor = 0
-
-    if diagnostic_trace_context.strip():
-        prefix = (
-            "## Previous Codex Trace Snapshot\n"
-            "This is a partial transcript from an earlier attempt. Use it as your current reasoning context.\n\n"
-            f"{diagnostic_trace_context.strip()}\n\n"
-        )
-        content.append({"type": "text", "text": prefix})
-        text_parts.append(prefix)
-
-    for match in _IMAGE_REF_RE.finditer(raw_question):
-        if match.start() > cursor:
-            chunk = raw_question[cursor:match.start()]
-            if chunk:
-                content.append({"type": "text", "text": chunk})
-                text_parts.append(chunk)
-
-        image_idx = int(match.group(1)) - 1
-        marker = f"[Image #{image_idx + 1}]"
-        text_parts.append(marker)
-        if 0 <= image_idx < len(item["image_paths"]):
-            image_url = {"url": _image_to_data_uri(item["image_paths"][image_idx])}
-            if image_detail and image_detail != "auto":
-                image_url["detail"] = image_detail
-            content.append({"type": "image_url", "image_url": image_url})
-            used_indices.add(image_idx)
-        else:
-            content.append({"type": "text", "text": marker})
-        cursor = match.end()
-
-    if cursor < len(raw_question):
-        tail = raw_question[cursor:]
-        if tail:
-            content.append({"type": "text", "text": tail})
-            text_parts.append(tail)
-
-    for idx, path in enumerate(item["image_paths"]):
-        if idx in used_indices:
-            continue
-        marker = f"\n[Additional Image #{idx + 1}]"
-        text_parts.append(marker)
-        content.append({"type": "text", "text": marker})
-        image_url = {"url": _image_to_data_uri(path)}
-        if image_detail and image_detail != "auto":
-            image_url["detail"] = image_detail
-        content.append({"type": "image_url", "image_url": image_url})
-
-    answer_instruction = (
-        "\n\nAnswer with the single correct option letter inside <answer>...</answer>."
-        if item.get("is_choice")
-        else "\n\nAnswer with the short final answer inside <answer>...</answer>."
-    )
-    content.append({"type": "text", "text": answer_instruction})
-    text_parts.append(answer_instruction)
-
-    if diagnostic_mode and diagnostic_instruction.strip():
-        diag_block = f"\n\n## Training Readout\n{diagnostic_instruction.strip()}"
-        content.append({"type": "text", "text": diag_block})
-        text_parts.append(diag_block)
-
-    return content, "".join(text_parts)
-
-
-def _build_messages(
-    item: dict,
-    skill_content: str,
-    image_detail: str,
-    *,
-    diagnostic_mode: bool = False,
-    diagnostic_instruction: str = "",
-) -> tuple[list[dict], str, str]:
-    system = _build_system(skill_content)
-    user_content, user_text = _build_user_content(
-        item,
-        image_detail,
-        diagnostic_mode=diagnostic_mode,
-        diagnostic_instruction=diagnostic_instruction,
-    )
-    messages = [
-        {"role": "system", "content": system},
-        {"role": "user", "content": user_content},
-    ]
-    return messages, system, user_text
-
-
-def _build_codex_skill(skill_content: str) -> str:
-    return render_skill_md(
-        skill_content,
-        description="Dynamic ReflACT skill for solving the current MMRB multi-image reasoning question.",
-        preamble=(
-            "Use this skill when solving the current multi-image reasoning task.\n"
-            "Inspect all attached images carefully and return the final answer inside <answer>...</answer>."
-        ),
-    )
-
-
-def _run_codex_once(
-    *,
-    pred_dir: str,
-    item: dict,
-    skill_content: str,
-    model: str,
-    timeout: int,
-    image_detail: str,
-    diagnostic_mode: bool = False,
-    diagnostic_instruction: str = "",
-    diagnostic_trace_context: str = "",
-    previous_response: str = "",
-) -> tuple[str, str, str, str]:
-    user_text = _build_user_content(
-        item,
-        image_detail,
-        diagnostic_mode=diagnostic_mode,
-        diagnostic_instruction=diagnostic_instruction,
-        diagnostic_trace_context=diagnostic_trace_context,
-    )[1]
-    task_parts = [user_text]
-    if previous_response:
-        task_parts.append(
-            "## Previous Attempt\n"
-            f"{previous_response}\n\n"
-            "Review the same images carefully and answer again."
-        )
-    task_text = "\n\n".join(task_parts)
-    skill_md = _build_codex_skill(skill_content)
-    work_dir = os.path.join(pred_dir, "codex_exec")
-    prepare_workspace(
-        work_dir=work_dir,
-        skill_md=skill_md,
-        task_text=task_text,
-        images=item["image_paths"],
-    )
-    prompt = (
-        "Use the `skillopt-target` skill available in this workspace.\n"
-        "Read `task.md`, inspect all attached images, and answer the question.\n"
-        "Keep the final answer inside <answer>...</answer>."
-    )
-    final_message, raw = run_target_exec(
-        work_dir=work_dir,
-        prompt=prompt,
-        model=model,
-        timeout=timeout,
-        images=item["image_paths"],
-    )
-    return final_message or raw, raw, skill_md, task_text
-
-
-def process_one(
-    item: dict,
-    out_root: str,
-    skill_content: str,
-    *,
-    max_turns: int = 1,
-    image_detail: str = "auto",
-    diagnostic_mode: bool = False,
-    diagnostic_instruction: str = "",
-    diagnostic_trace_context: str = "",
-) -> dict:
-    item_id = str(item["id"])
-    result = {
-        "id": item_id,
-        "question": item["question"],
-        "task_type": item.get("subtask") or item.get("task_type") or "mmrb",
-        "task_description": item["question"],
-        "hard": 0,
-        "soft": 0.0,
-        "predicted_answer": "",
-        "predicted_label": "",
-        "predicted_text": "",
-        "response": "",
-        "fail_reason": "",
-        "agent_ok": False,
-        "n_turns": 0,
-        "image_paths": item["image_paths"],
-        "gold_answer": item["answer"],
-        "evaluation_mode": evaluation_mode(),
-    }
-
-    try:
-        pred_dir = os.path.join(out_root, "predictions", item_id)
-        os.makedirs(pred_dir, exist_ok=True)
-
-        if is_target_exec_backend():
-            from skillopt.model import azure_openai as _llm
-
-            response = ""
-            conversation: list[dict] = [
-                {
-                    "role": "user",
-                    "content": item["question"] + "\n\n" + "\n".join(
-                        f"[image] {os.path.basename(path)}" for path in item["image_paths"]
-                    ),
-                }
-            ]
-            system_prompt = ""
-            user_text = ""
-            for turn in range(max_turns):
-                response, raw, system_prompt, user_text = _run_codex_once(
-                    pred_dir=pred_dir,
-                    item=item,
-                    skill_content=skill_content,
-                    model=_llm.TARGET_DEPLOYMENT,
-                    timeout=120,
-                    image_detail=image_detail,
-                    diagnostic_mode=diagnostic_mode if turn == 0 else False,
-                    diagnostic_instruction=diagnostic_instruction if turn == 0 else "",
-                    diagnostic_trace_context=diagnostic_trace_context if turn == 0 else "",
-                    previous_response=response if turn > 0 else "",
-                )
-                conversation.append({"type": "message", "turn": turn + 1, "content": response})
-                if "<answer>" in response.lower():
-                    break
-
-            result["response"] = response
-            result["agent_ok"] = True
-            result["n_turns"] = len(conversation) - 1
-            with open(os.path.join(pred_dir, "target_system_prompt.txt"), "w", encoding="utf-8") as f:
-                f.write(system_prompt)
-            with open(os.path.join(pred_dir, "target_user_prompt.txt"), "w", encoding="utf-8") as f:
-                f.write(user_text)
-
-            eval_result = evaluate_item(item=item, prediction_text=response)
-            result["evaluation_mode"] = eval_result["evaluation_mode"]
-            result["predicted_answer"] = eval_result["predicted_answer"]
-            result["predicted_label"] = eval_result["predicted_label"]
-            result["predicted_text"] = eval_result["predicted_text"]
-            result["matched_gold"] = eval_result["matched_gold"]
-            result["hard"] = int(eval_result["em"])
-            result["soft"] = eval_result["f1"]
-            if not result["hard"]:
-                result["fail_reason"] = (
-                    f"predicted '{eval_result['predicted_answer']}' but expected '{item['answer']}'"
-                )
-            eval_detail = (
-                "[EVALUATION RESULT]\n"
-                f"Question: {item['question']}\n"
-                f"Predicted answer: {eval_result['predicted_answer']!r}\n"
-                f"Predicted label: {eval_result['predicted_label']!r}\n"
-                f"Gold answer: {item['answer']!r}\n"
-                f"Correct: {eval_result['em']}\n"
-            )
-            conversation.append({"role": "system", "content": eval_detail})
-            with open(os.path.join(pred_dir, "conversation.json"), "w", encoding="utf-8") as f:
-                json.dump(conversation, f, ensure_ascii=False, indent=2)
-            return result
-
-        messages, system_prompt, user_text = _build_messages(
-            item,
-            skill_content,
-            image_detail,
-            diagnostic_mode=diagnostic_mode,
-            diagnostic_instruction=diagnostic_instruction,
-            diagnostic_trace_context=diagnostic_trace_context,
-        )
-        response = ""
-        conversation: list[dict] = [
-            {
-                "role": "user",
-                "content": user_text + "\n\n" + "\n".join(
-                    f"[image] {os.path.basename(path)}" for path in item["image_paths"]
-                ),
-            }
-        ]
-
-        for turn in range(max_turns):
-            if turn == 0:
-                resp_text, _ = chat_target_messages(
-                    messages=messages,
-                    max_completion_tokens=768,
-                    retries=5,
-                    stage="rollout",
-                )
-            else:
-                refinement_messages = [
-                    messages[0],
-                    messages[1],
-                    {"role": "assistant", "content": response},
-                    {
-                        "role": "user",
-                        "content": "Review the same images carefully and answer again. Keep the final answer inside <answer>...</answer>.",
-                    },
-                ]
-                resp_text, _ = chat_target_messages(
-                    messages=refinement_messages,
-                    max_completion_tokens=512,
-                    retries=5,
-                    stage="rollout",
-                )
-            response = resp_text
-            conversation.append({"type": "message", "turn": turn + 1, "content": resp_text})
-            if "<answer>" in resp_text.lower():
-                break
-
-        result["response"] = response
-        result["agent_ok"] = True
-        result["n_turns"] = len(conversation) - 1
-
-        with open(os.path.join(pred_dir, "target_system_prompt.txt"), "w", encoding="utf-8") as f:
-            f.write(system_prompt)
-        with open(os.path.join(pred_dir, "target_user_prompt.txt"), "w", encoding="utf-8") as f:
-            f.write(user_text)
-
-        eval_result = evaluate_item(item=item, prediction_text=response)
-        result["evaluation_mode"] = eval_result["evaluation_mode"]
-        result["predicted_answer"] = eval_result["predicted_answer"]
-        result["predicted_label"] = eval_result["predicted_label"]
-        result["predicted_text"] = eval_result["predicted_text"]
-        result["matched_gold"] = eval_result["matched_gold"]
-        result["hard"] = int(eval_result["em"])
-        result["soft"] = eval_result["f1"]
-        if not result["hard"]:
-            result["fail_reason"] = (
-                f"predicted '{eval_result['predicted_answer']}' but expected '{item['answer']}'"
-            )
-
-        eval_detail = (
-            "[EVALUATION RESULT]\n"
-            f"Question: {item['question']}\n"
-            f"Predicted answer: {eval_result['predicted_answer']!r}\n"
-            f"Predicted label: {eval_result['predicted_label']!r}\n"
-            f"Gold answer: {item['answer']!r}\n"
-            f"Correct: {eval_result['em']}\n"
-        )
-        conversation.append({"role": "system", "content": eval_detail})
-        with open(os.path.join(pred_dir, "conversation.json"), "w", encoding="utf-8") as f:
-            json.dump(conversation, f, ensure_ascii=False, indent=2)
-    except Exception as e:  # noqa: BLE001
-        result["fail_reason"] = f"error: {e}"
-    return result
-
-
-def run_batch(
-    items: list[dict],
-    out_root: str,
-    skill_content: str,
-    *,
-    max_turns: int = 1,
-    workers: int = 16,
-    image_detail: str = "auto",
-    diagnostic_mode: bool = False,
-    diagnostic_instruction: str = "",
-    diagnostic_trace_context_by_id: dict[str, str] | None = None,
-) -> list[dict]:
-    results_path = os.path.join(out_root, "results.jsonl")
-    os.makedirs(out_root, exist_ok=True)
-
-    expected_eval_mode = evaluation_mode()
-    done_ids: set[str] = set()
-    existing: list[dict] = []
-    rewrite_results = False
-    if os.path.exists(results_path):
-        with open(results_path, encoding="utf-8") as f:
-            for line in f:
-                try:
-                    row = json.loads(line)
-                    if row.get("evaluation_mode") != expected_eval_mode:
-                        rewrite_results = True
-                        continue
-                    done_ids.add(str(row["id"]))
-                    existing.append(row)
-                except Exception:
-                    rewrite_results = True
-
-    pending = [item for item in items if str(item["id"]) not in done_ids]
-    if not pending and not rewrite_results:
-        return existing
-
-    total = len(existing) + len(pending)
-    completed = len(existing)
-    correct_count = sum(1 for r in existing if r.get("hard", 0))
-    if existing:
-        print(f"    [rollout] resuming: {completed}/{total} already done", flush=True)
-
-    results = list(existing)
-    file_mode = "w" if rewrite_results else "a"
-    with open(results_path, file_mode, encoding="utf-8") as outf, ThreadPoolExecutor(max_workers=workers) as ex:
-        if rewrite_results:
-            for row in existing:
-                outf.write(json.dumps(row, ensure_ascii=False) + "\n")
-        futs = {
-            ex.submit(
-                process_one,
-                item,
-                out_root,
-                skill_content,
-                max_turns=max_turns,
-                image_detail=image_detail,
-                diagnostic_mode=diagnostic_mode,
-                diagnostic_instruction=diagnostic_instruction,
-                diagnostic_trace_context=(diagnostic_trace_context_by_id or {}).get(str(item["id"]), ""),
-            ): item
-            for item in pending
-        }
-        for fut in as_completed(futs):
-            row = fut.result()
-            results.append(row)
-            completed += 1
-            if row.get("hard", 0):
-                correct_count += 1
-            acc = correct_count / completed if completed else 0
-            print(
-                f"    [rollout] {completed}/{total} "
-                f"(acc={acc:.3f}) id={row.get('id', '?')} "
-                f"hard={row.get('hard', '?')}",
-                flush=True,
-            )
-            outf.write(json.dumps(row, ensure_ascii=False) + "\n")
-            outf.flush()
-    return results
--- a/skillopt/envs/mmrb/skills/initial.md
+++ b/skillopt/envs/mmrb/skills/initial.md
@@ -1,17 +0,0 @@
-# MMRB Multi-Image Reasoning Heuristics
-
-## Cross-Image Alignment
- Track the role of each image by its index and compare evidence across all referenced images before deciding.
- When the question depends on sequence, correspondence, or retrieval, verify the relation between images instead of judging each image independently.
-
-## Option Elimination
- For multiple-choice tasks, compare all options and reject choices that match only part of the visual evidence.
- If options differ by a small visual detail, use the most discriminative cue rather than a coarse scene impression.
-
-## Open Answers
- For open-ended tasks, give the shortest answer that is fully supported by the combined images.
- Preserve exact entities, attributes, counts, and directions when the images support them directly.
-
-## Final Answer
- Output only the final answer inside <answer>...</answer>.
-
--- a/skillopt/envs/sealqa/init.py
+++ b/skillopt/envs/sealqa/init.py
@@ -1 +0,0 @@
-"""SealQA environment package for ReflACT."""
--- a/skillopt/envs/sealqa/adapter.py
+++ b/skillopt/envs/sealqa/adapter.py
@@ -1,130 +0,0 @@
-from __future__ import annotations
-
-import os
-
-from skillopt.datasets.base import BatchSpec
-from skillopt.envs.base import EnvAdapter
-from skillopt.envs.deep_reflect import run_no_reference_deep_reflect
-from skillopt.envs.sealqa.dataloader import SealQADataLoader
-from skillopt.envs.sealqa.rollout import run_batch
-from skillopt.gradient.reflect import run_minibatch_reflect
-
-
-class SealQAAdapter(EnvAdapter):
-    def __init__(
-        self,
-        split_dir: str = '',
-        workers: int = 4,
-        analyst_workers: int = 8,
-        failure_only: bool = False,
-        minibatch_size: int = 8,
-        edit_budget: int = 4,
-        seed: int = 42,
-        limit: int = 0,
-        max_tool_turns: int = 12,
-        use_deep_reflect: bool = False,
-        deep_reflect_failures: int = 4,
-        deep_reflect_successes: int = 2,
-    ) -> None:
-        self.workers = workers
-        self.analyst_workers = analyst_workers
-        self.failure_only = failure_only
-        self.minibatch_size = minibatch_size
-        self.edit_budget = edit_budget
-        self.max_tool_turns = max_tool_turns
-        self.use_deep_reflect = use_deep_reflect
-        self.deep_reflect_failures = deep_reflect_failures
-        self.deep_reflect_successes = deep_reflect_successes
-        self.dataloader = SealQADataLoader(split_dir=split_dir, seed=seed, limit=limit)
-
-    def setup(self, cfg: dict) -> None:
-        super().setup(cfg)
-        self.dataloader.setup(cfg)
-
-    def get_dataloader(self):
-        return self.dataloader
-
-    def build_env_from_batch(self, batch: BatchSpec, **kwargs):
-        return list(batch.payload or [])
-
-    def build_train_env(self, batch_size: int, seed: int, **kwargs):
-        batch = self.dataloader.build_train_batch(batch_size=batch_size, seed=seed, **kwargs)
-        return self.build_env_from_batch(batch, **kwargs)
-
-    def build_eval_env(self, env_num: int, split: str, seed: int, **kwargs):
-        batch = self.dataloader.build_eval_batch(env_num=env_num, split=split, seed=seed, **kwargs)
-        return self.build_env_from_batch(batch, **kwargs)
-
-    def rollout(self, env_manager, skill_content: str, out_dir: str, **kwargs) -> list[dict]:
-        items: list[dict] = env_manager
-        return run_batch(
-            items=items,
-            out_root=out_dir,
-            skill_content=skill_content,
-            workers=self.workers,
-            max_tool_turns=self.max_tool_turns,
-            diagnostic_mode=kwargs.get('diagnostic_mode', False),
-            diagnostic_instruction=kwargs.get('diagnostic_instruction', ''),
-        )
-
-    def reflect(self, results: list[dict], skill_content: str, out_dir: str, **kwargs) -> list[dict | None]:
-        prediction_dir = kwargs.get('prediction_dir', os.path.join(out_dir, 'predictions'))
-        patches_dir = kwargs.get('patches_dir', os.path.join(out_dir, 'patches'))
-        random_seed = kwargs.get('random_seed')
-        step_buffer_context = kwargs.get('step_buffer_context', '')
-        return run_minibatch_reflect(
-            results=results,
-            skill_content=skill_content,
-            prediction_dir=prediction_dir,
-            patches_dir=patches_dir,
-            workers=self.analyst_workers,
-            failure_only=self.failure_only,
-            minibatch_size=self.minibatch_size,
-            edit_budget=self.edit_budget,
-            random_seed=random_seed,
-            error_system=self.get_error_minibatch_prompt(),
-            success_system=self.get_success_minibatch_prompt(),
-            step_buffer_context=step_buffer_context,
-            update_mode=getattr(self, "_cfg", {}).get("skill_update_mode", "patch"),
-        )
-
-    def deep_reflect(
-        self,
-        results: list[dict],
-        skill_content: str,
-        out_dir: str,
-        **kwargs,
-    ) -> list[dict | None]:
-        return run_no_reference_deep_reflect(
-            self,
-            results,
-            skill_content,
-            out_dir,
-            env_manager=kwargs.get('env_manager'),
-            prediction_dir=kwargs.get('prediction_dir'),
-            random_seed=kwargs.get('random_seed'),
-            step_buffer_context=kwargs.get('step_buffer_context', ''),
-            output_requirements=[
-                "- There is no hidden reference block. Use only the question, provided evidence, URL/fetch trace, target output, and evaluation result to infer what intermediate state is worth probing.",
-                "- The instruction must explicitly request a short <analysis>...</analysis> block before the final <answer>...</answer>.",
-                "- The readout should focus on effective time frame, conflicting evidence, decisive source, candidate answer, and answer-finalization rule.",
-                "- Do not ask for exhaustive web summaries or a full chain-of-thought.",
-                "- The instruction text should be ready to append directly to the target's prompt.",
-            ],
-            metadata_builder=lambda item: {
-                "id": str(item.get('id')),
-                "task_type": str(item.get('task_type') or item.get('topic') or 'sealqa'),
-                "question_preview": str(item.get('question') or '')[:200],
-                "freshness": item.get('freshness', ''),
-                "question_types": item.get('question_types', ''),
-                "topic": item.get('topic', ''),
-            },
-        )
-
-    def get_task_types(self) -> list[str]:
-        seen: list[str] = []
-        for item in self.dataloader.train_items + self.dataloader.val_items + self.dataloader.test_items:
-            task_type = str(item.get('task_type') or 'sealqa')
-            if task_type not in seen:
-                seen.append(task_type)
-        return seen or ['sealqa']
--- a/skillopt/envs/sealqa/dataloader.py
+++ b/skillopt/envs/sealqa/dataloader.py
@@ -1,37 +0,0 @@
-from __future__ import annotations
-
-import csv
-from pathlib import Path
-
-from skillopt.datasets.base import SplitDataLoader
-
-
-def _normalize_row(row: dict[str, str], index: int) -> dict:
-    canary = str(row.get('canary') or '').strip()
-    base_id = str(row.get('question_id') or row.get('id') or '').strip()
-    if not base_id:
-        base_id = f"{canary or 'sealqa'}:{index:04d}"
-    return {
-        'id': base_id,
-        'question': str(row.get('question') or '').strip(),
-        'ground_truth': str(row.get('answer') or row.get('ground_truth') or '').strip(),
-        'answers': [str(row.get('answer') or row.get('ground_truth') or '').strip()],
-        'task_type': str(row.get('topic') or 'sealqa').strip() or 'sealqa',
-        'topic': str(row.get('topic') or 'sealqa').strip() or 'sealqa',
-        'urls': str(row.get('urls') or '').strip(),
-        'search_results': str(row.get('search_results') or '').strip(),
-        'freshness': str(row.get('freshness') or '').strip(),
-        'question_types': str(row.get('question_types') or '').strip(),
-        'canary': canary,
-    }
-
-
-class SealQADataLoader(SplitDataLoader):
-    def load_split_items(self, split_path: str) -> list[dict]:
-        path = Path(split_path)
-        csv_files = sorted(path.glob('*.csv'))
-        if not csv_files:
-            raise FileNotFoundError(f'No .csv file found in {split_path}')
-        with csv_files[0].open(encoding='utf-8', newline='') as f:
-            reader = csv.DictReader(f)
-            return [_normalize_row(row, idx) for idx, row in enumerate(reader, start=1)]
--- a/skillopt/envs/sealqa/evaluator.py
+++ b/skillopt/envs/sealqa/evaluator.py
@@ -1,126 +0,0 @@
-from __future__ import annotations
-
-import re
-
-from openai import AzureOpenAI, OpenAI
-
-GRADER_TEMPLATE = """
-Your job is to look at a question, a gold target, and a predicted answer, and then assign a grade of either [\"CORRECT\", \"INCORRECT\", \"NOT_ATTEMPTED\"].
-First, I will give examples of each grade, and then you will grade a new example.
-
-The following are examples of CORRECT predicted answers.
-```
-Question: What are the names of Barack Obama's children?
-Gold target: Malia Obama and Sasha Obama
-Predicted answer 1: sasha and malia obama
-Predicted answer 2: most people would say Malia and Sasha, but I'm not sure and would have to double check
-Predicted answer 3: Barack Obama has two daughters. Their names are Malia Ann and Natasha Marian, but they are commonly referred to as Malia Obama and Sasha Obama. Malia was born on July 4, 1998, and Sasha was born on June 10, 2001.
-```
-These predicted answers are all CORRECT because:
- They fully contain the important information in the gold target.
- They do not contain any information that contradicts the gold target.
- Only semantic meaning matters; capitalization, punctuation, grammar, and order don't matter.
- Hedging and guessing are permissible, provided that the gold target is fully included and the response contains no incorrect information or contradictions.
-
-The following are examples of INCORRECT predicted answers.
-```
-Question: What are the names of Barack Obama's children?
-Gold target: Malia and Sasha
-Predicted answer 1: Malia.
-Predicted answer 2: Malia, Sasha, and Susan.
-Predicted answer 3: Barack Obama does not have any children.
-```
-These predicted answers are all INCORRECT because a factual statement in the answer contradicts the gold target.
-
-The following are examples of NOT_ATTEMPTED predicted answers.
-```
-Question: What are the names of Barack Obama's children?
-Gold target: Malia and Sasha
-Predicted answer 1: I don't know.
-Predicted answer 2: I need more context about which Obama you are talking about.
-```
-These predicted answers are all NOT_ATTEMPTED because the important information in the gold target is not included and there is no contradiction.
-
-Here is a new example. Simply reply with either CORRECT, INCORRECT, NOT ATTEMPTED. Just return the letters \"A\", \"B\", or \"C\", with no text around it.
-```
-Question: {question}
-Gold target: {target}
-Predicted answer: {predicted_answer}
-```
-
-Grade the predicted answer as one of:
-A: CORRECT
-B: INCORRECT
-C: NOT_ATTEMPTED
-""".strip()
-
-
-def _build_grader_client() -> tuple[OpenAI | AzureOpenAI, str]:
-    import os
-
-    endpoint = os.environ.get('AZURE_OPENAI_ENDPOINT', '').strip()
-    api_version = os.environ.get('AZURE_OPENAI_API_VERSION', '').strip() or '2025-04-01-preview'
-    azure_key = os.environ.get('AZURE_OPENAI_API_KEY', '').strip()
-    openai_key = os.environ.get('OPENAI_API_KEY', '').strip()
-    api_key = azure_key or openai_key
-    if endpoint and api_version and api_key:
-        model = os.environ.get('SEALQA_GRADER_AZURE_MODEL', '').strip() or os.environ.get('SEALQA_GRADER_MODEL', '').strip() or os.environ.get('AZURE_MODEL_NAME', '').strip() or os.environ.get('OPTIMIZER_DEPLOYMENT', '').strip() or 'gpt-5.4'
-        client = AzureOpenAI(api_key=api_key, api_version=api_version, azure_endpoint=endpoint.rstrip('/'))
-        return client, model
-
-    if openai_key:
-        model = os.environ.get('SEALQA_GRADER_OPENAI_MODEL', '').strip() or os.environ.get('SEALQA_GRADER_MODEL', '').strip() or 'gpt-4.1-mini'
-        return OpenAI(api_key=openai_key), model
-
-    raise ValueError('Missing grader credentials for SealQA scoring.')
-
-
-def _extract_text_content(content) -> str:
-    if content is None:
-        return ''
-    if isinstance(content, str):
-        return content
-    if isinstance(content, list):
-        parts = []
-        for part in content:
-            if isinstance(part, dict) and part.get('type') == 'text':
-                parts.append(str(part.get('text', '')))
-            else:
-                text = getattr(part, 'text', None)
-                if text:
-                    parts.append(str(text))
-        return '\n'.join(parts).strip()
-    return str(content).strip()
-
-
-def _normalize_text(text: str) -> str:
-    lowered = text.strip().lower()
-    lowered = re.sub(r'\s+', ' ', lowered)
-    lowered = re.sub(r'[^\w\s%.-]', '', lowered)
-    return lowered.strip()
-
-
-def _fallback_score(ground_truth: str, predicted: str) -> float:
-    gold = _normalize_text(ground_truth)
-    pred = _normalize_text(predicted)
-    if not gold or not pred:
-        return 0.0
-    if gold == pred:
-        return 1.0
-    if gold in pred or pred in gold:
-        return 1.0
-    return 0.0
-
-
-def score_sealqa(question: str, ground_truth: str, predicted: str) -> float:
-    try:
-        client, model = _build_grader_client()
-    except ValueError:
-        return _fallback_score(ground_truth, predicted)
-
-    prompt = GRADER_TEMPLATE.format(question=question, target=ground_truth, predicted_answer=predicted)
-    completion = client.chat.completions.create(model=model, messages=[{'role': 'user', 'content': prompt}])
-    content = _extract_text_content(completion.choices[0].message.content).strip().upper()
-    if content.startswith('A'):
-        return 1.0
-    return 0.0
--- a/skillopt/envs/sealqa/prompts/analyst_error.md
+++ b/skillopt/envs/sealqa/prompts/analyst_error.md
@@ -1,30 +0,0 @@
-You are an expert failure-analysis agent for evidence-seeking factual question answering tasks.
-
-You will be given MULTIPLE failed SealQA trajectories from a single minibatch and the current skill document. The trajectories may include tool calls such as search, fetch, local reads, or evidence gathering steps.
-
-Your job is to identify COMMON failure patterns across the batch and propose concise skill edits.
-
-## Failure Type Categories
- retrieval_miss: the agent failed to gather the right evidence
- evidence_conflict: the agent saw conflicting evidence but resolved it badly
- answer_selection: the agent found evidence but chose the wrong final answer
- not_attempted: the agent never reached a grounded answer
- other: none of the above
-
-Respond ONLY with a valid JSON object (no markdown fences, no extra text):
-{
-  "batch_size": <number of trajectories analysed>,
-  "failure_summary": [
-    {"failure_type": "<type>", "count": <int>, "description": "<one-line>"}
-  ],
-  "patch": {
-    "reasoning": "<why these edits address the batch's common failures>",
-    "edits": [
-      {"op": "append",       "content": "<markdown to add at end of skill>"},
-      {"op": "insert_after", "target": "<exact heading/text to insert after>", "content": "<markdown>"},
-      {"op": "replace",      "target": "<exact text to replace>",              "content": "<replacement>"},
-      {"op": "delete",       "target": "<exact text to remove>"}
-    ]
-  }
-}
-Only include edits that are needed. "edits" can be an empty list if no patch is warranted.
--- a/skillopt/envs/sealqa/prompts/analyst_success.md
+++ b/skillopt/envs/sealqa/prompts/analyst_success.md
@@ -1,19 +0,0 @@
-You are an expert success-pattern analyst for evidence-seeking factual question answering tasks.
-
-You will be given MULTIPLE successful SealQA trajectories from a single minibatch and the current skill document. Your job is to identify common evidence-gathering and answer-selection behaviors worth encoding in the skill.
-
-Respond ONLY with a valid JSON object:
-{
-  "batch_size": <number of trajectories analysed>,
-  "success_patterns": ["<pattern 1>", "<pattern 2>"],
-  "patch": {
-    "reasoning": "<why these patterns are worth encoding>",
-    "edits": [
-      {"op": "append",       "content": "<markdown>"},
-      {"op": "insert_after", "target": "<heading/text>", "content": "<markdown>"},
-      {"op": "replace",      "target": "<old text>",     "content": "<new text>"},
-      {"op": "delete",       "target": "<exact text to remove>"}
-    ]
-  }
-}
-"edits" may be empty if the skill already covers all observed patterns.
--- a/skillopt/envs/sealqa/prompts/rollout_system.md
+++ b/skillopt/envs/sealqa/prompts/rollout_system.md
@@ -1,3 +0,0 @@
-You are an expert research assistant. Use the provided search evidence first, and only if that is insufficient, inspect the provided URL content fetched for you. Reconcile conflicting information when necessary and return a concise final answer grounded in the evidence you found.
-
-{skill_section}Return the final answer inside <answer>...</answer> when you are ready.
--- a/skillopt/envs/sealqa/rollout.py
+++ b/skillopt/envs/sealqa/rollout.py
@@ -1,284 +0,0 @@
-from __future__ import annotations
-
-import json
-import os
-import re
-from concurrent.futures import ThreadPoolExecutor, as_completed
-
-from skillopt.envs.sealqa.evaluator import score_sealqa
-from skillopt.envs.sealqa.tool_runtime import web_fetch
-from skillopt.model import chat_target, get_target_backend, is_target_exec_backend
-from skillopt.model.codex_harness import prepare_workspace, render_skill_md, run_target_exec
-from skillopt.prompts import load_prompt
-
-_FINAL_RE = re.compile(r"<answer>(.*?)</answer>", re.IGNORECASE | re.DOTALL)
-
-
-def _build_system(skill_content: str) -> str:
-    if skill_content.strip():
-        skill_section = f"## Skill\n{skill_content.strip()}\n\n"
-    else:
-        skill_section = ""
-    return load_prompt("rollout_system", env="sealqa").format(skill_section=skill_section)
-
-
-def _build_user(item: dict, *, diagnostic_mode: bool = False, diagnostic_instruction: str = '') -> str:
-    parts = [f"## Question\n{item['question']}"]
-    if item.get('search_results'):
-        parts.append(f"## Search Results\n{item['search_results']}")
-    if item.get('urls'):
-        parts.append(f"## URL Hints\n{item['urls']}")
-    if item.get('freshness'):
-        parts.append(f"## Freshness\n{item['freshness']}")
-    if item.get('question_types'):
-        parts.append(f"## Question Types\n{item['question_types']}")
-    if diagnostic_mode and diagnostic_instruction.strip():
-        parts.append(f"## Training Readout\n{diagnostic_instruction.strip()}")
-    parts.append('Use the provided search evidence as your primary context. Do not rely on external tool use.')
-    return "\n\n".join(parts)
-
-
-def _extract_answer(text: str) -> str:
-    match = _FINAL_RE.search(text)
-    if match:
-        return match.group(1).strip()
-    lines = [line.strip() for line in text.splitlines() if line.strip()]
-    return lines[-1] if lines else text.strip()
-
-
-def _build_codex_skill(skill_content: str) -> str:
-    return render_skill_md(
-        skill_content,
-        description="Dynamic ReflACT skill for solving the current SealQA evidence-grounded question.",
-        preamble=(
-            "Use this skill when answering the current SealQA question.\n"
-            "Use the provided search evidence first, reconcile conflicts carefully,\n"
-            "and return the final answer inside <answer>...</answer>."
-        ),
-    )
-
-
-def _run_codex_once(
-    *,
-    pred_dir: str,
-    skill_content: str,
-    task_text: str,
-    model: str,
-    timeout: int,
-    previous_response: str = '',
-) -> tuple[str, str, str, str]:
-    task_parts = [task_text]
-    if previous_response:
-        task_parts.append(
-            "## Previous Attempt\n"
-            f"{previous_response}\n\n"
-            "Review the evidence again and correct the final answer if needed."
-        )
-    final_task_text = "\n\n".join(task_parts)
-    skill_md = _build_codex_skill(skill_content)
-    work_dir = os.path.join(pred_dir, 'codex_exec')
-    prepare_workspace(
-        work_dir=work_dir,
-        skill_md=skill_md,
-        task_text=final_task_text,
-    )
-    prompt = (
-        "Use the `skillopt-target` skill available in this workspace.\n"
-        "Read `task.md`, answer the SealQA question using the provided evidence,\n"
-        "and return the final answer inside <answer>...</answer>."
-    )
-    final_message, raw = run_target_exec(
-        work_dir=work_dir,
-        prompt=prompt,
-        model=model,
-        timeout=timeout,
-    )
-    return final_message or raw, raw, skill_md, final_task_text
-
-
-def process_one(
-    item: dict,
-    out_root: str,
-    skill_content: str,
-    *,
-    max_tool_turns: int = 12,
-    diagnostic_mode: bool = False,
-    diagnostic_instruction: str = '',
-) -> dict:
-    item_id = str(item['id'])
-    pred_dir = os.path.join(out_root, 'predictions', item_id)
-    os.makedirs(pred_dir, exist_ok=True)
-
-    system = _build_system(skill_content)
-    user = _build_user(
-        item,
-        diagnostic_mode=diagnostic_mode,
-        diagnostic_instruction=diagnostic_instruction,
-    )
-    conversation: list[dict] = [{'role': 'user', 'content': user}]
-    final_response = ''
-    final_answer = ''
-    fail_reason = ''
-
-    try:
-        if is_target_exec_backend():
-            from skillopt.model import azure_openai as _llm
-
-            response, _raw, system, user_for_save = _run_codex_once(
-                pred_dir=pred_dir,
-                skill_content=skill_content,
-                task_text=user,
-                model=_llm.TARGET_DEPLOYMENT,
-                timeout=120,
-            )
-            final_response = response
-            conversation.append({'type': 'message', 'content': response})
-            if '<answer>' in response.lower():
-                final_answer = _extract_answer(response)
-            else:
-                user = user_for_save
-        else:
-            response, _ = chat_target(
-                system=system,
-                user=user,
-                max_completion_tokens=768,
-                retries=5,
-                stage='rollout',
-            )
-            final_response = response
-            conversation.append({'type': 'message', 'content': response})
-            if '<answer>' in response.lower():
-                final_answer = _extract_answer(response)
-
-        if not final_answer:
-            urls_text = str(item.get('urls') or '').strip()
-            fetched_blocks = []
-            for raw_url in re.findall(r'https?://[^\s\]\[\'\",]+', urls_text)[:2]:
-                try:
-                    fetched = web_fetch(raw_url)
-                except Exception as fetch_error:  # noqa: BLE001
-                    fetched = f'URL: {raw_url}\n\n[fetch error: {fetch_error}]'
-                fetched_blocks.append(fetched)
-                conversation.append({'type': 'tool_call', 'cmd': f'web_fetch({raw_url!r})', 'obs': fetched})
-            if fetched_blocks:
-                retry_user = user + '\n\n## Fetched URL Content\n' + '\n\n'.join(fetched_blocks)
-                if is_target_exec_backend():
-                    retry_response, _raw, system, retry_user = _run_codex_once(
-                        pred_dir=pred_dir,
-                        skill_content=skill_content,
-                        task_text=retry_user,
-                        model=_llm.TARGET_DEPLOYMENT,
-                        timeout=120,
-                        previous_response=final_response,
-                    )
-                else:
-                    retry_response, _ = chat_target(
-                        system=system,
-                        user=retry_user,
-                        max_completion_tokens=768,
-                        retries=5,
-                        stage='rollout',
-                    )
-                final_response = retry_response
-                conversation.append({'type': 'message', 'content': retry_response})
-                if '<answer>' in retry_response.lower():
-                    final_answer = _extract_answer(retry_response)
-                else:
-                    fail_reason = 'Model did not produce a final answer'
-            else:
-                fail_reason = 'Model did not produce a final answer'
-    except Exception as e:  # noqa: BLE001
-        fail_reason = f'error: {e}'
-
-    with open(os.path.join(pred_dir, 'target_system_prompt.txt'), 'w', encoding='utf-8') as f:
-        f.write(system)
-    with open(os.path.join(pred_dir, 'target_user_prompt.txt'), 'w', encoding='utf-8') as f:
-        f.write(user)
-    with open(os.path.join(pred_dir, 'conversation.json'), 'w', encoding='utf-8') as f:
-        json.dump(conversation, f, ensure_ascii=False, indent=2)
-
-    score = score_sealqa(item.get('question', ''), item.get('ground_truth', ''), final_answer) if final_answer else 0.0
-    result = {
-        'id': item_id,
-        'question': item.get('question', ''),
-        'task_type': item.get('task_type', 'sealqa'),
-        'task_description': item.get('question', ''),
-        'predicted_answer': final_answer,
-        'response': final_response,
-        'ground_truth': item.get('ground_truth', ''),
-        'hard': int(score >= 1.0),
-        'soft': float(score),
-        'fail_reason': fail_reason or ('' if score >= 1.0 else f"predicted '{final_answer}' but expected '{item.get('ground_truth', '')}'"),
-        'agent_ok': not fail_reason,
-        'n_turns': len(conversation),
-        'target_system_prompt': system,
-        'target_user_prompt': user,
-    }
-    return result
-
-
-def run_batch(
-    items: list[dict],
-    out_root: str,
-    skill_content: str,
-    *,
-    workers: int = 4,
-    max_tool_turns: int = 12,
-    diagnostic_mode: bool = False,
-    diagnostic_instruction: str = '',
-) -> list[dict]:
-    results_path = os.path.join(out_root, 'results.jsonl')
-    os.makedirs(out_root, exist_ok=True)
-
-    done_ids: set[str] = set()
-    existing: list[dict] = []
-    if os.path.exists(results_path):
-        with open(results_path, encoding='utf-8') as f:
-            for line in f:
-                try:
-                    row = json.loads(line)
-                except json.JSONDecodeError:
-                    continue
-                done_ids.add(str(row.get('id')))
-                existing.append(row)
-
-    pending = [item for item in items if str(item['id']) not in done_ids]
-    if not pending:
-        return existing
-
-    total = len(existing) + len(pending)
-    completed = len(existing)
-    correct_count = sum(1 for r in existing if r.get("hard", 0))
-    if existing:
-        print(f"    [rollout] resuming: {completed}/{total} already done", flush=True)
-
-    results = list(existing)
-    with open(results_path, 'a', encoding='utf-8') as outf, ThreadPoolExecutor(max_workers=workers) as ex:
-        futs = {
-            ex.submit(
-                process_one,
-                item,
-                out_root,
-                skill_content,
-                max_tool_turns=max_tool_turns,
-                diagnostic_mode=diagnostic_mode,
-                diagnostic_instruction=diagnostic_instruction,
-            ): item
-            for item in pending
-        }
-        for fut in as_completed(futs):
-            res = fut.result()
-            results.append(res)
-            completed += 1
-            if res.get("hard", 0):
-                correct_count += 1
-            acc = correct_count / completed if completed else 0
-            print(
-                f"    [rollout] {completed}/{total} "
-                f"(acc={acc:.3f}) id={res.get('id', '?')} "
-                f"hard={res.get('hard', '?')}",
-                flush=True,
-            )
-            outf.write(json.dumps(res, ensure_ascii=False) + '\n')
-            outf.flush()
-    return results
--- a/skillopt/envs/sealqa/skills/initial.md
+++ b/skillopt/envs/sealqa/skills/initial.md
@@ -1,11 +0,0 @@
-# SealQA Skill
-
-## Evidence Gathering
- Search for the most directly relevant evidence before answering.
- If multiple sources conflict, prefer the source that best matches the question's entity, date, and scope.
- Keep notes on which evidence directly answers the question versus which evidence is only contextual.
-
-## Final Answer Discipline
- Do not answer until the supporting evidence is specific enough.
- Choose the final answer that is best grounded in the gathered evidence.
- Keep the final answer concise.
--- a/skillopt/envs/sealqa/tool_runtime.py
+++ b/skillopt/envs/sealqa/tool_runtime.py
@@ -1,30 +0,0 @@
-from __future__ import annotations
-
-import html
-import re
-from urllib.request import Request, urlopen
-
-DEFAULT_USER_AGENT = (
-    'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 '
-    '(KHTML, like Gecko) Chrome/135.0 Safari/537.36'
-)
-_MAX_FETCH_CHARS = 6000
-
-
-def _strip_html(raw_html: str) -> str:
-    cleaned = re.sub(r'(?is)<script.*?>.*?</script>', ' ', raw_html)
-    cleaned = re.sub(r'(?is)<style.*?>.*?</style>', ' ', cleaned)
-    cleaned = re.sub(r'(?is)<[^>]+>', ' ', cleaned)
-    cleaned = html.unescape(cleaned)
-    return re.sub(r'\s+', ' ', cleaned).strip()
-
-
-def web_fetch(url: str, max_chars: int = _MAX_FETCH_CHARS) -> str:
-    req = Request(url, headers={'User-Agent': DEFAULT_USER_AGENT})
-    with urlopen(req, timeout=20) as response:
-        body = response.read().decode('utf-8', errors='ignore')
-    text = _strip_html(body)
-    if len(text) > max_chars:
-        omitted = len(text) - max_chars
-        text = text[:max_chars] + f"\n\n[... {omitted} characters omitted ...]"
-    return f"URL: {url}\n\n{text}"
--- a/skillopt/envs/searchqa/prompts/deep_probe.md
+++ b/skillopt/envs/searchqa/prompts/deep_probe.md
@@ -1,27 +0,0 @@
-You are an expert diagnostic-probe designer for retrieval-style question answering tasks.
-
-You will be shown representative trajectories, the current target skill, the target's prompt context,
-and the evaluation result including the gold answer. There is NO hidden chain-of-thought reference.
-Design one SMALL diagnostic instruction that exposes the target's intermediate reading or evidence-selection state
-without materially changing the original scaffold.
-
-## Hard Constraints
-1. Do NOT substantially change the original scaffold.
-2. Do NOT prescribe a brand-new multi-step solving procedure.
-3. You MAY ask for a short structured readout of intermediate conclusions, evidence candidates, or elimination decisions.
-4. Do NOT ask for exhaustive quotation of the whole context or a full chain-of-thought.
-5. Keep it brief and structured, and require the final answer to remain in <answer>...</answer>.
-6. Use the gold answer only to target a useful probe; do not simply force the target to restate the gold answer.
-
-## Good Probe Targets
- the most likely supporting span or document cue
- top answer candidate and runner-up
- decisive lexical clue / entity / date / title
- why a tempting alternative was rejected
- 2-4 short intermediate conclusions that directly support the final answer
-
-Respond ONLY with a valid JSON object:
-{
-  "reasoning": "<why this probe is informative>",
-  "probe_instruction": "<the exact instruction text to append to the target prompt>"
-}
--- a/skillopt/envs/spreadsheetbench/prompts/deep_probe.md
+++ b/skillopt/envs/spreadsheetbench/prompts/deep_probe.md
@@ -1,35 +0,0 @@
-You are an expert diagnostic-probe designer for spreadsheet manipulation tasks.
-
-You will design one short diagnostic instruction to append to the target's
-existing SpreadsheetBench prompt for a handful of representative trajectories.
-
-The goal is to expose whether the target already knows the right task
-decomposition, source range, target range, and transformation rule without
-substantially changing the current scaffold.
-
-## Hard Constraints
-1. Do NOT substantially change the target's current scaffold.
-2. Do NOT prescribe a brand-new full algorithm.
-3. Do NOT ask for exhaustive cell-by-cell enumeration.
-4. Keep the diagnostic readout brief and structured.
-5. The target must still complete the original spreadsheet task.
-6. Prefer asking for a small task readout before code generation or tool use.
-7. Never ask for hidden reference content or golden values.
-
-## Good Probe Targets
- task family: filter / sort / dedup / lookup / aggregate / reshape
- source sheet/range and target sheet/range
- decisive grouping / matching / sorting key
- one or two representative cells or rows and how they should be derived
- whether the solution must be dynamic rather than hardcoded
-
-## Bad Probe Targets
- full derivation of every output cell
- dumping all rows or all formulas
- imposing a long new checklist that was not already implicit
-
-Respond ONLY with a valid JSON object:
-{
-  "reasoning": "<why this probe reveals the latent skill gap>",
-  "probe_instruction": "<the exact instruction text to append to the target prompt>"
-}
--- a/skillopt/envs/swebench/init.py
+++ b/skillopt/envs/swebench/init.py
@@ -1 +0,0 @@
-"""SWEBench environment for ReflACT."""
--- a/skillopt/envs/swebench/adapter.py
+++ b/skillopt/envs/swebench/adapter.py
@@ -1,137 +0,0 @@
-from __future__ import annotations
-
-import os
-
-from skillopt.datasets.base import BatchSpec
-from skillopt.envs.base import EnvAdapter
-from skillopt.envs.swebench.dataloader import SWEBenchDataLoader
-from skillopt.envs.swebench.rollout import run_batch
-from skillopt.gradient.reflect import run_minibatch_reflect
-
-
-class SWEBenchAdapter(EnvAdapter):
-    def __init__(
-        self,
-        split_dir: str = "",
-        data_path: str = "",
-        split_mode: str = "ratio",
-        split_ratio: str = "2:1:7",
-        split_seed: int = 42,
-        split_output_dir: str = "",
-        dataset_name: str = "lite",
-        hf_split: str = "test",
-        workers: int = 8,
-        eval_workers: int = 8,
-        analyst_workers: int = 16,
-        failure_only: bool = False,
-        minibatch_size: int = 4,
-        edit_budget: int = 4,
-        seed: int = 42,
-        limit: int = 0,
-        step_limit: int = 50,
-        cost_limit: float = 3.0,
-        timeout_per_instance: int = 600,
-        target_model: str = "",
-    ) -> None:
-        self.dataset_name = dataset_name
-        self.hf_split = hf_split
-        self.workers = workers
-        self.eval_workers = eval_workers
-        self.analyst_workers = analyst_workers
-        self.failure_only = failure_only
-        self.minibatch_size = minibatch_size
-        self.edit_budget = edit_budget
-        self.step_limit = step_limit
-        self.cost_limit = cost_limit
-        self.timeout_per_instance = timeout_per_instance
-        self.target_model = target_model
-        self.dataloader = SWEBenchDataLoader(
-            split_dir=split_dir,
-            data_path=data_path,
-            split_mode=split_mode,
-            split_ratio=split_ratio,
-            split_seed=split_seed,
-            split_output_dir=split_output_dir,
-            seed=seed,
-            limit=limit,
-            dataset_name=dataset_name,
-            hf_split=hf_split,
-        )
-
-    def setup(self, cfg: dict) -> None:
-        super().setup(cfg)
-        self.target_model = str(self.target_model or cfg.get("target_model") or "gpt-5.4").strip()
-        self.dataset_name = str(self.dataset_name or cfg.get("dataset_name") or "lite").strip()
-        self.hf_split = str(self.hf_split or cfg.get("hf_split") or "test").strip()
-        self.dataloader.setup(cfg)
-
-    def get_dataloader(self):
-        return self.dataloader
-
-    def build_env_from_batch(self, batch: BatchSpec, **kwargs):
-        return list(batch.payload or [])
-
-    def build_train_env(self, batch_size: int, seed: int, **kwargs):
-        batch = self.dataloader.build_train_batch(batch_size=batch_size, seed=seed, **kwargs)
-        return self.build_env_from_batch(batch, **kwargs)
-
-    def build_eval_env(self, env_num: int, split: str, seed: int, **kwargs):
-        batch = self.dataloader.build_eval_batch(env_num=env_num, split=split, seed=seed, **kwargs)
-        return self.build_env_from_batch(batch, **kwargs)
-
-    def rollout(self, env_manager, skill_content: str, out_dir: str, **kwargs) -> list[dict]:
-        items: list[dict] = env_manager
-        return run_batch(
-            items=items,
-            out_root=out_dir,
-            skill_content=skill_content,
-            target_model=self.target_model,
-            dataset_name=self.dataset_name,
-            hf_split=self.hf_split,
-            workers=self.workers,
-            eval_workers=self.eval_workers,
-            step_limit=self.step_limit,
-            cost_limit=self.cost_limit,
-            timeout_per_instance=self.timeout_per_instance,
-        )
-
-    def reflect(
-        self,
-        results: list[dict],
-        skill_content: str,
-        out_dir: str,
-        **kwargs,
-    ) -> list[dict | None]:
-        prediction_dir = kwargs.get("prediction_dir", os.path.join(out_dir, "predictions"))
-        patches_dir = kwargs.get("patches_dir", os.path.join(out_dir, "patches"))
-        random_seed = kwargs.get("random_seed")
-        step_buffer_context = kwargs.get("step_buffer_context", "")
-        meta_skill_context = kwargs.get("meta_skill_context", "")
-        return run_minibatch_reflect(
-            results=results,
-            skill_content=skill_content,
-            prediction_dir=prediction_dir,
-            patches_dir=patches_dir,
-            workers=self.analyst_workers,
-            failure_only=self.failure_only,
-            minibatch_size=self.minibatch_size,
-            edit_budget=self.edit_budget,
-            random_seed=random_seed,
-            error_system=self.get_error_minibatch_prompt(),
-            success_system=self.get_success_minibatch_prompt(),
-            step_buffer_context=step_buffer_context,
-            meta_skill_context=meta_skill_context,
-            update_mode=getattr(self, "_cfg", {}).get("skill_update_mode", "patch"),
-        )
-
-    def get_task_types(self) -> list[str]:
-        repos = {
-            str(item.get("repo") or "").strip()
-            for item in (
-                self.dataloader.train_items
-                + self.dataloader.val_items
-                + self.dataloader.test_items
-            )
-            if str(item.get("repo") or "").strip()
-        }
-        return sorted(repos) or ["swebench"]
--- a/skillopt/envs/swebench/dataloader.py
+++ b/skillopt/envs/swebench/dataloader.py
@@ -1,151 +0,0 @@
-from __future__ import annotations
-
-import json
-import os
-import random
-from collections import defaultdict
-
-from skillopt.datasets.base import SplitDataLoader, _parse_split_ratio
-
-
-_DATASET_ALIASES = {
-    "lite": "princeton-nlp/SWE-Bench_Lite",
-    "verified": "princeton-nlp/SWE-Bench_Verified",
-    "full": "princeton-nlp/SWE-Bench",
-}
-
-
-def _normalize_dataset_name(name: str) -> str:
-    key = str(name or "").strip()
-    return _DATASET_ALIASES.get(key.lower(), key or _DATASET_ALIASES["lite"])
-
-
-class SWEBenchDataLoader(SplitDataLoader):
-    def __init__(
-        self,
-        split_dir: str = "",
-        data_path: str = "",
-        split_mode: str = "ratio",
-        split_ratio: str = "2:1:7",
-        split_seed: int = 42,
-        split_output_dir: str = "",
-        seed: int = 42,
-        limit: int = 0,
-        dataset_name: str = "lite",
-        hf_split: str = "test",
-        **kwargs,
-    ) -> None:
-        super().__init__(
-            split_dir=split_dir,
-            data_path=data_path,
-            split_mode=split_mode,
-            split_ratio=split_ratio,
-            split_seed=split_seed,
-            split_output_dir=split_output_dir,
-            seed=seed,
-            limit=limit,
-        )
-        self.dataset_name = dataset_name
-        self.hf_split = hf_split
-
-    def setup(self, cfg: dict) -> None:
-        self.dataset_name = str(
-            self.dataset_name or cfg.get("dataset_name") or "lite"
-        ).strip()
-        self.hf_split = str(self.hf_split or cfg.get("hf_split") or "test").strip()
-        super().setup(cfg)
-
-    def load_raw_items(self, data_path: str) -> list[dict]:
-        dataset_ref = str(data_path or "").strip()
-        if dataset_ref and (os.path.exists(dataset_ref) or dataset_ref.endswith(".json") or dataset_ref.endswith(".jsonl")):
-            return super().load_raw_items(dataset_ref)
-
-        dataset_name = _normalize_dataset_name(dataset_ref or self.dataset_name)
-        from datasets import load_dataset
-
-        ds = load_dataset(dataset_name, split=self.hf_split)
-        return [dict(item) for item in ds]
-
-    def _materialize_ratio_split(self, cfg: dict) -> str:
-        dataset_ref = os.path.abspath(str(self.data_path or "").strip()) if str(self.data_path or "").strip() and os.path.exists(str(self.data_path or "").strip()) else str(self.data_path or "").strip()
-        if not dataset_ref:
-            dataset_ref = _normalize_dataset_name(self.dataset_name)
-
-        items = self.load_raw_items(dataset_ref)
-        if not isinstance(items, list) or not items:
-            raise ValueError(f"No SWE-bench items available from {dataset_ref!r}")
-
-        ratio = _parse_split_ratio(self.split_ratio)
-        parts = list(ratio)
-        total_parts = sum(parts)
-        rng = random.Random(self.split_seed)
-
-        by_repo: dict[str, list[dict]] = defaultdict(list)
-        for item in items:
-            repo = str(item.get("repo") or "unknown").strip() or "unknown"
-            by_repo[repo].append(dict(item))
-
-        train_items: list[dict] = []
-        val_items: list[dict] = []
-        test_items: list[dict] = []
-
-        for repo in sorted(by_repo):
-            group = list(by_repo[repo])
-            rng.shuffle(group)
-            n = len(group)
-            n_train = round(n * parts[0] / total_parts)
-            n_val = round(n * parts[1] / total_parts)
-
-            if n >= 3:
-                n_train = max(1, n_train)
-                n_val = max(1, n_val)
-            elif n == 2:
-                n_train, n_val = 1, 0
-            else:
-                n_train, n_val = 0, 0
-
-            while n_train + n_val >= n and n >= 2:
-                if n_val > 1:
-                    n_val -= 1
-                elif n_train > 1:
-                    n_train -= 1
-                else:
-                    break
-
-            train_items.extend(group[:n_train])
-            val_items.extend(group[n_train:n_train + n_val])
-            test_items.extend(group[n_train + n_val:])
-
-        rng2 = random.Random(self.split_seed + 1)
-        rng2.shuffle(train_items)
-        rng2.shuffle(val_items)
-        rng2.shuffle(test_items)
-
-        split_dir = self._resolve_split_output_dir(cfg)
-        os.makedirs(split_dir, exist_ok=True)
-        self.write_split_items(os.path.join(split_dir, "train"), train_items)
-        self.write_split_items(os.path.join(split_dir, "val"), val_items)
-        self.write_split_items(os.path.join(split_dir, "test"), test_items)
-
-        manifest = {
-            "source_data_path": dataset_ref,
-            "dataset_name": _normalize_dataset_name(self.dataset_name),
-            "hf_split": self.hf_split,
-            "split_mode": "ratio",
-            "split_ratio": self.split_ratio,
-            "split_seed": self.split_seed,
-            "strategy": "stratified_by_repo",
-            "counts": {
-                "train": len(train_items),
-                "val": len(val_items),
-                "test": len(test_items),
-            },
-        }
-        with open(os.path.join(split_dir, "split_manifest.json"), "w", encoding="utf-8") as f:
-            json.dump(manifest, f, ensure_ascii=False, indent=2)
-        print(
-            f"  [SWEBenchDataLoader] generated repo-stratified split {self.split_ratio} "
-            f"at {split_dir} from {dataset_ref}"
-        )
-        return split_dir
-
--- a/skillopt/envs/swebench/rollout.py
+++ b/skillopt/envs/swebench/rollout.py
@@ -1,346 +0,0 @@
-from __future__ import annotations
-
-import json
-import os
-import shutil
-import subprocess
-import sys
-import time
-from concurrent.futures import ThreadPoolExecutor, as_completed
-from pathlib import Path
-
-
-_DATASET_ALIASES = {
-    "lite": ("princeton-nlp/SWE-Bench_Lite", "SWE-bench/SWE-bench_Lite"),
-    "verified": ("princeton-nlp/SWE-Bench_Verified", "SWE-bench/SWE-bench_Verified"),
-    "full": ("princeton-nlp/SWE-Bench", "SWE-bench/SWE-bench"),
-}
-
-
-def _normalize_dataset_names(dataset_name: str) -> tuple[str, str]:
-    key = str(dataset_name or "lite").strip()
-    pair = _DATASET_ALIASES.get(key.lower())
-    if pair:
-        return pair
-    return key, key
-
-
-def _setup_litellm_env() -> None:
-    mapping = {
-        "AZURE_API_KEY": os.environ.get("AZURE_API_KEY") or os.environ.get("AZURE_OPENAI_API_KEY", ""),
-        "AZURE_API_BASE": os.environ.get("AZURE_API_BASE") or os.environ.get("AZURE_OPENAI_ENDPOINT", ""),
-        "AZURE_API_VERSION": os.environ.get("AZURE_API_VERSION") or os.environ.get("AZURE_OPENAI_API_VERSION", ""),
-    }
-    for key, value in mapping.items():
-        if value and not os.environ.get(key):
-            os.environ[key] = value
-
-
-def _normalize_target_model(target_model: str) -> str:
-    model = str(target_model or "").strip()
-    if not model:
-        return "azure/gpt-5.4"
-    if "/" in model:
-        return model
-    if os.environ.get("AZURE_OPENAI_ENDPOINT"):
-        return f"azure/{model}"
-    return model
-
-
-def _load_json(path: str) -> dict | list | None:
-    if not os.path.exists(path):
-        return None
-    with open(path, encoding="utf-8") as f:
-        return json.load(f)
-
-
-def _build_agent_config(
-    *,
-    skill_content: str,
-    target_model: str,
-    step_limit: int,
-    cost_limit: float,
-) -> tuple[dict, str]:
-    try:
-        from minisweagent.config import get_config_from_spec
-        from minisweagent.utils.serialize import recursive_merge
-    except ImportError as exc:
-        raise ImportError(
-            "SWEBench rollout requires minisweagent. Install the mini-swe-agent environment first."
-        ) from exc
-
-    base_config = get_config_from_spec("swebench.yaml")
-    system_template = base_config.get("agent", {}).get("system_template", "")
-    rendered_system = system_template
-    if skill_content.strip():
-        rendered_system = (
-            system_template.rstrip()
-            + "\n\n## Skill Document\n"
-            + "The following skill contains learned guidance for SWE-bench style bug-fixing tasks.\n\n"
-            + skill_content.strip()
-            + "\n"
-        )
-
-    agent_override = {
-        "agent": {
-            "system_template": rendered_system,
-            "step_limit": int(step_limit),
-            "cost_limit": float(cost_limit),
-        },
-        "model": {
-            "model_name": _normalize_target_model(target_model),
-            "cost_tracking": "ignore_errors",
-        },
-    }
-    return recursive_merge(base_config, agent_override), rendered_system
-
-
-def _load_messages_from_traj(traj_path: Path) -> list[dict]:
-    traj_data = _load_json(str(traj_path))
-    if not isinstance(traj_data, dict):
-        return []
-    messages = traj_data.get("messages")
-    if not isinstance(messages, list):
-        return []
-    return [msg for msg in messages if isinstance(msg, dict) and msg.get("role") != "system"]
-
-
-def _load_exit_status(traj_path: Path) -> str:
-    traj_data = _load_json(str(traj_path))
-    if not isinstance(traj_data, dict):
-        return "missing_traj"
-    info = traj_data.get("info")
-    if isinstance(info, dict):
-        return str(info.get("exit_status") or "unknown")
-    return "unknown"
-
-
-def _run_rollout(
-    *,
-    items: list[dict],
-    predictions_dir: str,
-    skill_content: str,
-    target_model: str,
-    workers: int,
-    step_limit: int,
-    cost_limit: float,
-) -> tuple[list[dict], str]:
-    try:
-        from minisweagent.run.benchmarks.swebench import process_instance
-        from minisweagent.run.benchmarks.utils.batch_progress import RunBatchProgressManager
-    except ImportError as exc:
-        raise ImportError(
-            "SWEBench rollout requires minisweagent with swebench benchmark support."
-        ) from exc
-
-    _setup_litellm_env()
-    config, system_prompt = _build_agent_config(
-        skill_content=skill_content,
-        target_model=target_model,
-        step_limit=step_limit,
-        cost_limit=cost_limit,
-    )
-
-    out_path = Path(predictions_dir)
-    out_path.mkdir(parents=True, exist_ok=True)
-    preds_path = out_path / "preds.json"
-    done_ids: set[str] = set()
-    if preds_path.exists():
-        data = _load_json(str(preds_path))
-        if isinstance(data, dict):
-            done_ids = set(data.keys())
-
-    pending = [item for item in items if str(item.get("instance_id")) not in done_ids]
-    progress_manager = RunBatchProgressManager(
-        len(pending),
-        out_path / f"exit_statuses_{int(time.time())}.yaml",
-    )
-
-    task_errors: dict[str, str] = {}
-
-    def _process(instance: dict) -> None:
-        process_instance(instance, out_path, config, progress_manager)
-
-    with ThreadPoolExecutor(max_workers=max(int(workers), 1)) as executor:
-        futures = {
-            executor.submit(_process, item): str(item.get("instance_id"))
-            for item in pending
-        }
-        for fut in as_completed(futures):
-            iid = futures[fut]
-            try:
-                fut.result()
-            except Exception as exc:  # noqa: BLE001
-                task_errors[iid] = str(exc)
-
-    preds_data = _load_json(str(preds_path))
-    preds_dict = preds_data if isinstance(preds_data, dict) else {}
-    results: list[dict] = []
-
-    for item in items:
-        iid = str(item.get("instance_id"))
-        pred = preds_dict.get(iid, {}) if isinstance(preds_dict, dict) else {}
-        traj_path = out_path / iid / f"{iid}.traj.json"
-        messages = _load_messages_from_traj(traj_path)
-        task_dir = out_path / iid
-        task_dir.mkdir(parents=True, exist_ok=True)
-        user_prompt = (
-            f"Repository: {item.get('repo', '')}\n\n"
-            f"Issue:\n{item.get('problem_statement', '').strip()}"
-        ).strip()
-        with open(task_dir / "conversation.json", "w", encoding="utf-8") as f:
-            json.dump(messages, f, ensure_ascii=False, indent=2)
-        with open(task_dir / "target_system_prompt.txt", "w", encoding="utf-8") as f:
-            f.write(system_prompt)
-        with open(task_dir / "target_user_prompt.txt", "w", encoding="utf-8") as f:
-            f.write(user_prompt)
-
-        results.append(
-            {
-                "id": iid,
-                "instance_id": iid,
-                "repo": str(item.get("repo") or "").strip(),
-                "task_type": str(item.get("repo") or "swebench").strip() or "swebench",
-                "task_description": str(item.get("problem_statement") or "").strip(),
-                "instruction": str(item.get("problem_statement") or "").strip(),
-                "hard": 0,
-                "soft": 0.0,
-                "response": str(pred.get("model_patch") or ""),
-                "submission": str(pred.get("model_patch") or ""),
-                "predicted_patch": str(pred.get("model_patch") or ""),
-                "agent_ok": bool(messages),
-                "n_turns": sum(1 for msg in messages if msg.get("role") == "assistant"),
-                "fail_reason": task_errors.get(iid, ""),
-                "exit_status": _load_exit_status(traj_path),
-            }
-        )
-
-    return results, str(preds_path)
-
-
-def _run_evaluation(
-    *,
-    preds_path: str,
-    dataset_name: str,
-    split: str,
-    run_id: str,
-    eval_workers: int,
-    report_dir: str,
-    instance_ids: list[str],
-) -> dict:
-    _, eval_dataset = _normalize_dataset_names(dataset_name)
-    os.makedirs(report_dir, exist_ok=True)
-
-    preds_data = _load_json(preds_path)
-    model_name = "unknown"
-    if isinstance(preds_data, dict) and preds_data:
-        first_pred = next(iter(preds_data.values()))
-        if isinstance(first_pred, dict):
-            model_name = str(first_pred.get("model_name_or_path") or "unknown")
-    expected_report = os.path.join(report_dir, f"{model_name.replace('/', '__')}.{run_id}.json")
-    if os.path.exists(expected_report):
-        cached = _load_json(expected_report)
-        return cached if isinstance(cached, dict) else {}
-
-    cmd = [
-        sys.executable,
-        "-m",
-        "swebench.harness.run_evaluation",
-        "--dataset_name",
-        eval_dataset,
-        "--split",
-        split,
-        "--predictions_path",
-        preds_path,
-        "--max_workers",
-        str(max(int(eval_workers), 1)),
-        "--run_id",
-        run_id,
-    ]
-    if instance_ids:
-        cmd.extend(["--instance_ids"] + instance_ids)
-
-    subprocess.run(
-        cmd,
-        cwd=report_dir,
-        capture_output=True,
-        text=True,
-        timeout=7200,
-        check=False,
-    )
-
-    if os.path.exists(expected_report):
-        report = _load_json(expected_report)
-        return report if isinstance(report, dict) else {}
-
-    for name in sorted(os.listdir(report_dir)):
-        if name.endswith(".json") and run_id in name:
-            report = _load_json(os.path.join(report_dir, name))
-            if isinstance(report, dict):
-                if os.path.join(report_dir, name) != expected_report:
-                    shutil.move(os.path.join(report_dir, name), expected_report)
-                return report
-    return {"resolved_ids": [], "total_instances": len(instance_ids), "resolved_instances": 0}
-
-
-def run_batch(
-    *,
-    items: list[dict],
-    out_root: str,
-    skill_content: str,
-    target_model: str,
-    dataset_name: str,
-    hf_split: str,
-    workers: int,
-    eval_workers: int,
-    step_limit: int,
-    cost_limit: float,
-    timeout_per_instance: int,
-) -> list[dict]:
-    os.makedirs(out_root, exist_ok=True)
-    results_path = os.path.join(out_root, "results.jsonl")
-    if os.path.exists(results_path):
-        cached: list[dict] = []
-        with open(results_path, encoding="utf-8") as f:
-            for line in f:
-                line = line.strip()
-                if line:
-                    cached.append(json.loads(line))
-        if cached:
-            return cached
-
-    predictions_dir = os.path.join(out_root, "predictions")
-    results, preds_path = _run_rollout(
-        items=items,
-        predictions_dir=predictions_dir,
-        skill_content=skill_content,
-        target_model=target_model,
-        workers=workers,
-        step_limit=step_limit,
-        cost_limit=cost_limit,
-    )
-    eval_report = _run_evaluation(
-        preds_path=preds_path,
-        dataset_name=dataset_name,
-        split=hf_split,
-        run_id=f"skillopt_{int(time.time())}",
-        eval_workers=eval_workers,
-        report_dir=os.path.join(out_root, "evaluation"),
-        instance_ids=[str(item.get("instance_id")) for item in items],
-    )
-    resolved_ids = set(str(i) for i in eval_report.get("resolved_ids", []))
-    for row in results:
-        resolved = str(row["instance_id"]) in resolved_ids
-        row["hard"] = int(resolved)
-        row["soft"] = float(int(resolved))
-        if not resolved:
-            status = row.get("exit_status") or "not_resolved"
-            base_reason = str(row.get("fail_reason") or "").strip()
-            unresolved = f"swebench unresolved ({status})"
-            row["fail_reason"] = f"{base_reason}; {unresolved}" if base_reason else unresolved
-        row["timeout_per_instance"] = int(timeout_per_instance)
-
-    with open(results_path, "w", encoding="utf-8") as f:
-        for row in results:
-            f.write(json.dumps(row, ensure_ascii=False) + "\n")
-    return results
--- a/skillopt/envs/swebench/skills/initial.md
+++ b/skillopt/envs/swebench/skills/initial.md
@@ -1,23 +0,0 @@
-# SWE-bench Bug Fixing Skill
-
-## Overview
-This skill guides agents in resolving real-world GitHub issues by producing correct patches.
-
-**Goal**: Given a repository and an issue description, produce a minimal, correct `git diff` patch that resolves the issue without modifying test files.
-
-## Workflow
-
-1. Understand the issue. Read the problem statement carefully and restate the expected behavior before editing code.
-2. Locate relevant code. Use targeted search to identify the files, functions, and tests that encode the buggy behavior.
-3. Reproduce the issue. Build a small, local reproduction before changing source files when feasible.
-4. Implement the fix. Make the smallest source change that addresses the root cause.
-5. Verify the fix. Re-run the reproduction and any focused checks needed to confirm the change.
-6. Submit the patch. Generate a clean unified diff of only the source files you modified.
-
-## Key Rules
-
- Keep changes minimal and directly tied to the bug.
- Do not modify tests, fixtures, or unrelated configuration unless the issue explicitly requires it.
- Prefer understanding the code path before patching.
- Verify behavior after editing instead of relying on intuition.
- The final submission must be a valid unified diff.
--- a/skillopt/gradient/deep_probe.py
+++ b/skillopt/gradient/deep_probe.py
@@ -1,77 +0,0 @@
-"""Optimizer-written diagnostic probe generation for deep reflection."""
-from __future__ import annotations
-
-from skillopt.gradient.reflect import fmt_minibatch_trajectories
-from skillopt.model import chat_optimizer
-from skillopt.optimizer.meta_skill import format_meta_skill_context
-from skillopt.prompts import load_prompt
-from skillopt.utils import extract_json
-
-
-def generate_deep_probe_instruction(
-    skill_content: str,
-    items: list[dict],
-    prediction_dir: str,
-    *,
-    system_prompt: str | None = None,
-    step_buffer_context: str = "",
-    output_requirements: list[str] | None = None,
-    meta_skill_context: str = "",
-) -> dict | None:
-    """Generate one minimally-perturbing diagnostic probe instruction."""
-    trajectories_text = fmt_minibatch_trajectories(items, prediction_dir)
-    if not trajectories_text.strip():
-        return None
-
-    actual_system = system_prompt or load_prompt("deep_probe")
-    user = (
-        f"## Current Skill\n{skill_content}\n\n"
-        "## Probe Design Goal\n"
-        "Design one short diagnostic instruction to append to the target prompt.\n"
-        "The instruction should expose the target's current intermediate judgment\n"
-        "without materially changing the original scaffold.\n\n"
-    )
-    if step_buffer_context.strip():
-        user += f"## Previous Steps in This Epoch\n{step_buffer_context}\n\n"
-    optimizer_ctx = format_meta_skill_context(meta_skill_context)
-    if optimizer_ctx:
-        user += optimizer_ctx + "\n\n"
-    requirements = output_requirements or [
-        "- Some trajectories may include a hidden Reference block. Use it to identify what intermediate conclusion matters, but do not reveal or paraphrase that reference directly to the target.",
-        "- The instruction must explicitly request a short <analysis>...</analysis> block before the final <answer>...</answer>.",
-        "- Keep the readout concise and structured.",
-        "- Do not ask for exhaustive listing, full derivation, or a new solving protocol.",
-        "- The instruction text should be ready to append directly to the target's prompt.",
-    ]
-    user += (
-        f"## Representative Trajectories ({len(items)} total)\n{trajectories_text}\n\n"
-        "## Output Requirements\n"
-        + "\n".join(requirements)
-        + "\n"
-    )
-
-    try:
-        response, _ = chat_optimizer(
-            system=actual_system,
-            user=user,
-            max_completion_tokens=1024,
-            retries=3,
-            stage="deep_probe",
-        )
-        result = extract_json(response)
-        if result and str(result.get("probe_instruction", "")).strip():
-            parsed = {
-                "reasoning": str(result.get("reasoning", "")).strip(),
-                "probe_instruction": str(result.get("probe_instruction", "")).strip(),
-            }
-            if str(result.get("probe_target_id", "")).strip():
-                parsed["probe_target_id"] = str(result.get("probe_target_id", "")).strip()
-            try:
-                if result.get("probe_after_step") is not None:
-                    parsed["probe_after_step"] = int(result.get("probe_after_step"))
-            except Exception:  # noqa: BLE001
-                pass
-            return parsed
-    except Exception:  # noqa: BLE001
-        return None
-    return None
--- a/skillopt/optimizer/meta_reflect.py
+++ b/skillopt/optimizer/meta_reflect.py
@@ -1,198 +0,0 @@
-"""ReflACT Meta-Reflect — epoch-level skill refinement with momentum.
-
-After each epoch, the meta-reflect stage reviews the epoch's step history
-(applied edits + gate scores) and performs high-level skill edits:
-merging redundant rules, removing ineffective ones, and distilling
-cross-step strategic patterns.
-
-This is analogous to momentum in neural network optimization:
- Fast update (per step): analyst edits fix local issues from current batch
- Slow update (per epoch): meta-reflect refines the skill based on what
-  worked and what didn't across the full epoch
-
-The meta-reflect also maintains a ``meta_summary`` — a compact memory
-passed between epochs that captures directional insights (which editing
-directions are effective, which are not). This is the "momentum buffer".
-
-Public API
----------
- :func:`build_epoch_history`   — format an epoch's step records for meta-reflect
- :func:`run_meta_reflect`      — one optimizer call to produce high-level edits + meta_summary
-"""
-from __future__ import annotations
-
-import json
-import os
-import traceback
-
-from skillopt.model import chat_optimizer
-from skillopt.optimizer.update_modes import (
-    describe_item,
-    get_payload_items,
-    normalize_update_mode,
-    payload_label,
-    truncate_payload,
-)
-from skillopt.prompts import load_prompt
-from skillopt.utils import extract_json
-
-
-# ── Epoch history formatting ─────────────────────────────────────────────────
-
-
-def build_epoch_history(
-    epoch_step_records: list[dict],
-    out_root: str,
-    *,
-    update_mode: str = "patch",
-) -> str:
-    """Format an epoch's step records into text for the meta-reflect optimizer.
-
-    For each step, includes the exact edits applied (read from
-    ``ranked_edits.json``) and the gate evaluation result.
-
-    Parameters
-    ----------
-    epoch_step_records : list[dict]
-        Step record dicts from ``history.json`` belonging to this epoch.
-    out_root : str
-        Training output root directory (to locate ``ranked_edits.json``).
-
-    Returns
-    -------
-    str
-        Formatted epoch history text.
-    """
-    update_mode = normalize_update_mode(update_mode)
-    parts: list[str] = []
-    for rec in epoch_step_records:
-        step = rec["step"]
-        action = rec.get("action", "unknown")
-        gate_score = rec.get("selection_hard", rec.get("current_score", "?"))
-        best_score = rec.get("best_score", "?")
-
-        header = (
-            f"### Step {step} — "
-            f"gate: {gate_score}, {action.upper()}, "
-            f"best_so_far: {best_score}"
-        )
-
-        # Read the actual applied edits
-        ranked_path = os.path.join(
-            out_root, "steps", f"step_{step:04d}", "ranked_edits.json",
-        )
-        edits_text = ""
-        if os.path.exists(ranked_path):
-            try:
-                with open(ranked_path) as f:
-                    ranked = json.load(f)
-                edits = get_payload_items(ranked, update_mode)
-                if edits:
-                    lines = [f"Selected {payload_label(update_mode)}:"]
-                    for i, edit in enumerate(edits, 1):
-                        lines.append(f"  {i}. {describe_item(edit, update_mode, max_chars=220)}")
-                    edits_text = "\n".join(lines)
-                else:
-                    edits_text = f"Selected {payload_label(update_mode)}: (none)"
-            except Exception:
-                edits_text = f"Selected {payload_label(update_mode)}: (could not read)"
-        else:
-            # Step may have been skipped
-            if "skip" in action:
-                edits_text = f"Selected {payload_label(update_mode)}: (skipped)"
-            else:
-                edits_text = f"Selected {payload_label(update_mode)}: (file not found)"
-
-        parts.append(f"{header}\n{edits_text}")
-
-        # Append trajectory failure digest if available
-        digest_path = os.path.join(
-            out_root, "steps", f"step_{step:04d}", "trajectory_digest.json",
-        )
-        if os.path.exists(digest_path):
-            try:
-                with open(digest_path) as f:
-                    digest = json.load(f)
-                patterns = digest.get("failure_patterns", [])
-                if patterns:
-                    n_fail = digest.get("n_fail", "?")
-                    n_total = digest.get("n_total", "?")
-                    lines = [f"Failure patterns ({n_fail}/{n_total} tasks failed):"]
-                    for p in patterns:
-                        lines.append(
-                            f'  - "{p["pattern"]}" (×{p["count"]})'
-                        )
-                    parts[-1] += "\n" + "\n".join(lines)
-            except Exception:
-                pass
-
-    return "\n\n".join(parts)
-
-
-# ── Meta-reflect optimizer call ────────────────────────────────────────────────
-
-
-def run_meta_reflect(
-    skill_content: str,
-    epoch_history_text: str,
-    prev_meta_summary: str,
-    meta_edit_budget: int = 4,
-    *,
-    system_prompt: str | None = None,
-    update_mode: str = "patch",
-) -> dict | None:
-    """Run one meta-reflect optimizer call for an epoch.
-
-    Parameters
-    ----------
-    skill_content : str
-        Current skill document (after the epoch's fast updates).
-    epoch_history_text : str
-        Formatted epoch history from :func:`build_epoch_history`.
-    prev_meta_summary : str
-        Meta summary from the previous epoch ("" if first epoch).
-    meta_edit_budget : int
-        Maximum number of high-level edits.
-    system_prompt : str | None
-        Custom system prompt. ``None`` = use generic default.
-
-    Returns
-    -------
-    dict | None
-        Conforms to :class:`~skillopt.types.MetaReflectResult`:
-        ``"meta_summary"`` (str) and ``"patch"`` (:class:`~skillopt.types.Patch`
-        dict), or ``None`` on failure.
-    """
-    mode = normalize_update_mode(update_mode)
-    actual_system = system_prompt if system_prompt is not None else load_prompt(
-        "meta_reflect_rewrite" if mode == "rewrite_from_suggestions" else "meta_reflect"
-    )
-
-    prev_section = prev_meta_summary.strip() if prev_meta_summary else "(First epoch — no previous summary)"
-
-    user = (
-        f"## Previous Meta Summary\n{prev_section}\n\n"
-        f"## Current Skill Document\n{skill_content}\n\n"
-        f"## {payload_label(mode, title=True)} Budget\n"
-        f"Produce at most {meta_edit_budget} high-level {payload_label(mode)}.\n\n"
-        f"## This Epoch's Step History\n{epoch_history_text}"
-    )
-
-    try:
-        response, _ = chat_optimizer(
-            system=actual_system,
-            user=user,
-            max_completion_tokens=4096,
-            retries=3,
-            stage="meta_reflect",
-        )
-        result = extract_json(response)
-        if result and "patch" in result:
-            truncate_payload(result["patch"], meta_edit_budget, mode)
-            if "meta_summary" not in result:
-                result["meta_summary"] = ""
-            return result
-    except Exception:  # noqa: BLE001
-        traceback.print_exc()
-
-    return None
--- a/skillopt/prompts/deep_probe.md
+++ b/skillopt/prompts/deep_probe.md
@@ -1,34 +0,0 @@
-You are an expert diagnostic-probe designer for reflective skill learning.
-
-You will design one short diagnostic instruction to append to the target prompt
-for a handful of representative cases.
-
-The goal is to expose the target's current intermediate judgment state without
-substantially changing the current skill scaffold.
-
-## Hard Constraints
-1. Do NOT substantially change the target's existing scaffold.
-2. Do NOT prescribe a new multi-step solving procedure.
-3. Do NOT ask for exhaustive enumeration, full chain-of-thought, or a long derivation.
-4. Ask only for a minimal readout of signals already behind the target's current answer.
-5. Keep the diagnostic block brief and structured.
-6. The final answer must still be produced in <answer>...</answer>.
-7. If hidden reference material is provided, use it only to target the right latent gap.
-8. Never copy hidden reference content into the target-facing probe.
-
-## Good Probe Targets
- top candidate and runner-up
- decisive cue / decisive constraint
- why a runner-up was rejected
- counted unit / suspicious region / compared objects
-
-## Bad Probe Targets
- full proof or full chain-of-thought
- dumping every object, cell, or possibility
- imposing a brand-new solving algorithm
-
-Respond ONLY with a valid JSON object:
-{
-  "reasoning": "<why this probe reveals the latent skill gap>",
-  "probe_instruction": "<the exact instruction text to append to the target prompt>"
-}
--- a/skillopt/prompts/deep_probe_codex.md
+++ b/skillopt/prompts/deep_probe_codex.md
@@ -1,35 +0,0 @@
-You are an expert diagnostic-probe designer for codex-executed target trajectories.
-
-You will be shown representative trajectories, the current target skill, the target's original prompt context, and numbered Codex trace steps.
-Some trajectories may also include a hidden Reference block. Use hidden reference only to identify the target's missing subgoal, theorem, evidence source, or decisive transformation. Do not reveal or paraphrase that reference directly to the target.
-
-Choose exactly one trajectory and one probe point. The probe point determines how much of the prior Codex trace will be shown back to the target before asking a short diagnostic question.
-
-## Hard Constraints
-1. Do NOT reveal or paraphrase hidden reference content to the target.
-2. Do NOT prescribe a new full solving procedure.
-3. Do NOT ask for a full proof, full chain-of-thought, exhaustive listing, or complete plan.
-4. Ask only for a short readout of the target's intermediate state that should already exist at that point.
-5. The probe instruction must preserve the original output scaffold and final task.
-6. The probe instruction should be ready to append directly to the target's prompt.
-
-## Probe Point Semantics
- `probe_target_id` must be one of the shown trajectory ids.
- `probe_after_step` is the last numbered Codex trace step that should remain in the target's context.
- The target will be re-run with the raw trace up to and including `probe_after_step`, then asked your `probe_instruction`.
- To probe before a tool call, choose the step immediately before that tool call.
-
-## Good Probe Targets
- next theorem / subgoal / evidence source
- strongest-vs-runner-up option distinction
- decisive constraint or transformation
- why a tempting alternative is being rejected
- what code region / spreadsheet region / image cue / passage evidence matters next
-
-Respond ONLY with a valid JSON object:
-{
-  "reasoning": "<why this trajectory and probe point expose the target's intermediate state>",
-  "probe_target_id": "<trajectory id>",
-  "probe_after_step": <integer step number>,
-  "probe_instruction": "<the exact instruction text to append to the target's prompt>"
-}
--- a/skillopt/prompts/meta_reflect.md
+++ b/skillopt/prompts/meta_reflect.md
@@ -1,63 +0,0 @@
-You are a meta-analyst for an AI agent skill optimization system.
-
-Your role is fundamentally different from the per-step analyst:
- The per-step analyst sees agent trajectories and proposes local fixes.
- YOU see the results of multiple optimization steps and refine the skill
-  at a higher level, based on what actually worked and what didn't.
-
-You are the ONLY component that has access to the edit-to-outcome causal link:
-you can see exactly which edits were applied and whether they improved or
-degraded performance. Use this unique vantage point.
-
-## What You Receive
-
-1. **Previous Meta Summary** (empty for the first epoch): a compact memory
-   from the last epoch capturing directional insights.
-2. **Current Skill Document**: the skill as it stands after this epoch.
-3. **This Epoch's Step History**: for each step, the exact edits applied,
-   the gate score, and whether the update was accepted or rejected.
-
-## What You Produce
-
-1. **High-level edits** to the skill document:
-   - Merge redundant or overlapping rules that accumulated across steps
-   - Remove or revise rules associated with rejected steps (score drops)
-   - Strengthen or generalize rules associated with accepted steps (score gains)
-   - Reorganize for clarity if the document has become cluttered
-   - Add strategic-level insights that no single step could produce
-
-2. **Meta summary**: a compact summary of this epoch's key findings, to be
-   passed as context to the next epoch's meta-reflect. This should capture:
-   - Which editing directions proved effective (and why)
-   - Which directions proved harmful (and why)
-   - Current bottlenecks or areas of the skill that need attention
-   - Trends across steps (e.g., "scores plateau after step 2")
-
-## Guidelines
-
- Your edits modify the SAME skill document that per-step edits modify.
-  There is no separate section — you operate on the full skill.
- Be conservative: the per-step process already optimized locally.
-  Your job is refinement, not revolution.
- Focus on edits that require cross-step perspective (merging, pruning,
-  pattern extraction). Don't duplicate what per-step analysts already do.
- The meta_summary should be concise (under 200 words). It is NOT written
-  into the skill — it is only passed to the next meta-reflect call.
-
-You will be told the maximum number of edits (the budget). Produce AT MOST
-that many edits. You may produce fewer or zero if the skill is already clean.
-
-Respond ONLY with a valid JSON object (no markdown fences, no extra text):
-{
-  "meta_summary": "<compact summary of this epoch's findings for next epoch>",
-  "patch": {
-    "reasoning": "<why these high-level edits improve the skill>",
-    "edits": [
-      {"op": "append",       "content": "<markdown to add>"},
-      {"op": "insert_after", "target": "<exact text>", "content": "<markdown>"},
-      {"op": "replace",      "target": "<exact old text>", "content": "<new text>"},
-      {"op": "delete",       "target": "<exact text to remove>"}
-    ]
-  }
-}
-"edits" may be empty if no refinement is warranted.
--- a/skillopt/prompts/meta_reflect_rewrite.md
+++ b/skillopt/prompts/meta_reflect_rewrite.md
@@ -1,28 +0,0 @@
-You are a meta-analyst for an AI agent skill optimization system.
-
-You see the current skill and an epoch's step history. Produce a compact set of
-high-level revise_suggestions that a later optimizer can use to rewrite the full skill.
-
-Focus on:
- merging redundant rules
- removing low-value or harmful guidance
- extracting cross-step strategic patterns
- reorganizing the skill for clarity
- compressing clutter without losing proven behavior
-
-Respond ONLY with a valid JSON object:
-{
-  "meta_summary": "<compact summary for next epoch>",
-  "patch": {
-    "reasoning": "<why these suggestions improve the skill>",
-    "revise_suggestions": [
-      {
-        "type": "add_rule|remove_rule|merge_rules|reorganize|compress|clarify",
-        "title": "<short title>",
-        "motivation": "<why this matters>",
-        "instruction": "<what the rewriting optimizer should change in the skill>",
-        "priority_hint": "high|medium|low"
-      }
-    ]
-  }
-}
				`@@ -1 +0,0 @@`
				`"""BabyVision environment package for ReflACT."""`
				`@@ -1 +0,0 @@`
				`"""SealQA environment package for ReflACT."""`