From 8acc2dd03e46483ef553f870b88d764687cbf16d Mon Sep 17 00:00:00 2001 From: Cuzyoung Date: Sun, 31 May 2026 09:01:25 +0000 Subject: [PATCH] docs: add self-contained reproduction & usage guideline page Add docs/guideline.html, a single self-contained documentation guide (left-nav + content + on-this-page TOC) covering installation, data preparation, training/eval, full configuration reference, framework internals, and an API reference. Link it from the README with local, htmlpreview, and GitHub Pages access instructions. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> --- README.md | 23 ++ docs/guideline.html | 911 ++++++++++++++++++++++++++++++++++++++++++++ 2 files changed, 934 insertions(+) create mode 100644 docs/guideline.html diff --git a/README.md b/README.md index fb9e003..72a7d5f 100644 --- a/README.md +++ b/README.md @@ -14,6 +14,29 @@ https://github.com/user-attachments/assets/eb12d3bc-371c-467f-904d-91b61f339ed7 --- +## Documentation + +A complete, self-contained **Documentation & Reproduction Guide** lives at +[`docs/guideline.html`](docs/guideline.html). It covers installation, data +preparation, training/eval commands, the full configuration reference, the +framework internals (training loop, validation gate, slow update, meta skill), +and an API/function reference — all in a single page with a left navigation +sidebar. + +Because GitHub shows raw source for `.html` files instead of rendering them, +open the guide one of these ways: + +- **Locally** — clone the repo and open `docs/guideline.html` in any browser + (no build step required). +- **Rendered online (no setup)** — via the htmlpreview proxy: + [`htmlpreview.github.io/?…/docs/guideline.html`](https://htmlpreview.github.io/?https://github.com/microsoft/SkillOpt/blob/main/docs/guideline.html) +- **GitHub Pages** — the repository's GitHub Pages site already serves the + project homepage from the repo root, so the guide is reachable alongside it at + `https://microsoft.github.io/SkillOpt/docs/guideline.html` (the homepage at + `https://microsoft.github.io/SkillOpt/` is unaffected). + +--- + ## Install **Requirements:** Python 3.10+ diff --git a/docs/guideline.html b/docs/guideline.html new file mode 100644 index 0000000..439fc55 --- /dev/null +++ b/docs/guideline.html @@ -0,0 +1,911 @@ + + + + + +SkillOpt — Documentation & Reproduction Guide + + + + + + +
+ + + SkillOpt + Documentation & Reproduction Guide + + GitHub ↗ + Paper ↗ +
+ +
+ + + + + +
+ + Microsoft Research +

SkillOpt Documentation & Reproduction Guide

+

Train agent skills like you train neural networks — with epochs, (mini-)batch size, learning rates, and validation gates — but without touching any model weights.

+

This guide walks you from a clean checkout to a reproduced result and a full reference for every configuration knob and core function. It is generated from, and kept consistent with, the current state of the codebase.

+ + +
+

1.1 What is SkillOpt #

+

SkillOpt is a text-space optimizer that improves a frozen language agent by iteratively editing a natural-language skill document — never the model weights. The skill document is a Markdown file that conditions a target model as it executes tasks. SkillOpt treats this document as the "weights" and runs a training loop that mirrors deep-learning training: rollout (forward pass), reflect (backward pass / gradients), select & apply edits (optimizer step), and a validation gate (accept/reject).

+

Two roles split every model call:

+
    +
  • Target — executes tasks using the current skill document (the agent being improved).
  • +
  • Optimizer — analyzes the target's trajectories and proposes edits to the skill document.
  • +
+

The same loop drives six benchmarks out of the box (QA, document QA, embodied agents, math, spreadsheet code generation, and tool-augmented QA).

+
+ +
+

1.2 Deep-Learning ↔ SkillOpt Analogy #

+

Every concept below maps to a concrete code construct, so deep-learning intuitions transfer directly to hyperparameter tuning.

+
+ + + + + + + + + + + + + + + + + + + +
Deep learningSkillOptWhere it lives
Model weightsSkill document (Markdown)skillopt/optimizer/skill.py
Forward passRollout — target runs tasksenvs/<bench>/rollout.py
Loss / scoreTask evaluatorenvs/<bench>/evaluator.py
Backprop / gradientsReflect → edit patchesgradient/reflect.py
Gradient aggregationHierarchical patch mergegradient/aggregate.py
Gradient clippingRank & select top-k editsoptimizer/clip.py
Learning rateoptimizer.learning_rate (edits/step)optimizer/scheduler.py
LR schedulerlr_scheduler (cosine/linear/…)optimizer/scheduler.py
Optimizer stepApply patches to the documentoptimizer/skill.py
Validation setSelection split (valid_seen)evaluation/gate.py
Early stopping / acceptValidation gateevaluation/gate.py
MomentumSlow update (epoch boundary)optimizer/slow_update.py
Meta-learningMeta skill (cross-epoch memory)optimizer/meta_skill.py
Batch / minibatchbatch_size / minibatch_sizeengine/trainer.py
EpochEpoch (+ slow update & meta skill)engine/trainer.py
+
+
What transfers from DL +

Cosine schedule tends to beat constant; moderate learning rates (≈4–16 edits/step) beat very high/low; slow update curbs cross-epoch forgetting; meta-skill memory improves reflection quality. Conversely, bigger rollout batches and many epochs show diminishing returns — skills converge in ~2–4 epochs.

+
+
+ +
+

1.3 Key Features #

+
+

Validation gating

Every candidate skill is scored on a held-out selection split and only accepted if it beats the current/best skill.

+

Slow update

Epoch-boundary longitudinal comparison writes guidance into a protected region — momentum against forgetting. Force-injected or selection-gated.

+

Meta skill

Optimizer-side memory that reflects on what worked across epochs and feeds back into reflection.

+

Pluggable backends

OpenAI / Azure OpenAI, Anthropic Claude, local Qwen (vLLM), plus Codex/Claude-Code exec backends for the target.

+

Six benchmarks

SearchQA, DocVQA, ALFWorld, LiveMathematicianBench, SpreadsheetBench, OfficeQA — each a self-contained env module.

+

Auto-resume

Every run is checkpointed step-by-step; re-running the same command continues from the last completed step.

+
+
+ +
+

1.4 Repository Layout #

+
# top level
+configs/            # YAML configs (_base_ + per-benchmark)
+scripts/            # train.py, eval_only.py CLIs
+ckpt/               # packaged reference skills (e.g. gpt5.5_skill.md)
+docs/               # this guide + mkdocs sources
+skillopt/           # the package
+ ├─ config.py        # YAML loading, _base_ inheritance, flatten
+ ├─ engine/trainer.py# the training loop (ReflACTTrainer)
+ ├─ gradient/        # reflect.py (analyst), aggregate.py (merge)
+ ├─ optimizer/       # skill edits, scheduler, clip, slow_update, meta_skill
+ ├─ evaluation/      # gate.py (accept/reject logic)
+ ├─ model/           # backend clients + routing
+ └─ envs/<benchmark>/ # adapter, dataloader, rollout, evaluator, reflect
+
+ + +
+

2.1 Requirements #

+
    +
  • Python ≥ 3.10
  • +
  • Credentials for at least one model backend (Azure OpenAI, OpenAI-compatible, Anthropic, or a local Qwen server)
  • +
  • Benchmark datasets are not bundled — prepare your own splits (see §3)
  • +
+
+ +
+

2.2 Install the Package #

+
git clone https://github.com/microsoft/SkillOpt.git
+cd SkillOpt
+pip install -e .
+
+# Optional extras (install only what you need):
+pip install -e ".[alfworld]"   # ALFWorld benchmark
+pip install -e ".[claude]"     # Anthropic Claude backend
+pip install -e ".[qwen]"       # local Qwen backend
+pip install -e ".[webui]"      # monitoring dashboard
+
+# ALFWorld also needs its data assets:
+alfworld-download
+
+ +
+

2.3 Configure Credentials #

+

Copy the template and fill in whichever backend you will use:

+
cp .env.example .env
+# edit .env, then:
+set -a; source .env; set +a
+
One env-var family for all OpenAI modes +

SkillOpt reuses the AZURE_OPENAI_* variable names even for plain OpenAI — there is no separate OPENAI_API_KEY knob. AZURE_OPENAI_ENDPOINT is required for every OpenAI auth mode.

+
+

Azure OpenAI (default)

+
export AZURE_OPENAI_ENDPOINT="https://your-resource.openai.azure.com/"
+export AZURE_OPENAI_API_VERSION="2024-12-01-preview"
+# Auth option 1 — API key:
+export AZURE_OPENAI_API_KEY="your-key"
+# Auth option 2 — Azure CLI (no key; recommended on Azure VMs):
+export AZURE_OPENAI_AUTH_MODE=azure_cli
+# Auth option 3 — Managed Identity:
+export AZURE_OPENAI_AUTH_MODE=managed_identity
+export AZURE_OPENAI_MANAGED_IDENTITY_CLIENT_ID="your-client-id"
+

OpenAI-compatible endpoint

+
export AZURE_OPENAI_ENDPOINT="https://api.openai.com/v1"
+export AZURE_OPENAI_API_KEY="sk-..."
+export AZURE_OPENAI_AUTH_MODE=openai_compatible
+

Anthropic Claude / local Qwen

+
export ANTHROPIC_API_KEY="sk-ant-..."          # claude_chat backend
+
+export QWEN_CHAT_BASE_URL="http://localhost:8000/v1" # local vLLM
+export QWEN_CHAT_MODEL="Qwen/Qwen3.5-4B"
+
+ +
+

2.4 Verify Installation #

+
python -c "import skillopt; print('SkillOpt ready!')"
+
+ + +
+

3.1 Split Directory Format #

+

With env.split_mode: split_dir (the recommended, deterministic mode), SkillOpt reads a directory containing train/, val/, and test/ subfolders, each holding a JSON array of task items:

+
data/my_split/
+ ├─ train/items.json   # used for rollout (the "train split")
+ ├─ val/items.json     # selection split → validation gate (valid_seen)
+ └─ test/items.json    # held-out final eval (valid_unseen)
+
Split naming +

Internally the splits are referred to as train, valid_seen (validation/selection), and valid_unseen (test). The --split flag of eval_only.py uses these names.

+
+
+ +
+

3.2 Item JSON Schema #

+

Required fields depend on the benchmark; consult skillopt/envs/<benchmark>/dataloader.py for the exact contract. A SearchQA item, for example:

+
[
+  {
+    "id":       "unique_item_id",
+    "question": "Who wrote the novel ...",
+    "context":  "[DOC] relevant passage text ...",
+    "answers":  ["expected answer"]
+  }
+]
+
Datasets not included +

This repository ships no benchmark data. Prepare your own splits in the format above before training.

+
+
+ +
+

3.3 Split Modes #

+
+ + + + + +
env.split_modeBehavior
split_dirUse a pre-built directory with explicit train/val/test folders (set env.split_dir). Deterministic and reproducible.
ratioBuild a deterministic split on the fly from a single env.data_path, using split_seed (and a train:val:test ratio). Convenient for quick experiments.
+
+ + +
+

4.1 Train a Skill #

+
# Minimal SearchQA run
+python scripts/train.py \
+    --config configs/searchqa/default.yaml \
+    --split_dir /path/to/your/searchqa_split \
+    --azure_openai_endpoint https://your-resource.openai.azure.com/ \
+    --optimizer_model gpt-5.5 \
+    --target_model gpt-5.5
+

Swap the config for another benchmark (e.g. configs/livemathematicianbench/default.yaml, configs/alfworld/default.yaml). Common CLI arguments:

+
+ + + + + + + + + + +
ArgumentDescription
--configBenchmark config YAML (required)
--split_dirPath to the data split directory
--azure_openai_endpointAzure OpenAI endpoint URL
--optimizer_model / --target_modelDeployment names for optimizer / target
--num_epochs / --batch_sizeEpochs and rollout batch size
--out_rootOutput directory
--cfg-options k=v ...Override any config key (see §6.1)
+
+ +
+

4.2 Evaluate a Skill #

+

Evaluate any skill document (a packaged reference skill, or a trained run's best_skill.md) without training:

+
# Evaluate the packaged GPT-5.5 SearchQA skill on the test split
+python scripts/eval_only.py \
+  --config configs/searchqa/default.yaml \
+  --skill ckpt/searchqa/gpt5.5_skill.md \
+  --split valid_unseen \
+  --split_dir /path/to/searchqa_split \
+  --azure_openai_endpoint https://your-resource.openai.azure.com/
+
+ + + + + + + +
--splitMeaning
valid_unseenTest set (held-out)
valid_seenValidation / selection set
trainTraining set
allAll splits combined (default)
+
+ +
+

4.3 Output Structure #

+
outputs/<run_name>/
+ ├─ config.json          # flattened runtime config
+ ├─ history.json         # per-step training history
+ ├─ runtime_state.json   # resume checkpoint
+ ├─ best_skill.md        # best validated skill document
+ ├─ skills/skill_vXXXX.md# skill snapshot per step
+ ├─ steps/step_XXXX/     # per-step artifacts (patches, evals)
+ ├─ slow_update/epoch_XX/# slow-update logs & rollouts
+ └─ meta_skill/epoch_XX/ # meta-skill logs
+
+ +
+

4.4 Auto-Resume #

+

Each completed step persists its state to runtime_state.json and a steps/step_XXXX/ directory. Re-running the same command against the same out_root detects finished work and continues from the last completed step — including epoch-boundary slow-update and meta-skill stages.

+
+ + +
+

5.1 The Training Loop #

+

The loop lives in ReflACTTrainer (skillopt/engine/trainer.py). Each epoch runs a series of optimization steps over rollout batches, then performs two epoch-boundary stages.

+
for epoch in epochs:
+    for step in steps:
+        1. Rollout    # target executes a batch of tasks
+        2. Reflect    # optimizer analyzes trajectories → edit patches
+        3. Aggregate  # hierarchically merge similar patches
+        4. Select     # rank & clip edits to the learning rate
+        5. Update     # apply patches → candidate skill
+        6. Gate       # score on selection split → accept / reject
+
+    # epoch boundary (from epoch 2 onward)
+    Slow update   # longitudinal comparison → protected guidance
+    Meta skill    # cross-epoch optimizer memory
+
+ +
+

5.2 The Six Per-Step Stages #

+
+ + + + + + + + + +
StageWhat happensSource
1. RolloutThe target model runs each task in the batch with the current skill as context, producing trajectories and scores.envs/<b>/rollout.py
2. ReflectThe optimizer runs an error analyst (and optional success analyst) over minibatches of trajectories, emitting structured edit patches. Runs in parallel across analyst_workers.gradient/reflect.py
3. AggregateSemantically similar patches are merged hierarchically to remove redundancy.gradient/aggregate.pymerge_patches
4. SelectPatches are ranked and clipped to the current learning rate (max edits this step), set by the scheduler.optimizer/clip.pyrank_and_select
5. UpdateSelected edits are applied to the skill document, producing a candidate skill (patch / rewrite modes).optimizer/skill.py, update_modes.py
6. GateThe candidate is scored on the selection split and accepted only if it improves (see §5.3).evaluation/gate.pyevaluate_gate
+
+ +
+

5.3 Validation Gate #

+

evaluate_gate is a pure decision function. It compares the candidate's selection-set score against the current and best skills:

+
    +
  • accept_new_best — candidate > current and candidate > best → becomes both current and best.
  • +
  • accept — candidate > current but ≤ best → becomes current only.
  • +
  • reject — candidate ≤ current → discarded; current/best unchanged.
  • +
+

The comparison metric is configurable via evaluation.gate_metric:

+
+ + + + + + +
MetricScore used
hard defaultExact-match / discrete score
softPartial-credit / continuous score
mixedWeighted blend, controlled by gate_mixed_weight
+
When to use soft/mixed +

The soft/mixed metrics (contributed config configs/examples/soft_gate.yaml) help when the selection split is small and rewards are continuous, where a discrete hard gate may reject every candidate and stall training. Paper numbers use the default hard gate.

+
+
+ +
+

5.4 Slow Update (Momentum) #

+

At each epoch boundary (from epoch 2), the slow update rolls out both the previous epoch's skill and the current skill on the same sampled tasks, categorizes items (improved / regressed / persistent-fail / stable-success), and asks the optimizer to write a free-form guidance block. This guidance lands in a protected region of the skill that step-level edits cannot touch — only the slow update overwrites it. It is SkillOpt's analogue of momentum, countering cross-epoch forgetting.

+

Acceptance has two modes, selected by optimizer.slow_update_gate_with_selection:

+
+ + + + + +
ModeBehavior
false default — force-injectedGuidance is injected into both current and best skills unconditionally. The longitudinal guidance always persists; it is not gated by step-level selection scores.
true — gatedThe slow-update candidate is scored on the selection split and accepted/rejected through the same validation gate as step-level updates.
+
+ +
+

5.5 Meta Skill (Optimizer Memory) #

+

The meta skill is optimizer-side memory — it never modifies the target skill document. At the end of each epoch (skipped for epoch 1), the optimizer compares the previous and current epoch's last-step skills on the same sampled tasks and writes a compact, evidence-based reflection on what kind of edits helped or hurt. That memory is then injected as extra context into the next epoch's reflect / merge / learning-rate / ranking stages, so the optimizer accumulates strategy across the run.

+
+ +
+

5.6 Skill Document Anatomy #

+

A skill document is plain Markdown. Initial skills can be empty (learn from scratch) or seeded with domain knowledge via env.skill_init. During training the document accrues rules, patterns, and edge-case handling through accepted edit patches. A dedicated protected region holds the slow-update guidance, delimited by HTML-comment markers:

+
# Question Answering Skill
+
+## Learned rules ...
+- When the context contains multiple candidates, prefer ...
+
+<!-- SLOW_UPDATE_START -->
+# (epoch-level longitudinal guidance — only the slow update writes here)
+<!-- SLOW_UPDATE_END -->
+

Helpers in optimizer/slow_update.py manage this region: inject_empty_slow_update_field (placeholder at epoch 1), extract_slow_update_field (read), and replace_slow_update_field (overwrite). Step-level edits are blocked from modifying anything inside the markers.

+
+ + +
+

6.1 Configuration System #

+

Configs are structured YAML with section blocks (model, train, gradient, optimizer, evaluation, env) and _base_ inheritance. A benchmark config inherits the shared defaults and overrides only what differs:

+
# configs/searchqa/default.yaml
+_base_: ../_base_/default.yaml
+train:
+  train_size: 400
+  batch_size: 40
+optimizer:
+  learning_rate: 4
+env:
+  name: searchqa
+  split_dir: data/searchqa_split
+

Override any key at the command line without editing files:

+
python scripts/train.py --config configs/searchqa/default.yaml \
+  --cfg-options optimizer.learning_rate=16 optimizer.lr_scheduler=linear
+
Reading the tables below +

Each section lists the key (relative to its YAML block), type, default (from configs/_base_/default.yaml), allowed values, and meaning. Defaults shown are the shipped base defaults.

+
+
+ +
+

6.2 model.* #

+
+ + + + + + + + + + + + + + +
KeyTypeDefaultDescription / options
backendstrazure_openaiHigh-level backend label for the run.
optimizerstrgpt-5.5Optimizer model deployment (writes skill edits).
targetstrgpt-5.5Target model deployment (executes tasks).
optimizer_backendstropenai_chatClient path for the optimizer: openai_chat or claude_chat.
target_backendstropenai_chatClient path for the target: openai_chat / claude_chat / qwen_chat / codex_exec / claude_code_exec.
reasoning_effortstrmediumlow / medium / high / xhigh / max (or empty).
rewrite_reasoning_effortstr""Override effort for full-rewrite calls (empty = inherit).
rewrite_max_completion_tokensint64000Token cap for full-rewrite optimizer calls.
azure_openai_endpointstr""Azure resource URL (or via AZURE_OPENAI_ENDPOINT).
azure_openai_api_versionstr2024-12-01-previewAzure API version header.
azure_openai_auth_modestr""api_key / azure_cli / managed_identity / openai_compatible (empty → env default).
+
Separate optimizer / target endpoints +

Every azure_openai_* key also has optimizer_azure_openai_* and target_azure_openai_* variants, letting you point the optimizer and target at different Azure resources. Exec backends (codex_exec, claude_code_exec) add their own codex_exec_* / claude_code_exec_* knobs (sandbox, reasoning effort, SDK mode, etc.).

+
+
+ +
+

6.3 train.* #

+
+ + + + + + + + +
KeyTypeDefaultDL analogyDescription
num_epochsint4EpochsNumber of training epochs.
train_sizeint0Train-set size0 = derive from the dataset split. (Fixed by split size when using split_dir.)
batch_sizeint40Batch sizeTasks rolled out per optimization step.
accumulationint1Grad accumulationAccumulation rounds per step.
seedint42Random seedReproducibility seed.
+
+ +
+

6.4 gradient.* #

+
+ + + + + + + + +
KeyTypeDefaultDescription
minibatch_sizeint8Trajectories per reflect minibatch.
merge_batch_sizeint8Patches per merge batch during aggregation.
analyst_workersint16Parallel reflection workers (data parallelism).
max_analyst_roundsint3Max rounds of analyst reflection per step.
failure_onlyboolfalseReflect only on failed trajectories when true.
+
+ +
+

6.5 optimizer.* #

+
+ + + + + + + + + + + + + +
KeyTypeDefaultDL analogyDescription / options
learning_rateint4Learning rateMax edit patches applied per step (the "edit budget").
min_learning_rateint2Min LRFloor edit budget for decaying schedulers.
lr_schedulerstrcosineLR scheduleconstant / linear / cosine / autonomous.
lr_control_modestrfixedfixed / autonomous / none.
skill_update_modestrpatchpatch / rewrite_from_suggestions / full_rewrite_minibatch.
use_slow_updatebooltrueMomentumEnable epoch-boundary slow update.
slow_update_samplesint20Tasks sampled for the longitudinal comparison.
slow_update_gate_with_selectionboolfalsefalse = force-inject guidance; true = gate it on the selection split (see §5.4).
longitudinal_pair_policystrmixedmixed / changed / unchanged — which comparison pairs to keep.
use_meta_skillbooltrueMeta-learningEnable cross-epoch optimizer memory.
+
+ +
+

6.6 evaluation.* #

+
+ + + + + + + + + +
KeyTypeDefaultDescription / options
use_gatebooltrueValidation gating is mandatory in this branch (must remain true).
gate_metricstrhardhard / soft / mixed — score used by the gate (see §5.3).
gate_mixed_weightfloat0.5Weight on the soft score when gate_metric = mixed.
sel_env_numint0Selection-split eval size (0 = use full split).
test_env_numint0Test-split eval size (0 = use full split).
eval_testbooltrueRun a final test evaluation after training.
+
Gate is required +

Setting evaluation.use_gate: false raises an error — validation gating cannot be disabled in this branch.

+
+
+ +
+

6.7 env.* #

+
+ + + + + + + + + + + +
KeyTypeDefaultDescription
namestr""Benchmark name (searchqa, docvqa, alfworld, …). Selects the env module.
skill_initstr""Path to a seed skill (empty = start from scratch).
split_modestrratioratio or split_dir (see §3.3).
split_dirstr""Pre-split directory (when split_mode = split_dir).
data_pathstr""Single dataset path (when split_mode = ratio).
split_seedint42Seed for deterministic ratio splitting.
exec_timeoutint120Per-task target/code-agent timeout (seconds).
out_rootstr""Output directory for the run.
+
Benchmark-specific env keys +

Env blocks may carry extra benchmark-specific keys (e.g. max_turns, workers, max_completion_tokens, limit). Unmapped env keys are passed straight through to the benchmark adapter — check the relevant configs/<benchmark>/default.yaml.

+
+
+ + +
+

7.1 Supported Benchmarks #

+
+ + + + + + + + + +
BenchmarkTypeConfig
SearchQAQuestion answeringconfigs/searchqa/default.yaml
DocVQADocument QAconfigs/docvqa/default.yaml
ALFWorldEmbodied agentconfigs/alfworld/default.yaml
LiveMathematicianBenchMath reasoningconfigs/livemathematicianbench/default.yaml
SpreadsheetBenchSpreadsheet code generationconfigs/spreadsheetbench/default.yaml
OfficeQATool-augmented QAconfigs/officeqa/default.yaml
+

Each benchmark is a self-contained module under skillopt/envs/<benchmark>/ with an adapter.py, dataloader.py, rollout.py, and evaluator.py (some add a custom reflect.py). Packaged reference skills live in ckpt/<benchmark>/.

+
+ +
+

7.2 Add a New Benchmark #

+

Use skillopt/envs/_template/ as a starting point. At minimum, implement:

+
    +
  1. Dataloader — read your item JSON into the framework's item dicts (dataloader.py).
  2. +
  3. Rollout — run the target on one item with the current skill and return a trajectory + score (rollout.py).
  4. +
  5. Evaluator — score predictions against ground truth (evaluator.py).
  6. +
  7. Adapter — wire the above into the trainer's expected interface and register the env name (adapter.py).
  8. +
+

Then add a configs/<name>/default.yaml inheriting _base_/default.yaml and set env.name to your new benchmark.

+
+ + +
+

8.1 Module Map #

+
+ + + + + + + + + + +
ModuleResponsibility
skillopt/config.pyLoad structured YAML, resolve _base_ inheritance, flatten to the trainer's flat dict, apply CLI overrides.
skillopt/engine/trainer.pyReflACTTrainer — orchestrates the whole loop, gating, slow update, meta skill, resume, and artifact writing.
skillopt/gradient/Reflection ("backward pass"): reflect.py analysts, aggregate.py patch merging.
skillopt/optimizer/The "optimizer": edit application, learning-rate scheduling, edit selection, slow update, meta skill, rewrite modes.
skillopt/evaluation/gate.pyPure accept/reject decision and metric selection.
skillopt/model/Backend clients (OpenAI/Azure, Claude, Qwen, Codex/Claude-Code exec) and routing.
skillopt/envs/<b>/Per-benchmark dataloader, rollout, evaluator, adapter.
+
+ +
+

8.2 Core Functions #

+
+ + + + + + + + + + + + + + + +
FunctionFilePurpose
load_config / flatten_config / apply_overridesconfig.pyLoad YAML with inheritance; flatten sections; apply key=value overrides.
run_minibatch_reflectgradient/reflect.pyRun error/success analysts over trajectory minibatches → edit patches.
merge_patchesgradient/aggregate.pyHierarchically merge semantically similar patches.
rank_and_selectoptimizer/clip.pyRank edits and clip to the learning-rate budget.
build_scheduleroptimizer/scheduler.pyConstruct the LR (edit-budget) scheduler: constant/linear/cosine/autonomous.
decide_autonomous_learning_rateoptimizer/lr_autonomous.pyLet the optimizer pick the next learning rate (autonomous mode).
apply_patch / apply_editoptimizer/skill.pyApply edits to the skill document (respecting the protected region).
rewrite_skill_from_suggestionsoptimizer/rewrite.pyFull-rewrite update mode from accumulated suggestions.
evaluate_gate / select_gate_scoreevaluation/gate.pyAccept/reject decision; compute hard/soft/mixed score.
run_slow_updateoptimizer/slow_update.pyProduce epoch-boundary longitudinal guidance.
replace_slow_update_field / extract_slow_update_fieldoptimizer/slow_update.pyRead/overwrite the protected guidance region.
run_meta_skill / format_meta_skill_contextoptimizer/meta_skill.pyGenerate cross-epoch optimizer memory and render it into reflection context.
+
+ +
+

8.3 CLI Scripts #

+

scripts/train.py

+

Runs a full training loop. Required: --config. Override config via --cfg-options section.key=value … or legacy flat flags (--num_epochs, --batch_size, --optimizer_model, --target_model, --lr_scheduler, --edit_budget, --split_dir, …).

+

scripts/eval_only.py

+

Evaluates a skill document without training. Required: --config and --skill. Use --split to choose train / valid_seen / valid_unseen / all.

+
python scripts/eval_only.py \
+  --config configs/searchqa/default.yaml \
+  --skill outputs/my_run/best_skill.md \
+  --split valid_unseen
+
+ +
+

8.4 WebUI #

+

An optional Gradio dashboard to configure parameters and monitor runs:

+
pip install -e ".[webui]"
+python -m skillopt_webui.app          # http://localhost:7860
+python -m skillopt_webui.app --share  # public share link
+
+ + + + + + +
FlagDefaultDescription
--port7860Server port.
--host0.0.0.0Bind address.
--shareoffCreate a public Gradio share link.
+ + +
+ +
+ + + + +
+ + + +