SkillOpt Documentation & Reproduction Guide
+Train agent skills like you train neural networks โ with epochs, (mini-)batch size, learning rates, and validation gates โ but without touching any model weights.
+This guide walks you from a clean checkout to a reproduced result and a full reference for every configuration knob and core function. It is generated from, and kept consistent with, the current state of the codebase.
+ + +1.1 What is SkillOpt #
+SkillOpt is a text-space optimizer that improves a frozen language agent by iteratively editing a natural-language skill document โ never the model weights. The skill document is a Markdown file that conditions a target model as it executes tasks. SkillOpt treats this document as the "weights" and runs a training loop that mirrors deep-learning training: rollout (forward pass), reflect (backward pass / gradients), select & apply edits (optimizer step), and a validation gate (accept/reject).
+Two roles split every model call:
+-
+
- Target โ executes tasks using the current skill document (the agent being improved). +
- Optimizer โ analyzes the target's trajectories and proposes edits to the skill document. +
The same loop drives six benchmarks out of the box (QA, document QA, embodied agents, math, spreadsheet code generation, and tool-augmented QA).
+1.2 Deep-Learning โ SkillOpt Analogy #
+Every concept below maps to a concrete code construct, so deep-learning intuitions transfer directly to hyperparameter tuning.
+| Deep learning | SkillOpt | Where it lives |
|---|---|---|
| Model weights | Skill document (Markdown) | skillopt/optimizer/skill.py |
| Forward pass | Rollout โ target runs tasks | envs/<bench>/rollout.py |
| Loss / score | Task evaluator | envs/<bench>/evaluator.py |
| Backprop / gradients | Reflect โ edit patches | gradient/reflect.py |
| Gradient aggregation | Hierarchical patch merge | gradient/aggregate.py |
| Gradient clipping | Rank & select top-k edits | optimizer/clip.py |
| Learning rate | optimizer.learning_rate (edits/step) | optimizer/scheduler.py |
| LR scheduler | lr_scheduler (cosine/linear/โฆ) | optimizer/scheduler.py |
| Optimizer step | Apply patches to the document | optimizer/skill.py |
| Validation set | Selection split (valid_seen) | evaluation/gate.py |
| Early stopping / accept | Validation gate | evaluation/gate.py |
| Momentum | Slow update (epoch boundary) | optimizer/slow_update.py |
| Meta-learning | Meta skill (cross-epoch memory) | optimizer/meta_skill.py |
| Batch / minibatch | batch_size / minibatch_size | engine/trainer.py |
| Epoch | Epoch (+ slow update & meta skill) | engine/trainer.py |
Cosine schedule tends to beat constant; moderate learning rates (โ4โ16 edits/step) beat very high/low; slow update curbs cross-epoch forgetting; meta-skill memory improves reflection quality. Conversely, bigger rollout batches and many epochs show diminishing returns โ skills converge in ~2โ4 epochs.
+1.3 Key Features #
+Validation gating
Every candidate skill is scored on a held-out selection split and only accepted if it beats the current/best skill.
Slow update
Epoch-boundary longitudinal comparison writes guidance into a protected region โ momentum against forgetting. Force-injected or selection-gated.
Meta skill
Optimizer-side memory that reflects on what worked across epochs and feeds back into reflection.
Pluggable backends
OpenAI / Azure OpenAI, Anthropic Claude, local Qwen (vLLM), plus Codex/Claude-Code exec backends for the target.
Six benchmarks
SearchQA, DocVQA, ALFWorld, LiveMathematicianBench, SpreadsheetBench, OfficeQA โ each a self-contained env module.
Auto-resume
Every run is checkpointed step-by-step; re-running the same command continues from the last completed step.
1.4 Repository Layout #
+# top level
+configs/ # YAML configs (_base_ + per-benchmark)
+scripts/ # train.py, eval_only.py CLIs
+ckpt/ # packaged reference skills (e.g. gpt5.5_skill.md)
+docs/ # this guide + mkdocs sources
+skillopt/ # the package
+ โโ config.py # YAML loading, _base_ inheritance, flatten
+ โโ engine/trainer.py# the training loop (ReflACTTrainer)
+ โโ gradient/ # reflect.py (analyst), aggregate.py (merge)
+ โโ optimizer/ # skill edits, scheduler, clip, slow_update, meta_skill
+ โโ evaluation/ # gate.py (accept/reject logic)
+ โโ model/ # backend clients + routing
+ โโ envs/<benchmark>/ # adapter, dataloader, rollout, evaluator, reflect
+ 2.1 Requirements #
+-
+
- Python โฅ 3.10 +
- Credentials for at least one model backend (Azure OpenAI, OpenAI-compatible, Anthropic, or a local Qwen server) +
- Benchmark datasets are not bundled โ prepare your own splits (see ยง3) +
2.2 Install the Package #
+git clone https://github.com/microsoft/SkillOpt.git
+cd SkillOpt
+pip install -e .
+
+# Optional extras (install only what you need):
+pip install -e ".[alfworld]" # ALFWorld benchmark
+pip install -e ".[claude]" # Anthropic Claude backend
+pip install -e ".[qwen]" # local Qwen backend
+pip install -e ".[webui]" # monitoring dashboard
+
+# ALFWorld also needs its data assets:
+alfworld-download
+ 2.3 Configure Credentials #
+Copy the template and fill in whichever backend you will use:
+cp .env.example .env
+# edit .env, then:
+set -a; source .env; set +a
+ SkillOpt reuses the AZURE_OPENAI_* variable names even for plain OpenAI โ there is no separate OPENAI_API_KEY knob. AZURE_OPENAI_ENDPOINT is required for every OpenAI auth mode.
Azure OpenAI (default)
+export AZURE_OPENAI_ENDPOINT="https://your-resource.openai.azure.com/"
+export AZURE_OPENAI_API_VERSION="2024-12-01-preview"
+# Auth option 1 โ API key:
+export AZURE_OPENAI_API_KEY="your-key"
+# Auth option 2 โ Azure CLI (no key; recommended on Azure VMs):
+export AZURE_OPENAI_AUTH_MODE=azure_cli
+# Auth option 3 โ Managed Identity:
+export AZURE_OPENAI_AUTH_MODE=managed_identity
+export AZURE_OPENAI_MANAGED_IDENTITY_CLIENT_ID="your-client-id"
+ OpenAI-compatible endpoint
+export AZURE_OPENAI_ENDPOINT="https://api.openai.com/v1"
+export AZURE_OPENAI_API_KEY="sk-..."
+export AZURE_OPENAI_AUTH_MODE=openai_compatible
+ Anthropic Claude / local Qwen
+export ANTHROPIC_API_KEY="sk-ant-..." # claude_chat backend
+
+export QWEN_CHAT_BASE_URL="http://localhost:8000/v1" # local vLLM
+export QWEN_CHAT_MODEL="Qwen/Qwen3.5-4B"
+ 2.4 Verify Installation #
+python -c "import skillopt; print('SkillOpt ready!')"
+ 3.1 Split Directory Format #
+With env.split_mode: split_dir (the recommended, deterministic mode), SkillOpt reads a directory containing train/, val/, and test/ subfolders, each holding a JSON array of task items:
data/my_split/
+ โโ train/items.json # used for rollout (the "train split")
+ โโ val/items.json # selection split โ validation gate (valid_seen)
+ โโ test/items.json # held-out final eval (valid_unseen)
+ Internally the splits are referred to as train, valid_seen (validation/selection), and valid_unseen (test). The --split flag of eval_only.py uses these names.
3.2 Item JSON Schema #
+Required fields depend on the benchmark; consult skillopt/envs/<benchmark>/dataloader.py for the exact contract. A SearchQA item, for example:
[
+ {
+ "id": "unique_item_id",
+ "question": "Who wrote the novel ...",
+ "context": "[DOC] relevant passage text ...",
+ "answers": ["expected answer"]
+ }
+]
+ This repository ships no benchmark data. Prepare your own splits in the format above before training.
+3.3 Split Modes #
+env.split_mode | Behavior |
|---|---|
split_dir | Use a pre-built directory with explicit train/val/test folders (set env.split_dir). Deterministic and reproducible. |
ratio | Build a deterministic split on the fly from a single env.data_path, using split_seed (and a train:val:test ratio). Convenient for quick experiments. |
4.1 Train a Skill #
+# Minimal SearchQA run
+python scripts/train.py \
+ --config configs/searchqa/default.yaml \
+ --split_dir /path/to/your/searchqa_split \
+ --azure_openai_endpoint https://your-resource.openai.azure.com/ \
+ --optimizer_model gpt-5.5 \
+ --target_model gpt-5.5
+ Swap the config for another benchmark (e.g. configs/livemathematicianbench/default.yaml, configs/alfworld/default.yaml). Common CLI arguments:
| Argument | Description |
|---|---|
--config | Benchmark config YAML (required) |
--split_dir | Path to the data split directory |
--azure_openai_endpoint | Azure OpenAI endpoint URL |
--optimizer_model / --target_model | Deployment names for optimizer / target |
--num_epochs / --batch_size | Epochs and rollout batch size |
--out_root | Output directory |
--cfg-options k=v ... | Override any config key (see ยง6.1) |
4.2 Evaluate a Skill #
+Evaluate any skill document (a packaged reference skill, or a trained run's best_skill.md) without training:
# Evaluate the packaged GPT-5.5 SearchQA skill on the test split
+python scripts/eval_only.py \
+ --config configs/searchqa/default.yaml \
+ --skill ckpt/searchqa/gpt5.5_skill.md \
+ --split valid_unseen \
+ --split_dir /path/to/searchqa_split \
+ --azure_openai_endpoint https://your-resource.openai.azure.com/
+ --split | Meaning |
|---|---|
valid_unseen | Test set (held-out) |
valid_seen | Validation / selection set |
train | Training set |
all | All splits combined (default) |
4.3 Output Structure #
+outputs/<run_name>/
+ โโ config.json # flattened runtime config
+ โโ history.json # per-step training history
+ โโ runtime_state.json # resume checkpoint
+ โโ best_skill.md # best validated skill document
+ โโ skills/skill_vXXXX.md# skill snapshot per step
+ โโ steps/step_XXXX/ # per-step artifacts (patches, evals)
+ โโ slow_update/epoch_XX/# slow-update logs & rollouts
+ โโ meta_skill/epoch_XX/ # meta-skill logs
+ 4.4 Auto-Resume #
+Each completed step persists its state to runtime_state.json and a steps/step_XXXX/ directory. Re-running the same command against the same out_root detects finished work and continues from the last completed step โ including epoch-boundary slow-update and meta-skill stages.
5.1 The Training Loop #
+The loop lives in ReflACTTrainer (skillopt/engine/trainer.py). Each epoch runs a series of optimization steps over rollout batches, then performs two epoch-boundary stages.
for epoch in epochs:
+ for step in steps:
+ 1. Rollout # target executes a batch of tasks
+ 2. Reflect # optimizer analyzes trajectories โ edit patches
+ 3. Aggregate # hierarchically merge similar patches
+ 4. Select # rank & clip edits to the learning rate
+ 5. Update # apply patches โ candidate skill
+ 6. Gate # score on selection split โ accept / reject
+
+ # epoch boundary (from epoch 2 onward)
+ Slow update # longitudinal comparison โ protected guidance
+ Meta skill # cross-epoch optimizer memory
+ 5.2 The Six Per-Step Stages #
+| Stage | What happens | Source |
|---|---|---|
| 1. Rollout | The target model runs each task in the batch with the current skill as context, producing trajectories and scores. | envs/<b>/rollout.py |
| 2. Reflect | The optimizer runs an error analyst (and optional success analyst) over minibatches of trajectories, emitting structured edit patches. Runs in parallel across analyst_workers. | gradient/reflect.py |
| 3. Aggregate | Semantically similar patches are merged hierarchically to remove redundancy. | gradient/aggregate.py โ merge_patches |
| 4. Select | Patches are ranked and clipped to the current learning rate (max edits this step), set by the scheduler. | optimizer/clip.py โ rank_and_select |
| 5. Update | Selected edits are applied to the skill document, producing a candidate skill (patch / rewrite modes). | optimizer/skill.py, update_modes.py |
| 6. Gate | The candidate is scored on the selection split and accepted only if it improves (see ยง5.3). | evaluation/gate.py โ evaluate_gate |
5.3 Validation Gate #
+evaluate_gate is a pure decision function. It compares the candidate's selection-set score against the current and best skills:
-
+
- accept_new_best โ candidate > current and candidate > best โ becomes both current and best. +
- accept โ candidate > current but โค best โ becomes current only. +
- reject โ candidate โค current โ discarded; current/best unchanged. +
The comparison metric is configurable via evaluation.gate_metric:
| Metric | Score used |
|---|---|
hard default | Exact-match / discrete score |
soft | Partial-credit / continuous score |
mixed | Weighted blend, controlled by gate_mixed_weight |
The soft/mixed metrics (contributed config configs/examples/soft_gate.yaml) help when the selection split is small and rewards are continuous, where a discrete hard gate may reject every candidate and stall training. Paper numbers use the default hard gate.
5.4 Slow Update (Momentum) #
+At each epoch boundary (from epoch 2), the slow update rolls out both the previous epoch's skill and the current skill on the same sampled tasks, categorizes items (improved / regressed / persistent-fail / stable-success), and asks the optimizer to write a free-form guidance block. This guidance lands in a protected region of the skill that step-level edits cannot touch โ only the slow update overwrites it. It is SkillOpt's analogue of momentum, countering cross-epoch forgetting.
+Acceptance has two modes, selected by optimizer.slow_update_gate_with_selection:
| Mode | Behavior |
|---|---|
false default โ force-injected | Guidance is injected into both current and best skills unconditionally. The longitudinal guidance always persists; it is not gated by step-level selection scores. |
true โ gated | The slow-update candidate is scored on the selection split and accepted/rejected through the same validation gate as step-level updates. |
5.5 Meta Skill (Optimizer Memory) #
+The meta skill is optimizer-side memory โ it never modifies the target skill document. At the end of each epoch (skipped for epoch 1), the optimizer compares the previous and current epoch's last-step skills on the same sampled tasks and writes a compact, evidence-based reflection on what kind of edits helped or hurt. That memory is then injected as extra context into the next epoch's reflect / merge / learning-rate / ranking stages, so the optimizer accumulates strategy across the run.
+5.6 Skill Document Anatomy #
+A skill document is plain Markdown. Initial skills can be empty (learn from scratch) or seeded with domain knowledge via env.skill_init. During training the document accrues rules, patterns, and edge-case handling through accepted edit patches. A dedicated protected region holds the slow-update guidance, delimited by HTML-comment markers:
# Question Answering Skill
+
+## Learned rules ...
+- When the context contains multiple candidates, prefer ...
+
+<!-- SLOW_UPDATE_START -->
+# (epoch-level longitudinal guidance โ only the slow update writes here)
+<!-- SLOW_UPDATE_END -->
+ Helpers in optimizer/slow_update.py manage this region: inject_empty_slow_update_field (placeholder at epoch 1), extract_slow_update_field (read), and replace_slow_update_field (overwrite). Step-level edits are blocked from modifying anything inside the markers.
6.1 Configuration System #
+Configs are structured YAML with section blocks (model, train, gradient, optimizer, evaluation, env) and _base_ inheritance. A benchmark config inherits the shared defaults and overrides only what differs:
# configs/searchqa/default.yaml
+_base_: ../_base_/default.yaml
+train:
+ train_size: 400
+ batch_size: 40
+optimizer:
+ learning_rate: 4
+env:
+ name: searchqa
+ split_dir: data/searchqa_split
+ Override any key at the command line without editing files:
+python scripts/train.py --config configs/searchqa/default.yaml \
+ --cfg-options optimizer.learning_rate=16 optimizer.lr_scheduler=linear
+ Each section lists the key (relative to its YAML block), type, default (from configs/_base_/default.yaml), allowed values, and meaning. Defaults shown are the shipped base defaults.
6.2 model.* #
+ | Key | Type | Default | Description / options |
|---|---|---|---|
backend | str | azure_openai | High-level backend label for the run. |
optimizer | str | gpt-5.5 | Optimizer model deployment (writes skill edits). |
target | str | gpt-5.5 | Target model deployment (executes tasks). |
optimizer_backend | str | openai_chat | Client path for the optimizer: openai_chat or claude_chat. |
target_backend | str | openai_chat | Client path for the target: openai_chat / claude_chat / qwen_chat / codex_exec / claude_code_exec. |
reasoning_effort | str | medium | low / medium / high / xhigh / max (or empty). |
rewrite_reasoning_effort | str | "" | Override effort for full-rewrite calls (empty = inherit). |
rewrite_max_completion_tokens | int | 64000 | Token cap for full-rewrite optimizer calls. |
azure_openai_endpoint | str | "" | Azure resource URL (or via AZURE_OPENAI_ENDPOINT). |
azure_openai_api_version | str | 2024-12-01-preview | Azure API version header. |
azure_openai_auth_mode | str | "" | api_key / azure_cli / managed_identity / openai_compatible (empty โ env default). |
Every azure_openai_* key also has optimizer_azure_openai_* and target_azure_openai_* variants, letting you point the optimizer and target at different Azure resources. Exec backends (codex_exec, claude_code_exec) add their own codex_exec_* / claude_code_exec_* knobs (sandbox, reasoning effort, SDK mode, etc.).
6.3 train.* #
+ | Key | Type | Default | DL analogy | Description |
|---|---|---|---|---|
num_epochs | int | 4 | Epochs | Number of training epochs. |
train_size | int | 0 | Train-set size | 0 = derive from the dataset split. (Fixed by split size when using split_dir.) |
batch_size | int | 40 | Batch size | Tasks rolled out per optimization step. |
accumulation | int | 1 | Grad accumulation | Accumulation rounds per step. |
seed | int | 42 | Random seed | Reproducibility seed. |
6.4 gradient.* #
+ | Key | Type | Default | Description |
|---|---|---|---|
minibatch_size | int | 8 | Trajectories per reflect minibatch. |
merge_batch_size | int | 8 | Patches per merge batch during aggregation. |
analyst_workers | int | 16 | Parallel reflection workers (data parallelism). |
max_analyst_rounds | int | 3 | Max rounds of analyst reflection per step. |
failure_only | bool | false | Reflect only on failed trajectories when true. |
6.5 optimizer.* #
+ | Key | Type | Default | DL analogy | Description / options |
|---|---|---|---|---|
learning_rate | int | 4 | Learning rate | Max edit patches applied per step (the "edit budget"). |
min_learning_rate | int | 2 | Min LR | Floor edit budget for decaying schedulers. |
lr_scheduler | str | cosine | LR schedule | constant / linear / cosine / autonomous. |
lr_control_mode | str | fixed | โ | fixed / autonomous / none. |
skill_update_mode | str | patch | โ | patch / rewrite_from_suggestions / full_rewrite_minibatch. |
use_slow_update | bool | true | Momentum | Enable epoch-boundary slow update. |
slow_update_samples | int | 20 | โ | Tasks sampled for the longitudinal comparison. |
slow_update_gate_with_selection | bool | false | โ | false = force-inject guidance; true = gate it on the selection split (see ยง5.4). |
longitudinal_pair_policy | str | mixed | โ | mixed / changed / unchanged โ which comparison pairs to keep. |
use_meta_skill | bool | true | Meta-learning | Enable cross-epoch optimizer memory. |
6.6 evaluation.* #
+ | Key | Type | Default | Description / options |
|---|---|---|---|
use_gate | bool | true | Validation gating is mandatory in this branch (must remain true). |
gate_metric | str | hard | hard / soft / mixed โ score used by the gate (see ยง5.3). |
gate_mixed_weight | float | 0.5 | Weight on the soft score when gate_metric = mixed. |
sel_env_num | int | 0 | Selection-split eval size (0 = use full split). |
test_env_num | int | 0 | Test-split eval size (0 = use full split). |
eval_test | bool | true | Run a final test evaluation after training. |
Setting evaluation.use_gate: false raises an error โ validation gating cannot be disabled in this branch.
6.7 env.* #
+ | Key | Type | Default | Description |
|---|---|---|---|
name | str | "" | Benchmark name (searchqa, docvqa, alfworld, โฆ). Selects the env module. |
skill_init | str | "" | Path to a seed skill (empty = start from scratch). |
split_mode | str | ratio | ratio or split_dir (see ยง3.3). |
split_dir | str | "" | Pre-split directory (when split_mode = split_dir). |
data_path | str | "" | Single dataset path (when split_mode = ratio). |
split_seed | int | 42 | Seed for deterministic ratio splitting. |
exec_timeout | int | 120 | Per-task target/code-agent timeout (seconds). |
out_root | str | "" | Output directory for the run. |
Env blocks may carry extra benchmark-specific keys (e.g. max_turns, workers, max_completion_tokens, limit). Unmapped env keys are passed straight through to the benchmark adapter โ check the relevant configs/<benchmark>/default.yaml.
7.1 Supported Benchmarks #
+| Benchmark | Type | Config |
|---|---|---|
| SearchQA | Question answering | configs/searchqa/default.yaml |
| DocVQA | Document QA | configs/docvqa/default.yaml |
| ALFWorld | Embodied agent | configs/alfworld/default.yaml |
| LiveMathematicianBench | Math reasoning | configs/livemathematicianbench/default.yaml |
| SpreadsheetBench | Spreadsheet code generation | configs/spreadsheetbench/default.yaml |
| OfficeQA | Tool-augmented QA | configs/officeqa/default.yaml |
Each benchmark is a self-contained module under skillopt/envs/<benchmark>/ with an adapter.py, dataloader.py, rollout.py, and evaluator.py (some add a custom reflect.py). Packaged reference skills live in ckpt/<benchmark>/.
7.2 Add a New Benchmark #
+Use skillopt/envs/_template/ as a starting point. At minimum, implement:
-
+
- Dataloader โ read your item JSON into the framework's item dicts (
dataloader.py).
+ - Rollout โ run the target on one item with the current skill and return a trajectory + score (
rollout.py).
+ - Evaluator โ score predictions against ground truth (
evaluator.py).
+ - Adapter โ wire the above into the trainer's expected interface and register the env name (
adapter.py).
+
Then add a configs/<name>/default.yaml inheriting _base_/default.yaml and set env.name to your new benchmark.
8.1 Module Map #
+| Module | Responsibility |
|---|---|
skillopt/config.py | Load structured YAML, resolve _base_ inheritance, flatten to the trainer's flat dict, apply CLI overrides. |
skillopt/engine/trainer.py | ReflACTTrainer โ orchestrates the whole loop, gating, slow update, meta skill, resume, and artifact writing. |
skillopt/gradient/ | Reflection ("backward pass"): reflect.py analysts, aggregate.py patch merging. |
skillopt/optimizer/ | The "optimizer": edit application, learning-rate scheduling, edit selection, slow update, meta skill, rewrite modes. |
skillopt/evaluation/gate.py | Pure accept/reject decision and metric selection. |
skillopt/model/ | Backend clients (OpenAI/Azure, Claude, Qwen, Codex/Claude-Code exec) and routing. |
skillopt/envs/<b>/ | Per-benchmark dataloader, rollout, evaluator, adapter. |
8.2 Core Functions #
+| Function | File | Purpose |
|---|---|---|
load_config / flatten_config / apply_overrides | config.py | Load YAML with inheritance; flatten sections; apply key=value overrides. |
run_minibatch_reflect | gradient/reflect.py | Run error/success analysts over trajectory minibatches โ edit patches. |
merge_patches | gradient/aggregate.py | Hierarchically merge semantically similar patches. |
rank_and_select | optimizer/clip.py | Rank edits and clip to the learning-rate budget. |
build_scheduler | optimizer/scheduler.py | Construct the LR (edit-budget) scheduler: constant/linear/cosine/autonomous. |
decide_autonomous_learning_rate | optimizer/lr_autonomous.py | Let the optimizer pick the next learning rate (autonomous mode). |
apply_patch / apply_edit | optimizer/skill.py | Apply edits to the skill document (respecting the protected region). |
rewrite_skill_from_suggestions | optimizer/rewrite.py | Full-rewrite update mode from accumulated suggestions. |
evaluate_gate / select_gate_score | evaluation/gate.py | Accept/reject decision; compute hard/soft/mixed score. |
run_slow_update | optimizer/slow_update.py | Produce epoch-boundary longitudinal guidance. |
replace_slow_update_field / extract_slow_update_field | optimizer/slow_update.py | Read/overwrite the protected guidance region. |
run_meta_skill / format_meta_skill_context | optimizer/meta_skill.py | Generate cross-epoch optimizer memory and render it into reflection context. |
8.3 CLI Scripts #
+scripts/train.py
+Runs a full training loop. Required: --config. Override config via --cfg-options section.key=value โฆ or legacy flat flags (--num_epochs, --batch_size, --optimizer_model, --target_model, --lr_scheduler, --edit_budget, --split_dir, โฆ).
scripts/eval_only.py
+Evaluates a skill document without training. Required: --config and --skill. Use --split to choose train / valid_seen / valid_unseen / all.
python scripts/eval_only.py \
+ --config configs/searchqa/default.yaml \
+ --skill outputs/my_run/best_skill.md \
+ --split valid_unseen
+ 8.4 WebUI #
+An optional Gradio dashboard to configure parameters and monitor runs:
+pip install -e ".[webui]"
+python -m skillopt_webui.app # http://localhost:7860
+python -m skillopt_webui.app --share # public share link
+ | Flag | Default | Description |
|---|---|---|
--port | 7860 | Server port. |
--host | 0.0.0.0 | Bind address. |
--share | off | Create a public Gradio share link. |