- Rename teacher -> optimizer, student -> target across all code, configs, docs, prompts - CLI: --teacher_model -> --optimizer_model, --student_model -> --target_model - Remove best_skill files, keep only initial skills - Fix slow update gate (force write into skill) - Fix SLOW_UPDATE marker stripping - Remove deep_reflect and meta_reflect mechanisms - Update .env.example with export prefix and azure_cli docs - Add endpoint empty validation in azure_openai.py Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
3.8 KiB
The Training Loop
SkillOpt's core insight: optimizing natural-language skill documents follows the same structure as training neural networks.
Overview
┌─────────────────────────────────────────────────────────┐
│ Training Loop │
│ │
│ for epoch in epochs: │
│ for step in steps: │
│ 1. Rollout — Target executes tasks │
│ 2. Reflect — Optimizer analyzes trajectories │
│ 3. Aggregate — Hierarchical merge of patches │
│ 4. Select — Rank & clip edits (learning rate) │
│ 5. Update — Apply patches to skill doc │
│ 6. Gate — Validate & accept/reject │
│ │
│ Epoch Boundary: │
│ • Slow Update (longitudinal comparison & guidance) │
│ • Meta Skill (cross-epoch strategy memory) │
└─────────────────────────────────────────────────────────┘
Stage Details
1. Rollout (Forward Pass)
The target model executes tasks using the current skill document as its prompt. Each task produces a trajectory and a score.
# Analogy: forward pass through the network
predictions = model(input, skill_document)
scores = evaluate(predictions, ground_truth)
2. Reflect (Backward Pass)
The optimizer model analyzes failed trajectories and produces edit patches — structured suggestions for improving the skill document.
Two modes:
- Shallow: Analyze each trajectory independently
- Deep: Cross-reference multiple failures to find systemic issues
# Analogy: computing gradients
gradients = loss.backward() # → edit patches
3. Aggregate
Semantically similar edit patches are merged to avoid redundant edits.
4. Select (Gradient Clipping)
Edits are ranked by relevance score. The learning_rate parameter caps how many edits are applied per step — just like gradient clipping prevents overshooting.
# Analogy: gradient clipping + optimizer step size
selected = top_k(edits, k=learning_rate)
The lr_scheduler adjusts this over training:
- cosine: Start aggressive, taper smoothly
- linear: Linear decay
- constant: Fixed rate
5. Update (Parameter Update)
Selected edits are applied to the skill document, producing a new version.
6. Gate (Validation)
The updated skill is evaluated on a selection split (analogous to a validation set). The update is only accepted if performance improves.
Epoch Boundary Mechanisms
Slow Update
At the end of each epoch (starting from epoch 2), the system performs a longitudinal comparison: it rolls out both the previous epoch's skill and the current skill on the same samples, categorizes items as improved/regressed/persistent_fail/stable_success, then generates high-level guidance that is injected into the skill document. This prevents catastrophic forgetting of earlier improvements.
Meta Skill
A meta-skill memory accumulates high-level strategy notes across the entire training run. At the end of each epoch, the optimizer reflects on what changed between epochs and produces a compact memory that is provided as additional context during future reflection steps.