Files
microsoft-SkillOpt/docs/guide/training-loop.md
Cuzyoung 4a1b984d87 refactor: rename teacher/student to optimizer/target, remove best skills, fix slow update
- Rename teacher -> optimizer, student -> target across all code, configs, docs, prompts
- CLI: --teacher_model -> --optimizer_model, --student_model -> --target_model
- Remove best_skill files, keep only initial skills
- Fix slow update gate (force write into skill)
- Fix SLOW_UPDATE marker stripping
- Remove deep_reflect and meta_reflect mechanisms
- Update .env.example with export prefix and azure_cli docs
- Add endpoint empty validation in azure_openai.py

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-05-24 19:15:10 +00:00

3.8 KiB

The Training Loop

SkillOpt's core insight: optimizing natural-language skill documents follows the same structure as training neural networks.

Overview

┌─────────────────────────────────────────────────────────┐
│                    Training Loop                         │
│                                                         │
│  for epoch in epochs:                                   │
│    for step in steps:                                   │
│      1. Rollout   — Target executes tasks              │
│      2. Reflect   — Optimizer analyzes trajectories       │
│      3. Aggregate — Hierarchical merge of patches       │
│      4. Select    — Rank & clip edits (learning rate)   │
│      5. Update    — Apply patches to skill doc          │
│      6. Gate      — Validate & accept/reject            │
│                                                         │
│    Epoch Boundary:                                       │
│      • Slow Update (longitudinal comparison & guidance) │
│      • Meta Skill  (cross-epoch strategy memory)        │
└─────────────────────────────────────────────────────────┘

Stage Details

1. Rollout (Forward Pass)

The target model executes tasks using the current skill document as its prompt. Each task produces a trajectory and a score.

# Analogy: forward pass through the network
predictions = model(input, skill_document)
scores = evaluate(predictions, ground_truth)

2. Reflect (Backward Pass)

The optimizer model analyzes failed trajectories and produces edit patches — structured suggestions for improving the skill document.

Two modes:

  • Shallow: Analyze each trajectory independently
  • Deep: Cross-reference multiple failures to find systemic issues
# Analogy: computing gradients
gradients = loss.backward()  # → edit patches

3. Aggregate

Semantically similar edit patches are merged to avoid redundant edits.

4. Select (Gradient Clipping)

Edits are ranked by relevance score. The learning_rate parameter caps how many edits are applied per step — just like gradient clipping prevents overshooting.

# Analogy: gradient clipping + optimizer step size
selected = top_k(edits, k=learning_rate)

The lr_scheduler adjusts this over training:

  • cosine: Start aggressive, taper smoothly
  • linear: Linear decay
  • constant: Fixed rate

5. Update (Parameter Update)

Selected edits are applied to the skill document, producing a new version.

6. Gate (Validation)

The updated skill is evaluated on a selection split (analogous to a validation set). The update is only accepted if performance improves.

Epoch Boundary Mechanisms

Slow Update

At the end of each epoch (starting from epoch 2), the system performs a longitudinal comparison: it rolls out both the previous epoch's skill and the current skill on the same samples, categorizes items as improved/regressed/persistent_fail/stable_success, then generates high-level guidance that is injected into the skill document. This prevents catastrophic forgetting of earlier improvements.

Meta Skill

A meta-skill memory accumulates high-level strategy notes across the entire training run. At the end of each epoch, the optimizer reflects on what changed between epochs and produces a compact memory that is provided as additional context during future reflection steps.

Next Steps