microsoft-SkillOpt/docs/guide/training-loop.md

# The Training Loop

SkillOpt's core insight: **optimizing natural-language skill documents follows the same structure as training neural networks**.

## Overview

```
┌─────────────────────────────────────────────────────────┐
│                    Training Loop                         │
│                                                         │
│  for epoch in epochs:                                   │
│    for step in steps:                                   │
│      1. Rollout   — Target executes tasks              │
│      2. Reflect   — Optimizer analyzes trajectories       │
│      3. Aggregate — Hierarchical merge of patches       │
│      4. Select    — Rank & clip edits (learning rate)   │
│      5. Update    — Apply patches to skill doc          │
│      6. Gate      — Validate & accept/reject            │
│                                                         │
│    Epoch Boundary:                                       │
│      • Slow Update (longitudinal comparison & guidance) │
│      • Meta Skill  (cross-epoch strategy memory)        │
└─────────────────────────────────────────────────────────┘
```

## Stage Details

### 1. Rollout (Forward Pass)

The **target** model executes tasks using the current skill document as its prompt. Each task produces a trajectory and a score.

```python
# Analogy: forward pass through the network
predictions = model(input, skill_document)
scores = evaluate(predictions, ground_truth)
```

### 2. Reflect (Backward Pass)

The **optimizer** model analyzes failed trajectories and produces **edit patches** — structured suggestions for improving the skill document.

Two modes:

- **Shallow**: Analyze each trajectory independently
- **Deep**: Cross-reference multiple failures to find systemic issues

```python
# Analogy: computing gradients
gradients = loss.backward()  # → edit patches
```

### 3. Aggregate

Semantically similar edit patches are merged to avoid redundant edits.

### 4. Select (Gradient Clipping)

Edits are ranked by relevance score. The `learning_rate` parameter caps how many edits are applied per step — just like gradient clipping prevents overshooting.

```python
# Analogy: gradient clipping + optimizer step size
selected = top_k(edits, k=learning_rate)
```

The `lr_scheduler` adjusts this over training:

- **cosine**: Start aggressive, taper smoothly
- **linear**: Linear decay
- **constant**: Fixed rate

### 5. Update (Parameter Update)

Selected edits are applied to the skill document, producing a new version.

### 6. Gate (Validation)

The updated skill is evaluated on a **selection split** (analogous to a validation set). The update is only accepted if performance improves.

## Epoch Boundary Mechanisms

### Slow Update

At the end of each epoch (starting from epoch 2), the system performs a **longitudinal comparison**: it rolls out both the previous epoch's skill and the current skill on the same samples, categorizes items as improved/regressed/persistent_fail/stable_success, then generates high-level **guidance** that is injected into the skill document. This prevents catastrophic forgetting of earlier improvements.

### Meta Skill

A **meta-skill memory** accumulates high-level strategy notes across the entire training run. At the end of each epoch, the optimizer reflects on what changed between epochs and produces a compact memory that is provided as additional context during future reflection steps.

## Next Steps

- [Understand Skill Documents](skill-document.md)
- [DL ↔ SkillOpt analogy table](dl-analogy.md)