mirror of
https://github.com/microsoft/SkillOpt.git
synced 2026-07-03 14:02:58 +08:00
- Rename teacher -> optimizer, student -> target across all code, configs, docs, prompts - CLI: --teacher_model -> --optimizer_model, --student_model -> --target_model - Remove best_skill files, keep only initial skills - Fix slow update gate (force write into skill) - Fix SLOW_UPDATE marker stripping - Remove deep_reflect and meta_reflect mechanisms - Update .env.example with export prefix and azure_cli docs - Add endpoint empty validation in azure_openai.py Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
93 lines
3.8 KiB
Markdown
93 lines
3.8 KiB
Markdown
# The Training Loop
|
|
|
|
SkillOpt's core insight: **optimizing natural-language skill documents follows the same structure as training neural networks**.
|
|
|
|
## Overview
|
|
|
|
```
|
|
┌─────────────────────────────────────────────────────────┐
|
|
│ Training Loop │
|
|
│ │
|
|
│ for epoch in epochs: │
|
|
│ for step in steps: │
|
|
│ 1. Rollout — Target executes tasks │
|
|
│ 2. Reflect — Optimizer analyzes trajectories │
|
|
│ 3. Aggregate — Hierarchical merge of patches │
|
|
│ 4. Select — Rank & clip edits (learning rate) │
|
|
│ 5. Update — Apply patches to skill doc │
|
|
│ 6. Gate — Validate & accept/reject │
|
|
│ │
|
|
│ Epoch Boundary: │
|
|
│ • Slow Update (longitudinal comparison & guidance) │
|
|
│ • Meta Skill (cross-epoch strategy memory) │
|
|
└─────────────────────────────────────────────────────────┘
|
|
```
|
|
|
|
## Stage Details
|
|
|
|
### 1. Rollout (Forward Pass)
|
|
|
|
The **target** model executes tasks using the current skill document as its prompt. Each task produces a trajectory and a score.
|
|
|
|
```python
|
|
# Analogy: forward pass through the network
|
|
predictions = model(input, skill_document)
|
|
scores = evaluate(predictions, ground_truth)
|
|
```
|
|
|
|
### 2. Reflect (Backward Pass)
|
|
|
|
The **optimizer** model analyzes failed trajectories and produces **edit patches** — structured suggestions for improving the skill document.
|
|
|
|
Two modes:
|
|
|
|
- **Shallow**: Analyze each trajectory independently
|
|
- **Deep**: Cross-reference multiple failures to find systemic issues
|
|
|
|
```python
|
|
# Analogy: computing gradients
|
|
gradients = loss.backward() # → edit patches
|
|
```
|
|
|
|
### 3. Aggregate
|
|
|
|
Semantically similar edit patches are merged to avoid redundant edits.
|
|
|
|
### 4. Select (Gradient Clipping)
|
|
|
|
Edits are ranked by relevance score. The `learning_rate` parameter caps how many edits are applied per step — just like gradient clipping prevents overshooting.
|
|
|
|
```python
|
|
# Analogy: gradient clipping + optimizer step size
|
|
selected = top_k(edits, k=learning_rate)
|
|
```
|
|
|
|
The `lr_scheduler` adjusts this over training:
|
|
|
|
- **cosine**: Start aggressive, taper smoothly
|
|
- **linear**: Linear decay
|
|
- **constant**: Fixed rate
|
|
|
|
### 5. Update (Parameter Update)
|
|
|
|
Selected edits are applied to the skill document, producing a new version.
|
|
|
|
### 6. Gate (Validation)
|
|
|
|
The updated skill is evaluated on a **selection split** (analogous to a validation set). The update is only accepted if performance improves.
|
|
|
|
## Epoch Boundary Mechanisms
|
|
|
|
### Slow Update
|
|
|
|
At the end of each epoch (starting from epoch 2), the system performs a **longitudinal comparison**: it rolls out both the previous epoch's skill and the current skill on the same samples, categorizes items as improved/regressed/persistent_fail/stable_success, then generates high-level **guidance** that is injected into the skill document. This prevents catastrophic forgetting of earlier improvements.
|
|
|
|
### Meta Skill
|
|
|
|
A **meta-skill memory** accumulates high-level strategy notes across the entire training run. At the end of each epoch, the optimizer reflects on what changed between epochs and produces a compact memory that is provided as additional context during future reflection steps.
|
|
|
|
## Next Steps
|
|
|
|
- [Understand Skill Documents](skill-document.md)
|
|
- [DL ↔ SkillOpt analogy table](dl-analogy.md)
|