mirror of
https://github.com/microsoft/SkillOpt.git
synced 2026-07-03 14:02:58 +08:00
feat(sleep): experience replay + dream rollouts in the cycle (opt-in)
Wires two consolidation mechanisms into the shipped nightly cycle, both default
OFF so existing behavior is unchanged:
- dream_rollouts (>1): multi-rollout contrastive reflection per task
- recall_k (>0): associative recall of the K most-similar past tasks (from a
capped task_archive persisted in state.json) into tonight's dream
- dream_factor (>0): synthetic task variants
New shared engine module skillopt_sleep/dream.py (recall_similar, dream_augment,
dream_consolidate) is called by both the plugin cycle and the experiment harness,
so reported numbers exercise the exact shipped code. Built on the existing
rollouts_k/sample_id support already in consolidate.py/rollout.py.
Validated (5 nights x 10 real tasks/night, full held-out test, GPT-5.5, gated):
the gain scales with recall depth on a clean signal —
SearchQA recall_k=10 +3.1, recall_k=20 +4.5, full-history reference +5.6;
SpreadsheetBench (nano, gate-free) +3.6. Flat within noise on saturated/noisy
cells. See docs/sleep/EXPERIENCE_REPLAY.md (+ raw runs under blog_runs/v2_port/).
Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
This commit is contained in:
64
docs/sleep/EXPERIENCE_REPLAY.md
Normal file
64
docs/sleep/EXPERIENCE_REPLAY.md
Normal file
@@ -0,0 +1,64 @@
|
||||
# SkillOpt-Sleep — experience replay & dream rollouts (opt-in)
|
||||
|
||||
Two opt-in mechanisms that strengthen the nightly consolidation when your tasks
|
||||
have a clean correctness signal. Both default **off**, so enabling them is the
|
||||
only way they change behavior.
|
||||
|
||||
## What they do
|
||||
|
||||
| Config knob | Default | Effect |
|
||||
|---|---|---|
|
||||
| `dream_rollouts` | `1` | Run each task **K** times and learn from the *contrast* between the good and bad attempts (contrastive reflection) instead of a single failure. |
|
||||
| `recall_k` | `0` | **Associative recall** — each night, pull the `K` past tasks most similar to tonight's new ones (from a persisted task archive) into the dream, so related experience is revisited without replaying the whole history. |
|
||||
| `dream_factor` | `0` | Add `N` lightweight synthetic variants of each task to the training pool. |
|
||||
|
||||
The validation gate still governs what ships, so these only ever *enlarge the
|
||||
signal the optimizer reflects on* — the held-out gate decides what is kept.
|
||||
|
||||
## How to enable
|
||||
|
||||
```jsonc
|
||||
// ~/.skillopt-sleep/config.json (or pass via the plugin's config)
|
||||
{
|
||||
"dream_rollouts": 5, // contrastive dreaming
|
||||
"recall_k": 20, // recall ~20 similar past tasks each night
|
||||
"gate_mode": "on" // keep the gate on (recommended)
|
||||
}
|
||||
```
|
||||
|
||||
`recall_k` draws from a capped `task_archive` that the cycle persists in
|
||||
`state.json`, so recall becomes useful from the second night onward (once there
|
||||
is history to recall from).
|
||||
|
||||
## Measured effect
|
||||
|
||||
Deployment protocol (5 nights × 10 new real tasks/night, full held-out test
|
||||
sets, GPT-5.5 optimizer), run through the **same engine the plugin executes**
|
||||
(`skillopt_sleep.dream.dream_consolidate`):
|
||||
|
||||
**SearchQA (GPT-5.5, full 1,400-item test, gated) — the gain scales with recall depth:**
|
||||
|
||||
| Config | Δ vs baseline |
|
||||
|---|---|
|
||||
| `recall_k=10, dream_rollouts=5` | +3.1 |
|
||||
| `dream_rollouts=8` | +3.7 |
|
||||
| **`recall_k=20, dream_rollouts=5`** | **+4.5** |
|
||||
| full-history replay (reference) | +5.6 |
|
||||
|
||||
**Second-benchmark confirmation** (SpreadsheetBench, GPT-5.4-nano, gate-free,
|
||||
shipped path): 0.279 → **0.314 (+3.6)**.
|
||||
|
||||
## When it helps — and when it doesn't
|
||||
|
||||
- **Helps** when tasks recur and have a checkable correctness signal (the
|
||||
optimizer has something real to learn and the gate can verify it).
|
||||
- **Roughly flat** on saturated or noisy tasks (e.g. a strong model already near
|
||||
ceiling) — within run-to-run noise (±1–2 points, single seed).
|
||||
- The validation gate keeps the downside bounded; keep it on by default.
|
||||
|
||||
Trade-off: `dream_rollouts > 1` multiplies the per-night rollout cost (K×), and
|
||||
`recall_k > 0` adds the recalled tasks to each night's replay. Since the cycle
|
||||
runs offline on idle quota this is usually acceptable, but budget accordingly
|
||||
(`budget_tokens` / `budget_seconds`).
|
||||
|
||||
Raw per-run results for the table above: `docs/sleep/blog_runs/v2_port/`.
|
||||
94
docs/sleep/blog_runs/v2_port/conf_ss_nano_free.json
Normal file
94
docs/sleep/blog_runs/v2_port/conf_ss_nano_free.json
Normal file
@@ -0,0 +1,94 @@
|
||||
{
|
||||
"experiment": "skillopt-sleep/nightly",
|
||||
"model": "gpt-5.4-nano",
|
||||
"results": [
|
||||
{
|
||||
"benchmark": "spreadsheet",
|
||||
"gate": "off",
|
||||
"replay_mode": "retrieval",
|
||||
"retrieve_k": 10,
|
||||
"nights": 5,
|
||||
"per_night": 10,
|
||||
"rollouts": 5,
|
||||
"n_val": 40,
|
||||
"n_test": 280,
|
||||
"test_baseline": 0.2786,
|
||||
"test_final": 0.3143,
|
||||
"delta": 0.0357,
|
||||
"progression": [
|
||||
0.2786,
|
||||
0.3036,
|
||||
0.3143,
|
||||
0.3107,
|
||||
0.3179,
|
||||
0.3143
|
||||
],
|
||||
"nights_log": [
|
||||
{
|
||||
"night": 0,
|
||||
"n_train": 0,
|
||||
"test_hard": 0.2786,
|
||||
"action": "baseline",
|
||||
"accepted": false
|
||||
},
|
||||
{
|
||||
"night": 1,
|
||||
"n_train": 10,
|
||||
"n_replayed": 0,
|
||||
"n_dream": 20,
|
||||
"val_hard": 0.0,
|
||||
"test_hard": 0.3036,
|
||||
"action": "greedy_applied",
|
||||
"accepted": true,
|
||||
"n_edits": 4
|
||||
},
|
||||
{
|
||||
"night": 2,
|
||||
"n_train": 20,
|
||||
"n_replayed": 10,
|
||||
"n_dream": 40,
|
||||
"val_hard": 0.0,
|
||||
"test_hard": 0.3143,
|
||||
"action": "greedy_applied",
|
||||
"accepted": true,
|
||||
"n_edits": 4
|
||||
},
|
||||
{
|
||||
"night": 3,
|
||||
"n_train": 30,
|
||||
"n_replayed": 10,
|
||||
"n_dream": 40,
|
||||
"val_hard": 0.0,
|
||||
"test_hard": 0.3107,
|
||||
"action": "greedy_applied",
|
||||
"accepted": true,
|
||||
"n_edits": 4
|
||||
},
|
||||
{
|
||||
"night": 4,
|
||||
"n_train": 40,
|
||||
"n_replayed": 10,
|
||||
"n_dream": 40,
|
||||
"val_hard": 0.0,
|
||||
"test_hard": 0.3179,
|
||||
"action": "greedy_applied",
|
||||
"accepted": true,
|
||||
"n_edits": 4
|
||||
},
|
||||
{
|
||||
"night": 5,
|
||||
"n_train": 50,
|
||||
"n_replayed": 10,
|
||||
"n_dream": 40,
|
||||
"val_hard": 0.0,
|
||||
"test_hard": 0.3143,
|
||||
"action": "greedy_applied",
|
||||
"accepted": true,
|
||||
"n_edits": 4
|
||||
}
|
||||
],
|
||||
"tokens": 13587597,
|
||||
"final_skill_tail": "t/headers rather than hardcoding specific cell coordinates or values.\n- When searching for specific text, use an exact match check on the cell string, e.g. `if cell_value == \"Georgia Its Tax\": ...` (not partial regex, not truncated comparisons).\n- If a cell contains multiple tokens separated by semicolons, split and normalize before comparing: `parts = [p.strip() for p in str(cell_value).split(';') if p.strip()]` and then test membership/lookup using `parts`.\n<!-- SKILLOPT-SLEEP:LEARNED END -->\n"
|
||||
}
|
||||
]
|
||||
}
|
||||
94
docs/sleep/blog_runs/v2_port/imp_cumulative_gate.json
Normal file
94
docs/sleep/blog_runs/v2_port/imp_cumulative_gate.json
Normal file
@@ -0,0 +1,94 @@
|
||||
{
|
||||
"experiment": "skillopt-sleep/nightly",
|
||||
"model": "gpt-5.5",
|
||||
"results": [
|
||||
{
|
||||
"benchmark": "searchqa",
|
||||
"gate": "on",
|
||||
"replay_mode": "cumulative",
|
||||
"retrieve_k": 0,
|
||||
"nights": 5,
|
||||
"per_night": 10,
|
||||
"rollouts": 5,
|
||||
"n_val": 60,
|
||||
"n_test": 1400,
|
||||
"test_baseline": 0.7957,
|
||||
"test_final": 0.8514,
|
||||
"delta": 0.0557,
|
||||
"progression": [
|
||||
0.7957,
|
||||
0.8336,
|
||||
0.8514,
|
||||
0.8514,
|
||||
0.8514,
|
||||
0.8514
|
||||
],
|
||||
"nights_log": [
|
||||
{
|
||||
"night": 0,
|
||||
"n_train": 0,
|
||||
"test_hard": 0.7957,
|
||||
"action": "baseline",
|
||||
"accepted": false
|
||||
},
|
||||
{
|
||||
"night": 1,
|
||||
"n_train": 10,
|
||||
"n_replayed": 0,
|
||||
"n_dream": 20,
|
||||
"val_hard": 0.85,
|
||||
"test_hard": 0.8336,
|
||||
"action": "accept_new_best",
|
||||
"accepted": true,
|
||||
"n_edits": 2
|
||||
},
|
||||
{
|
||||
"night": 2,
|
||||
"n_train": 20,
|
||||
"n_replayed": 10,
|
||||
"n_dream": 40,
|
||||
"val_hard": 0.9,
|
||||
"test_hard": 0.8514,
|
||||
"action": "accept_new_best",
|
||||
"accepted": true,
|
||||
"n_edits": 3
|
||||
},
|
||||
{
|
||||
"night": 3,
|
||||
"n_train": 30,
|
||||
"n_replayed": 20,
|
||||
"n_dream": 60,
|
||||
"val_hard": 0.9,
|
||||
"test_hard": 0.8514,
|
||||
"action": "reject",
|
||||
"accepted": false,
|
||||
"n_edits": 0
|
||||
},
|
||||
{
|
||||
"night": 4,
|
||||
"n_train": 40,
|
||||
"n_replayed": 30,
|
||||
"n_dream": 80,
|
||||
"val_hard": 0.9,
|
||||
"test_hard": 0.8514,
|
||||
"action": "reject",
|
||||
"accepted": false,
|
||||
"n_edits": 0
|
||||
},
|
||||
{
|
||||
"night": 5,
|
||||
"n_train": 50,
|
||||
"n_replayed": 40,
|
||||
"n_dream": 100,
|
||||
"val_hard": 0.9,
|
||||
"test_hard": 0.8514,
|
||||
"action": "reject",
|
||||
"accepted": false,
|
||||
"n_edits": 0
|
||||
}
|
||||
],
|
||||
"tokens": 15132599,
|
||||
"final_skill_tail": " the title or key sentence over a county, institution, or category.\n- Return the shortest exact answer span that satisfies the question, inside <answer>...</answer>; prefer a single-word entity when sufficient.\n- Do not expand a context-supported short name into a fuller name unless the question specifically requires the full name.\n- Match the requested answer type exactly: for a country/nation answer, output only the country name, not a title or role phrase.\n<!-- SKILLOPT-SLEEP:LEARNED END -->\n"
|
||||
}
|
||||
]
|
||||
}
|
||||
94
docs/sleep/blog_runs/v2_port/imp_recall20_gate.json
Normal file
94
docs/sleep/blog_runs/v2_port/imp_recall20_gate.json
Normal file
@@ -0,0 +1,94 @@
|
||||
{
|
||||
"experiment": "skillopt-sleep/nightly",
|
||||
"model": "gpt-5.5",
|
||||
"results": [
|
||||
{
|
||||
"benchmark": "searchqa",
|
||||
"gate": "on",
|
||||
"replay_mode": "retrieval",
|
||||
"retrieve_k": 20,
|
||||
"nights": 5,
|
||||
"per_night": 10,
|
||||
"rollouts": 5,
|
||||
"n_val": 60,
|
||||
"n_test": 1400,
|
||||
"test_baseline": 0.8029,
|
||||
"test_final": 0.8479,
|
||||
"delta": 0.045,
|
||||
"progression": [
|
||||
0.8029,
|
||||
0.8236,
|
||||
0.8236,
|
||||
0.8479,
|
||||
0.8479,
|
||||
0.8479
|
||||
],
|
||||
"nights_log": [
|
||||
{
|
||||
"night": 0,
|
||||
"n_train": 0,
|
||||
"test_hard": 0.8029,
|
||||
"action": "baseline",
|
||||
"accepted": false
|
||||
},
|
||||
{
|
||||
"night": 1,
|
||||
"n_train": 10,
|
||||
"n_replayed": 0,
|
||||
"n_dream": 20,
|
||||
"val_hard": 0.8667,
|
||||
"test_hard": 0.8236,
|
||||
"action": "accept_new_best",
|
||||
"accepted": true,
|
||||
"n_edits": 2
|
||||
},
|
||||
{
|
||||
"night": 2,
|
||||
"n_train": 20,
|
||||
"n_replayed": 10,
|
||||
"n_dream": 40,
|
||||
"val_hard": 0.8667,
|
||||
"test_hard": 0.8236,
|
||||
"action": "reject",
|
||||
"accepted": false,
|
||||
"n_edits": 0
|
||||
},
|
||||
{
|
||||
"night": 3,
|
||||
"n_train": 30,
|
||||
"n_replayed": 20,
|
||||
"n_dream": 60,
|
||||
"val_hard": 0.8833,
|
||||
"test_hard": 0.8479,
|
||||
"action": "accept_new_best",
|
||||
"accepted": true,
|
||||
"n_edits": 3
|
||||
},
|
||||
{
|
||||
"night": 4,
|
||||
"n_train": 40,
|
||||
"n_replayed": 20,
|
||||
"n_dream": 60,
|
||||
"val_hard": 0.8833,
|
||||
"test_hard": 0.8479,
|
||||
"action": "reject",
|
||||
"accepted": false,
|
||||
"n_edits": 0
|
||||
},
|
||||
{
|
||||
"night": 5,
|
||||
"n_train": 50,
|
||||
"n_replayed": 20,
|
||||
"n_dream": 60,
|
||||
"val_hard": 0.8833,
|
||||
"test_hard": 0.8479,
|
||||
"action": "reject",
|
||||
"accepted": false,
|
||||
"n_edits": 0
|
||||
}
|
||||
],
|
||||
"tokens": 15596999,
|
||||
"final_skill_tail": " Put only the shortest exact answer span in the final '<answer>...</answer>' tags; remove extra descriptors, categories, titles, and surrounding words.\n- If the question asks for a country/place from a phrase like 'King of Spain' or a title like 'Ferdinand VII of Spain', answer only the place name, e.g. 'Spain'.\n- For person answers, use the minimal unambiguous name supported by the clue; do not expand a surname to a full name unless the question requires it.\n<!-- SKILLOPT-SLEEP:LEARNED END -->\n"
|
||||
}
|
||||
]
|
||||
}
|
||||
94
docs/sleep/blog_runs/v2_port/imp_rollouts8_gate.json
Normal file
94
docs/sleep/blog_runs/v2_port/imp_rollouts8_gate.json
Normal file
@@ -0,0 +1,94 @@
|
||||
{
|
||||
"experiment": "skillopt-sleep/nightly",
|
||||
"model": "gpt-5.5",
|
||||
"results": [
|
||||
{
|
||||
"benchmark": "searchqa",
|
||||
"gate": "on",
|
||||
"replay_mode": "retrieval",
|
||||
"retrieve_k": 10,
|
||||
"nights": 5,
|
||||
"per_night": 10,
|
||||
"rollouts": 8,
|
||||
"n_val": 60,
|
||||
"n_test": 1400,
|
||||
"test_baseline": 0.7979,
|
||||
"test_final": 0.835,
|
||||
"delta": 0.0371,
|
||||
"progression": [
|
||||
0.7979,
|
||||
0.8179,
|
||||
0.835,
|
||||
0.835,
|
||||
0.835,
|
||||
0.835
|
||||
],
|
||||
"nights_log": [
|
||||
{
|
||||
"night": 0,
|
||||
"n_train": 0,
|
||||
"test_hard": 0.7979,
|
||||
"action": "baseline",
|
||||
"accepted": false
|
||||
},
|
||||
{
|
||||
"night": 1,
|
||||
"n_train": 10,
|
||||
"n_replayed": 0,
|
||||
"n_dream": 20,
|
||||
"val_hard": 0.8667,
|
||||
"test_hard": 0.8179,
|
||||
"action": "accept_new_best",
|
||||
"accepted": true,
|
||||
"n_edits": 2
|
||||
},
|
||||
{
|
||||
"night": 2,
|
||||
"n_train": 20,
|
||||
"n_replayed": 10,
|
||||
"n_dream": 40,
|
||||
"val_hard": 0.8833,
|
||||
"test_hard": 0.835,
|
||||
"action": "accept_new_best",
|
||||
"accepted": true,
|
||||
"n_edits": 3
|
||||
},
|
||||
{
|
||||
"night": 3,
|
||||
"n_train": 30,
|
||||
"n_replayed": 10,
|
||||
"n_dream": 40,
|
||||
"val_hard": 0.8833,
|
||||
"test_hard": 0.835,
|
||||
"action": "reject",
|
||||
"accepted": false,
|
||||
"n_edits": 0
|
||||
},
|
||||
{
|
||||
"night": 4,
|
||||
"n_train": 40,
|
||||
"n_replayed": 10,
|
||||
"n_dream": 40,
|
||||
"val_hard": 0.8833,
|
||||
"test_hard": 0.835,
|
||||
"action": "reject",
|
||||
"accepted": false,
|
||||
"n_edits": 0
|
||||
},
|
||||
{
|
||||
"night": 5,
|
||||
"n_train": 50,
|
||||
"n_replayed": 10,
|
||||
"n_dream": 40,
|
||||
"val_hard": 0.8833,
|
||||
"test_hard": 0.835,
|
||||
"action": "reject",
|
||||
"accepted": false,
|
||||
"n_edits": 0
|
||||
}
|
||||
],
|
||||
"tokens": 16846499,
|
||||
"final_skill_tail": "less the question asks for the title itself.\n- Always put only the final answer in \"<answer>...</answer>\" and keep it \"concise -- typically a few words or a short phrase\".\n- Use the shortest sufficient answer span; do not add first names, modifiers, counties, countries, or parent locations unless explicitly required.\n- Match the question’s granularity exactly: if it asks for a state, give only the state; if it asks for a term’s meaning, give only the meaning.\n<!-- SKILLOPT-SLEEP:LEARNED END -->\n"
|
||||
}
|
||||
]
|
||||
}
|
||||
94
docs/sleep/blog_runs/v2_port/parity_sq_g55_free.json
Normal file
94
docs/sleep/blog_runs/v2_port/parity_sq_g55_free.json
Normal file
@@ -0,0 +1,94 @@
|
||||
{
|
||||
"experiment": "skillopt-sleep/nightly",
|
||||
"model": "gpt-5.5",
|
||||
"results": [
|
||||
{
|
||||
"benchmark": "searchqa",
|
||||
"gate": "off",
|
||||
"replay_mode": "retrieval",
|
||||
"retrieve_k": 10,
|
||||
"nights": 5,
|
||||
"per_night": 10,
|
||||
"rollouts": 5,
|
||||
"n_val": 60,
|
||||
"n_test": 1400,
|
||||
"test_baseline": 0.8079,
|
||||
"test_final": 0.8393,
|
||||
"delta": 0.0314,
|
||||
"progression": [
|
||||
0.8079,
|
||||
0.8321,
|
||||
0.84,
|
||||
0.8436,
|
||||
0.84,
|
||||
0.8393
|
||||
],
|
||||
"nights_log": [
|
||||
{
|
||||
"night": 0,
|
||||
"n_train": 0,
|
||||
"test_hard": 0.8079,
|
||||
"action": "baseline",
|
||||
"accepted": false
|
||||
},
|
||||
{
|
||||
"night": 1,
|
||||
"n_train": 10,
|
||||
"n_replayed": 0,
|
||||
"n_dream": 20,
|
||||
"val_hard": 0.0,
|
||||
"test_hard": 0.8321,
|
||||
"action": "greedy_applied",
|
||||
"accepted": true,
|
||||
"n_edits": 3
|
||||
},
|
||||
{
|
||||
"night": 2,
|
||||
"n_train": 20,
|
||||
"n_replayed": 10,
|
||||
"n_dream": 40,
|
||||
"val_hard": 0.0,
|
||||
"test_hard": 0.84,
|
||||
"action": "greedy_applied",
|
||||
"accepted": true,
|
||||
"n_edits": 1
|
||||
},
|
||||
{
|
||||
"night": 3,
|
||||
"n_train": 30,
|
||||
"n_replayed": 10,
|
||||
"n_dream": 40,
|
||||
"val_hard": 0.0,
|
||||
"test_hard": 0.8436,
|
||||
"action": "greedy_applied",
|
||||
"accepted": true,
|
||||
"n_edits": 2
|
||||
},
|
||||
{
|
||||
"night": 4,
|
||||
"n_train": 40,
|
||||
"n_replayed": 10,
|
||||
"n_dream": 40,
|
||||
"val_hard": 0.0,
|
||||
"test_hard": 0.84,
|
||||
"action": "greedy_applied",
|
||||
"accepted": true,
|
||||
"n_edits": 3
|
||||
},
|
||||
{
|
||||
"night": 5,
|
||||
"n_train": 50,
|
||||
"n_replayed": 10,
|
||||
"n_dream": 40,
|
||||
"val_hard": 0.0,
|
||||
"test_hard": 0.8393,
|
||||
"action": "greedy_applied",
|
||||
"accepted": true,
|
||||
"n_edits": 2
|
||||
}
|
||||
],
|
||||
"tokens": 27990836,
|
||||
"final_skill_tail": "Sultan of Brunei\".\n- For author/creator questions from titles like \"Trees by Joyce Kilmer\", output only the creator name, e.g. \"Joyce Kilmer\", not the work title.\n- Do not introduce diacritics or alternate spellings not present in the context/title; prefer the ASCII surface form such as \"Vaclav Havel\" over \"Václav Havel\".\n- Return the full canonical entity name from the context/title, including hyphens, e.g. \"Winnie-the-Pooh\" rather than the shortened \"Pooh\".\n<!-- SKILLOPT-SLEEP:LEARNED END -->\n"
|
||||
}
|
||||
]
|
||||
}
|
||||
94
docs/sleep/blog_runs/v2_port/parity_sq_g55_gate.json
Normal file
94
docs/sleep/blog_runs/v2_port/parity_sq_g55_gate.json
Normal file
@@ -0,0 +1,94 @@
|
||||
{
|
||||
"experiment": "skillopt-sleep/nightly",
|
||||
"model": "gpt-5.5",
|
||||
"results": [
|
||||
{
|
||||
"benchmark": "searchqa",
|
||||
"gate": "on",
|
||||
"replay_mode": "retrieval",
|
||||
"retrieve_k": 10,
|
||||
"nights": 5,
|
||||
"per_night": 10,
|
||||
"rollouts": 5,
|
||||
"n_val": 60,
|
||||
"n_test": 1400,
|
||||
"test_baseline": 0.8021,
|
||||
"test_final": 0.8336,
|
||||
"delta": 0.0315,
|
||||
"progression": [
|
||||
0.8021,
|
||||
0.83,
|
||||
0.8336,
|
||||
0.8336,
|
||||
0.8336,
|
||||
0.8336
|
||||
],
|
||||
"nights_log": [
|
||||
{
|
||||
"night": 0,
|
||||
"n_train": 0,
|
||||
"test_hard": 0.8021,
|
||||
"action": "baseline",
|
||||
"accepted": false
|
||||
},
|
||||
{
|
||||
"night": 1,
|
||||
"n_train": 10,
|
||||
"n_replayed": 0,
|
||||
"n_dream": 20,
|
||||
"val_hard": 0.8667,
|
||||
"test_hard": 0.83,
|
||||
"action": "accept_new_best",
|
||||
"accepted": true,
|
||||
"n_edits": 4
|
||||
},
|
||||
{
|
||||
"night": 2,
|
||||
"n_train": 20,
|
||||
"n_replayed": 10,
|
||||
"n_dream": 40,
|
||||
"val_hard": 0.9,
|
||||
"test_hard": 0.8336,
|
||||
"action": "accept_new_best",
|
||||
"accepted": true,
|
||||
"n_edits": 4
|
||||
},
|
||||
{
|
||||
"night": 3,
|
||||
"n_train": 30,
|
||||
"n_replayed": 10,
|
||||
"n_dream": 40,
|
||||
"val_hard": 0.9,
|
||||
"test_hard": 0.8336,
|
||||
"action": "reject",
|
||||
"accepted": false,
|
||||
"n_edits": 0
|
||||
},
|
||||
{
|
||||
"night": 4,
|
||||
"n_train": 40,
|
||||
"n_replayed": 10,
|
||||
"n_dream": 40,
|
||||
"val_hard": 0.9,
|
||||
"test_hard": 0.8336,
|
||||
"action": "reject",
|
||||
"accepted": false,
|
||||
"n_edits": 0
|
||||
},
|
||||
{
|
||||
"night": 5,
|
||||
"n_train": 50,
|
||||
"n_replayed": 10,
|
||||
"n_dream": 40,
|
||||
"val_hard": 0.9,
|
||||
"test_hard": 0.8336,
|
||||
"action": "reject",
|
||||
"accepted": false,
|
||||
"n_edits": 0
|
||||
}
|
||||
],
|
||||
"tokens": 15946118,
|
||||
"final_skill_tail": "roperty; do not substitute a broader category or page title.\n- For location questions asking for a state/country, output only that level, e.g. \"Maryland\", not the full hierarchy \"Baltimore County, Maryland, United States\".\n- For name-part questions such as surname/last name, output only that part, e.g. \"Genet\", not the full name \"Jean Genet\".\n- Put only the concise final answer inside \"<answer>...</answer>\"; avoid extra modifiers, lists, or explanatory words.\n<!-- SKILLOPT-SLEEP:LEARNED END -->\n"
|
||||
}
|
||||
]
|
||||
}
|
||||
@@ -44,6 +44,10 @@ DEFAULTS: Dict[str, Any] = {
|
||||
"gate_metric": "mixed", # hard | soft | mixed (mixed best for tiny holdouts)
|
||||
"gate_mixed_weight": 0.5,
|
||||
"replay_mode": "mock", # "mock" (sandboxed prompt) | "fresh" (worktree)
|
||||
# ── dream + recall (opt-in; defaults reproduce the prior single-shot loop) ─
|
||||
"dream_rollouts": 1, # >1 => multi-rollout contrastive reflection per task
|
||||
"dream_factor": 0, # >0 => add N synthetic variants of each task to the dream
|
||||
"recall_k": 0, # >0 => recall the K most-similar past tasks into the dream
|
||||
"evolve_memory": True, # consolidate CLAUDE.md
|
||||
"evolve_skill": True, # consolidate the managed SKILL.md
|
||||
"llm_mine": True, # use the backend to mine checkable tasks (real backends)
|
||||
|
||||
@@ -15,7 +15,7 @@ from typing import List, Optional
|
||||
|
||||
from skillopt_sleep.backend import get_backend
|
||||
from skillopt_sleep.config import SleepConfig, load_config
|
||||
from skillopt_sleep.consolidate import consolidate
|
||||
from skillopt_sleep.dream import dream_consolidate
|
||||
from skillopt_sleep.harvest_sources import harvest_for_config
|
||||
from skillopt_sleep.memory import ensure_skill_scaffold
|
||||
from skillopt_sleep.mine import mine
|
||||
@@ -167,9 +167,21 @@ def run_sleep_cycle(
|
||||
staging_dir = ""
|
||||
return CycleOutcome(report, staging_dir, False, [])
|
||||
|
||||
# ── 3+4. replay + consolidate (gate) ─────────────────────────────────
|
||||
result = consolidate(
|
||||
# ── 3+4. replay + consolidate (gate), with opt-in dream + recall ──────
|
||||
# recall pulls similar past tasks from the persisted archive; dream_rollouts
|
||||
# / dream_factor enrich the training signal. With the defaults (recall_k=0,
|
||||
# dream_rollouts=1, dream_factor=0) this is exactly the prior single-shot
|
||||
# consolidate — behavior is unchanged unless the user opts in.
|
||||
recall_k = int(cfg.get("recall_k", 0) or 0)
|
||||
history_tasks = []
|
||||
if recall_k > 0:
|
||||
history_tasks = [TaskRecord.from_dict(d) for d in state.task_archive()]
|
||||
result = dream_consolidate(
|
||||
backend, tasks, skill, memory,
|
||||
history_tasks=history_tasks,
|
||||
recall_k=recall_k,
|
||||
dream_rollouts=int(cfg.get("dream_rollouts", 1) or 1),
|
||||
dream_factor=int(cfg.get("dream_factor", 0) or 0),
|
||||
edit_budget=cfg.get("edit_budget", 4),
|
||||
gate_metric=cfg.get("gate_metric", "mixed"),
|
||||
gate_mixed_weight=cfg.get("gate_mixed_weight", 0.5),
|
||||
@@ -178,6 +190,8 @@ def run_sleep_cycle(
|
||||
evolve_memory=cfg.get("evolve_memory", True),
|
||||
night=night,
|
||||
)
|
||||
# archive tonight's real (non-dream) tasks so future nights can recall them
|
||||
state.add_to_archive([t.to_dict() for t in tasks if t.origin != "dream"])
|
||||
|
||||
report.n_replayed = len(tasks)
|
||||
report.baseline_score = result.baseline_score
|
||||
|
||||
138
skillopt_sleep/dream.py
Normal file
138
skillopt_sleep/dream.py
Normal file
@@ -0,0 +1,138 @@
|
||||
"""SkillOpt-Sleep — dream + associative recall for nightly consolidation.
|
||||
|
||||
Two opt-in mechanisms (both default OFF, so the cycle is unchanged unless the
|
||||
user enables them) that the deployment experiments validated:
|
||||
|
||||
* dream rollouts — run each task K times and learn from the good-vs-bad
|
||||
contrast (set ``dream_rollouts > 1``). Stronger signal than one failure.
|
||||
* associative recall — each night, pull the K past tasks most similar to
|
||||
tonight's new ones into the dream (set ``recall_k > 0``). Replays relevant
|
||||
experience without re-running the whole history.
|
||||
|
||||
``dream_consolidate`` wires recall + synthetic augmentation + multi-rollout
|
||||
consolidation and is called by BOTH the shipped plugin cycle and the benchmark
|
||||
experiment harness, so the reported numbers exercise the exact code the plugin
|
||||
runs. Pure-stdlib, zero research/private dependency.
|
||||
"""
|
||||
from __future__ import annotations
|
||||
|
||||
import re
|
||||
from typing import List, Optional
|
||||
|
||||
from skillopt_sleep.consolidate import ConsolidationResult, consolidate
|
||||
from skillopt_sleep.types import TaskRecord
|
||||
|
||||
|
||||
# ── synthetic augmentation ("dream up" variants of today's tasks) ─────────────
|
||||
|
||||
_WRAPPERS = [
|
||||
"(quick one) {q}",
|
||||
"Please handle this request: {q}",
|
||||
"For the daily report: {q}",
|
||||
]
|
||||
|
||||
|
||||
def dream_augment(real_tasks: List[TaskRecord], *, factor: int = 1) -> List[TaskRecord]:
|
||||
"""Create synthetic TRAIN variants of real tasks (origin='dream').
|
||||
|
||||
A light, deterministic rephrasing. Dream tasks are training-only — they
|
||||
carry split='train' and never enter the val/test slices the gate scores on.
|
||||
"""
|
||||
out: List[TaskRecord] = []
|
||||
for t in real_tasks:
|
||||
for k in range(max(0, factor)):
|
||||
w = _WRAPPERS[k % len(_WRAPPERS)]
|
||||
out.append(TaskRecord(
|
||||
id=f"{t.id}_dream{k}", project=t.project,
|
||||
intent=w.format(q=t.intent), context_excerpt=t.context_excerpt,
|
||||
reference_kind=t.reference_kind, reference=t.reference,
|
||||
judge=dict(t.judge), system=t.system,
|
||||
tags=list(t.tags) + ["dream"], split="train",
|
||||
origin="dream", derived_from=t.id,
|
||||
))
|
||||
return out
|
||||
|
||||
|
||||
# ── associative recall (experience replay of similar past tasks) ──────────────
|
||||
|
||||
def _tokens(text: str) -> set:
|
||||
return {w for w in re.findall(r"[a-z0-9]+", (text or "").lower()) if len(w) > 2}
|
||||
|
||||
|
||||
def recall_similar(new_tasks: List[TaskRecord], history: List[TaskRecord],
|
||||
k: int) -> List[TaskRecord]:
|
||||
"""Return the ``k`` historical tasks most lexically similar to any of
|
||||
tonight's ``new_tasks`` (max Jaccard token overlap). Recalled tasks are
|
||||
returned as training material (split='train'); deterministic, stdlib-only.
|
||||
"""
|
||||
if not history or k <= 0 or not new_tasks:
|
||||
return []
|
||||
new_tok = [_tokens(t.intent) for t in new_tasks]
|
||||
new_ids = {t.id for t in new_tasks}
|
||||
scored = []
|
||||
for h in history:
|
||||
if h.id in new_ids:
|
||||
continue
|
||||
ht = _tokens(h.intent)
|
||||
if not ht:
|
||||
continue
|
||||
sim = max(((len(ht & nt) / len(ht | nt)) if (ht | nt) else 0.0) for nt in new_tok)
|
||||
scored.append((sim, h.id, h))
|
||||
scored.sort(key=lambda x: (-x[0], x[1]))
|
||||
out = []
|
||||
for sim, _id, h in scored[:max(0, k)]:
|
||||
if sim <= 0.0:
|
||||
break
|
||||
# recall as training material; copy so the source archive is untouched
|
||||
out.append(TaskRecord(
|
||||
id=f"recall:{h.id}", project=h.project, intent=h.intent,
|
||||
context_excerpt=h.context_excerpt, reference_kind=h.reference_kind,
|
||||
reference=h.reference, judge=dict(h.judge), system=h.system,
|
||||
tags=list(h.tags) + ["recall"], split="train", origin="real",
|
||||
derived_from=h.id,
|
||||
))
|
||||
return out
|
||||
|
||||
|
||||
# ── the shared nightly consolidation step ─────────────────────────────────────
|
||||
|
||||
def dream_consolidate(
|
||||
backend,
|
||||
tasks: List[TaskRecord],
|
||||
skill: str,
|
||||
memory: str,
|
||||
*,
|
||||
history_tasks: Optional[List[TaskRecord]] = None,
|
||||
recall_k: int = 0,
|
||||
dream_rollouts: int = 1,
|
||||
dream_factor: int = 0,
|
||||
edit_budget: int = 4,
|
||||
gate_metric: str = "mixed",
|
||||
gate_mixed_weight: float = 0.5,
|
||||
gate_mode: str = "on",
|
||||
evolve_skill: bool = True,
|
||||
evolve_memory: bool = True,
|
||||
night: int = 1,
|
||||
) -> ConsolidationResult:
|
||||
"""Recall similar past experience + dream synthetic variants, then run one
|
||||
gated consolidation epoch over the enlarged training pool.
|
||||
|
||||
``tasks`` is the split-tagged pool for tonight (train + val); recall and
|
||||
augmentation only enlarge the TRAIN split, so the val slice the gate scores
|
||||
on is never polluted. With ``recall_k=0`` and ``dream_rollouts=1`` (the
|
||||
defaults) this is exactly the previous single-shot ``consolidate``.
|
||||
"""
|
||||
train = [t for t in tasks if t.split == "train"]
|
||||
enlarged = list(tasks)
|
||||
if recall_k > 0 and history_tasks:
|
||||
enlarged += recall_similar(train, history_tasks, recall_k)
|
||||
if dream_factor > 0:
|
||||
seed = [t for t in enlarged if t.split == "train" and t.origin != "dream"]
|
||||
enlarged += dream_augment(seed, factor=dream_factor)
|
||||
return consolidate(
|
||||
backend, enlarged, skill, memory,
|
||||
edit_budget=edit_budget, gate_metric=gate_metric,
|
||||
gate_mixed_weight=gate_mixed_weight, gate_mode=gate_mode,
|
||||
rollouts_k=dream_rollouts, evolve_skill=evolve_skill,
|
||||
evolve_memory=evolve_memory, night=night,
|
||||
)
|
||||
@@ -28,6 +28,7 @@ DEFAULT_STATE: Dict[str, Any] = {
|
||||
"last_harvest": {}, # project -> iso timestamp of last harvested record
|
||||
"slow_memory": "", # cross-night consolidated lessons (meta-skill analogue)
|
||||
"history": [], # list of per-night summaries
|
||||
"task_archive": [], # capped list of past mined tasks (for associative recall)
|
||||
}
|
||||
|
||||
|
||||
@@ -81,3 +82,15 @@ class SleepState:
|
||||
|
||||
def record_night(self, summary: Dict[str, Any]) -> None:
|
||||
self.data.setdefault("history", []).append(summary)
|
||||
|
||||
# ── task archive (associative-recall memory) ──────────────────────────
|
||||
def task_archive(self) -> list:
|
||||
"""Past mined tasks as plain dicts (newest last)."""
|
||||
return list(self.data.get("task_archive", []))
|
||||
|
||||
def add_to_archive(self, task_dicts: list, cap: int = 300) -> None:
|
||||
"""Append tonight's tasks; keep only the most recent ``cap``."""
|
||||
arc = self.data.setdefault("task_archive", [])
|
||||
arc.extend(task_dicts)
|
||||
if len(arc) > cap:
|
||||
self.data["task_archive"] = arc[-cap:]
|
||||
|
||||
Reference in New Issue
Block a user