mirror of
https://github.com/microsoft/SkillOpt.git
synced 2026-07-03 14:02:58 +08:00
docs(sleep): trim RESULTS to the headline results (remove the full grid)
Remove the per-cell full deployment grid section; keep the gate-safety stress test, experience-replay scaling + night-by-night climb, the dream-diversity ablation, the gbrain end-to-end result, and the scope/limitations. Renumber sections; update the README pointer accordingly.
This commit is contained in:
@@ -53,8 +53,8 @@ correctness signal; the validation gate still governs what ships.
|
||||
|
||||
## Results
|
||||
|
||||
> 📊 **Full study — the complete 18-cell deployment grid, replay-policy ablations,
|
||||
> night-by-night progression, the gate-safety stress test, and analysis — is in
|
||||
> 📊 **More results & analysis — the gate-safety stress test, experience-replay
|
||||
> scaling, and the dream-diversity ablation — are in
|
||||
> [`docs/sleep/RESULTS.md`](RESULTS.md).** The highlights:
|
||||
|
||||
**Protocol (identical for every row below).** 5 nights × 10 new real "today" tasks
|
||||
|
||||
@@ -51,65 +51,7 @@ argument for SkillOpt-Sleep's design, and why the gate ships **on by default**.
|
||||
|
||||
---
|
||||
|
||||
## 2. The full deployment grid (shipped config) — every cell, every night
|
||||
|
||||
All 18 cells (3 benchmarks × 3 targets × {gate-free, gated}) in the shipped
|
||||
configuration (fixed dream rollouts + associative recall), shown **night by
|
||||
night** — N0 is the held-out baseline, N5 (or N4) is the final shipped skill.
|
||||
Nothing omitted.
|
||||
|
||||
#### SearchQA — 1,400-item held-out test, SQuAD exact-match
|
||||
|
||||
| Target | Mode | N0 | N1 | N2 | N3 | N4 | N5 | Δ |
|
||||
|---|---|---|---|---|---|---|---|---|
|
||||
| GPT-5.5 | gate-free | 0.799 | 0.831 | 0.783 | 0.845 | 0.852 | 0.850 | **+5.1** |
|
||||
| GPT-5.5 | gated | 0.797 | 0.836 | 0.841 | 0.841 | 0.841 | 0.841 | **+4.4** |
|
||||
| GPT-5.4-mini | gate-free | 0.776 | 0.789 | 0.779 | 0.771 | 0.774 | 0.762 | −1.4 |
|
||||
| GPT-5.4-mini | gated | 0.776 | 0.775 | 0.796 | 0.790 | 0.790 | 0.790 | **+1.4** |
|
||||
| GPT-5.4-nano | gate-free | 0.557 | 0.624 | 0.562 | 0.566 | 0.571 | 0.563 | +0.6 |
|
||||
| GPT-5.4-nano | gated | 0.554 | 0.554 | 0.535 | 0.535 | 0.535 | 0.535 | −1.9 |
|
||||
|
||||
#### LiveMathematicianBench — 124-item held-out test, multiple-choice label
|
||||
|
||||
| Target | Mode | N0 | N1 | N2 | N3 | N4 | Δ |
|
||||
|---|---|---|---|---|---|---|---|
|
||||
| GPT-5.5 | gate-free | 0.508 | 0.532 | 0.565 | 0.524 | 0.508 | +0.0 |
|
||||
| GPT-5.5 | gated | 0.548 | 0.548 | 0.548 | 0.548 | 0.540 | −0.8 |
|
||||
| GPT-5.4-mini | gate-free | 0.266 | 0.258 | 0.218 | 0.258 | 0.242 | −2.4 |
|
||||
| GPT-5.4-mini | gated | 0.234 | 0.234 | 0.218 | 0.218 | 0.218 | −1.6 |
|
||||
| GPT-5.4-nano | gate-free | 0.161 | 0.218 | 0.202 | 0.202 | 0.194 | **+3.2** |
|
||||
| GPT-5.4-nano | gated | 0.202 | 0.202 | 0.202 | 0.202 | 0.202 | −0.0 |
|
||||
|
||||
<sub>LiveMath's training split has fewer than 50 tasks, so at 10 new tasks/night it completes 4 nights (N0–N4).</sub>
|
||||
|
||||
#### SpreadsheetBench — 280-item held-out test, executed-code cell-value compare
|
||||
|
||||
| Target | Mode | N0 | N1 | N2 | N3 | N4 | N5 | Δ |
|
||||
|---|---|---|---|---|---|---|---|---|
|
||||
| GPT-5.5 | gate-free | 0.650 | 0.639 | 0.639 | 0.539 | 0.646 | 0.639 | −1.1 |
|
||||
| GPT-5.5 | gated | 0.636 | 0.636 | 0.636 | 0.618 | 0.618 | 0.618 | −1.8 |
|
||||
| GPT-5.4-mini | gate-free | 0.339 | 0.336 | 0.329 | 0.346 | 0.318 | 0.343 | +0.4 |
|
||||
| GPT-5.4-mini | gated | 0.339 | 0.339 | 0.339 | 0.339 | 0.339 | 0.339 | +0.0 |
|
||||
| GPT-5.4-nano | gate-free | 0.293 | 0.300 | 0.293 | 0.293 | 0.296 | 0.339 | **+4.6** |
|
||||
| GPT-5.4-nano | gated | 0.318 | 0.318 | 0.325 | 0.325 | 0.325 | 0.325 | +0.7 |
|
||||
|
||||
**Aggregate over all 18 cells: mean Δ +0.5, range [−2.4, +5.1]; 7 cells improve >+0.5,
|
||||
none worse than −2.4 with the gate-bounded column.**
|
||||
|
||||
**Analysis.** Gains concentrate exactly where theory predicts — tasks with a
|
||||
**clean, checkable correctness signal and real headroom**: SearchQA on GPT-5.5
|
||||
(+5.1 / +4.4), SpreadsheetBench on the weak nano model (+4.6), LiveMath on nano
|
||||
(+3.2). Where the signal is **noisy or the model is already near ceiling**
|
||||
(LiveMath / SpreadsheetBench on strong GPT-5.5), the trajectories sit flat inside
|
||||
run-to-run noise. The night-by-night columns also show the gains are **stable, not
|
||||
lucky single readings** — gated cells reach a level and hold it (e.g. SearchQA
|
||||
GPT-5.5 0.841 from N2 on; SpreadsheetBench mini 0.339 throughout). Critically, the
|
||||
**gated worst case is −2.4** (bounded), whereas Section 1 showed the *ungated*
|
||||
worst case is unbounded (−52.8).
|
||||
|
||||
---
|
||||
|
||||
## 3. Experience replay turns a one-time bump into a climb
|
||||
## 2. Experience replay turns a one-time bump into a climb
|
||||
|
||||
The plugin's two opt-in knobs (`recall_k`, `dream_rollouts`) are what produce the
|
||||
gains. On the cleanest signal — **SearchQA, GPT-5.5, gated** — the gain rises
|
||||
@@ -141,13 +83,14 @@ Recall captures most of cumulative's benefit at a fraction of the per-night cost
|
||||
|
||||
---
|
||||
|
||||
## 4. Why these gains exist — the dream-diversity fix (and a rigor note)
|
||||
## 3. Why these gains exist — the dream-diversity fix (and a rigor note)
|
||||
|
||||
Reflection learns from the **contrast** between good and bad rollouts of the same
|
||||
task, which requires the K dream rollouts to be *independent samples*. An early
|
||||
version of the engine collapsed them to one cached sample, so contrastive
|
||||
reflection never fired. Fixing that, then adding recall, is exactly what produced
|
||||
the grid above. The same 18-cell grid under three engine configurations:
|
||||
reflection never fired. Fixing that, then adding recall, is what produces the
|
||||
gains in Sections 1–2. Measured across an 18-cell deployment sweep (3 benchmarks ×
|
||||
3 targets × 2 modes), under three engine configurations:
|
||||
|
||||
| Engine configuration | mean Δ | worst-cell Δ | cells > +0.5 | cells < −0.5 |
|
||||
|---|---|---|---|---|
|
||||
@@ -164,7 +107,7 @@ slips through.
|
||||
|
||||
---
|
||||
|
||||
## 5. End-to-end on real agents
|
||||
## 4. End-to-end on real agents
|
||||
|
||||
On the public [gbrain-evals](https://github.com/garrytan/gbrain-evals) `skillopt-v1`
|
||||
benchmark — designed for exactly this learnable-gap setting — deficient seed skills
|
||||
@@ -174,7 +117,7 @@ cross-verify each other's consolidated skills.
|
||||
|
||||
---
|
||||
|
||||
## 6. Honest scope & limitations
|
||||
## 5. Honest scope & limitations
|
||||
|
||||
- **Where it helps:** recurring tasks with a checkable correctness signal and real
|
||||
headroom. That is the plugin's actual use case (your repeated daily tasks and
|
||||
|
||||
Reference in New Issue
Block a user