docs(sleep): trim RESULTS to the headline results (remove the full grid)

Remove the per-cell full deployment grid section; keep the gate-safety stress
test, experience-replay scaling + night-by-night climb, the dream-diversity
ablation, the gbrain end-to-end result, and the scope/limitations. Renumber
sections; update the README pointer accordingly.
This commit is contained in:
Yifan Yang
2026-06-15 17:08:51 +00:00
parent d43e8dba1a
commit 46b3207b96
2 changed files with 9 additions and 66 deletions

View File

@@ -53,8 +53,8 @@ correctness signal; the validation gate still governs what ships.
## Results
> 📊 **Full study — the complete 18-cell deployment grid, replay-policy ablations,
> night-by-night progression, the gate-safety stress test, and analysis — is in
> 📊 **More results & analysis — the gate-safety stress test, experience-replay
> scaling, and the dream-diversity ablation — are in
> [`docs/sleep/RESULTS.md`](RESULTS.md).** The highlights:
**Protocol (identical for every row below).** 5 nights × 10 new real "today" tasks

View File

@@ -51,65 +51,7 @@ argument for SkillOpt-Sleep's design, and why the gate ships **on by default**.
---
## 2. The full deployment grid (shipped config) — every cell, every night
All 18 cells (3 benchmarks × 3 targets × {gate-free, gated}) in the shipped
configuration (fixed dream rollouts + associative recall), shown **night by
night** — N0 is the held-out baseline, N5 (or N4) is the final shipped skill.
Nothing omitted.
#### SearchQA — 1,400-item held-out test, SQuAD exact-match
| Target | Mode | N0 | N1 | N2 | N3 | N4 | N5 | Δ |
|---|---|---|---|---|---|---|---|---|
| GPT-5.5 | gate-free | 0.799 | 0.831 | 0.783 | 0.845 | 0.852 | 0.850 | **+5.1** |
| GPT-5.5 | gated | 0.797 | 0.836 | 0.841 | 0.841 | 0.841 | 0.841 | **+4.4** |
| GPT-5.4-mini | gate-free | 0.776 | 0.789 | 0.779 | 0.771 | 0.774 | 0.762 | 1.4 |
| GPT-5.4-mini | gated | 0.776 | 0.775 | 0.796 | 0.790 | 0.790 | 0.790 | **+1.4** |
| GPT-5.4-nano | gate-free | 0.557 | 0.624 | 0.562 | 0.566 | 0.571 | 0.563 | +0.6 |
| GPT-5.4-nano | gated | 0.554 | 0.554 | 0.535 | 0.535 | 0.535 | 0.535 | 1.9 |
#### LiveMathematicianBench — 124-item held-out test, multiple-choice label
| Target | Mode | N0 | N1 | N2 | N3 | N4 | Δ |
|---|---|---|---|---|---|---|---|
| GPT-5.5 | gate-free | 0.508 | 0.532 | 0.565 | 0.524 | 0.508 | +0.0 |
| GPT-5.5 | gated | 0.548 | 0.548 | 0.548 | 0.548 | 0.540 | 0.8 |
| GPT-5.4-mini | gate-free | 0.266 | 0.258 | 0.218 | 0.258 | 0.242 | 2.4 |
| GPT-5.4-mini | gated | 0.234 | 0.234 | 0.218 | 0.218 | 0.218 | 1.6 |
| GPT-5.4-nano | gate-free | 0.161 | 0.218 | 0.202 | 0.202 | 0.194 | **+3.2** |
| GPT-5.4-nano | gated | 0.202 | 0.202 | 0.202 | 0.202 | 0.202 | 0.0 |
<sub>LiveMath's training split has fewer than 50 tasks, so at 10 new tasks/night it completes 4 nights (N0N4).</sub>
#### SpreadsheetBench — 280-item held-out test, executed-code cell-value compare
| Target | Mode | N0 | N1 | N2 | N3 | N4 | N5 | Δ |
|---|---|---|---|---|---|---|---|---|
| GPT-5.5 | gate-free | 0.650 | 0.639 | 0.639 | 0.539 | 0.646 | 0.639 | 1.1 |
| GPT-5.5 | gated | 0.636 | 0.636 | 0.636 | 0.618 | 0.618 | 0.618 | 1.8 |
| GPT-5.4-mini | gate-free | 0.339 | 0.336 | 0.329 | 0.346 | 0.318 | 0.343 | +0.4 |
| GPT-5.4-mini | gated | 0.339 | 0.339 | 0.339 | 0.339 | 0.339 | 0.339 | +0.0 |
| GPT-5.4-nano | gate-free | 0.293 | 0.300 | 0.293 | 0.293 | 0.296 | 0.339 | **+4.6** |
| GPT-5.4-nano | gated | 0.318 | 0.318 | 0.325 | 0.325 | 0.325 | 0.325 | +0.7 |
**Aggregate over all 18 cells: mean Δ +0.5, range [2.4, +5.1]; 7 cells improve >+0.5,
none worse than 2.4 with the gate-bounded column.**
**Analysis.** Gains concentrate exactly where theory predicts — tasks with a
**clean, checkable correctness signal and real headroom**: SearchQA on GPT-5.5
(+5.1 / +4.4), SpreadsheetBench on the weak nano model (+4.6), LiveMath on nano
(+3.2). Where the signal is **noisy or the model is already near ceiling**
(LiveMath / SpreadsheetBench on strong GPT-5.5), the trajectories sit flat inside
run-to-run noise. The night-by-night columns also show the gains are **stable, not
lucky single readings** — gated cells reach a level and hold it (e.g. SearchQA
GPT-5.5 0.841 from N2 on; SpreadsheetBench mini 0.339 throughout). Critically, the
**gated worst case is 2.4** (bounded), whereas Section 1 showed the *ungated*
worst case is unbounded (52.8).
---
## 3. Experience replay turns a one-time bump into a climb
## 2. Experience replay turns a one-time bump into a climb
The plugin's two opt-in knobs (`recall_k`, `dream_rollouts`) are what produce the
gains. On the cleanest signal — **SearchQA, GPT-5.5, gated** — the gain rises
@@ -141,13 +83,14 @@ Recall captures most of cumulative's benefit at a fraction of the per-night cost
---
## 4. Why these gains exist — the dream-diversity fix (and a rigor note)
## 3. Why these gains exist — the dream-diversity fix (and a rigor note)
Reflection learns from the **contrast** between good and bad rollouts of the same
task, which requires the K dream rollouts to be *independent samples*. An early
version of the engine collapsed them to one cached sample, so contrastive
reflection never fired. Fixing that, then adding recall, is exactly what produced
the grid above. The same 18-cell grid under three engine configurations:
reflection never fired. Fixing that, then adding recall, is what produces the
gains in Sections 12. Measured across an 18-cell deployment sweep (3 benchmarks ×
3 targets × 2 modes), under three engine configurations:
| Engine configuration | mean Δ | worst-cell Δ | cells > +0.5 | cells < 0.5 |
|---|---|---|---|---|
@@ -164,7 +107,7 @@ slips through.
---
## 5. End-to-end on real agents
## 4. End-to-end on real agents
On the public [gbrain-evals](https://github.com/garrytan/gbrain-evals) `skillopt-v1`
benchmark — designed for exactly this learnable-gap setting — deficient seed skills
@@ -174,7 +117,7 @@ cross-verify each other's consolidated skills.
---
## 6. Honest scope & limitations
## 5. Honest scope & limitations
- **Where it helps:** recurring tasks with a checkable correctness signal and real
headroom. That is the plugin's actual use case (your repeated daily tasks and