docs(sleep): trim RESULTS to the headline results (remove the full grid)

Remove the per-cell full deployment grid section; keep the gate-safety stress test, experience-replay scaling + night-by-night climb, the dream-diversity ablation, the gbrain end-to-end result, and the scope/limitations. Renumber sections; update the README pointer accordingly.
2026-07-03 14:02:58 +08:00 · 2026-06-15 17:08:51 +00:00
parent d43e8dba1a
commit 46b3207b96
2 changed files with 9 additions and 66 deletions
--- a/docs/sleep/README.md
+++ b/docs/sleep/README.md
@@ -53,8 +53,8 @@ correctness signal; the validation gate still governs what ships.

 ## Results

-> 📊 **Full study — the complete 18-cell deployment grid, replay-policy ablations,
-> night-by-night progression, the gate-safety stress test, and analysis — is in
+> 📊 **More results & analysis — the gate-safety stress test, experience-replay
+> scaling, and the dream-diversity ablation — are in
 > [`docs/sleep/RESULTS.md`](RESULTS.md).** The highlights:

 **Protocol (identical for every row below).** 5 nights × 10 new real "today" tasks
--- a/docs/sleep/RESULTS.md
+++ b/docs/sleep/RESULTS.md
@@ -51,65 +51,7 @@ argument for SkillOpt-Sleep's design, and why the gate ships **on by default**.

 ---

-## 2. The full deployment grid (shipped config) — every cell, every night
-
-All 18 cells (3 benchmarks × 3 targets × {gate-free, gated}) in the shipped
-configuration (fixed dream rollouts + associative recall), shown **night by
-night** — N0 is the held-out baseline, N5 (or N4) is the final shipped skill.
-Nothing omitted.
-
-#### SearchQA — 1,400-item held-out test, SQuAD exact-match
-
-| Target | Mode | N0 | N1 | N2 | N3 | N4 | N5 | Δ |
-|---|---|---|---|---|---|---|---|---|
-| GPT-5.5 | gate-free | 0.799 | 0.831 | 0.783 | 0.845 | 0.852 | 0.850 | **+5.1** |
-| GPT-5.5 | gated | 0.797 | 0.836 | 0.841 | 0.841 | 0.841 | 0.841 | **+4.4** |
-| GPT-5.4-mini | gate-free | 0.776 | 0.789 | 0.779 | 0.771 | 0.774 | 0.762 | −1.4 |
-| GPT-5.4-mini | gated | 0.776 | 0.775 | 0.796 | 0.790 | 0.790 | 0.790 | **+1.4** |
-| GPT-5.4-nano | gate-free | 0.557 | 0.624 | 0.562 | 0.566 | 0.571 | 0.563 | +0.6 |
-| GPT-5.4-nano | gated | 0.554 | 0.554 | 0.535 | 0.535 | 0.535 | 0.535 | −1.9 |
-
-#### LiveMathematicianBench — 124-item held-out test, multiple-choice label
-
-| Target | Mode | N0 | N1 | N2 | N3 | N4 | Δ |
-|---|---|---|---|---|---|---|---|
-| GPT-5.5 | gate-free | 0.508 | 0.532 | 0.565 | 0.524 | 0.508 | +0.0 |
-| GPT-5.5 | gated | 0.548 | 0.548 | 0.548 | 0.548 | 0.540 | −0.8 |
-| GPT-5.4-mini | gate-free | 0.266 | 0.258 | 0.218 | 0.258 | 0.242 | −2.4 |
-| GPT-5.4-mini | gated | 0.234 | 0.234 | 0.218 | 0.218 | 0.218 | −1.6 |
-| GPT-5.4-nano | gate-free | 0.161 | 0.218 | 0.202 | 0.202 | 0.194 | **+3.2** |
-| GPT-5.4-nano | gated | 0.202 | 0.202 | 0.202 | 0.202 | 0.202 | −0.0 |
-
-<sub>LiveMath's training split has fewer than 50 tasks, so at 10 new tasks/night it completes 4 nights (N0–N4).</sub>
-
-#### SpreadsheetBench — 280-item held-out test, executed-code cell-value compare
-
-| Target | Mode | N0 | N1 | N2 | N3 | N4 | N5 | Δ |
-|---|---|---|---|---|---|---|---|---|
-| GPT-5.5 | gate-free | 0.650 | 0.639 | 0.639 | 0.539 | 0.646 | 0.639 | −1.1 |
-| GPT-5.5 | gated | 0.636 | 0.636 | 0.636 | 0.618 | 0.618 | 0.618 | −1.8 |
-| GPT-5.4-mini | gate-free | 0.339 | 0.336 | 0.329 | 0.346 | 0.318 | 0.343 | +0.4 |
-| GPT-5.4-mini | gated | 0.339 | 0.339 | 0.339 | 0.339 | 0.339 | 0.339 | +0.0 |
-| GPT-5.4-nano | gate-free | 0.293 | 0.300 | 0.293 | 0.293 | 0.296 | 0.339 | **+4.6** |
-| GPT-5.4-nano | gated | 0.318 | 0.318 | 0.325 | 0.325 | 0.325 | 0.325 | +0.7 |
-
-**Aggregate over all 18 cells: mean Δ +0.5, range [−2.4, +5.1]; 7 cells improve >+0.5,
-none worse than −2.4 with the gate-bounded column.**
-
-**Analysis.** Gains concentrate exactly where theory predicts — tasks with a
-**clean, checkable correctness signal and real headroom**: SearchQA on GPT-5.5
-(+5.1 / +4.4), SpreadsheetBench on the weak nano model (+4.6), LiveMath on nano
-(+3.2). Where the signal is **noisy or the model is already near ceiling**
-(LiveMath / SpreadsheetBench on strong GPT-5.5), the trajectories sit flat inside
-run-to-run noise. The night-by-night columns also show the gains are **stable, not
-lucky single readings** — gated cells reach a level and hold it (e.g. SearchQA
-GPT-5.5 0.841 from N2 on; SpreadsheetBench mini 0.339 throughout). Critically, the
-**gated worst case is −2.4** (bounded), whereas Section 1 showed the *ungated*
-worst case is unbounded (−52.8).
-
---
-
-## 3. Experience replay turns a one-time bump into a climb
+## 2. Experience replay turns a one-time bump into a climb

 The plugin's two opt-in knobs (`recall_k`, `dream_rollouts`) are what produce the
 gains. On the cleanest signal — **SearchQA, GPT-5.5, gated** — the gain rises
@@ -141,13 +83,14 @@ Recall captures most of cumulative's benefit at a fraction of the per-night cost

 ---

-## 4. Why these gains exist — the dream-diversity fix (and a rigor note)
+## 3. Why these gains exist — the dream-diversity fix (and a rigor note)

 Reflection learns from the **contrast** between good and bad rollouts of the same
 task, which requires the K dream rollouts to be *independent samples*. An early
 version of the engine collapsed them to one cached sample, so contrastive
-reflection never fired. Fixing that, then adding recall, is exactly what produced
-the grid above. The same 18-cell grid under three engine configurations:
+reflection never fired. Fixing that, then adding recall, is what produces the
+gains in Sections 1–2. Measured across an 18-cell deployment sweep (3 benchmarks ×
+3 targets × 2 modes), under three engine configurations:

 | Engine configuration | mean Δ | worst-cell Δ | cells > +0.5 | cells < −0.5 |
 |---|---|---|---|---|
@@ -164,7 +107,7 @@ slips through.

 ---

-## 5. End-to-end on real agents
+## 4. End-to-end on real agents

 On the public [gbrain-evals](https://github.com/garrytan/gbrain-evals) `skillopt-v1`
 benchmark — designed for exactly this learnable-gap setting — deficient seed skills
@@ -174,7 +117,7 @@ cross-verify each other's consolidated skills.

 ---

-## 6. Honest scope & limitations
+## 5. Honest scope & limitations

 - **Where it helps:** recurring tasks with a checkable correctness signal and real
  headroom. That is the plugin's actual use case (your repeated daily tasks and