fix(sleep): retry reflect on non-JSON reply; honest report narrative

- reflect() now retries once with a firmer "JSON only" instruction when the
  first reply doesn't parse to a non-empty array. A transient non-JSON reply
  otherwise wastes a whole night (gate sees no edits -> reject), which made
  weak optimizers (Haiku) flaky across runs.
- FINAL_REPORT.md: document the context-leak discovery honestly; Codex cells
  stand (clean), Claude cells recomputed under strict isolation.

Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
This commit is contained in:
Yifan Yang
2026-06-08 14:31:51 +00:00
parent c80914b036
commit d75863eb6f
2 changed files with 31 additions and 12 deletions

View File

@@ -22,18 +22,26 @@ computes the score with a local rule judge — the optimizer never grades itself
| Backend (target) | Optimizer | Seed | Held-out before → after | Nights |
|---|---|---|---|---|
| Claude Haiku 4.5 | Claude Haiku | brief-writer | **0.00 → 1.00** | 1 |
| Claude Haiku 4.5 | Claude Haiku | advisor | **0.00 → 1.00** | 2 |
| Claude Haiku 4.5 | Claude Haiku | thorough-analyst | **0.00 → 1.00** † | 2 |
| Codex (gpt-5.5) | Codex | brief-writer | **0.00 → 1.00** | 2 |
| Claude Haiku 4.5 | Claude Haiku | brief-writer | **0.00 → 1.00** | 12 |
| Claude Haiku 4.5 | Claude Haiku | advisor | _recomputing clean_ ‡ | 2 |
| Claude Haiku 4.5 | Claude Haiku | thorough-analyst | partial (see §3) | 2 |
† after the override-prompt fix described in §3. Before the fix it was 0.00 → 0.00,
and we report that honestly because it taught us the most (see §3).
**An honesty note on the Claude numbers.** Our first Claude runs were
contaminated: `claude -p` was injecting the user's *global* skills/CLAUDE.md into
every optimizer/target call (one reflect call literally returned a list of the
machine's installed skills instead of JSON edits). That inflated some early
"successes." We fixed the backend to run truly isolated (`--bare
--disable-slash-commands --disallowedTools '*'`, clean temp cwd) and are
recomputing every Claude cell honestly. **The Codex results were never affected**
(the real `@openai/codex` binary runs in its own clean context) and stand as-is.
This is precisely the class of bug gbrain warns about: "the bugs that matter only
show up when the whole thing actually runs."
**Bottom line:** across two independent agent runtimes (Claude and Codex) and
multiple distinct skill flaws (missing structure, no verdict, no length
discipline), the sleep cycle lifts a deficient skill to a perfect held-out score,
with every change gated and staged for review.
**Bottom line:** the mechanism is real — a deficient skill is lifted to a perfect
held-out score by gated nightly edits — and it is demonstrated cleanly on Codex
today, with Claude being re-measured under strict isolation. Every change is
gated and staged for review.
---

View File

@@ -373,9 +373,20 @@ class CliBackend(Backend):
f"{criteria_text}\n\n"
f"# Recurring failures\n{fail_text}"
)
raw = self._call(prompt, max_tokens=1024)
self._tokens += len(prompt) // 4 + len(raw) // 4
arr = _extract_json(raw, "array")
# Call with one retry: transient non-JSON replies otherwise waste a whole
# night (the gate sees no edits and rejects). A firmer second prompt
# recovers most of these.
arr = None
for attempt in range(2):
p = prompt if attempt == 0 else (
prompt + "\n\nIMPORTANT: your previous reply was not valid JSON. "
"Reply with ONLY the JSON array, no prose, no markdown fences."
)
raw = self._call(p, max_tokens=1024)
self._tokens += len(p) // 4 + len(raw) // 4
arr = _extract_json(raw, "array")
if isinstance(arr, list) and arr:
break
edits: List[EditRecord] = []
if isinstance(arr, list):
for e in arr[:edit_budget]: