Address codex review: "API key" was too generic — a model response
about configuring API keys would trigger a false auth warning. Now:
- Use specific phrases ("Invalid API key", "Unauthorized: invalid x-api-key")
- Only check short stdout (<300 chars) to skip real model responses
- Still check stderr unconditionally
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
ClaudeCliBackend._call() and attempt_with_tools() hardcoded --bare,
which skips Claude CLI's credential resolution. This broke subscription-
token auth: every model call silently returned "Not logged in" and
scored 0 — the user saw "baseline 0.0 → candidate 0.0, gate reject"
with no indication of an auth failure.
Fix: only pass --bare when ANTHROPIC_API_KEY is set. The remaining
isolation flags (--disable-slash-commands, --disallowedTools,
--exclude-dynamic-system-prompt-sections, clean temp cwd) already
provide the needed isolation without --bare.
Also adds _detect_cli_error() to log a warning when CLI output matches
known auth error patterns, so auth failures surface loudly instead of
deflating every score to 0.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
- harvest: tighten sub-3s filter to also require prompt < 200 chars,
avoiding false positives on fast real one-shot questions
- openclaw schedule_cmd: add docstring clarifying it schedules the
shared engine, not the OpenClaw-native runner
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
The advertised backend choices in scripts/train.py use 'azure_openai',
not 'openai'; align the inputSchema description hint accordingly.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Adds honest tool-call detection for CopilotCliBackend, mirroring the
Claude/Codex backends. Writes per-tool executable shims into the work dir
and detects real invocations from a calllog (not self-reported markers).
The Copilot backend is Windows-validated, so shims are cross-platform:
a .cmd batch shim on Windows and a chmod'd bash shim on POSIX, with an
OS-specific tool hint. Mirrors _call's flags/env (isolated COPILOT_HOME,
--allow-all-tools, MCP/instruction disabling) and the UTF-8 subprocess fix.
Adds test_attempt_with_tools_honest_detection: a CI-friendly, OS-aware
stub stands in for the CLI, runs the shim, and asserts both JSONL parsing
and log-based detection. Validated live on Windows (real Copilot call) and
on Linux/WSL (POSIX path).
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Add CopilotCliBackend that drives the GitHub Copilot CLI in
non-interactive mode (copilot -p ... --output-format json) and parses the
JSONL event stream for assistant.message content. Registered as the
'copilot' backend (with aliases) and wired through the CLI, config,
experiment harness, and the Copilot MCP server's backend enum.
- Force UTF-8 decoding of CLI output (fixes cp1252 UnicodeDecodeError on
Windows when responses contain non-cp1252 bytes).
- Minimise per-call startup: isolated COPILOT_HOME with built-in MCPs and
custom instructions disabled, so user MCP servers are not spawned per
call (~5x faster: 36s -> 7.4s). Override via SKILLOPT_SLEEP_COPILOT_HOME
/ SKILLOPT_SLEEP_COPILOT_MODEL / SKILLOPT_SLEEP_COPILOT_FULL_ENV.
Validated end-to-end on real held-out tasks (researcher persona:
0.42 -> 1.00 lift; gate correctly rejects non-improving edits).
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Exposes scripts/train.py and scripts/eval_only.py as Copilot MCP tools
(skillopt_list_configs, skillopt_train, skillopt_eval) via a stdlib-only
stdio server, mirroring the existing SkillOpt-Sleep plugin layout.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Remove the per-cell full deployment grid section; keep the gate-safety stress
test, experience-replay scaling + night-by-night climb, the dream-diversity
ablation, the gbrain end-to-end result, and the scope/limitations. Renumber
sections; update the README pointer accordingly.
Replace the compact baseline->after grid with three grouped per-benchmark tables
(SearchQA / LiveMath / SpreadsheetBench), each showing all 3 targets x both modes
across every night (N0..N5) + Δ. Makes the trajectory visible — gains reach a
level and hold rather than being single lucky readings — and presents the full
18-cell evidence in a more solid, readable form. Footnotes LiveMath's 4-night run
(train split <50 tasks). Numbers unchanged; just richer presentation.
Adds docs/sleep/RESULTS.md — the complete deployment-scale study behind
SkillOpt-Sleep, presented rigorously (named benchmarks, test sizes, metrics,
baseline->after, single shared protocol):
1. Gate-safety stress test: ungated nano SearchQA collapses 0.554->0.026
(-52.8); the gated twin holds 0.570 — the core argument for the design.
2. Full 18-cell deployment grid (3 benchmarks x 3 targets x gate/free),
shipped config: mean +0.5, range [-2.4, +5.1], nothing hidden.
3. Experience-replay scaling (recall_k 10->20->full: +3.1->+4.5->+5.6) and
the night-by-night climb (0.798->...->0.858, gate accepts as late as N5).
4. Dream-diversity fix as defense-in-depth: 3-config grid comparison
(-2.66/-52.8 -> +0.24/-4.0 -> +0.53/-2.4); the -52.8 cell becomes +2.7
from the dream fix alone.
5. gbrain end-to-end 0.00->1.00 on real Claude + Codex.
6. Honest scope: where it helps vs flat-in-noise, single-seed caveat with a
seed-robustness spot check, keep-the-gate-on.
README Results section now links prominently to it. Docs only; numbers are
self-contained with reproduce commands (no raw run dumps committed).
Label each result with its benchmark, test size, metric, target model, and gate
mode; show absolute baseline→after (not just Δ); state the single shared protocol
once. SearchQA recall-scaling table (1400-item test, SQuAD-EM, GPT-5.5, gated) +
SpreadsheetBench confirmation (280-item, cell-value compare, nano, gate-free) +
the gbrain end-to-end line. Keeps the single-seed / flat-on-noisy caveats.
Adds docs/sleep/README.md — a concise intro to the SkillOpt-Sleep plugin (what
it is, how to use it across the three agents, the opt-in experience-replay /
dream-rollout knobs, and headline results), linking to the full guide section.
Adds a News bullet pointing to it. No code changes.
Per maintainer request:
- Remove the internal/scratch docs/sleep/ tree (reports, raw logs, blog run
JSON, sweep.jsonl) — 23 files — and the root PUBLISHING.md. These were
working notes, not reference docs.
- Take the dedicated SkillOpt-Sleep content out of the main README (News bullet
+ section) and host it in the rendered guide instead: new section 9 in
docs/guideline.html (deployment companion, the three plugins, opt-in
experience replay / dream rollouts) with a sidebar entry.
- Fix the README's opening reference so "Documentation & Reproduction Guide"
links directly to the rendered GitHub Pages page, not the raw .html source.
- Repoint the now-removed docs/sleep links in the plugin READMEs to the guide
section.
The plugin code (plugins/, skillopt_sleep/) is unchanged; only docs move.
Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
Wires two consolidation mechanisms into the shipped nightly cycle, both default
OFF so existing behavior is unchanged:
- dream_rollouts (>1): multi-rollout contrastive reflection per task
- recall_k (>0): associative recall of the K most-similar past tasks (from a
capped task_archive persisted in state.json) into tonight's dream
- dream_factor (>0): synthetic task variants
New shared engine module skillopt_sleep/dream.py (recall_similar, dream_augment,
dream_consolidate) is called by both the plugin cycle and the experiment harness,
so reported numbers exercise the exact shipped code. Built on the existing
rollouts_k/sample_id support already in consolidate.py/rollout.py.
Validated (5 nights x 10 real tasks/night, full held-out test, GPT-5.5, gated):
the gain scales with recall depth on a clean signal —
SearchQA recall_k=10 +3.1, recall_k=20 +4.5, full-history reference +5.6;
SpreadsheetBench (nano, gate-free) +3.6. Flat within noise on saturated/noisy
cells. See docs/sleep/EXPERIENCE_REPLAY.md (+ raw runs under blog_runs/v2_port/).
Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
All six adapters duplicated an identical reflect() that delegates to
run_minibatch_reflect. The copies had drifted: OfficeQA/DocVQA silently
dropped meta_skill_context and ALFWorld dropped update_mode, so those
analysts ran without inputs every other benchmark receives (active under
the default use_meta_skill: true).
Move the delegation into EnvAdapter.reflect as one default that forwards
all kwargs uniformly, and delete the six overrides. reflect is no longer
abstract — adapters inherit it and override only for custom logic.
Net -225 lines. Behavior change: OfficeQA/DocVQA/ALFWorld reflect now
receive the kwargs they previously dropped; the three already-correct
benchmarks are unaffected.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Updates the SkillOpt-Sleep plugin on top of the current main. User-facing and
engine improvements since the initial drop:
* Command renamed /sleep -> /skillopt-sleep across Claude Code + Codex shells;
refreshed plugin READMEs and install scripts.
* Built-in scheduling (skillopt_sleep/scheduler.py + __main__): schedule /
unschedule the nightly cycle without external cron wiring.
* Backend robustness: bounded retry with backoff (no more silent empty-string
on transient 429/timeout), content-filter-safe rollout prompt, an
output-contract guardrail that rejects edits violating the task's required
format, and a per-sample cache key so repeated dream rollouts are independent
samples (fixes degenerate single-sample reflection).
* consolidate / rollout / replay: parallel multi-rollout dreaming, gate-mode
controls, TaskRecord.system framing field.
Scope: this commit ships only the plugin engine + shells. Research/benchmark
harnesses and their data are intentionally not included; the public package
has no dependency on them (the one research-evaluator import is now guarded).
Marked as an early preview in the README; we'll keep iterating.
99/99 unit tests pass.
Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
Adds a thin OpenClaw shell wrapping the SkillOpt-Sleep engine. Enables
nightly validation-gated skill improvement cycles for OpenClaw agents.
Components:
- skillopt_sleep_openclaw.py: DeepSeek V4 Pro + Ollama nomic-embed-text
backend, mirroring the Claude/Codex/Copilot backend pattern.
- run_sleep.py: CLI entry point supporting dry-run and pre-built task files.
- run_sleep_cron.sh: bash wrapper for nightly cron invocation.
- slash_sleep.py: /sleep command (status / run / adopt / reject / cost).
- config.json: engine config tuned for our stack.
- SKILL.md: OpenClaw skill manifest.
- tests/: 14 held-out tasks across 3 categories (research-cron, devops, wiki).
OpenClaw is the 4th ecosystem in which SkillOpt-Sleep can be deployed,
joining Claude Code, Codex, and Copilot. The shell follows the same
single-engine / thin-shell pattern as the existing three plugins.
End-to-end tested: pipeline runs against real OpenClaw session transcripts,
gate correctly rejects non-improvements, staging artifacts land in
~/.skillopt-sleep/staging/<night>/. Cost: ~$0.02/night on DeepSeek V4 Pro.
- Move Quick Start (now §3) ahead of the data chapter; renumber and fix
cross-references and the sidebar nav.
- Add §3.1 'Your First Demo': states plainly that data/ ships ID manifests
only, gives the one benchmark that runs out of the box (ALFWorld with its
bundled path split), and points other benchmarks to the data/README.md
materialization step. Also offers eval-only with ckpt/ skills as a
lighter sanity check.
- Reframe the data chapter as 'Run on Your Own Data' (§4) with a three-step
lead-in (split dir -> item schema -> --split_dir) and a pointer to §7.2
for new task shapes.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>