Review fixes:
- Path bug: SKILLOPT_DEVIN_CLAUDE_HOME (and SKILLOPT_SLEEP_REPO) read from the
env are now wrapped in os.path.expanduser, so the documented "~/..." config
no longer passes a literal ~ to --claude-home (which yielded zero mined
sessions). expanduser on an absolute default is a no-op.
- tests/test_devin_plugin.py: tool-schema completeness, action→subcommand map,
backend enum, the CLAUDE_HOME expansion regression, and an ATIF-v1.7 harvest
shape test against a bundled fixture.
- plugins/devin/fixtures/devin_sample.json: sample ATIF-v1.7 transcript.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Wires the skillopt_sleep engine into Devin (Cognition) via an MCP server,
following the same thin-shell pattern as plugins/copilot.
- mcp_server.py: stdlib-only stdio MCP server exposing the standard sleep_*
tools (status, dry-run, run, adopt, harvest). REPO_ROOT defaults to ../.. so
it finds skillopt_sleep automatically when run from plugins/devin/.
- harvest_devin.py: converts Devin ATIF-v1.7 transcripts, agentmemory, and
.devin/skills/*/SKILL.md into the Claude Code-compatible JSONL the engine
consumes; enriches with taskKey + outcome envelopes (hard test/build signal
or judge rubric). Workspace auto-detection; cross-platform paths.
- judge.py, mcp-config.example.json, devin-rules.snippet.md, README.md.
- plugins/README.md: add Devin to the platform + install tables.
No changes to skillopt_sleep; shells out to `python -m skillopt_sleep` like the
other plugins. Pure stdlib; default backend mock (no API spend).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Follow-up to the string-aware brace scan: that change only skipped
double-quoted prose, so brace-shaped text in single quotes, backticks, or
bare prose (e.g. `{op: delete}`, '{x: 1}') still reached json_repair and was
fabricated into a bogus dict — strictly worse than None, since extract_json
feeds the optimizer's skill edits.
Add a _looks_json_like() guard before repair: a genuine JSON object's first
non-space char after `{` is `"` (a key) or `}` (empty). Prose pseudo-objects
start with a bare word and are rejected, while legitimate repair targets
(trailing commas, unescaped quotes inside string values) all begin with `"`
and pass — including objects whose string VALUES contain single quotes or
backticks, which must not be rejected.
Found by an independent GPT-5.5 re-review of the merged #79 code. Adds
regression tests for single-quoted / backticked / bare prose (-> None) and
for legitimate objects with quote/backtick string values (still repaired).
Tests: 30 pass (+3 skip) without json_repair, 33 pass with it, both clean
under -W error::RuntimeWarning.
Co-authored-by: Claude <noreply@anthropic.com>
The contributor is already credited via the Co-authored-by trailer carried
into main by #79; a dedicated README section is unnecessary.
Co-authored-by: Claude <noreply@anthropic.com>
Add an Acknowledgements section crediting @samuelgoofus-boop for the
Windows-robustness work on the Claude/Codex backends (originally #77,
merged via #79).
Co-authored-by: Claude <noreply@anthropic.com>
* Robustness for the claude/codex backends on Windows: argv overflow, subprocess encoding, tolerant JSON, test-eval dirs
Fixes surfaced running SkillOpt end-to-end on the bundled `claude` backend
(local Claude CLI) on Windows. None changes the OpenAI/GPT happy path.
1. skillopt/engine/trainer.py — the final test-eval directory
(test_eval_final/) is written to before being created; add
os.makedirs(..., exist_ok=True), matching the two sibling test-eval dirs.
Without it, summary.json raises FileNotFoundError when a rollout yields
zero predictions.
2. skillopt/model/claude_backend.py
a. Pass the prompt via stdin (not argv): on Windows the whole command line
is capped at ~32 KB and a large optimizer prompt (the success-analyst
minibatch carrying several report trajectories) overflows it with
[WinError 206], killing the run after retries.
b. Pass the system prompt via --append-system-prompt-file (a temp file),
not argv. The system prompt here is the skill being optimized, which
SkillOpt grows over training; since the ~32 KB cap applies to the SUM of
all argv, a grown skill would re-hit [WinError 206] even with the prompt
on stdin.
c. Pin the subprocess encoding to utf-8 (errors="replace"). With text=True
and no encoding=, stdin is encoded with the system codepage; on a zh-CN
box (cp936/GBK) a prompt containing an emoji or some Latin-1 characters
raises UnicodeEncodeError before the CLI even starts, failing every retry.
3. skillopt/model/codex_backend.py — the same utf-8 encoding pin on its
subprocess.run(input=...) call (identical unpinned-encoding pattern).
4. skillopt/utils/json_utils.py — extract_json() returned None for valid-
looking JSON that strict json.loads rejects (unescaped ASCII quotes inside
CJK string values, trailing commas), silently dropping the analyst's edits
on non-schema backends (Claude/Qwen): reflect produces N edits, 0 applied.
Add a json_repair fallback, but only on a single unambiguous object — a
balanced-brace extractor plus a refuse-on-multiple-objects guard — so a
chain-of-thought "scratch + final" response can't make repair silently
return the wrong (discarded) object, which would be worse than None (None is
detectable and retryable; a wrong-but-valid edit is applied blind). Declare
json_repair in requirements.txt and the claude/qwen optional extras so the
fallback is actually present (it otherwise no-ops, dropping edits silently).
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
(cherry picked from commit dca74a683e)
* fix(json_utils): harden tolerant JSON fallback from PR #77
Follow-up fixes on top of the cherry-picked Windows-robustness change:
1. Make _top_level_brace_objects() fully string-aware in its OUTER scan, not
just inside an object. A '{' inside quoted prose (e.g. '"set it to {x}"')
no longer starts a candidate object, so extract_json() returns None for
prose pseudo-JSON instead of repairing it into a bogus dict — which would
be strictly worse than dropping the edit, since extract_json feeds the
optimizer's skill edits.
2. Pick the repair candidate BEFORE importing json_repair, so the missing-
dependency RuntimeWarning only fires when there is genuinely a single
malformed object that could have been repaired. Ordinary no-JSON / prose
replies (the common case) now return None silently instead of warning on
every call.
3. Resolve dependency-metadata inconsistency: json_repair is optional, so add
it to the `all` extra (it was already in `claude`/`qwen`) and demote it
from a hard requirement to an optional/commented entry in requirements.txt,
matching the project's convention for backend-specific deps.
Adds regression tests for prose-with-braces (-> None), no-warning-on-plain-
text, single-object repair, and multi-object ambiguity. Existing 22 json
tests still pass with and without json_repair installed.
Co-Authored-By: Claude <noreply@anthropic.com>
---------
Co-authored-by: samuelgoofus-boop <260247789+samuelgoofus-boop@users.noreply.github.com>
Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Codex round 3: argparse default=0 made every CLI invocation without
--lookback-hours clobber the config's 72h default. Now default=None;
only explicit --lookback-hours N (including 0) overrides config.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
- harvest.py: revert break to continue — mtime ordering can diverge
from embedded ended_at timestamps (copy/touch), so we must check all
files rather than early-exiting on the first old one
- cycle.py: use `is not None and > 0` so lookback_hours=0 means
"scan full history" (opt-out of the cutoff)
- __main__.py: propagate --lookback-hours 0 to config as explicit 0
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
- cycle.py: use supplied `clock` parameter (not wall time) for the
lookback cutoff, so deterministic tests/experiments get reproducible
harvest windows
- harvest.py: break (not continue) when a file is older than since_iso,
since files are sorted newest-first by mtime — avoids scanning the
entire transcript directory for quiet projects with large histories
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Two fixes from issue #57 feedback:
1. run-sleep.sh: support SKILLOPT_SLEEP_PYTHON env var to explicitly set
the Python interpreter. Useful on macOS where system Python is 3.9 but
a newer Python is available elsewhere (e.g. Codex Desktop's bundled
Python 3.12). Applied to both the shared runner and the bundled
Claude Code plugin copy.
2. cycle.py: on first run (no prior harvest recorded), apply the
lookback_hours config (default 72h) as a time cutoff. Previously,
first run scanned the entire transcript history, which could trigger
massive LLM mining on users with months of session data.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Address codex review: "API key" was too generic — a model response
about configuring API keys would trigger a false auth warning. Now:
- Use specific phrases ("Invalid API key", "Unauthorized: invalid x-api-key")
- Only check short stdout (<300 chars) to skip real model responses
- Still check stderr unconditionally
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
ClaudeCliBackend._call() and attempt_with_tools() hardcoded --bare,
which skips Claude CLI's credential resolution. This broke subscription-
token auth: every model call silently returned "Not logged in" and
scored 0 — the user saw "baseline 0.0 → candidate 0.0, gate reject"
with no indication of an auth failure.
Fix: only pass --bare when ANTHROPIC_API_KEY is set. The remaining
isolation flags (--disable-slash-commands, --disallowedTools,
--exclude-dynamic-system-prompt-sections, clean temp cwd) already
provide the needed isolation without --bare.
Also adds _detect_cli_error() to log a warning when CLI output matches
known auth error patterns, so auth failures surface loudly instead of
deflating every score to 0.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
- harvest: tighten sub-3s filter to also require prompt < 200 chars,
avoiding false positives on fast real one-shot questions
- openclaw schedule_cmd: add docstring clarifying it schedules the
shared engine, not the OpenClaw-native runner
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
The advertised backend choices in scripts/train.py use 'azure_openai',
not 'openai'; align the inputSchema description hint accordingly.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Adds honest tool-call detection for CopilotCliBackend, mirroring the
Claude/Codex backends. Writes per-tool executable shims into the work dir
and detects real invocations from a calllog (not self-reported markers).
The Copilot backend is Windows-validated, so shims are cross-platform:
a .cmd batch shim on Windows and a chmod'd bash shim on POSIX, with an
OS-specific tool hint. Mirrors _call's flags/env (isolated COPILOT_HOME,
--allow-all-tools, MCP/instruction disabling) and the UTF-8 subprocess fix.
Adds test_attempt_with_tools_honest_detection: a CI-friendly, OS-aware
stub stands in for the CLI, runs the shim, and asserts both JSONL parsing
and log-based detection. Validated live on Windows (real Copilot call) and
on Linux/WSL (POSIX path).
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Add CopilotCliBackend that drives the GitHub Copilot CLI in
non-interactive mode (copilot -p ... --output-format json) and parses the
JSONL event stream for assistant.message content. Registered as the
'copilot' backend (with aliases) and wired through the CLI, config,
experiment harness, and the Copilot MCP server's backend enum.
- Force UTF-8 decoding of CLI output (fixes cp1252 UnicodeDecodeError on
Windows when responses contain non-cp1252 bytes).
- Minimise per-call startup: isolated COPILOT_HOME with built-in MCPs and
custom instructions disabled, so user MCP servers are not spawned per
call (~5x faster: 36s -> 7.4s). Override via SKILLOPT_SLEEP_COPILOT_HOME
/ SKILLOPT_SLEEP_COPILOT_MODEL / SKILLOPT_SLEEP_COPILOT_FULL_ENV.
Validated end-to-end on real held-out tasks (researcher persona:
0.42 -> 1.00 lift; gate correctly rejects non-improving edits).
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Exposes scripts/train.py and scripts/eval_only.py as Copilot MCP tools
(skillopt_list_configs, skillopt_train, skillopt_eval) via a stdlib-only
stdio server, mirroring the existing SkillOpt-Sleep plugin layout.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Remove the per-cell full deployment grid section; keep the gate-safety stress
test, experience-replay scaling + night-by-night climb, the dream-diversity
ablation, the gbrain end-to-end result, and the scope/limitations. Renumber
sections; update the README pointer accordingly.
Replace the compact baseline->after grid with three grouped per-benchmark tables
(SearchQA / LiveMath / SpreadsheetBench), each showing all 3 targets x both modes
across every night (N0..N5) + Δ. Makes the trajectory visible — gains reach a
level and hold rather than being single lucky readings — and presents the full
18-cell evidence in a more solid, readable form. Footnotes LiveMath's 4-night run
(train split <50 tasks). Numbers unchanged; just richer presentation.
Adds docs/sleep/RESULTS.md — the complete deployment-scale study behind
SkillOpt-Sleep, presented rigorously (named benchmarks, test sizes, metrics,
baseline->after, single shared protocol):
1. Gate-safety stress test: ungated nano SearchQA collapses 0.554->0.026
(-52.8); the gated twin holds 0.570 — the core argument for the design.
2. Full 18-cell deployment grid (3 benchmarks x 3 targets x gate/free),
shipped config: mean +0.5, range [-2.4, +5.1], nothing hidden.
3. Experience-replay scaling (recall_k 10->20->full: +3.1->+4.5->+5.6) and
the night-by-night climb (0.798->...->0.858, gate accepts as late as N5).
4. Dream-diversity fix as defense-in-depth: 3-config grid comparison
(-2.66/-52.8 -> +0.24/-4.0 -> +0.53/-2.4); the -52.8 cell becomes +2.7
from the dream fix alone.
5. gbrain end-to-end 0.00->1.00 on real Claude + Codex.
6. Honest scope: where it helps vs flat-in-noise, single-seed caveat with a
seed-robustness spot check, keep-the-gate-on.
README Results section now links prominently to it. Docs only; numbers are
self-contained with reproduce commands (no raw run dumps committed).
Label each result with its benchmark, test size, metric, target model, and gate
mode; show absolute baseline→after (not just Δ); state the single shared protocol
once. SearchQA recall-scaling table (1400-item test, SQuAD-EM, GPT-5.5, gated) +
SpreadsheetBench confirmation (280-item, cell-value compare, nano, gate-free) +
the gbrain end-to-end line. Keeps the single-seed / flat-on-noisy caveats.
Adds docs/sleep/README.md — a concise intro to the SkillOpt-Sleep plugin (what
it is, how to use it across the three agents, the opt-in experience-replay /
dream-rollout knobs, and headline results), linking to the full guide section.
Adds a News bullet pointing to it. No code changes.
Per maintainer request:
- Remove the internal/scratch docs/sleep/ tree (reports, raw logs, blog run
JSON, sweep.jsonl) — 23 files — and the root PUBLISHING.md. These were
working notes, not reference docs.
- Take the dedicated SkillOpt-Sleep content out of the main README (News bullet
+ section) and host it in the rendered guide instead: new section 9 in
docs/guideline.html (deployment companion, the three plugins, opt-in
experience replay / dream rollouts) with a sidebar entry.
- Fix the README's opening reference so "Documentation & Reproduction Guide"
links directly to the rendered GitHub Pages page, not the raw .html source.
- Repoint the now-removed docs/sleep links in the plugin READMEs to the guide
section.
The plugin code (plugins/, skillopt_sleep/) is unchanged; only docs move.
Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
Wires two consolidation mechanisms into the shipped nightly cycle, both default
OFF so existing behavior is unchanged:
- dream_rollouts (>1): multi-rollout contrastive reflection per task
- recall_k (>0): associative recall of the K most-similar past tasks (from a
capped task_archive persisted in state.json) into tonight's dream
- dream_factor (>0): synthetic task variants
New shared engine module skillopt_sleep/dream.py (recall_similar, dream_augment,
dream_consolidate) is called by both the plugin cycle and the experiment harness,
so reported numbers exercise the exact shipped code. Built on the existing
rollouts_k/sample_id support already in consolidate.py/rollout.py.
Validated (5 nights x 10 real tasks/night, full held-out test, GPT-5.5, gated):
the gain scales with recall depth on a clean signal —
SearchQA recall_k=10 +3.1, recall_k=20 +4.5, full-history reference +5.6;
SpreadsheetBench (nano, gate-free) +3.6. Flat within noise on saturated/noisy
cells. See docs/sleep/EXPERIENCE_REPLAY.md (+ raw runs under blog_runs/v2_port/).
Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>