Commit Graph

193 Commits

Author SHA1 Message Date
Yifan Yang
c2e47c50fb docs(readme): acknowledge community contributor @samuelgoofus-boop (#80)
Add an Acknowledgements section crediting @samuelgoofus-boop for the
Windows-robustness work on the Claude/Codex backends (originally #77,
merged via #79).

Co-authored-by: Claude <noreply@anthropic.com>
2026-06-23 19:03:30 +08:00
Yifan Yang
14c045f04f Windows robustness for claude/codex backends (+ hardened JSON fallback) (#79)
* Robustness for the claude/codex backends on Windows: argv overflow, subprocess encoding, tolerant JSON, test-eval dirs

Fixes surfaced running SkillOpt end-to-end on the bundled `claude` backend
(local Claude CLI) on Windows. None changes the OpenAI/GPT happy path.

1. skillopt/engine/trainer.py — the final test-eval directory
   (test_eval_final/) is written to before being created; add
   os.makedirs(..., exist_ok=True), matching the two sibling test-eval dirs.
   Without it, summary.json raises FileNotFoundError when a rollout yields
   zero predictions.

2. skillopt/model/claude_backend.py
   a. Pass the prompt via stdin (not argv): on Windows the whole command line
      is capped at ~32 KB and a large optimizer prompt (the success-analyst
      minibatch carrying several report trajectories) overflows it with
      [WinError 206], killing the run after retries.
   b. Pass the system prompt via --append-system-prompt-file (a temp file),
      not argv. The system prompt here is the skill being optimized, which
      SkillOpt grows over training; since the ~32 KB cap applies to the SUM of
      all argv, a grown skill would re-hit [WinError 206] even with the prompt
      on stdin.
   c. Pin the subprocess encoding to utf-8 (errors="replace"). With text=True
      and no encoding=, stdin is encoded with the system codepage; on a zh-CN
      box (cp936/GBK) a prompt containing an emoji or some Latin-1 characters
      raises UnicodeEncodeError before the CLI even starts, failing every retry.

3. skillopt/model/codex_backend.py — the same utf-8 encoding pin on its
   subprocess.run(input=...) call (identical unpinned-encoding pattern).

4. skillopt/utils/json_utils.py — extract_json() returned None for valid-
   looking JSON that strict json.loads rejects (unescaped ASCII quotes inside
   CJK string values, trailing commas), silently dropping the analyst's edits
   on non-schema backends (Claude/Qwen): reflect produces N edits, 0 applied.
   Add a json_repair fallback, but only on a single unambiguous object — a
   balanced-brace extractor plus a refuse-on-multiple-objects guard — so a
   chain-of-thought "scratch + final" response can't make repair silently
   return the wrong (discarded) object, which would be worse than None (None is
   detectable and retryable; a wrong-but-valid edit is applied blind). Declare
   json_repair in requirements.txt and the claude/qwen optional extras so the
   fallback is actually present (it otherwise no-ops, dropping edits silently).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
(cherry picked from commit dca74a683e)

* fix(json_utils): harden tolerant JSON fallback from PR #77

Follow-up fixes on top of the cherry-picked Windows-robustness change:

1. Make _top_level_brace_objects() fully string-aware in its OUTER scan, not
   just inside an object. A '{' inside quoted prose (e.g. '"set it to {x}"')
   no longer starts a candidate object, so extract_json() returns None for
   prose pseudo-JSON instead of repairing it into a bogus dict — which would
   be strictly worse than dropping the edit, since extract_json feeds the
   optimizer's skill edits.

2. Pick the repair candidate BEFORE importing json_repair, so the missing-
   dependency RuntimeWarning only fires when there is genuinely a single
   malformed object that could have been repaired. Ordinary no-JSON / prose
   replies (the common case) now return None silently instead of warning on
   every call.

3. Resolve dependency-metadata inconsistency: json_repair is optional, so add
   it to the `all` extra (it was already in `claude`/`qwen`) and demote it
   from a hard requirement to an optional/commented entry in requirements.txt,
   matching the project's convention for backend-specific deps.

Adds regression tests for prose-with-braces (-> None), no-warning-on-plain-
text, single-object repair, and multi-object ambiguity. Existing 22 json
tests still pass with and without json_repair installed.

Co-Authored-By: Claude <noreply@anthropic.com>

---------

Co-authored-by: samuelgoofus-boop <260247789+samuelgoofus-boop@users.noreply.github.com>
Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-23 19:00:23 +08:00
carpedkm
2841f82428 Fix ALFWorld gamefile paths relative to ALFWORLD_DATA 2026-06-23 10:32:38 +00:00
Yifan Yang
64c6dda105 Merge pull request #78 from Yif-Yang/main
docs(readme): add Trendshift daily/weekly badges
2026-06-23 16:52:42 +08:00
Yifan Yang
c98eac18c7 docs(readme): add Trendshift daily/weekly badges (#1)
Add the microsoft/SkillOpt Trendshift badges (daily + weekly) side by
side in the README header.

Co-authored-by: Claude <noreply@anthropic.com>
2026-06-23 16:50:47 +08:00
Yifan Yang
fc1f827f07 Merge pull request #74 from Yif-Yang/fix/python-path-and-lookback
fix: SKILLOPT_SLEEP_PYTHON override + lookback_hours first-run fallback
2026-06-20 22:26:43 +08:00
carpedkm
01b3e01804 fix: use None default for --lookback-hours to distinguish omitted vs 0
Codex round 3: argparse default=0 made every CLI invocation without
--lookback-hours clobber the config's 72h default. Now default=None;
only explicit --lookback-hours N (including 0) overrides config.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-20 14:23:17 +00:00
carpedkm
01075c90d3 fix: address codex round 2 — revert harvest break + allow lookback 0
- harvest.py: revert break to continue — mtime ordering can diverge
  from embedded ended_at timestamps (copy/touch), so we must check all
  files rather than early-exiting on the first old one
- cycle.py: use `is not None and > 0` so lookback_hours=0 means
  "scan full history" (opt-out of the cutoff)
- __main__.py: propagate --lookback-hours 0 to config as explicit 0

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-20 14:21:18 +00:00
carpedkm
6cc1cd2e95 fix: address codex review — use clock for cutoff + early-exit harvest
- cycle.py: use supplied `clock` parameter (not wall time) for the
  lookback cutoff, so deterministic tests/experiments get reproducible
  harvest windows
- harvest.py: break (not continue) when a file is older than since_iso,
  since files are sorted newest-first by mtime — avoids scanning the
  entire transcript directory for quiet projects with large histories

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-20 14:11:58 +00:00
carpedkm
889238b234 fix: add SKILLOPT_SLEEP_PYTHON override + lookback_hours first-run fallback
Two fixes from issue #57 feedback:

1. run-sleep.sh: support SKILLOPT_SLEEP_PYTHON env var to explicitly set
   the Python interpreter. Useful on macOS where system Python is 3.9 but
   a newer Python is available elsewhere (e.g. Codex Desktop's bundled
   Python 3.12). Applied to both the shared runner and the bundled
   Claude Code plugin copy.

2. cycle.py: on first run (no prior harvest recorded), apply the
   lookback_hours config (default 72h) as a time cutoff. Previously,
   first run scanned the entire transcript history, which could trigger
   massive LLM mining on users with months of session data.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-20 14:07:50 +00:00
Yifan Yang
b5a1c2b317 Merge pull request #73 from Yif-Yang/fix/bare-subscription-auth
fix(sleep): make --bare conditional on ANTHROPIC_API_KEY (#68)
2026-06-20 21:46:09 +08:00
carpedkm
552ddefd74 fix: narrow CLI error markers to avoid false positives
Address codex review: "API key" was too generic — a model response
about configuring API keys would trigger a false auth warning. Now:
- Use specific phrases ("Invalid API key", "Unauthorized: invalid x-api-key")
- Only check short stdout (<300 chars) to skip real model responses
- Still check stderr unconditionally

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-20 13:32:43 +00:00
carpedkm
bfa53bc46d fix(sleep): make --bare conditional on ANTHROPIC_API_KEY (#68)
ClaudeCliBackend._call() and attempt_with_tools() hardcoded --bare,
which skips Claude CLI's credential resolution. This broke subscription-
token auth: every model call silently returned "Not logged in" and
scored 0 — the user saw "baseline 0.0 → candidate 0.0, gate reject"
with no indication of an auth failure.

Fix: only pass --bare when ANTHROPIC_API_KEY is set. The remaining
isolation flags (--disable-slash-commands, --disallowedTools,
--exclude-dynamic-system-prompt-sections, clean temp cwd) already
provide the needed isolation without --bare.

Also adds _detect_cli_error() to log a warning when CLI output matches
known auth error patterns, so auth failures surface loudly instead of
deflating every score to 0.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-20 13:28:34 +00:00
Yifan Yang
24b5a25ba8 Merge pull request #72 from Yif-Yang/feat/plugin-feature-sync
feat: sync all 4 runtime plugins with full engine surface + fix #52 #58 #62
2026-06-20 20:42:24 +08:00
carpedkm
0d648b2580 fix: address codex+gpt-5.5 review findings
- harvest: tighten sub-3s filter to also require prompt < 200 chars,
  avoiding false positives on fast real one-shot questions
- openclaw schedule_cmd: add docstring clarifying it schedules the
  shared engine, not the OpenClaw-native runner

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-20 12:40:34 +00:00
carpedkm
7d36b1d592 fix: address review findings in plugin sync PR
- OpenClaw schedule_cmd: pass project as required positional arg
- OpenClaw schedule_cmd/unschedule_cmd: unpack Tuple[bool, str] return
- OpenClaw schedule_cmd: propagate failure status (return 1 on not ok)
- OpenClaw unschedule_cmd: pass project to avoid silent no-op
- OpenClaw --minute default: 17 (consistent with engine and MCP)
- harvest.py: move datetime import to module level

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-20 12:04:07 +00:00
carpedkm
0be780052a feat: sync all 4 runtime plugins with full engine surface + fix #52 #58 #62
Bug fixes:
- #52: bundle run-sleep.sh in Claude Code plugin + 4-level fallback
- #58: add skillopt-sleep console script entry point in pyproject.toml
- #62: filter headless claude -p replay sessions from harvest

Plugin sync (Claude Code / Codex / Copilot / OpenClaw):
- Document all 22 CLI flags, 7 actions, 4 backends across all SKILL.md files
- Document config keys (preferences, gate_mode, dream_rollouts, etc.)
- Document memory consolidation (evolve_memory / evolve_skill)
- Add schedule/unschedule to all plugins
- Copilot MCP: expand schema from 3 → 16 params + schedule tools
- OpenClaw: add schedule/unschedule subcommands via shared scheduler

Tests:
- Cross-plugin parity test (prevents future feature drift)
- MCP schema completeness test

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-20 11:31:09 +00:00
carpedkm
0b5b9a4296 Merge pull request #60 from Kirchberg/codex/reviewed-task-files-cwd
Add reviewed task-file flow for Codex sleep runs
2026-06-20 08:59:02 +00:00
Kirill Kostarev
05cdc26beb Add reviewed task-file flow for Codex sleep runs 2026-06-20 08:58:48 +00:00
Yifan Yang
382811ddcc Merge pull request #50 from Dongbumlee/Dongbumlee/copilot-sleep-backend
Add Copilot as a SkillOpt-Sleep model backend (CopilotCliBackend) + research-engine MCP plugin
2026-06-20 16:57:53 +08:00
DB Lee
d367ae1eea docs(plugins): list copilot in the cross-tool backend overview
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-06-17 17:38:10 -07:00
DB Lee
2c0980bda3 docs(copilot): correct backend hint in research MCP plugin (openai -> azure_openai)
The advertised backend choices in scripts/train.py use 'azure_openai',
not 'openai'; align the inputSchema description hint accordingly.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-06-17 17:25:50 -07:00
DB Lee
5799695951 feat(copilot): implement attempt_with_tools with cross-platform tool shims
Adds honest tool-call detection for CopilotCliBackend, mirroring the
Claude/Codex backends. Writes per-tool executable shims into the work dir
and detects real invocations from a calllog (not self-reported markers).
The Copilot backend is Windows-validated, so shims are cross-platform:
a .cmd batch shim on Windows and a chmod'd bash shim on POSIX, with an
OS-specific tool hint. Mirrors _call's flags/env (isolated COPILOT_HOME,
--allow-all-tools, MCP/instruction disabling) and the UTF-8 subprocess fix.

Adds test_attempt_with_tools_honest_detection: a CI-friendly, OS-aware
stub stands in for the CLI, runs the shim, and asserts both JSONL parsing
and log-based detection. Validated live on Windows (real Copilot call) and
on Linux/WSL (POSIX path).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-06-17 17:25:50 -07:00
DB Lee
013a7cd83a test: add unit tests for CopilotCliBackend (parsing + alias + isolated home)
Covers _parse_jsonl_response (multi-message concat, junk-line skipping,
empty/non-assistant events), get_backend alias resolution, and the
isolated-COPILOT_HOME / full-env opt-out behavior. Pure logic, no CLI required.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-06-17 17:25:50 -07:00
DB Lee
21f93c16c7 Add GitHub Copilot backend to SkillOpt-Sleep
Add CopilotCliBackend that drives the GitHub Copilot CLI in
non-interactive mode (copilot -p ... --output-format json) and parses the
JSONL event stream for assistant.message content. Registered as the
'copilot' backend (with aliases) and wired through the CLI, config,
experiment harness, and the Copilot MCP server's backend enum.

- Force UTF-8 decoding of CLI output (fixes cp1252 UnicodeDecodeError on
  Windows when responses contain non-cp1252 bytes).
- Minimise per-call startup: isolated COPILOT_HOME with built-in MCPs and
  custom instructions disabled, so user MCP servers are not spawned per
  call (~5x faster: 36s -> 7.4s). Override via SKILLOPT_SLEEP_COPILOT_HOME
  / SKILLOPT_SLEEP_COPILOT_MODEL / SKILLOPT_SLEEP_COPILOT_FULL_ENV.

Validated end-to-end on real held-out tasks (researcher persona:
0.42 -> 1.00 lift; gate correctly rejects non-improving edits).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-06-17 17:25:50 -07:00
DB Lee
5dc894715f Add SkillOpt research-engine MCP server plugin for Copilot
Exposes scripts/train.py and scripts/eval_only.py as Copilot MCP tools
(skillopt_list_configs, skillopt_train, skillopt_eval) via a stdlib-only
stdio server, mirroring the existing SkillOpt-Sleep plugin layout.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-06-17 17:24:00 -07:00
Yifan Yang
6940e46f4e Merge pull request #65 from summerview1997/codex/searchqa-materialize-splits
Add SearchQA split materialization helper
2026-06-17 23:50:38 +08:00
Yifan Yang
0e962219f5 Merge pull request #64 from summerview1997/codex/searchqa-rollout-failfast
Fail fast on systemic SearchQA rollout failures
2026-06-17 23:49:55 +08:00
Yifan Yang
fc42e6bf72 Merge pull request #63 from summerview1997/codex/webui-env-backend-preflight
Add WebUI env loading and backend preflight
2026-06-17 23:49:50 +08:00
summerview1997
c755792049 Add SearchQA materialization tests 2026-06-16 09:27:09 +08:00
summerview1997
e591a28242 Add SearchQA split materialization helper 2026-06-16 09:26:56 +08:00
summerview1997
c04467a428 Add SearchQA materialization dependency extra 2026-06-16 09:26:46 +08:00
summerview1997
d5ae8c8e66 Document SearchQA split materialization 2026-06-16 09:26:35 +08:00
summerview1997
923becb00f Add SearchQA rollout fail-fast tests 2026-06-16 09:21:08 +08:00
summerview1997
da799620ba Fail fast on systemic SearchQA rollout failures 2026-06-16 09:20:57 +08:00
summerview1997
30cc8a3ed3 Add WebUI env preflight tests 2026-06-16 09:04:30 +08:00
summerview1997
d05851bd7f Add WebUI env loading and backend preflight 2026-06-16 09:04:19 +08:00
Yifan Yang
46b3207b96 docs(sleep): trim RESULTS to the headline results (remove the full grid)
Remove the per-cell full deployment grid section; keep the gate-safety stress
test, experience-replay scaling + night-by-night climb, the dream-diversity
ablation, the gbrain end-to-end result, and the scope/limitations. Renumber
sections; update the README pointer accordingly.
2026-06-15 17:08:51 +00:00
Yifan Yang
d43e8dba1a docs(sleep): expand the grid into per-benchmark night-by-night tables
Replace the compact baseline->after grid with three grouped per-benchmark tables
(SearchQA / LiveMath / SpreadsheetBench), each showing all 3 targets x both modes
across every night (N0..N5) + Δ. Makes the trajectory visible — gains reach a
level and hold rather than being single lucky readings — and presents the full
18-cell evidence in a more solid, readable form. Footnotes LiveMath's 4-night run
(train split <50 tasks). Numbers unchanged; just richer presentation.
2026-06-15 16:54:01 +00:00
Yifan Yang
d02098ffc4 docs(sleep): add full Results & Analysis (RESULTS.md); link from README
Adds docs/sleep/RESULTS.md — the complete deployment-scale study behind
SkillOpt-Sleep, presented rigorously (named benchmarks, test sizes, metrics,
baseline->after, single shared protocol):
  1. Gate-safety stress test: ungated nano SearchQA collapses 0.554->0.026
     (-52.8); the gated twin holds 0.570 — the core argument for the design.
  2. Full 18-cell deployment grid (3 benchmarks x 3 targets x gate/free),
     shipped config: mean +0.5, range [-2.4, +5.1], nothing hidden.
  3. Experience-replay scaling (recall_k 10->20->full: +3.1->+4.5->+5.6) and
     the night-by-night climb (0.798->...->0.858, gate accepts as late as N5).
  4. Dream-diversity fix as defense-in-depth: 3-config grid comparison
     (-2.66/-52.8 -> +0.24/-4.0 -> +0.53/-2.4); the -52.8 cell becomes +2.7
     from the dream fix alone.
  5. gbrain end-to-end 0.00->1.00 on real Claude + Codex.
  6. Honest scope: where it helps vs flat-in-noise, single-seed caveat with a
     seed-robustness spot check, keep-the-gate-on.
README Results section now links prominently to it. Docs only; numbers are
self-contained with reproduce commands (no raw run dumps committed).
2026-06-15 16:49:13 +00:00
Yifan Yang
ea4ff459d7 docs(sleep): make the results section rigorous (named benchmarks, baseline→after)
Label each result with its benchmark, test size, metric, target model, and gate
mode; show absolute baseline→after (not just Δ); state the single shared protocol
once. SearchQA recall-scaling table (1400-item test, SQuAD-EM, GPT-5.5, gated) +
SpreadsheetBench confirmation (280-item, cell-value compare, nano, gate-free) +
the gbrain end-to-end line. Keeps the single-seed / flat-on-noisy caveats.
2026-06-15 16:42:43 +00:00
Yifan Yang
de3be75bac docs(sleep): add a SkillOpt-Sleep module readme + News mention
Adds docs/sleep/README.md — a concise intro to the SkillOpt-Sleep plugin (what
it is, how to use it across the three agents, the opt-in experience-replay /
dream-rollout knobs, and headline results), linking to the full guide section.
Adds a News bullet pointing to it. No code changes.
2026-06-15 16:31:15 +00:00
Yifan Yang
b701d9b6d9 docs: move SkillOpt-Sleep into the guide; clean docs/sleep; fix guide link
Per maintainer request:
- Remove the internal/scratch docs/sleep/ tree (reports, raw logs, blog run
  JSON, sweep.jsonl) — 23 files — and the root PUBLISHING.md. These were
  working notes, not reference docs.
- Take the dedicated SkillOpt-Sleep content out of the main README (News bullet
  + section) and host it in the rendered guide instead: new section 9 in
  docs/guideline.html (deployment companion, the three plugins, opt-in
  experience replay / dream rollouts) with a sidebar entry.
- Fix the README's opening reference so "Documentation & Reproduction Guide"
  links directly to the rendered GitHub Pages page, not the raw .html source.
- Repoint the now-removed docs/sleep links in the plugin READMEs to the guide
  section.

The plugin code (plugins/, skillopt_sleep/) is unchanged; only docs move.

Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
2026-06-15 16:20:50 +00:00
Yifan Yang
722ce646d4 feat(sleep): experience replay + dream rollouts in the cycle (opt-in)
Wires two consolidation mechanisms into the shipped nightly cycle, both default
OFF so existing behavior is unchanged:
  - dream_rollouts (>1): multi-rollout contrastive reflection per task
  - recall_k (>0): associative recall of the K most-similar past tasks (from a
    capped task_archive persisted in state.json) into tonight's dream
  - dream_factor (>0): synthetic task variants

New shared engine module skillopt_sleep/dream.py (recall_similar, dream_augment,
dream_consolidate) is called by both the plugin cycle and the experiment harness,
so reported numbers exercise the exact shipped code. Built on the existing
rollouts_k/sample_id support already in consolidate.py/rollout.py.

Validated (5 nights x 10 real tasks/night, full held-out test, GPT-5.5, gated):
the gain scales with recall depth on a clean signal —
SearchQA recall_k=10 +3.1, recall_k=20 +4.5, full-history reference +5.6;
SpreadsheetBench (nano, gate-free) +3.6. Flat within noise on saturated/noisy
cells. See docs/sleep/EXPERIENCE_REPLAY.md (+ raw runs under blog_runs/v2_port/).

Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
2026-06-15 15:58:27 +00:00
Yifan Yang
576f2f8bad Merge pull request #59 from Elzlxx/feat/openclaw-skillopt-sleep
feat(plugins): add OpenClaw shell for SkillOpt-Sleep
2026-06-15 18:26:12 +08:00
carpedkm
00d07bc59a Merge pull request #48 from Kirchberg/codex/codex-desktop-harvest
Add Codex Desktop transcript harvesting
2026-06-15 10:23:18 +00:00
Kirill Kostarev
31715a8b43 Add Codex Desktop transcript harvesting 2026-06-15 10:23:08 +00:00
carpedkm
e8c3e10b30 Merge pull request #49 from Kirchberg/codex/codex-skill-first-upstream
Make Codex integration skill-first
2026-06-15 10:21:43 +00:00
Kirill Kostarev
d31e9d9407 Back up legacy Codex prompt during install 2026-06-15 10:21:30 +00:00
Kirill Kostarev
1953484822 Make Codex integration skill-first 2026-06-15 10:21:30 +00:00