Commit Graph

183 Commits

Author SHA1 Message Date
Yifan Yang
b5a1c2b317 Merge pull request #73 from Yif-Yang/fix/bare-subscription-auth
fix(sleep): make --bare conditional on ANTHROPIC_API_KEY (#68)
2026-06-20 21:46:09 +08:00
carpedkm
552ddefd74 fix: narrow CLI error markers to avoid false positives
Address codex review: "API key" was too generic — a model response
about configuring API keys would trigger a false auth warning. Now:
- Use specific phrases ("Invalid API key", "Unauthorized: invalid x-api-key")
- Only check short stdout (<300 chars) to skip real model responses
- Still check stderr unconditionally

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-20 13:32:43 +00:00
carpedkm
bfa53bc46d fix(sleep): make --bare conditional on ANTHROPIC_API_KEY (#68)
ClaudeCliBackend._call() and attempt_with_tools() hardcoded --bare,
which skips Claude CLI's credential resolution. This broke subscription-
token auth: every model call silently returned "Not logged in" and
scored 0 — the user saw "baseline 0.0 → candidate 0.0, gate reject"
with no indication of an auth failure.

Fix: only pass --bare when ANTHROPIC_API_KEY is set. The remaining
isolation flags (--disable-slash-commands, --disallowedTools,
--exclude-dynamic-system-prompt-sections, clean temp cwd) already
provide the needed isolation without --bare.

Also adds _detect_cli_error() to log a warning when CLI output matches
known auth error patterns, so auth failures surface loudly instead of
deflating every score to 0.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-20 13:28:34 +00:00
Yifan Yang
24b5a25ba8 Merge pull request #72 from Yif-Yang/feat/plugin-feature-sync
feat: sync all 4 runtime plugins with full engine surface + fix #52 #58 #62
2026-06-20 20:42:24 +08:00
carpedkm
0d648b2580 fix: address codex+gpt-5.5 review findings
- harvest: tighten sub-3s filter to also require prompt < 200 chars,
  avoiding false positives on fast real one-shot questions
- openclaw schedule_cmd: add docstring clarifying it schedules the
  shared engine, not the OpenClaw-native runner

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-20 12:40:34 +00:00
carpedkm
7d36b1d592 fix: address review findings in plugin sync PR
- OpenClaw schedule_cmd: pass project as required positional arg
- OpenClaw schedule_cmd/unschedule_cmd: unpack Tuple[bool, str] return
- OpenClaw schedule_cmd: propagate failure status (return 1 on not ok)
- OpenClaw unschedule_cmd: pass project to avoid silent no-op
- OpenClaw --minute default: 17 (consistent with engine and MCP)
- harvest.py: move datetime import to module level

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-20 12:04:07 +00:00
carpedkm
0be780052a feat: sync all 4 runtime plugins with full engine surface + fix #52 #58 #62
Bug fixes:
- #52: bundle run-sleep.sh in Claude Code plugin + 4-level fallback
- #58: add skillopt-sleep console script entry point in pyproject.toml
- #62: filter headless claude -p replay sessions from harvest

Plugin sync (Claude Code / Codex / Copilot / OpenClaw):
- Document all 22 CLI flags, 7 actions, 4 backends across all SKILL.md files
- Document config keys (preferences, gate_mode, dream_rollouts, etc.)
- Document memory consolidation (evolve_memory / evolve_skill)
- Add schedule/unschedule to all plugins
- Copilot MCP: expand schema from 3 → 16 params + schedule tools
- OpenClaw: add schedule/unschedule subcommands via shared scheduler

Tests:
- Cross-plugin parity test (prevents future feature drift)
- MCP schema completeness test

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-20 11:31:09 +00:00
carpedkm
0b5b9a4296 Merge pull request #60 from Kirchberg/codex/reviewed-task-files-cwd
Add reviewed task-file flow for Codex sleep runs
2026-06-20 08:59:02 +00:00
Kirill Kostarev
05cdc26beb Add reviewed task-file flow for Codex sleep runs 2026-06-20 08:58:48 +00:00
Yifan Yang
382811ddcc Merge pull request #50 from Dongbumlee/Dongbumlee/copilot-sleep-backend
Add Copilot as a SkillOpt-Sleep model backend (CopilotCliBackend) + research-engine MCP plugin
2026-06-20 16:57:53 +08:00
DB Lee
d367ae1eea docs(plugins): list copilot in the cross-tool backend overview
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-06-17 17:38:10 -07:00
DB Lee
2c0980bda3 docs(copilot): correct backend hint in research MCP plugin (openai -> azure_openai)
The advertised backend choices in scripts/train.py use 'azure_openai',
not 'openai'; align the inputSchema description hint accordingly.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-06-17 17:25:50 -07:00
DB Lee
5799695951 feat(copilot): implement attempt_with_tools with cross-platform tool shims
Adds honest tool-call detection for CopilotCliBackend, mirroring the
Claude/Codex backends. Writes per-tool executable shims into the work dir
and detects real invocations from a calllog (not self-reported markers).
The Copilot backend is Windows-validated, so shims are cross-platform:
a .cmd batch shim on Windows and a chmod'd bash shim on POSIX, with an
OS-specific tool hint. Mirrors _call's flags/env (isolated COPILOT_HOME,
--allow-all-tools, MCP/instruction disabling) and the UTF-8 subprocess fix.

Adds test_attempt_with_tools_honest_detection: a CI-friendly, OS-aware
stub stands in for the CLI, runs the shim, and asserts both JSONL parsing
and log-based detection. Validated live on Windows (real Copilot call) and
on Linux/WSL (POSIX path).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-06-17 17:25:50 -07:00
DB Lee
013a7cd83a test: add unit tests for CopilotCliBackend (parsing + alias + isolated home)
Covers _parse_jsonl_response (multi-message concat, junk-line skipping,
empty/non-assistant events), get_backend alias resolution, and the
isolated-COPILOT_HOME / full-env opt-out behavior. Pure logic, no CLI required.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-06-17 17:25:50 -07:00
DB Lee
21f93c16c7 Add GitHub Copilot backend to SkillOpt-Sleep
Add CopilotCliBackend that drives the GitHub Copilot CLI in
non-interactive mode (copilot -p ... --output-format json) and parses the
JSONL event stream for assistant.message content. Registered as the
'copilot' backend (with aliases) and wired through the CLI, config,
experiment harness, and the Copilot MCP server's backend enum.

- Force UTF-8 decoding of CLI output (fixes cp1252 UnicodeDecodeError on
  Windows when responses contain non-cp1252 bytes).
- Minimise per-call startup: isolated COPILOT_HOME with built-in MCPs and
  custom instructions disabled, so user MCP servers are not spawned per
  call (~5x faster: 36s -> 7.4s). Override via SKILLOPT_SLEEP_COPILOT_HOME
  / SKILLOPT_SLEEP_COPILOT_MODEL / SKILLOPT_SLEEP_COPILOT_FULL_ENV.

Validated end-to-end on real held-out tasks (researcher persona:
0.42 -> 1.00 lift; gate correctly rejects non-improving edits).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-06-17 17:25:50 -07:00
DB Lee
5dc894715f Add SkillOpt research-engine MCP server plugin for Copilot
Exposes scripts/train.py and scripts/eval_only.py as Copilot MCP tools
(skillopt_list_configs, skillopt_train, skillopt_eval) via a stdlib-only
stdio server, mirroring the existing SkillOpt-Sleep plugin layout.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-06-17 17:24:00 -07:00
Yifan Yang
6940e46f4e Merge pull request #65 from summerview1997/codex/searchqa-materialize-splits
Add SearchQA split materialization helper
2026-06-17 23:50:38 +08:00
Yifan Yang
0e962219f5 Merge pull request #64 from summerview1997/codex/searchqa-rollout-failfast
Fail fast on systemic SearchQA rollout failures
2026-06-17 23:49:55 +08:00
Yifan Yang
fc42e6bf72 Merge pull request #63 from summerview1997/codex/webui-env-backend-preflight
Add WebUI env loading and backend preflight
2026-06-17 23:49:50 +08:00
summerview1997
c755792049 Add SearchQA materialization tests 2026-06-16 09:27:09 +08:00
summerview1997
e591a28242 Add SearchQA split materialization helper 2026-06-16 09:26:56 +08:00
summerview1997
c04467a428 Add SearchQA materialization dependency extra 2026-06-16 09:26:46 +08:00
summerview1997
d5ae8c8e66 Document SearchQA split materialization 2026-06-16 09:26:35 +08:00
summerview1997
923becb00f Add SearchQA rollout fail-fast tests 2026-06-16 09:21:08 +08:00
summerview1997
da799620ba Fail fast on systemic SearchQA rollout failures 2026-06-16 09:20:57 +08:00
summerview1997
30cc8a3ed3 Add WebUI env preflight tests 2026-06-16 09:04:30 +08:00
summerview1997
d05851bd7f Add WebUI env loading and backend preflight 2026-06-16 09:04:19 +08:00
Yifan Yang
46b3207b96 docs(sleep): trim RESULTS to the headline results (remove the full grid)
Remove the per-cell full deployment grid section; keep the gate-safety stress
test, experience-replay scaling + night-by-night climb, the dream-diversity
ablation, the gbrain end-to-end result, and the scope/limitations. Renumber
sections; update the README pointer accordingly.
2026-06-15 17:08:51 +00:00
Yifan Yang
d43e8dba1a docs(sleep): expand the grid into per-benchmark night-by-night tables
Replace the compact baseline->after grid with three grouped per-benchmark tables
(SearchQA / LiveMath / SpreadsheetBench), each showing all 3 targets x both modes
across every night (N0..N5) + Δ. Makes the trajectory visible — gains reach a
level and hold rather than being single lucky readings — and presents the full
18-cell evidence in a more solid, readable form. Footnotes LiveMath's 4-night run
(train split <50 tasks). Numbers unchanged; just richer presentation.
2026-06-15 16:54:01 +00:00
Yifan Yang
d02098ffc4 docs(sleep): add full Results & Analysis (RESULTS.md); link from README
Adds docs/sleep/RESULTS.md — the complete deployment-scale study behind
SkillOpt-Sleep, presented rigorously (named benchmarks, test sizes, metrics,
baseline->after, single shared protocol):
  1. Gate-safety stress test: ungated nano SearchQA collapses 0.554->0.026
     (-52.8); the gated twin holds 0.570 — the core argument for the design.
  2. Full 18-cell deployment grid (3 benchmarks x 3 targets x gate/free),
     shipped config: mean +0.5, range [-2.4, +5.1], nothing hidden.
  3. Experience-replay scaling (recall_k 10->20->full: +3.1->+4.5->+5.6) and
     the night-by-night climb (0.798->...->0.858, gate accepts as late as N5).
  4. Dream-diversity fix as defense-in-depth: 3-config grid comparison
     (-2.66/-52.8 -> +0.24/-4.0 -> +0.53/-2.4); the -52.8 cell becomes +2.7
     from the dream fix alone.
  5. gbrain end-to-end 0.00->1.00 on real Claude + Codex.
  6. Honest scope: where it helps vs flat-in-noise, single-seed caveat with a
     seed-robustness spot check, keep-the-gate-on.
README Results section now links prominently to it. Docs only; numbers are
self-contained with reproduce commands (no raw run dumps committed).
2026-06-15 16:49:13 +00:00
Yifan Yang
ea4ff459d7 docs(sleep): make the results section rigorous (named benchmarks, baseline→after)
Label each result with its benchmark, test size, metric, target model, and gate
mode; show absolute baseline→after (not just Δ); state the single shared protocol
once. SearchQA recall-scaling table (1400-item test, SQuAD-EM, GPT-5.5, gated) +
SpreadsheetBench confirmation (280-item, cell-value compare, nano, gate-free) +
the gbrain end-to-end line. Keeps the single-seed / flat-on-noisy caveats.
2026-06-15 16:42:43 +00:00
Yifan Yang
de3be75bac docs(sleep): add a SkillOpt-Sleep module readme + News mention
Adds docs/sleep/README.md — a concise intro to the SkillOpt-Sleep plugin (what
it is, how to use it across the three agents, the opt-in experience-replay /
dream-rollout knobs, and headline results), linking to the full guide section.
Adds a News bullet pointing to it. No code changes.
2026-06-15 16:31:15 +00:00
Yifan Yang
b701d9b6d9 docs: move SkillOpt-Sleep into the guide; clean docs/sleep; fix guide link
Per maintainer request:
- Remove the internal/scratch docs/sleep/ tree (reports, raw logs, blog run
  JSON, sweep.jsonl) — 23 files — and the root PUBLISHING.md. These were
  working notes, not reference docs.
- Take the dedicated SkillOpt-Sleep content out of the main README (News bullet
  + section) and host it in the rendered guide instead: new section 9 in
  docs/guideline.html (deployment companion, the three plugins, opt-in
  experience replay / dream rollouts) with a sidebar entry.
- Fix the README's opening reference so "Documentation & Reproduction Guide"
  links directly to the rendered GitHub Pages page, not the raw .html source.
- Repoint the now-removed docs/sleep links in the plugin READMEs to the guide
  section.

The plugin code (plugins/, skillopt_sleep/) is unchanged; only docs move.

Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
2026-06-15 16:20:50 +00:00
Yifan Yang
722ce646d4 feat(sleep): experience replay + dream rollouts in the cycle (opt-in)
Wires two consolidation mechanisms into the shipped nightly cycle, both default
OFF so existing behavior is unchanged:
  - dream_rollouts (>1): multi-rollout contrastive reflection per task
  - recall_k (>0): associative recall of the K most-similar past tasks (from a
    capped task_archive persisted in state.json) into tonight's dream
  - dream_factor (>0): synthetic task variants

New shared engine module skillopt_sleep/dream.py (recall_similar, dream_augment,
dream_consolidate) is called by both the plugin cycle and the experiment harness,
so reported numbers exercise the exact shipped code. Built on the existing
rollouts_k/sample_id support already in consolidate.py/rollout.py.

Validated (5 nights x 10 real tasks/night, full held-out test, GPT-5.5, gated):
the gain scales with recall depth on a clean signal —
SearchQA recall_k=10 +3.1, recall_k=20 +4.5, full-history reference +5.6;
SpreadsheetBench (nano, gate-free) +3.6. Flat within noise on saturated/noisy
cells. See docs/sleep/EXPERIENCE_REPLAY.md (+ raw runs under blog_runs/v2_port/).

Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
2026-06-15 15:58:27 +00:00
Yifan Yang
576f2f8bad Merge pull request #59 from Elzlxx/feat/openclaw-skillopt-sleep
feat(plugins): add OpenClaw shell for SkillOpt-Sleep
2026-06-15 18:26:12 +08:00
carpedkm
00d07bc59a Merge pull request #48 from Kirchberg/codex/codex-desktop-harvest
Add Codex Desktop transcript harvesting
2026-06-15 10:23:18 +00:00
Kirill Kostarev
31715a8b43 Add Codex Desktop transcript harvesting 2026-06-15 10:23:08 +00:00
carpedkm
e8c3e10b30 Merge pull request #49 from Kirchberg/codex/codex-skill-first-upstream
Make Codex integration skill-first
2026-06-15 10:21:43 +00:00
Kirill Kostarev
d31e9d9407 Back up legacy Codex prompt during install 2026-06-15 10:21:30 +00:00
Kirill Kostarev
1953484822 Make Codex integration skill-first 2026-06-15 10:21:30 +00:00
carpedkm
1b2652c6f8 Merge pull request #44 from imshunsuke/refactor/reflect-default-base
refactor: make EnvAdapter.reflect a shared default (fixes dropped reflect kwargs)
2026-06-15 09:06:38 +00:00
Shunsuke
98d0430bee refactor: make EnvAdapter.reflect a shared default (fixes dropped reflect kwargs)
All six adapters duplicated an identical reflect() that delegates to
run_minibatch_reflect. The copies had drifted: OfficeQA/DocVQA silently
dropped meta_skill_context and ALFWorld dropped update_mode, so those
analysts ran without inputs every other benchmark receives (active under
the default use_meta_skill: true).

Move the delegation into EnvAdapter.reflect as one default that forwards
all kwargs uniformly, and delete the six overrides. reflect is no longer
abstract — adapters inherit it and override only for custom logic.

Net -225 lines. Behavior change: OfficeQA/DocVQA/ALFWorld reflect now
receive the kwargs they previously dropped; the three already-correct
benchmarks are unaffected.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-15 09:06:00 +00:00
Yifan Yang
eef4805b25 Merge pull request #43 from imshunsuke/docs/fix-benchmark-loader-naming
docs: align benchmark guide and template with dataloader.py naming
2026-06-15 17:00:45 +08:00
Yifan Yang
86bad36ffe feat(sleep): SkillOpt-Sleep plugin update (preview) — engine robustness + scheduling
Updates the SkillOpt-Sleep plugin on top of the current main. User-facing and
engine improvements since the initial drop:

* Command renamed /sleep -> /skillopt-sleep across Claude Code + Codex shells;
  refreshed plugin READMEs and install scripts.
* Built-in scheduling (skillopt_sleep/scheduler.py + __main__): schedule /
  unschedule the nightly cycle without external cron wiring.
* Backend robustness: bounded retry with backoff (no more silent empty-string
  on transient 429/timeout), content-filter-safe rollout prompt, an
  output-contract guardrail that rejects edits violating the task's required
  format, and a per-sample cache key so repeated dream rollouts are independent
  samples (fixes degenerate single-sample reflection).
* consolidate / rollout / replay: parallel multi-rollout dreaming, gate-mode
  controls, TaskRecord.system framing field.

Scope: this commit ships only the plugin engine + shells. Research/benchmark
harnesses and their data are intentionally not included; the public package
has no dependency on them (the one research-evaluator import is now guarded).
Marked as an early preview in the README; we'll keep iterating.

99/99 unit tests pass.

Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
2026-06-14 16:12:00 +00:00
elzlxx
553446575a feat(plugins): add OpenClaw shell for SkillOpt-Sleep
Adds a thin OpenClaw shell wrapping the SkillOpt-Sleep engine. Enables
nightly validation-gated skill improvement cycles for OpenClaw agents.

Components:
- skillopt_sleep_openclaw.py: DeepSeek V4 Pro + Ollama nomic-embed-text
  backend, mirroring the Claude/Codex/Copilot backend pattern.
- run_sleep.py: CLI entry point supporting dry-run and pre-built task files.
- run_sleep_cron.sh: bash wrapper for nightly cron invocation.
- slash_sleep.py: /sleep command (status / run / adopt / reject / cost).
- config.json: engine config tuned for our stack.
- SKILL.md: OpenClaw skill manifest.
- tests/: 14 held-out tasks across 3 categories (research-cron, devops, wiki).

OpenClaw is the 4th ecosystem in which SkillOpt-Sleep can be deployed,
joining Claude Code, Codex, and Copilot. The shell follows the same
single-engine / thin-shell pattern as the existing three plugins.

End-to-end tested: pipeline runs against real OpenClaw session transcripts,
gate correctly rejects non-improvements, staging artifacts land in
~/.skillopt-sleep/staging/<night>/. Cost: ~$0.02/night on DeepSeek V4 Pro.
2026-06-14 23:27:54 +08:00
Cuzyoung
c1ac570d94 docs(guideline): make SearchQA the first demo — copy-paste materialization snippet + train command
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-10 13:51:20 +00:00
Cuzyoung
d8023a47c9 docs(guideline): novice-first restructure — Quick Start before data, honest first-demo path, own-data narrative
- Move Quick Start (now §3) ahead of the data chapter; renumber and fix
  cross-references and the sidebar nav.
- Add §3.1 'Your First Demo': states plainly that data/ ships ID manifests
  only, gives the one benchmark that runs out of the box (ALFWorld with its
  bundled path split), and points other benchmarks to the data/README.md
  materialization step. Also offers eval-only with ckpt/ skills as a
  lighter sanity check.
- Reframe the data chapter as 'Run on Your Own Data' (§4) with a three-step
  lead-in (split dir -> item schema -> --split_dir) and a pointer to §7.2
  for new task shapes.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-10 13:42:50 +00:00
Cuzyoung
b0b62fcb86 docs(readme): slim README — move install/quick-start/data/config details to the guideline page
README now: badges + one-line pointer to docs/guideline.html, overview,
demo, sleep section, extensibility pointers, WebUI launch, citation.
All run-the-demo commands live in the guideline (which already covered
install, credentials, training, eval, outputs, data prep, and config).

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-10 13:27:36 +00:00
Cuzyoung
3308c4c5dc docs(guideline): add PyPI install option and skill-aware reflection config rows
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-10 13:27:12 +00:00
Cuzyoung
0d5b331cd5 Merge branch 'docs/guideline' into feat/skill-aware-reflection
# Conflicts:
#	README.md
2026-06-10 13:27:12 +00:00