21 Commits

Author SHA1 Message Date
CharlesYang030
e4ea6a6771 chore(release): v0.2.0
Highlights since v0.1.0:
- feat: SkillOpt-Sleep engine — nightly offline self-evolution
  (harvest -> mine -> replay -> consolidate behind a validation gate),
  with multi-objective reward, experience replay + dream rollouts,
  slow-update long-term memory, and secret redaction in cycle diagnostics.
  Shipped as the `skillopt-sleep` CLI.
- feat: cross-tool backends & plugin shells — Claude, Codex (+Desktop
  harvest), Copilot, Devin, and OpenClaw.
- feat: SearchQA split materialization + rollout fail-fast.
- fix: Windows robustness for claude/codex backends, hardened JSON
  fallback, Qwen timeout/thinking gating, Codex failure surfacing.

Packaging:
- Bump pyproject / skillopt / skillopt_sleep to 0.2.0.
- Restore skillopt_webui to the packaged wheel.

See CHANGELOG.md for the full changelog and contributor acknowledgements.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-07-02 22:11:10 +08:00
Yif Yang
5487e2c426 fix(skillopt-sleep): redact secrets before persisting cycle diagnostics
PR #92 added a per-cycle diagnostics.json that surfaces backend stderr,
optimizer replies, and task responses so a 0.0 night is self-diagnosing.
Those free-text fields can carry credentials (e.g. a codex 401 stderr dump
containing an auth token), so persisting them verbatim was a new on-disk
leak surface.

- Add a shared redact_secrets() in staging.py and route diagnostics.json's
  call_error / reflect_raw_head / holdout_detail through it before writing.
- Redact the codex and Claude auth-error log lines too (a secondary sink
  when a file log handler is attached); last_call_error stays raw in memory
  so _AUTH_MARKERS matching is unaffected.
- Centralize _SECRET_PATTERNS in staging.py (harvest_codex now reuses them)
  and extend coverage to AWS / GitHub / Slack / Google / JWT token shapes.
- Tests: secret-shape coverage, private-key blocks, recursive/scalar
  passthrough, no over-redaction of plain prose, fail-fast auth-error log
  redaction, and an end-to-end check that diagnostics.json has no secret.

Observability-only; the gate and learning algorithm are unchanged.

Co-Authored-By: Claude <noreply@anthropic.com>
2026-06-30 19:47:36 +00:00
Daniel Martinez
9fa0716c72 fix(skillopt-sleep): also surface codex failures on the tool-call rollout path
Follow-up from a fresh-context review of the prior commit: CodexCliBackend.attempt_with_tools
(the rollout path for tool-requiring tasks) ran codex exec inline, swallowed all exceptions,
and never set last_call_error — so an auth/model/version failure on the tool path still
produced a silent empty->0 with no diagnostic signal, the exact failure class the prior commit
fixed for the _call path. Now it surfaces timeout/exception/non-zero-exit via last_call_error
(response stays empty; never leaks the CLI error text), so a failed tool rollout shows up in
diagnostics.json. Adds a regression test.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-27 23:56:11 -05:00
Daniel Martinez
9fcf5868c3 fix(skillopt-sleep): surface codex auth/model/version failures instead of silently scoring 0
A nightly sleep cycle could run for weeks emitting held-out 0.0 -> 0.0 (gate reject, zero
edits), indistinguishable from "nothing to learn", when the real cause was the codex backend
returning an error (expired auth / model unsupported on the account / outdated CLI) that got
scored as a failed rollout.

backend (CodexCliBackend):
- split _call into _call_once + a retry wrapper: transient empties/timeouts are retried
  instead of silently returning "" (mirrors AzureOpenAIBackend's guard);
- on a non-zero exit, surface the reason via last_call_error and return "" rather than
  leaking the CLI error text as if it were a model response;
- fail fast (no retries) on fatal auth/model/version errors (401, refresh_token_reused,
  token_expired, "not supported when using Codex with a ChatGPT account",
  "requires a newer version of Codex").
backend (CliBackend.reflect): retain last_reflect_raw so a no-edits night is diagnosable.
consolidate: ConsolidationResult now carries per-task held-out detail (response, hard/soft,
  fail_reason) + reflect_raw + call_error.
cycle: write diagnostics.json per cycle so a 0.0 night self-explains instead of being a black box.
tests: 4 new (retry-not-silent-zero, auth-error-surfaced-not-scored, holdout-detail, reflect-raw).

Also gitignore the .skillopt-sleep/ runtime dir.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-27 22:26:20 -05:00
carpedkm
01b3e01804 fix: use None default for --lookback-hours to distinguish omitted vs 0
Codex round 3: argparse default=0 made every CLI invocation without
--lookback-hours clobber the config's 72h default. Now default=None;
only explicit --lookback-hours N (including 0) overrides config.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-20 14:23:17 +00:00
carpedkm
01075c90d3 fix: address codex round 2 — revert harvest break + allow lookback 0
- harvest.py: revert break to continue — mtime ordering can diverge
  from embedded ended_at timestamps (copy/touch), so we must check all
  files rather than early-exiting on the first old one
- cycle.py: use `is not None and > 0` so lookback_hours=0 means
  "scan full history" (opt-out of the cutoff)
- __main__.py: propagate --lookback-hours 0 to config as explicit 0

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-20 14:21:18 +00:00
carpedkm
6cc1cd2e95 fix: address codex review — use clock for cutoff + early-exit harvest
- cycle.py: use supplied `clock` parameter (not wall time) for the
  lookback cutoff, so deterministic tests/experiments get reproducible
  harvest windows
- harvest.py: break (not continue) when a file is older than since_iso,
  since files are sorted newest-first by mtime — avoids scanning the
  entire transcript directory for quiet projects with large histories

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-20 14:11:58 +00:00
carpedkm
889238b234 fix: add SKILLOPT_SLEEP_PYTHON override + lookback_hours first-run fallback
Two fixes from issue #57 feedback:

1. run-sleep.sh: support SKILLOPT_SLEEP_PYTHON env var to explicitly set
   the Python interpreter. Useful on macOS where system Python is 3.9 but
   a newer Python is available elsewhere (e.g. Codex Desktop's bundled
   Python 3.12). Applied to both the shared runner and the bundled
   Claude Code plugin copy.

2. cycle.py: on first run (no prior harvest recorded), apply the
   lookback_hours config (default 72h) as a time cutoff. Previously,
   first run scanned the entire transcript history, which could trigger
   massive LLM mining on users with months of session data.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-20 14:07:50 +00:00
carpedkm
552ddefd74 fix: narrow CLI error markers to avoid false positives
Address codex review: "API key" was too generic — a model response
about configuring API keys would trigger a false auth warning. Now:
- Use specific phrases ("Invalid API key", "Unauthorized: invalid x-api-key")
- Only check short stdout (<300 chars) to skip real model responses
- Still check stderr unconditionally

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-20 13:32:43 +00:00
carpedkm
bfa53bc46d fix(sleep): make --bare conditional on ANTHROPIC_API_KEY (#68)
ClaudeCliBackend._call() and attempt_with_tools() hardcoded --bare,
which skips Claude CLI's credential resolution. This broke subscription-
token auth: every model call silently returned "Not logged in" and
scored 0 — the user saw "baseline 0.0 → candidate 0.0, gate reject"
with no indication of an auth failure.

Fix: only pass --bare when ANTHROPIC_API_KEY is set. The remaining
isolation flags (--disable-slash-commands, --disallowedTools,
--exclude-dynamic-system-prompt-sections, clean temp cwd) already
provide the needed isolation without --bare.

Also adds _detect_cli_error() to log a warning when CLI output matches
known auth error patterns, so auth failures surface loudly instead of
deflating every score to 0.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-20 13:28:34 +00:00
carpedkm
0d648b2580 fix: address codex+gpt-5.5 review findings
- harvest: tighten sub-3s filter to also require prompt < 200 chars,
  avoiding false positives on fast real one-shot questions
- openclaw schedule_cmd: add docstring clarifying it schedules the
  shared engine, not the OpenClaw-native runner

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-20 12:40:34 +00:00
carpedkm
7d36b1d592 fix: address review findings in plugin sync PR
- OpenClaw schedule_cmd: pass project as required positional arg
- OpenClaw schedule_cmd/unschedule_cmd: unpack Tuple[bool, str] return
- OpenClaw schedule_cmd: propagate failure status (return 1 on not ok)
- OpenClaw unschedule_cmd: pass project to avoid silent no-op
- OpenClaw --minute default: 17 (consistent with engine and MCP)
- harvest.py: move datetime import to module level

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-20 12:04:07 +00:00
carpedkm
0be780052a feat: sync all 4 runtime plugins with full engine surface + fix #52 #58 #62
Bug fixes:
- #52: bundle run-sleep.sh in Claude Code plugin + 4-level fallback
- #58: add skillopt-sleep console script entry point in pyproject.toml
- #62: filter headless claude -p replay sessions from harvest

Plugin sync (Claude Code / Codex / Copilot / OpenClaw):
- Document all 22 CLI flags, 7 actions, 4 backends across all SKILL.md files
- Document config keys (preferences, gate_mode, dream_rollouts, etc.)
- Document memory consolidation (evolve_memory / evolve_skill)
- Add schedule/unschedule to all plugins
- Copilot MCP: expand schema from 3 → 16 params + schedule tools
- OpenClaw: add schedule/unschedule subcommands via shared scheduler

Tests:
- Cross-plugin parity test (prevents future feature drift)
- MCP schema completeness test

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-20 11:31:09 +00:00
Kirill Kostarev
05cdc26beb Add reviewed task-file flow for Codex sleep runs 2026-06-20 08:58:48 +00:00
DB Lee
5799695951 feat(copilot): implement attempt_with_tools with cross-platform tool shims
Adds honest tool-call detection for CopilotCliBackend, mirroring the
Claude/Codex backends. Writes per-tool executable shims into the work dir
and detects real invocations from a calllog (not self-reported markers).
The Copilot backend is Windows-validated, so shims are cross-platform:
a .cmd batch shim on Windows and a chmod'd bash shim on POSIX, with an
OS-specific tool hint. Mirrors _call's flags/env (isolated COPILOT_HOME,
--allow-all-tools, MCP/instruction disabling) and the UTF-8 subprocess fix.

Adds test_attempt_with_tools_honest_detection: a CI-friendly, OS-aware
stub stands in for the CLI, runs the shim, and asserts both JSONL parsing
and log-based detection. Validated live on Windows (real Copilot call) and
on Linux/WSL (POSIX path).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-06-17 17:25:50 -07:00
DB Lee
21f93c16c7 Add GitHub Copilot backend to SkillOpt-Sleep
Add CopilotCliBackend that drives the GitHub Copilot CLI in
non-interactive mode (copilot -p ... --output-format json) and parses the
JSONL event stream for assistant.message content. Registered as the
'copilot' backend (with aliases) and wired through the CLI, config,
experiment harness, and the Copilot MCP server's backend enum.

- Force UTF-8 decoding of CLI output (fixes cp1252 UnicodeDecodeError on
  Windows when responses contain non-cp1252 bytes).
- Minimise per-call startup: isolated COPILOT_HOME with built-in MCPs and
  custom instructions disabled, so user MCP servers are not spawned per
  call (~5x faster: 36s -> 7.4s). Override via SKILLOPT_SLEEP_COPILOT_HOME
  / SKILLOPT_SLEEP_COPILOT_MODEL / SKILLOPT_SLEEP_COPILOT_FULL_ENV.

Validated end-to-end on real held-out tasks (researcher persona:
0.42 -> 1.00 lift; gate correctly rejects non-improving edits).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-06-17 17:25:50 -07:00
Yifan Yang
722ce646d4 feat(sleep): experience replay + dream rollouts in the cycle (opt-in)
Wires two consolidation mechanisms into the shipped nightly cycle, both default
OFF so existing behavior is unchanged:
  - dream_rollouts (>1): multi-rollout contrastive reflection per task
  - recall_k (>0): associative recall of the K most-similar past tasks (from a
    capped task_archive persisted in state.json) into tonight's dream
  - dream_factor (>0): synthetic task variants

New shared engine module skillopt_sleep/dream.py (recall_similar, dream_augment,
dream_consolidate) is called by both the plugin cycle and the experiment harness,
so reported numbers exercise the exact shipped code. Built on the existing
rollouts_k/sample_id support already in consolidate.py/rollout.py.

Validated (5 nights x 10 real tasks/night, full held-out test, GPT-5.5, gated):
the gain scales with recall depth on a clean signal —
SearchQA recall_k=10 +3.1, recall_k=20 +4.5, full-history reference +5.6;
SpreadsheetBench (nano, gate-free) +3.6. Flat within noise on saturated/noisy
cells. See docs/sleep/EXPERIENCE_REPLAY.md (+ raw runs under blog_runs/v2_port/).

Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
2026-06-15 15:58:27 +00:00
Kirill Kostarev
31715a8b43 Add Codex Desktop transcript harvesting 2026-06-15 10:23:08 +00:00
Yifan Yang
86bad36ffe feat(sleep): SkillOpt-Sleep plugin update (preview) — engine robustness + scheduling
Updates the SkillOpt-Sleep plugin on top of the current main. User-facing and
engine improvements since the initial drop:

* Command renamed /sleep -> /skillopt-sleep across Claude Code + Codex shells;
  refreshed plugin READMEs and install scripts.
* Built-in scheduling (skillopt_sleep/scheduler.py + __main__): schedule /
  unschedule the nightly cycle without external cron wiring.
* Backend robustness: bounded retry with backoff (no more silent empty-string
  on transient 429/timeout), content-filter-safe rollout prompt, an
  output-contract guardrail that rejects edits violating the task's required
  format, and a per-sample cache key so repeated dream rollouts are independent
  samples (fixes degenerate single-sample reflection).
* consolidate / rollout / replay: parallel multi-rollout dreaming, gate-mode
  controls, TaskRecord.system framing field.

Scope: this commit ships only the plugin engine + shells. Research/benchmark
harnesses and their data are intentionally not included; the public package
has no dependency on them (the one research-evaluator import is now guarded).
Marked as an early preview in the README; we'll keep iterating.

99/99 unit tests pass.

Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
2026-06-14 16:12:00 +00:00
Yifan Yang
dae974a5e3 chore(sleep): English-only across the engine, plugins, and docs
Remove every non-ASCII/CJK character for a professional open-source repo:
  - harvest.py: drop hardcoded Chinese feedback phrases; add an env-based
    extensibility hook (SKILLOPT_SLEEP_NEG_FEEDBACK / _POS_FEEDBACK) so any
    locale can be added without baking one in. Verified with a German example.
  - rollout.py / consolidate.py: English comments.
  - README.md section heading + anchor, CONTROLLABLE_DREAMING.md, plugin.json,
    marketplace.json (also fixed stale path skillopt-sleep-plugin ->
    plugins/claude-code), SKILL.md: English only.
  - Remove the internal WAKE_UP_SUMMARY.md note (not user-facing, not referenced).

Verified: zero CJK chars remain anywhere; 29 tests pass.

Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
2026-06-08 14:31:52 +00:00
Yifan Yang
b02ffc2c99 refactor(sleep): decouple engine to top-level skillopt_sleep/ (zero research dep)
Open-source-tool / research-code separation:
  - git mv skillopt/sleep/ -> skillopt_sleep/ (top-level, sibling to the research
    skillopt/ package). History preserved as renames.
  - All imports skillopt.sleep.* -> skillopt_sleep.*.
  - Vendor the validation gate into skillopt_sleep/gate.py (a self-contained copy
    of skillopt.evaluation.gate). The engine now has ZERO dependency on the
    research package — verified: grep finds no `from skillopt.` in skillopt_sleep/,
    and consolidate's gate resolves to skillopt_sleep.gate.
  - Plugin scripts/commands/skill call `-m skillopt_sleep`.

29 tests pass; `python -m skillopt_sleep` runs standalone.

Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
2026-06-08 14:31:52 +00:00