213 Commits

Author SHA1 Message Date
CharlesYang030
e4ea6a6771 chore(release): v0.2.0
Highlights since v0.1.0:
- feat: SkillOpt-Sleep engine — nightly offline self-evolution
  (harvest -> mine -> replay -> consolidate behind a validation gate),
  with multi-objective reward, experience replay + dream rollouts,
  slow-update long-term memory, and secret redaction in cycle diagnostics.
  Shipped as the `skillopt-sleep` CLI.
- feat: cross-tool backends & plugin shells — Claude, Codex (+Desktop
  harvest), Copilot, Devin, and OpenClaw.
- feat: SearchQA split materialization + rollout fail-fast.
- fix: Windows robustness for claude/codex backends, hardened JSON
  fallback, Qwen timeout/thinking gating, Codex failure surfacing.

Packaging:
- Bump pyproject / skillopt / skillopt_sleep to 0.2.0.
- Restore skillopt_webui to the packaged wheel.

See CHANGELOG.md for the full changelog and contributor acknowledgements.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
v0.2.0
2026-07-02 22:11:10 +08:00
Yif Yang
5487e2c426 fix(skillopt-sleep): redact secrets before persisting cycle diagnostics
PR #92 added a per-cycle diagnostics.json that surfaces backend stderr,
optimizer replies, and task responses so a 0.0 night is self-diagnosing.
Those free-text fields can carry credentials (e.g. a codex 401 stderr dump
containing an auth token), so persisting them verbatim was a new on-disk
leak surface.

- Add a shared redact_secrets() in staging.py and route diagnostics.json's
  call_error / reflect_raw_head / holdout_detail through it before writing.
- Redact the codex and Claude auth-error log lines too (a secondary sink
  when a file log handler is attached); last_call_error stays raw in memory
  so _AUTH_MARKERS matching is unaffected.
- Centralize _SECRET_PATTERNS in staging.py (harvest_codex now reuses them)
  and extend coverage to AWS / GitHub / Slack / Google / JWT token shapes.
- Tests: secret-shape coverage, private-key blocks, recursive/scalar
  passthrough, no over-redaction of plain prose, fail-fast auth-error log
  redaction, and an end-to-end check that diagnostics.json has no secret.

Observability-only; the gate and learning algorithm are unchanged.

Co-Authored-By: Claude <noreply@anthropic.com>
2026-06-30 19:47:36 +00:00
Yifan Yang
b9142bad24 fix(skillopt-sleep): surface codex auth/model/version failures instead of silently scoring 0 (#92)
Splits CodexCliBackend._call into _call_once + a retry wrapper so transient empties/timeouts are retried instead of silently scored 0, and fails fast on fatal auth/model/version errors (401, refresh_token_reused, token_expired, ChatGPT-account-unsupported, newer-Codex-required). On non-zero exit the CLI error text is surfaced via last_call_error instead of being returned as a model response. Adds per-cycle diagnostics.json (observability only; gate and learning algorithm unchanged) so a 0.0 night self-explains.
2026-07-01 03:20:08 +08:00
Yifan Yang
95a9e959fe test(sleep): add verifier-discipline stress test for the validation gate (#87)
Adds a reward-hacking stress test ensuring the consolidation gate rejects skill edits that game train/replay tasks while degrading held-out behavior. Also wires the minimax_chat backend into scripts/eval_only.py (coexisting with the qwen wiring from #85). Closes #67.
2026-07-01 02:40:24 +08:00
Tanmay9223
680dd28f5a fix(tests): move TestVerifierDiscipline above main block
(Addresses PR review feedback by ensuring python file-run execution discovers the test class)
2026-06-30 13:05:01 +05:30
Tanmay9223
fccc21f3f6 test(sleep): add verifier-discipline stress test (closes #67)
Add a regression test to ensure the validation gate correctly rejects
reward-hacking skill edits. It has been observed that optimizers
sometimes propose shortcuts that improve train/replay metrics but fail
to improve held-out behavior. This test codifies that the gate blocks
such artifacts.

Add TestVerifierDiscipline to the test_sleep_engine.py suite:
- Create MockRewardHackingBackend that simulates a reward-hacking rule
  which passes the train set but degrades the held-out tasks.
- Assert that the proposed edit is rejected by the gate.
2026-06-30 13:04:22 +05:30
Yifan Yang
6849e609a3 feat(eval): add missing minimax backend configuration
Add missing configuration setup in scripts/eval_only.py to properly
support the minimax_chat backend, which was entirely omitted.

Fix the following coverage gaps in eval_only.py:
- Add minimax CLI arguments
- Include the minimax config mappings in _MAP
- Update the backend parsing logic
- Call configure_minimax_chat
2026-06-30 13:04:22 +05:30
Daniel Martinez
9fa0716c72 fix(skillopt-sleep): also surface codex failures on the tool-call rollout path
Follow-up from a fresh-context review of the prior commit: CodexCliBackend.attempt_with_tools
(the rollout path for tool-requiring tasks) ran codex exec inline, swallowed all exceptions,
and never set last_call_error — so an auth/model/version failure on the tool path still
produced a silent empty->0 with no diagnostic signal, the exact failure class the prior commit
fixed for the _call path. Now it surfaces timeout/exception/non-zero-exit via last_call_error
(response stays empty; never leaks the CLI error text), so a failed tool rollout shows up in
diagnostics.json. Adds a regression test.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-27 23:56:11 -05:00
Daniel Martinez
9fcf5868c3 fix(skillopt-sleep): surface codex auth/model/version failures instead of silently scoring 0
A nightly sleep cycle could run for weeks emitting held-out 0.0 -> 0.0 (gate reject, zero
edits), indistinguishable from "nothing to learn", when the real cause was the codex backend
returning an error (expired auth / model unsupported on the account / outdated CLI) that got
scored as a failed rollout.

backend (CodexCliBackend):
- split _call into _call_once + a retry wrapper: transient empties/timeouts are retried
  instead of silently returning "" (mirrors AzureOpenAIBackend's guard);
- on a non-zero exit, surface the reason via last_call_error and return "" rather than
  leaking the CLI error text as if it were a model response;
- fail fast (no retries) on fatal auth/model/version errors (401, refresh_token_reused,
  token_expired, "not supported when using Codex with a ChatGPT account",
  "requires a newer version of Codex").
backend (CliBackend.reflect): retain last_reflect_raw so a no-edits night is diagnosable.
consolidate: ConsolidationResult now carries per-task held-out detail (response, hard/soft,
  fail_reason) + reflect_raw + call_error.
cycle: write diagnostics.json per cycle so a 0.0 night self-explains instead of being a black box.
tests: 4 new (retry-not-silent-zero, auth-error-surfaced-not-scored, holdout-detail, reflect-raw).

Also gitignore the .skillopt-sleep/ runtime dir.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-27 22:26:20 -05:00
Yifan Yang
9969a8f393 Add Devin plugin (plugins/devin): MCP server + ATIF-v1.7 harvest (#88)
Wires skillopt_sleep into Devin via a stdlib-only MCP server and an ATIF-v1.7 harvester, following the plugins/copilot thin-shell pattern. Includes path-expansion fix, tests + ATIF fixture, schema/tool parity with copilot, and a harvest fix so single-turn sessions aren't dropped by the <3s replay filter.
2026-06-26 11:04:23 +08:00
Yif Yang
26e5338def Update citation from @misc to @article format
Co-Authored-By: Claude <noreply@anthropic.com>
2026-06-26 02:54:46 +00:00
khashayar
1a70e4c9cd devin harvest: space turns >=5s so single-turn sessions aren't dropped
A harvested single-turn Devin session spanned only 1s (reply written 1000ms
after the prompt), which the engine's harvest filter conservatively classifies
as a <3s headless replay (skillopt_sleep Issue #62) and skips — so a real
single-turn session mined 0 tasks. Widen the prompt->reply gap to 5s. With this,
an end-to-end dry-run mines the task: "night 1: 1 sessions -> 1 tasks".

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-25 22:03:15 +02:00
khashayar
9799c41461 devin plugin: full schema/tool parity with plugins/copilot
Mirror the copilot MCP server: same rich _TOOL_SCHEMA (source, model,
tasks_file, target_skill_path, max_sessions, max_tasks, lookback_hours,
auto_adopt, json, edit_budget, hour, minute) and generic flag forwarding, plus
sleep_schedule / sleep_unschedule. Devin specifics retained: the ATIF-v1.7
harvest step (run before data-reading actions, engine pointed at it via
--claude-home, default --source claude) and post-adopt sync into .devin/skills/.
Tests + README + rules snippet updated for the 7-tool interface.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-25 21:56:42 +02:00
khashayar
e51eb7c4be devin plugin: expand ~ in CLAUDE_HOME from env + add tests & ATIF fixture
Review fixes:
- Path bug: SKILLOPT_DEVIN_CLAUDE_HOME (and SKILLOPT_SLEEP_REPO) read from the
  env are now wrapped in os.path.expanduser, so the documented "~/..." config
  no longer passes a literal ~ to --claude-home (which yielded zero mined
  sessions). expanduser on an absolute default is a no-op.
- tests/test_devin_plugin.py: tool-schema completeness, action→subcommand map,
  backend enum, the CLAUDE_HOME expansion regression, and an ATIF-v1.7 harvest
  shape test against a bundled fixture.
- plugins/devin/fixtures/devin_sample.json: sample ATIF-v1.7 transcript.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-25 21:49:21 +02:00
Yifan Yang
99ccb93945 fix(eval-only): configure qwen_chat/minimax backends so local LLM endpoints work (#85)
Replicates the trainer's backend setup in scripts/eval_only.py so eval-only no longer silently falls back to an unconfigured local endpoint. Closes #84.
2026-06-26 02:55:18 +08:00
Yifan Yang
9de9220214 docs(sleep): add cross-model scaling results (nano +11.9) and hyperparam ablation (#89)
Update RESULTS.md with:
- §2: GPT-5.4-nano target yields +11.9 pt (0.560→0.679) on SearchQA —
  2× the GPT-5.5 gain, demonstrating bigger benefit where headroom exists
- §4: Hyperparameter sweep confirms shipped defaults are optimal

Co-authored-by: Claude Opus 4 <noreply@anthropic.com>
2026-06-26 01:40:58 +08:00
khashayar
bec23ed020 Add Devin plugin (plugins/devin): MCP server + ATIF-v1.7 harvest
Wires the skillopt_sleep engine into Devin (Cognition) via an MCP server,
following the same thin-shell pattern as plugins/copilot.

- mcp_server.py: stdlib-only stdio MCP server exposing the standard sleep_*
  tools (status, dry-run, run, adopt, harvest). REPO_ROOT defaults to ../.. so
  it finds skillopt_sleep automatically when run from plugins/devin/.
- harvest_devin.py: converts Devin ATIF-v1.7 transcripts, agentmemory, and
  .devin/skills/*/SKILL.md into the Claude Code-compatible JSONL the engine
  consumes; enriches with taskKey + outcome envelopes (hard test/build signal
  or judge rubric). Workspace auto-detection; cross-platform paths.
- judge.py, mcp-config.example.json, devin-rules.snippet.md, README.md.
- plugins/README.md: add Devin to the platform + install tables.

No changes to skillopt_sleep; shells out to `python -m skillopt_sleep` like the
other plugins. Pure stdlib; default backend mock (no API spend).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-25 10:42:52 +02:00
Gergely Imreh
8559308361 fix(eval-only): call configure_qwen_chat so itslocal LLM endpoints can be used
The eval-only tool skipped configuring some of the backend types, that
the training did configure. Because of this, the eval is silently
fell back to a local endpoint that wasn't actually configured, and
all evaluations runs failed.

Replicate the backend setup based on the trainer's code, and eval-only
can run with the qwen_chat backends.

Co-authored-by: Qwen-Coder <noreply@qwen.ai>
2026-06-24 15:31:19 +08:00
Yifan Yang
2d7e37a395 fix(json_utils): reject prose pseudo-JSON in single quotes/backticks (#82)
Follow-up to the string-aware brace scan: that change only skipped
double-quoted prose, so brace-shaped text in single quotes, backticks, or
bare prose (e.g. `{op: delete}`, '{x: 1}') still reached json_repair and was
fabricated into a bogus dict — strictly worse than None, since extract_json
feeds the optimizer's skill edits.

Add a _looks_json_like() guard before repair: a genuine JSON object's first
non-space char after `{` is `"` (a key) or `}` (empty). Prose pseudo-objects
start with a bare word and are rejected, while legitimate repair targets
(trailing commas, unescaped quotes inside string values) all begin with `"`
and pass — including objects whose string VALUES contain single quotes or
backticks, which must not be rejected.

Found by an independent GPT-5.5 re-review of the merged #79 code. Adds
regression tests for single-quoted / backticked / bare prose (-> None) and
for legitimate objects with quote/backtick string values (still repaired).
Tests: 30 pass (+3 skip) without json_repair, 33 pass with it, both clean
under -W error::RuntimeWarning.

Co-authored-by: Claude <noreply@anthropic.com>
2026-06-23 20:31:39 +08:00
Yifan Yang
baad64a3b9 docs(readme): remove Acknowledgements section (#81)
The contributor is already credited via the Co-authored-by trailer carried
into main by #79; a dedicated README section is unnecessary.

Co-authored-by: Claude <noreply@anthropic.com>
2026-06-23 19:13:16 +08:00
Yifan Yang
c2e47c50fb docs(readme): acknowledge community contributor @samuelgoofus-boop (#80)
Add an Acknowledgements section crediting @samuelgoofus-boop for the
Windows-robustness work on the Claude/Codex backends (originally #77,
merged via #79).

Co-authored-by: Claude <noreply@anthropic.com>
2026-06-23 19:03:30 +08:00
Yifan Yang
14c045f04f Windows robustness for claude/codex backends (+ hardened JSON fallback) (#79)
* Robustness for the claude/codex backends on Windows: argv overflow, subprocess encoding, tolerant JSON, test-eval dirs

Fixes surfaced running SkillOpt end-to-end on the bundled `claude` backend
(local Claude CLI) on Windows. None changes the OpenAI/GPT happy path.

1. skillopt/engine/trainer.py — the final test-eval directory
   (test_eval_final/) is written to before being created; add
   os.makedirs(..., exist_ok=True), matching the two sibling test-eval dirs.
   Without it, summary.json raises FileNotFoundError when a rollout yields
   zero predictions.

2. skillopt/model/claude_backend.py
   a. Pass the prompt via stdin (not argv): on Windows the whole command line
      is capped at ~32 KB and a large optimizer prompt (the success-analyst
      minibatch carrying several report trajectories) overflows it with
      [WinError 206], killing the run after retries.
   b. Pass the system prompt via --append-system-prompt-file (a temp file),
      not argv. The system prompt here is the skill being optimized, which
      SkillOpt grows over training; since the ~32 KB cap applies to the SUM of
      all argv, a grown skill would re-hit [WinError 206] even with the prompt
      on stdin.
   c. Pin the subprocess encoding to utf-8 (errors="replace"). With text=True
      and no encoding=, stdin is encoded with the system codepage; on a zh-CN
      box (cp936/GBK) a prompt containing an emoji or some Latin-1 characters
      raises UnicodeEncodeError before the CLI even starts, failing every retry.

3. skillopt/model/codex_backend.py — the same utf-8 encoding pin on its
   subprocess.run(input=...) call (identical unpinned-encoding pattern).

4. skillopt/utils/json_utils.py — extract_json() returned None for valid-
   looking JSON that strict json.loads rejects (unescaped ASCII quotes inside
   CJK string values, trailing commas), silently dropping the analyst's edits
   on non-schema backends (Claude/Qwen): reflect produces N edits, 0 applied.
   Add a json_repair fallback, but only on a single unambiguous object — a
   balanced-brace extractor plus a refuse-on-multiple-objects guard — so a
   chain-of-thought "scratch + final" response can't make repair silently
   return the wrong (discarded) object, which would be worse than None (None is
   detectable and retryable; a wrong-but-valid edit is applied blind). Declare
   json_repair in requirements.txt and the claude/qwen optional extras so the
   fallback is actually present (it otherwise no-ops, dropping edits silently).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
(cherry picked from commit dca74a683e)

* fix(json_utils): harden tolerant JSON fallback from PR #77

Follow-up fixes on top of the cherry-picked Windows-robustness change:

1. Make _top_level_brace_objects() fully string-aware in its OUTER scan, not
   just inside an object. A '{' inside quoted prose (e.g. '"set it to {x}"')
   no longer starts a candidate object, so extract_json() returns None for
   prose pseudo-JSON instead of repairing it into a bogus dict — which would
   be strictly worse than dropping the edit, since extract_json feeds the
   optimizer's skill edits.

2. Pick the repair candidate BEFORE importing json_repair, so the missing-
   dependency RuntimeWarning only fires when there is genuinely a single
   malformed object that could have been repaired. Ordinary no-JSON / prose
   replies (the common case) now return None silently instead of warning on
   every call.

3. Resolve dependency-metadata inconsistency: json_repair is optional, so add
   it to the `all` extra (it was already in `claude`/`qwen`) and demote it
   from a hard requirement to an optional/commented entry in requirements.txt,
   matching the project's convention for backend-specific deps.

Adds regression tests for prose-with-braces (-> None), no-warning-on-plain-
text, single-object repair, and multi-object ambiguity. Existing 22 json
tests still pass with and without json_repair installed.

Co-Authored-By: Claude <noreply@anthropic.com>

---------

Co-authored-by: samuelgoofus-boop <260247789+samuelgoofus-boop@users.noreply.github.com>
Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-23 19:00:23 +08:00
carpedkm
2841f82428 Fix ALFWorld gamefile paths relative to ALFWORLD_DATA 2026-06-23 10:32:38 +00:00
Yifan Yang
64c6dda105 Merge pull request #78 from Yif-Yang/main
docs(readme): add Trendshift daily/weekly badges
2026-06-23 16:52:42 +08:00
Yifan Yang
c98eac18c7 docs(readme): add Trendshift daily/weekly badges (#1)
Add the microsoft/SkillOpt Trendshift badges (daily + weekly) side by
side in the README header.

Co-authored-by: Claude <noreply@anthropic.com>
2026-06-23 16:50:47 +08:00
Yifan Yang
fc1f827f07 Merge pull request #74 from Yif-Yang/fix/python-path-and-lookback
fix: SKILLOPT_SLEEP_PYTHON override + lookback_hours first-run fallback
2026-06-20 22:26:43 +08:00
carpedkm
01b3e01804 fix: use None default for --lookback-hours to distinguish omitted vs 0
Codex round 3: argparse default=0 made every CLI invocation without
--lookback-hours clobber the config's 72h default. Now default=None;
only explicit --lookback-hours N (including 0) overrides config.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-20 14:23:17 +00:00
carpedkm
01075c90d3 fix: address codex round 2 — revert harvest break + allow lookback 0
- harvest.py: revert break to continue — mtime ordering can diverge
  from embedded ended_at timestamps (copy/touch), so we must check all
  files rather than early-exiting on the first old one
- cycle.py: use `is not None and > 0` so lookback_hours=0 means
  "scan full history" (opt-out of the cutoff)
- __main__.py: propagate --lookback-hours 0 to config as explicit 0

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-20 14:21:18 +00:00
carpedkm
6cc1cd2e95 fix: address codex review — use clock for cutoff + early-exit harvest
- cycle.py: use supplied `clock` parameter (not wall time) for the
  lookback cutoff, so deterministic tests/experiments get reproducible
  harvest windows
- harvest.py: break (not continue) when a file is older than since_iso,
  since files are sorted newest-first by mtime — avoids scanning the
  entire transcript directory for quiet projects with large histories

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-20 14:11:58 +00:00
carpedkm
889238b234 fix: add SKILLOPT_SLEEP_PYTHON override + lookback_hours first-run fallback
Two fixes from issue #57 feedback:

1. run-sleep.sh: support SKILLOPT_SLEEP_PYTHON env var to explicitly set
   the Python interpreter. Useful on macOS where system Python is 3.9 but
   a newer Python is available elsewhere (e.g. Codex Desktop's bundled
   Python 3.12). Applied to both the shared runner and the bundled
   Claude Code plugin copy.

2. cycle.py: on first run (no prior harvest recorded), apply the
   lookback_hours config (default 72h) as a time cutoff. Previously,
   first run scanned the entire transcript history, which could trigger
   massive LLM mining on users with months of session data.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-20 14:07:50 +00:00
Yifan Yang
b5a1c2b317 Merge pull request #73 from Yif-Yang/fix/bare-subscription-auth
fix(sleep): make --bare conditional on ANTHROPIC_API_KEY (#68)
2026-06-20 21:46:09 +08:00
carpedkm
552ddefd74 fix: narrow CLI error markers to avoid false positives
Address codex review: "API key" was too generic — a model response
about configuring API keys would trigger a false auth warning. Now:
- Use specific phrases ("Invalid API key", "Unauthorized: invalid x-api-key")
- Only check short stdout (<300 chars) to skip real model responses
- Still check stderr unconditionally

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-20 13:32:43 +00:00
carpedkm
bfa53bc46d fix(sleep): make --bare conditional on ANTHROPIC_API_KEY (#68)
ClaudeCliBackend._call() and attempt_with_tools() hardcoded --bare,
which skips Claude CLI's credential resolution. This broke subscription-
token auth: every model call silently returned "Not logged in" and
scored 0 — the user saw "baseline 0.0 → candidate 0.0, gate reject"
with no indication of an auth failure.

Fix: only pass --bare when ANTHROPIC_API_KEY is set. The remaining
isolation flags (--disable-slash-commands, --disallowedTools,
--exclude-dynamic-system-prompt-sections, clean temp cwd) already
provide the needed isolation without --bare.

Also adds _detect_cli_error() to log a warning when CLI output matches
known auth error patterns, so auth failures surface loudly instead of
deflating every score to 0.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-20 13:28:34 +00:00
Yifan Yang
24b5a25ba8 Merge pull request #72 from Yif-Yang/feat/plugin-feature-sync
feat: sync all 4 runtime plugins with full engine surface + fix #52 #58 #62
2026-06-20 20:42:24 +08:00
carpedkm
0d648b2580 fix: address codex+gpt-5.5 review findings
- harvest: tighten sub-3s filter to also require prompt < 200 chars,
  avoiding false positives on fast real one-shot questions
- openclaw schedule_cmd: add docstring clarifying it schedules the
  shared engine, not the OpenClaw-native runner

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-20 12:40:34 +00:00
carpedkm
7d36b1d592 fix: address review findings in plugin sync PR
- OpenClaw schedule_cmd: pass project as required positional arg
- OpenClaw schedule_cmd/unschedule_cmd: unpack Tuple[bool, str] return
- OpenClaw schedule_cmd: propagate failure status (return 1 on not ok)
- OpenClaw unschedule_cmd: pass project to avoid silent no-op
- OpenClaw --minute default: 17 (consistent with engine and MCP)
- harvest.py: move datetime import to module level

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-20 12:04:07 +00:00
carpedkm
0be780052a feat: sync all 4 runtime plugins with full engine surface + fix #52 #58 #62
Bug fixes:
- #52: bundle run-sleep.sh in Claude Code plugin + 4-level fallback
- #58: add skillopt-sleep console script entry point in pyproject.toml
- #62: filter headless claude -p replay sessions from harvest

Plugin sync (Claude Code / Codex / Copilot / OpenClaw):
- Document all 22 CLI flags, 7 actions, 4 backends across all SKILL.md files
- Document config keys (preferences, gate_mode, dream_rollouts, etc.)
- Document memory consolidation (evolve_memory / evolve_skill)
- Add schedule/unschedule to all plugins
- Copilot MCP: expand schema from 3 → 16 params + schedule tools
- OpenClaw: add schedule/unschedule subcommands via shared scheduler

Tests:
- Cross-plugin parity test (prevents future feature drift)
- MCP schema completeness test

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-20 11:31:09 +00:00
carpedkm
0b5b9a4296 Merge pull request #60 from Kirchberg/codex/reviewed-task-files-cwd
Add reviewed task-file flow for Codex sleep runs
2026-06-20 08:59:02 +00:00
Kirill Kostarev
05cdc26beb Add reviewed task-file flow for Codex sleep runs 2026-06-20 08:58:48 +00:00
Yifan Yang
382811ddcc Merge pull request #50 from Dongbumlee/Dongbumlee/copilot-sleep-backend
Add Copilot as a SkillOpt-Sleep model backend (CopilotCliBackend) + research-engine MCP plugin
2026-06-20 16:57:53 +08:00
DB Lee
d367ae1eea docs(plugins): list copilot in the cross-tool backend overview
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-06-17 17:38:10 -07:00
DB Lee
2c0980bda3 docs(copilot): correct backend hint in research MCP plugin (openai -> azure_openai)
The advertised backend choices in scripts/train.py use 'azure_openai',
not 'openai'; align the inputSchema description hint accordingly.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-06-17 17:25:50 -07:00
DB Lee
5799695951 feat(copilot): implement attempt_with_tools with cross-platform tool shims
Adds honest tool-call detection for CopilotCliBackend, mirroring the
Claude/Codex backends. Writes per-tool executable shims into the work dir
and detects real invocations from a calllog (not self-reported markers).
The Copilot backend is Windows-validated, so shims are cross-platform:
a .cmd batch shim on Windows and a chmod'd bash shim on POSIX, with an
OS-specific tool hint. Mirrors _call's flags/env (isolated COPILOT_HOME,
--allow-all-tools, MCP/instruction disabling) and the UTF-8 subprocess fix.

Adds test_attempt_with_tools_honest_detection: a CI-friendly, OS-aware
stub stands in for the CLI, runs the shim, and asserts both JSONL parsing
and log-based detection. Validated live on Windows (real Copilot call) and
on Linux/WSL (POSIX path).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-06-17 17:25:50 -07:00
DB Lee
013a7cd83a test: add unit tests for CopilotCliBackend (parsing + alias + isolated home)
Covers _parse_jsonl_response (multi-message concat, junk-line skipping,
empty/non-assistant events), get_backend alias resolution, and the
isolated-COPILOT_HOME / full-env opt-out behavior. Pure logic, no CLI required.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-06-17 17:25:50 -07:00
DB Lee
21f93c16c7 Add GitHub Copilot backend to SkillOpt-Sleep
Add CopilotCliBackend that drives the GitHub Copilot CLI in
non-interactive mode (copilot -p ... --output-format json) and parses the
JSONL event stream for assistant.message content. Registered as the
'copilot' backend (with aliases) and wired through the CLI, config,
experiment harness, and the Copilot MCP server's backend enum.

- Force UTF-8 decoding of CLI output (fixes cp1252 UnicodeDecodeError on
  Windows when responses contain non-cp1252 bytes).
- Minimise per-call startup: isolated COPILOT_HOME with built-in MCPs and
  custom instructions disabled, so user MCP servers are not spawned per
  call (~5x faster: 36s -> 7.4s). Override via SKILLOPT_SLEEP_COPILOT_HOME
  / SKILLOPT_SLEEP_COPILOT_MODEL / SKILLOPT_SLEEP_COPILOT_FULL_ENV.

Validated end-to-end on real held-out tasks (researcher persona:
0.42 -> 1.00 lift; gate correctly rejects non-improving edits).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-06-17 17:25:50 -07:00
DB Lee
5dc894715f Add SkillOpt research-engine MCP server plugin for Copilot
Exposes scripts/train.py and scripts/eval_only.py as Copilot MCP tools
(skillopt_list_configs, skillopt_train, skillopt_eval) via a stdlib-only
stdio server, mirroring the existing SkillOpt-Sleep plugin layout.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-06-17 17:24:00 -07:00
Yifan Yang
6940e46f4e Merge pull request #65 from summerview1997/codex/searchqa-materialize-splits
Add SearchQA split materialization helper
2026-06-17 23:50:38 +08:00
Yifan Yang
0e962219f5 Merge pull request #64 from summerview1997/codex/searchqa-rollout-failfast
Fail fast on systemic SearchQA rollout failures
2026-06-17 23:49:55 +08:00
Yifan Yang
fc42e6bf72 Merge pull request #63 from summerview1997/codex/webui-env-backend-preflight
Add WebUI env loading and backend preflight
2026-06-17 23:49:50 +08:00
summerview1997
c755792049 Add SearchQA materialization tests 2026-06-16 09:27:09 +08:00