microsoft-SkillOpt

mirror of https://github.com/microsoft/SkillOpt.git synced 2026-07-03 14:02:58 +08:00

Author	SHA1	Message	Date
Yifan Yang	b5a1c2b317	Merge pull request #73 from Yif-Yang/fix/bare-subscription-auth fix(sleep): make --bare conditional on ANTHROPIC_API_KEY (#68)	2026-06-20 21:46:09 +08:00
carpedkm	552ddefd74	fix: narrow CLI error markers to avoid false positives Address codex review: "API key" was too generic — a model response about configuring API keys would trigger a false auth warning. Now: - Use specific phrases ("Invalid API key", "Unauthorized: invalid x-api-key") - Only check short stdout (<300 chars) to skip real model responses - Still check stderr unconditionally Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-06-20 13:32:43 +00:00
carpedkm	bfa53bc46d	fix(sleep): make --bare conditional on ANTHROPIC_API_KEY (#68 ) ClaudeCliBackend._call() and attempt_with_tools() hardcoded --bare, which skips Claude CLI's credential resolution. This broke subscription- token auth: every model call silently returned "Not logged in" and scored 0 — the user saw "baseline 0.0 → candidate 0.0, gate reject" with no indication of an auth failure. Fix: only pass --bare when ANTHROPIC_API_KEY is set. The remaining isolation flags (--disable-slash-commands, --disallowedTools, --exclude-dynamic-system-prompt-sections, clean temp cwd) already provide the needed isolation without --bare. Also adds _detect_cli_error() to log a warning when CLI output matches known auth error patterns, so auth failures surface loudly instead of deflating every score to 0. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-06-20 13:28:34 +00:00
Yifan Yang	24b5a25ba8	Merge pull request #72 from Yif-Yang/feat/plugin-feature-sync feat: sync all 4 runtime plugins with full engine surface + fix #52 #58 #62	2026-06-20 20:42:24 +08:00
carpedkm	0d648b2580	fix: address codex+gpt-5.5 review findings - harvest: tighten sub-3s filter to also require prompt < 200 chars, avoiding false positives on fast real one-shot questions - openclaw schedule_cmd: add docstring clarifying it schedules the shared engine, not the OpenClaw-native runner Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-06-20 12:40:34 +00:00
carpedkm	7d36b1d592	fix: address review findings in plugin sync PR - OpenClaw schedule_cmd: pass project as required positional arg - OpenClaw schedule_cmd/unschedule_cmd: unpack Tuple[bool, str] return - OpenClaw schedule_cmd: propagate failure status (return 1 on not ok) - OpenClaw unschedule_cmd: pass project to avoid silent no-op - OpenClaw --minute default: 17 (consistent with engine and MCP) - harvest.py: move datetime import to module level Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-06-20 12:04:07 +00:00
carpedkm	0be780052a	feat: sync all 4 runtime plugins with full engine surface + fix #52 #58 #62 Bug fixes: - #52: bundle run-sleep.sh in Claude Code plugin + 4-level fallback - #58: add skillopt-sleep console script entry point in pyproject.toml - #62: filter headless claude -p replay sessions from harvest Plugin sync (Claude Code / Codex / Copilot / OpenClaw): - Document all 22 CLI flags, 7 actions, 4 backends across all SKILL.md files - Document config keys (preferences, gate_mode, dream_rollouts, etc.) - Document memory consolidation (evolve_memory / evolve_skill) - Add schedule/unschedule to all plugins - Copilot MCP: expand schema from 3 → 16 params + schedule tools - OpenClaw: add schedule/unschedule subcommands via shared scheduler Tests: - Cross-plugin parity test (prevents future feature drift) - MCP schema completeness test Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-06-20 11:31:09 +00:00
carpedkm	0b5b9a4296	Merge pull request #60 from Kirchberg/codex/reviewed-task-files-cwd Add reviewed task-file flow for Codex sleep runs	2026-06-20 08:59:02 +00:00
Kirill Kostarev	05cdc26beb	Add reviewed task-file flow for Codex sleep runs	2026-06-20 08:58:48 +00:00
Yifan Yang	382811ddcc	Merge pull request #50 from Dongbumlee/Dongbumlee/copilot-sleep-backend Add Copilot as a SkillOpt-Sleep model backend (CopilotCliBackend) + research-engine MCP plugin	2026-06-20 16:57:53 +08:00
DB Lee	d367ae1eea	docs(plugins): list copilot in the cross-tool backend overview Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>	2026-06-17 17:38:10 -07:00
DB Lee	2c0980bda3	docs(copilot): correct backend hint in research MCP plugin (openai -> azure_openai) The advertised backend choices in scripts/train.py use 'azure_openai', not 'openai'; align the inputSchema description hint accordingly. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>	2026-06-17 17:25:50 -07:00
DB Lee	5799695951	feat(copilot): implement attempt_with_tools with cross-platform tool shims Adds honest tool-call detection for CopilotCliBackend, mirroring the Claude/Codex backends. Writes per-tool executable shims into the work dir and detects real invocations from a calllog (not self-reported markers). The Copilot backend is Windows-validated, so shims are cross-platform: a .cmd batch shim on Windows and a chmod'd bash shim on POSIX, with an OS-specific tool hint. Mirrors _call's flags/env (isolated COPILOT_HOME, --allow-all-tools, MCP/instruction disabling) and the UTF-8 subprocess fix. Adds test_attempt_with_tools_honest_detection: a CI-friendly, OS-aware stub stands in for the CLI, runs the shim, and asserts both JSONL parsing and log-based detection. Validated live on Windows (real Copilot call) and on Linux/WSL (POSIX path). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>	2026-06-17 17:25:50 -07:00
DB Lee	013a7cd83a	test: add unit tests for CopilotCliBackend (parsing + alias + isolated home) Covers _parse_jsonl_response (multi-message concat, junk-line skipping, empty/non-assistant events), get_backend alias resolution, and the isolated-COPILOT_HOME / full-env opt-out behavior. Pure logic, no CLI required. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>	2026-06-17 17:25:50 -07:00
DB Lee	21f93c16c7	Add GitHub Copilot backend to SkillOpt-Sleep Add CopilotCliBackend that drives the GitHub Copilot CLI in non-interactive mode (copilot -p ... --output-format json) and parses the JSONL event stream for assistant.message content. Registered as the 'copilot' backend (with aliases) and wired through the CLI, config, experiment harness, and the Copilot MCP server's backend enum. - Force UTF-8 decoding of CLI output (fixes cp1252 UnicodeDecodeError on Windows when responses contain non-cp1252 bytes). - Minimise per-call startup: isolated COPILOT_HOME with built-in MCPs and custom instructions disabled, so user MCP servers are not spawned per call (~5x faster: 36s -> 7.4s). Override via SKILLOPT_SLEEP_COPILOT_HOME / SKILLOPT_SLEEP_COPILOT_MODEL / SKILLOPT_SLEEP_COPILOT_FULL_ENV. Validated end-to-end on real held-out tasks (researcher persona: 0.42 -> 1.00 lift; gate correctly rejects non-improving edits). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>	2026-06-17 17:25:50 -07:00
DB Lee	5dc894715f	Add SkillOpt research-engine MCP server plugin for Copilot Exposes scripts/train.py and scripts/eval_only.py as Copilot MCP tools (skillopt_list_configs, skillopt_train, skillopt_eval) via a stdlib-only stdio server, mirroring the existing SkillOpt-Sleep plugin layout. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>	2026-06-17 17:24:00 -07:00
Yifan Yang	6940e46f4e	Merge pull request #65 from summerview1997/codex/searchqa-materialize-splits Add SearchQA split materialization helper	2026-06-17 23:50:38 +08:00
Yifan Yang	0e962219f5	Merge pull request #64 from summerview1997/codex/searchqa-rollout-failfast Fail fast on systemic SearchQA rollout failures	2026-06-17 23:49:55 +08:00
Yifan Yang	fc42e6bf72	Merge pull request #63 from summerview1997/codex/webui-env-backend-preflight Add WebUI env loading and backend preflight	2026-06-17 23:49:50 +08:00
summerview1997	c755792049	Add SearchQA materialization tests	2026-06-16 09:27:09 +08:00
summerview1997	e591a28242	Add SearchQA split materialization helper	2026-06-16 09:26:56 +08:00
summerview1997	c04467a428	Add SearchQA materialization dependency extra	2026-06-16 09:26:46 +08:00
summerview1997	d5ae8c8e66	Document SearchQA split materialization	2026-06-16 09:26:35 +08:00
summerview1997	923becb00f	Add SearchQA rollout fail-fast tests	2026-06-16 09:21:08 +08:00
summerview1997	da799620ba	Fail fast on systemic SearchQA rollout failures	2026-06-16 09:20:57 +08:00
summerview1997	30cc8a3ed3	Add WebUI env preflight tests	2026-06-16 09:04:30 +08:00
summerview1997	d05851bd7f	Add WebUI env loading and backend preflight	2026-06-16 09:04:19 +08:00
Yifan Yang	46b3207b96	docs(sleep): trim RESULTS to the headline results (remove the full grid) Remove the per-cell full deployment grid section; keep the gate-safety stress test, experience-replay scaling + night-by-night climb, the dream-diversity ablation, the gbrain end-to-end result, and the scope/limitations. Renumber sections; update the README pointer accordingly.	2026-06-15 17:08:51 +00:00
Yifan Yang	d43e8dba1a	docs(sleep): expand the grid into per-benchmark night-by-night tables Replace the compact baseline->after grid with three grouped per-benchmark tables (SearchQA / LiveMath / SpreadsheetBench), each showing all 3 targets x both modes across every night (N0..N5) + Δ. Makes the trajectory visible — gains reach a level and hold rather than being single lucky readings — and presents the full 18-cell evidence in a more solid, readable form. Footnotes LiveMath's 4-night run (train split <50 tasks). Numbers unchanged; just richer presentation.	2026-06-15 16:54:01 +00:00
Yifan Yang	d02098ffc4	docs(sleep): add full Results & Analysis (RESULTS.md); link from README Adds docs/sleep/RESULTS.md — the complete deployment-scale study behind SkillOpt-Sleep, presented rigorously (named benchmarks, test sizes, metrics, baseline->after, single shared protocol): 1. Gate-safety stress test: ungated nano SearchQA collapses 0.554->0.026 (-52.8); the gated twin holds 0.570 — the core argument for the design. 2. Full 18-cell deployment grid (3 benchmarks x 3 targets x gate/free), shipped config: mean +0.5, range [-2.4, +5.1], nothing hidden. 3. Experience-replay scaling (recall_k 10->20->full: +3.1->+4.5->+5.6) and the night-by-night climb (0.798->...->0.858, gate accepts as late as N5). 4. Dream-diversity fix as defense-in-depth: 3-config grid comparison (-2.66/-52.8 -> +0.24/-4.0 -> +0.53/-2.4); the -52.8 cell becomes +2.7 from the dream fix alone. 5. gbrain end-to-end 0.00->1.00 on real Claude + Codex. 6. Honest scope: where it helps vs flat-in-noise, single-seed caveat with a seed-robustness spot check, keep-the-gate-on. README Results section now links prominently to it. Docs only; numbers are self-contained with reproduce commands (no raw run dumps committed).	2026-06-15 16:49:13 +00:00
Yifan Yang	ea4ff459d7	docs(sleep): make the results section rigorous (named benchmarks, baseline→after) Label each result with its benchmark, test size, metric, target model, and gate mode; show absolute baseline→after (not just Δ); state the single shared protocol once. SearchQA recall-scaling table (1400-item test, SQuAD-EM, GPT-5.5, gated) + SpreadsheetBench confirmation (280-item, cell-value compare, nano, gate-free) + the gbrain end-to-end line. Keeps the single-seed / flat-on-noisy caveats.	2026-06-15 16:42:43 +00:00
Yifan Yang	de3be75bac	docs(sleep): add a SkillOpt-Sleep module readme + News mention Adds docs/sleep/README.md — a concise intro to the SkillOpt-Sleep plugin (what it is, how to use it across the three agents, the opt-in experience-replay / dream-rollout knobs, and headline results), linking to the full guide section. Adds a News bullet pointing to it. No code changes.	2026-06-15 16:31:15 +00:00
Yifan Yang	b701d9b6d9	docs: move SkillOpt-Sleep into the guide; clean docs/sleep; fix guide link Per maintainer request: - Remove the internal/scratch docs/sleep/ tree (reports, raw logs, blog run JSON, sweep.jsonl) — 23 files — and the root PUBLISHING.md. These were working notes, not reference docs. - Take the dedicated SkillOpt-Sleep content out of the main README (News bullet + section) and host it in the rendered guide instead: new section 9 in docs/guideline.html (deployment companion, the three plugins, opt-in experience replay / dream rollouts) with a sidebar entry. - Fix the README's opening reference so "Documentation & Reproduction Guide" links directly to the rendered GitHub Pages page, not the raw .html source. - Repoint the now-removed docs/sleep links in the plugin READMEs to the guide section. The plugin code (plugins/, skillopt_sleep/) is unchanged; only docs move. Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>	2026-06-15 16:20:50 +00:00
Yifan Yang	722ce646d4	feat(sleep): experience replay + dream rollouts in the cycle (opt-in) Wires two consolidation mechanisms into the shipped nightly cycle, both default OFF so existing behavior is unchanged: - dream_rollouts (>1): multi-rollout contrastive reflection per task - recall_k (>0): associative recall of the K most-similar past tasks (from a capped task_archive persisted in state.json) into tonight's dream - dream_factor (>0): synthetic task variants New shared engine module skillopt_sleep/dream.py (recall_similar, dream_augment, dream_consolidate) is called by both the plugin cycle and the experiment harness, so reported numbers exercise the exact shipped code. Built on the existing rollouts_k/sample_id support already in consolidate.py/rollout.py. Validated (5 nights x 10 real tasks/night, full held-out test, GPT-5.5, gated): the gain scales with recall depth on a clean signal — SearchQA recall_k=10 +3.1, recall_k=20 +4.5, full-history reference +5.6; SpreadsheetBench (nano, gate-free) +3.6. Flat within noise on saturated/noisy cells. See docs/sleep/EXPERIENCE_REPLAY.md (+ raw runs under blog_runs/v2_port/). Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>	2026-06-15 15:58:27 +00:00
Yifan Yang	576f2f8bad	Merge pull request #59 from Elzlxx/feat/openclaw-skillopt-sleep feat(plugins): add OpenClaw shell for SkillOpt-Sleep	2026-06-15 18:26:12 +08:00
carpedkm	00d07bc59a	Merge pull request #48 from Kirchberg/codex/codex-desktop-harvest Add Codex Desktop transcript harvesting	2026-06-15 10:23:18 +00:00
Kirill Kostarev	31715a8b43	Add Codex Desktop transcript harvesting	2026-06-15 10:23:08 +00:00
carpedkm	e8c3e10b30	Merge pull request #49 from Kirchberg/codex/codex-skill-first-upstream Make Codex integration skill-first	2026-06-15 10:21:43 +00:00
Kirill Kostarev	d31e9d9407	Back up legacy Codex prompt during install	2026-06-15 10:21:30 +00:00
Kirill Kostarev	1953484822	Make Codex integration skill-first	2026-06-15 10:21:30 +00:00
carpedkm	1b2652c6f8	Merge pull request #44 from imshunsuke/refactor/reflect-default-base refactor: make EnvAdapter.reflect a shared default (fixes dropped reflect kwargs)	2026-06-15 09:06:38 +00:00
Shunsuke	98d0430bee	refactor: make EnvAdapter.reflect a shared default (fixes dropped reflect kwargs) All six adapters duplicated an identical reflect() that delegates to run_minibatch_reflect. The copies had drifted: OfficeQA/DocVQA silently dropped meta_skill_context and ALFWorld dropped update_mode, so those analysts ran without inputs every other benchmark receives (active under the default use_meta_skill: true). Move the delegation into EnvAdapter.reflect as one default that forwards all kwargs uniformly, and delete the six overrides. reflect is no longer abstract — adapters inherit it and override only for custom logic. Net -225 lines. Behavior change: OfficeQA/DocVQA/ALFWorld reflect now receive the kwargs they previously dropped; the three already-correct benchmarks are unaffected. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-15 09:06:00 +00:00
Yifan Yang	eef4805b25	Merge pull request #43 from imshunsuke/docs/fix-benchmark-loader-naming docs: align benchmark guide and template with dataloader.py naming	2026-06-15 17:00:45 +08:00
Yifan Yang	86bad36ffe	feat(sleep): SkillOpt-Sleep plugin update (preview) — engine robustness + scheduling Updates the SkillOpt-Sleep plugin on top of the current main. User-facing and engine improvements since the initial drop: * Command renamed /sleep -> /skillopt-sleep across Claude Code + Codex shells; refreshed plugin READMEs and install scripts. * Built-in scheduling (skillopt_sleep/scheduler.py + __main__): schedule / unschedule the nightly cycle without external cron wiring. * Backend robustness: bounded retry with backoff (no more silent empty-string on transient 429/timeout), content-filter-safe rollout prompt, an output-contract guardrail that rejects edits violating the task's required format, and a per-sample cache key so repeated dream rollouts are independent samples (fixes degenerate single-sample reflection). * consolidate / rollout / replay: parallel multi-rollout dreaming, gate-mode controls, TaskRecord.system framing field. Scope: this commit ships only the plugin engine + shells. Research/benchmark harnesses and their data are intentionally not included; the public package has no dependency on them (the one research-evaluator import is now guarded). Marked as an early preview in the README; we'll keep iterating. 99/99 unit tests pass. Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>	2026-06-14 16:12:00 +00:00
elzlxx	553446575a	feat(plugins): add OpenClaw shell for SkillOpt-Sleep Adds a thin OpenClaw shell wrapping the SkillOpt-Sleep engine. Enables nightly validation-gated skill improvement cycles for OpenClaw agents. Components: - skillopt_sleep_openclaw.py: DeepSeek V4 Pro + Ollama nomic-embed-text backend, mirroring the Claude/Codex/Copilot backend pattern. - run_sleep.py: CLI entry point supporting dry-run and pre-built task files. - run_sleep_cron.sh: bash wrapper for nightly cron invocation. - slash_sleep.py: /sleep command (status / run / adopt / reject / cost). - config.json: engine config tuned for our stack. - SKILL.md: OpenClaw skill manifest. - tests/: 14 held-out tasks across 3 categories (research-cron, devops, wiki). OpenClaw is the 4th ecosystem in which SkillOpt-Sleep can be deployed, joining Claude Code, Codex, and Copilot. The shell follows the same single-engine / thin-shell pattern as the existing three plugins. End-to-end tested: pipeline runs against real OpenClaw session transcripts, gate correctly rejects non-improvements, staging artifacts land in ~/.skillopt-sleep/staging/<night>/. Cost: ~$0.02/night on DeepSeek V4 Pro.	2026-06-14 23:27:54 +08:00
Cuzyoung	c1ac570d94	docs(guideline): make SearchQA the first demo — copy-paste materialization snippet + train command Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-06-10 13:51:20 +00:00
Cuzyoung	d8023a47c9	docs(guideline): novice-first restructure — Quick Start before data, honest first-demo path, own-data narrative - Move Quick Start (now §3) ahead of the data chapter; renumber and fix cross-references and the sidebar nav. - Add §3.1 'Your First Demo': states plainly that data/ ships ID manifests only, gives the one benchmark that runs out of the box (ALFWorld with its bundled path split), and points other benchmarks to the data/README.md materialization step. Also offers eval-only with ckpt/ skills as a lighter sanity check. - Reframe the data chapter as 'Run on Your Own Data' (§4) with a three-step lead-in (split dir -> item schema -> --split_dir) and a pointer to §7.2 for new task shapes. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-06-10 13:42:50 +00:00
Cuzyoung	b0b62fcb86	docs(readme): slim README — move install/quick-start/data/config details to the guideline page README now: badges + one-line pointer to docs/guideline.html, overview, demo, sleep section, extensibility pointers, WebUI launch, citation. All run-the-demo commands live in the guideline (which already covered install, credentials, training, eval, outputs, data prep, and config). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-06-10 13:27:36 +00:00
Cuzyoung	3308c4c5dc	docs(guideline): add PyPI install option and skill-aware reflection config rows Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-06-10 13:27:12 +00:00
Cuzyoung	0d5b331cd5	Merge branch 'docs/guideline' into feat/skill-aware-reflection # Conflicts: # README.md	2026-06-10 13:27:12 +00:00

1 2 3 4

183 Commits