microsoft-SkillOpt

mirror of https://github.com/microsoft/SkillOpt.git synced 2026-07-03 14:02:58 +08:00

Author	SHA1	Message	Date
CharlesYang030	e4ea6a6771	chore(release): v0.2.0 Highlights since v0.1.0: - feat: SkillOpt-Sleep engine — nightly offline self-evolution (harvest -> mine -> replay -> consolidate behind a validation gate), with multi-objective reward, experience replay + dream rollouts, slow-update long-term memory, and secret redaction in cycle diagnostics. Shipped as the `skillopt-sleep` CLI. - feat: cross-tool backends & plugin shells — Claude, Codex (+Desktop harvest), Copilot, Devin, and OpenClaw. - feat: SearchQA split materialization + rollout fail-fast. - fix: Windows robustness for claude/codex backends, hardened JSON fallback, Qwen timeout/thinking gating, Codex failure surfacing. Packaging: - Bump pyproject / skillopt / skillopt_sleep to 0.2.0. - Restore skillopt_webui to the packaged wheel. See CHANGELOG.md for the full changelog and contributor acknowledgements. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-07-02 22:11:10 +08:00
Yifan Yang	9de9220214	docs(sleep): add cross-model scaling results (nano +11.9) and hyperparam ablation (#89 ) Update RESULTS.md with: - §2: GPT-5.4-nano target yields +11.9 pt (0.560→0.679) on SearchQA — 2× the GPT-5.5 gain, demonstrating bigger benefit where headroom exists - §4: Hyperparameter sweep confirms shipped defaults are optimal Co-authored-by: Claude Opus 4 <noreply@anthropic.com>	2026-06-26 01:40:58 +08:00
Yifan Yang	46b3207b96	docs(sleep): trim RESULTS to the headline results (remove the full grid) Remove the per-cell full deployment grid section; keep the gate-safety stress test, experience-replay scaling + night-by-night climb, the dream-diversity ablation, the gbrain end-to-end result, and the scope/limitations. Renumber sections; update the README pointer accordingly.	2026-06-15 17:08:51 +00:00
Yifan Yang	d43e8dba1a	docs(sleep): expand the grid into per-benchmark night-by-night tables Replace the compact baseline->after grid with three grouped per-benchmark tables (SearchQA / LiveMath / SpreadsheetBench), each showing all 3 targets x both modes across every night (N0..N5) + Δ. Makes the trajectory visible — gains reach a level and hold rather than being single lucky readings — and presents the full 18-cell evidence in a more solid, readable form. Footnotes LiveMath's 4-night run (train split <50 tasks). Numbers unchanged; just richer presentation.	2026-06-15 16:54:01 +00:00
Yifan Yang	d02098ffc4	docs(sleep): add full Results & Analysis (RESULTS.md); link from README Adds docs/sleep/RESULTS.md — the complete deployment-scale study behind SkillOpt-Sleep, presented rigorously (named benchmarks, test sizes, metrics, baseline->after, single shared protocol): 1. Gate-safety stress test: ungated nano SearchQA collapses 0.554->0.026 (-52.8); the gated twin holds 0.570 — the core argument for the design. 2. Full 18-cell deployment grid (3 benchmarks x 3 targets x gate/free), shipped config: mean +0.5, range [-2.4, +5.1], nothing hidden. 3. Experience-replay scaling (recall_k 10->20->full: +3.1->+4.5->+5.6) and the night-by-night climb (0.798->...->0.858, gate accepts as late as N5). 4. Dream-diversity fix as defense-in-depth: 3-config grid comparison (-2.66/-52.8 -> +0.24/-4.0 -> +0.53/-2.4); the -52.8 cell becomes +2.7 from the dream fix alone. 5. gbrain end-to-end 0.00->1.00 on real Claude + Codex. 6. Honest scope: where it helps vs flat-in-noise, single-seed caveat with a seed-robustness spot check, keep-the-gate-on. README Results section now links prominently to it. Docs only; numbers are self-contained with reproduce commands (no raw run dumps committed).	2026-06-15 16:49:13 +00:00
Yifan Yang	ea4ff459d7	docs(sleep): make the results section rigorous (named benchmarks, baseline→after) Label each result with its benchmark, test size, metric, target model, and gate mode; show absolute baseline→after (not just Δ); state the single shared protocol once. SearchQA recall-scaling table (1400-item test, SQuAD-EM, GPT-5.5, gated) + SpreadsheetBench confirmation (280-item, cell-value compare, nano, gate-free) + the gbrain end-to-end line. Keeps the single-seed / flat-on-noisy caveats.	2026-06-15 16:42:43 +00:00
Yifan Yang	de3be75bac	docs(sleep): add a SkillOpt-Sleep module readme + News mention Adds docs/sleep/README.md — a concise intro to the SkillOpt-Sleep plugin (what it is, how to use it across the three agents, the opt-in experience-replay / dream-rollout knobs, and headline results), linking to the full guide section. Adds a News bullet pointing to it. No code changes.	2026-06-15 16:31:15 +00:00
Yifan Yang	b701d9b6d9	docs: move SkillOpt-Sleep into the guide; clean docs/sleep; fix guide link Per maintainer request: - Remove the internal/scratch docs/sleep/ tree (reports, raw logs, blog run JSON, sweep.jsonl) — 23 files — and the root PUBLISHING.md. These were working notes, not reference docs. - Take the dedicated SkillOpt-Sleep content out of the main README (News bullet + section) and host it in the rendered guide instead: new section 9 in docs/guideline.html (deployment companion, the three plugins, opt-in experience replay / dream rollouts) with a sidebar entry. - Fix the README's opening reference so "Documentation & Reproduction Guide" links directly to the rendered GitHub Pages page, not the raw .html source. - Repoint the now-removed docs/sleep links in the plugin READMEs to the guide section. The plugin code (plugins/, skillopt_sleep/) is unchanged; only docs move. Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>	2026-06-15 16:20:50 +00:00
Yifan Yang	722ce646d4	feat(sleep): experience replay + dream rollouts in the cycle (opt-in) Wires two consolidation mechanisms into the shipped nightly cycle, both default OFF so existing behavior is unchanged: - dream_rollouts (>1): multi-rollout contrastive reflection per task - recall_k (>0): associative recall of the K most-similar past tasks (from a capped task_archive persisted in state.json) into tonight's dream - dream_factor (>0): synthetic task variants New shared engine module skillopt_sleep/dream.py (recall_similar, dream_augment, dream_consolidate) is called by both the plugin cycle and the experiment harness, so reported numbers exercise the exact shipped code. Built on the existing rollouts_k/sample_id support already in consolidate.py/rollout.py. Validated (5 nights x 10 real tasks/night, full held-out test, GPT-5.5, gated): the gain scales with recall depth on a clean signal — SearchQA recall_k=10 +3.1, recall_k=20 +4.5, full-history reference +5.6; SpreadsheetBench (nano, gate-free) +3.6. Flat within noise on saturated/noisy cells. See docs/sleep/EXPERIENCE_REPLAY.md (+ raw runs under blog_runs/v2_port/). Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>	2026-06-15 15:58:27 +00:00
Kirill Kostarev	31715a8b43	Add Codex Desktop transcript harvesting	2026-06-15 10:23:08 +00:00
Kirill Kostarev	d31e9d9407	Back up legacy Codex prompt during install	2026-06-15 10:21:30 +00:00
Kirill Kostarev	1953484822	Make Codex integration skill-first	2026-06-15 10:21:30 +00:00
Yifan Yang	f64a41397c	docs(sleep): add PR draft (title + body) for the upstream PR Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>	2026-06-08 14:31:52 +00:00
Yifan Yang	d6c4ca3f6e	docs(sleep): load-test all 3 plugin shells on a fresh (non-gbrain) example Actually exercised every plugin shell end to end on a brand-new "SQL must always include LIMIT" analyst persona: - Claude Code shell: harvest (2 real crafted transcripts -> 2 tasks), full run (stages a proposal), adopt (honors the no-op-when-nothing-accepted contract). - Codex: install.sh places ~/.codex/prompts/sleep.md + ~/.agents/skills correctly. - Copilot: MCP server initialize -> tools/list -> tools/call returns engine output. Genuine improvement on the fresh persona, both backends: held-out TEST 0.00 -> 1.00 (Sonnet->Haiku and Codex), the optimizer learning the user's LIMIT house rule and generalizing to unseen queries. Honest finding: the first split left too few train tasks (no-op night) — re-balancing fixed it; motivates a small-train-pool warning. Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>	2026-06-08 14:31:52 +00:00
Yifan Yang	dae974a5e3	chore(sleep): English-only across the engine, plugins, and docs Remove every non-ASCII/CJK character for a professional open-source repo: - harvest.py: drop hardcoded Chinese feedback phrases; add an env-based extensibility hook (SKILLOPT_SLEEP_NEG_FEEDBACK / _POS_FEEDBACK) so any locale can be added without baking one in. Verified with a German example. - rollout.py / consolidate.py: English comments. - README.md section heading + anchor, CONTROLLABLE_DREAMING.md, plugin.json, marketplace.json (also fixed stale path skillopt-sleep-plugin -> plugins/claude-code), SKILL.md: English only. - Remove the internal WAKE_UP_SUMMARY.md note (not user-facing, not referenced). Verified: zero CJK chars remain anywhere; 29 tests pass. Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>	2026-06-08 14:31:52 +00:00
Yifan Yang	e2de84d36f	docs(sleep): real Claude<->Codex cross-validation of the new features Three live runs exercise the new code paths on both runtimes: A) Claude Sonnet->Haiku, gate=OFF + rollouts_k=2: brief-writer test 0->1.00, action 'greedy_improved', val & test both reported (3-way split works). B) Codex, gate=ON + rollouts_k=2: brief-writer test 0->1.00 in 2 nights. C) Claude Sonnet->Haiku, thorough-analyst, 3 nights: slow-update fires and distils a durable cross-night meta-rule (general, not task-specific). Confirms gate-off greedy path, 3-way val/test split, multi-rollout, and the gate-independent slow-update all work with real models on Claude AND Codex. Raw logs under docs/sleep/raw/crosscheck_*.txt. Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>	2026-06-08 14:31:51 +00:00
Yifan Yang	9379e494bf	docs(sleep): document the controllable dreaming architecture Captures the four-stage refactor: train(dream)/val(real)/test(real) splits, optional gate, gate-independent slow-update long-term memory, token/time budget, multi-rollout contrastive reflection, multi-objective reward (accuracy/tokens/ latency), and user-preference priors — with a one-command example composing them. Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>	2026-06-08 14:31:51 +00:00
Yifan Yang	99ec2caf6b	docs(sleep): complete 4/4 gbrain parity on Claude AND Codex (tool loop incl.) benchmark_report.md now 7/7 direct + 4/4 transfer, all 0->1.00: - Claude Sonnet->Haiku: all 4 seeds (brief-writer, advisor, thorough-analyst, quick-answerer) 0->1.00 - Codex self-optimized: brief-writer, advisor, quick-answerer 0->1.00 - quick-answerer uses the real ./search tool loop on both runtimes. This matches gbrain's own "4/4 skills 0->1.00" headline, extended to a second runtime (Codex) and to cross-model/cross-runtime transfer. Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>	2026-06-08 14:31:51 +00:00
Yifan Yang	acf4545c00	docs(sleep): full 4/4 gbrain parity — quick-answerer 0->1.00 via real tool loop quick-answerer (judge: tool_called=search) reaches 0.00 -> 1.00 with Sonnet optimizer -> Haiku target: the optimizer wrote an OVERRIDE of the "never use tools" instruction and the Haiku target genuinely invoked the ./search shim. All 4 gbrain skillopt-v1 seeds now at 0->1.00, matching gbrain's own headline. Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>	2026-06-08 14:31:51 +00:00
Yifan Yang	b1f41a7506	docs(sleep): full sweep — 5/5 direct + 4/4 transfer all 0->1.00 Machine-generated benchmark_report.md from a 9-config sweep: - Direct (Sonnet->Haiku): brief-writer/advisor/thorough-analyst 0->1.00 - Direct (Codex): brief-writer/advisor 0->1.00 - Transfer (4/4 positive, incl. cross-runtime Codex<->Claude): all 0->1.00 Cross-model transfer confirms the price-difference value prop: a skill optimized on a cheap model deploys for free on an expensive one, and skills move between Codex and Claude. sweep.jsonl is the committed source data. Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>	2026-06-08 14:31:51 +00:00
Yifan Yang	4186e5bb73	docs(sleep): definitive clean results — Sonnet->Haiku 3/3 seeds 0->1.00 Strong-optimizer/weak-target (Sonnet -> Haiku), fully isolated: brief-writer, advisor, thorough-analyst all 0.00 -> 1.00 on held-out. thorough-analyst shows 2-night convergence (0.33 -> 1.00). Codex self-optimized brief-writer also 0 -> 1.00. Key finding answering the optimizer/target-split request: the OPTIMIZER MODEL is decisive — weak Haiku-as-optimizer is flaky (0 or 1.0 across runs), strong Sonnet-as-optimizer reliably hits 1.0 on every seed. Raw logs under docs/sleep/raw/. Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>	2026-06-08 14:31:51 +00:00
Yifan Yang	d75863eb6f	fix(sleep): retry reflect on non-JSON reply; honest report narrative - reflect() now retries once with a firmer "JSON only" instruction when the first reply doesn't parse to a non-empty array. A transient non-JSON reply otherwise wastes a whole night (gate sees no edits -> reject), which made weak optimizers (Haiku) flaky across runs. - FINAL_REPORT.md: document the context-leak discovery honestly; Codex cells stand (clean), Claude cells recomputed under strict isolation. Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>	2026-06-08 14:31:51 +00:00
Yifan Yang	233b619555	feat(sleep): marketplace manifest, install docs, final report shell, sweep flush - skillopt-sleep-plugin/.claude-plugin/marketplace.json so the plugin is installable via `/plugin marketplace add ./skillopt-sleep-plugin`. - README install section (clone -> add marketplace -> install -> /sleep status). - docs/sleep/FINAL_REPORT.md: the consolidated presented results doc (real Claude+Codex, transfer, and the honest thorough-analyst failure + fix). - sweep.py flushes stdout for live monitoring. Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>	2026-06-08 14:31:51 +00:00
Yifan Yang	63c79b3602	docs(sleep): record real Claude+Codex gbrain results; both reach 0->1.00 Codex with the directive reflect prompt + 2 nights converges 0.00 -> 1.00 (up from 0.67 single-night); its night-2 edit diagnoses its own residual failure ("preserve required sections even when keeping the brief short"). Claude (Haiku) reaches 1.00 in one night. Update plugin README + skill to reference --backend claude\|codex (was anthropic) and surface the benchmark. Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>	2026-06-08 14:31:51 +00:00
Yifan Yang	4203086899	feat(sleep): real claude + codex backends, gbrain-evals benchmark, rule judges Upgrade from mock-only to REAL multi-backend validation: Backends (skillopt/sleep/backend.py): - CliBackend base: shared attempt/judge/reflect prompts, response cache, token accounting. Subclasses implement only _call(). - ClaudeCliBackend: drives `claude -p --output-format text`. - CodexCliBackend: drives the REAL @openai/codex `exec -o <file>` for clean output; resolve_codex_path() skips the hermes wrapper at ~/.local/bin/codex. - reflect() now aggregates the exact failing judge criteria into the prompt (gbrain's lesson: tell the optimizer what the scorer rewards). Rule judges (skillopt/sleep/judges.py): gbrain-compatible local scorers (section_present / regex / max_chars / contains / tool_called) — held-out scoring with no judge-API spend. TaskRecord gains a `judge` field + reference_kind="rule". gbrain-evals adapter (experiments/gbrain_bench.py, run_gbrain.py): load garrytan/gbrain-evals skillopt-v1 deficient skills + train/held-out task sets and run our consolidate() loop against the SAME suite gbrain scores. REAL results (docs/sleep/real_api_results.md), brief-writer seed, 1 night: - Claude (Haiku): held-out 0.00 -> 1.00 - Codex: held-out 0.00 -> 0.67 Both proposed a correct, general format rule into the protected LEARNED block. CLI: --backend {mock,claude,codex}, --codex-path, --model; experiment + gbrain runners gain --limit-* cost controls. 17 tests pass. Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>	2026-06-08 14:31:51 +00:00
Yifan Yang	309f3141d4	docs(sleep): add wake-up summary of the overnight build Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>	2026-06-08 14:31:51 +00:00
Yifan Yang	4e7add899d	feat(sleep): nightly offline self-evolution engine + Claude Code plugin Add skillopt/sleep — a deployment-time companion to SkillOpt that gives a local Claude agent a nightly "sleep cycle": harvest ~/.claude transcripts -> mine recurring tasks -> replay offline -> consolidate (reflect -> bounded edit -> held-out GATE) -> stage -> adopt Synthesizes SkillOpt (validation-gated bounded text optimization, reusing skillopt.evaluation.gate verbatim), Claude Dreams (offline consolidation; input never mutated; review-then-adopt), and the agent-sleep paper (short-term experience -> long-term competence). Engine (skillopt/sleep/, import-light, py>=3.10): - harvest.py read-only parse of session JSONL + history.jsonl - mine.py sessions -> TaskRecords (heuristic miner + LLM hook) - backend.py MockBackend (deterministic, no API) + AnthropicBackend - replay.py offline re-run -> (hard, soft) scores - consolidate.py one SkillOpt epoch behind a held-out gate - memory.py protected-region edits to SKILL.md / CLAUDE.md - staging.py stage proposals; adopt with backup (Dreams safety contract) - cycle.py + __main__.py orchestrator + CLI (run/dry-run/status/adopt/harvest) Plugin (skillopt-sleep-plugin/): plugin.json, /sleep command, skillopt-sleep skill, SessionEnd hook, bundled runner + cron generator. Validation (deterministic, no API): persona experiment proves held-out lift (researcher 0.33->1.0, programmer 0.32->1.0) AND that the gate rejects an injected harmful edit. 13 stdlib-unittest tests pass, incl. full cycle + adopt-with-backup and parsing of real on-disk transcripts. Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>	2026-06-08 14:31:51 +00:00

27 Commits