microsoft-SkillOpt

mirror of https://github.com/microsoft/SkillOpt.git synced 2026-07-03 14:02:58 +08:00

Author	SHA1	Message	Date
CharlesYang030	e4ea6a6771	chore(release): v0.2.0 Highlights since v0.1.0: - feat: SkillOpt-Sleep engine — nightly offline self-evolution (harvest -> mine -> replay -> consolidate behind a validation gate), with multi-objective reward, experience replay + dream rollouts, slow-update long-term memory, and secret redaction in cycle diagnostics. Shipped as the `skillopt-sleep` CLI. - feat: cross-tool backends & plugin shells — Claude, Codex (+Desktop harvest), Copilot, Devin, and OpenClaw. - feat: SearchQA split materialization + rollout fail-fast. - fix: Windows robustness for claude/codex backends, hardened JSON fallback, Qwen timeout/thinking gating, Codex failure surfacing. Packaging: - Bump pyproject / skillopt / skillopt_sleep to 0.2.0. - Restore skillopt_webui to the packaged wheel. See CHANGELOG.md for the full changelog and contributor acknowledgements. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-07-02 22:11:10 +08:00
Yifan Yang	2d7e37a395	fix(json_utils): reject prose pseudo-JSON in single quotes/backticks (#82 ) Follow-up to the string-aware brace scan: that change only skipped double-quoted prose, so brace-shaped text in single quotes, backticks, or bare prose (e.g. `{op: delete}`, '{x: 1}') still reached json_repair and was fabricated into a bogus dict — strictly worse than None, since extract_json feeds the optimizer's skill edits. Add a _looks_json_like() guard before repair: a genuine JSON object's first non-space char after `{` is `"` (a key) or `}` (empty). Prose pseudo-objects start with a bare word and are rejected, while legitimate repair targets (trailing commas, unescaped quotes inside string values) all begin with `"` and pass — including objects whose string VALUES contain single quotes or backticks, which must not be rejected. Found by an independent GPT-5.5 re-review of the merged #79 code. Adds regression tests for single-quoted / backticked / bare prose (-> None) and for legitimate objects with quote/backtick string values (still repaired). Tests: 30 pass (+3 skip) without json_repair, 33 pass with it, both clean under -W error::RuntimeWarning. Co-authored-by: Claude <noreply@anthropic.com>	2026-06-23 20:31:39 +08:00
Yifan Yang	14c045f04f	Windows robustness for claude/codex backends (+ hardened JSON fallback) (#79 ) * Robustness for the claude/codex backends on Windows: argv overflow, subprocess encoding, tolerant JSON, test-eval dirs Fixes surfaced running SkillOpt end-to-end on the bundled `claude` backend (local Claude CLI) on Windows. None changes the OpenAI/GPT happy path. 1. skillopt/engine/trainer.py — the final test-eval directory (test_eval_final/) is written to before being created; add os.makedirs(..., exist_ok=True), matching the two sibling test-eval dirs. Without it, summary.json raises FileNotFoundError when a rollout yields zero predictions. 2. skillopt/model/claude_backend.py a. Pass the prompt via stdin (not argv): on Windows the whole command line is capped at ~32 KB and a large optimizer prompt (the success-analyst minibatch carrying several report trajectories) overflows it with [WinError 206], killing the run after retries. b. Pass the system prompt via --append-system-prompt-file (a temp file), not argv. The system prompt here is the skill being optimized, which SkillOpt grows over training; since the ~32 KB cap applies to the SUM of all argv, a grown skill would re-hit [WinError 206] even with the prompt on stdin. c. Pin the subprocess encoding to utf-8 (errors="replace"). With text=True and no encoding=, stdin is encoded with the system codepage; on a zh-CN box (cp936/GBK) a prompt containing an emoji or some Latin-1 characters raises UnicodeEncodeError before the CLI even starts, failing every retry. 3. skillopt/model/codex_backend.py — the same utf-8 encoding pin on its subprocess.run(input=...) call (identical unpinned-encoding pattern). 4. skillopt/utils/json_utils.py — extract_json() returned None for valid- looking JSON that strict json.loads rejects (unescaped ASCII quotes inside CJK string values, trailing commas), silently dropping the analyst's edits on non-schema backends (Claude/Qwen): reflect produces N edits, 0 applied. Add a json_repair fallback, but only on a single unambiguous object — a balanced-brace extractor plus a refuse-on-multiple-objects guard — so a chain-of-thought "scratch + final" response can't make repair silently return the wrong (discarded) object, which would be worse than None (None is detectable and retryable; a wrong-but-valid edit is applied blind). Declare json_repair in requirements.txt and the claude/qwen optional extras so the fallback is actually present (it otherwise no-ops, dropping edits silently). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> (cherry picked from commit `dca74a683e`) * fix(json_utils): harden tolerant JSON fallback from PR #77 Follow-up fixes on top of the cherry-picked Windows-robustness change: 1. Make _top_level_brace_objects() fully string-aware in its OUTER scan, not just inside an object. A '{' inside quoted prose (e.g. '"set it to {x}"') no longer starts a candidate object, so extract_json() returns None for prose pseudo-JSON instead of repairing it into a bogus dict — which would be strictly worse than dropping the edit, since extract_json feeds the optimizer's skill edits. 2. Pick the repair candidate BEFORE importing json_repair, so the missing- dependency RuntimeWarning only fires when there is genuinely a single malformed object that could have been repaired. Ordinary no-JSON / prose replies (the common case) now return None silently instead of warning on every call. 3. Resolve dependency-metadata inconsistency: json_repair is optional, so add it to the `all` extra (it was already in `claude`/`qwen`) and demote it from a hard requirement to an optional/commented entry in requirements.txt, matching the project's convention for backend-specific deps. Adds regression tests for prose-with-braces (-> None), no-warning-on-plain- text, single-object repair, and multi-object ambiguity. Existing 22 json tests still pass with and without json_repair installed. Co-Authored-By: Claude <noreply@anthropic.com> --------- Co-authored-by: samuelgoofus-boop <260247789+samuelgoofus-boop@users.noreply.github.com> Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-23 19:00:23 +08:00
carpedkm	2841f82428	Fix ALFWorld gamefile paths relative to ALFWORLD_DATA	2026-06-23 10:32:38 +00:00
summerview1997	da799620ba	Fail fast on systemic SearchQA rollout failures	2026-06-16 09:20:57 +08:00
Shunsuke	98d0430bee	refactor: make EnvAdapter.reflect a shared default (fixes dropped reflect kwargs) All six adapters duplicated an identical reflect() that delegates to run_minibatch_reflect. The copies had drifted: OfficeQA/DocVQA silently dropped meta_skill_context and ALFWorld dropped update_mode, so those analysts ran without inputs every other benchmark receives (active under the default use_meta_skill: true). Move the delegation into EnvAdapter.reflect as one default that forwards all kwargs uniformly, and delete the six overrides. reflect is no longer abstract — adapters inherit it and override only for custom logic. Net -225 lines. Behavior change: OfficeQA/DocVQA/ALFWorld reflect now receive the kwargs they previously dropped; the three already-correct benchmarks are unaffected. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-15 09:06:00 +00:00
Yifan Yang	eef4805b25	Merge pull request #43 from imshunsuke/docs/fix-benchmark-loader-naming docs: align benchmark guide and template with dataloader.py naming	2026-06-15 17:00:45 +08:00
Cuzyoung	44043d4ae5	docs(trainer): drop the stale skill-aware comments (claimed best_skill carries no appendix; it does) Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-06-10 13:10:08 +00:00
Cuzyoung	7dcd612361	fix(trainer): flush appendix notes on skip branches — lapse-only steps no longer drop them A step whose minibatches yield ONLY execution-lapse notes produces no body patches (analysts return empty-edits carriers, dropped by _normalise_patches), so skip_no_patches / skip_no_rewrite would `continue` before the appendix flush and silently discard every note of the step. This hit exactly the feature's target regime (mature skill body, failures classified as lapses): in c1_searchqa_def_g55_sar, 10/40 steps skipped this way and lost 95 notes total. Extract the flush block into _flush_skill_aware_appendix() and call it on the normal update path (unchanged behavior) AND on both skip branches before `continue`, so notes persist and appendix_notes.json / step_rec counters are recorded for skipped steps too. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-06-10 13:10:08 +00:00
Cuzyoung	0dc84162dc	feat(optimizer): skill-aware reflection (EmbodiSkill S_app), config-controlled and env-independent Split failure reflections into SKILL_DEFECT (body edit) vs EXECUTION_LAPSE (protected appendix note that re-emphasizes an existing rule, never edited by step-level analysts). Toggle: optimizer.use_skill_aware_reflection (default false; baseline byte-identical when off). - optimizer/appendix.py: protected APPENDIX region (inject/extract/append with dedup), mirrors the slow_update protected-field pattern - optimizer/skill_aware.py: analyst prompt augmentation, appendix_notes parsing, threshold-gated LLM consolidation, and a process-wide runtime switch (configure_skill_aware_reflection) set once by the trainer - gradient/reflect.py: augment error/success analyst prompts at runtime; None-sentinel kwargs resolve from the global switch, so env adapters need no per-benchmark wiring (works for all envs, present and future) - optimizer/skill.py: generalize the protected-region check to (slow_update, appendix); edits inside any protected region are skipped - engine/trainer.py: inject appendix at init, flush per-step EXECUTION_LAPSE notes after the gate settles, optional consolidation - tests: regression suite incl. toggle-off byte-identical guarantee and env-independent global-switch resolution (6/6 passing + live smoke) Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-06-10 13:10:08 +00:00
Cuzyoung	ffe581098b	feat(trainer): final-skill val + best promotion; keep best unpolluted by slow_update - slow_update force-inject now writes current_skill ONLY (best_skill stays a faithful val-best snapshot, never receives un-validated slow_update content) - after training, run one val on the final skill; if its gate score beats the incumbent best, promote final to best (updates best_skill/best_step/best_origin) - trainer now evaluates final skill on test itself (reuses best test result when final==best); records final_selection_* and final_test_* in summary.json - spreadsheetbench: head+tail truncate the post-execution verification report at source to fix multi-MB conversation bloat Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>	2026-06-10 13:03:17 +00:00
Cuzyoung	372fd56c1e	fix(spreadsheetbench)+optimizer: fix verify-feedback bloat, drop optimizer-side truncation, soft-disable gate A. SpreadsheetBench verification-feedback bloat - rollout.py _auto_verify_output: use official _compare_cell_value (was repr() equality, which falsely flagged 5 vs 5.0 / None vs ""); collapse correct-and-empty cells into a count so large sparse answer ranges no longer flood feedback with MBs of None=None noise. - codegen_agent.py _build_eval_feedback: only list WRONG cells, collapse correct ones into a count. Scoring is unaffected (evaluate() is independent); this only fixes the target model's multi-turn solving feedback. B. Remove optimizer-side truncation (bloat source now fixed) - reflect.py: drop _MAX_TRAJ_CHARS cap and all per-field clips. - update_modes.py / clip.py / lr_autonomous.py: describe_item / short_item_summary no longer truncate; raise ranking/lr token budget. - trainer.py _format_step_buffer: full task_ids / target. - slow_update.py: full comparison samples. C. Soft-disable gate - config.py / trainer.py: use_gate=false no longer raises; validation still runs but candidates are force-accepted (new force_accept branch + log). Misc: aggregate.py merge token budget 4096 -> 16384. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>	2026-06-10 13:03:17 +00:00
Shunsuke	54e4b3eafb	docs: align benchmark guide and template with dataloader.py naming The new-benchmark guide and the env template README referred to the data loader file as loader.py, but all six built-in benchmarks name it dataloader.py (skillopt/envs/<name>/dataloader.py). Update the docs and the template rename step to match the actual convention. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-09 12:20:01 +08:00
Yifan Yang	b02ffc2c99	refactor(sleep): decouple engine to top-level skillopt_sleep/ (zero research dep) Open-source-tool / research-code separation: - git mv skillopt/sleep/ -> skillopt_sleep/ (top-level, sibling to the research skillopt/ package). History preserved as renames. - All imports skillopt.sleep.* -> skillopt_sleep.*. - Vendor the validation gate into skillopt_sleep/gate.py (a self-contained copy of skillopt.evaluation.gate). The engine now has ZERO dependency on the research package — verified: grep finds no `from skillopt.` in skillopt_sleep/, and consolidate's gate resolves to skillopt_sleep.gate. - Plugin scripts/commands/skill call `-m skillopt_sleep`. 29 tests pass; `python -m skillopt_sleep` runs standalone. Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>	2026-06-08 14:31:52 +00:00
Yifan Yang	a29201adc4	feat(sleep): multi-objective reward (accuracy/tokens/latency) + user preferences - ReplayResult records per-rollout tokens + latency_ms; replay_one measures them (approximated from text length when the backend doesn't track tokens, e.g. mock). - replay.multi_objective_reward(w_acc, w_tokens, w_latency): weighted reward so a skill can be optimized to be cheaper/faster, not only more accurate (cost terms normalized vs a reference, default = accuracy-only / backward compatible). - Backend.preferences (free text) injected into reflect as a prior; build_backend attaches it (to the optimizer for dual backends). run_gbrain gains --preferences. 3 new tests (multi-objective ordering, preference injection, cost recording). 29 tests pass; mock gates + 3.8/3.12 compile green. Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>	2026-06-08 14:31:51 +00:00
Yifan Yang	77ac33e8bf	feat(sleep): multi-rollout contrastive reflection + token/time budget The "脑补推演" core the user described — re-run the same task many times and learn from the contrast between good and bad rollouts: - rollout.py: multi_rollout(task, k) runs K scored attempts; RolloutSet exposes best/worst/spread/pass_rate. contrastive_reflect picks the highest-spread tasks (some attempts passed, some failed — most informative) and asks the optimizer what the GOOD attempts did that the BAD ones didn't, distilling a general rule. Far stronger signal than a single failure. - consolidate(rollouts_k>1) uses contrastive reflection (falls back to single-shot reflect if it yields nothing). - budget.py: Budget(max_tokens\|max_minutes) tracks spend; plan_depth() derives (nights, rollouts_k) from a token budget. run_gbrain gains --rollouts-k, --budget-tokens, --budget-minutes (auto-plans depth). 3 new tests (rollout stats, budget+plan, contrastive stub). 26 tests pass. Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>	2026-06-08 14:31:51 +00:00
Yifan Yang	c179a24c45	feat(sleep): slow-update long-term memory field (runs even with gate off) Bring SkillOpt's epoch-wise slow/meta update (paper §3.6) into the sleep engine as skillopt/sleep/slow_update.py — import-light, driven through the Backend abstraction (mock/claude/codex): - Reuses the main repo's protected-field markers <!-- SLOW_UPDATE_START --> ... <!-- SLOW_UPDATE_END --> so the artifact is compatible; step-level edits never touch this field. - run_slow_update compares behavior under the first-night vs final skill across the val tasks, groups into improved/regressed/persistent/stable, and asks the optimizer to distill durable longitudinal guidance (refining prior text). - Wired into run_gbrain.run_seed AFTER the nights loop, gated by slow_update=True and run REGARDLESS of gate_mode — this is what preserves long-term memory even when the user turns the hard gate OFF (the user's slot_date=slow-update intent). 2 new tests (protected-field round-trip, stub-backend synthesis). 23 tests pass. Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>	2026-06-08 14:31:51 +00:00
Yifan Yang	6f1351edb9	feat(sleep): 3-way train/val/test split + gate_mode on\|off Data-split refactor (the anti-overfitting foundation the user asked for): - TaskRecord gains split∈{train,val,test} and origin∈{real,dream}. - assign_splits: real tasks deterministically split into val/test (disjoint); DREAM-augmented tasks (origin='dream') NEVER enter val/test — they only go to train. val gates updates; test is the final held-out measure. - gbrain loader maps its held-out.jsonl -> test, benchmark.jsonl -> train/val, so the gbrain held-out stays the true final score. - consolidate(): train drives reflect, val gates; adds gate_mode='off' (greedy, no hard filter) reporting val movement (greedy_improved/regressed/flat). - run_gbrain/transfer/experiment score on test (val fallback); run_gbrain gains --gate on\|off. Legacy replay/holdout names normalized. New test proves dream tasks never land in val/test. 21 tests pass; mock experiment + gate=off both green. Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>	2026-06-08 14:31:51 +00:00
Yifan Yang	1d20e9db14	chore(sleep): include quick-answerer (tool loop) in the sweep direct plan All 4 gbrain skillopt-v1 seeds are now in the sweep, matching gbrain's full scorecard coverage. Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>	2026-06-08 14:31:51 +00:00
Yifan Yang	937bc1ec4d	feat(sleep): real tool-loop replay for gbrain quick-answerer (tool_called judge) The 4th gbrain seed (quick-answerer) is judged by tool_called=search: the agent must ACTUALLY call a search tool. Add an honest tool loop: - Backend.attempt_with_tools(task, skill, memory, tools) -> (response, tools_called) - Claude: exposes a real ./search shell shim, runs with --allowedTools Bash in a clean cwd; detects the call from the shim's log (not a self-reported marker). - Codex: same shim under `exec --sandbox workspace-write`. - Mock: deterministic — "calls" a tool iff skill/memory instructs it (for CI). - replay_one routes tasks with a tool_called check through the tool loop and feeds detected calls to the rule judge; ReplayResult gains tools_called. Verified live (Claude haiku): deficient skill -> tools_called=[] hard=0; learned "must run ./search" rule -> tools_called=['search'] hard=1.0. 20 tests pass. Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>	2026-06-08 14:31:51 +00:00
Yifan Yang	023950a291	feat(sleep): sweep 'direct' plan uses strong-optimizer/weak-target dual config The default sweep direct plan now uses a DualBackend (Sonnet optimizer proposes edits, Haiku target runs tasks) — the SkillOpt-faithful and more reliable setup, since a weak self-optimizing model (Haiku-as-optimizer) produced flaky JSON. report.py renders the optimizer->target pairing in the direct table. Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>	2026-06-08 14:31:51 +00:00
Yifan Yang	d75863eb6f	fix(sleep): retry reflect on non-JSON reply; honest report narrative - reflect() now retries once with a firmer "JSON only" instruction when the first reply doesn't parse to a non-empty array. A transient non-JSON reply otherwise wastes a whole night (gate sees no edits -> reject), which made weak optimizers (Haiku) flaky across runs. - FINAL_REPORT.md: document the context-leak discovery honestly; Codex cells stand (clean), Claude cells recomputed under strict isolation. Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>	2026-06-08 14:31:51 +00:00
Yifan Yang	c80914b036	fix(sleep): disable global skills in claude calls (--bare --disable-slash-commands) The clean-cwd + --disallowedTools isolation was NOT enough: the user's GLOBAL skills (~/.claude/skills) are injected regardless of cwd, so reflect/attempt still sometimes replied with a list of installed skills instead of JSON edits (advisor reflect returned 21KB of skill descriptions, n_edits=0 -> gate reject). Add --bare (skip hooks/LSP/plugins) and --disable-slash-commands (disable all skills). Verified: the optimizer now returns clean JSON. Re-validating all seeds with the truly-isolated backend; prior Claude numbers are being recomputed honestly (some earlier "successes" were partly leak-assisted). Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>	2026-06-08 14:31:51 +00:00
Yifan Yang	defb4566ea	fix(sleep): isolate claude CLI calls; concrete+override-aware reflect; honor hard constraints Critical correctness fix found by debugging the thorough-analyst failure: * `claude -p` was running with the AMBIENT Claude Code project context (the repo's CLAUDE.md, installed skills, tools). The optimizer/target calls were polluted — reflect once replied with a list of the user's installed skills instead of JSON edits. Now ClaudeCliBackend._call runs ISOLATED: a clean temp cwd, --disallowedTools '', --exclude-dynamic-system-prompt-sections. This is essential for the backend to be trustworthy and reproducible. reflect prompt: translate failing rule-judge criteria into plain English (max_chars=1200 -> "the ENTIRE response must be at most 1200 characters") and require CONCRETE, verbatim thresholds in proposed rules (not "respect limits"). * attempt prompt: treat the Learned-preferences block as HARD CONSTRAINTS that override earlier conflicting skill text. Earlier Claude results predate this fix and are being re-validated clean; the Codex backend was never affected (it runs in its own exec context). Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>	2026-06-08 14:31:51 +00:00
Yifan Yang	233b619555	feat(sleep): marketplace manifest, install docs, final report shell, sweep flush - skillopt-sleep-plugin/.claude-plugin/marketplace.json so the plugin is installable via `/plugin marketplace add ./skillopt-sleep-plugin`. - README install section (clone -> add marketplace -> install -> /sleep status). - docs/sleep/FINAL_REPORT.md: the consolidated presented results doc (real Claude+Codex, transfer, and the honest thorough-analyst failure + fix). - sweep.py flushes stdout for live monitoring. Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>	2026-06-08 14:31:51 +00:00
Yifan Yang	a0419bfdbb	feat(sleep): benchmark sweep + report tooling; override-aware reflect prompt - sweep.py: run many (backend, model, seed, transfer-pair) configs sequentially, append each result to JSONL incrementally (resumable, interrupt-safe). - report.py: render the sweep JSONL into a presented Markdown scorecard with direct-improvement and cross-model-transfer tables. - reflect prompt now tells the optimizer its edits are APPENDED (can't delete the base skill text), so on a conflict it must write a forceful OVERRIDE rule. Diagnosed from a real failure: thorough-analyst (needs <=1200 chars) kept its edits rejected because the base "be exhaustive" line won; a verified override ("HARD LIMIT ... supersedes") makes Haiku obey (1194/880 chars -> hard=1.0). Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>	2026-06-08 14:31:51 +00:00
Yifan Yang	7d9900b6af	feat(sleep): optimizer/target model split, transfer experiment, LLM miner Three additions driven by the goal of price-aware, model-flexible sleep: 1. DualBackend + build_backend(): route attempt->TARGET model and reflect/judge->OPTIMIZER model (SkillOpt's target-vs-optimizer split). gbrain runner gains --optimizer-backend/-model + --target-backend/-model. 2. run_transfer.py: sleep-scenario cross-model transfer. Optimize a skill on a SOURCE model (e.g. cheap haiku), freeze it, evaluate held-out on a TARGET model (e.g. expensive sonnet) with no further optimization — plus a direct reference. Mirrors the SkillOpt paper's transfer table; quantifies the "optimize cheap overnight, deploy anywhere" value prop. 3. llm_miner.py: turn real harvested transcripts into TaskRecords WITH checkable rule/rubric judges, wired into the cycle for non-mock backends, so real-data lift becomes measurable (heuristic miner remains the no-API fallback). Fixed a str.format brace bug the new unit test caught. 19 tests pass. Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>	2026-06-08 14:31:51 +00:00
Yifan Yang	4203086899	feat(sleep): real claude + codex backends, gbrain-evals benchmark, rule judges Upgrade from mock-only to REAL multi-backend validation: Backends (skillopt/sleep/backend.py): - CliBackend base: shared attempt/judge/reflect prompts, response cache, token accounting. Subclasses implement only _call(). - ClaudeCliBackend: drives `claude -p --output-format text`. - CodexCliBackend: drives the REAL @openai/codex `exec -o <file>` for clean output; resolve_codex_path() skips the hermes wrapper at ~/.local/bin/codex. - reflect() now aggregates the exact failing judge criteria into the prompt (gbrain's lesson: tell the optimizer what the scorer rewards). Rule judges (skillopt/sleep/judges.py): gbrain-compatible local scorers (section_present / regex / max_chars / contains / tool_called) — held-out scoring with no judge-API spend. TaskRecord gains a `judge` field + reference_kind="rule". gbrain-evals adapter (experiments/gbrain_bench.py, run_gbrain.py): load garrytan/gbrain-evals skillopt-v1 deficient skills + train/held-out task sets and run our consolidate() loop against the SAME suite gbrain scores. REAL results (docs/sleep/real_api_results.md), brief-writer seed, 1 night: - Claude (Haiku): held-out 0.00 -> 1.00 - Codex: held-out 0.00 -> 0.67 Both proposed a correct, general format rule into the protected LEARNED block. CLI: --backend {mock,claude,codex}, --codex-path, --model; experiment + gbrain runners gain --limit-* cost controls. 17 tests pass. Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>	2026-06-08 14:31:51 +00:00
Yifan Yang	4e7add899d	feat(sleep): nightly offline self-evolution engine + Claude Code plugin Add skillopt/sleep — a deployment-time companion to SkillOpt that gives a local Claude agent a nightly "sleep cycle": harvest ~/.claude transcripts -> mine recurring tasks -> replay offline -> consolidate (reflect -> bounded edit -> held-out GATE) -> stage -> adopt Synthesizes SkillOpt (validation-gated bounded text optimization, reusing skillopt.evaluation.gate verbatim), Claude Dreams (offline consolidation; input never mutated; review-then-adopt), and the agent-sleep paper (short-term experience -> long-term competence). Engine (skillopt/sleep/, import-light, py>=3.10): - harvest.py read-only parse of session JSONL + history.jsonl - mine.py sessions -> TaskRecords (heuristic miner + LLM hook) - backend.py MockBackend (deterministic, no API) + AnthropicBackend - replay.py offline re-run -> (hard, soft) scores - consolidate.py one SkillOpt epoch behind a held-out gate - memory.py protected-region edits to SKILL.md / CLAUDE.md - staging.py stage proposals; adopt with backup (Dreams safety contract) - cycle.py + __main__.py orchestrator + CLI (run/dry-run/status/adopt/harvest) Plugin (skillopt-sleep-plugin/): plugin.json, /sleep command, skillopt-sleep skill, SessionEnd hook, bundled runner + cron generator. Validation (deterministic, no API): persona experiment proves held-out lift (researcher 0.33->1.0, programmer 0.32->1.0) AND that the gate rejects an injected harmful edit. 13 stdlib-unittest tests pass, incl. full cycle + adopt-with-backup and parsing of real on-disk transcripts. Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>	2026-06-08 14:31:51 +00:00
Matt Van Horn	c31c50be51	fix(model): forward Qwen timeout and only set enable_thinking when true Two bugs made local vLLM targets score acc=0.000: the router did not forward 'timeout' to the Qwen backend (so runs used the 300s default), and qwen_backend always injected chat_template_kwargs.enable_thinking, which non-Qwen vLLM servers reject or answer with <think> output and no <answer> tag. Forward timeout and only set the field when enabled. Closes #28 Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-07 07:41:35 -07:00
Yifan Yang	4eb4c64b2a	envs/_template: make template instantiable against real EnvAdapter ABC The shipped env_template.py and loader_template.py described the same fictional async execute / evaluate / build_prompt API documented in docs/reference/api.md. As a result TemplateBenchmarkEnv(cfg) raised 'TypeError: Can't instantiate abstract class' for every copy-and-paste user who followed the in-tree scaffold. Rewrite the template so it's a working starting point: - env_template.py: TemplateBenchmarkEnv(EnvAdapter) now implements all five real abstract methods (build_train_env, build_eval_env, rollout, reflect, get_task_types) with no-op defaults documented as TODO. Instantiable today; pytest 60/60 still passes. - loader_template.py: TemplateBenchmarkLoader(SplitDataLoader) implements load_split_items for .json / .jsonl input and explains the optional load_raw_items override for split_mode="ratio". - README.md: usage steps now point at scripts/train.py's _ENV_REGISTRY (the real registry) instead of a non-existent BENCHMARK_REGISTRY in skillopt/envs/__init__.py, and link to the rewritten new-benchmark guide. - config_template.yaml: _base_ is a string path (not a list, which the loader rejects); skill_init is commented out with a note so the template config doesn't reference a file the user hasn't created. Verified locally: 'from skillopt.envs._template.env_template import TemplateBenchmarkEnv; TemplateBenchmarkEnv()' succeeds. Refs microsoft/SkillOpt#30. Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>	2026-06-01 20:15:12 +00:00
kaikai-macbook	41012e2d5e	Support Qwen chat as optimizer backend	2026-06-01 16:44:49 +08:00
Yif Yang	b4850ce418	fix(minimax): wire YAML / CLI config through to backend PR #26 added a MiniMax chat backend but left three loose ends that silently dropped any YAML / CLI configuration of minimax_* keys: only the environment-variable path worked. - skillopt/config.py: add 6 model.minimax_* entries to _FLATTEN_MAP so the keys declared in configs/_base_/default.yaml actually survive flatten_config() (mirroring the existing model.qwen_chat_* block). - skillopt/engine/trainer.py: import configure_minimax_chat and call it alongside configure_qwen_chat, so cfg-supplied credentials, temperature, max_tokens, and enable_thinking reach the backend. Also apply cfg["minimax_model"] via set_target_deployment when the active target backend is minimax_chat. - scripts/train.py: add 6 --minimax_* CLI flags + the corresponding _CLI_TO_YAML entries, add 'minimax' / 'minimax_chat' to the --backend choices, auto-route to target_backend=minimax_chat, and pick the right default target_model for the new backend. Default behavior on existing backends (openai, claude, qwen, codex, claude_code_exec) is unchanged; all 8 shipped configs continue to load with gate_metric falling back to 'hard' for paper reproduction.	2026-05-31 08:22:20 +00:00
Yif Yang	643346c9f3	Merge pull request #26 from KovaForge/minimax-backend feat: add MiniMax as first-class chat backend Adds skillopt/model/minimax_backend.py (clean port of qwen_backend.py targeting MiniMax-M2.7 via https://api.minimax.io/v1) and registers it in the router, backend_config, and common defaults. Existing backends (openai_chat, claude_chat, qwen_chat, codex_exec, claude_code_exec) remain bit-for-bit unchanged. Verified via 10 import / routing / parity subtests; backward-compat sweep across the 8 shipped configs passes with no regression. A follow-up commit completes the YAML / CLI plumbing that this PR left half-wired (FLATTEN_MAP entries, trainer-level configure_minimax_chat call, and --minimax_* CLI args).	2026-05-31 08:20:39 +00:00
Cuzyoung	00602df9e9	feat(slow-update): add config-controlled gated / force-injected modes Add optimizer.slow_update_gate_with_selection to control how epoch-boundary slow-update guidance is applied: - false (default): force-injected - inject guidance into current & best unconditionally (unchanged behavior). - true: gated - evaluate the slow-update candidate on the selection set and accept/reject via the same validation gate as step-level updates (logic follows the SkillReflection ablation). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>	2026-05-31 02:02:23 +00:00
Declan Murphy	c6da31df44	fix: use correct MiniMax endpoint, model name, and add .venv to gitignore	2026-05-31 05:27:50 +08:00
Declan Murphy	309ea64ff4	feat: integrate MiniMax into model router, backend config, and common common.py: - Add minimax_chat → MiniMax/MiniMax-Text-01 to _BACKEND_DEFAULT_MODELS - Add minimax/minimax_chat aliases to _BACKEND_ALIASES backend_config.py: - Add minimax_chat to set_optimizer_backend() valid set - Add minimax_chat to set_target_backend() valid set - Add minimax_chat to is_optimizer_chat_backend() - Add minimax_chat to is_target_chat_backend() __init__.py: - Import minimax_backend as _minimax - Add minimax_chat to set_backend() legacy handler - Add minimax_chat to get_backend_name() reporting - Route chat_target() and chat_target_messages() to _minimax - Update NotImplementedError messages to list minimax_chat - Aggregate _minimax into get_token_summary() - Add _minimax.reset_token_tracker() - Add configure_minimax_chat() delegator - Add _minimax to set_reasoning_effort() and set_target_deployment()	2026-05-31 05:22:33 +08:00
Declan Murphy	d224d425f9	feat: add MiniMax chat backend module Port qwen_backend.py pattern to minimax_backend.py as a new OpenAI-compatible urllib-based backend. Includes: - BASE_URL defaulting to https://api.minimax.chat/v1 - API_KEY, TIMEOUT_SECONDS, MAX_TOKENS, TEMPERATURE env vars - ENABLE_THINKING support (MiniMax thinking mode) - configure_minimax_chat() runtime configurator - chat_target() and chat_target_messages() functions - TokenTracker integration and get_token_summary() - set_target_deployment() support - Default model: MiniMax/MiniMax-Text-01	2026-05-31 05:22:29 +08:00
hwq	1f75d022a5	y	2026-05-30 15:01:34 +00:00
Yif Yang	d190bf37c1	Merge pull request #25 from lvbaocheng/feature/gate-soft-metric Add configurable gate metric (hard / soft / mixed) for skill validation Default is `hard`, preserving exact pre-PR behavior — verified by 22 unit assertions on the gate module plus an end-to-end 8-step trainer-trajectory test that produces a bit-for-bit identical accept/reject sequence between the pre-PR and post-PR code paths under `gate_metric: hard`. Paper- reproduction results are unaffected. `soft` and `mixed` are opt-in via `evaluation.gate_metric` in the config and address small-selection-set runs where discrete hard accuracy is too coarse to distinguish candidate skills.	2026-05-30 08:01:39 +00:00
Yif Yang	02695bd813	Merge pull request #24 from lvbaocheng/fix/claude-cli-effort-flag fix(claude): use --effort instead of deprecated --thinking flag	2026-05-30 15:31:00 +08:00
lvbaocheng	5d7875cb2e	Add configurable gate metric (hard / soft / mixed) for skill validation The training gate currently always compares candidate vs. current/best using hard exact-match accuracy. On environments with a small held-out selection set (e.g. 3-6 items) or partial-credit scoring, hard accuracy is too coarse: candidate skills that meaningfully improve per-item soft scores get rejected because the discrete hard count does not move. Add three opt-in metrics so users can pick the one that matches their scoring function: - `gate_metric: hard` — original behavior (default, fully backward compatible). - `gate_metric: soft` — gate on the soft / F1 / partial-credit score. - `gate_metric: mixed` — `(1 - w) * hard + w * soft`, where `w` is set by `gate_mixed_weight` (default 0.5). Changes ------- - `skillopt/evaluation/gate.py`: extend `evaluate_gate` with `cand_soft`, `metric`, and `mixed_weight` keyword arguments; add a pure helper `select_gate_score(hard, soft, metric, mixed_weight)`. Defaults preserve the original `metric="hard"` behavior — existing callers that only pass `cand_hard` keep working unchanged. - `skillopt/evaluation/__init__.py`: export the new helper / type. - `skillopt/engine/trainer.py`: read `evaluation.gate_metric` and `evaluation.gate_mixed_weight` from the config (with safe defaults), pass both metrics into `evaluate_gate`, and project the baseline `current_score` / `best_score` into metric space so subsequent comparisons are consistent. Print the gate metric on the `[6/6 EVALUATE]` line so logs make the decision basis explicit. The selection cache still records both `(hard, soft)` so a metric change on resume is non-destructive. - `configs/_base_/default.yaml`: document and ship the new keys with backward-compatible defaults (`hard`, `0.5`). Backward compatibility ---------------------- - Default config does not change behavior: `gate_metric` defaults to `hard`, exactly matching the previous gate. - `evaluate_gate(...)` keeps its existing positional signature; the new parameters are keyword-only with safe defaults. - `step_record.json` gains optional `gate_metric` and `candidate_gate_score` fields; old records still load. Tested ------ - Unit-tested all three metrics + boundary `mixed_weight` values (0.0 / 1.0) and rejection of unknown metric strings. All six cases pass. - Verified `skillopt.engine.trainer` imports cleanly after the refactor.	2026-05-30 14:45:27 +08:00
lvbaocheng	2532043d25	fix(claude): use --effort instead of deprecated --thinking flag Claude Code CLI v2.x renamed the flag; passing --thinking low causes all rollout calls to fail on CLI 2.1.87+. Co-authored-by: Cursor <cursoragent@cursor.com>	2026-05-30 11:24:13 +08:00
zq	41be2f1803	fix(scoring): use float() instead of int() for continuous reward scores int() truncates smoothed composite scores (0.0-1.0) to 0, making all continuous reward values appear as failures. This broke SkillOpt training pipelines using SmoothedCompositeReward.	2026-05-30 07:47:41 +08:00
zq	a62ec857f1	fix(reflect): support continuous reward scores in failure filtering not r.get("hard") treats non-zero floats as success. Add explicit float threshold check (< 1e-9). Backward compatible with binary hard=0/1.	2026-05-29 19:04:42 +08:00
zq	afb552008b	fix(trainer): support continuous reward scores in bucket aggregation int() truncates any float in [0,1) to 0. Replace with float(). Also fix falsy float check in failure detection. Backward compatible with binary hard=0/1.	2026-05-29 19:03:52 +08:00
Yif Yang	75b5c7f31c	Merge pull request #16 from guilhermeleste/feat/pioneer-ai-provider-integration Add OpenAI-compatible backend support for Pioneer.ai and other providers	2026-05-29 10:14:32 +08:00
hwq	786d57b5cf	Make rollout completion tokens configurable	2026-05-28 09:45:47 +00:00
guilhermeleste	d5c5b61830	Add OpenAI-compatible backend support for Pioneer.ai and other providers - Add 'openai_compatible', 'compat', and 'openai' auth modes to azure_openai.py - Modify _make_client() to use OpenAI client (not AzureOpenAI) for compatible endpoints - Update type hints to support both AzureOpenAI and OpenAI clients - Auto-configure API version sentinel when using compatible modes - Add .env template for Pioneer.ai configuration This allows users to use Pioneer.ai or any OpenAI-compatible API endpoint as both optimizer and target backend without requiring Azure OpenAI. Resolves: Support for non-Azure OpenAI-compatible providers	2026-05-28 05:54:43 -03:00
Cuzyoung	f55a26414e	cleanup: remove unused benchmarks, deep_probe, meta_reflect Remove sealqa, babyvision, mathverse, mmrb, swebench envs and configs. Remove deep_probe, deep_reflect, meta_reflect modules and prompts. Remove download_babyvision script. These are not part of the core released benchmarks. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>	2026-05-24 19:36:48 +00:00

1 2

53 Commits