microsoft-SkillOpt

mirror of https://github.com/microsoft/SkillOpt.git synced 2026-07-03 14:02:58 +08:00

Author	SHA1	Message	Date
Yif Yang	5487e2c426	fix(skillopt-sleep): redact secrets before persisting cycle diagnostics PR #92 added a per-cycle diagnostics.json that surfaces backend stderr, optimizer replies, and task responses so a 0.0 night is self-diagnosing. Those free-text fields can carry credentials (e.g. a codex 401 stderr dump containing an auth token), so persisting them verbatim was a new on-disk leak surface. - Add a shared redact_secrets() in staging.py and route diagnostics.json's call_error / reflect_raw_head / holdout_detail through it before writing. - Redact the codex and Claude auth-error log lines too (a secondary sink when a file log handler is attached); last_call_error stays raw in memory so _AUTH_MARKERS matching is unaffected. - Centralize _SECRET_PATTERNS in staging.py (harvest_codex now reuses them) and extend coverage to AWS / GitHub / Slack / Google / JWT token shapes. - Tests: secret-shape coverage, private-key blocks, recursive/scalar passthrough, no over-redaction of plain prose, fail-fast auth-error log redaction, and an end-to-end check that diagnostics.json has no secret. Observability-only; the gate and learning algorithm are unchanged. Co-Authored-By: Claude <noreply@anthropic.com>	2026-06-30 19:47:36 +00:00
Yifan Yang	b9142bad24	fix(skillopt-sleep): surface codex auth/model/version failures instead of silently scoring 0 (#92 ) Splits CodexCliBackend._call into _call_once + a retry wrapper so transient empties/timeouts are retried instead of silently scored 0, and fails fast on fatal auth/model/version errors (401, refresh_token_reused, token_expired, ChatGPT-account-unsupported, newer-Codex-required). On non-zero exit the CLI error text is surfaced via last_call_error instead of being returned as a model response. Adds per-cycle diagnostics.json (observability only; gate and learning algorithm unchanged) so a 0.0 night self-explains.	2026-07-01 03:20:08 +08:00
Tanmay9223	680dd28f5a	fix(tests): move TestVerifierDiscipline above main block (Addresses PR review feedback by ensuring python file-run execution discovers the test class)	2026-06-30 13:05:01 +05:30
Tanmay9223	fccc21f3f6	test(sleep): add verifier-discipline stress test (closes #67 ) Add a regression test to ensure the validation gate correctly rejects reward-hacking skill edits. It has been observed that optimizers sometimes propose shortcuts that improve train/replay metrics but fail to improve held-out behavior. This test codifies that the gate blocks such artifacts. Add TestVerifierDiscipline to the test_sleep_engine.py suite: - Create MockRewardHackingBackend that simulates a reward-hacking rule which passes the train set but degrades the held-out tasks. - Assert that the proposed edit is rejected by the gate.	2026-06-30 13:04:22 +05:30
Daniel Martinez	9fa0716c72	fix(skillopt-sleep): also surface codex failures on the tool-call rollout path Follow-up from a fresh-context review of the prior commit: CodexCliBackend.attempt_with_tools (the rollout path for tool-requiring tasks) ran codex exec inline, swallowed all exceptions, and never set last_call_error — so an auth/model/version failure on the tool path still produced a silent empty->0 with no diagnostic signal, the exact failure class the prior commit fixed for the _call path. Now it surfaces timeout/exception/non-zero-exit via last_call_error (response stays empty; never leaks the CLI error text), so a failed tool rollout shows up in diagnostics.json. Adds a regression test. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-27 23:56:11 -05:00
Daniel Martinez	9fcf5868c3	fix(skillopt-sleep): surface codex auth/model/version failures instead of silently scoring 0 A nightly sleep cycle could run for weeks emitting held-out 0.0 -> 0.0 (gate reject, zero edits), indistinguishable from "nothing to learn", when the real cause was the codex backend returning an error (expired auth / model unsupported on the account / outdated CLI) that got scored as a failed rollout. backend (CodexCliBackend): - split _call into _call_once + a retry wrapper: transient empties/timeouts are retried instead of silently returning "" (mirrors AzureOpenAIBackend's guard); - on a non-zero exit, surface the reason via last_call_error and return "" rather than leaking the CLI error text as if it were a model response; - fail fast (no retries) on fatal auth/model/version errors (401, refresh_token_reused, token_expired, "not supported when using Codex with a ChatGPT account", "requires a newer version of Codex"). backend (CliBackend.reflect): retain last_reflect_raw so a no-edits night is diagnosable. consolidate: ConsolidationResult now carries per-task held-out detail (response, hard/soft, fail_reason) + reflect_raw + call_error. cycle: write diagnostics.json per cycle so a 0.0 night self-explains instead of being a black box. tests: 4 new (retry-not-silent-zero, auth-error-surfaced-not-scored, holdout-detail, reflect-raw). Also gitignore the .skillopt-sleep/ runtime dir. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-27 22:26:20 -05:00
khashayar	9799c41461	devin plugin: full schema/tool parity with plugins/copilot Mirror the copilot MCP server: same rich _TOOL_SCHEMA (source, model, tasks_file, target_skill_path, max_sessions, max_tasks, lookback_hours, auto_adopt, json, edit_budget, hour, minute) and generic flag forwarding, plus sleep_schedule / sleep_unschedule. Devin specifics retained: the ATIF-v1.7 harvest step (run before data-reading actions, engine pointed at it via --claude-home, default --source claude) and post-adopt sync into .devin/skills/. Tests + README + rules snippet updated for the 7-tool interface. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-25 21:56:42 +02:00
khashayar	e51eb7c4be	devin plugin: expand ~ in CLAUDE_HOME from env + add tests & ATIF fixture Review fixes: - Path bug: SKILLOPT_DEVIN_CLAUDE_HOME (and SKILLOPT_SLEEP_REPO) read from the env are now wrapped in os.path.expanduser, so the documented "~/..." config no longer passes a literal ~ to --claude-home (which yielded zero mined sessions). expanduser on an absolute default is a no-op. - tests/test_devin_plugin.py: tool-schema completeness, action→subcommand map, backend enum, the CLAUDE_HOME expansion regression, and an ATIF-v1.7 harvest shape test against a bundled fixture. - plugins/devin/fixtures/devin_sample.json: sample ATIF-v1.7 transcript. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-25 21:49:21 +02:00
Yifan Yang	2d7e37a395	fix(json_utils): reject prose pseudo-JSON in single quotes/backticks (#82 ) Follow-up to the string-aware brace scan: that change only skipped double-quoted prose, so brace-shaped text in single quotes, backticks, or bare prose (e.g. `{op: delete}`, '{x: 1}') still reached json_repair and was fabricated into a bogus dict — strictly worse than None, since extract_json feeds the optimizer's skill edits. Add a _looks_json_like() guard before repair: a genuine JSON object's first non-space char after `{` is `"` (a key) or `}` (empty). Prose pseudo-objects start with a bare word and are rejected, while legitimate repair targets (trailing commas, unescaped quotes inside string values) all begin with `"` and pass — including objects whose string VALUES contain single quotes or backticks, which must not be rejected. Found by an independent GPT-5.5 re-review of the merged #79 code. Adds regression tests for single-quoted / backticked / bare prose (-> None) and for legitimate objects with quote/backtick string values (still repaired). Tests: 30 pass (+3 skip) without json_repair, 33 pass with it, both clean under -W error::RuntimeWarning. Co-authored-by: Claude <noreply@anthropic.com>	2026-06-23 20:31:39 +08:00
Yifan Yang	14c045f04f	Windows robustness for claude/codex backends (+ hardened JSON fallback) (#79 ) * Robustness for the claude/codex backends on Windows: argv overflow, subprocess encoding, tolerant JSON, test-eval dirs Fixes surfaced running SkillOpt end-to-end on the bundled `claude` backend (local Claude CLI) on Windows. None changes the OpenAI/GPT happy path. 1. skillopt/engine/trainer.py — the final test-eval directory (test_eval_final/) is written to before being created; add os.makedirs(..., exist_ok=True), matching the two sibling test-eval dirs. Without it, summary.json raises FileNotFoundError when a rollout yields zero predictions. 2. skillopt/model/claude_backend.py a. Pass the prompt via stdin (not argv): on Windows the whole command line is capped at ~32 KB and a large optimizer prompt (the success-analyst minibatch carrying several report trajectories) overflows it with [WinError 206], killing the run after retries. b. Pass the system prompt via --append-system-prompt-file (a temp file), not argv. The system prompt here is the skill being optimized, which SkillOpt grows over training; since the ~32 KB cap applies to the SUM of all argv, a grown skill would re-hit [WinError 206] even with the prompt on stdin. c. Pin the subprocess encoding to utf-8 (errors="replace"). With text=True and no encoding=, stdin is encoded with the system codepage; on a zh-CN box (cp936/GBK) a prompt containing an emoji or some Latin-1 characters raises UnicodeEncodeError before the CLI even starts, failing every retry. 3. skillopt/model/codex_backend.py — the same utf-8 encoding pin on its subprocess.run(input=...) call (identical unpinned-encoding pattern). 4. skillopt/utils/json_utils.py — extract_json() returned None for valid- looking JSON that strict json.loads rejects (unescaped ASCII quotes inside CJK string values, trailing commas), silently dropping the analyst's edits on non-schema backends (Claude/Qwen): reflect produces N edits, 0 applied. Add a json_repair fallback, but only on a single unambiguous object — a balanced-brace extractor plus a refuse-on-multiple-objects guard — so a chain-of-thought "scratch + final" response can't make repair silently return the wrong (discarded) object, which would be worse than None (None is detectable and retryable; a wrong-but-valid edit is applied blind). Declare json_repair in requirements.txt and the claude/qwen optional extras so the fallback is actually present (it otherwise no-ops, dropping edits silently). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> (cherry picked from commit `dca74a683e`) * fix(json_utils): harden tolerant JSON fallback from PR #77 Follow-up fixes on top of the cherry-picked Windows-robustness change: 1. Make _top_level_brace_objects() fully string-aware in its OUTER scan, not just inside an object. A '{' inside quoted prose (e.g. '"set it to {x}"') no longer starts a candidate object, so extract_json() returns None for prose pseudo-JSON instead of repairing it into a bogus dict — which would be strictly worse than dropping the edit, since extract_json feeds the optimizer's skill edits. 2. Pick the repair candidate BEFORE importing json_repair, so the missing- dependency RuntimeWarning only fires when there is genuinely a single malformed object that could have been repaired. Ordinary no-JSON / prose replies (the common case) now return None silently instead of warning on every call. 3. Resolve dependency-metadata inconsistency: json_repair is optional, so add it to the `all` extra (it was already in `claude`/`qwen`) and demote it from a hard requirement to an optional/commented entry in requirements.txt, matching the project's convention for backend-specific deps. Adds regression tests for prose-with-braces (-> None), no-warning-on-plain- text, single-object repair, and multi-object ambiguity. Existing 22 json tests still pass with and without json_repair installed. Co-Authored-By: Claude <noreply@anthropic.com> --------- Co-authored-by: samuelgoofus-boop <260247789+samuelgoofus-boop@users.noreply.github.com> Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-23 19:00:23 +08:00
carpedkm	2841f82428	Fix ALFWorld gamefile paths relative to ALFWORLD_DATA	2026-06-23 10:32:38 +00:00
carpedkm	bfa53bc46d	fix(sleep): make --bare conditional on ANTHROPIC_API_KEY (#68 ) ClaudeCliBackend._call() and attempt_with_tools() hardcoded --bare, which skips Claude CLI's credential resolution. This broke subscription- token auth: every model call silently returned "Not logged in" and scored 0 — the user saw "baseline 0.0 → candidate 0.0, gate reject" with no indication of an auth failure. Fix: only pass --bare when ANTHROPIC_API_KEY is set. The remaining isolation flags (--disable-slash-commands, --disallowedTools, --exclude-dynamic-system-prompt-sections, clean temp cwd) already provide the needed isolation without --bare. Also adds _detect_cli_error() to log a warning when CLI output matches known auth error patterns, so auth failures surface loudly instead of deflating every score to 0. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-06-20 13:28:34 +00:00
carpedkm	0be780052a	feat: sync all 4 runtime plugins with full engine surface + fix #52 #58 #62 Bug fixes: - #52: bundle run-sleep.sh in Claude Code plugin + 4-level fallback - #58: add skillopt-sleep console script entry point in pyproject.toml - #62: filter headless claude -p replay sessions from harvest Plugin sync (Claude Code / Codex / Copilot / OpenClaw): - Document all 22 CLI flags, 7 actions, 4 backends across all SKILL.md files - Document config keys (preferences, gate_mode, dream_rollouts, etc.) - Document memory consolidation (evolve_memory / evolve_skill) - Add schedule/unschedule to all plugins - Copilot MCP: expand schema from 3 → 16 params + schedule tools - OpenClaw: add schedule/unschedule subcommands via shared scheduler Tests: - Cross-plugin parity test (prevents future feature drift) - MCP schema completeness test Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-06-20 11:31:09 +00:00
Kirill Kostarev	05cdc26beb	Add reviewed task-file flow for Codex sleep runs	2026-06-20 08:58:48 +00:00
DB Lee	5799695951	feat(copilot): implement attempt_with_tools with cross-platform tool shims Adds honest tool-call detection for CopilotCliBackend, mirroring the Claude/Codex backends. Writes per-tool executable shims into the work dir and detects real invocations from a calllog (not self-reported markers). The Copilot backend is Windows-validated, so shims are cross-platform: a .cmd batch shim on Windows and a chmod'd bash shim on POSIX, with an OS-specific tool hint. Mirrors _call's flags/env (isolated COPILOT_HOME, --allow-all-tools, MCP/instruction disabling) and the UTF-8 subprocess fix. Adds test_attempt_with_tools_honest_detection: a CI-friendly, OS-aware stub stands in for the CLI, runs the shim, and asserts both JSONL parsing and log-based detection. Validated live on Windows (real Copilot call) and on Linux/WSL (POSIX path). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>	2026-06-17 17:25:50 -07:00
DB Lee	013a7cd83a	test: add unit tests for CopilotCliBackend (parsing + alias + isolated home) Covers _parse_jsonl_response (multi-message concat, junk-line skipping, empty/non-assistant events), get_backend alias resolution, and the isolated-COPILOT_HOME / full-env opt-out behavior. Pure logic, no CLI required. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>	2026-06-17 17:25:50 -07:00
Yifan Yang	6940e46f4e	Merge pull request #65 from summerview1997/codex/searchqa-materialize-splits Add SearchQA split materialization helper	2026-06-17 23:50:38 +08:00
Yifan Yang	0e962219f5	Merge pull request #64 from summerview1997/codex/searchqa-rollout-failfast Fail fast on systemic SearchQA rollout failures	2026-06-17 23:49:55 +08:00
summerview1997	c755792049	Add SearchQA materialization tests	2026-06-16 09:27:09 +08:00
summerview1997	923becb00f	Add SearchQA rollout fail-fast tests	2026-06-16 09:21:08 +08:00
summerview1997	30cc8a3ed3	Add WebUI env preflight tests	2026-06-16 09:04:30 +08:00
Kirill Kostarev	31715a8b43	Add Codex Desktop transcript harvesting	2026-06-15 10:23:08 +00:00
Cuzyoung	0dc84162dc	feat(optimizer): skill-aware reflection (EmbodiSkill S_app), config-controlled and env-independent Split failure reflections into SKILL_DEFECT (body edit) vs EXECUTION_LAPSE (protected appendix note that re-emphasizes an existing rule, never edited by step-level analysts). Toggle: optimizer.use_skill_aware_reflection (default false; baseline byte-identical when off). - optimizer/appendix.py: protected APPENDIX region (inject/extract/append with dedup), mirrors the slow_update protected-field pattern - optimizer/skill_aware.py: analyst prompt augmentation, appendix_notes parsing, threshold-gated LLM consolidation, and a process-wide runtime switch (configure_skill_aware_reflection) set once by the trainer - gradient/reflect.py: augment error/success analyst prompts at runtime; None-sentinel kwargs resolve from the global switch, so env adapters need no per-benchmark wiring (works for all envs, present and future) - optimizer/skill.py: generalize the protected-region check to (slow_update, appendix); edits inside any protected region are skipped - engine/trainer.py: inject appendix at init, flush per-step EXECUTION_LAPSE notes after the gate settles, optional consolidation - tests: regression suite incl. toggle-off byte-identical guarantee and env-independent global-switch resolution (6/6 passing + live smoke) Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-06-10 13:10:08 +00:00
Yifan Yang	b02ffc2c99	refactor(sleep): decouple engine to top-level skillopt_sleep/ (zero research dep) Open-source-tool / research-code separation: - git mv skillopt/sleep/ -> skillopt_sleep/ (top-level, sibling to the research skillopt/ package). History preserved as renames. - All imports skillopt.sleep.* -> skillopt_sleep.*. - Vendor the validation gate into skillopt_sleep/gate.py (a self-contained copy of skillopt.evaluation.gate). The engine now has ZERO dependency on the research package — verified: grep finds no `from skillopt.` in skillopt_sleep/, and consolidate's gate resolves to skillopt_sleep.gate. - Plugin scripts/commands/skill call `-m skillopt_sleep`. 29 tests pass; `python -m skillopt_sleep` runs standalone. Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>	2026-06-08 14:31:52 +00:00
Yifan Yang	a29201adc4	feat(sleep): multi-objective reward (accuracy/tokens/latency) + user preferences - ReplayResult records per-rollout tokens + latency_ms; replay_one measures them (approximated from text length when the backend doesn't track tokens, e.g. mock). - replay.multi_objective_reward(w_acc, w_tokens, w_latency): weighted reward so a skill can be optimized to be cheaper/faster, not only more accurate (cost terms normalized vs a reference, default = accuracy-only / backward compatible). - Backend.preferences (free text) injected into reflect as a prior; build_backend attaches it (to the optimizer for dual backends). run_gbrain gains --preferences. 3 new tests (multi-objective ordering, preference injection, cost recording). 29 tests pass; mock gates + 3.8/3.12 compile green. Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>	2026-06-08 14:31:51 +00:00
Yifan Yang	77ac33e8bf	feat(sleep): multi-rollout contrastive reflection + token/time budget The "脑补推演" core the user described — re-run the same task many times and learn from the contrast between good and bad rollouts: - rollout.py: multi_rollout(task, k) runs K scored attempts; RolloutSet exposes best/worst/spread/pass_rate. contrastive_reflect picks the highest-spread tasks (some attempts passed, some failed — most informative) and asks the optimizer what the GOOD attempts did that the BAD ones didn't, distilling a general rule. Far stronger signal than a single failure. - consolidate(rollouts_k>1) uses contrastive reflection (falls back to single-shot reflect if it yields nothing). - budget.py: Budget(max_tokens\|max_minutes) tracks spend; plan_depth() derives (nights, rollouts_k) from a token budget. run_gbrain gains --rollouts-k, --budget-tokens, --budget-minutes (auto-plans depth). 3 new tests (rollout stats, budget+plan, contrastive stub). 26 tests pass. Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>	2026-06-08 14:31:51 +00:00
Yifan Yang	c179a24c45	feat(sleep): slow-update long-term memory field (runs even with gate off) Bring SkillOpt's epoch-wise slow/meta update (paper §3.6) into the sleep engine as skillopt/sleep/slow_update.py — import-light, driven through the Backend abstraction (mock/claude/codex): - Reuses the main repo's protected-field markers <!-- SLOW_UPDATE_START --> ... <!-- SLOW_UPDATE_END --> so the artifact is compatible; step-level edits never touch this field. - run_slow_update compares behavior under the first-night vs final skill across the val tasks, groups into improved/regressed/persistent/stable, and asks the optimizer to distill durable longitudinal guidance (refining prior text). - Wired into run_gbrain.run_seed AFTER the nights loop, gated by slow_update=True and run REGARDLESS of gate_mode — this is what preserves long-term memory even when the user turns the hard gate OFF (the user's slot_date=slow-update intent). 2 new tests (protected-field round-trip, stub-backend synthesis). 23 tests pass. Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>	2026-06-08 14:31:51 +00:00
Yifan Yang	6f1351edb9	feat(sleep): 3-way train/val/test split + gate_mode on\|off Data-split refactor (the anti-overfitting foundation the user asked for): - TaskRecord gains split∈{train,val,test} and origin∈{real,dream}. - assign_splits: real tasks deterministically split into val/test (disjoint); DREAM-augmented tasks (origin='dream') NEVER enter val/test — they only go to train. val gates updates; test is the final held-out measure. - gbrain loader maps its held-out.jsonl -> test, benchmark.jsonl -> train/val, so the gbrain held-out stays the true final score. - consolidate(): train drives reflect, val gates; adds gate_mode='off' (greedy, no hard filter) reporting val movement (greedy_improved/regressed/flat). - run_gbrain/transfer/experiment score on test (val fallback); run_gbrain gains --gate on\|off. Legacy replay/holdout names normalized. New test proves dream tasks never land in val/test. 21 tests pass; mock experiment + gate=off both green. Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>	2026-06-08 14:31:51 +00:00
Yifan Yang	937bc1ec4d	feat(sleep): real tool-loop replay for gbrain quick-answerer (tool_called judge) The 4th gbrain seed (quick-answerer) is judged by tool_called=search: the agent must ACTUALLY call a search tool. Add an honest tool loop: - Backend.attempt_with_tools(task, skill, memory, tools) -> (response, tools_called) - Claude: exposes a real ./search shell shim, runs with --allowedTools Bash in a clean cwd; detects the call from the shim's log (not a self-reported marker). - Codex: same shim under `exec --sandbox workspace-write`. - Mock: deterministic — "calls" a tool iff skill/memory instructs it (for CI). - replay_one routes tasks with a tool_called check through the tool loop and feeds detected calls to the rule judge; ReplayResult gains tools_called. Verified live (Claude haiku): deficient skill -> tools_called=[] hard=0; learned "must run ./search" rule -> tools_called=['search'] hard=1.0. 20 tests pass. Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>	2026-06-08 14:31:51 +00:00
Yifan Yang	7d9900b6af	feat(sleep): optimizer/target model split, transfer experiment, LLM miner Three additions driven by the goal of price-aware, model-flexible sleep: 1. DualBackend + build_backend(): route attempt->TARGET model and reflect/judge->OPTIMIZER model (SkillOpt's target-vs-optimizer split). gbrain runner gains --optimizer-backend/-model + --target-backend/-model. 2. run_transfer.py: sleep-scenario cross-model transfer. Optimize a skill on a SOURCE model (e.g. cheap haiku), freeze it, evaluate held-out on a TARGET model (e.g. expensive sonnet) with no further optimization — plus a direct reference. Mirrors the SkillOpt paper's transfer table; quantifies the "optimize cheap overnight, deploy anywhere" value prop. 3. llm_miner.py: turn real harvested transcripts into TaskRecords WITH checkable rule/rubric judges, wired into the cycle for non-mock backends, so real-data lift becomes measurable (heuristic miner remains the no-API fallback). Fixed a str.format brace bug the new unit test caught. 19 tests pass. Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>	2026-06-08 14:31:51 +00:00
Yifan Yang	4203086899	feat(sleep): real claude + codex backends, gbrain-evals benchmark, rule judges Upgrade from mock-only to REAL multi-backend validation: Backends (skillopt/sleep/backend.py): - CliBackend base: shared attempt/judge/reflect prompts, response cache, token accounting. Subclasses implement only _call(). - ClaudeCliBackend: drives `claude -p --output-format text`. - CodexCliBackend: drives the REAL @openai/codex `exec -o <file>` for clean output; resolve_codex_path() skips the hermes wrapper at ~/.local/bin/codex. - reflect() now aggregates the exact failing judge criteria into the prompt (gbrain's lesson: tell the optimizer what the scorer rewards). Rule judges (skillopt/sleep/judges.py): gbrain-compatible local scorers (section_present / regex / max_chars / contains / tool_called) — held-out scoring with no judge-API spend. TaskRecord gains a `judge` field + reference_kind="rule". gbrain-evals adapter (experiments/gbrain_bench.py, run_gbrain.py): load garrytan/gbrain-evals skillopt-v1 deficient skills + train/held-out task sets and run our consolidate() loop against the SAME suite gbrain scores. REAL results (docs/sleep/real_api_results.md), brief-writer seed, 1 night: - Claude (Haiku): held-out 0.00 -> 1.00 - Codex: held-out 0.00 -> 0.67 Both proposed a correct, general format rule into the protected LEARNED block. CLI: --backend {mock,claude,codex}, --codex-path, --model; experiment + gbrain runners gain --limit-* cost controls. 17 tests pass. Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>	2026-06-08 14:31:51 +00:00
Yifan Yang	4e7add899d	feat(sleep): nightly offline self-evolution engine + Claude Code plugin Add skillopt/sleep — a deployment-time companion to SkillOpt that gives a local Claude agent a nightly "sleep cycle": harvest ~/.claude transcripts -> mine recurring tasks -> replay offline -> consolidate (reflect -> bounded edit -> held-out GATE) -> stage -> adopt Synthesizes SkillOpt (validation-gated bounded text optimization, reusing skillopt.evaluation.gate verbatim), Claude Dreams (offline consolidation; input never mutated; review-then-adopt), and the agent-sleep paper (short-term experience -> long-term competence). Engine (skillopt/sleep/, import-light, py>=3.10): - harvest.py read-only parse of session JSONL + history.jsonl - mine.py sessions -> TaskRecords (heuristic miner + LLM hook) - backend.py MockBackend (deterministic, no API) + AnthropicBackend - replay.py offline re-run -> (hard, soft) scores - consolidate.py one SkillOpt epoch behind a held-out gate - memory.py protected-region edits to SKILL.md / CLAUDE.md - staging.py stage proposals; adopt with backup (Dreams safety contract) - cycle.py + __main__.py orchestrator + CLI (run/dry-run/status/adopt/harvest) Plugin (skillopt-sleep-plugin/): plugin.json, /sleep command, skillopt-sleep skill, SessionEnd hook, bundled runner + cron generator. Validation (deterministic, no API): persona experiment proves held-out lift (researcher 0.33->1.0, programmer 0.32->1.0) AND that the gate rejects an injected harmful edit. 13 stdlib-unittest tests pass, incl. full cycle + adopt-with-backup and parsing of real on-disk transcripts. Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>	2026-06-08 14:31:51 +00:00
Matt Van Horn	c31c50be51	fix(model): forward Qwen timeout and only set enable_thinking when true Two bugs made local vLLM targets score acc=0.000: the router did not forward 'timeout' to the Qwen backend (so runs used the 300s default), and qwen_backend always injected chat_template_kwargs.enable_thinking, which non-Qwen vLLM servers reject or answer with <think> output and no <answer> tag. Forward timeout and only set the field when enabled. Closes #28 Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-07 07:41:35 -07:00
Claude Code Agent	dd8cd993b5	test: add unit test suite for core utility modules Add initial test infrastructure covering: - skillopt/utils/scoring.py (compute_score, skill_hash) - skillopt/utils/json_utils.py (extract_json, extract_json_array) - skillopt/types.py (Edit, Patch dataclass serialization) All tested functions are pure/deterministic with no LLM dependencies. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-01 02:04:22 +08:00

34 Commits