A step whose minibatches yield ONLY execution-lapse notes produces no body
patches (analysts return empty-edits carriers, dropped by
_normalise_patches), so skip_no_patches / skip_no_rewrite would `continue`
before the appendix flush and silently discard every note of the step.
This hit exactly the feature's target regime (mature skill body, failures
classified as lapses): in c1_searchqa_def_g55_sar, 10/40 steps skipped
this way and lost 95 notes total.
Extract the flush block into _flush_skill_aware_appendix() and call it on
the normal update path (unchanged behavior) AND on both skip branches
before `continue`, so notes persist and appendix_notes.json /
step_rec counters are recorded for skipped steps too.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Split failure reflections into SKILL_DEFECT (body edit) vs EXECUTION_LAPSE
(protected appendix note that re-emphasizes an existing rule, never edited
by step-level analysts). Toggle: optimizer.use_skill_aware_reflection
(default false; baseline byte-identical when off).
- optimizer/appendix.py: protected APPENDIX region (inject/extract/append
with dedup), mirrors the slow_update protected-field pattern
- optimizer/skill_aware.py: analyst prompt augmentation, appendix_notes
parsing, threshold-gated LLM consolidation, and a process-wide runtime
switch (configure_skill_aware_reflection) set once by the trainer
- gradient/reflect.py: augment error/success analyst prompts at runtime;
None-sentinel kwargs resolve from the global switch, so env adapters
need no per-benchmark wiring (works for all envs, present and future)
- optimizer/skill.py: generalize the protected-region check to
(slow_update, appendix); edits inside any protected region are skipped
- engine/trainer.py: inject appendix at init, flush per-step
EXECUTION_LAPSE notes after the gate settles, optional consolidation
- tests: regression suite incl. toggle-off byte-identical guarantee and
env-independent global-switch resolution (6/6 passing + live smoke)
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
- slow_update force-inject now writes current_skill ONLY (best_skill stays a
faithful val-best snapshot, never receives un-validated slow_update content)
- after training, run one val on the final skill; if its gate score beats the
incumbent best, promote final to best (updates best_skill/best_step/best_origin)
- trainer now evaluates final skill on test itself (reuses best test result when
final==best); records final_selection_* and final_test_* in summary.json
- spreadsheetbench: head+tail truncate the post-execution verification report at
source to fix multi-MB conversation bloat
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
A. SpreadsheetBench verification-feedback bloat
- rollout.py _auto_verify_output: use official _compare_cell_value (was
repr() equality, which falsely flagged 5 vs 5.0 / None vs ""); collapse
correct-and-empty cells into a count so large sparse answer ranges no
longer flood feedback with MBs of None=None noise.
- codegen_agent.py _build_eval_feedback: only list WRONG cells, collapse
correct ones into a count.
Scoring is unaffected (evaluate() is independent); this only fixes the
target model's multi-turn solving feedback.
B. Remove optimizer-side truncation (bloat source now fixed)
- reflect.py: drop _MAX_TRAJ_CHARS cap and all per-field clips.
- update_modes.py / clip.py / lr_autonomous.py: describe_item /
short_item_summary no longer truncate; raise ranking/lr token budget.
- trainer.py _format_step_buffer: full task_ids / target.
- slow_update.py: full comparison samples.
C. Soft-disable gate
- config.py / trainer.py: use_gate=false no longer raises; validation still
runs but candidates are force-accepted (new force_accept branch + log).
Misc: aggregate.py merge token budget 4096 -> 16384.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Per-platform install (Claude Code marketplace, Codex install.sh, Copilot MCP
server) plus optional wider-distribution steps (GitHub Release, official Claude
plugin marketplace PR, PyPI) and release-verification commands.
Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
Actually exercised every plugin shell end to end on a brand-new "SQL must always
include LIMIT" analyst persona:
- Claude Code shell: harvest (2 real crafted transcripts -> 2 tasks), full run
(stages a proposal), adopt (honors the no-op-when-nothing-accepted contract).
- Codex: install.sh places ~/.codex/prompts/sleep.md + ~/.agents/skills correctly.
- Copilot: MCP server initialize -> tools/list -> tools/call returns engine output.
Genuine improvement on the fresh persona, both backends: held-out TEST 0.00 -> 1.00
(Sonnet->Haiku and Codex), the optimizer learning the user's LIMIT house rule and
generalizing to unseen queries. Honest finding: the first split left too few train
tasks (no-op night) — re-balancing fixed it; motivates a small-train-pool warning.
Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
Remove every non-ASCII/CJK character for a professional open-source repo:
- harvest.py: drop hardcoded Chinese feedback phrases; add an env-based
extensibility hook (SKILLOPT_SLEEP_NEG_FEEDBACK / _POS_FEEDBACK) so any
locale can be added without baking one in. Verified with a German example.
- rollout.py / consolidate.py: English comments.
- README.md section heading + anchor, CONTROLLABLE_DREAMING.md, plugin.json,
marketplace.json (also fixed stale path skillopt-sleep-plugin ->
plugins/claude-code), SKILL.md: English only.
- Remove the internal WAKE_UP_SUMMARY.md note (not user-facing, not referenced).
Verified: zero CJK chars remain anywhere; 29 tests pass.
Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
Restructure into plugins/{claude-code,codex,copilot}/ — one engine, three thin
shells, all calling the shared plugins/run-sleep.sh -> python -m skillopt_sleep.
- claude-code/: existing plugin moved here; runner delegates to the shared
launcher (fixes repo-root resolution after the move).
- codex/: ~/.codex/prompts/sleep.md custom prompt + ~/.agents/skills SKILL.md +
install.sh + AGENTS.md hint — Codex's documented, stable extension surfaces.
- copilot/: a stdlib-only MCP server (mcp_server.py) exposing sleep_* tools,
plus mcp-config.example.json and a copilot-instructions snippet. Verified end
to end (initialize -> tools/list -> tools/call returns real engine output).
- plugins/README.md overview table; main README News + a dedicated SkillOpt-Sleep
section; pyproject lists skillopt_sleep as a first-class package.
Decoupling emphasized throughout: open-source tool (skillopt_sleep/) with zero
dependency on the research package. 29 tests pass; all three shells resolve.
Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
Open-source-tool / research-code separation:
- git mv skillopt/sleep/ -> skillopt_sleep/ (top-level, sibling to the research
skillopt/ package). History preserved as renames.
- All imports skillopt.sleep.* -> skillopt_sleep.*.
- Vendor the validation gate into skillopt_sleep/gate.py (a self-contained copy
of skillopt.evaluation.gate). The engine now has ZERO dependency on the
research package — verified: grep finds no `from skillopt.` in skillopt_sleep/,
and consolidate's gate resolves to skillopt_sleep.gate.
- Plugin scripts/commands/skill call `-m skillopt_sleep`.
29 tests pass; `python -m skillopt_sleep` runs standalone.
Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
Three live runs exercise the new code paths on both runtimes:
A) Claude Sonnet->Haiku, gate=OFF + rollouts_k=2: brief-writer test 0->1.00,
action 'greedy_improved', val & test both reported (3-way split works).
B) Codex, gate=ON + rollouts_k=2: brief-writer test 0->1.00 in 2 nights.
C) Claude Sonnet->Haiku, thorough-analyst, 3 nights: slow-update fires and
distils a durable cross-night meta-rule (general, not task-specific).
Confirms gate-off greedy path, 3-way val/test split, multi-rollout, and the
gate-independent slow-update all work with real models on Claude AND Codex.
Raw logs under docs/sleep/raw/crosscheck_*.txt.
Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
- ReplayResult records per-rollout tokens + latency_ms; replay_one measures them
(approximated from text length when the backend doesn't track tokens, e.g. mock).
- replay.multi_objective_reward(w_acc, w_tokens, w_latency): weighted reward so a
skill can be optimized to be cheaper/faster, not only more accurate (cost terms
normalized vs a reference, default = accuracy-only / backward compatible).
- Backend.preferences (free text) injected into reflect as a prior; build_backend
attaches it (to the optimizer for dual backends). run_gbrain gains --preferences.
3 new tests (multi-objective ordering, preference injection, cost recording).
29 tests pass; mock gates + 3.8/3.12 compile green.
Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
The "脑补推演" core the user described — re-run the same task many times and
learn from the contrast between good and bad rollouts:
- rollout.py: multi_rollout(task, k) runs K scored attempts; RolloutSet exposes
best/worst/spread/pass_rate. contrastive_reflect picks the highest-spread
tasks (some attempts passed, some failed — most informative) and asks the
optimizer what the GOOD attempts did that the BAD ones didn't, distilling a
general rule. Far stronger signal than a single failure.
- consolidate(rollouts_k>1) uses contrastive reflection (falls back to
single-shot reflect if it yields nothing).
- budget.py: Budget(max_tokens|max_minutes) tracks spend; plan_depth() derives
(nights, rollouts_k) from a token budget. run_gbrain gains --rollouts-k,
--budget-tokens, --budget-minutes (auto-plans depth).
3 new tests (rollout stats, budget+plan, contrastive stub). 26 tests pass.
Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
Bring SkillOpt's epoch-wise slow/meta update (paper §3.6) into the sleep engine
as skillopt/sleep/slow_update.py — import-light, driven through the Backend
abstraction (mock/claude/codex):
- Reuses the main repo's protected-field markers
<!-- SLOW_UPDATE_START --> ... <!-- SLOW_UPDATE_END --> so the artifact is
compatible; step-level edits never touch this field.
- run_slow_update compares behavior under the first-night vs final skill across
the val tasks, groups into improved/regressed/persistent/stable, and asks the
optimizer to distill durable longitudinal guidance (refining prior text).
- Wired into run_gbrain.run_seed AFTER the nights loop, gated by slow_update=True
and run REGARDLESS of gate_mode — this is what preserves long-term memory even
when the user turns the hard gate OFF (the user's slot_date=slow-update intent).
2 new tests (protected-field round-trip, stub-backend synthesis). 23 tests pass.
Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
Data-split refactor (the anti-overfitting foundation the user asked for):
- TaskRecord gains split∈{train,val,test} and origin∈{real,dream}.
- assign_splits: real tasks deterministically split into val/test (disjoint);
DREAM-augmented tasks (origin='dream') NEVER enter val/test — they only go to
train. val gates updates; test is the final held-out measure.
- gbrain loader maps its held-out.jsonl -> test, benchmark.jsonl -> train/val,
so the gbrain held-out stays the true final score.
- consolidate(): train drives reflect, val gates; adds gate_mode='off' (greedy,
no hard filter) reporting val movement (greedy_improved/regressed/flat).
- run_gbrain/transfer/experiment score on test (val fallback); run_gbrain gains
--gate on|off. Legacy replay/holdout names normalized.
New test proves dream tasks never land in val/test. 21 tests pass; mock
experiment + gate=off both green.
Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
benchmark_report.md now 7/7 direct + 4/4 transfer, all 0->1.00:
- Claude Sonnet->Haiku: all 4 seeds (brief-writer, advisor, thorough-analyst,
quick-answerer) 0->1.00
- Codex self-optimized: brief-writer, advisor, quick-answerer 0->1.00
- quick-answerer uses the real ./search tool loop on both runtimes.
This matches gbrain's own "4/4 skills 0->1.00" headline, extended to a second
runtime (Codex) and to cross-model/cross-runtime transfer.
Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
quick-answerer (judge: tool_called=search) reaches 0.00 -> 1.00 with Sonnet
optimizer -> Haiku target: the optimizer wrote an OVERRIDE of the "never use
tools" instruction and the Haiku target genuinely invoked the ./search shim.
All 4 gbrain skillopt-v1 seeds now at 0->1.00, matching gbrain's own headline.
Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
The 4th gbrain seed (quick-answerer) is judged by tool_called=search: the agent
must ACTUALLY call a search tool. Add an honest tool loop:
- Backend.attempt_with_tools(task, skill, memory, tools) -> (response, tools_called)
- Claude: exposes a real ./search shell shim, runs with --allowedTools Bash in a
clean cwd; detects the call from the shim's log (not a self-reported marker).
- Codex: same shim under `exec --sandbox workspace-write`.
- Mock: deterministic — "calls" a tool iff skill/memory instructs it (for CI).
- replay_one routes tasks with a tool_called check through the tool loop and
feeds detected calls to the rule judge; ReplayResult gains tools_called.
Verified live (Claude haiku): deficient skill -> tools_called=[] hard=0;
learned "must run ./search" rule -> tools_called=['search'] hard=1.0.
20 tests pass.
Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
Machine-generated benchmark_report.md from a 9-config sweep:
- Direct (Sonnet->Haiku): brief-writer/advisor/thorough-analyst 0->1.00
- Direct (Codex): brief-writer/advisor 0->1.00
- Transfer (4/4 positive, incl. cross-runtime Codex<->Claude): all 0->1.00
Cross-model transfer confirms the price-difference value prop: a skill
optimized on a cheap model deploys for free on an expensive one, and skills
move between Codex and Claude. sweep.jsonl is the committed source data.
Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
Strong-optimizer/weak-target (Sonnet -> Haiku), fully isolated:
brief-writer, advisor, thorough-analyst all 0.00 -> 1.00 on held-out.
thorough-analyst shows 2-night convergence (0.33 -> 1.00). Codex self-optimized
brief-writer also 0 -> 1.00.
Key finding answering the optimizer/target-split request: the OPTIMIZER MODEL is
decisive — weak Haiku-as-optimizer is flaky (0 or 1.0 across runs), strong
Sonnet-as-optimizer reliably hits 1.0 on every seed. Raw logs under docs/sleep/raw/.
Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
The default sweep direct plan now uses a DualBackend (Sonnet optimizer proposes
edits, Haiku target runs tasks) — the SkillOpt-faithful and more reliable setup,
since a weak self-optimizing model (Haiku-as-optimizer) produced flaky JSON.
report.py renders the optimizer->target pairing in the direct table.
Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
- reflect() now retries once with a firmer "JSON only" instruction when the
first reply doesn't parse to a non-empty array. A transient non-JSON reply
otherwise wastes a whole night (gate sees no edits -> reject), which made
weak optimizers (Haiku) flaky across runs.
- FINAL_REPORT.md: document the context-leak discovery honestly; Codex cells
stand (clean), Claude cells recomputed under strict isolation.
Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
The clean-cwd + --disallowedTools isolation was NOT enough: the user's GLOBAL
skills (~/.claude/skills) are injected regardless of cwd, so reflect/attempt
still sometimes replied with a list of installed skills instead of JSON edits
(advisor reflect returned 21KB of skill descriptions, n_edits=0 -> gate reject).
Add --bare (skip hooks/LSP/plugins) and --disable-slash-commands (disable all
skills). Verified: the optimizer now returns clean JSON. Re-validating all
seeds with the truly-isolated backend; prior Claude numbers are being recomputed
honestly (some earlier "successes" were partly leak-assisted).
Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
Critical correctness fix found by debugging the thorough-analyst failure:
* `claude -p` was running with the AMBIENT Claude Code project context (the
repo's CLAUDE.md, installed skills, tools). The optimizer/target calls were
polluted — reflect once replied with a list of the user's installed skills
instead of JSON edits. Now ClaudeCliBackend._call runs ISOLATED: a clean temp
cwd, --disallowedTools '*', --exclude-dynamic-system-prompt-sections. This is
essential for the backend to be trustworthy and reproducible.
* reflect prompt: translate failing rule-judge criteria into plain English
(max_chars=1200 -> "the ENTIRE response must be at most 1200 characters") and
require CONCRETE, verbatim thresholds in proposed rules (not "respect limits").
* attempt prompt: treat the Learned-preferences block as HARD CONSTRAINTS that
override earlier conflicting skill text.
Earlier Claude results predate this fix and are being re-validated clean; the
Codex backend was never affected (it runs in its own exec context).
Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
- skillopt-sleep-plugin/.claude-plugin/marketplace.json so the plugin is
installable via `/plugin marketplace add ./skillopt-sleep-plugin`.
- README install section (clone -> add marketplace -> install -> /sleep status).
- docs/sleep/FINAL_REPORT.md: the consolidated presented results doc (real
Claude+Codex, transfer, and the honest thorough-analyst failure + fix).
- sweep.py flushes stdout for live monitoring.
Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
- sweep.py: run many (backend, model, seed, transfer-pair) configs sequentially,
append each result to JSONL incrementally (resumable, interrupt-safe).
- report.py: render the sweep JSONL into a presented Markdown scorecard with
direct-improvement and cross-model-transfer tables.
- reflect prompt now tells the optimizer its edits are APPENDED (can't delete the
base skill text), so on a conflict it must write a forceful OVERRIDE rule.
Diagnosed from a real failure: thorough-analyst (needs <=1200 chars) kept its
edits rejected because the base "be exhaustive" line won; a verified override
("HARD LIMIT ... supersedes") makes Haiku obey (1194/880 chars -> hard=1.0).
Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
Three additions driven by the goal of price-aware, model-flexible sleep:
1. DualBackend + build_backend(): route attempt->TARGET model and
reflect/judge->OPTIMIZER model (SkillOpt's target-vs-optimizer split).
gbrain runner gains --optimizer-backend/-model + --target-backend/-model.
2. run_transfer.py: sleep-scenario cross-model transfer. Optimize a skill on a
SOURCE model (e.g. cheap haiku), freeze it, evaluate held-out on a TARGET
model (e.g. expensive sonnet) with no further optimization — plus a direct
reference. Mirrors the SkillOpt paper's transfer table; quantifies the
"optimize cheap overnight, deploy anywhere" value prop.
3. llm_miner.py: turn real harvested transcripts into TaskRecords WITH checkable
rule/rubric judges, wired into the cycle for non-mock backends, so real-data
lift becomes measurable (heuristic miner remains the no-API fallback).
Fixed a str.format brace bug the new unit test caught.
19 tests pass.
Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
Codex with the directive reflect prompt + 2 nights converges 0.00 -> 1.00
(up from 0.67 single-night); its night-2 edit diagnoses its own residual
failure ("preserve required sections even when keeping the brief short").
Claude (Haiku) reaches 1.00 in one night. Update plugin README + skill to
reference --backend claude|codex (was anthropic) and surface the benchmark.
Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
Upgrade from mock-only to REAL multi-backend validation:
Backends (skillopt/sleep/backend.py):
- CliBackend base: shared attempt/judge/reflect prompts, response cache,
token accounting. Subclasses implement only _call().
- ClaudeCliBackend: drives `claude -p --output-format text`.
- CodexCliBackend: drives the REAL @openai/codex `exec -o <file>` for clean
output; resolve_codex_path() skips the hermes wrapper at ~/.local/bin/codex.
- reflect() now aggregates the exact failing judge criteria into the prompt
(gbrain's lesson: tell the optimizer what the scorer rewards).
Rule judges (skillopt/sleep/judges.py): gbrain-compatible local scorers
(section_present / regex / max_chars / contains / tool_called) — held-out
scoring with no judge-API spend. TaskRecord gains a `judge` field +
reference_kind="rule".
gbrain-evals adapter (experiments/gbrain_bench.py, run_gbrain.py): load
garrytan/gbrain-evals skillopt-v1 deficient skills + train/held-out task
sets and run our consolidate() loop against the SAME suite gbrain scores.
REAL results (docs/sleep/real_api_results.md), brief-writer seed, 1 night:
- Claude (Haiku): held-out 0.00 -> 1.00
- Codex: held-out 0.00 -> 0.67
Both proposed a correct, general format rule into the protected LEARNED block.
CLI: --backend {mock,claude,codex}, --codex-path, --model; experiment +
gbrain runners gain --limit-* cost controls. 17 tests pass.
Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
Design for a nightly offline self-evolution plugin that synthesizes
SkillOpt (validation-gated bounded text optimizer), Claude Dreams
(offline memory consolidation), and the Agent-Sleep paper (short-term
to long-term experience). Harvests local ~/.claude transcripts, mines
recurring tasks, replays them offline, and consolidates memory+skills
behind a held-out gate.
Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
Two bugs made local vLLM targets score acc=0.000: the router did not
forward 'timeout' to the Qwen backend (so runs used the 300s default),
and qwen_backend always injected chat_template_kwargs.enable_thinking,
which non-Qwen vLLM servers reject or answer with <think> output and no
<answer> tag. Forward timeout and only set the field when enabled.
Closes#28
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The shipped env_template.py and loader_template.py described the same
fictional async execute / evaluate / build_prompt API documented in
docs/reference/api.md. As a result TemplateBenchmarkEnv(cfg) raised
'TypeError: Can't instantiate abstract class' for every copy-and-paste
user who followed the in-tree scaffold.
Rewrite the template so it's a working starting point:
- env_template.py: TemplateBenchmarkEnv(EnvAdapter) now implements all
five real abstract methods (build_train_env, build_eval_env, rollout,
reflect, get_task_types) with no-op defaults documented as TODO.
Instantiable today; pytest 60/60 still passes.
- loader_template.py: TemplateBenchmarkLoader(SplitDataLoader)
implements load_split_items for .json / .jsonl input and explains the
optional load_raw_items override for split_mode="ratio".
- README.md: usage steps now point at scripts/train.py's _ENV_REGISTRY
(the real registry) instead of a non-existent BENCHMARK_REGISTRY in
skillopt/envs/__init__.py, and link to the rewritten new-benchmark
guide.
- config_template.yaml: _base_ is a string path (not a list, which the
loader rejects); skill_init is commented out with a note so the
template config doesn't reference a file the user hasn't created.
Verified locally: 'from skillopt.envs._template.env_template import
TemplateBenchmarkEnv; TemplateBenchmarkEnv()' succeeds. Refs
microsoft/SkillOpt#30.
Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
docs/reference/api.md previously documented a fictional EnvAdapter API
(execute / evaluate / build_prompt + DataItem / TaskResult) and a
BENCHMARK_REGISTRY that never existed in code. Anyone following the
documented contract would hit ImportError or TypeError on the first
instantiation.
Replace both pages with the real shape from skillopt/envs/base.py and
skillopt/datasets/base.py:
- EnvAdapter: build_train_env, build_eval_env, rollout, reflect,
get_task_types (the 5 actual abstract methods).
- Rollout dicts: id / hard / soft required; everything else preserved
into RolloutResult.extras.
- Reflect dicts: {patch, source_type} schema as consumed by
run_minibatch_reflect.
- BatchSpec: slotted-but-mutable dataclass matching the actual
definition (payload defaults to None, metadata to dict()).
- SplitDataLoader.load_split_items as the one mandatory loader method.
- Registry: _ENV_REGISTRY in scripts/train.py (lazy try/except
ImportError block), not a non-existent BENCHMARK_REGISTRY in
skillopt/envs/__init__.py.
- _base_: documented as a string path, since the current YAML loader
only accepts strings.
The new-benchmark.md guide now walks through a docfaithful worked
example with a real rollout helper (chat_target + scorer) instead of
hand-waving over the rollout step. Refs microsoft/SkillOpt#30.
Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>