Commit Graph

45 Commits

Author SHA1 Message Date
Cuzyoung
44043d4ae5 docs(trainer): drop the stale skill-aware comments (claimed best_skill carries no appendix; it does)
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-10 13:10:08 +00:00
Cuzyoung
7dcd612361 fix(trainer): flush appendix notes on skip branches — lapse-only steps no longer drop them
A step whose minibatches yield ONLY execution-lapse notes produces no body
patches (analysts return empty-edits carriers, dropped by
_normalise_patches), so skip_no_patches / skip_no_rewrite would `continue`
before the appendix flush and silently discard every note of the step.
This hit exactly the feature's target regime (mature skill body, failures
classified as lapses): in c1_searchqa_def_g55_sar, 10/40 steps skipped
this way and lost 95 notes total.

Extract the flush block into _flush_skill_aware_appendix() and call it on
the normal update path (unchanged behavior) AND on both skip branches
before `continue`, so notes persist and appendix_notes.json /
step_rec counters are recorded for skipped steps too.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-10 13:10:08 +00:00
Cuzyoung
0dc84162dc feat(optimizer): skill-aware reflection (EmbodiSkill S_app), config-controlled and env-independent
Split failure reflections into SKILL_DEFECT (body edit) vs EXECUTION_LAPSE
(protected appendix note that re-emphasizes an existing rule, never edited
by step-level analysts). Toggle: optimizer.use_skill_aware_reflection
(default false; baseline byte-identical when off).

- optimizer/appendix.py: protected APPENDIX region (inject/extract/append
  with dedup), mirrors the slow_update protected-field pattern
- optimizer/skill_aware.py: analyst prompt augmentation, appendix_notes
  parsing, threshold-gated LLM consolidation, and a process-wide runtime
  switch (configure_skill_aware_reflection) set once by the trainer
- gradient/reflect.py: augment error/success analyst prompts at runtime;
  None-sentinel kwargs resolve from the global switch, so env adapters
  need no per-benchmark wiring (works for all envs, present and future)
- optimizer/skill.py: generalize the protected-region check to
  (slow_update, appendix); edits inside any protected region are skipped
- engine/trainer.py: inject appendix at init, flush per-step
  EXECUTION_LAPSE notes after the gate settles, optional consolidation
- tests: regression suite incl. toggle-off byte-identical guarantee and
  env-independent global-switch resolution (6/6 passing + live smoke)

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-10 13:10:08 +00:00
Cuzyoung
ffe581098b feat(trainer): final-skill val + best promotion; keep best unpolluted by slow_update
- slow_update force-inject now writes current_skill ONLY (best_skill stays a
  faithful val-best snapshot, never receives un-validated slow_update content)
- after training, run one val on the final skill; if its gate score beats the
  incumbent best, promote final to best (updates best_skill/best_step/best_origin)
- trainer now evaluates final skill on test itself (reuses best test result when
  final==best); records final_selection_* and final_test_* in summary.json
- spreadsheetbench: head+tail truncate the post-execution verification report at
  source to fix multi-MB conversation bloat

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-06-10 13:03:17 +00:00
Cuzyoung
372fd56c1e fix(spreadsheetbench)+optimizer: fix verify-feedback bloat, drop optimizer-side truncation, soft-disable gate
A. SpreadsheetBench verification-feedback bloat
   - rollout.py _auto_verify_output: use official _compare_cell_value (was
     repr() equality, which falsely flagged 5 vs 5.0 / None vs ""); collapse
     correct-and-empty cells into a count so large sparse answer ranges no
     longer flood feedback with MBs of None=None noise.
   - codegen_agent.py _build_eval_feedback: only list WRONG cells, collapse
     correct ones into a count.
   Scoring is unaffected (evaluate() is independent); this only fixes the
   target model's multi-turn solving feedback.

B. Remove optimizer-side truncation (bloat source now fixed)
   - reflect.py: drop _MAX_TRAJ_CHARS cap and all per-field clips.
   - update_modes.py / clip.py / lr_autonomous.py: describe_item /
     short_item_summary no longer truncate; raise ranking/lr token budget.
   - trainer.py _format_step_buffer: full task_ids / target.
   - slow_update.py: full comparison samples.

C. Soft-disable gate
   - config.py / trainer.py: use_gate=false no longer raises; validation still
     runs but candidates are force-accepted (new force_accept branch + log).

Misc: aggregate.py merge token budget 4096 -> 16384.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-06-10 13:03:17 +00:00
Yifan Yang
b02ffc2c99 refactor(sleep): decouple engine to top-level skillopt_sleep/ (zero research dep)
Open-source-tool / research-code separation:
  - git mv skillopt/sleep/ -> skillopt_sleep/ (top-level, sibling to the research
    skillopt/ package). History preserved as renames.
  - All imports skillopt.sleep.* -> skillopt_sleep.*.
  - Vendor the validation gate into skillopt_sleep/gate.py (a self-contained copy
    of skillopt.evaluation.gate). The engine now has ZERO dependency on the
    research package — verified: grep finds no `from skillopt.` in skillopt_sleep/,
    and consolidate's gate resolves to skillopt_sleep.gate.
  - Plugin scripts/commands/skill call `-m skillopt_sleep`.

29 tests pass; `python -m skillopt_sleep` runs standalone.

Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
2026-06-08 14:31:52 +00:00
Yifan Yang
a29201adc4 feat(sleep): multi-objective reward (accuracy/tokens/latency) + user preferences
- ReplayResult records per-rollout tokens + latency_ms; replay_one measures them
  (approximated from text length when the backend doesn't track tokens, e.g. mock).
- replay.multi_objective_reward(w_acc, w_tokens, w_latency): weighted reward so a
  skill can be optimized to be cheaper/faster, not only more accurate (cost terms
  normalized vs a reference, default = accuracy-only / backward compatible).
- Backend.preferences (free text) injected into reflect as a prior; build_backend
  attaches it (to the optimizer for dual backends). run_gbrain gains --preferences.

3 new tests (multi-objective ordering, preference injection, cost recording).
29 tests pass; mock gates + 3.8/3.12 compile green.

Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
2026-06-08 14:31:51 +00:00
Yifan Yang
77ac33e8bf feat(sleep): multi-rollout contrastive reflection + token/time budget
The "脑补推演" core the user described — re-run the same task many times and
learn from the contrast between good and bad rollouts:

  - rollout.py: multi_rollout(task, k) runs K scored attempts; RolloutSet exposes
    best/worst/spread/pass_rate. contrastive_reflect picks the highest-spread
    tasks (some attempts passed, some failed — most informative) and asks the
    optimizer what the GOOD attempts did that the BAD ones didn't, distilling a
    general rule. Far stronger signal than a single failure.
  - consolidate(rollouts_k>1) uses contrastive reflection (falls back to
    single-shot reflect if it yields nothing).
  - budget.py: Budget(max_tokens|max_minutes) tracks spend; plan_depth() derives
    (nights, rollouts_k) from a token budget. run_gbrain gains --rollouts-k,
    --budget-tokens, --budget-minutes (auto-plans depth).

3 new tests (rollout stats, budget+plan, contrastive stub). 26 tests pass.

Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
2026-06-08 14:31:51 +00:00
Yifan Yang
c179a24c45 feat(sleep): slow-update long-term memory field (runs even with gate off)
Bring SkillOpt's epoch-wise slow/meta update (paper §3.6) into the sleep engine
as skillopt/sleep/slow_update.py — import-light, driven through the Backend
abstraction (mock/claude/codex):

  - Reuses the main repo's protected-field markers
    <!-- SLOW_UPDATE_START --> ... <!-- SLOW_UPDATE_END --> so the artifact is
    compatible; step-level edits never touch this field.
  - run_slow_update compares behavior under the first-night vs final skill across
    the val tasks, groups into improved/regressed/persistent/stable, and asks the
    optimizer to distill durable longitudinal guidance (refining prior text).
  - Wired into run_gbrain.run_seed AFTER the nights loop, gated by slow_update=True
    and run REGARDLESS of gate_mode — this is what preserves long-term memory even
    when the user turns the hard gate OFF (the user's slot_date=slow-update intent).

2 new tests (protected-field round-trip, stub-backend synthesis). 23 tests pass.

Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
2026-06-08 14:31:51 +00:00
Yifan Yang
6f1351edb9 feat(sleep): 3-way train/val/test split + gate_mode on|off
Data-split refactor (the anti-overfitting foundation the user asked for):
  - TaskRecord gains split∈{train,val,test} and origin∈{real,dream}.
  - assign_splits: real tasks deterministically split into val/test (disjoint);
    DREAM-augmented tasks (origin='dream') NEVER enter val/test — they only go to
    train. val gates updates; test is the final held-out measure.
  - gbrain loader maps its held-out.jsonl -> test, benchmark.jsonl -> train/val,
    so the gbrain held-out stays the true final score.
  - consolidate(): train drives reflect, val gates; adds gate_mode='off' (greedy,
    no hard filter) reporting val movement (greedy_improved/regressed/flat).
  - run_gbrain/transfer/experiment score on test (val fallback); run_gbrain gains
    --gate on|off. Legacy replay/holdout names normalized.

New test proves dream tasks never land in val/test. 21 tests pass; mock
experiment + gate=off both green.

Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
2026-06-08 14:31:51 +00:00
Yifan Yang
1d20e9db14 chore(sleep): include quick-answerer (tool loop) in the sweep direct plan
All 4 gbrain skillopt-v1 seeds are now in the sweep, matching gbrain's full
scorecard coverage.

Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
2026-06-08 14:31:51 +00:00
Yifan Yang
937bc1ec4d feat(sleep): real tool-loop replay for gbrain quick-answerer (tool_called judge)
The 4th gbrain seed (quick-answerer) is judged by tool_called=search: the agent
must ACTUALLY call a search tool. Add an honest tool loop:

  - Backend.attempt_with_tools(task, skill, memory, tools) -> (response, tools_called)
  - Claude: exposes a real ./search shell shim, runs with --allowedTools Bash in a
    clean cwd; detects the call from the shim's log (not a self-reported marker).
  - Codex: same shim under `exec --sandbox workspace-write`.
  - Mock: deterministic — "calls" a tool iff skill/memory instructs it (for CI).
  - replay_one routes tasks with a tool_called check through the tool loop and
    feeds detected calls to the rule judge; ReplayResult gains tools_called.

Verified live (Claude haiku): deficient skill -> tools_called=[] hard=0;
learned "must run ./search" rule -> tools_called=['search'] hard=1.0.
20 tests pass.

Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
2026-06-08 14:31:51 +00:00
Yifan Yang
023950a291 feat(sleep): sweep 'direct' plan uses strong-optimizer/weak-target dual config
The default sweep direct plan now uses a DualBackend (Sonnet optimizer proposes
edits, Haiku target runs tasks) — the SkillOpt-faithful and more reliable setup,
since a weak self-optimizing model (Haiku-as-optimizer) produced flaky JSON.
report.py renders the optimizer->target pairing in the direct table.

Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
2026-06-08 14:31:51 +00:00
Yifan Yang
d75863eb6f fix(sleep): retry reflect on non-JSON reply; honest report narrative
- reflect() now retries once with a firmer "JSON only" instruction when the
  first reply doesn't parse to a non-empty array. A transient non-JSON reply
  otherwise wastes a whole night (gate sees no edits -> reject), which made
  weak optimizers (Haiku) flaky across runs.
- FINAL_REPORT.md: document the context-leak discovery honestly; Codex cells
  stand (clean), Claude cells recomputed under strict isolation.

Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
2026-06-08 14:31:51 +00:00
Yifan Yang
c80914b036 fix(sleep): disable global skills in claude calls (--bare --disable-slash-commands)
The clean-cwd + --disallowedTools isolation was NOT enough: the user's GLOBAL
skills (~/.claude/skills) are injected regardless of cwd, so reflect/attempt
still sometimes replied with a list of installed skills instead of JSON edits
(advisor reflect returned 21KB of skill descriptions, n_edits=0 -> gate reject).

Add --bare (skip hooks/LSP/plugins) and --disable-slash-commands (disable all
skills). Verified: the optimizer now returns clean JSON. Re-validating all
seeds with the truly-isolated backend; prior Claude numbers are being recomputed
honestly (some earlier "successes" were partly leak-assisted).

Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
2026-06-08 14:31:51 +00:00
Yifan Yang
defb4566ea fix(sleep): isolate claude CLI calls; concrete+override-aware reflect; honor hard constraints
Critical correctness fix found by debugging the thorough-analyst failure:

* `claude -p` was running with the AMBIENT Claude Code project context (the
  repo's CLAUDE.md, installed skills, tools). The optimizer/target calls were
  polluted — reflect once replied with a list of the user's installed skills
  instead of JSON edits. Now ClaudeCliBackend._call runs ISOLATED: a clean temp
  cwd, --disallowedTools '*', --exclude-dynamic-system-prompt-sections. This is
  essential for the backend to be trustworthy and reproducible.

* reflect prompt: translate failing rule-judge criteria into plain English
  (max_chars=1200 -> "the ENTIRE response must be at most 1200 characters") and
  require CONCRETE, verbatim thresholds in proposed rules (not "respect limits").

* attempt prompt: treat the Learned-preferences block as HARD CONSTRAINTS that
  override earlier conflicting skill text.

Earlier Claude results predate this fix and are being re-validated clean; the
Codex backend was never affected (it runs in its own exec context).

Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
2026-06-08 14:31:51 +00:00
Yifan Yang
233b619555 feat(sleep): marketplace manifest, install docs, final report shell, sweep flush
- skillopt-sleep-plugin/.claude-plugin/marketplace.json so the plugin is
  installable via `/plugin marketplace add ./skillopt-sleep-plugin`.
- README install section (clone -> add marketplace -> install -> /sleep status).
- docs/sleep/FINAL_REPORT.md: the consolidated presented results doc (real
  Claude+Codex, transfer, and the honest thorough-analyst failure + fix).
- sweep.py flushes stdout for live monitoring.

Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
2026-06-08 14:31:51 +00:00
Yifan Yang
a0419bfdbb feat(sleep): benchmark sweep + report tooling; override-aware reflect prompt
- sweep.py: run many (backend, model, seed, transfer-pair) configs sequentially,
  append each result to JSONL incrementally (resumable, interrupt-safe).
- report.py: render the sweep JSONL into a presented Markdown scorecard with
  direct-improvement and cross-model-transfer tables.
- reflect prompt now tells the optimizer its edits are APPENDED (can't delete the
  base skill text), so on a conflict it must write a forceful OVERRIDE rule.
  Diagnosed from a real failure: thorough-analyst (needs <=1200 chars) kept its
  edits rejected because the base "be exhaustive" line won; a verified override
  ("HARD LIMIT ... supersedes") makes Haiku obey (1194/880 chars -> hard=1.0).

Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
2026-06-08 14:31:51 +00:00
Yifan Yang
7d9900b6af feat(sleep): optimizer/target model split, transfer experiment, LLM miner
Three additions driven by the goal of price-aware, model-flexible sleep:

1. DualBackend + build_backend(): route attempt->TARGET model and
   reflect/judge->OPTIMIZER model (SkillOpt's target-vs-optimizer split).
   gbrain runner gains --optimizer-backend/-model + --target-backend/-model.

2. run_transfer.py: sleep-scenario cross-model transfer. Optimize a skill on a
   SOURCE model (e.g. cheap haiku), freeze it, evaluate held-out on a TARGET
   model (e.g. expensive sonnet) with no further optimization — plus a direct
   reference. Mirrors the SkillOpt paper's transfer table; quantifies the
   "optimize cheap overnight, deploy anywhere" value prop.

3. llm_miner.py: turn real harvested transcripts into TaskRecords WITH checkable
   rule/rubric judges, wired into the cycle for non-mock backends, so real-data
   lift becomes measurable (heuristic miner remains the no-API fallback).
   Fixed a str.format brace bug the new unit test caught.

19 tests pass.

Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
2026-06-08 14:31:51 +00:00
Yifan Yang
4203086899 feat(sleep): real claude + codex backends, gbrain-evals benchmark, rule judges
Upgrade from mock-only to REAL multi-backend validation:

Backends (skillopt/sleep/backend.py):
  - CliBackend base: shared attempt/judge/reflect prompts, response cache,
    token accounting. Subclasses implement only _call().
  - ClaudeCliBackend: drives `claude -p --output-format text`.
  - CodexCliBackend: drives the REAL @openai/codex `exec -o <file>` for clean
    output; resolve_codex_path() skips the hermes wrapper at ~/.local/bin/codex.
  - reflect() now aggregates the exact failing judge criteria into the prompt
    (gbrain's lesson: tell the optimizer what the scorer rewards).

Rule judges (skillopt/sleep/judges.py): gbrain-compatible local scorers
  (section_present / regex / max_chars / contains / tool_called) — held-out
  scoring with no judge-API spend. TaskRecord gains a `judge` field +
  reference_kind="rule".

gbrain-evals adapter (experiments/gbrain_bench.py, run_gbrain.py): load
  garrytan/gbrain-evals skillopt-v1 deficient skills + train/held-out task
  sets and run our consolidate() loop against the SAME suite gbrain scores.

REAL results (docs/sleep/real_api_results.md), brief-writer seed, 1 night:
  - Claude (Haiku): held-out 0.00 -> 1.00
  - Codex:          held-out 0.00 -> 0.67
  Both proposed a correct, general format rule into the protected LEARNED block.

CLI: --backend {mock,claude,codex}, --codex-path, --model; experiment +
gbrain runners gain --limit-* cost controls. 17 tests pass.

Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
2026-06-08 14:31:51 +00:00
Yifan Yang
4e7add899d feat(sleep): nightly offline self-evolution engine + Claude Code plugin
Add skillopt/sleep — a deployment-time companion to SkillOpt that gives a
local Claude agent a nightly "sleep cycle":

  harvest ~/.claude transcripts -> mine recurring tasks -> replay offline
    -> consolidate (reflect -> bounded edit -> held-out GATE) -> stage -> adopt

Synthesizes SkillOpt (validation-gated bounded text optimization, reusing
skillopt.evaluation.gate verbatim), Claude Dreams (offline consolidation;
input never mutated; review-then-adopt), and the agent-sleep paper
(short-term experience -> long-term competence).

Engine (skillopt/sleep/, import-light, py>=3.10):
  - harvest.py   read-only parse of session JSONL + history.jsonl
  - mine.py      sessions -> TaskRecords (heuristic miner + LLM hook)
  - backend.py   MockBackend (deterministic, no API) + AnthropicBackend
  - replay.py    offline re-run -> (hard, soft) scores
  - consolidate.py  one SkillOpt epoch behind a held-out gate
  - memory.py    protected-region edits to SKILL.md / CLAUDE.md
  - staging.py   stage proposals; adopt with backup (Dreams safety contract)
  - cycle.py + __main__.py  orchestrator + CLI (run/dry-run/status/adopt/harvest)

Plugin (skillopt-sleep-plugin/): plugin.json, /sleep command, skillopt-sleep
skill, SessionEnd hook, bundled runner + cron generator.

Validation (deterministic, no API): persona experiment proves held-out lift
(researcher 0.33->1.0, programmer 0.32->1.0) AND that the gate rejects an
injected harmful edit. 13 stdlib-unittest tests pass, incl. full cycle +
adopt-with-backup and parsing of real on-disk transcripts.

Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
2026-06-08 14:31:51 +00:00
Matt Van Horn
c31c50be51 fix(model): forward Qwen timeout and only set enable_thinking when true
Two bugs made local vLLM targets score acc=0.000: the router did not
forward 'timeout' to the Qwen backend (so runs used the 300s default),
and qwen_backend always injected chat_template_kwargs.enable_thinking,
which non-Qwen vLLM servers reject or answer with <think> output and no
<answer> tag. Forward timeout and only set the field when enabled.

Closes #28

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-07 07:41:35 -07:00
Yifan Yang
4eb4c64b2a envs/_template: make template instantiable against real EnvAdapter ABC
The shipped env_template.py and loader_template.py described the same
fictional async execute / evaluate / build_prompt API documented in
docs/reference/api.md. As a result TemplateBenchmarkEnv(cfg) raised
'TypeError: Can't instantiate abstract class' for every copy-and-paste
user who followed the in-tree scaffold.

Rewrite the template so it's a working starting point:

- env_template.py: TemplateBenchmarkEnv(EnvAdapter) now implements all
  five real abstract methods (build_train_env, build_eval_env, rollout,
  reflect, get_task_types) with no-op defaults documented as TODO.
  Instantiable today; pytest 60/60 still passes.
- loader_template.py: TemplateBenchmarkLoader(SplitDataLoader)
  implements load_split_items for .json / .jsonl input and explains the
  optional load_raw_items override for split_mode="ratio".
- README.md: usage steps now point at scripts/train.py's _ENV_REGISTRY
  (the real registry) instead of a non-existent BENCHMARK_REGISTRY in
  skillopt/envs/__init__.py, and link to the rewritten new-benchmark
  guide.
- config_template.yaml: _base_ is a string path (not a list, which the
  loader rejects); skill_init is commented out with a note so the
  template config doesn't reference a file the user hasn't created.

Verified locally: 'from skillopt.envs._template.env_template import
TemplateBenchmarkEnv; TemplateBenchmarkEnv()' succeeds. Refs
microsoft/SkillOpt#30.

Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
2026-06-01 20:15:12 +00:00
kaikai-macbook
41012e2d5e Support Qwen chat as optimizer backend 2026-06-01 16:44:49 +08:00
Yif Yang
b4850ce418 fix(minimax): wire YAML / CLI config through to backend
PR #26 added a MiniMax chat backend but left three loose ends that
silently dropped any YAML / CLI configuration of minimax_* keys: only
the environment-variable path worked.

- skillopt/config.py: add 6 model.minimax_* entries to _FLATTEN_MAP so
  the keys declared in configs/_base_/default.yaml actually survive
  flatten_config() (mirroring the existing model.qwen_chat_* block).
- skillopt/engine/trainer.py: import configure_minimax_chat and call
  it alongside configure_qwen_chat, so cfg-supplied credentials,
  temperature, max_tokens, and enable_thinking reach the backend. Also
  apply cfg["minimax_model"] via set_target_deployment when the active
  target backend is minimax_chat.
- scripts/train.py: add 6 --minimax_* CLI flags + the corresponding
  _CLI_TO_YAML entries, add 'minimax' / 'minimax_chat' to the --backend
  choices, auto-route to target_backend=minimax_chat, and pick the
  right default target_model for the new backend.

Default behavior on existing backends (openai, claude, qwen, codex,
claude_code_exec) is unchanged; all 8 shipped configs continue to load
with gate_metric falling back to 'hard' for paper reproduction.
2026-05-31 08:22:20 +00:00
Yif Yang
643346c9f3 Merge pull request #26 from KovaForge/minimax-backend
feat: add MiniMax as first-class chat backend

Adds skillopt/model/minimax_backend.py (clean port of qwen_backend.py
targeting MiniMax-M2.7 via https://api.minimax.io/v1) and registers it
in the router, backend_config, and common defaults. Existing backends
(openai_chat, claude_chat, qwen_chat, codex_exec, claude_code_exec)
remain bit-for-bit unchanged.

Verified via 10 import / routing / parity subtests; backward-compat
sweep across the 8 shipped configs passes with no regression.

A follow-up commit completes the YAML / CLI plumbing that this PR left
half-wired (FLATTEN_MAP entries, trainer-level configure_minimax_chat
call, and --minimax_* CLI args).
2026-05-31 08:20:39 +00:00
Cuzyoung
00602df9e9 feat(slow-update): add config-controlled gated / force-injected modes
Add optimizer.slow_update_gate_with_selection to control how epoch-boundary
slow-update guidance is applied:
- false (default): force-injected - inject guidance into current & best
  unconditionally (unchanged behavior).
- true: gated - evaluate the slow-update candidate on the selection set and
  accept/reject via the same validation gate as step-level updates
  (logic follows the SkillReflection ablation).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-05-31 02:02:23 +00:00
Declan Murphy
c6da31df44 fix: use correct MiniMax endpoint, model name, and add .venv to gitignore 2026-05-31 05:27:50 +08:00
Declan Murphy
309ea64ff4 feat: integrate MiniMax into model router, backend config, and common
common.py:
- Add minimax_chat → MiniMax/MiniMax-Text-01 to _BACKEND_DEFAULT_MODELS
- Add minimax/minimax_chat aliases to _BACKEND_ALIASES

backend_config.py:
- Add minimax_chat to set_optimizer_backend() valid set
- Add minimax_chat to set_target_backend() valid set
- Add minimax_chat to is_optimizer_chat_backend()
- Add minimax_chat to is_target_chat_backend()

__init__.py:
- Import minimax_backend as _minimax
- Add minimax_chat to set_backend() legacy handler
- Add minimax_chat to get_backend_name() reporting
- Route chat_target() and chat_target_messages() to _minimax
- Update NotImplementedError messages to list minimax_chat
- Aggregate _minimax into get_token_summary()
- Add _minimax.reset_token_tracker()
- Add configure_minimax_chat() delegator
- Add _minimax to set_reasoning_effort() and set_target_deployment()
2026-05-31 05:22:33 +08:00
Declan Murphy
d224d425f9 feat: add MiniMax chat backend module
Port qwen_backend.py pattern to minimax_backend.py as a new
OpenAI-compatible urllib-based backend. Includes:
- BASE_URL defaulting to https://api.minimax.chat/v1
- API_KEY, TIMEOUT_SECONDS, MAX_TOKENS, TEMPERATURE env vars
- ENABLE_THINKING support (MiniMax thinking mode)
- configure_minimax_chat() runtime configurator
- chat_target() and chat_target_messages() functions
- TokenTracker integration and get_token_summary()
- set_target_deployment() support
- Default model: MiniMax/MiniMax-Text-01
2026-05-31 05:22:29 +08:00
hwq
1f75d022a5 y 2026-05-30 15:01:34 +00:00
Yif Yang
d190bf37c1 Merge pull request #25 from lvbaocheng/feature/gate-soft-metric
Add configurable gate metric (hard / soft / mixed) for skill validation

Default is `hard`, preserving exact pre-PR behavior — verified by 22 unit
assertions on the gate module plus an end-to-end 8-step trainer-trajectory
test that produces a bit-for-bit identical accept/reject sequence between
the pre-PR and post-PR code paths under `gate_metric: hard`. Paper-
reproduction results are unaffected.

`soft` and `mixed` are opt-in via `evaluation.gate_metric` in the config
and address small-selection-set runs where discrete hard accuracy is too
coarse to distinguish candidate skills.
2026-05-30 08:01:39 +00:00
Yif Yang
02695bd813 Merge pull request #24 from lvbaocheng/fix/claude-cli-effort-flag
fix(claude): use --effort instead of deprecated --thinking flag
2026-05-30 15:31:00 +08:00
lvbaocheng
5d7875cb2e Add configurable gate metric (hard / soft / mixed) for skill validation
The training gate currently always compares candidate vs. current/best
using *hard* exact-match accuracy. On environments with a small
held-out selection set (e.g. 3-6 items) or partial-credit scoring,
hard accuracy is too coarse: candidate skills that meaningfully
improve per-item soft scores get rejected because the discrete hard
count does not move.

Add three opt-in metrics so users can pick the one that matches their
scoring function:

- `gate_metric: hard`  — original behavior (default, fully backward
  compatible).
- `gate_metric: soft`  — gate on the soft / F1 / partial-credit score.
- `gate_metric: mixed` — `(1 - w) * hard + w * soft`, where `w` is
  set by `gate_mixed_weight` (default 0.5).

Changes
-------
- `skillopt/evaluation/gate.py`: extend `evaluate_gate` with
  `cand_soft`, `metric`, and `mixed_weight` keyword arguments; add a
  pure helper `select_gate_score(hard, soft, metric, mixed_weight)`.
  Defaults preserve the original `metric="hard"` behavior — existing
  callers that only pass `cand_hard` keep working unchanged.
- `skillopt/evaluation/__init__.py`: export the new helper / type.
- `skillopt/engine/trainer.py`: read `evaluation.gate_metric` and
  `evaluation.gate_mixed_weight` from the config (with safe defaults),
  pass both metrics into `evaluate_gate`, and project the baseline
  `current_score` / `best_score` into metric space so subsequent
  comparisons are consistent. Print the gate metric on the
  `[6/6 EVALUATE]` line so logs make the decision basis explicit. The
  selection cache still records both `(hard, soft)` so a metric change
  on resume is non-destructive.
- `configs/_base_/default.yaml`: document and ship the new keys with
  backward-compatible defaults (`hard`, `0.5`).

Backward compatibility
----------------------
- Default config does not change behavior: `gate_metric` defaults to
  `hard`, exactly matching the previous gate.
- `evaluate_gate(...)` keeps its existing positional signature; the
  new parameters are keyword-only with safe defaults.
- `step_record.json` gains optional `gate_metric` and
  `candidate_gate_score` fields; old records still load.

Tested
------
- Unit-tested all three metrics + boundary `mixed_weight` values
  (0.0 / 1.0) and rejection of unknown metric strings. All six cases
  pass.
- Verified `skillopt.engine.trainer` imports cleanly after the
  refactor.
2026-05-30 14:45:27 +08:00
lvbaocheng
2532043d25 fix(claude): use --effort instead of deprecated --thinking flag
Claude Code CLI v2.x renamed the flag; passing --thinking low causes
all rollout calls to fail on CLI 2.1.87+.

Co-authored-by: Cursor <cursoragent@cursor.com>
2026-05-30 11:24:13 +08:00
zq
41be2f1803 fix(scoring): use float() instead of int() for continuous reward scores
int() truncates smoothed composite scores (0.0-1.0) to 0,
making all continuous reward values appear as failures.
This broke SkillOpt training pipelines using SmoothedCompositeReward.
2026-05-30 07:47:41 +08:00
zq
a62ec857f1 fix(reflect): support continuous reward scores in failure filtering
not r.get("hard") treats non-zero floats as success.
Add explicit float threshold check (< 1e-9).
Backward compatible with binary hard=0/1.
2026-05-29 19:04:42 +08:00
zq
afb552008b fix(trainer): support continuous reward scores in bucket aggregation
int() truncates any float in [0,1) to 0. Replace with float().
Also fix falsy float check in failure detection.
Backward compatible with binary hard=0/1.
2026-05-29 19:03:52 +08:00
Yif Yang
75b5c7f31c Merge pull request #16 from guilhermeleste/feat/pioneer-ai-provider-integration
Add OpenAI-compatible backend support for Pioneer.ai and other providers
2026-05-29 10:14:32 +08:00
hwq
786d57b5cf Make rollout completion tokens configurable 2026-05-28 09:45:47 +00:00
guilhermeleste
d5c5b61830 Add OpenAI-compatible backend support for Pioneer.ai and other providers
- Add 'openai_compatible', 'compat', and 'openai' auth modes to azure_openai.py
- Modify _make_client() to use OpenAI client (not AzureOpenAI) for compatible endpoints
- Update type hints to support both AzureOpenAI and OpenAI clients
- Auto-configure API version sentinel when using compatible modes
- Add .env template for Pioneer.ai configuration

This allows users to use Pioneer.ai or any OpenAI-compatible API endpoint
as both optimizer and target backend without requiring Azure OpenAI.

Resolves: Support for non-Azure OpenAI-compatible providers
2026-05-28 05:54:43 -03:00
Cuzyoung
f55a26414e cleanup: remove unused benchmarks, deep_probe, meta_reflect
Remove sealqa, babyvision, mathverse, mmrb, swebench envs and configs.
Remove deep_probe, deep_reflect, meta_reflect modules and prompts.
Remove download_babyvision script.
These are not part of the core released benchmarks.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-05-24 19:36:48 +00:00
Cuzyoung
cff7ff6846 fix: rename remaining teacher/student refs, remove .gradio from repo
- Fix teacher/student in deep_reflect, meta_reflect, sealqa, babyvision,
  mathverse, mmrb, swebench envs and prompt templates
- Remove .gradio/certificate.pem from tracked files
- Add .gradio/ to .gitignore

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-05-24 19:22:20 +00:00
Cuzyoung
4a1b984d87 refactor: rename teacher/student to optimizer/target, remove best skills, fix slow update
- Rename teacher -> optimizer, student -> target across all code, configs, docs, prompts
- CLI: --teacher_model -> --optimizer_model, --student_model -> --target_model
- Remove best_skill files, keep only initial skills
- Fix slow update gate (force write into skill)
- Fix SLOW_UPDATE marker stripping
- Remove deep_reflect and meta_reflect mechanisms
- Update .env.example with export prefix and azure_cli docs
- Add endpoint empty validation in azure_openai.py

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-05-24 19:15:10 +00:00
CharlesYang030
244e346b83 SkillOpt v0.1.0: initial release
- Skill optimization framework with training loop analogy
- 11 benchmarks, 4 model backends (Azure OpenAI, Claude, Codex, Qwen)
- WebUI for browser-based training control
- Pluggable architecture for extending benchmarks and backends
2026-05-21 17:22:04 +00:00