microsoft-SkillOpt

mirror of https://github.com/microsoft/SkillOpt.git synced 2026-07-03 14:02:58 +08:00

Author	SHA1	Message	Date
Yifan Yang	99ec2caf6b	docs(sleep): complete 4/4 gbrain parity on Claude AND Codex (tool loop incl.) benchmark_report.md now 7/7 direct + 4/4 transfer, all 0->1.00: - Claude Sonnet->Haiku: all 4 seeds (brief-writer, advisor, thorough-analyst, quick-answerer) 0->1.00 - Codex self-optimized: brief-writer, advisor, quick-answerer 0->1.00 - quick-answerer uses the real ./search tool loop on both runtimes. This matches gbrain's own "4/4 skills 0->1.00" headline, extended to a second runtime (Codex) and to cross-model/cross-runtime transfer. Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>	2026-06-08 14:31:51 +00:00
Yifan Yang	acf4545c00	docs(sleep): full 4/4 gbrain parity — quick-answerer 0->1.00 via real tool loop quick-answerer (judge: tool_called=search) reaches 0.00 -> 1.00 with Sonnet optimizer -> Haiku target: the optimizer wrote an OVERRIDE of the "never use tools" instruction and the Haiku target genuinely invoked the ./search shim. All 4 gbrain skillopt-v1 seeds now at 0->1.00, matching gbrain's own headline. Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>	2026-06-08 14:31:51 +00:00
Yifan Yang	1d20e9db14	chore(sleep): include quick-answerer (tool loop) in the sweep direct plan All 4 gbrain skillopt-v1 seeds are now in the sweep, matching gbrain's full scorecard coverage. Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>	2026-06-08 14:31:51 +00:00
Yifan Yang	937bc1ec4d	feat(sleep): real tool-loop replay for gbrain quick-answerer (tool_called judge) The 4th gbrain seed (quick-answerer) is judged by tool_called=search: the agent must ACTUALLY call a search tool. Add an honest tool loop: - Backend.attempt_with_tools(task, skill, memory, tools) -> (response, tools_called) - Claude: exposes a real ./search shell shim, runs with --allowedTools Bash in a clean cwd; detects the call from the shim's log (not a self-reported marker). - Codex: same shim under `exec --sandbox workspace-write`. - Mock: deterministic — "calls" a tool iff skill/memory instructs it (for CI). - replay_one routes tasks with a tool_called check through the tool loop and feeds detected calls to the rule judge; ReplayResult gains tools_called. Verified live (Claude haiku): deficient skill -> tools_called=[] hard=0; learned "must run ./search" rule -> tools_called=['search'] hard=1.0. 20 tests pass. Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>	2026-06-08 14:31:51 +00:00
Yifan Yang	b1f41a7506	docs(sleep): full sweep — 5/5 direct + 4/4 transfer all 0->1.00 Machine-generated benchmark_report.md from a 9-config sweep: - Direct (Sonnet->Haiku): brief-writer/advisor/thorough-analyst 0->1.00 - Direct (Codex): brief-writer/advisor 0->1.00 - Transfer (4/4 positive, incl. cross-runtime Codex<->Claude): all 0->1.00 Cross-model transfer confirms the price-difference value prop: a skill optimized on a cheap model deploys for free on an expensive one, and skills move between Codex and Claude. sweep.jsonl is the committed source data. Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>	2026-06-08 14:31:51 +00:00
Yifan Yang	4186e5bb73	docs(sleep): definitive clean results — Sonnet->Haiku 3/3 seeds 0->1.00 Strong-optimizer/weak-target (Sonnet -> Haiku), fully isolated: brief-writer, advisor, thorough-analyst all 0.00 -> 1.00 on held-out. thorough-analyst shows 2-night convergence (0.33 -> 1.00). Codex self-optimized brief-writer also 0 -> 1.00. Key finding answering the optimizer/target-split request: the OPTIMIZER MODEL is decisive — weak Haiku-as-optimizer is flaky (0 or 1.0 across runs), strong Sonnet-as-optimizer reliably hits 1.0 on every seed. Raw logs under docs/sleep/raw/. Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>	2026-06-08 14:31:51 +00:00
Yifan Yang	023950a291	feat(sleep): sweep 'direct' plan uses strong-optimizer/weak-target dual config The default sweep direct plan now uses a DualBackend (Sonnet optimizer proposes edits, Haiku target runs tasks) — the SkillOpt-faithful and more reliable setup, since a weak self-optimizing model (Haiku-as-optimizer) produced flaky JSON. report.py renders the optimizer->target pairing in the direct table. Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>	2026-06-08 14:31:51 +00:00
Yifan Yang	d75863eb6f	fix(sleep): retry reflect on non-JSON reply; honest report narrative - reflect() now retries once with a firmer "JSON only" instruction when the first reply doesn't parse to a non-empty array. A transient non-JSON reply otherwise wastes a whole night (gate sees no edits -> reject), which made weak optimizers (Haiku) flaky across runs. - FINAL_REPORT.md: document the context-leak discovery honestly; Codex cells stand (clean), Claude cells recomputed under strict isolation. Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>	2026-06-08 14:31:51 +00:00
Yifan Yang	c80914b036	fix(sleep): disable global skills in claude calls (--bare --disable-slash-commands) The clean-cwd + --disallowedTools isolation was NOT enough: the user's GLOBAL skills (~/.claude/skills) are injected regardless of cwd, so reflect/attempt still sometimes replied with a list of installed skills instead of JSON edits (advisor reflect returned 21KB of skill descriptions, n_edits=0 -> gate reject). Add --bare (skip hooks/LSP/plugins) and --disable-slash-commands (disable all skills). Verified: the optimizer now returns clean JSON. Re-validating all seeds with the truly-isolated backend; prior Claude numbers are being recomputed honestly (some earlier "successes" were partly leak-assisted). Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>	2026-06-08 14:31:51 +00:00
Yifan Yang	defb4566ea	fix(sleep): isolate claude CLI calls; concrete+override-aware reflect; honor hard constraints Critical correctness fix found by debugging the thorough-analyst failure: * `claude -p` was running with the AMBIENT Claude Code project context (the repo's CLAUDE.md, installed skills, tools). The optimizer/target calls were polluted — reflect once replied with a list of the user's installed skills instead of JSON edits. Now ClaudeCliBackend._call runs ISOLATED: a clean temp cwd, --disallowedTools '', --exclude-dynamic-system-prompt-sections. This is essential for the backend to be trustworthy and reproducible. reflect prompt: translate failing rule-judge criteria into plain English (max_chars=1200 -> "the ENTIRE response must be at most 1200 characters") and require CONCRETE, verbatim thresholds in proposed rules (not "respect limits"). * attempt prompt: treat the Learned-preferences block as HARD CONSTRAINTS that override earlier conflicting skill text. Earlier Claude results predate this fix and are being re-validated clean; the Codex backend was never affected (it runs in its own exec context). Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>	2026-06-08 14:31:51 +00:00
Yifan Yang	233b619555	feat(sleep): marketplace manifest, install docs, final report shell, sweep flush - skillopt-sleep-plugin/.claude-plugin/marketplace.json so the plugin is installable via `/plugin marketplace add ./skillopt-sleep-plugin`. - README install section (clone -> add marketplace -> install -> /sleep status). - docs/sleep/FINAL_REPORT.md: the consolidated presented results doc (real Claude+Codex, transfer, and the honest thorough-analyst failure + fix). - sweep.py flushes stdout for live monitoring. Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>	2026-06-08 14:31:51 +00:00
Yifan Yang	a0419bfdbb	feat(sleep): benchmark sweep + report tooling; override-aware reflect prompt - sweep.py: run many (backend, model, seed, transfer-pair) configs sequentially, append each result to JSONL incrementally (resumable, interrupt-safe). - report.py: render the sweep JSONL into a presented Markdown scorecard with direct-improvement and cross-model-transfer tables. - reflect prompt now tells the optimizer its edits are APPENDED (can't delete the base skill text), so on a conflict it must write a forceful OVERRIDE rule. Diagnosed from a real failure: thorough-analyst (needs <=1200 chars) kept its edits rejected because the base "be exhaustive" line won; a verified override ("HARD LIMIT ... supersedes") makes Haiku obey (1194/880 chars -> hard=1.0). Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>	2026-06-08 14:31:51 +00:00
Yifan Yang	7d9900b6af	feat(sleep): optimizer/target model split, transfer experiment, LLM miner Three additions driven by the goal of price-aware, model-flexible sleep: 1. DualBackend + build_backend(): route attempt->TARGET model and reflect/judge->OPTIMIZER model (SkillOpt's target-vs-optimizer split). gbrain runner gains --optimizer-backend/-model + --target-backend/-model. 2. run_transfer.py: sleep-scenario cross-model transfer. Optimize a skill on a SOURCE model (e.g. cheap haiku), freeze it, evaluate held-out on a TARGET model (e.g. expensive sonnet) with no further optimization — plus a direct reference. Mirrors the SkillOpt paper's transfer table; quantifies the "optimize cheap overnight, deploy anywhere" value prop. 3. llm_miner.py: turn real harvested transcripts into TaskRecords WITH checkable rule/rubric judges, wired into the cycle for non-mock backends, so real-data lift becomes measurable (heuristic miner remains the no-API fallback). Fixed a str.format brace bug the new unit test caught. 19 tests pass. Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>	2026-06-08 14:31:51 +00:00
Yifan Yang	63c79b3602	docs(sleep): record real Claude+Codex gbrain results; both reach 0->1.00 Codex with the directive reflect prompt + 2 nights converges 0.00 -> 1.00 (up from 0.67 single-night); its night-2 edit diagnoses its own residual failure ("preserve required sections even when keeping the brief short"). Claude (Haiku) reaches 1.00 in one night. Update plugin README + skill to reference --backend claude\|codex (was anthropic) and surface the benchmark. Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>	2026-06-08 14:31:51 +00:00
Yifan Yang	4203086899	feat(sleep): real claude + codex backends, gbrain-evals benchmark, rule judges Upgrade from mock-only to REAL multi-backend validation: Backends (skillopt/sleep/backend.py): - CliBackend base: shared attempt/judge/reflect prompts, response cache, token accounting. Subclasses implement only _call(). - ClaudeCliBackend: drives `claude -p --output-format text`. - CodexCliBackend: drives the REAL @openai/codex `exec -o <file>` for clean output; resolve_codex_path() skips the hermes wrapper at ~/.local/bin/codex. - reflect() now aggregates the exact failing judge criteria into the prompt (gbrain's lesson: tell the optimizer what the scorer rewards). Rule judges (skillopt/sleep/judges.py): gbrain-compatible local scorers (section_present / regex / max_chars / contains / tool_called) — held-out scoring with no judge-API spend. TaskRecord gains a `judge` field + reference_kind="rule". gbrain-evals adapter (experiments/gbrain_bench.py, run_gbrain.py): load garrytan/gbrain-evals skillopt-v1 deficient skills + train/held-out task sets and run our consolidate() loop against the SAME suite gbrain scores. REAL results (docs/sleep/real_api_results.md), brief-writer seed, 1 night: - Claude (Haiku): held-out 0.00 -> 1.00 - Codex: held-out 0.00 -> 0.67 Both proposed a correct, general format rule into the protected LEARNED block. CLI: --backend {mock,claude,codex}, --codex-path, --model; experiment + gbrain runners gain --limit-* cost controls. 17 tests pass. Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>	2026-06-08 14:31:51 +00:00
Yifan Yang	309f3141d4	docs(sleep): add wake-up summary of the overnight build Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>	2026-06-08 14:31:51 +00:00
Yifan Yang	4e7add899d	feat(sleep): nightly offline self-evolution engine + Claude Code plugin Add skillopt/sleep — a deployment-time companion to SkillOpt that gives a local Claude agent a nightly "sleep cycle": harvest ~/.claude transcripts -> mine recurring tasks -> replay offline -> consolidate (reflect -> bounded edit -> held-out GATE) -> stage -> adopt Synthesizes SkillOpt (validation-gated bounded text optimization, reusing skillopt.evaluation.gate verbatim), Claude Dreams (offline consolidation; input never mutated; review-then-adopt), and the agent-sleep paper (short-term experience -> long-term competence). Engine (skillopt/sleep/, import-light, py>=3.10): - harvest.py read-only parse of session JSONL + history.jsonl - mine.py sessions -> TaskRecords (heuristic miner + LLM hook) - backend.py MockBackend (deterministic, no API) + AnthropicBackend - replay.py offline re-run -> (hard, soft) scores - consolidate.py one SkillOpt epoch behind a held-out gate - memory.py protected-region edits to SKILL.md / CLAUDE.md - staging.py stage proposals; adopt with backup (Dreams safety contract) - cycle.py + __main__.py orchestrator + CLI (run/dry-run/status/adopt/harvest) Plugin (skillopt-sleep-plugin/): plugin.json, /sleep command, skillopt-sleep skill, SessionEnd hook, bundled runner + cron generator. Validation (deterministic, no API): persona experiment proves held-out lift (researcher 0.33->1.0, programmer 0.32->1.0) AND that the gate rejects an injected harmful edit. 13 stdlib-unittest tests pass, incl. full cycle + adopt-with-backup and parsing of real on-disk transcripts. Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>	2026-06-08 14:31:51 +00:00
Yifan Yang	0ac2b35daa	docs: add SkillOpt-Sleep Claude Code plugin design Design for a nightly offline self-evolution plugin that synthesizes SkillOpt (validation-gated bounded text optimizer), Claude Dreams (offline memory consolidation), and the Agent-Sleep paper (short-term to long-term experience). Harvests local ~/.claude transcripts, mines recurring tasks, replays them offline, and consolidates memory+skills behind a held-out gate. Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>	2026-06-08 14:31:51 +00:00
Yifan Yang	b5328e8b22	Merge pull request #40 from mvanhorn/fix/28-qwen-chat-timeout-and-thinking-tag fix: forward Qwen target timeout and gate enable_thinking for vLLM targets	2026-06-08 01:42:50 +08:00
Matt Van Horn	c31c50be51	fix(model): forward Qwen timeout and only set enable_thinking when true Two bugs made local vLLM targets score acc=0.000: the router did not forward 'timeout' to the Qwen backend (so runs used the 300s default), and qwen_backend always injected chat_template_kwargs.enable_thinking, which non-Qwen vLLM servers reject or answer with <think> output and no <answer> tag. Forward timeout and only set the field when enabled. Closes #28 Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-07 07:41:35 -07:00
Yif Yang	ee9931ec01	docs: add SkillOpt integration news	2026-06-03 16:07:56 +00:00
CharlesYang030	3f194d58e5	docs: trim News entry wording Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-06-02 23:12:40 +08:00
CharlesYang030	c7513d54f3	docs: update News section to match LLM2CLIP style Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-06-02 23:09:10 +08:00
CharlesYang030	abc9acd82e	docs: add fire emoji to News section heading Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-06-02 22:59:06 +08:00
CharlesYang030	46cc2efd8a	docs: add News section, PyPI install instructions, and PyPI badge to README Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-06-02 22:54:54 +08:00
Ziyang Gong	25da7cb2dd	Merge pull request #32 from Yif-Yang/fix/issue-30-docs-and-template Fix/issue 30 docs and template v0.1.0	2026-06-02 10:12:48 +08:00
Yifan Yang	4eb4c64b2a	envs/_template: make template instantiable against real EnvAdapter ABC The shipped env_template.py and loader_template.py described the same fictional async execute / evaluate / build_prompt API documented in docs/reference/api.md. As a result TemplateBenchmarkEnv(cfg) raised 'TypeError: Can't instantiate abstract class' for every copy-and-paste user who followed the in-tree scaffold. Rewrite the template so it's a working starting point: - env_template.py: TemplateBenchmarkEnv(EnvAdapter) now implements all five real abstract methods (build_train_env, build_eval_env, rollout, reflect, get_task_types) with no-op defaults documented as TODO. Instantiable today; pytest 60/60 still passes. - loader_template.py: TemplateBenchmarkLoader(SplitDataLoader) implements load_split_items for .json / .jsonl input and explains the optional load_raw_items override for split_mode="ratio". - README.md: usage steps now point at scripts/train.py's _ENV_REGISTRY (the real registry) instead of a non-existent BENCHMARK_REGISTRY in skillopt/envs/__init__.py, and link to the rewritten new-benchmark guide. - config_template.yaml: _base_ is a string path (not a list, which the loader rejects); skill_init is commented out with a note so the template config doesn't reference a file the user hasn't created. Verified locally: 'from skillopt.envs._template.env_template import TemplateBenchmarkEnv; TemplateBenchmarkEnv()' succeeds. Refs microsoft/SkillOpt#30. Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>	2026-06-01 20:15:12 +00:00
Yifan Yang	2ca2910649	docs: align API reference and Add-a-Benchmark guide with real EnvAdapter ABC docs/reference/api.md previously documented a fictional EnvAdapter API (execute / evaluate / build_prompt + DataItem / TaskResult) and a BENCHMARK_REGISTRY that never existed in code. Anyone following the documented contract would hit ImportError or TypeError on the first instantiation. Replace both pages with the real shape from skillopt/envs/base.py and skillopt/datasets/base.py: - EnvAdapter: build_train_env, build_eval_env, rollout, reflect, get_task_types (the 5 actual abstract methods). - Rollout dicts: id / hard / soft required; everything else preserved into RolloutResult.extras. - Reflect dicts: {patch, source_type} schema as consumed by run_minibatch_reflect. - BatchSpec: slotted-but-mutable dataclass matching the actual definition (payload defaults to None, metadata to dict()). - SplitDataLoader.load_split_items as the one mandatory loader method. - Registry: _ENV_REGISTRY in scripts/train.py (lazy try/except ImportError block), not a non-existent BENCHMARK_REGISTRY in skillopt/envs/__init__.py. - _base_: documented as a string path, since the current YAML loader only accepts strings. The new-benchmark.md guide now walks through a docfaithful worked example with a real rollout helper (chat_target + scorer) instead of hand-waving over the rollout step. Refs microsoft/SkillOpt#30. Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>	2026-06-01 20:14:54 +00:00
Yifan Yang	fb1a76371d	Merge pull request #29 from LifeIsSoSolong/codex/qwen-chat-optimizer-backend Support qwen_chat as optimizer backend	2026-06-02 03:27:50 +08:00
Yifan Yang	47063e1ceb	Merge pull request #27 from Oxygen56/test/add-core-utility-tests test: add unit test suite for core utility modules	2026-06-02 03:27:26 +08:00
hwq	181d71b737	Release data split manifests	2026-06-01 16:02:14 +00:00
kaikai-macbook	41012e2d5e	Support Qwen chat as optimizer backend	2026-06-01 16:44:49 +08:00
Claude Code Agent	dd8cd993b5	test: add unit test suite for core utility modules Add initial test infrastructure covering: - skillopt/utils/scoring.py (compute_score, skill_hash) - skillopt/utils/json_utils.py (extract_json, extract_json_array) - skillopt/types.py (Edit, Patch dataclass serialization) All tested functions are pure/deterministic with no LLM dependencies. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-01 02:04:22 +08:00
Yif Yang	8ebede0efd	Refine README for clarity on optimization results Removed redundant wording about math benchmarks.	2026-05-31 18:20:00 +08:00
Yif Yang	266fca72ab	docs: clarify optional features and ckpt artifacts	2026-05-31 09:36:25 +00:00
Yif Yang	9265545c45	docs: clarify README and paper-aligned skill artifacts	2026-05-31 09:23:07 +00:00
Cuzyoung	8acc2dd03e	docs: add self-contained reproduction & usage guideline page Add docs/guideline.html, a single self-contained documentation guide (left-nav + content + on-this-page TOC) covering installation, data preparation, training/eval, full configuration reference, framework internals, and an API reference. Link it from the README with local, htmlpreview, and GitHub Pages access instructions. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>	2026-05-31 09:01:25 +00:00
Yif Yang	b4850ce418	fix(minimax): wire YAML / CLI config through to backend PR #26 added a MiniMax chat backend but left three loose ends that silently dropped any YAML / CLI configuration of minimax_* keys: only the environment-variable path worked. - skillopt/config.py: add 6 model.minimax_* entries to _FLATTEN_MAP so the keys declared in configs/_base_/default.yaml actually survive flatten_config() (mirroring the existing model.qwen_chat_* block). - skillopt/engine/trainer.py: import configure_minimax_chat and call it alongside configure_qwen_chat, so cfg-supplied credentials, temperature, max_tokens, and enable_thinking reach the backend. Also apply cfg["minimax_model"] via set_target_deployment when the active target backend is minimax_chat. - scripts/train.py: add 6 --minimax_* CLI flags + the corresponding _CLI_TO_YAML entries, add 'minimax' / 'minimax_chat' to the --backend choices, auto-route to target_backend=minimax_chat, and pick the right default target_model for the new backend. Default behavior on existing backends (openai, claude, qwen, codex, claude_code_exec) is unchanged; all 8 shipped configs continue to load with gate_metric falling back to 'hard' for paper reproduction.	2026-05-31 08:22:20 +00:00
Yif Yang	643346c9f3	Merge pull request #26 from KovaForge/minimax-backend feat: add MiniMax as first-class chat backend Adds skillopt/model/minimax_backend.py (clean port of qwen_backend.py targeting MiniMax-M2.7 via https://api.minimax.io/v1) and registers it in the router, backend_config, and common defaults. Existing backends (openai_chat, claude_chat, qwen_chat, codex_exec, claude_code_exec) remain bit-for-bit unchanged. Verified via 10 import / routing / parity subtests; backward-compat sweep across the 8 shipped configs passes with no regression. A follow-up commit completes the YAML / CLI plumbing that this PR left half-wired (FLATTEN_MAP entries, trainer-level configure_minimax_chat call, and --minimax_* CLI args).	2026-05-31 08:20:39 +00:00
Cuzyoung	00602df9e9	feat(slow-update): add config-controlled gated / force-injected modes Add optimizer.slow_update_gate_with_selection to control how epoch-boundary slow-update guidance is applied: - false (default): force-injected - inject guidance into current & best unconditionally (unchanged behavior). - true: gated - evaluate the slow-update candidate on the selection set and accept/reject via the same validation gate as step-level updates (logic follows the SkillReflection ablation). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>	2026-05-31 02:02:23 +00:00
Declan Murphy	c6da31df44	fix: use correct MiniMax endpoint, model name, and add .venv to gitignore	2026-05-31 05:27:50 +08:00
Declan Murphy	e4201074aa	docs: add MiniMax config to default.yaml and .env.example default.yaml: - Add minimax_base_url, minimax_api_key, minimax_model, minimax_temperature, minimax_max_tokens, minimax_enable_thinking settings - Add optimizer_minimax_base_url, target_minimax_base_url per-role overrides - Add optimizer_minimax_api_key, target_minimax_api_key per-role overrides .env.example: - Add MINIMAX_BASE_URL, MINIMAX_API_KEY, MINIMAX_MODEL env var docs	2026-05-31 05:22:35 +08:00
Declan Murphy	309ea64ff4	feat: integrate MiniMax into model router, backend config, and common common.py: - Add minimax_chat → MiniMax/MiniMax-Text-01 to _BACKEND_DEFAULT_MODELS - Add minimax/minimax_chat aliases to _BACKEND_ALIASES backend_config.py: - Add minimax_chat to set_optimizer_backend() valid set - Add minimax_chat to set_target_backend() valid set - Add minimax_chat to is_optimizer_chat_backend() - Add minimax_chat to is_target_chat_backend() __init__.py: - Import minimax_backend as _minimax - Add minimax_chat to set_backend() legacy handler - Add minimax_chat to get_backend_name() reporting - Route chat_target() and chat_target_messages() to _minimax - Update NotImplementedError messages to list minimax_chat - Aggregate _minimax into get_token_summary() - Add _minimax.reset_token_tracker() - Add configure_minimax_chat() delegator - Add _minimax to set_reasoning_effort() and set_target_deployment()	2026-05-31 05:22:33 +08:00
Declan Murphy	d224d425f9	feat: add MiniMax chat backend module Port qwen_backend.py pattern to minimax_backend.py as a new OpenAI-compatible urllib-based backend. Includes: - BASE_URL defaulting to https://api.minimax.chat/v1 - API_KEY, TIMEOUT_SECONDS, MAX_TOKENS, TEMPERATURE env vars - ENABLE_THINKING support (MiniMax thinking mode) - configure_minimax_chat() runtime configurator - chat_target() and chat_target_messages() functions - TokenTracker integration and get_token_summary() - set_target_deployment() support - Default model: MiniMax/MiniMax-Text-01	2026-05-31 05:22:29 +08:00
hwq	42e555d28e	Update eval-only README example	2026-05-30 15:28:17 +00:00
hwq	933c0a4ab5	Add GPT-5.5 benchmark skills	2026-05-30 15:15:15 +00:00
hwq	1f75d022a5	y	2026-05-30 15:01:34 +00:00
Yif Yang	4f3a9bc055	docs: scope PR #25 gate_metric as opt-in example, not default Move the soft/mixed gate-metric configuration introduced in PR #25 out of the base default config and into a standalone example config so that default SkillOpt runs (and paper reproduction) remain bit-for-bit on the original hard gate. - configs/_base_/default.yaml: drop gate_metric / gate_mixed_weight keys. The trainer's cfg.get("gate_metric", "hard") fallback preserves the original behavior unchanged. - configs/examples/soft_gate.yaml: new standalone reference config with a header explaining when to consider it (small selection split with continuous rewards) and when not to (paper reproduction, large or binary-reward settings). - README.md: add a short "Community-contributed configs" section that clearly flags this as user-contributed and non-default.	2026-05-30 08:09:03 +00:00
Yif Yang	d190bf37c1	Merge pull request #25 from lvbaocheng/feature/gate-soft-metric Add configurable gate metric (hard / soft / mixed) for skill validation Default is `hard`, preserving exact pre-PR behavior — verified by 22 unit assertions on the gate module plus an end-to-end 8-step trainer-trajectory test that produces a bit-for-bit identical accept/reject sequence between the pre-PR and post-PR code paths under `gate_metric: hard`. Paper- reproduction results are unaffected. `soft` and `mixed` are opt-in via `evaluation.gate_metric` in the config and address small-selection-set runs where discrete hard accuracy is too coarse to distinguish candidate skills.	2026-05-30 08:01:39 +00:00
Yif Yang	02695bd813	Merge pull request #24 from lvbaocheng/fix/claude-cli-effort-flag fix(claude): use --effort instead of deprecated --thinking flag	2026-05-30 15:31:00 +08:00

1 2 3 4 5

213 Commits