* Robustness for the claude/codex backends on Windows: argv overflow, subprocess encoding, tolerant JSON, test-eval dirs
Fixes surfaced running SkillOpt end-to-end on the bundled `claude` backend
(local Claude CLI) on Windows. None changes the OpenAI/GPT happy path.
1. skillopt/engine/trainer.py — the final test-eval directory
(test_eval_final/) is written to before being created; add
os.makedirs(..., exist_ok=True), matching the two sibling test-eval dirs.
Without it, summary.json raises FileNotFoundError when a rollout yields
zero predictions.
2. skillopt/model/claude_backend.py
a. Pass the prompt via stdin (not argv): on Windows the whole command line
is capped at ~32 KB and a large optimizer prompt (the success-analyst
minibatch carrying several report trajectories) overflows it with
[WinError 206], killing the run after retries.
b. Pass the system prompt via --append-system-prompt-file (a temp file),
not argv. The system prompt here is the skill being optimized, which
SkillOpt grows over training; since the ~32 KB cap applies to the SUM of
all argv, a grown skill would re-hit [WinError 206] even with the prompt
on stdin.
c. Pin the subprocess encoding to utf-8 (errors="replace"). With text=True
and no encoding=, stdin is encoded with the system codepage; on a zh-CN
box (cp936/GBK) a prompt containing an emoji or some Latin-1 characters
raises UnicodeEncodeError before the CLI even starts, failing every retry.
3. skillopt/model/codex_backend.py — the same utf-8 encoding pin on its
subprocess.run(input=...) call (identical unpinned-encoding pattern).
4. skillopt/utils/json_utils.py — extract_json() returned None for valid-
looking JSON that strict json.loads rejects (unescaped ASCII quotes inside
CJK string values, trailing commas), silently dropping the analyst's edits
on non-schema backends (Claude/Qwen): reflect produces N edits, 0 applied.
Add a json_repair fallback, but only on a single unambiguous object — a
balanced-brace extractor plus a refuse-on-multiple-objects guard — so a
chain-of-thought "scratch + final" response can't make repair silently
return the wrong (discarded) object, which would be worse than None (None is
detectable and retryable; a wrong-but-valid edit is applied blind). Declare
json_repair in requirements.txt and the claude/qwen optional extras so the
fallback is actually present (it otherwise no-ops, dropping edits silently).
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
(cherry picked from commit dca74a683e)
* fix(json_utils): harden tolerant JSON fallback from PR #77
Follow-up fixes on top of the cherry-picked Windows-robustness change:
1. Make _top_level_brace_objects() fully string-aware in its OUTER scan, not
just inside an object. A '{' inside quoted prose (e.g. '"set it to {x}"')
no longer starts a candidate object, so extract_json() returns None for
prose pseudo-JSON instead of repairing it into a bogus dict — which would
be strictly worse than dropping the edit, since extract_json feeds the
optimizer's skill edits.
2. Pick the repair candidate BEFORE importing json_repair, so the missing-
dependency RuntimeWarning only fires when there is genuinely a single
malformed object that could have been repaired. Ordinary no-JSON / prose
replies (the common case) now return None silently instead of warning on
every call.
3. Resolve dependency-metadata inconsistency: json_repair is optional, so add
it to the `all` extra (it was already in `claude`/`qwen`) and demote it
from a hard requirement to an optional/commented entry in requirements.txt,
matching the project's convention for backend-specific deps.
Adds regression tests for prose-with-braces (-> None), no-warning-on-plain-
text, single-object repair, and multi-object ambiguity. Existing 22 json
tests still pass with and without json_repair installed.
Co-Authored-By: Claude <noreply@anthropic.com>
---------
Co-authored-by: samuelgoofus-boop <260247789+samuelgoofus-boop@users.noreply.github.com>
Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
A step whose minibatches yield ONLY execution-lapse notes produces no body
patches (analysts return empty-edits carriers, dropped by
_normalise_patches), so skip_no_patches / skip_no_rewrite would `continue`
before the appendix flush and silently discard every note of the step.
This hit exactly the feature's target regime (mature skill body, failures
classified as lapses): in c1_searchqa_def_g55_sar, 10/40 steps skipped
this way and lost 95 notes total.
Extract the flush block into _flush_skill_aware_appendix() and call it on
the normal update path (unchanged behavior) AND on both skip branches
before `continue`, so notes persist and appendix_notes.json /
step_rec counters are recorded for skipped steps too.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Split failure reflections into SKILL_DEFECT (body edit) vs EXECUTION_LAPSE
(protected appendix note that re-emphasizes an existing rule, never edited
by step-level analysts). Toggle: optimizer.use_skill_aware_reflection
(default false; baseline byte-identical when off).
- optimizer/appendix.py: protected APPENDIX region (inject/extract/append
with dedup), mirrors the slow_update protected-field pattern
- optimizer/skill_aware.py: analyst prompt augmentation, appendix_notes
parsing, threshold-gated LLM consolidation, and a process-wide runtime
switch (configure_skill_aware_reflection) set once by the trainer
- gradient/reflect.py: augment error/success analyst prompts at runtime;
None-sentinel kwargs resolve from the global switch, so env adapters
need no per-benchmark wiring (works for all envs, present and future)
- optimizer/skill.py: generalize the protected-region check to
(slow_update, appendix); edits inside any protected region are skipped
- engine/trainer.py: inject appendix at init, flush per-step
EXECUTION_LAPSE notes after the gate settles, optional consolidation
- tests: regression suite incl. toggle-off byte-identical guarantee and
env-independent global-switch resolution (6/6 passing + live smoke)
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
- slow_update force-inject now writes current_skill ONLY (best_skill stays a
faithful val-best snapshot, never receives un-validated slow_update content)
- after training, run one val on the final skill; if its gate score beats the
incumbent best, promote final to best (updates best_skill/best_step/best_origin)
- trainer now evaluates final skill on test itself (reuses best test result when
final==best); records final_selection_* and final_test_* in summary.json
- spreadsheetbench: head+tail truncate the post-execution verification report at
source to fix multi-MB conversation bloat
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
A. SpreadsheetBench verification-feedback bloat
- rollout.py _auto_verify_output: use official _compare_cell_value (was
repr() equality, which falsely flagged 5 vs 5.0 / None vs ""); collapse
correct-and-empty cells into a count so large sparse answer ranges no
longer flood feedback with MBs of None=None noise.
- codegen_agent.py _build_eval_feedback: only list WRONG cells, collapse
correct ones into a count.
Scoring is unaffected (evaluate() is independent); this only fixes the
target model's multi-turn solving feedback.
B. Remove optimizer-side truncation (bloat source now fixed)
- reflect.py: drop _MAX_TRAJ_CHARS cap and all per-field clips.
- update_modes.py / clip.py / lr_autonomous.py: describe_item /
short_item_summary no longer truncate; raise ranking/lr token budget.
- trainer.py _format_step_buffer: full task_ids / target.
- slow_update.py: full comparison samples.
C. Soft-disable gate
- config.py / trainer.py: use_gate=false no longer raises; validation still
runs but candidates are force-accepted (new force_accept branch + log).
Misc: aggregate.py merge token budget 4096 -> 16384.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
PR #26 added a MiniMax chat backend but left three loose ends that
silently dropped any YAML / CLI configuration of minimax_* keys: only
the environment-variable path worked.
- skillopt/config.py: add 6 model.minimax_* entries to _FLATTEN_MAP so
the keys declared in configs/_base_/default.yaml actually survive
flatten_config() (mirroring the existing model.qwen_chat_* block).
- skillopt/engine/trainer.py: import configure_minimax_chat and call
it alongside configure_qwen_chat, so cfg-supplied credentials,
temperature, max_tokens, and enable_thinking reach the backend. Also
apply cfg["minimax_model"] via set_target_deployment when the active
target backend is minimax_chat.
- scripts/train.py: add 6 --minimax_* CLI flags + the corresponding
_CLI_TO_YAML entries, add 'minimax' / 'minimax_chat' to the --backend
choices, auto-route to target_backend=minimax_chat, and pick the
right default target_model for the new backend.
Default behavior on existing backends (openai, claude, qwen, codex,
claude_code_exec) is unchanged; all 8 shipped configs continue to load
with gate_metric falling back to 'hard' for paper reproduction.
Add optimizer.slow_update_gate_with_selection to control how epoch-boundary
slow-update guidance is applied:
- false (default): force-injected - inject guidance into current & best
unconditionally (unchanged behavior).
- true: gated - evaluate the slow-update candidate on the selection set and
accept/reject via the same validation gate as step-level updates
(logic follows the SkillReflection ablation).
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Add configurable gate metric (hard / soft / mixed) for skill validation
Default is `hard`, preserving exact pre-PR behavior — verified by 22 unit
assertions on the gate module plus an end-to-end 8-step trainer-trajectory
test that produces a bit-for-bit identical accept/reject sequence between
the pre-PR and post-PR code paths under `gate_metric: hard`. Paper-
reproduction results are unaffected.
`soft` and `mixed` are opt-in via `evaluation.gate_metric` in the config
and address small-selection-set runs where discrete hard accuracy is too
coarse to distinguish candidate skills.
The training gate currently always compares candidate vs. current/best
using *hard* exact-match accuracy. On environments with a small
held-out selection set (e.g. 3-6 items) or partial-credit scoring,
hard accuracy is too coarse: candidate skills that meaningfully
improve per-item soft scores get rejected because the discrete hard
count does not move.
Add three opt-in metrics so users can pick the one that matches their
scoring function:
- `gate_metric: hard` — original behavior (default, fully backward
compatible).
- `gate_metric: soft` — gate on the soft / F1 / partial-credit score.
- `gate_metric: mixed` — `(1 - w) * hard + w * soft`, where `w` is
set by `gate_mixed_weight` (default 0.5).
Changes
-------
- `skillopt/evaluation/gate.py`: extend `evaluate_gate` with
`cand_soft`, `metric`, and `mixed_weight` keyword arguments; add a
pure helper `select_gate_score(hard, soft, metric, mixed_weight)`.
Defaults preserve the original `metric="hard"` behavior — existing
callers that only pass `cand_hard` keep working unchanged.
- `skillopt/evaluation/__init__.py`: export the new helper / type.
- `skillopt/engine/trainer.py`: read `evaluation.gate_metric` and
`evaluation.gate_mixed_weight` from the config (with safe defaults),
pass both metrics into `evaluate_gate`, and project the baseline
`current_score` / `best_score` into metric space so subsequent
comparisons are consistent. Print the gate metric on the
`[6/6 EVALUATE]` line so logs make the decision basis explicit. The
selection cache still records both `(hard, soft)` so a metric change
on resume is non-destructive.
- `configs/_base_/default.yaml`: document and ship the new keys with
backward-compatible defaults (`hard`, `0.5`).
Backward compatibility
----------------------
- Default config does not change behavior: `gate_metric` defaults to
`hard`, exactly matching the previous gate.
- `evaluate_gate(...)` keeps its existing positional signature; the
new parameters are keyword-only with safe defaults.
- `step_record.json` gains optional `gate_metric` and
`candidate_gate_score` fields; old records still load.
Tested
------
- Unit-tested all three metrics + boundary `mixed_weight` values
(0.0 / 1.0) and rejection of unknown metric strings. All six cases
pass.
- Verified `skillopt.engine.trainer` imports cleanly after the
refactor.
int() truncates any float in [0,1) to 0. Replace with float().
Also fix falsy float check in failure detection.
Backward compatible with binary hard=0/1.
- Skill optimization framework with training loop analogy
- 11 benchmarks, 4 model backends (Azure OpenAI, Claude, Codex, Qwen)
- WebUI for browser-based training control
- Pluggable architecture for extending benchmarks and backends