* Robustness for the claude/codex backends on Windows: argv overflow, subprocess encoding, tolerant JSON, test-eval dirs
Fixes surfaced running SkillOpt end-to-end on the bundled `claude` backend
(local Claude CLI) on Windows. None changes the OpenAI/GPT happy path.
1. skillopt/engine/trainer.py — the final test-eval directory
(test_eval_final/) is written to before being created; add
os.makedirs(..., exist_ok=True), matching the two sibling test-eval dirs.
Without it, summary.json raises FileNotFoundError when a rollout yields
zero predictions.
2. skillopt/model/claude_backend.py
a. Pass the prompt via stdin (not argv): on Windows the whole command line
is capped at ~32 KB and a large optimizer prompt (the success-analyst
minibatch carrying several report trajectories) overflows it with
[WinError 206], killing the run after retries.
b. Pass the system prompt via --append-system-prompt-file (a temp file),
not argv. The system prompt here is the skill being optimized, which
SkillOpt grows over training; since the ~32 KB cap applies to the SUM of
all argv, a grown skill would re-hit [WinError 206] even with the prompt
on stdin.
c. Pin the subprocess encoding to utf-8 (errors="replace"). With text=True
and no encoding=, stdin is encoded with the system codepage; on a zh-CN
box (cp936/GBK) a prompt containing an emoji or some Latin-1 characters
raises UnicodeEncodeError before the CLI even starts, failing every retry.
3. skillopt/model/codex_backend.py — the same utf-8 encoding pin on its
subprocess.run(input=...) call (identical unpinned-encoding pattern).
4. skillopt/utils/json_utils.py — extract_json() returned None for valid-
looking JSON that strict json.loads rejects (unescaped ASCII quotes inside
CJK string values, trailing commas), silently dropping the analyst's edits
on non-schema backends (Claude/Qwen): reflect produces N edits, 0 applied.
Add a json_repair fallback, but only on a single unambiguous object — a
balanced-brace extractor plus a refuse-on-multiple-objects guard — so a
chain-of-thought "scratch + final" response can't make repair silently
return the wrong (discarded) object, which would be worse than None (None is
detectable and retryable; a wrong-but-valid edit is applied blind). Declare
json_repair in requirements.txt and the claude/qwen optional extras so the
fallback is actually present (it otherwise no-ops, dropping edits silently).
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
(cherry picked from commit dca74a683e)
* fix(json_utils): harden tolerant JSON fallback from PR #77
Follow-up fixes on top of the cherry-picked Windows-robustness change:
1. Make _top_level_brace_objects() fully string-aware in its OUTER scan, not
just inside an object. A '{' inside quoted prose (e.g. '"set it to {x}"')
no longer starts a candidate object, so extract_json() returns None for
prose pseudo-JSON instead of repairing it into a bogus dict — which would
be strictly worse than dropping the edit, since extract_json feeds the
optimizer's skill edits.
2. Pick the repair candidate BEFORE importing json_repair, so the missing-
dependency RuntimeWarning only fires when there is genuinely a single
malformed object that could have been repaired. Ordinary no-JSON / prose
replies (the common case) now return None silently instead of warning on
every call.
3. Resolve dependency-metadata inconsistency: json_repair is optional, so add
it to the `all` extra (it was already in `claude`/`qwen`) and demote it
from a hard requirement to an optional/commented entry in requirements.txt,
matching the project's convention for backend-specific deps.
Adds regression tests for prose-with-braces (-> None), no-warning-on-plain-
text, single-object repair, and multi-object ambiguity. Existing 22 json
tests still pass with and without json_repair installed.
Co-Authored-By: Claude <noreply@anthropic.com>
---------
Co-authored-by: samuelgoofus-boop <260247789+samuelgoofus-boop@users.noreply.github.com>
Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Two bugs made local vLLM targets score acc=0.000: the router did not
forward 'timeout' to the Qwen backend (so runs used the 300s default),
and qwen_backend always injected chat_template_kwargs.enable_thinking,
which non-Qwen vLLM servers reject or answer with <think> output and no
<answer> tag. Forward timeout and only set the field when enabled.
Closes#28
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Port qwen_backend.py pattern to minimax_backend.py as a new
OpenAI-compatible urllib-based backend. Includes:
- BASE_URL defaulting to https://api.minimax.chat/v1
- API_KEY, TIMEOUT_SECONDS, MAX_TOKENS, TEMPERATURE env vars
- ENABLE_THINKING support (MiniMax thinking mode)
- configure_minimax_chat() runtime configurator
- chat_target() and chat_target_messages() functions
- TokenTracker integration and get_token_summary()
- set_target_deployment() support
- Default model: MiniMax/MiniMax-Text-01
Claude Code CLI v2.x renamed the flag; passing --thinking low causes
all rollout calls to fail on CLI 2.1.87+.
Co-authored-by: Cursor <cursoragent@cursor.com>
- Add 'openai_compatible', 'compat', and 'openai' auth modes to azure_openai.py
- Modify _make_client() to use OpenAI client (not AzureOpenAI) for compatible endpoints
- Update type hints to support both AzureOpenAI and OpenAI clients
- Auto-configure API version sentinel when using compatible modes
- Add .env template for Pioneer.ai configuration
This allows users to use Pioneer.ai or any OpenAI-compatible API endpoint
as both optimizer and target backend without requiring Azure OpenAI.
Resolves: Support for non-Azure OpenAI-compatible providers
- Skill optimization framework with training loop analogy
- 11 benchmarks, 4 model backends (Azure OpenAI, Claude, Codex, Qwen)
- WebUI for browser-based training control
- Pluggable architecture for extending benchmarks and backends