PR #92 added a per-cycle diagnostics.json that surfaces backend stderr,
optimizer replies, and task responses so a 0.0 night is self-diagnosing.
Those free-text fields can carry credentials (e.g. a codex 401 stderr dump
containing an auth token), so persisting them verbatim was a new on-disk
leak surface.
- Add a shared redact_secrets() in staging.py and route diagnostics.json's
call_error / reflect_raw_head / holdout_detail through it before writing.
- Redact the codex and Claude auth-error log lines too (a secondary sink
when a file log handler is attached); last_call_error stays raw in memory
so _AUTH_MARKERS matching is unaffected.
- Centralize _SECRET_PATTERNS in staging.py (harvest_codex now reuses them)
and extend coverage to AWS / GitHub / Slack / Google / JWT token shapes.
- Tests: secret-shape coverage, private-key blocks, recursive/scalar
passthrough, no over-redaction of plain prose, fail-fast auth-error log
redaction, and an end-to-end check that diagnostics.json has no secret.
Observability-only; the gate and learning algorithm are unchanged.
Co-Authored-By: Claude <noreply@anthropic.com>
Splits CodexCliBackend._call into _call_once + a retry wrapper so transient empties/timeouts are retried instead of silently scored 0, and fails fast on fatal auth/model/version errors (401, refresh_token_reused, token_expired, ChatGPT-account-unsupported, newer-Codex-required). On non-zero exit the CLI error text is surfaced via last_call_error instead of being returned as a model response. Adds per-cycle diagnostics.json (observability only; gate and learning algorithm unchanged) so a 0.0 night self-explains.
Add a regression test to ensure the validation gate correctly rejects
reward-hacking skill edits. It has been observed that optimizers
sometimes propose shortcuts that improve train/replay metrics but fail
to improve held-out behavior. This test codifies that the gate blocks
such artifacts.
Add TestVerifierDiscipline to the test_sleep_engine.py suite:
- Create MockRewardHackingBackend that simulates a reward-hacking rule
which passes the train set but degrades the held-out tasks.
- Assert that the proposed edit is rejected by the gate.
Follow-up from a fresh-context review of the prior commit: CodexCliBackend.attempt_with_tools
(the rollout path for tool-requiring tasks) ran codex exec inline, swallowed all exceptions,
and never set last_call_error — so an auth/model/version failure on the tool path still
produced a silent empty->0 with no diagnostic signal, the exact failure class the prior commit
fixed for the _call path. Now it surfaces timeout/exception/non-zero-exit via last_call_error
(response stays empty; never leaks the CLI error text), so a failed tool rollout shows up in
diagnostics.json. Adds a regression test.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
A nightly sleep cycle could run for weeks emitting held-out 0.0 -> 0.0 (gate reject, zero
edits), indistinguishable from "nothing to learn", when the real cause was the codex backend
returning an error (expired auth / model unsupported on the account / outdated CLI) that got
scored as a failed rollout.
backend (CodexCliBackend):
- split _call into _call_once + a retry wrapper: transient empties/timeouts are retried
instead of silently returning "" (mirrors AzureOpenAIBackend's guard);
- on a non-zero exit, surface the reason via last_call_error and return "" rather than
leaking the CLI error text as if it were a model response;
- fail fast (no retries) on fatal auth/model/version errors (401, refresh_token_reused,
token_expired, "not supported when using Codex with a ChatGPT account",
"requires a newer version of Codex").
backend (CliBackend.reflect): retain last_reflect_raw so a no-edits night is diagnosable.
consolidate: ConsolidationResult now carries per-task held-out detail (response, hard/soft,
fail_reason) + reflect_raw + call_error.
cycle: write diagnostics.json per cycle so a 0.0 night self-explains instead of being a black box.
tests: 4 new (retry-not-silent-zero, auth-error-surfaced-not-scored, holdout-detail, reflect-raw).
Also gitignore the .skillopt-sleep/ runtime dir.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Mirror the copilot MCP server: same rich _TOOL_SCHEMA (source, model,
tasks_file, target_skill_path, max_sessions, max_tasks, lookback_hours,
auto_adopt, json, edit_budget, hour, minute) and generic flag forwarding, plus
sleep_schedule / sleep_unschedule. Devin specifics retained: the ATIF-v1.7
harvest step (run before data-reading actions, engine pointed at it via
--claude-home, default --source claude) and post-adopt sync into .devin/skills/.
Tests + README + rules snippet updated for the 7-tool interface.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Review fixes:
- Path bug: SKILLOPT_DEVIN_CLAUDE_HOME (and SKILLOPT_SLEEP_REPO) read from the
env are now wrapped in os.path.expanduser, so the documented "~/..." config
no longer passes a literal ~ to --claude-home (which yielded zero mined
sessions). expanduser on an absolute default is a no-op.
- tests/test_devin_plugin.py: tool-schema completeness, action→subcommand map,
backend enum, the CLAUDE_HOME expansion regression, and an ATIF-v1.7 harvest
shape test against a bundled fixture.
- plugins/devin/fixtures/devin_sample.json: sample ATIF-v1.7 transcript.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Follow-up to the string-aware brace scan: that change only skipped
double-quoted prose, so brace-shaped text in single quotes, backticks, or
bare prose (e.g. `{op: delete}`, '{x: 1}') still reached json_repair and was
fabricated into a bogus dict — strictly worse than None, since extract_json
feeds the optimizer's skill edits.
Add a _looks_json_like() guard before repair: a genuine JSON object's first
non-space char after `{` is `"` (a key) or `}` (empty). Prose pseudo-objects
start with a bare word and are rejected, while legitimate repair targets
(trailing commas, unescaped quotes inside string values) all begin with `"`
and pass — including objects whose string VALUES contain single quotes or
backticks, which must not be rejected.
Found by an independent GPT-5.5 re-review of the merged #79 code. Adds
regression tests for single-quoted / backticked / bare prose (-> None) and
for legitimate objects with quote/backtick string values (still repaired).
Tests: 30 pass (+3 skip) without json_repair, 33 pass with it, both clean
under -W error::RuntimeWarning.
Co-authored-by: Claude <noreply@anthropic.com>
* Robustness for the claude/codex backends on Windows: argv overflow, subprocess encoding, tolerant JSON, test-eval dirs
Fixes surfaced running SkillOpt end-to-end on the bundled `claude` backend
(local Claude CLI) on Windows. None changes the OpenAI/GPT happy path.
1. skillopt/engine/trainer.py — the final test-eval directory
(test_eval_final/) is written to before being created; add
os.makedirs(..., exist_ok=True), matching the two sibling test-eval dirs.
Without it, summary.json raises FileNotFoundError when a rollout yields
zero predictions.
2. skillopt/model/claude_backend.py
a. Pass the prompt via stdin (not argv): on Windows the whole command line
is capped at ~32 KB and a large optimizer prompt (the success-analyst
minibatch carrying several report trajectories) overflows it with
[WinError 206], killing the run after retries.
b. Pass the system prompt via --append-system-prompt-file (a temp file),
not argv. The system prompt here is the skill being optimized, which
SkillOpt grows over training; since the ~32 KB cap applies to the SUM of
all argv, a grown skill would re-hit [WinError 206] even with the prompt
on stdin.
c. Pin the subprocess encoding to utf-8 (errors="replace"). With text=True
and no encoding=, stdin is encoded with the system codepage; on a zh-CN
box (cp936/GBK) a prompt containing an emoji or some Latin-1 characters
raises UnicodeEncodeError before the CLI even starts, failing every retry.
3. skillopt/model/codex_backend.py — the same utf-8 encoding pin on its
subprocess.run(input=...) call (identical unpinned-encoding pattern).
4. skillopt/utils/json_utils.py — extract_json() returned None for valid-
looking JSON that strict json.loads rejects (unescaped ASCII quotes inside
CJK string values, trailing commas), silently dropping the analyst's edits
on non-schema backends (Claude/Qwen): reflect produces N edits, 0 applied.
Add a json_repair fallback, but only on a single unambiguous object — a
balanced-brace extractor plus a refuse-on-multiple-objects guard — so a
chain-of-thought "scratch + final" response can't make repair silently
return the wrong (discarded) object, which would be worse than None (None is
detectable and retryable; a wrong-but-valid edit is applied blind). Declare
json_repair in requirements.txt and the claude/qwen optional extras so the
fallback is actually present (it otherwise no-ops, dropping edits silently).
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
(cherry picked from commit dca74a683e)
* fix(json_utils): harden tolerant JSON fallback from PR #77
Follow-up fixes on top of the cherry-picked Windows-robustness change:
1. Make _top_level_brace_objects() fully string-aware in its OUTER scan, not
just inside an object. A '{' inside quoted prose (e.g. '"set it to {x}"')
no longer starts a candidate object, so extract_json() returns None for
prose pseudo-JSON instead of repairing it into a bogus dict — which would
be strictly worse than dropping the edit, since extract_json feeds the
optimizer's skill edits.
2. Pick the repair candidate BEFORE importing json_repair, so the missing-
dependency RuntimeWarning only fires when there is genuinely a single
malformed object that could have been repaired. Ordinary no-JSON / prose
replies (the common case) now return None silently instead of warning on
every call.
3. Resolve dependency-metadata inconsistency: json_repair is optional, so add
it to the `all` extra (it was already in `claude`/`qwen`) and demote it
from a hard requirement to an optional/commented entry in requirements.txt,
matching the project's convention for backend-specific deps.
Adds regression tests for prose-with-braces (-> None), no-warning-on-plain-
text, single-object repair, and multi-object ambiguity. Existing 22 json
tests still pass with and without json_repair installed.
Co-Authored-By: Claude <noreply@anthropic.com>
---------
Co-authored-by: samuelgoofus-boop <260247789+samuelgoofus-boop@users.noreply.github.com>
Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
ClaudeCliBackend._call() and attempt_with_tools() hardcoded --bare,
which skips Claude CLI's credential resolution. This broke subscription-
token auth: every model call silently returned "Not logged in" and
scored 0 — the user saw "baseline 0.0 → candidate 0.0, gate reject"
with no indication of an auth failure.
Fix: only pass --bare when ANTHROPIC_API_KEY is set. The remaining
isolation flags (--disable-slash-commands, --disallowedTools,
--exclude-dynamic-system-prompt-sections, clean temp cwd) already
provide the needed isolation without --bare.
Also adds _detect_cli_error() to log a warning when CLI output matches
known auth error patterns, so auth failures surface loudly instead of
deflating every score to 0.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Adds honest tool-call detection for CopilotCliBackend, mirroring the
Claude/Codex backends. Writes per-tool executable shims into the work dir
and detects real invocations from a calllog (not self-reported markers).
The Copilot backend is Windows-validated, so shims are cross-platform:
a .cmd batch shim on Windows and a chmod'd bash shim on POSIX, with an
OS-specific tool hint. Mirrors _call's flags/env (isolated COPILOT_HOME,
--allow-all-tools, MCP/instruction disabling) and the UTF-8 subprocess fix.
Adds test_attempt_with_tools_honest_detection: a CI-friendly, OS-aware
stub stands in for the CLI, runs the shim, and asserts both JSONL parsing
and log-based detection. Validated live on Windows (real Copilot call) and
on Linux/WSL (POSIX path).
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Split failure reflections into SKILL_DEFECT (body edit) vs EXECUTION_LAPSE
(protected appendix note that re-emphasizes an existing rule, never edited
by step-level analysts). Toggle: optimizer.use_skill_aware_reflection
(default false; baseline byte-identical when off).
- optimizer/appendix.py: protected APPENDIX region (inject/extract/append
with dedup), mirrors the slow_update protected-field pattern
- optimizer/skill_aware.py: analyst prompt augmentation, appendix_notes
parsing, threshold-gated LLM consolidation, and a process-wide runtime
switch (configure_skill_aware_reflection) set once by the trainer
- gradient/reflect.py: augment error/success analyst prompts at runtime;
None-sentinel kwargs resolve from the global switch, so env adapters
need no per-benchmark wiring (works for all envs, present and future)
- optimizer/skill.py: generalize the protected-region check to
(slow_update, appendix); edits inside any protected region are skipped
- engine/trainer.py: inject appendix at init, flush per-step
EXECUTION_LAPSE notes after the gate settles, optional consolidation
- tests: regression suite incl. toggle-off byte-identical guarantee and
env-independent global-switch resolution (6/6 passing + live smoke)
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Open-source-tool / research-code separation:
- git mv skillopt/sleep/ -> skillopt_sleep/ (top-level, sibling to the research
skillopt/ package). History preserved as renames.
- All imports skillopt.sleep.* -> skillopt_sleep.*.
- Vendor the validation gate into skillopt_sleep/gate.py (a self-contained copy
of skillopt.evaluation.gate). The engine now has ZERO dependency on the
research package — verified: grep finds no `from skillopt.` in skillopt_sleep/,
and consolidate's gate resolves to skillopt_sleep.gate.
- Plugin scripts/commands/skill call `-m skillopt_sleep`.
29 tests pass; `python -m skillopt_sleep` runs standalone.
Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
- ReplayResult records per-rollout tokens + latency_ms; replay_one measures them
(approximated from text length when the backend doesn't track tokens, e.g. mock).
- replay.multi_objective_reward(w_acc, w_tokens, w_latency): weighted reward so a
skill can be optimized to be cheaper/faster, not only more accurate (cost terms
normalized vs a reference, default = accuracy-only / backward compatible).
- Backend.preferences (free text) injected into reflect as a prior; build_backend
attaches it (to the optimizer for dual backends). run_gbrain gains --preferences.
3 new tests (multi-objective ordering, preference injection, cost recording).
29 tests pass; mock gates + 3.8/3.12 compile green.
Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
The "脑补推演" core the user described — re-run the same task many times and
learn from the contrast between good and bad rollouts:
- rollout.py: multi_rollout(task, k) runs K scored attempts; RolloutSet exposes
best/worst/spread/pass_rate. contrastive_reflect picks the highest-spread
tasks (some attempts passed, some failed — most informative) and asks the
optimizer what the GOOD attempts did that the BAD ones didn't, distilling a
general rule. Far stronger signal than a single failure.
- consolidate(rollouts_k>1) uses contrastive reflection (falls back to
single-shot reflect if it yields nothing).
- budget.py: Budget(max_tokens|max_minutes) tracks spend; plan_depth() derives
(nights, rollouts_k) from a token budget. run_gbrain gains --rollouts-k,
--budget-tokens, --budget-minutes (auto-plans depth).
3 new tests (rollout stats, budget+plan, contrastive stub). 26 tests pass.
Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
Bring SkillOpt's epoch-wise slow/meta update (paper §3.6) into the sleep engine
as skillopt/sleep/slow_update.py — import-light, driven through the Backend
abstraction (mock/claude/codex):
- Reuses the main repo's protected-field markers
<!-- SLOW_UPDATE_START --> ... <!-- SLOW_UPDATE_END --> so the artifact is
compatible; step-level edits never touch this field.
- run_slow_update compares behavior under the first-night vs final skill across
the val tasks, groups into improved/regressed/persistent/stable, and asks the
optimizer to distill durable longitudinal guidance (refining prior text).
- Wired into run_gbrain.run_seed AFTER the nights loop, gated by slow_update=True
and run REGARDLESS of gate_mode — this is what preserves long-term memory even
when the user turns the hard gate OFF (the user's slot_date=slow-update intent).
2 new tests (protected-field round-trip, stub-backend synthesis). 23 tests pass.
Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
Data-split refactor (the anti-overfitting foundation the user asked for):
- TaskRecord gains split∈{train,val,test} and origin∈{real,dream}.
- assign_splits: real tasks deterministically split into val/test (disjoint);
DREAM-augmented tasks (origin='dream') NEVER enter val/test — they only go to
train. val gates updates; test is the final held-out measure.
- gbrain loader maps its held-out.jsonl -> test, benchmark.jsonl -> train/val,
so the gbrain held-out stays the true final score.
- consolidate(): train drives reflect, val gates; adds gate_mode='off' (greedy,
no hard filter) reporting val movement (greedy_improved/regressed/flat).
- run_gbrain/transfer/experiment score on test (val fallback); run_gbrain gains
--gate on|off. Legacy replay/holdout names normalized.
New test proves dream tasks never land in val/test. 21 tests pass; mock
experiment + gate=off both green.
Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
The 4th gbrain seed (quick-answerer) is judged by tool_called=search: the agent
must ACTUALLY call a search tool. Add an honest tool loop:
- Backend.attempt_with_tools(task, skill, memory, tools) -> (response, tools_called)
- Claude: exposes a real ./search shell shim, runs with --allowedTools Bash in a
clean cwd; detects the call from the shim's log (not a self-reported marker).
- Codex: same shim under `exec --sandbox workspace-write`.
- Mock: deterministic — "calls" a tool iff skill/memory instructs it (for CI).
- replay_one routes tasks with a tool_called check through the tool loop and
feeds detected calls to the rule judge; ReplayResult gains tools_called.
Verified live (Claude haiku): deficient skill -> tools_called=[] hard=0;
learned "must run ./search" rule -> tools_called=['search'] hard=1.0.
20 tests pass.
Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
Three additions driven by the goal of price-aware, model-flexible sleep:
1. DualBackend + build_backend(): route attempt->TARGET model and
reflect/judge->OPTIMIZER model (SkillOpt's target-vs-optimizer split).
gbrain runner gains --optimizer-backend/-model + --target-backend/-model.
2. run_transfer.py: sleep-scenario cross-model transfer. Optimize a skill on a
SOURCE model (e.g. cheap haiku), freeze it, evaluate held-out on a TARGET
model (e.g. expensive sonnet) with no further optimization — plus a direct
reference. Mirrors the SkillOpt paper's transfer table; quantifies the
"optimize cheap overnight, deploy anywhere" value prop.
3. llm_miner.py: turn real harvested transcripts into TaskRecords WITH checkable
rule/rubric judges, wired into the cycle for non-mock backends, so real-data
lift becomes measurable (heuristic miner remains the no-API fallback).
Fixed a str.format brace bug the new unit test caught.
19 tests pass.
Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
Upgrade from mock-only to REAL multi-backend validation:
Backends (skillopt/sleep/backend.py):
- CliBackend base: shared attempt/judge/reflect prompts, response cache,
token accounting. Subclasses implement only _call().
- ClaudeCliBackend: drives `claude -p --output-format text`.
- CodexCliBackend: drives the REAL @openai/codex `exec -o <file>` for clean
output; resolve_codex_path() skips the hermes wrapper at ~/.local/bin/codex.
- reflect() now aggregates the exact failing judge criteria into the prompt
(gbrain's lesson: tell the optimizer what the scorer rewards).
Rule judges (skillopt/sleep/judges.py): gbrain-compatible local scorers
(section_present / regex / max_chars / contains / tool_called) — held-out
scoring with no judge-API spend. TaskRecord gains a `judge` field +
reference_kind="rule".
gbrain-evals adapter (experiments/gbrain_bench.py, run_gbrain.py): load
garrytan/gbrain-evals skillopt-v1 deficient skills + train/held-out task
sets and run our consolidate() loop against the SAME suite gbrain scores.
REAL results (docs/sleep/real_api_results.md), brief-writer seed, 1 night:
- Claude (Haiku): held-out 0.00 -> 1.00
- Codex: held-out 0.00 -> 0.67
Both proposed a correct, general format rule into the protected LEARNED block.
CLI: --backend {mock,claude,codex}, --codex-path, --model; experiment +
gbrain runners gain --limit-* cost controls. 17 tests pass.
Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
Two bugs made local vLLM targets score acc=0.000: the router did not
forward 'timeout' to the Qwen backend (so runs used the 300s default),
and qwen_backend always injected chat_template_kwargs.enable_thinking,
which non-Qwen vLLM servers reject or answer with <think> output and no
<answer> tag. Forward timeout and only set the field when enabled.
Closes#28
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Add initial test infrastructure covering:
- skillopt/utils/scoring.py (compute_score, skill_hash)
- skillopt/utils/json_utils.py (extract_json, extract_json_array)
- skillopt/types.py (Edit, Patch dataclass serialization)
All tested functions are pure/deterministic with no LLM dependencies.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>