Commit Graph

30 Commits

Author SHA1 Message Date
Shunsuke
98d0430bee refactor: make EnvAdapter.reflect a shared default (fixes dropped reflect kwargs)
All six adapters duplicated an identical reflect() that delegates to
run_minibatch_reflect. The copies had drifted: OfficeQA/DocVQA silently
dropped meta_skill_context and ALFWorld dropped update_mode, so those
analysts ran without inputs every other benchmark receives (active under
the default use_meta_skill: true).

Move the delegation into EnvAdapter.reflect as one default that forwards
all kwargs uniformly, and delete the six overrides. reflect is no longer
abstract — adapters inherit it and override only for custom logic.

Net -225 lines. Behavior change: OfficeQA/DocVQA/ALFWorld reflect now
receive the kwargs they previously dropped; the three already-correct
benchmarks are unaffected.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-15 09:06:00 +00:00
Yifan Yang
eef4805b25 Merge pull request #43 from imshunsuke/docs/fix-benchmark-loader-naming
docs: align benchmark guide and template with dataloader.py naming
2026-06-15 17:00:45 +08:00
Cuzyoung
c1ac570d94 docs(guideline): make SearchQA the first demo — copy-paste materialization snippet + train command
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-10 13:51:20 +00:00
Cuzyoung
d8023a47c9 docs(guideline): novice-first restructure — Quick Start before data, honest first-demo path, own-data narrative
- Move Quick Start (now §3) ahead of the data chapter; renumber and fix
  cross-references and the sidebar nav.
- Add §3.1 'Your First Demo': states plainly that data/ ships ID manifests
  only, gives the one benchmark that runs out of the box (ALFWorld with its
  bundled path split), and points other benchmarks to the data/README.md
  materialization step. Also offers eval-only with ckpt/ skills as a
  lighter sanity check.
- Reframe the data chapter as 'Run on Your Own Data' (§4) with a three-step
  lead-in (split dir -> item schema -> --split_dir) and a pointer to §7.2
  for new task shapes.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-10 13:42:50 +00:00
Cuzyoung
3308c4c5dc docs(guideline): add PyPI install option and skill-aware reflection config rows
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-10 13:27:12 +00:00
Cuzyoung
0d5b331cd5 Merge branch 'docs/guideline' into feat/skill-aware-reflection
# Conflicts:
#	README.md
2026-06-10 13:27:12 +00:00
Cuzyoung
1c6a0e75c8 docs(guide): document skill-aware reflection options in the configuration guide
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-10 13:19:27 +00:00
Shunsuke
54e4b3eafb docs: align benchmark guide and template with dataloader.py naming
The new-benchmark guide and the env template README referred to the data
loader file as loader.py, but all six built-in benchmarks name it
dataloader.py (skillopt/envs/<name>/dataloader.py). Update the docs and
the template rename step to match the actual convention.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-09 12:20:01 +08:00
Yifan Yang
f64a41397c docs(sleep): add PR draft (title + body) for the upstream PR
Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
2026-06-08 14:31:52 +00:00
Yifan Yang
d6c4ca3f6e docs(sleep): load-test all 3 plugin shells on a fresh (non-gbrain) example
Actually exercised every plugin shell end to end on a brand-new "SQL must always
include LIMIT" analyst persona:
  - Claude Code shell: harvest (2 real crafted transcripts -> 2 tasks), full run
    (stages a proposal), adopt (honors the no-op-when-nothing-accepted contract).
  - Codex: install.sh places ~/.codex/prompts/sleep.md + ~/.agents/skills correctly.
  - Copilot: MCP server initialize -> tools/list -> tools/call returns engine output.

Genuine improvement on the fresh persona, both backends: held-out TEST 0.00 -> 1.00
(Sonnet->Haiku and Codex), the optimizer learning the user's LIMIT house rule and
generalizing to unseen queries. Honest finding: the first split left too few train
tasks (no-op night) — re-balancing fixed it; motivates a small-train-pool warning.

Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
2026-06-08 14:31:52 +00:00
Yifan Yang
dae974a5e3 chore(sleep): English-only across the engine, plugins, and docs
Remove every non-ASCII/CJK character for a professional open-source repo:
  - harvest.py: drop hardcoded Chinese feedback phrases; add an env-based
    extensibility hook (SKILLOPT_SLEEP_NEG_FEEDBACK / _POS_FEEDBACK) so any
    locale can be added without baking one in. Verified with a German example.
  - rollout.py / consolidate.py: English comments.
  - README.md section heading + anchor, CONTROLLABLE_DREAMING.md, plugin.json,
    marketplace.json (also fixed stale path skillopt-sleep-plugin ->
    plugins/claude-code), SKILL.md: English only.
  - Remove the internal WAKE_UP_SUMMARY.md note (not user-facing, not referenced).

Verified: zero CJK chars remain anywhere; 29 tests pass.

Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
2026-06-08 14:31:52 +00:00
Yifan Yang
e2de84d36f docs(sleep): real Claude<->Codex cross-validation of the new features
Three live runs exercise the new code paths on both runtimes:
  A) Claude Sonnet->Haiku, gate=OFF + rollouts_k=2: brief-writer test 0->1.00,
     action 'greedy_improved', val & test both reported (3-way split works).
  B) Codex, gate=ON + rollouts_k=2: brief-writer test 0->1.00 in 2 nights.
  C) Claude Sonnet->Haiku, thorough-analyst, 3 nights: slow-update fires and
     distils a durable cross-night meta-rule (general, not task-specific).

Confirms gate-off greedy path, 3-way val/test split, multi-rollout, and the
gate-independent slow-update all work with real models on Claude AND Codex.
Raw logs under docs/sleep/raw/crosscheck_*.txt.

Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
2026-06-08 14:31:51 +00:00
Yifan Yang
9379e494bf docs(sleep): document the controllable dreaming architecture
Captures the four-stage refactor: train(dream)/val(real)/test(real) splits,
optional gate, gate-independent slow-update long-term memory, token/time budget,
multi-rollout contrastive reflection, multi-objective reward (accuracy/tokens/
latency), and user-preference priors — with a one-command example composing them.

Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
2026-06-08 14:31:51 +00:00
Yifan Yang
99ec2caf6b docs(sleep): complete 4/4 gbrain parity on Claude AND Codex (tool loop incl.)
benchmark_report.md now 7/7 direct + 4/4 transfer, all 0->1.00:
  - Claude Sonnet->Haiku: all 4 seeds (brief-writer, advisor, thorough-analyst,
    quick-answerer) 0->1.00
  - Codex self-optimized: brief-writer, advisor, quick-answerer 0->1.00
  - quick-answerer uses the real ./search tool loop on both runtimes.

This matches gbrain's own "4/4 skills 0->1.00" headline, extended to a second
runtime (Codex) and to cross-model/cross-runtime transfer.

Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
2026-06-08 14:31:51 +00:00
Yifan Yang
acf4545c00 docs(sleep): full 4/4 gbrain parity — quick-answerer 0->1.00 via real tool loop
quick-answerer (judge: tool_called=search) reaches 0.00 -> 1.00 with Sonnet
optimizer -> Haiku target: the optimizer wrote an OVERRIDE of the "never use
tools" instruction and the Haiku target genuinely invoked the ./search shim.
All 4 gbrain skillopt-v1 seeds now at 0->1.00, matching gbrain's own headline.

Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
2026-06-08 14:31:51 +00:00
Yifan Yang
b1f41a7506 docs(sleep): full sweep — 5/5 direct + 4/4 transfer all 0->1.00
Machine-generated benchmark_report.md from a 9-config sweep:
  - Direct (Sonnet->Haiku): brief-writer/advisor/thorough-analyst 0->1.00
  - Direct (Codex): brief-writer/advisor 0->1.00
  - Transfer (4/4 positive, incl. cross-runtime Codex<->Claude): all 0->1.00

Cross-model transfer confirms the price-difference value prop: a skill
optimized on a cheap model deploys for free on an expensive one, and skills
move between Codex and Claude. sweep.jsonl is the committed source data.

Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
2026-06-08 14:31:51 +00:00
Yifan Yang
4186e5bb73 docs(sleep): definitive clean results — Sonnet->Haiku 3/3 seeds 0->1.00
Strong-optimizer/weak-target (Sonnet -> Haiku), fully isolated:
  brief-writer, advisor, thorough-analyst all 0.00 -> 1.00 on held-out.
thorough-analyst shows 2-night convergence (0.33 -> 1.00). Codex self-optimized
brief-writer also 0 -> 1.00.

Key finding answering the optimizer/target-split request: the OPTIMIZER MODEL is
decisive — weak Haiku-as-optimizer is flaky (0 or 1.0 across runs), strong
Sonnet-as-optimizer reliably hits 1.0 on every seed. Raw logs under docs/sleep/raw/.

Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
2026-06-08 14:31:51 +00:00
Yifan Yang
d75863eb6f fix(sleep): retry reflect on non-JSON reply; honest report narrative
- reflect() now retries once with a firmer "JSON only" instruction when the
  first reply doesn't parse to a non-empty array. A transient non-JSON reply
  otherwise wastes a whole night (gate sees no edits -> reject), which made
  weak optimizers (Haiku) flaky across runs.
- FINAL_REPORT.md: document the context-leak discovery honestly; Codex cells
  stand (clean), Claude cells recomputed under strict isolation.

Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
2026-06-08 14:31:51 +00:00
Yifan Yang
233b619555 feat(sleep): marketplace manifest, install docs, final report shell, sweep flush
- skillopt-sleep-plugin/.claude-plugin/marketplace.json so the plugin is
  installable via `/plugin marketplace add ./skillopt-sleep-plugin`.
- README install section (clone -> add marketplace -> install -> /sleep status).
- docs/sleep/FINAL_REPORT.md: the consolidated presented results doc (real
  Claude+Codex, transfer, and the honest thorough-analyst failure + fix).
- sweep.py flushes stdout for live monitoring.

Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
2026-06-08 14:31:51 +00:00
Yifan Yang
63c79b3602 docs(sleep): record real Claude+Codex gbrain results; both reach 0->1.00
Codex with the directive reflect prompt + 2 nights converges 0.00 -> 1.00
(up from 0.67 single-night); its night-2 edit diagnoses its own residual
failure ("preserve required sections even when keeping the brief short").
Claude (Haiku) reaches 1.00 in one night. Update plugin README + skill to
reference --backend claude|codex (was anthropic) and surface the benchmark.

Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
2026-06-08 14:31:51 +00:00
Yifan Yang
4203086899 feat(sleep): real claude + codex backends, gbrain-evals benchmark, rule judges
Upgrade from mock-only to REAL multi-backend validation:

Backends (skillopt/sleep/backend.py):
  - CliBackend base: shared attempt/judge/reflect prompts, response cache,
    token accounting. Subclasses implement only _call().
  - ClaudeCliBackend: drives `claude -p --output-format text`.
  - CodexCliBackend: drives the REAL @openai/codex `exec -o <file>` for clean
    output; resolve_codex_path() skips the hermes wrapper at ~/.local/bin/codex.
  - reflect() now aggregates the exact failing judge criteria into the prompt
    (gbrain's lesson: tell the optimizer what the scorer rewards).

Rule judges (skillopt/sleep/judges.py): gbrain-compatible local scorers
  (section_present / regex / max_chars / contains / tool_called) — held-out
  scoring with no judge-API spend. TaskRecord gains a `judge` field +
  reference_kind="rule".

gbrain-evals adapter (experiments/gbrain_bench.py, run_gbrain.py): load
  garrytan/gbrain-evals skillopt-v1 deficient skills + train/held-out task
  sets and run our consolidate() loop against the SAME suite gbrain scores.

REAL results (docs/sleep/real_api_results.md), brief-writer seed, 1 night:
  - Claude (Haiku): held-out 0.00 -> 1.00
  - Codex:          held-out 0.00 -> 0.67
  Both proposed a correct, general format rule into the protected LEARNED block.

CLI: --backend {mock,claude,codex}, --codex-path, --model; experiment +
gbrain runners gain --limit-* cost controls. 17 tests pass.

Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
2026-06-08 14:31:51 +00:00
Yifan Yang
309f3141d4 docs(sleep): add wake-up summary of the overnight build
Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
2026-06-08 14:31:51 +00:00
Yifan Yang
4e7add899d feat(sleep): nightly offline self-evolution engine + Claude Code plugin
Add skillopt/sleep — a deployment-time companion to SkillOpt that gives a
local Claude agent a nightly "sleep cycle":

  harvest ~/.claude transcripts -> mine recurring tasks -> replay offline
    -> consolidate (reflect -> bounded edit -> held-out GATE) -> stage -> adopt

Synthesizes SkillOpt (validation-gated bounded text optimization, reusing
skillopt.evaluation.gate verbatim), Claude Dreams (offline consolidation;
input never mutated; review-then-adopt), and the agent-sleep paper
(short-term experience -> long-term competence).

Engine (skillopt/sleep/, import-light, py>=3.10):
  - harvest.py   read-only parse of session JSONL + history.jsonl
  - mine.py      sessions -> TaskRecords (heuristic miner + LLM hook)
  - backend.py   MockBackend (deterministic, no API) + AnthropicBackend
  - replay.py    offline re-run -> (hard, soft) scores
  - consolidate.py  one SkillOpt epoch behind a held-out gate
  - memory.py    protected-region edits to SKILL.md / CLAUDE.md
  - staging.py   stage proposals; adopt with backup (Dreams safety contract)
  - cycle.py + __main__.py  orchestrator + CLI (run/dry-run/status/adopt/harvest)

Plugin (skillopt-sleep-plugin/): plugin.json, /sleep command, skillopt-sleep
skill, SessionEnd hook, bundled runner + cron generator.

Validation (deterministic, no API): persona experiment proves held-out lift
(researcher 0.33->1.0, programmer 0.32->1.0) AND that the gate rejects an
injected harmful edit. 13 stdlib-unittest tests pass, incl. full cycle +
adopt-with-backup and parsing of real on-disk transcripts.

Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
2026-06-08 14:31:51 +00:00
Yifan Yang
0ac2b35daa docs: add SkillOpt-Sleep Claude Code plugin design
Design for a nightly offline self-evolution plugin that synthesizes
SkillOpt (validation-gated bounded text optimizer), Claude Dreams
(offline memory consolidation), and the Agent-Sleep paper (short-term
to long-term experience). Harvests local ~/.claude transcripts, mines
recurring tasks, replays them offline, and consolidates memory+skills
behind a held-out gate.

Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
2026-06-08 14:31:51 +00:00
Yifan Yang
2ca2910649 docs: align API reference and Add-a-Benchmark guide with real EnvAdapter ABC
docs/reference/api.md previously documented a fictional EnvAdapter API
(execute / evaluate / build_prompt + DataItem / TaskResult) and a
BENCHMARK_REGISTRY that never existed in code. Anyone following the
documented contract would hit ImportError or TypeError on the first
instantiation.

Replace both pages with the real shape from skillopt/envs/base.py and
skillopt/datasets/base.py:

- EnvAdapter: build_train_env, build_eval_env, rollout, reflect,
  get_task_types (the 5 actual abstract methods).
- Rollout dicts: id / hard / soft required; everything else preserved
  into RolloutResult.extras.
- Reflect dicts: {patch, source_type} schema as consumed by
  run_minibatch_reflect.
- BatchSpec: slotted-but-mutable dataclass matching the actual
  definition (payload defaults to None, metadata to dict()).
- SplitDataLoader.load_split_items as the one mandatory loader method.
- Registry: _ENV_REGISTRY in scripts/train.py (lazy try/except
  ImportError block), not a non-existent BENCHMARK_REGISTRY in
  skillopt/envs/__init__.py.
- _base_: documented as a string path, since the current YAML loader
  only accepts strings.

The new-benchmark.md guide now walks through a docfaithful worked
example with a real rollout helper (chat_target + scorer) instead of
hand-waving over the rollout step. Refs microsoft/SkillOpt#30.

Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
2026-06-01 20:14:54 +00:00
kaikai-macbook
41012e2d5e Support Qwen chat as optimizer backend 2026-06-01 16:44:49 +08:00
Cuzyoung
8acc2dd03e docs: add self-contained reproduction & usage guideline page
Add docs/guideline.html, a single self-contained documentation guide
(left-nav + content + on-this-page TOC) covering installation, data
preparation, training/eval, full configuration reference, framework
internals, and an API reference. Link it from the README with local,
htmlpreview, and GitHub Pages access instructions.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-05-31 09:01:25 +00:00
yongjin
657b987de6 docs: add local environment smoke test guide 2026-05-29 09:26:38 +09:00
Cuzyoung
4a1b984d87 refactor: rename teacher/student to optimizer/target, remove best skills, fix slow update
- Rename teacher -> optimizer, student -> target across all code, configs, docs, prompts
- CLI: --teacher_model -> --optimizer_model, --student_model -> --target_model
- Remove best_skill files, keep only initial skills
- Fix slow update gate (force write into skill)
- Fix SLOW_UPDATE marker stripping
- Remove deep_reflect and meta_reflect mechanisms
- Update .env.example with export prefix and azure_cli docs
- Add endpoint empty validation in azure_openai.py

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-05-24 19:15:10 +00:00
CharlesYang030
244e346b83 SkillOpt v0.1.0: initial release
- Skill optimization framework with training loop analogy
- 11 benchmarks, 4 model backends (Azure OpenAI, Claude, Codex, Qwen)
- WebUI for browser-based training control
- Pluggable architecture for extending benchmarks and backends
2026-05-21 17:22:04 +00:00