mirror of https://github.com/microsoft/SkillOpt.git synced 2026-07-04 06:29:56 +08:00

Files

Yifan Yang d6c4ca3f6e docs(sleep): load-test all 3 plugin shells on a fresh (non-gbrain) example

Actually exercised every plugin shell end to end on a brand-new "SQL must always
include LIMIT" analyst persona:
  - Claude Code shell: harvest (2 real crafted transcripts -> 2 tasks), full run
    (stages a proposal), adopt (honors the no-op-when-nothing-accepted contract).
  - Codex: install.sh places ~/.codex/prompts/sleep.md + ~/.agents/skills correctly.
  - Copilot: MCP server initialize -> tools/list -> tools/call returns engine output.

Genuine improvement on the fresh persona, both backends: held-out TEST 0.00 -> 1.00
(Sonnet->Haiku and Codex), the optimizer learning the user's LIMIT house rule and
generalizing to unseen queries. Honest finding: the first split left too few train
tasks (no-op night) — re-balancing fixed it; motivates a small-train-pool warning.

Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>

2026-06-08 14:31:52 +00:00

3.4 KiB

Raw Blame History

SkillOpt-Sleep — plugin load-test (fresh examples)

This records an actual end-to-end load-test of all three plugin shells on a brand-new example (not the gbrain benchmark seeds), run on 2026-06-08.

The fresh persona

A data analyst whose SQL queries must always include a LIMIT clause — built from scratch for this test. Two forms were used:

Real transcripts — crafted Claude Code session JSONL where the analyst asks for SQL, the agent forgets LIMIT, and the user complains ("you forgot a LIMIT again", "always cap results"). This exercises the real harvest → mine pipeline.
Checkable tasks — the same intent with a rule judge (regex: (?i)LIMIT\s+100), so the optimizer can be scored on whether future SQL follows the house rule.

Results

Shell plumbing (all three drive the engine)

Shell	What was run	Result
Claude Code (`scripts/sleep.sh`)	`harvest`, full `run`, `adopt`	harvest found 2 sessions → 2 tasks; `run` staged a proposal; `adopt` honored the safety contract (no live change when nothing was accepted)
Codex (`install.sh` + shared runner)	`install.sh` into a temp HOME	placed `~/.codex/prompts/sleep.md` and `~/.agents/skills/skillopt-sleep/SKILL.md` correctly
Copilot (`mcp_server.py`)	`initialize` → `tools/list` → `tools/call sleep_harvest`	5 tools listed; `sleep_harvest` returned real engine output (2 sessions → 2 tasks)

Genuine improvement (real model, fresh persona)

Optimizer Claude Sonnet 4.6 → target Claude Haiku 4.5, 3-way split (5 train / 2 val / 5 test), scored on the held-out test queries; and the same fresh persona self-optimized on Codex:

Backend	Held-out test (fraction of SQL with `LIMIT 100`) before → after
Claude (Sonnet → Haiku)	0.00 → 1.00
Codex	0.00 → 1.00

In one night each optimizer wrote, into the protected learned block, a rule like:

"OVERRIDE: Every SQL query you generate MUST include LIMIT 100 …" (Claude) "Hard requirement: every SQL query response must include …" (Codex)

and the target then applied it to the unseen test queries. This is the whole claim on a task family the engine had never seen: it learned the user's house rule from their failures and generalized it — confirmed on both backends.

An honest finding from load-testing

The first attempt used val_fraction=0.34, test_fraction=0.34, which left only 1 train task for an 8-task set — too little signal — so reflect produced nothing and the night was a no-op (val already 0.75). Re-balancing the split to a real train pool (5 train) fixed it and produced the 0 → 1.00 result above. This is exactly the kind of issue that only surfaces when you actually run the thing, and it motivates a future guardrail: warn when the train pool is too small for the chosen split fractions.

Reproduce

The checkable persona run (real Claude):

# see the snippet in docs/sleep/plugin_load_test.md history, or run:
python -m skillopt_sleep.experiments.run_experiment --persona programmer --assert-improves  # deterministic

Shell checks:

# Copilot MCP server
printf '%s\n' '{"jsonrpc":"2.0","id":1,"method":"tools/list"}' \
  | SKILLOPT_SLEEP_REPO="$(pwd)" python3 plugins/copilot/mcp_server.py
# Codex installer (into a throwaway HOME)
HOME=$(mktemp -d) bash plugins/codex/install.sh

3.4 KiB Raw Blame History