diff --git a/README.md b/README.md index 7b4fbcd..cae5438 100644 --- a/README.md +++ b/README.md @@ -210,14 +210,13 @@ Re-running the same command auto-resumes from the last completed step. ### Pretrained Skill Artifacts -The paper-aligned GPT-5.5 optimized skills are shipped in -[`ckpt//gpt5.5_skill.md`](ckpt/) (one per benchmark — SearchQA, -ALFWorld, DocVQA, LiveMathematicianBench, OfficeQA, SpreadsheetBench). Use -them with `scripts/eval_only.py` to evaluate the paper-aligned skills on a -matching data split without re-running training. See [`ckpt/README.md`](ckpt/README.md) -for the full per-benchmark command. This is the first artifact batch; we -plan to continue uploading the remaining optimized skills and benchmark -split manifests as they are cleaned and verified. +We provide a subset of the paper's main Table 1 GPT-5.5 optimized skills in +[`ckpt/`](ckpt/) as reference artifacts. Use them with `scripts/eval_only.py` +to evaluate the provided skills on a matching data split without re-running +training. See [`ckpt/README.md`](ckpt/README.md) for the full per-benchmark +command. This is the first artifact batch; we plan to continue uploading +the remaining optimized skills and benchmark split manifests as they are +cleaned and verified. --- @@ -249,7 +248,7 @@ Each JSON file is an array of task items. The required fields depend on the benc See `skillopt/envs//dataloader.py` for the exact format each benchmark expects. -> **Note:** Most benchmark datasets are not included in this repository. Prepare your own data following the format above. The exact SearchQA split used in the paper is shipped at [`data/searchqa_id_split/`](data/searchqa_id_split) (400 train / 200 val / 1400 test). We are preparing the remaining benchmark split manifests for upload. +> **Note:** Most benchmark datasets are not included in this repository. Prepare your own data following the format above. The exact SearchQA split used in the paper is provided at [`data/searchqa_id_split/`](data/searchqa_id_split) (400 train / 200 val / 1400 test). We are preparing the remaining benchmark split manifests for upload. ### Supported Benchmarks @@ -269,14 +268,14 @@ See `skillopt/envs//dataloader.py` for the exact format each benchmar ### Default settings and paper-reproduction knobs `configs/_base_/default.yaml` is the single source of truth for SkillOpt's -runtime knobs. Out of the box, every shipped benchmark config inherits +runtime knobs. Out of the box, every included benchmark config inherits from it and keeps the paper protocol visible: 4 epochs, rollout batch 40, reflection minibatch 8, textual learning rate 4 with cosine decay, strict -hard validation gating, and slow-update + meta-skill enabled. The slow-update -acceptance policy is now explicit because `main` has moved forward from -the paper snapshot: the shipped `ckpt/` skills were produced with the gated -semantics described in paper Section 3.6, while the current `main` default -uses the post-submission force-accept behavior. +hard validation gating, and slow-update + meta-skill enabled. One detail to +watch is slow-update acceptance: the current `main` default is the newer +post-submission force-accept mode, while the paper protocol and the +paper-aligned skills under `ckpt/` use the gated semantics described in +paper Section 3.6. ### Slow-update acceptance mode @@ -292,11 +291,11 @@ optimizer: slow-update guidance is injected into both `current_skill` and `best_skill` unconditionally at the epoch boundary. This is the newer post-submission behavior on `main`. -- **`true`** *(paper / shipped-skill reproduction)*: gated, matching paper +- **`true`** *(paper / ckpt-skill reproduction)*: gated, matching paper Section 3.6 verbatim. The slow-update candidate is evaluated on the selection split and accepted only if it passes the same validation gate as a step-level edit. Use this setting when re-running optimization to - match the paper protocol and the provenance of the shipped `ckpt/` skills. + match the paper protocol and the provenance of the provided `ckpt/` skills. The trainer prints which mode is active at startup (`[slow update] acceptance=...`). See issue #22 for the discussion that @@ -315,15 +314,15 @@ split using `gate_metric`: - **`mixed`**: weighted average, `(1 - w) * hard + w * soft`, with `w` set by `gate_mixed_weight` (default `0.5`). -Default is `hard`. Use the example config below to switch. +Default is `hard`. Use the optional feature config below to switch. -### Community-contributed examples +### Optional feature configs -These are **not** default SkillOpt settings — they are reference configs +These are **not** default SkillOpt settings — they are optional feature configs contributed by users for specific scenarios. The paper-reported numbers were obtained with the default settings, not these. -- **[`configs/examples/soft_gate.yaml`](configs/examples/soft_gate.yaml)** +- **[`configs/features/soft_gate.yaml`](configs/features/soft_gate.yaml)** *(PR #25, contributed by [@lvbaocheng](https://github.com/lvbaocheng))* — switches `gate_metric` to `soft` (or `mixed`). See the comment at the top of the file for when to use and when not to. diff --git a/ckpt/README.md b/ckpt/README.md index 5b506a9..b79f766 100644 --- a/ckpt/README.md +++ b/ckpt/README.md @@ -1,9 +1,9 @@ -# Paper-aligned optimized SkillOpt skills (GPT-5.5) +# Paper-aligned SkillOpt reference skills (GPT-5.5) -This folder ships the GPT-5.5 best skills exported from SkillOpt training -runs — one `gpt5.5_skill.md` per benchmark. You can plug them into -`scripts/eval_only.py` to evaluate the paper-aligned optimized skills on a -given split without re-running the training loop. +This folder provides a subset of the paper's main Table 1 GPT-5.5 optimized +skills as reference artifacts — one `gpt5.5_skill.md` per currently included +benchmark. You can plug them into `scripts/eval_only.py` to evaluate the +provided skills on a given split without re-running the training loop. > These are checkpoints associated with the paper, not a general-purpose > tool. They're here so you can verify the reported numbers and use the @@ -29,7 +29,7 @@ Each file is a plain Markdown skill document (~2k–13k chars). It contains a protected `SLOW_UPDATE` section at the end that holds epoch-wise longitudinal guidance — that's expected, not a formatting issue. -## How to evaluate a shipped skill +## How to evaluate a provided skill `scripts/eval_only.py` runs a single skill against a data split without invoking the optimizer. Example for SearchQA against the test split: @@ -62,7 +62,7 @@ for upload — see issues #14 and #21. ## Why force-accept vs. gated slow-update matters -The shipped skills were produced with the gated slow-update semantics +These `ckpt/` skills were produced with the gated slow-update semantics described in paper Section 3.6: ```yaml diff --git a/configs/examples/soft_gate.yaml b/configs/features/soft_gate.yaml similarity index 96% rename from configs/examples/soft_gate.yaml rename to configs/features/soft_gate.yaml index 2f83b3f..7b622d3 100644 --- a/configs/examples/soft_gate.yaml +++ b/configs/features/soft_gate.yaml @@ -1,5 +1,5 @@ # ───────────────────────────────────────────────────────────────────────────── -# Example: soft / mixed validation-gate metric (community-contributed, PR #25) +# Feature: soft / mixed validation-gate metric (community-contributed, PR #25) # ───────────────────────────────────────────────────────────────────────────── # # This is NOT a default SkillOpt setting and was NOT used to produce the @@ -28,7 +28,7 @@ # and matches the design described in the paper. # # To use: inherit your env config from this file, e.g. -# _base_: ../examples/soft_gate.yaml +# _base_: ../features/soft_gate.yaml # or copy the `evaluation:` block below into your config. # ─────────────────────────────────────────────────────────────────────────────