docs: clarify optional features and ckpt artifacts

This commit is contained in:
Yif Yang
2026-05-31 09:36:25 +00:00
parent 9265545c45
commit 266fca72ab
3 changed files with 29 additions and 30 deletions

View File

@@ -210,14 +210,13 @@ Re-running the same command auto-resumes from the last completed step.
### Pretrained Skill Artifacts
The paper-aligned GPT-5.5 optimized skills are shipped in
[`ckpt/<benchmark>/gpt5.5_skill.md`](ckpt/) (one per benchmark — SearchQA,
ALFWorld, DocVQA, LiveMathematicianBench, OfficeQA, SpreadsheetBench). Use
them with `scripts/eval_only.py` to evaluate the paper-aligned skills on a
matching data split without re-running training. See [`ckpt/README.md`](ckpt/README.md)
for the full per-benchmark command. This is the first artifact batch; we
plan to continue uploading the remaining optimized skills and benchmark
split manifests as they are cleaned and verified.
We provide a subset of the paper's main Table 1 GPT-5.5 optimized skills in
[`ckpt/`](ckpt/) as reference artifacts. Use them with `scripts/eval_only.py`
to evaluate the provided skills on a matching data split without re-running
training. See [`ckpt/README.md`](ckpt/README.md) for the full per-benchmark
command. This is the first artifact batch; we plan to continue uploading
the remaining optimized skills and benchmark split manifests as they are
cleaned and verified.
---
@@ -249,7 +248,7 @@ Each JSON file is an array of task items. The required fields depend on the benc
See `skillopt/envs/<benchmark>/dataloader.py` for the exact format each benchmark expects.
> **Note:** Most benchmark datasets are not included in this repository. Prepare your own data following the format above. The exact SearchQA split used in the paper is shipped at [`data/searchqa_id_split/`](data/searchqa_id_split) (400 train / 200 val / 1400 test). We are preparing the remaining benchmark split manifests for upload.
> **Note:** Most benchmark datasets are not included in this repository. Prepare your own data following the format above. The exact SearchQA split used in the paper is provided at [`data/searchqa_id_split/`](data/searchqa_id_split) (400 train / 200 val / 1400 test). We are preparing the remaining benchmark split manifests for upload.
### Supported Benchmarks
@@ -269,14 +268,14 @@ See `skillopt/envs/<benchmark>/dataloader.py` for the exact format each benchmar
### Default settings and paper-reproduction knobs
`configs/_base_/default.yaml` is the single source of truth for SkillOpt's
runtime knobs. Out of the box, every shipped benchmark config inherits
runtime knobs. Out of the box, every included benchmark config inherits
from it and keeps the paper protocol visible: 4 epochs, rollout batch 40,
reflection minibatch 8, textual learning rate 4 with cosine decay, strict
hard validation gating, and slow-update + meta-skill enabled. The slow-update
acceptance policy is now explicit because `main` has moved forward from
the paper snapshot: the shipped `ckpt/` skills were produced with the gated
semantics described in paper Section 3.6, while the current `main` default
uses the post-submission force-accept behavior.
hard validation gating, and slow-update + meta-skill enabled. One detail to
watch is slow-update acceptance: the current `main` default is the newer
post-submission force-accept mode, while the paper protocol and the
paper-aligned skills under `ckpt/` use the gated semantics described in
paper Section 3.6.
### Slow-update acceptance mode
@@ -292,11 +291,11 @@ optimizer:
slow-update guidance is injected into both `current_skill` and
`best_skill` unconditionally at the epoch boundary. This is the newer
post-submission behavior on `main`.
- **`true`** *(paper / shipped-skill reproduction)*: gated, matching paper
- **`true`** *(paper / ckpt-skill reproduction)*: gated, matching paper
Section 3.6 verbatim. The slow-update candidate is evaluated on the
selection split and accepted only if it passes the same validation gate
as a step-level edit. Use this setting when re-running optimization to
match the paper protocol and the provenance of the shipped `ckpt/` skills.
match the paper protocol and the provenance of the provided `ckpt/` skills.
The trainer prints which mode is active at startup
(`[slow update] acceptance=...`). See issue #22 for the discussion that
@@ -315,15 +314,15 @@ split using `gate_metric`:
- **`mixed`**: weighted average, `(1 - w) * hard + w * soft`, with `w`
set by `gate_mixed_weight` (default `0.5`).
Default is `hard`. Use the example config below to switch.
Default is `hard`. Use the optional feature config below to switch.
### Community-contributed examples
### Optional feature configs
These are **not** default SkillOpt settings — they are reference configs
These are **not** default SkillOpt settings — they are optional feature configs
contributed by users for specific scenarios. The paper-reported numbers
were obtained with the default settings, not these.
- **[`configs/examples/soft_gate.yaml`](configs/examples/soft_gate.yaml)**
- **[`configs/features/soft_gate.yaml`](configs/features/soft_gate.yaml)**
*(PR #25, contributed by [@lvbaocheng](https://github.com/lvbaocheng))*
switches `gate_metric` to `soft` (or `mixed`). See the comment at the
top of the file for when to use and when not to.

View File

@@ -1,9 +1,9 @@
# Paper-aligned optimized SkillOpt skills (GPT-5.5)
# Paper-aligned SkillOpt reference skills (GPT-5.5)
This folder ships the GPT-5.5 best skills exported from SkillOpt training
runs — one `gpt5.5_skill.md` per benchmark. You can plug them into
`scripts/eval_only.py` to evaluate the paper-aligned optimized skills on a
given split without re-running the training loop.
This folder provides a subset of the paper's main Table 1 GPT-5.5 optimized
skills as reference artifacts — one `gpt5.5_skill.md` per currently included
benchmark. You can plug them into `scripts/eval_only.py` to evaluate the
provided skills on a given split without re-running the training loop.
> These are checkpoints associated with the paper, not a general-purpose
> tool. They're here so you can verify the reported numbers and use the
@@ -29,7 +29,7 @@ Each file is a plain Markdown skill document (~2k13k chars). It contains a
protected `SLOW_UPDATE` section at the end that holds epoch-wise
longitudinal guidance — that's expected, not a formatting issue.
## How to evaluate a shipped skill
## How to evaluate a provided skill
`scripts/eval_only.py` runs a single skill against a data split without
invoking the optimizer. Example for SearchQA against the test split:
@@ -62,7 +62,7 @@ for upload — see issues #14 and #21.
## Why force-accept vs. gated slow-update matters
The shipped skills were produced with the gated slow-update semantics
These `ckpt/` skills were produced with the gated slow-update semantics
described in paper Section 3.6:
```yaml

View File

@@ -1,5 +1,5 @@
# ─────────────────────────────────────────────────────────────────────────────
# Example: soft / mixed validation-gate metric (community-contributed, PR #25)
# Feature: soft / mixed validation-gate metric (community-contributed, PR #25)
# ─────────────────────────────────────────────────────────────────────────────
#
# This is NOT a default SkillOpt setting and was NOT used to produce the
@@ -28,7 +28,7 @@
# and matches the design described in the paper.
#
# To use: inherit your env config from this file, e.g.
# _base_: ../examples/soft_gate.yaml
# _base_: ../features/soft_gate.yaml
# or copy the `evaluation:` block below into your config.
# ─────────────────────────────────────────────────────────────────────────────