diff --git a/docs/guideline.html b/docs/guideline.html index 1c0d1d3..4029e6d 100644 --- a/docs/guideline.html +++ b/docs/guideline.html @@ -244,18 +244,19 @@ Verify installation
-python -c "import skillopt; print('SkillOpt ready!')"
-
- With env.split_mode: split_dir (the recommended, deterministic mode), SkillOpt reads a directory containing train/, val/, and test/ subfolders, each holding a JSON array of task items:
data/my_split/
- ├─ train/items.json # used for rollout (the "train split")
- ├─ val/items.json # selection split → validation gate (valid_seen)
- └─ test/items.json # held-out final eval (valid_unseen)
- Internally the splits are referred to as train, valid_seen (validation/selection), and valid_unseen (test). The --split flag of eval_only.py uses these names.
What ships in this repo: ready-to-use configs and
+ pretrained skills (ckpt/) for six benchmarks, plus
+ lightweight ID manifests under data/. The manifests
+ list which examples each split uses but do not contain
+ the example contents — so for most benchmarks you materialize the data
+ once before training (see below).
Fastest out-of-the-box run — ALFWorld. Its bundled
+ split (data/alfworld_path_split) is directly usable; you
+ only need the ALFWorld game files:
pip install -e ".[alfworld]"
+alfworld-download
+export ALFWORLD_DATA=~/.cache/alfworld # data root containing json_2.1.1
+
+python scripts/train.py \
+ --config configs/alfworld/default.yaml \
+ --split_dir data/alfworld_path_split \
+ --azure_openai_endpoint https://your-resource.openai.azure.com/ \
+ --optimizer_model gpt-5.5 \
+ --target_model gpt-5.5
+ Other benchmarks (e.g. SearchQA) require a one-time
+ data materialization step: download the raw dataset from the source
+ listed in data/README.md,
+ match the manifest IDs to raw examples (the README documents the lookup
+ key per benchmark), and write the resulting
+ train/val/test item files into a split directory. Then run
+ the commands in §3.2 with --split_dir pointing at it. The
+ required item fields are documented in §4.2.
To sanity-check your setup without training, evaluate a
+ packaged pretrained skill instead (§3.3 uses
+ ckpt/searchqa/gpt5.5_skill.md), or launch the monitoring
+ WebUI (§8.4).
Required fields depend on the benchmark; consult skillopt/envs/<benchmark>/dataloader.py for the exact contract. A SearchQA item, for example:
[
- {
- "id": "unique_item_id",
- "question": "Who wrote the novel ...",
- "context": "[DOC] relevant passage text ...",
- "answers": ["expected answer"]
- }
-]
- This repository ships no benchmark data. Prepare your own splits in the format above before training.
-env.split_mode | Behavior |
|---|---|
split_dir | Use a pre-built directory with explicit train/val/test folders (set env.split_dir). Deterministic and reproducible. |
ratio | Build a deterministic split on the fly from a single env.data_path, using split_seed (and a train:val:test ratio). Convenient for quick experiments. |
# Minimal SearchQA run
python scripts/train.py \
--config configs/searchqa/default.yaml \
@@ -504,7 +500,7 @@ skillopt/ # the package
Evaluate any skill document (a packaged reference skill, or a trained run's best_skill.md) without training:
# Evaluate the packaged GPT-5.5 SearchQA skill on the test split
python scripts/eval_only.py \
@@ -525,7 +521,7 @@ skillopt/ # the package
outputs/<run_name>/
├─ config.json # flattened runtime config
├─ history.json # per-step training history
@@ -538,10 +534,58 @@ skillopt/ # the package
Each completed step persists its state to runtime_state.json and a steps/step_XXXX/ directory. Re-running the same command against the same out_root detects finished work and continues from the last completed step — including epoch-boundary slow-update and meta-skill stages.
Bringing your own dataset takes three steps:
+ (1) create a split directory with train/ val/ test/ item
+ files in the format below; (2) make sure each item carries the fields
+ the closest existing benchmark adapter expects (§4.2); (3) point
+ --split_dir at it and train with that benchmark's config.
+ If no existing adapter matches your task shape (different rollout or
+ scoring logic), write a new benchmark adapter instead — see §7.2.
With env.split_mode: split_dir (the recommended, deterministic mode), SkillOpt reads a directory containing train/, val/, and test/ subfolders, each holding a JSON array of task items:
data/my_split/
+ ├─ train/items.json # used for rollout (the "train split")
+ ├─ val/items.json # selection split → validation gate (valid_seen)
+ └─ test/items.json # held-out final eval (valid_unseen)
+ Internally the splits are referred to as train, valid_seen (validation/selection), and valid_unseen (test). The --split flag of eval_only.py uses these names.
Required fields depend on the benchmark; consult skillopt/envs/<benchmark>/dataloader.py for the exact contract. A SearchQA item, for example:
[
+ {
+ "id": "unique_item_id",
+ "question": "Who wrote the novel ...",
+ "context": "[DOC] relevant passage text ...",
+ "answers": ["expected answer"]
+ }
+]
+ This repository ships no benchmark data. Prepare your own splits in the format above before training.
+env.split_mode | Behavior |
|---|---|
split_dir | Use a pre-built directory with explicit train/val/test folders (set env.split_dir). Deterministic and reproducible. |
ratio | Build a deterministic split on the fly from a single env.data_path, using split_seed (and a train:val:test ratio). Convenient for quick experiments. |
namesearchqa, docvqa, alfworld, …). Selects the env module.skill_initsplit_moderatio or split_dir (see §3.3).split_moderatio or split_dir (see §4.3).split_dirsplit_mode = split_dir).data_pathsplit_mode = ratio).split_seed