From 7ae2d8766e8424bd7ccce82e21e577aea70bb0f7 Mon Sep 17 00:00:00 2001
From: Cuzyoung <gongzy23@mail2.sysu.edu.cn>
Date: Sun, 24 May 2026 19:19:12 +0000
Subject: [PATCH] docs: restore clean README with
 Install/Data/QuickStart/WebUI/Citation only

Keep remote project page header (badges, video), replace body with our
streamlined 5-section README focused on reproducibility.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
---
 README.md | 407 +++++++++++++++---------------------------------------
 1 file changed, 109 insertions(+), 298 deletions(-)

diff --git a/README.md b/README.md
index 2d1edde..7266b05 100644
--- a/README.md
+++ b/README.md
@@ -13,105 +13,18 @@
 
 ---
 
-## What is SkillOpt?
+## Install
 
-SkillOpt is a framework for optimizing a natural-language **skill document** through iterative rollout, reflection, editing, and gated validation.
-
-It does **not** fine-tune model parameters. Instead, it treats the skill document as the optimization target:
-
-- The **target** model executes tasks with the current skill
-- The **optimizer** model analyzes trajectories and proposes edits
-- The framework merges, ranks, applies, and validates those edits
-- Only validated skill updates are kept
-
-| Deep Learning | SkillOpt |
-|---|---|
-| Model weights | Skill document (Markdown) |
-| Forward pass | Rollout (target executes tasks) |
-| Loss computation | Reflect (optimizer analyzes trajectories) |
-| Gradient | Edit patches (proposed skill improvements) |
-| Gradient clipping | Edit ranking & selection (`learning_rate`) |
-| Weight update | Patch application to skill document |
-| Validation | Gated evaluation on held-out split |
-| Learning rate schedule | `lr_scheduler`: cosine, linear decay |
-| Epochs | Multi-epoch training with slow update & meta skill |
-
----
-
-## Method Overview
-
-### Optimization Target
-
-Each run maintains a mutable markdown skill document. The framework repeatedly improves that document instead of changing model parameters.
-
-This gives a training-style loop for prompt / policy optimization:
-
-1. Roll out the current skill on a batch of tasks.
-2. Reflect on failures and successes.
-3. Merge patch proposals into a coherent candidate update.
-4. Rank and select a bounded number of edits.
-5. Apply those edits to produce a candidate skill.
-6. Validate the candidate skill on a held-out selection split.
-7. Keep the update only if the gate accepts it.
-
-### Per-Step Pipeline
-
-Every training step executes the following pipeline in `skillopt/engine/trainer.py`:
-
-1. **Rollout**
-   The target model runs a batch of tasks using the current skill.
-
-2. **Reflect**
-   The optimizer analyzes minibatches of trajectories and emits raw patches.
-   Failure-driven and success-driven patches are tracked separately.
-
-3. **Aggregate**
-   Raw patches are merged hierarchically. Metadata such as `support_count` and `source_type` is carried into the merged patch so later ranking can use it.
-
-4. **Select**
-   The optimizer ranks the merged edit pool and keeps up to `edit_budget` edits.
-
-5. **Update**
-   The selected edits are applied to the skill document. The framework records an `edit_apply_report.json` so you can see which edits actually landed, which were skipped, and why.
-
-6. **Evaluate / Gate**
-   The candidate skill is evaluated on the selection split. A candidate update is accepted only if it improves over the current selection score; a new global best is tracked separately.
-
-### Within-Epoch Memory
-
-Inside an epoch, the trainer maintains a step buffer containing:
-
-- Compact failure-pattern summaries from previous steps
-- Rejected edits and their score deltas
-
-That context is fed back into later reflection calls so the optimizer can avoid repeating ineffective edits and can focus on unsolved error patterns.
-
-### Epoch-Level Mechanisms
-
-#### Slow Update
-
-At the end of each epoch, `slow_update` compares the previous epoch's terminal skill and current epoch's terminal skill on a sampled train subset. It then writes longitudinal guidance into a protected slow-update region inside the skill document.
-
-This guidance is **not** blindly written through — it is converted into a candidate skill and sent through the same selection gate as step-level updates.
-
-#### Meta Skill
-
-`meta_skill` is optimizer-side cross-epoch memory. It does not directly edit the current skill. Instead, it writes a compact memory artifact describing longer-term patterns across adjacent epochs. That memory is loaded into later reflection / merge / ranking calls as extra context.
-
-#### Meta Reflect
-
-`meta_reflect` runs at epoch end over the step history of the current epoch. It looks at accepted and rejected directions from the whole epoch, proposes higher-level patch edits, applies them to a meta candidate, and then sends that candidate through the same selection gate.
-
----
-
-## Quick Start
-
-### Install
+**Requirements:** Python 3.10+
 
 ```bash
-git clone https://github.com/AgenticOpt/SkillOpt.git
+git clone https://github.com/microsoft/SkillOpt.git
 cd SkillOpt
 pip install -e .
+
+# For ALFWorld benchmark (optional):
+pip install -e ".[alfworld]"
+alfworld-download
 ```
 
 ### Configure API Credentials
@@ -122,13 +35,17 @@ cp .env.example .env
 source .env
 ```
 
-**Azure OpenAI** (API key or managed identity):
+**Azure OpenAI** (recommended):
 ```bash
 export AZURE_OPENAI_ENDPOINT="https://your-resource.openai.azure.com/"
+# Option 1: API key auth
 export AZURE_OPENAI_API_KEY="your-key"
-# Or use managed identity: set azure_openai_auth_mode=managed_identity in config
+# Option 2: Azure CLI auth (no API key needed)
+export AZURE_OPENAI_AUTH_MODE="azure_cli"
 ```
 
+> **Note:** `AZURE_OPENAI_ENDPOINT` is always required. Without it, all LLM calls will fail.
+
 **OpenAI** directly:
 ```bash
 export OPENAI_API_KEY="sk-..."
@@ -145,237 +62,139 @@ export QWEN_CHAT_BASE_URL="http://localhost:8000/v1"
 export QWEN_CHAT_MODEL="Qwen/Qwen3.5-4B"
 ```
 
-### Run Training
-
-```bash
-python scripts/train.py --config configs/searchqa/default.yaml
-```
-
 ---
 
-## Configuration
+## Data Preparation
 
-SkillOpt uses a hierarchical YAML configuration system. Each benchmark config inherits from `configs/_base_/default.yaml`.
+SkillOpt expects data in a **split directory** with `train/`, `val/`, `test/` subdirectories, each containing a JSON file (e.g., `items.json`).
 
-### Configuration Structure
-
-```yaml
-model:
-  optimizer_backend: openai_chat     # openai_chat | claude_chat | qwen_chat
-  target_backend: openai_chat     # openai_chat | claude_chat | codex_exec | qwen_chat
-  optimizer: gpt-5.5                 # optimizer model deployment name
-  target: gpt-5.5                 # target model deployment name
-  reasoning_effort: medium         # low | medium | high
-
-train:
-  num_epochs: 4
-  batch_size: 40
-  seed: 42
-
-gradient:
-  minibatch_size: 8                # trajectories per reflection call
-  analyst_workers: 16              # parallel reflection workers
-  use_deep_reflect: false          # deep multi-turn probing
-  deep_reflect_failures: 4
-  deep_reflect_successes: 2
-
-optimizer:
-  learning_rate: 4                 # max edits per step (edit_budget)
-  min_learning_rate: 2             # min edits for decay schedulers
-  lr_scheduler: cosine             # constant | linear | cosine | autonomous
-  skill_update_mode: patch         # patch | rewrite_from_suggestions | full_rewrite_minibatch
-  use_slow_update: true
-  use_meta_skill: true
-  use_meta_reflect: false
-
-evaluation:
-  use_gate: true                   # gated validation (always recommended)
-
-env:
-  name: ""                         # benchmark name
-  skill_init: ""                   # path to initial skill document
-  split_mode: ratio                # ratio | split_dir
-  split_ratio: "2:1:7"            # train:val:test
+```
+data/my_split/
+├── train/items.json
+├── val/items.json
+└── test/items.json
 ```
 
-### CLI Overrides
+Each JSON file is an array of task items. The required fields depend on the benchmark. For example, SearchQA items look like:
 
-Override any config key from the command line:
-
-```bash
-python scripts/train.py \
-  --config configs/searchqa/default.yaml \
-  --cfg-options model.optimizer_backend=openai_chat \
-                model.target_backend=codex_exec \
-                train.batch_size=40 \
-                optimizer.learning_rate=4
-
-# Legacy flat overrides also work for common keys:
-python scripts/train.py \
-  --config configs/searchqa/default.yaml \
-  --backend azure_openai \
-  --optimizer_model gpt-5.5 \
-  --target_model gpt-5.5 \
-  --reasoning_effort medium
+```json
+[
+  {
+    "id": "unique_item_id",
+    "question": "Who wrote the novel ...",
+    "context": "[DOC] relevant passage text ...",
+    "answers": ["expected answer"]
+  }
+]
 ```
 
----
+See `skillopt/envs/<benchmark>/dataloader.py` for the exact format each benchmark expects.
 
-## Model Backends
+> **Note:** Benchmark datasets are not included in this repository. Prepare your own data following the format above.
 
-All model access goes through the unified backend router in `skillopt/model/`.
-
-| Backend | Use case | Config key |
-|---|---|---|
-| `openai_chat` | Azure OpenAI / OpenAI API | optimizer / target |
-| `claude_chat` | Anthropic Claude | optimizer / target |
-| `codex_exec` | Codex execution harness | target only |
-| `qwen_chat` | Local Qwen via vLLM | optimizer / target |
-
-Separate optimizer/target endpoints are supported:
-
-```yaml
-model:
-  optimizer_backend: openai_chat
-  target_backend: codex_exec
-  optimizer: gpt-5.5
-  target: gpt-5.5-codex
-```
-
----
-
-## Data Splits
-
-SkillOpt supports two split modes:
-
-**Ratio split** — auto-generate from raw data:
-```bash
-python scripts/train.py \
-  --config configs/searchqa/default.yaml \
-  --split_mode ratio \
-  --data_path /path/to/searchqa_data.json
-```
-
-**Pre-split directory** — consume prepared splits:
-```bash
-python scripts/train.py \
-  --config configs/searchqa/default.yaml \
-  --split_mode split_dir \
-  --split_dir /path/to/searchqa_split
-```
-
----
-
-## Supported Benchmarks
+### Supported Benchmarks
 
 | Benchmark | Type | Config |
 |---|---|---|
 | SearchQA | QA | `configs/searchqa/default.yaml` |
-| SpreadsheetBench | Code generation | `configs/spreadsheetbench/default.yaml` |
 | ALFWorld | Embodied agent | `configs/alfworld/default.yaml` |
 | DocVQA | Document QA | `configs/docvqa/default.yaml` |
-| OfficeQA | Tool-augmented QA | `configs/officeqa/default.yaml` |
-| SealQA | Tool-augmented QA | `configs/sealqa/default.yaml` |
-| BabyVision | Vision QA | `configs/babyvision/default.yaml` |
 | LiveMathematicianBench | Math | `configs/livemathematicianbench/default.yaml` |
-| MathVerse | Multimodal math | `configs/mathverse/default.yaml` |
-| MMRB | Multimodal reasoning | `configs/mmrb/default.yaml` |
-| SWEBench | Software engineering | `configs/swebench/default.yaml` |
+| SpreadsheetBench | Code generation | `configs/spreadsheetbench/default.yaml` |
+| OfficeQA | Tool-augmented QA | `configs/officeqa/default.yaml` |
 
 ---
 
-## Running Training
+## Quick Start
 
-Basic training:
-
-```bash
-python scripts/train.py --config configs/searchqa/default.yaml
-```
-
-Exec harness (Codex target):
+### Training
 
 ```bash
+# Minimal example — train on SearchQA:
 python scripts/train.py \
-  --config configs/searchqa/default.yaml \
-  --optimizer_backend openai_chat \
-  --target_backend codex_exec \
-  --optimizer_model gpt-5.5 \
-  --target_model gpt-5.5-codex \
-  --use_deep_reflect true \
-  --skill_update_mode rewrite_from_suggestions
-```
+    --config configs/searchqa/default.yaml \
+    --split_dir /path/to/your/searchqa_split \
+    --azure_openai_endpoint https://your-resource.openai.azure.com/ \
+    --optimizer_model gpt-5.5 \
+    --target_model gpt-5.5
 
-SWEBench:
-
-```bash
+# Train on LiveMathematicianBench:
 python scripts/train.py \
-  --config configs/swebench/default.yaml \
-  --cfg-options env.dataset_name=lite env.split_ratio=2:1:7
+    --config configs/livemathematicianbench/default.yaml \
+    --split_dir /path/to/your/livemath_split \
+    --azure_openai_endpoint https://your-resource.openai.azure.com/ \
+    --optimizer_model gpt-5.5 \
+    --target_model gpt-5.5
+
+# Train on ALFWorld:
+python scripts/train.py \
+    --config configs/alfworld/default.yaml \
+    --split_dir /path/to/your/alfworld_split \
+    --azure_openai_endpoint https://your-resource.openai.azure.com/ \
+    --optimizer_model gpt-5.5 \
+    --target_model gpt-5.5
 ```
 
+Key CLI arguments:
+
+| Argument | Description | Example |
+|---|---|---|
+| `--config` | Benchmark config YAML | `configs/searchqa/default.yaml` |
+| `--split_dir` | Path to data split directory | `/path/to/split` |
+| `--azure_openai_endpoint` | Azure OpenAI endpoint URL | `https://your-resource.openai.azure.com/` |
+| `--optimizer_model` | Optimizer model deployment name | `gpt-5.5` |
+| `--target_model` | Target model deployment name | `gpt-5.5` |
+| `--num_epochs` | Number of training epochs | `4` |
+| `--batch_size` | Batch size per step | `40` |
+| `--workers` | Parallel rollout workers | `8` |
+| `--out_root` | Output directory | `outputs/my_run` |
+
 ### Eval Only
 
-Evaluate a specific skill without training:
+Evaluate a trained skill on specific data splits without training:
 
 ```bash
+# Evaluate on test set only:
 python scripts/eval_only.py \
   --config configs/searchqa/default.yaml \
-  --skill skillopt/envs/searchqa/skills/initial.md
+  --skill outputs/my_run/best_skill.md \
+  --split valid_unseen \
+  --split_dir /path/to/searchqa_split \
+  --azure_openai_endpoint https://your-resource.openai.azure.com/
+
+# Evaluate on all splits (train + val + test):
+python scripts/eval_only.py \
+  --config configs/searchqa/default.yaml \
+  --skill outputs/my_run/best_skill.md \
+  --split all \
+  --split_dir /path/to/searchqa_split \
+  --azure_openai_endpoint https://your-resource.openai.azure.com/
 ```
 
----
+| Split | Description |
+|---|---|
+| `valid_unseen` | Test set |
+| `valid_seen` | Validation set |
+| `train` | Training set |
+| `all` | All splits combined (default) |
 
-## Output Structure
+### Output Structure
 
-Each run writes a structured output directory:
+Each run writes to a structured output directory:
 
 ```
 outputs/<run_name>/
 ├── config.json              # Flattened runtime config
-├── history.json             # Per-step history records
-├── runtime_state.json       # Resume state (for auto-resume)
-├── best_skill.md            # Current best validated skill
+├── history.json             # Per-step training history
+├── runtime_state.json       # Resume checkpoint
+├── best_skill.md            # Best validated skill document
 ├── skills/skill_vXXXX.md   # Skill snapshot per step
-├── steps/step_XXXX/        # Per-step artifacts
-│   ├── merged_patch.json
-│   ├── ranked_edits.json
-│   ├── candidate_skill.md
-│   ├── edit_apply_report.json
-│   ├── rewrite_result.json  # when rewrite mode is enabled
-│   └── selection_eval/
-├── slow_update/epoch_XX/
-├── meta_skill/epoch_XX/
-└── meta_reflect/epoch_XX/
+├── steps/step_XXXX/        # Per-step artifacts (patches, evals)
+├── slow_update/epoch_XX/   # Slow update logs
+└── meta_skill/epoch_XX/    # Meta skill logs
 ```
 
-### Resume Behavior
-
-The trainer resumes from `runtime_state.json` when present. That state tracks:
-
-- Last completed step
-- Current skill path and score
-- Best skill path and score
-- Origin tags for current and best skill
-
----
-
-## Extending SkillOpt
-
-### Add a New Benchmark
-
-1. Create `skillopt/envs/<your_env>/` with:
-   - `adapter.py` — implements `EnvAdapter`
-   - `dataloader.py` — data loading logic
-   - `rollout.py` — target execution logic
-   - `skills/initial.md` — initial skill document
-2. Add a config at `configs/<your_env>/default.yaml`
-3. Register in `skillopt/envs/__init__.py`
-
-See `skillopt/envs/_template/` for a scaffold.
-
-### Add a New Model Backend
-
-Implement a backend in `skillopt/model/` following the interface in `skillopt/model/common.py`, then register it in `skillopt/model/router.py`.
+Re-running the same command auto-resumes from the last completed step.
 
 ---
 
@@ -388,34 +207,26 @@ pip install -e ".[webui]"
 python -m skillopt_webui.app
 ```
 
-Provides browser-based config selection, training launch, and real-time log monitoring.
-
----
-
-## Minimal Setup
+| Flag | Default | Description |
+|---|---|---|
+| `--port` | 7860 | Server port |
+| `--host` | `0.0.0.0` | Bind address |
+| `--share` | off | Create a public Gradio share link |
 
 ```bash
-conda create -n skillopt python=3.11
-conda activate skillopt
-pip install -e .
+# With public share link (useful for remote servers)
+python -m skillopt_webui.app --share
 ```
 
-Depending on the benchmark, you may also need:
-
-```bash
-pip install datasets gymnasium numpy
-```
-
-For SWEBench, you also need a working Docker environment plus the SWE-bench harness dependencies.
-
 ---
 
 ## Citation
 
 ```bibtex
 @article{skillopt2026,
-  title={SkillOpt: Executive Strategy for Self-Evolving Agent Skills},
+  title={SKILLOPT: Executive Strategy for Self-Evolving Agent Skills},
   author={SkillOpt Team},
   year={2026}
 }
 ```
+