Merge pull request #49 from Kirchberg/codex/codex-skill-first-upstream

Make Codex integration skill-first
2026-07-03 14:02:58 +08:00 · 2026-06-15 10:21:43 +00:00
parent 1b2652c6f8 d31e9d9407
commit e8c3e10b30
8 changed files with 112 additions and 68 deletions
--- a/README.md
+++ b/README.md
@@ -80,7 +80,7 @@ harvest session transcripts → mine recurring tasks → replay offline
 | Platform | Folder | Install |
 |---|---|---|
 | **Claude Code** | [`plugins/claude-code`](plugins/claude-code) | `/plugin marketplace add ./plugins/claude-code` → `/skillopt-sleep` |
-| **Codex** | [`plugins/codex`](plugins/codex) | `bash plugins/codex/install.sh` → `/skillopt-sleep` |
+| **Codex** | [`plugins/codex`](plugins/codex) | `bash plugins/codex/install.sh` → `skillopt-sleep` skill |
 | **Copilot** | [`plugins/copilot`](plugins/copilot) | register `plugins/copilot/mcp_server.py` as an MCP server |

 **Validated on real models.** On the public
--- a/docs/sleep/PR_DRAFT.md
+++ b/docs/sleep/PR_DRAFT.md
@@ -15,7 +15,7 @@ Synthesizes SkillOpt (validation-gated bounded text edits), Claude Dreams
 Shipped as plugins for **three agents**, one engine + three thin shells:

 - **Claude Code** — `.claude-plugin` + `/sleep` command + skill + hooks
- **Codex** — `~/.codex/prompts/sleep.md` + `~/.agents/skills` + `install.sh`
+- **Codex** — user-level `skillopt-sleep` skill + shared runner + `install.sh`
 - **Copilot** — a stdlib-only MCP server exposing `sleep_*` tools

 ## Design notes
--- a/docs/sleep/plugin_load_test.md
+++ b/docs/sleep/plugin_load_test.md
@@ -23,7 +23,7 @@ from scratch for this test. Two forms were used:
 | Shell | What was run | Result |
 |---|---|---|
 | **Claude Code** (`scripts/sleep.sh`) | `harvest`, full `run`, `adopt` | harvest found 2 sessions → 2 tasks; `run` staged a proposal; `adopt` honored the safety contract (no live change when nothing was accepted) |
-| **Codex** (`install.sh` + shared runner) | `install.sh` into a temp HOME | placed `~/.codex/prompts/sleep.md` and `~/.agents/skills/skillopt-sleep/SKILL.md` correctly |
+| **Codex** (`install.sh` + shared runner) | `install.sh` into a temp HOME | placed the user-level `~/.agents/skills/skillopt-sleep/SKILL.md` skill correctly and moved any legacy custom prompt aside instead of installing one |
 | **Copilot** (`mcp_server.py`) | `initialize` → `tools/list` → `tools/call sleep_harvest` | 5 tools listed; `sleep_harvest` returned real engine output (2 sessions → 2 tasks) |

 ### Genuine improvement (real model, fresh persona)
@@ -71,6 +71,6 @@ Shell checks:
 # Copilot MCP server
 printf '%s\n' '{"jsonrpc":"2.0","id":1,"method":"tools/list"}' \
  | SKILLOPT_SLEEP_REPO="$(pwd)" python3 plugins/copilot/mcp_server.py
-# Codex installer (into a throwaway HOME)
+# Codex skill installer (into a throwaway HOME)
 HOME=$(mktemp -d) bash plugins/codex/install.sh
 ```
--- a/plugins/README.md
+++ b/plugins/README.md
@@ -20,6 +20,12 @@ sleep** idea (short-term experience → long-term competence).

 ---

+| Platform | Folder | Mechanism | Status |
+|---|---|---|---|
+| **Claude Code** | [`claude-code/`](claude-code) | `.claude-plugin` + `/skillopt-sleep` command + skill + hooks | full, installable |
+| **Codex** | [`codex/`](codex) | user-level `skillopt-sleep` skill + shared runner | full |
+| **Copilot** | [`copilot/`](copilot) | MCP server (`sleep_*` tools) + `copilot-instructions` | full (MCP) |
+
 ## Install (pick your agent)

 | Platform | Install | Then |
--- a/plugins/codex/README.md
+++ b/plugins/codex/README.md
@@ -14,28 +14,35 @@ as the Claude Code plugin (`skillopt_sleep`), wrapped for Codex.
 ## What Codex supports (and what we use)

 Codex (`@openai/codex`) extends via **`AGENTS.md`** instructions, **skills** at
-`~/.agents/skills/<name>/SKILL.md`, and **custom prompts** at
-`~/.codex/prompts/<name>.md` (invoked as `/<name>`). This integration ships all
-three, plus a shared runner.
+`~/.agents/skills/<name>/SKILL.md`, and plugins that can distribute skills.
+Custom prompts are deprecated in Codex, so this integration is skill-first: the
+installed `skillopt-sleep` skill contains the launch commands and operating
+rules. The shared runner remains a plain shell entrypoint that the skill calls.

 ## Install

 ```bash
 git clone <repo-url> SkillOpt-Sleep
 cd SkillOpt-Sleep
-bash plugins/codex/install.sh          # installs the /skillopt-sleep prompt + skill
+bash plugins/codex/install.sh          # installs the skill
 export SKILLOPT_SLEEP_REPO="$(pwd)"    # so the runner is found from anywhere
 ```

+If a previous install created `~/.codex/prompts/sleep.md`, the installer moves
+that deprecated prompt aside with a `.skillopt-legacy*.bak` suffix.
+
 Requires Python ≥ 3.10 and the `codex` CLI on PATH.

 ## Use

+Mention `$skillopt-sleep` where Codex supports explicit skill mentions, or ask
+Codex in natural language:
+
 ```text
-/skillopt-sleep status      # what's happened
-/skillopt-sleep dry-run     # safe preview, stages nothing
-/skillopt-sleep run         # full cycle, stages a reviewed proposal (no live edits)
-/skillopt-sleep adopt       # apply the staged proposal (with backup)
+Use the skillopt-sleep skill to run status for this project.
+Use the skillopt-sleep skill to run a dry-run for this project.
+Use the skillopt-sleep skill to run the full cycle for this project with the Codex backend.
+Use the skillopt-sleep skill to adopt the latest staged proposal.
 ```

 Or call the engine directly:
@@ -53,7 +60,7 @@ identically — see [`../../docs/sleep/CONTROLLABLE_DREAMING.md`](../../docs/sle

 - Codex's `exec` runs shell, so the real-tool-loop replay (e.g. the
  `tool_called: search` benchmark seed) works natively.
- Codex's standalone *plugin-package manifest* format is not yet a stable public
-  spec; this integration uses the documented `AGENTS.md` + skills + prompts
-  mechanisms, which are stable. If/when a `codex plugin` package format ships,
-  we'll add a one-file manifest.
+- This integration no longer installs a `.codex/prompts` slash command. Skills
+  are the reusable Codex workflow surface; mention `skillopt-sleep` explicitly
+  or ask for a sleep/dream/offline self-improvement run and Codex can load the
+  skill.
--- a/plugins/codex/install.sh
+++ b/plugins/codex/install.sh
@@ -1,24 +1,30 @@
 #!/usr/bin/env bash
-# Install the SkillOpt-Sleep Codex integration into the user's ~/.codex and
-# ~/.agents directories. Idempotent; prints what it does.
+# Install the SkillOpt-Sleep Codex integration as a user-level Codex skill.
+# Idempotent; prints what it does.
 set -euo pipefail

 REPO_ROOT="$(cd "$(dirname "${BASH_SOURCE[0]}")/../.." && pwd)"
 CODEX_HOME="${CODEX_HOME:-$HOME/.codex}"
 AGENTS_SKILLS="${HOME}/.agents/skills"
+LEGACY_PROMPT="$CODEX_HOME/prompts/sleep.md"

 echo "[install] repo: $REPO_ROOT"

-# 1) custom /skillopt-sleep prompt
-mkdir -p "$CODEX_HOME/prompts"
-cp "$REPO_ROOT/plugins/codex/prompts/skillopt-sleep.md" "$CODEX_HOME/prompts/skillopt-sleep.md"
-echo "[install] /skillopt-sleep prompt   -> $CODEX_HOME/prompts/skillopt-sleep.md"
-
-# 2) user-level skill
+# 1) user-level skill
 mkdir -p "$AGENTS_SKILLS/skillopt-sleep"
 cp "$REPO_ROOT/plugins/codex/skills/skillopt-sleep/SKILL.md" "$AGENTS_SKILLS/skillopt-sleep/SKILL.md"
 echo "[install] skill           -> $AGENTS_SKILLS/skillopt-sleep/SKILL.md"

+# 2) retire the old custom prompt entrypoint from previous installs
+if [ -f "$LEGACY_PROMPT" ]; then
+  backup="${LEGACY_PROMPT}.skillopt-legacy.bak"
+  if [ -e "$backup" ]; then
+    backup="${LEGACY_PROMPT}.skillopt-legacy.$(date +%Y%m%d%H%M%S).bak"
+  fi
+  mv "$LEGACY_PROMPT" "$backup"
+  echo "[install] legacy prompt  -> $backup"
+fi
+
 # 3) record the repo location so the runner is found from anywhere
 echo "[install] add to your shell profile:"
 echo "    export SKILLOPT_SLEEP_REPO=\"$REPO_ROOT\""
@@ -29,8 +35,10 @@ cat <<EOF
 [install] Optional — add this to ~/.codex/AGENTS.md so Codex always knows the tool:

  ## SkillOpt-Sleep
-  An offline self-improvement cycle is available. To run it:
-  \`bash "$REPO_ROOT/plugins/run-sleep.sh" status\`. Use \`/skillopt-sleep\` for the guided flow.
+  Use the skillopt-sleep skill when I ask to run a sleep/dream/offline
+  self-improvement cycle. The runner is:
+  \`bash "$REPO_ROOT/plugins/run-sleep.sh" status --project "\$(pwd)"\`.

-Done. Try:  /skillopt-sleep status
+Done. Try asking Codex:
+  Use the skillopt-sleep skill to run status for this project.
 EOF
--- a/plugins/codex/prompts/skillopt-sleep.md
+++ b/plugins/codex/prompts/skillopt-sleep.md
@@ -1,21 +0,0 @@
-# /skillopt-sleep — SkillOpt-Sleep for Codex
-#
-# Custom prompt: copy this file to ~/.codex/prompts/skillopt-sleep.md and invoke with
-# `/skillopt-sleep` in the Codex CLI. ($ARGUMENTS is the text after /skillopt-sleep.)
-
-Run the SkillOpt-Sleep offline self-evolution cycle. Action: $ARGUMENTS
-(empty → "status").
-
-Use the bundled runner via shell:
-
-    bash "${SKILLOPT_SLEEP_REPO:?set SKILLOPT_SLEEP_REPO to the repo root}/plugins/run-sleep.sh" $ARGUMENTS --project "$(pwd)"
-
-Then:
- For `run`/`dry-run`: read the staged `report.md` and show the held-out
-  baseline → candidate score and the proposed edits. `run` only stages a
-  proposal; nothing live changes until `adopt`.
- For `adopt`: confirm which files were updated and that a backup was written.
- Never edit the user's AGENTS.md / skills yourself; only `adopt` does that.
-
-Default backend is `mock` (no API spend). Add `--backend codex` for real
-improvement on the user's Codex budget.
--- a/plugins/codex/skills/skillopt-sleep/SKILL.md
+++ b/plugins/codex/skills/skillopt-sleep/SKILL.md
@@ -1,49 +1,93 @@
 ---
 name: skillopt-sleep
-description: Nightly offline self-evolution for a Codex agent. Reviews past sessions, replays recurring tasks, and consolidates validated memory + skills behind a held-out gate. Use when the user wants Codex to learn from past usage, run a "sleep"/"dream" cycle, or schedule offline self-optimization.
+description: "Use when the user wants Codex to self-improve from past usage, asks about a nightly/offline 'sleep' or 'dream' cycle, wants Codex to review past sessions, learn preferences, consolidate memory/skills, run dry-run/run/adopt/status for SkillOpt-Sleep, or schedule offline self-optimization. Drives the skillopt_sleep engine: harvest past sessions -> mine recurring tasks -> replay offline -> consolidate validated memory + skills behind a held-out gate."
 ---

-# SkillOpt-Sleep (Codex skill)
+# SkillOpt-Sleep: offline self-evolution for a local Codex agent

-This skill drives the `skillopt_sleep` engine — an offline "sleep cycle" that
-makes a Codex agent better at the user's recurring work without retraining.
+SkillOpt-Sleep gives the user's Codex agent a sleep cycle. While the user is
+offline or on demand, it reviews past local sessions, re-runs recurring tasks
+on the user's own budget, and consolidates what it learns into memory and
+skills. It keeps only changes that pass a held-out validation gate, and live
+files change only after the user explicitly adopts a staged proposal. There is
+no model-weight training.

 ## When to use

-Trigger when the user wants to: review past sessions, learn their preferences,
-consolidate feedback into long-term memory/skills, run a nightly/offline
-self-improvement cycle, or adopt a staged proposal.
+Trigger when the user wants any of:

-## How to run it
+- Codex to learn from past sessions or get better the more they use it;
+- a nightly/scheduled or on-demand sleep/dream/offline self-improvement run;
+- to review past sessions and distill recurring tasks;
+- to consolidate feedback into memory or managed skills;
+- to run `status`, `harvest`, `dry-run`, `run`, or `adopt` for SkillOpt-Sleep.
+
+## The cycle
+
+1. **Harvest** - read local session transcripts according to the engine
+   configuration and normalize them into session digests.
+2. **Mine** - turn digests into recurring `TaskRecord`s with outcomes and
+   checkable references where possible.
+3. **Replay** - re-run mined tasks offline under the current skill and memory.
+4. **Consolidate** - reflect on failures and propose bounded edits.
+5. **Gate** - accept edits only when the held-out validation score improves.
+6. **Stage** - write the proposal under
+   `<project>/.skillopt-sleep/staging/<date>/`; nothing live changes.
+7. **Adopt** - only after explicit user approval, copy staged files over live
+   files with backups.
+
+## How to drive it

 Invoke the bundled runner via shell (Codex `exec` has shell access). The runner
-finds the engine and a Python ≥ 3.10 automatically:
+finds the engine and a Python >= 3.10 automatically.

 ```bash
 # point at the repo if it isn't auto-detected from CWD:
 export SKILLOPT_SLEEP_REPO=/path/to/SkillOpt-Sleep
-bash "$SKILLOPT_SLEEP_REPO/plugins/run-sleep.sh" <action> --project "$(pwd)"
+
+bash "$SKILLOPT_SLEEP_REPO/plugins/run-sleep.sh" status --project "$(pwd)"
+bash "$SKILLOPT_SLEEP_REPO/plugins/run-sleep.sh" harvest --project "$(pwd)"
+bash "$SKILLOPT_SLEEP_REPO/plugins/run-sleep.sh" dry-run --project "$(pwd)" --backend mock
+bash "$SKILLOPT_SLEEP_REPO/plugins/run-sleep.sh" run --project "$(pwd)" --backend codex
+bash "$SKILLOPT_SLEEP_REPO/plugins/run-sleep.sh" adopt --project "$(pwd)"
 ```

-`<action>` ∈ `status | dry-run | run | adopt | harvest`. Use `--backend codex`
-for real improvement on the user's own Codex budget (default `mock` = no spend).
+Actions are `status`, `harvest`, `dry-run`, `run`, and `adopt`.
+
+- Default backend is `mock`, which is deterministic and spends no API budget.
+- `--backend codex` uses the user's Codex budget for real improvement.
+- Keep `dry-run --backend mock` as the first smoke check unless the user
+  explicitly asked for a real optimization run.

 ## Steps

 1. Run the requested action; capture stdout.
-2. For `run`/`dry-run`: read the staged `report.md` it prints and show the user
-   the held-out baseline → candidate score and the exact proposed edits.
-3. `run` only **stages** a proposal under `<project>/.skillopt-sleep/staging/`;
-   nothing live changes until `adopt`. Offer `/skillopt-sleep adopt`.
-4. Never hand-edit the user's `AGENTS.md` / skills yourself — only `adopt` does,
-   and it backs up first.
+2. For `dry-run` and `run`, report the held-out baseline -> candidate score,
+   gate action, task count, session count, and exact proposed edits.
+3. If a staging directory is printed, read `report.md` before summarizing.
+4. `run` only stages a proposal; nothing live changes until `adopt`.
+5. Offer adoption only after the user has reviewed the staged proposal.
+6. Never hand-edit the user's `AGENTS.md`, memory, or skills as a substitute
+   for `adopt`; adoption is the safety boundary and writes backups first.
+
+## Hard rules
+
+- Harvest is read-only. Do not edit archived sessions or raw transcripts.
+- Keep raw secrets, credentials, private user data, and unsanitized transcript
+  contents out of messages, logs, generated artifacts, and commits.
+- Show validation evidence before recommending adoption.
+- Treat generated edits as proposals, not as source of truth.
+- Do not rely on deprecated custom prompts or `/sleep` slash commands for this
+  Codex integration. This skill is the entrypoint.

 ## Validate

 ```bash
+python -m skillopt_sleep dry-run --project "$(pwd)" --backend mock --json
 python -m skillopt_sleep.experiments.run_gbrain --backend codex \
  --seeds brief-writer --data-root /path/to/gbrain-evals/eval/data/skillopt-v1 \
  --nights 2 --limit-replay 3 --limit-holdout 3
 ```
-A deficient skill goes 0.00 → 1.00 on a held-out set; the optimizer's edits are
-gated on real-task performance.
+
+A deficient skill goes 0.00 -> 1.00 on a held-out set; the optimizer's edits
+are gated on real-task performance.