Merge pull request #59 from Elzlxx/feat/openclaw-skillopt-sleep

feat(plugins): add OpenClaw shell for SkillOpt-Sleep
2026-07-03 14:02:58 +08:00 · 2026-06-15 18:26:12 +08:00
parent 00d07bc59a 553446575a
commit 576f2f8bad
10 changed files with 1244 additions and 0 deletions
--- a/plugins/openclaw/README.md
+++ b/plugins/openclaw/README.md
@@ -0,0 +1,112 @@
+# OpenClaw Plugin for SkillOpt-Sleep
+
+Thin shell for running [SkillOpt-Sleep](https://github.com/microsoft/SkillOpt) on [OpenClaw](https://github.com/openclaw/openclaw).
+
+## What it does
+
+Adds a nightly "sleep cycle" to any OpenClaw agent. The cycle:
+
+1. **Harvests** recent session transcripts from `~/.openclaw/agents/<name>/sessions/*.jsonl`
+2. **Mines** recurring task patterns using the optimizer LLM
+3. **Replays** each pattern with the current `SKILL.md` (baseline) and a candidate `SKILL.md` (with proposed edits)
+4. **Gates** the candidate against the held-out score (rejects regressions)
+5. **Stages** the accepted proposal in `~/.skillopt-sleep/staging/<night>/`
+6. Leaves adoption to the operator (Ethan)
+
+Nothing live changes until you adopt. Every adopt backs up first.
+
+## Install
+
+The plugin is a thin wrapper around the engine at `~/.openclaw/workspace/SkillOpt/skillopt_sleep/`:
+
+```bash
+# 1. Clone the engine (one-time)
+cd ~/.openclaw/workspace
+git clone https://github.com/microsoft/SkillOpt.git
+
+# 2. Install the OpenClaw skill (this folder)
+ln -s /path/to/openclaw ~/.openclaw/workspace/skills/skillopt-sleep
+
+# 3. Configure
+cp ~/.openclaw/workspace/skills/skillopt-sleep/config.json ~/.skillopt-sleep/config.json
+$EDITOR ~/.skillopt-sleep/config.json
+# Set backend = "openclaw-deepseek"
+# Set model = "deepseek-v4-pro" (or "deepseek-v4-flash" for budget)
+
+# 4. Set API key
+echo 'export DEEPSEEK_API_KEY="sk-..."' >> ~/.openclaw/.env
+
+# 5. Add the nightly cron
+(crontab -l 2>/dev/null; echo "0 3 * * * cd ~/.openclaw/workspace/skills/skillopt-sleep && bash run_sleep_cron.sh >> ~/.skillopt-sleep/nightly.log 2>&1") | crontab -
+```
+
+## Use
+
+### Manual trigger
+
+```bash
+# Run one cycle now
+python3 ~/.openclaw/workspace/skills/skillopt-sleep/run_sleep.py
+
+# Dry run (report only)
+python3 ~/.openclaw/workspace/skills/skillopt-sleep/run_sleep.py --dry-run
+
+# One category only
+python3 ~/.openclaw/workspace/skills/skillopt-sleep/run_sleep.py --tasks tests/research-cron-tasks.json
+```
+
+### Slash command
+
+```bash
+# In any OpenClaw session
+/sleep status
+/sleep run
+/sleep run research-cron
+/sleep dry-run
+/sleep adopt              # adopt most recent accepted proposal
+/sleep reject             # discard most recent
+/sleep cost
+```
+
+## Architecture
+
+```
+plugins/openclaw/
+├── README.md                       # this file
+├── run_sleep_cron.sh               # wrapper for cron invocation
+├── run_sleep.py                    # main entry point
+├── slash_sleep.py                  # /sleep command implementation
+├── skillopt_sleep_openclaw.py      # DeepSeek + Ollama backend
+├── config.json                     # engine config
+├── SKILL.md                        # OpenClaw skill manifest
+└── tests/                          # held-out test sets
+    ├── research-cron-tasks.json
+    ├── devops-tasks.json
+    └── wiki-tasks.json
+```
+
+The OpenClaw shell is one engine (skillopt_sleep/) + one backend (DeepSeek/Ollama) + four thin wrappers (cron, slash, skill, tests).
+
+## Why this matters for OpenClaw
+
+OpenClaw currently has no built-in "self-evolving skills" mechanism. The community has:
+
+- **Manual skills** — Ethan writes them
+- **LLM-generated skills** — one-shot, no validation
+- **Self-revision** — unbounded, no quality bar
+
+SkillOpt-Sleep adds a 4th option: **validated self-evolution**. The skill is the training target, the engine is the optimizer, the gate is the quality bar, the operator is the human-in-the-loop.
+
+## Validation
+
+Validated on the public [gbrain-evals](https://github.com/garrytan/gbrain-evals) `skillopt-v1` benchmark with real Claude and Codex (deficient skills 0.00 → 1.00 on held-out, all 4 seeds).
+
+End-to-end test on our own 14-task held-out set: pipeline runs, gate correctly rejects non-improvements, staging artifacts land in `~/.skillopt-sleep/staging/<night>/`.
+
+## Cost
+
+Measured: ~$0.02/night with `deepseek-v4-pro` at 12 tasks/night. ~$0.59/month, $7.18/year.
+
+## License
+
+MIT (same as SkillOpt core).
--- a/plugins/openclaw/SKILL.md
+++ b/plugins/openclaw/SKILL.md
@@ -0,0 +1,96 @@
+---
+name: skillopt-sleep
+description: Validate and refine agent skills through nightly sleep cycles with held-out gates. Wraps Microsoft's SkillOpt-Sleep engine for the OpenClaw/DeepSeek stack.
+---
+
+# skillopt-sleep — OpenClaw Adaptation of Microsoft SkillOpt-Sleep
+
+A nightly self-improvement loop that reads our session transcripts, mines recurring workflow patterns, replays them with proposed skill edits, and gates the proposals against a held-out test set. Only improvements that beat baseline are staged for human adoption.
+
+## When To Use
+
+- After Hermes's Weekly Skill Review (or as its replacement)
+- When a skill is being used 10+ times/week and could be tighter
+- Before promoting a new skill from `skill-proposals/` to `skills/`
+- When a skill regresses in observed quality
+
+## What It Does (One Cycle)
+
+```
+harvest session transcripts  ->  mine recurring task patterns
+                              ->  replay each pattern (current skill vs proposed)
+                              ->  GATE: must improve held-out score
+                              ->  stage proposal
+                              ->  Ethan adopts (manual)
+```
+
+Nothing live changes until Ethan adopts. Every adopt backs up first.
+
+## Architecture
+
+```
+skills/skillopt-sleep/
+├── SKILL.md                          # this file
+├── config.json                       # engine config (backend, budgets, etc.)
+├── run_sleep.py                      # entry point
+└── skillopt_sleep_openclaw.py        # DeepSeek/Ollama backend
+```
+
+The engine itself is at `~/.openclaw/workspace/SkillOpt/skillopt_sleep/` (cloned from microsoft/SkillOpt).
+
+## Usage
+
+```bash
+# Run one cycle with current config
+cd ~/.openclaw/workspace/skills/skillopt-sleep
+python3 run_sleep.py
+
+# Dry run (report only, no staging)
+python3 run_sleep.py --dry-run
+
+# Use a pre-built task set (recommended for testing)
+python3 run_sleep.py --tasks tests/research-cron-tasks.json
+```
+
+## Config (config.json)
+
+Key knobs:
+- `backend: "openclaw-deepseek"` — our custom backend
+- `model: "deepseek-v4-pro"` — optimizer model
+- `edit_budget: 3` — max bounded edits per night
+- `gate_mode: "on"` — validation-gated (rejects regressions)
+- `auto_adopt: false` — require Ethan to adopt manually
+- `max_tasks_per_night: 12` — cap to control cost
+
+## Cost Estimate
+
+Per night: 12 tasks × (1 attempt + 1 judge + 1 reflect) × ~$0.005/1K tokens × ~3K tokens/call ≈ **$0.50-2.00/night**.
+
+## Outputs
+
+- Report: `~/.skillopt-sleep/state.json` (running totals)
+- Staging: `~/.skillopt-sleep/staging/<night>/`
+  - `report.md` — readable summary
+  - `best_skill.md` — proposed skill
+  - `edits.json` — bounded edit list
+  - `before.md` / `after.md` — diffs
+
+## Held-Out Test Sets (Phase 2)
+
+Located at `tests/<category>-tasks.json`. Each task has:
+- `prompt` — the recurring task
+- `reference` — exact-match gold answer
+- `rubric` — soft score rubric (0-1)
+- `domain` — research/devops/wiki/etc.
+
+Currently building for 3 categories:
+- research-cron-output
+- devops-infrastructure-check
+- wiki-canonical-guide
+
+## When NOT To Use
+
+- For a one-off workflow (not a recurring pattern)
+- During a crisis/incident (humans must lead)
+- When session transcripts are < 24h old (not enough signal)
+- For skills < 300 tokens (over-optimization risk)
--- a/plugins/openclaw/config.json
+++ b/plugins/openclaw/config.json
@@ -0,0 +1,30 @@
+{
+  "_comment": "OpenClaw adaptation of skillopt-sleep. Edit and run via run_sleep.py",
+
+  "claude_home": "/home/ethanclaw/.openclaw/agents",
+  "invoked_project": "/home/ethanclaw/.openclaw/workspace",
+  "projects": "invoked",
+  "lookback_hours": 168,
+
+  "max_tasks_per_night": 12,
+  "max_tokens_per_night": 800000,
+  "holdout_fraction": 0.34,
+  "val_fraction": 0.34,
+  "test_fraction": 0.0,
+
+  "backend": "openclaw-deepseek",
+  "model": "deepseek-v4-pro",
+  "gate_mode": "on",
+  "edit_budget": 3,
+  "gate_metric": "mixed",
+  "gate_mixed_weight": 0.5,
+  "replay_mode": "fresh",
+  "evolve_memory": true,
+  "evolve_skill": true,
+  "llm_mine": false,
+
+  "auto_adopt": false,
+  "managed_skill_name": "skillopt-sleep-learned",
+  "redact_secrets": true,
+  "seed": 42
+}
--- a/plugins/openclaw/run_sleep.py
+++ b/plugins/openclaw/run_sleep.py
@@ -0,0 +1,122 @@
+#!/usr/bin/env python3
+"""run_sleep.py — OpenClaw entry point for SkillOpt-Sleep.
+
+Runs one nightly sleep cycle:
+  1. harvest recent session transcripts
+  2. mine recurring task patterns
+  3. replay tasks with current skill (baseline) + candidate skill (with proposed edit)
+  4. gate candidate vs baseline on held-out accuracy
+  5. stage the proposal in ~/.skillopt-sleep/staging/<night>/
+  6. leave adoption to Ethan (auto_adopt=false)
+
+Usage:
+  python3 run_sleep.py                  # one cycle, default config
+  python3 run_sleep.py --dry-run        # compute report only, no staging
+  python3 run_sleep.py --tasks path.json  # use a pre-built task file
+"""
+from __future__ import annotations
+
+import argparse
+import json
+import os
+import sys
+from pathlib import Path
+
+# Ensure the skillopt_sleep package is importable (it lives in the cloned repo)
+REPO = Path("/home/ethanclaw/.openclaw/workspace/SkillOpt")
+sys.path.insert(0, str(REPO))
+
+# Register our backend before importing cycle
+from skillopt_sleep_openclaw import OpenClawDeepSeekBackend
+import skillopt_sleep.backend as _b
+_b._BACKENDS = getattr(_b, "_BACKENDS", {})
+_b._BACKENDS["openclaw-deepseek"] = OpenClawDeepSeekBackend
+
+# Patch get_backend to know about our backend
+_orig_get_backend = _b.get_backend
+
+def get_backend(name, model="", codex_path=""):
+    if name == "openclaw-deepseek":
+        return OpenClawDeepSeekBackend(model=model or "deepseek-v4-pro")
+    return _orig_get_backend(name, model=model, codex_path=codex_path)
+
+_b.get_backend = get_backend
+
+from skillopt_sleep.cycle import run_sleep_cycle
+from skillopt_sleep.config import load_config
+
+
+def main() -> int:
+    ap = argparse.ArgumentParser(description="OpenClaw SkillOpt-Sleep nightly cycle")
+    ap.add_argument("--dry-run", action="store_true", help="Compute but don't stage")
+    ap.add_argument("--config", default="/home/ethanclaw/.openclaw/workspace/skills/skillopt-sleep/config.json")
+    ap.add_argument("--tasks", default=None, help="Path to pre-built tasks JSON")
+    ap.add_argument("--verbose", action="store_true")
+    args = ap.parse_args()
+
+    # Load config from file then override with our defaults
+    overrides = {}
+    if os.path.exists(args.config):
+        with open(args.config) as f:
+            overrides.update(json.load(f))
+    overrides.pop("_comment", None)
+
+    cfg = load_config(**overrides)
+
+    seed_tasks = None
+    if args.tasks:
+        from skillopt_sleep.types import TaskRecord
+        with open(args.tasks) as f:
+            raw = json.load(f)
+        # Translate our test-set fields → TaskRecord fields
+        seed_tasks = []
+        for t in raw:
+            seed_tasks.append(TaskRecord(
+                id=t['id'],
+                project=t.get('project', 'openclaw'),
+                intent=t.get('intent') or t.get('prompt', ''),
+                context_excerpt=t.get('context_excerpt', ''),
+                attempted_solution=t.get('attempted_solution', ''),
+                outcome=t.get('outcome', 'unknown'),
+                reference_kind=t.get('reference_kind', 'rubric'),
+                reference=t.get('reference', ''),
+                judge=t.get('judge', {}),
+                tags=t.get('tags', []),
+                source_sessions=t.get('source_sessions', []),
+                split=t.get('split', 'train'),
+            ))
+
+    print(f"[skillopt-sleep] starting cycle...")
+    print(f"  backend: {cfg.get('backend')}")
+    print(f"  project: {cfg.get('invoked_project')}")
+    print(f"  max tasks: {cfg.get('max_tasks_per_night')}")
+    print(f"  edit budget: {cfg.get('edit_budget')}")
+    print(f"  dry_run: {args.dry_run}")
+
+    outcome = run_sleep_cycle(cfg, seed_tasks=seed_tasks, dry_run=args.dry_run)
+
+    r = outcome.report
+    print(f"\n=== Report — night {r.night} ===")
+    print(f"  sessions harvested: {r.n_sessions}")
+    print(f"  tasks mined: {r.n_tasks}  (replayed: {r.n_replayed})")
+    print(f"  baseline: {r.baseline_score:.3f}  ->  candidate: {r.candidate_score:.3f}")
+    print(f"  gate: {r.gate_action}  accepted={r.accepted}")
+    print(f"  tokens: {r.tokens_used}")
+    if r.edits:
+        print(f"  applied edits ({len(r.edits)}):")
+        for e in r.edits:
+            print(f"    [{e.target}/{e.op}] {e.content[:80]}...")
+    if r.rejected_edits:
+        print(f"  rejected edits ({len(r.rejected_edits)}) — kept as negative feedback")
+    if r.notes:
+        for n in r.notes:
+            print(f"  note: {n}")
+    if outcome.staging_dir:
+        print(f"\n  STAGED at: {outcome.staging_dir}")
+        print(f"  Review with: ls {outcome.staging_dir}")
+
+    return 0 if r.accepted or r.candidate_score >= r.baseline_score else 1
+
+
+if __name__ == "__main__":
+    sys.exit(main())
--- a/plugins/openclaw/run_sleep_cron.sh
+++ b/plugins/openclaw/run_sleep_cron.sh
@@ -0,0 +1,76 @@
+#!/bin/bash
+# run_sleep_cron.sh — wrapper for cron-driven nightly sleep cycle
+#
+# Usage: bash run_sleep_cron.sh [category1 category2 ...]
+#   No args: run on all categories in tests/
+#   With args: run only on listed categories (research-cron, devops, wiki)
+#
+# Cron (3am MYT daily):
+#   0 3 * * * cd /home/ethanclaw/.openclaw/workspace/skills/skillopt-sleep && bash run_sleep_cron.sh >> ~/.skillopt-sleep/nightly.log 2>&1
+
+set -euo pipefail
+
+SKILL_DIR="/home/ethanclaw/.openclaw/workspace/skills/skillopt-sleep"
+TESTS_DIR="$SKILL_DIR/tests"
+LOG_DIR="$HOME/.skillopt-sleep/logs"
+mkdir -p "$LOG_DIR"
+
+TIMESTAMP=$(date +%Y%m%d-%H%M%S)
+LOG_FILE="$LOG_DIR/night-$TIMESTAMP.log"
+
+# category → test file map
+declare -A CATEGORIES=(
+    ["research-cron"]="research-cron-tasks.json"
+    ["devops"]="devops-tasks.json"
+    ["wiki"]="wiki-tasks.json"
+)
+
+# Determine which categories to run
+if [ $# -eq 0 ]; then
+    CATS=("research-cron" "devops" "wiki")
+else
+    CATS=("$@")
+fi
+
+{
+    echo "=========================================="
+    echo "SkillOpt-Sleep nightly — $TIMESTAMP"
+    echo "Categories: ${CATS[*]}"
+    echo "=========================================="
+} | tee -a "$LOG_FILE"
+
+# Pre-flight: check DeepSeek API key
+if ! grep -q "DEEPSEEK_API_KEY=" "$HOME/.openclaw/.env" 2>/dev/null; then
+    echo "ERROR: DEEPSEEK_API_KEY not found in ~/.openclaw/.env" | tee -a "$LOG_FILE"
+    exit 1
+fi
+
+EXIT_CODE=0
+for cat in "${CATS[@]}"; do
+    tasks_file="$TESTS_DIR/${CATEGORIES[$cat]:-}"
+    if [ ! -f "$tasks_file" ]; then
+        echo "SKIP: $cat (no tasks file: $tasks_file)" | tee -a "$LOG_FILE"
+        continue
+    fi
+
+    echo "" | tee -a "$LOG_FILE"
+    echo "--- [$cat] starting cycle ---" | tee -a "$LOG_FILE"
+
+    cd "$SKILL_DIR"
+    if python3 run_sleep.py --tasks "$tasks_file" 2>&1 | tee -a "$LOG_FILE"; then
+        echo "--- [$cat] OK ---" | tee -a "$LOG_FILE"
+    else
+        EC=$?
+        echo "--- [$cat] FAILED (exit $EC) ---" | tee -a "$LOG_FILE"
+        EXIT_CODE=$EC
+    fi
+done
+
+{
+    echo ""
+    echo "=========================================="
+    echo "Done. Exit: $EXIT_CODE"
+    echo "=========================================="
+} | tee -a "$LOG_FILE"
+
+exit $EXIT_CODE
--- a/plugins/openclaw/skillopt_sleep_openclaw.py
+++ b/plugins/openclaw/skillopt_sleep_openclaw.py
@@ -0,0 +1,275 @@
+"""OpenClaw backend for SkillOpt-Sleep.
+
+Adapts the skillopt_sleep Backend protocol to our DeepSeek + Ollama stack:
+  - attempt/judge/reflect  ->  DeepSeek V4 Pro (or Flash for cost)
+  - embeddings              ->  Ollama nomic-embed-text (already configured)
+
+This backend NEVER mutates live state. It only returns text + EditRecord
+proposals that the gate stages for human review.
+"""
+from __future__ import annotations
+
+import json
+import os
+import re
+import subprocess
+from typing import Any, Dict, List, Optional, Tuple
+
+from skillopt_sleep.backend import Backend, _normalize, exact_score
+from skillopt_sleep.types import EditRecord, ReplayResult, TaskRecord
+
+
+# ── DeepSeek + Ollama OpenAI-compatible API client (curl-based, no extra deps) ──
+
+
+def _chat(messages: List[Dict[str, str]], *, model: str, temperature: float = 0.2, max_tokens: int = 1500) -> str:
+    """Call DeepSeek V4 Pro via curl + jq. No extra Python deps needed."""
+    import json as _json
+    import urllib.request
+
+    api_key = os.environ.get("DEEPSEEK_API_KEY", "")
+    if not api_key:
+        # try loading from .env
+        env_path = os.path.expanduser("~/.openclaw/.env")
+        if os.path.exists(env_path):
+            with open(env_path) as f:
+                for line in f:
+                    if line.startswith("DEEPSEEK_API_KEY="):
+                        api_key = line.split("=", 1)[1].strip()
+                        break
+
+    base = os.environ.get("DEEPSEEK_BASE_URL", "https://api.deepseek.com/v1")
+
+    payload = {
+        "model": model,
+        "messages": messages,
+        "temperature": temperature,
+        "max_tokens": max_tokens,
+        "stream": False,
+    }
+    req = urllib.request.Request(
+        f"{base}/chat/completions",
+        data=_json.dumps(payload).encode("utf-8"),
+        headers={
+            "Content-Type": "application/json",
+            "Authorization": f"Bearer {api_key}",
+        },
+    )
+    try:
+        with urllib.request.urlopen(req, timeout=180) as resp:
+            data = _json.loads(resp.read().decode("utf-8"))
+            return data["choices"][0]["message"]["content"]
+    except Exception as e:
+        return f"[BACKEND_ERROR] {type(e).__name__}: {str(e)[:200]}"
+
+
+def _embed(text: str) -> List[float]:
+    """Call Ollama for embeddings. Uses the configured nomic-embed-text model."""
+    import json as _json
+    import urllib.request
+
+    try:
+        req = urllib.request.Request(
+            "http://127.0.0.1:11434/api/embeddings",
+            data=_json.dumps({"model": "nomic-embed-text:latest", "prompt": text[:2000]}).encode("utf-8"),
+            headers={"Content-Type": "application/json"},
+        )
+        with urllib.request.urlopen(req, timeout=30) as resp:
+            data = _json.loads(resp.read().decode("utf-8"))
+            return data.get("embedding", [])
+    except Exception:
+        return []
+
+
+# ── Backend implementation ────────────────────────────────────────────────────
+
+
+class OpenClawDeepSeekBackend(Backend):
+    """Use DeepSeek V4 Pro for attempt/judge/reflect, Ollama for embeddings.
+
+    - "model" passed to constructor = optimizer model (default: deepseek-v4-pro)
+    - "judge_model" = judge model (default: deepseek-v4-pro for quality)
+    - "cheap_model" = budget-fallback (deepseek-v4-flash)
+    """
+
+    name = "openclaw-deepseek"
+
+    def __init__(
+        self,
+        model: str = "deepseek-v4-pro",
+        judge_model: str = "deepseek-v4-pro",
+        cheap_model: str = "deepseek-v4-flash",
+    ):
+        self._model = model
+        self._judge_model = judge_model
+        self._cheap_model = cheap_model
+        self._tokens = 0  # rough estimate
+
+    def tokens_used(self) -> int:
+        return self._tokens
+
+    # ── 1. attempt: produce a response given the task + skill + memory ──
+    def attempt(self, task: TaskRecord, skill: str, memory: str) -> str:
+        sys = (
+            "You are an OpenClaw agent (Kobe ecosystem). Use the skill and memory below to complete the task. "
+            "If the task asks for a structured output, follow the rubric exactly. "
+            "Be concise. No preamble, no explanation unless the task asks for it."
+        )
+        usr = f"""## SKILL
+{skill or '(no skill yet)'}
+
+## MEMORY
+{memory or '(no memory yet)'}
+
+## TASK
+{task.intent}
+
+## CONTEXT (if any)
+{task.context_excerpt or '(none)'}
+
+## RESPONSE
+"""
+        out = _chat(
+            [{"role": "system", "content": sys}, {"role": "user", "content": usr}],
+            model=self._model,
+            temperature=0.2,
+        )
+        self._tokens += len(usr) // 4 + 200
+        return out
+
+    # ── 2. judge: score the response ──
+    def judge(self, task: TaskRecord, response: str) -> Tuple[float, float, str]:
+        # Hard score: exact-match against task.reference (if available)
+        hard = exact_score(task.reference or "", response)
+
+        # Soft score: LLM judge against rubric (reference if reference_kind=='rubric')
+        rubric_text = task.reference if task.reference_kind == "rubric" else ""
+        if rubric_text:
+            judge_prompt = f"""You are a strict grader. Score the response 0.0-1.0 against the rubric.
+
+## TASK
+{task.intent}
+
+## REFERENCE
+{task.reference or '(none)'}
+
+## RUBRIC
+{rubric_text}
+
+## RESPONSE
+{response[:3000]}
+
+## INSTRUCTIONS
+Return ONLY a single float 0.0-1.0 on one line. No explanation. No markdown.
+"""
+            try:
+                j_out = _chat(
+                    [{"role": "user", "content": judge_prompt}],
+                    model=self._judge_model,
+                    temperature=0.0,
+                    max_tokens=20,
+                ).strip()
+                soft = float(re.search(r"[\d.]+", j_out.splitlines()[0]).group())
+                soft = max(0.0, min(1.0, soft))
+            except Exception:
+                soft = hard
+            self._tokens += 600
+        else:
+            soft = hard
+
+        rationale = f"hard={hard:.2f} soft={soft:.2f}"
+        return hard, soft, rationale
+
+    # ── 3. reflect: produce bounded EditRecord proposals ──
+    def reflect(
+        self,
+        failures: List[Tuple[TaskRecord, ReplayResult]],
+        successes: List[Tuple[TaskRecord, ReplayResult]],
+        skill: str,
+        memory: str,
+        *,
+        edit_budget: int,
+        evolve_skill: bool,
+        evolve_memory: bool,
+    ) -> List[EditRecord]:
+        # Compact digest of failures + successes
+        fail_digest = "\n".join(
+            f"- TASK: {t.intent[:200]}\n  RESPONSE: {r.response[:300]}\n  WHY FAIL: {r.judge_rationale or r.fail_reason or 'unknown'}\n  REFERENCE: {t.reference[:200]}"
+            for t, r in failures[:5]
+        ) or "(none)"
+        succ_digest = "\n".join(
+            f"- TASK: {t.intent[:150]} -> OK ({r.judge_rationale or 'high score'})"
+            for t, r in successes[:3]
+        ) or "(none)"
+
+        rubric_text = ""
+        if failures:
+            rubric_text = f"\n\n## REFERENCE ANSWERS\n{chr(10).join(f'Q: {t.intent[:120]}\\nA: {t.reference}' for t, _ in failures[:3] if t.reference)}"
+
+        sys = (
+            "You are SkillOpt-Sleep's bounded-edit optimizer. Your job is to propose 1-4 MINIMAL text edits to a skill or memory document "
+            "that, if applied, would help future agents do better on the failed tasks. "
+            "NEVER propose adding new sections wholesale. NEVER delete entire sections. "
+            "Edit primitives: ADD (append a step/rule at end), DELETE (remove a specific line by exact match), REPLACE (swap a specific line for another by exact match). "
+            "If you cannot identify a clear, minimal improvement, return an empty list."
+        )
+        usr = f"""## CURRENT SKILL
+{skill or '(empty)'}
+
+## CURRENT MEMORY
+{memory or '(empty)'}
+
+## FAILED TASKS
+{fail_digest}
+
+## SUCCESSFUL TASKS
+{succ_digest}
+{rubric_text}
+
+## CONSTRAINTS
+- max {edit_budget} edits total
+- edits go to {"skill + memory" if (evolve_skill and evolve_memory) else ("skill" if evolve_skill else "memory")}
+- if evolve_skill=False, target="memory" only; if evolve_memory=False, target="skill" only
+- target must be "skill" or "memory"
+
+## OUTPUT FORMAT (JSON, no markdown)
+{{"edits": [{{"op": "ADD"|"DELETE"|"REPLACE", "target": "skill"|"memory", "content": "the text to add or replace with", "old_text": "for REPLACE/DELETE, the exact line to find", "rationale": "one short sentence why"}}]}}
+"""
+        out = _chat(
+            [{"role": "system", "content": sys}, {"role": "user", "content": usr}],
+            model=self._model,
+            temperature=0.4,
+            max_tokens=2000,
+        )
+        self._tokens += len(usr) // 3 + 1500
+
+        # parse
+        try:
+            # strip markdown fences if any
+            cleaned = out.strip()
+            if cleaned.startswith("```"):
+                cleaned = re.sub(r"^```[a-z]*\n?", "", cleaned)
+                cleaned = re.sub(r"\n?```$", "", cleaned)
+            data = json.loads(cleaned)
+            edits: List[EditRecord] = []
+            for e in data.get("edits", [])[:edit_budget]:
+                if e.get("op") not in ("ADD", "DELETE", "REPLACE"):
+                    continue
+                target = e.get("target", "skill")
+                if target not in ("skill", "memory"):
+                    continue
+                if not evolve_skill and target == "skill":
+                    continue
+                if not evolve_memory and target == "memory":
+                    continue
+                edits.append(EditRecord(
+                    op=e["op"],
+                    target=target,
+                    content=e.get("content", ""),
+                    old_text=e.get("old_text", ""),
+                    rationale=e.get("rationale", ""),
+                ))
+            return edits
+        except Exception as e:
+            # log + return empty list (no edit is better than a bad edit)
+            return []
--- a/plugins/openclaw/slash_sleep.py
+++ b/plugins/openclaw/slash_sleep.py
@@ -0,0 +1,289 @@
+#!/usr/bin/env python3
+"""slash_sleep.py — OpenClaw slash command equivalent of SkillOpt's /sleep.
+
+Use from the main session as a /sleep command:
+  /sleep status    — show current state + last 5 nights
+  /sleep run       — trigger one cycle (all categories) right now
+  /sleep run research-cron  — one cycle, single category
+  /sleep adopt [night]      — adopt the most recent (or specified) staged proposal
+  /sleep reject [night]     — discard the most recent (or specified) staging dir
+  /sleep dry-run   — report-only cycle
+  /sleep cost      — estimate per-night cost for current config
+
+This script is a thin shell over run_sleep.py. It can be invoked either
+manually from the main session or by an OpenClaw command handler.
+"""
+from __future__ import annotations
+
+import argparse
+import json
+import os
+import shutil
+import sys
+from pathlib import Path
+from datetime import datetime
+
+SKILL_DIR = Path("/home/ethanclaw/.openclaw/workspace/skills/skillopt-sleep")
+STATE_DIR = Path(os.path.expanduser("~/.skillopt-sleep"))  # default
+STAGING_ROOT = STATE_DIR
+
+def _resolve_state_dir():
+    """Find the actual state dir.
+
+    Priority: scan in order:
+      1. ~/.skillopt-sleep/                 (default)
+      2. /home/ethanclaw/.openclaw/workspace/.skillopt-sleep/  (when staging is there)
+      3. /home/ethanclaw/.openclaw/.skillopt-sleep/            (parent of overridden claude_home)
+    Pick the first one that has a state.json OR staging dir.
+    """
+    candidates = [
+        Path(os.path.expanduser("~/.skillopt-sleep")),
+        Path("/home/ethanclaw/.openclaw/workspace/.skillopt-sleep"),
+        Path("/home/ethanclaw/.openclaw/.skillopt-sleep"),
+    ]
+    # Prefer the one with state.json
+    for c in candidates:
+        if (c / "state.json").exists():
+            return c
+    # Then the one with staging
+    for c in candidates:
+        if (c / "staging").exists():
+            return c
+    return candidates[0]
+
+TESTS_DIR = SKILL_DIR / "tests"
+
+
+def status() -> int:
+    state_dir = _resolve_state_dir()
+    state_file = state_dir / "state.json"
+    staging_dir = state_dir / "staging"
+    print(f"=== SkillOpt-Sleep status ===")
+    print(f"  state dir: {state_dir}")
+    print(f"  staging dir: {staging_dir}")
+    if staging_dir.exists():
+        stages = sorted(staging_dir.iterdir(), key=lambda p: p.stat().st_mtime, reverse=True)
+        print(f"  staging entries: {len(stages)}")
+        for s in stages[:3]:
+            print(f"    {s.name}")
+    if not state_file.exists():
+        print("  no state.json — run a cycle first (state is written at end of each non-dry-run)")
+        return 0
+
+    with open(state_file) as f:
+        state = json.load(f)
+
+    nights = state.get("history") or state.get("nights", [])
+    print(f"  total nights: {len(nights)}")
+    print(f"  accepted: {sum(1 for n in nights if n.get('accepted'))}")
+    print(f"  rejected: {sum(1 for n in nights if not n.get('accepted'))}")
+    if nights:
+        last = nights[-1]
+        print(f"  last night: {last.get('night')}")
+        print(f"    accepted: {last.get('accepted')}")
+        print(f"    baseline: {last.get('baseline'):.3f}  ->  candidate: {last.get('candidate'):.3f}")
+        print(f"    staging: {last.get('staging') or '(none)'}")
+    return 0
+
+
+def run_category(category: str, *, dry_run: bool = False) -> int:
+    cat_to_file = {
+        "research-cron": "research-cron-tasks.json",
+        "devops": "devops-tasks.json",
+        "wiki": "wiki-tasks.json",
+    }
+    tasks_file = TESTS_DIR / cat_to_file.get(category, f"{category}-tasks.json")
+    if not tasks_file.exists():
+        print(f"ERROR: no tasks file for category '{category}': {tasks_file}")
+        return 1
+
+    cmd = [sys.executable, str(SKILL_DIR / "run_sleep.py")]
+    if dry_run:
+        cmd.append("--dry-run")
+    cmd.extend(["--tasks", str(tasks_file)])
+
+    print(f"=== /sleep run {category}{' (dry-run)' if dry_run else ''} ===")
+    print(f"  cmd: {' '.join(cmd)}")
+    rc = os.system(" ".join(f'"{c}"' for c in cmd))
+    return rc
+
+
+def run_all(*, dry_run: bool = False) -> int:
+    rc = 0
+    for cat in ("research-cron", "devops", "wiki"):
+        r = run_category(cat, dry_run=dry_run)
+        if r != 0:
+            rc = r
+    return rc
+
+
+def adopt(night: str = None) -> int:
+    state_dir = _resolve_state_dir()
+    state_file = state_dir / "state.json"
+    if not state_file.exists():
+        print("ERROR: no state to adopt from")
+        return 1
+    with open(state_file) as f:
+        state = json.load(f)
+    nights = state.get("history") or state.get("nights", [])
+    if not nights:
+        print("ERROR: no nights recorded")
+        return 1
+
+    target = None
+    if night:
+        target = next((n for n in nights if str(n.get("night")) == night), None)
+        if not target:
+            print(f"ERROR: night '{night}' not found")
+            return 1
+    else:
+        # most recent accepted
+        candidates = [n for n in nights if n.get("accepted") and n.get("staging")]
+        if not candidates:
+            print("ERROR: no accepted nights with staging to adopt")
+            return 1
+        target = candidates[-1]
+
+    staging = target["staging"]
+    if not os.path.isdir(staging):
+        print(f"ERROR: staging dir missing: {staging}")
+        return 1
+
+    print(f"=== /sleep adopt night {target['night']} ===")
+    print(f"  staging: {staging}")
+    print(f"  baseline: {target.get('baseline'):.3f}  candidate: {target.get('candidate'):.3f}")
+
+    # Read proposed skill from staging
+    manifest = Path(staging) / "manifest.json"
+    if manifest.exists():
+        with open(manifest) as f:
+            m = json.load(f)
+        proposed = m.get("proposed_skill")
+        if proposed and Path(proposed).exists():
+            live = STATE_DIR / "live_skill.md"
+            backup = STATE_DIR / f"live_skill.md.bak-{target['night']}"
+            if live.exists():
+                shutil.copy2(live, backup)
+                print(f"  backed up current live skill → {backup}")
+            shutil.copy2(proposed, live)
+            print(f"  adopted proposed skill → {live}")
+            print()
+            print("✅ Adoption complete. Next cycle will use the new skill.")
+            return 0
+
+    print("ERROR: no proposed_skill in manifest")
+    return 1
+
+
+def reject(night: str = None) -> int:
+    state_dir = _resolve_state_dir()
+    state_file = state_dir / "state.json"
+    if not state_file.exists():
+        print("ERROR: no state")
+        return 1
+    with open(state_file) as f:
+        state = json.load(f)
+    nights = state.get("history") or state.get("nights", [])
+    target = None
+    if night:
+        target = next((n for n in nights if str(n.get("night")) == night), None)
+    else:
+        candidates = [n for n in reversed(nights) if n.get("staging")]
+        target = candidates[0] if candidates else None
+
+    if not target or not target.get("staging"):
+        print("ERROR: nothing to reject")
+        return 1
+
+    staging = target["staging"]
+    if os.path.isdir(staging):
+        shutil.rmtree(staging)
+        print(f"🗑️  Removed staging: {staging}")
+    # remove from state
+    state["history"] = [n for n in nights if n.get("night") != target["night"]]
+    with open(state_file, "w") as f:
+        json.dump(state, f, indent=2)
+    print("✅ Rejected. State updated.")
+    return 0
+
+
+def cost() -> int:
+    """Estimate per-night cost based on the actual measurement from Phase 2.
+
+    From the real dry-run: 5 devops tasks used 14,427 tokens total.
+    That is ~2,885 tokens per task (all 3 phases combined).
+    """
+    cfg_path = SKILL_DIR / "config.json"
+    cfg = {}
+    if cfg_path.exists():
+        cfg = json.loads(cfg_path.read_text())
+    cfg.pop("_comment", None)
+
+    max_tasks = cfg.get("max_tasks_per_night", 12)
+    model = cfg.get("model", "deepseek-v4-pro")
+    # DeepSeek V4 pricing
+    if "pro" in model:
+        cost_in = 0.435  # per 1M
+        cost_out = 0.87
+    elif "flash" in model:
+        cost_in = 0.14
+        cost_out = 0.28
+    else:
+        cost_in, cost_out = 0.5, 1.0
+
+    # Measured: ~2,900 tokens per task, 30% output / 70% input
+    toks_per_task = 2900
+    input_toks = int(toks_per_task * 0.7)
+    output_toks = int(toks_per_task * 0.3)
+
+    cost_in_total = (input_toks * max_tasks / 1_000_000) * cost_in
+    cost_out_total = (output_toks * max_tasks / 1_000_000) * cost_out
+    cost = cost_in_total + cost_out_total
+
+    print(f"=== Cost estimate (per actual measurement) ===")
+    print(f"  model: {model}")
+    print(f"  max tasks/night: {max_tasks}")
+    print(f"  ~tokens/night: {toks_per_task * max_tasks:,}")
+    print(f"  cost/night: ${cost:.3f}")
+    print(f"  cost/month (30 nights): ${cost*30:.2f}")
+    print(f"  cost/year (365 nights): ${cost*365:.2f}")
+    return 0
+
+
+def main():
+    ap = argparse.ArgumentParser(description="OpenClaw /sleep command")
+    sub = ap.add_subparsers(dest="cmd", required=True)
+
+    sub.add_parser("status", help="show state + last 5 nights")
+    p_run = sub.add_parser("run", help="trigger one cycle")
+    p_run.add_argument("category", nargs="?", default=None,
+                        choices=["research-cron", "devops", "wiki", None])
+    p_run.add_argument("--dry-run", action="store_true")
+    sub.add_parser("dry-run", help="report-only cycle (all categories)")
+    p_adopt = sub.add_parser("adopt", help="adopt most recent accepted staging")
+    p_adopt.add_argument("night", nargs="?", default=None)
+    p_reject = sub.add_parser("reject", help="discard most recent staging")
+    p_reject.add_argument("night", nargs="?", default=None)
+    sub.add_parser("cost", help="estimate cost")
+
+    args = ap.parse_args()
+
+    if args.cmd == "status":
+        return status()
+    if args.cmd == "run":
+        if args.category:
+            return run_category(args.category, dry_run=args.dry_run)
+        return run_all(dry_run=args.dry_run)
+    if args.cmd == "dry-run":
+        return run_all(dry_run=True)
+    if args.cmd == "adopt":
+        return adopt(args.night)
+    if args.cmd == "reject":
+        return reject(args.night)
+    if args.cmd == "cost":
+        return cost()
+    return 1
+
+
+if __name__ == "__main__":
+    sys.exit(main())
--- a/plugins/openclaw/tests/devops-tasks.json
+++ b/plugins/openclaw/tests/devops-tasks.json
@@ -0,0 +1,87 @@
+[
+  {
+    "id": "do-01",
+    "reference": "[STATUS] devops-agent | Site Uptime \u2192 geoxylia.com OK (200) | 14/06 22:30 MYT",
+    "rubric": "Score 1.0 if output matches the exact format [STATUS] devops-agent | Site Uptime \u2192 geoxylia.com OK (200) | DD/MM HH:MM MYT, with a real current time. Score 0.5 if format is close but missing one field. Score 0.0 if wrong format or hallucinated values.",
+    "project": "devops-infrastructure-check",
+    "intent": "Site Uptime check. Run: `curl -o /dev/null -s -w '%{http_code}' https://geoxylia.com`. Interpret the result 200, and report in our standard format: 'STATUS | TASK \u2192 RESULT | TIME'. If not 200, escalate.",
+    "context_excerpt": "",
+    "attempted_solution": "",
+    "outcome": "unknown",
+    "reference_kind": "rubric",
+    "judge": {},
+    "tags": [
+      "devops-infrastructure-check"
+    ],
+    "source_sessions": [],
+    "split": "val"
+  },
+  {
+    "id": "do-02",
+    "reference": "Backup complete. Files: 87, Size: 1.2G, Last: 2026-06-14 22:00:00 MYT",
+    "rubric": "Score 1.0 if output includes the exact 'Backup complete. Files: N, Size: X, Last: timestamp' structure with plausible values. Score 0.5 if structure is close but one field missing. Score 0.0 if hallucinated or wrong structure.",
+    "project": "devops-infrastructure-check",
+    "intent": "Daily Memory Backup. Confirm this ran successfully by checking: `ls -t ~/backups/memory/memory-backup-*.tar.gz | head -3`. Report the file count, total size, and most recent backup time. Use format: 'Backup complete. Files: [N], Size: [X], Last: [timestamp]'.",
+    "context_excerpt": "",
+    "attempted_solution": "",
+    "outcome": "unknown",
+    "reference_kind": "rubric",
+    "judge": {},
+    "tags": [
+      "devops-infrastructure-check"
+    ],
+    "source_sessions": [],
+    "split": "val"
+  },
+  {
+    "id": "do-03",
+    "reference": "1) Vercel CSP missing frame-ancestors: MEDIUM. Allows clickjacking if anyone embeds our pages; not exploitable for our content, but best-practice gap.\n2) OpenClaw plaintext API keys: LOW. The config is chmod 600, loopback-only, not in git. Standard OpenClaw behavior. Rotating would add zero real security given current exposure.",
+    "rubric": "Score 1.0 if both are classified correctly (MEDIUM and LOW respectively) and justifications are accurate (not panicky, not dismissive). Score 0.5 if classifications are wrong by one tier or justifications are weak. Score 0.0 if both over-classified as CRITICAL or both wrong.",
+    "project": "devops-infrastructure-check",
+    "intent": "Security Check daily run. Two findings: 1) Vercel CSP header missing 'frame-ancestors' directive, 2) OpenClaw config has 3 plaintext API keys. Classify each as: CRITICAL / HIGH / MEDIUM / LOW / INFO. Justify each in 1 sentence.",
+    "context_excerpt": "",
+    "attempted_solution": "",
+    "outcome": "unknown",
+    "reference_kind": "rubric",
+    "judge": {},
+    "tags": [
+      "devops-infrastructure-check"
+    ],
+    "source_sessions": [],
+    "split": "train"
+  },
+  {
+    "id": "do-04",
+    "reference": "[INCIDENT] supabase.audit_results: anon role has no RLS policy \u2014 anyone with the URL can read all audit results. Fix: add policy 'audit_results_select_own' granting SELECT WHERE user_id = auth.uid(). Severity: HIGH (data exposure). Estimated 2-min fix.",
+    "rubric": "Score 1.0 if: (a) severity correctly identified as HIGH, (b) fix is a real RLS policy (not just 'enable RLS' since it's already enabled), (c) under 50 words, (d) Telegram-friendly format. Score 0.5 if severity right but fix is generic. Score 0.0 if missing severity or wrong fix.",
+    "project": "devops-infrastructure-check",
+    "intent": "Incident Check. The Supabase RLS check returned: 'table public.audit_results: rls enabled but policy missing for anon role'. Interpret severity, propose fix, and format as a Telegram alert (max 50 words).",
+    "context_excerpt": "",
+    "attempted_solution": "",
+    "outcome": "unknown",
+    "reference_kind": "rubric",
+    "judge": {},
+    "tags": [
+      "devops-infrastructure-check"
+    ],
+    "source_sessions": [],
+    "split": "val"
+  },
+  {
+    "id": "do-05",
+    "reference": "\ud83d\udee1\ufe0f Week security digest:\n\n\u2022 0 critical incidents, 1 high resolved (Supabase RLS policy added)\n\u2022 22 plaintext secrets: expected OpenClaw behavior, no action\n\u2022 1 medium open: Vercel CSP frame-ancestors, schedule for next sprint\n\nTrend: stable. No regressions vs last week.",
+    "rubric": "Score 1.0 if all 3 priority tiers mentioned with correct counts, ends with a trend statement, Telegram-friendly. Score 0.5 if structure is right but one tier wrong. Score 0.0 if missing a tier or wrong format.",
+    "project": "devops-infrastructure-check",
+    "intent": "Weekly security digest. Synthesize this week's findings: 22 plaintext secrets in openclaw.json (expected), 0 critical incidents, 1 high (Supabase RLS), 1 medium (CSP frame-ancestors), 0 low. Output a 3-bullet Telegram status.",
+    "context_excerpt": "",
+    "attempted_solution": "",
+    "outcome": "unknown",
+    "reference_kind": "rubric",
+    "judge": {},
+    "tags": [
+      "devops-infrastructure-check"
+    ],
+    "source_sessions": [],
+    "split": "train"
+  }
+]
--- a/plugins/openclaw/tests/research-cron-tasks.json
+++ b/plugins/openclaw/tests/research-cron-tasks.json
@@ -0,0 +1,87 @@
+[
+  {
+    "id": "rc-01",
+    "reference": "COMPETITOR MOVES: Otterly adds Perplexity tracker, joining Profound and LLMRefs in multi-platform citations.\nBACKLINK OPPORTUNITIES: 3 SEO directories (G2, Capterra, GetApp) have not been claimed.\nAGENCY BLUEPRINT: Top 2 agency sites bundle GEO audit + content refresh as $3K/mo tier.\nACTION ITEMS: Build Perplexity citation test into GeoXylia audit; claim G2 listing by Friday.",
+    "rubric": "Score 1.0 if all 4 section headings present in correct order, each with a substantive (not generic) 1-sentence content. Score 0.5 if headings present but content is generic. Score 0.0 if any heading missing or order wrong.",
+    "project": "research-cron-output",
+    "intent": "Weekly Competitive Deep Dive for GeoXylia. The competitor otterly.ai just added a Perplexity citation tracker. Produce the report header (top section) in our standard format: COMPETITOR MOVES, BACKLINK OPPORTUNITIES, AGENCY BLUEPRINT, ACTION ITEMS. Keep it to 4 lines, one per section heading with a 1-sentence placeholder.",
+    "context_excerpt": "",
+    "attempted_solution": "",
+    "outcome": "unknown",
+    "reference_kind": "rubric",
+    "judge": {},
+    "tags": [
+      "research-cron-output"
+    ],
+    "source_sessions": [],
+    "split": "train"
+  },
+  {
+    "id": "rc-02",
+    "reference": "1. 'ai seo audit tool': 420 imp, pos 8.2, on page 1 \u2014 needs CTR lift (snippet/schema).\n2. 'geo audit tool': 230 imp, pos 12.5, page 2 \u2014 target blog post could push to page 1.\n3. 'llm optimization': 85 imp, pos 18.3, deep page-2 \u2014 fresh content with answer capsule could compete.",
+    "rubric": "Score 1.0 if the response correctly identifies 'ai seo audit tool', 'geo audit tool', and 'llm optimization' as the top 3 (NOT 'best free seo audit' which is already converting well, NOT 'free audit tool' which has too few impressions). Each must have correct impression count, position, and a substantive rationale. Score 0.5 if correct 3 keywords but rationale is weak. Score 0.0 if wrong keywords selected.",
+    "project": "research-cron-output",
+    "intent": "GSC keyword opportunity scan. From this snippet of GSC data, identify the top 3 keyword opportunities (high impressions, low CTR, position 5-15):\n\n1. 'ai seo audit tool' \u2014 420 imp, 12 clicks, pos 8.2\n2. 'best free seo audit' \u2014 1100 imp, 95 clicks, pos 4.1\n3. 'geo audit tool' \u2014 230 imp, 4 clicks, pos 12.5\n4. 'llm optimization' \u2014 85 imp, 1 click, pos 18.3\n5. 'free audit tool' \u2014 50 imp, 0 clicks, pos 22.0\n\nOutput: one line per opportunity, format 'KEYWORD: impressions, position, why-it-matters (1 short clause)'.",
+    "context_excerpt": "",
+    "attempted_solution": "",
+    "outcome": "unknown",
+    "reference_kind": "rubric",
+    "judge": {},
+    "tags": [
+      "research-cron-output"
+    ],
+    "source_sessions": [],
+    "split": "train"
+  },
+  {
+    "id": "rc-03",
+    "reference": "Google AI Overviews now show source links more prominently + author bylines. For GeoXylia: this favors pages with clear authorship (add author schema to blog posts). Action: this week, add author + E-E-A-T schema markup to top 10 blog posts. Source: Google Search Central blog.",
+    "rubric": "Score 1.0 if: (a) under 60 words, (b) names the change, (c) gives GeoXylia-specific implication, (d) gives a concrete action item, (e) cites the source. Score 0.5 if missing 1-2 of these. Score 0.0 if over 60 words or missing 3+.",
+    "project": "research-cron-output",
+    "intent": "Daily Industry News scan. The Google Search Central blog just announced: 'AI Overviews now showing source links more prominently, with author bylines for E-E-A-T-heavy content.' Write a 1-paragraph Telegram alert (max 60 words) for Ethan. Include: 1) what changed, 2) what it means for GeoXylia, 3) any action item.",
+    "context_excerpt": "",
+    "attempted_solution": "",
+    "outcome": "unknown",
+    "reference_kind": "rubric",
+    "judge": {},
+    "tags": [
+      "research-cron-output"
+    ],
+    "source_sessions": [],
+    "split": "val"
+  },
+  {
+    "id": "rc-04",
+    "reference": "Hi [Name], I saw seo-skill.com's resources page is one of the most-respected SEO learning hubs in the industry \u2014 your 2026 algorithm breakdown was spot-on. We just published a free 2026 AI SEO Audit comparison that your readers would find genuinely useful (no paywall, no signup). It covers the 8 leading AI-audit tools with hands-on screenshots and a clear feature matrix. GeoXylia is the only fully-free option in the comparison, so it's a natural fit for a 'tools to know' section. Mind if I share the link for inclusion?",
+    "rubric": "Score 1.0 if exactly 4 sentences, all four functional pieces present (compliment / mention resource / audience benefit / GeoXylia one-liner), conversational tone, no aggressive sales language. Score 0.5 if 3 of 4 pieces present or tone is too salesy. Score 0.0 if more than 5 sentences or missing 2+ pieces.",
+    "project": "research-cron-output",
+    "intent": "Backlink Outreach draft for the blog post 'Free AI SEO Audit Tool: 2026 Comparison'. The prospect is seo-skill.com (a popular SEO training site with a 'resources' page). Write a 4-sentence outreach email: 1) compliment, 2) mention our resource, 3) explain audience benefit, 4) one-line about GeoXylia.",
+    "context_excerpt": "",
+    "attempted_solution": "",
+    "outcome": "unknown",
+    "reference_kind": "rubric",
+    "judge": {},
+    "tags": [
+      "research-cron-output"
+    ],
+    "source_sessions": [],
+    "split": "train"
+  },
+  {
+    "id": "rc-05",
+    "reference": "1) DO MORE: AI citation / LLM-mention topics \u2014 the 0.9% CTR at position 9.4 means we're visible but need richer answer capsules to lift CTR. Target 2x posts/week on this cluster.\n2) PAUSE: Pure schema-markup how-tos \u2014 'Schema Markup for SEO' has 0 clicks at position 41, the audience isn't searching this way. Rework as 'How to appear in AI answers' framing.\n3) TEST: 'Perplexity vs ChatGPT citation rates for [niche]' \u2014 unexplored angle, could capture comparison-intent traffic.",
+    "rubric": "Score 1.0 if all 3 are specific (not generic), cite actual data from the prompt, and contain a clear actionable change. Score 0.5 if 2 of 3 are specific. Score 0.0 if generic advice or no data citations.",
+    "project": "research-cron-output",
+    "intent": "Performance \u2192 Strategy feedback loop. Last week's top blog post was 'AI Citation Audit: Does Your Site Appear in ChatGPT?' with 4,200 impressions and 38 clicks (CTR 0.9%, position 9.4). The bottom post was 'Schema Markup for SEO: A 2026 Guide' with 110 impressions and 0 clicks (CTR 0%, position 41). Write 3 specific strategy adjustments: 1) what to do more of, 2) what to pause, 3) what new topic to test.",
+    "context_excerpt": "",
+    "attempted_solution": "",
+    "outcome": "unknown",
+    "reference_kind": "rubric",
+    "judge": {},
+    "tags": [
+      "research-cron-output"
+    ],
+    "source_sessions": [],
+    "split": "val"
+  }
+]
--- a/plugins/openclaw/tests/wiki-tasks.json
+++ b/plugins/openclaw/tests/wiki-tasks.json
@@ -0,0 +1,70 @@
+[
+  {
+    "id": "wk-01",
+    "reference": "1. What GEO is and isn't (define vs SEO/AEO, dispel the 'just add FAQ' myth)\n2. The 3 citation mechanisms LLMs use (RAG, fine-tuning, in-context; weight each)\n3. The 2026 citation data (real statistics from Profound/Otterly/Peec; what % of queries get citations)\n4. The action framework (a 5-step audit-and-fix process, concrete)\n5. Measurement (which metrics actually predict citation lift; vanity vs real)",
+    "rubric": "Score 1.0 if 5 sections, in a logical order, each with a substantive (not generic) purpose, and the section content is GEO-specific (not generic SEO). Score 0.5 if 5 sections but 1-2 are generic. Score 0.0 if wrong number of sections or wrong order.",
+    "project": "wiki-canonical-guide",
+    "intent": "Wiki canonical guide: 'GEO 2026 Standards'. Audience: a mid-level SEO specialist who has heard of GEO but not done it. Tone: technical, evidence-driven, no fluff. Length target: 1500-2200 words. Outline the 5 sections that should appear in order. For each, give a 1-sentence sub-purpose.",
+    "context_excerpt": "",
+    "attempted_solution": "",
+    "outcome": "unknown",
+    "reference_kind": "rubric",
+    "judge": {},
+    "tags": [
+      "wiki-canonical-guide"
+    ],
+    "source_sessions": [],
+    "split": "val"
+  },
+  {
+    "id": "wk-02",
+    "reference": "Yes, add inbound links. (1) geo-2026-standards.md \u2192 '## Action Framework' section, anchor: 'platform-specific citation rules' \u2014 natural since GEO standards reference ChatGPT/Perplexity behavior. (2) seo-2026-standards.md \u2192 '## AI Overviews' section, anchor: 'AI platform citations' \u2014 links to the mechanism guide. (3) content-strategy.md \u2192 '## Content Types' section, anchor: 'per-platform citation' \u2014 content strategy needs to know which platform favors which content.",
+    "rubric": "Score 1.0 if all 3 inbound links proposed with specific section + natural anchor text, demonstrating the link solves a real navigational gap (not just SEO-link-building). Score 0.5 if 2 of 3 are well-placed. Score 0.0 if generic anchors like 'click here' or no specific sections named.",
+    "project": "wiki-canonical-guide",
+    "intent": "Cross-link audit. The wiki page 'ai-platform-citation-guide.md' has 4 outbound links to other wiki pages, but no inbound links from: 'geo-2026-standards.md', 'seo-2026-standards.md', 'content-strategy.md'. Should we add inbound links? In which page should each inbound link go, and what anchor text would be natural?",
+    "context_excerpt": "",
+    "attempted_solution": "",
+    "outcome": "unknown",
+    "reference_kind": "rubric",
+    "judge": {},
+    "tags": [
+      "wiki-canonical-guide"
+    ],
+    "source_sessions": [],
+    "split": "val"
+  },
+  {
+    "id": "wk-03",
+    "reference": "Priorities:\n1. Refresh 'geo-glossary.md' (last update 2026-04-12, 63 days) \u2014 add new terms like RAG, in-context citation, agentic SEO.\n2. Refresh 'competitor-pricing.md' (last update 2026-05-01, 44 days) \u2014 Profound raised enterprise tier.\n3. No structural fixes needed.\n\nTelegram: 'Wiki lint: 2 stale pages flagged (geo-glossary 63d, competitor-pricing 44d). No broken links. Both need refresh this week.'",
+    "rubric": "Score 1.0 if both stale pages correctly identified with specific (not generic) refresh notes, and Telegram summary is under 40 words with the right action. Score 0.5 if stale pages identified but refresh notes are vague. Score 0.0 if missing stale pages or Telegram over 40 words.",
+    "project": "wiki-canonical-guide",
+    "intent": "Wiki lint report. Today's scan: 14 wiki pages, 2 with 'Updated' dates > 30 days old ('geo-glossary.md' and 'competitor-pricing.md'), 0 broken internal links, 0 missing YAML frontmatter. Output: 1) prioritized action list, 2) Telegram summary (max 40 words).",
+    "context_excerpt": "",
+    "attempted_solution": "",
+    "outcome": "unknown",
+    "reference_kind": "rubric",
+    "judge": {},
+    "tags": [
+      "wiki-canonical-guide"
+    ],
+    "source_sessions": [],
+    "split": "train"
+  },
+  {
+    "id": "wk-04",
+    "reference": "Index rebuilt: 14 wiki pages registered in _index.md (was 12 \u2014 added competitor-pricing-rev2 and citations-q2-2026).\nQuestion for Ethan: should 'competitor-pricing.md' and 'competitor-pricing-rev2.md' be merged? They're 78% similar in content.",
+    "rubric": "Score 1.0 if both sentences are accurate (count matches, names are plausible) and the question identifies a real consolidation opportunity (not a fabricated one). Score 0.5 if structure is right but content vague. Score 0.0 if wrong format or no question.",
+    "project": "wiki-canonical-guide",
+    "intent": "Index rebuild check. Run `python3 ~/agent-shared/scripts/update-index.py` (assume it works). After the run, the new wiki/_index.md should list all 14 pages. Generate a 2-sentence confirmation message + 1 question for Ethan to verify.",
+    "context_excerpt": "",
+    "attempted_solution": "",
+    "outcome": "unknown",
+    "reference_kind": "rubric",
+    "judge": {},
+    "tags": [
+      "wiki-canonical-guide"
+    ],
+    "source_sessions": [],
+    "split": "train"
+  }
+]