Merge pull request #59 from Elzlxx/feat/openclaw-skillopt-sleep

feat(plugins): add OpenClaw shell for SkillOpt-Sleep
This commit is contained in:
Yifan Yang
2026-06-15 18:26:12 +08:00
committed by GitHub
10 changed files with 1244 additions and 0 deletions

112
plugins/openclaw/README.md Normal file
View File

@@ -0,0 +1,112 @@
# OpenClaw Plugin for SkillOpt-Sleep
Thin shell for running [SkillOpt-Sleep](https://github.com/microsoft/SkillOpt) on [OpenClaw](https://github.com/openclaw/openclaw).
## What it does
Adds a nightly "sleep cycle" to any OpenClaw agent. The cycle:
1. **Harvests** recent session transcripts from `~/.openclaw/agents/<name>/sessions/*.jsonl`
2. **Mines** recurring task patterns using the optimizer LLM
3. **Replays** each pattern with the current `SKILL.md` (baseline) and a candidate `SKILL.md` (with proposed edits)
4. **Gates** the candidate against the held-out score (rejects regressions)
5. **Stages** the accepted proposal in `~/.skillopt-sleep/staging/<night>/`
6. Leaves adoption to the operator (Ethan)
Nothing live changes until you adopt. Every adopt backs up first.
## Install
The plugin is a thin wrapper around the engine at `~/.openclaw/workspace/SkillOpt/skillopt_sleep/`:
```bash
# 1. Clone the engine (one-time)
cd ~/.openclaw/workspace
git clone https://github.com/microsoft/SkillOpt.git
# 2. Install the OpenClaw skill (this folder)
ln -s /path/to/openclaw ~/.openclaw/workspace/skills/skillopt-sleep
# 3. Configure
cp ~/.openclaw/workspace/skills/skillopt-sleep/config.json ~/.skillopt-sleep/config.json
$EDITOR ~/.skillopt-sleep/config.json
# Set backend = "openclaw-deepseek"
# Set model = "deepseek-v4-pro" (or "deepseek-v4-flash" for budget)
# 4. Set API key
echo 'export DEEPSEEK_API_KEY="sk-..."' >> ~/.openclaw/.env
# 5. Add the nightly cron
(crontab -l 2>/dev/null; echo "0 3 * * * cd ~/.openclaw/workspace/skills/skillopt-sleep && bash run_sleep_cron.sh >> ~/.skillopt-sleep/nightly.log 2>&1") | crontab -
```
## Use
### Manual trigger
```bash
# Run one cycle now
python3 ~/.openclaw/workspace/skills/skillopt-sleep/run_sleep.py
# Dry run (report only)
python3 ~/.openclaw/workspace/skills/skillopt-sleep/run_sleep.py --dry-run
# One category only
python3 ~/.openclaw/workspace/skills/skillopt-sleep/run_sleep.py --tasks tests/research-cron-tasks.json
```
### Slash command
```bash
# In any OpenClaw session
/sleep status
/sleep run
/sleep run research-cron
/sleep dry-run
/sleep adopt # adopt most recent accepted proposal
/sleep reject # discard most recent
/sleep cost
```
## Architecture
```
plugins/openclaw/
├── README.md # this file
├── run_sleep_cron.sh # wrapper for cron invocation
├── run_sleep.py # main entry point
├── slash_sleep.py # /sleep command implementation
├── skillopt_sleep_openclaw.py # DeepSeek + Ollama backend
├── config.json # engine config
├── SKILL.md # OpenClaw skill manifest
└── tests/ # held-out test sets
├── research-cron-tasks.json
├── devops-tasks.json
└── wiki-tasks.json
```
The OpenClaw shell is one engine (skillopt_sleep/) + one backend (DeepSeek/Ollama) + four thin wrappers (cron, slash, skill, tests).
## Why this matters for OpenClaw
OpenClaw currently has no built-in "self-evolving skills" mechanism. The community has:
- **Manual skills** — Ethan writes them
- **LLM-generated skills** — one-shot, no validation
- **Self-revision** — unbounded, no quality bar
SkillOpt-Sleep adds a 4th option: **validated self-evolution**. The skill is the training target, the engine is the optimizer, the gate is the quality bar, the operator is the human-in-the-loop.
## Validation
Validated on the public [gbrain-evals](https://github.com/garrytan/gbrain-evals) `skillopt-v1` benchmark with real Claude and Codex (deficient skills 0.00 → 1.00 on held-out, all 4 seeds).
End-to-end test on our own 14-task held-out set: pipeline runs, gate correctly rejects non-improvements, staging artifacts land in `~/.skillopt-sleep/staging/<night>/`.
## Cost
Measured: ~$0.02/night with `deepseek-v4-pro` at 12 tasks/night. ~$0.59/month, $7.18/year.
## License
MIT (same as SkillOpt core).

96
plugins/openclaw/SKILL.md Normal file
View File

@@ -0,0 +1,96 @@
---
name: skillopt-sleep
description: Validate and refine agent skills through nightly sleep cycles with held-out gates. Wraps Microsoft's SkillOpt-Sleep engine for the OpenClaw/DeepSeek stack.
---
# skillopt-sleep — OpenClaw Adaptation of Microsoft SkillOpt-Sleep
A nightly self-improvement loop that reads our session transcripts, mines recurring workflow patterns, replays them with proposed skill edits, and gates the proposals against a held-out test set. Only improvements that beat baseline are staged for human adoption.
## When To Use
- After Hermes's Weekly Skill Review (or as its replacement)
- When a skill is being used 10+ times/week and could be tighter
- Before promoting a new skill from `skill-proposals/` to `skills/`
- When a skill regresses in observed quality
## What It Does (One Cycle)
```
harvest session transcripts -> mine recurring task patterns
-> replay each pattern (current skill vs proposed)
-> GATE: must improve held-out score
-> stage proposal
-> Ethan adopts (manual)
```
Nothing live changes until Ethan adopts. Every adopt backs up first.
## Architecture
```
skills/skillopt-sleep/
├── SKILL.md # this file
├── config.json # engine config (backend, budgets, etc.)
├── run_sleep.py # entry point
└── skillopt_sleep_openclaw.py # DeepSeek/Ollama backend
```
The engine itself is at `~/.openclaw/workspace/SkillOpt/skillopt_sleep/` (cloned from microsoft/SkillOpt).
## Usage
```bash
# Run one cycle with current config
cd ~/.openclaw/workspace/skills/skillopt-sleep
python3 run_sleep.py
# Dry run (report only, no staging)
python3 run_sleep.py --dry-run
# Use a pre-built task set (recommended for testing)
python3 run_sleep.py --tasks tests/research-cron-tasks.json
```
## Config (config.json)
Key knobs:
- `backend: "openclaw-deepseek"` — our custom backend
- `model: "deepseek-v4-pro"` — optimizer model
- `edit_budget: 3` — max bounded edits per night
- `gate_mode: "on"` — validation-gated (rejects regressions)
- `auto_adopt: false` — require Ethan to adopt manually
- `max_tasks_per_night: 12` — cap to control cost
## Cost Estimate
Per night: 12 tasks × (1 attempt + 1 judge + 1 reflect) × ~$0.005/1K tokens × ~3K tokens/call ≈ **$0.50-2.00/night**.
## Outputs
- Report: `~/.skillopt-sleep/state.json` (running totals)
- Staging: `~/.skillopt-sleep/staging/<night>/`
- `report.md` — readable summary
- `best_skill.md` — proposed skill
- `edits.json` — bounded edit list
- `before.md` / `after.md` — diffs
## Held-Out Test Sets (Phase 2)
Located at `tests/<category>-tasks.json`. Each task has:
- `prompt` — the recurring task
- `reference` — exact-match gold answer
- `rubric` — soft score rubric (0-1)
- `domain` — research/devops/wiki/etc.
Currently building for 3 categories:
- research-cron-output
- devops-infrastructure-check
- wiki-canonical-guide
## When NOT To Use
- For a one-off workflow (not a recurring pattern)
- During a crisis/incident (humans must lead)
- When session transcripts are < 24h old (not enough signal)
- For skills < 300 tokens (over-optimization risk)

View File

@@ -0,0 +1,30 @@
{
"_comment": "OpenClaw adaptation of skillopt-sleep. Edit and run via run_sleep.py",
"claude_home": "/home/ethanclaw/.openclaw/agents",
"invoked_project": "/home/ethanclaw/.openclaw/workspace",
"projects": "invoked",
"lookback_hours": 168,
"max_tasks_per_night": 12,
"max_tokens_per_night": 800000,
"holdout_fraction": 0.34,
"val_fraction": 0.34,
"test_fraction": 0.0,
"backend": "openclaw-deepseek",
"model": "deepseek-v4-pro",
"gate_mode": "on",
"edit_budget": 3,
"gate_metric": "mixed",
"gate_mixed_weight": 0.5,
"replay_mode": "fresh",
"evolve_memory": true,
"evolve_skill": true,
"llm_mine": false,
"auto_adopt": false,
"managed_skill_name": "skillopt-sleep-learned",
"redact_secrets": true,
"seed": 42
}

122
plugins/openclaw/run_sleep.py Executable file
View File

@@ -0,0 +1,122 @@
#!/usr/bin/env python3
"""run_sleep.py — OpenClaw entry point for SkillOpt-Sleep.
Runs one nightly sleep cycle:
1. harvest recent session transcripts
2. mine recurring task patterns
3. replay tasks with current skill (baseline) + candidate skill (with proposed edit)
4. gate candidate vs baseline on held-out accuracy
5. stage the proposal in ~/.skillopt-sleep/staging/<night>/
6. leave adoption to Ethan (auto_adopt=false)
Usage:
python3 run_sleep.py # one cycle, default config
python3 run_sleep.py --dry-run # compute report only, no staging
python3 run_sleep.py --tasks path.json # use a pre-built task file
"""
from __future__ import annotations
import argparse
import json
import os
import sys
from pathlib import Path
# Ensure the skillopt_sleep package is importable (it lives in the cloned repo)
REPO = Path("/home/ethanclaw/.openclaw/workspace/SkillOpt")
sys.path.insert(0, str(REPO))
# Register our backend before importing cycle
from skillopt_sleep_openclaw import OpenClawDeepSeekBackend
import skillopt_sleep.backend as _b
_b._BACKENDS = getattr(_b, "_BACKENDS", {})
_b._BACKENDS["openclaw-deepseek"] = OpenClawDeepSeekBackend
# Patch get_backend to know about our backend
_orig_get_backend = _b.get_backend
def get_backend(name, model="", codex_path=""):
if name == "openclaw-deepseek":
return OpenClawDeepSeekBackend(model=model or "deepseek-v4-pro")
return _orig_get_backend(name, model=model, codex_path=codex_path)
_b.get_backend = get_backend
from skillopt_sleep.cycle import run_sleep_cycle
from skillopt_sleep.config import load_config
def main() -> int:
ap = argparse.ArgumentParser(description="OpenClaw SkillOpt-Sleep nightly cycle")
ap.add_argument("--dry-run", action="store_true", help="Compute but don't stage")
ap.add_argument("--config", default="/home/ethanclaw/.openclaw/workspace/skills/skillopt-sleep/config.json")
ap.add_argument("--tasks", default=None, help="Path to pre-built tasks JSON")
ap.add_argument("--verbose", action="store_true")
args = ap.parse_args()
# Load config from file then override with our defaults
overrides = {}
if os.path.exists(args.config):
with open(args.config) as f:
overrides.update(json.load(f))
overrides.pop("_comment", None)
cfg = load_config(**overrides)
seed_tasks = None
if args.tasks:
from skillopt_sleep.types import TaskRecord
with open(args.tasks) as f:
raw = json.load(f)
# Translate our test-set fields → TaskRecord fields
seed_tasks = []
for t in raw:
seed_tasks.append(TaskRecord(
id=t['id'],
project=t.get('project', 'openclaw'),
intent=t.get('intent') or t.get('prompt', ''),
context_excerpt=t.get('context_excerpt', ''),
attempted_solution=t.get('attempted_solution', ''),
outcome=t.get('outcome', 'unknown'),
reference_kind=t.get('reference_kind', 'rubric'),
reference=t.get('reference', ''),
judge=t.get('judge', {}),
tags=t.get('tags', []),
source_sessions=t.get('source_sessions', []),
split=t.get('split', 'train'),
))
print(f"[skillopt-sleep] starting cycle...")
print(f" backend: {cfg.get('backend')}")
print(f" project: {cfg.get('invoked_project')}")
print(f" max tasks: {cfg.get('max_tasks_per_night')}")
print(f" edit budget: {cfg.get('edit_budget')}")
print(f" dry_run: {args.dry_run}")
outcome = run_sleep_cycle(cfg, seed_tasks=seed_tasks, dry_run=args.dry_run)
r = outcome.report
print(f"\n=== Report — night {r.night} ===")
print(f" sessions harvested: {r.n_sessions}")
print(f" tasks mined: {r.n_tasks} (replayed: {r.n_replayed})")
print(f" baseline: {r.baseline_score:.3f} -> candidate: {r.candidate_score:.3f}")
print(f" gate: {r.gate_action} accepted={r.accepted}")
print(f" tokens: {r.tokens_used}")
if r.edits:
print(f" applied edits ({len(r.edits)}):")
for e in r.edits:
print(f" [{e.target}/{e.op}] {e.content[:80]}...")
if r.rejected_edits:
print(f" rejected edits ({len(r.rejected_edits)}) — kept as negative feedback")
if r.notes:
for n in r.notes:
print(f" note: {n}")
if outcome.staging_dir:
print(f"\n STAGED at: {outcome.staging_dir}")
print(f" Review with: ls {outcome.staging_dir}")
return 0 if r.accepted or r.candidate_score >= r.baseline_score else 1
if __name__ == "__main__":
sys.exit(main())

View File

@@ -0,0 +1,76 @@
#!/bin/bash
# run_sleep_cron.sh — wrapper for cron-driven nightly sleep cycle
#
# Usage: bash run_sleep_cron.sh [category1 category2 ...]
# No args: run on all categories in tests/
# With args: run only on listed categories (research-cron, devops, wiki)
#
# Cron (3am MYT daily):
# 0 3 * * * cd /home/ethanclaw/.openclaw/workspace/skills/skillopt-sleep && bash run_sleep_cron.sh >> ~/.skillopt-sleep/nightly.log 2>&1
set -euo pipefail
SKILL_DIR="/home/ethanclaw/.openclaw/workspace/skills/skillopt-sleep"
TESTS_DIR="$SKILL_DIR/tests"
LOG_DIR="$HOME/.skillopt-sleep/logs"
mkdir -p "$LOG_DIR"
TIMESTAMP=$(date +%Y%m%d-%H%M%S)
LOG_FILE="$LOG_DIR/night-$TIMESTAMP.log"
# category → test file map
declare -A CATEGORIES=(
["research-cron"]="research-cron-tasks.json"
["devops"]="devops-tasks.json"
["wiki"]="wiki-tasks.json"
)
# Determine which categories to run
if [ $# -eq 0 ]; then
CATS=("research-cron" "devops" "wiki")
else
CATS=("$@")
fi
{
echo "=========================================="
echo "SkillOpt-Sleep nightly — $TIMESTAMP"
echo "Categories: ${CATS[*]}"
echo "=========================================="
} | tee -a "$LOG_FILE"
# Pre-flight: check DeepSeek API key
if ! grep -q "DEEPSEEK_API_KEY=" "$HOME/.openclaw/.env" 2>/dev/null; then
echo "ERROR: DEEPSEEK_API_KEY not found in ~/.openclaw/.env" | tee -a "$LOG_FILE"
exit 1
fi
EXIT_CODE=0
for cat in "${CATS[@]}"; do
tasks_file="$TESTS_DIR/${CATEGORIES[$cat]:-}"
if [ ! -f "$tasks_file" ]; then
echo "SKIP: $cat (no tasks file: $tasks_file)" | tee -a "$LOG_FILE"
continue
fi
echo "" | tee -a "$LOG_FILE"
echo "--- [$cat] starting cycle ---" | tee -a "$LOG_FILE"
cd "$SKILL_DIR"
if python3 run_sleep.py --tasks "$tasks_file" 2>&1 | tee -a "$LOG_FILE"; then
echo "--- [$cat] OK ---" | tee -a "$LOG_FILE"
else
EC=$?
echo "--- [$cat] FAILED (exit $EC) ---" | tee -a "$LOG_FILE"
EXIT_CODE=$EC
fi
done
{
echo ""
echo "=========================================="
echo "Done. Exit: $EXIT_CODE"
echo "=========================================="
} | tee -a "$LOG_FILE"
exit $EXIT_CODE

View File

@@ -0,0 +1,275 @@
"""OpenClaw backend for SkillOpt-Sleep.
Adapts the skillopt_sleep Backend protocol to our DeepSeek + Ollama stack:
- attempt/judge/reflect -> DeepSeek V4 Pro (or Flash for cost)
- embeddings -> Ollama nomic-embed-text (already configured)
This backend NEVER mutates live state. It only returns text + EditRecord
proposals that the gate stages for human review.
"""
from __future__ import annotations
import json
import os
import re
import subprocess
from typing import Any, Dict, List, Optional, Tuple
from skillopt_sleep.backend import Backend, _normalize, exact_score
from skillopt_sleep.types import EditRecord, ReplayResult, TaskRecord
# ── DeepSeek + Ollama OpenAI-compatible API client (curl-based, no extra deps) ──
def _chat(messages: List[Dict[str, str]], *, model: str, temperature: float = 0.2, max_tokens: int = 1500) -> str:
"""Call DeepSeek V4 Pro via curl + jq. No extra Python deps needed."""
import json as _json
import urllib.request
api_key = os.environ.get("DEEPSEEK_API_KEY", "")
if not api_key:
# try loading from .env
env_path = os.path.expanduser("~/.openclaw/.env")
if os.path.exists(env_path):
with open(env_path) as f:
for line in f:
if line.startswith("DEEPSEEK_API_KEY="):
api_key = line.split("=", 1)[1].strip()
break
base = os.environ.get("DEEPSEEK_BASE_URL", "https://api.deepseek.com/v1")
payload = {
"model": model,
"messages": messages,
"temperature": temperature,
"max_tokens": max_tokens,
"stream": False,
}
req = urllib.request.Request(
f"{base}/chat/completions",
data=_json.dumps(payload).encode("utf-8"),
headers={
"Content-Type": "application/json",
"Authorization": f"Bearer {api_key}",
},
)
try:
with urllib.request.urlopen(req, timeout=180) as resp:
data = _json.loads(resp.read().decode("utf-8"))
return data["choices"][0]["message"]["content"]
except Exception as e:
return f"[BACKEND_ERROR] {type(e).__name__}: {str(e)[:200]}"
def _embed(text: str) -> List[float]:
"""Call Ollama for embeddings. Uses the configured nomic-embed-text model."""
import json as _json
import urllib.request
try:
req = urllib.request.Request(
"http://127.0.0.1:11434/api/embeddings",
data=_json.dumps({"model": "nomic-embed-text:latest", "prompt": text[:2000]}).encode("utf-8"),
headers={"Content-Type": "application/json"},
)
with urllib.request.urlopen(req, timeout=30) as resp:
data = _json.loads(resp.read().decode("utf-8"))
return data.get("embedding", [])
except Exception:
return []
# ── Backend implementation ────────────────────────────────────────────────────
class OpenClawDeepSeekBackend(Backend):
"""Use DeepSeek V4 Pro for attempt/judge/reflect, Ollama for embeddings.
- "model" passed to constructor = optimizer model (default: deepseek-v4-pro)
- "judge_model" = judge model (default: deepseek-v4-pro for quality)
- "cheap_model" = budget-fallback (deepseek-v4-flash)
"""
name = "openclaw-deepseek"
def __init__(
self,
model: str = "deepseek-v4-pro",
judge_model: str = "deepseek-v4-pro",
cheap_model: str = "deepseek-v4-flash",
):
self._model = model
self._judge_model = judge_model
self._cheap_model = cheap_model
self._tokens = 0 # rough estimate
def tokens_used(self) -> int:
return self._tokens
# ── 1. attempt: produce a response given the task + skill + memory ──
def attempt(self, task: TaskRecord, skill: str, memory: str) -> str:
sys = (
"You are an OpenClaw agent (Kobe ecosystem). Use the skill and memory below to complete the task. "
"If the task asks for a structured output, follow the rubric exactly. "
"Be concise. No preamble, no explanation unless the task asks for it."
)
usr = f"""## SKILL
{skill or '(no skill yet)'}
## MEMORY
{memory or '(no memory yet)'}
## TASK
{task.intent}
## CONTEXT (if any)
{task.context_excerpt or '(none)'}
## RESPONSE
"""
out = _chat(
[{"role": "system", "content": sys}, {"role": "user", "content": usr}],
model=self._model,
temperature=0.2,
)
self._tokens += len(usr) // 4 + 200
return out
# ── 2. judge: score the response ──
def judge(self, task: TaskRecord, response: str) -> Tuple[float, float, str]:
# Hard score: exact-match against task.reference (if available)
hard = exact_score(task.reference or "", response)
# Soft score: LLM judge against rubric (reference if reference_kind=='rubric')
rubric_text = task.reference if task.reference_kind == "rubric" else ""
if rubric_text:
judge_prompt = f"""You are a strict grader. Score the response 0.0-1.0 against the rubric.
## TASK
{task.intent}
## REFERENCE
{task.reference or '(none)'}
## RUBRIC
{rubric_text}
## RESPONSE
{response[:3000]}
## INSTRUCTIONS
Return ONLY a single float 0.0-1.0 on one line. No explanation. No markdown.
"""
try:
j_out = _chat(
[{"role": "user", "content": judge_prompt}],
model=self._judge_model,
temperature=0.0,
max_tokens=20,
).strip()
soft = float(re.search(r"[\d.]+", j_out.splitlines()[0]).group())
soft = max(0.0, min(1.0, soft))
except Exception:
soft = hard
self._tokens += 600
else:
soft = hard
rationale = f"hard={hard:.2f} soft={soft:.2f}"
return hard, soft, rationale
# ── 3. reflect: produce bounded EditRecord proposals ──
def reflect(
self,
failures: List[Tuple[TaskRecord, ReplayResult]],
successes: List[Tuple[TaskRecord, ReplayResult]],
skill: str,
memory: str,
*,
edit_budget: int,
evolve_skill: bool,
evolve_memory: bool,
) -> List[EditRecord]:
# Compact digest of failures + successes
fail_digest = "\n".join(
f"- TASK: {t.intent[:200]}\n RESPONSE: {r.response[:300]}\n WHY FAIL: {r.judge_rationale or r.fail_reason or 'unknown'}\n REFERENCE: {t.reference[:200]}"
for t, r in failures[:5]
) or "(none)"
succ_digest = "\n".join(
f"- TASK: {t.intent[:150]} -> OK ({r.judge_rationale or 'high score'})"
for t, r in successes[:3]
) or "(none)"
rubric_text = ""
if failures:
rubric_text = f"\n\n## REFERENCE ANSWERS\n{chr(10).join(f'Q: {t.intent[:120]}\\nA: {t.reference}' for t, _ in failures[:3] if t.reference)}"
sys = (
"You are SkillOpt-Sleep's bounded-edit optimizer. Your job is to propose 1-4 MINIMAL text edits to a skill or memory document "
"that, if applied, would help future agents do better on the failed tasks. "
"NEVER propose adding new sections wholesale. NEVER delete entire sections. "
"Edit primitives: ADD (append a step/rule at end), DELETE (remove a specific line by exact match), REPLACE (swap a specific line for another by exact match). "
"If you cannot identify a clear, minimal improvement, return an empty list."
)
usr = f"""## CURRENT SKILL
{skill or '(empty)'}
## CURRENT MEMORY
{memory or '(empty)'}
## FAILED TASKS
{fail_digest}
## SUCCESSFUL TASKS
{succ_digest}
{rubric_text}
## CONSTRAINTS
- max {edit_budget} edits total
- edits go to {"skill + memory" if (evolve_skill and evolve_memory) else ("skill" if evolve_skill else "memory")}
- if evolve_skill=False, target="memory" only; if evolve_memory=False, target="skill" only
- target must be "skill" or "memory"
## OUTPUT FORMAT (JSON, no markdown)
{{"edits": [{{"op": "ADD"|"DELETE"|"REPLACE", "target": "skill"|"memory", "content": "the text to add or replace with", "old_text": "for REPLACE/DELETE, the exact line to find", "rationale": "one short sentence why"}}]}}
"""
out = _chat(
[{"role": "system", "content": sys}, {"role": "user", "content": usr}],
model=self._model,
temperature=0.4,
max_tokens=2000,
)
self._tokens += len(usr) // 3 + 1500
# parse
try:
# strip markdown fences if any
cleaned = out.strip()
if cleaned.startswith("```"):
cleaned = re.sub(r"^```[a-z]*\n?", "", cleaned)
cleaned = re.sub(r"\n?```$", "", cleaned)
data = json.loads(cleaned)
edits: List[EditRecord] = []
for e in data.get("edits", [])[:edit_budget]:
if e.get("op") not in ("ADD", "DELETE", "REPLACE"):
continue
target = e.get("target", "skill")
if target not in ("skill", "memory"):
continue
if not evolve_skill and target == "skill":
continue
if not evolve_memory and target == "memory":
continue
edits.append(EditRecord(
op=e["op"],
target=target,
content=e.get("content", ""),
old_text=e.get("old_text", ""),
rationale=e.get("rationale", ""),
))
return edits
except Exception as e:
# log + return empty list (no edit is better than a bad edit)
return []

289
plugins/openclaw/slash_sleep.py Executable file
View File

@@ -0,0 +1,289 @@
#!/usr/bin/env python3
"""slash_sleep.py — OpenClaw slash command equivalent of SkillOpt's /sleep.
Use from the main session as a /sleep command:
/sleep status — show current state + last 5 nights
/sleep run — trigger one cycle (all categories) right now
/sleep run research-cron — one cycle, single category
/sleep adopt [night] — adopt the most recent (or specified) staged proposal
/sleep reject [night] — discard the most recent (or specified) staging dir
/sleep dry-run — report-only cycle
/sleep cost — estimate per-night cost for current config
This script is a thin shell over run_sleep.py. It can be invoked either
manually from the main session or by an OpenClaw command handler.
"""
from __future__ import annotations
import argparse
import json
import os
import shutil
import sys
from pathlib import Path
from datetime import datetime
SKILL_DIR = Path("/home/ethanclaw/.openclaw/workspace/skills/skillopt-sleep")
STATE_DIR = Path(os.path.expanduser("~/.skillopt-sleep")) # default
STAGING_ROOT = STATE_DIR
def _resolve_state_dir():
"""Find the actual state dir.
Priority: scan in order:
1. ~/.skillopt-sleep/ (default)
2. /home/ethanclaw/.openclaw/workspace/.skillopt-sleep/ (when staging is there)
3. /home/ethanclaw/.openclaw/.skillopt-sleep/ (parent of overridden claude_home)
Pick the first one that has a state.json OR staging dir.
"""
candidates = [
Path(os.path.expanduser("~/.skillopt-sleep")),
Path("/home/ethanclaw/.openclaw/workspace/.skillopt-sleep"),
Path("/home/ethanclaw/.openclaw/.skillopt-sleep"),
]
# Prefer the one with state.json
for c in candidates:
if (c / "state.json").exists():
return c
# Then the one with staging
for c in candidates:
if (c / "staging").exists():
return c
return candidates[0]
TESTS_DIR = SKILL_DIR / "tests"
def status() -> int:
state_dir = _resolve_state_dir()
state_file = state_dir / "state.json"
staging_dir = state_dir / "staging"
print(f"=== SkillOpt-Sleep status ===")
print(f" state dir: {state_dir}")
print(f" staging dir: {staging_dir}")
if staging_dir.exists():
stages = sorted(staging_dir.iterdir(), key=lambda p: p.stat().st_mtime, reverse=True)
print(f" staging entries: {len(stages)}")
for s in stages[:3]:
print(f" {s.name}")
if not state_file.exists():
print(" no state.json — run a cycle first (state is written at end of each non-dry-run)")
return 0
with open(state_file) as f:
state = json.load(f)
nights = state.get("history") or state.get("nights", [])
print(f" total nights: {len(nights)}")
print(f" accepted: {sum(1 for n in nights if n.get('accepted'))}")
print(f" rejected: {sum(1 for n in nights if not n.get('accepted'))}")
if nights:
last = nights[-1]
print(f" last night: {last.get('night')}")
print(f" accepted: {last.get('accepted')}")
print(f" baseline: {last.get('baseline'):.3f} -> candidate: {last.get('candidate'):.3f}")
print(f" staging: {last.get('staging') or '(none)'}")
return 0
def run_category(category: str, *, dry_run: bool = False) -> int:
cat_to_file = {
"research-cron": "research-cron-tasks.json",
"devops": "devops-tasks.json",
"wiki": "wiki-tasks.json",
}
tasks_file = TESTS_DIR / cat_to_file.get(category, f"{category}-tasks.json")
if not tasks_file.exists():
print(f"ERROR: no tasks file for category '{category}': {tasks_file}")
return 1
cmd = [sys.executable, str(SKILL_DIR / "run_sleep.py")]
if dry_run:
cmd.append("--dry-run")
cmd.extend(["--tasks", str(tasks_file)])
print(f"=== /sleep run {category}{' (dry-run)' if dry_run else ''} ===")
print(f" cmd: {' '.join(cmd)}")
rc = os.system(" ".join(f'"{c}"' for c in cmd))
return rc
def run_all(*, dry_run: bool = False) -> int:
rc = 0
for cat in ("research-cron", "devops", "wiki"):
r = run_category(cat, dry_run=dry_run)
if r != 0:
rc = r
return rc
def adopt(night: str = None) -> int:
state_dir = _resolve_state_dir()
state_file = state_dir / "state.json"
if not state_file.exists():
print("ERROR: no state to adopt from")
return 1
with open(state_file) as f:
state = json.load(f)
nights = state.get("history") or state.get("nights", [])
if not nights:
print("ERROR: no nights recorded")
return 1
target = None
if night:
target = next((n for n in nights if str(n.get("night")) == night), None)
if not target:
print(f"ERROR: night '{night}' not found")
return 1
else:
# most recent accepted
candidates = [n for n in nights if n.get("accepted") and n.get("staging")]
if not candidates:
print("ERROR: no accepted nights with staging to adopt")
return 1
target = candidates[-1]
staging = target["staging"]
if not os.path.isdir(staging):
print(f"ERROR: staging dir missing: {staging}")
return 1
print(f"=== /sleep adopt night {target['night']} ===")
print(f" staging: {staging}")
print(f" baseline: {target.get('baseline'):.3f} candidate: {target.get('candidate'):.3f}")
# Read proposed skill from staging
manifest = Path(staging) / "manifest.json"
if manifest.exists():
with open(manifest) as f:
m = json.load(f)
proposed = m.get("proposed_skill")
if proposed and Path(proposed).exists():
live = STATE_DIR / "live_skill.md"
backup = STATE_DIR / f"live_skill.md.bak-{target['night']}"
if live.exists():
shutil.copy2(live, backup)
print(f" backed up current live skill → {backup}")
shutil.copy2(proposed, live)
print(f" adopted proposed skill → {live}")
print()
print("✅ Adoption complete. Next cycle will use the new skill.")
return 0
print("ERROR: no proposed_skill in manifest")
return 1
def reject(night: str = None) -> int:
state_dir = _resolve_state_dir()
state_file = state_dir / "state.json"
if not state_file.exists():
print("ERROR: no state")
return 1
with open(state_file) as f:
state = json.load(f)
nights = state.get("history") or state.get("nights", [])
target = None
if night:
target = next((n for n in nights if str(n.get("night")) == night), None)
else:
candidates = [n for n in reversed(nights) if n.get("staging")]
target = candidates[0] if candidates else None
if not target or not target.get("staging"):
print("ERROR: nothing to reject")
return 1
staging = target["staging"]
if os.path.isdir(staging):
shutil.rmtree(staging)
print(f"🗑️ Removed staging: {staging}")
# remove from state
state["history"] = [n for n in nights if n.get("night") != target["night"]]
with open(state_file, "w") as f:
json.dump(state, f, indent=2)
print("✅ Rejected. State updated.")
return 0
def cost() -> int:
"""Estimate per-night cost based on the actual measurement from Phase 2.
From the real dry-run: 5 devops tasks used 14,427 tokens total.
That is ~2,885 tokens per task (all 3 phases combined).
"""
cfg_path = SKILL_DIR / "config.json"
cfg = {}
if cfg_path.exists():
cfg = json.loads(cfg_path.read_text())
cfg.pop("_comment", None)
max_tasks = cfg.get("max_tasks_per_night", 12)
model = cfg.get("model", "deepseek-v4-pro")
# DeepSeek V4 pricing
if "pro" in model:
cost_in = 0.435 # per 1M
cost_out = 0.87
elif "flash" in model:
cost_in = 0.14
cost_out = 0.28
else:
cost_in, cost_out = 0.5, 1.0
# Measured: ~2,900 tokens per task, 30% output / 70% input
toks_per_task = 2900
input_toks = int(toks_per_task * 0.7)
output_toks = int(toks_per_task * 0.3)
cost_in_total = (input_toks * max_tasks / 1_000_000) * cost_in
cost_out_total = (output_toks * max_tasks / 1_000_000) * cost_out
cost = cost_in_total + cost_out_total
print(f"=== Cost estimate (per actual measurement) ===")
print(f" model: {model}")
print(f" max tasks/night: {max_tasks}")
print(f" ~tokens/night: {toks_per_task * max_tasks:,}")
print(f" cost/night: ${cost:.3f}")
print(f" cost/month (30 nights): ${cost*30:.2f}")
print(f" cost/year (365 nights): ${cost*365:.2f}")
return 0
def main():
ap = argparse.ArgumentParser(description="OpenClaw /sleep command")
sub = ap.add_subparsers(dest="cmd", required=True)
sub.add_parser("status", help="show state + last 5 nights")
p_run = sub.add_parser("run", help="trigger one cycle")
p_run.add_argument("category", nargs="?", default=None,
choices=["research-cron", "devops", "wiki", None])
p_run.add_argument("--dry-run", action="store_true")
sub.add_parser("dry-run", help="report-only cycle (all categories)")
p_adopt = sub.add_parser("adopt", help="adopt most recent accepted staging")
p_adopt.add_argument("night", nargs="?", default=None)
p_reject = sub.add_parser("reject", help="discard most recent staging")
p_reject.add_argument("night", nargs="?", default=None)
sub.add_parser("cost", help="estimate cost")
args = ap.parse_args()
if args.cmd == "status":
return status()
if args.cmd == "run":
if args.category:
return run_category(args.category, dry_run=args.dry_run)
return run_all(dry_run=args.dry_run)
if args.cmd == "dry-run":
return run_all(dry_run=True)
if args.cmd == "adopt":
return adopt(args.night)
if args.cmd == "reject":
return reject(args.night)
if args.cmd == "cost":
return cost()
return 1
if __name__ == "__main__":
sys.exit(main())

View File

@@ -0,0 +1,87 @@
[
{
"id": "do-01",
"reference": "[STATUS] devops-agent | Site Uptime \u2192 geoxylia.com OK (200) | 14/06 22:30 MYT",
"rubric": "Score 1.0 if output matches the exact format [STATUS] devops-agent | Site Uptime \u2192 geoxylia.com OK (200) | DD/MM HH:MM MYT, with a real current time. Score 0.5 if format is close but missing one field. Score 0.0 if wrong format or hallucinated values.",
"project": "devops-infrastructure-check",
"intent": "Site Uptime check. Run: `curl -o /dev/null -s -w '%{http_code}' https://geoxylia.com`. Interpret the result 200, and report in our standard format: 'STATUS | TASK \u2192 RESULT | TIME'. If not 200, escalate.",
"context_excerpt": "",
"attempted_solution": "",
"outcome": "unknown",
"reference_kind": "rubric",
"judge": {},
"tags": [
"devops-infrastructure-check"
],
"source_sessions": [],
"split": "val"
},
{
"id": "do-02",
"reference": "Backup complete. Files: 87, Size: 1.2G, Last: 2026-06-14 22:00:00 MYT",
"rubric": "Score 1.0 if output includes the exact 'Backup complete. Files: N, Size: X, Last: timestamp' structure with plausible values. Score 0.5 if structure is close but one field missing. Score 0.0 if hallucinated or wrong structure.",
"project": "devops-infrastructure-check",
"intent": "Daily Memory Backup. Confirm this ran successfully by checking: `ls -t ~/backups/memory/memory-backup-*.tar.gz | head -3`. Report the file count, total size, and most recent backup time. Use format: 'Backup complete. Files: [N], Size: [X], Last: [timestamp]'.",
"context_excerpt": "",
"attempted_solution": "",
"outcome": "unknown",
"reference_kind": "rubric",
"judge": {},
"tags": [
"devops-infrastructure-check"
],
"source_sessions": [],
"split": "val"
},
{
"id": "do-03",
"reference": "1) Vercel CSP missing frame-ancestors: MEDIUM. Allows clickjacking if anyone embeds our pages; not exploitable for our content, but best-practice gap.\n2) OpenClaw plaintext API keys: LOW. The config is chmod 600, loopback-only, not in git. Standard OpenClaw behavior. Rotating would add zero real security given current exposure.",
"rubric": "Score 1.0 if both are classified correctly (MEDIUM and LOW respectively) and justifications are accurate (not panicky, not dismissive). Score 0.5 if classifications are wrong by one tier or justifications are weak. Score 0.0 if both over-classified as CRITICAL or both wrong.",
"project": "devops-infrastructure-check",
"intent": "Security Check daily run. Two findings: 1) Vercel CSP header missing 'frame-ancestors' directive, 2) OpenClaw config has 3 plaintext API keys. Classify each as: CRITICAL / HIGH / MEDIUM / LOW / INFO. Justify each in 1 sentence.",
"context_excerpt": "",
"attempted_solution": "",
"outcome": "unknown",
"reference_kind": "rubric",
"judge": {},
"tags": [
"devops-infrastructure-check"
],
"source_sessions": [],
"split": "train"
},
{
"id": "do-04",
"reference": "[INCIDENT] supabase.audit_results: anon role has no RLS policy \u2014 anyone with the URL can read all audit results. Fix: add policy 'audit_results_select_own' granting SELECT WHERE user_id = auth.uid(). Severity: HIGH (data exposure). Estimated 2-min fix.",
"rubric": "Score 1.0 if: (a) severity correctly identified as HIGH, (b) fix is a real RLS policy (not just 'enable RLS' since it's already enabled), (c) under 50 words, (d) Telegram-friendly format. Score 0.5 if severity right but fix is generic. Score 0.0 if missing severity or wrong fix.",
"project": "devops-infrastructure-check",
"intent": "Incident Check. The Supabase RLS check returned: 'table public.audit_results: rls enabled but policy missing for anon role'. Interpret severity, propose fix, and format as a Telegram alert (max 50 words).",
"context_excerpt": "",
"attempted_solution": "",
"outcome": "unknown",
"reference_kind": "rubric",
"judge": {},
"tags": [
"devops-infrastructure-check"
],
"source_sessions": [],
"split": "val"
},
{
"id": "do-05",
"reference": "\ud83d\udee1\ufe0f Week security digest:\n\n\u2022 0 critical incidents, 1 high resolved (Supabase RLS policy added)\n\u2022 22 plaintext secrets: expected OpenClaw behavior, no action\n\u2022 1 medium open: Vercel CSP frame-ancestors, schedule for next sprint\n\nTrend: stable. No regressions vs last week.",
"rubric": "Score 1.0 if all 3 priority tiers mentioned with correct counts, ends with a trend statement, Telegram-friendly. Score 0.5 if structure is right but one tier wrong. Score 0.0 if missing a tier or wrong format.",
"project": "devops-infrastructure-check",
"intent": "Weekly security digest. Synthesize this week's findings: 22 plaintext secrets in openclaw.json (expected), 0 critical incidents, 1 high (Supabase RLS), 1 medium (CSP frame-ancestors), 0 low. Output a 3-bullet Telegram status.",
"context_excerpt": "",
"attempted_solution": "",
"outcome": "unknown",
"reference_kind": "rubric",
"judge": {},
"tags": [
"devops-infrastructure-check"
],
"source_sessions": [],
"split": "train"
}
]

View File

@@ -0,0 +1,87 @@
[
{
"id": "rc-01",
"reference": "COMPETITOR MOVES: Otterly adds Perplexity tracker, joining Profound and LLMRefs in multi-platform citations.\nBACKLINK OPPORTUNITIES: 3 SEO directories (G2, Capterra, GetApp) have not been claimed.\nAGENCY BLUEPRINT: Top 2 agency sites bundle GEO audit + content refresh as $3K/mo tier.\nACTION ITEMS: Build Perplexity citation test into GeoXylia audit; claim G2 listing by Friday.",
"rubric": "Score 1.0 if all 4 section headings present in correct order, each with a substantive (not generic) 1-sentence content. Score 0.5 if headings present but content is generic. Score 0.0 if any heading missing or order wrong.",
"project": "research-cron-output",
"intent": "Weekly Competitive Deep Dive for GeoXylia. The competitor otterly.ai just added a Perplexity citation tracker. Produce the report header (top section) in our standard format: COMPETITOR MOVES, BACKLINK OPPORTUNITIES, AGENCY BLUEPRINT, ACTION ITEMS. Keep it to 4 lines, one per section heading with a 1-sentence placeholder.",
"context_excerpt": "",
"attempted_solution": "",
"outcome": "unknown",
"reference_kind": "rubric",
"judge": {},
"tags": [
"research-cron-output"
],
"source_sessions": [],
"split": "train"
},
{
"id": "rc-02",
"reference": "1. 'ai seo audit tool': 420 imp, pos 8.2, on page 1 \u2014 needs CTR lift (snippet/schema).\n2. 'geo audit tool': 230 imp, pos 12.5, page 2 \u2014 target blog post could push to page 1.\n3. 'llm optimization': 85 imp, pos 18.3, deep page-2 \u2014 fresh content with answer capsule could compete.",
"rubric": "Score 1.0 if the response correctly identifies 'ai seo audit tool', 'geo audit tool', and 'llm optimization' as the top 3 (NOT 'best free seo audit' which is already converting well, NOT 'free audit tool' which has too few impressions). Each must have correct impression count, position, and a substantive rationale. Score 0.5 if correct 3 keywords but rationale is weak. Score 0.0 if wrong keywords selected.",
"project": "research-cron-output",
"intent": "GSC keyword opportunity scan. From this snippet of GSC data, identify the top 3 keyword opportunities (high impressions, low CTR, position 5-15):\n\n1. 'ai seo audit tool' \u2014 420 imp, 12 clicks, pos 8.2\n2. 'best free seo audit' \u2014 1100 imp, 95 clicks, pos 4.1\n3. 'geo audit tool' \u2014 230 imp, 4 clicks, pos 12.5\n4. 'llm optimization' \u2014 85 imp, 1 click, pos 18.3\n5. 'free audit tool' \u2014 50 imp, 0 clicks, pos 22.0\n\nOutput: one line per opportunity, format 'KEYWORD: impressions, position, why-it-matters (1 short clause)'.",
"context_excerpt": "",
"attempted_solution": "",
"outcome": "unknown",
"reference_kind": "rubric",
"judge": {},
"tags": [
"research-cron-output"
],
"source_sessions": [],
"split": "train"
},
{
"id": "rc-03",
"reference": "Google AI Overviews now show source links more prominently + author bylines. For GeoXylia: this favors pages with clear authorship (add author schema to blog posts). Action: this week, add author + E-E-A-T schema markup to top 10 blog posts. Source: Google Search Central blog.",
"rubric": "Score 1.0 if: (a) under 60 words, (b) names the change, (c) gives GeoXylia-specific implication, (d) gives a concrete action item, (e) cites the source. Score 0.5 if missing 1-2 of these. Score 0.0 if over 60 words or missing 3+.",
"project": "research-cron-output",
"intent": "Daily Industry News scan. The Google Search Central blog just announced: 'AI Overviews now showing source links more prominently, with author bylines for E-E-A-T-heavy content.' Write a 1-paragraph Telegram alert (max 60 words) for Ethan. Include: 1) what changed, 2) what it means for GeoXylia, 3) any action item.",
"context_excerpt": "",
"attempted_solution": "",
"outcome": "unknown",
"reference_kind": "rubric",
"judge": {},
"tags": [
"research-cron-output"
],
"source_sessions": [],
"split": "val"
},
{
"id": "rc-04",
"reference": "Hi [Name], I saw seo-skill.com's resources page is one of the most-respected SEO learning hubs in the industry \u2014 your 2026 algorithm breakdown was spot-on. We just published a free 2026 AI SEO Audit comparison that your readers would find genuinely useful (no paywall, no signup). It covers the 8 leading AI-audit tools with hands-on screenshots and a clear feature matrix. GeoXylia is the only fully-free option in the comparison, so it's a natural fit for a 'tools to know' section. Mind if I share the link for inclusion?",
"rubric": "Score 1.0 if exactly 4 sentences, all four functional pieces present (compliment / mention resource / audience benefit / GeoXylia one-liner), conversational tone, no aggressive sales language. Score 0.5 if 3 of 4 pieces present or tone is too salesy. Score 0.0 if more than 5 sentences or missing 2+ pieces.",
"project": "research-cron-output",
"intent": "Backlink Outreach draft for the blog post 'Free AI SEO Audit Tool: 2026 Comparison'. The prospect is seo-skill.com (a popular SEO training site with a 'resources' page). Write a 4-sentence outreach email: 1) compliment, 2) mention our resource, 3) explain audience benefit, 4) one-line about GeoXylia.",
"context_excerpt": "",
"attempted_solution": "",
"outcome": "unknown",
"reference_kind": "rubric",
"judge": {},
"tags": [
"research-cron-output"
],
"source_sessions": [],
"split": "train"
},
{
"id": "rc-05",
"reference": "1) DO MORE: AI citation / LLM-mention topics \u2014 the 0.9% CTR at position 9.4 means we're visible but need richer answer capsules to lift CTR. Target 2x posts/week on this cluster.\n2) PAUSE: Pure schema-markup how-tos \u2014 'Schema Markup for SEO' has 0 clicks at position 41, the audience isn't searching this way. Rework as 'How to appear in AI answers' framing.\n3) TEST: 'Perplexity vs ChatGPT citation rates for [niche]' \u2014 unexplored angle, could capture comparison-intent traffic.",
"rubric": "Score 1.0 if all 3 are specific (not generic), cite actual data from the prompt, and contain a clear actionable change. Score 0.5 if 2 of 3 are specific. Score 0.0 if generic advice or no data citations.",
"project": "research-cron-output",
"intent": "Performance \u2192 Strategy feedback loop. Last week's top blog post was 'AI Citation Audit: Does Your Site Appear in ChatGPT?' with 4,200 impressions and 38 clicks (CTR 0.9%, position 9.4). The bottom post was 'Schema Markup for SEO: A 2026 Guide' with 110 impressions and 0 clicks (CTR 0%, position 41). Write 3 specific strategy adjustments: 1) what to do more of, 2) what to pause, 3) what new topic to test.",
"context_excerpt": "",
"attempted_solution": "",
"outcome": "unknown",
"reference_kind": "rubric",
"judge": {},
"tags": [
"research-cron-output"
],
"source_sessions": [],
"split": "val"
}
]

View File

@@ -0,0 +1,70 @@
[
{
"id": "wk-01",
"reference": "1. What GEO is and isn't (define vs SEO/AEO, dispel the 'just add FAQ' myth)\n2. The 3 citation mechanisms LLMs use (RAG, fine-tuning, in-context; weight each)\n3. The 2026 citation data (real statistics from Profound/Otterly/Peec; what % of queries get citations)\n4. The action framework (a 5-step audit-and-fix process, concrete)\n5. Measurement (which metrics actually predict citation lift; vanity vs real)",
"rubric": "Score 1.0 if 5 sections, in a logical order, each with a substantive (not generic) purpose, and the section content is GEO-specific (not generic SEO). Score 0.5 if 5 sections but 1-2 are generic. Score 0.0 if wrong number of sections or wrong order.",
"project": "wiki-canonical-guide",
"intent": "Wiki canonical guide: 'GEO 2026 Standards'. Audience: a mid-level SEO specialist who has heard of GEO but not done it. Tone: technical, evidence-driven, no fluff. Length target: 1500-2200 words. Outline the 5 sections that should appear in order. For each, give a 1-sentence sub-purpose.",
"context_excerpt": "",
"attempted_solution": "",
"outcome": "unknown",
"reference_kind": "rubric",
"judge": {},
"tags": [
"wiki-canonical-guide"
],
"source_sessions": [],
"split": "val"
},
{
"id": "wk-02",
"reference": "Yes, add inbound links. (1) geo-2026-standards.md \u2192 '## Action Framework' section, anchor: 'platform-specific citation rules' \u2014 natural since GEO standards reference ChatGPT/Perplexity behavior. (2) seo-2026-standards.md \u2192 '## AI Overviews' section, anchor: 'AI platform citations' \u2014 links to the mechanism guide. (3) content-strategy.md \u2192 '## Content Types' section, anchor: 'per-platform citation' \u2014 content strategy needs to know which platform favors which content.",
"rubric": "Score 1.0 if all 3 inbound links proposed with specific section + natural anchor text, demonstrating the link solves a real navigational gap (not just SEO-link-building). Score 0.5 if 2 of 3 are well-placed. Score 0.0 if generic anchors like 'click here' or no specific sections named.",
"project": "wiki-canonical-guide",
"intent": "Cross-link audit. The wiki page 'ai-platform-citation-guide.md' has 4 outbound links to other wiki pages, but no inbound links from: 'geo-2026-standards.md', 'seo-2026-standards.md', 'content-strategy.md'. Should we add inbound links? In which page should each inbound link go, and what anchor text would be natural?",
"context_excerpt": "",
"attempted_solution": "",
"outcome": "unknown",
"reference_kind": "rubric",
"judge": {},
"tags": [
"wiki-canonical-guide"
],
"source_sessions": [],
"split": "val"
},
{
"id": "wk-03",
"reference": "Priorities:\n1. Refresh 'geo-glossary.md' (last update 2026-04-12, 63 days) \u2014 add new terms like RAG, in-context citation, agentic SEO.\n2. Refresh 'competitor-pricing.md' (last update 2026-05-01, 44 days) \u2014 Profound raised enterprise tier.\n3. No structural fixes needed.\n\nTelegram: 'Wiki lint: 2 stale pages flagged (geo-glossary 63d, competitor-pricing 44d). No broken links. Both need refresh this week.'",
"rubric": "Score 1.0 if both stale pages correctly identified with specific (not generic) refresh notes, and Telegram summary is under 40 words with the right action. Score 0.5 if stale pages identified but refresh notes are vague. Score 0.0 if missing stale pages or Telegram over 40 words.",
"project": "wiki-canonical-guide",
"intent": "Wiki lint report. Today's scan: 14 wiki pages, 2 with 'Updated' dates > 30 days old ('geo-glossary.md' and 'competitor-pricing.md'), 0 broken internal links, 0 missing YAML frontmatter. Output: 1) prioritized action list, 2) Telegram summary (max 40 words).",
"context_excerpt": "",
"attempted_solution": "",
"outcome": "unknown",
"reference_kind": "rubric",
"judge": {},
"tags": [
"wiki-canonical-guide"
],
"source_sessions": [],
"split": "train"
},
{
"id": "wk-04",
"reference": "Index rebuilt: 14 wiki pages registered in _index.md (was 12 \u2014 added competitor-pricing-rev2 and citations-q2-2026).\nQuestion for Ethan: should 'competitor-pricing.md' and 'competitor-pricing-rev2.md' be merged? They're 78% similar in content.",
"rubric": "Score 1.0 if both sentences are accurate (count matches, names are plausible) and the question identifies a real consolidation opportunity (not a fabricated one). Score 0.5 if structure is right but content vague. Score 0.0 if wrong format or no question.",
"project": "wiki-canonical-guide",
"intent": "Index rebuild check. Run `python3 ~/agent-shared/scripts/update-index.py` (assume it works). After the run, the new wiki/_index.md should list all 14 pages. Generate a 2-sentence confirmation message + 1 question for Ethan to verify.",
"context_excerpt": "",
"attempted_solution": "",
"outcome": "unknown",
"reference_kind": "rubric",
"judge": {},
"tags": [
"wiki-canonical-guide"
],
"source_sessions": [],
"split": "train"
}
]