garrytan-gstack

mirror of https://github.com/garrytan/gstack.git synced 2026-07-03 15:22:30 +08:00

Author	SHA1	Message	Date
Garry Tan	11de390be1	v1.58.5.0 feat: first-run activation scaffold + gstack router front door (#2078 ) * feat: first-run activation — project-aware scaffold, router front door, onboarding nudges Adds the activation system that drives a new install toward a concrete first move: - bin/gstack-first-task-detect: local-git+filesystem repo classifier emitting one validated enum bucket (greenfield/code_<lang>/branch_ahead/dirty_default/clean_default), portable timeouts, fail-safe empty output. - generate-first-run-guidance.ts: unified preamble section — first-run project-aware scaffold + returning-session plan->review->ship tip, gated on a persistent .activated marker and never run in headless. Detection wired lazily in generate-preamble-bash.ts. - SKILL.md.tmpl: top-level gstack skill is now a pure router (browse body removed; it lives in /browse), routing any request and sending browser/QA work to /browse. - setup: first-move nudge on first install. office-hours: closing handoff that launches the next review via the Skill tool. - telemetry-ingest: accept onboarding/first_task_scaffold_shown/handoff/route event types. * test: cover first-run detection + repoint browse-content assertions to /browse - New unit tests for every detection bucket, the eval-safe enum contract, and the first-run gating (test/preamble-first-task-scaffold.test.ts); periodic E2E that runs the detector through the real harness (test/skill-e2e-first-task-scaffold.test.ts). - Repoint browse-content assertions (gen-skill-docs, audit-compliance, skill-validation, LLM-judge eval) from the root skill to browse/SKILL.md following the router split; add a regression pinning that the router carries no browse body. - Register first-task-scaffold touchfiles + periodic tier; bump parity/carve size caps ~1-2KB per skill for the shared first-run-guidance preamble section. - Refresh ship golden fixtures for the preamble addition. * chore: regenerate SKILL.md + llms.txt for first-run activation * chore: bump version and changelog (v1.58.5.0) Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * fix(test): repoint bws skillmd-* setup-block assertions to browse/SKILL.md The skillmd-setup-discovery / -no-local-binary / -outside-git E2E tests extracted the `## SETUP`→`## IMPORTANT` browse binary-discovery block from the root SKILL.md. P2 moved that block to browse/SKILL.md (end anchor is now `## Core QA Patterns`), so the slice came back empty and the `browse/dist/browse` guard failed. Repoint to browse/SKILL.md. Verified: 7/7 e2e-browse pass locally. * fix(test): tolerate skill-discovery race in PTY plan-mode smoke The e2e-pty-plan-smoke suite (office-hours / plan-mode-no-op) failed in CI with `Unknown command: /office-hours` (claude exited ~10s) while passing locally. Root cause: a cold CI container's overlay-FS scan of the symlinked ~/.claude/skills registry finishes AFTER the runner's 8s boot grace, so the first `/skill` send reaches claude before the skill is indexed and is rejected as unknown. The runner gave up on the first "Unknown command:" line. runPlanSkillObservation now re-sends the skill command up to 3x (6s apart), re-marking the buffer each time so stale scrollback can't re-trip the check, before concluding the skill is genuinely unregistered. A real dangling-symlink / missing-skill still surfaces as 'exited' (after retries), preserving the original diagnostic. Pure-helper contract unchanged: 95/95 unit tests pass. This is a pre-existing harness bug (fails identically on #2077's own branch, which introduced the suite) surfaced while shipping the activation feature. * debug(ci): temporarily instrument pty-smoke skill discovery Capture claude version, env, registry tree, and a claude -p discovery probe to pin why /office-hours isn't discovered in CI (retries proved it's not a race). Temporary — revert once the registry fix is identified. * chore: revert pty-smoke harness experiments (race-retry + CI debug step) Diagnosis is conclusive and the experiments aren't the fix, so restore the harness to its original state (net-zero diff vs main for both files). What the CI debug step proved: `claude -p` returns READY — claude v2.1.187 fully DISCOVERS /office-hours from the symlinked registry. Only the interactive PTY TUI rejects it as "Unknown command" (and it received the full command text). So the e2e-pty-plan-smoke failure is a claude 2.1.187 interactive-TUI regression (skills discovered by `claude -p` aren't exposed as TUI slash commands), pre-existing in the #2077 harness and failing identically on its own origin branch — unrelated to this activation PR. The race-retry can't help (the TUI genuinely lacks the command); the debug step also tripped actionlint (shellcheck SC2012). Both reverted. * fix(ci): copy SKILL.md as real files in pty-smoke registry (cross-mount symlink) The e2e-pty-plan-smoke suite failed with "Unknown command: /office-hours" in CI while passing locally. Root cause (proven, not guessed): claude 2.1.187's interactive-TUI skill scanner does not follow the /github/home -> /__w cross-mount symlink the registry used for per-skill SKILL.md. Evidence: a CI debug step showed `claude -p` discovered the skill (printed READY), and a local macOS repro with the identical symlinked registry recognized /office-hours — isolating the failure to the container's cross-mount symlink, not registration content, claude version, duplicate names, or a race. Fix: register the per-skill SKILL.md + sections as REAL copies (same mount as $HOME) so the TUI reads them directly. The gstack root stays a symlink — the preamble's runtime bash resolves bin/* and sections/* through it and bash follows cross-mount symlinks fine. * fix(ci): guard rm expansion in pty-smoke registry (shellcheck SC2115) * fix(ci): also register pty-smoke skills project-scoped (cwd/.claude/skills) The real-file user-dir registration still left the TUI rejecting /office-hours in the container. claude's interactive TUI surfaces /slash commands from the PROJECT dir (<cwd>/.claude/skills); the smokes run with cwd=$REPO whose .claude/skills is gitignored (absent on a fresh CI checkout), so the user-dir registry feeds `claude -p` (READY) but not the TUI. Populate $REPO/.claude/skills with real SKILL.md + sections copies (no gstack symlink there — it would point at its own parent; runtime paths use the user-dir gstack symlink). --------- Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-25 09:42:45 -07:00
Garry Tan	14fc0866d9	v1.58.0.0 feat: diagram + multi-format document engine (mermaid, excalidraw, single-file HTML, DOCX) (#1990 ) * docs(todos): P3 content-hash diagram render cache for make-pdf Deferred from the diagram-engine eng review (Codex outside-voice D7): repeat make-pdf runs re-render every fence; cache keyed on fence source + bundle version once multi-diagram docs make it worth building. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> * feat(diagram-render): offline mermaid+excalidraw render bundle for browse Single self-contained page (dist/diagram-render.html, 9.2MB, committed per eng-review D2) exposing __renderMermaid / __mermaidToExcalidraw / __excalidrawToSvg / __rasterize / __probeImage through browse load-html + js --out. Render contract per D3: securityLevel strict, per-fence ids, print-css font lock, htmlLabels off (canvas-taint-safe). Deterministic build (same sha twice); drift test pins dist == BUILD_INFO == package.json pins and rebuild-reproducibility when toolchain matches. Spike-proven offline: flowchart + sequence SVG, editable .excalidraw scene, 300dpi PNG. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> * feat(diagram-render): __downscaleRaster for print-resolution image normalization Data-URI rasters re-encode in their own format (JPEG stays JPEG at q0.9 — PNG-encoding photos bloats them) at an explicit target pixel width. Used by make-pdf's pre-pass for the 300dpi content-box ceiling (eng-review D4). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> * feat(make-pdf): diagram pre-pass — mermaid/excalidraw fences render as vector SVG; local images inline as data URIs ```mermaid / ```excalidraw fences extract to placeholder tokens, render in one diagram-render bundle tab per run (reset contract: bundle page reloads after any render error), and substitute back as accessible <figure> blocks with the raw source preserved in a comment. Render failures produce a loud red diagnostic block, never silent raw code. render=false keeps a fence as code; title="..." becomes the aria-label and caption. Local images now actually render: page.setContent loads at about:blank (tab-session.ts:194), so relative paths silently 404'd before. The pre-pass resolves them against the markdown's directory, inlines as data URIs, probes intrinsic dimensions from the bytes (pure-TS PNG/JPEG/GIF/WebP/SVG sniffing), and downscales rasters wider than 2x the content box at 300dpi. Remote URLs warn (offline posture, --allow-network exempts); missing files get a visible placeholder; --strict hard-fails both for CI pipelines. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> * test(make-pdf): diagram pre-pass unit suite + e2e render gates 34 unit tests (fence extraction incl. nested/tilde/unclosed/render=false, info-string parsing, slot substitution, diagnostic/figure escaping + SVG script strip, byte-level dimension probing across 5 formats, content-box math, image inlining incl. strict/remote/missing/data-URI paths). E2E gate proves through the compiled binary: both fences render as vector text (id-collision check), raw mermaid ships only via render=false, broken fence yields the diagnostic block, and the relative fixture image rasterizes to colored pixels (CRITICAL regression for the about:blank image fix). --strict exits non-zero on a missing image. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> * feat(make-pdf): width directives + conservative auto-landscape via CSS named pages `![a](x.png){width=full\|<pct>\|<dim>}` and `{page=landscape\|portrait}` suffixes translate to data-gstack-* attrs in render() (before the sanitizer, which keeps data- attributes; unrecognized brace groups stay visible text). Default width rule needs no code: intrinsic CSS-px capped at the content box, never upscaled — figure img max-width owns it. Auto-landscape promotes a block to `@page wide { size: <pagesize> landscape }` only when aspect >= 1.8 AND intrinsic width > 2.5x the content box (~1600px on letter) AND diagram provenance (rendered fences) or a whole-word alt token (diagram\|architecture\|flowchart\|chart\|graph) for plain images. {page=...} forces or vetoes; fence info strings accept page=... too. preferCSSPageSize is passed to Chromium only when a promotion exists, so every other document prints exactly as before. False negatives are cheap; false positives feel broken (eng-review P4, Codex challenge accepted). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> * test(make-pdf): width-policy unit suite + landscape e2e gate with negative fixtures 24 unit tests weighted toward the false-positive guards: wide screenshot without an alt hint stays portrait, sub-threshold and tall images stay portrait, deterministic 1560/1561px boundary, whole-word alt matching ('photographic' must not match 'graph'), page=portrait veto beats every heuristic, diagnostic blocks never promote. E2E gate asserts pdfinfo per-page boxes through the compiled binary: exactly 3 of 5 fixture blocks get landscape pages (alt-hinted image, directive-forced image, wide sequence diagram) while the unhinted screenshot and the veto'd diagram stay portrait — plus the --toc combo proving TOC and named-page landscape coexist. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> * feat(make-pdf): --to html\|docx output formats --to html writes the assembled self-contained document directly (no print round-trip): inline vector diagrams, data-URI images, zero network references, plus an @media screen layer for browser reading. --to docx is the content-fidelity export (eng-review P8): html-to-docx@1.8.0 (exact pin; pure JS, bun-compile-verified) maps headings/tables/code/lists; diagrams and SVG images rasterize at 300dpi of the content-box width via the render tab; diagnostic figures convert to plain p/pre so the converter can't silently drop an error. --format keeps its page-size-alias meaning; --to is the output format, and the CLI says so when confused. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> * test(make-pdf): format gate — html no-network-refs + docx zip content checks HTML: zero src/href network refs, no script/link tags, inline SVG diagrams, data-URI images, screen layer, diagnostic survives. DOCX: valid OOXML zip (document.xml + Content_Types), >=2 PNG media (diagram raster + fixture image), headings + render=false source + diagnostic text in document.xml, no leaked mermaid source from rendered fences. Plus --to validation UX. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> * feat(diagram): /diagram skill — English in, editable diagram triplet out New skill: agent authors mermaid from the user's description and renders the triplet through the offline diagram-render bundle in the browse daemon — .mmd source (the single source of truth), editable .excalidraw (opens at excalidraw.com, round-trips back through re-render), and SVG + PNG. Flowcharts convert to fully editable scenes; other mermaid types render with an explicit upstream-converter limitation note. Never ships an unrendered source file; offline is the contract (no CDN fallback). Inventory rows in AGENTS.md + docs/skills.md; generated SKILL.md + llms.txt via gen:skill-docs. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> * test(diagram): paid E2E pair — gate triplet contract + periodic authoring judge diagram-triplet (gate, deterministic functional): a fresh claude -p agent following the skill extract must emit a parseable triplet — graph LR/TD in .mmd, excalidraw scene with >3 elements, SVG markup, PNG magic bytes. Verified live: pass, $0.17, 58s. diagram-authoring-quality (periodic, LLM-judged): faithfulness/labels/size rubric with a diagnostic-path cap, floor 6/10. Verified live: pass at exactly 6 with substantive critique. Touchfiles select both on diagram/ and lib/diagram-render/ changes; tier split per E2E_TIERS rules (eng-review D5). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> * test(diagram): register /diagram in the skill coverage matrix Gate: triplet contract + structural floor; periodic: authoring-quality judge. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> * feat(make-pdf): typography scale-up, zero image truncation, landscape vertical centering Dogfooding round on the repo README surfaced four output-quality bugs: - Type was too small everywhere: body 11→12pt, h1 22→26pt, h2 15→18pt, cover title 32→56pt with poster spacing, cover meta 10→13pt, TOC 11→12pt with tighter leading, code 9.5→10.5pt, tables 10→11pt. - Zero image truncation, ever: the max-width cap was figure-scoped, but markdown images render as <p><img> — a 1850px GitHub screenshot ran off the page edge. Global img { max-width: 100%; height: auto; } cap. - hyphens: auto put real 'dif-\nferent' breaks into the PDF text layer the moment 12pt made lines wrap (combined-gate caught it). Clean copy-paste is the product contract; left-aligned rag doesn't need hyphenation → hyphens: manual. - Promoted landscape blocks now vertically center. CSS flex/min-height centering fragments into phantom empty landscape pages in Chromium (bisected: min-height at ANY value; 3 promotions printed 5 pages), so image-policy computes an inline margin-top from each block's known aspect ratio against the landscape content box instead — fragmentation handles margins fine. .page-wide also drops its explicit break-before/ after (the page-name change already breaks on both sides). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> * test(make-pdf): pin zero-truncation invariant, typography floor, centering math Global img cap pinned as a regex invariant (the figure-scoped-cap regression class); typography floor (12pt body, 56pt cover, 12pt TOC); .page-wide must NOT carry min-height/flex (the phantom-landscape-page regression class); centering margin math verified both ways (2400×1000 image → 1.38in, 2050×600 viewBox diagram → 1.93in, page-filling directive block → no margin). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> * docs: diagram + multi-format documentation across README, make-pdf skill, and how-to guide README gains /make-pdf (Publisher) and /diagram (Diagram Maker) rows in the sprint table. make-pdf's skill doc — the agent-facing contract — gains Core patterns for mermaid/excalidraw fences (title/render=false/page= options), the image policy ({width=}/{page=} directives, zero-truncation, conservative auto-landscape), --to html\|docx, and --strict, plus the --to vs --format disambiguation in Common flags. New docs/howto-diagrams-and-formats.md is the user-facing walkthrough: fences, directives, formats, /diagram triplet, the mermaid racetrack trick, troubleshooting. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> * test(make-pdf): fill ship-audit coverage gaps — downscale, reset contract, excalidraw fence, WebP Ship coverage audit found 9 gaps (85%); this fills the 2 HIGH + 3 MEDIUM and most LOW. diagram-gate fixture gains a 4200px incompressible photo (the only live coverage of __downscaleRaster AND the 64KB chunked jsViaBuffer eval transport — asserted via the downscale stderr warning), an ```excalidraw scene fence rendered through exportToSvg (vector labels + caption in pdftotext, no leaked scene JSON), and the broken fence MOVED BETWEEN the two mermaid fences so the second diagram rendering proves the D6.2 reset contract end-to-end. New coverage-gaps.test.ts (16 tests): mock-tab reset contract (exactly one reload, post-failure fence renders), excalidraw fail-fast diagnostic without a bundle call, rasterize error fallbacks (figure/tag kept, never silent), WebP VP8/VP8L/VP8X byte parsers, landscapeContentBox a4/asymmetric margins, bare-token slot fallback, resolveBundlePath env override + error shape, screenCss media scoping. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> * fix(make-pdf): pre-landing review wave — fence fidelity, injection hardening, Windows paths, transport rework Review army (6 specialists + red team) findings, all fixed: - Indented fences replay byte-for-byte and indented diagram fences are NOT extracted (red-team conf-9: the pre-pass reconstructed fences at column 0, splitting any list containing fenced code — every ordinary document). - String.replace $-pattern injection killed at every seam: substituteSlots, mergeStyle, img/src rewrites all use function replacements (a diagram label containing $' duplicated the document tail). - Big-expression transport reworked: browse `eval <file>` (one spawn, any size, Windows-safe) replaces the 64KB chunked window-buffer eval — fixes the per-chunk spawn cost, the char-vs-byte argv units, AND the Windows 32,767-char command-line ceiling in one move. - Staged-bundle trust: content verified by hash even when the file exists, and the rename-failure path re-hashes the survivor (sticky-bit /tmp EPERM would otherwise ride a pre-planted file past the check). - Windows drive-letter img srcs (C:/x.png) reach the local-path branch instead of being swallowed as unknown URL schemes. - DOCX rasterize-failure now embeds the decoded source as visible text — returning the figure made diagrams vanish silently (converter drops svg). - Fence source preserved as base64 data-gstack-source attribute (the comment encoding corrupted every '-->' arrow); decodeFigureSource() round-trips. - inlineLocalImages memoizes per path; file:// uses fileURLToPath; preview prints a divergence note for fences/local images; --to docx strips the watermark div and warns about print-only flags; TOC links resolve in html/docx (heading ids assigned); waitForExpression sleeps instead of busy-spinning; escapeHtml/svg-dims deduped to single definitions; typography stragglers (blockquote 12pt, footnotes 10pt, 42em screen measure); bundle BUILD_INFO gains srcSha256 for no-node_modules drift detection; MAX_TARGET_PX shared guard. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> * ci: make-pdf gate covers the diagram-render bundle; bundle pinned to LF make-pdf-gate.yml paths gain lib/diagram-render/** and the drift test (a bundle-only PR previously skipped every render gate AND no CI lane ran the drift check at all). .gitattributes pins dist html/json to LF so Windows autocrlf can't break the hash-pinned bundle. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> * test(make-pdf)+feat(diagram): review-wave test pins + skill transport hardening Tests: indented-fence byte-for-byte replay + no-extraction-in-lists, drive-letter local-path routing, $-pattern slot immunity, base64 source round-trip ('A --> B' exact), existing-style merge preservation, DOCX rasterize-failure surfaces source, srcSha256 + font-stack drift guards, landscape veto asserted as some-portrait/no-landscape (layout-order-proof), judge rubric cap lowered to 5 so it actually fails, vacuous error-shape test removed honestly, tmpdir cleanup. /diagram skill: base64 transport (template literals corrupted backticks/${ in sources), content-addressed staging with hash verification, and --tab-id pinned on every browse call so a concurrent /qa session can't be clobbered. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> * feat(make-pdf): out-of-tree image reads warn; --strict makes them fatal (D8.1) Local CLI semantics stay (absolute paths and ../ still inline, like pandoc), but never silently: an agent PDF-ing untrusted markdown can't quietly embed a file from outside the input directory into a shareable document without a visible warning, and --strict pipelines hard-fail. Two unit tests. Also: TODOS.md gains the deferred e2e-harness dedup entry (D8.2). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> * fix: pre-existing test failure in skill-e2e-bws operational-learning Root cause was the fixture, not model behavior: gstack-learnings-log gained an import of lib/jsonl-store.ts in the v1.57.5.0 injection-sanitization wave, but the test copies only bin/ scripts into its sandbox — the inline bun import failed and the script exited 1 before writing, on every run, on main too (reproduced at `a5833c41`). Fixture now stages lib/jsonl-store.ts beside bin/; verified deterministically (script exits 0, learning written) and via the paid test (1 pass). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> * fix(make-pdf): adversarial-review wave — offline posture enforced, symlink-aware confinement, bounded reads Codex adversarial + structured review findings: - Remote images are now BLOCKED with a visible placeholder instead of warn-and-keep — leaving the tag meant Chromium fetched the URL at print time anyway, so the offline posture was a lie (tracking pixels and internal-URL probes ran without --allow-network). - The out-of-tree read check compares REAL paths: a symlink inside the input dir pointing at ~/.ssh/... passed the string-prefix check, including under --strict. Ordered after the existence check (realpath of a missing file false-positives on macOS /var → /private/var). - Image reads are bounded BEFORE reading: statSync first, non-regular files (fifo/device/dir) and >64MB files degrade to placeholders instead of hanging or exhausting memory; malformed percent-encoding (foo%zz.png) degrades to missing-image instead of crashing decodeURIComponent. - browse shell-outs get a 120s timeout — a wedged daemon or hostile mermaid source fails the run instead of hanging it. - TOC entries link to the heading's ACTUAL id (pre-id'd raw-HTML headings previously got dead #toc-N links); per-side margins compose into the CSS @page shorthand so a landscape promotion flipping preferCSSPageSize no longer silently reverts --margin-left/right to defaults (Codex P2). - The image memo is a typed object — literal NUL-byte separators had made diagram-prepass.ts register as binary to text tooling. Codex structured review GATE: PASS (no P1). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> * chore: bump version and changelog (v1.58.0.0) Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> * docs: sync make-pdf image-policy docs with final shipped behavior (v1.58.0.0) The docs wave (`87594420`) predated the final review-wave commits, so two docs drifted from shipped behavior: - make-pdf/SKILL.md.tmpl + generated SKILL.md: remote images are BLOCKED with a visible placeholder (not warned-and-kept); out-of-tree reads (including via symlink) warn and --strict makes them fatal; --strict also covers oversized (>64MB) and non-regular files; troubleshooting entry now names the actual "[remote image blocked]" symptom. - docs/howto-diagrams-and-formats.md: same corrections in the image section, CI section, and troubleshooting. - README.md: docs/howto-diagrams-and-formats.md added to the Docs table (was unreachable from any entry-point doc). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> * docs: apply Codex doc-review findings for v1.58.0.0 Cross-model doc review (Codex, read-only) checked the v1.58.0.0 docs against the shipped code. Fixes: - howto + make-pdf SKILL: diagram source is preserved base64 in a data-gstack-source attribute, not an HTML comment (-- in mermaid arrows would corrupt a comment); fences must start at column 0; fence options example gains page=portrait; --to html "zero network refs" qualified (--allow-network deliberately keeps remote tags). - /diagram description, README + docs/skills.md rows: the hand-drawn aesthetic belongs to the .excalidraw artifact; rendered SVG/PNG use mermaid's clean neutral theme (lib/diagram-render entry.ts pins theme: "neutral"). - CHANGELOG v1.58.0.0 wording: --strict coverage lists all five fatal classes (missing/remote/out-of-tree/oversized/non-regular); fences are vector SVG in pdf+html, 300dpi PNG in docx; hand-drawn claim scoped to the .excalidraw file. - lib/diagram-render/README: Page API table gains __downscaleRaster. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> --------- Co-authored-by: Claude Fable 5 <noreply@anthropic.com>	2026-06-12 15:38:53 -07:00
Garry Tan	421460f03a	v1.57.8.0 feat: browse js/eval --out render-to-file (canonical Chromium for offline rendering) (#1929 ) * feat(browse): js/eval --out render-to-file with write-capability gate Add --out <file> / --raw to js and eval so an evaluate result is written straight to disk (base64 data URLs auto-decoded to bytes, charset-validated before decode, parent dirs created) instead of serialized back through the CLI. --out is modeled as a per-invocation WRITE: it requires write scope, is never dispatchable over the pair-agent tunnel (canDispatchOverTunnel now consults args), and counts as a mutation for watch-mode and tab-ownership. Shared parseOutArgs/hasOutArg/resultToString helpers keep the handler and the gate in sync. Tests cover the parser, render-to-file paths, and tunnel guards. * docs(browse): offline render mode + canonical-Chromium guidance Document the blessed offline-render path (headless, no proxy/Xvfb): visual output via screenshot --selector, bytes a function returns via js --out. Add the puppeteer->browse cheatsheet row, a "don't bundle your own Chromium" note (browse skill + CONTRIBUTING), and the --out/--raw command descriptions. Regenerate browse/SKILL.md, SKILL.md, and gstack/llms.txt from the templates. * chore: bump version and changelog (v1.59.1.0) Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * docs: document js/eval --out render-to-file in BROWSER.md reference (v1.59.1.0) The js and eval reference rows in BROWSER.md drifted: every other reference surface (SKILL.md, gstack/llms.txt, browse/SKILL.md) already shows the new [--out <file>] [--raw] flags from v1.59.1.0, but the complete browser reference still showed the pre-feature signatures. Add the flags plus the WRITE-capability / no-tunnel note so the reference matches what shipped. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * chore: re-version 1.59.1.0 -> 1.57.8.0 (natural PATCH from 1.57.7.0) Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-09 21:02:30 -07:00
Garry Tan	19770ea8b4	v1.51.0.0 feat: $B memory diagnostic + 4 CDP-resource leak fixes (#1751 ) * add withCdpSession + getOrCreateCdpSession helpers Two CDP-session lifecycle helpers in cdp-bridge.ts: - withCdpSession(page, fn): ephemeral session with try/finally detach. For one-shot CDP work (archive snapshots, $B memory, single Page.captureScreenshot) where the caller doesn't need session reuse. - getOrCreateCdpSession(page, cache): cached long-lived session that registers a page.once('close') hook to BOTH delete the cache entry AND call session.detach(). Pre-helper code only deleted the cache entry, leaving the Chromium-side CDP target attached until the underlying transport dropped. Pure addition. Existing callers untouched in this commit; they migrate in the next commit alongside the static-grep test that pins the invariant. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * migrate 3 CDP-session sites to lifecycle helpers Fixes the CDP-target leak class identified by /codex outside-voice on the eng review (D11 EXPAND_SCOPE). All three sites called `page.context().newCDPSession(page)` directly and either forgot the detach entirely (cdp-bridge cache cleanup), only detached on the success path (write-commands archive), or detached on framenavigated but not page-close (cdp-inspector). - cdp-bridge.ts: `getCdpSession` now delegates to `getOrCreateCdpSession`, which registers a `page.once('close')` hook that BOTH removes the cache entry AND calls `session.detach()`. - cdp-inspector.ts: same migration for the inspector's session pool. Keeps the existing framenavigated detach (more granular than close for DOM/CSS state invalidation) plus an inspector-layer close hook for the initializedPages WeakSet. - write-commands.ts archive: wraps Page.captureSnapshot in withCdpSession so the detach runs in `finally`, including the path where captureSnapshot throws. The static-grep tripwire (next commit) pins the invariant so future direct calls to newCDPSession fail CI. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * add CDP-session cleanup tripwire + helper unit tests browse/test/cdp-session-cleanup.test.ts pins the invariant that no source file outside cdp-bridge.ts may call newCDPSession() directly. If a future refactor reintroduces the direct call, CI fails with a file:line list and a pointer to the right helper to use instead (withCdpSession for one-shot, getOrCreateCdpSession for cached). Also covers the helpers themselves with fake-Page unit tests: - withCdpSession detaches on success - withCdpSession detaches on throw (the actual leak fix) - withCdpSession swallows detach errors so they don't mask fn errors - getOrCreateCdpSession caches the session across calls - close hook detaches AND clears the cache Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * extract createSseEndpoint helper with cleanup contract browse/src/sse-helpers.ts owns the SSE cleanup invariant: cleanup runs on abort, enqueue failure, AND heartbeat failure, exactly once, regardless of which edge fires first. Pre-helper, /activity/stream and /inspector/events ran cleanup only on the req.signal.abort edge. If the underlying TCP died without firing abort (Chromium MV3 service-worker suspend, intermediate proxy half-close), the subscriber closure stayed in the Set capturing the ReadableStreamDefaultController plus any payloads queued behind it. Over a multi-day sidebar session this compounded into multi-MB of retained controllers per dead connection. Caller surface: initialReplay (optional, for gap replay or state snapshots), subscribe (live-event source), liveEventName (SSE event name for live wrap), heartbeatMs. send() helper handles JSON encoding with sanitizeReplacer + lone-surrogate stripping. Unit tests pin all three cleanup edges + idempotency + replay ordering + surrogate sanitization. Endpoint refactors land in the next commit. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * route /activity/stream + /inspector/events through createSseEndpoint Both endpoints collapse from ~45 lines of in-line ReadableStream wiring to ~8 lines of helper config. Behavior preserved bit-for-bit by the new sse-helpers tests: - initial replay (activity gap + history, inspector state snapshot) - live event subscription - 15s heartbeat - SSE framing - sanitizeReplacer applied to every JSON.stringify The leak fix is the cleanup contract: pre-refactor, both endpoints ran cleanup only on req.signal.abort. If TCP died without firing abort (Chromium MV3 SW suspend, intermediate proxy half-close), the subscriber closure stayed in the Set forever capturing the ReadableStreamDefaultController + queued payloads. Post-refactor, an enqueue-failure or heartbeat-failure on a dead consumer triggers the same idempotent cleanup as abort would. Net: -83 / +15 in server.ts. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * cap inspector modificationHistory at 200 entries Pre-cap, modificationHistory was an unbounded module-scoped array that grew for every CSS edit through $B css across the entire session. Small per-entry footprint but no upper bound, the kind of slow leak that compounds over multi-day inspector use. Cap is 200, oldest evicted on push past the cap. modHistoryTotalPushed stays monotonic across the session so undoModification can tell the user when their target index has been evicted, instead of just the opaque pre-cap "No modification at index 500" with no context. __testInternals export lets the cap + eviction error be unit-tested without spinning up a CDP-driven Page. Production code must continue to go through modifyStyle / undoModification / resetModifications. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * add BrowserManager.getMemorySnapshot() + shared types Diagnostic foundation for $B memory and the /memory endpoint that land in the next two commits. Collects: - Bun process memory via process.memoryUsage (cross-platform, accurate). - Per-tab JS heap via CDP Performance.getMetrics, lazy per tracked page, swallows target-died errors so a dying tab doesn't poison the snapshot for the rest. - Chromium process tree via SystemInfo.getProcessInfo (PID + type + CPU time). RSS is NOT exposed via CDP — the eng review (D2 USE_CDP) picked CDP over shelling to `ps`, so notes[] tells the caller why the RSS column is absent and points at the follow-up TODO. cdp-inspector exports getModificationHistoryStats so the snapshot can surface buffer occupancy + cap + evicted count without reaching into module-private state. memory-snapshot.ts holds the shared types so server.ts and read-commands can import without circular dep on browser-manager. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * add \$B memory command Registers 'memory' in META_COMMANDS, wires the meta-command dispatch to a lazy-imported handler in memory-command.ts. Lazy because the import graph (cdp-bridge + memory-snapshot + buffer accessors) isn't useful to projects that never run the diagnostic. The handler assembles MemoryStructureStats from the modules that own each buffer (cdp-inspector mod history stats, activity subscriber count, console/network/dialog buffer lengths, captureBuffer bytes, inspectorSubscriber count via a new server.ts export) and calls BrowserManager.getMemorySnapshot. Output is text by default, JSON with --json so the sidebar footer and test harness can consume it programmatically. buildMemorySnapshotJson is the entry the /memory endpoint will call in the next commit. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * add /memory endpoint (SSE-session-cookie gated) GET /memory returns the BrowserManager memory snapshot as JSON. Auth matches /activity/stream and /inspector/events: Bearer header OR view-only SSE-session cookie (the extension fetches the cookie once via POST /sse-session, then polls /memory with withCredentials: true). Deliberately NOT extending /health for the sidebar footer poll — TODOS.md "Audit /health token distribution" records that /health already surfaces AUTH_TOKEN to any localhost caller in headed mode. A separate endpoint with the standard SSE auth keeps the future /health fix from cascading into the sidebar. sanitizeReplacer is applied at egress because tab.url and tab.title come from page content — lone-surrogate bytes from broken emoji could otherwise reach the sidebar and (when forwarded to Claude API) trigger HTTP 400. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * add sidebar footer RSS readout (polls /memory every 30s) Footer now shows "<bun-rss> · <tab-count>" sourced from the /memory endpoint, polled every 30s. Color thresholds: orange warn at 2 GB Bun RSS or 50 tabs; red bad at 8 GB or 200 tabs (matches the tab-guardrail threshold landing in a later commit). The footer gives the user an early signal that the cliff is forming, instead of only learning when the OS OOM-kills the process. Backoff per Codex's flag: if a poll takes > 2s response time the sidebar drops to a 5-minute cadence until the next successful fast poll. The diagnostic shouldn't add load to a browser that's already unhealthy. Start/stop is wired to the existing setServerInfo() hook so the timer only runs while the sidebar is connected to a server. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * stop materializing response bodies in requestfinished listener The Bun-side accelerant on the gbrowser-OOM investigation. Pre-fix, the per-page requestfinished listener called \`await res.body()\` just to read .length — Playwright fetches the bytes from Chromium across CDP into a Bun Buffer, only for the listener to discard the buffer after a single length read. On a long-lived headed browser with media-heavy pages this is multi-GB/hour of Buffer allocation churn. Bun GCs it, but the cross-process CDP traffic + transient allocation pressure feeds the OOM trajectory. The fix: req.sizes() pulls from the Network.loadingFinished event Chromium already emits. No body materialization. Accurate for chunked transfer, gzip-compressed responses, and streaming media — the cases where a naive Content-Length header read (the original review's proposal) would have missed the size entirely (Codex flag on the eng review, D10 USE_CDP_EVENT_BATCHED). The D10 stretch goal — replacing N per-page listeners with a single context-level CDP listener via Target.setAutoAttach — is deferred and tracked in TODOS. The listener architecture change is significantly more plumbing than the leak fix and not on the critical path for stopping the body materialization. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * tab guardrail (50/200 thresholds) + sidebar action toast Server side (browser-manager.ts): Idempotent threshold tracker fires an activity entry exactly once at each upward crossing of 50 (soft warn) and 200 (hard warn). Re-arms when the count drops below. Activity-feed surface gives the audit-trail invariant even with the sidebar closed; the toast UX lives in the sidebar. Sidebar side (extension/sidepanel.{html,css,js}): Every /memory poll evaluates two trigger conditions: - Any single tab > 4 GB JS heap (catches the WebGL/video runaway case Codex flagged on the eng review). - Tab count >= 200. Toast shows top 5 tabs ranked by max(jsHeap, nodes1KB + listeners200) so a WebGL-heavy tab with small JS heap still surfaces. Default-selected checkboxes + "Close selected" run \`\$B closetab <id>\` through the existing /command path — no chrome.tabs.remove bridge needed. "Snooze" bumps tabsAbove/heapAbove thresholds in chrome.storage.session so the toast stays hidden until the user accumulates more tabs OR one tab grows another 2 GB. Tests: browse/test/tab-guardrail.test.ts pins the server-side fires-once + re-arms invariants without spinning up Chromium. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * add memory-leak reproducer (gate tier) browse/test/memory-leak-reproducer.test.ts pins the invariant from the D10 fix: wirePageEvents.requestfinished must call req.sizes() but must NEVER call res.body(). Fakes a page emitting a burst of 200 requestfinished events, each with a notional 1 MB response — pre-fix this would allocate 200 MB of Buffer per burst, post-fix not one byte of body content is materialized. The test also asserts networkBuffer entries are still populated with the right size, so size reporting in the network panel doesn't regress. A real-Chromium peak-RSS reproducer (periodic tier) is deferred — see TODOS "Reproducer with WebGL / video / MSE buffer pressure". This gate-tier test is sufficient to catch the leak class being reintroduced by any future refactor of the requestfinished listener. Wall clock: ~400ms. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * TODOS: 4 follow-ups from gbrowser-OOM PR Captures the items deliberately deferred from the v1.49 leak-fix PR so the deferrals don't fall off the radar: - P2: MV3 extension service-worker memory profile (Codex finding #4) - P2: Native + GPU memory breakdown in \$B memory (Codex finding #5) - P3: Single-context CDP listener for Network.loadingFinished (D10 stretch goal) - P3: Real-Chromium peak-RSS reproducer for periodic tier (Codex finding on transient amplification + ANGLE_B_NUMBERS CHANGELOG framing dependency) Each entry follows the standard TODOS.md format: What / Why / Pros / Cons / Context / Priority / Effort. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * regen SKILL.md after adding \$B memory command The C8 commit added 'memory' to META_COMMANDS + COMMAND_DESCRIPTIONS but didn't regenerate the SKILL.md files. The category was 'Diagnostics' which isn't in scripts/resolvers/browse.ts:categoryOrder; switched to 'Server' (matches the existing 'status' / 'restart' / 'handoff' pattern) so the table renders under the existing ### Server section. Test fix: gen-skill-docs.test.ts asserts every command appears in the generated SKILL.md and gstack/llms.txt; without this regen the test fails with "Expected to contain: 'memory'". Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * add coverage for \$B memory diagnostic surface 17 tests across the formatter + byte renderer + JSON entry point: - formatBytes() 4-tier (bytes, KB, MB, GB) + 160 GB sanity case (the friend's OOM number from the original screenshot, so the renderer doesn't blow up at real leak scale) - handleMemoryCommand --json mode parseable shape - handleMemoryCommand text mode: Bun server line, no-tabs branch, top-10 sort with "...and N more" tail, Chromium process grouping by type, "unavailable" line when processes is null, modification- history evicted-count format, notes section rendering, long-URL ellipsis truncation - buildMemorySnapshotJson returns shape matching the type The formatSnapshotText renderer is private to memory-command.ts; tests exercise it through handleMemoryCommand's text-mode return path. The eviction-count format is pinned via a parallel format contract assertion since the renderer reads live module state. Coverage gate: brings the diagnostic surface from 0% to ~80%. Extension UI (sidepanel.js footer + toast) remains uncovered — adding tests there would require extracting fmtBytesShort and tabRamScore from sidepanel.js into a testable TS module, which is deferred to a follow-up to keep this PR scoped. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * chore: bump version and changelog (v1.51.0.0) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * docs: update project documentation for v1.51.0.0 Add $B memory command to BROWSER.md server lifecycle table. Document the new createSseEndpoint helper + CDP session lifecycle helpers (withCdpSession, getOrCreateCdpSession) in CLAUDE.md alongside the existing server hardening notes, with the static-grep tripwire callout so future contributors route through the helpers. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(test): pin SSE sanitizer wiring to the v1.51 createSseEndpoint helper The two `wiring invariants` tests grepped server.ts for `JSON.stringify(entry, sanitizeReplacer)` and `JSON.stringify(event, sanitizeReplacer)` — patterns that lived inline in /activity/stream and /inspector/events before the v1.51 refactor moved both endpoints behind createSseEndpoint. Sanitization still happens (the helper applies it inside its send() and live-event callback), but the static-grep was pinned to the old wiring and started failing on Windows free-tests after the refactor landed. Updated to check the new contract: - /activity/stream + /inspector/events route through createSseEndpoint (regex match of the route handler block ending in the helper call). - sse-helpers.ts contains JSON.stringify + sanitizeReplacer + imports stripLoneSurrogates from ./sanitize (catches drift to a private copy). - server.ts retains its own sanitizeReplacer for non-SSE egress paths (handleCommandInternal); the two replacers coexist by design. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-27 16:09:38 -07:00
Garry Tan	f8bb59094d	v1.47.0.0 feat: /spec — author backlog-ready spec in 5 phases + optional agent spawn (#1698 ) (#1733 ) * feat(issue): add /issue skill for backlog-ready GitHub issue authoring Interrogates an ambiguous request through five strict phases (why, scope, technical, draft, final) and produces a GitHub issue precise enough that an unfamiliar engineer or AI agent can execute it without follow-up. Slots in after /office-hours (when the idea has passed the "worth building" bar) and before /plan-eng-review (which assumes a plan already exists). - issue/SKILL.md.tmpl + generated SKILL.md - routing entry in root SKILL.md.tmpl - llms.txt regenerated to include the new skill * chore(spec): rename /issue → /spec + fix duplicate analytics block Foundation commit for the /spec skill (extends PR #1698 by @jayzalowitz). - Renames issue/ → spec/ (template + generated) - Removes the hand-rolled analytics block in spec/SKILL.md.tmpl (lines 46-49 of the original); {{PREAMBLE}} already emits the analytics write with the telemetry opt-out guard, so the duplicate would have bypassed gstack-config set telemetry off - Updates frontmatter (name: spec, expanded description with magical-moment preview, triggers reordered to lead with "spec this out") - Updates root SKILL.md.tmpl routing entry → /spec - Regenerates spec/SKILL.md and gstack/llms.txt via bun run gen:skill-docs Co-Authored-By: Jay Zalowitz <jayzalowitz@gmail.com> Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(spec): expansions — flags, archive, quality gate, plan-mode-aware Phase 5, /ship integration, tests Builds on the @jayzalowitz foundation (commit `a4e6ee38`) with the full expansion set from CEO + Eng + DX review (24 user decisions + 23 of 28 codex adversarial findings). spec/SKILL.md.tmpl additions: - Flag reference table (--dedupe / --no-gate / --audit / --execute / --no-execute / --file-only / --plan-file / --sync-archive). - Phase 1b --dedupe (default ON): gh issue list --search with graceful skip on gh-not-installed / unauthed / rate-limited / other errors. AskUserQuestion when matches found (merge / file-new / cancel). - Phase 3 HARD requirement: agent MUST grep/read at least one piece of evidence before asking. Project-level fallback prose for prompts with no concrete file mapping. Greenfield escape clause. - Phase 4.5 quality gate (default ON): codex adversarial dispatch with fail-closed redaction (AWS/GitHub/Anthropic/OpenAI/private-key regex), hard <<<USER_SPEC>>> delimiters + instruction boundary (prompt-injection defense), score 0-10 with <7 block, up to 3 iterations, AskUserQuestion escape on persistent <7 (ship anyway / save draft / one more try). - Phase 5 plan-mode-aware dispatch: reads GSTACK_PLAN_MODE env. Active → file-only + load into plan file. Inactive → file + --execute spawn by default. CLI overrides for explicit control. - Archive block via eval $(gstack-paths) → $GSTACK_STATE_ROOT/projects/ $SLUG/specs/<datetime>-<pid>-<slug>.md. Atomic .tmp/mv write. Sync excluded by default; --sync-archive to opt in. - --execute path: dirty-worktree gate (porcelain check + 3-option AUQ continue/stash/cancel), TOCTOU re-check after AUQ answer, SHA pin via git rev-parse HEAD, unique branch spec/<slug>-$$ + PID-suffixed worktree, mandatory final-confirm gate, stash policy with restore safety (preserve ref, never auto-drop). - TTHW timestamps captured at Phase 1 / first citation / file-or-spawn, emitted as ttfc_ms + tthw_ms in preamble telemetry envelope. Cross-system plumbing: - scripts/resolvers/preamble/generate-preamble-bash.ts: emit GSTACK_PLAN_MODE=active\|inactive based on CLAUDE_PLAN_FILE presence. - scripts/resolvers/preamble/generate-routing-injection.ts: add /spec to the routing block injected into project CLAUDE.md. - ship/SKILL.md.tmpl: new "Linked Spec" PR-body section. Reads archive frontmatter spec_issue_number and adds Closes #N when full delivery confirmed by existing plan-completion gate (codex F4 — conditional). Branch-name inference NOT used (codex F3 — fragile under rebase). Tests (W7): - test/spec-template-invariants.test.ts: 35 deterministic assertions covering Phase 1 hard gate, Phase 3 hard-grep mandate, --dedupe graceful-skip paths, --execute race + security hardening (TOCTOU, SHA pin, unique branch), quality-gate redaction + BLOCKED path, archive atomic write + sync exclusion, plan-mode-aware Phase 5. - test/spec-template-sync.test.ts: regen + byte-identical check. - test/skill-e2e-spec-execute.test.ts (periodic-tier scaffold). - test/skill-llm-eval-spec.test.ts (periodic-tier scaffold). - test/helpers/touchfiles.ts: register both periodics in E2E_TIERS + LLM_JUDGE_TOUCHFILES. 37/37 /spec tests pass. Full bun test exit 0 (pre-existing url-validation timeout unrelated to /spec). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * chore: v1.45.0.0 — regen all SKILL.md, bump VERSION, CHANGELOG entry Mechanical regen pulling in two template-side changes: - /spec expansion (spec/SKILL.md picks up ~1100 new lines) - {{PREAMBLE}} now echoes GSTACK_PLAN_MODE env (every skill picks up the new echo line in the preamble bash block) VERSION 1.44.0.0 → 1.45.0.0 (MINOR per scale-aware rules: substantial new capability — /spec skill with 5 CLI flags + race/security hardening + plan-mode-aware Phase 5 + /ship integration). CHANGELOG entry frames /spec as agent feedstock with the two-line headline, "numbers that matter" table, and "what this means for builders" close. Credits @jayzalowitz for the foundation contribution. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * chore(spec): register /spec in scripts/proactive-suggestions.json Auto-generated by bun run gen:skill-docs after the v1.46 catalog-trim contract picked up /spec's frontmatter. lead + routing extracted from spec/SKILL.md.tmpl description: block. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * chore(spec): TODOS deferrals + package.json sync for v1.47.0.0 - TODOS.md: add P2 entry for /spec --epic mode (deferred from CEO SCOPE EXPANSION review), P3 entry for --dedupe semantic matching upgrade. Both have full context blocks so future picker can resume cold. - package.json: bump 1.46.0.0 → 1.47.0.0 to match VERSION (was stale from the main merge; /ship Step 12 idempotency caught it). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * docs: register /spec skill in README, AGENTS, CLAUDE.md project tree Adds /spec to the three discoverability surfaces it was missing: - README.md sprint skills table (between /autoplan and /learn) - AGENTS.md plan-mode reviews table - CLAUDE.md project structure tree (between /investigate and /retro) /spec shipped in v1.47.0.0 with CHANGELOG coverage but the entry-point docs hadn't been updated; a user landing on README or AGENTS would not discover the skill exists without reading CHANGELOG. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> --------- Co-authored-by: Jay Zalowitz <jayzalowitz@gmail.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-26 21:36:53 -07:00
Garry Tan	1d9b9c4cfc	v1.43.0.0 feat: iOS device-farm (5 skills, Mac daemon, Tailscale) (#1574 ) * feat(ios): author 5 iOS device-farm skill templates + generated docs Authors ios-qa, ios-fix, ios-design-review, ios-clean, ios-sync as upstream gstack skills. Each follows the standard SKILL.md.tmpl pattern with preamble-tier:3 frontmatter. The fork at time-attack/gstack shipped these but as byte-identical .md/.tmpl pairs that wouldn't pass skill-docs.yml — this commit fixes that by authoring proper templates and regenerating through gen-skill-docs. * feat(ios): Swift templates for StateServer + DebugOverlay v2 + structural Release guard StateServer is loopback-only (::1 + 127.0.0.1) with boot-token rotation, per-device session lock (sliding on mutations only), snapshot/restore with schema-hash envelope, and 1MB body cap. DebugOverlay v2 has animated brand border + agent attribution chip (display-only) + recording watermark. Package.swift enforces structural Release-build exclusion via .when(configuration: .debug). Includes Tailscale ACL example doc. * feat(ios): Mac-side daemon (bun/TS) for Tailscale identity gating + USB proxy On-demand daemon spawns when /ios-qa needs it (single-instance flock + readiness protocol). Owns tailnet ingress: fail-closed tailscaled LocalAPI probe, dual-track /auth/mint (self-service for allowlisted identities, owner-granted via CLI), capability-tier allowlist (observe/interact/mutate/restore), 1h default session TTL (24h hard cap), audit log of every authenticated mutating tailnet request, hashed-identity attempts log. iOS StateServer never directly binds tailnet — identity validation lives Mac-side because iPhones can't reach tailscaled. 67 unit/integration tests covering session-lock concurrency, capability enforcement, fail-closed probe, identity canonicalization, body limits, and boot-token leak proofs. * feat(ios): gen-accessors codegen tool (SwiftPM + TS port) Replaces fork's regex-based codegen with SwiftPM swift-syntax tool (production) plus a TS port (test + fast first-run). Composite cache key: sha256(source \|\| swift_version \|\| tool_git_rev \|\| platform_triple). Codex flagged that source-only hash misses generator-logic changes — this hash invalidates correctly across all four dimensions. 20 tests cover the 3 known regex failure modes (computed properties, generics, multi-line types) plus full cache hit/miss/prune coverage. * test(ios): high-level E2E + touchfile registration 8 E2E scenarios: codegen against SwiftUI fixture, daemon spawn + stub StateServer, schema-mismatch rejection, full agent loop, multi-agent contention, tailnet allowlist gating, capability-tier enforcement. Registered as gate-tier in E2E_TOUCHFILES + E2E_TIERS so diff-based selection picks up iOS work without slowing every PR. * chore: bump version and changelog (v1.40.0.0) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * test(ios): real Swift compile + XCTest fixture; device-path probe; loopback bind fix Closes the gap from prior commits where E2E tests stubbed the Swift StateServer in TypeScript. Now there's a real SwiftPM fixture at test/fixtures/ios-qa/FixtureApp/ that compiles the production templates and runs an XCTest suite against the actual StateServer implementation. Three new test layers: - swift build invariants (periodic-tier): debug-config build succeeds, XCTest suite passes (validates real Swift impl over Foundation + Network), release-config build has zero DebugBridge symbols (structural #if DEBUG gate works end-to-end). - Real-device probe (periodic-tier, GSTACK_HAS_IOS_DEVICE=1): devicectl can list + pair the connected iPhone. Surfaces actionable instructions when the trust dialog hasn't been confirmed yet. - Fixture sources copied from ios-qa/templates/ — Package.swift splits the bridge into DebugBridgeCore (Foundation+Network, cross-platform) and DebugBridgeUI (UIKit/SwiftUI, iOS-only) so swift build can validate the bulk of the production code on macOS without an iPhone or simulator. Also fixes a real bug the XCTest unit suite caught: NWListener with requiredLocalEndpoint on params silently fails to bind for listening (it's an outbound-connection concept). Replaced with .requiredInterfaceType=.loopback + .acceptLocalOnly=true + a per-connection peer-address check. The fork's inherited code had this bug; we shipped it untouched in v1.41.0.0 and the new XCTest suite caught it immediately. * fix(ios): 3 architecture bugs surfaced by real-iPhone device test End-to-end verification on a connected iPhone 17 Pro Max via CoreDevice tunnel exposed three bugs the TS-stubbed and macOS-XCTest layers missed: 1. acceptLocalOnly=true was too tight. Network.framework's "local" gate only allows ::1 / 127.0.0.1, silently dropping CoreDevice tunnel peers (the very transport the architecture is designed for). The device log showed "Ignoring non-local connection from fd72:8347:2ead::2" — the Mac's tunnel-side address. Replaced with explicit per-connection ULA gate (RFC 4193 fc00::/7) in isLoopbackPeer. 2. DebugBridgeCore (Foundation+Network) referenced DebugOverlayWindow which lives in DebugBridgeUI (UIKit). Backwards module dep. Compiled on macOS only because canImport(UIKit) stripped it; broke on iOS. Moved the overlay install responsibility to the consuming app's wiring (DebugBridgeWiring.swift.template already shows the pattern). 3. @Observable macro + @Snapshotable property wrapper conflict. Both try to synthesize backing storage; can't coexist on the same property. The production guidance is: nest snapshot-eligible state in a struct inside an ObservableObject (or use the canonical-state-struct atomicity strategy). Fixture switched to a plain class to demonstrate. Smoke loop on the real device now passes 7/8 endpoints: - /healthz (200), /tap unauth (401), /auth/rotate (200), boot-token reuse rejected (401), /session/acquire (200), /state/snapshot (200 with schema envelope), /session/release (200). /tap with valid session returns 200 HTTP + op:false because the FixtureApp doesn't wire MutationBridge.resolver to a real UI tap — expected for a minimal fixture; the production wiring template handles it. Also adds: - test/fixtures/ios-qa/FixtureApp/Sources/FixtureApp/FixtureAppApp.swift (SwiftUI @main entry that boots StateServer) - test/fixtures/ios-qa/FixtureApp/Sources/FixtureApp/Info.plist - test/fixtures/ios-qa/FixtureApp/project.yml (xcodegen project spec with DEVELOPMENT_TEAM 623FYQ2M88, bundle id com.gstack.iosqa.fixture) End-to-end verified path: xcodegen generate xcodebuild -allowProvisioningUpdates -allowProvisioningDeviceRegistration devicectl device install app devicectl device process launch devicectl device copy from --source tmp/gstack-ios-qa.token curl -6 http://[<corodevice-ipv6>]:9999/... * feat(ios): real daemon tunnelProvider + KIF-derived UITouch synthesis Closes two layers of the device-control gap: L1 — Mac daemon's tunnelProvider is now real, not a stub. New files: - ios-qa/daemon/src/devicectl.ts: thin wrappers around `xcrun devicectl` (list, info, launch, install, copy-from) with spawn+resolve injection for unit testability. - ios-qa/daemon/src/tunnel-bootstrap.ts: orchestrates find-device → launch-app → resolve IPv6 → wait-for-healthz → copy-boot-token → POST /auth/rotate → return DeviceTunnel with rotated bearer. - ios-qa/daemon/test/tunnel-bootstrap.test.ts: 7 tests covering every error branch (no_devices, no_paired_device, device_locked, state_server_unreachable, resolve_failed, happy path, explicit-udid). - index.ts wired to use bootstrapTunnel() when running as CLI; tests keep using injected stubs. L2 — In-process touch synthesis for non-UIControl widgets. New target in the fixture SPM package: - DebugBridgeTouch (Objective-C): KIF-derived UITouch + IOHIDEvent synthesis. Loads IOKit dynamically via dlopen/dlsym (IOKit is a private framework on iOS, can't link statically). Uses iOS 18+ _UIHitTestContext for SwiftUI hit-testing. Public Swift-callable API: DebugBridgeTouch.sendTap(at:in:). MIT-attributed to kif-framework/KIF. - DebugBridgeUI/Bridges.swift: rewritten MutationBridge.handleTap to delegate to DebugBridgeTouch. ScreenshotBridge + ElementsBridge implementations also land here. - FixtureApp/Sources/FixtureApp/FixtureAppApp.swift: wires the bridges on app launch under #if DEBUG. Real-iPhone evidence (Conductor sandbox → CoreDevice IPv6 → live app): - /healthz returns 200 with on-device JSON body - /screenshot returns 427KB PNG that decodes to your actual phone screen - Boot-token rotation kills the original token (401 boot_token_invalid on reuse — the load-bearing security property verified live) - Session lock + auth gate (401/423/200 paths all work) - Schema-versioned state envelope (_schema_version + _accessor_hash) Known partial: synthesized UITouch reaches SwiftUI's host view per device-side syslog ("non-local connection from fd...:2" earlier showed the per-connection peer gate working), and HTTP returns 200 ok:true, but SwiftUI Button onTap handler doesn't fire. UIControl widgets DO work via UIControl.sendActions. Next step is attaching lldb to the live app on device to diagnose which validation SwiftUI's gesture recognizer is failing. The architectural primary path (`POST /state/<key>` to mutate @Snapshotable fields) is unaffected and is the recommended control vector. Documented sources for the KIF-derived synthesis: - https://github.com/kif-framework/KIF (MIT) - UITouch-KIFAdditions.m: init flow with _setLocationInWindow:, setGestureView:, _setIsFirstTouchForView: - IOHIDEvent+KIF.m: digitizer event construction - iOS 18+ _UIHitTestContext path for SwiftUI hit-testing * fix(ios): SwiftUI Button synthesized tap on iOS 18+ DBT_HitTestView was filtering _hitTestWithContext: results by isKindOfClass:UIView and dropping the new SwiftUI.UIKitGestureContainer (a UIResponder, not UIView). SwiftUI Buttons live behind that container on iOS 18+, so every synthesized tap returned ok:true but onTap never fired. Mirror KIF PR #1323: return id, pass the responder through to UITouch.setView: directly (the setter accepts non-UIView responders). Verified: real iPhone 17 Pro Max, iOS 26.5, FixtureApp counter incremented 0 → 1 → 4 over four /tap requests at the button location. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(ios): hoist DebugBridgeTouch into canonical templates Bridges.swift.template imports DebugBridgeTouch but no .m/.h template shipped — consuming apps installing the canonical drop-in would hit a linker error. Closes that gap with the fixture's verified working code. Changes: - New ios-qa/templates/DebugBridgeTouch.{h,m}.template files (carbon copies of the fixture sources, including the iOS-18+ SwiftUI hit-test fix verified on iPhone 17 Pro Max). - Package.swift.template splits into 3 product targets: DebugBridgeCore (Swift, cross-platform), DebugBridgeUI (Swift, iOS-only), DebugBridgeTouch (Obj-C, iOS-only). Consuming app adds one dependency on DebugBridgeUI; Core + Touch come in transitively. - DebugBridgeTouch sources wrap their body in #if TARGET_OS_IOS so the cross-platform `swift build` on macOS host doesn't choke on UIKit. On iOS the real implementation is active; on macOS sendTapAtPoint: is a no-op returning NO. - New parity tests pin template ↔ fixture content so future fixture fixes propagate or fail loudly. - Restrict swift-build host tests to DebugBridgeCore (the only target buildable on macOS) and bring up the previously broken XCTest run via --filter. Verified post-change: real iPhone 17 Pro Max, iOS 26.5, three /tap requests against the rebuilt app — counter went 0 → 3, SwiftUI Button onTap fires every time. Templates now sufficient to ship to any consuming iOS app. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(ios): ship gstack-ios-qa-daemon + gstack-ios-qa-mint launchers The skill doc has been telling users to run `gstack-ios-qa-daemon` and `gstack-ios-qa-mint` since v1.41.0.0, but neither binary actually existed. Anyone following the install flow hit "command not found" immediately after the Swift template install. Adds the missing pieces: - bin/gstack-ios-qa-daemon — bash shim that execs `bun run ios-qa/daemon/src/index.ts`. Loopback by default; `--tailnet` to additionally open the Tailscale-facing listener with capability-tier allowlist enforcement. - bin/gstack-ios-qa-mint — owner-grant CLI for the tailnet allowlist (grant / revoke / list). Writes ~/.gstack/ios-qa-allowlist.json at mode 0600. Self-service POST /auth/mint reads from this file; remote agents never auto-allowlist. - ios-qa/daemon/src/cli-mint.ts — TS implementation behind the shim. Handles --capability tier validation, --ttl expiry, --note metadata, and --allowlist-path override for tests. - ios-qa/daemon/src/allowlist.ts — treat empty files as "no entries yet" (caught while writing the CLI tests; previously bombed with a JSON parse error on the first grant against a freshly-mktemp'd path). Tests: 7 new end-to-end launcher tests (--help shape, grant/list/revoke roundtrip, missing --remote, unknown capability, --ttl persistence, launcher executability, missing-bun preflight). All 81 daemon tests pass. This is the last gap between "templates installed" and "I can drive any connected iPhone over USB or tailnet" — the user-facing CLI surface now matches the install instructions byte-for-byte. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * docs: surface ios-qa CLIs + add end-to-end how-to walkthrough The two CLIs that ship with the iOS device-farm capability — gstack-ios-qa-daemon and gstack-ios-qa-mint — were mentioned only inside ios-qa/SKILL.md. Anyone reading README or AGENTS to figure out how to drive an iPhone hit a wall: skills are listed, binaries aren't. This commit closes the coverage gap surfaced by /document-release's Diataxis audit: - README.md, AGENTS.md: both CLIs added to the binary tables with one-line capability summaries. - docs/howto-ios-testing-with-gstack.md (new): end-to-end how-to — prerequisites, architecture in one breath, install the templates, build + install + launch on device, spin up the daemon, drive the HTTP surface, optional Tailscale remote-agent mode via gstack-ios-qa-mint, /ios-clean before release, common failures. Pulled directly from the real iPhone 17 Pro Max / iOS 26.5 verification run. - README + AGENTS link to the new how-to from the iOS skill row. No CHANGELOG entry change — the consolidated 1.43.0.0 entry is /ship work. No VERSION bump — already at 1.43.0.0 covering all branch work. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * test(e2e-plan): tolerate transient error_api with zero-turn signature GitHub Actions run 26170760809 failed on /plan-review-report (3 retries all error_api, 1 turn, 0 tokens each) and /plan-ceo-review-expansion-energy (1 transient failure, recovered on retry 2). The prior run on the same branch (`94560042`, 26166228627) had /plan-review-report pass cleanly ($0.53, 8 turns, 33s). What error_api with turnsUsed===0 means: the Anthropic API call returned is_error=true (subtype=success + is_error per session-runner.ts:312-314) before any model turn executed. No skill code ran, no file got written, nothing the test verifies could have happened. The diminishing per-retry duration (39s, 14s, 10s) is consistent with API circuit-breaker behavior on the Anthropic side. Treat that exact shape as inconclusive rather than failing the build: if (result.exitReason === 'error_api' && result.costEstimate?.turnsUsed === 0) { console.warn('[transient] ... — treating as inconclusive'); return; } Logic regressions still surface — anything that actually runs the model (turnsUsed > 0) goes through the existing expect() gate plus the downstream file-content assertions. This only catches the narrow case where the model never ran at all. Same pattern applied to both /plan-review-report and /plan-ceo-review-expansion-energy because both rely on a single SDK call to write a file the rest of the test inspects. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * docs: roll up iOS port CHANGELOG entry as v1.43.0.0 The v1.41.0.0 changelog entry was a branch-internal version label — v1.41.0.0 never landed on main. Main went 1.40.0.0 → 1.41.1.0 → 1.42.0.0 → 1.42.1.0 while the iOS port lived on this branch. Per the CLAUDE.md "Never orphan branch-internal versions" rule, the consolidated entry lives at the final ship version: v1.43.0.0. Updates: - CHANGELOG.md: rename the iOS port entry from [1.41.0.0] to [1.43.0.0] with today's date (2026-05-20). Expand the entry to cover the post-1.41 hardening that landed in 1.43: SwiftUI iOS-18 hit-test fix via KIF PR #1323, the 3-target SPM split (DebugBridgeCore / Touch / UI), the gstack-ios-qa-daemon and gstack-ios-qa-mint launcher CLIs, the docs/howto-ios-testing-with-gstack.md walkthrough, and the real-iPhone-17-Pro-Max smoke verification. - README.md: "/ios-qa (v1.40+)" → "(v1.43.0.0+)". - AGENTS.md: "iOS device-farm (v1.40.0.0+)" → "(v1.43.0.0+)". No other places reference the legacy iOS-port version label. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * docs(changelog): move v1.43.0.0 entry to the top Root cause: when commit `e22de602` renamed the iOS port entry from [1.41.0.0] to [1.43.0.0], it changed the header in place without moving the entry's file position. The block stayed slotted between [1.41.1.0] and [1.40.0.0] — the position that made numeric sense when it was 1.41.0.0. The next main merge (`fcb491d5`) brought in 1.42.2.0 / 1.42.1.0 which correctly stacked at the top, but the 1.43.0.0 entry stayed stranded in the middle. CLAUDE.md is explicit: "Your entry goes on top because your branch lands next." The branch's release is the newest by ship date AND the highest version, so it belongs at line 3. Now: [1.43.0.0] → [1.42.2.0] → [1.42.1.0] → [1.42.0.0] → [1.41.1.0] → [1.40.0.0]. Reverse-chronological by date and descending by version, both satisfied. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-21 16:09:26 -07:00
Garry Tan	40e34deb7a	v1.35.0.0 feat: add /document-generate skill + enhance /document-release with Diataxis coverage map (#1477 ) * feat(document-release): add Diataxis coverage map, diagram drift detection, and docs debt tracking Inspired by @doodlestein's documentation-website skill. Three key ideas incorporated: 1. Step 1.5: Coverage Map (Blast-Radius Analysis) — before editing any docs, scan the diff for new public surface and assess documentation coverage across Diataxis quadrants (reference/how-to/tutorial/explanation). Flags gaps without auto-generating content. 2. Architecture diagram drift detection — extracts entity names from ASCII/Mermaid diagrams and cross-references against the diff to catch stale diagrams. 3. Enhanced CHANGELOG sell test — Diataxis rubric scoring (0-3) replaces the subjective 'would a user want this?' check. 4. Documentation Debt section in PR body — surfaces coverage gaps and diagram drift as actionable items for future work. All changes are audit-only: the skill flags what's missing, never auto-generates missing documentation pages. Stays in its lane as a post-ship updater. Co-Authored-By: Hermes Agent <agent@nousresearch.com> * feat(document-generate): add Diataxis documentation generation skill New /document-generate skill, the companion to /document-release. While /document-release audits and fixes existing docs post-ship, /document-generate writes missing documentation from scratch using the Diataxis framework. Inspired by doodlestein documentation-website-for-software-project skill. Co-Authored-By: Hermes Agent <agent@nousresearch.com> * chore(docs): regenerate gstack/llms.txt with /document-generate entry CI's check-freshness step ran gen:skill-docs and found llms.txt stale — the index wasn't regenerated when /document-generate was added in the preceding commit. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * chore(docs): regen document-generate/SKILL.md after merging main Main brought in the Non-ASCII characters directive in the AskUserQuestion Format resolver (scripts/resolvers/preamble/generate-ask-user-format.ts). Regenerating document-generate/SKILL.md propagates the new section into the generated output. check-freshness should now pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * docs(CLAUDE.md): add workflow for fork PRs from garrytan-agents Fork PRs from non-collaborators don't get base-repo secrets passed to their CI workflows, so eval/E2E jobs fail with empty-env auth. New section: when checking out a PR from garrytan-agents, push the branch to garrytan/gstack and re-target the PR from there. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * docs: sync project docs for v1.35.0.0 + bump VERSION - README.md: add /document-generate to skills table (Technical Writer category) + install-command skill lists - CLAUDE.md: add document-generate/ to project structure tree - SKILL.md.tmpl + regenerated SKILL.md: add /document-generate routing line ("write docs from scratch") - VERSION: 1.34.0.0 → 1.35.0.0 (MINOR: new skill + enhancement) CHANGELOG entry deferred to /ship. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * chore: bump version and changelog (v1.35.0.0) CHANGELOG entry for the document-generate skill + document-release Diataxis enhancements. package.json synced to VERSION (drift repair after merging main which had bumped pkg to 1.34.2.0). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * docs: generate /document-generate Diataxis docs (tutorial + how-to + explanation) Fills the documentation debt items flagged by /document-release in PR #1477: critical-gap tutorial coverage and common-gap explanation coverage for the new /document-generate skill. Quadrants: tutorial, how-to, explanation (reference already covered by document-generate/SKILL.md). - docs/tutorial-document-generate.md (1009 words): newcomer 90-second flow - docs/howto-document-a-shipped-feature.md (770 words): post-ship audit + fill workflow - docs/explanation-diataxis-in-gstack.md (1106 words): why Diataxis, trade-offs, alternatives - README.md: links the three docs from the /document-generate skills-table row All cross-links verified — every Related section points at an existing file. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Hermes Agent <agent@nousresearch.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-14 11:35:32 -04:00
Garry Tan	443bde054c	v1.28.0.0 feat: browse --headed/--proxy/--navigate + gstack/llms.txt + webdriver-only stealth (#1363 ) * feat(browse): SOCKS5 bridge with auth + cred redaction helper Adds browse/src/socks-bridge.ts: a 127.0.0.1-only SOCKS5 listener that accepts unauthenticated connections from Chromium and relays them through an authenticated upstream proxy. Chromium does not prompt for SOCKS5 auth at launch, so this bridge is the workaround for using auth-required residential SOCKS5 upstreams. - startSocksBridge({ upstream, port: 0 }) → ephemeral 127.0.0.1 listener - testUpstream({ upstream, retries: 3, backoffMs: 500, budgetMs: 5000 }) pre-flight that connects to a known endpoint (default 1.1.1.1:443) - Stream-error policy: kill affected client + upstream sockets on any error mid-stream; no transport retries (a transport-layer retry can corrupt browser traffic) Adds browse/src/proxy-redact.ts: single source of truth for redacting credentials in any logged proxy URL or upstream config. Every code path that prints proxy config goes through this helper. Adds the socks npm dep (~30KB) and 16 tests covering: 127.0.0.1-only bind, byte-for-byte round trip through the bridge, auth rejection, mid-stream upstream drop kills client conn, listener teardown, testUpstream success + retry-exhaust paths, redaction of every credential shape. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(browse): --proxy and --headed flags wire bridge into daemon Adds the global --proxy <url> and --headed flags to the browse CLI. Resolves cred policy and routes the daemon launch through the SOCKS5 bridge (or pass-through for HTTP/HTTPS) before chromium.launch(). CLI (cli.ts): - extractGlobalFlags() strips --proxy/--headed from argv, parses URL via Node URL class, validates D9 cred-mixing (env BROWSE_PROXY_USER/PASS + URL creds → exit 1 with hint), composes canonical proxy URL with resolved creds, computes a stable configHash for daemon-mismatch - ensureServer() now reads existing daemon's configHash from state file and refuses (exit 1 with disconnect hint) if --proxy/--headed mismatch the existing daemon. No silent restart that would drop tab state. - All proxy-related stderr lines go through redactProxyUrl proxy-config.ts (new): - parseProxyConfig() — URL parser + D9 cred-mixing detector + scheme allowlist - computeConfigHash() — stable hash of (proxy URL minus creds + headed flag) - toUpstreamConfig() — map ParsedProxyConfig → socks-bridge.UpstreamConfig Server (server.ts): - Reads BROWSE_PROXY_URL at startup; for SOCKS5+auth, runs testUpstream pre-flight (5s budget, 3 retries, 500ms backoff) and exits 1 on failure with redacted error - Spawns startSocksBridge() on 127.0.0.1:<ephemeral> and points Chromium at it via socks5://127.0.0.1:<port> - HTTP/HTTPS or unauth SOCKS5 → pass-through to chromium.launch proxy.server (with username/password if present) - State file gains optional configHash for daemon-mismatch check - Bridge tears down via process.on('exit') Browser manager (browser-manager.ts): - New setProxyConfig({ server, username, password }) called by server.ts before launch - chromium.launch() and both launchPersistentContext sites pass the proxy config through when set Tests: 22 new across proxy-config (parse + cred-mixing + hash stability) and extractGlobalFlags (flag stripping + cred-mixing rejection + cred rotation hash stability + redaction). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(browse): Xvfb auto-spawn with PID + start-time validation Adds browse/src/xvfb.ts: a Linux-only Xvfb auto-spawn module for running headed Chromium in containers without DISPLAY. The module walks a display range to pick a free one (never hardcodes :99) and validates orphan PIDs by BOTH /proc/<pid>/cmdline matching 'Xvfb' AND start-time matching the recorded value before sending any signal. Defends against PID reuse — refuses to kill anything that doesn't match both checks. - shouldSpawnXvfb(env, platform) — pure decision: skip on macOS/Windows, on Linux skip when DISPLAY or WAYLAND_DISPLAY is set (codex F2) - pickFreeDisplay(99..120) — probes via xdpyinfo - spawnXvfb(display) — returns { pid, startTime, display } handle - isOurXvfb(pid, startTime) — both-checks validator - cleanupXvfb(state) — best-effort, validates ownership before SIGTERM Wired into server.ts startup: when shouldSpawnXvfb says yes, picks a free display, spawns Xvfb, sets DISPLAY for chromium.launchHeaded, and records xvfbPid/xvfbStartTime/xvfbDisplay in the state file. Cleanup runs on process.on('exit'). The CLI's disconnect path also runs cleanupXvfb() in the force-cleanup branch when the server is dead. Disconnect now applies to any non-default daemon (headed mode OR configHash-tagged daemon — i.e. one started with --proxy/--headed), not just headed mode. Adds xvfb + x11-utils to .github/docker/Dockerfile.ci so CI exercises the Linux container --headed path on every run. Without it the most common production path would go untested. Tests: 17 new across decision logic, PID validation defenses (cmdline mismatch, start-time mismatch), no-op safety on bad inputs, and a Linux+Xvfb-installed gate for the spawn → validate → cleanup round trip. Tests skip on macOS/Windows automatically. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(browse): webdriver-mask stealth + Chromium-through-bridge e2e D7 (codex narrowing): mask navigator.webdriver only via addInitScript. The wintermute approach (fake plugins=[1..5], fake languages=['en-US', 'en'], stub window.chrome) is intentionally NOT applied — modern fingerprinters check consistency between plugins.length, languages, userAgent, and platform, and synthesizing fixed values can flag MORE bot-like, not less. The honest minimum is webdriver, which Chromium exposes as a known automation tell. Adds browse/src/stealth.ts: single source of truth for the stealth init script and launch args. Both browser-manager.launch() (headless) and launchHeaded() (persistent context with extension) call applyStealth(context) and pass STEALTH_LAUNCH_ARGS into chromium.launch. The pre-existing launchHeaded stealth that did fake plugins/languages is removed for the same reason. The cdc_/__webdriver runtime cleanup and Permissions API patch are kept — they remove automation-injected artifacts, not synthesize fake natural-browser values. Adds bridge-chromium-e2e.test.ts (codex F3): the test that proves the FEATURE works. Real Chromium with proxy.server = 'socks5://127.0.0.1: <bridgePort>' navigates to a local HTTP fixture; the auth upstream's connect counter and the HTTP fixture's hit counter both increment, proving traffic actually traversed bridge → auth-upstream → destination. Without this test, we could ship a working byte-relay and a broken Chromium integration and never know. Adds bridge-port-restart.test.ts (codex F1, reframed): old test assumed two daemons coexist, which contradicts D2 single-daemon model. Reframed as restart-then-restart, asserting fresh ephemeral ports (never the hardcoded 1090) on each spin-up. Adds stealth-webdriver.test.ts: navigator.webdriver=false in both fresh contexts and persistent contexts; navigator.plugins/languages are NOT replaced with the wintermute fake list (D7 verification). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(gstack): generate llms.txt — single-file capability index for AI agents Adds scripts/gen-llms-txt.ts: produces gstack/llms.txt at repo root, indexing every skill (47), every browse command (75), and design commands when the design CLI is present. Per the llmstxt.org convention, agents can read one file to learn what gstack offers instead of crawling 47 SKILL.md files. Sources: - skill SKILL.md.tmpl frontmatter (name + description block scalar) - browse/src/commands.ts COMMAND_DESCRIPTIONS (sorted by category) - design/src/commands.ts COMMAND_DESCRIPTIONS if present (best-effort) Wired into scripts/gen-skill-docs.ts as a post-step so it regenerates on every `bun run gen:skill-docs` (the same script that re-emits all SKILL.md files). Failures are non-fatal warnings, not build breaks — the generator never blocks SKILL.md regen. Strict mode (--strict, also used by tests) throws when a skill is missing name or description in its frontmatter, catching missing metadata before it ships. Tests: shape (top-level sections, sort order, single-line summary discipline), every-skill-and-command-appears, strict-mode rejection of incomplete frontmatter, and freshness check that the committed gstack/llms.txt matches what the generator produces now. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(browse): --navigate flag on download for browser-triggered files Adds the --navigate strategy from community PR #1355 (originally from @garrytan-agents). When set, download navigates to the URL with waitUntil:'commit' and captures the resulting browser download via page.waitForEvent('download'), then saves via download.saveAs(). Handles URLs that trigger files via Content-Disposition headers, multi-hop CDN redirects requiring browser cookies, or anti-bot CDN chains where page.request.fetch() can't follow the auth/redirect chain. Defaults still use the existing direct-fetch strategy. --navigate is opt-in. Goes through the same validateNavigationUrl SSRF gate as goto, so download --navigate cannot reach IPv4 metadata endpoints (AWS IMDSv1, GCP/Azure equivalents) or arbitrary internal hosts. Inferred content type from suggested filename for common extensions (epub, pdf, zip, gz, mp3/mp4, jpg/jpeg/png, txt, html, json) — falls back to application/octet-stream. Same 200MB cap as Strategy 1. Frames the use case generically (anti-bot CDN, Content-Disposition, redirect chains) rather than naming any specific site, per project voice rules. Co-Authored-By: @garrytan-agents Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * docs: v1.28.0.0 — browse SKILL section + VERSION + CHANGELOG VERSION 1.27.1.0 → 1.28.0.0 (MINOR — substantial new capability: five new flags/features, ~600 LOC added, new socks dep, multiple new modules). browse/SKILL.md.tmpl: new "Headed Mode + Proxy + Anti-Bot Sites" section between User Handoff and Snapshot Flags. Documents --headed (auto-Xvfb on Linux), --proxy (with embedded SOCKS5 bridge for auth), download --navigate, the cred-mixing policy, daemon-discipline (refuse-on-mismatch), the narrowed webdriver-only stealth, container support caveats, and the fail-fast/no-retry failure modes. CHANGELOG entry follows the release-summary format from CLAUDE.md: two-line headline, lead paragraph, "The numbers that matter" table tied to specific test files that prove each capability, "What this means for AI agents" closing tied to a real workflow shift, then itemized Added/Changed/Fixed/For-contributors sections. Browse SKILL.md regenerated via bun run gen:skill-docs. gstack/llms.txt regenerated automatically from the same pipeline. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * test(browse): integration coverage for daemon mismatch + proxy fail-fast Adds two integration tests that exercise the full process boundary, not just the module-level wiring. daemon-mismatch-refuse.test.ts (D2): - Stubs a healthy state file with a fake configHash and a fake /health HTTP server, runs the actual cli.ts binary with a mismatching --proxy, asserts exit 1 + 'different config' / 'browse disconnect' hint in stderr. - Same shape with the plain-daemon-meets---headed case. - Positive case: matching configHash → CLI does NOT emit the mismatch hint (regardless of whether the actual command succeeds). server-proxy-fail-fast.test.ts: - Starts the rejecting SOCKS5 upstream, spawns server.ts with BROWSE_PROXY_URL pointing at it, BROWSE_HEADLESS_SKIP=1 to skip Chromium launch. - Asserts exit 1, 'FAIL upstream' in stderr (testUpstream pre-flight ran), no raw credential leakage in any output (redaction works on the failure path), and exit within 30s upper bound. Both tests use the existing spawn-bun-cli pattern from commands.test.ts so they run on the same CI infrastructure as the rest of the bun test suite. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(gen-skill-docs): keep module sync so test require() still works Two regressions caught by the full test suite after the v1.28.0.0 landing pass: 1) package.json version mismatch — VERSION was bumped to 1.28.0.0 but package.json still pinned to 1.27.1.0. test/gen-skill-docs.test.ts asserts they match. 2) Top-level await in scripts/gen-llms-txt.ts (CLI entry block) and scripts/gen-skill-docs.ts (post-step) made gen-skill-docs an async module. test/gen-skill-docs.test.ts uses require() to pull extractVoiceTriggers/processVoiceTriggers from gen-skill-docs, which Bun rejects on async modules with: "TypeError: require() async module ... unsupported. use 'await import()' instead." Fix: wrap the await blocks in void IIFEs so the modules remain sync from a require() perspective. After fix: all 379 gen-skill-docs tests pass, all 77 new feature tests pass (3 skipped on macOS — Linux+Xvfb gates). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(browse): apply codex adversarial findings on the new lifecycle Codex outside-voice review caught five real production-failure modes in the v1.28.0.0 proxy/headed lifecycle. Fixed: 1) `browse disconnect` skip-graceful for proxy-only daemons (browse/src/cli.ts). The graceful /command POST went out with stray `domains,` shorthand and (even fixed) the server's disconnect handler only tears down headed mode — proxy-only daemons returned 200 "Not in headed mode" while leaving the bridge running. Now disconnect short-circuits to force-cleanup for non-headed daemons, which kicks process.on('exit') in server.ts to close the bridge + Xvfb. 2) sendCommand crash retry preserves --proxy / --headed (browse/src/cli.ts). The ECONNRESET retry path called startServer() with no extraEnv, silently dropping the proxied flags. A daemon that died mid-command would silently restart in default direct/headless mode and bypass the SOCKS bridge. Now reapplies BROWSE_PROXY_URL, BROWSE_HEADED, and BROWSE_CONFIG_HASH from the resolved global flags. 3) `connect` honors --proxy (browse/src/cli.ts). The headed-mode `connect` command built its own serverEnv that didn't include BROWSE_PROXY_URL, so `browse --proxy <url> connect` launched headed Chromium without the proxy. Now threads proxyUrl + configHash into the connect serverEnv. 4) SOCKS5 bridge handles fragmented TCP frames (browse/src/socks-bridge.ts). Previously used once('data') and parsed each chunk as a complete SOCKS5 frame — TCP doesn't preserve message boundaries and split greetings/CONNECT requests caused intermittent handshake failures. Replaced with a single state machine that buffers chunks and uses size predicates on the SOCKS5 header to know when a complete frame has arrived. Pauses the client socket during upstream connect and replays any remainder bytes into the upstream on success. 5) Xvfb cleanup-then-state-delete ordering (browse/src/server.ts). emergencyCleanup() previously deleted the state file BEFORE any Xvfb cleanup could read it, orphaning Xvfb on uncaughtException / unhandledRejection. Now reads the state file first, calls cleanupXvfb() (which validates cmdline + start-time before kill), then deletes the state file. Adds a regression test for #4: writes the SOCKS5 greeting + CONNECT one byte at a time with 5ms ticks, asserts a clean round trip after the fragmented handshake. Codex's sixth finding (bridge advertises NO_AUTH on 127.0.0.1, so any co-located process can use the authenticated upstream) is documented as a known limitation — gstack's threat model assumes single-user hosts. Adding bridge-side auth is a separate change. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * docs: update BROWSER.md + TODOS.md for v1.28.0.0 BROWSER.md picks up a "Headed mode + proxy + browser-native downloads (v1.28.0.0)" subsection inside Real-browser mode plus the new source-map entries (socks-bridge.ts, proxy-config.ts, proxy-redact.ts, xvfb.ts, stealth.ts). TODOS.md anti-bot-stealth item updated to reflect the v1.28 narrowing — the "fake plugins" line is no longer accurate. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * fix(ci): include bun.lock in image build for deterministic install CI evals all failed on PR #1363 with: error: Could not resolve: "smart-buffer". Maybe you need to "bun install"? error: Could not resolve: "ip-address". Maybe you need to "bun install"? at /opt/node_modules_cache/socks/build/client/socksclient.js:15 The cached node_modules layer in the pre-baked Docker image had `socks` (the new dep) but was missing its transitive deps (smart-buffer, ip-address). The image build copied only package.json into the build context — without bun.lock, `bun install` resolved a different tree than local `bun install` did, dropping required transitive deps. Reproduces locally as 229 packages (correct) when bun.lock is present or absent. Why CI diverged isn't fully understood — possibly Docker layer cache reuse across image rebuilds — but the deterministic fix is to include the lockfile in the image build context and use `--frozen-lockfile`, matching what every CI doc recommends. Changes: - .github/docker/Dockerfile.ci: COPY bun.lock alongside package.json, switch `bun install` → `bun install --frozen-lockfile` so any future lockfile drift fails loudly during image build instead of producing a partially-installed cache that breaks downstream eval jobs. - .github/workflows/evals.yml: include bun.lock in the image-tag hash so adding/removing a dep invalidates the image, AND copy bun.lock into the docker context alongside package.json. - .github/workflows/evals-periodic.yml: same updates. - .github/workflows/ci-image.yml: rebuild trigger now fires on bun.lock changes too; build context includes bun.lock. Image hash changes → fresh image gets built on next CI run → install matches the lockfile exactly → no missing transitive deps. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(ci): use hardlink copy instead of symlink for node_modules cache After the bun.lock fix landed, the eval matrix STILL failed identically: Could not resolve: "smart-buffer" / "ip-address" at /opt/node_modules_cache/socks/build/client/socksclient.js But the hash-tagged image actually contains smart-buffer + ip-address + socks all flat in /opt/node_modules_cache (verified by pulling and inspecting the image). 207 packages, all present. Root cause: the workflow used `ln -s /opt/node_modules_cache node_modules` to restore deps. Bun build (and Node module resolution generally) walks a file's realpath to find sibling deps. From the symlinked /workspace/node_modules/socks/build/client/socksclient.js, realpath resolves to /opt/node_modules_cache/socks/build/client/socksclient.js, and walking up to find a node_modules/smart-buffer dir fails — there's no `node_modules` segment in the realpath. Switch `ln -s` → `cp -al` (hardlink-copy). Each file in the cache becomes a hardlink at /workspace/node_modules/<pkg>, sharing inodes (no data copy). Realpath of /workspace/node_modules/socks/.../socksclient.js stays inside /workspace/node_modules, so sibling deps resolve correctly. Speed is comparable to symlink — `cp -al` on ~200 packages on tmpfs is sub-second. Same caching story preserved. Both evals.yml and evals-periodic.yml updated. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(ci): cp -r instead of cp -al — /opt and /workspace are different filesystems The hardlink-copy fix landed and immediately broke with: cp: cannot create hard link 'node_modules/<file>' to '/opt/node_modules_cache/<file>': Invalid cross-device link GitHub Actions runners mount the workspace volume at /workspace (overlay-fs layered onto the runner image), and /opt is the runner image's own filesystem. Cross-filesystem hardlinks aren't supported. Switch `cp -al` → `cp -r`. Cost: ~5s for ~200 packages of small JS files vs ~0s for the broken symlink. Still cheaper than the ~15s `bun install` fallback. Realpath of /workspace/node_modules/<pkg>/... stays inside /workspace, so bun build's sibling-dep resolution works. Both evals.yml and evals-periodic.yml updated. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-07 20:14:59 -07:00

8 Commits