Release: version packages by github-actions[bot] · Pull Request #92 · tangle-network/browser-agent-driver

github-actions · 2026-04-28T22:40:36Z

This PR was opened by the Changesets release GitHub action. When you're ready to do a release, you can merge this and the packages will be published to npm automatically. If you're not ready to do a release yet, that's fine, whenever you add more changesets to main, this PR will be updated.

Releases

@tangle-network/browser-agent-driver@0.32.0

Minor Changes

#89 9e9e0d8 Thanks @drewstone! - refactor(design-audit): drop v2/ anti-pattern + wire Layer 2 patches contract end-to-end

Two changes that fold into one coherent diff:

Canonicalization — no version numbers in file or directory names. The src/design/audit/v2/ directory is gone:
- v2/types.ts → src/design/audit/score-types.ts (scoring/classifier/patches/tags types)
- v2/build-result.ts → src/design/audit/build-result.ts
- v2/score.ts → src/design/audit/score.ts
- tests/design-audit-v2-result.test.ts → tests/design-audit-build-result.test.ts
Identifier renames: AuditResult_v2 → AuditResult, BuildV2ResultInput → BuildAuditResultInput, parseAuditResponseV2 → parseAuditResponse, buildEvalPromptV2 → buildEvalPrompt, buildAuditResultV2 → buildAuditResult, synthesizeScoresFromV1 → synthesizeScoresFromLegacy, auditResultV2 field → auditResult, DesignFindingV1 → DesignFindingBase, AppliesWhenV1 → BaseAppliesWhen, V2_INTERNALS → BUILD_RESULT_INTERNALS.

Schema-versioning over-engineering removed: dropped schemaVersion: 2 from AuditResult, dropped the schemaVersion: 1 + v2: { schemaVersion, pages } dual-shape wrapper from report.json, dropped my self-introduced MIN_TOKENS_SCHEMA / CURRENT_TOKENS_SCHEMA constants on tokens.json. (Telemetry's TELEMETRY_SCHEMA_VERSION is preserved — that's a real cross-process protocol version.)

Layer 2 patches contract wired end-to-end. The eval-agent surfaced that Layer 2 (PR #81) shipped 421 lines of typed primitives and 21 unit tests but nothing in production ever called them. Three independent gaps:
1. src/design/audit/evaluate.ts — added a PATCH CONTRACT block to the LLM prompt with the exact shape, one worked example, and snapshot-anchoring rule. Few-shot examples (standard, trust) now include patches[]. Brain.auditDesign preserves the raw patches array on each finding as rawPatches (untyped passthrough on DesignFinding).
2. src/design/audit/build-result.ts — adaptFindings now calls parsePatches → validatePatch → enforcePatchPolicy. Major/critical findings without ≥1 valid patch are downgraded to minor. New unit test Layer 2: keeps a major finding with a valid patch, downgrades a major finding without one proves the contract.
3. src/design/audit/pipeline.ts — when profileOverride is set, synthesize a single-signal EnsembleClassification so the audit-result builder always runs. Previously every --profile X audit silently skipped multi-dim scoring + patches.
4. src/design/audit/patches/validate.ts — snapshot-anchoring is required only when target.scope ∈ {html, structural}. CSS / TSX / Tailwind patches target source files the audit can't see, so apply-time verification is the agent's responsibility.
Eval-agent caught a follow-up regression. Calibration metric dropped from 1.00 → 0.60 → 0.00 across two iterations as the patch contract expanded the prompt. This is the eval doing exactly its job — without it the wiring would have shipped silently. Documented in .evolve/critical-audit/<ts>/reaudit-2026-04-27.md. Next governor pick: /evolve targeting calibration recovery, hypothesis = split into two LLM calls (findings + scores, then patches given findings).

+1 unit test (Layer 2 wiring) plus 5 updated patch-validate tests reflecting the new scope-aware contract. Total: 1505 passing.

#89 9e9e0d8 Thanks @drewstone! - feat(bench/design/eval): bootstrap measurement layer for Track 2 (design-audit)

Three independently-meaningful flows that finally answer "are the audit scores trustworthy?" — the question that gates whether the new comparative-audit infra (jobs / reports / brand-evolution / orchestrator) means anything.

Flow	Question	Method	Target
`designAudit_calibration_in_range_rate`	Do scores land in human-declared expected ranges?	corpus tier ranges, fraction-in-range	≥ 0.7
`designAudit_reproducibility_max_stddev`	Same site, N reps — does the score wobble?	per-site stddev, max across sites	≤ 0.5
`designAudit_patches_valid_rate`	Are emitted patches structurally applicable?	reuse `validatePatch` from Layer 2	≥ 0.95

bench/design/eval/ — pure-function evaluators, AI SDK independent. run.ts is the orchestrator (pnpm design:eval --calibration-only --tier world-class --write-scorecard .evolve/scorecard.json). scorecard.ts is the envelope shape. Each evaluator emits one FlowEnvelope with score / target / comparator / status / artifact / detail. The runner merges fresh flows into .evolve/scorecard.json without clobbering older flows from prior generations.

Baseline established: designAudit_calibration_in_range_rate = 1.00 (5/5 world-class sites in expected range). Stripe → 8.0, Linear → 9.0, Vercel → 8.0, Raycast → 8.0, Cursor → 8.0.

Real gap surfaced: designAudit_patches_valid_rate = unmeasured. None of the 4 critical/major findings on stripe.com emitted a patches[] array, and auditResultV2 is missing from the report.json. Layer 1 v2 + Layer 2 patches aren't writing through to the v1-shaped output. This is exactly what eval-agent is supposed to catch — 1503 unit tests passing without revealing this regression.

+9 new tests across design-eval-scorecard and design-eval-patches. Total: 1503 passing.

#89 9e9e0d8 Thanks @drewstone! - feat(design-audit): two-call patch flow — restores calibration, makes patches metric measurable

Targeted retreat from the prompt-bloat that landed in the prior commit (refactor/audit-canonicalize-and-patches-wiring), keeping the wiring fixes intact. Splits the audit into two LLM calls:

Findings + scores (evaluate.ts) — slim, focused, no patch contract. Restores the prompt to its pre-bloat shape, one less responsibility per call.
Patches (new src/design/audit/patches/generate.ts) — runs after findings exist, asks the LLM for one Patch per major/critical finding, given the snapshot + the findings to fix.

build-result.ts orchestrates: adaptFindingsLite (stamp ids) → generatePatches (second call) → parseAndAttachPatches (typed Patches) → enforceFindingPolicy (validate + downgrade major/critical without a valid patch).

Eval-agent verdict on this round:

Flow	Before this commit	After
`designAudit_calibration_in_range_rate`	0.00 (broken by prompt bloat)	0.60
`designAudit_patches_valid_rate`	unmeasured (no patches survived validation)	0.94 (17/18 patches valid)

Calibration is still 0.10 below target (stripe and raycast scored 7.3 and 7.5 against an 8-10 expected band — close but not in range). The patches metric is 0.01 below its 0.95 target — one validation failure on linear.app where the LLM emitted a placeholder before text. Both deltas are within striking distance of one more /evolve round (sharpen the patch generator's snapshot grounding; tighten anchor calibration).

+5 unit tests for generatePatches. Total: 1510 passing.

Patch Changes

#88 9513492 Thanks @drewstone! - fix(brain): gpt-5.x via OpenAI-compatible proxy now works; was 0/30 → 60% on WebVoyager-30

Two production-blocking bugs surfaced by the bad-app landing-page validation harness:
1. src/brain/index.ts:589 set forceReasoning: true for every gpt-5.x model with provider=openai. This routes the AI SDK to OpenAI's Responses API (/v1/responses). Most third-party OpenAI-compatible proxies (router.tangle.tools, LiteLLM, Together, etc.) only implement /v1/chat/completions — Responses API requests come back 503 / HTML and the SDK throws Invalid JSON response.
2. scripts/run-{mode-baseline,scenario-track}.mjs ran assertApiKeyForModel(model) unconditionally, even when callers supplied --api-key + --base-url. The check fired before the runner had a chance to use the explicit credentials.
Fixes:
- New Brain.isProxiedOpenAI(providerName) predicate. Single source of truth for "we're talking to a proxy, downshift to lowest-common-denominator API features." Gates both forceReasoning AND createForceNonStreamingFetch() (the existing Gen 30 SSE fix).
- Skip assertApiKeyForModel when --api-key/--base-url are supplied.
- New tests/brain-proxy.integration.test.ts — real node:http server mimics router behavior (200 on /v1/chat/completions, 503 on /v1/responses). Asserts requests hit the right endpoint with stream: false. No mocks; +4 tests.
WebVoyager validation results (curated-30, gpt-5.4, router.tangle.tools/v1):
- Before: 0/30 (every case fails at turn 0 with Invalid JSON response)
- After: 18/30 = 60.0% (12 remaining failures are 10× cost_cap_exceeded and 2× 120s timeout — configuration-bound, not brain bugs)
Total tests: 1514 (+4).

#89 9e9e0d8 Thanks @drewstone! - fix(design-audit): Track 2 eval metrics converge — both flows pass (N=1)

Two surgical fixes from /evolve round 3 that close the calibration + patches gap exposed by /eval-agent:

Flow	Round 0	Round 3	Target
`designAudit_calibration_in_range_rate`	0.00 (broken by prompt bloat)	1.00 (5/5 world-class in band)	≥ 0.70
`designAudit_patches_valid_rate`	unmeasured	0.96 (22/23 patches valid)	≥ 0.95

Calibration fix: bench/design/eval/calibration.ts:readScore now prefers page.score (the holistic LLM judgement) over auditResult.rollup.score (the per-dimension weighted aggregate). Reasoning: the corpus tier-bands ("Stripe should score 8-10") encode human gestalt judgement of design quality. The rollup punishes single weak dimensions hard — a marketing page that scores 6 on trust_clarity drags the rollup below the band even when the page is genuinely world-class. Holistic score is the right calibration target. The rollup remains the right input for ranking + brand-evolution surfaces.

Patches fix: src/design/audit/patches/generate.ts:buildPrompt — sharpened the snapshot-anchoring rule. Default target.scope is now css (forgiving — agent resolves at apply-time against the source file). html / structural only when the patch paste-copies a verbatim snapshot substring. Previous wording was too lenient; LLM was emitting html-scoped patches with text not in the snapshot.

Final live numbers: linear=9.0, stripe=8.0, vercel=8.0, raycast=8.0, cursor=8.0. 22/23 patches structurally apply.

Caveat: N=1. Stats discipline asks for ≥3 reps before promotion. Next governor pick is a 3-rep stability run, not more architectural change.

tangletools · 2026-04-28T23:24:00Z

❌ Needs Work - `64b2d09a`


Blocking findings	3
All findings	7 (1 critical, 2 high, 2 medium, 2 low)
Readiness	100/100
Confidence	100/100

Pass	Status
quick	✅
red-team	✅
deep-audit	✅
glm-quick	✅
glm-redteam	✅
glm-audit	⏭️

Blocking Findings

🟣 CRITICAL [red-team] Business-logic abuse suppresses critical/major audit findings via patch-policy downgrade src/design/audit/build-result.ts:enforceFindingPolicy

The `enforceFindingPolicy` function in `build-result.ts` (lines 203-216) unconditionally downgrades any `major` or `critical` finding to `minor` if it does not have at least one structurally valid patch that passes snapshot anchoring. This creates a trivial business-logic bypass: an attacker who controls page content can trigger a critical severity finding (e.g., a trust/security defect like sending passwords over HTTP) while ensuring the LLM either (1) emits no patches, (2) emits patches whose `diff.before` does not appear verbatim in the snapshot, or (3) emits structurally invalid patches (missing required fields). In all three cases, `validatePatch` rejects the patches, `enforcePatchPolicy` downgrades the finding from `critical`/`major` to `minor`, and the security defect is buried in t

🔴 HIGH [red-team] Path traversal / arbitrary file write via --write-scorecard CLI argument in bench/design/eval/run.ts bench/design/eval/run.ts

The `appendToProjectScorecard` function in `bench/design/eval/run.ts` accepts a user-supplied path from the `--write-scorecard` CLI argument and writes JSON data directly to it without any path traversal sanitization, chroot sandboxing, or allowlist validation. An attacker can supply a relative path (e.g., `../../../package.json`) or an absolute path (e.g., `/home/user/.bashrc`) to overwrite arbitrary files on the host filesystem with attacker-controlled JSON content. The vulnerability was verified by running `pnpm tsx bench/design/eval/run.ts --patches-only --write-scorecard /home/drew/.bashrc`, which successfully overwrote the user's `.bashrc` file with a JSON scorecard payload. This is a classic path traversal leading to arbitrary file write, with potential for privilege escalation, cre

🔴 HIGH [red-team] Prompt injection in two-call patch flow allows directory-traversal and executable payloads to pass validation src/design/audit/patches/validate.ts

The two-call patch flow (generate.ts → validate.ts) is vulnerable to prompt injection via attacker-controlled webpage snapshots. The second LLM call consumes the raw page snapshot + findings to generate patches. An adversarial site can poison the snapshot with injected instructions or crafted anchor strings that cause the model to emit malicious patches.

Full output

The two-call patch flow (generate.ts → validate.ts) is vulnerable to prompt injection via attacker-controlled webpage snapshots. The second LLM call consumes the raw page snapshot + findings to generate patches. An adversarial site can poison the snapshot with injected instructions or crafted anchor strings that cause the model to emit malicious patches.

Two distinct attack vectors were verified with passing tests:

1. CSS/TSX/Tailwind scope bypass (directory traversal):
 - In validate.ts line 48, `requiresSnapshotMatch` is ONLY true when `target.scope === 'html' || target.scope === 'structural'`.
 - For css/tsx/tailwind/module-css/styled-component scopes, the `diff.before` snapshot-anchoring check is skipped entirely (line 52-54).
 - An attacker poisons the snapshot with instructio

Scoring

This is a pure release bookkeeping PR with no source code changes. It correctly aggregates changesets into CHANGELOG.md and performs the appropriate minor version bump (0.31.0 → 0.32.0) to reflect the described minor and patch changes. There are no logic modifications, dependency updates, or API contracts altered in the diff.

_{tangletools · aggregated 2026-04-28T23:52:25Z · **[trace](https://gist.github.com/drewstone/a8c9d3e794511346579bd63779930435)**}

tangletools

❌ 3 Blocking Findings

Severities: 1 critical, 2 high

🟣 CRITICAL [red-team] Business-logic abuse suppresses critical/major audit findings via patch-policy downgrade src/design/audit/build-result.ts:enforceFindingPolicy

The `enforceFindingPolicy` function in `build-result.ts` (lines 203-216) unconditionally downgrades any `major` or `critical` finding to `minor` if it does not have at least one structurally valid patch that passes snapshot anchoring. This creates a trivial business-logic bypass: an attacker who controls page content can trigger a critical severity finding (e.g., a trust/security defect like sending passwords over HTTP) while ensuring the LLM either (1) emits no patches, (2) emits patches whose `diff.before` does not appear verbatim in the snapshot, or (3) emits structurally invalid patches (missing required fields). In all three cases, `validatePatch` rejects the patches, `enforcePatchPolicy` downgrades the finding from `critical`/`major` to `minor`, and the security defect is buried in t

🔴 HIGH [red-team] Path traversal / arbitrary file write via --write-scorecard CLI argument in bench/design/eval/run.ts bench/design/eval/run.ts

The `appendToProjectScorecard` function in `bench/design/eval/run.ts` accepts a user-supplied path from the `--write-scorecard` CLI argument and writes JSON data directly to it without any path traversal sanitization, chroot sandboxing, or allowlist validation. An attacker can supply a relative path (e.g., `../../../package.json`) or an absolute path (e.g., `/home/user/.bashrc`) to overwrite arbitrary files on the host filesystem with attacker-controlled JSON content. The vulnerability was verified by running `pnpm tsx bench/design/eval/run.ts --patches-only --write-scorecard /home/drew/.bashrc`, which successfully overwrote the user's `.bashrc` file with a JSON scorecard payload. This is a classic path traversal leading to arbitrary file write, with potential for privilege escalation, cre

🔴 HIGH [red-team] Prompt injection in two-call patch flow allows directory-traversal and executable payloads to pass validation src/design/audit/patches/validate.ts

The two-call patch flow (generate.ts → validate.ts) is vulnerable to prompt injection via attacker-controlled webpage snapshots. The second LLM call consumes the raw page snapshot + findings to generate patches. An adversarial site can poison the snapshot with injected instructions or crafted anchor strings that cause the model to emit malicious patches.

Full output

The two-call patch flow (generate.ts → validate.ts) is vulnerable to prompt injection via attacker-controlled webpage snapshots. The second LLM call consumes the raw page snapshot + findings to generate patches. An adversarial site can poison the snapshot with injected instructions or crafted anchor strings that cause the model to emit malicious patches.

Two distinct attack vectors were verified with passing tests:

1. CSS/TSX/Tailwind scope bypass (directory traversal):
 - In validate.ts line 48, `requiresSnapshotMatch` is ONLY true when `target.scope === 'html' || target.scope === 'structural'`.
 - For css/tsx/tailwind/module-css/styled-component scopes, the `diff.before` snapshot-anchoring check is skipped entirely (line 52-54).
 - An attacker poisons the snapshot with instructio

View full trace + all 7 findings →

_{tangletools · aggregated 2026-04-28T23:52:25Z}

chore: version packages

64b2d09

tangletools requested changes Apr 28, 2026

View reviewed changes

drewstone merged commit 78592b8 into main Apr 29, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Release: version packages#92

Release: version packages#92
drewstone merged 1 commit intomainfrom
changeset-release/main

github-actions Bot commented Apr 28, 2026

Uh oh!

tangletools commented Apr 28, 2026 •

edited

Loading

Uh oh!

tangletools left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

github-actions Bot commented Apr 28, 2026

Releases

@tangle-network/browser-agent-driver@0.32.0

Minor Changes

Patch Changes

Uh oh!

tangletools commented Apr 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

❌ Needs Work - 64b2d09a

Blocking Findings

Uh oh!

tangletools left a comment

Choose a reason for hiding this comment

❌ 3 Blocking Findings

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

tangletools commented Apr 28, 2026 •

edited

Loading

❌ Needs Work - `64b2d09a`