research(wv590): adaptive max-turns extension + cycle-1 hot-spots rejection record by drewstone · Pull Request #91 · tangle-network/browser-agent-driver

drewstone · 2026-04-28T22:38:11Z

TL;DR

First research cycle from `bench/research/wv590-hot-spots.json` against the 2026-04-28 WebVoyager-590 baseline (90.8%). Tested two prompt rules + one runner change as a combined treatment on the 79 booking + flights cases. Combined treatment was net-negative; prompt rules reverted; runner change kept as a no-op safety net + the research history is persisted for cycle 2.

What's in this PR

Change	Status	Why kept
`src/run-state.ts` — `RunState.lastProgressTurn` tracker	kept	Innocuous; only used by the extension below
`src/runner/runner.ts` — adaptive max-turns extension (15 → up-to-25 when last 3 turns showed URL/snapshot progress)	kept	Inconclusive (fired 1× in 57 cases — most fails happen before the cap), no negative impact observed. Future-proofs cycle 2 once the earlier-bail-out is fixed.
`src/brain/index.ts` HEAVY_PAGE_RULES rules 25-27	reverted	Combined treatment caused -17.5pp on booking + -19pp on flights. Rule #27 (state-regression detection) likely the culprit — gave the agent permission to bail without offering a working alternative strategy.
`bench/external/webvoyager/convert-tasks.mjs` default timeout 120s→300s	kept	Locks in the timeout-fix shipped earlier in this WV-590 work; prevents `pnpm webbench:import` from regenerating cases.json with the old 120s floor.
`bench/research/wv590-hot-spots.json` — queue with cycle-1 annotations	kept	Public research history. Future cycles read `priority: 99` (rejected) + `priority: 50` (inconclusive) + `result` blocks to avoid re-running.
`tests/run-state.test.ts` (+2 tests)	kept	Predicate behavior for the extension gate.

Cycle-1 result data

```
Baseline Treatment Δ
booking 25/40 = 62.5% 18/40 = 45.0% -17.5pp
google-flights 12/39 = 30.8% 2/17 = 11.8% -19.0pp
combined 37/79 = 46.8% 20/57 = 35.1% -11.7pp

(1 cred-fail aborted run at 57/79; cost ~$32)
```

Both subsets clearly negative beyond the run-to-run variance band on this benchmark. REJECT per /research skill criteria.

Diagnosis

Fail-verdict patterns make the cause unambiguous: agent bails CLEANER and EARLIER than baseline, not pushes through. Recurring lines like:

"Booking.com repeatedly failed to navigate off the homepage"
"Date dialog is open, could not complete within remaining turns"

Rule #27 told the agent "if state regressed, switch strategy" but the only documented fallback (URL-direct) is blocked by booking. Net effect: more cleanly-bailed cases. Lesson: prompt rules that tell the agent when to give up MUST pair with concrete actions the agent can take.

Cycle-2 plan (queued, not in this PR)

Architectural — not prompt — fix:

When booking-style state-regression is detected (URL bounced to root after progressed flow), the runner spawns a fresh browser context (new cookies, new fingerprint) and retries via URL-direct in the new context. The brain stays out of recovery decisions; the runner does it deterministically.

~150 LOC across the runner's recovery system, separate session from this PR.

Verification

`pnpm lint` clean
`pnpm test` 1516/1516 pass (+2 new for lastProgressTurn predicate)
`pnpm check:boundaries` clean (157 files)
Cycle-1 result data persisted in `bench/research/wv590-hot-spots.json` (full per-hypothesis annotations + scorecard)

Why merge a net-neutral PR

Two reasons:

The research history is the durable artifact. Future cycles must not re-run rejected hypotheses; the annotated queue ensures that. Lost if the branch is closed.
The runner's adaptive-max-turns extension is innocuous + might pay off in cycle 2. Once the earlier-bail-out is fixed, this safety net could convert the cases that currently bail at turn 14 into successes by granting +5 turns.

If the team prefers to throw away the runner change and keep only the research queue, I can split the PR. Default is merging as-is.

… guidance Closes 2 of 7 hypotheses from bench/research/wv590-hot-spots.json, the queue derived from the 2026-04-28 WebVoyager-590 baseline (536/590 = 90.8%; 78% of fails on booking + google-flights date pickers). Hypothesis #5 — adaptive max-turns (priority 5, parameter-tuning, expected +2-4pp) 21 of 54 fails were "agent_gave_up_at_max_turns" mid-flow on booking + google-flights, where agent-was-progressing reads unambiguously from the trace. Static maxTurns=15 cut them off. Implementation src/run-state.ts: new RunState.lastProgressTurn (init -Infinity) src/runner/runner.ts: progress detection in the observe-completed emit path (URL change OR snapshot byte delta > 5%) src/runner/runner.ts: maxTurns is now `let` not `const`; at the cap boundary, if lastProgressTurn ≥ maxTurns - 3, grant a one-time +5 extension (capped absolute at 25). Vision-mode runs are excluded (already get +5 baseline). Cascading extensions blocked via extensionGranted. bus emits a recovery-fired event with strategy "max-turns-extension" so traces are honest about borrowed turns. Anti-overfitting The 5% byte-delta floor was chosen so decorative animations and dynamic-id reshuffles don't trip the predicate. The extension requires recent (≤3 turn) progress, not just a one-shot DOM change at turn 1, so it doesn't reward stuck loops. Hypothesis #1 — dialog/calendar nav guidance (priority 1, prompt change, expected +5-7pp) 27/27 google-flights fails involve the date-picker; ~10/15 booking fails are calendar-month-navigation-stuck. Both share the same failure mode: agent clicks "next month" once per turn, burning the turn budget navigating from "April 2026" to "December 2026". Implementation src/brain/index.ts HEAVY_PAGE_RULES adds rules 25-27: 25. CALENDAR/DATE-PICKER: chain N "next month" clicks via nextActions (micro-plan) in ONE turn instead of one click per turn 26. DIALOG-STATE AWARENESS: complete or dismiss the dialog; don't waste turns clicking outside it 27. STATE-REGRESSION DETECTION: if search results disappeared and you're back at the homepage, switch strategy These rules are added to SYSTEM_PROMPT (the per-turn agent prompt) — URL_FIRST_RULES is only in the planner prompt and doesn't reach per-turn decisions. Tests tests/run-state.test.ts: +2 tests for lastProgressTurn predicate behavior (initial -Infinity; lookback-window matching). Existing 1514 tests unchanged; 1516/1516 total pass. Boundary check 157/157 files clean. Next steps (separate cycles) Run --two-stage screen on the 79 booking+flights cases (~$200) to validate combined +5-10pp signal. If wins, full WebVoyager-590 re-baseline (~$200) confirms. Then implement #2 (URL-direct site profiles) as the next bigger architectural lever — could push toward 96%+.

…ep adaptive max-turns Cycle 1 ran the combined treatment (rules 25-27 added to SYSTEM_PROMPT + adaptive max-turns runner change) on the 79 hot-spot cases. Result: booking 18/40 = 45.0% vs 25/40 = 62.5% baseline -17.5pp google-flights 2/17 = 11.8% vs 12/39 = 30.8% baseline -19.0pp combined 20/57 = 35.1% vs 37/79 = 46.8% baseline -11.7pp (1 cred-fail aborted run at 57/79) REJECT — both subsets clearly negative beyond run-to-run variance. Diagnosis from fail verdicts: agent is bailing CLEANER and EARLIER than baseline, not grinding through. Verdict patterns like "Booking.com repeatedly failed to navigate off the homepage" and "date dialog is open, could not complete" recur. Rule #27 (state- regression detection) likely gave the agent permission to give up when the only fallback strategy (URL-direct) is blocked by the target site. Reverted src/brain/index.ts: HEAVY_PAGE_RULES rules 25-27 reverted to baseline (no calendar/dialog/state-regression guidance). Prompt rules that tell the agent when to bail without offering a working alternative are net-negative. Kept src/run-state.ts + src/runner/runner.ts adaptive max-turns extension. Fired 1× in 57 cases (most fails happen before the cap), so it's INCONCLUSIVE rather than negative. The safety net is innocuous in the worst case and may help in cycle 2 once we fix the earlier bail-out. Annotated bench/research/wv590-hot-spots.json with the cycle-1 results — calendar-month-nav-macro and state-regression-detection demoted to priority=99 (rejected), max-turns-by-flow-complexity demoted to priority=50 (inconclusive). Future cycles read these annotations to avoid re-running rejected hypotheses. Cycle-2 hypothesis (queued, not yet implemented): The fundamental problem isn't prompt-level guidance — it's that booking redirects to homepage with no working recovery. Real fix needs a new "fresh-session retry" action: when state regresses to homepage, the runner spawns a new browser context and retries the search via URL-direct on the fresh session. Architectural change, not prompt change. Pairs with adaptive max-turns since retries consume turns.

tangletools · 2026-04-28T23:00:44Z

❌ Needs Work - `e4df01ee`


Blocking findings	4
All findings	9 (2 critical, 2 high, 4 medium, 1 low)
Readiness	85/100
Confidence	85/100

Pass	Status
quick	✅
red-team	✅
deep-audit	✅
glm-quick	✅
glm-redteam	✅
glm-audit	⏭️

Blocking Findings

🟣 CRITICAL [deep-audit] Hardcoded secrets in runner core src/runner/runner.ts

Pre-scan detected hardcoded-secret(2) in runner.ts. Embedded credentials in source control bypass secret rotation and expose keys to anyone with repo or package access.

🟣 CRITICAL [deep-audit] SQL injection surface in runner core src/runner/runner.ts

Pre-scan flagged sql-injection(1). Even a single unsanitized query in a browser-automation runner—often used for telemetry, local state, or task queues—can lead to data exfiltration or remote code execution.

🔴 HIGH [deep-audit] Excessive authentication surface violates architecture boundaries src/runner/runner.ts

auth-surface(118) indicates an unusually large number of authentication-related code paths in a single file. This concentration increases the probability of bypasses, session fixation, or privilege escalation and likely fails the check:boundaries gate.

🔴 HIGH [deep-audit] Excessive cryptographic surface in runner core src/runner/runner.ts

crypto-surface(107) signals extensive direct use of low-level cryptographic primitives. This raises the risk of algorithm misuse, nonce reuse, weak key derivation, or side-channel leakage.

View all 9 findings →

1 additional findings

🟡 LOW [deep-audit] Extended default timeout widens resource-exposure window src/runner.ts (entry script)

PR #91 raised the default timeout from 120s to 300s. While this reduces false negatives on long-page sites, it also increases the window for resource exhaustion and orphaned browser contexts if cleanup is imperfect.

Scoring

The adaptive max-turns extension is cleanly gated and well-commented, and the new RunState predicate is unit-tested. However, the final failure reason in runner.ts hardcodes '+5' even when the hard cap reduces the actual extension, which will produce inaccurate benchmark telemetry.

_{tangletools · aggregated 2026-04-28T23:35:04Z · **[trace](https://gist.github.com/drewstone/6136a664d613a2b5662eb3ab72c86097)**}

tangletools

❌ 4 Blocking Findings

Severities: 2 critical, 2 high

🟣 CRITICAL [deep-audit] Hardcoded secrets in runner core src/runner/runner.ts

Pre-scan detected hardcoded-secret(2) in runner.ts. Embedded credentials in source control bypass secret rotation and expose keys to anyone with repo or package access.

🟣 CRITICAL [deep-audit] SQL injection surface in runner core src/runner/runner.ts

Pre-scan flagged sql-injection(1). Even a single unsanitized query in a browser-automation runner—often used for telemetry, local state, or task queues—can lead to data exfiltration or remote code execution.

🔴 HIGH [deep-audit] Excessive authentication surface violates architecture boundaries src/runner/runner.ts

auth-surface(118) indicates an unusually large number of authentication-related code paths in a single file. This concentration increases the probability of bypasses, session fixation, or privilege escalation and likely fails the check:boundaries gate.

🔴 HIGH [deep-audit] Excessive cryptographic surface in runner core src/runner/runner.ts

crypto-surface(107) signals extensive direct use of low-level cryptographic primitives. This raises the risk of algorithm misuse, nonce reuse, weak key derivation, or side-channel leakage.

View full trace + all 9 findings →

_{tangletools · aggregated 2026-04-28T23:35:04Z}

drewstone added 2 commits April 28, 2026 15:35

drewstone merged commit 2109e0d into main Apr 28, 2026
5 checks passed

tangletools requested changes Apr 28, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

research(wv590): adaptive max-turns extension + cycle-1 hot-spots rejection record#91

research(wv590): adaptive max-turns extension + cycle-1 hot-spots rejection record#91
drewstone merged 2 commits intomainfrom
research/wv590-adaptive-turns-and-calendar-nav

drewstone commented Apr 28, 2026

Uh oh!

Uh oh!

tangletools commented Apr 28, 2026 •

edited

Loading

Uh oh!

tangletools left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

drewstone commented Apr 28, 2026

TL;DR

What's in this PR

Cycle-1 result data

Diagnosis

Cycle-2 plan (queued, not in this PR)

Verification

Why merge a net-neutral PR

Uh oh!

Uh oh!

tangletools commented Apr 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

❌ Needs Work - e4df01ee

Blocking Findings

Uh oh!

tangletools left a comment

Choose a reason for hiding this comment

❌ 4 Blocking Findings

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

tangletools commented Apr 28, 2026 •

edited

Loading

❌ Needs Work - `e4df01ee`