Conversation
… guidance Closes 2 of 7 hypotheses from bench/research/wv590-hot-spots.json, the queue derived from the 2026-04-28 WebVoyager-590 baseline (536/590 = 90.8%; 78% of fails on booking + google-flights date pickers). Hypothesis #5 — adaptive max-turns (priority 5, parameter-tuning, expected +2-4pp) 21 of 54 fails were "agent_gave_up_at_max_turns" mid-flow on booking + google-flights, where agent-was-progressing reads unambiguously from the trace. Static maxTurns=15 cut them off. Implementation src/run-state.ts: new RunState.lastProgressTurn (init -Infinity) src/runner/runner.ts: progress detection in the observe-completed emit path (URL change OR snapshot byte delta > 5%) src/runner/runner.ts: maxTurns is now `let` not `const`; at the cap boundary, if lastProgressTurn ≥ maxTurns - 3, grant a one-time +5 extension (capped absolute at 25). Vision-mode runs are excluded (already get +5 baseline). Cascading extensions blocked via extensionGranted. bus emits a recovery-fired event with strategy "max-turns-extension" so traces are honest about borrowed turns. Anti-overfitting The 5% byte-delta floor was chosen so decorative animations and dynamic-id reshuffles don't trip the predicate. The extension requires recent (≤3 turn) progress, not just a one-shot DOM change at turn 1, so it doesn't reward stuck loops. Hypothesis #1 — dialog/calendar nav guidance (priority 1, prompt change, expected +5-7pp) 27/27 google-flights fails involve the date-picker; ~10/15 booking fails are calendar-month-navigation-stuck. Both share the same failure mode: agent clicks "next month" once per turn, burning the turn budget navigating from "April 2026" to "December 2026". Implementation src/brain/index.ts HEAVY_PAGE_RULES adds rules 25-27: 25. CALENDAR/DATE-PICKER: chain N "next month" clicks via nextActions (micro-plan) in ONE turn instead of one click per turn 26. DIALOG-STATE AWARENESS: complete or dismiss the dialog; don't waste turns clicking outside it 27. STATE-REGRESSION DETECTION: if search results disappeared and you're back at the homepage, switch strategy These rules are added to SYSTEM_PROMPT (the per-turn agent prompt) — URL_FIRST_RULES is only in the planner prompt and doesn't reach per-turn decisions. Tests tests/run-state.test.ts: +2 tests for lastProgressTurn predicate behavior (initial -Infinity; lookback-window matching). Existing 1514 tests unchanged; 1516/1516 total pass. Boundary check 157/157 files clean. Next steps (separate cycles) Run --two-stage screen on the 79 booking+flights cases (~$200) to validate combined +5-10pp signal. If wins, full WebVoyager-590 re-baseline (~$200) confirms. Then implement #2 (URL-direct site profiles) as the next bigger architectural lever — could push toward 96%+.
…ep adaptive max-turns Cycle 1 ran the combined treatment (rules 25-27 added to SYSTEM_PROMPT + adaptive max-turns runner change) on the 79 hot-spot cases. Result: booking 18/40 = 45.0% vs 25/40 = 62.5% baseline -17.5pp google-flights 2/17 = 11.8% vs 12/39 = 30.8% baseline -19.0pp combined 20/57 = 35.1% vs 37/79 = 46.8% baseline -11.7pp (1 cred-fail aborted run at 57/79) REJECT — both subsets clearly negative beyond run-to-run variance. Diagnosis from fail verdicts: agent is bailing CLEANER and EARLIER than baseline, not grinding through. Verdict patterns like "Booking.com repeatedly failed to navigate off the homepage" and "date dialog is open, could not complete" recur. Rule #27 (state- regression detection) likely gave the agent permission to give up when the only fallback strategy (URL-direct) is blocked by the target site. Reverted src/brain/index.ts: HEAVY_PAGE_RULES rules 25-27 reverted to baseline (no calendar/dialog/state-regression guidance). Prompt rules that tell the agent when to bail without offering a working alternative are net-negative. Kept src/run-state.ts + src/runner/runner.ts adaptive max-turns extension. Fired 1× in 57 cases (most fails happen before the cap), so it's INCONCLUSIVE rather than negative. The safety net is innocuous in the worst case and may help in cycle 2 once we fix the earlier bail-out. Annotated bench/research/wv590-hot-spots.json with the cycle-1 results — calendar-month-nav-macro and state-regression-detection demoted to priority=99 (rejected), max-turns-by-flow-complexity demoted to priority=50 (inconclusive). Future cycles read these annotations to avoid re-running rejected hypotheses. Cycle-2 hypothesis (queued, not yet implemented): The fundamental problem isn't prompt-level guidance — it's that booking redirects to homepage with no working recovery. Real fix needs a new "fresh-session retry" action: when state regresses to homepage, the runner spawns a new browser context and retries the search via URL-direct on the fresh session. Architectural change, not prompt change. Pairs with adaptive max-turns since retries consume turns.
❌ Needs Work -
|
| Blocking findings | 4 |
| All findings | 9 (2 critical, 2 high, 4 medium, 1 low) |
| Readiness | 85/100 |
| Confidence | 85/100 |
| Pass | Status |
|---|---|
| quick | ✅ |
| red-team | ✅ |
| deep-audit | ✅ |
| glm-quick | ✅ |
| glm-redteam | ✅ |
| glm-audit | ⏭️ |
Blocking Findings
🟣 CRITICAL [deep-audit] Hardcoded secrets in runner core src/runner/runner.ts
Pre-scan detected hardcoded-secret(2) in runner.ts. Embedded credentials in source control bypass secret rotation and expose keys to anyone with repo or package access.
🟣 CRITICAL [deep-audit] SQL injection surface in runner core src/runner/runner.ts
Pre-scan flagged sql-injection(1). Even a single unsanitized query in a browser-automation runner—often used for telemetry, local state, or task queues—can lead to data exfiltration or remote code execution.
🔴 HIGH [deep-audit] Excessive authentication surface violates architecture boundaries src/runner/runner.ts
auth-surface(118) indicates an unusually large number of authentication-related code paths in a single file. This concentration increases the probability of bypasses, session fixation, or privilege escalation and likely fails the check:boundaries gate.
🔴 HIGH [deep-audit] Excessive cryptographic surface in runner core src/runner/runner.ts
crypto-surface(107) signals extensive direct use of low-level cryptographic primitives. This raises the risk of algorithm misuse, nonce reuse, weak key derivation, or side-channel leakage.
1 additional findings
🟡 LOW [deep-audit] Extended default timeout widens resource-exposure window src/runner.ts (entry script)
PR #91 raised the default timeout from 120s to 300s. While this reduces false negatives on long-page sites, it also increases the window for resource exhaustion and orphaned browser contexts if cleanup is imperfect.
Scoring
The adaptive max-turns extension is cleanly gated and well-commented, and the new RunState predicate is unit-tested. However, the final failure reason in runner.ts hardcodes '+5' even when the hard cap reduces the actual extension, which will produce inaccurate benchmark telemetry.
tangletools
left a comment
There was a problem hiding this comment.
❌ 4 Blocking Findings
Severities: 2 critical, 2 high
🟣 CRITICAL [deep-audit] Hardcoded secrets in runner core src/runner/runner.ts
Pre-scan detected hardcoded-secret(2) in runner.ts. Embedded credentials in source control bypass secret rotation and expose keys to anyone with repo or package access.
🟣 CRITICAL [deep-audit] SQL injection surface in runner core src/runner/runner.ts
Pre-scan flagged sql-injection(1). Even a single unsanitized query in a browser-automation runner—often used for telemetry, local state, or task queues—can lead to data exfiltration or remote code execution.
🔴 HIGH [deep-audit] Excessive authentication surface violates architecture boundaries src/runner/runner.ts
auth-surface(118) indicates an unusually large number of authentication-related code paths in a single file. This concentration increases the probability of bypasses, session fixation, or privilege escalation and likely fails the check:boundaries gate.
🔴 HIGH [deep-audit] Excessive cryptographic surface in runner core src/runner/runner.ts
crypto-surface(107) signals extensive direct use of low-level cryptographic primitives. This raises the risk of algorithm misuse, nonce reuse, weak key derivation, or side-channel leakage.
View full trace + all 9 findings →
tangletools · aggregated 2026-04-28T23:35:04Z
TL;DR
First research cycle from `bench/research/wv590-hot-spots.json` against the 2026-04-28 WebVoyager-590 baseline (90.8%). Tested two prompt rules + one runner change as a combined treatment on the 79 booking + flights cases. Combined treatment was net-negative; prompt rules reverted; runner change kept as a no-op safety net + the research history is persisted for cycle 2.
What's in this PR
Cycle-1 result data
```
Baseline Treatment Δ
booking 25/40 = 62.5% 18/40 = 45.0% -17.5pp
google-flights 12/39 = 30.8% 2/17 = 11.8% -19.0pp
combined 37/79 = 46.8% 20/57 = 35.1% -11.7pp
(1 cred-fail aborted run at 57/79; cost ~$32)
```
Both subsets clearly negative beyond the run-to-run variance band on this benchmark. REJECT per /research skill criteria.
Diagnosis
Fail-verdict patterns make the cause unambiguous: agent bails CLEANER and EARLIER than baseline, not pushes through. Recurring lines like:
Rule #27 told the agent "if state regressed, switch strategy" but the only documented fallback (URL-direct) is blocked by booking. Net effect: more cleanly-bailed cases. Lesson: prompt rules that tell the agent when to give up MUST pair with concrete actions the agent can take.
Cycle-2 plan (queued, not in this PR)
Architectural — not prompt — fix:
~150 LOC across the runner's recovery system, separate session from this PR.
Verification
Why merge a net-neutral PR
Two reasons:
If the team prefers to throw away the runner change and keep only the research queue, I can split the PR. Default is merging as-is.