Skip to content

research(wv590): adaptive max-turns extension + cycle-1 hot-spots rejection record#91

Merged
drewstone merged 2 commits intomainfrom
research/wv590-adaptive-turns-and-calendar-nav
Apr 28, 2026
Merged

research(wv590): adaptive max-turns extension + cycle-1 hot-spots rejection record#91
drewstone merged 2 commits intomainfrom
research/wv590-adaptive-turns-and-calendar-nav

Conversation

@drewstone
Copy link
Copy Markdown
Contributor

TL;DR

First research cycle from `bench/research/wv590-hot-spots.json` against the 2026-04-28 WebVoyager-590 baseline (90.8%). Tested two prompt rules + one runner change as a combined treatment on the 79 booking + flights cases. Combined treatment was net-negative; prompt rules reverted; runner change kept as a no-op safety net + the research history is persisted for cycle 2.

What's in this PR

Change Status Why kept
`src/run-state.ts` — `RunState.lastProgressTurn` tracker kept Innocuous; only used by the extension below
`src/runner/runner.ts` — adaptive max-turns extension (15 → up-to-25 when last 3 turns showed URL/snapshot progress) kept Inconclusive (fired 1× in 57 cases — most fails happen before the cap), no negative impact observed. Future-proofs cycle 2 once the earlier-bail-out is fixed.
`src/brain/index.ts` HEAVY_PAGE_RULES rules 25-27 reverted Combined treatment caused -17.5pp on booking + -19pp on flights. Rule #27 (state-regression detection) likely the culprit — gave the agent permission to bail without offering a working alternative strategy.
`bench/external/webvoyager/convert-tasks.mjs` default timeout 120s→300s kept Locks in the timeout-fix shipped earlier in this WV-590 work; prevents `pnpm webbench:import` from regenerating cases.json with the old 120s floor.
`bench/research/wv590-hot-spots.json` — queue with cycle-1 annotations kept Public research history. Future cycles read `priority: 99` (rejected) + `priority: 50` (inconclusive) + `result` blocks to avoid re-running.
`tests/run-state.test.ts` (+2 tests) kept Predicate behavior for the extension gate.

Cycle-1 result data

```
Baseline Treatment Δ
booking 25/40 = 62.5% 18/40 = 45.0% -17.5pp
google-flights 12/39 = 30.8% 2/17 = 11.8% -19.0pp
combined 37/79 = 46.8% 20/57 = 35.1% -11.7pp

(1 cred-fail aborted run at 57/79; cost ~$32)
```

Both subsets clearly negative beyond the run-to-run variance band on this benchmark. REJECT per /research skill criteria.

Diagnosis

Fail-verdict patterns make the cause unambiguous: agent bails CLEANER and EARLIER than baseline, not pushes through. Recurring lines like:

"Booking.com repeatedly failed to navigate off the homepage"
"Date dialog is open, could not complete within remaining turns"

Rule #27 told the agent "if state regressed, switch strategy" but the only documented fallback (URL-direct) is blocked by booking. Net effect: more cleanly-bailed cases. Lesson: prompt rules that tell the agent when to give up MUST pair with concrete actions the agent can take.

Cycle-2 plan (queued, not in this PR)

Architectural — not prompt — fix:

When booking-style state-regression is detected (URL bounced to root after progressed flow), the runner spawns a fresh browser context (new cookies, new fingerprint) and retries via URL-direct in the new context. The brain stays out of recovery decisions; the runner does it deterministically.

~150 LOC across the runner's recovery system, separate session from this PR.

Verification

  • `pnpm lint` clean
  • `pnpm test` 1516/1516 pass (+2 new for lastProgressTurn predicate)
  • `pnpm check:boundaries` clean (157 files)
  • Cycle-1 result data persisted in `bench/research/wv590-hot-spots.json` (full per-hypothesis annotations + scorecard)

Why merge a net-neutral PR

Two reasons:

  1. The research history is the durable artifact. Future cycles must not re-run rejected hypotheses; the annotated queue ensures that. Lost if the branch is closed.
  2. The runner's adaptive-max-turns extension is innocuous + might pay off in cycle 2. Once the earlier-bail-out is fixed, this safety net could convert the cases that currently bail at turn 14 into successes by granting +5 turns.

If the team prefers to throw away the runner change and keep only the research queue, I can split the PR. Default is merging as-is.

… guidance

Closes 2 of 7 hypotheses from bench/research/wv590-hot-spots.json,
the queue derived from the 2026-04-28 WebVoyager-590 baseline
(536/590 = 90.8%; 78% of fails on booking + google-flights date
pickers).

Hypothesis #5 — adaptive max-turns (priority 5, parameter-tuning,
expected +2-4pp)
  21 of 54 fails were "agent_gave_up_at_max_turns" mid-flow on
  booking + google-flights, where agent-was-progressing reads
  unambiguously from the trace. Static maxTurns=15 cut them off.

  Implementation
    src/run-state.ts: new RunState.lastProgressTurn (init -Infinity)
    src/runner/runner.ts: progress detection in the observe-completed
      emit path (URL change OR snapshot byte delta > 5%)
    src/runner/runner.ts: maxTurns is now `let` not `const`; at the
      cap boundary, if lastProgressTurn ≥ maxTurns - 3, grant a
      one-time +5 extension (capped absolute at 25). Vision-mode
      runs are excluded (already get +5 baseline). Cascading
      extensions blocked via extensionGranted.
    bus emits a recovery-fired event with strategy
      "max-turns-extension" so traces are honest about borrowed
      turns.

  Anti-overfitting
    The 5% byte-delta floor was chosen so decorative animations and
    dynamic-id reshuffles don't trip the predicate. The extension
    requires recent (≤3 turn) progress, not just a one-shot DOM
    change at turn 1, so it doesn't reward stuck loops.

Hypothesis #1 — dialog/calendar nav guidance (priority 1, prompt
change, expected +5-7pp)
  27/27 google-flights fails involve the date-picker; ~10/15 booking
  fails are calendar-month-navigation-stuck. Both share the same
  failure mode: agent clicks "next month" once per turn, burning the
  turn budget navigating from "April 2026" to "December 2026".

  Implementation
    src/brain/index.ts HEAVY_PAGE_RULES adds rules 25-27:
      25. CALENDAR/DATE-PICKER: chain N "next month" clicks via
          nextActions (micro-plan) in ONE turn instead of one click
          per turn
      26. DIALOG-STATE AWARENESS: complete or dismiss the dialog;
          don't waste turns clicking outside it
      27. STATE-REGRESSION DETECTION: if search results disappeared
          and you're back at the homepage, switch strategy

  These rules are added to SYSTEM_PROMPT (the per-turn agent
  prompt) — URL_FIRST_RULES is only in the planner prompt and
  doesn't reach per-turn decisions.

Tests
  tests/run-state.test.ts: +2 tests for lastProgressTurn predicate
    behavior (initial -Infinity; lookback-window matching).
  Existing 1514 tests unchanged; 1516/1516 total pass.
  Boundary check 157/157 files clean.

Next steps (separate cycles)
  Run --two-stage screen on the 79 booking+flights cases (~$200)
  to validate combined +5-10pp signal.
  If wins, full WebVoyager-590 re-baseline (~$200) confirms.
  Then implement #2 (URL-direct site profiles) as the next bigger
  architectural lever — could push toward 96%+.
…ep adaptive max-turns

Cycle 1 ran the combined treatment (rules 25-27 added to SYSTEM_PROMPT
+ adaptive max-turns runner change) on the 79 hot-spot cases. Result:

  booking         18/40 = 45.0%   vs 25/40 = 62.5% baseline   -17.5pp
  google-flights   2/17 = 11.8%   vs 12/39 = 30.8% baseline   -19.0pp
  combined        20/57 = 35.1%   vs 37/79 = 46.8% baseline   -11.7pp
  (1 cred-fail aborted run at 57/79)

REJECT — both subsets clearly negative beyond run-to-run variance.

Diagnosis from fail verdicts: agent is bailing CLEANER and EARLIER
than baseline, not grinding through. Verdict patterns like
"Booking.com repeatedly failed to navigate off the homepage" and
"date dialog is open, could not complete" recur. Rule #27 (state-
regression detection) likely gave the agent permission to give up
when the only fallback strategy (URL-direct) is blocked by the
target site.

Reverted
  src/brain/index.ts: HEAVY_PAGE_RULES rules 25-27 reverted to
  baseline (no calendar/dialog/state-regression guidance). Prompt
  rules that tell the agent when to bail without offering a working
  alternative are net-negative.

Kept
  src/run-state.ts + src/runner/runner.ts adaptive max-turns
  extension. Fired 1× in 57 cases (most fails happen before the
  cap), so it's INCONCLUSIVE rather than negative. The safety net
  is innocuous in the worst case and may help in cycle 2 once we
  fix the earlier bail-out.

Annotated bench/research/wv590-hot-spots.json with the cycle-1
results — calendar-month-nav-macro and state-regression-detection
demoted to priority=99 (rejected), max-turns-by-flow-complexity
demoted to priority=50 (inconclusive). Future cycles read these
annotations to avoid re-running rejected hypotheses.

Cycle-2 hypothesis (queued, not yet implemented):
  The fundamental problem isn't prompt-level guidance — it's that
  booking redirects to homepage with no working recovery. Real fix
  needs a new "fresh-session retry" action: when state regresses
  to homepage, the runner spawns a new browser context and retries
  the search via URL-direct on the fresh session. Architectural
  change, not prompt change. Pairs with adaptive max-turns since
  retries consume turns.
@drewstone drewstone merged commit 2109e0d into main Apr 28, 2026
5 checks passed
@tangletools
Copy link
Copy Markdown
Contributor

tangletools commented Apr 28, 2026

❌ Needs Work - e4df01ee

Blocking findings 4
All findings 9 (2 critical, 2 high, 4 medium, 1 low)
Readiness 85/100
Confidence 85/100
Pass Status
quick
red-team
deep-audit
glm-quick
glm-redteam
glm-audit ⏭️

Blocking Findings

🟣 CRITICAL [deep-audit] Hardcoded secrets in runner core src/runner/runner.ts

Pre-scan detected hardcoded-secret(2) in runner.ts. Embedded credentials in source control bypass secret rotation and expose keys to anyone with repo or package access.

🟣 CRITICAL [deep-audit] SQL injection surface in runner core src/runner/runner.ts

Pre-scan flagged sql-injection(1). Even a single unsanitized query in a browser-automation runner—often used for telemetry, local state, or task queues—can lead to data exfiltration or remote code execution.

🔴 HIGH [deep-audit] Excessive authentication surface violates architecture boundaries src/runner/runner.ts

auth-surface(118) indicates an unusually large number of authentication-related code paths in a single file. This concentration increases the probability of bypasses, session fixation, or privilege escalation and likely fails the check:boundaries gate.

🔴 HIGH [deep-audit] Excessive cryptographic surface in runner core src/runner/runner.ts

crypto-surface(107) signals extensive direct use of low-level cryptographic primitives. This raises the risk of algorithm misuse, nonce reuse, weak key derivation, or side-channel leakage.

View all 9 findings →

1 additional findings

🟡 LOW [deep-audit] Extended default timeout widens resource-exposure window src/runner.ts (entry script)

PR #91 raised the default timeout from 120s to 300s. While this reduces false negatives on long-page sites, it also increases the window for resource exhaustion and orphaned browser contexts if cleanup is imperfect.


Scoring

The adaptive max-turns extension is cleanly gated and well-commented, and the new RunState predicate is unit-tested. However, the final failure reason in runner.ts hardcodes '+5' even when the hard cap reduces the actual extension, which will produce inaccurate benchmark telemetry.

tangletools · aggregated 2026-04-28T23:35:04Z · **[trace](https://gist.github.com/drewstone/6136a664d613a2b5662eb3ab72c86097)**

Copy link
Copy Markdown
Contributor

@tangletools tangletools left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

❌ 4 Blocking Findings

Severities: 2 critical, 2 high

🟣 CRITICAL [deep-audit] Hardcoded secrets in runner core src/runner/runner.ts

Pre-scan detected hardcoded-secret(2) in runner.ts. Embedded credentials in source control bypass secret rotation and expose keys to anyone with repo or package access.

🟣 CRITICAL [deep-audit] SQL injection surface in runner core src/runner/runner.ts

Pre-scan flagged sql-injection(1). Even a single unsanitized query in a browser-automation runner—often used for telemetry, local state, or task queues—can lead to data exfiltration or remote code execution.

🔴 HIGH [deep-audit] Excessive authentication surface violates architecture boundaries src/runner/runner.ts

auth-surface(118) indicates an unusually large number of authentication-related code paths in a single file. This concentration increases the probability of bypasses, session fixation, or privilege escalation and likely fails the check:boundaries gate.

🔴 HIGH [deep-audit] Excessive cryptographic surface in runner core src/runner/runner.ts

crypto-surface(107) signals extensive direct use of low-level cryptographic primitives. This raises the risk of algorithm misuse, nonce reuse, weak key derivation, or side-channel leakage.

View full trace + all 9 findings →


tangletools · aggregated 2026-04-28T23:35:04Z

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants