Skip to content

fix(agents): pull-wake runner resilience — fetch cache stacking, hung connections, xstate lifecycle#4542

Open
KyleAMathews wants to merge 3 commits into
mainfrom
fix/idempotent-fetch-cache
Open

fix(agents): pull-wake runner resilience — fetch cache stacking, hung connections, xstate lifecycle#4542
KyleAMathews wants to merge 3 commits into
mainfrom
fix/idempotent-fetch-cache

Conversation

@KyleAMathews

@KyleAMathews KyleAMathews commented Jun 9, 2026

Copy link
Copy Markdown
Contributor

Summary

Fixes two bugs that left the desktop agents runtime unable to pick up sessions until a full app restart, then ports the pull-wake runner's lifecycle to an xstate v5 state machine so this class of bug can't recur silently. No public API changes — all 21 pre-existing behavioral tests pass unmodified.

Root cause

Two independent failures combined into "runner looks alive but never picks up work, and restarting the runtime doesn't help":

  1. Hung connection couldn't be aborted. The heartbeat-failure → stream-reset path (requestStreamReconnect) was a no-op unless streamConnected was true. If the wake stream hung while (re)connecting (e.g. after an agents-server restart), heartbeat failures counted up but nothing could cancel the stuck streamFactory call — the runner sat in connecting forever.
  2. Runtime restart stacked cache interceptors. Every BuiltinAgentsServer.start() composed a new undici cache interceptor onto the global dispatcher without a guard. Restarting the runtime in-process stacked SQLite-backed cache layers over the same file; only a full app restart got a clean dispatcher.

Approach

  • installDurableStreamsFetchCache gets a module-level installed guard (warns on repeat calls).
  • The runner lifecycle moves into an xstate machine (pull-wake-machine.ts): stopped → running.{connecting, streaming, reconnecting} → stopping. The structural win: invoked actors are auto-aborted on state exit, so a STREAM_RESET event during connecting cancels the in-flight connect via the promise actor's signal — no manual connectAbort bookkeeping, which is how the original bug slipped through. Backoff timers (after) cancel the same way.
  • Diagnostics, heartbeat coalescing, and claim processing stay in the runner closure as effects the machine triggers (PullWakeMachineEffects). The machine decides when; the closure does what.

Key invariants

  • Every (state × event) pair is pinned by an exhaustive 35-case transition matrix test — adding a state or event forces a deliberate decision.
  • Repeated heartbeat failures (≥2) reset the stream from any running substate.
  • One shutdown sequence regardless of concurrent stop() calls; stop() rejects with drain errors.
  • Claim actors survive stream reconnects and only drain (1s grace) at stop.

Non-goals

  • No behavioral changes to claiming, heartbeat cadence, or backoff timing (1s → ×2 → 30s cap preserved).
  • Heartbeat send/coalescing logic was not moved into the machine — it's effect plumbing, not lifecycle.

Verification

pnpm --filter @electric-ax/agents-runtime exec vitest run test/pull-wake-runner.test.ts   # 61 tests
pnpm --filter @electric-ax/agents exec vitest run test/durable-streams-cache.test.ts      # 2 tests
pnpm --filter @electric-ax/agents-runtime build

Full agents-runtime suite: 852 passed; 2 pre-existing environment failures unrelated to this PR (sandbox-docker Docker-timing assertion, tool-providers unbuilt workspace dep).

Files changed

  • packages/agents-runtime/src/pull-wake-machine.ts — new: xstate v5 lifecycle machine + effects interface
  • packages/agents-runtime/src/pull-wake-runner.ts — rewritten as a thin adapter: same public API, closure keeps diagnostics/heartbeat/claims, machine owns lifecycle
  • packages/agents-runtime/test/pull-wake-runner.test.ts — +1 hung-connection regression test, +40 machine transition tests; 21 original tests untouched
  • packages/agents/src/durable-streams-cache.ts — idempotency guard + warning
  • packages/agents/test/durable-streams-cache.test.ts — idempotency test
  • packages/agents-runtime/package.json — adds xstate (zero runtime deps)
  • .changeset/pull-wake-runner-resilience.md — patch bumps for both packages

🤖 Generated with Claude Code

…nections

The durable-streams fetch cache interceptor was composed onto the global
undici dispatcher on every BuiltinAgentsServer.start() call without a
guard, so restarting the runtime stacked duplicate SQLite-backed cache
layers on the same file — breaking fetch after a restart.

The pull-wake runner's heartbeat-failure → stream-reconnect path only
worked while the stream was already connected. If the stream factory
hung during (re)connection, heartbeat failures couldn't abort it because
requestStreamReconnect was a no-op when streamConnected was false. Now
each connection attempt gets its own AbortController that the heartbeat
failure path can trigger.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@github-actions

github-actions Bot commented Jun 9, 2026

Copy link
Copy Markdown
Contributor

Electric Agents Desktop Builds

Build artifacts for commit 9dbb4d2.

Platform Status Artifact
macOS Apple Silicon Passed DMG
macOS Intel Passed DMG
Windows x64 Passed Installer
Linux x64 Passed AppImage / deb

Workflow run

@codecov

codecov Bot commented Jun 9, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 98.54015% with 4 lines in your changes missing coverage. Please review.
✅ Project coverage is 56.78%. Comparing base (916f6cd) to head (9dbb4d2).
✅ All tests successful. No failed tests found.

Files with missing lines Patch % Lines
packages/agents-runtime/src/pull-wake-machine.ts 98.85% 2 Missing ⚠️
packages/agents-runtime/src/pull-wake-runner.ts 97.87% 2 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #4542      +/-   ##
==========================================
+ Coverage   54.80%   56.78%   +1.97%     
==========================================
  Files         317      361      +44     
  Lines       36681    39500    +2819     
  Branches    10466    11099     +633     
==========================================
+ Hits        20104    22431    +2327     
- Misses      16544    16998     +454     
- Partials       33       71      +38     
Flag Coverage Δ
packages/agents 70.64% <100.00%> (+0.10%) ⬆️
packages/agents-mcp 77.54% <ø> (?)
packages/agents-mobile 71.42% <ø> (ø)
packages/agents-runtime 80.32% <98.51%> (+0.07%) ⬆️
packages/agents-server 74.16% <ø> (+0.21%) ⬆️
packages/agents-server-ui 5.67% <ø> (ø)
packages/electric-ax 46.42% <ø> (ø)
packages/experimental 87.73% <ø> (?)
packages/react-hooks 86.48% <ø> (?)
packages/start 82.83% <ø> (?)
packages/typescript-client 91.83% <ø> (?)
packages/y-electric 56.05% <ø> (?)
typescript 56.78% <98.54%> (+1.97%) ⬆️
unit-tests 56.78% <98.54%> (+1.97%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

KyleAMathews and others added 2 commits June 9, 2026 15:18
Replace the hand-rolled state machine (23 mutable closure variables, manual
AbortController wiring) with an xstate v5 machine that owns the lifecycle:
stopped → running.{connecting,streaming,reconnecting} → stopping.

The machine structurally prevents the class of bug fixed in the previous
commit: invoked actors are auto-aborted on state exit, so a heartbeat-driven
STREAM_RESET during the connecting phase cancels the in-flight connection
without any manual connectAbort bookkeeping. Backoff timers auto-cancel the
same way.

Diagnostics, heartbeat coalescing, and claim processing stay in the runner
closure as effects the machine triggers; the public PullWakeRunner API is
unchanged and all 21 existing behavioral tests pass unmodified.

Adds an exhaustive (state × event) transition matrix test — all 35 pairs
pinned — so adding a state or event forces a deliberate decision.

Also warn when installDurableStreamsFetchCache is called more than once,
per review feedback.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
@KyleAMathews KyleAMathews force-pushed the fix/idempotent-fetch-cache branch from af7fda5 to 9dbb4d2 Compare June 9, 2026 21:20
@KyleAMathews KyleAMathews changed the title fix(agents): prevent fetch cache stacking and recover hung stream connections fix(agents): pull-wake runner resilience — fetch cache stacking, hung connections, xstate lifecycle Jun 9, 2026
@github-actions

github-actions Bot commented Jun 9, 2026

Copy link
Copy Markdown
Contributor

Electric Agents Mobile Build

Local mobile checks ran for commit 9dbb4d2.

The EAS Android preview build was skipped because the mobile-eas-build label is not present.
Add the mobile-eas-build label to this PR to produce an installable preview build.

Workflow run

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant