fix(agents): pull-wake runner resilience — fetch cache stacking, hung connections, xstate lifecycle#4542
Open
KyleAMathews wants to merge 3 commits into
Open
fix(agents): pull-wake runner resilience — fetch cache stacking, hung connections, xstate lifecycle#4542KyleAMathews wants to merge 3 commits into
KyleAMathews wants to merge 3 commits into
Conversation
…nections The durable-streams fetch cache interceptor was composed onto the global undici dispatcher on every BuiltinAgentsServer.start() call without a guard, so restarting the runtime stacked duplicate SQLite-backed cache layers on the same file — breaking fetch after a restart. The pull-wake runner's heartbeat-failure → stream-reconnect path only worked while the stream was already connected. If the stream factory hung during (re)connection, heartbeat failures couldn't abort it because requestStreamReconnect was a no-op when streamConnected was false. Now each connection attempt gets its own AbortController that the heartbeat failure path can trigger. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Contributor
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #4542 +/- ##
==========================================
+ Coverage 54.80% 56.78% +1.97%
==========================================
Files 317 361 +44
Lines 36681 39500 +2819
Branches 10466 11099 +633
==========================================
+ Hits 20104 22431 +2327
- Misses 16544 16998 +454
- Partials 33 71 +38
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Harness. 🚀 New features to boost your workflow:
|
Replace the hand-rolled state machine (23 mutable closure variables, manual
AbortController wiring) with an xstate v5 machine that owns the lifecycle:
stopped → running.{connecting,streaming,reconnecting} → stopping.
The machine structurally prevents the class of bug fixed in the previous
commit: invoked actors are auto-aborted on state exit, so a heartbeat-driven
STREAM_RESET during the connecting phase cancels the in-flight connection
without any manual connectAbort bookkeeping. Backoff timers auto-cancel the
same way.
Diagnostics, heartbeat coalescing, and claim processing stay in the runner
closure as effects the machine triggers; the public PullWakeRunner API is
unchanged and all 21 existing behavioral tests pass unmodified.
Adds an exhaustive (state × event) transition matrix test — all 35 pairs
pinned — so adding a state or event forces a deliberate decision.
Also warn when installDurableStreamsFetchCache is called more than once,
per review feedback.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
af7fda5 to
9dbb4d2
Compare
Contributor
Electric Agents Mobile BuildLocal mobile checks ran for commit The EAS Android preview build was skipped because the |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Fixes two bugs that left the desktop agents runtime unable to pick up sessions until a full app restart, then ports the pull-wake runner's lifecycle to an xstate v5 state machine so this class of bug can't recur silently. No public API changes — all 21 pre-existing behavioral tests pass unmodified.
Root cause
Two independent failures combined into "runner looks alive but never picks up work, and restarting the runtime doesn't help":
requestStreamReconnect) was a no-op unlessstreamConnectedwas true. If the wake stream hung while (re)connecting (e.g. after an agents-server restart), heartbeat failures counted up but nothing could cancel the stuckstreamFactorycall — the runner sat inconnectingforever.BuiltinAgentsServer.start()composed a new undici cache interceptor onto the global dispatcher without a guard. Restarting the runtime in-process stacked SQLite-backed cache layers over the same file; only a full app restart got a clean dispatcher.Approach
installDurableStreamsFetchCachegets a module-level installed guard (warns on repeat calls).pull-wake-machine.ts):stopped → running.{connecting, streaming, reconnecting} → stopping. The structural win: invoked actors are auto-aborted on state exit, so aSTREAM_RESETevent duringconnectingcancels the in-flight connect via the promise actor's signal — no manualconnectAbortbookkeeping, which is how the original bug slipped through. Backoff timers (after) cancel the same way.PullWakeMachineEffects). The machine decides when; the closure does what.Key invariants
stop()calls;stop()rejects with drain errors.Non-goals
Verification
Full agents-runtime suite: 852 passed; 2 pre-existing environment failures unrelated to this PR (
sandbox-dockerDocker-timing assertion,tool-providersunbuilt workspace dep).Files changed
packages/agents-runtime/src/pull-wake-machine.ts— new: xstate v5 lifecycle machine + effects interfacepackages/agents-runtime/src/pull-wake-runner.ts— rewritten as a thin adapter: same public API, closure keeps diagnostics/heartbeat/claims, machine owns lifecyclepackages/agents-runtime/test/pull-wake-runner.test.ts— +1 hung-connection regression test, +40 machine transition tests; 21 original tests untouchedpackages/agents/src/durable-streams-cache.ts— idempotency guard + warningpackages/agents/test/durable-streams-cache.test.ts— idempotency testpackages/agents-runtime/package.json— addsxstate(zero runtime deps).changeset/pull-wake-runner-resilience.md— patch bumps for both packages🤖 Generated with Claude Code