Keep SSE chat stream alive with 20s heartbeat#56
Merged
mgoldsborough merged 2 commits intomainfrom Apr 20, 2026
Merged
Conversation
Fixes the "network error" red toast during long agent turns (deep research, Typst compile, large delegation). The backend run completed fine — the LB just killed the silent TCP connection. Root cause: `/v1/chat/stream` only emits frames on engine events, so a multi-minute tool call leaves the SSE pipe idle. AWS ALB defaults to a 60s idle timeout and drops the connection. `/v1/events` and `/v1/conversations/:id/events` already heartbeat via their managers; this handler was the only unprotected SSE endpoint. Change: emit `: ping\n\n` SSE comments every 20s (3× safety margin below the 60s default, 45× below the 900s ceiling we'll set on ALB in the deploy-config PR). SSE comment lines are ignored by the client per spec, so the existing consumer code needs no change. Heartbeat logic lives in `src/api/sse-heartbeat.ts` as a reusable helper — isolated from the chat handler's existing lifecycle so the test surface is a pure timer + controller. `handleChatStream` wires `start()` into its ReadableStream's `start()` and `stop()` into both `finish()` (normal completion, error paths) and `cancel()` (client disconnect). Controller-already-closed enqueue errors are swallowed — the timer is cleared promptly but there's a benign race where one tick can fire after close. Integration test for the handler wiring is deferred — the wiring itself is two lines over a unit-tested helper; the real integration case needs either multi-second waits or env-var plumbing and neither was worth it. Next: deploy-config PR sets `alb.ingress.kubernetes.io/load-balancer-attributes: idle_timeout.timeout_seconds=900` on staging + prod base-values.
- Drop unused export of HEARTBEAT_INTERVAL_MS — no consumer imports it; tests use literal intervals. - Collapse the dead defensive-default on the heartbeat handle. WHATWG guarantees cancel() can't fire before start() returns, so markClosed() always sees the real handle. heartbeat is now a `const` assigned before markClosed is wired up. - Rewrite the "stop() cancels further ticks" test: drain pre-stop pings and assert >=1; then sleep, drain again, assert exactly 0. Easier to read than the before/after math. - Doc-comment on sse-heartbeat.ts distinguishes this request-scoped helper from SseEventManager / ConversationEventManager's broadcast-channel heartbeat. Same word, different paradigm — the note prevents a future reader asking "why two implementations?".
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Fixes the red "network error" toast during long agent turns. Backend runs complete — the LB just silently killed the idle SSE connection.
/v1/chat/streamnow emits a 20s heartbeat (SSE comment) so ALB never sees it as idle.Why
`/v1/chat/stream` only emits frames on engine events. A multi-minute tool call (Typst compile, MCP task-augmented research, long delegate) leaves the SSE pipe idle → ALB's 60s default kills the TCP connection → browser shows a red toast even though conversation events persist cleanly. `/v1/events` and `/v1/conversations/:id/events` already heartbeat via their event managers; this handler was the only unprotected site.
Changes
Scope
This is Change 1 of 3 from the spec. The other two are:
Test plan
Deferred