Skip to content

Keep SSE chat stream alive with 20s heartbeat#56

Merged
mgoldsborough merged 2 commits intomainfrom
fix/platform-sse-keepalive
Apr 20, 2026
Merged

Keep SSE chat stream alive with 20s heartbeat#56
mgoldsborough merged 2 commits intomainfrom
fix/platform-sse-keepalive

Conversation

@mgoldsborough
Copy link
Copy Markdown
Contributor

Summary

Fixes the red "network error" toast during long agent turns. Backend runs complete — the LB just silently killed the idle SSE connection. /v1/chat/stream now emits a 20s heartbeat (SSE comment) so ALB never sees it as idle.

Why

`/v1/chat/stream` only emits frames on engine events. A multi-minute tool call (Typst compile, MCP task-augmented research, long delegate) leaves the SSE pipe idle → ALB's 60s default kills the TCP connection → browser shows a red toast even though conversation events persist cleanly. `/v1/events` and `/v1/conversations/:id/events` already heartbeat via their event managers; this handler was the only unprotected site.

Changes

  • `src/api/sse-heartbeat.ts` — new reusable helper. `startSseHeartbeat(controller, intervalMs)` returns a handle with `stop()`; enqueues `: ping\n\n` per tick; swallows enqueue errors after close.
  • `src/api/handlers.ts` — wires the heartbeat into `handleChatStream`'s `ReadableStream.start()`; stops it in `finish()` and `cancel()`.
  • Interval: 20s (3× safety below 60s default ALB idle, 45× below the 900s we'll set via deploy-config).

Scope

This is Change 1 of 3 from the spec. The other two are:

  • Change 2 (infra): ALB idle-timeout annotation in `deployments/agent-platform/*/base-values.yaml`. Separate PR on the deployments submodule, stage → prod rollout.
  • Change 3 (client): soft "Reconnecting…" + switch to `/v1/conversations/:id/events` on transport error. Proposed as a separate follow-up PR — defense-in-depth for the tail case where Changes 1+2 fail. Not urgent once 1+2 ship.

Test plan

  • `bun run verify` — unit (new: 4 heartbeat cases covering tick emission, stop idempotence, after-close enqueue) + integration + smoke all green.
  • Staging: run a >60s tool call (Typst compile / research), confirm no red toast and the stream stays open with periodic 6-byte frames in DevTools Network tab.

Deferred

  • Integration test for `handleChatStream` wiring — the wiring is 2 lines over a unit-tested helper; the real integration case needs either multi-second waits or env-var plumbing. Can add if QA requests.

Fixes the "network error" red toast during long agent turns (deep
research, Typst compile, large delegation). The backend run
completed fine — the LB just killed the silent TCP connection.

Root cause: `/v1/chat/stream` only emits frames on engine events,
so a multi-minute tool call leaves the SSE pipe idle. AWS ALB
defaults to a 60s idle timeout and drops the connection.
`/v1/events` and `/v1/conversations/:id/events` already heartbeat
via their managers; this handler was the only unprotected SSE
endpoint.

Change: emit `: ping\n\n` SSE comments every 20s (3× safety margin
below the 60s default, 45× below the 900s ceiling we'll set on ALB
in the deploy-config PR). SSE comment lines are ignored by the
client per spec, so the existing consumer code needs no change.

Heartbeat logic lives in `src/api/sse-heartbeat.ts` as a reusable
helper — isolated from the chat handler's existing lifecycle so
the test surface is a pure timer + controller. `handleChatStream`
wires `start()` into its ReadableStream's `start()` and `stop()`
into both `finish()` (normal completion, error paths) and
`cancel()` (client disconnect). Controller-already-closed enqueue
errors are swallowed — the timer is cleared promptly but there's a
benign race where one tick can fire after close.

Integration test for the handler wiring is deferred — the wiring
itself is two lines over a unit-tested helper; the real integration
case needs either multi-second waits or env-var plumbing and
neither was worth it.

Next: deploy-config PR sets
`alb.ingress.kubernetes.io/load-balancer-attributes:
idle_timeout.timeout_seconds=900` on staging + prod base-values.
- Drop unused export of HEARTBEAT_INTERVAL_MS — no consumer imports
  it; tests use literal intervals.
- Collapse the dead defensive-default on the heartbeat handle.
  WHATWG guarantees cancel() can't fire before start() returns, so
  markClosed() always sees the real handle. heartbeat is now a
  `const` assigned before markClosed is wired up.
- Rewrite the "stop() cancels further ticks" test: drain
  pre-stop pings and assert >=1; then sleep, drain again, assert
  exactly 0. Easier to read than the before/after math.
- Doc-comment on sse-heartbeat.ts distinguishes this
  request-scoped helper from SseEventManager /
  ConversationEventManager's broadcast-channel heartbeat. Same
  word, different paradigm — the note prevents a future reader
  asking "why two implementations?".
@mgoldsborough mgoldsborough added the qa-reviewed QA review completed with no critical issues label Apr 20, 2026
@mgoldsborough mgoldsborough merged commit faed5c5 into main Apr 20, 2026
4 checks passed
@mgoldsborough mgoldsborough deleted the fix/platform-sse-keepalive branch April 20, 2026 23:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

qa-reviewed QA review completed with no critical issues

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant