Skip to content

[codex] Add realtime agents support#4543

Draft
samwillis wants to merge 25 commits into
mainfrom
codex/realtime-agents-plan
Draft

[codex] Add realtime agents support#4543
samwillis wants to merge 25 commits into
mainfrom
codex/realtime-agents-plan

Conversation

@samwillis

@samwillis samwillis commented Jun 9, 2026

Copy link
Copy Markdown
Contributor

Summary

Adds Horton realtime voice mode on top of Electric Agents using durable streams as the client/server IO path and OpenAI Realtime as the initial provider. The implementation lets Horton drop into an active realtime session from either an existing conversation or the new-session screen, receive durable microphone audio, stream assistant audio back, keep typed messages usable during voice mode, run the existing tool loop, and persist transcripts/audio stream metadata so sessions can be replayed later.

This PR is still a draft, but it now includes the end-to-end runtime, server, UI, OpenAI provider, timeline, and test coverage needed to exercise realtime mode in the desktop app.

Why

The goal is not just a WebSocket passthrough to OpenAI. Electric Agents needs a reusable realtime API for app builders where audio/control IO is durable, inspectable, and eventually replayable. The server/runtime owns provider connections and tool execution; clients write/read durable streams for audio and control frames. OpenAI is the first provider, with the runtime API leaving room for other providers later.

Architecture

  • Clients create realtime sessions through the agents server and receive stream refs for:
    • audio_in: durable PCM audio from browser/client to runtime.
    • audio_out: durable PCM assistant audio from runtime/provider to client.
    • control_in: durable JSON commands from client to runtime.
    • control_out: durable JSON provider/runtime events from runtime to client.
  • The runtime bridges these durable streams to a provider session.
  • Horton continues to use its normal context/tool stack. In realtime mode, Horton configures ctx.useRealtime(...) instead of the normal text-only ctx.useAgent(...) path.
  • Tool calls still run through the composed Electric/Pi tool system, with Horton’s realtime policy allowing direct use of safe tools and worker delegation where appropriate.
  • Realtime sessions are recorded in the manifest with durable stream refs, so the UI can discover active/replayable sessions.

Runtime and SDK API

  • Adds first-class realtime runtime types and provider hooks for:
    • realtime audio formats
    • input transcription config
    • OpenAI/server VAD config
    • realtime provider events
    • tool call streaming/results
    • transcript callbacks
  • Adds RealtimeTurnDetectionConfig, supporting:
    • server_vad
    • semantic_vad
    • explicit manual mode via false or { type: "none" }
  • Adds inputTranscription.delay so callers can request low-latency streaming transcription when the provider supports it.
  • Exports the new realtime types from @electric-ax/agents-runtime for app builders.

OpenAI Realtime Provider

  • Adds OpenAI Realtime WebSocket provider support.
  • Sends GA-style session.update payloads with nested session.audio.input / session.audio.output config.
  • Maps OpenAI audio, transcript, response, error, and tool events into runtime RealtimeProviderEvents.
  • Supports provider tools/function calls and sends tool results back to the active realtime response.
  • Handles cancellation races by avoiding stale tool-result responses after response cancellation/epoch changes.
  • Deduplicates provider events by event_id where available.
  • Supports output audio truncation through conversation.item.truncate for WebSocket playback interruption.
  • Keeps manual input commit support for future push-to-talk or non-VAD providers.

Horton Behavior

  • Horton realtime mode currently supports OpenAI only.
  • Horton uses OpenAI server_vad for turn detection:
    • threshold 0.55
    • prefix padding 300ms
    • silence duration 500ms
    • automatic response creation enabled
    • provider-side interruption enabled
  • Horton uses gpt-realtime-whisper with delay: "minimal" for input transcription so user transcript deltas stream while the user is still speaking, rather than only after the turn ends.
  • Typed text during an active realtime session routes into the realtime provider session instead of starting a separate text run.

Durable Stream IO

  • Audio and control IO use durable streams with write batching enabled.
  • The browser no longer sends manual input_audio.commit commands in provider-VAD mode.
  • The runtime bridge has two input modes:
    • provider VAD mode streams durable audio_in chunks directly to OpenAI.
    • manual commit mode buffers exact byte ranges and commits only requested audio spans.
  • Short manual audio commits below OpenAI’s minimum input size are skipped and cleared to avoid provider errors.
  • control_out carries lightweight JSON event summaries for UI state, while raw assistant PCM goes to audio_out.

Browser Audio Capture and Playback

  • Uses AudioWorklet for lower-overhead microphone capture where available, with a ScriptProcessorNode fallback.
  • Captures mono 24 kHz PCM16 for OpenAI realtime.
  • Adds a local transport gate to reduce durable stream write volume while still relying on OpenAI VAD for actual turn detection:
    • keeps a short pre-roll buffer so speech starts are not clipped
    • sends active speech chunks
    • sends a trailing silence tail so provider VAD can detect the end of the turn
    • sends nothing while idle
  • Does not locally cancel assistant responses on local gate-open. Interruption is driven by OpenAI input_audio_buffer.speech_started events.
  • On provider speech-start events, the UI stops queued playback and sends truncation metadata without also sending a redundant response.cancel.
  • Fixes PCM16 output chunk alignment so odd or split byte chunks do not create static/noise.
  • Adds a realtime voice control using a non-dictation icon and a live input-level visual indicator.

UI Flow

  • Adds realtime start/stop controls to the normal message input.
  • Keeps the prompt send button available during realtime mode so users can type while voice mode is active.
  • Adds a realtime button to the new Horton session screen. Clicking it creates a new session and navigates into it.
  • Realtime session status and stream refs are represented in the timeline/manifest rather than hidden client-only state.

Transcript and Timeline Handling

  • Persists realtime input/output transcripts as first-class realtime transcript rows.
  • Streams transcript text as textDeltas chunks instead of repeatedly rewriting the whole transcript text over the wire.
  • Interleaves realtime transcripts, normal assistant output, and tool calls in visible timeline order.
  • Splits assistant output transcript segments around later user speech so long assistant streams do not visually float above later user turns.
  • Uses OpenAI item/response identifiers to group input/output transcript deltas.
  • Reconciles final transcript text against streamed deltas without duplicating content.
  • Avoids seeding active-session realtime transcripts back into provider history when reconnecting the active session.
  • Generates session titles/descriptions from the first finalized user realtime transcript.

Server Routing

  • Adds realtime session start plumbing in agents-server.
  • Adds stream routing support for realtime audio/control streams.
  • Allows the producer-seq header in CORS/preflight handling so durable stream writes from the UI succeed.

Error Handling and Reliability Fixes

  • Ignores OpenAI inactive cancellation errors such as response_cancel_not_active when they are caused by a stale or already-cancelled response.
  • Ignores stale output truncation errors when the local client’s playback/truncation state is behind provider state.
  • Avoids committing empty or too-short provider input buffers in manual mode.
  • Prevents duplicate transcript rows by streaming deltas under stable realtime transcript IDs.
  • Keeps response/tool result sending guarded by response epoch so cancelled/replaced responses do not get stale tool follow-ups.

Validation

Ran the focused checks under the repo-pinned Node 24.11.1 toolchain:

  • pnpm --dir packages/agents-runtime exec tsc --noEmit --pretty false
  • pnpm --dir packages/agents-server-ui exec tsc --noEmit --pretty false
  • pnpm --dir packages/agents exec tsc --noEmit --pretty false
  • pnpm --dir packages/agents-runtime exec vitest run test/openai-realtime.test.ts test/realtime-context.test.ts
  • pnpm --dir packages/agents-runtime exec vitest run test/realtime-context.test.ts test/openai-realtime.test.ts test/timeline-context.test.ts test/entity-timeline.test.ts
  • pnpm --dir packages/agents exec vitest run test/horton-tool-composition.test.ts
  • pnpm --dir packages/agents-runtime run build
  • git diff --check

Also tested manually in the desktop app through multiple realtime voice sessions, including:

  • initial realtime session startup
  • microphone input streaming
  • assistant audio playback
  • live input-level indicator
  • typed messages during realtime mode
  • tool calls from realtime mode
  • realtime transcript rendering/order
  • long user turns with live transcript deltas
  • interruption behavior while assistant audio is playing

Notes and Follow-ups

  • This PR intentionally starts with OpenAI Realtime only, but the runtime/provider boundary is not OpenAI-specific.
  • Replay/scrub UI is not implemented here, but durable session stream refs are persisted in the manifest so a replay UI can be built on top.
  • Mobile/native scoped stream token refresh remains out of scope for this draft.
  • Risk area to keep testing: long-running voice sessions with many interruptions and tool calls, because those stress provider response cancellation, transcript reconciliation, and timeline ordering together.

@netlify

netlify Bot commented Jun 9, 2026

Copy link
Copy Markdown

Deploy Preview for electric-next ready!

Name Link
🔨 Latest commit 242aca6
🔍 Latest deploy log https://app.netlify.com/projects/electric-next/deploys/6a2825e8c87469000860818d
😎 Deploy Preview https://deploy-preview-4543--electric-next.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

@samwillis samwillis force-pushed the codex/realtime-agents-plan branch from 3ffd31d to d54bb62 Compare June 9, 2026 18:39
@github-actions

github-actions Bot commented Jun 9, 2026

Copy link
Copy Markdown
Contributor

Electric Agents Desktop Builds

Build artifacts for commit d54bb62.

Platform Status Artifact
macOS Apple Silicon Passed DMG
macOS Intel Passed DMG
Windows x64 Passed Installer
Linux x64 Passed AppImage / deb

Workflow run

@codecov

codecov Bot commented Jun 9, 2026

Copy link
Copy Markdown

❌ 2 Tests Failed:

Tests completed Failed Passed Skipped
1168 2 1166 41
View the top 2 failed test(s) by shortest run time
test/process-wake.test.ts > processWake > applies SIGINT that arrives before the handler run controller is created
Stack Traces | 0.114s run time
Error: [agent-runtime] entity timeline requires collection "realtimeTranscripts" but it was not registered
 ❯ getOrderableCollection src/entity-timeline.ts:623:11
 ❯ Module.buildEntityTimelineData src/entity-timeline.ts:1048:5
 ❯ timelineMessages src/timeline-context.ts:519:37
 ❯ Module.timelineToMessages src/timeline-context.ts:540:10
 ❯ Object.run src/context-factory.ts:1398:39
 ❯ Object.handler test/process-wake.test.ts:1067:9
 ❯ Module.processWake src/process-wake.ts:2147:9
 ❯ test/process-wake.test.ts:1089:5
test/process-wake.test.ts > processWake > aborts an active run for server-handled SIGKILL without rewriting the signal
Stack Traces | 5s run time
Error: Test timed out in 5000ms.
If this is a long-running test, pass a timeout value as the last argument or configure it globally with "testTimeout".
 ❯ test/process-wake.test.ts:1183:3

To view more test analytics, go to the Test Analytics Dashboard
📋 Got 3 mins? Take this short survey to help us improve Test Analytics.

@github-actions

github-actions Bot commented Jun 9, 2026

Copy link
Copy Markdown
Contributor

Electric Agents Mobile Build

Local mobile checks ran for commit d54bb62.

The EAS Android preview build was skipped because the mobile-eas-build label is not present.
Add the mobile-eas-build label to this PR to produce an installable preview build.

Workflow run

@netlify

netlify Bot commented Jun 9, 2026

Copy link
Copy Markdown

Deploy Preview for electric-next ready!

Name Link
🔨 Latest commit d54bb62
🔍 Latest deploy log https://app.netlify.com/projects/electric-next/deploys/6a285def22d91d0008c570f8
😎 Deploy Preview https://deploy-preview-4543--electric-next.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant