A reference implementation showing how to integrate AI into modern applications: real-time voice, streaming chat, narration, transcription, voice cloning, and music generation in one codebase.
It is meant to be read, modified, and copied from. It is not a library and not a hosted service.
Most AI tutorials stop at openai.chat.completions.create(...). Production systems need much more: streaming, real-time voice, interrupts, branching, persistent memory, multiple providers, and a UI built around all of them. This repository puts those concepts in one place so engineers can study a working example instead of stitching one together from blog posts.
| Capability | What you get |
|---|---|
| Real-time voice | Two-way spoken conversation with barge-in, pause, and resume. |
| Streaming chat | Token-level streaming into the React UI, sharing memory with voice mode. |
| Branching | Edit any past message and re-run from that point; the original branch is preserved. |
| Persistent memory | Conversations and user preferences survive restarts. |
| Tool calling | Web search, news, weather, exchange rates, tasks, notes. |
| Provider swapping | One environment variable selects between OpenAI, Anthropic, Ollama, NVIDIA NIM, or anything else init_chat_model supports. |
| Narration | Text input to a clean audio file with chunked synthesis and capability-driven UI. |
| Transcription | Audio uploads (WAV, MP3, M4A, WebM, OGG, FLAC, MP4) to text. |
| Voice cloning | Reference clip to a custom voice usable across the app. |
| Music generation | Text prompt to audio file with background model warmup. |
You speak → Silero VAD detects end-of-turn → Whisper transcribes →
LangGraph agent reasons + streams tokens → Pocket / Kokoro TTS → You hear
Every feature is plain code with no hidden SaaS at the bottom of the call stack.
| To learn how to... | Open... |
|---|---|
| Run a voice pipeline (mic, VAD, STT, LLM, TTS, speaker) | backend/api/routes/v1/voice.py. Four asyncio tasks decoupled by queues. |
| Stream LLM tokens to a browser over WebSocket | Same file (response_orchestrator, tts_streamer). |
| Integrate Silero VAD with proper hysteresis | backend/services/vad/silero.py |
| Detect barge-in cleanly | Two-phase confirmation in voice.py plus the BufferSource cleanup in client/src/hooks/useVoiceAgent.ts |
| Build a LangGraph agent with tools, skills, memory | agent/assistant/graph.py and the middleware stack |
| Switch system prompts between voice and chat | agent/assistant/prompt.py and select_prompt in graph.py |
| Run Whisper for low-latency STT on CPU | backend/services/stt/whisper.py (greedy decoding, warmup) |
| Put two TTS providers behind a capability flag | backend/services/tts/base.py plus the Kokoro and Pocket TTS files |
| Auto 401, refresh, retry with RTK Query | client/src/services/auth/baseQuery.ts |
| Design a WebSocket protocol for a voice agent | docs/ARCHITECTURE.md |
| Gate UI controls on backend capabilities | The page components read /voice/config flags before rendering. |
| Reverse-proxy a multi-service stack with WebSocket upgrade | nginx/config/nginx-dev.conf |
| Run Python and Node services in Docker with hot reload | docker-compose.dev.yml |
The full walkthrough of the voice pipeline (VAD state machine, AEC, latency budget, barge-in, agent integration, complete WebSocket protocol) lives in docs/ARCHITECTURE.md.
All features run end-to-end: voice, chat, narration, transcription, music, voice cloning. The repository is intended for learning rather than direct production deployment. The API surface and configuration schema may still change. Known limitations:
- Multi-user isolation is partial. The LangGraph thread store is shared.
- The personality system is wired through the UI but not yet plumbed into the agent prompt.
- A few legacy environment entries (e.g.
STT_PROVIDER=kyutai) are not active.
If you adopt code from this repository, pin to a commit and review the production checklist below.
| Service | Stack | Purpose |
|---|---|---|
backend/ |
FastAPI, uvicorn | HTTP and WebSocket APIs for voice, narrate, transcribe, music, voice cloning, auth, file storage. |
agent/ |
LangGraph, DeepAgents | Reasoning agent with tools, skills, persistent memory, provider-agnostic LLM selection. |
client/ |
React 19, Vite, RTK Query | Real-time UI: streaming chat, voice with waveform, capability-driven settings. |
nginx/ |
nginx 1.27 | Reverse proxy with WebSocket upgrade. |
postgres |
PostgreSQL 16 | LangGraph checkpoints, threads, accounts. |
redis |
Redis 7 | Token blacklist, rate-limit counters. |
git clone https://github.com/<you>/voiceagent.git
cd voiceagent
# Copy every env example. None of the .env files are tracked by git.
cp .env.example .env
cp backend/.env.example backend/.env
cp agent/.env.example agent/.env
cp client/.env.example client/.env
# Minimum edit: open agent/.env and set OPENAI_API_KEY (or any other
# provider supported by LangChain init_chat_model).
docker compose -f docker-compose.dev.yml up -dOpen http://localhost:8080.
First boot pulls model weights (~2 GB Whisper plus ~80 MB Kokoro or ~500 MB Pocket TTS). Subsequent starts use the cache and complete in seconds.
Rebuild from scratch after dependency changes:
docker compose -f docker-compose.dev.yml build --no-cache && docker compose -f docker-compose.dev.yml up -d --buildEach service has its own .env. The project-root .env is intentionally minimal; service configuration lives in backend/.env and agent/.env.
| Variable | Default | Purpose |
|---|---|---|
STT_PROVIDER |
whisper |
Speech-to-text. Only whisper is wired today. |
WHISPER_MODEL |
distil-large-v3 |
tiny, base, small, medium, large-v3, distil-large-v3. Distil-large-v3 is the best speed and accuracy tradeoff on CPU. |
WHISPER_DEVICE |
cpu |
cpu or cuda. |
TTS_PROVIDER |
kokoro |
kokoro (8 voices, supports speed and language) or pocket_tts (27 voices, supports cloning). |
KOKORO_VOICE |
af_heart |
Default voice for Kokoro. |
POCKET_TTS_VOICE |
alba |
Default voice for Pocket TTS. |
POCKET_TTS_LANGUAGE |
english |
Language is selected at model load. english, french, german, spanish, italian, portuguese. |
MUSIC_PROVIDER |
ace_step |
ace_step (uses MusicGen-small) or disabled. |
VAD_THRESHOLD |
0.5 |
Silero activation threshold (range 0.0 to 1.0). |
VAD_MIN_SILENCE_DURATION_MS |
300 |
Required silence before VAD calls end-of-turn. Lower is snappier; higher reduces premature cutoffs. |
LANGGRAPH_URL |
http://voiceagent-agent-dev:8000 |
URL the backend uses to reach the agent. |
SECRET_KEY |
dev placeholder | Set in production. 32+ random characters. |
JWT_EXPIRATION_MINUTES |
30 |
Access token lifetime. |
HF_TOKEN |
empty | Required for voice cloning only. Generate at https://huggingface.co/settings/tokens and accept the gated model terms at https://huggingface.co/kyutai/pocket-tts. |
API_KEY |
empty | If set, every API call requires Authorization: Bearer <key>. |
CORS_ORIGINS |
localhost:5173, :3000, :8080, :80 | Add your production domain here. |
| Variable | Default | Purpose |
|---|---|---|
AGENT_MODEL |
openai:gpt-4o-mini |
LangChain init_chat_model spec. Examples: openai:gpt-4o-mini, anthropic:claude-3-5-haiku-20241022, ollama:llama3. |
AGENT_TEMPERATURE |
0.7 |
Sampling temperature. |
MAX_RESULTS |
5 |
Default cap on Tavily and web-search results. |
OPENAI_API_KEY |
required for OpenAI | |
TAVILY_API_KEY |
required for web_search |
|
LANGSMITH_API_KEY |
optional | Enables LangSmith tracing. |
The full annotated lists live in each .env.example.
┌─────────────────────────────────────────────────────────────────┐
│ Browser (React) │
│ AudioWorklet capture (16 kHz int16) ◀──── Web Audio playback│
└────────────────────────────┬────────────────────────────────────┘
│ WebSocket + HTTP
▼
┌─────────────────┐
│ nginx :8080 │
└────────┬────────┘
┌────────────────────┼────────────────────┐
│ │ │
/api/v1/* (HTTP) /api/v1/voice/ws /api/* (HTTP, LangGraph SDK)
│ │ │
▼ ▼ │
┌─────────────────────────────────────────────┐ │
│ backend (FastAPI :8000) │ │
│ ─────────────────────────────────────────── │ │
│ │ │
│ Voice WebSocket pipeline: │ │
│ │ │
│ ┌────────┐ │ │
│ │ mic │ ─PCM─┐ │ │
│ └────────┘ ▼ │ │
│ ┌─────────┐ is_echo flag │ │
│ │ AEC │ ◀──────┐ │ │
│ │ (state) │ │ │ │
│ └────┬────┘ │ │ │
│ ▼ │ │ │
│ ┌─────────┐ │ │ │
│ │ VAD │ │ │ │
│ │(Silero) │ │ │ │
│ └────┬────┘ │ │ │
│ ▼ end-of-turn │ │ │
│ ┌─────────┐ │ │ │
│ │ STT │ │ │ │
│ │(Whisper)│ │ │ │
│ └────┬────┘ │ │ │
│ ▼ │ │ │
│ agent.stream_events ─┼──────────┼──▶│
│ │ │ │ │
│ ▼ │ │ ◀─┤ tokens
│ ┌─────────┐ │ │ │
│ │ TTS │ ───────┘ │ │
│ │(Kokoro/ │ (mark playing) │ │
│ │ Pocket) │ │ │
│ └────┬────┘ │ │
│ └────── PCM out ─────────┼───┼──▶ browser plays
│ │ │
│ HTTP routes: │ │
│ /narrate /transcribe /music /clone │ │
│ /auth /personality │ │
└─────────────────────────────────────────────┘ │
▼
┌──────────────────┐
│ agent :8000 │
│ (LangGraph) │
│ ──────────────── │
│ tools: │
│ web, news, │
│ weather, │
│ finance, │
│ memory │
│ skills: │
│ time, task, │
│ info, ... │
└────────┬─────────┘
│
┌───────────────────────────────────────┘
▼
┌──────────────────────────┐
│ Postgres + Redis │
│ users, threads, │
│ checkpoints, blobs │
└──────────────────────────┘
The AEC box is a state flag rather than a DSP filter. It tracks whether TTS is currently playing so that VAD can decide whether incoming microphone audio is the agent's own voice bleeding back through the speaker. Real echo cancellation runs in the browser via getUserMedia({ echoCancellation: true }). The full explanation lives in docs/ARCHITECTURE.md.
user starts speaking
│
▼
AudioWorklet captures 16-kHz int16 PCM in 100 ms chunks
│
▼
WebSocket sends each chunk to the backend
│
▼
per-connection Silero VAD updates state
(SILENCE → SPEECH_START → SPEAKING → SPEECH_END)
│
▼ end-of-utterance after 300 ms of silence
▼
Whisper (greedy, beam=1) transcribes accumulated audio
│
▼
semantic-turn check: if the last word is hanging (modal,
conjunction, preposition, disfluency) hold for 700 ms
│
▼
agent.stream_events(thread_id, text, mode="voice", voice_name=...)
│
▼ tokens stream back as the LLM produces them
▼
sentence-boundary splitter pushes complete sentences
(or first comma break for low first-byte latency) into a TTS queue
│
▼
Pocket TTS or Kokoro streams audio chunks
│
▼
WebSocket sends int16 PCM back to the browser
│
▼
AudioWorklet decodes and plays via the Web Audio scheduler
While the agent is speaking, the same VAD watches for the user starting again. After 200 ms of confirmed speech, an interrupt message is sent, the TTS queue drains, the audio context is recreated, and the orchestrator stops.
Endpoint: ws://<host>/api/v1/voice/ws
| Message | Notes |
|---|---|
| binary frame (Int16Array) | 16-kHz mono PCM. The browser captures at 48 kHz and resamples client-side. |
{type: "config", voice_id, personality_id, speed, language} |
Sent on connect and on any change. speed and language are honored only if the active TTS provider supports them. |
{type: "text_input", text} |
Bypasses STT and sends text directly to the agent. |
| Message | When |
|---|---|
{type: "thread", thread_id} |
Once, immediately after connect. |
{type: "thread_title", thread_id, title} |
After the first user transcript; auto-titles the conversation. |
{type: "vad", state, probability, is_speaking, is_echo, is_responding} |
On VAD state changes (rate-limited). |
{type: "partial_transcript", text, is_final} |
Live transcript updates. Final fires when end-of-turn is confirmed. |
{type: "text_stream", text, done} |
LLM token stream for the in-progress AI message bubble. |
{type: "spoken_text", text} |
Sentence-by-sentence record of what TTS will speak (post-sanitization). |
{type: "audio_info", sample_rate} |
Once per response; the sample rate of the upcoming audio frames. |
| binary frame (Int16Array) | TTS audio. |
{type: "interrupt"} |
Server confirmed barge-in. The client should drop its playback queue. |
{type: "error", message} |
Pipeline error. |
Hot-reload stack:
docker compose -f docker-compose.dev.yml up -d
docker compose -f docker-compose.dev.yml logs -f backendRestart a single service:
docker compose -f docker-compose.dev.yml restart backend
docker compose -f docker-compose.dev.yml restart agentTests:
docker exec voiceagent-backend-dev pytest
docker exec voiceagent-client-dev npm test
docker exec voiceagent-client-dev npm run type-checkFor a non-Docker setup, see CONTRIBUTING.md.
The repository is shaped for learning. To adapt it for production:
- Replace
docker-compose.dev.ymlwith a production compose. No bind mounts. Only nginx exposed. SetENVIRONMENT=production. - Generate a real
SECRET_KEYwithpython -c 'import secrets; print(secrets.token_urlsafe(32))'. - Use a strong Postgres password and managed Redis.
- Put nginx behind TLS (Let's Encrypt via Certbot, Cloudflare, or a load balancer).
- Set
CORS_ORIGINSto your real domain. - Pin Docker image tags so deploys are reproducible.
The full hardening checklist lives in SECURITY.md.
voiceagent/
├── agent/ LangGraph reasoning service
│ ├── assistant/
│ │ ├── graph.py agent factory + middleware stack
│ │ ├── prompt.py CHAT_SYSTEM_PROMPT, VOICE_SYSTEM_PROMPT
│ │ ├── config.py
│ │ ├── tools/ web, news, weather, finance, memory
│ │ └── skills/ time-management, task-management, ...
│ └── pyproject.toml
├── backend/ FastAPI voice/narrate/music/auth API
│ ├── api/routes/v1/
│ │ ├── voice.py voice WS + narrate + transcribe + clone
│ │ ├── agent.py LangGraph thread CRUD passthrough
│ │ ├── music.py music generation
│ │ ├── auth.py login / register / refresh
│ │ └── personality.py
│ ├── services/
│ │ ├── voice_pipeline.py STT + TTS + VAD orchestration
│ │ ├── stt/whisper.py
│ │ ├── tts/kokoro.py
│ │ ├── tts/pocket_tts.py
│ │ ├── vad/silero.py
│ │ ├── audio/aec.py
│ │ ├── agent/client.py LangGraph SDK wrapper
│ │ └── music/ace_step.py
│ └── pyproject.toml
├── client/ React 19 + Vite + RTK Query
│ ├── src/
│ │ ├── pages/ Converse, Narrate, Transcribe, Music
│ │ ├── components/
│ │ ├── hooks/ useVoiceAgent, useChat, useAudioPlayer
│ │ ├── services/ RTK Query APIs (auth, voice, music, personality)
│ │ ├── store/ Redux slice
│ │ └── context/ VoiceConfigContext
│ └── package.json
├── nginx/
│ ├── Dockerfile
│ └── config/ nginx-dev.conf, nginx.conf, proxy.conf
├── docker-compose.dev.yml
├── docs/
│ └── ARCHITECTURE.md deep dive on the voice pipeline
├── CONTRIBUTING.md
├── SECURITY.md
└── LICENSE
Pull requests are welcome. See CONTRIBUTING.md for setup, branch conventions, and the test harness.
Vulnerability reports go through SECURITY.md.
This project depends on work by several open-source teams:
- LangChain and LangGraph. The agent runtime, checkpointing, and tool framework that the entire
agent/service is built on. - Kyutai. Pocket TTS provides 27 voices, voice cloning, and sub-200 ms first-byte latency on CPU.
- Hexgrad / Kokoro. The 82M-parameter ONNX TTS model used as the fast default voice without a GPU.
- SYSTRAN / faster-whisper. CTranslate2-optimized Whisper, roughly six times faster than the reference implementation on CPU.
- Silero Team. The 1.5 MB neural voice activity detector used for turn-taking and barge-in.
- Open-Meteo. Free, no-key weather API used by the agent's
get_weathertool. - Tavily. Web search and content extraction.
The research and models above are theirs. This repository shows one way to integrate them.
MIT.