Skip to content

feat: add TTS engine with Kokoro ONNX backend#2

Merged
mahimairaja merged 3 commits intomainfrom
claude/add-tts-kokoro-foundation-BUl5X
Apr 20, 2026
Merged

feat: add TTS engine with Kokoro ONNX backend#2
mahimairaja merged 3 commits intomainfrom
claude/add-tts-kokoro-foundation-BUl5X

Conversation

@mahimairaja
Copy link
Copy Markdown
Contributor

No description provided.

…ndpoint

Implements the TTS foundation (M4) with a Kokoro-backed engine that
satisfies the ModalityEngine protocol. Includes a thread-safe LRU
speaker cache, stdlib-only WAV/PCM audio encoding, and a real
`/v1/audio/speech` router that is mounted conditionally when
kokoro-onnx is importable, matching the pattern already used for STT.
Adds `voicequant tts speak|voices|benchmark-quick` CLI commands and a
`tts` optional dependency extra.
@mahimairaja mahimairaja self-assigned this Apr 20, 2026
Copilot AI review requested due to automatic review settings April 20, 2026 19:56
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 20, 2026

Warning

Rate limit exceeded

@mahimairaja has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 47 minutes and 2 seconds before requesting another review.

Your organization is not enrolled in usage-based pricing. Contact your admin to enable usage-based pricing to continue reviews beyond the rate limit, or try again in 47 minutes and 2 seconds.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 0c6420bb-d33e-4e61-b6f3-271b7054201c

📥 Commits

Reviewing files that changed from the base of the PR and between 5267637 and 8ec6bdb.

⛔ Files ignored due to path filters (1)
  • uv.lock is excluded by !**/*.lock
📒 Files selected for processing (31)
  • examples/inference_gateway.py
  • examples/livekit_agent.py
  • examples/pipecat_agent.py
  • examples/quick_bench.py
  • pyproject.toml
  • src/voicequant/benchmarks/__init__.py
  • src/voicequant/benchmarks/report.py
  • src/voicequant/benchmarks/runner.py
  • src/voicequant/benchmarks/scenarios/quality.py
  • src/voicequant/benchmarks/scenarios/ttfb.py
  • src/voicequant/benchmarks/visualize.py
  • src/voicequant/cli.py
  • src/voicequant/core/llm/codebook.py
  • src/voicequant/core/tts/__init__.py
  • src/voicequant/core/tts/audio.py
  • src/voicequant/core/tts/config.py
  • src/voicequant/core/tts/engine.py
  • src/voicequant/core/tts/speaker_cache.py
  • src/voicequant/deploy/__init__.py
  • src/voicequant/deploy/runpod_handler.py
  • src/voicequant/server/app.py
  • src/voicequant/server/engine.py
  • src/voicequant/server/routes/capacity.py
  • src/voicequant/server/routes/tts.py
  • tests/core/llm/conftest.py
  • tests/core/tts/__init__.py
  • tests/core/tts/test_audio.py
  • tests/core/tts/test_speaker_cache.py
  • tests/core/tts/test_tts_config.py
  • tests/core/tts/test_tts_engine.py
  • tests/server/test_tts_routes.py
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch claude/add-tts-kokoro-foundation-BUl5X

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@mahimairaja
Copy link
Copy Markdown
Contributor Author

@CodeRabbit review full

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 20, 2026

@mahimairaja Sure! I'll redo the full review of the PR now.

✅ Actions performed

Full review triggered.

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a Kokoro (ONNX) text-to-speech modality to VoiceQuant, including a real OpenAI-compatible-ish TTS HTTP route, a CLI surface, and supporting core utilities (audio encoding + speaker embedding cache), with accompanying tests.

Changes:

  • Introduces TTSEngine (lazy-loading Kokoro backend), TTSConfig, SpeakerCache, and audio conversion utilities (wav/pcm + optional mp3/opus).
  • Mounts a real /v1/audio/speech router when kokoro_onnx is available, otherwise falls back to an existing 501 stub; adds /v1/audio/speech/voices.
  • Adds CLI commands under voicequant tts plus a comprehensive new test suite for TTS core + server routing.

Reviewed changes

Copilot reviewed 14 out of 15 changed files in this pull request and generated 9 comments.

Show a summary per file
File Description
tests/server/test_tts_routes.py Adds coverage for stub vs real TTS routing, voices endpoint, and CLI integration.
tests/core/tts/test_tts_engine.py Validates lazy loading, synthesis outputs, speaker cache behavior, metrics, shutdown.
tests/core/tts/test_tts_config.py Covers defaults and pydantic validation for new TTS config.
tests/core/tts/test_speaker_cache.py Tests LRU eviction, stats, and thread safety of speaker cache.
tests/core/tts/test_audio.py Tests wav/pcm encoding, duration calculation, and optional-encoder ImportErrors.
tests/core/tts/init.py Marks TTS test package.
src/voicequant/server/routes/tts.py Adds real TTS FastAPI router (+ voices listing).
src/voicequant/server/app.py Adds conditional TTS availability detection, startup initialization, and router mounting.
src/voicequant/core/tts/speaker_cache.py Implements a thread-safe LRU cache for speaker embeddings.
src/voicequant/core/tts/engine.py Implements Kokoro-backed TTSEngine, synthesis, metrics, and protocol methods.
src/voicequant/core/tts/config.py Introduces TTSConfig (device auto-resolution, defaults).
src/voicequant/core/tts/audio.py Adds wav/pcm encoders, mp3/opus optional encoders, duration helper.
src/voicequant/core/tts/init.py Exposes the new TTS public API surface.
src/voicequant/cli.py Adds voicequant tts commands (voices, speak, quick benchmark).
pyproject.toml Adds tts optional dependency extra and includes it in voice/all.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +37 to +53
engine = get_engine()
if engine is None:
raise HTTPException(status_code=503, detail="TTS engine not ready")

fmt = (request.response_format or "wav").lower()
try:
result = engine.synthesize(
request.input,
voice=request.voice,
output_format=fmt,
)
except ImportError as e:
raise HTTPException(status_code=501, detail=str(e)) from e
except Exception as e:
raise HTTPException(
status_code=500, detail=f"Synthesis failed: {e}"
) from e
Copy link

Copilot AI Apr 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The stub TTS route returns an OpenAI-style {error: {message, type}} payload, but this router raises HTTPException, which yields FastAPI's default {detail: ...} body. If the intent is OpenAI compatibility, consider returning a consistent error schema here (including for 503/500 paths) instead of HTTPException defaults.

Copilot uses AI. Check for mistakes.
Comment on lines +20 to +61
_CONTENT_TYPES = {
"wav": "audio/wav",
"mp3": "audio/mpeg",
"opus": "audio/opus",
"pcm": "audio/pcm",
}


def build_router(get_engine) -> APIRouter:
"""Build the real TTS router.

`get_engine` returns the live TTSEngine instance (lazy).
"""
router = APIRouter(prefix="/v1", tags=["tts"])

@router.post("/audio/speech")
async def speech(request: SpeechRequest) -> Any:
engine = get_engine()
if engine is None:
raise HTTPException(status_code=503, detail="TTS engine not ready")

fmt = (request.response_format or "wav").lower()
try:
result = engine.synthesize(
request.input,
voice=request.voice,
output_format=fmt,
)
except ImportError as e:
raise HTTPException(status_code=501, detail=str(e)) from e
except Exception as e:
raise HTTPException(
status_code=500, detail=f"Synthesis failed: {e}"
) from e

media_type = _CONTENT_TYPES.get(result.format, "application/octet-stream")
filename = f"speech.{result.format}"
return Response(
content=result.audio_bytes,
media_type=media_type,
headers={"Content-Disposition": f'attachment; filename="{filename}"'},
)
Copy link

Copilot AI Apr 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_CONTENT_TYPES/filename generation treat opus as a standard .opus asset (audio/opus, speech.opus), but the current encoder produces raw frames (no Ogg container). Either change the advertised format/content-type/extension to match what is returned, or emit a proper .opus (Ogg Opus) file.

Copilot uses AI. Check for mistakes.
Comment on lines +69 to +81
def wav_to_opus(wav_bytes: bytes) -> bytes:
"""Encode WAV bytes as Opus. Requires opuslib."""
try:
import opuslib # noqa: F401
except ImportError as e:
raise ImportError(
"opus encoding requires opuslib. pip install opuslib"
) from e

# opuslib gives you a codec; full container framing is out of scope here.
# Decode WAV then pass raw PCM to opuslib. Callers needing a file container
# should install a full encoder.
import numpy as np
Copy link

Copilot AI Apr 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wav_to_opus currently returns raw Opus frames without an Ogg/Opus container (the comment notes this), but the rest of the code treats this as an .opus file. Clients expecting a standard .opus file will likely fail to play it; either emit an Ogg Opus container (and consider audio/ogg), or rename/advertise this as a raw stream format instead of opus.

Copilot uses AI. Check for mistakes.
Comment on lines +91 to +94
frame_size = sample_rate // 50 # 20ms frames
out = bytearray()
for i in range(0, len(pcm) - frame_size + 1, frame_size):
chunk = pcm[i : i + frame_size].tobytes()
Copy link

Copilot AI Apr 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wav_to_opus computes frame_size = sample_rate // 50 (samples per channel for 20ms) but then slices the interleaved pcm buffer by frame_size, which is incorrect for multi-channel audio (it should account for n_channels). This will produce wrong chunk sizes and can break encoding for stereo WAVs.

Suggested change
frame_size = sample_rate // 50 # 20ms frames
out = bytearray()
for i in range(0, len(pcm) - frame_size + 1, frame_size):
chunk = pcm[i : i + frame_size].tobytes()
frame_size = sample_rate // 50 # 20ms frames, in samples per channel
samples_per_frame = frame_size * n_channels
out = bytearray()
for i in range(0, len(pcm) - samples_per_frame + 1, samples_per_frame):
chunk = pcm[i : i + samples_per_frame].tobytes()

Copilot uses AI. Check for mistakes.
Comment on lines +126 to +137
def synthesize(
self,
text: str,
voice: str | None = None,
output_format: str | None = None,
) -> SynthesisResult:
if not self._model_loaded:
self.load_model()

voice_id = voice or self.config.default_voice
fmt = (output_format or self.config.output_format).lower()

Copy link

Copilot AI Apr 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

max_text_length exists in TTSConfig but synthesize() does not enforce it. Adding a length check here (and returning a clear error) would prevent unbounded inputs from hitting the model and potentially causing excessive latency/memory use.

Copilot uses AI. Check for mistakes.
Comment on lines +141 to +147
embedding = self._get_voice_embedding(voice_id)
samples, sample_rate = self._model.create(
text,
voice=embedding,
speed=1.0,
lang="en-us",
)
Copy link

Copilot AI Apr 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SpeechRequest includes speed (and model) but synthesize() hardcodes speed=1.0 and a fixed lang, so the request fields are currently ignored. Either plumb these parameters through (router -> engine) or remove them from the request model to avoid implying unsupported behavior.

Copilot uses AI. Check for mistakes.
Comment on lines +93 to +100
if tts_available:
from voicequant.core.tts.config import TTSConfig
from voicequant.core.tts.engine import TTSEngine

tts_cfg = (
TTSConfig(**config.tts_config) if config.tts_config else TTSConfig()
)
tts_engine = TTSEngine(tts_cfg)
Copy link

Copilot AI Apr 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TTSConfig(**config.tts_config) assumes the dict keys match the new TTSConfig fields (e.g., model_name). Existing configs/tests in this repo use tts_config={"model": ...}; consider supporting backward-compatible keys (e.g., map model -> model_name or add a Pydantic alias) to avoid startup-time validation errors for existing users.

Copilot uses AI. Check for mistakes.
Comment on lines +1 to +5
"""Audio format conversion utilities for TTS output.

wav and pcm use stdlib only. mp3 and opus are stretch goals that require
optional packages and raise a helpful ImportError when they are missing.
"""
Copy link

Copilot AI Apr 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The module docstring says WAV/PCM use stdlib only, but _to_int16 depends on NumPy. Either update the docstring to reflect the NumPy dependency, or rework the conversion to avoid NumPy if keeping a stdlib-only claim is important.

Copilot uses AI. Check for mistakes.
Comment on lines +67 to +79
def load_model(self) -> None:
with self._lock:
if self._model_loaded:
return
from kokoro_onnx import Kokoro

if self.config.model_path:
self._model = Kokoro(self.config.model_path)
else:
# Let kokoro-onnx resolve its default model.
self._model = Kokoro()
self._speaker_cache = SpeakerCache(self.config.speaker_cache_size)
self._model_loaded = True
Copy link

Copilot AI Apr 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TTSConfig.device is resolved (cpu/cuda) but never used when constructing the Kokoro model. If kokoro-onnx supports device selection, pass self.config.device through in Kokoro(...); otherwise consider removing the field to avoid a misleading configuration knob.

Copilot uses AI. Check for mistakes.
claude added 2 commits April 20, 2026 20:10
Adds ruff, mypy, fastapi, httpx, and python-multipart to the dev
dependency group so ci.yml's ruff/mypy/pytest steps and
test-coverage.yml's pytest step can resolve them.

Fixes 25 pre-existing ruff errors (mostly auto-fix: f-string prefixes,
import sort, UP035) and 3 real mypy errors (torch total_mem ->
total_memory, dict key typing in ttfb scenario). Disables
warn_return_any globally and excludes voicequant.core.llm.* from mypy
checks per CLAUDE.md frozen-module policy.
test_end_to_end.py builds Q/K/V with torch.randn on the unseeded global
RNG and asserts tight cosine-similarity bounds. Different torch builds
produce different inputs, so 3-bit cases occasionally fell below the
0.80 threshold in CI. Add a scoped autouse fixture that seeds the
global RNG only for that module, keeping inputs deterministic without
disturbing other LLM tests that rely on unseeded randomness.
@codecov
Copy link
Copy Markdown

codecov Bot commented Apr 20, 2026

Welcome to Codecov 🎉

Once you merge this PR into your default branch, you're all set! Codecov will compare coverage reports and display results in all future pull requests.

ℹ️ You can also turn on project coverage checks and project coverage reporting on Pull Request comment

Thanks for integrating Codecov - We've got you covered ☂️

@mahimairaja
Copy link
Copy Markdown
Contributor Author

@CodeRabbit review full

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 20, 2026

@mahimairaja Sure! I'll redo the full review of the PR now.

✅ Actions performed

Full review triggered.

@mahimairaja mahimairaja merged commit 1fe79c0 into main Apr 20, 2026
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants