feat: add TTS engine with Kokoro ONNX backend#2
Conversation
…ndpoint Implements the TTS foundation (M4) with a Kokoro-backed engine that satisfies the ModalityEngine protocol. Includes a thread-safe LRU speaker cache, stdlib-only WAV/PCM audio encoding, and a real `/v1/audio/speech` router that is mounted conditionally when kokoro-onnx is importable, matching the pattern already used for STT. Adds `voicequant tts speak|voices|benchmark-quick` CLI commands and a `tts` optional dependency extra.
|
Warning Rate limit exceeded
Your organization is not enrolled in usage-based pricing. Contact your admin to enable usage-based pricing to continue reviews beyond the rate limit, or try again in 47 minutes and 2 seconds. ⌛ How to resolve this issue?After the wait time has elapsed, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout. Please see our FAQ for further information. ℹ️ Review info⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: ⛔ Files ignored due to path filters (1)
📒 Files selected for processing (31)
✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
@CodeRabbit review full |
|
✅ Actions performedFull review triggered. |
There was a problem hiding this comment.
Pull request overview
Adds a Kokoro (ONNX) text-to-speech modality to VoiceQuant, including a real OpenAI-compatible-ish TTS HTTP route, a CLI surface, and supporting core utilities (audio encoding + speaker embedding cache), with accompanying tests.
Changes:
- Introduces
TTSEngine(lazy-loading Kokoro backend),TTSConfig,SpeakerCache, and audio conversion utilities (wav/pcm + optional mp3/opus). - Mounts a real
/v1/audio/speechrouter whenkokoro_onnxis available, otherwise falls back to an existing 501 stub; adds/v1/audio/speech/voices. - Adds CLI commands under
voicequant ttsplus a comprehensive new test suite for TTS core + server routing.
Reviewed changes
Copilot reviewed 14 out of 15 changed files in this pull request and generated 9 comments.
Show a summary per file
| File | Description |
|---|---|
| tests/server/test_tts_routes.py | Adds coverage for stub vs real TTS routing, voices endpoint, and CLI integration. |
| tests/core/tts/test_tts_engine.py | Validates lazy loading, synthesis outputs, speaker cache behavior, metrics, shutdown. |
| tests/core/tts/test_tts_config.py | Covers defaults and pydantic validation for new TTS config. |
| tests/core/tts/test_speaker_cache.py | Tests LRU eviction, stats, and thread safety of speaker cache. |
| tests/core/tts/test_audio.py | Tests wav/pcm encoding, duration calculation, and optional-encoder ImportErrors. |
| tests/core/tts/init.py | Marks TTS test package. |
| src/voicequant/server/routes/tts.py | Adds real TTS FastAPI router (+ voices listing). |
| src/voicequant/server/app.py | Adds conditional TTS availability detection, startup initialization, and router mounting. |
| src/voicequant/core/tts/speaker_cache.py | Implements a thread-safe LRU cache for speaker embeddings. |
| src/voicequant/core/tts/engine.py | Implements Kokoro-backed TTSEngine, synthesis, metrics, and protocol methods. |
| src/voicequant/core/tts/config.py | Introduces TTSConfig (device auto-resolution, defaults). |
| src/voicequant/core/tts/audio.py | Adds wav/pcm encoders, mp3/opus optional encoders, duration helper. |
| src/voicequant/core/tts/init.py | Exposes the new TTS public API surface. |
| src/voicequant/cli.py | Adds voicequant tts commands (voices, speak, quick benchmark). |
| pyproject.toml | Adds tts optional dependency extra and includes it in voice/all. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| engine = get_engine() | ||
| if engine is None: | ||
| raise HTTPException(status_code=503, detail="TTS engine not ready") | ||
|
|
||
| fmt = (request.response_format or "wav").lower() | ||
| try: | ||
| result = engine.synthesize( | ||
| request.input, | ||
| voice=request.voice, | ||
| output_format=fmt, | ||
| ) | ||
| except ImportError as e: | ||
| raise HTTPException(status_code=501, detail=str(e)) from e | ||
| except Exception as e: | ||
| raise HTTPException( | ||
| status_code=500, detail=f"Synthesis failed: {e}" | ||
| ) from e |
There was a problem hiding this comment.
The stub TTS route returns an OpenAI-style {error: {message, type}} payload, but this router raises HTTPException, which yields FastAPI's default {detail: ...} body. If the intent is OpenAI compatibility, consider returning a consistent error schema here (including for 503/500 paths) instead of HTTPException defaults.
| _CONTENT_TYPES = { | ||
| "wav": "audio/wav", | ||
| "mp3": "audio/mpeg", | ||
| "opus": "audio/opus", | ||
| "pcm": "audio/pcm", | ||
| } | ||
|
|
||
|
|
||
| def build_router(get_engine) -> APIRouter: | ||
| """Build the real TTS router. | ||
|
|
||
| `get_engine` returns the live TTSEngine instance (lazy). | ||
| """ | ||
| router = APIRouter(prefix="/v1", tags=["tts"]) | ||
|
|
||
| @router.post("/audio/speech") | ||
| async def speech(request: SpeechRequest) -> Any: | ||
| engine = get_engine() | ||
| if engine is None: | ||
| raise HTTPException(status_code=503, detail="TTS engine not ready") | ||
|
|
||
| fmt = (request.response_format or "wav").lower() | ||
| try: | ||
| result = engine.synthesize( | ||
| request.input, | ||
| voice=request.voice, | ||
| output_format=fmt, | ||
| ) | ||
| except ImportError as e: | ||
| raise HTTPException(status_code=501, detail=str(e)) from e | ||
| except Exception as e: | ||
| raise HTTPException( | ||
| status_code=500, detail=f"Synthesis failed: {e}" | ||
| ) from e | ||
|
|
||
| media_type = _CONTENT_TYPES.get(result.format, "application/octet-stream") | ||
| filename = f"speech.{result.format}" | ||
| return Response( | ||
| content=result.audio_bytes, | ||
| media_type=media_type, | ||
| headers={"Content-Disposition": f'attachment; filename="{filename}"'}, | ||
| ) |
There was a problem hiding this comment.
_CONTENT_TYPES/filename generation treat opus as a standard .opus asset (audio/opus, speech.opus), but the current encoder produces raw frames (no Ogg container). Either change the advertised format/content-type/extension to match what is returned, or emit a proper .opus (Ogg Opus) file.
| def wav_to_opus(wav_bytes: bytes) -> bytes: | ||
| """Encode WAV bytes as Opus. Requires opuslib.""" | ||
| try: | ||
| import opuslib # noqa: F401 | ||
| except ImportError as e: | ||
| raise ImportError( | ||
| "opus encoding requires opuslib. pip install opuslib" | ||
| ) from e | ||
|
|
||
| # opuslib gives you a codec; full container framing is out of scope here. | ||
| # Decode WAV then pass raw PCM to opuslib. Callers needing a file container | ||
| # should install a full encoder. | ||
| import numpy as np |
There was a problem hiding this comment.
wav_to_opus currently returns raw Opus frames without an Ogg/Opus container (the comment notes this), but the rest of the code treats this as an .opus file. Clients expecting a standard .opus file will likely fail to play it; either emit an Ogg Opus container (and consider audio/ogg), or rename/advertise this as a raw stream format instead of opus.
| frame_size = sample_rate // 50 # 20ms frames | ||
| out = bytearray() | ||
| for i in range(0, len(pcm) - frame_size + 1, frame_size): | ||
| chunk = pcm[i : i + frame_size].tobytes() |
There was a problem hiding this comment.
wav_to_opus computes frame_size = sample_rate // 50 (samples per channel for 20ms) but then slices the interleaved pcm buffer by frame_size, which is incorrect for multi-channel audio (it should account for n_channels). This will produce wrong chunk sizes and can break encoding for stereo WAVs.
| frame_size = sample_rate // 50 # 20ms frames | |
| out = bytearray() | |
| for i in range(0, len(pcm) - frame_size + 1, frame_size): | |
| chunk = pcm[i : i + frame_size].tobytes() | |
| frame_size = sample_rate // 50 # 20ms frames, in samples per channel | |
| samples_per_frame = frame_size * n_channels | |
| out = bytearray() | |
| for i in range(0, len(pcm) - samples_per_frame + 1, samples_per_frame): | |
| chunk = pcm[i : i + samples_per_frame].tobytes() |
| def synthesize( | ||
| self, | ||
| text: str, | ||
| voice: str | None = None, | ||
| output_format: str | None = None, | ||
| ) -> SynthesisResult: | ||
| if not self._model_loaded: | ||
| self.load_model() | ||
|
|
||
| voice_id = voice or self.config.default_voice | ||
| fmt = (output_format or self.config.output_format).lower() | ||
|
|
There was a problem hiding this comment.
max_text_length exists in TTSConfig but synthesize() does not enforce it. Adding a length check here (and returning a clear error) would prevent unbounded inputs from hitting the model and potentially causing excessive latency/memory use.
| embedding = self._get_voice_embedding(voice_id) | ||
| samples, sample_rate = self._model.create( | ||
| text, | ||
| voice=embedding, | ||
| speed=1.0, | ||
| lang="en-us", | ||
| ) |
There was a problem hiding this comment.
SpeechRequest includes speed (and model) but synthesize() hardcodes speed=1.0 and a fixed lang, so the request fields are currently ignored. Either plumb these parameters through (router -> engine) or remove them from the request model to avoid implying unsupported behavior.
| if tts_available: | ||
| from voicequant.core.tts.config import TTSConfig | ||
| from voicequant.core.tts.engine import TTSEngine | ||
|
|
||
| tts_cfg = ( | ||
| TTSConfig(**config.tts_config) if config.tts_config else TTSConfig() | ||
| ) | ||
| tts_engine = TTSEngine(tts_cfg) |
There was a problem hiding this comment.
TTSConfig(**config.tts_config) assumes the dict keys match the new TTSConfig fields (e.g., model_name). Existing configs/tests in this repo use tts_config={"model": ...}; consider supporting backward-compatible keys (e.g., map model -> model_name or add a Pydantic alias) to avoid startup-time validation errors for existing users.
| """Audio format conversion utilities for TTS output. | ||
|
|
||
| wav and pcm use stdlib only. mp3 and opus are stretch goals that require | ||
| optional packages and raise a helpful ImportError when they are missing. | ||
| """ |
There was a problem hiding this comment.
The module docstring says WAV/PCM use stdlib only, but _to_int16 depends on NumPy. Either update the docstring to reflect the NumPy dependency, or rework the conversion to avoid NumPy if keeping a stdlib-only claim is important.
| def load_model(self) -> None: | ||
| with self._lock: | ||
| if self._model_loaded: | ||
| return | ||
| from kokoro_onnx import Kokoro | ||
|
|
||
| if self.config.model_path: | ||
| self._model = Kokoro(self.config.model_path) | ||
| else: | ||
| # Let kokoro-onnx resolve its default model. | ||
| self._model = Kokoro() | ||
| self._speaker_cache = SpeakerCache(self.config.speaker_cache_size) | ||
| self._model_loaded = True |
There was a problem hiding this comment.
TTSConfig.device is resolved (cpu/cuda) but never used when constructing the Kokoro model. If kokoro-onnx supports device selection, pass self.config.device through in Kokoro(...); otherwise consider removing the field to avoid a misleading configuration knob.
Adds ruff, mypy, fastapi, httpx, and python-multipart to the dev dependency group so ci.yml's ruff/mypy/pytest steps and test-coverage.yml's pytest step can resolve them. Fixes 25 pre-existing ruff errors (mostly auto-fix: f-string prefixes, import sort, UP035) and 3 real mypy errors (torch total_mem -> total_memory, dict key typing in ttfb scenario). Disables warn_return_any globally and excludes voicequant.core.llm.* from mypy checks per CLAUDE.md frozen-module policy.
test_end_to_end.py builds Q/K/V with torch.randn on the unseeded global RNG and asserts tight cosine-similarity bounds. Different torch builds produce different inputs, so 3-bit cases occasionally fell below the 0.80 threshold in CI. Add a scoped autouse fixture that seeds the global RNG only for that module, keeping inputs deterministic without disturbing other LLM tests that rely on unseeded randomness.
Welcome to Codecov 🎉Once you merge this PR into your default branch, you're all set! Codecov will compare coverage reports and display results in all future pull requests. ℹ️ You can also turn on project coverage checks and project coverage reporting on Pull Request comment Thanks for integrating Codecov - We've got you covered ☂️ |
|
@CodeRabbit review full |
|
✅ Actions performedFull review triggered. |
No description provided.