Add Kokoro ONNX TTS backend with CLI and server integration#1
Add Kokoro ONNX TTS backend with CLI and server integration#1mahimairaja wants to merge 4 commits intomainfrom
Conversation
|
Warning Rate limit exceeded
Your organization is not enrolled in usage-based pricing. Contact your admin to enable usage-based pricing to continue reviews beyond the rate limit, or try again in 29 minutes and 59 seconds. ⌛ How to resolve this issue?After the wait time has elapsed, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout. Please see our FAQ for further information. ℹ️ Review info⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (10)
📝 WalkthroughWalkthroughA comprehensive text-to-speech feature set has been added to the system, introducing Kokoro ONNX-based synthesis with configuration, CLI commands, FastAPI routes, speaker caching, audio format conversion, metrics tracking, and tests. Dependencies have been updated to include TTS-related packages. Changes
Sequence DiagramsequenceDiagram
actor Client
participant CLI/Server
participant TTSEngine
participant SpeakerCache
participant KokoroONNX
Client->>CLI/Server: Request synthesis (text, voice, format)
CLI/Server->>TTSEngine: synthesize(text, voice, output_format)
TTSEngine->>TTSEngine: Validate text length
TTSEngine->>TTSEngine: Load model (once, lazy)
TTSEngine->>SpeakerCache: get(voice_id)
alt Cache Hit
SpeakerCache-->>TTSEngine: speaker_embedding
else Cache Miss
TTSEngine->>KokoroONNX: get_speaker_embedding(voice_id)
KokoroONNX-->>TTSEngine: speaker_embedding
TTSEngine->>SpeakerCache: put(voice_id, embedding)
end
TTSEngine->>KokoroONNX: synthesize(text, speaker_embedding)
KokoroONNX-->>TTSEngine: audio_samples, sample_rate
TTSEngine->>TTSEngine: Encode audio (WAV/PCM/MP3/Opus)
TTSEngine->>TTSEngine: Compute duration & latency metrics
TTSEngine-->>CLI/Server: SynthesisResult
CLI/Server-->>Client: Audio bytes + metadata
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Pull request overview
Adds a Kokoro ONNX-backed text-to-speech (TTS) modality to VoiceQuant, exposing synthesis and voice listing through both the FastAPI server and the Typer CLI, while keeping optional-dependency environments importable.
Changes:
- Introduced
voicequant.core.ttspackage (engine/config/audio helpers + speaker embedding cache). - Added
/v1/audio/speechand/v1/audio/speech/voicesFastAPI routes with conditional mounting (real vs stub). - Added
voicequant ttsCLI group (speak,voices,benchmark-quick) and attsoptional dependency extra; added unit/server/CLI tests.
Reviewed changes
Copilot reviewed 17 out of 17 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
src/voicequant/core/tts/audio.py |
Audio encoding/conversion helpers for TTS outputs. |
src/voicequant/core/tts/config.py |
Pydantic configuration for the Kokoro TTS backend. |
src/voicequant/core/tts/engine.py |
Lazy-loading Kokoro ONNX engine implementing the modality protocol. |
src/voicequant/core/tts/speaker_cache.py |
Thread-safe LRU cache for speaker embeddings. |
src/voicequant/core/tts/__init__.py |
Re-exports of TTS public API. |
src/voicequant/server/routes/tts.py |
Real OpenAI-compatible TTS endpoints returning audio bytes and voice list. |
src/voicequant/server/app.py |
Conditional TTS engine initialization and router mounting in create_app. |
src/voicequant/core/__init__.py |
Makes voicequant.core imports tolerant of missing heavy optional deps. |
src/voicequant/cli.py |
Adds tts Typer command group and subcommands. |
pyproject.toml |
Adds tts optional dependency group and includes it in voice/all extras. |
tests/core/tts/* |
Unit tests for audio helpers, config validation, engine behavior, and speaker cache. |
tests/server/test_tts_routes.py |
Server route tests for stub/real TTS routing and conditional mounting. |
tests/server/test_tts_cli.py |
CLI test ensuring the tts group is present. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| bytes_per_sample = 2 | ||
| samples = len(audio_bytes) / bytes_per_sample | ||
| return samples / sample_rate if sample_rate else 0.0 | ||
|
|
There was a problem hiding this comment.
get_audio_duration() returns 0.0 for formats other than wav/pcm, but TTSEngine.synthesize() uses it to populate duration_seconds for mp3/opus outputs. This makes duration_seconds incorrect for those formats; compute duration from the raw samples/sample_rate before encoding (or add duration support for encoded formats).
| if fmt == "mp3": | |
| from mutagen.mp3 import MP3 | |
| audio = MP3(io.BytesIO(audio_bytes)) | |
| return float(audio.info.length) if audio.info else 0.0 | |
| if fmt == "opus": | |
| from mutagen.oggopus import OggOpus | |
| audio = OggOpus(io.BytesIO(audio_bytes)) | |
| return float(audio.info.length) if audio.info else 0.0 |
| finally: | ||
| elapsed_ms = (time.time() - t0) * 1000 | ||
| self._latency_sum_ms += elapsed_ms | ||
| self._syntheses_total += 1 | ||
| self._decr_active() |
There was a problem hiding this comment.
TTSEngine updates _latency_sum_ms and _syntheses_total in finally without any synchronization. Under concurrent FastAPI requests this can lose updates and produce inconsistent metrics/capacity values; protect these counters with a lock (or use thread-safe/atomic primitives) similar to _active_lock.
| def capacity(self) -> CapacityReport: | ||
| active = self._active | ||
| headroom = max(0, self.config.max_concurrent - active) | ||
| saturated = active >= self.config.max_concurrent | ||
| avg_ms = ( |
There was a problem hiding this comment.
CapacityReport.active is read from self._active without acquiring _active_lock, while increments/decrements are locked. This can report stale/inconsistent values under concurrency; read _active under the same lock (or use an atomic counter).
| for _name in ("engine", "codebook", "config", "constants", "wrapper", "validator"): | ||
| _sys.modules[f"voicequant.core.{_name}"] = _importlib.import_module( | ||
| f"voicequant.core.llm.{_name}" | ||
| ) | ||
| try: | ||
| _sys.modules[f"voicequant.core.{_name}"] = _importlib.import_module( | ||
| f"voicequant.core.llm.{_name}" | ||
| ) | ||
| except ImportError: | ||
| # Keep package importable in minimal environments where optional | ||
| # heavy deps (torch/scipy) are not installed yet. | ||
| pass |
There was a problem hiding this comment.
The top-level alias imports swallow any ImportError from voicequant.core.llm.*, which can hide real bugs inside those modules (not just missing optional deps). Consider narrowing this to ModuleNotFoundError for known optional packages (or checking exc.name) so genuine import failures still surface when dependencies are installed.
| def wav_to_opus(wav_bytes: bytes) -> bytes: | ||
| """Convert WAV bytes to Opus bytes (optional dependency).""" | ||
| try: | ||
| import opuslib # noqa: F401 | ||
| except ImportError as exc: | ||
| raise ImportError( | ||
| "opus encoding requires opuslib. pip install opuslib" | ||
| ) from exc | ||
|
|
||
| raise NotImplementedError("Opus conversion path is not implemented yet") | ||
|
|
There was a problem hiding this comment.
wav_to_opus() unconditionally raises NotImplementedError after the optional opuslib import, but the rest of the stack (TTSConfig validator, CLI help, and TTSEngine _encode_audio) advertises opus as a supported format. This will cause runtime failures for response_format="opus"; either implement Opus encoding or remove opus from supported formats and user-facing options until it’s implemented.
There was a problem hiding this comment.
Actionable comments posted: 9
🧹 Nitpick comments (6)
pyproject.toml (1)
53-57: Minor: numpy lower bound inconsistency.
ttspinsnumpy>=1.24.0while thebenchextra usesnumpy>=1.26.0and the Python floor is>=3.12. For consistency and to avoid accidentally resolving an older wheel when someone installs onlyvoicequant[tts], consider aligning tonumpy>=1.26.0(which is also the minimum with stable Python 3.12 wheels).🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pyproject.toml` around lines 53 - 57, The tts extra currently lists "numpy>=1.24.0" which is inconsistent with the bench extra and Python 3.12 wheel compatibility; update the tts extras array in pyproject.toml to use "numpy>=1.26.0" (replace the existing numpy entry in the tts list) so the package extras (tts) align with the bench extra and the project's supported Python wheels.src/voicequant/core/__init__.py (1)
8-16: Silently swallowingImportErrorcan mask real bugs.
ModuleNotFoundErrorfrom a missing optional dep (e.g.,torch) and anImportErrorcaused by a genuine bug insidevoicequant.core.llm.<name>(e.g., a broken relative import) are indistinguishable here. Consider narrowing toModuleNotFoundErrorand checkingexc.nameagainst the known heavy deps, or at least emitting alogging.debugwith the exception so failures are traceable.Proposed tweak
- try: - _sys.modules[f"voicequant.core.{_name}"] = _importlib.import_module( - f"voicequant.core.llm.{_name}" - ) - except ImportError: - # Keep package importable in minimal environments where optional - # heavy deps (torch/scipy) are not installed yet. - pass + try: + _sys.modules[f"voicequant.core.{_name}"] = _importlib.import_module( + f"voicequant.core.llm.{_name}" + ) + except ModuleNotFoundError as exc: + # Keep package importable in minimal environments where optional + # heavy deps (torch/scipy) are not installed yet. + import logging as _logging + _logging.getLogger(__name__).debug( + "Skipping voicequant.core.%s alias: %s", _name, exc + )🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/voicequant/core/__init__.py` around lines 8 - 16, The import loop that currently catches all ImportError for voicequant.core.llm._name can mask real bugs; update the try/except around the _importlib.import_module call to only catch ModuleNotFoundError (not ImportError), and when caught inspect the exception's name (exc.name) against known heavy deps (e.g., "torch", "scipy") before skipping, otherwise re-raise; additionally emit a debug log with the exception (using the package logger) when skipping so failures are traceable; refer to the loop variable _name, the call to _importlib.import_module, and the assignment to _sys.modules[f"voicequant.core.{_name}"] to locate where to implement this change.src/voicequant/core/tts/audio.py (2)
71-71: Nit: parameterformatshadows the builtin.Consider renaming to
fmtfor consistency withTTSConfig.output_formatcallers and to avoid shadowingbuiltins.format.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/voicequant/core/tts/audio.py` at line 71, Rename the parameter named format in the function get_audio_duration to fmt to avoid shadowing the built-in format and to match TTSConfig.output_format naming; update the function signature get_audio_duration(audio_bytes: bytes, fmt: str, sample_rate: int) and change all internal references and all call sites that pass TTSConfig.output_format (or any variable named format) to pass fmt instead so callers and implementation are consistent.
11-16: Consider vectorizing_to_int16_byteswith NumPy.
numpyis declared in thettsextra, so it's available whenever this module is actually used. A pure-Python loop over each sample is ~50–100× slower than a vectorizednp.clip+(*32767).astype(np.int16).tobytes()and becomes noticeable on multi-second utterances at 24 kHz.Proposed refactor
-def _to_int16_bytes(samples: Iterable[float]) -> bytes: - pcm = array("h") - for s in samples: - v = max(-1.0, min(1.0, float(s))) - pcm.append(int(v * 32767.0)) - return pcm.tobytes() +def _to_int16_bytes(samples: Iterable[float]) -> bytes: + import numpy as np + arr = np.asarray(list(samples) if not hasattr(samples, "__array__") else samples, + dtype=np.float32) + np.clip(arr, -1.0, 1.0, out=arr) + return (arr * 32767.0).astype(np.int16).tobytes()🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/voicequant/core/tts/audio.py` around lines 11 - 16, Replace the slow Python loop in _to_int16_bytes with a NumPy vectorized path: import numpy as np (top-level or lazy), convert samples to an ndarray via np.asarray(samples, dtype=np.float32), apply np.clip(arr, -1.0, 1.0), multiply by 32767, cast to np.int16 with astype(np.int16) and return .tobytes(); keep the current array("h") approach as a fallback if NumPy is unavailable or add a single import since numpy is provided in the tts extra.tests/core/tts/test_tts_engine.py (1)
59-69: Tighten the cache assertion —>=silently tolerates a broken cache.If speaker caching regresses and every lookup misses,
first["speaker_cache_hit_rate"]andsecond["speaker_cache_hit_rate"]both stay at0.0and the test still passes. With the current mock and two calls for the same voice, the expected hit rate after the second synthesis is exactly0.5, so the check can be made strict.🧪 Proposed tighter assertion
- assert second["speaker_cache_hit_rate"] >= first["speaker_cache_hit_rate"] + assert first["speaker_cache_hit_rate"] == 0.0 + assert second["speaker_cache_hit_rate"] > first["speaker_cache_hit_rate"]🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/core/tts/test_tts_engine.py` around lines 59 - 69, The test test_speaker_cache_used_across_syntheses currently uses a non-strict assertion allowing a broken cache to pass; update the assertion after the second synthesize call to assert the exact expected hit rate (0.5) for "speaker_cache_hit_rate" instead of using ">=" so that TTSEngine(TTSConfig(device="cpu")) with synthesize("hello", voice="af_heart") then synthesize("again", voice="af_heart") produces metrics()["speaker_cache_hit_rate"] == 0.5; locate the check in test_speaker_cache_used_across_syntheses and change the assertion on second to the strict equality comparing second["speaker_cache_hit_rate"] to 0.5.src/voicequant/cli.py (1)
222-250: Consider error-handling UX fortts speak.
engine.synthesize(...)can raiseImportError(missingkokoro_onnx),ValueError(text too long / unsupported format), orRuntimeErrorfrom model load. Today these bubble up as raw Python tracebacks to the user. A smalltry/exceptthat maps these to a short red error message +typer.Exit(1)would match the UX of the rest of the CLI (e.g. thebenchcommand's error handling on line 89-90).🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/voicequant/cli.py` around lines 222 - 250, Wrap the call to engine.synthesize(...) inside a try/except in the tts_speak function to catch ImportError, ValueError, and RuntimeError; on each exception print a concise red error message via console.print (include the exception message) and then exit with typer.Exit(1) to match existing CLI UX (similar to the bench command). Ensure you still compute elapsed_ms only if synthesis succeeds, and keep writing result.audio_bytes and printing success info untouched when no exception occurs.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@src/voicequant/cli.py`:
- Around line 289-304: The average latency is skewed by the TTSEngine lazy model
load inside TTSEngine.synthesize; warm the engine before timing by invoking
engine.synthesize once (e.g., a short warmup call with one of the samples or a
dummy string) prior to starting the timed loop over samples, or alternatively
ignore/discard the first result from latencies/durations (remove the first
element) before computing avg_latency and avg_duration so only steady-state
synthesis times are averaged; reference TTSEngine, TTSConfig, synthesize,
samples, latencies and durations when making the change.
In `@src/voicequant/core/tts/audio.py`:
- Around line 71-85: get_audio_duration currently returns 0.0 for encoded
formats like mp3/opus which causes SynthesisResult.duration_seconds to report
0.00s; change the flow so duration is computed from raw float samples in the TTS
engine and passed through rather than derived from encoded bytes. Update the API
by adding an explicit duration argument (e.g., pass duration_seconds or samples
+ sample_rate) from the synthesizer into get_audio_duration or into
SynthesisResult so callers use the precomputed len(samples)/sample_rate value,
and keep get_audio_duration as a fallback for wav/pcm; ensure references to
get_audio_duration and SynthesisResult.duration_seconds are updated accordingly
and documented.
- Around line 59-68: The function wav_to_opus currently always raises
NotImplementedError even when opuslib is installed; either implement the
encoding or stop advertising Opus support — update code to remove "opus" from
any advertised formats/CLI choices and from any engine/server output_format
choices (references: wav_to_opus and any use of output_format="opus"), and if
you keep wav_to_opus stub, make it consistently raise a clear
ImportError/NotImplementedError with a message directing users that Opus support
is not yet available so requests won't 500.
In `@src/voicequant/core/tts/engine.py`:
- Around line 49-53: The metrics (_syntheses_total, _latency_sum_ms and reads of
_active in capacity()) must be protected by the same concurrency discipline as
_active: wrap all mutations and reads of _syntheses_total and _latency_sum_ms
with _active_lock, and change capacity() to read _active while holding
_active_lock (or add thread-safe accessor methods). Update all sites that
increment/accumulate metrics (including the sections around the current init and
the blocks around the referenced ranges) to acquire _active_lock before
modifying or reading these fields, or replace those direct accesses with new
synchronized helper methods (e.g., _inc_syntheses(), _add_latency_ms(),
get_metrics()) that acquire _active_lock internally to ensure atomic, consistent
metrics and capacity reporting.
- Around line 166-168: The duration is being computed after encoding which calls
get_audio_duration that only supports 'wav' and 'pcm', causing mp3/opus outputs
to get 0.0; modify the logic in the method that calls _encode_audio and
get_audio_duration (around SynthesisResult construction) to compute duration
from raw samples and sample_rate before encoding (e.g., duration_seconds =
len(samples) / sample_rate) and use that value for non-wav/pcm formats, while
still calling get_audio_duration for 'wav'/'pcm'; update the code paths around
_encode_audio, selected_format, get_audio_duration and SynthesisResult to use
the precomputed duration for mp3/opus.
- Around line 112-114: The current `_synthesize_samples` uses an `or` chain to
pick `samples` from `out` which triggers ambiguous truth-value errors for
multi-element NumPy arrays; change the selection logic to explicit None checks
(e.g., prefer `out["audio"]` if it is present and is not None, else
`out["samples"]` if not None, else `out["waveform"]`) so evaluating array values
is not used as a boolean; keep the `sample_rate` fallback to
`self.config.sample_rate` as before and update the code paths around the `out`
handling in the `_synthesize_samples` function accordingly to avoid implicit
truthiness on `samples`.
- Around line 62-73: The kokoro-onnx API changed: update TTSConfig to include a
voices_path field and stop passing a device or kwargs; instantiate the model
using the positional signature expected by kokoro-onnx by calling the retrieved
model class (model_cls or Kokoro/KokoroOnnx) with (model_path, voices_path) and
assign to self._model, remove any attempt to pass device; replace any calls to
synthesize()/generate()/get_speaker_embedding()/load_voice() with the single
create(text, voice) method and use get_voices() (not list_voices()) to enumerate
available voices, and ensure the voice argument passed to create is either a
voice name string or a numpy array (no separate embedding-loading step).
In `@src/voicequant/server/routes/tts.py`:
- Around line 40-62: The handler currently leaks raw exception text and echoes
user-controlled fields into response headers; update the synthesize error
handling around engine.synthesize to log the full exception server-side
(including stack trace) and raise HTTPException with a generic message like
"Synthesis failed" (no raw e). Also ensure voice/format values are validated or
sanitized before using them in headers and filenames: enforce an allowlist/enum
or regex on SpeechRequest.voice and SpeechRequest.response_format (or sanitize
result.voice and result.format) to strip CR/LF and non-safe bytes, constrain to
ASCII letters/digits/_- and map unknown values to safe defaults (e.g., "unknown"
and "bin"), then use those sanitized values for "X-VoiceQuant-Voice",
"X-VoiceQuant-Sample-Rate" and Content-Disposition filename while keeping
content-type via _content_type_for(ext).
In `@tests/core/tts/test_audio.py`:
- Around line 33-40: Tests assume third-party packages are absent; instead
monkeypatch import detection so the code takes the ImportError branch: in
test_wav_to_mp3_importerror_message and test_wav_to_opus_importerror_message use
the pytest monkeypatch fixture to stub importlib.util.find_spec (save original,
then set it to a function that returns None when name == "lameenc" or "opuslib"
respectively and otherwise calls the original) before calling wav_to_mp3(...) /
wav_to_opus(...), then assert the ImportError message as before; alternatively,
if you prefer skipping, call importlib.util.find_spec("lameenc") /
find_spec("opuslib") at start and pytest.skip when the package is absent/present
per the desired behavior, but prefer the monkeypatch approach above to force the
ImportError branch for wav_to_mp3 and wav_to_opus.
---
Nitpick comments:
In `@pyproject.toml`:
- Around line 53-57: The tts extra currently lists "numpy>=1.24.0" which is
inconsistent with the bench extra and Python 3.12 wheel compatibility; update
the tts extras array in pyproject.toml to use "numpy>=1.26.0" (replace the
existing numpy entry in the tts list) so the package extras (tts) align with the
bench extra and the project's supported Python wheels.
In `@src/voicequant/cli.py`:
- Around line 222-250: Wrap the call to engine.synthesize(...) inside a
try/except in the tts_speak function to catch ImportError, ValueError, and
RuntimeError; on each exception print a concise red error message via
console.print (include the exception message) and then exit with typer.Exit(1)
to match existing CLI UX (similar to the bench command). Ensure you still
compute elapsed_ms only if synthesis succeeds, and keep writing
result.audio_bytes and printing success info untouched when no exception occurs.
In `@src/voicequant/core/__init__.py`:
- Around line 8-16: The import loop that currently catches all ImportError for
voicequant.core.llm._name can mask real bugs; update the try/except around the
_importlib.import_module call to only catch ModuleNotFoundError (not
ImportError), and when caught inspect the exception's name (exc.name) against
known heavy deps (e.g., "torch", "scipy") before skipping, otherwise re-raise;
additionally emit a debug log with the exception (using the package logger) when
skipping so failures are traceable; refer to the loop variable _name, the call
to _importlib.import_module, and the assignment to
_sys.modules[f"voicequant.core.{_name}"] to locate where to implement this
change.
In `@src/voicequant/core/tts/audio.py`:
- Line 71: Rename the parameter named format in the function get_audio_duration
to fmt to avoid shadowing the built-in format and to match
TTSConfig.output_format naming; update the function signature
get_audio_duration(audio_bytes: bytes, fmt: str, sample_rate: int) and change
all internal references and all call sites that pass TTSConfig.output_format (or
any variable named format) to pass fmt instead so callers and implementation are
consistent.
- Around line 11-16: Replace the slow Python loop in _to_int16_bytes with a
NumPy vectorized path: import numpy as np (top-level or lazy), convert samples
to an ndarray via np.asarray(samples, dtype=np.float32), apply np.clip(arr,
-1.0, 1.0), multiply by 32767, cast to np.int16 with astype(np.int16) and return
.tobytes(); keep the current array("h") approach as a fallback if NumPy is
unavailable or add a single import since numpy is provided in the tts extra.
In `@tests/core/tts/test_tts_engine.py`:
- Around line 59-69: The test test_speaker_cache_used_across_syntheses currently
uses a non-strict assertion allowing a broken cache to pass; update the
assertion after the second synthesize call to assert the exact expected hit rate
(0.5) for "speaker_cache_hit_rate" instead of using ">=" so that
TTSEngine(TTSConfig(device="cpu")) with synthesize("hello", voice="af_heart")
then synthesize("again", voice="af_heart") produces
metrics()["speaker_cache_hit_rate"] == 0.5; locate the check in
test_speaker_cache_used_across_syntheses and change the assertion on second to
the strict equality comparing second["speaker_cache_hit_rate"] to 0.5.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 95691e2d-fb8a-4b69-93c5-5624ee716f3d
📒 Files selected for processing (17)
pyproject.tomlsrc/voicequant/cli.pysrc/voicequant/core/__init__.pysrc/voicequant/core/tts/__init__.pysrc/voicequant/core/tts/audio.pysrc/voicequant/core/tts/config.pysrc/voicequant/core/tts/engine.pysrc/voicequant/core/tts/speaker_cache.pysrc/voicequant/server/app.pysrc/voicequant/server/routes/tts.pytests/core/tts/__init__.pytests/core/tts/test_audio.pytests/core/tts/test_speaker_cache.pytests/core/tts/test_tts_config.pytests/core/tts/test_tts_engine.pytests/server/test_tts_cli.pytests/server/test_tts_routes.py
| def get_audio_duration(audio_bytes: bytes, format: str, sample_rate: int) -> float: | ||
| """Estimate duration in seconds from audio byte payload.""" | ||
| fmt = format.lower() | ||
| if fmt == "wav": | ||
| with wave.open(io.BytesIO(audio_bytes), "rb") as wf: | ||
| frames = wf.getnframes() | ||
| rate = wf.getframerate() | ||
| return frames / rate if rate else 0.0 | ||
|
|
||
| if fmt == "pcm": | ||
| bytes_per_sample = 2 | ||
| samples = len(audio_bytes) / bytes_per_sample | ||
| return samples / sample_rate if sample_rate else 0.0 | ||
|
|
||
| return 0.0 |
There was a problem hiding this comment.
get_audio_duration silently returns 0.0 for mp3/opus.
Callers (e.g., SynthesisResult.duration_seconds surfaced in CLI/server output) will report 0.00s whenever the output format is anything other than wav/pcm. Since the engine already has the raw float samples before encoding, prefer computing duration from len(samples)/sample_rate in the engine and passing it in, rather than trying to re-derive it from encoded bytes. At minimum, document the limitation or fall back to len(samples)/sample_rate so metrics aren't misleading.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/voicequant/core/tts/audio.py` around lines 71 - 85, get_audio_duration
currently returns 0.0 for encoded formats like mp3/opus which causes
SynthesisResult.duration_seconds to report 0.00s; change the flow so duration is
computed from raw float samples in the TTS engine and passed through rather than
derived from encoded bytes. Update the API by adding an explicit duration
argument (e.g., pass duration_seconds or samples + sample_rate) from the
synthesizer into get_audio_duration or into SynthesisResult so callers use the
precomputed len(samples)/sample_rate value, and keep get_audio_duration as a
fallback for wav/pcm; ensure references to get_audio_duration and
SynthesisResult.duration_seconds are updated accordingly and documented.
Motivation
Description
voicequant.core.ttsincludingaudio.py,config.py,engine.py,speaker_cache.py, and a package__init__.pythat re-exports key symbols and types such asTTSEngine,TTSConfig, andSynthesisResult.ttscommand group withspeak,voices, andbenchmark-quickcommands implemented insrc/voicequant/cli.pyand wired into the mainvoicequantTyper app.src/voicequant/server/routes/tts.py, conditional TTS mounting increate_appinsrc/voicequant/server/app.py, and a real route that returns audio payloads and a voices list.pyproject.tomlto add attsoptional dependency group (kokoro-onnx,soundfile,numpy) and includesttsin thevoiceandallextras, and makesvoicequant.coreimports tolerant of missing heavy dependencies.Testing
tests/core/ttscovering audio helpers,TTSConfig,TTSEnginebehavior with a mockedkokoro_onnx, andSpeakerCacheconcurrency semantics, and all were executed viapytestand passed.tests/serverincludingtest_tts_routes.pyand a CLI testtest_tts_cli.py, which were executed viapytestand passed.typer.testing.CliRunnerin the test suite and validated to exist and run as expected.Codex Task
Summary by CodeRabbit
Release Notes
ttsextra.