Skip to content

Add Kokoro ONNX TTS backend with CLI and server integration#1

Closed
mahimairaja wants to merge 4 commits intomainfrom
feat/assess-project-status-and-potential-features
Closed

Add Kokoro ONNX TTS backend with CLI and server integration#1
mahimairaja wants to merge 4 commits intomainfrom
feat/assess-project-status-and-potential-features

Conversation

@mahimairaja
Copy link
Copy Markdown
Contributor

@mahimairaja mahimairaja commented Apr 20, 2026

Motivation

  • Add a Kokoro ONNX-based text-to-speech modality to provide speech synthesis alongside existing LLM/STT functionality.
  • Expose TTS functionality via the CLI and the HTTP server so users can synthesize audio and list voices programmatically.
  • Make core package imports resilient in minimal environments where heavy optional deps may be missing.

Description

  • Introduces a new TTS core package under voicequant.core.tts including audio.py, config.py, engine.py, speaker_cache.py, and a package __init__.py that re-exports key symbols and types such as TTSEngine, TTSConfig, and SynthesisResult.
  • Adds CLI support under the tts command group with speak, voices, and benchmark-quick commands implemented in src/voicequant/cli.py and wired into the main voicequant Typer app.
  • Integrates TTS into the FastAPI server by adding src/voicequant/server/routes/tts.py, conditional TTS mounting in create_app in src/voicequant/server/app.py, and a real route that returns audio payloads and a voices list.
  • Updates pyproject.toml to add a tts optional dependency group (kokoro-onnx, soundfile, numpy) and includes tts in the voice and all extras, and makes voicequant.core imports tolerant of missing heavy dependencies.

Testing

  • Added unit tests under tests/core/tts covering audio helpers, TTSConfig, TTSEngine behavior with a mocked kokoro_onnx, and SpeakerCache concurrency semantics, and all were executed via pytest and passed.
  • Added server tests in tests/server including test_tts_routes.py and a CLI test test_tts_cli.py, which were executed via pytest and passed.
  • The CLI commands were exercised using typer.testing.CliRunner in the test suite and validated to exist and run as expected.

Codex Task

Summary by CodeRabbit

Release Notes

  • New Features
    • Added text-to-speech synthesis with support for multiple voices and audio formats (WAV, PCM, MP3, Opus).
    • Added CLI commands for TTS: speak text to audio, list available voices, and run performance benchmarks.
    • Added REST API endpoints for speech synthesis and voice listing.
    • Introduced optional TTS dependencies installable via the new tts extra.

Copilot AI review requested due to automatic review settings April 20, 2026 19:00
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 20, 2026

Warning

Rate limit exceeded

@mahimairaja has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 29 minutes and 59 seconds before requesting another review.

Your organization is not enrolled in usage-based pricing. Contact your admin to enable usage-based pricing to continue reviews beyond the rate limit, or try again in 29 minutes and 59 seconds.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 9ad07d38-c24c-4f07-ada6-07fb6b881bbc

📥 Commits

Reviewing files that changed from the base of the PR and between 0bfbf60 and f64de58.

📒 Files selected for processing (10)
  • pyproject.toml
  • src/voicequant/cli.py
  • src/voicequant/core/__init__.py
  • src/voicequant/core/tts/__init__.py
  • src/voicequant/core/tts/audio.py
  • src/voicequant/core/tts/config.py
  • src/voicequant/core/tts/engine.py
  • src/voicequant/server/routes/tts.py
  • tests/core/tts/test_audio.py
  • tests/core/tts/test_tts_engine.py
📝 Walkthrough

Walkthrough

A comprehensive text-to-speech feature set has been added to the system, introducing Kokoro ONNX-based synthesis with configuration, CLI commands, FastAPI routes, speaker caching, audio format conversion, metrics tracking, and tests. Dependencies have been updated to include TTS-related packages.

Changes

Cohort / File(s) Summary
Dependencies & Configuration
pyproject.toml
Added new tts optional dependency group with kokoro-onnx, soundfile, and numpy. Updated voice and all extras to include TTS components.
Core TTS Implementation
src/voicequant/core/tts/__init__.py, src/voicequant/core/tts/config.py, src/voicequant/core/tts/engine.py, src/voicequant/core/tts/audio.py, src/voicequant/core/tts/speaker_cache.py
Introduced TTS package with Pydantic-based configuration model, Kokoro ONNX synthesis engine with lazy model loading and metrics, thread-safe LRU speaker embedding cache, and audio format conversion utilities (WAV, PCM, MP3, Opus).
Core Module Lazy Loading
src/voicequant/core/__init__.py
Wrapped dynamic imports of core modules in try/except to support lazy loading and graceful handling of missing optional dependencies.
Server Integration
src/voicequant/server/app.py, src/voicequant/server/routes/tts.py
Extended FastAPI app with TTS engine lifecycle management. Added new router with /v1/audio/speech and /v1/audio/speech/voices endpoints supporting OpenAI-compatible TTS interface.
CLI Commands
src/voicequant/cli.py
Added new tts command group with three subcommands: speak (synthesis with latency measurement), voices (list available voices), and benchmark-quick (performance benchmarking).
Test Suite
tests/core/tts/*, tests/server/test_tts_*.py
Added comprehensive test coverage for audio utilities, speaker cache LRU semantics and concurrency, TTS configuration validation, engine synthesis and metrics, CLI integration, and HTTP route behavior.

Sequence Diagram

sequenceDiagram
    actor Client
    participant CLI/Server
    participant TTSEngine
    participant SpeakerCache
    participant KokoroONNX
    
    Client->>CLI/Server: Request synthesis (text, voice, format)
    CLI/Server->>TTSEngine: synthesize(text, voice, output_format)
    TTSEngine->>TTSEngine: Validate text length
    TTSEngine->>TTSEngine: Load model (once, lazy)
    TTSEngine->>SpeakerCache: get(voice_id)
    alt Cache Hit
        SpeakerCache-->>TTSEngine: speaker_embedding
    else Cache Miss
        TTSEngine->>KokoroONNX: get_speaker_embedding(voice_id)
        KokoroONNX-->>TTSEngine: speaker_embedding
        TTSEngine->>SpeakerCache: put(voice_id, embedding)
    end
    TTSEngine->>KokoroONNX: synthesize(text, speaker_embedding)
    KokoroONNX-->>TTSEngine: audio_samples, sample_rate
    TTSEngine->>TTSEngine: Encode audio (WAV/PCM/MP3/Opus)
    TTSEngine->>TTSEngine: Compute duration & latency metrics
    TTSEngine-->>CLI/Server: SynthesisResult
    CLI/Server-->>Client: Audio bytes + metadata
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Poem

🐰 Hop hop, hear the voice so clear!
New TTS makes synthesized cheer,
Kokoro's embeddings cached with care,
Audio bytes float through the air!
Server speaks now, CLI too—
One thousand voices, just for you! 🎙️

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 17.81% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately and concisely summarizes the main change: adding a Kokoro ONNX TTS backend with CLI and server integration, which aligns with the primary objectives and changeset.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feat/assess-project-status-and-potential-features

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a Kokoro ONNX-backed text-to-speech (TTS) modality to VoiceQuant, exposing synthesis and voice listing through both the FastAPI server and the Typer CLI, while keeping optional-dependency environments importable.

Changes:

  • Introduced voicequant.core.tts package (engine/config/audio helpers + speaker embedding cache).
  • Added /v1/audio/speech and /v1/audio/speech/voices FastAPI routes with conditional mounting (real vs stub).
  • Added voicequant tts CLI group (speak, voices, benchmark-quick) and a tts optional dependency extra; added unit/server/CLI tests.

Reviewed changes

Copilot reviewed 17 out of 17 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
src/voicequant/core/tts/audio.py Audio encoding/conversion helpers for TTS outputs.
src/voicequant/core/tts/config.py Pydantic configuration for the Kokoro TTS backend.
src/voicequant/core/tts/engine.py Lazy-loading Kokoro ONNX engine implementing the modality protocol.
src/voicequant/core/tts/speaker_cache.py Thread-safe LRU cache for speaker embeddings.
src/voicequant/core/tts/__init__.py Re-exports of TTS public API.
src/voicequant/server/routes/tts.py Real OpenAI-compatible TTS endpoints returning audio bytes and voice list.
src/voicequant/server/app.py Conditional TTS engine initialization and router mounting in create_app.
src/voicequant/core/__init__.py Makes voicequant.core imports tolerant of missing heavy optional deps.
src/voicequant/cli.py Adds tts Typer command group and subcommands.
pyproject.toml Adds tts optional dependency group and includes it in voice/all extras.
tests/core/tts/* Unit tests for audio helpers, config validation, engine behavior, and speaker cache.
tests/server/test_tts_routes.py Server route tests for stub/real TTS routing and conditional mounting.
tests/server/test_tts_cli.py CLI test ensuring the tts group is present.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

bytes_per_sample = 2
samples = len(audio_bytes) / bytes_per_sample
return samples / sample_rate if sample_rate else 0.0

Copy link

Copilot AI Apr 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

get_audio_duration() returns 0.0 for formats other than wav/pcm, but TTSEngine.synthesize() uses it to populate duration_seconds for mp3/opus outputs. This makes duration_seconds incorrect for those formats; compute duration from the raw samples/sample_rate before encoding (or add duration support for encoded formats).

Suggested change
if fmt == "mp3":
from mutagen.mp3 import MP3
audio = MP3(io.BytesIO(audio_bytes))
return float(audio.info.length) if audio.info else 0.0
if fmt == "opus":
from mutagen.oggopus import OggOpus
audio = OggOpus(io.BytesIO(audio_bytes))
return float(audio.info.length) if audio.info else 0.0

Copilot uses AI. Check for mistakes.
Comment on lines +175 to +179
finally:
elapsed_ms = (time.time() - t0) * 1000
self._latency_sum_ms += elapsed_ms
self._syntheses_total += 1
self._decr_active()
Copy link

Copilot AI Apr 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TTSEngine updates _latency_sum_ms and _syntheses_total in finally without any synchronization. Under concurrent FastAPI requests this can lose updates and produce inconsistent metrics/capacity values; protect these counters with a lock (or use thread-safe/atomic primitives) similar to _active_lock.

Copilot uses AI. Check for mistakes.
Comment thread src/voicequant/core/tts/engine.py Outdated
Comment on lines +195 to +199
def capacity(self) -> CapacityReport:
active = self._active
headroom = max(0, self.config.max_concurrent - active)
saturated = active >= self.config.max_concurrent
avg_ms = (
Copy link

Copilot AI Apr 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CapacityReport.active is read from self._active without acquiring _active_lock, while increments/decrements are locked. This can report stale/inconsistent values under concurrency; read _active under the same lock (or use an atomic counter).

Copilot uses AI. Check for mistakes.
Comment thread src/voicequant/core/__init__.py Outdated
Comment on lines +8 to +16
for _name in ("engine", "codebook", "config", "constants", "wrapper", "validator"):
_sys.modules[f"voicequant.core.{_name}"] = _importlib.import_module(
f"voicequant.core.llm.{_name}"
)
try:
_sys.modules[f"voicequant.core.{_name}"] = _importlib.import_module(
f"voicequant.core.llm.{_name}"
)
except ImportError:
# Keep package importable in minimal environments where optional
# heavy deps (torch/scipy) are not installed yet.
pass
Copy link

Copilot AI Apr 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The top-level alias imports swallow any ImportError from voicequant.core.llm.*, which can hide real bugs inside those modules (not just missing optional deps). Consider narrowing this to ModuleNotFoundError for known optional packages (or checking exc.name) so genuine import failures still surface when dependencies are installed.

Copilot uses AI. Check for mistakes.
Comment on lines +59 to +69
def wav_to_opus(wav_bytes: bytes) -> bytes:
"""Convert WAV bytes to Opus bytes (optional dependency)."""
try:
import opuslib # noqa: F401
except ImportError as exc:
raise ImportError(
"opus encoding requires opuslib. pip install opuslib"
) from exc

raise NotImplementedError("Opus conversion path is not implemented yet")

Copy link

Copilot AI Apr 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wav_to_opus() unconditionally raises NotImplementedError after the optional opuslib import, but the rest of the stack (TTSConfig validator, CLI help, and TTSEngine _encode_audio) advertises opus as a supported format. This will cause runtime failures for response_format="opus"; either implement Opus encoding or remove opus from supported formats and user-facing options until it’s implemented.

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 9

🧹 Nitpick comments (6)
pyproject.toml (1)

53-57: Minor: numpy lower bound inconsistency.

tts pins numpy>=1.24.0 while the bench extra uses numpy>=1.26.0 and the Python floor is >=3.12. For consistency and to avoid accidentally resolving an older wheel when someone installs only voicequant[tts], consider aligning to numpy>=1.26.0 (which is also the minimum with stable Python 3.12 wheels).

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pyproject.toml` around lines 53 - 57, The tts extra currently lists
"numpy>=1.24.0" which is inconsistent with the bench extra and Python 3.12 wheel
compatibility; update the tts extras array in pyproject.toml to use
"numpy>=1.26.0" (replace the existing numpy entry in the tts list) so the
package extras (tts) align with the bench extra and the project's supported
Python wheels.
src/voicequant/core/__init__.py (1)

8-16: Silently swallowing ImportError can mask real bugs.

ModuleNotFoundError from a missing optional dep (e.g., torch) and an ImportError caused by a genuine bug inside voicequant.core.llm.<name> (e.g., a broken relative import) are indistinguishable here. Consider narrowing to ModuleNotFoundError and checking exc.name against the known heavy deps, or at least emitting a logging.debug with the exception so failures are traceable.

Proposed tweak
-    try:
-        _sys.modules[f"voicequant.core.{_name}"] = _importlib.import_module(
-            f"voicequant.core.llm.{_name}"
-        )
-    except ImportError:
-        # Keep package importable in minimal environments where optional
-        # heavy deps (torch/scipy) are not installed yet.
-        pass
+    try:
+        _sys.modules[f"voicequant.core.{_name}"] = _importlib.import_module(
+            f"voicequant.core.llm.{_name}"
+        )
+    except ModuleNotFoundError as exc:
+        # Keep package importable in minimal environments where optional
+        # heavy deps (torch/scipy) are not installed yet.
+        import logging as _logging
+        _logging.getLogger(__name__).debug(
+            "Skipping voicequant.core.%s alias: %s", _name, exc
+        )
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/voicequant/core/__init__.py` around lines 8 - 16, The import loop that
currently catches all ImportError for voicequant.core.llm._name can mask real
bugs; update the try/except around the _importlib.import_module call to only
catch ModuleNotFoundError (not ImportError), and when caught inspect the
exception's name (exc.name) against known heavy deps (e.g., "torch", "scipy")
before skipping, otherwise re-raise; additionally emit a debug log with the
exception (using the package logger) when skipping so failures are traceable;
refer to the loop variable _name, the call to _importlib.import_module, and the
assignment to _sys.modules[f"voicequant.core.{_name}"] to locate where to
implement this change.
src/voicequant/core/tts/audio.py (2)

71-71: Nit: parameter format shadows the builtin.

Consider renaming to fmt for consistency with TTSConfig.output_format callers and to avoid shadowing builtins.format.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/voicequant/core/tts/audio.py` at line 71, Rename the parameter named
format in the function get_audio_duration to fmt to avoid shadowing the built-in
format and to match TTSConfig.output_format naming; update the function
signature get_audio_duration(audio_bytes: bytes, fmt: str, sample_rate: int) and
change all internal references and all call sites that pass
TTSConfig.output_format (or any variable named format) to pass fmt instead so
callers and implementation are consistent.

11-16: Consider vectorizing _to_int16_bytes with NumPy.

numpy is declared in the tts extra, so it's available whenever this module is actually used. A pure-Python loop over each sample is ~50–100× slower than a vectorized np.clip + (*32767).astype(np.int16).tobytes() and becomes noticeable on multi-second utterances at 24 kHz.

Proposed refactor
-def _to_int16_bytes(samples: Iterable[float]) -> bytes:
-    pcm = array("h")
-    for s in samples:
-        v = max(-1.0, min(1.0, float(s)))
-        pcm.append(int(v * 32767.0))
-    return pcm.tobytes()
+def _to_int16_bytes(samples: Iterable[float]) -> bytes:
+    import numpy as np
+    arr = np.asarray(list(samples) if not hasattr(samples, "__array__") else samples,
+                     dtype=np.float32)
+    np.clip(arr, -1.0, 1.0, out=arr)
+    return (arr * 32767.0).astype(np.int16).tobytes()
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/voicequant/core/tts/audio.py` around lines 11 - 16, Replace the slow
Python loop in _to_int16_bytes with a NumPy vectorized path: import numpy as np
(top-level or lazy), convert samples to an ndarray via np.asarray(samples,
dtype=np.float32), apply np.clip(arr, -1.0, 1.0), multiply by 32767, cast to
np.int16 with astype(np.int16) and return .tobytes(); keep the current
array("h") approach as a fallback if NumPy is unavailable or add a single import
since numpy is provided in the tts extra.
tests/core/tts/test_tts_engine.py (1)

59-69: Tighten the cache assertion — >= silently tolerates a broken cache.

If speaker caching regresses and every lookup misses, first["speaker_cache_hit_rate"] and second["speaker_cache_hit_rate"] both stay at 0.0 and the test still passes. With the current mock and two calls for the same voice, the expected hit rate after the second synthesis is exactly 0.5, so the check can be made strict.

🧪 Proposed tighter assertion
-    assert second["speaker_cache_hit_rate"] >= first["speaker_cache_hit_rate"]
+    assert first["speaker_cache_hit_rate"] == 0.0
+    assert second["speaker_cache_hit_rate"] > first["speaker_cache_hit_rate"]
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/core/tts/test_tts_engine.py` around lines 59 - 69, The test
test_speaker_cache_used_across_syntheses currently uses a non-strict assertion
allowing a broken cache to pass; update the assertion after the second
synthesize call to assert the exact expected hit rate (0.5) for
"speaker_cache_hit_rate" instead of using ">=" so that
TTSEngine(TTSConfig(device="cpu")) with synthesize("hello", voice="af_heart")
then synthesize("again", voice="af_heart") produces
metrics()["speaker_cache_hit_rate"] == 0.5; locate the check in
test_speaker_cache_used_across_syntheses and change the assertion on second to
the strict equality comparing second["speaker_cache_hit_rate"] to 0.5.
src/voicequant/cli.py (1)

222-250: Consider error-handling UX for tts speak.

engine.synthesize(...) can raise ImportError (missing kokoro_onnx), ValueError (text too long / unsupported format), or RuntimeError from model load. Today these bubble up as raw Python tracebacks to the user. A small try/except that maps these to a short red error message + typer.Exit(1) would match the UX of the rest of the CLI (e.g. the bench command's error handling on line 89-90).

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/voicequant/cli.py` around lines 222 - 250, Wrap the call to
engine.synthesize(...) inside a try/except in the tts_speak function to catch
ImportError, ValueError, and RuntimeError; on each exception print a concise red
error message via console.print (include the exception message) and then exit
with typer.Exit(1) to match existing CLI UX (similar to the bench command).
Ensure you still compute elapsed_ms only if synthesis succeeds, and keep writing
result.audio_bytes and printing success info untouched when no exception occurs.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/voicequant/cli.py`:
- Around line 289-304: The average latency is skewed by the TTSEngine lazy model
load inside TTSEngine.synthesize; warm the engine before timing by invoking
engine.synthesize once (e.g., a short warmup call with one of the samples or a
dummy string) prior to starting the timed loop over samples, or alternatively
ignore/discard the first result from latencies/durations (remove the first
element) before computing avg_latency and avg_duration so only steady-state
synthesis times are averaged; reference TTSEngine, TTSConfig, synthesize,
samples, latencies and durations when making the change.

In `@src/voicequant/core/tts/audio.py`:
- Around line 71-85: get_audio_duration currently returns 0.0 for encoded
formats like mp3/opus which causes SynthesisResult.duration_seconds to report
0.00s; change the flow so duration is computed from raw float samples in the TTS
engine and passed through rather than derived from encoded bytes. Update the API
by adding an explicit duration argument (e.g., pass duration_seconds or samples
+ sample_rate) from the synthesizer into get_audio_duration or into
SynthesisResult so callers use the precomputed len(samples)/sample_rate value,
and keep get_audio_duration as a fallback for wav/pcm; ensure references to
get_audio_duration and SynthesisResult.duration_seconds are updated accordingly
and documented.
- Around line 59-68: The function wav_to_opus currently always raises
NotImplementedError even when opuslib is installed; either implement the
encoding or stop advertising Opus support — update code to remove "opus" from
any advertised formats/CLI choices and from any engine/server output_format
choices (references: wav_to_opus and any use of output_format="opus"), and if
you keep wav_to_opus stub, make it consistently raise a clear
ImportError/NotImplementedError with a message directing users that Opus support
is not yet available so requests won't 500.

In `@src/voicequant/core/tts/engine.py`:
- Around line 49-53: The metrics (_syntheses_total, _latency_sum_ms and reads of
_active in capacity()) must be protected by the same concurrency discipline as
_active: wrap all mutations and reads of _syntheses_total and _latency_sum_ms
with _active_lock, and change capacity() to read _active while holding
_active_lock (or add thread-safe accessor methods). Update all sites that
increment/accumulate metrics (including the sections around the current init and
the blocks around the referenced ranges) to acquire _active_lock before
modifying or reading these fields, or replace those direct accesses with new
synchronized helper methods (e.g., _inc_syntheses(), _add_latency_ms(),
get_metrics()) that acquire _active_lock internally to ensure atomic, consistent
metrics and capacity reporting.
- Around line 166-168: The duration is being computed after encoding which calls
get_audio_duration that only supports 'wav' and 'pcm', causing mp3/opus outputs
to get 0.0; modify the logic in the method that calls _encode_audio and
get_audio_duration (around SynthesisResult construction) to compute duration
from raw samples and sample_rate before encoding (e.g., duration_seconds =
len(samples) / sample_rate) and use that value for non-wav/pcm formats, while
still calling get_audio_duration for 'wav'/'pcm'; update the code paths around
_encode_audio, selected_format, get_audio_duration and SynthesisResult to use
the precomputed duration for mp3/opus.
- Around line 112-114: The current `_synthesize_samples` uses an `or` chain to
pick `samples` from `out` which triggers ambiguous truth-value errors for
multi-element NumPy arrays; change the selection logic to explicit None checks
(e.g., prefer `out["audio"]` if it is present and is not None, else
`out["samples"]` if not None, else `out["waveform"]`) so evaluating array values
is not used as a boolean; keep the `sample_rate` fallback to
`self.config.sample_rate` as before and update the code paths around the `out`
handling in the `_synthesize_samples` function accordingly to avoid implicit
truthiness on `samples`.
- Around line 62-73: The kokoro-onnx API changed: update TTSConfig to include a
voices_path field and stop passing a device or kwargs; instantiate the model
using the positional signature expected by kokoro-onnx by calling the retrieved
model class (model_cls or Kokoro/KokoroOnnx) with (model_path, voices_path) and
assign to self._model, remove any attempt to pass device; replace any calls to
synthesize()/generate()/get_speaker_embedding()/load_voice() with the single
create(text, voice) method and use get_voices() (not list_voices()) to enumerate
available voices, and ensure the voice argument passed to create is either a
voice name string or a numpy array (no separate embedding-loading step).

In `@src/voicequant/server/routes/tts.py`:
- Around line 40-62: The handler currently leaks raw exception text and echoes
user-controlled fields into response headers; update the synthesize error
handling around engine.synthesize to log the full exception server-side
(including stack trace) and raise HTTPException with a generic message like
"Synthesis failed" (no raw e). Also ensure voice/format values are validated or
sanitized before using them in headers and filenames: enforce an allowlist/enum
or regex on SpeechRequest.voice and SpeechRequest.response_format (or sanitize
result.voice and result.format) to strip CR/LF and non-safe bytes, constrain to
ASCII letters/digits/_- and map unknown values to safe defaults (e.g., "unknown"
and "bin"), then use those sanitized values for "X-VoiceQuant-Voice",
"X-VoiceQuant-Sample-Rate" and Content-Disposition filename while keeping
content-type via _content_type_for(ext).

In `@tests/core/tts/test_audio.py`:
- Around line 33-40: Tests assume third-party packages are absent; instead
monkeypatch import detection so the code takes the ImportError branch: in
test_wav_to_mp3_importerror_message and test_wav_to_opus_importerror_message use
the pytest monkeypatch fixture to stub importlib.util.find_spec (save original,
then set it to a function that returns None when name == "lameenc" or "opuslib"
respectively and otherwise calls the original) before calling wav_to_mp3(...) /
wav_to_opus(...), then assert the ImportError message as before; alternatively,
if you prefer skipping, call importlib.util.find_spec("lameenc") /
find_spec("opuslib") at start and pytest.skip when the package is absent/present
per the desired behavior, but prefer the monkeypatch approach above to force the
ImportError branch for wav_to_mp3 and wav_to_opus.

---

Nitpick comments:
In `@pyproject.toml`:
- Around line 53-57: The tts extra currently lists "numpy>=1.24.0" which is
inconsistent with the bench extra and Python 3.12 wheel compatibility; update
the tts extras array in pyproject.toml to use "numpy>=1.26.0" (replace the
existing numpy entry in the tts list) so the package extras (tts) align with the
bench extra and the project's supported Python wheels.

In `@src/voicequant/cli.py`:
- Around line 222-250: Wrap the call to engine.synthesize(...) inside a
try/except in the tts_speak function to catch ImportError, ValueError, and
RuntimeError; on each exception print a concise red error message via
console.print (include the exception message) and then exit with typer.Exit(1)
to match existing CLI UX (similar to the bench command). Ensure you still
compute elapsed_ms only if synthesis succeeds, and keep writing
result.audio_bytes and printing success info untouched when no exception occurs.

In `@src/voicequant/core/__init__.py`:
- Around line 8-16: The import loop that currently catches all ImportError for
voicequant.core.llm._name can mask real bugs; update the try/except around the
_importlib.import_module call to only catch ModuleNotFoundError (not
ImportError), and when caught inspect the exception's name (exc.name) against
known heavy deps (e.g., "torch", "scipy") before skipping, otherwise re-raise;
additionally emit a debug log with the exception (using the package logger) when
skipping so failures are traceable; refer to the loop variable _name, the call
to _importlib.import_module, and the assignment to
_sys.modules[f"voicequant.core.{_name}"] to locate where to implement this
change.

In `@src/voicequant/core/tts/audio.py`:
- Line 71: Rename the parameter named format in the function get_audio_duration
to fmt to avoid shadowing the built-in format and to match
TTSConfig.output_format naming; update the function signature
get_audio_duration(audio_bytes: bytes, fmt: str, sample_rate: int) and change
all internal references and all call sites that pass TTSConfig.output_format (or
any variable named format) to pass fmt instead so callers and implementation are
consistent.
- Around line 11-16: Replace the slow Python loop in _to_int16_bytes with a
NumPy vectorized path: import numpy as np (top-level or lazy), convert samples
to an ndarray via np.asarray(samples, dtype=np.float32), apply np.clip(arr,
-1.0, 1.0), multiply by 32767, cast to np.int16 with astype(np.int16) and return
.tobytes(); keep the current array("h") approach as a fallback if NumPy is
unavailable or add a single import since numpy is provided in the tts extra.

In `@tests/core/tts/test_tts_engine.py`:
- Around line 59-69: The test test_speaker_cache_used_across_syntheses currently
uses a non-strict assertion allowing a broken cache to pass; update the
assertion after the second synthesize call to assert the exact expected hit rate
(0.5) for "speaker_cache_hit_rate" instead of using ">=" so that
TTSEngine(TTSConfig(device="cpu")) with synthesize("hello", voice="af_heart")
then synthesize("again", voice="af_heart") produces
metrics()["speaker_cache_hit_rate"] == 0.5; locate the check in
test_speaker_cache_used_across_syntheses and change the assertion on second to
the strict equality comparing second["speaker_cache_hit_rate"] to 0.5.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 95691e2d-fb8a-4b69-93c5-5624ee716f3d

📥 Commits

Reviewing files that changed from the base of the PR and between 5267637 and 0bfbf60.

📒 Files selected for processing (17)
  • pyproject.toml
  • src/voicequant/cli.py
  • src/voicequant/core/__init__.py
  • src/voicequant/core/tts/__init__.py
  • src/voicequant/core/tts/audio.py
  • src/voicequant/core/tts/config.py
  • src/voicequant/core/tts/engine.py
  • src/voicequant/core/tts/speaker_cache.py
  • src/voicequant/server/app.py
  • src/voicequant/server/routes/tts.py
  • tests/core/tts/__init__.py
  • tests/core/tts/test_audio.py
  • tests/core/tts/test_speaker_cache.py
  • tests/core/tts/test_tts_config.py
  • tests/core/tts/test_tts_engine.py
  • tests/server/test_tts_cli.py
  • tests/server/test_tts_routes.py

Comment thread src/voicequant/cli.py
Comment thread src/voicequant/core/tts/audio.py Outdated
Comment thread src/voicequant/core/tts/audio.py Outdated
Comment on lines +71 to +85
def get_audio_duration(audio_bytes: bytes, format: str, sample_rate: int) -> float:
"""Estimate duration in seconds from audio byte payload."""
fmt = format.lower()
if fmt == "wav":
with wave.open(io.BytesIO(audio_bytes), "rb") as wf:
frames = wf.getnframes()
rate = wf.getframerate()
return frames / rate if rate else 0.0

if fmt == "pcm":
bytes_per_sample = 2
samples = len(audio_bytes) / bytes_per_sample
return samples / sample_rate if sample_rate else 0.0

return 0.0
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

get_audio_duration silently returns 0.0 for mp3/opus.

Callers (e.g., SynthesisResult.duration_seconds surfaced in CLI/server output) will report 0.00s whenever the output format is anything other than wav/pcm. Since the engine already has the raw float samples before encoding, prefer computing duration from len(samples)/sample_rate in the engine and passing it in, rather than trying to re-derive it from encoded bytes. At minimum, document the limitation or fall back to len(samples)/sample_rate so metrics aren't misleading.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/voicequant/core/tts/audio.py` around lines 71 - 85, get_audio_duration
currently returns 0.0 for encoded formats like mp3/opus which causes
SynthesisResult.duration_seconds to report 0.00s; change the flow so duration is
computed from raw float samples in the TTS engine and passed through rather than
derived from encoded bytes. Update the API by adding an explicit duration
argument (e.g., pass duration_seconds or samples + sample_rate) from the
synthesizer into get_audio_duration or into SynthesisResult so callers use the
precomputed len(samples)/sample_rate value, and keep get_audio_duration as a
fallback for wav/pcm; ensure references to get_audio_duration and
SynthesisResult.duration_seconds are updated accordingly and documented.

Comment thread src/voicequant/core/tts/engine.py
Comment thread src/voicequant/core/tts/engine.py Outdated
Comment thread src/voicequant/core/tts/engine.py
Comment thread src/voicequant/core/tts/engine.py
Comment thread src/voicequant/server/routes/tts.py
Comment thread tests/core/tts/test_audio.py Outdated
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants