feat(tts): M5 streaming, Orpheus adapter, and headline benchmarks by mahimairaja · Pull Request #3 · mahimailabs/voicequant

mahimairaja · 2026-04-20T22:45:36Z

Adds the M5 milestone: TTS chunked streaming, Orpheus TTS model support
wired through the TurboQuant KV compression engine, and the TTS benchmark
suite with the cross-modality hero chart.

Core streaming: StreamingSynthesizer emits StreamingChunk objects from
either a pre-generated waveform (Kokoro) or a token-streaming backend
(Orpheus), with first-chunk min-size and TTFA tracking.
Streaming endpoint: POST /v1/audio/speech/stream supports chunked
transfer and SSE (when Accept: text/event-stream). POST /v1/audio/speech
honors the stream field for the OpenAI-style pattern.
Orpheus adapter: OrpheusAdapter imports TurboQuantEngine from
core/llm/engine and applies KV compression per step on Orpheus' LLaMA
backbone. This is the only cross-modality import; core/llm remains
unaware of TTS.
Multi-backend TTSEngine: model_name routes to Kokoro or Orpheus. New
tq_bits/tq_enabled config fields flow through to the Orpheus adapter.
Optional extra: pyproject.toml adds tts-orpheus (orpheus-tts +
phonemizer). Base tts extra is unchanged.
Benchmarks: 5 new TTS scenarios (ttfa, streaming_jitter, mos_quality,
concurrent, speaker_cache_hit) registered in the shared runner; TTS
markdown section added to the report. 15+ test sentences fixture.
Visualization: 5 TTS charts plus the cross-modality hero chart
("Voice Agent Sessions per GPU"). CLI gains --modality and --all.
Tests: 53 new tests across streaming, orpheus_adapter, multi_backend,
tts benchmarks, and the streaming route. All 291 tests pass.

Summary by CodeRabbit

New Features
- Added TTS benchmarking with five scenarios: Time-to-First-Audio, streaming jitter, audio quality, concurrent streams, and speaker cache performance.
- Introduced support for Orpheus TTS backend alongside Kokoro.
- Added TTS audio streaming support with configurable chunk sizing.
- Expanded CLI with --modality option for benchmark and visualization filtering (llm, tts, all).
- Added TTS analytics charts and cross-modality performance visualization.
Configuration
- Extended TTS config with TurboQuant compression options (tq-bits, tq-enabled).

Adds the M5 milestone: TTS chunked streaming, Orpheus TTS model support wired through the TurboQuant KV compression engine, and the TTS benchmark suite with the cross-modality hero chart. - Core streaming: StreamingSynthesizer emits StreamingChunk objects from either a pre-generated waveform (Kokoro) or a token-streaming backend (Orpheus), with first-chunk min-size and TTFA tracking. - Streaming endpoint: POST /v1/audio/speech/stream supports chunked transfer and SSE (when Accept: text/event-stream). POST /v1/audio/speech honors the `stream` field for the OpenAI-style pattern. - Orpheus adapter: OrpheusAdapter imports TurboQuantEngine from core/llm/engine and applies KV compression per step on Orpheus' LLaMA backbone. This is the only cross-modality import; core/llm remains unaware of TTS. - Multi-backend TTSEngine: model_name routes to Kokoro or Orpheus. New tq_bits/tq_enabled config fields flow through to the Orpheus adapter. - Optional extra: pyproject.toml adds `tts-orpheus` (orpheus-tts + phonemizer). Base `tts` extra is unchanged. - Benchmarks: 5 new TTS scenarios (ttfa, streaming_jitter, mos_quality, concurrent, speaker_cache_hit) registered in the shared runner; TTS markdown section added to the report. 15+ test sentences fixture. - Visualization: 5 TTS charts plus the cross-modality hero chart ("Voice Agent Sessions per GPU"). CLI gains `--modality` and `--all`. - Tests: 53 new tests across streaming, orpheus_adapter, multi_backend, tts benchmarks, and the streaming route. All 291 tests pass.

coderabbitai · 2026-04-20T22:45:49Z

Warning

Rate limit exceeded

@mahimairaja has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 39 minutes and 51 seconds before requesting another review.

Your organization is not enrolled in usage-based pricing. Contact your admin to enable usage-based pricing to continue reviews beyond the rate limit, or try again in 39 minutes and 51 seconds.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: ac2c36ff-a87d-4c58-9864-83d2c247f16d

📥 Commits

Reviewing files that changed from the base of the PR and between 112547a and f342915.

📒 Files selected for processing (10)

src/voicequant/benchmarks/report.py
src/voicequant/benchmarks/runner.py
src/voicequant/benchmarks/scenarios/tts/concurrent.py
src/voicequant/benchmarks/scenarios/tts/speaker_cache_hit.py
src/voicequant/cli.py
src/voicequant/core/tts/config.py
src/voicequant/core/tts/engine.py
src/voicequant/core/tts/orpheus_adapter.py
src/voicequant/core/tts/streaming.py
src/voicequant/server/routes/tts.py

📝 Walkthrough

Walkthrough

This PR adds comprehensive TTS support with multi-backend synthesis (Kokoro and Orpheus), streaming audio output, TurboQuant KV compression, five benchmark scenarios measuring performance metrics, visualization charts, CLI modality controls, and server streaming endpoints alongside supporting test coverage.

Changes

Cohort / File(s)	Summary
Dependencies `pyproject.toml`	Added optional dependency groups `tts-orpheus` (orpheus-tts, phonemizer, torch, transformers, soundfile, numpy) and `tts-all`; updated `all` group to include `tts-orpheus`.
Benchmark Test Data `src/voicequant/benchmarks/prompts/tts/test_sentences.json`	New JSON fixture defining 15 test sentences across short, medium, and long categories, each with expected duration ranges for TTS benchmarking.
TTS Benchmark Scenarios `src/voicequant/benchmarks/scenarios/tts/__init__.py`, `ttfa.py`, `streaming_jitter.py`, `mos_quality.py`, `concurrent.py`, `speaker_cache_hit.py`	Five new analytical TTS benchmark scenario classes: TTFAScenario (time-to-first-audio), StreamingJitterScenario (gap metrics), MOSQualityScenario (PESQ/STOI scores), ConcurrentTTSScenario (session concurrency modeling), and SpeakerCacheHitScenario (cache hit rate estimation).
Benchmark Infrastructure `src/voicequant/benchmarks/runner.py`, `report.py`, `visualize.py`	Extended scenario registry with five TTS scenario entries; added TTS report section generation with TTFA, concurrency, and quality subsections; added TTS data collection, five chart generators (ttfa, jitter, concurrent, quality, cache), cross-modality hero chart, and new public functions for modality-based chart generation.
Core TTS: Multi-Backend & Streaming `src/voicequant/core/tts/config.py`, `engine.py`, `orpheus_adapter.py`, `streaming.py`	Added `tq_bits` and `tq_enabled` config fields; implemented backend detection and multi-backend loading (Kokoro/Orpheus); added `OrpheusAdapter` for Orpheus speech token generation with TurboQuant compression; introduced `StreamingSynthesizer` and `TTSStreamingConfig` for chunked audio streaming with TTFA tracking.
Server TTS Routes `src/voicequant/server/routes/tts.py`, `tts_stub.py`	Added `stream` field to `SpeechRequest`; extended `/v1/audio/speech` handler to route to new streaming path; added `/v1/audio/speech/stream` endpoint supporting both raw audio and Server-Sent Events response modes; added stub endpoint for disabled TTS environments.
CLI Commands `src/voicequant/cli.py`	Added `--modality` option (llm/tts/all) to `bench()` and `visualize()` commands; updated scenario selection and visualization routing; extended `tts speak` command with `--model` (backend selector) and `--tq-bits` (KV compression bits) options.
Test Coverage `tests/benchmarks/test_tts_benchmarks.py`, `tests/core/tts/test_multi_backend.py`, `test_orpheus_adapter.py`, `test_streaming.py`, `tests/server/test_tts_streaming_route.py`	Comprehensive test suite covering TTS benchmark scenario registration, execution, and output validation; multi-backend engine switching with dependency checks; Orpheus adapter config and token generation with TQ compression; streaming chunk generation and TTFA tracking; server streaming endpoints with SSE and raw audio response modes.

Sequence Diagram(s)

sequenceDiagram
    actor Client
    participant Engine as TTSEngine
    participant Detector as _detect_backend()
    participant Kokoro as Kokoro Backend
    participant Orpheus as OrpheusAdapter

    Client->>Engine: synthesize(text, voice, model="orpheus-fp16")
    Engine->>Detector: _detect_backend("orpheus-fp16")
    Detector-->>Engine: "orpheus"
    Engine->>Orpheus: load_model()
    Engine->>Orpheus: synthesize(text, voice)
    Orpheus->>Orpheus: generate_speech_tokens(text, voice)
    Orpheus->>Orpheus: decode_tokens_to_audio(tokens)
    Orpheus-->>Engine: SynthesisResult(wav/pcm)
    Engine-->>Client: audio bytes

    Client->>Engine: synthesize(text, voice, model="kokoro")
    Engine->>Detector: _detect_backend("kokoro")
    Detector-->>Engine: "kokoro"
    Engine->>Kokoro: create(text, speaker_id, output_format)
    Kokoro-->>Engine: audio bytes
    Engine-->>Client: audio bytes

sequenceDiagram
    actor Client
    participant Synthesizer as StreamingSynthesizer
    participant Engine as TTSEngine
    participant Orpheus as OrpheusAdapter
    participant Encoder as _encode_chunk()

    Client->>Synthesizer: stream(text, voice)
    Synthesizer->>Synthesizer: detect backend support
    Synthesizer->>Engine: stream_samples(text, voice)
    Engine->>Orpheus: stream_samples(text, voice)
    loop For each sample batch
        Orpheus->>Orpheus: generate_speech_tokens()
        Orpheus->>Orpheus: decode_tokens_to_audio(batch)
        Orpheus-->>Engine: float32 samples
        Engine-->>Synthesizer: float32 samples
        Synthesizer->>Encoder: encode chunk to PCM/WAV
        Encoder-->>Synthesizer: audio_bytes
        Synthesizer->>Synthesizer: emit StreamingChunk
        Synthesizer-->>Client: StreamingChunk
    end
    Synthesizer->>Synthesizer: mark final chunk
    Synthesizer-->>Client: StreamingChunk(is_final=true)
    Synthesizer->>Synthesizer: record last_ttfa_ms, last_total_chunks

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~75 minutes

Poem

🐰 Hops through the speech tokens with glee,
Orpheus and Kokoro dance in harmony,
Streaming chunks flow like garden streams,
TurboQuant compresses our audio dreams,
Five benchmarks measure what voices can be!

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 40.43% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The PR title accurately summarizes the primary changes: M5 streaming implementation, Orpheus adapter addition, and headline benchmarks for TTS—all core objectives reflected in the changeset.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch claude/tts-streaming-orpheus-m5-Sqe94

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

codecov · 2026-04-20T22:47:47Z

Codecov Report

❌ Patch coverage is 76.17978% with 212 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
src/voicequant/core/tts/orpheus_adapter.py	55.20%	80 Missing and 6 partials ⚠️
src/voicequant/core/tts/engine.py	47.76%	31 Missing and 4 partials ⚠️
src/voicequant/cli.py	6.66%	28 Missing ⚠️
src/voicequant/benchmarks/report.py	60.00%	14 Missing and 6 partials ⚠️
src/voicequant/benchmarks/visualize.py	91.28%	12 Missing and 5 partials ⚠️
src/voicequant/core/tts/streaming.py	91.30%	6 Missing and 6 partials ⚠️
src/voicequant/server/routes/tts.py	84.48%	7 Missing and 2 partials ⚠️
.../voicequant/benchmarks/scenarios/tts/concurrent.py	95.23%	1 Missing and 1 partial ⚠️
src/voicequant/core/tts/config.py	80.00%	1 Missing and 1 partial ⚠️
src/voicequant/benchmarks/runner.py	0.00%	1 Missing ⚠️

📢 Thoughts on this report? Let us know!

coderabbitai

Actionable comments posted: 15

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)

src/voicequant/core/tts/engine.py (1)

300-307: ⚠️ Potential issue | 🟠 Major

Release the Orpheus adapter during shutdown.

shutdown() clears _model, but _orpheus can still hold the loaded Orpheus model/tokenizer/decoder, keeping GPU memory referenced after shutdown.

🧹 Proposed fix

 def shutdown(self) -> None:
     with self._lock:
+        if self._orpheus is not None:
+            self._orpheus.shutdown()
+            self._orpheus = None
         self._model = None
         self._model_loaded = False
         if self._speaker_cache is not None:
             self._speaker_cache.clear()
             self._speaker_cache = None

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@src/voicequant/core/tts/engine.py` around lines 300 - 307, shutdown()
currently clears _model but leaves self._orpheus alive, which can keep
GPU-resident Orpheus tensors; update shutdown() to, inside the existing with
self._lock block, check if self._orpheus is not None and first call any cleanup
API available (e.g. call self._orpheus.close() or self._orpheus.release() if
present via getattr) and then set self._orpheus = None, and also ensure
_model_loaded is set to False and _speaker_cache cleared as already done so the
Orpheus adapter is fully released.

src/voicequant/cli.py (1)

282-304: ⚠️ Potential issue | 🟠 Major

Resolve the orpheus shorthand to the full Hugging Face model ID before constructing TTSConfig.

The CLI advertises --model orpheus as valid, but orpheus-tts does not recognize "orpheus" as a model name alias—it requires the full Hugging Face model ID. Passing the bare string will cause a runtime failure.

Fix

-    cfg = TTSConfig(
-        model_name=model,
+    resolved_model = (
+        "canopylabs/orpheus-3b-0.1-ft"
+        if model.lower() == "orpheus"
+        else model
+    )
+    cfg = TTSConfig(
+        model_name=resolved_model,
         device=device,
         default_voice=voice,
         output_format=fmt,
         tq_bits=tq_bits,
     )

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@src/voicequant/cli.py` around lines 282 - 304, The CLI allows the shorthand
"orpheus" but you must translate that to the full Hugging Face model ID before
creating the TTSConfig; update the code around the model variable (in the CLI
function where model is read and before the TTSConfig(...) call) to check if
model == "orpheus" and replace it with the canonical HF model id (e.g.
ai-forever/orpheus-tts or your project's exact HF identifier) so that
TTSEngine/TTSConfig receives the full model name; ensure this mapping happens
prior to instantiating TTSConfig(model_name=model, ...).

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/voicequant/benchmarks/report.py`:
- Around line 366-417: The TTS generator _generate_tts_section is missing output
for the scenarios keyed as "tts_streaming_jitter" and "tts_speaker_cache_hit",
so add branches that mirror the existing pattern (like "tts_ttfa",
"tts_concurrent", "tts_mos_quality"): check if each key is in the present list,
pull its data from results (e.g., results["tts_streaming_jitter"] and
results["tts_speaker_cache_hit"]), and append human-readable subsection headers
and Markdown tables or summaries using the scenario's .get("results", []) or
summary fields; ensure field names used match those produced by the scenario
(e.g., jitter statistics, packet/drop rates, cache hit ratio) so that running
only those scenarios still renders a non-empty TTS section and be consistent
with other TTS blocks.

In `@src/voicequant/benchmarks/runner.py`:
- Around line 76-79: The tts_concurrent benchmark registration doesn't receive
the user-provided max_sessions value like the LLM "concurrent" path does; update
runner.py so when creating or invoking the scenario for
"voicequant.benchmarks.scenarios.tts.concurrent" / ConcurrentTTSScenario you
forward the same max_sessions parameter used for the LLM "concurrent" scenario
(e.g., include max_sessions in the scenario instantiation/config or run call),
ensuring TTS concurrent runs respect the provided concurrency limit; apply the
same change to the other similar registration block referenced around lines
312-317.

In `@src/voicequant/benchmarks/scenarios/tts/concurrent.py`:
- Around line 67-104: The summary's selection of rows can include ladder samples
above the hardware cap because it only checks p95; update the rows filter
comprehension in the summary loop (the one that builds rows for each gpu and
model) to also enforce the hardware cap by requiring r["concurrency"] <=
r["max_concurrency"] (or equivalently r["max_concurrency"] >= r["concurrency"])
so best_n cannot exceed the computed GPU capacity; keep the rest of the logic
(p95 <= _LATENCY_BUDGET_MS and matching gpu/model) unchanged.

In `@src/voicequant/benchmarks/scenarios/tts/speaker_cache_hit.py`:
- Around line 19-28: The _cache_hit_rate function incorrectly returns 0.0 for
unique_voices == 1; change it so the one-voice case returns a warm cache hit
rate (1.0) after the first miss. Update the function _cache_hit_rate to
explicitly handle unique_voices == 1 (return 1.0) and leave the existing logic
for other values (the branch for unique_voices <= cache_size and the cache_size
/ unique_voices fallback) unchanged, so benchmarks report a warm cache for
repeated single-voice synthesis.

In `@src/voicequant/cli.py`:
- Around line 124-128: The visualization modality defaults to "llm" when
--modality is omitted, causing TTS scenarios to produce LLM charts; update the
logic before calling generate_charts_by_modality in cli.py to infer modality
from the selected scenarios: set viz_mod to "all" if all_scenarios is true,
otherwise use modality if provided, else detect if any selected scenario string
starts with "tts_" and set viz_mod to "tts" (falling back to "llm" if no TTS
scenarios are present); reference the viz_mod variable and
generate_charts_by_modality function to implement this conditional inference.

In `@src/voicequant/core/tts/config.py`:
- Around line 18-19: TTSConfig currently accepts any integer for tq_bits which
can later break the Orpheus adapter; add Pydantic validation at the config
boundary (either by replacing the raw fields with a dedicated TurboQuantConfig
model or by adding a validator on TTSConfig) that enforces tq_bits is one of the
supported TurboQuant bit widths and raises a clear ValueError otherwise; keep
tq_enabled default behavior and update ServerConfig to carry the new
TurboQuantConfig if you introduce it.

In `@src/voicequant/core/tts/engine.py`:
- Around line 124-128: list_voices() currently returns KOKORO_VOICES when the
engine is configured for Orpheus but _orpheus hasn't been initialized, exposing
wrong voice IDs; change list_voices in the TTS engine to return Orpheus voice
identifiers whenever self._backend == "orpheus" regardless of self._orpheus
being None by delegating to the Orpheus voice registry or a static
ORPHEUS_VOICES mapping (or calling a class/static helper in the Orpheus loader)
so that cold engines report correct Orpheus voice IDs; reference the list_voices
method, the _backend and _orpheus attributes, and the load_model initialization
path to locate where to replace the fallback to KOKORO_VOICES.

In `@src/voicequant/core/tts/orpheus_adapter.py`:
- Around line 302-312: get_compression_stats() currently fabricates a 0.99
cosine similarity when self._last_cosine is unset; update the method
(get_compression_stats) to stop reporting a made-up value by either omitting the
"cosine_similarity" key or returning None for it when self._last_cosine is None,
i.e., use the real _last_cosine only when it's been computed and otherwise leave
that metric absent/None (also ensure any callers tolerate a missing/None
"cosine_similarity" instead of assuming a numeric value).
- Around line 155-164: In the token-generation loop in OrpheusAdapter (the block
that calls self._sample and currently yields token_id before checking
self._is_eos), change the flow so you call self._sample(), compute token_id,
then check self._is_eos(token_id) and break without yielding if it's EOS; only
yield non-EOS token_ids so decode_tokens_to_audio never receives a tokenizer
control token. Ensure you still update input_ids/position_ids and n_generated
appropriately for the next step when breaking.
- Around line 216-222: The Orpheus adapter's format branch currently only
handles wav/pcm (see float32_to_wav, float32_to_pcm and self.sample_rate) but
must accept mp3 and opus like the public TTSEngine/CLI; update the fmt handling
in the method in OrpheusAdapter to add branches for "mp3" and "opus" that encode
the float32 samples to those formats (e.g., call or add helper functions
encode_float32_to_mp3(samples, self.sample_rate) and
encode_float32_to_opus(samples, self.sample_rate) or invoke your project's
ffmpeg/pydub wrapper), returning audio_bytes for mp3/opus, and keep the existing
wav/pcm branches and the ValueError for unknown formats.
- Around line 113-130: The generate_speech_tokens method accepts a voice but
never applies it to the prompt; prepend the voice to the text (format "{voice}:
{text}" when voice is provided) before calling self._tokenizer so the model
receives voice-conditioned input. Update generate_speech_tokens to build a
prompt variable from the voice and text, then pass that prompt into
self._tokenizer(...).to(self.config.device) (leaving load_model and device
handling unchanged).

In `@src/voicequant/core/tts/streaming.py`:
- Around line 186-215: The loop currently yields the last real chunk with
is_final=False and then _flush(True) emits an empty final chunk; fix by
preventing emission of that zero-length final chunk and ensuring the real last
chunk is marked final: (1) make _flush return None when there are no buffered
samples (do not create a zero-sample chunk), and (2) after assembling head/tail
in the sample_iter loop (function using buffer, buffered, _make_chunk and
first_emitted), detect when tail.size == 0 and no more input will arrive and set
the chunk's is_final=True before yielding (or rely on _flush not emitting and
only call _flush when buffer has data). Update references: modify _flush, the
sample_iter consumption loop that builds chunk via self._make_chunk, and the
final = _flush(True) call so no empty final chunk is emitted.
- Around line 47-53: The _encode_chunk function currently allows "wav" which
calls float32_to_wav and emits a full WAV header per chunk, producing an invalid
concatenated stream; change the logic in _encode_chunk to disallow or map "wav"
to PCM for streaming: when output_format (or fmt) equals "wav" either raise a
ValueError indicating WAV is not supported for per-chunk streaming or treat it
as "pcm" by calling float32_to_pcm(samples, sample_rate) instead; update
references to _encode_chunk, float32_to_wav, float32_to_pcm and output_format
handling so streaming only emits raw PCM chunks (or explicitly rejects "wav")
consistent with the server/routes/tts streaming behavior.

In `@src/voicequant/server/routes/tts.py`:
- Around line 69-78: The _streaming_format helper currently returns the input
lowercased format (so "wav" is returned) which contradicts the comment about
downgrading WAV for streaming; update the function _streaming_format(fmt: str)
so that any "wav" input is converted to "pcm" (i.e. return "pcm" when low ==
"wav"), preserve "pcm" for that case, and keep the default fallback as "pcm" for
unknown/empty formats to ensure streaming always uses raw PCM.
- Around line 117-136: The import of StreamingSynthesizer/TTSStreamingConfig and
the creation/first iteration of the generator must be moved inside the try so
ImportError and generator-start exceptions are caught; specifically, inside the
try block import StreamingSynthesizer and TTSStreamingConfig, instantiate
StreamingSynthesizer(engine, TTSStreamingConfig(...)), call
synth.stream(request.input, voice=request.voice) to get chunk_iter and perform
the first next(chunk_iter) (or otherwise trigger the generator start) so errors
surface there, then return a StreamingResponse that yields the already-read
first chunk followed by the remaining chunk_iter; ensure the except ImportError
and generic except blocks remain the same to map to 501/500.

---

Outside diff comments:
In `@src/voicequant/cli.py`:
- Around line 282-304: The CLI allows the shorthand "orpheus" but you must
translate that to the full Hugging Face model ID before creating the TTSConfig;
update the code around the model variable (in the CLI function where model is
read and before the TTSConfig(...) call) to check if model == "orpheus" and
replace it with the canonical HF model id (e.g. ai-forever/orpheus-tts or your
project's exact HF identifier) so that TTSEngine/TTSConfig receives the full
model name; ensure this mapping happens prior to instantiating
TTSConfig(model_name=model, ...).

In `@src/voicequant/core/tts/engine.py`:
- Around line 300-307: shutdown() currently clears _model but leaves
self._orpheus alive, which can keep GPU-resident Orpheus tensors; update
shutdown() to, inside the existing with self._lock block, check if self._orpheus
is not None and first call any cleanup API available (e.g. call
self._orpheus.close() or self._orpheus.release() if present via getattr) and
then set self._orpheus = None, and also ensure _model_loaded is set to False and
_speaker_cache cleared as already done so the Orpheus adapter is fully released.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 1edb011e-763a-4e6a-be0a-9626e790ea82

📥 Commits

Reviewing files that changed from the base of the PR and between 1fe79c0 and 112547a.

📒 Files selected for processing (23)

pyproject.toml
src/voicequant/benchmarks/prompts/tts/test_sentences.json
src/voicequant/benchmarks/report.py
src/voicequant/benchmarks/runner.py
src/voicequant/benchmarks/scenarios/tts/__init__.py
src/voicequant/benchmarks/scenarios/tts/concurrent.py
src/voicequant/benchmarks/scenarios/tts/mos_quality.py
src/voicequant/benchmarks/scenarios/tts/speaker_cache_hit.py
src/voicequant/benchmarks/scenarios/tts/streaming_jitter.py
src/voicequant/benchmarks/scenarios/tts/ttfa.py
src/voicequant/benchmarks/visualize.py
src/voicequant/cli.py
src/voicequant/core/tts/config.py
src/voicequant/core/tts/engine.py
src/voicequant/core/tts/orpheus_adapter.py
src/voicequant/core/tts/streaming.py
src/voicequant/server/routes/tts.py
src/voicequant/server/routes/tts_stub.py
tests/benchmarks/test_tts_benchmarks.py
tests/core/tts/test_multi_backend.py
tests/core/tts/test_orpheus_adapter.py
tests/core/tts/test_streaming.py
tests/server/test_tts_streaming_route.py

Review response for M5. Verified each finding against the code and applied fixes where the issue was real. - report.py: _generate_tts_section now renders jitter and speaker-cache subsections so scenarios render their own tables when run alone. - runner.py: forward max_sessions to tts_concurrent like the LLM concurrent scenario. - tts/concurrent.py: summary rows now require concurrency <= max_concurrency so best_n cannot exceed the hardware cap when a user passes a very large --max-sessions. - tts/speaker_cache_hit.py: single-voice workloads return hit_rate=1.0 (warm cache after the first miss) instead of the degenerate 0.0. - cli.py visualize: infer modality from scenario names when --modality is omitted; "orpheus" shorthand on `tts speak` expands to the canonical HF id before TTSConfig is created. - tts/config.py: Pydantic validator rejects tq_bits outside the supported (2, 3, 4) set at the boundary. - tts/engine.py: list_voices() returns ORPHEUS_VOICES for cold orpheus engines; shutdown() now releases the Orpheus adapter (calls shutdown/close if available, clears _orpheus). - tts/orpheus_adapter.py: ORPHEUS_VOICES exposed at module scope; get_compression_stats no longer fabricates a 0.99 cosine similarity when none was measured (returns None until real data); EOS tokens are consumed but not yielded so the decoder never receives control tokens; voice is prepended to the prompt when provided; synthesize now accepts mp3/opus via the existing wav_to_mp3/wav_to_opus path. - tts/streaming.py: _encode_chunk downgrades "wav" to PCM so concatenated streams aren't corrupted by per-chunk WAV headers; the iterable path uses one-chunk lookahead so the real last chunk is marked is_final=True and no zero-length terminator is emitted. - server/routes/tts.py: _streaming_format always returns "pcm" (the previous wav passthrough contradicted the docstring); streaming endpoint imports StreamingSynthesizer inside the try and pulls the first chunk eagerly so ImportError/generator-start errors map to 501/500 cleanly. All 291 tests still pass.

The inline list of model config dicts mixes str and int values, so mypy inferred the value type as 'object' and flagged every attribute access (m['name'], m['per_session_mb']) in ConcurrentTTSScenario. Adding the explicit list[dict[str, Any]] annotation keeps lookups as Any and clears the 5 errors without reshaping the data.

Copilot AI review requested due to automatic review settings April 20, 2026 22:45

Copilot started reviewing on behalf of mahimairaja April 20, 2026 22:46 View session

coderabbitai Bot reviewed Apr 20, 2026

View reviewed changes

claude added 2 commits April 20, 2026 23:00

mahimairaja merged commit ec52e54 into main Apr 20, 2026
5 checks passed

mahimairaja review requested due to automatic review settings April 20, 2026 23:09

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(tts): M5 streaming, Orpheus adapter, and headline benchmarks#3

feat(tts): M5 streaming, Orpheus adapter, and headline benchmarks#3
mahimairaja merged 3 commits intomainfrom
claude/tts-streaming-orpheus-m5-Sqe94

mahimairaja commented Apr 20, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented Apr 20, 2026 •

edited

Loading

Rate limit exceeded

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Poem

❌ Failed checks (1 warning)

Uh oh!

codecov Bot commented Apr 20, 2026 •

edited

Loading

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

mahimairaja commented Apr 20, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Apr 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Rate limit exceeded

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Poem

❌ Failed checks (1 warning)

Uh oh!

codecov Bot commented Apr 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

mahimairaja commented Apr 20, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Apr 20, 2026 •

edited

Loading

codecov Bot commented Apr 20, 2026 •

edited

Loading