Skip to content

feat(tts): M5 streaming, Orpheus adapter, and headline benchmarks#3

Merged
mahimairaja merged 3 commits intomainfrom
claude/tts-streaming-orpheus-m5-Sqe94
Apr 20, 2026
Merged

feat(tts): M5 streaming, Orpheus adapter, and headline benchmarks#3
mahimairaja merged 3 commits intomainfrom
claude/tts-streaming-orpheus-m5-Sqe94

Conversation

@mahimairaja
Copy link
Copy Markdown
Contributor

@mahimairaja mahimairaja commented Apr 20, 2026

Adds the M5 milestone: TTS chunked streaming, Orpheus TTS model support
wired through the TurboQuant KV compression engine, and the TTS benchmark
suite with the cross-modality hero chart.

  • Core streaming: StreamingSynthesizer emits StreamingChunk objects from
    either a pre-generated waveform (Kokoro) or a token-streaming backend
    (Orpheus), with first-chunk min-size and TTFA tracking.
  • Streaming endpoint: POST /v1/audio/speech/stream supports chunked
    transfer and SSE (when Accept: text/event-stream). POST /v1/audio/speech
    honors the stream field for the OpenAI-style pattern.
  • Orpheus adapter: OrpheusAdapter imports TurboQuantEngine from
    core/llm/engine and applies KV compression per step on Orpheus' LLaMA
    backbone. This is the only cross-modality import; core/llm remains
    unaware of TTS.
  • Multi-backend TTSEngine: model_name routes to Kokoro or Orpheus. New
    tq_bits/tq_enabled config fields flow through to the Orpheus adapter.
  • Optional extra: pyproject.toml adds tts-orpheus (orpheus-tts +
    phonemizer). Base tts extra is unchanged.
  • Benchmarks: 5 new TTS scenarios (ttfa, streaming_jitter, mos_quality,
    concurrent, speaker_cache_hit) registered in the shared runner; TTS
    markdown section added to the report. 15+ test sentences fixture.
  • Visualization: 5 TTS charts plus the cross-modality hero chart
    ("Voice Agent Sessions per GPU"). CLI gains --modality and --all.
  • Tests: 53 new tests across streaming, orpheus_adapter, multi_backend,
    tts benchmarks, and the streaming route. All 291 tests pass.

Summary by CodeRabbit

  • New Features

    • Added TTS benchmarking with five scenarios: Time-to-First-Audio, streaming jitter, audio quality, concurrent streams, and speaker cache performance.
    • Introduced support for Orpheus TTS backend alongside Kokoro.
    • Added TTS audio streaming support with configurable chunk sizing.
    • Expanded CLI with --modality option for benchmark and visualization filtering (llm, tts, all).
    • Added TTS analytics charts and cross-modality performance visualization.
  • Configuration

    • Extended TTS config with TurboQuant compression options (tq-bits, tq-enabled).

Adds the M5 milestone: TTS chunked streaming, Orpheus TTS model support
wired through the TurboQuant KV compression engine, and the TTS benchmark
suite with the cross-modality hero chart.

- Core streaming: StreamingSynthesizer emits StreamingChunk objects from
  either a pre-generated waveform (Kokoro) or a token-streaming backend
  (Orpheus), with first-chunk min-size and TTFA tracking.
- Streaming endpoint: POST /v1/audio/speech/stream supports chunked
  transfer and SSE (when Accept: text/event-stream). POST /v1/audio/speech
  honors the `stream` field for the OpenAI-style pattern.
- Orpheus adapter: OrpheusAdapter imports TurboQuantEngine from
  core/llm/engine and applies KV compression per step on Orpheus' LLaMA
  backbone. This is the only cross-modality import; core/llm remains
  unaware of TTS.
- Multi-backend TTSEngine: model_name routes to Kokoro or Orpheus. New
  tq_bits/tq_enabled config fields flow through to the Orpheus adapter.
- Optional extra: pyproject.toml adds `tts-orpheus` (orpheus-tts +
  phonemizer). Base `tts` extra is unchanged.
- Benchmarks: 5 new TTS scenarios (ttfa, streaming_jitter, mos_quality,
  concurrent, speaker_cache_hit) registered in the shared runner; TTS
  markdown section added to the report. 15+ test sentences fixture.
- Visualization: 5 TTS charts plus the cross-modality hero chart
  ("Voice Agent Sessions per GPU"). CLI gains `--modality` and `--all`.
- Tests: 53 new tests across streaming, orpheus_adapter, multi_backend,
  tts benchmarks, and the streaming route. All 291 tests pass.
Copilot AI review requested due to automatic review settings April 20, 2026 22:45
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 20, 2026

Warning

Rate limit exceeded

@mahimairaja has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 39 minutes and 51 seconds before requesting another review.

Your organization is not enrolled in usage-based pricing. Contact your admin to enable usage-based pricing to continue reviews beyond the rate limit, or try again in 39 minutes and 51 seconds.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: ac2c36ff-a87d-4c58-9864-83d2c247f16d

📥 Commits

Reviewing files that changed from the base of the PR and between 112547a and f342915.

📒 Files selected for processing (10)
  • src/voicequant/benchmarks/report.py
  • src/voicequant/benchmarks/runner.py
  • src/voicequant/benchmarks/scenarios/tts/concurrent.py
  • src/voicequant/benchmarks/scenarios/tts/speaker_cache_hit.py
  • src/voicequant/cli.py
  • src/voicequant/core/tts/config.py
  • src/voicequant/core/tts/engine.py
  • src/voicequant/core/tts/orpheus_adapter.py
  • src/voicequant/core/tts/streaming.py
  • src/voicequant/server/routes/tts.py
📝 Walkthrough

Walkthrough

This PR adds comprehensive TTS support with multi-backend synthesis (Kokoro and Orpheus), streaming audio output, TurboQuant KV compression, five benchmark scenarios measuring performance metrics, visualization charts, CLI modality controls, and server streaming endpoints alongside supporting test coverage.

Changes

Cohort / File(s) Summary
Dependencies
pyproject.toml
Added optional dependency groups tts-orpheus (orpheus-tts, phonemizer, torch, transformers, soundfile, numpy) and tts-all; updated all group to include tts-orpheus.
Benchmark Test Data
src/voicequant/benchmarks/prompts/tts/test_sentences.json
New JSON fixture defining 15 test sentences across short, medium, and long categories, each with expected duration ranges for TTS benchmarking.
TTS Benchmark Scenarios
src/voicequant/benchmarks/scenarios/tts/__init__.py, ttfa.py, streaming_jitter.py, mos_quality.py, concurrent.py, speaker_cache_hit.py
Five new analytical TTS benchmark scenario classes: TTFAScenario (time-to-first-audio), StreamingJitterScenario (gap metrics), MOSQualityScenario (PESQ/STOI scores), ConcurrentTTSScenario (session concurrency modeling), and SpeakerCacheHitScenario (cache hit rate estimation).
Benchmark Infrastructure
src/voicequant/benchmarks/runner.py, report.py, visualize.py
Extended scenario registry with five TTS scenario entries; added TTS report section generation with TTFA, concurrency, and quality subsections; added TTS data collection, five chart generators (ttfa, jitter, concurrent, quality, cache), cross-modality hero chart, and new public functions for modality-based chart generation.
Core TTS: Multi-Backend & Streaming
src/voicequant/core/tts/config.py, engine.py, orpheus_adapter.py, streaming.py
Added tq_bits and tq_enabled config fields; implemented backend detection and multi-backend loading (Kokoro/Orpheus); added OrpheusAdapter for Orpheus speech token generation with TurboQuant compression; introduced StreamingSynthesizer and TTSStreamingConfig for chunked audio streaming with TTFA tracking.
Server TTS Routes
src/voicequant/server/routes/tts.py, tts_stub.py
Added stream field to SpeechRequest; extended /v1/audio/speech handler to route to new streaming path; added /v1/audio/speech/stream endpoint supporting both raw audio and Server-Sent Events response modes; added stub endpoint for disabled TTS environments.
CLI Commands
src/voicequant/cli.py
Added --modality option (llm/tts/all) to bench() and visualize() commands; updated scenario selection and visualization routing; extended tts speak command with --model (backend selector) and --tq-bits (KV compression bits) options.
Test Coverage
tests/benchmarks/test_tts_benchmarks.py, tests/core/tts/test_multi_backend.py, test_orpheus_adapter.py, test_streaming.py, tests/server/test_tts_streaming_route.py
Comprehensive test suite covering TTS benchmark scenario registration, execution, and output validation; multi-backend engine switching with dependency checks; Orpheus adapter config and token generation with TQ compression; streaming chunk generation and TTFA tracking; server streaming endpoints with SSE and raw audio response modes.

Sequence Diagram(s)

sequenceDiagram
    actor Client
    participant Engine as TTSEngine
    participant Detector as _detect_backend()
    participant Kokoro as Kokoro Backend
    participant Orpheus as OrpheusAdapter

    Client->>Engine: synthesize(text, voice, model="orpheus-fp16")
    Engine->>Detector: _detect_backend("orpheus-fp16")
    Detector-->>Engine: "orpheus"
    Engine->>Orpheus: load_model()
    Engine->>Orpheus: synthesize(text, voice)
    Orpheus->>Orpheus: generate_speech_tokens(text, voice)
    Orpheus->>Orpheus: decode_tokens_to_audio(tokens)
    Orpheus-->>Engine: SynthesisResult(wav/pcm)
    Engine-->>Client: audio bytes

    Client->>Engine: synthesize(text, voice, model="kokoro")
    Engine->>Detector: _detect_backend("kokoro")
    Detector-->>Engine: "kokoro"
    Engine->>Kokoro: create(text, speaker_id, output_format)
    Kokoro-->>Engine: audio bytes
    Engine-->>Client: audio bytes
Loading
sequenceDiagram
    actor Client
    participant Synthesizer as StreamingSynthesizer
    participant Engine as TTSEngine
    participant Orpheus as OrpheusAdapter
    participant Encoder as _encode_chunk()

    Client->>Synthesizer: stream(text, voice)
    Synthesizer->>Synthesizer: detect backend support
    Synthesizer->>Engine: stream_samples(text, voice)
    Engine->>Orpheus: stream_samples(text, voice)
    loop For each sample batch
        Orpheus->>Orpheus: generate_speech_tokens()
        Orpheus->>Orpheus: decode_tokens_to_audio(batch)
        Orpheus-->>Engine: float32 samples
        Engine-->>Synthesizer: float32 samples
        Synthesizer->>Encoder: encode chunk to PCM/WAV
        Encoder-->>Synthesizer: audio_bytes
        Synthesizer->>Synthesizer: emit StreamingChunk
        Synthesizer-->>Client: StreamingChunk
    end
    Synthesizer->>Synthesizer: mark final chunk
    Synthesizer-->>Client: StreamingChunk(is_final=true)
    Synthesizer->>Synthesizer: record last_ttfa_ms, last_total_chunks
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~75 minutes

Poem

🐰 Hops through the speech tokens with glee,
Orpheus and Kokoro dance in harmony,
Streaming chunks flow like garden streams,
TurboQuant compresses our audio dreams,
Five benchmarks measure what voices can be!

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 40.43% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The PR title accurately summarizes the primary changes: M5 streaming implementation, Orpheus adapter addition, and headline benchmarks for TTS—all core objectives reflected in the changeset.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch claude/tts-streaming-orpheus-m5-Sqe94

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 15

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
src/voicequant/core/tts/engine.py (1)

300-307: ⚠️ Potential issue | 🟠 Major

Release the Orpheus adapter during shutdown.

shutdown() clears _model, but _orpheus can still hold the loaded Orpheus model/tokenizer/decoder, keeping GPU memory referenced after shutdown.

🧹 Proposed fix
 def shutdown(self) -> None:
     with self._lock:
+        if self._orpheus is not None:
+            self._orpheus.shutdown()
+            self._orpheus = None
         self._model = None
         self._model_loaded = False
         if self._speaker_cache is not None:
             self._speaker_cache.clear()
             self._speaker_cache = None
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/voicequant/core/tts/engine.py` around lines 300 - 307, shutdown()
currently clears _model but leaves self._orpheus alive, which can keep
GPU-resident Orpheus tensors; update shutdown() to, inside the existing with
self._lock block, check if self._orpheus is not None and first call any cleanup
API available (e.g. call self._orpheus.close() or self._orpheus.release() if
present via getattr) and then set self._orpheus = None, and also ensure
_model_loaded is set to False and _speaker_cache cleared as already done so the
Orpheus adapter is fully released.
src/voicequant/cli.py (1)

282-304: ⚠️ Potential issue | 🟠 Major

Resolve the orpheus shorthand to the full Hugging Face model ID before constructing TTSConfig.

The CLI advertises --model orpheus as valid, but orpheus-tts does not recognize "orpheus" as a model name alias—it requires the full Hugging Face model ID. Passing the bare string will cause a runtime failure.

Fix
-    cfg = TTSConfig(
-        model_name=model,
+    resolved_model = (
+        "canopylabs/orpheus-3b-0.1-ft"
+        if model.lower() == "orpheus"
+        else model
+    )
+    cfg = TTSConfig(
+        model_name=resolved_model,
         device=device,
         default_voice=voice,
         output_format=fmt,
         tq_bits=tq_bits,
     )
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/voicequant/cli.py` around lines 282 - 304, The CLI allows the shorthand
"orpheus" but you must translate that to the full Hugging Face model ID before
creating the TTSConfig; update the code around the model variable (in the CLI
function where model is read and before the TTSConfig(...) call) to check if
model == "orpheus" and replace it with the canonical HF model id (e.g.
ai-forever/orpheus-tts or your project's exact HF identifier) so that
TTSEngine/TTSConfig receives the full model name; ensure this mapping happens
prior to instantiating TTSConfig(model_name=model, ...).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/voicequant/benchmarks/report.py`:
- Around line 366-417: The TTS generator _generate_tts_section is missing output
for the scenarios keyed as "tts_streaming_jitter" and "tts_speaker_cache_hit",
so add branches that mirror the existing pattern (like "tts_ttfa",
"tts_concurrent", "tts_mos_quality"): check if each key is in the present list,
pull its data from results (e.g., results["tts_streaming_jitter"] and
results["tts_speaker_cache_hit"]), and append human-readable subsection headers
and Markdown tables or summaries using the scenario's .get("results", []) or
summary fields; ensure field names used match those produced by the scenario
(e.g., jitter statistics, packet/drop rates, cache hit ratio) so that running
only those scenarios still renders a non-empty TTS section and be consistent
with other TTS blocks.

In `@src/voicequant/benchmarks/runner.py`:
- Around line 76-79: The tts_concurrent benchmark registration doesn't receive
the user-provided max_sessions value like the LLM "concurrent" path does; update
runner.py so when creating or invoking the scenario for
"voicequant.benchmarks.scenarios.tts.concurrent" / ConcurrentTTSScenario you
forward the same max_sessions parameter used for the LLM "concurrent" scenario
(e.g., include max_sessions in the scenario instantiation/config or run call),
ensuring TTS concurrent runs respect the provided concurrency limit; apply the
same change to the other similar registration block referenced around lines
312-317.

In `@src/voicequant/benchmarks/scenarios/tts/concurrent.py`:
- Around line 67-104: The summary's selection of rows can include ladder samples
above the hardware cap because it only checks p95; update the rows filter
comprehension in the summary loop (the one that builds rows for each gpu and
model) to also enforce the hardware cap by requiring r["concurrency"] <=
r["max_concurrency"] (or equivalently r["max_concurrency"] >= r["concurrency"])
so best_n cannot exceed the computed GPU capacity; keep the rest of the logic
(p95 <= _LATENCY_BUDGET_MS and matching gpu/model) unchanged.

In `@src/voicequant/benchmarks/scenarios/tts/speaker_cache_hit.py`:
- Around line 19-28: The _cache_hit_rate function incorrectly returns 0.0 for
unique_voices == 1; change it so the one-voice case returns a warm cache hit
rate (1.0) after the first miss. Update the function _cache_hit_rate to
explicitly handle unique_voices == 1 (return 1.0) and leave the existing logic
for other values (the branch for unique_voices <= cache_size and the cache_size
/ unique_voices fallback) unchanged, so benchmarks report a warm cache for
repeated single-voice synthesis.

In `@src/voicequant/cli.py`:
- Around line 124-128: The visualization modality defaults to "llm" when
--modality is omitted, causing TTS scenarios to produce LLM charts; update the
logic before calling generate_charts_by_modality in cli.py to infer modality
from the selected scenarios: set viz_mod to "all" if all_scenarios is true,
otherwise use modality if provided, else detect if any selected scenario string
starts with "tts_" and set viz_mod to "tts" (falling back to "llm" if no TTS
scenarios are present); reference the viz_mod variable and
generate_charts_by_modality function to implement this conditional inference.

In `@src/voicequant/core/tts/config.py`:
- Around line 18-19: TTSConfig currently accepts any integer for tq_bits which
can later break the Orpheus adapter; add Pydantic validation at the config
boundary (either by replacing the raw fields with a dedicated TurboQuantConfig
model or by adding a validator on TTSConfig) that enforces tq_bits is one of the
supported TurboQuant bit widths and raises a clear ValueError otherwise; keep
tq_enabled default behavior and update ServerConfig to carry the new
TurboQuantConfig if you introduce it.

In `@src/voicequant/core/tts/engine.py`:
- Around line 124-128: list_voices() currently returns KOKORO_VOICES when the
engine is configured for Orpheus but _orpheus hasn't been initialized, exposing
wrong voice IDs; change list_voices in the TTS engine to return Orpheus voice
identifiers whenever self._backend == "orpheus" regardless of self._orpheus
being None by delegating to the Orpheus voice registry or a static
ORPHEUS_VOICES mapping (or calling a class/static helper in the Orpheus loader)
so that cold engines report correct Orpheus voice IDs; reference the list_voices
method, the _backend and _orpheus attributes, and the load_model initialization
path to locate where to replace the fallback to KOKORO_VOICES.

In `@src/voicequant/core/tts/orpheus_adapter.py`:
- Around line 302-312: get_compression_stats() currently fabricates a 0.99
cosine similarity when self._last_cosine is unset; update the method
(get_compression_stats) to stop reporting a made-up value by either omitting the
"cosine_similarity" key or returning None for it when self._last_cosine is None,
i.e., use the real _last_cosine only when it's been computed and otherwise leave
that metric absent/None (also ensure any callers tolerate a missing/None
"cosine_similarity" instead of assuming a numeric value).
- Around line 155-164: In the token-generation loop in OrpheusAdapter (the block
that calls self._sample and currently yields token_id before checking
self._is_eos), change the flow so you call self._sample(), compute token_id,
then check self._is_eos(token_id) and break without yielding if it's EOS; only
yield non-EOS token_ids so decode_tokens_to_audio never receives a tokenizer
control token. Ensure you still update input_ids/position_ids and n_generated
appropriately for the next step when breaking.
- Around line 216-222: The Orpheus adapter's format branch currently only
handles wav/pcm (see float32_to_wav, float32_to_pcm and self.sample_rate) but
must accept mp3 and opus like the public TTSEngine/CLI; update the fmt handling
in the method in OrpheusAdapter to add branches for "mp3" and "opus" that encode
the float32 samples to those formats (e.g., call or add helper functions
encode_float32_to_mp3(samples, self.sample_rate) and
encode_float32_to_opus(samples, self.sample_rate) or invoke your project's
ffmpeg/pydub wrapper), returning audio_bytes for mp3/opus, and keep the existing
wav/pcm branches and the ValueError for unknown formats.
- Around line 113-130: The generate_speech_tokens method accepts a voice but
never applies it to the prompt; prepend the voice to the text (format "{voice}:
{text}" when voice is provided) before calling self._tokenizer so the model
receives voice-conditioned input. Update generate_speech_tokens to build a
prompt variable from the voice and text, then pass that prompt into
self._tokenizer(...).to(self.config.device) (leaving load_model and device
handling unchanged).

In `@src/voicequant/core/tts/streaming.py`:
- Around line 186-215: The loop currently yields the last real chunk with
is_final=False and then _flush(True) emits an empty final chunk; fix by
preventing emission of that zero-length final chunk and ensuring the real last
chunk is marked final: (1) make _flush return None when there are no buffered
samples (do not create a zero-sample chunk), and (2) after assembling head/tail
in the sample_iter loop (function using buffer, buffered, _make_chunk and
first_emitted), detect when tail.size == 0 and no more input will arrive and set
the chunk's is_final=True before yielding (or rely on _flush not emitting and
only call _flush when buffer has data). Update references: modify _flush, the
sample_iter consumption loop that builds chunk via self._make_chunk, and the
final = _flush(True) call so no empty final chunk is emitted.
- Around line 47-53: The _encode_chunk function currently allows "wav" which
calls float32_to_wav and emits a full WAV header per chunk, producing an invalid
concatenated stream; change the logic in _encode_chunk to disallow or map "wav"
to PCM for streaming: when output_format (or fmt) equals "wav" either raise a
ValueError indicating WAV is not supported for per-chunk streaming or treat it
as "pcm" by calling float32_to_pcm(samples, sample_rate) instead; update
references to _encode_chunk, float32_to_wav, float32_to_pcm and output_format
handling so streaming only emits raw PCM chunks (or explicitly rejects "wav")
consistent with the server/routes/tts streaming behavior.

In `@src/voicequant/server/routes/tts.py`:
- Around line 69-78: The _streaming_format helper currently returns the input
lowercased format (so "wav" is returned) which contradicts the comment about
downgrading WAV for streaming; update the function _streaming_format(fmt: str)
so that any "wav" input is converted to "pcm" (i.e. return "pcm" when low ==
"wav"), preserve "pcm" for that case, and keep the default fallback as "pcm" for
unknown/empty formats to ensure streaming always uses raw PCM.
- Around line 117-136: The import of StreamingSynthesizer/TTSStreamingConfig and
the creation/first iteration of the generator must be moved inside the try so
ImportError and generator-start exceptions are caught; specifically, inside the
try block import StreamingSynthesizer and TTSStreamingConfig, instantiate
StreamingSynthesizer(engine, TTSStreamingConfig(...)), call
synth.stream(request.input, voice=request.voice) to get chunk_iter and perform
the first next(chunk_iter) (or otherwise trigger the generator start) so errors
surface there, then return a StreamingResponse that yields the already-read
first chunk followed by the remaining chunk_iter; ensure the except ImportError
and generic except blocks remain the same to map to 501/500.

---

Outside diff comments:
In `@src/voicequant/cli.py`:
- Around line 282-304: The CLI allows the shorthand "orpheus" but you must
translate that to the full Hugging Face model ID before creating the TTSConfig;
update the code around the model variable (in the CLI function where model is
read and before the TTSConfig(...) call) to check if model == "orpheus" and
replace it with the canonical HF model id (e.g. ai-forever/orpheus-tts or your
project's exact HF identifier) so that TTSEngine/TTSConfig receives the full
model name; ensure this mapping happens prior to instantiating
TTSConfig(model_name=model, ...).

In `@src/voicequant/core/tts/engine.py`:
- Around line 300-307: shutdown() currently clears _model but leaves
self._orpheus alive, which can keep GPU-resident Orpheus tensors; update
shutdown() to, inside the existing with self._lock block, check if self._orpheus
is not None and first call any cleanup API available (e.g. call
self._orpheus.close() or self._orpheus.release() if present via getattr) and
then set self._orpheus = None, and also ensure _model_loaded is set to False and
_speaker_cache cleared as already done so the Orpheus adapter is fully released.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 1edb011e-763a-4e6a-be0a-9626e790ea82

📥 Commits

Reviewing files that changed from the base of the PR and between 1fe79c0 and 112547a.

📒 Files selected for processing (23)
  • pyproject.toml
  • src/voicequant/benchmarks/prompts/tts/test_sentences.json
  • src/voicequant/benchmarks/report.py
  • src/voicequant/benchmarks/runner.py
  • src/voicequant/benchmarks/scenarios/tts/__init__.py
  • src/voicequant/benchmarks/scenarios/tts/concurrent.py
  • src/voicequant/benchmarks/scenarios/tts/mos_quality.py
  • src/voicequant/benchmarks/scenarios/tts/speaker_cache_hit.py
  • src/voicequant/benchmarks/scenarios/tts/streaming_jitter.py
  • src/voicequant/benchmarks/scenarios/tts/ttfa.py
  • src/voicequant/benchmarks/visualize.py
  • src/voicequant/cli.py
  • src/voicequant/core/tts/config.py
  • src/voicequant/core/tts/engine.py
  • src/voicequant/core/tts/orpheus_adapter.py
  • src/voicequant/core/tts/streaming.py
  • src/voicequant/server/routes/tts.py
  • src/voicequant/server/routes/tts_stub.py
  • tests/benchmarks/test_tts_benchmarks.py
  • tests/core/tts/test_multi_backend.py
  • tests/core/tts/test_orpheus_adapter.py
  • tests/core/tts/test_streaming.py
  • tests/server/test_tts_streaming_route.py

Comment thread src/voicequant/benchmarks/report.py
Comment thread src/voicequant/benchmarks/runner.py
Comment thread src/voicequant/benchmarks/scenarios/tts/concurrent.py
Comment thread src/voicequant/benchmarks/scenarios/tts/speaker_cache_hit.py
Comment thread src/voicequant/cli.py
Comment thread src/voicequant/core/tts/orpheus_adapter.py
Comment thread src/voicequant/core/tts/streaming.py
Comment thread src/voicequant/core/tts/streaming.py Outdated
Comment thread src/voicequant/server/routes/tts.py
Comment thread src/voicequant/server/routes/tts.py Outdated
claude added 2 commits April 20, 2026 23:00
Review response for M5. Verified each finding against the code and
applied fixes where the issue was real.

- report.py: _generate_tts_section now renders jitter and speaker-cache
  subsections so scenarios render their own tables when run alone.
- runner.py: forward max_sessions to tts_concurrent like the LLM
  concurrent scenario.
- tts/concurrent.py: summary rows now require concurrency <=
  max_concurrency so best_n cannot exceed the hardware cap when a user
  passes a very large --max-sessions.
- tts/speaker_cache_hit.py: single-voice workloads return hit_rate=1.0
  (warm cache after the first miss) instead of the degenerate 0.0.
- cli.py visualize: infer modality from scenario names when --modality
  is omitted; "orpheus" shorthand on `tts speak` expands to the
  canonical HF id before TTSConfig is created.
- tts/config.py: Pydantic validator rejects tq_bits outside the
  supported (2, 3, 4) set at the boundary.
- tts/engine.py: list_voices() returns ORPHEUS_VOICES for cold orpheus
  engines; shutdown() now releases the Orpheus adapter (calls
  shutdown/close if available, clears _orpheus).
- tts/orpheus_adapter.py: ORPHEUS_VOICES exposed at module scope;
  get_compression_stats no longer fabricates a 0.99 cosine similarity
  when none was measured (returns None until real data); EOS tokens
  are consumed but not yielded so the decoder never receives control
  tokens; voice is prepended to the prompt when provided; synthesize
  now accepts mp3/opus via the existing wav_to_mp3/wav_to_opus path.
- tts/streaming.py: _encode_chunk downgrades "wav" to PCM so
  concatenated streams aren't corrupted by per-chunk WAV headers; the
  iterable path uses one-chunk lookahead so the real last chunk is
  marked is_final=True and no zero-length terminator is emitted.
- server/routes/tts.py: _streaming_format always returns "pcm" (the
  previous wav passthrough contradicted the docstring); streaming
  endpoint imports StreamingSynthesizer inside the try and pulls the
  first chunk eagerly so ImportError/generator-start errors map to
  501/500 cleanly.

All 291 tests still pass.
The inline list of model config dicts mixes str and int values, so
mypy inferred the value type as 'object' and flagged every attribute
access (m['name'], m['per_session_mb']) in ConcurrentTTSScenario.
Adding the explicit list[dict[str, Any]] annotation keeps lookups as
Any and clears the 5 errors without reshaping the data.
@mahimairaja mahimairaja merged commit ec52e54 into main Apr 20, 2026
5 checks passed
@mahimairaja mahimairaja review requested due to automatic review settings April 20, 2026 23:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants