feat(tts): M5 streaming, Orpheus adapter, and headline benchmarks#3
Conversation
Adds the M5 milestone: TTS chunked streaming, Orpheus TTS model support
wired through the TurboQuant KV compression engine, and the TTS benchmark
suite with the cross-modality hero chart.
- Core streaming: StreamingSynthesizer emits StreamingChunk objects from
either a pre-generated waveform (Kokoro) or a token-streaming backend
(Orpheus), with first-chunk min-size and TTFA tracking.
- Streaming endpoint: POST /v1/audio/speech/stream supports chunked
transfer and SSE (when Accept: text/event-stream). POST /v1/audio/speech
honors the `stream` field for the OpenAI-style pattern.
- Orpheus adapter: OrpheusAdapter imports TurboQuantEngine from
core/llm/engine and applies KV compression per step on Orpheus' LLaMA
backbone. This is the only cross-modality import; core/llm remains
unaware of TTS.
- Multi-backend TTSEngine: model_name routes to Kokoro or Orpheus. New
tq_bits/tq_enabled config fields flow through to the Orpheus adapter.
- Optional extra: pyproject.toml adds `tts-orpheus` (orpheus-tts +
phonemizer). Base `tts` extra is unchanged.
- Benchmarks: 5 new TTS scenarios (ttfa, streaming_jitter, mos_quality,
concurrent, speaker_cache_hit) registered in the shared runner; TTS
markdown section added to the report. 15+ test sentences fixture.
- Visualization: 5 TTS charts plus the cross-modality hero chart
("Voice Agent Sessions per GPU"). CLI gains `--modality` and `--all`.
- Tests: 53 new tests across streaming, orpheus_adapter, multi_backend,
tts benchmarks, and the streaming route. All 291 tests pass.
|
Warning Rate limit exceeded
Your organization is not enrolled in usage-based pricing. Contact your admin to enable usage-based pricing to continue reviews beyond the rate limit, or try again in 39 minutes and 51 seconds. ⌛ How to resolve this issue?After the wait time has elapsed, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout. Please see our FAQ for further information. ℹ️ Review info⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (10)
📝 WalkthroughWalkthroughThis PR adds comprehensive TTS support with multi-backend synthesis (Kokoro and Orpheus), streaming audio output, TurboQuant KV compression, five benchmark scenarios measuring performance metrics, visualization charts, CLI modality controls, and server streaming endpoints alongside supporting test coverage. Changes
Sequence Diagram(s)sequenceDiagram
actor Client
participant Engine as TTSEngine
participant Detector as _detect_backend()
participant Kokoro as Kokoro Backend
participant Orpheus as OrpheusAdapter
Client->>Engine: synthesize(text, voice, model="orpheus-fp16")
Engine->>Detector: _detect_backend("orpheus-fp16")
Detector-->>Engine: "orpheus"
Engine->>Orpheus: load_model()
Engine->>Orpheus: synthesize(text, voice)
Orpheus->>Orpheus: generate_speech_tokens(text, voice)
Orpheus->>Orpheus: decode_tokens_to_audio(tokens)
Orpheus-->>Engine: SynthesisResult(wav/pcm)
Engine-->>Client: audio bytes
Client->>Engine: synthesize(text, voice, model="kokoro")
Engine->>Detector: _detect_backend("kokoro")
Detector-->>Engine: "kokoro"
Engine->>Kokoro: create(text, speaker_id, output_format)
Kokoro-->>Engine: audio bytes
Engine-->>Client: audio bytes
sequenceDiagram
actor Client
participant Synthesizer as StreamingSynthesizer
participant Engine as TTSEngine
participant Orpheus as OrpheusAdapter
participant Encoder as _encode_chunk()
Client->>Synthesizer: stream(text, voice)
Synthesizer->>Synthesizer: detect backend support
Synthesizer->>Engine: stream_samples(text, voice)
Engine->>Orpheus: stream_samples(text, voice)
loop For each sample batch
Orpheus->>Orpheus: generate_speech_tokens()
Orpheus->>Orpheus: decode_tokens_to_audio(batch)
Orpheus-->>Engine: float32 samples
Engine-->>Synthesizer: float32 samples
Synthesizer->>Encoder: encode chunk to PCM/WAV
Encoder-->>Synthesizer: audio_bytes
Synthesizer->>Synthesizer: emit StreamingChunk
Synthesizer-->>Client: StreamingChunk
end
Synthesizer->>Synthesizer: mark final chunk
Synthesizer-->>Client: StreamingChunk(is_final=true)
Synthesizer->>Synthesizer: record last_ttfa_ms, last_total_chunks
Estimated code review effort🎯 4 (Complex) | ⏱️ ~75 minutes Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Codecov Report❌ Patch coverage is 📢 Thoughts on this report? Let us know! |
There was a problem hiding this comment.
Actionable comments posted: 15
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
src/voicequant/core/tts/engine.py (1)
300-307:⚠️ Potential issue | 🟠 MajorRelease the Orpheus adapter during shutdown.
shutdown()clears_model, but_orpheuscan still hold the loaded Orpheus model/tokenizer/decoder, keeping GPU memory referenced after shutdown.🧹 Proposed fix
def shutdown(self) -> None: with self._lock: + if self._orpheus is not None: + self._orpheus.shutdown() + self._orpheus = None self._model = None self._model_loaded = False if self._speaker_cache is not None: self._speaker_cache.clear() self._speaker_cache = None🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/voicequant/core/tts/engine.py` around lines 300 - 307, shutdown() currently clears _model but leaves self._orpheus alive, which can keep GPU-resident Orpheus tensors; update shutdown() to, inside the existing with self._lock block, check if self._orpheus is not None and first call any cleanup API available (e.g. call self._orpheus.close() or self._orpheus.release() if present via getattr) and then set self._orpheus = None, and also ensure _model_loaded is set to False and _speaker_cache cleared as already done so the Orpheus adapter is fully released.src/voicequant/cli.py (1)
282-304:⚠️ Potential issue | 🟠 MajorResolve the
orpheusshorthand to the full Hugging Face model ID before constructingTTSConfig.The CLI advertises
--model orpheusas valid, butorpheus-ttsdoes not recognize"orpheus"as a model name alias—it requires the full Hugging Face model ID. Passing the bare string will cause a runtime failure.Fix
- cfg = TTSConfig( - model_name=model, + resolved_model = ( + "canopylabs/orpheus-3b-0.1-ft" + if model.lower() == "orpheus" + else model + ) + cfg = TTSConfig( + model_name=resolved_model, device=device, default_voice=voice, output_format=fmt, tq_bits=tq_bits, )🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/voicequant/cli.py` around lines 282 - 304, The CLI allows the shorthand "orpheus" but you must translate that to the full Hugging Face model ID before creating the TTSConfig; update the code around the model variable (in the CLI function where model is read and before the TTSConfig(...) call) to check if model == "orpheus" and replace it with the canonical HF model id (e.g. ai-forever/orpheus-tts or your project's exact HF identifier) so that TTSEngine/TTSConfig receives the full model name; ensure this mapping happens prior to instantiating TTSConfig(model_name=model, ...).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@src/voicequant/benchmarks/report.py`:
- Around line 366-417: The TTS generator _generate_tts_section is missing output
for the scenarios keyed as "tts_streaming_jitter" and "tts_speaker_cache_hit",
so add branches that mirror the existing pattern (like "tts_ttfa",
"tts_concurrent", "tts_mos_quality"): check if each key is in the present list,
pull its data from results (e.g., results["tts_streaming_jitter"] and
results["tts_speaker_cache_hit"]), and append human-readable subsection headers
and Markdown tables or summaries using the scenario's .get("results", []) or
summary fields; ensure field names used match those produced by the scenario
(e.g., jitter statistics, packet/drop rates, cache hit ratio) so that running
only those scenarios still renders a non-empty TTS section and be consistent
with other TTS blocks.
In `@src/voicequant/benchmarks/runner.py`:
- Around line 76-79: The tts_concurrent benchmark registration doesn't receive
the user-provided max_sessions value like the LLM "concurrent" path does; update
runner.py so when creating or invoking the scenario for
"voicequant.benchmarks.scenarios.tts.concurrent" / ConcurrentTTSScenario you
forward the same max_sessions parameter used for the LLM "concurrent" scenario
(e.g., include max_sessions in the scenario instantiation/config or run call),
ensuring TTS concurrent runs respect the provided concurrency limit; apply the
same change to the other similar registration block referenced around lines
312-317.
In `@src/voicequant/benchmarks/scenarios/tts/concurrent.py`:
- Around line 67-104: The summary's selection of rows can include ladder samples
above the hardware cap because it only checks p95; update the rows filter
comprehension in the summary loop (the one that builds rows for each gpu and
model) to also enforce the hardware cap by requiring r["concurrency"] <=
r["max_concurrency"] (or equivalently r["max_concurrency"] >= r["concurrency"])
so best_n cannot exceed the computed GPU capacity; keep the rest of the logic
(p95 <= _LATENCY_BUDGET_MS and matching gpu/model) unchanged.
In `@src/voicequant/benchmarks/scenarios/tts/speaker_cache_hit.py`:
- Around line 19-28: The _cache_hit_rate function incorrectly returns 0.0 for
unique_voices == 1; change it so the one-voice case returns a warm cache hit
rate (1.0) after the first miss. Update the function _cache_hit_rate to
explicitly handle unique_voices == 1 (return 1.0) and leave the existing logic
for other values (the branch for unique_voices <= cache_size and the cache_size
/ unique_voices fallback) unchanged, so benchmarks report a warm cache for
repeated single-voice synthesis.
In `@src/voicequant/cli.py`:
- Around line 124-128: The visualization modality defaults to "llm" when
--modality is omitted, causing TTS scenarios to produce LLM charts; update the
logic before calling generate_charts_by_modality in cli.py to infer modality
from the selected scenarios: set viz_mod to "all" if all_scenarios is true,
otherwise use modality if provided, else detect if any selected scenario string
starts with "tts_" and set viz_mod to "tts" (falling back to "llm" if no TTS
scenarios are present); reference the viz_mod variable and
generate_charts_by_modality function to implement this conditional inference.
In `@src/voicequant/core/tts/config.py`:
- Around line 18-19: TTSConfig currently accepts any integer for tq_bits which
can later break the Orpheus adapter; add Pydantic validation at the config
boundary (either by replacing the raw fields with a dedicated TurboQuantConfig
model or by adding a validator on TTSConfig) that enforces tq_bits is one of the
supported TurboQuant bit widths and raises a clear ValueError otherwise; keep
tq_enabled default behavior and update ServerConfig to carry the new
TurboQuantConfig if you introduce it.
In `@src/voicequant/core/tts/engine.py`:
- Around line 124-128: list_voices() currently returns KOKORO_VOICES when the
engine is configured for Orpheus but _orpheus hasn't been initialized, exposing
wrong voice IDs; change list_voices in the TTS engine to return Orpheus voice
identifiers whenever self._backend == "orpheus" regardless of self._orpheus
being None by delegating to the Orpheus voice registry or a static
ORPHEUS_VOICES mapping (or calling a class/static helper in the Orpheus loader)
so that cold engines report correct Orpheus voice IDs; reference the list_voices
method, the _backend and _orpheus attributes, and the load_model initialization
path to locate where to replace the fallback to KOKORO_VOICES.
In `@src/voicequant/core/tts/orpheus_adapter.py`:
- Around line 302-312: get_compression_stats() currently fabricates a 0.99
cosine similarity when self._last_cosine is unset; update the method
(get_compression_stats) to stop reporting a made-up value by either omitting the
"cosine_similarity" key or returning None for it when self._last_cosine is None,
i.e., use the real _last_cosine only when it's been computed and otherwise leave
that metric absent/None (also ensure any callers tolerate a missing/None
"cosine_similarity" instead of assuming a numeric value).
- Around line 155-164: In the token-generation loop in OrpheusAdapter (the block
that calls self._sample and currently yields token_id before checking
self._is_eos), change the flow so you call self._sample(), compute token_id,
then check self._is_eos(token_id) and break without yielding if it's EOS; only
yield non-EOS token_ids so decode_tokens_to_audio never receives a tokenizer
control token. Ensure you still update input_ids/position_ids and n_generated
appropriately for the next step when breaking.
- Around line 216-222: The Orpheus adapter's format branch currently only
handles wav/pcm (see float32_to_wav, float32_to_pcm and self.sample_rate) but
must accept mp3 and opus like the public TTSEngine/CLI; update the fmt handling
in the method in OrpheusAdapter to add branches for "mp3" and "opus" that encode
the float32 samples to those formats (e.g., call or add helper functions
encode_float32_to_mp3(samples, self.sample_rate) and
encode_float32_to_opus(samples, self.sample_rate) or invoke your project's
ffmpeg/pydub wrapper), returning audio_bytes for mp3/opus, and keep the existing
wav/pcm branches and the ValueError for unknown formats.
- Around line 113-130: The generate_speech_tokens method accepts a voice but
never applies it to the prompt; prepend the voice to the text (format "{voice}:
{text}" when voice is provided) before calling self._tokenizer so the model
receives voice-conditioned input. Update generate_speech_tokens to build a
prompt variable from the voice and text, then pass that prompt into
self._tokenizer(...).to(self.config.device) (leaving load_model and device
handling unchanged).
In `@src/voicequant/core/tts/streaming.py`:
- Around line 186-215: The loop currently yields the last real chunk with
is_final=False and then _flush(True) emits an empty final chunk; fix by
preventing emission of that zero-length final chunk and ensuring the real last
chunk is marked final: (1) make _flush return None when there are no buffered
samples (do not create a zero-sample chunk), and (2) after assembling head/tail
in the sample_iter loop (function using buffer, buffered, _make_chunk and
first_emitted), detect when tail.size == 0 and no more input will arrive and set
the chunk's is_final=True before yielding (or rely on _flush not emitting and
only call _flush when buffer has data). Update references: modify _flush, the
sample_iter consumption loop that builds chunk via self._make_chunk, and the
final = _flush(True) call so no empty final chunk is emitted.
- Around line 47-53: The _encode_chunk function currently allows "wav" which
calls float32_to_wav and emits a full WAV header per chunk, producing an invalid
concatenated stream; change the logic in _encode_chunk to disallow or map "wav"
to PCM for streaming: when output_format (or fmt) equals "wav" either raise a
ValueError indicating WAV is not supported for per-chunk streaming or treat it
as "pcm" by calling float32_to_pcm(samples, sample_rate) instead; update
references to _encode_chunk, float32_to_wav, float32_to_pcm and output_format
handling so streaming only emits raw PCM chunks (or explicitly rejects "wav")
consistent with the server/routes/tts streaming behavior.
In `@src/voicequant/server/routes/tts.py`:
- Around line 69-78: The _streaming_format helper currently returns the input
lowercased format (so "wav" is returned) which contradicts the comment about
downgrading WAV for streaming; update the function _streaming_format(fmt: str)
so that any "wav" input is converted to "pcm" (i.e. return "pcm" when low ==
"wav"), preserve "pcm" for that case, and keep the default fallback as "pcm" for
unknown/empty formats to ensure streaming always uses raw PCM.
- Around line 117-136: The import of StreamingSynthesizer/TTSStreamingConfig and
the creation/first iteration of the generator must be moved inside the try so
ImportError and generator-start exceptions are caught; specifically, inside the
try block import StreamingSynthesizer and TTSStreamingConfig, instantiate
StreamingSynthesizer(engine, TTSStreamingConfig(...)), call
synth.stream(request.input, voice=request.voice) to get chunk_iter and perform
the first next(chunk_iter) (or otherwise trigger the generator start) so errors
surface there, then return a StreamingResponse that yields the already-read
first chunk followed by the remaining chunk_iter; ensure the except ImportError
and generic except blocks remain the same to map to 501/500.
---
Outside diff comments:
In `@src/voicequant/cli.py`:
- Around line 282-304: The CLI allows the shorthand "orpheus" but you must
translate that to the full Hugging Face model ID before creating the TTSConfig;
update the code around the model variable (in the CLI function where model is
read and before the TTSConfig(...) call) to check if model == "orpheus" and
replace it with the canonical HF model id (e.g. ai-forever/orpheus-tts or your
project's exact HF identifier) so that TTSEngine/TTSConfig receives the full
model name; ensure this mapping happens prior to instantiating
TTSConfig(model_name=model, ...).
In `@src/voicequant/core/tts/engine.py`:
- Around line 300-307: shutdown() currently clears _model but leaves
self._orpheus alive, which can keep GPU-resident Orpheus tensors; update
shutdown() to, inside the existing with self._lock block, check if self._orpheus
is not None and first call any cleanup API available (e.g. call
self._orpheus.close() or self._orpheus.release() if present via getattr) and
then set self._orpheus = None, and also ensure _model_loaded is set to False and
_speaker_cache cleared as already done so the Orpheus adapter is fully released.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 1edb011e-763a-4e6a-be0a-9626e790ea82
📒 Files selected for processing (23)
pyproject.tomlsrc/voicequant/benchmarks/prompts/tts/test_sentences.jsonsrc/voicequant/benchmarks/report.pysrc/voicequant/benchmarks/runner.pysrc/voicequant/benchmarks/scenarios/tts/__init__.pysrc/voicequant/benchmarks/scenarios/tts/concurrent.pysrc/voicequant/benchmarks/scenarios/tts/mos_quality.pysrc/voicequant/benchmarks/scenarios/tts/speaker_cache_hit.pysrc/voicequant/benchmarks/scenarios/tts/streaming_jitter.pysrc/voicequant/benchmarks/scenarios/tts/ttfa.pysrc/voicequant/benchmarks/visualize.pysrc/voicequant/cli.pysrc/voicequant/core/tts/config.pysrc/voicequant/core/tts/engine.pysrc/voicequant/core/tts/orpheus_adapter.pysrc/voicequant/core/tts/streaming.pysrc/voicequant/server/routes/tts.pysrc/voicequant/server/routes/tts_stub.pytests/benchmarks/test_tts_benchmarks.pytests/core/tts/test_multi_backend.pytests/core/tts/test_orpheus_adapter.pytests/core/tts/test_streaming.pytests/server/test_tts_streaming_route.py
Review response for M5. Verified each finding against the code and applied fixes where the issue was real. - report.py: _generate_tts_section now renders jitter and speaker-cache subsections so scenarios render their own tables when run alone. - runner.py: forward max_sessions to tts_concurrent like the LLM concurrent scenario. - tts/concurrent.py: summary rows now require concurrency <= max_concurrency so best_n cannot exceed the hardware cap when a user passes a very large --max-sessions. - tts/speaker_cache_hit.py: single-voice workloads return hit_rate=1.0 (warm cache after the first miss) instead of the degenerate 0.0. - cli.py visualize: infer modality from scenario names when --modality is omitted; "orpheus" shorthand on `tts speak` expands to the canonical HF id before TTSConfig is created. - tts/config.py: Pydantic validator rejects tq_bits outside the supported (2, 3, 4) set at the boundary. - tts/engine.py: list_voices() returns ORPHEUS_VOICES for cold orpheus engines; shutdown() now releases the Orpheus adapter (calls shutdown/close if available, clears _orpheus). - tts/orpheus_adapter.py: ORPHEUS_VOICES exposed at module scope; get_compression_stats no longer fabricates a 0.99 cosine similarity when none was measured (returns None until real data); EOS tokens are consumed but not yielded so the decoder never receives control tokens; voice is prepended to the prompt when provided; synthesize now accepts mp3/opus via the existing wav_to_mp3/wav_to_opus path. - tts/streaming.py: _encode_chunk downgrades "wav" to PCM so concatenated streams aren't corrupted by per-chunk WAV headers; the iterable path uses one-chunk lookahead so the real last chunk is marked is_final=True and no zero-length terminator is emitted. - server/routes/tts.py: _streaming_format always returns "pcm" (the previous wav passthrough contradicted the docstring); streaming endpoint imports StreamingSynthesizer inside the try and pulls the first chunk eagerly so ImportError/generator-start errors map to 501/500 cleanly. All 291 tests still pass.
The inline list of model config dicts mixes str and int values, so mypy inferred the value type as 'object' and flagged every attribute access (m['name'], m['per_session_mb']) in ConcurrentTTSScenario. Adding the explicit list[dict[str, Any]] annotation keeps lookups as Any and clears the 5 errors without reshaping the data.
Adds the M5 milestone: TTS chunked streaming, Orpheus TTS model support
wired through the TurboQuant KV compression engine, and the TTS benchmark
suite with the cross-modality hero chart.
either a pre-generated waveform (Kokoro) or a token-streaming backend
(Orpheus), with first-chunk min-size and TTFA tracking.
transfer and SSE (when Accept: text/event-stream). POST /v1/audio/speech
honors the
streamfield for the OpenAI-style pattern.core/llm/engine and applies KV compression per step on Orpheus' LLaMA
backbone. This is the only cross-modality import; core/llm remains
unaware of TTS.
tq_bits/tq_enabled config fields flow through to the Orpheus adapter.
tts-orpheus(orpheus-tts +phonemizer). Base
ttsextra is unchanged.concurrent, speaker_cache_hit) registered in the shared runner; TTS
markdown section added to the report. 15+ test sentences fixture.
("Voice Agent Sessions per GPU"). CLI gains
--modalityand--all.tts benchmarks, and the streaming route. All 291 tests pass.
Summary by CodeRabbit
New Features
--modalityoption for benchmark and visualization filtering (llm, tts, all).Configuration
tq-bits,tq-enabled).