ASR: fix Parakeet TDT v3 emitting Cyrillic for short Latin-script utterances (#512) by Alex-Wengg · Pull Request #515 · FluidInference/FluidAudio

Alex-Wengg · 2026-04-12T02:25:20Z

Fixes #512.

TL;DR

Parakeet TDT v3 transcribed short Polish utterances like "Wpisz Google kropka com" as Cyrillic (Впиш Гугл к ком.) because the joint decoder's top-1 pick drifts to Cyrillic tokens under low acoustic confidence. This PR adds an opt-in script filter: when a caller passes language: .polish (or any other language with a declared script), the decoder rejects top-1 if it's the wrong script and walks top-K to the highest-probability candidate matching the expected script.

Opt-in: language: defaults to nil — zero behavior change for existing callers.
No acoustic-model changes — this is purely a decoder-side post-processing step over the joint logits.
Requires JointDecisionv3.mlmodelc (exposes top-K outputs). Auto-downloaded from HuggingFace alongside the other v3 files; falls back to standard argmax when absent.

Empirical validation — reporter's own audio

Samples pulled via gdown --folder <link-from-issue-#512-comment> from @tajchert's Drive folder. JointDecisionv3.mlmodelc is loaded in both columns — this isolates the Swift filter as the mechanism, not a model swap.

sample	ground truth	`language: nil` (current)	`language: .polish` (this PR)
pl	Wpisz Google kropka com	Впиш Гугл к ком.	Wpis Google.com.
pl2	Wpisz Google kropka com	Впиш Гугл крокаком.	Wpish Google, Com.
pl3	Wpisz Google kropka com	Впишь куглькрабком.	VP Kugl.com.
pl4	Wpisz Google kropka com	Впиш гугл к ком.	Wpish gugl c.
pl5	Wpisz Google kropka com	Впиш гугл кракаком.	Wpish Google Croca kom.
pl6	Wpisz Google kropka com	Впиш, гугл крокаком.	Wpish, Google, Com.
pl_complex	Cały spichlarz jest ze spiżu	Cały spichlarz jest ze spiżu.	Cały spichlarz jest ze spiżu.

6/6 short samples flip Cyrillic → Latin. pl_complex was never broken (long context → high joint confidence → no drift) and is unchanged.

Scope & limitations (important — please don't overclaim)

This PR fixes the script the tokens are drawn from. It does NOT fix per-word acoustic accuracy.

	`language: nil`	`language: .polish`
Script correct (Latin, not Cyrillic)	✗	✓ (6/6)
Word spelling matches ground truth	✗	✗ (still 6/7 wrong on short)

The residual errors — Wpisz → Wpish/Wpis, kropka → Croca / dropped — are Parakeet TDT v3 acoustic weaknesses on short Polish commands. No amount of output post-processing can turn Wpish into Wpisz; that needs better acoustic modeling, a Polish LM rescorer, or more training data. Out of scope here.

What users actually get by merging:

Output is visually Polish (Latin script), not pseudo-Russian — works with locale-aware post-processing, spell-check, and UI rendering
Locale-strict WER evaluators no longer penalize Cyrillic-vs-Latin substitution
Opt-in; zero risk for callers who don't pass language:

What users do not get:

Higher word accuracy on short Polish/Slavic Latin utterances
Support for languages outside the Language enum (Greek, Maltese, Hungarian, Turkish, Baltic — their characters fit the Latin Unicode ranges but aren't exposed; easy follow-up)
A meaningful FLEURS WER delta — see Documentation/fleurs-script-filtering-comparison.md; full sentences aren't in the failure regime

Implementation

New

Sources/FluidAudio/Shared/ScriptDetection.swift (new, +112)
- public enum Language — 13 Latin (en, es, fr, de, it, pt, ro, pl, cs, sk, sl, hr, bs) + 5 Cyrillic (ru, uk, be, bg, sr)
- public enum Script { case latin, cyrillic }
- matches(_:script:) over Unicode ranges: ASCII (0x20–0x7F), Latin-1 (0xA0–0xFF), Latin Extended-A (0x100–0x17F), Latin Extended-B (0x180–0x24F — Romanian ș/ț), Latin Extended Additional (0x1E00–0x1EFF — Vietnamese), Cyrillic (0x400–0x4FF). Strips SentencePiece boundary marker U+2581 before checking.
- filterTopK(topKIds:topKLogits:vocabulary:preferredScript:) -> (tokenId, probability)? — returns the highest-probability top-K candidate matching the target script; probability via softmax over the top-K subset with the max-logit stability trick; guarded against top-K array length mismatch.

Changed

TdtJointDecision — optional topKIds / topKLogits fields (populated by JointDecisionv3 only)
TdtDecoderV3 — script filter runs only when top-1 is already wrong script; both decode sites feed filtered.probability (a real [0,1]) into TdtDurationMapping.clampProbability, not raw logits
AsrManager.transcribe(...) — language: Language? = nil plumbed through all three overloads: [Float], URL, AVAudioPCMBuffer
AsrModels + ModelNames — requiredModelsV3 set includes JointDecisionv3.mlmodelc so the download utility fetches it on fresh installs and also backfills it for existing users on next .v3 load
CLI — fluidaudiocli transcribe <file> --language {en|pl|cs|sk|sl|hr|bs|ro|es|fr|de|it|pt|ru|uk|be|bg|sr}

How to try it

swift run -c release fluidaudiocli transcribe sample.wav --language pl

Model dependency

JointDecisionv3.mlmodelc must be present in FluidInference/parakeet-tdt-0.6b-v3-coreml on HuggingFace. It exposes top_k_ids / top_k_logits outputs (K=64 in our export) alongside the standard argmax. When absent, AsrModels falls back to JointDecision.mlmodelc and the script filter becomes a no-op — backward compatible.

Cache-upgrade verified: removed JointDecisionv3.mlmodelc from a populated cache, re-ran --language pl; the file was auto-fetched and Polish output was Latin. Existing users pick up the fix on next .v3 load without manual intervention.

Review notes / risky bits

Softmax over top-K subset, not the full vocab — probabilities won't exactly match a true full-softmax, but K=64 captures ~all the mass when the model is anywhere near confident. If you prefer, we can expose the raw top-K logits to callers and let them compute confidence however they want.
Top-1 escape hatch: filter is only triggered when top-1 fails matches(_, script:). When top-1 is already correct, nothing is changed — so we can't regress the common case.
Length-mismatch guard in filterTopK uses min(topKIds.count, topKLogits.count). If CoreML output arrays ever diverge, we iterate the common prefix instead of crashing.
Latin Extended-B (0x0180–0x024F) was added specifically so Romanian ș/ț aren't rejected as non-Latin. Latin Extended Additional (0x1E00–0x1EFF) was added for free — helps Vietnamese should anyone want it later.

Tests

ScriptDetectionTests — 37 tests: Unicode range coverage (Latin-1 / Extended-A / Extended-B / Extended Additional / Cyrillic), SentencePiece boundary-marker stripping, filterTopK happy path, length-mismatch guard, probability-range invariant, Czech/Slovak/Slovenian/Croatian/Romanian token coverage, cross-script rejection
Build clean; swift format lint clean on all touched files
A/B end-to-end run against reporter's actual Polish audio (table above)

Checklist

Builds clean (swift build, swift build -c release)
swift format lint clean on touched files
ScriptDetectionTests 37/37 pass
A/B reproduction on Short utterances in Latin-script languages transcribed as Cyrillic [Parakeet TDT v3] #512 reporter's audio
Cache-upgrade path verified (JointDecisionv3 auto-fetched on existing caches)
CLI accepts all 18 language codes end-to-end
CI green

Follow-ups (not blocking)

Expose more Latin languages in the enum (Hungarian, Turkish, Baltic, Maltese) — all character ranges already supported, just need enum cases
Add Script.greek for el_gr (separate Unicode range)
Short-utterance benchmark dataset (FLEURS is the wrong tool — it's all long sentences where drift doesn't happen)
Optional: publish a Polish LM rescorer to address the underlying acoustic-accuracy issue the script filter cannot fix

github-actions · 2026-04-12T02:29:15Z

PocketTTS Smoke Test ✅

Check	Result
Build	✅
Model download	✅
Model load	✅
Synthesis pipeline	✅
Output WAV	✅ (191.3 KB)

_{Runtime: 0m23s}

_{Note: PocketTTS uses CoreML MLState (macOS 15) KV cache + Mimi streaming state. CI VM lacks physical GPU — audio quality and performance may differ from Apple Silicon.}

github-actions · 2026-04-12T02:30:31Z

Kokoro TTS Smoke Test ✅

Check	Result
Build	✅
Model download	✅
Model load	✅
Synthesis pipeline	✅
Output WAV	✅ (634.8 KB)

_{Runtime: 0m51s}

_{Note: Kokoro TTS uses CoreML flow matching + Vocos vocoder. CI VM lacks physical ANE — performance may differ from Apple Silicon.}

github-actions · 2026-04-12T02:31:23Z

Parakeet EOU Benchmark Results ✅

Status: Benchmark passed
Chunk Size: 320ms
Files Tested: 100/100

Performance Metrics

Metric	Value	Description
WER (Avg)	7.03%	Average Word Error Rate
WER (Med)	4.17%	Median Word Error Rate
RTFx	12.34x	Real-time factor (higher = faster)
Total Audio	470.6s	Total audio duration processed
Total Time	38.7s	Total processing time

Streaming Metrics

Metric	Value	Description
Avg Chunk Time	0.039s	Average chunk processing time
Max Chunk Time	0.077s	Maximum chunk processing time
EOU Detections	0	Total End-of-Utterance detections

_{Test runtime: 0m44s • 04/21/2026, 01:10 PM EST}

_{RTFx = Real-Time Factor (higher is better) • Processing includes: Model inference, audio preprocessing, state management, and file I/O}

github-actions · 2026-04-12T02:34:23Z

Qwen3-ASR int8 Smoke Test ✅

Check	Result
Build	✅
Model download	✅
Model load	✅
Transcription pipeline	✅
Decoder size	571 MB (vs 1.1 GB f32)

Performance Metrics

Metric	CI Value	Expected on Apple Silicon
Median RTFx	0.06x	~2.5x
Overall RTFx	0.06x	~2.5x

_{Runtime: 3m27s}

_{Note: CI VM lacks physical GPU — CoreML MLState (macOS 15) KV cache produces degraded results on virtualized runners. On Apple Silicon: ~1.3% WER / 2.5x RTFx.}

github-actions · 2026-04-12T02:34:49Z

VAD Benchmark Results

Performance Comparison

Dataset	Accuracy	Precision	Recall	F1-Score	RTFx	Files
MUSAN	92.0%	86.2%	100.0%	92.6%	596.8x faster	50
VOiCES	92.0%	86.2%	100.0%	92.6%	542.0x faster	50

Dataset Details

MUSAN: Music, Speech, and Noise dataset - standard VAD evaluation
VOiCES: Voices Obscured in Complex Environmental Settings - tests robustness in real-world conditions

✅: Average F1-Score above 70%

github-actions · 2026-04-12T02:36:12Z

ASR Benchmark Results ✅

Status: All benchmarks passed

Parakeet v3 (multilingual)

Dataset	WER Avg	WER Med	RTFx	Status
test-clean	0.57%	0.00%	4.92x	✅
test-other	1.80%	0.00%	2.67x	✅

Parakeet v2 (English-optimized)

Dataset	WER Avg	WER Med	RTFx	Status
test-clean	0.80%	0.00%	5.49x	✅
test-other	1.00%	0.00%	3.43x	✅

Streaming (v3)

Metric	Value	Description
WER	0.00%	Word Error Rate in streaming mode
RTFx	0.64x	Streaming real-time factor
Avg Chunk Time	1.529s	Average time to process each chunk
Max Chunk Time	2.317s	Maximum chunk processing time
First Token	1.862s	Latency to first transcription token
Total Chunks	31	Number of chunks processed

Streaming (v2)

Metric	Value	Description
WER	0.00%	Word Error Rate in streaming mode
RTFx	0.64x	Streaming real-time factor
Avg Chunk Time	1.418s	Average time to process each chunk
Max Chunk Time	1.734s	Maximum chunk processing time
First Token	1.440s	Latency to first transcription token
Total Chunks	31	Number of chunks processed

_{Streaming tests use 5 files with 0.5s chunks to simulate real-time audio streaming}

_{25 files per dataset • Test runtime: 5m56s • 04/21/2026, 01:09 PM EST}

_{RTFx = Real-Time Factor (higher is better) • Calculated as: Total audio duration ÷ Total processing time
Processing time includes: Model inference on Apple Neural Engine, audio preprocessing, state resets between files, token-to-text conversion, and file I/O
Example: RTFx of 2.0x means 10 seconds of audio processed in 5 seconds (2x faster than real-time)}

Expected RTFx Performance on Physical M1 Hardware:

• M1 Mac: ~28x (clean), ~25x (other)
• CI shows ~0.5-3x due to virtualization limitations

_{Testing methodology follows HuggingFace Open ASR Leaderboard}

github-actions · 2026-04-12T02:36:41Z

✅ Japanese ASR Benchmark Results (CTC)

Status: Passed

Metric	Value
CER	9.94%
Samples	50
Avg RTFx	2.5x
Decoder	CTC

✅ Benchmark completed successfully. The TDT Japanese hybrid model (CTC preprocessor/encoder + TDT decoder/joint) is working correctly.

View benchmark log

github-actions · 2026-04-12T02:38:34Z

Speaker Diarization Benchmark Results

Speaker Diarization Performance

Evaluating "who spoke when" detection accuracy

Metric	Value	Target	Status	Description
DER	15.1%	<30%	✅	Diarization Error Rate (lower is better)
JER	24.9%	<25%	✅	Jaccard Error Rate
RTFx	23.24x	>1.0x	✅	Real-Time Factor (higher is faster)

Diarization Pipeline Timing Breakdown

Time spent in each stage of speaker diarization

Stage	Time (s)	%	Description
Model Download	9.871	21.9	Fetching diarization models
Model Compile	4.230	9.4	CoreML compilation
Audio Load	0.089	0.2	Loading audio file
Segmentation	13.542	30.0	Detecting speech regions
Embedding	22.571	50.0	Extracting speaker voices
Clustering	9.028	20.0	Grouping same speakers
Total	45.160	100	Full pipeline

Speaker Diarization Research Comparison

Research baselines typically achieve 18-30% DER on standard datasets

Method	DER	Notes
FluidAudio	15.1%	On-device CoreML
Research baseline	18-30%	Standard dataset performance

Note: RTFx shown above is from GitHub Actions runner. On Apple Silicon with ANE:

M2 MacBook Air (2022): Runs at 150 RTFx real-time
Performance scales with Apple Neural Engine capabilities

_{🎯 Speaker Diarization Test • AMI Corpus ES2004a • 1049.0s meeting audio • 45.1s diarization time • Test runtime: 2m 7s • 04/21/2026, 01:10 PM EST}

github-actions · 2026-04-12T02:40:51Z

Sortformer High-Latency Benchmark Results

ES2004a Performance (30.4s latency config)

Metric	Value	Target	Status
DER	33.4%	<35%	✅
Miss Rate	24.4%	-	-
False Alarm	0.2%	-	-
Speaker Error	8.8%	-	-
RTFx	8.1x	>1.0x	✅
Speakers	4/4	-	-

_{Sortformer High-Latency • ES2004a • Runtime: 3m 12s • 2026-04-21T17:09:44.168Z}

github-actions · 2026-04-12T02:42:38Z

Offline VBx Pipeline Results

Speaker Diarization Performance (VBx Batch Mode)

Optimal clustering with Hungarian algorithm for maximum accuracy

Metric	Value	Target	Status	Description
DER	10.4%	<20%	✅	Diarization Error Rate (lower is better)
RTFx	9.94x	>1.0x	✅	Real-Time Factor (higher is faster)

Offline VBx Pipeline Timing Breakdown

Time spent in each stage of batch diarization

Stage	Time (s)	%	Description
Model Download	12.426	11.8	Fetching diarization models
Model Compile	5.325	5.0	CoreML compilation
Audio Load	0.057	0.1	Loading audio file
Segmentation	21.676	20.5	VAD + speech detection
Embedding	105.243	99.7	Speaker embedding extraction
Clustering (VBx)	0.115	0.1	Hungarian algorithm + VBx clustering
Total	105.576	100	Full VBx pipeline

Speaker Diarization Research Comparison

Offline VBx achieves competitive accuracy with batch processing

Method	DER	Mode	Description
FluidAudio (Offline)	10.4%	VBx Batch	On-device CoreML with optimal clustering
FluidAudio (Streaming)	17.7%	Chunk-based	First-occurrence speaker mapping
Research baseline	18-30%	Various	Standard dataset performance

Pipeline Details:

Mode: Offline VBx with Hungarian algorithm for optimal speaker-to-cluster assignment
Segmentation: VAD-based voice activity detection
Embeddings: WeSpeaker-compatible speaker embeddings
Clustering: PowerSet with VBx refinement
Accuracy: Higher than streaming due to optimal post-hoc mapping

_{🎯 Offline VBx Test • AMI Corpus ES2004a • 1049.0s meeting audio • 127.0s processing • Test runtime: 2m 12s • 04/21/2026, 01:16 PM EST}

Complete baseline benchmark results for 24 languages (2,400 samples total): - Establishes baseline WER/CER before script filtering implementation - Polish: 8.98% WER (target for issue #512 improvement) - All languages maintain real-time performance (avg 62.6x RTFx) - Best: Italian 3.46% WER, Worst: Greek 38.91% WER Related to issue #512 (Polish Cyrillic confusion) and PR #515 (script filtering). Next step: Re-run on feat/script-filtering-issue-512 branch to measure improvement. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

**Issue 1: Language parameter silently dropped for long audio (CRITICAL)** - Thread language parameter through ChunkProcessor.process() and transcribeChunk() - Script filtering now works correctly for audio >15 seconds - Before: ChunkProcessor ignored language, disabling filtering for real-world recordings - After: Language parameter flows through full chunked transcription pipeline **Issue 2: SentencePiece word boundary marker not handled (CRITICAL)** - Strip ▁ (U+2581 LOWER ONE EIGHTH BLOCK) before script detection - This character prefixes most vocabulary tokens but doesn't indicate script - Before: allSatisfy() check failed because ▁ outside all Unicode ranges - After: Strip marker first, then check actual content **Issue 3: Token confidence not updated after filtering (MEDIUM)** - Update `score` variable with filtered token's logit in both main loop and inner loop - Before: Stale probability from original top-1 token persisted through results - After: Confidence reflects actual selected token after script filtering **Issue 4: Missing unit tests (HIGH)** - Add comprehensive ScriptDetectionTests with 28 tests covering: - Script property tests for Language enum - Basic script matching (Latin, Cyrillic, mixed scripts) - SentencePiece boundary marker handling - Polish language support (issue #512 specific tests) - Punctuation and whitespace handling - filterTopK() functionality and edge cases - Unicode range validation - All tests pass **Additional improvements:** - Improved Cyrillic script detection to reject Latin letters while allowing punctuation, spaces, and digits (prevents "hello" matching Cyrillic) - Fixed existing TdtRefactoredComponentsTests to use new TdtJointDecision signature Fixes identified by Devin AI in PR review #4094445719. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

github-actions · 2026-04-12T03:19:34Z

✅ Japanese ASR Benchmark Results (CTC)

Status: Passed

Metric	Value
CER	9.94%
Samples	50
Avg RTFx	2.2x
Decoder	CTC

✅ Benchmark completed successfully. The TDT Japanese hybrid model (CTC preprocessor/encoder + TDT decoder/joint) is working correctly.

View benchmark log

github-actions · 2026-04-12T03:25:25Z

✅ Japanese ASR Benchmark Results (CTC)

Status: Passed

Metric	Value
CER	9.94%
Samples	50
Avg RTFx	2.4x
Decoder	CTC

✅ Benchmark completed successfully. The TDT Japanese hybrid model (CTC preprocessor/encoder + TDT decoder/joint) is working correctly.

View benchmark log

github-actions · 2026-04-12T03:44:22Z

✅ Japanese ASR Benchmark Results (CTC)

Status: Passed

Metric	Value
CER	9.94%
Samples	50
Avg RTFx	2.2x
Decoder	CTC

✅ Benchmark completed successfully. The TDT Japanese hybrid model (CTC preprocessor/encoder + TDT decoder/joint) is working correctly.

View benchmark log

github-actions · 2026-04-12T03:52:58Z

✅ Japanese ASR Benchmark Results (CTC)

Status: Passed

Metric	Value
CER	9.94%
Samples	50
Avg RTFx	2.5x
Decoder	CTC

✅ Benchmark completed successfully. The TDT Japanese hybrid model (CTC preprocessor/encoder + TDT decoder/joint) is working correctly.

View benchmark log

github-actions · 2026-04-12T03:56:16Z

✅ Japanese ASR Benchmark Results (CTC)

Status: Passed

Metric	Value
CER	9.94%
Samples	50
Avg RTFx	2.6x
Decoder	CTC

✅ Benchmark completed successfully. The TDT Japanese hybrid model (CTC preprocessor/encoder + TDT decoder/joint) is working correctly.

View benchmark log

github-actions · 2026-04-12T04:01:27Z

✅ Japanese ASR Benchmark Results (CTC)

Status: Passed

Metric	Value
CER	9.94%
Samples	50
Avg RTFx	2.6x
Decoder	CTC

✅ Benchmark completed successfully. The TDT Japanese hybrid model (CTC preprocessor/encoder + TDT decoder/joint) is working correctly.

View benchmark log

**Issue 1: Language parameter silently dropped for long audio (CRITICAL)** - Thread language parameter through ChunkProcessor.process() and transcribeChunk() - Script filtering now works correctly for audio >15 seconds - Before: ChunkProcessor ignored language, disabling filtering for real-world recordings - After: Language parameter flows through full chunked transcription pipeline **Issue 2: SentencePiece word boundary marker not handled (CRITICAL)** - Strip ▁ (U+2581 LOWER ONE EIGHTH BLOCK) before script detection - This character prefixes most vocabulary tokens but doesn't indicate script - Before: allSatisfy() check failed because ▁ outside all Unicode ranges - After: Strip marker first, then check actual content **Issue 3: Token confidence not updated after filtering (MEDIUM)** - Update `score` variable with filtered token's logit in both main loop and inner loop - Before: Stale probability from original top-1 token persisted through results - After: Confidence reflects actual selected token after script filtering **Issue 4: Missing unit tests (HIGH)** - Add comprehensive ScriptDetectionTests with 28 tests covering: - Script property tests for Language enum - Basic script matching (Latin, Cyrillic, mixed scripts) - SentencePiece boundary marker handling - Polish language support (issue #512 specific tests) - Punctuation and whitespace handling - filterTopK() functionality and edge cases - Unicode range validation - All tests pass **Additional improvements:** - Improved Cyrillic script detection to reject Latin letters while allowing punctuation, spaces, and digits (prevents "hello" matching Cyrillic) - Fixed existing TdtRefactoredComponentsTests to use new TdtJointDecision signature Fixes identified by Devin AI in PR review #4094445719. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

github-actions · 2026-04-21T00:40:23Z

✅ Japanese ASR Benchmark Results (CTC)

Status: Passed

Metric	Value
CER	9.07%
Samples	50
Avg RTFx	3.0x
Decoder	CTC

✅ Benchmark completed successfully. The TDT Japanese hybrid model (CTC preprocessor/encoder + TDT decoder/joint) is working correctly.

View benchmark log

…ernal `matches` and `filterTopK` only make sense when you have raw top-K ids/logits plus a CoreML vocab dictionary — i.e. inside the TDT decoder path. No external consumer has a realistic use case, so drop them from the public surface. `Language` and `Script` stay public since they're on the transcribe APIs. Also hoist the SentencePiece word-boundary marker (U+2581) to a named constant, `sentencePieceBoundary`, instead of leaving the scalar literal inline — reads cleaner and is easier to find if another path ever needs the same stripping.

Trim to a minimal 24-language FLEURS regression runner: - Drop parallel LANG_NAMES array and stale WER-tier comments. - Drop caffeinate (only inhibits sleep under AC power anyway). - Drop verify_models step; CLI auto-downloads on first use. - Soft-fail per language with `|| log "WARN"` so one bad language doesn't abort the suite. - Use printf (not echo) so tabs render correctly on macOS. - Write machine-readable summary CSV alongside the per-lang JSONs. - Preflight python3 check and surface swift-build failures. - SAMPLES env override for quick smoke tests.

Address code-review findings on TdtDecoderV3: - #10 (correctness): final-chunk flushing loop now runs `applyScriptFilter` on each emitted token. Previously the last few tokens of any utterance bypassed script filtering, which could leak wrong-script candidates in multilingual runs where the tail fell in this path. - #28: update the now-stale "No filtering at decoder level" comment to reflect that script filtering happens per-step above. - #9: remove dead `iterCount` / `innerLoopCount` counters. - #15: delete unused `MLMultiArray.l2Normf()` helper (kept `shapeString`, which is used by TdtModelInference for error output).

Address review feedback on `TdtJointDecision`: - Clarify the scale mismatch between `probability` (full-vocab softmax) and `topKLogits` (raw pre-softmax logits). Consumers that want a comparable probability should go through `TokenLanguageFilter.filterTopK`, which returns the top-K softmax. - Add an `assert(topKIds?.count == topKLogits?.count)` in the init to catch schema drift if a future model ever returns mismatched arrays. - Add `: Sendable` conformance. All stored properties are value types, so conformance synthesizes for free and matches the surrounding `TdtDecoderV3: Sendable` posture. - Sharpen the rationale comment on why the top-K fields aren't given stored-property defaults: Swift excludes stored `let` properties with default values from the synthesized memberwise initializer because `let` + default is a compile-time-initialized constant, not a parameter.

Address review feedback on the joint top-K path: - Thread `needsTopK: Bool` through `runJointPrepared`. TdtDecoderV3 computes it once as `language != nil`. When no language is provided (the common path), v3 joint runs no longer allocate K-length Swift arrays on every decoded step just to drop them. - Factor the per-array extraction into `extractInt32Array` / `extractFloat32Array` helpers. Both validate the CoreML dtype before the `bindMemory` cast, so an export that switches to Int64 / Float16 fails loudly instead of silently reinterpreting bit patterns. - Enforce top-K outputs as a present-or-absent pair with matching lengths. Catches export-schema drift before TokenLanguageFilter has to defend against it downstream.

Address code-review feedback on `AsrManager.tdtDecodeWithTimings` and the public `transcribe(...)` overloads: - Split the `.v3, .tdtJa` switch arm. Route `.tdtJa` independently and drop the `language` / `vocabulary` forwarding: the Japanese model emits Kanji / Hiragana / Katakana tokens, none of which are covered by the current Latin/Cyrillic `TokenLanguageFilter`. Propagating a hint there would either no-op or — worse — silently filter out valid Japanese output. Log at debug when a caller-supplied hint is dropped. - Log at debug on the `.v2, .tdtCtc110m` path when `language` is non-nil. Previously the hint was swallowed silently. - Drop the `language != nil ? vocabulary : nil` conditional on the `.v3` path. `TdtDecoderV3.applyScriptFilter` already short-circuits when `language` is nil; forwarding the vocab unconditionally is clearer and has no runtime cost. - Document the `language:` parameter on the `URL`, `transcribeDiskBacked`, and `[Float]` public overloads, and note the silent-ignore behavior on the already-documented buffer overload.

…back v3 now loads `JointDecisionv3.mlmodelc` exclusively. The opportunistic try/fallback to `JointDecision.mlmodelc` was transitional scaffolding for the HF upload period and is no longer needed now that v3 joint is stable on the remote. - ModelNames: add `requiredModelsV3` using `jointV3File`; keep `requiredModels` for v2/legacy; split `.parakeet` vs `.parakeetV2` in `getRequiredModelNames`. - AsrModels: `getModelFileNames(.v3)` returns `jointV3File`; `getRequiredModels(.v3)` returns `requiredModelsV3`; joint-load path becomes a single unconditional download (dies loud if missing). - AsrManager: drop "requires JointDecisionv3.mlmodelc" hedge in `language:` doc comments — always present for v3. Existing v3 users with only the legacy `JointDecision.mlmodelc` cached will fetch `JointDecisionv3.mlmodelc` on next launch (single ~50MB file; the other models remain cache-hit).

Follow-ups from the AsrModels.swift review on the v3-joint-only PR: - `inferredVersion`: add `.tdtJa` to `knownVersions` so `modelsExist(at:)` without an explicit version returns the right answer for Japanese model directories. `.ctcZhCn` is intentionally omitted — it's rejected at the top of `load(...)` and uses a dedicated loader. - `getModelFileNames(.v3)`: correct the "zero-cost when disabled" comment. Top-K is always computed in the CoreML graph; only the Swift-side extraction is gated via `needsTopK`. - v3 joint-load failure: append a diagnostic hint pointing at stale caches so users who previously had only the legacy `JointDecision.mlmodelc` can recover.

`846924a1d` removed CTC-only inference for Japanese, but the comments on `parakeetJa` and the `TDTJa` enum still described the repo as containing "both CTC and TDT models" and referred to the files as "newly converted ... uploaded to CTC repo." Only the TDT path exists now; the CTC preprocessor+encoder files from the repo are reused as the acoustic frontend for the TDT decoder+joint. Comment-only; no behavior change.

The `parakeet` case predates the v2/v3 split and was the last remaining ambiguous name in the `Repo` enum — every other Parakeet case is explicitly versioned (`parakeetV2`, `parakeetCtc110m`, etc.). Now that `.v3` is the only repo binding to this case and has a distinct required model set (JointDecisionv3), the implicit "parakeet means v3" convention is more confusing than helpful. Renames the enum case across Sources/ and Tests/. The HuggingFace remote path is unchanged — this is Swift-side naming only.

FleursBenchmark.mapToLanguageEnum previously returned nil for cs_cz, sk_sk, hr_hr, sl_si, ro_ro — silently skipping script-aware filtering for the exact Latin-script languages that the Language enum flagged as "prone to Cyrillic confusion." The corresponding enum cases already exist; this wires the benchmark up to use them.

github-actions · 2026-04-21T08:30:34Z

✅ Japanese ASR Benchmark Results (CTC)

Status: Passed

Metric	Value
CER	9.07%
Samples	50
Avg RTFx	2.5x
Decoder	CTC

✅ Benchmark completed successfully. The TDT Japanese hybrid model (CTC preprocessor/encoder + TDT decoder/joint) is working correctly.

View benchmark log

…tModel - Drop single-use MLMultiArray.shapeString extension; inline the 'x'-joined shape string directly at the one error-message call site in TdtModelInference. - Rename the guard-let binding unwrappedJointModel -> jointModel in AsrModels.load; the 'unwrapped' prefix was noise given the variable is a plain non-optional MLModel after the guard.

github-actions · 2026-04-21T14:43:08Z

✅ Japanese ASR Benchmark Results (CTC)

Status: Passed

Metric	Value
CER	9.07%
Samples	50
Avg RTFx	2.6x
Decoder	CTC

✅ Benchmark completed successfully. The TDT Japanese hybrid model (CTC preprocessor/encoder + TDT decoder/joint) is working correctly.

View benchmark log

…ogit, inline softmax - matches(): accept U+0300-U+036F (Combining Diacritical Marks) as Latin so NFD-decomposed forms like 'e' + U+0301 aren't rejected. SentencePiece vocabs usually emit precomposed characters, but handling both keeps the filter robust across export variants. - matches(): empty / boundary-only tokens now return true (script-neutral) instead of false. filterTopK's argmax can then rank them alongside real candidates rather than skipping them outright. - filterTopK(): use 'bestIdx < 0 || logit > bestLogit' so the first in- script candidate wins unconditionally. Previously a match with logit == -.infinity never beat the -.infinity sentinel and filterTopK returned nil despite having a valid candidate. - Inline the private softmaxProbability helper into its sole caller; drop the helper. Tests: flipped testBoundaryMarkerOnly / testEmptyString to assert the new neutral behavior and added testCombiningDiacriticsRange + testFilterTopKPicksNegativeInfinityLogit. 40/40 pass.

github-actions · 2026-04-21T15:08:52Z

✅ Japanese ASR Benchmark Results (CTC)

Status: Passed

Metric	Value
CER	9.07%
Samples	50
Avg RTFx	3.0x
Decoder	CTC

✅ Benchmark completed successfully. The TDT Japanese hybrid model (CTC preprocessor/encoder + TDT decoder/joint) is working correctly.

View benchmark log

Empirical testing against the 7 Polish audio samples in issue #512 (tajchert's Google Drive folder) showed the last-chunk flush loop emits 0 script-filter swaps, while the main loop fires 7 times and the inner silence-skip loop fires 21 times across the same samples. The flush loop is blank/punct-dominated on short utterances, so the defensive filter call added during code review was dead code. Removing it and replacing the successor comment with an honest note about the empirical finding.

github-actions · 2026-04-21T15:49:45Z

✅ Japanese ASR Benchmark Results (CTC)

Status: Passed

Metric	Value
CER	9.07%
Samples	50
Avg RTFx	2.2x
Decoder	CTC

✅ Benchmark completed successfully. The TDT Japanese hybrid model (CTC preprocessor/encoder + TDT decoder/joint) is working correctly.

View benchmark log

Add Documentation/ASR/TokenLanguageFilter.md covering: the cross-script leakage problem (issue #512), where the filter runs in the TDT v3 decoder, the asymmetric Unicode guards (Latin vs Cyrillic), the top-K vs full-vocab softmax caveat, and the handled edge cases including the -infinity logit argmax. Trim TokenLanguageFilter.swift inline comments from 221 -> 149 lines, keeping the non-obvious ones (asymmetric guards, -inf edge case, top-K probability caveat) and dropping idiom restatements and redundant doc.

github-actions · 2026-04-21T16:07:01Z

✅ Japanese ASR Benchmark Results (CTC)

Status: Passed

Metric	Value
CER	9.07%
Samples	50
Avg RTFx	2.8x
Decoder	CTC

✅ Benchmark completed successfully. The TDT Japanese hybrid model (CTC preprocessor/encoder + TDT decoder/joint) is working correctly.

View benchmark log

Matches the TokenLanguageFilter type name (following the earlier ScriptDetection -> TokenLanguageFilter rename in 2794797).

… + trim doc Drop the 'apply' prefix and compress the helper's doc comment: the behavior paragraph restated the guard chain, and the double-clamp note covered a one-line defensive call. Keep the blank-token rationale — the 'label \!= blankId' guard is load-bearing and the 'why' isn't obvious.

devin-ai-integration

Devin Review found 1 new potential issue.

View 20 additional findings in Devin Review.

devin-ai-integration · 2026-04-21T16:26:35Z

+                if value >= 0x0020 && value <= 0x007F {
+                    if (value >= 0x41 && value <= 0x5A) || (value >= 0x61 && value <= 0x7A) {
+                        return false
+                    }
+                    return true
+                }


🟡 Nested if statement violates AGENTS.md control-flow rule

AGENTS.md mandates: "Nested if statements should be absolutely avoided." The Cyrillic branch of matches at Sources/FluidAudio/Shared/TokenLanguageFilter.swift:87-91 contains a nested if — the outer checks for the ASCII range and the inner checks for ASCII letters. This can be flattened by checking letters first (they're a subset of the ASCII range), then accepting the rest of ASCII separately.

Suggested change

if value >= 0x0020 && value <= 0x007F {

if (value >= 0x41 && value <= 0x5A) || (value >= 0x61 && value <= 0x7A) {

return false

}

return true

}

if (value >= 0x41 && value <= 0x5A) || (value >= 0x61 && value <= 0x7A) {

return false

}

if value >= 0x0020 && value <= 0x007F {

return true

}

Was this helpful? React with 👍 or 👎 to provide feedback.

github-actions · 2026-04-21T16:27:52Z

✅ Japanese ASR Benchmark Results (CTC)

Status: Passed

Metric	Value
CER	9.07%
Samples	50
Avg RTFx	2.6x
Decoder	CTC

✅ Benchmark completed successfully. The TDT Japanese hybrid model (CTC preprocessor/encoder + TDT decoder/joint) is working correctly.

View benchmark log

github-actions · 2026-04-21T16:28:07Z

✅ Japanese ASR Benchmark Results (CTC)

Status: Passed

Metric	Value
CER	9.07%
Samples	50
Avg RTFx	2.0x
Decoder	CTC

✅ Benchmark completed successfully. The TDT Japanese hybrid model (CTC preprocessor/encoder + TDT decoder/joint) is working correctly.

View benchmark log

"cross-script", "out-of-script", "in-script" were jargon. The mechanism is actually "is this token from the right alphabet for the target language?" — plain English works better for reviewers and future maintainers. - "cross-script leakage" -> "wrong-language leakage" - "wrong-script candidates" -> "wrong-language tokens" - "out-of-script" -> "wrong-language" - "in-script" -> "right-language" - "writing script" (user-facing) -> "alphabet" Kept as-is: `Script` enum, `script:` parameter, `matches(_:script:)` signature — implementation details where "script" is the correct Unicode term. "Latin-script Slavic" also kept as legitimate linguistic grouping. No behavior change. Tests pass (40/40).

github-actions · 2026-04-21T17:04:37Z

✅ Japanese ASR Benchmark Results (CTC)

Status: Passed

Metric	Value
CER	9.07%
Samples	50
Avg RTFx	2.5x
Decoder	CTC

✅ Benchmark completed successfully. The TDT Japanese hybrid model (CTC preprocessor/encoder + TDT decoder/joint) is working correctly.

View benchmark log

devin-ai-integration

Devin Review found 1 new potential issue.

View 20 additional findings in Devin Review.

devin-ai-integration · 2026-04-21T17:07:56Z

+        label = filtered.tokenId
+        score = TdtDurationMapping.clampProbability(filtered.probability)


🔴 filterTopK can return blankId as replacement, silently converting speech tokens to silence

The doc comment at TdtDecoderV3.swift:564-566 states "Blanks are excluded from replacement" but neither tokenLanguageFilter nor filterTopK actually excludes the blank token from the candidate pool. The guard at line 576 (label != blankId) only prevents blank from being the trigger — it doesn't prevent blank from being the replacement.

The blank token's vocabulary text is typically empty or a boundary marker, so TokenLanguageFilter.matches returns true at TokenLanguageFilter.swift:69 (guard !cleanedText.isEmpty else { return true }), making blank a valid script-neutral candidate in filterTopK. If blank has the highest logit among matching candidates, it wins the argmax and gets returned.

This is especially harmful in the inner blank-processing loop (TdtDecoderV3.swift:333-345): after the filter sets label = blankId, blankMask becomes true, and advanceMask = activeMask && blankMask keeps the loop running — the speech token is swallowed and the decoder continues advancing through frames as if they were silence, instead of exiting the inner loop and emitting the token. The intended nil-return fallback path ("no right-language candidates → keep original token") is defeated because blank passes matches as script-neutral.

Suggested change

label = filtered.tokenId

score = TdtDurationMapping.clampProbability(filtered.probability)

guard filtered.tokenId != blankId else { return }

label = filtered.tokenId

score = TdtDurationMapping.clampProbability(filtered.probability)

Was this helpful? React with 👍 or 👎 to provide feedback.

Alex-Wengg mentioned this pull request Apr 12, 2026

Short utterances in Latin-script languages transcribed as Cyrillic [Parakeet TDT v3] #512

Open

This comment was marked as resolved.

Sign in to view

Alex-Wengg force-pushed the feat/script-filtering-issue-512 branch 2 times, most recently from 14d1926 to bbf98df Compare April 12, 2026 03:27

This comment was marked as resolved.

Sign in to view

Alex-Wengg force-pushed the feat/script-filtering-issue-512 branch from 923412f to 7721530 Compare April 21, 2026 00:33

This comment was marked as resolved.

Sign in to view

Alex-Wengg changed the title ~~Add script filtering for Cyrillic/Latin disambiguation (fixes #512)~~ ASR: fix Parakeet TDT v3 emitting Cyrillic for short Latin-script utterances (#512) Apr 21, 2026

This comment was marked as resolved.

Sign in to view

Alex-Wengg added 11 commits April 21, 2026 04:23

Alex-Wengg force-pushed the feat/script-filtering-issue-512 branch from 96d2a90 to 379cd89 Compare April 21, 2026 08:24

Alex-Wengg added 2 commits April 21, 2026 10:14

docs(fleurs): clarify v3 script-filter scope + cross-ref Qwen3 mapping

e6b34de

Alex-Wengg added 2 commits April 21, 2026 12:12

refactor(asr): rename applyScriptFilter -> applyTokenLanguageFilter

182c758

Matches the TokenLanguageFilter type name (following the earlier ScriptDetection -> TokenLanguageFilter rename in 2794797).

devin-ai-integration bot reviewed Apr 21, 2026

View reviewed changes

		label = filtered.tokenId
		score = TdtDurationMapping.clampProbability(filtered.probability)

Conversation

Alex-Wengg commented Apr 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

TL;DR

Empirical validation — reporter's own audio

Scope & limitations (important — please don't overclaim)

Implementation

New

Changed

How to try it

Model dependency

Review notes / risky bits

Tests

Checklist

Follow-ups (not blocking)

Uh oh!

github-actions bot commented Apr 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PocketTTS Smoke Test ✅

Uh oh!

This comment was marked as resolved.

Uh oh!

github-actions bot commented Apr 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Kokoro TTS Smoke Test ✅

Uh oh!

github-actions bot commented Apr 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Parakeet EOU Benchmark Results ✅

Performance Metrics

Streaming Metrics

Uh oh!

github-actions bot commented Apr 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Qwen3-ASR int8 Smoke Test ✅

Performance Metrics

Uh oh!

github-actions bot commented Apr 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

VAD Benchmark Results

Performance Comparison

Dataset Details

Uh oh!

github-actions bot commented Apr 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

ASR Benchmark Results ✅

Parakeet v3 (multilingual)

Parakeet v2 (English-optimized)

Streaming (v3)

Streaming (v2)

Expected RTFx Performance on Physical M1 Hardware:

Uh oh!

github-actions bot commented Apr 12, 2026

✅ Japanese ASR Benchmark Results (CTC)

Uh oh!

github-actions bot commented Apr 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Speaker Diarization Benchmark Results

Speaker Diarization Performance

Diarization Pipeline Timing Breakdown

Speaker Diarization Research Comparison

Uh oh!

github-actions bot commented Apr 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Sortformer High-Latency Benchmark Results

ES2004a Performance (30.4s latency config)

Uh oh!

github-actions bot commented Apr 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Offline VBx Pipeline Results

Speaker Diarization Performance (VBx Batch Mode)

Offline VBx Pipeline Timing Breakdown

Speaker Diarization Research Comparison

Uh oh!

This comment was marked as resolved.

Uh oh!

github-actions bot commented Apr 12, 2026

✅ Japanese ASR Benchmark Results (CTC)

Uh oh!

github-actions bot commented Apr 12, 2026

Alex-Wengg commented Apr 12, 2026 •

edited

Loading

github-actions bot commented Apr 12, 2026 •

edited

Loading

github-actions bot commented Apr 12, 2026 •

edited

Loading

github-actions bot commented Apr 12, 2026 •

edited

Loading

github-actions bot commented Apr 12, 2026 •

edited

Loading

github-actions bot commented Apr 12, 2026 •

edited

Loading

github-actions bot commented Apr 12, 2026 •

edited

Loading

github-actions bot commented Apr 12, 2026 •

edited

Loading

github-actions bot commented Apr 12, 2026 •

edited

Loading

github-actions bot commented Apr 12, 2026 •

edited

Loading