Skip to content

ASR: fix Parakeet TDT v3 emitting Cyrillic for short Latin-script utterances (#512)#515

Open
Alex-Wengg wants to merge 34 commits intomainfrom
feat/script-filtering-issue-512
Open

ASR: fix Parakeet TDT v3 emitting Cyrillic for short Latin-script utterances (#512)#515
Alex-Wengg wants to merge 34 commits intomainfrom
feat/script-filtering-issue-512

Conversation

@Alex-Wengg
Copy link
Copy Markdown
Member

@Alex-Wengg Alex-Wengg commented Apr 12, 2026

Fixes #512.

TL;DR

Parakeet TDT v3 transcribed short Polish utterances like "Wpisz Google kropka com" as Cyrillic (Впиш Гугл к ком.) because the joint decoder's top-1 pick drifts to Cyrillic tokens under low acoustic confidence. This PR adds an opt-in script filter: when a caller passes language: .polish (or any other language with a declared script), the decoder rejects top-1 if it's the wrong script and walks top-K to the highest-probability candidate matching the expected script.

  • Opt-in: language: defaults to nil — zero behavior change for existing callers.
  • No acoustic-model changes — this is purely a decoder-side post-processing step over the joint logits.
  • Requires JointDecisionv3.mlmodelc (exposes top-K outputs). Auto-downloaded from HuggingFace alongside the other v3 files; falls back to standard argmax when absent.

Empirical validation — reporter's own audio

Samples pulled via gdown --folder <link-from-issue-#512-comment> from @tajchert's Drive folder. JointDecisionv3.mlmodelc is loaded in both columns — this isolates the Swift filter as the mechanism, not a model swap.

sample ground truth language: nil (current) language: .polish (this PR)
pl Wpisz Google kropka com Впиш Гугл к ком. Wpis Google.com.
pl2 Wpisz Google kropka com Впиш Гугл крокаком. Wpish Google, Com.
pl3 Wpisz Google kropka com Впишь куглькрабком. VP Kugl.com.
pl4 Wpisz Google kropka com Впиш гугл к ком. Wpish gugl c.
pl5 Wpisz Google kropka com Впиш гугл кракаком. Wpish Google Croca kom.
pl6 Wpisz Google kropka com Впиш, гугл крокаком. Wpish, Google, Com.
pl_complex Cały spichlarz jest ze spiżu Cały spichlarz jest ze spiżu. Cały spichlarz jest ze spiżu.

6/6 short samples flip Cyrillic → Latin. pl_complex was never broken (long context → high joint confidence → no drift) and is unchanged.

Scope & limitations (important — please don't overclaim)

This PR fixes the script the tokens are drawn from. It does NOT fix per-word acoustic accuracy.

language: nil language: .polish
Script correct (Latin, not Cyrillic) ✓ (6/6)
Word spelling matches ground truth ✗ (still 6/7 wrong on short)

The residual errors — WpiszWpish/Wpis, kropkaCroca / dropped — are Parakeet TDT v3 acoustic weaknesses on short Polish commands. No amount of output post-processing can turn Wpish into Wpisz; that needs better acoustic modeling, a Polish LM rescorer, or more training data. Out of scope here.

What users actually get by merging:

  • Output is visually Polish (Latin script), not pseudo-Russian — works with locale-aware post-processing, spell-check, and UI rendering
  • Locale-strict WER evaluators no longer penalize Cyrillic-vs-Latin substitution
  • Opt-in; zero risk for callers who don't pass language:

What users do not get:

  • Higher word accuracy on short Polish/Slavic Latin utterances
  • Support for languages outside the Language enum (Greek, Maltese, Hungarian, Turkish, Baltic — their characters fit the Latin Unicode ranges but aren't exposed; easy follow-up)
  • A meaningful FLEURS WER delta — see Documentation/fleurs-script-filtering-comparison.md; full sentences aren't in the failure regime

Implementation

New

  • Sources/FluidAudio/Shared/ScriptDetection.swift (new, +112)
    • public enum Language — 13 Latin (en, es, fr, de, it, pt, ro, pl, cs, sk, sl, hr, bs) + 5 Cyrillic (ru, uk, be, bg, sr)
    • public enum Script { case latin, cyrillic }
    • matches(_:script:) over Unicode ranges: ASCII (0x20–0x7F), Latin-1 (0xA0–0xFF), Latin Extended-A (0x100–0x17F), Latin Extended-B (0x180–0x24F — Romanian ș/ț), Latin Extended Additional (0x1E00–0x1EFF — Vietnamese), Cyrillic (0x400–0x4FF). Strips SentencePiece boundary marker U+2581 before checking.
    • filterTopK(topKIds:topKLogits:vocabulary:preferredScript:) -> (tokenId, probability)? — returns the highest-probability top-K candidate matching the target script; probability via softmax over the top-K subset with the max-logit stability trick; guarded against top-K array length mismatch.

Changed

  • TdtJointDecision — optional topKIds / topKLogits fields (populated by JointDecisionv3 only)
  • TdtDecoderV3 — script filter runs only when top-1 is already wrong script; both decode sites feed filtered.probability (a real [0,1]) into TdtDurationMapping.clampProbability, not raw logits
  • AsrManager.transcribe(...)language: Language? = nil plumbed through all three overloads: [Float], URL, AVAudioPCMBuffer
  • AsrModels + ModelNamesrequiredModelsV3 set includes JointDecisionv3.mlmodelc so the download utility fetches it on fresh installs and also backfills it for existing users on next .v3 load
  • CLI — fluidaudiocli transcribe <file> --language {en|pl|cs|sk|sl|hr|bs|ro|es|fr|de|it|pt|ru|uk|be|bg|sr}

How to try it

swift run -c release fluidaudiocli transcribe sample.wav --language pl

Model dependency

JointDecisionv3.mlmodelc must be present in FluidInference/parakeet-tdt-0.6b-v3-coreml on HuggingFace. It exposes top_k_ids / top_k_logits outputs (K=64 in our export) alongside the standard argmax. When absent, AsrModels falls back to JointDecision.mlmodelc and the script filter becomes a no-op — backward compatible.

Cache-upgrade verified: removed JointDecisionv3.mlmodelc from a populated cache, re-ran --language pl; the file was auto-fetched and Polish output was Latin. Existing users pick up the fix on next .v3 load without manual intervention.

Review notes / risky bits

  • Softmax over top-K subset, not the full vocab — probabilities won't exactly match a true full-softmax, but K=64 captures ~all the mass when the model is anywhere near confident. If you prefer, we can expose the raw top-K logits to callers and let them compute confidence however they want.
  • Top-1 escape hatch: filter is only triggered when top-1 fails matches(_, script:). When top-1 is already correct, nothing is changed — so we can't regress the common case.
  • Length-mismatch guard in filterTopK uses min(topKIds.count, topKLogits.count). If CoreML output arrays ever diverge, we iterate the common prefix instead of crashing.
  • Latin Extended-B (0x0180–0x024F) was added specifically so Romanian ș/ț aren't rejected as non-Latin. Latin Extended Additional (0x1E00–0x1EFF) was added for free — helps Vietnamese should anyone want it later.

Tests

  • ScriptDetectionTests37 tests: Unicode range coverage (Latin-1 / Extended-A / Extended-B / Extended Additional / Cyrillic), SentencePiece boundary-marker stripping, filterTopK happy path, length-mismatch guard, probability-range invariant, Czech/Slovak/Slovenian/Croatian/Romanian token coverage, cross-script rejection
  • Build clean; swift format lint clean on all touched files
  • A/B end-to-end run against reporter's actual Polish audio (table above)

Checklist

Follow-ups (not blocking)

  • Expose more Latin languages in the enum (Hungarian, Turkish, Baltic, Maltese) — all character ranges already supported, just need enum cases
  • Add Script.greek for el_gr (separate Unicode range)
  • Short-utterance benchmark dataset (FLEURS is the wrong tool — it's all long sentences where drift doesn't happen)
  • Optional: publish a Polish LM rescorer to address the underlying acoustic-accuracy issue the script filter cannot fix

@github-actions
Copy link
Copy Markdown

github-actions bot commented Apr 12, 2026

PocketTTS Smoke Test ✅

Check Result
Build
Model download
Model load
Synthesis pipeline
Output WAV ✅ (191.3 KB)

Runtime: 0m23s

Note: PocketTTS uses CoreML MLState (macOS 15) KV cache + Mimi streaming state. CI VM lacks physical GPU — audio quality and performance may differ from Apple Silicon.

devin-ai-integration[bot]

This comment was marked as resolved.

@github-actions
Copy link
Copy Markdown

github-actions bot commented Apr 12, 2026

Kokoro TTS Smoke Test ✅

Check Result
Build
Model download
Model load
Synthesis pipeline
Output WAV ✅ (634.8 KB)

Runtime: 0m51s

Note: Kokoro TTS uses CoreML flow matching + Vocos vocoder. CI VM lacks physical ANE — performance may differ from Apple Silicon.

@github-actions
Copy link
Copy Markdown

github-actions bot commented Apr 12, 2026

Parakeet EOU Benchmark Results ✅

Status: Benchmark passed
Chunk Size: 320ms
Files Tested: 100/100

Performance Metrics

Metric Value Description
WER (Avg) 7.03% Average Word Error Rate
WER (Med) 4.17% Median Word Error Rate
RTFx 12.34x Real-time factor (higher = faster)
Total Audio 470.6s Total audio duration processed
Total Time 38.7s Total processing time

Streaming Metrics

Metric Value Description
Avg Chunk Time 0.039s Average chunk processing time
Max Chunk Time 0.077s Maximum chunk processing time
EOU Detections 0 Total End-of-Utterance detections

Test runtime: 0m44s • 04/21/2026, 01:10 PM EST

RTFx = Real-Time Factor (higher is better) • Processing includes: Model inference, audio preprocessing, state management, and file I/O

@github-actions
Copy link
Copy Markdown

github-actions bot commented Apr 12, 2026

Qwen3-ASR int8 Smoke Test ✅

Check Result
Build
Model download
Model load
Transcription pipeline
Decoder size 571 MB (vs 1.1 GB f32)

Performance Metrics

Metric CI Value Expected on Apple Silicon
Median RTFx 0.06x ~2.5x
Overall RTFx 0.06x ~2.5x

Runtime: 3m27s

Note: CI VM lacks physical GPU — CoreML MLState (macOS 15) KV cache produces degraded results on virtualized runners. On Apple Silicon: ~1.3% WER / 2.5x RTFx.

@github-actions
Copy link
Copy Markdown

github-actions bot commented Apr 12, 2026

VAD Benchmark Results

Performance Comparison

Dataset Accuracy Precision Recall F1-Score RTFx Files
MUSAN 92.0% 86.2% 100.0% 92.6% 596.8x faster 50
VOiCES 92.0% 86.2% 100.0% 92.6% 542.0x faster 50

Dataset Details

  • MUSAN: Music, Speech, and Noise dataset - standard VAD evaluation
  • VOiCES: Voices Obscured in Complex Environmental Settings - tests robustness in real-world conditions

✅: Average F1-Score above 70%

@github-actions
Copy link
Copy Markdown

github-actions bot commented Apr 12, 2026

ASR Benchmark Results ✅

Status: All benchmarks passed

Parakeet v3 (multilingual)

Dataset WER Avg WER Med RTFx Status
test-clean 0.57% 0.00% 4.92x
test-other 1.80% 0.00% 2.67x

Parakeet v2 (English-optimized)

Dataset WER Avg WER Med RTFx Status
test-clean 0.80% 0.00% 5.49x
test-other 1.00% 0.00% 3.43x

Streaming (v3)

Metric Value Description
WER 0.00% Word Error Rate in streaming mode
RTFx 0.64x Streaming real-time factor
Avg Chunk Time 1.529s Average time to process each chunk
Max Chunk Time 2.317s Maximum chunk processing time
First Token 1.862s Latency to first transcription token
Total Chunks 31 Number of chunks processed

Streaming (v2)

Metric Value Description
WER 0.00% Word Error Rate in streaming mode
RTFx 0.64x Streaming real-time factor
Avg Chunk Time 1.418s Average time to process each chunk
Max Chunk Time 1.734s Maximum chunk processing time
First Token 1.440s Latency to first transcription token
Total Chunks 31 Number of chunks processed

Streaming tests use 5 files with 0.5s chunks to simulate real-time audio streaming

25 files per dataset • Test runtime: 5m56s • 04/21/2026, 01:09 PM EST

RTFx = Real-Time Factor (higher is better) • Calculated as: Total audio duration ÷ Total processing time
Processing time includes: Model inference on Apple Neural Engine, audio preprocessing, state resets between files, token-to-text conversion, and file I/O
Example: RTFx of 2.0x means 10 seconds of audio processed in 5 seconds (2x faster than real-time)

Expected RTFx Performance on Physical M1 Hardware:

• M1 Mac: ~28x (clean), ~25x (other)
• CI shows ~0.5-3x due to virtualization limitations

Testing methodology follows HuggingFace Open ASR Leaderboard

@github-actions
Copy link
Copy Markdown

✅ Japanese ASR Benchmark Results (CTC)

Status: Passed

Metric Value
CER 9.94%
Samples 50
Avg RTFx 2.5x
Decoder CTC

✅ Benchmark completed successfully. The TDT Japanese hybrid model (CTC preprocessor/encoder + TDT decoder/joint) is working correctly.

View benchmark log

@github-actions
Copy link
Copy Markdown

github-actions bot commented Apr 12, 2026

Speaker Diarization Benchmark Results

Speaker Diarization Performance

Evaluating "who spoke when" detection accuracy

Metric Value Target Status Description
DER 15.1% <30% Diarization Error Rate (lower is better)
JER 24.9% <25% Jaccard Error Rate
RTFx 23.24x >1.0x Real-Time Factor (higher is faster)

Diarization Pipeline Timing Breakdown

Time spent in each stage of speaker diarization

Stage Time (s) % Description
Model Download 9.871 21.9 Fetching diarization models
Model Compile 4.230 9.4 CoreML compilation
Audio Load 0.089 0.2 Loading audio file
Segmentation 13.542 30.0 Detecting speech regions
Embedding 22.571 50.0 Extracting speaker voices
Clustering 9.028 20.0 Grouping same speakers
Total 45.160 100 Full pipeline

Speaker Diarization Research Comparison

Research baselines typically achieve 18-30% DER on standard datasets

Method DER Notes
FluidAudio 15.1% On-device CoreML
Research baseline 18-30% Standard dataset performance

Note: RTFx shown above is from GitHub Actions runner. On Apple Silicon with ANE:

  • M2 MacBook Air (2022): Runs at 150 RTFx real-time
  • Performance scales with Apple Neural Engine capabilities

🎯 Speaker Diarization Test • AMI Corpus ES2004a • 1049.0s meeting audio • 45.1s diarization time • Test runtime: 2m 7s • 04/21/2026, 01:10 PM EST

@github-actions
Copy link
Copy Markdown

github-actions bot commented Apr 12, 2026

Sortformer High-Latency Benchmark Results

ES2004a Performance (30.4s latency config)

Metric Value Target Status
DER 33.4% <35%
Miss Rate 24.4% - -
False Alarm 0.2% - -
Speaker Error 8.8% - -
RTFx 8.1x >1.0x
Speakers 4/4 - -

Sortformer High-Latency • ES2004a • Runtime: 3m 12s • 2026-04-21T17:09:44.168Z

@github-actions
Copy link
Copy Markdown

github-actions bot commented Apr 12, 2026

Offline VBx Pipeline Results

Speaker Diarization Performance (VBx Batch Mode)

Optimal clustering with Hungarian algorithm for maximum accuracy

Metric Value Target Status Description
DER 10.4% <20% Diarization Error Rate (lower is better)
RTFx 9.94x >1.0x Real-Time Factor (higher is faster)

Offline VBx Pipeline Timing Breakdown

Time spent in each stage of batch diarization

Stage Time (s) % Description
Model Download 12.426 11.8 Fetching diarization models
Model Compile 5.325 5.0 CoreML compilation
Audio Load 0.057 0.1 Loading audio file
Segmentation 21.676 20.5 VAD + speech detection
Embedding 105.243 99.7 Speaker embedding extraction
Clustering (VBx) 0.115 0.1 Hungarian algorithm + VBx clustering
Total 105.576 100 Full VBx pipeline

Speaker Diarization Research Comparison

Offline VBx achieves competitive accuracy with batch processing

Method DER Mode Description
FluidAudio (Offline) 10.4% VBx Batch On-device CoreML with optimal clustering
FluidAudio (Streaming) 17.7% Chunk-based First-occurrence speaker mapping
Research baseline 18-30% Various Standard dataset performance

Pipeline Details:

  • Mode: Offline VBx with Hungarian algorithm for optimal speaker-to-cluster assignment
  • Segmentation: VAD-based voice activity detection
  • Embeddings: WeSpeaker-compatible speaker embeddings
  • Clustering: PowerSet with VBx refinement
  • Accuracy: Higher than streaming due to optimal post-hoc mapping

🎯 Offline VBx Test • AMI Corpus ES2004a • 1049.0s meeting audio • 127.0s processing • Test runtime: 2m 12s • 04/21/2026, 01:16 PM EST

Alex-Wengg added a commit that referenced this pull request Apr 12, 2026
Complete baseline benchmark results for 24 languages (2,400 samples total):
- Establishes baseline WER/CER before script filtering implementation
- Polish: 8.98% WER (target for issue #512 improvement)
- All languages maintain real-time performance (avg 62.6x RTFx)
- Best: Italian 3.46% WER, Worst: Greek 38.91% WER

Related to issue #512 (Polish Cyrillic confusion) and PR #515 (script filtering).
Next step: Re-run on feat/script-filtering-issue-512 branch to measure improvement.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
devin-ai-integration[bot]

This comment was marked as resolved.

Alex-Wengg added a commit that referenced this pull request Apr 12, 2026
**Issue 1: Language parameter silently dropped for long audio (CRITICAL)**
- Thread language parameter through ChunkProcessor.process() and transcribeChunk()
- Script filtering now works correctly for audio >15 seconds
- Before: ChunkProcessor ignored language, disabling filtering for real-world recordings
- After: Language parameter flows through full chunked transcription pipeline

**Issue 2: SentencePiece word boundary marker not handled (CRITICAL)**
- Strip ▁ (U+2581 LOWER ONE EIGHTH BLOCK) before script detection
- This character prefixes most vocabulary tokens but doesn't indicate script
- Before: allSatisfy() check failed because ▁ outside all Unicode ranges
- After: Strip marker first, then check actual content

**Issue 3: Token confidence not updated after filtering (MEDIUM)**
- Update `score` variable with filtered token's logit in both main loop and inner loop
- Before: Stale probability from original top-1 token persisted through results
- After: Confidence reflects actual selected token after script filtering

**Issue 4: Missing unit tests (HIGH)**
- Add comprehensive ScriptDetectionTests with 28 tests covering:
  - Script property tests for Language enum
  - Basic script matching (Latin, Cyrillic, mixed scripts)
  - SentencePiece boundary marker handling
  - Polish language support (issue #512 specific tests)
  - Punctuation and whitespace handling
  - filterTopK() functionality and edge cases
  - Unicode range validation
- All tests pass

**Additional improvements:**
- Improved Cyrillic script detection to reject Latin letters while allowing
  punctuation, spaces, and digits (prevents "hello" matching Cyrillic)
- Fixed existing TdtRefactoredComponentsTests to use new TdtJointDecision signature

Fixes identified by Devin AI in PR review #4094445719.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
@github-actions
Copy link
Copy Markdown

✅ Japanese ASR Benchmark Results (CTC)

Status: Passed

Metric Value
CER 9.94%
Samples 50
Avg RTFx 2.2x
Decoder CTC

✅ Benchmark completed successfully. The TDT Japanese hybrid model (CTC preprocessor/encoder + TDT decoder/joint) is working correctly.

View benchmark log

@github-actions
Copy link
Copy Markdown

✅ Japanese ASR Benchmark Results (CTC)

Status: Passed

Metric Value
CER 9.94%
Samples 50
Avg RTFx 2.4x
Decoder CTC

✅ Benchmark completed successfully. The TDT Japanese hybrid model (CTC preprocessor/encoder + TDT decoder/joint) is working correctly.

View benchmark log

@Alex-Wengg Alex-Wengg force-pushed the feat/script-filtering-issue-512 branch 2 times, most recently from 14d1926 to bbf98df Compare April 12, 2026 03:27
@github-actions
Copy link
Copy Markdown

✅ Japanese ASR Benchmark Results (CTC)

Status: Passed

Metric Value
CER 9.94%
Samples 50
Avg RTFx 2.2x
Decoder CTC

✅ Benchmark completed successfully. The TDT Japanese hybrid model (CTC preprocessor/encoder + TDT decoder/joint) is working correctly.

View benchmark log

@github-actions
Copy link
Copy Markdown

✅ Japanese ASR Benchmark Results (CTC)

Status: Passed

Metric Value
CER 9.94%
Samples 50
Avg RTFx 2.5x
Decoder CTC

✅ Benchmark completed successfully. The TDT Japanese hybrid model (CTC preprocessor/encoder + TDT decoder/joint) is working correctly.

View benchmark log

@github-actions
Copy link
Copy Markdown

✅ Japanese ASR Benchmark Results (CTC)

Status: Passed

Metric Value
CER 9.94%
Samples 50
Avg RTFx 2.6x
Decoder CTC

✅ Benchmark completed successfully. The TDT Japanese hybrid model (CTC preprocessor/encoder + TDT decoder/joint) is working correctly.

View benchmark log

1 similar comment
@github-actions
Copy link
Copy Markdown

✅ Japanese ASR Benchmark Results (CTC)

Status: Passed

Metric Value
CER 9.94%
Samples 50
Avg RTFx 2.6x
Decoder CTC

✅ Benchmark completed successfully. The TDT Japanese hybrid model (CTC preprocessor/encoder + TDT decoder/joint) is working correctly.

View benchmark log

devin-ai-integration[bot]

This comment was marked as resolved.

@Alex-Wengg Alex-Wengg force-pushed the feat/script-filtering-issue-512 branch from 923412f to 7721530 Compare April 21, 2026 00:33
Alex-Wengg added a commit that referenced this pull request Apr 21, 2026
**Issue 1: Language parameter silently dropped for long audio (CRITICAL)**
- Thread language parameter through ChunkProcessor.process() and transcribeChunk()
- Script filtering now works correctly for audio >15 seconds
- Before: ChunkProcessor ignored language, disabling filtering for real-world recordings
- After: Language parameter flows through full chunked transcription pipeline

**Issue 2: SentencePiece word boundary marker not handled (CRITICAL)**
- Strip ▁ (U+2581 LOWER ONE EIGHTH BLOCK) before script detection
- This character prefixes most vocabulary tokens but doesn't indicate script
- Before: allSatisfy() check failed because ▁ outside all Unicode ranges
- After: Strip marker first, then check actual content

**Issue 3: Token confidence not updated after filtering (MEDIUM)**
- Update `score` variable with filtered token's logit in both main loop and inner loop
- Before: Stale probability from original top-1 token persisted through results
- After: Confidence reflects actual selected token after script filtering

**Issue 4: Missing unit tests (HIGH)**
- Add comprehensive ScriptDetectionTests with 28 tests covering:
  - Script property tests for Language enum
  - Basic script matching (Latin, Cyrillic, mixed scripts)
  - SentencePiece boundary marker handling
  - Polish language support (issue #512 specific tests)
  - Punctuation and whitespace handling
  - filterTopK() functionality and edge cases
  - Unicode range validation
- All tests pass

**Additional improvements:**
- Improved Cyrillic script detection to reject Latin letters while allowing
  punctuation, spaces, and digits (prevents "hello" matching Cyrillic)
- Fixed existing TdtRefactoredComponentsTests to use new TdtJointDecision signature

Fixes identified by Devin AI in PR review #4094445719.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
@github-actions
Copy link
Copy Markdown

✅ Japanese ASR Benchmark Results (CTC)

Status: Passed

Metric Value
CER 9.07%
Samples 50
Avg RTFx 3.0x
Decoder CTC

✅ Benchmark completed successfully. The TDT Japanese hybrid model (CTC preprocessor/encoder + TDT decoder/joint) is working correctly.

View benchmark log

devin-ai-integration[bot]

This comment was marked as resolved.

@Alex-Wengg Alex-Wengg changed the title Add script filtering for Cyrillic/Latin disambiguation (fixes #512) ASR: fix Parakeet TDT v3 emitting Cyrillic for short Latin-script utterances (#512) Apr 21, 2026
devin-ai-integration[bot]

This comment was marked as resolved.

…ernal

`matches` and `filterTopK` only make sense when you have raw top-K ids/logits
plus a CoreML vocab dictionary — i.e. inside the TDT decoder path. No external
consumer has a realistic use case, so drop them from the public surface.
`Language` and `Script` stay public since they're on the transcribe APIs.

Also hoist the SentencePiece word-boundary marker (U+2581) to a named
constant, `sentencePieceBoundary`, instead of leaving the scalar literal
inline — reads cleaner and is easier to find if another path ever needs
the same stripping.
Trim to a minimal 24-language FLEURS regression runner:
- Drop parallel LANG_NAMES array and stale WER-tier comments.
- Drop caffeinate (only inhibits sleep under AC power anyway).
- Drop verify_models step; CLI auto-downloads on first use.
- Soft-fail per language with `|| log "WARN"` so one bad language
  doesn't abort the suite.
- Use printf (not echo) so tabs render correctly on macOS.
- Write machine-readable summary CSV alongside the per-lang JSONs.
- Preflight python3 check and surface swift-build failures.
- SAMPLES env override for quick smoke tests.
Address code-review findings on TdtDecoderV3:

- #10 (correctness): final-chunk flushing loop now runs
  `applyScriptFilter` on each emitted token. Previously the last few
  tokens of any utterance bypassed script filtering, which could leak
  wrong-script candidates in multilingual runs where the tail fell in
  this path.
- #28: update the now-stale "No filtering at decoder level" comment to
  reflect that script filtering happens per-step above.
- #9: remove dead `iterCount` / `innerLoopCount` counters.
- #15: delete unused `MLMultiArray.l2Normf()` helper (kept
  `shapeString`, which is used by TdtModelInference for error output).
Address review feedback on `TdtJointDecision`:

- Clarify the scale mismatch between `probability` (full-vocab softmax)
  and `topKLogits` (raw pre-softmax logits). Consumers that want a
  comparable probability should go through
  `TokenLanguageFilter.filterTopK`, which returns the top-K softmax.
- Add an `assert(topKIds?.count == topKLogits?.count)` in the init to
  catch schema drift if a future model ever returns mismatched arrays.
- Add `: Sendable` conformance. All stored properties are value types,
  so conformance synthesizes for free and matches the surrounding
  `TdtDecoderV3: Sendable` posture.
- Sharpen the rationale comment on why the top-K fields aren't given
  stored-property defaults: Swift excludes stored `let` properties with
  default values from the synthesized memberwise initializer because
  `let` + default is a compile-time-initialized constant, not a
  parameter.
Address review feedback on the joint top-K path:

- Thread `needsTopK: Bool` through `runJointPrepared`. TdtDecoderV3
  computes it once as `language != nil`. When no language is provided
  (the common path), v3 joint runs no longer allocate K-length Swift
  arrays on every decoded step just to drop them.
- Factor the per-array extraction into `extractInt32Array` /
  `extractFloat32Array` helpers. Both validate the CoreML dtype before
  the `bindMemory` cast, so an export that switches to Int64 / Float16
  fails loudly instead of silently reinterpreting bit patterns.
- Enforce top-K outputs as a present-or-absent pair with matching
  lengths. Catches export-schema drift before TokenLanguageFilter has
  to defend against it downstream.
Address code-review feedback on `AsrManager.tdtDecodeWithTimings` and
the public `transcribe(...)` overloads:

- Split the `.v3, .tdtJa` switch arm. Route `.tdtJa` independently and
  drop the `language` / `vocabulary` forwarding: the Japanese model
  emits Kanji / Hiragana / Katakana tokens, none of which are covered
  by the current Latin/Cyrillic `TokenLanguageFilter`. Propagating a
  hint there would either no-op or — worse — silently filter out
  valid Japanese output. Log at debug when a caller-supplied hint is
  dropped.
- Log at debug on the `.v2, .tdtCtc110m` path when `language` is
  non-nil. Previously the hint was swallowed silently.
- Drop the `language != nil ? vocabulary : nil` conditional on the
  `.v3` path. `TdtDecoderV3.applyScriptFilter` already short-circuits
  when `language` is nil; forwarding the vocab unconditionally is
  clearer and has no runtime cost.
- Document the `language:` parameter on the `URL`,
  `transcribeDiskBacked`, and `[Float]` public overloads, and note the
  silent-ignore behavior on the already-documented buffer overload.
…back

v3 now loads `JointDecisionv3.mlmodelc` exclusively. The opportunistic
try/fallback to `JointDecision.mlmodelc` was transitional scaffolding
for the HF upload period and is no longer needed now that v3 joint is
stable on the remote.

- ModelNames: add `requiredModelsV3` using `jointV3File`; keep
  `requiredModels` for v2/legacy; split `.parakeet` vs `.parakeetV2` in
  `getRequiredModelNames`.
- AsrModels: `getModelFileNames(.v3)` returns `jointV3File`;
  `getRequiredModels(.v3)` returns `requiredModelsV3`; joint-load path
  becomes a single unconditional download (dies loud if missing).
- AsrManager: drop "requires JointDecisionv3.mlmodelc" hedge in
  `language:` doc comments — always present for v3.

Existing v3 users with only the legacy `JointDecision.mlmodelc` cached
will fetch `JointDecisionv3.mlmodelc` on next launch (single ~50MB file;
the other models remain cache-hit).
Follow-ups from the AsrModels.swift review on the v3-joint-only PR:

- `inferredVersion`: add `.tdtJa` to `knownVersions` so
  `modelsExist(at:)` without an explicit version returns the right
  answer for Japanese model directories. `.ctcZhCn` is intentionally
  omitted — it's rejected at the top of `load(...)` and uses a
  dedicated loader.
- `getModelFileNames(.v3)`: correct the "zero-cost when disabled"
  comment. Top-K is always computed in the CoreML graph; only the
  Swift-side extraction is gated via `needsTopK`.
- v3 joint-load failure: append a diagnostic hint pointing at stale
  caches so users who previously had only the legacy
  `JointDecision.mlmodelc` can recover.
`846924a1d` removed CTC-only inference for Japanese, but the comments on
`parakeetJa` and the `TDTJa` enum still described the repo as containing
"both CTC and TDT models" and referred to the files as "newly converted
... uploaded to CTC repo." Only the TDT path exists now; the CTC
preprocessor+encoder files from the repo are reused as the acoustic
frontend for the TDT decoder+joint.

Comment-only; no behavior change.
The `parakeet` case predates the v2/v3 split and was the last remaining
ambiguous name in the `Repo` enum — every other Parakeet case is
explicitly versioned (`parakeetV2`, `parakeetCtc110m`, etc.). Now that
`.v3` is the only repo binding to this case and has a distinct required
model set (JointDecisionv3), the implicit "parakeet means v3" convention
is more confusing than helpful.

Renames the enum case across Sources/ and Tests/. The HuggingFace remote
path is unchanged — this is Swift-side naming only.
FleursBenchmark.mapToLanguageEnum previously returned nil for cs_cz,
sk_sk, hr_hr, sl_si, ro_ro — silently skipping script-aware filtering
for the exact Latin-script languages that the Language enum flagged as
"prone to Cyrillic confusion." The corresponding enum cases already
exist; this wires the benchmark up to use them.
@Alex-Wengg Alex-Wengg force-pushed the feat/script-filtering-issue-512 branch from 96d2a90 to 379cd89 Compare April 21, 2026 08:24
@github-actions
Copy link
Copy Markdown

✅ Japanese ASR Benchmark Results (CTC)

Status: Passed

Metric Value
CER 9.07%
Samples 50
Avg RTFx 2.5x
Decoder CTC

✅ Benchmark completed successfully. The TDT Japanese hybrid model (CTC preprocessor/encoder + TDT decoder/joint) is working correctly.

View benchmark log

…tModel

- Drop single-use MLMultiArray.shapeString extension; inline the
  'x'-joined shape string directly at the one error-message call site
  in TdtModelInference.
- Rename the guard-let binding unwrappedJointModel -> jointModel in
  AsrModels.load; the 'unwrapped' prefix was noise given the variable
  is a plain non-optional MLModel after the guard.
@github-actions
Copy link
Copy Markdown

✅ Japanese ASR Benchmark Results (CTC)

Status: Passed

Metric Value
CER 9.07%
Samples 50
Avg RTFx 2.6x
Decoder CTC

✅ Benchmark completed successfully. The TDT Japanese hybrid model (CTC preprocessor/encoder + TDT decoder/joint) is working correctly.

View benchmark log

…ogit, inline softmax

- matches(): accept U+0300-U+036F (Combining Diacritical Marks) as Latin
  so NFD-decomposed forms like 'e' + U+0301 aren't rejected. SentencePiece
  vocabs usually emit precomposed characters, but handling both keeps the
  filter robust across export variants.
- matches(): empty / boundary-only tokens now return true (script-neutral)
  instead of false. filterTopK's argmax can then rank them alongside real
  candidates rather than skipping them outright.
- filterTopK(): use 'bestIdx < 0 || logit > bestLogit' so the first in-
  script candidate wins unconditionally. Previously a match with logit
  == -.infinity never beat the -.infinity sentinel and filterTopK returned
  nil despite having a valid candidate.
- Inline the private softmaxProbability helper into its sole caller; drop
  the helper.

Tests: flipped testBoundaryMarkerOnly / testEmptyString to assert the new
neutral behavior and added testCombiningDiacriticsRange +
testFilterTopKPicksNegativeInfinityLogit. 40/40 pass.
@github-actions
Copy link
Copy Markdown

✅ Japanese ASR Benchmark Results (CTC)

Status: Passed

Metric Value
CER 9.07%
Samples 50
Avg RTFx 3.0x
Decoder CTC

✅ Benchmark completed successfully. The TDT Japanese hybrid model (CTC preprocessor/encoder + TDT decoder/joint) is working correctly.

View benchmark log

Empirical testing against the 7 Polish audio samples in issue #512
(tajchert's Google Drive folder) showed the last-chunk flush loop
emits 0 script-filter swaps, while the main loop fires 7 times and
the inner silence-skip loop fires 21 times across the same samples.

The flush loop is blank/punct-dominated on short utterances, so the
defensive filter call added during code review was dead code.
Removing it and replacing the successor comment with an honest note
about the empirical finding.
@github-actions
Copy link
Copy Markdown

✅ Japanese ASR Benchmark Results (CTC)

Status: Passed

Metric Value
CER 9.07%
Samples 50
Avg RTFx 2.2x
Decoder CTC

✅ Benchmark completed successfully. The TDT Japanese hybrid model (CTC preprocessor/encoder + TDT decoder/joint) is working correctly.

View benchmark log

Add Documentation/ASR/TokenLanguageFilter.md covering: the cross-script
leakage problem (issue #512), where the filter runs in the TDT v3
decoder, the asymmetric Unicode guards (Latin vs Cyrillic), the top-K
vs full-vocab softmax caveat, and the handled edge cases including
the -infinity logit argmax.

Trim TokenLanguageFilter.swift inline comments from 221 -> 149 lines,
keeping the non-obvious ones (asymmetric guards, -inf edge case, top-K
probability caveat) and dropping idiom restatements and redundant doc.
@github-actions
Copy link
Copy Markdown

✅ Japanese ASR Benchmark Results (CTC)

Status: Passed

Metric Value
CER 9.07%
Samples 50
Avg RTFx 2.8x
Decoder CTC

✅ Benchmark completed successfully. The TDT Japanese hybrid model (CTC preprocessor/encoder + TDT decoder/joint) is working correctly.

View benchmark log

Matches the TokenLanguageFilter type name (following the earlier
ScriptDetection -> TokenLanguageFilter rename in 2794797).
… + trim doc

Drop the 'apply' prefix and compress the helper's doc comment: the
behavior paragraph restated the guard chain, and the double-clamp note
covered a one-line defensive call. Keep the blank-token rationale — the
'label \!= blankId' guard is load-bearing and the 'why' isn't obvious.
Copy link
Copy Markdown
Contributor

@devin-ai-integration devin-ai-integration bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Devin Review found 1 new potential issue.

View 20 additional findings in Devin Review.

Open in Devin Review

Comment on lines +87 to +92
if value >= 0x0020 && value <= 0x007F {
if (value >= 0x41 && value <= 0x5A) || (value >= 0x61 && value <= 0x7A) {
return false
}
return true
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Nested if statement violates AGENTS.md control-flow rule

AGENTS.md mandates: "Nested if statements should be absolutely avoided." The Cyrillic branch of matches at Sources/FluidAudio/Shared/TokenLanguageFilter.swift:87-91 contains a nested if — the outer checks for the ASCII range and the inner checks for ASCII letters. This can be flattened by checking letters first (they're a subset of the ASCII range), then accepting the rest of ASCII separately.

Suggested change
if value >= 0x0020 && value <= 0x007F {
if (value >= 0x41 && value <= 0x5A) || (value >= 0x61 && value <= 0x7A) {
return false
}
return true
}
if (value >= 0x41 && value <= 0x5A) || (value >= 0x61 && value <= 0x7A) {
return false
}
if value >= 0x0020 && value <= 0x007F {
return true
}
Open in Devin Review

Was this helpful? React with 👍 or 👎 to provide feedback.

@github-actions
Copy link
Copy Markdown

✅ Japanese ASR Benchmark Results (CTC)

Status: Passed

Metric Value
CER 9.07%
Samples 50
Avg RTFx 2.6x
Decoder CTC

✅ Benchmark completed successfully. The TDT Japanese hybrid model (CTC preprocessor/encoder + TDT decoder/joint) is working correctly.

View benchmark log

@github-actions
Copy link
Copy Markdown

✅ Japanese ASR Benchmark Results (CTC)

Status: Passed

Metric Value
CER 9.07%
Samples 50
Avg RTFx 2.0x
Decoder CTC

✅ Benchmark completed successfully. The TDT Japanese hybrid model (CTC preprocessor/encoder + TDT decoder/joint) is working correctly.

View benchmark log

"cross-script", "out-of-script", "in-script" were jargon. The mechanism
is actually "is this token from the right alphabet for the target
language?" — plain English works better for reviewers and future
maintainers.

- "cross-script leakage" -> "wrong-language leakage"
- "wrong-script candidates" -> "wrong-language tokens"
- "out-of-script" -> "wrong-language"
- "in-script" -> "right-language"
- "writing script" (user-facing) -> "alphabet"

Kept as-is: `Script` enum, `script:` parameter, `matches(_:script:)`
signature — implementation details where "script" is the correct Unicode
term. "Latin-script Slavic" also kept as legitimate linguistic grouping.

No behavior change. Tests pass (40/40).
@github-actions
Copy link
Copy Markdown

✅ Japanese ASR Benchmark Results (CTC)

Status: Passed

Metric Value
CER 9.07%
Samples 50
Avg RTFx 2.5x
Decoder CTC

✅ Benchmark completed successfully. The TDT Japanese hybrid model (CTC preprocessor/encoder + TDT decoder/joint) is working correctly.

View benchmark log

Copy link
Copy Markdown
Contributor

@devin-ai-integration devin-ai-integration bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Devin Review found 1 new potential issue.

View 20 additional findings in Devin Review.

Open in Devin Review

Comment on lines +592 to +593
label = filtered.tokenId
score = TdtDurationMapping.clampProbability(filtered.probability)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 filterTopK can return blankId as replacement, silently converting speech tokens to silence

The doc comment at TdtDecoderV3.swift:564-566 states "Blanks are excluded from replacement" but neither tokenLanguageFilter nor filterTopK actually excludes the blank token from the candidate pool. The guard at line 576 (label != blankId) only prevents blank from being the trigger — it doesn't prevent blank from being the replacement.

The blank token's vocabulary text is typically empty or a boundary marker, so TokenLanguageFilter.matches returns true at TokenLanguageFilter.swift:69 (guard !cleanedText.isEmpty else { return true }), making blank a valid script-neutral candidate in filterTopK. If blank has the highest logit among matching candidates, it wins the argmax and gets returned.

This is especially harmful in the inner blank-processing loop (TdtDecoderV3.swift:333-345): after the filter sets label = blankId, blankMask becomes true, and advanceMask = activeMask && blankMask keeps the loop running — the speech token is swallowed and the decoder continues advancing through frames as if they were silence, instead of exiting the inner loop and emitting the token. The intended nil-return fallback path ("no right-language candidates → keep original token") is defeated because blank passes matches as script-neutral.

Suggested change
label = filtered.tokenId
score = TdtDurationMapping.clampProbability(filtered.probability)
guard filtered.tokenId != blankId else { return }
label = filtered.tokenId
score = TdtDurationMapping.clampProbability(filtered.probability)
Open in Devin Review

Was this helpful? React with 👍 or 👎 to provide feedback.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Short utterances in Latin-script languages transcribed as Cyrillic [Parakeet TDT v3]

1 participant