ASR: fix Parakeet TDT v3 emitting Cyrillic for short Latin-script utterances (#512)#515
ASR: fix Parakeet TDT v3 emitting Cyrillic for short Latin-script utterances (#512)#515Alex-Wengg wants to merge 34 commits intomainfrom
Conversation
PocketTTS Smoke Test ✅
Runtime: 0m23s Note: PocketTTS uses CoreML MLState (macOS 15) KV cache + Mimi streaming state. CI VM lacks physical GPU — audio quality and performance may differ from Apple Silicon. |
Kokoro TTS Smoke Test ✅
Runtime: 0m51s Note: Kokoro TTS uses CoreML flow matching + Vocos vocoder. CI VM lacks physical ANE — performance may differ from Apple Silicon. |
Parakeet EOU Benchmark Results ✅Status: Benchmark passed Performance Metrics
Streaming Metrics
Test runtime: 0m44s • 04/21/2026, 01:10 PM EST RTFx = Real-Time Factor (higher is better) • Processing includes: Model inference, audio preprocessing, state management, and file I/O |
Qwen3-ASR int8 Smoke Test ✅
Performance Metrics
Runtime: 3m27s Note: CI VM lacks physical GPU — CoreML MLState (macOS 15) KV cache produces degraded results on virtualized runners. On Apple Silicon: ~1.3% WER / 2.5x RTFx. |
VAD Benchmark ResultsPerformance Comparison
Dataset Details
✅: Average F1-Score above 70% |
ASR Benchmark Results ✅Status: All benchmarks passed Parakeet v3 (multilingual)
Parakeet v2 (English-optimized)
Streaming (v3)
Streaming (v2)
Streaming tests use 5 files with 0.5s chunks to simulate real-time audio streaming 25 files per dataset • Test runtime: 5m56s • 04/21/2026, 01:09 PM EST RTFx = Real-Time Factor (higher is better) • Calculated as: Total audio duration ÷ Total processing time Expected RTFx Performance on Physical M1 Hardware:• M1 Mac: ~28x (clean), ~25x (other) Testing methodology follows HuggingFace Open ASR Leaderboard |
✅ Japanese ASR Benchmark Results (CTC)Status: Passed
✅ Benchmark completed successfully. The TDT Japanese hybrid model (CTC preprocessor/encoder + TDT decoder/joint) is working correctly. View benchmark log |
Speaker Diarization Benchmark ResultsSpeaker Diarization PerformanceEvaluating "who spoke when" detection accuracy
Diarization Pipeline Timing BreakdownTime spent in each stage of speaker diarization
Speaker Diarization Research ComparisonResearch baselines typically achieve 18-30% DER on standard datasets
Note: RTFx shown above is from GitHub Actions runner. On Apple Silicon with ANE:
🎯 Speaker Diarization Test • AMI Corpus ES2004a • 1049.0s meeting audio • 45.1s diarization time • Test runtime: 2m 7s • 04/21/2026, 01:10 PM EST |
Sortformer High-Latency Benchmark ResultsES2004a Performance (30.4s latency config)
Sortformer High-Latency • ES2004a • Runtime: 3m 12s • 2026-04-21T17:09:44.168Z |
Offline VBx Pipeline ResultsSpeaker Diarization Performance (VBx Batch Mode)Optimal clustering with Hungarian algorithm for maximum accuracy
Offline VBx Pipeline Timing BreakdownTime spent in each stage of batch diarization
Speaker Diarization Research ComparisonOffline VBx achieves competitive accuracy with batch processing
Pipeline Details:
🎯 Offline VBx Test • AMI Corpus ES2004a • 1049.0s meeting audio • 127.0s processing • Test runtime: 2m 12s • 04/21/2026, 01:16 PM EST |
Complete baseline benchmark results for 24 languages (2,400 samples total): - Establishes baseline WER/CER before script filtering implementation - Polish: 8.98% WER (target for issue #512 improvement) - All languages maintain real-time performance (avg 62.6x RTFx) - Best: Italian 3.46% WER, Worst: Greek 38.91% WER Related to issue #512 (Polish Cyrillic confusion) and PR #515 (script filtering). Next step: Re-run on feat/script-filtering-issue-512 branch to measure improvement. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
**Issue 1: Language parameter silently dropped for long audio (CRITICAL)** - Thread language parameter through ChunkProcessor.process() and transcribeChunk() - Script filtering now works correctly for audio >15 seconds - Before: ChunkProcessor ignored language, disabling filtering for real-world recordings - After: Language parameter flows through full chunked transcription pipeline **Issue 2: SentencePiece word boundary marker not handled (CRITICAL)** - Strip ▁ (U+2581 LOWER ONE EIGHTH BLOCK) before script detection - This character prefixes most vocabulary tokens but doesn't indicate script - Before: allSatisfy() check failed because ▁ outside all Unicode ranges - After: Strip marker first, then check actual content **Issue 3: Token confidence not updated after filtering (MEDIUM)** - Update `score` variable with filtered token's logit in both main loop and inner loop - Before: Stale probability from original top-1 token persisted through results - After: Confidence reflects actual selected token after script filtering **Issue 4: Missing unit tests (HIGH)** - Add comprehensive ScriptDetectionTests with 28 tests covering: - Script property tests for Language enum - Basic script matching (Latin, Cyrillic, mixed scripts) - SentencePiece boundary marker handling - Polish language support (issue #512 specific tests) - Punctuation and whitespace handling - filterTopK() functionality and edge cases - Unicode range validation - All tests pass **Additional improvements:** - Improved Cyrillic script detection to reject Latin letters while allowing punctuation, spaces, and digits (prevents "hello" matching Cyrillic) - Fixed existing TdtRefactoredComponentsTests to use new TdtJointDecision signature Fixes identified by Devin AI in PR review #4094445719. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
✅ Japanese ASR Benchmark Results (CTC)Status: Passed
✅ Benchmark completed successfully. The TDT Japanese hybrid model (CTC preprocessor/encoder + TDT decoder/joint) is working correctly. View benchmark log |
✅ Japanese ASR Benchmark Results (CTC)Status: Passed
✅ Benchmark completed successfully. The TDT Japanese hybrid model (CTC preprocessor/encoder + TDT decoder/joint) is working correctly. View benchmark log |
14d1926 to
bbf98df
Compare
✅ Japanese ASR Benchmark Results (CTC)Status: Passed
✅ Benchmark completed successfully. The TDT Japanese hybrid model (CTC preprocessor/encoder + TDT decoder/joint) is working correctly. View benchmark log |
✅ Japanese ASR Benchmark Results (CTC)Status: Passed
✅ Benchmark completed successfully. The TDT Japanese hybrid model (CTC preprocessor/encoder + TDT decoder/joint) is working correctly. View benchmark log |
✅ Japanese ASR Benchmark Results (CTC)Status: Passed
✅ Benchmark completed successfully. The TDT Japanese hybrid model (CTC preprocessor/encoder + TDT decoder/joint) is working correctly. View benchmark log |
1 similar comment
✅ Japanese ASR Benchmark Results (CTC)Status: Passed
✅ Benchmark completed successfully. The TDT Japanese hybrid model (CTC preprocessor/encoder + TDT decoder/joint) is working correctly. View benchmark log |
923412f to
7721530
Compare
**Issue 1: Language parameter silently dropped for long audio (CRITICAL)** - Thread language parameter through ChunkProcessor.process() and transcribeChunk() - Script filtering now works correctly for audio >15 seconds - Before: ChunkProcessor ignored language, disabling filtering for real-world recordings - After: Language parameter flows through full chunked transcription pipeline **Issue 2: SentencePiece word boundary marker not handled (CRITICAL)** - Strip ▁ (U+2581 LOWER ONE EIGHTH BLOCK) before script detection - This character prefixes most vocabulary tokens but doesn't indicate script - Before: allSatisfy() check failed because ▁ outside all Unicode ranges - After: Strip marker first, then check actual content **Issue 3: Token confidence not updated after filtering (MEDIUM)** - Update `score` variable with filtered token's logit in both main loop and inner loop - Before: Stale probability from original top-1 token persisted through results - After: Confidence reflects actual selected token after script filtering **Issue 4: Missing unit tests (HIGH)** - Add comprehensive ScriptDetectionTests with 28 tests covering: - Script property tests for Language enum - Basic script matching (Latin, Cyrillic, mixed scripts) - SentencePiece boundary marker handling - Polish language support (issue #512 specific tests) - Punctuation and whitespace handling - filterTopK() functionality and edge cases - Unicode range validation - All tests pass **Additional improvements:** - Improved Cyrillic script detection to reject Latin letters while allowing punctuation, spaces, and digits (prevents "hello" matching Cyrillic) - Fixed existing TdtRefactoredComponentsTests to use new TdtJointDecision signature Fixes identified by Devin AI in PR review #4094445719. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
✅ Japanese ASR Benchmark Results (CTC)Status: Passed
✅ Benchmark completed successfully. The TDT Japanese hybrid model (CTC preprocessor/encoder + TDT decoder/joint) is working correctly. View benchmark log |
…ernal `matches` and `filterTopK` only make sense when you have raw top-K ids/logits plus a CoreML vocab dictionary — i.e. inside the TDT decoder path. No external consumer has a realistic use case, so drop them from the public surface. `Language` and `Script` stay public since they're on the transcribe APIs. Also hoist the SentencePiece word-boundary marker (U+2581) to a named constant, `sentencePieceBoundary`, instead of leaving the scalar literal inline — reads cleaner and is easier to find if another path ever needs the same stripping.
Trim to a minimal 24-language FLEURS regression runner: - Drop parallel LANG_NAMES array and stale WER-tier comments. - Drop caffeinate (only inhibits sleep under AC power anyway). - Drop verify_models step; CLI auto-downloads on first use. - Soft-fail per language with `|| log "WARN"` so one bad language doesn't abort the suite. - Use printf (not echo) so tabs render correctly on macOS. - Write machine-readable summary CSV alongside the per-lang JSONs. - Preflight python3 check and surface swift-build failures. - SAMPLES env override for quick smoke tests.
Address code-review findings on TdtDecoderV3: - #10 (correctness): final-chunk flushing loop now runs `applyScriptFilter` on each emitted token. Previously the last few tokens of any utterance bypassed script filtering, which could leak wrong-script candidates in multilingual runs where the tail fell in this path. - #28: update the now-stale "No filtering at decoder level" comment to reflect that script filtering happens per-step above. - #9: remove dead `iterCount` / `innerLoopCount` counters. - #15: delete unused `MLMultiArray.l2Normf()` helper (kept `shapeString`, which is used by TdtModelInference for error output).
Address review feedback on `TdtJointDecision`: - Clarify the scale mismatch between `probability` (full-vocab softmax) and `topKLogits` (raw pre-softmax logits). Consumers that want a comparable probability should go through `TokenLanguageFilter.filterTopK`, which returns the top-K softmax. - Add an `assert(topKIds?.count == topKLogits?.count)` in the init to catch schema drift if a future model ever returns mismatched arrays. - Add `: Sendable` conformance. All stored properties are value types, so conformance synthesizes for free and matches the surrounding `TdtDecoderV3: Sendable` posture. - Sharpen the rationale comment on why the top-K fields aren't given stored-property defaults: Swift excludes stored `let` properties with default values from the synthesized memberwise initializer because `let` + default is a compile-time-initialized constant, not a parameter.
Address review feedback on the joint top-K path: - Thread `needsTopK: Bool` through `runJointPrepared`. TdtDecoderV3 computes it once as `language != nil`. When no language is provided (the common path), v3 joint runs no longer allocate K-length Swift arrays on every decoded step just to drop them. - Factor the per-array extraction into `extractInt32Array` / `extractFloat32Array` helpers. Both validate the CoreML dtype before the `bindMemory` cast, so an export that switches to Int64 / Float16 fails loudly instead of silently reinterpreting bit patterns. - Enforce top-K outputs as a present-or-absent pair with matching lengths. Catches export-schema drift before TokenLanguageFilter has to defend against it downstream.
Address code-review feedback on `AsrManager.tdtDecodeWithTimings` and the public `transcribe(...)` overloads: - Split the `.v3, .tdtJa` switch arm. Route `.tdtJa` independently and drop the `language` / `vocabulary` forwarding: the Japanese model emits Kanji / Hiragana / Katakana tokens, none of which are covered by the current Latin/Cyrillic `TokenLanguageFilter`. Propagating a hint there would either no-op or — worse — silently filter out valid Japanese output. Log at debug when a caller-supplied hint is dropped. - Log at debug on the `.v2, .tdtCtc110m` path when `language` is non-nil. Previously the hint was swallowed silently. - Drop the `language != nil ? vocabulary : nil` conditional on the `.v3` path. `TdtDecoderV3.applyScriptFilter` already short-circuits when `language` is nil; forwarding the vocab unconditionally is clearer and has no runtime cost. - Document the `language:` parameter on the `URL`, `transcribeDiskBacked`, and `[Float]` public overloads, and note the silent-ignore behavior on the already-documented buffer overload.
…back v3 now loads `JointDecisionv3.mlmodelc` exclusively. The opportunistic try/fallback to `JointDecision.mlmodelc` was transitional scaffolding for the HF upload period and is no longer needed now that v3 joint is stable on the remote. - ModelNames: add `requiredModelsV3` using `jointV3File`; keep `requiredModels` for v2/legacy; split `.parakeet` vs `.parakeetV2` in `getRequiredModelNames`. - AsrModels: `getModelFileNames(.v3)` returns `jointV3File`; `getRequiredModels(.v3)` returns `requiredModelsV3`; joint-load path becomes a single unconditional download (dies loud if missing). - AsrManager: drop "requires JointDecisionv3.mlmodelc" hedge in `language:` doc comments — always present for v3. Existing v3 users with only the legacy `JointDecision.mlmodelc` cached will fetch `JointDecisionv3.mlmodelc` on next launch (single ~50MB file; the other models remain cache-hit).
Follow-ups from the AsrModels.swift review on the v3-joint-only PR: - `inferredVersion`: add `.tdtJa` to `knownVersions` so `modelsExist(at:)` without an explicit version returns the right answer for Japanese model directories. `.ctcZhCn` is intentionally omitted — it's rejected at the top of `load(...)` and uses a dedicated loader. - `getModelFileNames(.v3)`: correct the "zero-cost when disabled" comment. Top-K is always computed in the CoreML graph; only the Swift-side extraction is gated via `needsTopK`. - v3 joint-load failure: append a diagnostic hint pointing at stale caches so users who previously had only the legacy `JointDecision.mlmodelc` can recover.
`846924a1d` removed CTC-only inference for Japanese, but the comments on `parakeetJa` and the `TDTJa` enum still described the repo as containing "both CTC and TDT models" and referred to the files as "newly converted ... uploaded to CTC repo." Only the TDT path exists now; the CTC preprocessor+encoder files from the repo are reused as the acoustic frontend for the TDT decoder+joint. Comment-only; no behavior change.
The `parakeet` case predates the v2/v3 split and was the last remaining ambiguous name in the `Repo` enum — every other Parakeet case is explicitly versioned (`parakeetV2`, `parakeetCtc110m`, etc.). Now that `.v3` is the only repo binding to this case and has a distinct required model set (JointDecisionv3), the implicit "parakeet means v3" convention is more confusing than helpful. Renames the enum case across Sources/ and Tests/. The HuggingFace remote path is unchanged — this is Swift-side naming only.
FleursBenchmark.mapToLanguageEnum previously returned nil for cs_cz, sk_sk, hr_hr, sl_si, ro_ro — silently skipping script-aware filtering for the exact Latin-script languages that the Language enum flagged as "prone to Cyrillic confusion." The corresponding enum cases already exist; this wires the benchmark up to use them.
96d2a90 to
379cd89
Compare
✅ Japanese ASR Benchmark Results (CTC)Status: Passed
✅ Benchmark completed successfully. The TDT Japanese hybrid model (CTC preprocessor/encoder + TDT decoder/joint) is working correctly. View benchmark log |
…tModel - Drop single-use MLMultiArray.shapeString extension; inline the 'x'-joined shape string directly at the one error-message call site in TdtModelInference. - Rename the guard-let binding unwrappedJointModel -> jointModel in AsrModels.load; the 'unwrapped' prefix was noise given the variable is a plain non-optional MLModel after the guard.
✅ Japanese ASR Benchmark Results (CTC)Status: Passed
✅ Benchmark completed successfully. The TDT Japanese hybrid model (CTC preprocessor/encoder + TDT decoder/joint) is working correctly. View benchmark log |
…ogit, inline softmax - matches(): accept U+0300-U+036F (Combining Diacritical Marks) as Latin so NFD-decomposed forms like 'e' + U+0301 aren't rejected. SentencePiece vocabs usually emit precomposed characters, but handling both keeps the filter robust across export variants. - matches(): empty / boundary-only tokens now return true (script-neutral) instead of false. filterTopK's argmax can then rank them alongside real candidates rather than skipping them outright. - filterTopK(): use 'bestIdx < 0 || logit > bestLogit' so the first in- script candidate wins unconditionally. Previously a match with logit == -.infinity never beat the -.infinity sentinel and filterTopK returned nil despite having a valid candidate. - Inline the private softmaxProbability helper into its sole caller; drop the helper. Tests: flipped testBoundaryMarkerOnly / testEmptyString to assert the new neutral behavior and added testCombiningDiacriticsRange + testFilterTopKPicksNegativeInfinityLogit. 40/40 pass.
✅ Japanese ASR Benchmark Results (CTC)Status: Passed
✅ Benchmark completed successfully. The TDT Japanese hybrid model (CTC preprocessor/encoder + TDT decoder/joint) is working correctly. View benchmark log |
Empirical testing against the 7 Polish audio samples in issue #512 (tajchert's Google Drive folder) showed the last-chunk flush loop emits 0 script-filter swaps, while the main loop fires 7 times and the inner silence-skip loop fires 21 times across the same samples. The flush loop is blank/punct-dominated on short utterances, so the defensive filter call added during code review was dead code. Removing it and replacing the successor comment with an honest note about the empirical finding.
✅ Japanese ASR Benchmark Results (CTC)Status: Passed
✅ Benchmark completed successfully. The TDT Japanese hybrid model (CTC preprocessor/encoder + TDT decoder/joint) is working correctly. View benchmark log |
Add Documentation/ASR/TokenLanguageFilter.md covering: the cross-script leakage problem (issue #512), where the filter runs in the TDT v3 decoder, the asymmetric Unicode guards (Latin vs Cyrillic), the top-K vs full-vocab softmax caveat, and the handled edge cases including the -infinity logit argmax. Trim TokenLanguageFilter.swift inline comments from 221 -> 149 lines, keeping the non-obvious ones (asymmetric guards, -inf edge case, top-K probability caveat) and dropping idiom restatements and redundant doc.
✅ Japanese ASR Benchmark Results (CTC)Status: Passed
✅ Benchmark completed successfully. The TDT Japanese hybrid model (CTC preprocessor/encoder + TDT decoder/joint) is working correctly. View benchmark log |
Matches the TokenLanguageFilter type name (following the earlier ScriptDetection -> TokenLanguageFilter rename in 2794797).
… + trim doc Drop the 'apply' prefix and compress the helper's doc comment: the behavior paragraph restated the guard chain, and the double-clamp note covered a one-line defensive call. Keep the blank-token rationale — the 'label \!= blankId' guard is load-bearing and the 'why' isn't obvious.
| if value >= 0x0020 && value <= 0x007F { | ||
| if (value >= 0x41 && value <= 0x5A) || (value >= 0x61 && value <= 0x7A) { | ||
| return false | ||
| } | ||
| return true | ||
| } |
There was a problem hiding this comment.
🟡 Nested if statement violates AGENTS.md control-flow rule
AGENTS.md mandates: "Nested if statements should be absolutely avoided." The Cyrillic branch of matches at Sources/FluidAudio/Shared/TokenLanguageFilter.swift:87-91 contains a nested if — the outer checks for the ASCII range and the inner checks for ASCII letters. This can be flattened by checking letters first (they're a subset of the ASCII range), then accepting the rest of ASCII separately.
| if value >= 0x0020 && value <= 0x007F { | |
| if (value >= 0x41 && value <= 0x5A) || (value >= 0x61 && value <= 0x7A) { | |
| return false | |
| } | |
| return true | |
| } | |
| if (value >= 0x41 && value <= 0x5A) || (value >= 0x61 && value <= 0x7A) { | |
| return false | |
| } | |
| if value >= 0x0020 && value <= 0x007F { | |
| return true | |
| } |
Was this helpful? React with 👍 or 👎 to provide feedback.
✅ Japanese ASR Benchmark Results (CTC)Status: Passed
✅ Benchmark completed successfully. The TDT Japanese hybrid model (CTC preprocessor/encoder + TDT decoder/joint) is working correctly. View benchmark log |
✅ Japanese ASR Benchmark Results (CTC)Status: Passed
✅ Benchmark completed successfully. The TDT Japanese hybrid model (CTC preprocessor/encoder + TDT decoder/joint) is working correctly. View benchmark log |
"cross-script", "out-of-script", "in-script" were jargon. The mechanism is actually "is this token from the right alphabet for the target language?" — plain English works better for reviewers and future maintainers. - "cross-script leakage" -> "wrong-language leakage" - "wrong-script candidates" -> "wrong-language tokens" - "out-of-script" -> "wrong-language" - "in-script" -> "right-language" - "writing script" (user-facing) -> "alphabet" Kept as-is: `Script` enum, `script:` parameter, `matches(_:script:)` signature — implementation details where "script" is the correct Unicode term. "Latin-script Slavic" also kept as legitimate linguistic grouping. No behavior change. Tests pass (40/40).
✅ Japanese ASR Benchmark Results (CTC)Status: Passed
✅ Benchmark completed successfully. The TDT Japanese hybrid model (CTC preprocessor/encoder + TDT decoder/joint) is working correctly. View benchmark log |
| label = filtered.tokenId | ||
| score = TdtDurationMapping.clampProbability(filtered.probability) |
There was a problem hiding this comment.
🔴 filterTopK can return blankId as replacement, silently converting speech tokens to silence
The doc comment at TdtDecoderV3.swift:564-566 states "Blanks are excluded from replacement" but neither tokenLanguageFilter nor filterTopK actually excludes the blank token from the candidate pool. The guard at line 576 (label != blankId) only prevents blank from being the trigger — it doesn't prevent blank from being the replacement.
The blank token's vocabulary text is typically empty or a boundary marker, so TokenLanguageFilter.matches returns true at TokenLanguageFilter.swift:69 (guard !cleanedText.isEmpty else { return true }), making blank a valid script-neutral candidate in filterTopK. If blank has the highest logit among matching candidates, it wins the argmax and gets returned.
This is especially harmful in the inner blank-processing loop (TdtDecoderV3.swift:333-345): after the filter sets label = blankId, blankMask becomes true, and advanceMask = activeMask && blankMask keeps the loop running — the speech token is swallowed and the decoder continues advancing through frames as if they were silence, instead of exiting the inner loop and emitting the token. The intended nil-return fallback path ("no right-language candidates → keep original token") is defeated because blank passes matches as script-neutral.
| label = filtered.tokenId | |
| score = TdtDurationMapping.clampProbability(filtered.probability) | |
| guard filtered.tokenId != blankId else { return } | |
| label = filtered.tokenId | |
| score = TdtDurationMapping.clampProbability(filtered.probability) |
Was this helpful? React with 👍 or 👎 to provide feedback.
Fixes #512.
TL;DR
Parakeet TDT v3 transcribed short Polish utterances like "Wpisz Google kropka com" as Cyrillic (
Впиш Гугл к ком.) because the joint decoder's top-1 pick drifts to Cyrillic tokens under low acoustic confidence. This PR adds an opt-in script filter: when a caller passeslanguage: .polish(or any other language with a declared script), the decoder rejects top-1 if it's the wrong script and walks top-K to the highest-probability candidate matching the expected script.language:defaults tonil— zero behavior change for existing callers.JointDecisionv3.mlmodelc(exposes top-K outputs). Auto-downloaded from HuggingFace alongside the other v3 files; falls back to standard argmax when absent.Empirical validation — reporter's own audio
Samples pulled via
gdown --folder <link-from-issue-#512-comment>from @tajchert's Drive folder.JointDecisionv3.mlmodelcis loaded in both columns — this isolates the Swift filter as the mechanism, not a model swap.language: nil(current)language: .polish(this PR)6/6 short samples flip Cyrillic → Latin.
pl_complexwas never broken (long context → high joint confidence → no drift) and is unchanged.Scope & limitations (important — please don't overclaim)
This PR fixes the script the tokens are drawn from. It does NOT fix per-word acoustic accuracy.
language: nillanguage: .polishThe residual errors —
Wpisz→Wpish/Wpis,kropka→Croca/ dropped — are Parakeet TDT v3 acoustic weaknesses on short Polish commands. No amount of output post-processing can turnWpishintoWpisz; that needs better acoustic modeling, a Polish LM rescorer, or more training data. Out of scope here.What users actually get by merging:
language:What users do not get:
Languageenum (Greek, Maltese, Hungarian, Turkish, Baltic — their characters fit the Latin Unicode ranges but aren't exposed; easy follow-up)Implementation
New
Sources/FluidAudio/Shared/ScriptDetection.swift(new, +112)public enum Language— 13 Latin (en, es, fr, de, it, pt, ro, pl, cs, sk, sl, hr, bs) + 5 Cyrillic (ru, uk, be, bg, sr)public enum Script { case latin, cyrillic }matches(_:script:)over Unicode ranges: ASCII (0x20–0x7F), Latin-1 (0xA0–0xFF), Latin Extended-A (0x100–0x17F), Latin Extended-B (0x180–0x24F — Romanian ș/ț), Latin Extended Additional (0x1E00–0x1EFF — Vietnamese), Cyrillic (0x400–0x4FF). Strips SentencePiece boundary marker U+2581 before checking.filterTopK(topKIds:topKLogits:vocabulary:preferredScript:) -> (tokenId, probability)?— returns the highest-probability top-K candidate matching the target script; probability via softmax over the top-K subset with the max-logit stability trick; guarded against top-K array length mismatch.Changed
TdtJointDecision— optionaltopKIds/topKLogitsfields (populated by JointDecisionv3 only)TdtDecoderV3— script filter runs only when top-1 is already wrong script; both decode sites feedfiltered.probability(a real [0,1]) intoTdtDurationMapping.clampProbability, not raw logitsAsrManager.transcribe(...)—language: Language? = nilplumbed through all three overloads:[Float],URL,AVAudioPCMBufferAsrModels+ModelNames—requiredModelsV3set includesJointDecisionv3.mlmodelcso the download utility fetches it on fresh installs and also backfills it for existing users on next.v3loadfluidaudiocli transcribe <file> --language {en|pl|cs|sk|sl|hr|bs|ro|es|fr|de|it|pt|ru|uk|be|bg|sr}How to try it
Model dependency
JointDecisionv3.mlmodelcmust be present inFluidInference/parakeet-tdt-0.6b-v3-coremlon HuggingFace. It exposestop_k_ids/top_k_logitsoutputs (K=64 in our export) alongside the standard argmax. When absent,AsrModelsfalls back toJointDecision.mlmodelcand the script filter becomes a no-op — backward compatible.Cache-upgrade verified: removed
JointDecisionv3.mlmodelcfrom a populated cache, re-ran--language pl; the file was auto-fetched and Polish output was Latin. Existing users pick up the fix on next.v3load without manual intervention.Review notes / risky bits
matches(_, script:). When top-1 is already correct, nothing is changed — so we can't regress the common case.filterTopKusesmin(topKIds.count, topKLogits.count). If CoreML output arrays ever diverge, we iterate the common prefix instead of crashing.Tests
ScriptDetectionTests— 37 tests: Unicode range coverage (Latin-1 / Extended-A / Extended-B / Extended Additional / Cyrillic), SentencePiece boundary-marker stripping,filterTopKhappy path, length-mismatch guard, probability-range invariant, Czech/Slovak/Slovenian/Croatian/Romanian token coverage, cross-script rejectionswift format lintclean on all touched filesChecklist
swift build,swift build -c release)swift format lintclean on touched filesScriptDetectionTests37/37 passFollow-ups (not blocking)
Script.greekforel_gr(separate Unicode range)