fix(benchmark): repair 3 pre-existing script/download bugs#534
fix(benchmark): repair 3 pre-existing script/download bugs#534Alex-Wengg merged 1 commit intomainfrom
Conversation
1. Scripts/parakeet_subset_benchmark.sh: update Japanese TDT folder reference from parakeet-tdt-ja to parakeet-ja to match Repo.parakeetJa.folderName (renamed in 4ef33f0). 2. ParakeetEouCommand: default to Application Support cache directory. The legacy default wrote to $CWD/Models/<chunk>/<chunk>/ (double-nested and relative to CWD) while DownloadUtils wrote to $CWD/Models/parakeet- eou-streaming/<chunk>/, causing a load-path mismatch and silent failure. --use-cache kept as a no-op for backward compatibility. 3. DatasetDownloader.downloadEarnings22KWS: switch dataset id from the discontinued argmaxinc/earnings22-kws-golden to argmaxinc/ contextual-earnings22. HF consolidated the dataset; the old id returns 404 from the Datasets-Server API. New dataset has identical feature schema (audio, file_id, text, dictionary). Validated end-to-end via Scripts/parakeet_subset_benchmark.sh --download: all 3 previously broken paths now complete without error.
| // Determine models path. Default: Application Support cache directory | ||
| // (matches how every other CoreML model in FluidAudio is stored). | ||
| // `--use-cache` is retained as a no-op for backward compatibility. | ||
| _ = useCache |
There was a problem hiding this comment.
🟡 --use-cache not actually a no-op: still triggers unnecessary download path
The comment on line 101 says --use-cache is retained as a no-op, and _ = useCache on line 102 is written as if to suppress an unused-variable warning. However, useCache is still referenced in the download condition on line 113: if download || useCache || !FileManager.default.fileExists(atPath: modelsUrl.path). This means passing --use-cache still unconditionally enters the download block (logging "Downloading models to: ..." and calling downloadModels), even when models already exist at the target path. While downloadModels (ParakeetEouCommand.swift:184-192) has its own file-existence check and will return early, the behavior contradicts the stated intent of making --use-cache a no-op. The _ = useCache is misleading — it signals the value is intentionally discarded, but it's actually still used.
Prompt for agents
The intent of this PR is to make --use-cache a no-op for backward compatibility (since the default path is now always Application Support). However, useCache is still referenced on line 113 in the download condition: `if download || useCache || !FileManager.default.fileExists(atPath: modelsUrl.path)`. This means --use-cache still has an effect: it unconditionally enters the download code path.
To fix: remove `useCache` from the condition on line 113 so it reads `if download || !FileManager.default.fileExists(atPath: modelsUrl.path)`. Keep the `_ = useCache` on line 102 to suppress the unused variable warning, since the variable is still set by the argument parser on line 68.
File: Sources/FluidAudioCLI/Commands/ASR/Parakeet/Streaming/ParakeetEouCommand.swift
Line 113: change `if download || useCache || !FileManager.default.fileExists(atPath: modelsUrl.path)` to `if download || !FileManager.default.fileExists(atPath: modelsUrl.path)`
Was this helpful? React with 👍 or 👎 to provide feedback.
Kokoro TTS Smoke Test ✅
Runtime: 0m31s Note: Kokoro TTS uses CoreML flow matching + Vocos vocoder. CI VM lacks physical ANE — performance may differ from Apple Silicon. |
Parakeet EOU Benchmark Results ✅Status: Benchmark passed Performance Metrics
Streaming Metrics
Test runtime: 0m58s • 04/21/2026, 03:50 AM EST RTFx = Real-Time Factor (higher is better) • Processing includes: Model inference, audio preprocessing, state management, and file I/O |
PocketTTS Smoke Test ✅
Runtime: 0m47s Note: PocketTTS uses CoreML MLState (macOS 15) KV cache + Mimi streaming state. CI VM lacks physical GPU — audio quality and performance may differ from Apple Silicon. |
VAD Benchmark ResultsPerformance Comparison
Dataset Details
✅: Average F1-Score above 70% |
Speaker Diarization Benchmark ResultsSpeaker Diarization PerformanceEvaluating "who spoke when" detection accuracy
Diarization Pipeline Timing BreakdownTime spent in each stage of speaker diarization
Speaker Diarization Research ComparisonResearch baselines typically achieve 18-30% DER on standard datasets
Note: RTFx shown above is from GitHub Actions runner. On Apple Silicon with ANE:
🎯 Speaker Diarization Test • AMI Corpus ES2004a • 1049.0s meeting audio • 62.1s diarization time • Test runtime: 3m 21s • 04/21/2026, 03:59 AM EST |
Sortformer High-Latency Benchmark ResultsES2004a Performance (30.4s latency config)
Sortformer High-Latency • ES2004a • Runtime: 4m 11s • 2026-04-21T08:01:24.683Z |
Qwen3-ASR int8 Smoke Test ✅
Performance Metrics
Runtime: 5m39s Note: CI VM lacks physical GPU — CoreML MLState (macOS 15) KV cache produces degraded results on virtualized runners. On Apple Silicon: ~1.3% WER / 2.5x RTFx. |
Offline VBx Pipeline ResultsSpeaker Diarization Performance (VBx Batch Mode)Optimal clustering with Hungarian algorithm for maximum accuracy
Offline VBx Pipeline Timing BreakdownTime spent in each stage of batch diarization
Speaker Diarization Research ComparisonOffline VBx achieves competitive accuracy with batch processing
Pipeline Details:
🎯 Offline VBx Test • AMI Corpus ES2004a • 1049.0s meeting audio • 159.0s processing • Test runtime: 2m 36s • 04/21/2026, 04:03 AM EST |
ASR Benchmark Results ✅Status: All benchmarks passed Parakeet v3 (multilingual)
Parakeet v2 (English-optimized)
Streaming (v3)
Streaming (v2)
Streaming tests use 5 files with 0.5s chunks to simulate real-time audio streaming 25 files per dataset • Test runtime: 7m8s • 04/21/2026, 04:04 AM EST RTFx = Real-Time Factor (higher is better) • Calculated as: Total audio duration ÷ Total processing time Expected RTFx Performance on Physical M1 Hardware:• M1 Mac: ~28x (clean), ~25x (other) Testing methodology follows HuggingFace Open ASR Leaderboard |
Summary
Three unrelated pre-existing bugs surfaced while validating PR #515. All of them block
Scripts/parakeet_subset_benchmark.sh --downloadfrom succeeding, but none are related to the v3 script-filtering work. Consolidating into one PR since each fix is ~1–3 lines.1. Japanese TDT folder-name mismatch
Scripts/parakeet_subset_benchmark.shverifies the Japanese TDT model at$MODELS_DIR/parakeet-tdt-ja/, but the folder was renamed toparakeet-jain 4ef33f0 (Repo.parakeetJa.folderName = "parakeet-ja"). Result:verify_assets()always reported missing assets even on a fully provisioned machine. One-line rename to match.2. EOU streaming CLI writes to wrong path
ParakeetEouCommandhad a default /--use-cachesplit where the default branch produced$CWD/Models/<chunk>/<chunk>/(double-nested, relative to CWD) as the load path, whiledownloadModels()calleddeletingLastPathComponent().deletingLastPathComponent()thenDownloadUtils.downloadRepo(repo, to:)which appendedfolderName = "parakeet-eou-streaming/<chunk>". Net effect: files landed at$CWD/Models/parakeet-eou-streaming/<chunk>/whileloadModels()looked at$CWD/Models/<chunk>/<chunk>/— model load failed silently.Unified on Application Support (matches every other CoreML model in FluidAudio).
--use-cacheretained as a no-op flag for backward compatibility.3. earnings22-kws dataset 404
HuggingFace consolidated
argmaxinc/earnings22-kws-goldenintoargmaxinc/contextual-earnings22. The old id now returns 404 from the Datasets-Server REST API (no redirect follow). The new dataset has the same feature schema (audio,file_id,text,dictionary, ...), so swapping the id is sufficient — no downstream consumer changes needed.Test plan
Ran
Scripts/parakeet_subset_benchmark.sh --downloadend-to-end:verify_assetscorrectly resolvesparakeet-ja/(all 5 expected files present)Models downloaded to ~/Library/Application Support/FluidAudio/Models/parakeet-eou-streaming/320ms, 0.00% WER on warmup fileswift buildpassesOut of scope but observed (pre-existing, unrelated):
ctc-earnings-benchmark --auto-downloaddoes not actually auto-download CTC-110m model