Skip to content

Whisper STT: re-architect for real-time or defer to post-processing #43

@BraedenBDev

Description

@BraedenBDev

Status

Whisper on-device STT is functional but too slow for production. Current architecture:

  • 3-second audio chunks buffered via ScriptProcessorNode
  • Single-threaded WASM inference (~3-5s per chunk)
  • Chrome MV3 CSP blocks multi-threaded ONNX (blob: URLs in script-src)
  • Result: ~6-8s latency per utterance, missing real-time intent matching

Web Speech API (Google) works well for real-time but requires internet.

Options

A: Streaming Whisper — 500ms chunks, lower accuracy, still WASM-limited
B: WebGPU backend — significant speedup, not all extension contexts support it
C: Post-processing — Web Speech for real-time, Whisper refines after capture ends
D: Accept Web Speech as default — Whisper is opt-in private mode with known latency

Recommendation: D for now, C as next evolution.

Technical debt

  • Migrate ScriptProcessorNode to AudioWorklet
  • numThreads=1 forced by CSP — revisit if Chrome relaxes MV3

🤖 Generated with Claude Code

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions