VTT MLX

Turn speech into text instantly — wherever you work, no cloud, no subscription.

The Problem

You type notes, emails, docs for hours. Your voice is 3x faster.

Cloud transcription costs $10–20/month and sends your data elsewhere. Built-in dictation is slow, unreliable, and doesn't paste where you need it.

The Solution

VTT sits in your Mac menu bar. Press Option+Space, speak, text appears where your cursor is.

Runs locally on Apple Silicon (offline) or on your own server via Tailscale. No subscription. 99 languages, auto-detected. Works in any app.

Results

Before: 5 min typing a 2-min voice note, or $20/mo for cloud ASR, or 3.5 GB RAM for local models
After: 2 min voice → instant text. Local mode: ~42x real-time. Remote mode: ~120 MB RAM on Mac (model on your server)

Quick Start

Requirements

Mac with Apple Silicon (M1, M2, M3, M4 — any variant)
macOS 13+ (Ventura or later)
8 GB RAM minimum (see model selection below)
uv package manager

Install

# Install uv if you don't have it
curl -LsSf https://astral.sh/uv/install.sh | sh

# Clone and set up
git clone https://github.com/FUYOH666/VoiceToText.git
cd VoiceToText
uv sync

With local mlx_whisper, the model downloads on first use (~6 GB for large-v3); after that, transcription works offline. With remote_asr (default in this repo’s config.yaml), the Mac does not load MLX; transcription needs your ASR server reachable (e.g. via Tailscale).

Run

uv run python src/vtt2/main.py

A microphone icon appears in your menu bar. Press Option+Space to record.

Run as a background service

To start automatically on login and restart on crash:

# Install
uv run python src/vtt2/main.py --install

# Check status
uv run python src/vtt2/main.py --status

# Remove
uv run python src/vtt2/main.py --uninstall

macOS permissions

On first launch macOS will ask for three permissions. All three are required:

Permission	Why	Where to grant
Microphone	Record your voice	Privacy & Security > Microphone
Accessibility	Global hotkey (Option+Space)	Privacy & Security > Accessibility
Input Monitoring	Auto-paste text (Cmd+V)	Privacy & Security > Input Monitoring

If hotkeys don't work, add your terminal app (Terminal, iTerm, Cursor) to Accessibility and Input Monitoring, then restart the app.

Deploy This For Your Business

This is open-source. You can run it yourself.

Or I can deploy, customize, and integrate it for your team in 2 weeks — custom voice workflows, enterprise integrations, deployment on your infrastructure.

Free consultation — tell me your use case, I'll tell you if it fits and how fast we can move.

→ Email: iamfuyoh@gmail.com
→ Telegram: @ScanovichAI

Tech Stack

How it works

Press Option+Space to start recording
Speak (any language — auto-detected)
Press Option+Space again to stop
Text is transcribed and pasted into the active app

Local mode: MLX Whisper on Apple Silicon — ~42x faster than real-time on M4 Max.
Remote mode: Whisper on your Linux GPU server via Tailscale.

Transcription engines

Mode	RAM on Mac	Where it runs
`mlx_whisper` (local)	~3.5 GB	Your Mac (Apple Silicon)
`remote_asr`	~120 MB	Linux GPU server via Tailscale

With remote_asr, the model runs on your server — Mac stays light. Lazy imports ensure MLX is never loaded when using remote.

Models (defaults in config.yaml):

Engine	Model / artifact
`remote_asr`	`cstr/whisper-large-v3-turbo-int8_float32` (server-side; override via `transcription.remote_asr.model`)
`mlx_whisper`	`mlx-community/whisper-large-v3-mlx`
`whisper_cpp`	GGML file path, e.g. `models/ggml-medium-q5_0.bin` (`transcription.whisper_cpp.model_path`)

Tail-end subtitle-style hallucinations are stripped before paste; see docs/WHISPER_ARTIFACTS.md.

Switch mode:

# Remote ASR (matches default engine in bundled config.yaml)
VTT2_TRANSCRIPTION_ENGINE=remote_asr uv run python src/vtt2/main.py

# Local MLX on Mac — downloads model, ~3.5 GB RAM for large-v3
VTT2_TRANSCRIPTION_ENGINE=mlx_whisper uv run python src/vtt2/main.py

To use remote ASR, set in config.yaml:

transcription:
  engine: remote_asr
  remote_asr:
    host: "YOUR_TAILSCALE_IP"  # Tailscale IP of your server
    port: 8001
    path: "/v1/audio/transcriptions"
    model: "cstr/whisper-large-v3-turbo-int8_float32"

Or override via env: VTT2_TRANSCRIPTION_ENGINE=remote_asr, LOCAL_AI_ASR_BASE_URL=http://host:8001.

Local setup (keep your IP private): Create .env.vtt2 (gitignored) before running --install. The service will inject these into the launchd plist:

# .env.vtt2 (copy from .env.vtt2.example)
VTT2_TRANSCRIPTION_ENGINE=remote_asr
LOCAL_AI_ASR_BASE_URL=http://100.x.x.x:8001

Then run uv run python src/vtt2/main.py --install. After reboot, VTT will use your server automatically.

Choose a model (local MLX only)

Edit config.yaml to pick a model that fits your Mac:

Model	RAM needed	Quality	Speed
`whisper-tiny-mlx`	2 GB	Basic	Fastest
`whisper-small-mlx`	4 GB	Good	Fast
`whisper-medium-mlx`	6 GB	Great	Fast
`whisper-large-v3-mlx`	10 GB	Best	Fast

All models are from mlx-community on Hugging Face. The full model name uses the prefix mlx-community/, for example:

transcription:
  mlx_whisper:
    model_name: "mlx-community/whisper-large-v3-mlx"

Default is whisper-large-v3-mlx (best quality). If you have 8 GB RAM, use whisper-medium-mlx.

Configuration

All settings are in config.yaml. Shape (simplified):

transcription:
  engine: remote_asr  # or mlx_whisper | whisper_cpp
  mlx_whisper:
    model_name: "mlx-community/whisper-large-v3-mlx"
    language: "auto"  # or "en", "ru", "zh", "ja", …

audio:
  max_recording_duration: 7200  # seconds (2 hours)

ui:
  hotkey: "option+space"
  auto_paste_enabled: true

text_processing:
  strip_whisper_tail_artifacts: true
  whisper_artifact_languages: [ru, en]

You can also override settings with environment variables using the VTT2_ prefix (see .env.example).

Troubleshooting

Hotkey not working: Add your terminal app to System Settings > Privacy & Security > Accessibility and Input Monitoring. Restart the app.

Hotkey stopped responding after recording (stuck): Restart the service: launchctl unload ~/Library/LaunchAgents/ai.vtt2.plist && launchctl load ~/Library/LaunchAgents/ai.vtt2.plist. If a zombie process remains, kill it in Activity Monitor (python … main.py) or pkill -9 -f vtt2/main.py, remove ~/.local/state/vtt2/vtt2.pid, then load again. (v1.2.1+: safer stream stop; v1.2.6+: non-blocking drain of the audio chunk queue after stop — update if you still see hangs.)

"Model not found" on first run: The model downloads from Hugging Face on first use. Make sure you have internet for the initial download. After that, everything works offline.

High memory usage:

Best option: switch to remote_asr in config.yaml — drops from ~3.5 GB to ~120 MB (model runs on server).
Or use a smaller local model (see table above). Memory auto-cleanup is enabled by default.

Check everything at once:

uv run python src/vtt2/main.py --health

Logs

Logs are at ~/Library/Logs/vtt2/ (vtt2.stdout.log, vtt2.stderr.log, vtt2.log).

For verbose output: uv run python src/vtt2/main.py --verbose

Supported languages

Whisper supports 99 languages including English, Russian, Chinese, Japanese, Spanish, French, German, Arabic, Hindi, and many more. Set language: "auto" in config (default) and it detects automatically.

Development

# Run tests
uv run pytest

# Benchmark transcription speed
uv run python test_transcription_speed.py

License

MIT

Built with MLX by Apple.

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
docs		docs
models		models
service		service
src/vtt2		src/vtt2
tests		tests
.env.example		.env.example
.env.vtt2.example		.env.vtt2.example
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
LICENSE		LICENSE
README.md		README.md
config.yaml		config.yaml
pyproject.toml		pyproject.toml
test_transcription_speed.py		test_transcription_speed.py
transcribe-cli.py		transcribe-cli.py
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

VTT MLX

The Problem

The Solution

Results

Quick Start

Requirements

Install

Run

Run as a background service

macOS permissions

Deploy This For Your Business

Tech Stack

How it works

Transcription engines

Choose a model (local MLX only)

Configuration

Troubleshooting

Logs

Supported languages

Development

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

VTT MLX

The Problem

The Solution

Results

Quick Start

Requirements

Install

Run

Run as a background service

macOS permissions

Deploy This For Your Business

Tech Stack

How it works

Transcription engines

Choose a model (local MLX only)

Configuration

Troubleshooting

Logs

Supported languages

Development

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages