Two of the most useful AI voice tools of the last few years live on opposite ends of the same loop. ElevenLabs handles output: clone a voice, generate speech, export audio. WisprFlow handles input: hold a hotkey, speak, text appears in whatever app you're in.
Both are paid cloud services. Both send your voice data to their servers.
Voicebox does both, locally, for free.
It is an open source AI voice studio built by Jamie Pine (also the creator of Spacedrive). Voicebox clones voices from a few seconds of audio, generates speech across 23 languages with seven different TTS engines, provides a global dictation hotkey that pastes into any app, runs a local LLM for personality rewrites, and exposes an MCP server so Claude, Cursor, and any other AI agent can speak to you in a voice you've cloned.
31,000+ GitHub stars. MIT license. Runs entirely on your machine. Nothing leaves without your permission.
The Problem It Solves
Voice AI tooling in 2026 has a fragmentation problem. You need one subscription for voice cloning, another for dictation, a separate API for your agents, and a different integration for each. You are paying per character of speech, per minute of transcription, and per API call — with your voice data living on someone else's infrastructure.
Voicebox's design answer is vertical integration, locally. One app, one model cache, one GPU footprint, covering:
- Voice cloning — zero-shot cloning from a reference audio sample
- TTS — 7 engines, 23 languages, post-processing effects
- STT — Whisper (all sizes, plus Turbo) for transcription
- Dictation — global hotkey, pastes into any text field on macOS
- Local LLM — Qwen3 for personality rewrites and dictation cleanup
- MCP server — agents can speak and transcribe via standard tool protocol
- REST API — everything accessible programmatically
Voice Cloning: How It Works
Voicebox supports zero-shot voice cloning — you provide a reference audio sample (a few seconds of clean speech), and Voicebox creates a voice profile that matches it. No training, no fine-tuning, no wait time. The cloning happens at inference time: the TTS engine receives your reference audio and conditions its output to match the speaker.
To clone a voice:
- Open Voicebox → Voices tab
- Click "New Profile"
- Record or import a reference audio clip (10–30 seconds of clean speech works best)
- Name the profile — this is what you pass to
voicebox.speakover MCP
Multi-sample support is available: upload several clips from the same speaker for higher quality cloning.
Profiles are exportable and importable, so you can share voice profiles or move them between machines.
The Seven TTS Engines
Voicebox ships seven TTS engines, each with different strengths. You can switch engines per generation — the UI lets you select which engine to use before generating.
| Engine | Languages | Strengths |
|---|---|---|
| Qwen3-TTS (0.6B / 1.7B) | 10 | High-quality multilingual cloning, delivery instructions ("speak slowly", "whisper") |
| Qwen CustomVoice | 10 | 9 preset voices with natural-language delivery control, no reference audio needed |
| LuxTTS | English + 10 more | Lightweight (~1GB VRAM), 48kHz output, 150x real-time on Apple Silicon; best for Finnish, Greek, Hebrew, Hindi, Norwegian, Polish, Swahili |
| Chatterbox Turbo | English | Fast 350M model, supports paralinguistic emotion tags like [laugh] [sigh] [gasp] |
| HumeAI TADA (1B / 3B) | 10 | 700s+ of coherent continuous audio, text-acoustic dual alignment |
| Kokoro | 8 | 82M model, 50 curated preset voices, fast CPU inference — the lightweight default |
| Chatterbox Multilingual | Multiple | Multilingual Chatterbox base model |
The paralinguistic tag support in Chatterbox Turbo deserves a callout. Type / in the text input to open the tag inserter and add:
"That's [laugh] actually quite funny, you know. [sigh] But here we are."
The model speaks the text with the indicated emotional cues inline. ElevenLabs offers something similar in its "Expressive" tier; Voicebox's implementation is free and local.
Global Dictation: Voice Input Anywhere
The input half of Voicebox is a global dictation system backed by OpenAI Whisper (running locally). Hold a hotkey anywhere on your system, speak, and the transcript pastes directly into the focused text field — terminal, editor, browser, any app.
Setup on macOS:
- Open Voicebox → Settings → Dictation
- Set your push-to-talk chord (default: hold
Fn) - Grant Accessibility and Input Monitoring permissions (Voicebox walks you through this with deep-links to System Settings)
- Hold the hotkey → speak → release → text appears
The paste implementation on macOS is accessibility-verified: it uses the Accessibility API to inject text into the focused element, not the clipboard, so your clipboard is not overwritten. The clipboard save/restore is atomic regardless.
Two modes:
- Push-to-talk — hold chord to record, release to transcribe and paste
- Toggle — tap chord to start, tap again to stop. Hold the push-to-talk chord and tap Space mid-hold to upgrade to toggle without a gap in recording.
LLM refinement is optional: before paste, Voicebox's bundled Qwen3 LLM cleans up filler words, stutters, and false starts. Toggle this per-dictation or set it as a default in Settings.
On-screen pill: A floating overlay shows recording, transcribing, refining, and speaking states — the same pill agents use when they speak to you, so there is one mental model for both directions of the voice loop.
Connecting Claude (and Other Agents) via MCP
This is the feature that distinguishes Voicebox from every other TTS tool. When Voicebox is running, it exposes an MCP server at http://127.0.0.1:17493/mcp. Any MCP-aware agent can use four tools:
voicebox.speak— synthesize text in a cloned voicevoicebox.transcribe— convert an audio file to textvoicebox.list_captures— browse your recording historyvoicebox.list_profiles— browse your voice profiles
Connect Claude Code
claude mcp add voicebox \
--transport http \
--url http://127.0.0.1:17493/mcp \
--header "X-Voicebox-Client-Id: claude-code"
Connect Cursor / Windsurf / VS Code
{
"mcpServers": {
"voicebox": {
"url": "http://127.0.0.1:17493/mcp",
"headers": { "X-Voicebox-Client-Id": "cursor" }
}
}
}
Connect Claude Desktop (stdio fallback)
{
"mcpServers": {
"voicebox": {
"command": "/Applications/Voicebox.app/Contents/MacOS/voicebox-mcp",
"env": { "VOICEBOX_CLIENT_ID": "claude-desktop" }
}
}
}
Using the speak tool in an agent session
Once connected, Claude can call:
voicebox.speak({
text: "Tests passing. Ready to merge.",
profile: "Morgan"
})
Claude speaks in Morgan's voice. You hear it through your speakers. The on-screen pill shows "Speaking" with the profile name.
Per-client voice binding: In Voicebox → Settings → MCP, pin each connected agent to a specific voice. Claude Code → Morgan. Cursor → Scarlett. When those agents call voicebox.speak without specifying a profile, Voicebox uses the bound voice. This means you can tell which agent is talking without looking.
Voice personalities over MCP: Add a persona description to any voice profile ("a calm senior engineer who explains things simply"). Then call:
voicebox.speak({
text: "The build failed on line 42.",
profile: "Morgan",
personality: true
})
The text passes through the local Qwen3 LLM which rewrites it in Morgan's character before TTS. The agent's output becomes Morgan's voice in personality as well as sound.
The REST API
Everything in Voicebox is accessible via REST at http://127.0.0.1:17493. Full docs at http://127.0.0.1:17493/docs when the app is running.
# Generate speech — returns audio file
curl -X POST http://127.0.0.1:17493/generate \
-H "Content-Type: application/json" \
-d '{"text": "Hello world", "profile_id": "abc123", "language": "en"}'
# Agent voice output
curl -X POST http://127.0.0.1:17493/speak \
-H "Content-Type: application/json" \
-H "X-Voicebox-Client-Id: my-script" \
-d '{"text": "Deploy complete.", "profile": "Morgan"}'
# Transcribe audio
curl -X POST http://127.0.0.1:17493/transcribe \
-F "[email protected]" \
-F "model=whisper-turbo"
# List voice profiles
curl http://127.0.0.1:17493/profiles
The /speak endpoint accepts profile as a name (case-insensitive) or ID, resolving in the same order as the MCP tool: explicit argument → per-client binding → default capture voice.
Post-Processing Effects
After generating speech, apply audio effects non-destructively. Preview in real time, build presets per voice profile.
| Effect | Options |
|---|---|
| Pitch Shift | ±12 semitones |
| Reverb | Room size, damping, wet/dry |
| Delay | Time, feedback, mix |
| Chorus / Flanger | Modulated delay |
| Compressor | Dynamic range |
| Gain | -40 to +40 dB |
| High-Pass Filter | Remove low frequencies |
| Low-Pass Filter | Remove high frequencies |
Four built-in presets: Robotic, Radio, Echo Chamber, Deep Voice. Custom presets are saveable per profile.
Each generation creates a version — original TTS output is always preserved. Effects versions branch from the original, so you can apply different chains without losing the clean source.
Download and Install
macOS (Apple Silicon or Intel): Download the DMG from github.com/jamiepine/voicebox/releases. Drag to Applications, open it, grant the Accessibility and Microphone permissions when prompted.
Windows: Download the MSI installer from the same Releases page.
Docker:
docker compose up
Linux: Pre-built binaries are not yet published. Build from source — see voicebox.sh/linux-install for instructions.
On first launch, Voicebox downloads the base models (Whisper, Kokoro as default TTS, Qwen3 LLM for dictation refinement). Storage requirements vary by which models you choose to download — Kokoro is the lightest (~82M), TADA is the heaviest (1B/3B).
GPU Support
Voicebox auto-detects and uses the best available inference backend for your hardware:
| Platform | Backend | Notes |
|---|---|---|
| macOS Apple Silicon | MLX (Metal) | 4–5x faster via Neural Engine |
| Windows / Linux NVIDIA | PyTorch (CUDA) | Auto-downloads from within the app |
| Linux AMD | PyTorch (ROCm) | Auto-configures HSA_OVERRIDE_GFX_VERSION |
| Windows (any GPU) | DirectML | Universal Windows GPU support |
| Intel Arc | IPEX/XPU | Intel discrete GPU acceleration |
| Any | CPU | Works everywhere, slower |
Privacy: The Non-Negotiable Default
Every part of Voicebox runs locally:
- Voice cloning models run on your GPU/CPU
- Generated audio never leaves your machine
- Whisper STT runs locally — your speech is not sent to any API
- The local Qwen3 LLM for personality and dictation cleanup runs on-device
- The MCP server and REST API are localhost-only
There are no usage metrics, no voice data collection, and no cloud processing unless you explicitly configure an external integration.
Quick-Start Summary
- Download: github.com/jamiepine/voicebox/releases → macOS DMG or Windows MSI
- Clone a voice: Voices tab → New Profile → record or upload reference audio
- Enable dictation: Settings → Dictation → set hotkey → grant permissions
- Connect Claude Code:
claude mcp add voicebox --transport http --url http://127.0.0.1:17493/mcp --header "X-Voicebox-Client-Id: claude-code" - Ask Claude to speak:
voicebox.speak({ text: "...", profile: "YourVoice" })