What is VoxCPM2 and how is it different from traditional TTS models?

VoxCPM2 is a 2 billion parameter tokenizer-free Text-to-Speech model that directly generates continuous speech representations via an end-to-end diffusion autoregressive architecture, bypassing discrete tokenization entirely. Unlike traditional TTS models that convert text to discrete tokens and then to audio, VoxCPM2 operates entirely in the latent space of AudioVAE V2, achieving more natural and expressive synthesis.

How many languages does VoxCPM2 support?

VoxCPM2 supports 30 languages including Arabic, Burmese, Chinese, Danish, Dutch, English, Finnish, French, German, Greek, Hebrew, Hindi, Indonesian, Italian, Japanese, Khmer, Korean, Lao, Malay, Norwegian, Polish, Portuguese, Russian, Spanish, Swahili, Swedish, Tagalog, Thai, Turkish, and Vietnamese. It also supports Chinese dialects like Sichuan, Cantonese, Wu, Northeastern, Henan, Shaanxi, Shandong, Tianjin, and Minnan.

What is Voice Design in VoxCPM2?

Voice Design is a unique feature that allows you to create a brand-new voice from a natural-language description alone, with no reference audio required. You simply describe the voice characteristics (gender, age, tone, emotion, pace) in parentheses at the start of your text, like '(A young woman, gentle and sweet voice)Hello, welcome to VoxCPM2!' and the model generates a voice matching that description.

What are the three types of voice cloning in VoxCPM2?

VoxCPM2 offers three voice cloning modes: (1) Controllable Voice Cloning—clone a voice from a short reference clip with optional style guidance to control emotion and pace, (2) Ultimate Cloning—provide both reference audio and its transcript for continuation-based cloning that reproduces every vocal nuance, and (3) Voice Design—create voices from text descriptions without any reference audio.

What audio quality does VoxCPM2 produce?

VoxCPM2 directly outputs 48kHz studio-quality audio via AudioVAE V2's asymmetric encode/decode design with built-in super-resolution. It accepts 16kHz reference audio and upsamples to 48kHz without needing external upsamplers, delivering professional-grade audio quality.

Is VoxCPM2 open-source and can I use it commercially?

Yes, VoxCPM2 is fully open-source and released under the Apache 2.0 license, making it free for commercial use. Both the model weights and code are available on Hugging Face and ModelScope.

What is the inference speed of VoxCPM2?

VoxCPM2 achieves RTF (Real-Time Factor) of ~0.30 on NVIDIA RTX 4090 with the standard PyTorch implementation, and ~0.13 when accelerated by Nano-vLLM. This means it can generate audio 3-7x faster than real-time playback speed.

How much VRAM does VoxCPM2 require?

VoxCPM2 requires approximately 8 GB of VRAM for inference on NVIDIA GPUs. It can also run on Apple Silicon Macs using MPS (Metal Performance Shaders) with similar memory requirements.

VoxCPM2: The 2B Parameter Tokenizer-Free TTS Model That | explainx.ai Blog

explainx.ainewsletter3.5k

workshops ↗

VoxCPM2: The 2B Parameter Tokenizer-Free TTS Model That | explainx.ai Blog | explainx.ai

In April 2026, OpenBMB released VoxCPM2—a 2 billion parameter tokenizer-free Text-to-Speech (TTS) model trained on over 2 million hours of multilingual speech data. Unlike traditional TTS systems that rely on discrete audio tokens, VoxCPM2 directly generates continuous speech representations via an end-to-end diffusion autoregressive architecture, achieving highly natural and expressive synthesis across 30 languages.

VoxCPM2 tokenizer-free TTS model architecture

VoxCPM2 operates entirely in the latent space of AudioVAE V2, following a four-stage pipeline: LocEnc → TSLM → RALM → LocDiT for rich expressiveness and 48kHz native audio output.

VoxCPM2 introduces three groundbreaking capabilities:

Voice Design: Create brand-new voices from natural-language descriptions alone (gender, age, tone, emotion, pace), no reference audio required
Controllable Voice Cloning: Clone any voice from a short reference clip with optional style guidance to steer emotion and pace
Ultimate Cloning: Provide both reference audio and its transcript for continuation-based cloning that reproduces every vocal nuance

Built on a MiniCPM-4 backbone and Apache 2.0 licensed, VoxCPM2 is fully open-source and free for commercial use, making it one of the most accessible and powerful TTS models available in 2026.

This article covers VoxCPM2's architecture, multilingual capabilities, voice cloning features, performance benchmarks, fine-tuning guide, and how it compares to commercial alternatives like OpenAI's GPT-Realtime 2.0 and ElevenLabs.

TL;DR

Topic	Takeaway
VoxCPM2	2B parameter tokenizer-free TTS model; directly generates continuous speech (no discrete tokens); trained on 2M+ hours of multilingual data
Languages	30 languages + Chinese dialects; automatic language detection (no language tags needed)
Voice Design	Create voices from text descriptions alone: "(Young woman, gentle voice)Hello world!" → generates matching voice
Voice Cloning	Three modes: (1) Controllable (reference audio + style control), (2) Ultimate (reference + transcript for perfect reproduction), (3) Design (text-only)
Audio Quality	48kHz studio-quality output; accepts 16kHz reference, outputs 48kHz via built-in super-resolution
Performance	RTF ~0.30 (RTX 4090 PyTorch), ~0.13 (Nano-vLLM); 8GB VRAM; supports streaming
License	Apache 2.0—fully open-source, free for commercial use
Fine-Tuning	Supports LoRA (5-10 min audio) and full SFT; WebUI for training/inference
Deployment	Python API, CLI, Gradio WebUI, Nano-vLLM (high-throughput), vLLM-Omni (OpenAI-compatible API)

What Makes VoxCPM2 Revolutionary

1. Tokenizer-Free Architecture

Traditional TTS Pipeline: Text → Phonemes/Tokens → Discrete Audio Tokens → Audio Waveform

VoxCPM2 Pipeline: Text → Continuous Latent Representations → Audio Waveform

By eliminating discrete tokenization, VoxCPM2 avoids:

Information loss from quantization
Prosody artifacts from token boundaries
Unnatural pauses at token transitions
Limited expressiveness of finite codebooks

Result: More natural, expressive, and human-like speech synthesis.

2. Diffusion Autoregressive Paradigm

VoxCPM2 combines two powerful generative approaches:

Autoregressive Generation: Models long-range dependencies and coherent structure (like language models)

Diffusion Modeling: Captures fine-grained acoustic details and smooth transitions

The model operates entirely in the latent space of AudioVAE V2, following a four-stage pipeline:

LocEnc (Local Encoder): Encodes text and reference audio into latent representations
TSLM (Text-to-Speech Language Model): Generates semantic speech tokens from text
RALM (Reference-Aware Language Model): Incorporates reference voice timbre and style
LocDiT (Local Diffusion Transformer): Refines acoustic details via diffusion

This architecture enables context-aware synthesis—the model automatically infers appropriate prosody, emotion, and expressiveness from text content.

3. 30-Language Multilingual Support

VoxCPM2 supports 30 languages with automatic language detection:

Major Languages: Arabic, Chinese, Dutch, English, French, German, Greek, Hebrew, Hindi, Indonesian, Italian, Japanese, Korean, Malay, Polish, Portuguese, Russian, Spanish, Thai, Turkish, Vietnamese

Less Common Languages: Burmese, Danish, Finnish, Khmer, Lao, Norwegian, Swahili, Swedish, Tagalog

Chinese Dialects: 四川话 (Sichuan), 粤语 (Cantonese), 吴语 (Wu), 东北话 (Northeastern), 河南话 (Henan), 陕西话 (Shaanxi), 山东话 (Shandong), 天津话 (Tianjin), 闽南话 (Minnan)

Key Advantage: No language tags required—just input text in any of the 30 languages and VoxCPM2 automatically detects and synthesizes.

Voice Design: Create Voices from Text Descriptions

Overview

Voice Design is VoxCPM2's most innovative feature: create a brand-new voice from a natural-language description alone, with no reference audio required.

How It Works:

Put your voice description in parentheses at the start of the text:

python

from voxcpm import VoxCPM
import soundfile as sf

model = VoxCPM.from_pretrained("openbmb/VoxCPM2", load_denoiser=False)

wav = model.generate(
    text="(A young woman, gentle and sweet voice)Hello, welcome to VoxCPM2!",
    cfg_value=2.0,
    inference_timesteps=10,
)
sf.write("voice_design.wav", wav, model.tts_model.sample_rate)

Supported Attributes

You can describe voices using:

Demographic:

Gender: "male," "female," "non-binary"
Age: "young child," "teenager," "middle-aged," "elderly"

Vocal Quality:

Tone: "warm," "cold," "bright," "dark," "raspy," "smooth"
Pitch: "high-pitched," "low-pitched," "baritone," "soprano"

Emotion and Style:

Emotion: "cheerful," "sad," "angry," "calm," "excited," "nervous"
Style: "professional," "casual," "formal," "playful," "serious"

Speech Characteristics:

Pace: "fast," "slow," "moderate," "rushed," "leisurely"
Expression: "expressive," "monotone," "animated," "subdued"

Example Descriptions

Professional Narrator:

snippet

"(Middle-aged male, deep voice, calm and authoritative tone)"

Friendly Assistant:

snippet

"(Young female, warm and cheerful, slightly fast pace)"

Dramatic Character:

snippet

"(Elderly man, raspy voice, slow and ominous)"

Customer Service:

snippet

"(Female, professional tone, clear and friendly)"

Use Cases

1. Content Creation Generate unique character voices for audiobooks, podcasts, or video narration without hiring voice actors.

2. Rapid Prototyping Test different voice styles for your application before recording custom samples.

3. Accessibility Create personalized TTS voices for users who prefer specific vocal characteristics (e.g., gender-affirming voices for transgender users).

4. Localization Generate culturally appropriate voices for different markets (e.g., "British male, posh accent" vs "American male, casual Southern accent").

Voice Cloning: Three Modes for Different Use Cases

Mode 1: Controllable Voice Cloning

Overview: Clone a voice from a short reference audio clip, with optional style guidance to control emotion, pace, and expression while preserving the original timbre.

How It Works:

python

# Basic voice cloning
wav = model.generate(
    text="This is a cloned voice generated by VoxCPM2.",
    reference_wav_path="path/to/voice.wav",
)
sf.write("clone.wav", wav, model.tts_model.sample_rate)

# Cloning with style control
wav = model.generate(
    text="(slightly faster, cheerful tone)This is a cloned voice with style control.",
    reference_wav_path="path/to/voice.wav",
    cfg_value=2.0,
    inference_timesteps=10,
)
sf.write("controllable_clone.wav", wav, model.tts_model.sample_rate)

Key Features:

Minimal reference audio: 3-10 seconds is enough
Style control: Add text instructions to modify emotion, pace, or expression
Timbre preservation: Core voice identity remains consistent
Flexible output: Same voice, different styles

Use Cases:

Customer support: Clone brand voice but vary emotion (calm, apologetic, enthusiastic)
Audiobook narration: One voice, multiple character emotions
Localization: Clone voice across languages with culturally appropriate prosody

Mode 2: Ultimate Cloning

Overview: Provide both reference audio and its exact transcript for continuation-based cloning that reproduces every vocal nuance—timbre, rhythm, emotion, and style.

How It Works:

python

wav = model.generate(
    text="This is an ultimate cloning demonstration using VoxCPM2.",
    prompt_wav_path="path/to/voice.wav",
    prompt_text="The transcript of the reference audio.",
    reference_wav_path="path/to/voice.wav",  # optional, for better similarity
)
sf.write("ultimate_clone.wav", wav, model.tts_model.sample_rate)

Key Features:

Maximum fidelity: Reproduces every vocal detail
Seamless continuation: Model continues from the reference as if it's the same recording
Transcript-aware: Uses reference transcript to match prosody and rhythm
Best for: Highest-quality cloning when you have both audio and transcript

Use Cases:

Podcast editing: Insert new segments that sound identical to original recording
Video dubbing: Match original actor's voice exactly
Personalized TTS: Clone your own voice for assistive technology

Mode 3: Voice Design (Covered Above)

Create voices from text descriptions with no reference audio.

48kHz Studio-Quality Audio Output

AudioVAE V2: Asymmetric Encode/Decode

VoxCPM2 uses AudioVAE V2 with asymmetric encode/decode architecture:

Encoding: Accepts 16kHz reference audio Decoding: Outputs 48kHz studio-quality audio

Built-in Super-Resolution: The model internally upsamples from 16kHz to 48kHz during generation—no external upsampler needed.

Audio Quality Metrics

Model	Sample Rate	Bitrate Equivalent	Use Case
VoxCPM2	48kHz	Studio quality	Professional content, music, high-fidelity
VoxCPM1.5	44.1kHz	CD quality	Audio production
VoxCPM-0.5B	16kHz	Telephony	Voice assistants, accessibility

48kHz is the professional audio standard used in:

Film and video production
Music recording and mastering
Broadcasting (TV, streaming)
High-fidelity audio applications

Benefit: VoxCPM2 output is broadcast-ready without post-processing.

Performance Benchmarks

Inference Speed (RTF)

Real-Time Factor (RTF): Lower is better. RTF < 1.0 means faster than real-time.

Model	RTF (RTX 4090 PyTorch)	RTF (Nano-vLLM)	VRAM
VoxCPM2	~0.30	~0.13	~8 GB
VoxCPM1.5	~0.15	~0.08	~6 GB
VoxCPM-0.5B	~0.17	~0.10	~5 GB

Interpretation:

RTF 0.30: Generate 10 seconds of audio in 3 seconds (3× real-time speed)
RTF 0.13 (Nano-vLLM): Generate 10 seconds of audio in 1.3 seconds (7× real-time speed)

Production Deployment: For high-throughput serving, Nano-vLLM achieves ~0.13 RTF with concurrent request support and async API.

Multilingual WER (Word Error Rate)

VoxCPM2 achieves state-of-the-art or comparable results on public zero-shot TTS benchmarks:

Seed-TTS-eval (English):

VoxCPM2: WER competitive with Seed-TTS and GPT-SoVITS
SIM (speaker similarity) scores comparable to commercial models

CV3-eval (Multilingual):

Low WER/CER across 30 languages
Particularly strong on Chinese, English, Spanish, French, German, Japanese

Minimax-MLS-test:

WER competitive with leading multilingual TTS models
High SIM scores indicating strong voice cloning fidelity

Internal 30-Language ASR Benchmark:

Tested on 30 languages × 500 samples
ASR transcription evaluated via Gemini 3.1 Flash Lite API
Consistently low WER across all supported languages

Instruction-Guided Voice Design

InstructTTSEval (Voice Design from text instructions):

VoxCPM2 outperforms or matches commercial models in following voice description instructions
High accuracy in generating voices matching gender, age, tone, and emotion specifications

Fine-Tuning: Adapt to Your Use Case

VoxCPM2 supports both LoRA fine-tuning (parameter-efficient) and full fine-tuning (SFT).

LoRA Fine-Tuning (Recommended)

Data Requirements: As little as 5-10 minutes of audio

Command:

bash

python scripts/train_voxcpm_finetune.py \
    --config_path conf/voxcpm_v2/voxcpm_finetune_lora.yaml

Use Cases:

Adapt to a specific speaker (personal voice cloning)
Fine-tune for domain-specific vocabulary (medical, legal, technical)
Adjust prosody for brand voice consistency

Benefits:

Fast training: Hours, not days
Low VRAM: Can train on consumer GPUs
Hot-swapping: Load/unload LoRA adapters at runtime without restarting

Full Fine-Tuning (SFT)

Data Requirements: 1-10 hours of audio (more data = better results)

Command:

bash

python scripts/train_voxcpm_finetune.py \
    --config_path conf/voxcpm_v2/voxcpm_finetune_all.yaml

Use Cases:

Add new languages not in the original 30
Adapt to specialized domains (e.g., medical terminology, regional accents)
Create custom voice models for commercial products

WebUI for Training & Inference

VoxCPM2 includes a Gradio WebUI for visual training and inference:

bash

python lora_ft_webui.py   # then open http://localhost:7860

Features:

Upload training data via drag-and-drop
Configure LoRA parameters visually
Monitor training progress with live loss charts
Test fine-tuned models immediately after training
Export LoRA adapters for production deployment

Deployment Options

1. Python API (Simple)

Installation:

bash

pip install voxcpm

Basic Usage:

python

from voxcpm import VoxCPM
import soundfile as sf

model = VoxCPM.from_pretrained("openbmb/VoxCPM2", load_denoiser=False)

wav = model.generate(
    text="VoxCPM2 is the current recommended release for realistic multilingual speech synthesis.",
    cfg_value=2.0,
    inference_timesteps=10,
)
sf.write("demo.wav", wav, model.tts_model.sample_rate)

Use Cases: Prototyping, local experimentation, Jupyter notebooks

2. CLI (Command-Line)

Voice Design:

bash

voxcpm design \
  --text "VoxCPM2 brings studio-quality multilingual speech synthesis." \
  --output out.wav

Voice Cloning:

bash

voxcpm clone \
  --text "This is a voice cloning demo." \
  --reference-audio path/to/voice.wav \
  --output out.wav

Batch Processing:

bash

voxcpm batch --input examples/input.txt --output-dir outs

Use Cases: Scripting, automation, CI/CD pipelines

3. Gradio WebUI (Interactive)

bash

python app.py --port 8808  # then open http://localhost:8808

Device Selection:

bash

python app.py --device auto  # auto, cpu, mps, cuda, cuda:N

Use Cases: Demos, non-technical users, rapid prototyping

4. Nano-vLLM (Production High-Throughput)

Installation:

bash

pip install nano-vllm-voxcpm

Usage:

python

from nanovllm_voxcpm import VoxCPM
import numpy as np, soundfile as sf

server = VoxCPM.from_pretrained(model="/path/to/VoxCPM", devices=[0])
chunks = list(server.generate(target_text="Hello from VoxCPM!"))
sf.write("out.wav", np.concatenate(chunks), 48000)
server.stop()

Features:

RTF ~0.13 (vs ~0.30 PyTorch)
Concurrent requests: Batched processing
Async API: Non-blocking inference
FastAPI server: HTTP endpoint for microservices

Use Cases: Production APIs, high-throughput serving, microservices

5. vLLM-Omni (OpenAI-Compatible API)

Installation:

bash

uv pip install vllm==0.19.0 --torch-backend=auto
git clone https://github.com/vllm-project/vllm-omni.git && cd vllm-omni
uv pip install -e .

Launch Server:

bash

vllm serve openbmb/VoxCPM2 --omni --port 8000

Call from any OpenAI client:

bash

curl http://localhost:8000/v1/audio/speech \
  -H "Content-Type: application/json" \
  -d '{"model":"openbmb/VoxCPM2","input":"Hello from VoxCPM2 on vLLM-Omni!","voice":"default"}' \
  --output out.wav

Features:

OpenAI-compatible /v1/audio/speech endpoint
PagedAttention KV cache: Efficient memory management
Continuous batching: Automatic request batching
Multi-GPU deployment: Scale across multiple GPUs

Use Cases: Enterprise deployments, OpenAI API drop-in replacement, multi-tenant serving

Comparison: VoxCPM2 vs Alternatives

VoxCPM2 vs GPT-Realtime 2.0 (OpenAI)

Feature	VoxCPM2	GPT-Realtime 2.0
License	Apache 2.0 (open-source)	Proprietary (API-only)
Languages	30 languages	Not disclosed (likely many via GPT-5 base)
Voice Design	✅ (text descriptions)	❌ (no reported feature)
Voice Cloning	✅ (3 modes)	Limited (reference audio support unclear)
Audio Quality	48kHz studio	Likely comparable
Deployment	Self-hosted or cloud	Cloud-only (OpenAI API)
Pricing	Free (self-hosted)	$32/1M input tokens, $64/1M output
RTF	~0.13 (Nano-vLLM)	Not disclosed
Use Case	Production TTS, voice cloning	Real-time voice agents, conversational AI

Winner: VoxCPM2 for cost-sensitive production TTS and voice cloning; GPT-Realtime 2.0 for real-time conversational voice agents.

VoxCPM2 vs ElevenLabs

Feature	VoxCPM2	ElevenLabs
License	Apache 2.0 (open-source)	Proprietary (API-only)
Languages	30 languages	29 languages (as of 2026)
Voice Design	✅ (text descriptions)	✅ (Voice Lab)
Voice Cloning	✅ (3 modes, free)	✅ (Professional Voice Cloning, paid)
Audio Quality	48kHz studio	High-quality (exact specs undisclosed)
Deployment	Self-hosted	Cloud-only (ElevenLabs API)
Pricing	Free (self-hosted)	~$22/month (Creator), ~$99/month (Pro)
Emotional Range	High (via Voice Design & cloning)	Very high (ElevenLabs specialty)

Winner: VoxCPM2 for self-hosted, cost-free production; ElevenLabs for managed service with exceptional emotional expressiveness.

VoxCPM2 vs Coqui TTS (Open-Source)

Feature	VoxCPM2	Coqui TTS
Architecture	Tokenizer-free diffusion autoregressive	Traditional (VITS, Tacotron2, etc.)
Languages	30 languages	~10 languages (varies by model)
Voice Cloning	✅ (3 modes)	✅ (VITS, YourTTS)
Audio Quality	48kHz studio	22kHz standard
Active Development	✅ (2026 release)	⚠️ (Coqui AI shut down in 2023, community-maintained)
License	Apache 2.0	MPL 2.0 (some models proprietary)

Winner: VoxCPM2 for modern architecture, active development, and higher audio quality; Coqui TTS for legacy compatibility and mature ecosystem.

Risks and Limitations

1. Potential for Misuse

Risk: VoxCPM2's voice cloning can generate highly realistic synthetic speech, enabling:

Impersonation: Mimicking public figures or individuals for fraud
Disinformation: Generating fake audio for propaganda or scams
Deepfakes: Creating misleading audio content

Mitigation:

Ethical use guidelines: OpenBMB's documentation includes responsible AI usage recommendations
Watermarking: Consider embedding audio watermarks for provenance tracking
Disclosure: Clearly mark AI-generated content as synthetic

OpenBMB's Statement:

"It is strictly forbidden to use VoxCPM for impersonation, fraud, or disinformation. We strongly recommend clearly marking any AI-generated content."

2. Controllable Generation Stability

Challenge: Voice Design and Controllable Voice Cloning results can vary between runs—same input may produce slightly different voices.

Implication: You may need to generate 1-3 times to obtain the desired voice or style.

Mitigation:

Use fixed random seeds for reproducibility during development
Generate multiple samples and select the best
Fine-tune with LoRA for consistent results in production

3. Language Coverage

Current: 30 languages officially supported

Limitation: Languages not on the list may:

Produce lower-quality synthesis
Require fine-tuning on custom data
Not work at all

Mitigation:

Test unsupported languages directly (model may generalize)
Collect 1-10 hours of target language audio and fine-tune
OpenBMB plans to expand language coverage in future releases

4. Production Safety and Testing

Recommendation: Before deploying VoxCPM2 in production:

Conduct thorough testing on your specific use case
Evaluate safety: Test for inappropriate outputs, bias, or unintended behavior
Monitor in production: Track output quality, latency, and error rates
Have fallback mechanisms: Graceful degradation if TTS fails

Ecosystem & Community Projects

VoxCPM2 has a growing ecosystem of community-contributed tools and integrations:

Project	Description
Nano-vLLM-VoxCPM	High-throughput GPU serving with async API
vLLM-Omni	Official vLLM omni-modal serving with OpenAI-compatible API
VoxCPM.cpp	GGML/GGUF: CPU, CUDA, Vulkan inference
VoxCPM-ONNX	ONNX export for CPU inference
VoxCPMANE	Apple Neural Engine backend for Mac
voxcpm_rs	Rust re-implementation
ComfyUI-VoxCPM	ComfyUI node-based workflows
ComfyUI_RH_VoxCPM	Feature-complete ComfyUI workflow with multi-speaker, LoRA, auto-ASR
ComfyUI-VoxCPMTTS	ComfyUI TTS extension
TTS WebUI	Browser-based TTS extension

Note: Community projects are not officially maintained by OpenBMB. Check individual repositories for support and documentation.

Getting Started with VoxCPM2

Step 1: Installation

Requirements:

Python ≥ 3.10 (<3.13)
PyTorch ≥ 2.5.0
CUDA ≥ 12.0 (for NVIDIA GPUs) or MPS (for Apple Silicon)

bash

pip install voxcpm

Step 2: Download the Model

From Hugging Face:

python

from voxcpm import VoxCPM
model = VoxCPM.from_pretrained("openbmb/VoxCPM2", load_denoiser=False)

From ModelScope (China-friendly mirror):

bash

pip install modelscope

python

from modelscope import snapshot_download
snapshot_download("OpenBMB/VoxCPM2", local_dir='./pretrained_models/VoxCPM2')

from voxcpm import VoxCPM
model = VoxCPM.from_pretrained("./pretrained_models/VoxCPM2", load_denoiser=False)

Step 3: Generate Your First Audio

Text-to-Speech:

python

import soundfile as sf

wav = model.generate(
    text="VoxCPM2 is a revolutionary tokenizer-free TTS model.",
    cfg_value=2.0,
    inference_timesteps=10,
)
sf.write("demo.wav", wav, model.tts_model.sample_rate)

Voice Design:

python

wav = model.generate(
    text="(Young female, warm and friendly)Welcome to VoxCPM2!",
    cfg_value=2.0,
    inference_timesteps=10,
)
sf.write("voice_design.wav", wav, model.tts_model.sample_rate)

Voice Cloning:

python

wav = model.generate(
    text="This is a cloned voice.",
    reference_wav_path="path/to/reference.wav",
)
sf.write("clone.wav", wav, model.tts_model.sample_rate)

Step 4: Experiment with Parameters

cfg_value (Classifier-Free Guidance):

Higher values (2.0-3.0): More adherence to instructions, less natural
Lower values (1.0-1.5): More natural, less controllable
Default: 2.0 (balanced)

inference_timesteps:

Higher values (50-100): Better quality, slower inference
Lower values (10-20): Faster inference, slight quality loss
Default: 10 (fast, good quality)

Step 5: Fine-Tune (Optional)

If you want to adapt VoxCPM2 to your specific voice or domain:

bash

# Prepare your audio files in a directory
# Run LoRA fine-tuning
python scripts/train_voxcpm_finetune.py \
    --config_path conf/voxcpm_v2/voxcpm_finetune_lora.yaml

See the Fine-tuning Guide for full instructions.

Bottom Line: Open-Source TTS Reaches New Heights

VoxCPM2 represents a major leap forward in open-source Text-to-Speech technology, matching or exceeding commercial alternatives in:

Audio quality (48kHz studio-grade)
Multilingual support (30 languages)
Voice cloning fidelity (three modes for different use cases)
Innovation (Voice Design from text descriptions)

Key Takeaways:

Tokenizer-free architecture eliminates information loss from discrete tokenization, achieving more natural and expressive synthesis
Voice Design enables creating voices from text descriptions alone—no reference audio required
48kHz output is broadcast-ready without post-processing
Apache 2.0 license makes it free for commercial use, unlike proprietary alternatives
RTF ~0.13 (Nano-vLLM) enables real-time production deployment
30 languages with automatic detection cover most global markets
Fine-tuning support (LoRA, SFT) allows domain adaptation with minimal data

Who Should Care:

Product teams: Replace expensive TTS APIs (ElevenLabs, Google, AWS) with self-hosted VoxCPM2
Content creators: Generate high-quality narration, character voices, and multilingual content
Researchers: Build on state-of-the-art open-source TTS architecture
Accessibility advocates: Deploy cost-free, high-quality TTS for assistive technology
Enterprises: Self-host for data privacy, cost savings, and customization

VoxCPM2 proves that open-source AI can match or exceed commercial offerings—and with Apache 2.0 licensing, it's ready for production use today.

For more on voice AI, TTS, and multilingual models:

Disclosure: This post is editorial analysis based on the VoxCPM2 GitHub repository, official documentation, and public benchmarks as of May 31, 2026. Performance metrics and pricing details are accurate at time of writing but may change. For the latest information, visit the official VoxCPM GitHub repository and documentation.

Related posts

Voicebox: The Free, Open Source AI Voice Studio That Replaces ElevenLabs and WisprFlow in One App

Miso One: 110ms Real-Time TTS Voice Model Guide 2026

Cohere Command A+: the first fully Apache 2.0 enterprise AI model that runs on 2 H100s (May 2026)

TL;DR

What Makes VoxCPM2 Revolutionary

1. Tokenizer-Free Architecture

2. Diffusion Autoregressive Paradigm

3. 30-Language Multilingual Support

Voice Design: Create Voices from Text Descriptions

Overview

Supported Attributes

Example Descriptions

Use Cases

Voice Cloning: Three Modes for Different Use Cases

Mode 1: Controllable Voice Cloning

Mode 2: Ultimate Cloning

Mode 3: Voice Design (Covered Above)

48kHz Studio-Quality Audio Output

AudioVAE V2: Asymmetric Encode/Decode

Audio Quality Metrics

Performance Benchmarks

Inference Speed (RTF)

Multilingual WER (Word Error Rate)

Instruction-Guided Voice Design

Fine-Tuning: Adapt to Your Use Case

LoRA Fine-Tuning (Recommended)

Full Fine-Tuning (SFT)

WebUI for Training & Inference

Deployment Options

1. Python API (Simple)

2. CLI (Command-Line)

3. Gradio WebUI (Interactive)

4. Nano-vLLM (Production High-Throughput)

5. vLLM-Omni (OpenAI-Compatible API)

Comparison: VoxCPM2 vs Alternatives

VoxCPM2 vs GPT-Realtime 2.0 (OpenAI)

VoxCPM2 vs ElevenLabs

VoxCPM2 vs Coqui TTS (Open-Source)

Risks and Limitations

1. Potential for Misuse

2. Controllable Generation Stability

3. Language Coverage

4. Production Safety and Testing

Ecosystem & Community Projects

Getting Started with VoxCPM2

Step 1: Installation

Step 2: Download the Model

Step 3: Generate Your First Audio

Step 4: Experiment with Parameters

Step 5: Fine-Tune (Optional)

Bottom Line: Open-Source TTS Reaches New Heights

Related Reading

Sources