Paolo Perrone (@paoloap): "You can now clone a human voice in real time without tokenization. It's called VoxCPM. Most TTS models convert speech to discrete tokens. Tokens lose information. Create artifacts. Cause unnatural pauses. VoxCPM generates speech in continuous space. End-to-end diffusion. No…"

You can now clone a human voice in real time without tokenization.

It's called VoxCPM.

Most TTS models convert speech to discrete tokens.

Tokens lose information. Create artifacts. Cause unnatural pauses.

VoxCPM generates speech in continuous space.

End-to-end diffusion. No tokenizer. No information loss.

What you get:

1️⃣ Context-aware generation

→ Model reads text, infers appropriate prosody

→ Adapts style based on content automatically

2️⃣ Zero-shot voice cloning

→ Short reference clip is all you need

→ Captures timbre, accent, emotion, rhythm, pacing

3️⃣ Real-time synthesis

→ RTF of 0.15 on RTX 4090

→ Streaming supported

The specs:

→ 800M parameters

→ 44.1kHz output

→ 1.8M hours of training data

→ Supports LoRA fine-tuning

pip install voxcpm

from voxcpm import VoxCPM

model = VoxCPM.from_pretrained("openbmb/VoxCPM1.5")

wav = model.generate(text="Your text here")

5.5k GitHub stars. Apache 2.0.

→

💾 Save for when you need TTS that doesn't sound like TTS

♻️ Repost if you've been burned by robotic voice cloning

github.com

VoxCPM: Tokenizer-Free TTS for Context-Aware Speech Generation and True-to-Life Voice Cloning - OpenBMB/VoxCPM

Feb 12

2:03 AM