The app for independent voices

Today’s paper introduces Voxtral TTS, a multilingual text-to-speech system that can generate natural, expressive speech from just 3 seconds of reference audio. The model combines autoregressive generation for semantic content with flow-matching for acoustic details, achieving a 68.4% preference rate over ElevenLabs Flash v2.5 in human evaluations for voice cloning tasks. The system supports 9 languages and is designed for low-latency streaming inference.

Mar 27
at
6:27 PM
Relevant people

Log in or sign up

Join the most interesting and insightful discussions.