Today’s paper introduces Voxtral TTS, a multilingual text-to-speech system that can generate natural, expressive speech from just 3 seconds of reference audio. The model combines autoregressive generation for semantic content with flow-matching for acoustic details, achieving a 68.4% preference rate over ElevenLabs Flash v2.5 in human evaluations for voice cloning tasks. The system supports 9 languages and is designed for low-latency streaming inference.