Voxtral Realtime

📰 ArXiv cs.AI

Voxtral Realtime is a streaming automatic speech recognition model that achieves offline transcription quality at sub-second latency

advanced Published 7 Apr 2026

Action Steps

Train the model end-to-end for streaming using the Delayed Streams Modeling framework
Introduce explicit alignment between audio and text streams to improve transcription accuracy
Implement causal audio encoding to reduce latency
Evaluate the model's performance on streaming audio data to ensure sub-second latency and high transcription quality

Who Needs to Know This

Speech recognition engineers and researchers on a team can benefit from Voxtral Realtime's ability to provide high-quality transcription in real-time, enabling applications such as live captioning and voice assistants

Key Insight

💡 Voxtral Realtime's end-to-end training and causal audio encoding enable high-quality, real-time speech recognition

Key Takeaways

Voxtral Realtime is a streaming automatic speech recognition model that achieves offline transcription quality at sub-second latency

Full Article

Title: Voxtral Realtime

Abstract:
arXiv:2602.11298v3 Announce Type: replace Abstract: We introduce Voxtral Realtime, a natively streaming automatic speech recognition model that matches offline transcription quality at sub-second latency. Unlike approaches that adapt offline models through chunking or sliding windows, Voxtral Realtime is trained end-to-end for streaming, with explicit alignment between audio and text streams. Our architecture builds on the Delayed Streams Modeling framework, introducing a new causal audio encode

Read full paper → ← Back to Reads