Why TTS Models Now Look Like LLMs — Samuel Humeau, Mistral
The dominant architecture pattern for text-to-speech in 2026 looks a lot like an LLM — an autoregressive transformer generating sequences of tokens, one frame of audio at a time. Samuel Humeau from Mistral walks through why the field converged there, how neural audio codecs solve the information-density problem (audio carries ~200kbps of signal; you can't feed that raw to a transformer), and what the streaming trick actually is that makes voice agents feel responsive before the full audio has even finished generating.
The talk uses Mistral's just-released open-weight TTS model as a running example — live demos of voice cloning from a few seconds of reference audio, a voice agent answering real conference schedule questions, and a breakdown of the codec-to-backbone-to-decoder pipeline that produces it all. There's also a frank section on what's still unsettled: how to handle streaming text input (tokens arriving from an LLM in real time rather than a fixed block of text) and why getting that right is the next meaningful latency win in agent pipelines.
It's the kind of talk that makes the system feel less like a black box — not by oversimplifying, but by showing exactly which engineering choices are load-bearing and which are still open problems.
Speaker info:
- https://x.com/DrSamuelBHume
- https://www.linkedin.com/in/samuelhumeau/
Watch on YouTube ↗
(saves to browser)
Sign in to unlock AI tutor explanation · ⚡30
More on: LLM Foundations
View skill →Related AI Lessons
⚡
⚡
⚡
⚡
I Asked AI to Teach Algebra. The First Result Was Slop. Here’s How We Fixed It.
Medium · Machine Learning
AI Is Like a Super Smart Toy Box — But It Still Needs You
Medium · AI
AI Is Like a Super Smart Toy Box — But It Still Needs You
Medium · Machine Learning
OpenAI Prompt Caching in 2026: When You'll Save 75% (And When You Won't)
Dev.to · Leolionel221
🎓
Tutor Explanation
DeepCamp AI