Why TTS Models Now Look Like LLMs — Samuel Humeau, Mistral

AI Engineer · Intermediate ·🧠 Large Language Models ·4d ago
The dominant architecture pattern for text-to-speech in 2026 looks a lot like an LLM — an autoregressive transformer generating sequences of tokens, one frame of audio at a time. Samuel Humeau from Mistral walks through why the field converged there, how neural audio codecs solve the information-density problem (audio carries ~200kbps of signal; you can't feed that raw to a transformer), and what the streaming trick actually is that makes voice agents feel responsive before the full audio has even finished generating. The talk uses Mistral's just-released open-weight TTS model as a running example — live demos of voice cloning from a few seconds of reference audio, a voice agent answering real conference schedule questions, and a breakdown of the codec-to-backbone-to-decoder pipeline that produces it all. There's also a frank section on what's still unsettled: how to handle streaming text input (tokens arriving from an LLM in real time rather than a fixed block of text) and why getting that right is the next meaningful latency win in agent pipelines. It's the kind of talk that makes the system feel less like a black box — not by oversimplifying, but by showing exactly which engineering choices are load-bearing and which are still open problems. Speaker info: - https://x.com/DrSamuelBHume - https://www.linkedin.com/in/samuelhumeau/
Watch on YouTube ↗ (saves to browser)
Sign in to unlock AI tutor explanation · ⚡30

Related AI Lessons

I Asked AI to Teach Algebra. The First Result Was Slop. Here’s How We Fixed It.
Learn how to improve AI-generated educational content by refining prompts and fine-tuning models, as demonstrated by a project to create an AI-generated algebra course
Medium · Machine Learning
AI Is Like a Super Smart Toy Box — But It Still Needs You
Discover how AI can augment human capabilities, but still requires human input and oversight to function effectively
Medium · AI
AI Is Like a Super Smart Toy Box — But It Still Needs You
AI is a powerful tool that still requires human input and oversight to function effectively
Medium · Machine Learning
OpenAI Prompt Caching in 2026: When You'll Save 75% (And When You Won't)
Learn how OpenAI prompt caching can save you 75% of costs in 2026 and when it's not applicable
Dev.to · Leolionel221
Up next
5 Levels of AI Agents - From Simple LLM Calls to Multi-Agent Systems
Dave Ebbelaar (LLM Eng)
Watch →