Why TTS Models Now Look Like LLMs — Samuel Humeau, Mistral

Name: Why TTS Models Now Look Like LLMs — Samuel Humeau, Mistral
Uploaded: 2026-05-09T17:00:07Z
Channel: AI Engineer
Description: The dominant architecture pattern for text-to-speech in 2026 looks a lot like an LLM — an autoregressive transformer generating sequences of tokens, one...

AI Engineer · Intermediate ·🧠 Large Language Models ·4d ago

Skills: LLM Foundations90%LLM Engineering80%

The dominant architecture pattern for text-to-speech in 2026 looks a lot like an LLM — an autoregressive transformer generating sequences of tokens, one frame of audio at a time. Samuel Humeau from Mistral walks through why the field converged there, how neural audio codecs solve the information-density problem (audio carries ~200kbps of signal; you can't feed that raw to a transformer), and what the streaming trick actually is that makes voice agents feel responsive before the full audio has even finished generating. The talk uses Mistral's just-released open-weight TTS model as a running example — live demos of voice cloning from a few seconds of reference audio, a voice agent answering real conference schedule questions, and a breakdown of the codec-to-backbone-to-decoder pipeline that produces it all. There's also a frank section on what's still unsettled: how to handle streaming text input (tokens arriving from an LLM in real time rather than a fixed block of text) and why getting that right is the next meaningful latency win in agent pipelines. It's the kind of talk that makes the system feel less like a black box — not by oversimplifying, but by showing exactly which engineering choices are load-bearing and which are still open problems. Speaker info: - https://x.com/DrSamuelBHume - https://www.linkedin.com/in/samuelhumeau/

Watch on YouTube ↗ (saves to browser)