Voice AI: when is the "Her" moment? — Neil Zeghidour, Gradium AI

AI Engineer · Intermediate ·🧠 Large Language Models ·5d ago
The "Her" moment has been promised so many times it's become a joke. Every new demo, every smooth-sounding voice agent gets called it. Neil Zeghidour, CEO of Gradium AI and one of the researchers behind Moshi — the first full-duplex voice model — uses this talk to be honest about where the gap actually is and why it keeps not closing. The core tension: cascaded systems (speech-to-text, LLM, text-to-speech) are practical and getting smarter, but they're architecturally incapable of feeling like a real conversation. Latency from tool calls alone can be 500ms to 4 seconds — while humans process and respond in around 200ms total. Speech-to-speech models solve some of that but trade it for a different problem: they're still half-duplex, meaning they're either listening or talking but never both, which makes backchanneling impossible and the interaction feel robotic in a different way. Moshi showed that full-duplex is solvable. What it didn't solve was making the model useful. And cost is a wall hiding behind the latency problem — TTS at scale is expensive enough that some teams burn through their fundraising before they can grow a user base. The most underrated thread in the talk is paralinguistic understanding: voice carries tone, hesitation, discomfort, and cultural signals that get entirely stripped out the moment you transcribe to text. Getting to Her means building models that don't just produce natural-sounding speech but actually understand what the voice is carrying — and that's a science problem, not a prompt engineering one. Speaker info: - https://x.com/neilzegh - https://www.linkedin.com/in/neil-zeghidour-a838aaa7/
Watch on YouTube ↗ (saves to browser)
Sign in to unlock AI tutor explanation · ⚡30

Related AI Lessons

Up next
5 Levels of AI Agents - From Simple LLM Calls to Multi-Agent Systems
Dave Ebbelaar (LLM Eng)
Watch →