Any-to-Any: Building Native Multimodal Agents - Patrick Löber, Google DeepMind

AI Engineer · Intermediate ·🤖 AI Agents & Automation ·1h ago
Draw arrows on a map and ask Gemini to generate a picture of what you see. It produces the Golden Gate Bridge. Not because it matched pixels, but because the image generation model is built on top of Gemini's world understanding and knows what those arrows are pointing at. Patrick Löber walks through the full any-to-any stack: multimodal understanding where Gemini ingests PDFs, video, and audio up to nine-plus hours at once, native image and speech generation called as tools from an agentic loop, and a live audio model where audio goes in and audio comes out through a single architecture with no cascaded pipeline. The session ends with the building blocks for a Notebook LM clone where a reasoning agent decides what to generate rather than a hardcoded workflow. Speaker info: - https://x.com/patloeber - https://linkedin.com/in/patrick-l%C3%B6ber-403022137 - https://github.com/patrickloeber
Watch on YouTube ↗ (saves to browser)
Sign in to unlock AI tutor explanation · ⚡30

Related AI Lessons

Up next
I gave my Hermes Agent a phone number (it’s crazy)
David Ondrej
Watch →