Any-to-Any: Building Native Multimodal Agents - Patrick Löber, Google DeepMind

Name: Any-to-Any: Building Native Multimodal Agents - Patrick Löber, Google DeepMind
Uploaded: 2026-05-20T17:00:07Z
Channel: AI Engineer
Description: Draw arrows on a map and ask Gemini to generate a picture of what you see. It produces the Golden Gate Bridge. Not because it matched pixels, but becaus...

AI Engineer · Intermediate ·🤖 AI Agents & Automation ·1h ago

Skills: Multimodal LLMs85%

Draw arrows on a map and ask Gemini to generate a picture of what you see. It produces the Golden Gate Bridge. Not because it matched pixels, but because the image generation model is built on top of Gemini's world understanding and knows what those arrows are pointing at. Patrick Löber walks through the full any-to-any stack: multimodal understanding where Gemini ingests PDFs, video, and audio up to nine-plus hours at once, native image and speech generation called as tools from an agentic loop, and a live audio model where audio goes in and audio comes out through a single architecture with no cascaded pipeline. The session ends with the building blocks for a Notebook LM clone where a reasoning agent decides what to generate rather than a hardcoded workflow. Speaker info: - https://x.com/patloeber - https://linkedin.com/in/patrick-l%C3%B6ber-403022137 - https://github.com/patrickloeber

Watch on YouTube ↗ (saves to browser)