Any-to-Any: Building Native Multimodal Agents - Patrick Löber, Google DeepMind
Skills:
Multimodal LLMs85%
Draw arrows on a map and ask Gemini to generate a picture of what you see. It produces the Golden Gate Bridge. Not because it matched pixels, but because the image generation model is built on top of Gemini's world understanding and knows what those arrows are pointing at.
Patrick Löber walks through the full any-to-any stack: multimodal understanding where Gemini ingests PDFs, video, and audio up to nine-plus hours at once, native image and speech generation called as tools from an agentic loop, and a live audio model where audio goes in and audio comes out through a single architecture with no cascaded pipeline. The session ends with the building blocks for a Notebook LM clone where a reasoning agent decides what to generate rather than a hardcoded workflow.
Speaker info:
- https://x.com/patloeber
- https://linkedin.com/in/patrick-l%C3%B6ber-403022137
- https://github.com/patrickloeber
Watch on YouTube ↗
(saves to browser)
Sign in to unlock AI tutor explanation · ⚡30
More on: Multimodal LLMs
View skill →Related AI Lessons
⚡
⚡
⚡
⚡
AI Agents Don't Crash. They Drift. Here's the Framework to See It.
Dev.to · Varsha Das
How Can AI Detect and Analyze Traffic Patterns in Real Time?
Medium · AI
Overland AI: The Autonomous War Machine That Actually Works — For Now
Medium · AI
How to Find Your Own Code Inside ChatGPT (Tiger Team)
Medium · ChatGPT
🎓
Tutor Explanation
DeepCamp AI