Let's go Bananas with GenMedia — Guillaume Vernade, Google DeepMind
Skills:
Multimodal LLMs90%
Guillaume Vernade from Google DeepMind takes a public domain book and runs it through the full gen media stack live. Gemini reads the whole text and writes image prompts for each character and chapter. Imagen generates the portraits. Veo animates them into video clips using those images as first frames. Lyria composes a different piece of music per chapter, with or without lyrics. The TTS model reads dialogue from the book using a trick that makes two voices sound like four distinct characters.
The interesting layer underneath all of it is that Gemini acts as the prompt engineer for every other model, and it works well partly because the gen media models were trained on prompts written by Gemini. The workshop also covers the Lyria Realtime model, which generates music continuously and responds to new prompts mid-stream like a DJ, and a new interactions API that makes chained multi-turn calls cheaper by caching context server-side instead of resending the full book on every turn.
Speaker info:
- https://x.com/Giom_V
- https://www.linkedin.com/in/guillaumevernade
- https://github.com/Giom-V
Watch on YouTube ↗
(saves to browser)
Sign in to unlock AI tutor explanation · ⚡30
More on: Multimodal LLMs
View skill →Related AI Lessons
🎓
Tutor Explanation
DeepCamp AI