Self host Gemma 4: Deploy LLMs on Cloud Run GPUs

Google Cloud Tech · Beginner ·🧠 Large Language Models ·2w ago
GCP credit → https://goo.gle/handson-ep7-lab1 Lab → https://goo.gle/guardians In this episode, we deploy Google's Gemma 4 model to Cloud Run two completely different ways, each with real trade-offs you need to understand before choosing one for production. 🔨 Ollama — model baked into the container. Instant cold starts. Rebuild to update. ⚡ vLLM — model mounted from Cloud Storage via FUSE. Slower first boot, but swap models without redeploying. Both use Cloud Run GPUs, scale to zero, and ship through automated CI/CD with Cloud Build. We build both. You decide which fits. 👇 📦 CI/CD with Cloud Build 🖥️ GPU accelerated serverless inference 🔄 Baked in vs. decoupled model architecture 🚀 Scale to zero ⚖️ Cold start speed vs. production agility Chapters: 0:00 - Intro 6:08 - Getting started with Agentverse lab 7:57 - Laying the foundations of the citadel 16:07 - Forging the power core: Self hosted LLMs 28:02 - Forging the citadel's central core: Deploy vLLM 43:59 - Summary More resources: Cloud Run GPU documentation → https://goo.gle/4sEbTvG Ollama documentation → https://goo.gle/3Qdi64w vLLM documentation → https://goo.gle/4cvvxE9 Cloud Storage FUSE → https://goo.gle/4cQAb0V Watch more Hands on AI → https://www.youtube.com/watch?v=qCBreTfjFHQ&list=PLIivdWyY5sqKnJOvP89yF8t9mWuzMTcbM 🔔 Subscribe to Google Cloud Tech → https://goo.gle/GoogleCloudTech #Gemma4 #CloudRun Speakers: Ayo Adedeji, Annie Wang Products Mentioned: Agent Development Kit, Gemini API, Cloud Run
Watch on YouTube ↗ (saves to browser)
Sign in to unlock AI tutor explanation · ⚡30

Related AI Lessons

On deleting what the model gave you
Learn to refine AI-generated content by deleting and editing, a crucial step in creative work with AI models
Dev.to AI
The Wall Every AI Has Been Hitting And the Startup That Claims to Have Broken Through
Startup SubQ claims to have broken through the context length constraint in AI, solving a decade-long limitation
Medium · LLM
Unsolved AI Mystery Is Solved Along With Lessons Learned On Why ChatGPT Became Oddly Obsessed With Gremlins And Goblins
Discover why ChatGPT became obsessed with gremlins and goblins, and learn vital lessons from this AI mystery
Forbes Innovation
The Human Element in AI-Based Emotion Recognition: When Machines Start to Read Our Faces
Learn how AI-based emotion recognition systems use facial analysis to understand human emotions and the importance of the human element in these systems
Medium · Deep Learning

Chapters (6)

Intro
6:08 Getting started with Agentverse lab
7:57 Laying the foundations of the citadel
16:07 Forging the power core: Self hosted LLMs
28:02 Forging the citadel's central core: Deploy vLLM
43:59 Summary
Up next
5 Levels of AI Agents - From Simple LLM Calls to Multi-Agent Systems
Dave Ebbelaar (LLM Eng)
Watch →