A closer look at Gemma 4 with Baseten and NVIDIA

Name: A closer look at Gemma 4 with Baseten and NVIDIA
Uploaded: 2026-04-23T05:46:24Z
Channel: Google Cloud
Description: Inference isn't just one thing—it’s the entire stack. Live from Google Cloud Next '26, Jason Davenport (Google Cloud), Jay Rodge (NVIDIA), and Philip Ki...

Google Cloud · Intermediate ·🧠 Large Language Models ·10h ago

Skills: LLM Foundations80%ML Pipelines60%

Inference isn't just one thing—it’s the entire stack. Live from Google Cloud Next '26, Jason Davenport (Google Cloud), Jay Rodge (NVIDIA), and Philip Kiely (Baseten) break down the "Full Stack Seating Chart" of modern AI: from the silicon powering the models to the frameworks scaling them to millions of users. This session dives into the day-zero support for Gemma 4, Google's most capable open model family, and how the partnership between Google, NVIDIA, and Baseten is solving the "Hypergrowth" problem for AI applications. Key Highlights: Next-Gen Hardware: Jay Rodge announces the arrival of NVIDIA Blackwell (RTX PRO 6000) and the future Vera Rubin GPUs on Google Cloud, featuring 96GB of VRAM—enough to pack multiple massive models on a single chip. Gemma 4 & MoE: A look at the new Gemma 4 26B A4B (Mixture-of-Experts) model, which activates only 4B parameters to deliver 27B-class intelligence at lightning-fast speeds. Inference Engineering: Philip Kiely discusses his new book, "Inference Engineering," and explains why inference is a holistic challenge involving CUDA, infrastructure, distributed systems, and tight latency SLAs. Scaling at Baseten: A live demo showing how Baseten uses GKE and L4 GPUs to provide one-click deployments of Gemma 4, featuring auto-scaling that handles traffic spikes without sacrificing response time. Precision & Optimization: Why NVFP4 and TensorRT-LLM are the "secret sauce" for getting the highest possible performance out of Gemma on NVIDIA hardware. "If you have a GPU that costs twice as much but handles three times the volume, you’ve actually lowered your TCO. In inference engineering, cheap isn't always the goal—efficiency is." Get Started: Explore the Gemma 4 family on Hugging Face, check out Baseten for model serving, and join the NVIDIA & Google Cloud developer community to start building. #Gemma4 #NVIDIABlackwell #InferenceEngineering #GoogleCloudNext #Baseten #OpenModels #VeraRubin

Watch on YouTube ↗ (saves to browser)