How is hardware reshaping LLM design?

Julia Turc · Advanced ·🧠 Large Language Models ·2mo ago
Why can an NVIDIA H100 GPU theoretically generate 62,000 tokens per second when in practice even the best inference engines struggle to reach 200 tokens per second? The answer lies in the memory wall. In this video, we break down why traditional auto-regressive LLMs are memory-bound. Using the roofline model, we analyze how GPU memory bandwidth, arithmetic intensity, and model architecture determine real-world LLM inference performance. We'll cover: • Why an 8B parameter LLM must stream its entire model weights for every output token • How HBM (High Bandwidth Memory) becomes the bottleneck • What the roofline model tells us about GPU utilization • Why prefill is compute-bound but decode is memory-bound • How inference engines like vLLM, SGLang, and TensorRT-LLM optimize batching and KV caching • How speculative decoding increases arithmetic intensity • Why diffusion-based LLMs shift the workload from memory-bound to compute-bound • How new architectures like block diffusion combine the best of both worlds We’ll also discuss recent advances in diffusion LLM inference, including work like FOCUS and block-wise decoding strategies. This video is sponsored by Inception (inceptionlabs.ai), who are building ultra-fast diffusion-based language models and the infrastructure to serve them efficiently. They recently introduced Mercury 2, a diffusion reasoning model designed for high-throughput inference. 📚 My full reading list (free): https://www.patreon.com/c/JuliaTurc ▶️ Other diffusion videos: https://youtube.com/playlist?list=PL4bm2lr9UVG3SN79Y6WBe4OOlEiO88vie&si=Tg4IgNkZdV9HSaGU 00:00 Intro 01:13 High-level GPU architecture 04:14 The memory bottleneck 05:51 The roofline model 08:34 Auto-regressive LLMs are memory-bound 10:40 Batching 12:55 How KV caching caps batching 14:25 Speculative decoding 16:00 Diffusion LLMs are compute-bound 18:53 Reducing wasted diffusion computations 20:34 Block diffusion 21:42 Diffusion inference engines
Watch on YouTube ↗ (saves to browser)
Sign in to unlock AI tutor explanation · ⚡30

Related AI Lessons

Chapters (12)

Intro
1:13 High-level GPU architecture
4:14 The memory bottleneck
5:51 The roofline model
8:34 Auto-regressive LLMs are memory-bound
10:40 Batching
12:55 How KV caching caps batching
14:25 Speculative decoding
16:00 Diffusion LLMs are compute-bound
18:53 Reducing wasted diffusion computations
20:34 Block diffusion
21:42 Diffusion inference engines
Up next
5 Levels of AI Agents - From Simple LLM Calls to Multi-Agent Systems
Dave Ebbelaar (LLM Eng)
Watch →