How is hardware reshaping LLM design?
Skills:
LLM Engineering90%
Why can an NVIDIA H100 GPU theoretically generate 62,000 tokens per second when in practice even the best inference engines struggle to reach 200 tokens per second?
The answer lies in the memory wall.
In this video, we break down why traditional auto-regressive LLMs are memory-bound. Using the roofline model, we analyze how GPU memory bandwidth, arithmetic intensity, and model architecture determine real-world LLM inference performance.
We'll cover:
• Why an 8B parameter LLM must stream its entire model weights for every output token
• How HBM (High Bandwidth Memory) becomes the bottleneck
• What the roofline model tells us about GPU utilization
• Why prefill is compute-bound but decode is memory-bound
• How inference engines like vLLM, SGLang, and TensorRT-LLM optimize batching and KV caching
• How speculative decoding increases arithmetic intensity
• Why diffusion-based LLMs shift the workload from memory-bound to compute-bound
• How new architectures like block diffusion combine the best of both worlds
We’ll also discuss recent advances in diffusion LLM inference, including work like FOCUS and block-wise decoding strategies.
This video is sponsored by Inception (inceptionlabs.ai), who are building ultra-fast diffusion-based language models and the infrastructure to serve them efficiently. They recently introduced Mercury 2, a diffusion reasoning model designed for high-throughput inference.
📚 My full reading list (free): https://www.patreon.com/c/JuliaTurc
▶️ Other diffusion videos: https://youtube.com/playlist?list=PL4bm2lr9UVG3SN79Y6WBe4OOlEiO88vie&si=Tg4IgNkZdV9HSaGU
00:00 Intro
01:13 High-level GPU architecture
04:14 The memory bottleneck
05:51 The roofline model
08:34 Auto-regressive LLMs are memory-bound
10:40 Batching
12:55 How KV caching caps batching
14:25 Speculative decoding
16:00 Diffusion LLMs are compute-bound
18:53 Reducing wasted diffusion computations
20:34 Block diffusion
21:42 Diffusion inference engines
Watch on YouTube ↗
(saves to browser)
Sign in to unlock AI tutor explanation · ⚡30
More on: LLM Engineering
View skill →Related AI Lessons
⚡
⚡
⚡
⚡
Thursday Thoughts: The Models We Can't Run
Dev.to · Rob
Big Tech firms are accelerating AI investments and integration, while regulators and companies focus on safety and responsible adoption.
Dev.to AI
35 ChatGPT Prompts for Recruiters (That Actually Work in 2026)
Dev.to · ClawGear
Stop Writing Like a Robot: The Prompt That Makes ChatGPT Sound Human
Medium · ChatGPT
Chapters (12)
Intro
1:13
High-level GPU architecture
4:14
The memory bottleneck
5:51
The roofline model
8:34
Auto-regressive LLMs are memory-bound
10:40
Batching
12:55
How KV caching caps batching
14:25
Speculative decoding
16:00
Diffusion LLMs are compute-bound
18:53
Reducing wasted diffusion computations
20:34
Block diffusion
21:42
Diffusion inference engines
🎓
Tutor Explanation
DeepCamp AI