How is hardware reshaping LLM design?
Why can an NVIDIA H100 GPU theoretically generate 62,000 tokens per second when in practice even the best inference engines struggle to reach 200 tokens per second?
The answer lies in the memory wall.
In this video, we break down why traditional auto-regressive LLMs are memory-bound. Using the roofline model, we analyze how GPU memory bandwidth, arithmetic intensity, and model architecture determine real-world LLM inference performance.
We'll cover:
• Why an 8B parameter LLM must stream its entire model weights for every output token
• How HBM (High Bandwidth Memory) becomes the bottlenec…
Watch on YouTube ↗
(saves to browser)
Chapters (12)
Intro
1:13
High-level GPU architecture
4:14
The memory bottleneck
5:51
The roofline model
8:34
Auto-regressive LLMs are memory-bound
10:40
Batching
12:55
How KV caching caps batching
14:25
Speculative decoding
16:00
Diffusion LLMs are compute-bound
18:53
Reducing wasted diffusion computations
20:34
Block diffusion
21:42
Diffusion inference engines
DeepCamp AI