PagedAttention: Behind vLLLM's Insane Speed
PagedAttention is the “virtual memory” idea applied to LLM inference: instead of storing each request’s KV cache in one big contiguous chunk, vLLM breaks it into fixed-size blocks and maps logical tokens to physical GPU memory with a block table. The result is far less fragmentation, smarter reuse, and higher throughput under real, mixed-length traffic.
In this video, we visualize the problem with a Tetris-style fragmentation demo, then build up the exact data structures (KV blocks + block tables), how prefill vs decode works, how sharing helps for parallel sampling/beam search, and what happ…
Watch on YouTube ↗
(saves to browser)
Chapters (8)
The Problem of Memory Fragmentation
0:41
Memory Waste in Traditional Serving
1:23
Introduction to PagedAttention
2:08
Learning from Operating Systems (Virtual Memory)
2:49
How PagedAttention Dividies the KV Cache
3:25
Walking Through a Single Request
4:00
Prefix Sharing: Saving Memory on Shared Prompts
4:41
Continuous
DeepCamp AI