PagedAttention: Behind vLLLM's Insane Speed

Tales Of Tensors · Beginner ·🧠 Large Language Models ·3mo ago
PagedAttention is the “virtual memory” idea applied to LLM inference: instead of storing each request’s KV cache in one big contiguous chunk, vLLM breaks it into fixed-size blocks and maps logical tokens to physical GPU memory with a block table. The result is far less fragmentation, smarter reuse, and higher throughput under real, mixed-length traffic. In this video, we visualize the problem with a Tetris-style fragmentation demo, then build up the exact data structures (KV blocks + block tables), how prefill vs decode works, how sharing helps for parallel sampling/beam search, and what happ…
Watch on YouTube ↗ (saves to browser)

Chapters (8)

The Problem of Memory Fragmentation
0:41 Memory Waste in Traditional Serving
1:23 Introduction to PagedAttention
2:08 Learning from Operating Systems (Virtual Memory)
2:49 How PagedAttention Dividies the KV Cache
3:25 Walking Through a Single Request
4:00 Prefix Sharing: Saving Memory on Shared Prompts
4:41 Continuous
5 Levels of AI Agents - From Simple LLM Calls to Multi-Agent Systems
Next Up
5 Levels of AI Agents - From Simple LLM Calls to Multi-Agent Systems
Dave Ebbelaar (LLM Eng)