PagedAttention: Behind vLLLM's Insane Speed

Name: PagedAttention: Behind vLLLM's Insane Speed
Uploaded: 2025-12-14T13:48:43+00:00
Channel: Tales Of Tensors
Description: PagedAttention is the “virtual memory” idea applied to LLM inference: instead of storing each request’s KV cache in one big contiguous chunk, vLLM break...

Tales Of Tensors · Beginner ·🧠 Large Language Models ·3mo ago

PagedAttention is the “virtual memory” idea applied to LLM inference: instead of storing each request’s KV cache in one big contiguous chunk, vLLM breaks it into fixed-size blocks and maps logical tokens to physical GPU memory with a block table. The result is far less fragmentation, smarter reuse, and higher throughput under real, mixed-length traffic. In this video, we visualize the problem with a Tetris-style fragmentation demo, then build up the exact data structures (KV blocks + block tables), how prefill vs decode works, how sharing helps for parallel sampling/beam search, and what happ…

Watch on YouTube ↗ (saves to browser)

Chapters (8)

The Problem of Memory Fragmentation

0:41 Memory Waste in Traditional Serving

1:23 Introduction to PagedAttention

2:08 Learning from Operating Systems (Virtual Memory)

2:49 How PagedAttention Dividies the KV Cache

3:25 Walking Through a Single Request

4:00 Prefix Sharing: Saving Memory on Shared Prompts

4:41 Continuous

Next Up

5 Levels of AI Agents - From Simple LLM Calls to Multi-Agent Systems

Dave Ebbelaar (LLM Eng)