DeepSeek Sparse Attention Explained: 80% Cheaper Long-Context AI
00:00:00 Introduction to DeepSeek Sparse Attention
00:00:52 Standard Attention and the KV Cache Bottleneck
00:03:16 Prefilling vs. Decoding Bottlenecks
00:03:51 The Intuition Behind Sparse Attention
00:05:06 The Lightning Indexer
00:05:29 Multi-Head Latent Attention (MLA) Explained
00:07:20 How DeepSeek Sparse Attention (DSA) Works
00:08:26 Lightning Indexer Deep Dive
00:09:29 Attention Computation on Selected Tokens
00:10:18 Full Architecture Overview
00:11:17 Two-Stage Training Process
00:12:34 Complexity Analysis
00:14:05 GPU Hardware Mapping and Execution
00:15:35 DeepSeek v3.2: MoE and Sp…
Watch on YouTube ↗
(saves to browser)
Chapters (16)
Introduction to DeepSeek Sparse Attention
0:52
Standard Attention and the KV Cache Bottleneck
3:16
Prefilling vs. Decoding Bottlenecks
3:51
The Intuition Behind Sparse Attention
5:06
The Lightning Indexer
5:29
Multi-Head Latent Attention (MLA) Explained
7:20
How DeepSeek Sparse Attention (DSA) Works
8:26
Lightning Indexer Deep Dive
9:29
Attention Computation on Selected Tokens
10:18
Full Architecture Overview
11:17
Two-Stage Training Process
12:34
Complexity Analysis
14:05
GPU Hardware Mapping and Execution
15:35
DeepSeek v3.2: MoE and Sparse Attention
16:27
Limitations of DeepSeek Sparse Attention
17:25
Summary and Conclusion
DeepCamp AI