DeepSeek Sparse Attention Explained: 80% Cheaper Long-Context AI

Tales Of Tensors · Beginner ·🧠 Large Language Models ·1w ago
00:00:00 Introduction to DeepSeek Sparse Attention 00:00:52 Standard Attention and the KV Cache Bottleneck 00:03:16 Prefilling vs. Decoding Bottlenecks 00:03:51 The Intuition Behind Sparse Attention 00:05:06 The Lightning Indexer 00:05:29 Multi-Head Latent Attention (MLA) Explained 00:07:20 How DeepSeek Sparse Attention (DSA) Works 00:08:26 Lightning Indexer Deep Dive 00:09:29 Attention Computation on Selected Tokens 00:10:18 Full Architecture Overview 00:11:17 Two-Stage Training Process 00:12:34 Complexity Analysis 00:14:05 GPU Hardware Mapping and Execution 00:15:35 DeepSeek v3.2: MoE and Sp…
Watch on YouTube ↗ (saves to browser)

Chapters (16)

Introduction to DeepSeek Sparse Attention
0:52 Standard Attention and the KV Cache Bottleneck
3:16 Prefilling vs. Decoding Bottlenecks
3:51 The Intuition Behind Sparse Attention
5:06 The Lightning Indexer
5:29 Multi-Head Latent Attention (MLA) Explained
7:20 How DeepSeek Sparse Attention (DSA) Works
8:26 Lightning Indexer Deep Dive
9:29 Attention Computation on Selected Tokens
10:18 Full Architecture Overview
11:17 Two-Stage Training Process
12:34 Complexity Analysis
14:05 GPU Hardware Mapping and Execution
15:35 DeepSeek v3.2: MoE and Sparse Attention
16:27 Limitations of DeepSeek Sparse Attention
17:25 Summary and Conclusion
5 Levels of AI Agents - From Simple LLM Calls to Multi-Agent Systems
Next Up
5 Levels of AI Agents - From Simple LLM Calls to Multi-Agent Systems
Dave Ebbelaar (LLM Eng)