DeepSeek Sparse Attention Explained: 80% Cheaper Long-Context AI

Name: DeepSeek Sparse Attention Explained: 80% Cheaper Long-Context AI
Uploaded: 2026-03-21T19:46:30+00:00
Channel: Tales Of Tensors
Description: 00:00:00 Introduction to DeepSeek Sparse Attention 00:00:52 Standard Attention and the KV Cache Bottleneck 00:03:16 Prefilling vs. Decoding Bottlenecks ...

Tales Of Tensors · Beginner ·🧠 Large Language Models ·1w ago

00:00:00 Introduction to DeepSeek Sparse Attention 00:00:52 Standard Attention and the KV Cache Bottleneck 00:03:16 Prefilling vs. Decoding Bottlenecks 00:03:51 The Intuition Behind Sparse Attention 00:05:06 The Lightning Indexer 00:05:29 Multi-Head Latent Attention (MLA) Explained 00:07:20 How DeepSeek Sparse Attention (DSA) Works 00:08:26 Lightning Indexer Deep Dive 00:09:29 Attention Computation on Selected Tokens 00:10:18 Full Architecture Overview 00:11:17 Two-Stage Training Process 00:12:34 Complexity Analysis 00:14:05 GPU Hardware Mapping and Execution 00:15:35 DeepSeek v3.2: MoE and Sp…

Watch on YouTube ↗ (saves to browser)

Chapters (16)

Introduction to DeepSeek Sparse Attention

0:52 Standard Attention and the KV Cache Bottleneck

3:16 Prefilling vs. Decoding Bottlenecks

3:51 The Intuition Behind Sparse Attention

5:06 The Lightning Indexer

5:29 Multi-Head Latent Attention (MLA) Explained

7:20 How DeepSeek Sparse Attention (DSA) Works

8:26 Lightning Indexer Deep Dive

9:29 Attention Computation on Selected Tokens

10:18 Full Architecture Overview

11:17 Two-Stage Training Process

12:34 Complexity Analysis

14:05 GPU Hardware Mapping and Execution

15:35 DeepSeek v3.2: MoE and Sparse Attention

16:27 Limitations of DeepSeek Sparse Attention

17:25 Summary and Conclusion

Next Up

5 Levels of AI Agents - From Simple LLM Calls to Multi-Agent Systems

Dave Ebbelaar (LLM Eng)