How Attention Got So Efficient [GQA/MLA/DSA]
Attention mechanisms have been the key behind the recent AI boom. What happened after the multi-head attention in the seminal 2017 Transformer paper?
In this video, we break down several core ideas that make attention efficient and scalable.
00:00 Introduction
00:35 Tokenization
01:21 Attention (vector form)
04:26 Attention (matrix form)
07:07 Key-Value caching
09:42 Multi-Query Attention (MQA)
11:03 Grouped Query Attention (GQA)
13:32 Multi-head Latent Attention (MLA)
15:37 MLA at inference time
18:15 Applying RoPE to MLA (decoupled RoPE)
22:18 DeepSeek Sparse Attention (DSA)
23:57 Quantiza…
Watch on YouTube ↗
(saves to browser)
Chapters (13)
Introduction
0:35
Tokenization
1:21
Attention (vector form)
4:26
Attention (matrix form)
7:07
Key-Value caching
9:42
Multi-Query Attention (MQA)
11:03
Grouped Query Attention (GQA)
13:32
Multi-head Latent Attention (MLA)
15:37
MLA at inference time
18:15
Applying RoPE to MLA (decoupled RoPE)
22:18
DeepSeek Sparse Attention (DSA)
23:57
Quantization and rotation in DSA
27:44
DSA training
DeepCamp AI