How Attention Got So Efficient [GQA/MLA/DSA]

Jia-Bin Huang · Beginner ·🧠 Large Language Models ·4mo ago
Attention mechanisms have been the key behind the recent AI boom. What happened after the multi-head attention in the seminal 2017 Transformer paper? In this video, we break down several core ideas that make attention efficient and scalable. 00:00 Introduction 00:35 Tokenization 01:21 Attention (vector form) 04:26 Attention (matrix form) 07:07 Key-Value caching 09:42 Multi-Query Attention (MQA) 11:03 Grouped Query Attention (GQA) 13:32 Multi-head Latent Attention (MLA) 15:37 MLA at inference time 18:15 Applying RoPE to MLA (decoupled RoPE) 22:18 DeepSeek Sparse Attention (DSA) 23:57 Quantiza…
Watch on YouTube ↗ (saves to browser)

Chapters (13)

Introduction
0:35 Tokenization
1:21 Attention (vector form)
4:26 Attention (matrix form)
7:07 Key-Value caching
9:42 Multi-Query Attention (MQA)
11:03 Grouped Query Attention (GQA)
13:32 Multi-head Latent Attention (MLA)
15:37 MLA at inference time
18:15 Applying RoPE to MLA (decoupled RoPE)
22:18 DeepSeek Sparse Attention (DSA)
23:57 Quantization and rotation in DSA
27:44 DSA training
5 Levels of AI Agents - From Simple LLM Calls to Multi-Agent Systems
Next Up
5 Levels of AI Agents - From Simple LLM Calls to Multi-Agent Systems
Dave Ebbelaar (LLM Eng)