How DeepSeek's Multi-Head Latent Attention Changed the Game

Tales Of Tensors · Advanced ·🧠 Large Language Models ·4mo ago
What if you could cut your transformer’s KV cache by over 90% without touching your GPU? In this video, we break down how DeepSeek’s Multi-Head Latent Attention (MLA) completely changes the game for long-context LLMs by aggressively compressing keys and values into a tiny latent space—while keeping model quality essentially unchanged. We’ll start from the real bottleneck: why the KV cache explodes as sequence length grows, and why older tricks like MQA and GQA help on memory but often pay in quality. Then we dive into MLA’s core idea: low-rank compression of K and V into latents, regenerating…
Watch on YouTube ↗ (saves to browser)

Chapters (5)

Intro: The Cost of Global Attention
0:40 The KV Cache Memory Bottleneck
1:22 Comparing MHA, MQA, and GQA
2:06 The Core Concept: Low-Rank Compression
2:56 Latent Space Projections vs. Standard Atte
5 Levels of AI Agents - From Simple LLM Calls to Multi-Agent Systems
Next Up
5 Levels of AI Agents - From Simple LLM Calls to Multi-Agent Systems
Dave Ebbelaar (LLM Eng)