How DeepSeek's Multi-Head Latent Attention Changed the Game

Name: How DeepSeek's Multi-Head Latent Attention Changed the Game
Uploaded: 2025-11-16T16:37:21+00:00
Channel: Tales Of Tensors
Description: What if you could cut your transformer’s KV cache by over 90% without touching your GPU? In this video, we break down how DeepSeek’s Multi-Head Latent A...

Tales Of Tensors · Advanced ·🧠 Large Language Models ·4mo ago

What if you could cut your transformer’s KV cache by over 90% without touching your GPU? In this video, we break down how DeepSeek’s Multi-Head Latent Attention (MLA) completely changes the game for long-context LLMs by aggressively compressing keys and values into a tiny latent space—while keeping model quality essentially unchanged. We’ll start from the real bottleneck: why the KV cache explodes as sequence length grows, and why older tricks like MQA and GQA help on memory but often pay in quality. Then we dive into MLA’s core idea: low-rank compression of K and V into latents, regenerating…

Watch on YouTube ↗ (saves to browser)

Chapters (5)

Intro: The Cost of Global Attention

0:40 The KV Cache Memory Bottleneck

1:22 Comparing MHA, MQA, and GQA

2:06 The Core Concept: Low-Rank Compression

2:56 Latent Space Projections vs. Standard Atte

Next Up

5 Levels of AI Agents - From Simple LLM Calls to Multi-Agent Systems

Dave Ebbelaar (LLM Eng)