Flash Attention: The Fastest Attention Mechanism?
This video explains FlashAttention-1, FlashAttention-2, and FlashAttention-3 in a clear, visual, step-by-step way. We look at why standard attention is memory-bound, how GPU memory hierarchy creates bottlenecks, and how FlashAttention fixes the problem with three core ideas: tiling, online softmax, and recomputation. You’ll learn how FA2 improves parallelism, how FA3 uses Hopper’s new hardware features for even higher utilization, and why all modern LLM frameworks now use FlashAttention by default. We cover training and inference speedups, memory savings, context expansion, and how to enable F…
Watch on YouTube ↗
(saves to browser)
Chapters (12)
The Problem: Memory Bound vs. Compute Bound
1:12
GPU Memory Hierarchy: HBM vs. SRAM
1:44
Counting Memory Accesses in Standard Attention
2:12
Insight 1: Tiling and Processing in Blocks
2:49
Insight 2: Online Softmax for Incremental Updates
3:26
Insight 3: Trading Recomputation for Bandwidth
4:05
Walking Through the Flash Attention Algorithm
4:51
Flash Attention 2: Parallelism and Optimization
5:27
Flash Attention 3: H100 Hopper Optimizations
6:20
Real-World Impact: Training and Inference Speedups
7:05
How to Use Flash Attention (PyTorch & Hugging Face)
7:46
Recap: Three Key Insights
DeepCamp AI