Flash Attention derived and coded from first principles with Triton (Python)

Umar Jamil · Beginner ·📐 ML Fundamentals ·1y ago

Skills: LLM Foundations53%ML Maths Basics53%Maths for ML53%

In this video, I'll be deriving and coding Flash Attention from scratch. I'll be deriving every operation we do in Flash Attention using only pen and "paper". Moreover, I'll explain CUDA and Triton from zero, so no prior knowledge of CUDA is required. To code the backwards pass, I'll first explain how the autograd system works in PyTorch and then derive the Jacobian of the matrix multiplication and the Softmax operation and use it to code the backwards pass. All the code will be written in Python with Triton, but no prior knowledge of Triton is required. I'll also explain the CUDA programming model from zero. Chapters 00:00:00 - Introduction 00:03:10 - Multi-Head Attention 00:09:06 - Why Flash Attention 00:12:50 - Safe Softmax 00:27:03 - Online Softmax 00:39:44 - Online Softmax (Proof) 00:47:26 - Block Matrix Multiplication 01:28:38 - Flash Attention forward (by hand) 01:44:01 - Flash Attention forward (paper) 01:50:53 - Intro to CUDA with examples 02:26:28 - Tensor Layouts 02:40:48 - Intro to Triton with examples 02:54:26 - Flash Attention forward (coding) 04:22:11 - LogSumExp trick in Flash Attention 2 04:32:53 - Derivatives, gradients, Jacobians 04:45:54 - Autograd 05:00:00 - Jacobian of the MatMul operation 05:16:14 - Jacobian through the Softmax 05:47:33 - Flash Attention backwards (paper) 06:13:11 - Flash Attention backwards (coding) 07:21:10 - Triton Autotuning 07:23:29 - Triton tricks: software pipelining 07:33:38 - Running the code This video won't only teach you one of the most influential algorithms in deep learning history; it'll also give you the knowledge you need to solve any new problem that involves writing CUDA or Triton kernels. Moreover, it'll give you the mathematical foundations to derive backwards passes! As usual, the code is available on GitHub: https://github.com/hkproj/triton-flash-attention 🚀Join Writer 🚀 If you're a ML researcher who wants to do research at the hottest AI startup in Silicon Valley, consider applying to Writer an

Watch on YouTube ↗ (saves to browser)

Sign in to unlock AI tutor explanation · ⚡30

More on: LLM Foundations

View skill →

Getting Started with Vertex AI Gemini 1.5 Flash

I TRAINED AN AI TO SOLVE 2+2 (w/ Live Coding)

I TRAINED AN AI TO SOLVE 2+2 (w/ Live Coding)

How to use the ChatGPT API with Python!!

How to use the ChatGPT API with Python!!

Nicholas Renotte

Gemini 2.5: Create an interactive plot of economic data

Gemini 2.5: Create an interactive plot of economic data

Google DeepMind

LangChain Chatbots: Building a Personalized AI Assistant

LangChain Chatbots: Building a Personalized AI Assistant

Analytics Vidhya

Auto-generating meeting notes with Python

Auto-generating meeting notes with Python

Related AI Lessons

The GenAI Honeymoon is Over: The Brutal Realities of Production AI

The GenAI honeymoon is over, highlighting the importance of MLOps in production AI

Medium · Data Science

Quantization From First Principles: Build Your Own INT8 Inference Engine

Learn to build an INT8 inference engine from scratch and understand the fundamentals of quantization in machine learning

Medium · Machine Learning

Quantization From First Principles: Build Your Own INT8 Inference Engine

Learn to build an INT8 inference engine from scratch and understand the principles of quantization to optimize model performance

Medium · Data Science

Quantization From First Principles: Build Your Own INT8 Inference Engine

Learn to build an INT8 inference engine from scratch using quantization principles and Python

Medium · Python

Chapters (23)

Introduction

3:10 Multi-Head Attention

9:06 Why Flash Attention

12:50 Safe Softmax

27:03 Online Softmax

39:44 Online Softmax (Proof)

47:26 Block Matrix Multiplication

1:28:38 Flash Attention forward (by hand)

1:44:01 Flash Attention forward (paper)

1:50:53 Intro to CUDA with examples

2:26:28 Tensor Layouts

2:40:48 Intro to Triton with examples

2:54:26 Flash Attention forward (coding)

4:22:11 LogSumExp trick in Flash Attention 2

4:32:53 Derivatives, gradients, Jacobians

4:45:54 Autograd

5:00:00 Jacobian of the MatMul operation

5:16:14 Jacobian through the Softmax

5:47:33 Flash Attention backwards (paper)

6:13:11 Flash Attention backwards (coding)

7:21:10 Triton Autotuning

7:23:29 Triton tricks: software pipelining

7:33:38 Running the code

Communicate Uncomfortably Much