Mamba and S4 Explained: Architecture, Parallel Scan, Kernel Fusion, Recurrent, Convolution, Math
Skills:
Reading ML Papers90%
Explanation of the paper Mamba: Linear-Time Sequence Modeling with Selective State Spaces
In this video I will be explaining Mamba, a new sequence modeling architecture that can compete with the Transformer. I will first start by introducing the various sequence modeling architectures (RNN, CNN and Transformer) and then deep dive into State Space Models. To fully understand State Space Models, we need to have some background in differential equations. That's why, I will provide a brief introduction to differential equations (in 5 minutes!) and then proceed to derive the recurrent formula and the convolutional formula from first principles. I will also prove mathematically (with the help of visual diagrams) why State Space Models can be run as a convolution. I will explain what is the HIPPO matrix and how it can help the model "memorize" the input history in a finite state.
In the second part of the video, I will explore Mamba and in particular the Selective Scan algorithm, but first explaining what is the scan operation and how it can be parallelized, and then showing how the authors further improved the algorithm with Kernel Fusion and activations recomputation. I will also provide a brief lesson on the memory hierarchy in the GPU and why some operations may be IO-bound.
In the last part of the video we will explore the architecture of Mamba and some performance results to compare it with the Transformer.
Slides PDF and Parallel Scan (excel file): https://github.com/hkproj/mamba-notes
Chapters
00:00:00 - Introduction
00:01:46 - Sequence modeling
00:07:12 - Differential equations (basics)
00:11:38 - State Space Models
00:13:53 - Discretization
00:23:08 - Recurrent computation
00:26:32 - Convolutional computation
00:34:18 - Skip connection term
00:35:21 - Multidimentional SSM
00:37:44 - The HIPPO theory
00:43:30 - The motivation behind Mamba
00:46:56 - Selective Scan algorithm
00:51:34 - The Scan operation
00:54:24 - Parallel Scan
00:57:20 - Innovations in Selec
Watch on YouTube ↗
(saves to browser)
Sign in to unlock AI tutor explanation · ⚡30
More on: Reading ML Papers
View skill →Related AI Lessons
⚡
⚡
⚡
⚡
The ABCs of reading medical research and review papers these days
Medium · LLM
#1 DevLog Meta-research: I Got Tired of Tab Chaos While Reading Research Papers.
Dev.to AI
How to Set Up a Karpathy-Style Wiki for Your Research Field
Medium · AI
The Non-Optimality of Scientific Knowledge: Path Dependence, Lock-In, and The Local Minimum Trap
ArXiv cs.AI
Chapters (15)
Introduction
1:46
Sequence modeling
7:12
Differential equations (basics)
11:38
State Space Models
13:53
Discretization
23:08
Recurrent computation
26:32
Convolutional computation
34:18
Skip connection term
35:21
Multidimentional SSM
37:44
The HIPPO theory
43:30
The motivation behind Mamba
46:56
Selective Scan algorithm
51:34
The Scan operation
54:24
Parallel Scan
57:20
Innovations in Selec
🎓
Tutor Explanation
DeepCamp AI