Beyond Softmax: The Future of Attention Mechanisms
Linear attention and its variants have emerged as promising techniques for sequential modeling. Compared to standard softmax attention in Transformers, these models achieve faster decoding and a constant memory requirement regardless of the sequence length. Such methods may hold the key to unlocking long-context processing capability.
In this video, let's explore what comes after softmax attention.
00:00 Introduction
00:13 Softmax attention - Review
02:23 Softmax attention - Matrix form
03:29 KV caching
05:29 Linear attention
10:15 Chunkwise parallel training
14:41 Gating in linear attention…
Watch on YouTube ↗
(saves to browser)
Chapters (12)
Introduction
0:13
Softmax attention - Review
2:23
Softmax attention - Matrix form
3:29
KV caching
5:29
Linear attention
10:15
Chunkwise parallel training
14:41
Gating in linear attention
17:02
Test-time regression perspective
21:29
Delta update rule
23:51
Efficient training of DeltaNet
29:12
Better optimization for test-time regression
31:13
More expressive regressors
DeepCamp AI