Beyond Softmax: The Future of Attention Mechanisms
Skills:
Reading ML Papers80%
Linear attention and its variants have emerged as promising techniques for sequential modeling. Compared to standard softmax attention in Transformers, these models achieve faster decoding and a constant memory requirement regardless of the sequence length. Such methods may hold the key to unlocking long-context processing capability.
In this video, let's explore what comes after softmax attention.
00:00 Introduction
00:13 Softmax attention - Review
02:23 Softmax attention - Matrix form
03:29 KV caching
05:29 Linear attention
10:15 Chunkwise parallel training
14:41 Gating in linear attention
17:02 Test-time regression perspective
21:29 Delta update rule
23:51 Efficient training of DeltaNet
29:12 Better optimization for test-time regression
31:13 More expressive regressors
References:
[Linear Attention and Beyond] https://www.youtube.com/watch?v=d0HJvGSWw8A
(by Songlin Yang)
[Test-time Regression] https://www.youtube.com/watch?v=C7KnW8VFp4U
(by Alex Wang)
[Beyond Standard LLMs] https://magazine.sebastianraschka.com/p/beyond-standard-llms
(by Sebastian Raschka)
[Linear Attention] https://arxiv.org/abs/2006.16236
[Chunkwise parallel training] https://arxiv.org/abs/2202.10447
[Gated Linear Attention] https://arxiv.org/abs/2312.06635
[Lightning Attention] https://arxiv.org/abs/2405.17381
[Mamba 2] https://arxiv.org/abs/2405.21060
[Test-time Regression] https://arxiv.org/abs/2501.12352
[Fast Weight Programmer] https://proceedings.mlr.press/v139/schlag21a
[Gated Delta Networks] https://arxiv.org/abs/2412.06464
[Qwen-Next] https://qwen.ai/blog?id=4074cca80393150c248e508aa62983f9cb7d27cd
[RWKV-6] https://arxiv.org/abs/2404.05892
[RWKV-7] https://arxiv.org/abs/2503.14456
[Kimi-Linear] https://arxiv.org/abs/2510.26692
[DeltaProduct] https://arxiv.org/abs/2502.10297
[LongHorn] https://arxiv.org/abs/2407.14207
[Mesa layer] https://arxiv.org/abs/2309.05858
[MesaNet] https://arxiv.org/abs/2506.05233
[Test Time Training]
[Titans] https://arxiv.org/abs/2501.00663
[Test Tim
Watch on YouTube ↗
(saves to browser)
Sign in to unlock AI tutor explanation · ⚡30
More on: Reading ML Papers
View skill →Related AI Lessons
⚡
⚡
⚡
⚡
The ABCs of reading medical research and review papers these days
Medium · LLM
#1 DevLog Meta-research: I Got Tired of Tab Chaos While Reading Research Papers.
Dev.to AI
How to Set Up a Karpathy-Style Wiki for Your Research Field
Medium · AI
The Non-Optimality of Scientific Knowledge: Path Dependence, Lock-In, and The Local Minimum Trap
ArXiv cs.AI
Chapters (12)
Introduction
0:13
Softmax attention - Review
2:23
Softmax attention - Matrix form
3:29
KV caching
5:29
Linear attention
10:15
Chunkwise parallel training
14:41
Gating in linear attention
17:02
Test-time regression perspective
21:29
Delta update rule
23:51
Efficient training of DeltaNet
29:12
Better optimization for test-time regression
31:13
More expressive regressors
🎓
Tutor Explanation
DeepCamp AI