Beyond Softmax: The Future of Attention Mechanisms

Jia-Bin Huang · Beginner ·📄 Research Papers Explained ·3mo ago

Skills: Reading ML Papers80%

Linear attention and its variants have emerged as promising techniques for sequential modeling. Compared to standard softmax attention in Transformers, these models achieve faster decoding and a constant memory requirement regardless of the sequence length. Such methods may hold the key to unlocking long-context processing capability. In this video, let's explore what comes after softmax attention. 00:00 Introduction 00:13 Softmax attention - Review 02:23 Softmax attention - Matrix form 03:29 KV caching 05:29 Linear attention 10:15 Chunkwise parallel training 14:41 Gating in linear attention 17:02 Test-time regression perspective 21:29 Delta update rule 23:51 Efficient training of DeltaNet 29:12 Better optimization for test-time regression 31:13 More expressive regressors References: [Linear Attention and Beyond] https://www.youtube.com/watch?v=d0HJvGSWw8A (by Songlin Yang) [Test-time Regression] https://www.youtube.com/watch?v=C7KnW8VFp4U (by Alex Wang) [Beyond Standard LLMs] https://magazine.sebastianraschka.com/p/beyond-standard-llms (by Sebastian Raschka) [Linear Attention] https://arxiv.org/abs/2006.16236 [Chunkwise parallel training] https://arxiv.org/abs/2202.10447 [Gated Linear Attention] https://arxiv.org/abs/2312.06635 [Lightning Attention] https://arxiv.org/abs/2405.17381 [Mamba 2] https://arxiv.org/abs/2405.21060 [Test-time Regression] https://arxiv.org/abs/2501.12352 [Fast Weight Programmer] https://proceedings.mlr.press/v139/schlag21a [Gated Delta Networks] https://arxiv.org/abs/2412.06464 [Qwen-Next] https://qwen.ai/blog?id=4074cca80393150c248e508aa62983f9cb7d27cd [RWKV-6] https://arxiv.org/abs/2404.05892 [RWKV-7] https://arxiv.org/abs/2503.14456 [Kimi-Linear] https://arxiv.org/abs/2510.26692 [DeltaProduct] https://arxiv.org/abs/2502.10297 [LongHorn] https://arxiv.org/abs/2407.14207 [Mesa layer] https://arxiv.org/abs/2309.05858 [MesaNet] https://arxiv.org/abs/2506.05233 [Test Time Training] [Titans] https://arxiv.org/abs/2501.00663 [Test Tim

Watch on YouTube ↗ (saves to browser)

Sign in to unlock AI tutor explanation · ⚡30

More on: Reading ML Papers

View skill →

Automatic Literature Review with GPT-3 - I embedded and indexed all of arXiv into a search engine!

Automatic Literature Review with GPT-3 - I embedded and indexed all of arXiv into a search engine!

Marcos Lopez Caniego - ESASky's JupyterLab widget| JupyterCon 2020

Marcos Lopez Caniego - ESASky's JupyterLab widget| JupyterCon 2020

Obsidian Zotero Integration Plugin | Streamline Your Research Paper Workflow 📝️

Obsidian Zotero Integration Plugin | Streamline Your Research Paper Workflow 📝️

This FULLY FREE Research Agent can BUILD Reports in Minutes!!!

This FULLY FREE Research Agent can BUILD Reports in Minutes!!!

Claude 3.7 Sonnet API | Build a Research Assistant

Claude 3.7 Sonnet API | Build a Research Assistant

I Built An Obsidian AI Research Assistant with Oz...

I Built An Obsidian AI Research Assistant with Oz...

Related AI Lessons

The ABCs of reading medical research and review papers these days

Learn to critically evaluate medical research papers by accepting nothing at face value, believing no one blindly, and checking everything

#1 DevLog Meta-research: I Got Tired of Tab Chaos While Reading Research Papers.

Learn to manage research paper tabs efficiently and apply meta-research techniques to improve productivity

How to Set Up a Karpathy-Style Wiki for Your Research Field

Learn to set up a Karpathy-style wiki for your research field to organize and share knowledge effectively

The Non-Optimality of Scientific Knowledge: Path Dependence, Lock-In, and The Local Minimum Trap

Scientific knowledge may be stuck in a local minimum, hindering optimal progress, and understanding this concept is crucial for advancing research

Chapters (12)

Introduction

0:13 Softmax attention - Review

2:23 Softmax attention - Matrix form

3:29 KV caching

5:29 Linear attention

10:15 Chunkwise parallel training

14:41 Gating in linear attention

17:02 Test-time regression perspective

21:29 Delta update rule

23:51 Efficient training of DeltaNet

29:12 Better optimization for test-time regression

31:13 More expressive regressors

Microsoft Research Forum | Season 2, Episode 4

Microsoft Research