How Attention Residuals Rewire Modern LLMs

Jia-Bin Huang · Beginner ·🧠 Large Language Models ·1w ago
Attention Residuals replaces the standard fixed residual accumulation with softmax attention over previous layer outputs. This enables each layer to selectively combine earlier representations using learned, input-dependent weights. Attention Residuals replaces standard fixed residual accumulation with depth-wise softmax attention over all preceding layer outputs. This enables each layer to combine earlier representations using learned, input-dependent weights. 00:00 Intro to residual connections 03:27 Intuition behind attention residuals 04:43 Full attention residuals 09:43 Block attention …
Watch on YouTube ↗ (saves to browser)

Chapters (9)

Intro to residual connections
3:27 Intuition behind attention residuals
4:43 Full attention residuals
9:43 Block attention residuals
13:07 Parallelism
14:21 Infrastructure design for efficient training
20:03 Infrastructure design for efficient inference
22:01 Discussions
21:02 Related work
5 Levels of AI Agents - From Simple LLM Calls to Multi-Agent Systems
Next Up
5 Levels of AI Agents - From Simple LLM Calls to Multi-Agent Systems
Dave Ebbelaar (LLM Eng)