How Attention Residuals Rewire Modern LLMs
Attention Residuals replaces the standard fixed residual accumulation with softmax attention over previous layer outputs. This enables each layer to selectively combine earlier representations using learned, input-dependent weights.
Attention Residuals replaces standard fixed residual accumulation with depth-wise softmax attention over all preceding layer outputs. This enables each layer to combine earlier representations using learned, input-dependent weights.
00:00 Intro to residual connections
03:27 Intuition behind attention residuals
04:43 Full attention residuals
09:43 Block attention …
Watch on YouTube ↗
(saves to browser)
Chapters (9)
Intro to residual connections
3:27
Intuition behind attention residuals
4:43
Full attention residuals
9:43
Block attention residuals
13:07
Parallelism
14:21
Infrastructure design for efficient training
20:03
Infrastructure design for efficient inference
22:01
Discussions
21:02
Related work
DeepCamp AI