Moonshot AI’s AttnRes: Replacing Residual Connections to End Data Dilution. Kimi Attention Residuals
In the world of Large Language Models, we’ve been building taller and taller skyscrapers, but we’re starting to realize the plumbing is leaky.
Since 2015, we’ve relied on 'Residual Connections'—the standard way layers talk to each other. It was a brilliant fix at the time, but as our models hit 40, 70, or even 100 billion parameters, that simple 'addition' is starting to fail us. We’re facing a 'dilution' problem: important data from the early layers is getting buried under a mountain of new noise, and the model's internal states are growing out of control.
Today, we’re looking at a massive…
Watch on YouTube ↗
(saves to browser)
DeepCamp AI