Why Scaling by the Square Root of Dimensions Matters in Attention | Transformers in Deep Learning

Learn With Jay ยท Beginner ยท๐Ÿง  Large Language Models ยท1y ago
Why do we divide by the square root of the key dimensions in Scaled Dot-Product Attention? ๐Ÿค” In this video, we dive deep into the intuition and mathematics behind this crucial step. Understand: ๐Ÿ”น How scaling prevents extreme attention scores. ๐Ÿ”น The impact of dimensionality on softmax. ๐Ÿ”น Why this scaling makes models more stable and efficient. If youโ€™ve ever wondered about this subtle yet vital detail, this video is for you, where we go in-depth into why this scaling is so important for the stability in training of the model. ๐Ÿ“• NOTES: https://github.com/Coding-Lane/Transformer-notes/blโ€ฆ
Watch on YouTube โ†— (saves to browser)

Chapters (7)

Intro
1:12 Recap of Self-Attention
4:39 Increase in variance
8:05 Why variance increases?
12:41 Why high variance is a problem in Deep Learning?
15:30 Why divide by square root of dimension
19:07 Outro
5 Levels of AI Agents - From Simple LLM Calls to Multi-Agent Systems
Next Up
5 Levels of AI Agents - From Simple LLM Calls to Multi-Agent Systems
Dave Ebbelaar (LLM Eng)