Why Scaling by the Square Root of Dimensions Matters in Attention | Transformers in Deep Learning
Why do we divide by the square root of the key dimensions in Scaled Dot-Product Attention? ๐ค In this video, we dive deep into the intuition and mathematics behind this crucial step.
Understand:
๐น How scaling prevents extreme attention scores.
๐น The impact of dimensionality on softmax.
๐น Why this scaling makes models more stable and efficient.
If youโve ever wondered about this subtle yet vital detail, this video is for you, where we go in-depth into why this scaling is so important for the stability in training of the model.
๐ NOTES: https://github.com/Coding-Lane/Transformer-notes/blโฆ
Watch on YouTube โ
(saves to browser)
Chapters (7)
Intro
1:12
Recap of Self-Attention
4:39
Increase in variance
8:05
Why variance increases?
12:41
Why high variance is a problem in Deep Learning?
15:30
Why divide by square root of dimension
19:07
Outro
DeepCamp AI