Why Scaling by the Square Root of Dimensions Matters in Attention | Transformers in Deep Learning

Name: Why Scaling by the Square Root of Dimensions Matters in Attention | Transformers in Deep Learning
Uploaded: 2024-11-24T15:13:06+00:00
Channel: Learn With Jay
Description: Why do we divide by the square root of the key dimensions in Scaled Dot-Product Attention? 🤔 In this video, we dive deep into the intuition and mathema...

Learn With Jay · Beginner ·🧠 Large Language Models ·1y ago

Why do we divide by the square root of the key dimensions in Scaled Dot-Product Attention? 🤔 In this video, we dive deep into the intuition and mathematics behind this crucial step. Understand: 🔹 How scaling prevents extreme attention scores. 🔹 The impact of dimensionality on softmax. 🔹 Why this scaling makes models more stable and efficient. If you’ve ever wondered about this subtle yet vital detail, this video is for you, where we go in-depth into why this scaling is so important for the stability in training of the model. 📕 NOTES: https://github.com/Coding-Lane/Transformer-notes/bl…

Watch on YouTube ↗ (saves to browser)

Chapters (7)

Intro

1:12 Recap of Self-Attention

4:39 Increase in variance

8:05 Why variance increases?

12:41 Why high variance is a problem in Deep Learning?

15:30 Why divide by square root of dimension

19:07 Outro

Next Up

5 Levels of AI Agents - From Simple LLM Calls to Multi-Agent Systems

Dave Ebbelaar (LLM Eng)