SCALED Dot-Product Attention Explained
About this lesson
This video provides a detailed, conceptual, and mathematical justification for the scaling factor, dk, used in Scaled Dot-Product Attention. The Problem: When high-dimensional vectors (like 512D or 1000D) are used to calculate the raw attention scores (QKT ), the variance of those scores scales linearly with the vector dimension (dk). This high variance forces the Softmax function (which uses exponentiation) to act as a "winner-take-all" mechanism, resulting in extreme probabilities (some near 1, others near 0). The Instability: These extreme probabilities destabilize the neural network training process by causing the vanishing gradient problem, as the training focus is skewed only toward the high-probability paths, while low-probability paths are ignored. The Solution (dk ): Dividing the scores by dk directly counteracts the linear growth in variance. Mathematically, this division ensures that the overall variance remains constant, regardless of dk. This stability allows the Softmax function to produce balanced probabilities, leading to stable training and robust parameter updates. The resulting mechanism is essential for using the high-dimensional embeddings needed for extracting useful information in modern models.
DeepCamp AI