Why Batch Normalization Fails in Transformers: The Padding Problem Explained
About this lesson
Ever wondered why the gold standard of Computer Vision—Batch Normalization (BatchNorm)—fails when applied to Transformers and NLP tasks? In this video, we deep dive into the technical reasons behind this architectural choice. The core of the issue lies in sequential data and the requirement for padding. Because sentences in a batch have varying lengths, we must add artificial "padding zeros" to align them. What you will learn: • The BatchNorm Flaw: How BatchNorm calculates statistics "vertically" across the batch, causing padding zeros to skew the mean and variance. • Statistical Distortion: Why including these zeros leads to a loss of true representation of your data. • Training Instability: The impact of Training Inference Discrepancy (TID) and why it leads to performance degradation in sequence models. • The LayerNorm Solution: Why Layer Normalization is superior for Transformers because it operates "horizontally" across features, making it independent of padding in other sequences. If you're interested in mastering Transformer architectures and understanding the nuances of Deep Learning optimization, this video is for you!
DeepCamp AI