Batch Norm At Test Time (C2W3L07)

DeepLearningAI · Beginner ·📐 ML Fundamentals ·8y ago

Skills: ML Pipelines80%Supervised Learning60%

Key Takeaways

Batch normalization at test time using exponentially weighted average, implementing neural networks to process single examples at a time

Full Transcript

bachelor on processes or data one me me batch at the time but at times you might need to process the examples one at a time let's see how you can adapt your network to do that recall that during training here are the equations you just implement national within a single mini-batches sum over that mini batch of the zi values to compute the mean so here you're just summing over the examples in one mini batch I'm using M to denote the number of examples in the mini batch not not in the whole training set then you compute the variance and then you compute the norm by scaling by the mean and standard deviation what that's on added for numerical stability and then V tilde is taking Z norm and rescaling by gamma and beta so notice that mu and Sigma squared which you need for this scaling calculation are computed on the entire mini value but at times you might not have a mini batch of 64 128 alternative Pacific examples to process at the same time so you need some different way of coming up with mu and Sigma squared and if just one example taking the mean and variance of that one example doesn't make sense so what's actually done in order to apply your neural network at test time is to come up with some separate estimate of mu and Sigma squared and in typical implementations of national what you do is estimate this using a exponentially weighted average where the average is across the mini batches so to be very concrete here's what I mean let's pick some layer L and let's say you're going through mini batches x1 x2 together with the corresponding values of Y and so on so when training on x1 for that layer L you get some new L and in fact I'm going to write this as new for the first mini batch and that lane and then when you train on the second mini batch for that layer and that mean about you and there was some second value of you and then for the third mini batch in this hidden layer you end up with some third value for MU so just as means for how to use the exponentially weighted average to compute the mean of theta1 theta2 theta3 when you are trying to compute a exponentially weighted average of the current temperature you will do that to keep track of so what's the latest average value of this mean vector your seat so that exponentially weighted average becomes your estimate for what the mean of the B's is for that hidden layer and similarly you'd use an exponentially weighted average to keep track of these values of Sigma squared that you see on the first mini batch in that layer Sigma squared then you see on a second mini batch and so on so you keep a running average of the MU and the Sigma square that you're seeing for each layer as you train the neural network across different mini batches then finally at test time what you do is in place of this equation you would just compute Z norm using whatever value you see you have and using your exponentially weighted average of the MU and Sigma squared whatever was the latest value you have to do the scaling here and then you would compute each other on your one test example using that Z norm that we just computed on the left and using the beta and gamma parameters then you'll you have learned during your neural network training process so the takeaway from this is that during training time mu and Sigma squared are computed on an entire mini batch of you know say 64 and June 28 or some number of examples but at test time you might need to process a single example at a time so the way to do that is to estimate mu and Sigma squared from your training and there many ways to do that you couldn't clearly run your whole training set through your final network to get mu and Sigma squared but in practice what people usually do is implement an exponentially weighted average where you just keep track of the new and Sigma squared values you've seen during training and use an exponentially weighted average also sometimes called a running average to just get a rough estimate of mu and Sigma squared and then you use those values of MU and Sigma square that test time to do the scaling you need of the hidden unit values z in practice this process is pretty robust to the exact way you use to estimate mu and Sigma squared so I wouldn't worry too much about exactly how you do this and if you're using a deep learning framework they'll usually have some default way to estimate mu and Sigma squared tension work reasonably well as well but in practice any reasonable way to estimate the mean and variance of your hidden unit values of Z should work fine and test so that's it - dome and using it I think you'll be able to train much deeper networks and get your learning album to run much more quickly before we wrap up for this video I want to share you some thoughts on deep learning frameworks as well let's start to talk about that in the next video

Original Description

Take the Deep Learning Specialization: http://bit.ly/2vBGGmD Check out all our courses: https://www.deeplearning.ai Subscribe to The Batch, our weekly newsletter: https://www.deeplearning.ai/thebatch Follow us: Twitter: https://twitter.com/deeplearningai_ Facebook: https://www.facebook.com/deeplearningHQ/ Linkedin: https://www.linkedin.com/company/deeplearningai

Watch on YouTube ↗ (saves to browser)

Sign in to unlock AI tutor explanation · ⚡30

Playlist

Uploads from DeepLearningAI · DeepLearningAI · 26 of 60

← Previous Next →

Forward and Backward Propagation (C1W4L06)

Forward and Backward Propagation (C1W4L06)

deeplearning.ai's Heroes of Deep Learning: Yuanqing Lin

deeplearning.ai's Heroes of Deep Learning: Yuanqing Lin

deeplearning.ai's Heroes of Deep Learning: Ruslan Salakhutdinov

deeplearning.ai's Heroes of Deep Learning: Ruslan Salakhutdinov

deeplearning.ai's Heroes of Deep Learning: Yoshua Bengio

deeplearning.ai's Heroes of Deep Learning: Yoshua Bengio

deeplearning.ai's Heroes of Deep Learning: Pieter Abbeel

deeplearning.ai's Heroes of Deep Learning: Pieter Abbeel

deeplearning.ai's Heroes of Deep Learning: Ian Goodfellow

deeplearning.ai's Heroes of Deep Learning: Ian Goodfellow

deeplearning.ai's Heroes of Deep Learning: Andrej Karpathy

deeplearning.ai's Heroes of Deep Learning: Andrej Karpathy

Using an Appropriate Scale (C2W3L02)

Using an Appropriate Scale (C2W3L02)

Gradient Checking (C2W1L13)

Gradient Checking (C2W1L13)

Gradient Checking Implementation Notes (C2W1L14)

Gradient Checking Implementation Notes (C2W1L14)

Learning Rate Decay (C2W2L09)

Learning Rate Decay (C2W2L09)

Understanding Mini-Batch Gradient Dexcent (C2W2L02)

Understanding Mini-Batch Gradient Dexcent (C2W2L02)

Mini Batch Gradient Descent (C2W2L01)

Mini Batch Gradient Descent (C2W2L01)

The Problem of Local Optima (C2W3L10)

The Problem of Local Optima (C2W3L10)

Exponentially Weighted Averages (C2W2L03)

Exponentially Weighted Averages (C2W2L03)

Tuning Process (C2W3L01)

Tuning Process (C2W3L01)

Understanding Exponentially Weighted Averages (C2W2L04)

Understanding Exponentially Weighted Averages (C2W2L04)

Bias Correction of Exponentially Weighted Averages (C2W2L05)

Bias Correction of Exponentially Weighted Averages (C2W2L05)

Gradient Descent With Momentum (C2W2L06)

Gradient Descent With Momentum (C2W2L06)

Normalizing Activations in a Network (C2W3L04)

Normalizing Activations in a Network (C2W3L04)

Hyperparameter Tuning in Practice (C2W3L03)

Hyperparameter Tuning in Practice (C2W3L03)

Adam Optimization Algorithm (C2W2L08)

Adam Optimization Algorithm (C2W2L08)

RMSProp (C2W2L07)

RMSProp (C2W2L07)

Fitting Batch Norm Into Neural Networks (C2W3L05)

Fitting Batch Norm Into Neural Networks (C2W3L05)

Why Does Batch Norm Work? (C2W3L06)

Why Does Batch Norm Work? (C2W3L06)

Batch Norm At Test Time (C2W3L07)

Batch Norm At Test Time (C2W3L07)

Softmax Regression (C2W3L08)

Softmax Regression (C2W3L08)

Deep Learning Frameworks (C2W3L10)

Deep Learning Frameworks (C2W3L10)

Neural Network Overview (C1W3L01)

Neural Network Overview (C1W3L01)

Training Softmax Classifier (C2W3L09)

Training Softmax Classifier (C2W3L09)

Why Deep Representations? (C1W4L04)

Why Deep Representations? (C1W4L04)

Gradient Descent For Neural Networks (C1W3L09)

Gradient Descent For Neural Networks (C1W3L09)

Neural Network Representations (C1W3L02)

Neural Network Representations (C1W3L02)

TensorFlow (C2W3L11)

TensorFlow (C2W3L11)

Activation Functions (C1W3L06)

Activation Functions (C1W3L06)

Explanation For Vectorized Implementation (C1W3L05)

Explanation For Vectorized Implementation (C1W3L05)

Getting Matrix Dimensions Right (C1W4L03)

Getting Matrix Dimensions Right (C1W4L03)

Understanding Dropout (C2W1L07)

Understanding Dropout (C2W1L07)

Building Blocks of a Deep Neural Network (C1W4L05)

Building Blocks of a Deep Neural Network (C1W4L05)

Why Non-linear Activation Functions (C1W3L07)

Why Non-linear Activation Functions (C1W3L07)

Computing Neural Network Output (C1W3L03)

Computing Neural Network Output (C1W3L03)

Backpropagation Intuition (C1W3L10)

Backpropagation Intuition (C1W3L10)

Train/Dev/Test Sets (C2W1L01)

Train/Dev/Test Sets (C2W1L01)

Deep L-Layer Neural Network (C1W4L01)

Deep L-Layer Neural Network (C1W4L01)

Random Initialization (C1W3L11)

Random Initialization (C1W3L11)

Other Regularization Methods (C2W1L08)

Other Regularization Methods (C2W1L08)

Normalizing Inputs (C2W1L09)

Normalizing Inputs (C2W1L09)

Derivatives Of Activation Functions (C1W3L08)

Derivatives Of Activation Functions (C1W3L08)

Parameters vs Hyperparameters (C1W4L07)

Parameters vs Hyperparameters (C1W4L07)

Vectorizing Across Multiple Examples (C1W3L04)

Vectorizing Across Multiple Examples (C1W3L04)

What does this have to do with the brain? (C1W4L08)

What does this have to do with the brain? (C1W4L08)

Dropout Regularization (C2W1L06)

Dropout Regularization (C2W1L06)

Vanishing/Exploding Gradients (C2W1L10)

Vanishing/Exploding Gradients (C2W1L10)

Basic Recipe for Machine Learning (C2W1L03)

Basic Recipe for Machine Learning (C2W1L03)

Bias/Variance (C2W1L02)

Bias/Variance (C2W1L02)

Forward Propagation in a Deep Network (C1W4L02)

Forward Propagation in a Deep Network (C1W4L02)

Weight Initialization in a Deep Network (C2W1L11)

Weight Initialization in a Deep Network (C2W1L11)

Numerical Approximations of Gradients (C2W1L12)

Numerical Approximations of Gradients (C2W1L12)

Regularization (C2W1L04)

Regularization (C2W1L04)

Why Regularization Reduces Overfitting (C2W1L05)

Why Regularization Reduces Overfitting (C2W1L05)

Batch normalization is used to normalize the input data for each layer in a neural network, and at test time, we need to estimate the mean and variance of the hidden unit values using an exponentially weighted average. This allows us to process single examples at a time, and it's a crucial step in implementing neural networks. In this video, we learn how to implement batch normalization at test time using an exponentially weighted average, and how to use deep learning frameworks to make this pro

Key Takeaways

Compute the mean and variance of the hidden unit values for each mini batch during training
Use an exponentially weighted average to estimate the mean and variance of the hidden unit values
Implement batch normalization at test time using the estimated mean and variance
Process single examples at a time using the implemented batch normalization

💡 Using an exponentially weighted average to estimate the mean and variance of the hidden unit values is a robust and efficient way to implement batch normalization at test time

🔒 Pro feature: Ask AI to explain this lesson →

More on: ML Pipelines

View skill →

Building a Dog Breed Identifier App from scratch - DogNet

Building a Dog Breed Identifier App from scratch - DogNet

Aladdin Persson

Complete Dockers For Data Science Tutorial In One Shot

Complete Dockers For Data Science Tutorial In One Shot

Part 6 | Deploy ML Model on Kubernetes | Auto-Scaling with HPA and Monitoring with Prometheus

Part 6 | Deploy ML Model on Kubernetes | Auto-Scaling with HPA and Monitoring with Prometheus

Abonia Sojasingarayar

Vertex Pipelines: Qwik Start

Coding a Multimodal (Vision) Language Model from scratch in PyTorch with full explanation

Coding a Multimodal (Vision) Language Model from scratch in PyTorch with full explanation

Automate R scripts with GitHub Actions: Deploy a model

Related Reads

Coverage Said 69%, Mutation Testing Said 25%

Learn why code coverage metrics can be misleading and how mutation testing provides a more accurate measure of code quality

Dev.to · Jeremy Longshore

The win that was a coin flip

Learn to avoid false positives in competitive ML by recognizing noise and verifying results

Dev.to · Alan Scott Encinas

Build a Simple Calculator

Learn to build a simple calculator using Python and apply basic programming concepts to a real-world project

Medium · Python

Building ML APIs That Don’t Fail During Startup

Learn how to build ML APIs that don't fail during startup by using a production-ready pattern for loading ML models without serving requests too early

Medium · Python

Is Python Dead in 2026?| Truth About Python in AI Era | 90 Days Roadmap @FameWorldEducationalHub

FAME WORLD EDUCATIONAL HUB