Batch Norm At Test Time (C2W3L07)
Key Takeaways
Batch normalization at test time using exponentially weighted average, implementing neural networks to process single examples at a time
Full Transcript
bachelor on processes or data one me me batch at the time but at times you might need to process the examples one at a time let's see how you can adapt your network to do that recall that during training here are the equations you just implement national within a single mini-batches sum over that mini batch of the zi values to compute the mean so here you're just summing over the examples in one mini batch I'm using M to denote the number of examples in the mini batch not not in the whole training set then you compute the variance and then you compute the norm by scaling by the mean and standard deviation what that's on added for numerical stability and then V tilde is taking Z norm and rescaling by gamma and beta so notice that mu and Sigma squared which you need for this scaling calculation are computed on the entire mini value but at times you might not have a mini batch of 64 128 alternative Pacific examples to process at the same time so you need some different way of coming up with mu and Sigma squared and if just one example taking the mean and variance of that one example doesn't make sense so what's actually done in order to apply your neural network at test time is to come up with some separate estimate of mu and Sigma squared and in typical implementations of national what you do is estimate this using a exponentially weighted average where the average is across the mini batches so to be very concrete here's what I mean let's pick some layer L and let's say you're going through mini batches x1 x2 together with the corresponding values of Y and so on so when training on x1 for that layer L you get some new L and in fact I'm going to write this as new for the first mini batch and that lane and then when you train on the second mini batch for that layer and that mean about you and there was some second value of you and then for the third mini batch in this hidden layer you end up with some third value for MU so just as means for how to use the exponentially weighted average to compute the mean of theta1 theta2 theta3 when you are trying to compute a exponentially weighted average of the current temperature you will do that to keep track of so what's the latest average value of this mean vector your seat so that exponentially weighted average becomes your estimate for what the mean of the B's is for that hidden layer and similarly you'd use an exponentially weighted average to keep track of these values of Sigma squared that you see on the first mini batch in that layer Sigma squared then you see on a second mini batch and so on so you keep a running average of the MU and the Sigma square that you're seeing for each layer as you train the neural network across different mini batches then finally at test time what you do is in place of this equation you would just compute Z norm using whatever value you see you have and using your exponentially weighted average of the MU and Sigma squared whatever was the latest value you have to do the scaling here and then you would compute each other on your one test example using that Z norm that we just computed on the left and using the beta and gamma parameters then you'll you have learned during your neural network training process so the takeaway from this is that during training time mu and Sigma squared are computed on an entire mini batch of you know say 64 and June 28 or some number of examples but at test time you might need to process a single example at a time so the way to do that is to estimate mu and Sigma squared from your training and there many ways to do that you couldn't clearly run your whole training set through your final network to get mu and Sigma squared but in practice what people usually do is implement an exponentially weighted average where you just keep track of the new and Sigma squared values you've seen during training and use an exponentially weighted average also sometimes called a running average to just get a rough estimate of mu and Sigma squared and then you use those values of MU and Sigma square that test time to do the scaling you need of the hidden unit values z in practice this process is pretty robust to the exact way you use to estimate mu and Sigma squared so I wouldn't worry too much about exactly how you do this and if you're using a deep learning framework they'll usually have some default way to estimate mu and Sigma squared tension work reasonably well as well but in practice any reasonable way to estimate the mean and variance of your hidden unit values of Z should work fine and test so that's it - dome and using it I think you'll be able to train much deeper networks and get your learning album to run much more quickly before we wrap up for this video I want to share you some thoughts on deep learning frameworks as well let's start to talk about that in the next video
Original Description
Take the Deep Learning Specialization: http://bit.ly/2vBGGmD
Check out all our courses: https://www.deeplearning.ai
Subscribe to The Batch, our weekly newsletter: https://www.deeplearning.ai/thebatch
Follow us:
Twitter: https://twitter.com/deeplearningai_
Facebook: https://www.facebook.com/deeplearningHQ/
Linkedin: https://www.linkedin.com/company/deeplearningai
Watch on YouTube ↗
(saves to browser)
Sign in to unlock AI tutor explanation · ⚡30
Playlist
Uploads from DeepLearningAI · DeepLearningAI · 26 of 60
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
▶
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
Forward and Backward Propagation (C1W4L06)
DeepLearningAI
deeplearning.ai's Heroes of Deep Learning: Yuanqing Lin
DeepLearningAI
deeplearning.ai's Heroes of Deep Learning: Ruslan Salakhutdinov
DeepLearningAI
deeplearning.ai's Heroes of Deep Learning: Yoshua Bengio
DeepLearningAI
deeplearning.ai's Heroes of Deep Learning: Pieter Abbeel
DeepLearningAI
deeplearning.ai's Heroes of Deep Learning: Ian Goodfellow
DeepLearningAI
deeplearning.ai's Heroes of Deep Learning: Andrej Karpathy
DeepLearningAI
Using an Appropriate Scale (C2W3L02)
DeepLearningAI
Gradient Checking (C2W1L13)
DeepLearningAI
Gradient Checking Implementation Notes (C2W1L14)
DeepLearningAI
Learning Rate Decay (C2W2L09)
DeepLearningAI
Understanding Mini-Batch Gradient Dexcent (C2W2L02)
DeepLearningAI
Mini Batch Gradient Descent (C2W2L01)
DeepLearningAI
The Problem of Local Optima (C2W3L10)
DeepLearningAI
Exponentially Weighted Averages (C2W2L03)
DeepLearningAI
Tuning Process (C2W3L01)
DeepLearningAI
Understanding Exponentially Weighted Averages (C2W2L04)
DeepLearningAI
Bias Correction of Exponentially Weighted Averages (C2W2L05)
DeepLearningAI
Gradient Descent With Momentum (C2W2L06)
DeepLearningAI
Normalizing Activations in a Network (C2W3L04)
DeepLearningAI
Hyperparameter Tuning in Practice (C2W3L03)
DeepLearningAI
Adam Optimization Algorithm (C2W2L08)
DeepLearningAI
RMSProp (C2W2L07)
DeepLearningAI
Fitting Batch Norm Into Neural Networks (C2W3L05)
DeepLearningAI
Why Does Batch Norm Work? (C2W3L06)
DeepLearningAI
Batch Norm At Test Time (C2W3L07)
DeepLearningAI
Softmax Regression (C2W3L08)
DeepLearningAI
Deep Learning Frameworks (C2W3L10)
DeepLearningAI
Neural Network Overview (C1W3L01)
DeepLearningAI
Training Softmax Classifier (C2W3L09)
DeepLearningAI
Why Deep Representations? (C1W4L04)
DeepLearningAI
Gradient Descent For Neural Networks (C1W3L09)
DeepLearningAI
Neural Network Representations (C1W3L02)
DeepLearningAI
TensorFlow (C2W3L11)
DeepLearningAI
Activation Functions (C1W3L06)
DeepLearningAI
Explanation For Vectorized Implementation (C1W3L05)
DeepLearningAI
Getting Matrix Dimensions Right (C1W4L03)
DeepLearningAI
Understanding Dropout (C2W1L07)
DeepLearningAI
Building Blocks of a Deep Neural Network (C1W4L05)
DeepLearningAI
Why Non-linear Activation Functions (C1W3L07)
DeepLearningAI
Computing Neural Network Output (C1W3L03)
DeepLearningAI
Backpropagation Intuition (C1W3L10)
DeepLearningAI
Train/Dev/Test Sets (C2W1L01)
DeepLearningAI
Deep L-Layer Neural Network (C1W4L01)
DeepLearningAI
Random Initialization (C1W3L11)
DeepLearningAI
Other Regularization Methods (C2W1L08)
DeepLearningAI
Normalizing Inputs (C2W1L09)
DeepLearningAI
Derivatives Of Activation Functions (C1W3L08)
DeepLearningAI
Parameters vs Hyperparameters (C1W4L07)
DeepLearningAI
Vectorizing Across Multiple Examples (C1W3L04)
DeepLearningAI
What does this have to do with the brain? (C1W4L08)
DeepLearningAI
Dropout Regularization (C2W1L06)
DeepLearningAI
Vanishing/Exploding Gradients (C2W1L10)
DeepLearningAI
Basic Recipe for Machine Learning (C2W1L03)
DeepLearningAI
Bias/Variance (C2W1L02)
DeepLearningAI
Forward Propagation in a Deep Network (C1W4L02)
DeepLearningAI
Weight Initialization in a Deep Network (C2W1L11)
DeepLearningAI
Numerical Approximations of Gradients (C2W1L12)
DeepLearningAI
Regularization (C2W1L04)
DeepLearningAI
Why Regularization Reduces Overfitting (C2W1L05)
DeepLearningAI
More on: ML Pipelines
View skill →Related AI Lessons
⚡
⚡
⚡
⚡
How to Learn a Hard Technical Skill Without Burning Out
Dev.to · Anas Kalthoum | FreeBrain
After interviewing over 100 ML Candidates. Last Week Someone Walked In and Made Me Take Notes.
Medium · Machine Learning
How AI Learns with Less Labeled Data
Medium · Machine Learning
Mastering TypeScript — Understanding the TypeScript Compiler (tsc) from Scratch — Lesson 2
Medium · JavaScript
🎓
Tutor Explanation
DeepCamp AI