Mini Batch Gradient Descent (C2W2L01)
Key Takeaways
This video covers Mini-Batch Gradient Descent, a faster algorithm than processing the entire training set at once, using techniques such as splitting the training set into smaller mini-batches and vectorized implementation to process all examples at once.
Full Transcript
hello and welcome back in this week you learn about optimization algorithms that will enable you to train in your networks much faster you've heard me say before that apply machine learning is a highly empirical process is highly intuitive process it which you just have to train a lot of models to find one that works really well so it really helps to really train models quickly one thing that makes it more difficult is that deep learning which is the work best in the regime of Big Data when you're able to train your near network on a huge data set and training on large data sets is just slow so what you find is that having fast optimization algorithms having good optimization algorithms can really speed up the efficiency of you and your team so let's get started by talking about mini-batch gradient descent you've learned previously the vectorization allows you to efficiently compute on all M examples that allows you to process your whole training set without an explicit for loop so that's why we would take our training examples and stack them into this huge matrix capital X so 6 1 X 2 X 3 you know and then um eventually it goes up to X M they give M training examples and similarly for y this is y 1 y 2 y 3 and so on up to Y M so the dimension of X was n X by M and this is 1 by M vectorization allows you to process our M examples quickly relatively quickly if M is very large then it can still be slow so for example what if M was 5 million you know 50 million or even bigger with the implementation of gradient sent on your training set what you have to do is you have to process your entire training set before you take you know one little step for gradient descent and then you have to process your entire training set of five million training examples again before you take another little step of gradient descent so it turns out that you can get a faster algorithm if you get straightened descent start to make some progress even before you finish processing your entire your giant tree in size of five million examples in particular here's what you can do let's say that you split up your training set into smaller your little baby training sets and these baby training sets are called mini batches and let's say each of your baby training sets have just 1000 examples each so you take X 1 through X 1000 and you call that your first little baby training session also called a mini batch and then you take home the next 1000 examples X 1000 1 through X 2000 that's the next thousand examples and call the next one and so on and I'm going to introduce a new notation I'm going to call this X superscript with curly braces 1 and I want to call this X superscript with curly braces too now if you have five million training examples total and each of these little mini batches as a thousand examples that means you have 5000 of these videos you know 5000 times 1000 equals 5 million so altogether you would have 5000 of these um mini batches so the ends of X superscript curly braces 5000 and then similarly you do the same thing for y you'd also split up your training data for Y accordingly so you call that y1 and then this is y 1001 3y 2000 this becomes called y2 and so on until you have y 5000 so now we - number T is going to be comprised of X T and Y T and that is a thousand training examples so the corresponding input output pairs before moving on just to make sure notation is clear we have previously used superscript round brackets I to index on the training set so X is d I've trained example we use superscript square brackets L to index into the different layers of a neural network so VL comes from the Z values for the elf layer of in your network and here we're introducing the curly brackets T to index into different mini batches so you have X T Y T and to check your understanding of these um or what's the dimension right of XT and YT well X is NX by M so if x1 is a thousand training examples or the X values for a thousand examples then this dimension should be MX by 1,000 and x2 should also be an X by 1000 and so on so all of these should have to mention NX / 1000 and these should have to mention 1 by 1000 right 2 the name of this algorithm - gradient descent refers to the gradient descent algorithm we've been talking about previously where you process your entire training set all at the same time and the name comes from viewing that as processing your entire batch of training examples all at the same time I'm not such a great name but that's just what is called mini batch period descent in contrast refers to the algorithm which we'll talk about on the next slide and which you process is single mini batch X T YT at the same time rather than processing your entire training set X Y at the same time so let's see how many batch gradient descent works to run mini-batch gradient descent on your training sets you would run for t equals 1 to 5000 because we had 5000 mini batches of size 1,000 each and what you're going to do inside the for loop is basically implement one step of gradient descent using X G comma Y T and it's as if you had a training set of size 1,000 examples and it was as if you were to implement the algorithm you're already familiar with but just on this you know little training set size of M equals 1000 rather than having explicit for loop over all 1000 examples you would use vectorization to process all 1,000 examples sort of all at the same time so let's write this out first you implement forward prop on the inputs so just on XP and you do that by implementing you know Z 1 equals W 1 now previously we just have X there right but now you're on process the entire training set and you're just processing the first mini batch so this becomes X tea when you processing mini-batch tea and then you would have a1 equals G 1 of Z 1 District Capital Z since we're this is actually a vectorized implementation and so on until you end up with a l you know as I guess GL of VL and then this is your prediction and you notice that here you should use a vectorized implementation it's just that this vectorized implementation processes 1,000 examples at a time rather than 5 million examples mixed you compute the cost function J which I'm going to write as 1 over 1000 since 301 thousands the size of your little training set sum from I equals 1 through L of really the you know loss of Y hat I Y I and this notation for clarity refers to examples from the mini-batch XT YT and then if you're using regularization you can also have this regularization term just move over to the denominator time sum over L Frobenius norm the way measures a squared so because this is really the cost on just one rainy batch and then I index this cost J with a superscript T in curly braces so you notice that everything we're doing is exactly the same as when we were previously implementing gradient descent except that instead of doing it on X Y you're not doing it on X T YT next you'd implement back prop to compute gradients with respect to really respect to this JT so you're still using only X T YT and then you update the weights you know wre every WL gets updated as WL minus alpha D WL and similarly for B and so this is one pass through your training set using mini-batch gradient descent the code i've written down here is also called doing one epoch of training and epoch is a word that just means a single pass through the training set so whereas with batch gradient descent a single pass through the training set allows you to take only one gradient descent step with really batch gradient descent a single pass through the training set that is one epoch allows you to take 5000 gradient descent steps now of course you want to take multiple passes through the training sets which you usually want to you might want another for loop or another your while loop out there so you keep taking process through the training set until hopefully you converge or it approximately converged when you have a lost training set meaning batch gradient descent runs much faster than batch gradient descent and it's pretty much what everyone in deep learning will use when you're training on a large dataset in the next video let's delve deeper into mini batch goodness and so you can get a better understanding of what is doing and why it works so well
Original Description
Take the Deep Learning Specialization: http://bit.ly/2x6x2J9
Check out all our courses: https://www.deeplearning.ai
Subscribe to The Batch, our weekly newsletter: https://www.deeplearning.ai/thebatch
Follow us:
Twitter: https://twitter.com/deeplearningai_
Facebook: https://www.facebook.com/deeplearningHQ/
Linkedin: https://www.linkedin.com/company/deeplearningai
Watch on YouTube ↗
(saves to browser)
Sign in to unlock AI tutor explanation · ⚡30
Playlist
Uploads from DeepLearningAI · DeepLearningAI · 13 of 60
1
2
3
4
5
6
7
8
9
10
11
12
▶
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
Forward and Backward Propagation (C1W4L06)
DeepLearningAI
deeplearning.ai's Heroes of Deep Learning: Yuanqing Lin
DeepLearningAI
deeplearning.ai's Heroes of Deep Learning: Ruslan Salakhutdinov
DeepLearningAI
deeplearning.ai's Heroes of Deep Learning: Yoshua Bengio
DeepLearningAI
deeplearning.ai's Heroes of Deep Learning: Pieter Abbeel
DeepLearningAI
deeplearning.ai's Heroes of Deep Learning: Ian Goodfellow
DeepLearningAI
deeplearning.ai's Heroes of Deep Learning: Andrej Karpathy
DeepLearningAI
Using an Appropriate Scale (C2W3L02)
DeepLearningAI
Gradient Checking (C2W1L13)
DeepLearningAI
Gradient Checking Implementation Notes (C2W1L14)
DeepLearningAI
Learning Rate Decay (C2W2L09)
DeepLearningAI
Understanding Mini-Batch Gradient Dexcent (C2W2L02)
DeepLearningAI
Mini Batch Gradient Descent (C2W2L01)
DeepLearningAI
The Problem of Local Optima (C2W3L10)
DeepLearningAI
Exponentially Weighted Averages (C2W2L03)
DeepLearningAI
Tuning Process (C2W3L01)
DeepLearningAI
Understanding Exponentially Weighted Averages (C2W2L04)
DeepLearningAI
Bias Correction of Exponentially Weighted Averages (C2W2L05)
DeepLearningAI
Gradient Descent With Momentum (C2W2L06)
DeepLearningAI
Normalizing Activations in a Network (C2W3L04)
DeepLearningAI
Hyperparameter Tuning in Practice (C2W3L03)
DeepLearningAI
Adam Optimization Algorithm (C2W2L08)
DeepLearningAI
RMSProp (C2W2L07)
DeepLearningAI
Fitting Batch Norm Into Neural Networks (C2W3L05)
DeepLearningAI
Why Does Batch Norm Work? (C2W3L06)
DeepLearningAI
Batch Norm At Test Time (C2W3L07)
DeepLearningAI
Softmax Regression (C2W3L08)
DeepLearningAI
Deep Learning Frameworks (C2W3L10)
DeepLearningAI
Neural Network Overview (C1W3L01)
DeepLearningAI
Training Softmax Classifier (C2W3L09)
DeepLearningAI
Why Deep Representations? (C1W4L04)
DeepLearningAI
Gradient Descent For Neural Networks (C1W3L09)
DeepLearningAI
Neural Network Representations (C1W3L02)
DeepLearningAI
TensorFlow (C2W3L11)
DeepLearningAI
Activation Functions (C1W3L06)
DeepLearningAI
Explanation For Vectorized Implementation (C1W3L05)
DeepLearningAI
Getting Matrix Dimensions Right (C1W4L03)
DeepLearningAI
Understanding Dropout (C2W1L07)
DeepLearningAI
Building Blocks of a Deep Neural Network (C1W4L05)
DeepLearningAI
Why Non-linear Activation Functions (C1W3L07)
DeepLearningAI
Computing Neural Network Output (C1W3L03)
DeepLearningAI
Backpropagation Intuition (C1W3L10)
DeepLearningAI
Train/Dev/Test Sets (C2W1L01)
DeepLearningAI
Deep L-Layer Neural Network (C1W4L01)
DeepLearningAI
Random Initialization (C1W3L11)
DeepLearningAI
Other Regularization Methods (C2W1L08)
DeepLearningAI
Normalizing Inputs (C2W1L09)
DeepLearningAI
Derivatives Of Activation Functions (C1W3L08)
DeepLearningAI
Parameters vs Hyperparameters (C1W4L07)
DeepLearningAI
Vectorizing Across Multiple Examples (C1W3L04)
DeepLearningAI
What does this have to do with the brain? (C1W4L08)
DeepLearningAI
Dropout Regularization (C2W1L06)
DeepLearningAI
Vanishing/Exploding Gradients (C2W1L10)
DeepLearningAI
Basic Recipe for Machine Learning (C2W1L03)
DeepLearningAI
Bias/Variance (C2W1L02)
DeepLearningAI
Forward Propagation in a Deep Network (C1W4L02)
DeepLearningAI
Weight Initialization in a Deep Network (C2W1L11)
DeepLearningAI
Numerical Approximations of Gradients (C2W1L12)
DeepLearningAI
Regularization (C2W1L04)
DeepLearningAI
Why Regularization Reduces Overfitting (C2W1L05)
DeepLearningAI
More on: ML Maths Basics
View skill →Related AI Lessons
⚡
⚡
⚡
⚡
10 Python Concepts You Must Know Before Calling Yourself Advanced
Medium · AI
10 Python Concepts You Must Know Before Calling Yourself Advanced
Medium · Data Science
10 Python Concepts You Must Know Before Calling Yourself Advanced
Medium · Programming
10 Python Concepts You Must Know Before Calling Yourself Advanced
Medium · Python
🎓
Tutor Explanation
DeepCamp AI