Mini Batch Gradient Descent (C2W2L01)

DeepLearningAI · Beginner ·📐 ML Fundamentals ·8y ago

Key Takeaways

This video covers Mini-Batch Gradient Descent, a faster algorithm than processing the entire training set at once, using techniques such as splitting the training set into smaller mini-batches and vectorized implementation to process all examples at once.

Full Transcript

hello and welcome back in this week you learn about optimization algorithms that will enable you to train in your networks much faster you've heard me say before that apply machine learning is a highly empirical process is highly intuitive process it which you just have to train a lot of models to find one that works really well so it really helps to really train models quickly one thing that makes it more difficult is that deep learning which is the work best in the regime of Big Data when you're able to train your near network on a huge data set and training on large data sets is just slow so what you find is that having fast optimization algorithms having good optimization algorithms can really speed up the efficiency of you and your team so let's get started by talking about mini-batch gradient descent you've learned previously the vectorization allows you to efficiently compute on all M examples that allows you to process your whole training set without an explicit for loop so that's why we would take our training examples and stack them into this huge matrix capital X so 6 1 X 2 X 3 you know and then um eventually it goes up to X M they give M training examples and similarly for y this is y 1 y 2 y 3 and so on up to Y M so the dimension of X was n X by M and this is 1 by M vectorization allows you to process our M examples quickly relatively quickly if M is very large then it can still be slow so for example what if M was 5 million you know 50 million or even bigger with the implementation of gradient sent on your training set what you have to do is you have to process your entire training set before you take you know one little step for gradient descent and then you have to process your entire training set of five million training examples again before you take another little step of gradient descent so it turns out that you can get a faster algorithm if you get straightened descent start to make some progress even before you finish processing your entire your giant tree in size of five million examples in particular here's what you can do let's say that you split up your training set into smaller your little baby training sets and these baby training sets are called mini batches and let's say each of your baby training sets have just 1000 examples each so you take X 1 through X 1000 and you call that your first little baby training session also called a mini batch and then you take home the next 1000 examples X 1000 1 through X 2000 that's the next thousand examples and call the next one and so on and I'm going to introduce a new notation I'm going to call this X superscript with curly braces 1 and I want to call this X superscript with curly braces too now if you have five million training examples total and each of these little mini batches as a thousand examples that means you have 5000 of these videos you know 5000 times 1000 equals 5 million so altogether you would have 5000 of these um mini batches so the ends of X superscript curly braces 5000 and then similarly you do the same thing for y you'd also split up your training data for Y accordingly so you call that y1 and then this is y 1001 3y 2000 this becomes called y2 and so on until you have y 5000 so now we - number T is going to be comprised of X T and Y T and that is a thousand training examples so the corresponding input output pairs before moving on just to make sure notation is clear we have previously used superscript round brackets I to index on the training set so X is d I've trained example we use superscript square brackets L to index into the different layers of a neural network so VL comes from the Z values for the elf layer of in your network and here we're introducing the curly brackets T to index into different mini batches so you have X T Y T and to check your understanding of these um or what's the dimension right of XT and YT well X is NX by M so if x1 is a thousand training examples or the X values for a thousand examples then this dimension should be MX by 1,000 and x2 should also be an X by 1000 and so on so all of these should have to mention NX / 1000 and these should have to mention 1 by 1000 right 2 the name of this algorithm - gradient descent refers to the gradient descent algorithm we've been talking about previously where you process your entire training set all at the same time and the name comes from viewing that as processing your entire batch of training examples all at the same time I'm not such a great name but that's just what is called mini batch period descent in contrast refers to the algorithm which we'll talk about on the next slide and which you process is single mini batch X T YT at the same time rather than processing your entire training set X Y at the same time so let's see how many batch gradient descent works to run mini-batch gradient descent on your training sets you would run for t equals 1 to 5000 because we had 5000 mini batches of size 1,000 each and what you're going to do inside the for loop is basically implement one step of gradient descent using X G comma Y T and it's as if you had a training set of size 1,000 examples and it was as if you were to implement the algorithm you're already familiar with but just on this you know little training set size of M equals 1000 rather than having explicit for loop over all 1000 examples you would use vectorization to process all 1,000 examples sort of all at the same time so let's write this out first you implement forward prop on the inputs so just on XP and you do that by implementing you know Z 1 equals W 1 now previously we just have X there right but now you're on process the entire training set and you're just processing the first mini batch so this becomes X tea when you processing mini-batch tea and then you would have a1 equals G 1 of Z 1 District Capital Z since we're this is actually a vectorized implementation and so on until you end up with a l you know as I guess GL of VL and then this is your prediction and you notice that here you should use a vectorized implementation it's just that this vectorized implementation processes 1,000 examples at a time rather than 5 million examples mixed you compute the cost function J which I'm going to write as 1 over 1000 since 301 thousands the size of your little training set sum from I equals 1 through L of really the you know loss of Y hat I Y I and this notation for clarity refers to examples from the mini-batch XT YT and then if you're using regularization you can also have this regularization term just move over to the denominator time sum over L Frobenius norm the way measures a squared so because this is really the cost on just one rainy batch and then I index this cost J with a superscript T in curly braces so you notice that everything we're doing is exactly the same as when we were previously implementing gradient descent except that instead of doing it on X Y you're not doing it on X T YT next you'd implement back prop to compute gradients with respect to really respect to this JT so you're still using only X T YT and then you update the weights you know wre every WL gets updated as WL minus alpha D WL and similarly for B and so this is one pass through your training set using mini-batch gradient descent the code i've written down here is also called doing one epoch of training and epoch is a word that just means a single pass through the training set so whereas with batch gradient descent a single pass through the training set allows you to take only one gradient descent step with really batch gradient descent a single pass through the training set that is one epoch allows you to take 5000 gradient descent steps now of course you want to take multiple passes through the training sets which you usually want to you might want another for loop or another your while loop out there so you keep taking process through the training set until hopefully you converge or it approximately converged when you have a lost training set meaning batch gradient descent runs much faster than batch gradient descent and it's pretty much what everyone in deep learning will use when you're training on a large dataset in the next video let's delve deeper into mini batch goodness and so you can get a better understanding of what is doing and why it works so well

Original Description

Take the Deep Learning Specialization: http://bit.ly/2x6x2J9 Check out all our courses: https://www.deeplearning.ai Subscribe to The Batch, our weekly newsletter: https://www.deeplearning.ai/thebatch Follow us: Twitter: https://twitter.com/deeplearningai_ Facebook: https://www.facebook.com/deeplearningHQ/ Linkedin: https://www.linkedin.com/company/deeplearningai
Watch on YouTube ↗ (saves to browser)
Sign in to unlock AI tutor explanation · ⚡30

Playlist

Uploads from DeepLearningAI · DeepLearningAI · 13 of 60

1 Forward and Backward Propagation (C1W4L06)
Forward and Backward Propagation (C1W4L06)
DeepLearningAI
2 deeplearning.ai's Heroes of Deep Learning: Yuanqing Lin
deeplearning.ai's Heroes of Deep Learning: Yuanqing Lin
DeepLearningAI
3 deeplearning.ai's Heroes of Deep Learning: Ruslan Salakhutdinov
deeplearning.ai's Heroes of Deep Learning: Ruslan Salakhutdinov
DeepLearningAI
4 deeplearning.ai's Heroes of Deep Learning: Yoshua Bengio
deeplearning.ai's Heroes of Deep Learning: Yoshua Bengio
DeepLearningAI
5 deeplearning.ai's Heroes of Deep Learning: Pieter Abbeel
deeplearning.ai's Heroes of Deep Learning: Pieter Abbeel
DeepLearningAI
6 deeplearning.ai's Heroes of Deep Learning: Ian Goodfellow
deeplearning.ai's Heroes of Deep Learning: Ian Goodfellow
DeepLearningAI
7 deeplearning.ai's Heroes of Deep Learning: Andrej Karpathy
deeplearning.ai's Heroes of Deep Learning: Andrej Karpathy
DeepLearningAI
8 Using an Appropriate Scale (C2W3L02)
Using an Appropriate Scale (C2W3L02)
DeepLearningAI
9 Gradient Checking (C2W1L13)
Gradient Checking (C2W1L13)
DeepLearningAI
10 Gradient Checking Implementation Notes (C2W1L14)
Gradient Checking Implementation Notes (C2W1L14)
DeepLearningAI
11 Learning Rate Decay (C2W2L09)
Learning Rate Decay (C2W2L09)
DeepLearningAI
12 Understanding Mini-Batch Gradient Dexcent (C2W2L02)
Understanding Mini-Batch Gradient Dexcent (C2W2L02)
DeepLearningAI
Mini Batch Gradient Descent (C2W2L01)
Mini Batch Gradient Descent (C2W2L01)
DeepLearningAI
14 The Problem of Local Optima (C2W3L10)
The Problem of Local Optima (C2W3L10)
DeepLearningAI
15 Exponentially Weighted Averages (C2W2L03)
Exponentially Weighted Averages (C2W2L03)
DeepLearningAI
16 Tuning Process (C2W3L01)
Tuning Process (C2W3L01)
DeepLearningAI
17 Understanding Exponentially Weighted Averages (C2W2L04)
Understanding Exponentially Weighted Averages (C2W2L04)
DeepLearningAI
18 Bias Correction of Exponentially Weighted Averages (C2W2L05)
Bias Correction of Exponentially Weighted Averages (C2W2L05)
DeepLearningAI
19 Gradient Descent With Momentum (C2W2L06)
Gradient Descent With Momentum (C2W2L06)
DeepLearningAI
20 Normalizing Activations in a Network (C2W3L04)
Normalizing Activations in a Network (C2W3L04)
DeepLearningAI
21 Hyperparameter Tuning in Practice (C2W3L03)
Hyperparameter Tuning in Practice (C2W3L03)
DeepLearningAI
22 Adam Optimization Algorithm (C2W2L08)
Adam Optimization Algorithm (C2W2L08)
DeepLearningAI
23 RMSProp (C2W2L07)
RMSProp (C2W2L07)
DeepLearningAI
24 Fitting Batch Norm Into Neural Networks (C2W3L05)
Fitting Batch Norm Into Neural Networks (C2W3L05)
DeepLearningAI
25 Why Does Batch Norm Work? (C2W3L06)
Why Does Batch Norm Work? (C2W3L06)
DeepLearningAI
26 Batch Norm At Test Time (C2W3L07)
Batch Norm At Test Time (C2W3L07)
DeepLearningAI
27 Softmax Regression (C2W3L08)
Softmax Regression (C2W3L08)
DeepLearningAI
28 Deep Learning Frameworks (C2W3L10)
Deep Learning Frameworks (C2W3L10)
DeepLearningAI
29 Neural Network Overview (C1W3L01)
Neural Network Overview (C1W3L01)
DeepLearningAI
30 Training Softmax Classifier (C2W3L09)
Training Softmax Classifier (C2W3L09)
DeepLearningAI
31 Why Deep Representations? (C1W4L04)
Why Deep Representations? (C1W4L04)
DeepLearningAI
32 Gradient Descent For Neural Networks (C1W3L09)
Gradient Descent For Neural Networks (C1W3L09)
DeepLearningAI
33 Neural Network Representations (C1W3L02)
Neural Network Representations (C1W3L02)
DeepLearningAI
34 TensorFlow (C2W3L11)
TensorFlow (C2W3L11)
DeepLearningAI
35 Activation Functions (C1W3L06)
Activation Functions (C1W3L06)
DeepLearningAI
36 Explanation For Vectorized Implementation (C1W3L05)
Explanation For Vectorized Implementation (C1W3L05)
DeepLearningAI
37 Getting Matrix Dimensions Right (C1W4L03)
Getting Matrix Dimensions Right (C1W4L03)
DeepLearningAI
38 Understanding Dropout (C2W1L07)
Understanding Dropout (C2W1L07)
DeepLearningAI
39 Building Blocks of a Deep Neural Network (C1W4L05)
Building Blocks of a Deep Neural Network (C1W4L05)
DeepLearningAI
40 Why Non-linear Activation Functions (C1W3L07)
Why Non-linear Activation Functions (C1W3L07)
DeepLearningAI
41 Computing Neural Network Output (C1W3L03)
Computing Neural Network Output (C1W3L03)
DeepLearningAI
42 Backpropagation Intuition (C1W3L10)
Backpropagation Intuition (C1W3L10)
DeepLearningAI
43 Train/Dev/Test Sets (C2W1L01)
Train/Dev/Test Sets (C2W1L01)
DeepLearningAI
44 Deep L-Layer Neural Network (C1W4L01)
Deep L-Layer Neural Network (C1W4L01)
DeepLearningAI
45 Random Initialization (C1W3L11)
Random Initialization (C1W3L11)
DeepLearningAI
46 Other Regularization Methods (C2W1L08)
Other Regularization Methods (C2W1L08)
DeepLearningAI
47 Normalizing Inputs (C2W1L09)
Normalizing Inputs (C2W1L09)
DeepLearningAI
48 Derivatives Of Activation Functions (C1W3L08)
Derivatives Of Activation Functions (C1W3L08)
DeepLearningAI
49 Parameters vs Hyperparameters (C1W4L07)
Parameters vs Hyperparameters (C1W4L07)
DeepLearningAI
50 Vectorizing Across Multiple Examples (C1W3L04)
Vectorizing Across Multiple Examples (C1W3L04)
DeepLearningAI
51 What does this have to do with the brain? (C1W4L08)
What does this have to do with the brain? (C1W4L08)
DeepLearningAI
52 Dropout Regularization (C2W1L06)
Dropout Regularization (C2W1L06)
DeepLearningAI
53 Vanishing/Exploding Gradients (C2W1L10)
Vanishing/Exploding Gradients (C2W1L10)
DeepLearningAI
54 Basic Recipe for Machine Learning (C2W1L03)
Basic Recipe for Machine Learning (C2W1L03)
DeepLearningAI
55 Bias/Variance (C2W1L02)
Bias/Variance (C2W1L02)
DeepLearningAI
56 Forward Propagation in a Deep Network (C1W4L02)
Forward Propagation in a Deep Network (C1W4L02)
DeepLearningAI
57 Weight Initialization in a Deep Network (C2W1L11)
Weight Initialization in a Deep Network (C2W1L11)
DeepLearningAI
58 Numerical Approximations of Gradients (C2W1L12)
Numerical Approximations of Gradients (C2W1L12)
DeepLearningAI
59 Regularization (C2W1L04)
Regularization (C2W1L04)
DeepLearningAI
60 Why Regularization Reduces Overfitting (C2W1L05)
Why Regularization Reduces Overfitting (C2W1L05)
DeepLearningAI

This video teaches Mini-Batch Gradient Descent, a faster algorithm than Batch Gradient Descent, and how to implement it using vectorized implementation and cost function computation. It's essential for deep learning with large datasets.

Key Takeaways
  1. Split the training set into smaller mini-batches of 1000 examples each
  2. Index into different mini-batches using the notation X superscript with curly braces
  3. Implement forward prop on the inputs X_T
  4. Compute the cost function J as 1/1000 * sum from i=1 to L of (loss(Y_hat_i, Y_i)) for examples from the mini-batch XT YT
  5. Implement back prop to compute gradients with respect to the cost function J
  6. Update the weights W_L every WL gets updated as WL - alpha * dWL
  7. Take multiple passes through the training set using a for loop or while loop until convergence is achieved
💡 Mini-Batch Gradient Descent allows for faster convergence and is commonly used in deep learning for large datasets.

Related AI Lessons

10 Python Concepts You Must Know Before Calling Yourself Advanced
Learn 10 essential Python concepts to take your skills to the advanced level and stand out as a developer
Medium · AI
10 Python Concepts You Must Know Before Calling Yourself Advanced
Learn 10 crucial Python concepts to elevate your skills from intermediate to advanced and become a proficient developer
Medium · Data Science
10 Python Concepts You Must Know Before Calling Yourself Advanced
Learn 10 essential Python concepts to take your skills to the advanced level and stand out as a developer
Medium · Programming
10 Python Concepts You Must Know Before Calling Yourself Advanced
Learn 10 essential Python concepts to take your skills to the advanced level and separate yourself from beginner developers
Medium · Python
Up next
Is Python Dead in 2026?| Truth About Python in AI Era | 90 Days Roadmap @FameWorldEducationalHub
FAME WORLD EDUCATIONAL HUB
Watch →