Understanding Mini-Batch Gradient Dexcent (C2W2L02)

DeepLearningAI · Beginner ·📐 ML Fundamentals ·8y ago

Key Takeaways

The video explains Mini-Batch Gradient Descent, covering its benefits, parameter selection, and comparison to Batch Gradient Descent and Stochastic Gradient Descent. It provides guidance on choosing the ideal mini-batch size and learning rate for optimization problems.

Full Transcript

in the previous video you saw how you can use mini-batch gradient descent to start making progress to start taking gradient descent steps even when you're just partway through processing your training set even for the first time in this video you learn more details of how to implement gradient descent and gain a better understanding of what is doing and why it works with batch gradient descent on every iteration you go through the entire training set and you would expect the costs to go down on every single iteration so if we plot the cost function J as a function of different iterations it should decrease on every single iteration and if it ever goes up even on one iteration then something's wrong maybe the learning rates too big on mini-batch gradient descent though if you plot progress in your cost function then it may not decrease on every iteration in particular on every iteration you're processing some X T YT and so if you plot the cost function J T which is computed using just X T YT then it's as if on every iteration you're training on a different training cycle really trading on a different meaning batch so you plot the cost function J you're more likely to see something that looks like this it should trend downwards but it is also going to be a little bit noisier you plot J of T has your training mini-batch gradient descent it may be over multiple epochs you might expect to see a curve like this so as okay if it doesn't go down on every iteration but it should trend downwards and the reason it'll be a little bit noisy is that maybe x1 y1 it's just a relatively easy meaning batch so your cost might be a bit lower but then maybe just by chance x2 y2 is just a harder mini batch maybe even some let's label examples in it in which case the cost would be a bit higher and so on so that's why you get these oscillations as you plot the cost when you're running mini batch gradient descent now one of the parameters you need to choose is the size of your mini batch so M was the training set size on one extreme if the mini batch size is equal to M then you just end up with bosch gradient descent alright so in this extreme you would just have one mini batch x1 y1 and this mini batch is equal to your entire training set so setting the movie batch size M just gives you back gradient descent the other extreme would be if your mini batch size were equal to 1 this gives you an algorithm called stochastic gradient descent and here every example is his own mini batch so what in this case as you look at you know the first mini batch so x1 y1 but when you meanie batch sizes 1 this just has you know your first training example and you take your it into sense that with your first training example and then you mix take a look at your second mini batch which is just your second training example and take your grandest and step with that and then you do with the third training example and so on looking at just one single training example at a time so let's look at what these two extremes will do on optimizing this cost function if these are the contours of a cost function trying to minimize so the your minimum is there then - gradient descent might start somewhere and be able to take relatively low noise relatively large steps and you know just keep marching to the minimum in contrast with so costly gradient descent if you start somewhere let's pick a different starting point then on every iteration you're taking bring descends with just a single training example so most of the time you hit to what the global minimum but sometimes you hit in the wrong direction if you know that one example happens to point you in a bad direction so stochastic great descent can be extremely noisy and on average takes you in a good direction but um sometimes you're headed in the wrong direction as well as the constant descent won't ever converge you're always just kind of oscillate and wander around the region of the minimum but it won't ever just head to the minimum and stay there in practice the mini batch size you use will be somewhere in between some moves in in 1 + M + 1 nm are respectively too small and too large and here's why if you use batch gradient descent so this is your mini batch size equals M then you're processing a huge training set on every innovation so the main disadvantage of this is that it takes too much time too long per iteration assuming you have a very large training set if you have you're a small training set then bachelor in descent is fine if you go to the opposite if you use the conflict-ridden you're sent then it's nice that you get to make progress after processing just one example that's actually not a problem and the noisiness can be ameliorated or can be reduced by just using a smaller learning rate but the huge disadvantage the stochastic green descent is that you lose almost all your speed up from vectorization because here you're processing a single training example at a time the way you process each example is going to be very inefficient so what works best in practice is something in between where you have some you know mini batch size that not too big or too small and this gives you impractical fastest learning and you notice that this has two good things going for it one is that you do get a lot of vectorization so in the example we use on the previous video if your mini batch size was a thousand examples then you know you might go to vectorize across a thousand examples so it's going to be much faster than processing the examples one at a time and second you can also make progress without needing to wait till you process the entire training set so again using the numbers we have in the previous video in epochal each path to your training set allows you to take 5000 gradient descent steps so in practice there be some in-between mini batch size that works best and so with mini bearing assembly to start here maybe one iteration does this two iterations three four you know and it's not a guarantee to always head toward the minimum but it tends to head more consistently in the rational minimum than stochastic during descent and then it doesn't always exactly convert your oscillate in a very small region if that's an issue you can always reduce the learning rate slowly we'll talk more about learning rate detail how to reduce our learning rate in a later video so if the mini batch size should not be M and should not be one but it should be something in between how do you go about choosing it well here are some guidelines first if you have a small training set just use batch gradient descent you know if you have the small training set then no point using the batch render send you can process the whole training site quite fast so you might as well use factory innocent what the small training set mean I would say you know this less than maybe 2000 um would be perfectly fine to just use battery and descent otherwise if you have a bigger training set typical mini batch sizes would be anything from 64 up to maybe 512 are quite typical and difference of the way computer memory is laid out in Access sometimes you code runs faster if your mini batch size is a lot as the power of two alright so 64 is 2 to the 6 to the 7 2 to the 8 2 to the 9 so often I'll implement my mini batch size to be a power of 2 I know in the previous video I use in the batch size of 1000 if you really want to do that work you just use zero 1024 which is to the power of 10 and you do see mean batch sizes of size 1024 it is a bit more rare this range of mini batch size is a little bit more common one last tip is to make sure that your mini batch all of your XT comma Y T that that fits in you know CPU GPU memory and this really depends on your application and how large the single training example is but if you ever process a mini batch that doesn't actually fit in CPU GPU memory whatever using the process the data then you find that the performance suddenly falls off a cliff and is suddenly much worse so I hope this gives you a sense of the typical range of mini batch sizes that people use in practice of course the mini batch size is actually another hyper parameter that you might do a quick search over to try to figure out which one is most efficient at reducing your cost function J so what I would do is just try a several different values try a few different powers of two and then see if you could pick one that makes your gradient descent optimization algorithm as efficient as possible but hopefully this gives you a set of guidelines for how to get started with that type of parameter search you now know how to implement mimi bash great descent and make your algorithm run much faster especially when you're trading on a large training set but it turns out they're even more efficient algorithms than gradient descent or mini battery in this end let's start talking about them in the next few videos

Original Description

Take the Deep Learning Specialization: http://bit.ly/2PWDKrR Check out all our courses: https://www.deeplearning.ai Subscribe to The Batch, our weekly newsletter: https://www.deeplearning.ai/thebatch Follow us: Twitter: https://twitter.com/deeplearningai_ Facebook: https://www.facebook.com/deeplearningHQ/ Linkedin: https://www.linkedin.com/company/deeplearningai
Watch on YouTube ↗ (saves to browser)
Sign in to unlock AI tutor explanation · ⚡30

Playlist

Uploads from DeepLearningAI · DeepLearningAI · 12 of 60

1 Forward and Backward Propagation (C1W4L06)
Forward and Backward Propagation (C1W4L06)
DeepLearningAI
2 deeplearning.ai's Heroes of Deep Learning: Yuanqing Lin
deeplearning.ai's Heroes of Deep Learning: Yuanqing Lin
DeepLearningAI
3 deeplearning.ai's Heroes of Deep Learning: Ruslan Salakhutdinov
deeplearning.ai's Heroes of Deep Learning: Ruslan Salakhutdinov
DeepLearningAI
4 deeplearning.ai's Heroes of Deep Learning: Yoshua Bengio
deeplearning.ai's Heroes of Deep Learning: Yoshua Bengio
DeepLearningAI
5 deeplearning.ai's Heroes of Deep Learning: Pieter Abbeel
deeplearning.ai's Heroes of Deep Learning: Pieter Abbeel
DeepLearningAI
6 deeplearning.ai's Heroes of Deep Learning: Ian Goodfellow
deeplearning.ai's Heroes of Deep Learning: Ian Goodfellow
DeepLearningAI
7 deeplearning.ai's Heroes of Deep Learning: Andrej Karpathy
deeplearning.ai's Heroes of Deep Learning: Andrej Karpathy
DeepLearningAI
8 Using an Appropriate Scale (C2W3L02)
Using an Appropriate Scale (C2W3L02)
DeepLearningAI
9 Gradient Checking (C2W1L13)
Gradient Checking (C2W1L13)
DeepLearningAI
10 Gradient Checking Implementation Notes (C2W1L14)
Gradient Checking Implementation Notes (C2W1L14)
DeepLearningAI
11 Learning Rate Decay (C2W2L09)
Learning Rate Decay (C2W2L09)
DeepLearningAI
Understanding Mini-Batch Gradient Dexcent (C2W2L02)
Understanding Mini-Batch Gradient Dexcent (C2W2L02)
DeepLearningAI
13 Mini Batch Gradient Descent (C2W2L01)
Mini Batch Gradient Descent (C2W2L01)
DeepLearningAI
14 The Problem of Local Optima (C2W3L10)
The Problem of Local Optima (C2W3L10)
DeepLearningAI
15 Exponentially Weighted Averages (C2W2L03)
Exponentially Weighted Averages (C2W2L03)
DeepLearningAI
16 Tuning Process (C2W3L01)
Tuning Process (C2W3L01)
DeepLearningAI
17 Understanding Exponentially Weighted Averages (C2W2L04)
Understanding Exponentially Weighted Averages (C2W2L04)
DeepLearningAI
18 Bias Correction of Exponentially Weighted Averages (C2W2L05)
Bias Correction of Exponentially Weighted Averages (C2W2L05)
DeepLearningAI
19 Gradient Descent With Momentum (C2W2L06)
Gradient Descent With Momentum (C2W2L06)
DeepLearningAI
20 Normalizing Activations in a Network (C2W3L04)
Normalizing Activations in a Network (C2W3L04)
DeepLearningAI
21 Hyperparameter Tuning in Practice (C2W3L03)
Hyperparameter Tuning in Practice (C2W3L03)
DeepLearningAI
22 Adam Optimization Algorithm (C2W2L08)
Adam Optimization Algorithm (C2W2L08)
DeepLearningAI
23 RMSProp (C2W2L07)
RMSProp (C2W2L07)
DeepLearningAI
24 Fitting Batch Norm Into Neural Networks (C2W3L05)
Fitting Batch Norm Into Neural Networks (C2W3L05)
DeepLearningAI
25 Why Does Batch Norm Work? (C2W3L06)
Why Does Batch Norm Work? (C2W3L06)
DeepLearningAI
26 Batch Norm At Test Time (C2W3L07)
Batch Norm At Test Time (C2W3L07)
DeepLearningAI
27 Softmax Regression (C2W3L08)
Softmax Regression (C2W3L08)
DeepLearningAI
28 Deep Learning Frameworks (C2W3L10)
Deep Learning Frameworks (C2W3L10)
DeepLearningAI
29 Neural Network Overview (C1W3L01)
Neural Network Overview (C1W3L01)
DeepLearningAI
30 Training Softmax Classifier (C2W3L09)
Training Softmax Classifier (C2W3L09)
DeepLearningAI
31 Why Deep Representations? (C1W4L04)
Why Deep Representations? (C1W4L04)
DeepLearningAI
32 Gradient Descent For Neural Networks (C1W3L09)
Gradient Descent For Neural Networks (C1W3L09)
DeepLearningAI
33 Neural Network Representations (C1W3L02)
Neural Network Representations (C1W3L02)
DeepLearningAI
34 TensorFlow (C2W3L11)
TensorFlow (C2W3L11)
DeepLearningAI
35 Activation Functions (C1W3L06)
Activation Functions (C1W3L06)
DeepLearningAI
36 Explanation For Vectorized Implementation (C1W3L05)
Explanation For Vectorized Implementation (C1W3L05)
DeepLearningAI
37 Getting Matrix Dimensions Right (C1W4L03)
Getting Matrix Dimensions Right (C1W4L03)
DeepLearningAI
38 Understanding Dropout (C2W1L07)
Understanding Dropout (C2W1L07)
DeepLearningAI
39 Building Blocks of a Deep Neural Network (C1W4L05)
Building Blocks of a Deep Neural Network (C1W4L05)
DeepLearningAI
40 Why Non-linear Activation Functions (C1W3L07)
Why Non-linear Activation Functions (C1W3L07)
DeepLearningAI
41 Computing Neural Network Output (C1W3L03)
Computing Neural Network Output (C1W3L03)
DeepLearningAI
42 Backpropagation Intuition (C1W3L10)
Backpropagation Intuition (C1W3L10)
DeepLearningAI
43 Train/Dev/Test Sets (C2W1L01)
Train/Dev/Test Sets (C2W1L01)
DeepLearningAI
44 Deep L-Layer Neural Network (C1W4L01)
Deep L-Layer Neural Network (C1W4L01)
DeepLearningAI
45 Random Initialization (C1W3L11)
Random Initialization (C1W3L11)
DeepLearningAI
46 Other Regularization Methods (C2W1L08)
Other Regularization Methods (C2W1L08)
DeepLearningAI
47 Normalizing Inputs (C2W1L09)
Normalizing Inputs (C2W1L09)
DeepLearningAI
48 Derivatives Of Activation Functions (C1W3L08)
Derivatives Of Activation Functions (C1W3L08)
DeepLearningAI
49 Parameters vs Hyperparameters (C1W4L07)
Parameters vs Hyperparameters (C1W4L07)
DeepLearningAI
50 Vectorizing Across Multiple Examples (C1W3L04)
Vectorizing Across Multiple Examples (C1W3L04)
DeepLearningAI
51 What does this have to do with the brain? (C1W4L08)
What does this have to do with the brain? (C1W4L08)
DeepLearningAI
52 Dropout Regularization (C2W1L06)
Dropout Regularization (C2W1L06)
DeepLearningAI
53 Vanishing/Exploding Gradients (C2W1L10)
Vanishing/Exploding Gradients (C2W1L10)
DeepLearningAI
54 Basic Recipe for Machine Learning (C2W1L03)
Basic Recipe for Machine Learning (C2W1L03)
DeepLearningAI
55 Bias/Variance (C2W1L02)
Bias/Variance (C2W1L02)
DeepLearningAI
56 Forward Propagation in a Deep Network (C1W4L02)
Forward Propagation in a Deep Network (C1W4L02)
DeepLearningAI
57 Weight Initialization in a Deep Network (C2W1L11)
Weight Initialization in a Deep Network (C2W1L11)
DeepLearningAI
58 Numerical Approximations of Gradients (C2W1L12)
Numerical Approximations of Gradients (C2W1L12)
DeepLearningAI
59 Regularization (C2W1L04)
Regularization (C2W1L04)
DeepLearningAI
60 Why Regularization Reduces Overfitting (C2W1L05)
Why Regularization Reduces Overfitting (C2W1L05)
DeepLearningAI

This video teaches the fundamentals of Mini-Batch Gradient Descent, including its benefits, parameter selection, and comparison to other optimization algorithms. By understanding how to choose the ideal mini-batch size and learning rate, viewers can improve the performance of their machine learning models.

Key Takeaways
  1. Choose a mini-batch size between 64 and 512, with a preference for powers of 2 for better performance
  2. Use batch gradient descent for small training sets (less than 2000 examples)
  3. Use stochastic gradient descent for large training sets
  4. Reduce the learning rate slowly if the mini-batch size is too large or if the model oscillates around the minimum
  5. Implement mini-batch size as a power of 2
  6. Try several different values of mini-batch size
  7. Pick the most efficient value of mini-batch size
💡 The ideal mini-batch size is a hyperparameter that needs to be searched over, and it should fit in CPU/GPU memory

Related AI Lessons

Up next
Learn Deep Learning by Hand (Beginner's Guide - Part 1)
Thu Vu
Watch →