Understanding Mini-Batch Gradient Dexcent (C2W2L02)
Key Takeaways
The video explains Mini-Batch Gradient Descent, covering its benefits, parameter selection, and comparison to Batch Gradient Descent and Stochastic Gradient Descent. It provides guidance on choosing the ideal mini-batch size and learning rate for optimization problems.
Full Transcript
in the previous video you saw how you can use mini-batch gradient descent to start making progress to start taking gradient descent steps even when you're just partway through processing your training set even for the first time in this video you learn more details of how to implement gradient descent and gain a better understanding of what is doing and why it works with batch gradient descent on every iteration you go through the entire training set and you would expect the costs to go down on every single iteration so if we plot the cost function J as a function of different iterations it should decrease on every single iteration and if it ever goes up even on one iteration then something's wrong maybe the learning rates too big on mini-batch gradient descent though if you plot progress in your cost function then it may not decrease on every iteration in particular on every iteration you're processing some X T YT and so if you plot the cost function J T which is computed using just X T YT then it's as if on every iteration you're training on a different training cycle really trading on a different meaning batch so you plot the cost function J you're more likely to see something that looks like this it should trend downwards but it is also going to be a little bit noisier you plot J of T has your training mini-batch gradient descent it may be over multiple epochs you might expect to see a curve like this so as okay if it doesn't go down on every iteration but it should trend downwards and the reason it'll be a little bit noisy is that maybe x1 y1 it's just a relatively easy meaning batch so your cost might be a bit lower but then maybe just by chance x2 y2 is just a harder mini batch maybe even some let's label examples in it in which case the cost would be a bit higher and so on so that's why you get these oscillations as you plot the cost when you're running mini batch gradient descent now one of the parameters you need to choose is the size of your mini batch so M was the training set size on one extreme if the mini batch size is equal to M then you just end up with bosch gradient descent alright so in this extreme you would just have one mini batch x1 y1 and this mini batch is equal to your entire training set so setting the movie batch size M just gives you back gradient descent the other extreme would be if your mini batch size were equal to 1 this gives you an algorithm called stochastic gradient descent and here every example is his own mini batch so what in this case as you look at you know the first mini batch so x1 y1 but when you meanie batch sizes 1 this just has you know your first training example and you take your it into sense that with your first training example and then you mix take a look at your second mini batch which is just your second training example and take your grandest and step with that and then you do with the third training example and so on looking at just one single training example at a time so let's look at what these two extremes will do on optimizing this cost function if these are the contours of a cost function trying to minimize so the your minimum is there then - gradient descent might start somewhere and be able to take relatively low noise relatively large steps and you know just keep marching to the minimum in contrast with so costly gradient descent if you start somewhere let's pick a different starting point then on every iteration you're taking bring descends with just a single training example so most of the time you hit to what the global minimum but sometimes you hit in the wrong direction if you know that one example happens to point you in a bad direction so stochastic great descent can be extremely noisy and on average takes you in a good direction but um sometimes you're headed in the wrong direction as well as the constant descent won't ever converge you're always just kind of oscillate and wander around the region of the minimum but it won't ever just head to the minimum and stay there in practice the mini batch size you use will be somewhere in between some moves in in 1 + M + 1 nm are respectively too small and too large and here's why if you use batch gradient descent so this is your mini batch size equals M then you're processing a huge training set on every innovation so the main disadvantage of this is that it takes too much time too long per iteration assuming you have a very large training set if you have you're a small training set then bachelor in descent is fine if you go to the opposite if you use the conflict-ridden you're sent then it's nice that you get to make progress after processing just one example that's actually not a problem and the noisiness can be ameliorated or can be reduced by just using a smaller learning rate but the huge disadvantage the stochastic green descent is that you lose almost all your speed up from vectorization because here you're processing a single training example at a time the way you process each example is going to be very inefficient so what works best in practice is something in between where you have some you know mini batch size that not too big or too small and this gives you impractical fastest learning and you notice that this has two good things going for it one is that you do get a lot of vectorization so in the example we use on the previous video if your mini batch size was a thousand examples then you know you might go to vectorize across a thousand examples so it's going to be much faster than processing the examples one at a time and second you can also make progress without needing to wait till you process the entire training set so again using the numbers we have in the previous video in epochal each path to your training set allows you to take 5000 gradient descent steps so in practice there be some in-between mini batch size that works best and so with mini bearing assembly to start here maybe one iteration does this two iterations three four you know and it's not a guarantee to always head toward the minimum but it tends to head more consistently in the rational minimum than stochastic during descent and then it doesn't always exactly convert your oscillate in a very small region if that's an issue you can always reduce the learning rate slowly we'll talk more about learning rate detail how to reduce our learning rate in a later video so if the mini batch size should not be M and should not be one but it should be something in between how do you go about choosing it well here are some guidelines first if you have a small training set just use batch gradient descent you know if you have the small training set then no point using the batch render send you can process the whole training site quite fast so you might as well use factory innocent what the small training set mean I would say you know this less than maybe 2000 um would be perfectly fine to just use battery and descent otherwise if you have a bigger training set typical mini batch sizes would be anything from 64 up to maybe 512 are quite typical and difference of the way computer memory is laid out in Access sometimes you code runs faster if your mini batch size is a lot as the power of two alright so 64 is 2 to the 6 to the 7 2 to the 8 2 to the 9 so often I'll implement my mini batch size to be a power of 2 I know in the previous video I use in the batch size of 1000 if you really want to do that work you just use zero 1024 which is to the power of 10 and you do see mean batch sizes of size 1024 it is a bit more rare this range of mini batch size is a little bit more common one last tip is to make sure that your mini batch all of your XT comma Y T that that fits in you know CPU GPU memory and this really depends on your application and how large the single training example is but if you ever process a mini batch that doesn't actually fit in CPU GPU memory whatever using the process the data then you find that the performance suddenly falls off a cliff and is suddenly much worse so I hope this gives you a sense of the typical range of mini batch sizes that people use in practice of course the mini batch size is actually another hyper parameter that you might do a quick search over to try to figure out which one is most efficient at reducing your cost function J so what I would do is just try a several different values try a few different powers of two and then see if you could pick one that makes your gradient descent optimization algorithm as efficient as possible but hopefully this gives you a set of guidelines for how to get started with that type of parameter search you now know how to implement mimi bash great descent and make your algorithm run much faster especially when you're trading on a large training set but it turns out they're even more efficient algorithms than gradient descent or mini battery in this end let's start talking about them in the next few videos
Original Description
Take the Deep Learning Specialization: http://bit.ly/2PWDKrR
Check out all our courses: https://www.deeplearning.ai
Subscribe to The Batch, our weekly newsletter: https://www.deeplearning.ai/thebatch
Follow us:
Twitter: https://twitter.com/deeplearningai_
Facebook: https://www.facebook.com/deeplearningHQ/
Linkedin: https://www.linkedin.com/company/deeplearningai
Watch on YouTube ↗
(saves to browser)
Sign in to unlock AI tutor explanation · ⚡30
Playlist
Uploads from DeepLearningAI · DeepLearningAI · 12 of 60
1
2
3
4
5
6
7
8
9
10
11
▶
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
Forward and Backward Propagation (C1W4L06)
DeepLearningAI
deeplearning.ai's Heroes of Deep Learning: Yuanqing Lin
DeepLearningAI
deeplearning.ai's Heroes of Deep Learning: Ruslan Salakhutdinov
DeepLearningAI
deeplearning.ai's Heroes of Deep Learning: Yoshua Bengio
DeepLearningAI
deeplearning.ai's Heroes of Deep Learning: Pieter Abbeel
DeepLearningAI
deeplearning.ai's Heroes of Deep Learning: Ian Goodfellow
DeepLearningAI
deeplearning.ai's Heroes of Deep Learning: Andrej Karpathy
DeepLearningAI
Using an Appropriate Scale (C2W3L02)
DeepLearningAI
Gradient Checking (C2W1L13)
DeepLearningAI
Gradient Checking Implementation Notes (C2W1L14)
DeepLearningAI
Learning Rate Decay (C2W2L09)
DeepLearningAI
Understanding Mini-Batch Gradient Dexcent (C2W2L02)
DeepLearningAI
Mini Batch Gradient Descent (C2W2L01)
DeepLearningAI
The Problem of Local Optima (C2W3L10)
DeepLearningAI
Exponentially Weighted Averages (C2W2L03)
DeepLearningAI
Tuning Process (C2W3L01)
DeepLearningAI
Understanding Exponentially Weighted Averages (C2W2L04)
DeepLearningAI
Bias Correction of Exponentially Weighted Averages (C2W2L05)
DeepLearningAI
Gradient Descent With Momentum (C2W2L06)
DeepLearningAI
Normalizing Activations in a Network (C2W3L04)
DeepLearningAI
Hyperparameter Tuning in Practice (C2W3L03)
DeepLearningAI
Adam Optimization Algorithm (C2W2L08)
DeepLearningAI
RMSProp (C2W2L07)
DeepLearningAI
Fitting Batch Norm Into Neural Networks (C2W3L05)
DeepLearningAI
Why Does Batch Norm Work? (C2W3L06)
DeepLearningAI
Batch Norm At Test Time (C2W3L07)
DeepLearningAI
Softmax Regression (C2W3L08)
DeepLearningAI
Deep Learning Frameworks (C2W3L10)
DeepLearningAI
Neural Network Overview (C1W3L01)
DeepLearningAI
Training Softmax Classifier (C2W3L09)
DeepLearningAI
Why Deep Representations? (C1W4L04)
DeepLearningAI
Gradient Descent For Neural Networks (C1W3L09)
DeepLearningAI
Neural Network Representations (C1W3L02)
DeepLearningAI
TensorFlow (C2W3L11)
DeepLearningAI
Activation Functions (C1W3L06)
DeepLearningAI
Explanation For Vectorized Implementation (C1W3L05)
DeepLearningAI
Getting Matrix Dimensions Right (C1W4L03)
DeepLearningAI
Understanding Dropout (C2W1L07)
DeepLearningAI
Building Blocks of a Deep Neural Network (C1W4L05)
DeepLearningAI
Why Non-linear Activation Functions (C1W3L07)
DeepLearningAI
Computing Neural Network Output (C1W3L03)
DeepLearningAI
Backpropagation Intuition (C1W3L10)
DeepLearningAI
Train/Dev/Test Sets (C2W1L01)
DeepLearningAI
Deep L-Layer Neural Network (C1W4L01)
DeepLearningAI
Random Initialization (C1W3L11)
DeepLearningAI
Other Regularization Methods (C2W1L08)
DeepLearningAI
Normalizing Inputs (C2W1L09)
DeepLearningAI
Derivatives Of Activation Functions (C1W3L08)
DeepLearningAI
Parameters vs Hyperparameters (C1W4L07)
DeepLearningAI
Vectorizing Across Multiple Examples (C1W3L04)
DeepLearningAI
What does this have to do with the brain? (C1W4L08)
DeepLearningAI
Dropout Regularization (C2W1L06)
DeepLearningAI
Vanishing/Exploding Gradients (C2W1L10)
DeepLearningAI
Basic Recipe for Machine Learning (C2W1L03)
DeepLearningAI
Bias/Variance (C2W1L02)
DeepLearningAI
Forward Propagation in a Deep Network (C1W4L02)
DeepLearningAI
Weight Initialization in a Deep Network (C2W1L11)
DeepLearningAI
Numerical Approximations of Gradients (C2W1L12)
DeepLearningAI
Regularization (C2W1L04)
DeepLearningAI
Why Regularization Reduces Overfitting (C2W1L05)
DeepLearningAI
More on: ML Maths Basics
View skill →Related AI Lessons
⚡
⚡
⚡
⚡
How to Learn a Hard Technical Skill Without Burning Out
Dev.to · Anas Kalthoum | FreeBrain
After interviewing over 100 ML Candidates. Last Week Someone Walked In and Made Me Take Notes.
Medium · Machine Learning
How AI Learns with Less Labeled Data
Medium · Machine Learning
Mastering TypeScript — Understanding the TypeScript Compiler (tsc) from Scratch — Lesson 2
Medium · JavaScript
🎓
Tutor Explanation
DeepCamp AI