Understanding Mini-Batch Gradient Dexcent (C2W2L02)

DeepLearningAI · Beginner ·📐 ML Fundamentals ·8y ago

Skills: ML Maths Basics80%Supervised Learning60%

Key Takeaways

The video explains Mini-Batch Gradient Descent, covering its benefits, parameter selection, and comparison to Batch Gradient Descent and Stochastic Gradient Descent. It provides guidance on choosing the ideal mini-batch size and learning rate for optimization problems.

Full Transcript

in the previous video you saw how you can use mini-batch gradient descent to start making progress to start taking gradient descent steps even when you're just partway through processing your training set even for the first time in this video you learn more details of how to implement gradient descent and gain a better understanding of what is doing and why it works with batch gradient descent on every iteration you go through the entire training set and you would expect the costs to go down on every single iteration so if we plot the cost function J as a function of different iterations it should decrease on every single iteration and if it ever goes up even on one iteration then something's wrong maybe the learning rates too big on mini-batch gradient descent though if you plot progress in your cost function then it may not decrease on every iteration in particular on every iteration you're processing some X T YT and so if you plot the cost function J T which is computed using just X T YT then it's as if on every iteration you're training on a different training cycle really trading on a different meaning batch so you plot the cost function J you're more likely to see something that looks like this it should trend downwards but it is also going to be a little bit noisier you plot J of T has your training mini-batch gradient descent it may be over multiple epochs you might expect to see a curve like this so as okay if it doesn't go down on every iteration but it should trend downwards and the reason it'll be a little bit noisy is that maybe x1 y1 it's just a relatively easy meaning batch so your cost might be a bit lower but then maybe just by chance x2 y2 is just a harder mini batch maybe even some let's label examples in it in which case the cost would be a bit higher and so on so that's why you get these oscillations as you plot the cost when you're running mini batch gradient descent now one of the parameters you need to choose is the size of your mini batch so M was the training set size on one extreme if the mini batch size is equal to M then you just end up with bosch gradient descent alright so in this extreme you would just have one mini batch x1 y1 and this mini batch is equal to your entire training set so setting the movie batch size M just gives you back gradient descent the other extreme would be if your mini batch size were equal to 1 this gives you an algorithm called stochastic gradient descent and here every example is his own mini batch so what in this case as you look at you know the first mini batch so x1 y1 but when you meanie batch sizes 1 this just has you know your first training example and you take your it into sense that with your first training example and then you mix take a look at your second mini batch which is just your second training example and take your grandest and step with that and then you do with the third training example and so on looking at just one single training example at a time so let's look at what these two extremes will do on optimizing this cost function if these are the contours of a cost function trying to minimize so the your minimum is there then - gradient descent might start somewhere and be able to take relatively low noise relatively large steps and you know just keep marching to the minimum in contrast with so costly gradient descent if you start somewhere let's pick a different starting point then on every iteration you're taking bring descends with just a single training example so most of the time you hit to what the global minimum but sometimes you hit in the wrong direction if you know that one example happens to point you in a bad direction so stochastic great descent can be extremely noisy and on average takes you in a good direction but um sometimes you're headed in the wrong direction as well as the constant descent won't ever converge you're always just kind of oscillate and wander around the region of the minimum but it won't ever just head to the minimum and stay there in practice the mini batch size you use will be somewhere in between some moves in in 1 + M + 1 nm are respectively too small and too large and here's why if you use batch gradient descent so this is your mini batch size equals M then you're processing a huge training set on every innovation so the main disadvantage of this is that it takes too much time too long per iteration assuming you have a very large training set if you have you're a small training set then bachelor in descent is fine if you go to the opposite if you use the conflict-ridden you're sent then it's nice that you get to make progress after processing just one example that's actually not a problem and the noisiness can be ameliorated or can be reduced by just using a smaller learning rate but the huge disadvantage the stochastic green descent is that you lose almost all your speed up from vectorization because here you're processing a single training example at a time the way you process each example is going to be very inefficient so what works best in practice is something in between where you have some you know mini batch size that not too big or too small and this gives you impractical fastest learning and you notice that this has two good things going for it one is that you do get a lot of vectorization so in the example we use on the previous video if your mini batch size was a thousand examples then you know you might go to vectorize across a thousand examples so it's going to be much faster than processing the examples one at a time and second you can also make progress without needing to wait till you process the entire training set so again using the numbers we have in the previous video in epochal each path to your training set allows you to take 5000 gradient descent steps so in practice there be some in-between mini batch size that works best and so with mini bearing assembly to start here maybe one iteration does this two iterations three four you know and it's not a guarantee to always head toward the minimum but it tends to head more consistently in the rational minimum than stochastic during descent and then it doesn't always exactly convert your oscillate in a very small region if that's an issue you can always reduce the learning rate slowly we'll talk more about learning rate detail how to reduce our learning rate in a later video so if the mini batch size should not be M and should not be one but it should be something in between how do you go about choosing it well here are some guidelines first if you have a small training set just use batch gradient descent you know if you have the small training set then no point using the batch render send you can process the whole training site quite fast so you might as well use factory innocent what the small training set mean I would say you know this less than maybe 2000 um would be perfectly fine to just use battery and descent otherwise if you have a bigger training set typical mini batch sizes would be anything from 64 up to maybe 512 are quite typical and difference of the way computer memory is laid out in Access sometimes you code runs faster if your mini batch size is a lot as the power of two alright so 64 is 2 to the 6 to the 7 2 to the 8 2 to the 9 so often I'll implement my mini batch size to be a power of 2 I know in the previous video I use in the batch size of 1000 if you really want to do that work you just use zero 1024 which is to the power of 10 and you do see mean batch sizes of size 1024 it is a bit more rare this range of mini batch size is a little bit more common one last tip is to make sure that your mini batch all of your XT comma Y T that that fits in you know CPU GPU memory and this really depends on your application and how large the single training example is but if you ever process a mini batch that doesn't actually fit in CPU GPU memory whatever using the process the data then you find that the performance suddenly falls off a cliff and is suddenly much worse so I hope this gives you a sense of the typical range of mini batch sizes that people use in practice of course the mini batch size is actually another hyper parameter that you might do a quick search over to try to figure out which one is most efficient at reducing your cost function J so what I would do is just try a several different values try a few different powers of two and then see if you could pick one that makes your gradient descent optimization algorithm as efficient as possible but hopefully this gives you a set of guidelines for how to get started with that type of parameter search you now know how to implement mimi bash great descent and make your algorithm run much faster especially when you're trading on a large training set but it turns out they're even more efficient algorithms than gradient descent or mini battery in this end let's start talking about them in the next few videos

Original Description

Take the Deep Learning Specialization: http://bit.ly/2PWDKrR Check out all our courses: https://www.deeplearning.ai Subscribe to The Batch, our weekly newsletter: https://www.deeplearning.ai/thebatch Follow us: Twitter: https://twitter.com/deeplearningai_ Facebook: https://www.facebook.com/deeplearningHQ/ Linkedin: https://www.linkedin.com/company/deeplearningai

Watch on YouTube ↗ (saves to browser)

Sign in to unlock AI tutor explanation · ⚡30

Playlist

Uploads from DeepLearningAI · DeepLearningAI · 12 of 60

← Previous Next →

Forward and Backward Propagation (C1W4L06)

Forward and Backward Propagation (C1W4L06)

deeplearning.ai's Heroes of Deep Learning: Yuanqing Lin

deeplearning.ai's Heroes of Deep Learning: Yuanqing Lin

deeplearning.ai's Heroes of Deep Learning: Ruslan Salakhutdinov

deeplearning.ai's Heroes of Deep Learning: Ruslan Salakhutdinov

deeplearning.ai's Heroes of Deep Learning: Yoshua Bengio

deeplearning.ai's Heroes of Deep Learning: Yoshua Bengio

deeplearning.ai's Heroes of Deep Learning: Pieter Abbeel

deeplearning.ai's Heroes of Deep Learning: Pieter Abbeel

deeplearning.ai's Heroes of Deep Learning: Ian Goodfellow

deeplearning.ai's Heroes of Deep Learning: Ian Goodfellow

deeplearning.ai's Heroes of Deep Learning: Andrej Karpathy

deeplearning.ai's Heroes of Deep Learning: Andrej Karpathy

Using an Appropriate Scale (C2W3L02)

Using an Appropriate Scale (C2W3L02)

Gradient Checking (C2W1L13)

Gradient Checking (C2W1L13)

Gradient Checking Implementation Notes (C2W1L14)

Gradient Checking Implementation Notes (C2W1L14)

Learning Rate Decay (C2W2L09)

Learning Rate Decay (C2W2L09)

Understanding Mini-Batch Gradient Dexcent (C2W2L02)

Understanding Mini-Batch Gradient Dexcent (C2W2L02)

Mini Batch Gradient Descent (C2W2L01)

Mini Batch Gradient Descent (C2W2L01)

The Problem of Local Optima (C2W3L10)

The Problem of Local Optima (C2W3L10)

Exponentially Weighted Averages (C2W2L03)

Exponentially Weighted Averages (C2W2L03)

Tuning Process (C2W3L01)

Tuning Process (C2W3L01)

Understanding Exponentially Weighted Averages (C2W2L04)

Understanding Exponentially Weighted Averages (C2W2L04)

Bias Correction of Exponentially Weighted Averages (C2W2L05)

Bias Correction of Exponentially Weighted Averages (C2W2L05)

Gradient Descent With Momentum (C2W2L06)

Gradient Descent With Momentum (C2W2L06)

Normalizing Activations in a Network (C2W3L04)

Normalizing Activations in a Network (C2W3L04)

Hyperparameter Tuning in Practice (C2W3L03)

Hyperparameter Tuning in Practice (C2W3L03)

Adam Optimization Algorithm (C2W2L08)

Adam Optimization Algorithm (C2W2L08)

RMSProp (C2W2L07)

RMSProp (C2W2L07)

Fitting Batch Norm Into Neural Networks (C2W3L05)

Fitting Batch Norm Into Neural Networks (C2W3L05)

Why Does Batch Norm Work? (C2W3L06)

Why Does Batch Norm Work? (C2W3L06)

Batch Norm At Test Time (C2W3L07)

Batch Norm At Test Time (C2W3L07)

Softmax Regression (C2W3L08)

Softmax Regression (C2W3L08)

Deep Learning Frameworks (C2W3L10)

Deep Learning Frameworks (C2W3L10)

Neural Network Overview (C1W3L01)

Neural Network Overview (C1W3L01)

Training Softmax Classifier (C2W3L09)

Training Softmax Classifier (C2W3L09)

Why Deep Representations? (C1W4L04)

Why Deep Representations? (C1W4L04)

Gradient Descent For Neural Networks (C1W3L09)

Gradient Descent For Neural Networks (C1W3L09)

Neural Network Representations (C1W3L02)

Neural Network Representations (C1W3L02)

TensorFlow (C2W3L11)

TensorFlow (C2W3L11)

Activation Functions (C1W3L06)

Activation Functions (C1W3L06)

Explanation For Vectorized Implementation (C1W3L05)

Explanation For Vectorized Implementation (C1W3L05)

Getting Matrix Dimensions Right (C1W4L03)

Getting Matrix Dimensions Right (C1W4L03)

Understanding Dropout (C2W1L07)

Understanding Dropout (C2W1L07)

Building Blocks of a Deep Neural Network (C1W4L05)

Building Blocks of a Deep Neural Network (C1W4L05)

Why Non-linear Activation Functions (C1W3L07)

Why Non-linear Activation Functions (C1W3L07)

Computing Neural Network Output (C1W3L03)

Computing Neural Network Output (C1W3L03)

Backpropagation Intuition (C1W3L10)

Backpropagation Intuition (C1W3L10)

Train/Dev/Test Sets (C2W1L01)

Train/Dev/Test Sets (C2W1L01)

Deep L-Layer Neural Network (C1W4L01)

Deep L-Layer Neural Network (C1W4L01)

Random Initialization (C1W3L11)

Random Initialization (C1W3L11)

Other Regularization Methods (C2W1L08)

Other Regularization Methods (C2W1L08)

Normalizing Inputs (C2W1L09)

Normalizing Inputs (C2W1L09)

Derivatives Of Activation Functions (C1W3L08)

Derivatives Of Activation Functions (C1W3L08)

Parameters vs Hyperparameters (C1W4L07)

Parameters vs Hyperparameters (C1W4L07)

Vectorizing Across Multiple Examples (C1W3L04)

Vectorizing Across Multiple Examples (C1W3L04)

What does this have to do with the brain? (C1W4L08)

What does this have to do with the brain? (C1W4L08)

Dropout Regularization (C2W1L06)

Dropout Regularization (C2W1L06)

Vanishing/Exploding Gradients (C2W1L10)

Vanishing/Exploding Gradients (C2W1L10)

Basic Recipe for Machine Learning (C2W1L03)

Basic Recipe for Machine Learning (C2W1L03)

Bias/Variance (C2W1L02)

Bias/Variance (C2W1L02)

Forward Propagation in a Deep Network (C1W4L02)

Forward Propagation in a Deep Network (C1W4L02)

Weight Initialization in a Deep Network (C2W1L11)

Weight Initialization in a Deep Network (C2W1L11)

Numerical Approximations of Gradients (C2W1L12)

Numerical Approximations of Gradients (C2W1L12)

Regularization (C2W1L04)

Regularization (C2W1L04)

Why Regularization Reduces Overfitting (C2W1L05)

Why Regularization Reduces Overfitting (C2W1L05)

This video teaches the fundamentals of Mini-Batch Gradient Descent, including its benefits, parameter selection, and comparison to other optimization algorithms. By understanding how to choose the ideal mini-batch size and learning rate, viewers can improve the performance of their machine learning models.

Key Takeaways

Choose a mini-batch size between 64 and 512, with a preference for powers of 2 for better performance
Use batch gradient descent for small training sets (less than 2000 examples)
Use stochastic gradient descent for large training sets
Reduce the learning rate slowly if the mini-batch size is too large or if the model oscillates around the minimum
Implement mini-batch size as a power of 2
Try several different values of mini-batch size
Pick the most efficient value of mini-batch size

💡 The ideal mini-batch size is a hyperparameter that needs to be searched over, and it should fit in CPU/GPU memory

🔒 Pro feature: Ask AI to explain this lesson →

More on: ML Maths Basics

View skill →

Important Steps I Have Followed To Improve My Data Science Skills- Sharing My Experience

Important Steps I Have Followed To Improve My Data Science Skills- Sharing My Experience

Learn Python FAST for Beginners 🚀#coding #conditionals #loops #functions

Learn Python FAST for Beginners 🚀#coding #conditionals #loops #functions

ChethanAIChronicles

“Hello, world” from scratch on a 6502 — Part 1

“Hello, world” from scratch on a 6502 — Part 1

PCA (Principal Component Analysis) in Python - Machine Learning From Scratch 11 - Python Tutorial

PCA (Principal Component Analysis) in Python - Machine Learning From Scratch 11 - Python Tutorial

ROC and AUC in R

ROC and AUC in R

StatQuest with Josh Starmer

Data Science Fundamentals: Data Cleaning in Python

Data Science Fundamentals: Data Cleaning in Python

Related AI Lessons

How to Learn a Hard Technical Skill Without Burning Out

Learn how to acquire hard technical skills without burnout by creating a sustainable learning plan

Dev.to · Anas Kalthoum | FreeBrain

After interviewing over 100 ML Candidates. Last Week Someone Walked In and Made Me Take Notes.

Learn what makes a standout ML candidate after interviewing over 100 applicants

Medium · Machine Learning

How AI Learns with Less Labeled Data

Discover how AI can learn with less labeled data, a crucial aspect of machine learning beyond model selection

Medium · Machine Learning

Mastering TypeScript — Understanding the TypeScript Compiler (tsc) from Scratch — Lesson 2

Learn the basics of the TypeScript compiler to write better JavaScript code

Medium · JavaScript

Learn Deep Learning by Hand (Beginner's Guide - Part 1)