Mini Batch Gradient Descent (C2W2L01)

DeepLearningAI · Beginner ·📐 ML Fundamentals ·8y ago

Skills: ML Maths Basics80%Supervised Learning70%ML Pipelines60%

Key Takeaways

This video covers Mini-Batch Gradient Descent, a faster algorithm than processing the entire training set at once, using techniques such as splitting the training set into smaller mini-batches and vectorized implementation to process all examples at once.

Full Transcript

hello and welcome back in this week you learn about optimization algorithms that will enable you to train in your networks much faster you've heard me say before that apply machine learning is a highly empirical process is highly intuitive process it which you just have to train a lot of models to find one that works really well so it really helps to really train models quickly one thing that makes it more difficult is that deep learning which is the work best in the regime of Big Data when you're able to train your near network on a huge data set and training on large data sets is just slow so what you find is that having fast optimization algorithms having good optimization algorithms can really speed up the efficiency of you and your team so let's get started by talking about mini-batch gradient descent you've learned previously the vectorization allows you to efficiently compute on all M examples that allows you to process your whole training set without an explicit for loop so that's why we would take our training examples and stack them into this huge matrix capital X so 6 1 X 2 X 3 you know and then um eventually it goes up to X M they give M training examples and similarly for y this is y 1 y 2 y 3 and so on up to Y M so the dimension of X was n X by M and this is 1 by M vectorization allows you to process our M examples quickly relatively quickly if M is very large then it can still be slow so for example what if M was 5 million you know 50 million or even bigger with the implementation of gradient sent on your training set what you have to do is you have to process your entire training set before you take you know one little step for gradient descent and then you have to process your entire training set of five million training examples again before you take another little step of gradient descent so it turns out that you can get a faster algorithm if you get straightened descent start to make some progress even before you finish processing your entire your giant tree in size of five million examples in particular here's what you can do let's say that you split up your training set into smaller your little baby training sets and these baby training sets are called mini batches and let's say each of your baby training sets have just 1000 examples each so you take X 1 through X 1000 and you call that your first little baby training session also called a mini batch and then you take home the next 1000 examples X 1000 1 through X 2000 that's the next thousand examples and call the next one and so on and I'm going to introduce a new notation I'm going to call this X superscript with curly braces 1 and I want to call this X superscript with curly braces too now if you have five million training examples total and each of these little mini batches as a thousand examples that means you have 5000 of these videos you know 5000 times 1000 equals 5 million so altogether you would have 5000 of these um mini batches so the ends of X superscript curly braces 5000 and then similarly you do the same thing for y you'd also split up your training data for Y accordingly so you call that y1 and then this is y 1001 3y 2000 this becomes called y2 and so on until you have y 5000 so now we - number T is going to be comprised of X T and Y T and that is a thousand training examples so the corresponding input output pairs before moving on just to make sure notation is clear we have previously used superscript round brackets I to index on the training set so X is d I've trained example we use superscript square brackets L to index into the different layers of a neural network so VL comes from the Z values for the elf layer of in your network and here we're introducing the curly brackets T to index into different mini batches so you have X T Y T and to check your understanding of these um or what's the dimension right of XT and YT well X is NX by M so if x1 is a thousand training examples or the X values for a thousand examples then this dimension should be MX by 1,000 and x2 should also be an X by 1000 and so on so all of these should have to mention NX / 1000 and these should have to mention 1 by 1000 right 2 the name of this algorithm - gradient descent refers to the gradient descent algorithm we've been talking about previously where you process your entire training set all at the same time and the name comes from viewing that as processing your entire batch of training examples all at the same time I'm not such a great name but that's just what is called mini batch period descent in contrast refers to the algorithm which we'll talk about on the next slide and which you process is single mini batch X T YT at the same time rather than processing your entire training set X Y at the same time so let's see how many batch gradient descent works to run mini-batch gradient descent on your training sets you would run for t equals 1 to 5000 because we had 5000 mini batches of size 1,000 each and what you're going to do inside the for loop is basically implement one step of gradient descent using X G comma Y T and it's as if you had a training set of size 1,000 examples and it was as if you were to implement the algorithm you're already familiar with but just on this you know little training set size of M equals 1000 rather than having explicit for loop over all 1000 examples you would use vectorization to process all 1,000 examples sort of all at the same time so let's write this out first you implement forward prop on the inputs so just on XP and you do that by implementing you know Z 1 equals W 1 now previously we just have X there right but now you're on process the entire training set and you're just processing the first mini batch so this becomes X tea when you processing mini-batch tea and then you would have a1 equals G 1 of Z 1 District Capital Z since we're this is actually a vectorized implementation and so on until you end up with a l you know as I guess GL of VL and then this is your prediction and you notice that here you should use a vectorized implementation it's just that this vectorized implementation processes 1,000 examples at a time rather than 5 million examples mixed you compute the cost function J which I'm going to write as 1 over 1000 since 301 thousands the size of your little training set sum from I equals 1 through L of really the you know loss of Y hat I Y I and this notation for clarity refers to examples from the mini-batch XT YT and then if you're using regularization you can also have this regularization term just move over to the denominator time sum over L Frobenius norm the way measures a squared so because this is really the cost on just one rainy batch and then I index this cost J with a superscript T in curly braces so you notice that everything we're doing is exactly the same as when we were previously implementing gradient descent except that instead of doing it on X Y you're not doing it on X T YT next you'd implement back prop to compute gradients with respect to really respect to this JT so you're still using only X T YT and then you update the weights you know wre every WL gets updated as WL minus alpha D WL and similarly for B and so this is one pass through your training set using mini-batch gradient descent the code i've written down here is also called doing one epoch of training and epoch is a word that just means a single pass through the training set so whereas with batch gradient descent a single pass through the training set allows you to take only one gradient descent step with really batch gradient descent a single pass through the training set that is one epoch allows you to take 5000 gradient descent steps now of course you want to take multiple passes through the training sets which you usually want to you might want another for loop or another your while loop out there so you keep taking process through the training set until hopefully you converge or it approximately converged when you have a lost training set meaning batch gradient descent runs much faster than batch gradient descent and it's pretty much what everyone in deep learning will use when you're training on a large dataset in the next video let's delve deeper into mini batch goodness and so you can get a better understanding of what is doing and why it works so well

Original Description

Take the Deep Learning Specialization: http://bit.ly/2x6x2J9 Check out all our courses: https://www.deeplearning.ai Subscribe to The Batch, our weekly newsletter: https://www.deeplearning.ai/thebatch Follow us: Twitter: https://twitter.com/deeplearningai_ Facebook: https://www.facebook.com/deeplearningHQ/ Linkedin: https://www.linkedin.com/company/deeplearningai

Watch on YouTube ↗ (saves to browser)

Sign in to unlock AI tutor explanation · ⚡30

Playlist

Uploads from DeepLearningAI · DeepLearningAI · 13 of 60

← Previous Next →

Forward and Backward Propagation (C1W4L06)

Forward and Backward Propagation (C1W4L06)

deeplearning.ai's Heroes of Deep Learning: Yuanqing Lin

deeplearning.ai's Heroes of Deep Learning: Yuanqing Lin

deeplearning.ai's Heroes of Deep Learning: Ruslan Salakhutdinov

deeplearning.ai's Heroes of Deep Learning: Ruslan Salakhutdinov

deeplearning.ai's Heroes of Deep Learning: Yoshua Bengio

deeplearning.ai's Heroes of Deep Learning: Yoshua Bengio

deeplearning.ai's Heroes of Deep Learning: Pieter Abbeel

deeplearning.ai's Heroes of Deep Learning: Pieter Abbeel

deeplearning.ai's Heroes of Deep Learning: Ian Goodfellow

deeplearning.ai's Heroes of Deep Learning: Ian Goodfellow

deeplearning.ai's Heroes of Deep Learning: Andrej Karpathy

deeplearning.ai's Heroes of Deep Learning: Andrej Karpathy

Using an Appropriate Scale (C2W3L02)

Using an Appropriate Scale (C2W3L02)

Gradient Checking (C2W1L13)

Gradient Checking (C2W1L13)

Gradient Checking Implementation Notes (C2W1L14)

Gradient Checking Implementation Notes (C2W1L14)

Learning Rate Decay (C2W2L09)

Learning Rate Decay (C2W2L09)

Understanding Mini-Batch Gradient Dexcent (C2W2L02)

Understanding Mini-Batch Gradient Dexcent (C2W2L02)

Mini Batch Gradient Descent (C2W2L01)

Mini Batch Gradient Descent (C2W2L01)

The Problem of Local Optima (C2W3L10)

The Problem of Local Optima (C2W3L10)

Exponentially Weighted Averages (C2W2L03)

Exponentially Weighted Averages (C2W2L03)

Tuning Process (C2W3L01)

Tuning Process (C2W3L01)

Understanding Exponentially Weighted Averages (C2W2L04)

Understanding Exponentially Weighted Averages (C2W2L04)

Bias Correction of Exponentially Weighted Averages (C2W2L05)

Bias Correction of Exponentially Weighted Averages (C2W2L05)

Gradient Descent With Momentum (C2W2L06)

Gradient Descent With Momentum (C2W2L06)

Normalizing Activations in a Network (C2W3L04)

Normalizing Activations in a Network (C2W3L04)

Hyperparameter Tuning in Practice (C2W3L03)

Hyperparameter Tuning in Practice (C2W3L03)

Adam Optimization Algorithm (C2W2L08)

Adam Optimization Algorithm (C2W2L08)

RMSProp (C2W2L07)

RMSProp (C2W2L07)

Fitting Batch Norm Into Neural Networks (C2W3L05)

Fitting Batch Norm Into Neural Networks (C2W3L05)

Why Does Batch Norm Work? (C2W3L06)

Why Does Batch Norm Work? (C2W3L06)

Batch Norm At Test Time (C2W3L07)

Batch Norm At Test Time (C2W3L07)

Softmax Regression (C2W3L08)

Softmax Regression (C2W3L08)

Deep Learning Frameworks (C2W3L10)

Deep Learning Frameworks (C2W3L10)

Neural Network Overview (C1W3L01)

Neural Network Overview (C1W3L01)

Training Softmax Classifier (C2W3L09)

Training Softmax Classifier (C2W3L09)

Why Deep Representations? (C1W4L04)

Why Deep Representations? (C1W4L04)

Gradient Descent For Neural Networks (C1W3L09)

Gradient Descent For Neural Networks (C1W3L09)

Neural Network Representations (C1W3L02)

Neural Network Representations (C1W3L02)

TensorFlow (C2W3L11)

TensorFlow (C2W3L11)

Activation Functions (C1W3L06)

Activation Functions (C1W3L06)

Explanation For Vectorized Implementation (C1W3L05)

Explanation For Vectorized Implementation (C1W3L05)

Getting Matrix Dimensions Right (C1W4L03)

Getting Matrix Dimensions Right (C1W4L03)

Understanding Dropout (C2W1L07)

Understanding Dropout (C2W1L07)

Building Blocks of a Deep Neural Network (C1W4L05)

Building Blocks of a Deep Neural Network (C1W4L05)

Why Non-linear Activation Functions (C1W3L07)

Why Non-linear Activation Functions (C1W3L07)

Computing Neural Network Output (C1W3L03)

Computing Neural Network Output (C1W3L03)

Backpropagation Intuition (C1W3L10)

Backpropagation Intuition (C1W3L10)

Train/Dev/Test Sets (C2W1L01)

Train/Dev/Test Sets (C2W1L01)

Deep L-Layer Neural Network (C1W4L01)

Deep L-Layer Neural Network (C1W4L01)

Random Initialization (C1W3L11)

Random Initialization (C1W3L11)

Other Regularization Methods (C2W1L08)

Other Regularization Methods (C2W1L08)

Normalizing Inputs (C2W1L09)

Normalizing Inputs (C2W1L09)

Derivatives Of Activation Functions (C1W3L08)

Derivatives Of Activation Functions (C1W3L08)

Parameters vs Hyperparameters (C1W4L07)

Parameters vs Hyperparameters (C1W4L07)

Vectorizing Across Multiple Examples (C1W3L04)

Vectorizing Across Multiple Examples (C1W3L04)

What does this have to do with the brain? (C1W4L08)

What does this have to do with the brain? (C1W4L08)

Dropout Regularization (C2W1L06)

Dropout Regularization (C2W1L06)

Vanishing/Exploding Gradients (C2W1L10)

Vanishing/Exploding Gradients (C2W1L10)

Basic Recipe for Machine Learning (C2W1L03)

Basic Recipe for Machine Learning (C2W1L03)

Bias/Variance (C2W1L02)

Bias/Variance (C2W1L02)

Forward Propagation in a Deep Network (C1W4L02)

Forward Propagation in a Deep Network (C1W4L02)

Weight Initialization in a Deep Network (C2W1L11)

Weight Initialization in a Deep Network (C2W1L11)

Numerical Approximations of Gradients (C2W1L12)

Numerical Approximations of Gradients (C2W1L12)

Regularization (C2W1L04)

Regularization (C2W1L04)

Why Regularization Reduces Overfitting (C2W1L05)

Why Regularization Reduces Overfitting (C2W1L05)

This video teaches Mini-Batch Gradient Descent, a faster algorithm than Batch Gradient Descent, and how to implement it using vectorized implementation and cost function computation. It's essential for deep learning with large datasets.

Key Takeaways

Split the training set into smaller mini-batches of 1000 examples each
Index into different mini-batches using the notation X superscript with curly braces
Implement forward prop on the inputs X_T
Compute the cost function J as 1/1000 * sum from i=1 to L of (loss(Y_hat_i, Y_i)) for examples from the mini-batch XT YT
Implement back prop to compute gradients with respect to the cost function J
Update the weights W_L every WL gets updated as WL - alpha * dWL
Take multiple passes through the training set using a for loop or while loop until convergence is achieved

💡 Mini-Batch Gradient Descent allows for faster convergence and is commonly used in deep learning for large datasets.

🔒 Pro feature: Ask AI to explain this lesson →

More on: ML Maths Basics

View skill →

Important Steps I Have Followed To Improve My Data Science Skills- Sharing My Experience

Important Steps I Have Followed To Improve My Data Science Skills- Sharing My Experience

Learn Python FAST for Beginners 🚀#coding #conditionals #loops #functions

Learn Python FAST for Beginners 🚀#coding #conditionals #loops #functions

ChethanAIChronicles

“Hello, world” from scratch on a 6502 — Part 1

“Hello, world” from scratch on a 6502 — Part 1

PCA (Principal Component Analysis) in Python - Machine Learning From Scratch 11 - Python Tutorial

PCA (Principal Component Analysis) in Python - Machine Learning From Scratch 11 - Python Tutorial

ROC and AUC in R

ROC and AUC in R

StatQuest with Josh Starmer

Data Science Fundamentals: Data Cleaning in Python

Data Science Fundamentals: Data Cleaning in Python

Related AI Lessons

10 Python Concepts You Must Know Before Calling Yourself Advanced

Learn 10 essential Python concepts to take your skills to the advanced level and stand out as a developer

10 Python Concepts You Must Know Before Calling Yourself Advanced

Learn 10 crucial Python concepts to elevate your skills from intermediate to advanced and become a proficient developer

Medium · Data Science

10 Python Concepts You Must Know Before Calling Yourself Advanced

Learn 10 essential Python concepts to take your skills to the advanced level and stand out as a developer

Medium · Programming

10 Python Concepts You Must Know Before Calling Yourself Advanced

Learn 10 essential Python concepts to take your skills to the advanced level and separate yourself from beginner developers

Medium · Python

Is Python Dead in 2026?| Truth About Python in AI Era | 90 Days Roadmap @FameWorldEducationalHub

FAME WORLD EDUCATIONAL HUB