Gradient Descent For Neural Networks (C1W3L09)

DeepLearningAI · Beginner ·📐 ML Fundamentals ·8y ago

Skills: ML Maths Basics80%Supervised Learning70%

Key Takeaways

The video demonstrates how to implement gradient descent for a neural network with one hidden layer, providing the necessary equations for forward and back propagation. It covers the basics of neural network parameters, cost functions, and gradient descent updates.

Full Transcript

all right I think there's been an exciting video in this video you see how to implement gradient descent for your neural network with one hidden layer in this video I'm going to just give you the equations you need to implement in order to get that propagation of the jet gradient descent working and then in the video after this one or give some more intuition about why these particular equations are the accurate equations or the correct equations for computing the gradients you need for your neural network so your neural network with a single hidden layer for now will have parameters W 1 V 1 W 2 and B 2 and so as a reminder if you have an X alternative the UM n 0 input features and N 1 hidden units and n 2 output units in our example so following it n 2 equals 1 then the matrix W 1 will be N 1 by n 0 B 1 will be an N 1 dimensional vector so you can write down as a 10 1 by 1 dimensional matrix really a column vector the dimensions of W 2 will be n 2 by N 1 and the dimension of B 2 will be n 2 by 1 right where again so far we've only seen examples where n 2 is equal to one where you have just one a single hidden unit so you also have a cost function for a neural network and for now I'm just going to assume that you're doing binary classification so in that case the cost of your parameters as follows is going to be 1 over m of the average of that loss function and so L here is the loss when your new network predicts y hat all right this is really a a to when the ground should label is e to the Y and if you're doing binary classification the loss function can be exactly what you use for logistic earlier so to crane the parameters your algorithms you need to perform gradient descent when training your network is important to initialize the parameters randomly rather than to all the URLs will say later why that's the case but after initializing the parameter to something each loop or gradient descent would compute the predictions so you basically compute you know y hat I for I equals 1 through m say and then you need to compute the derivative so you need to compute DW 1 and that's this is a derivative of the cost function with respect to the parameter w1 you need to compute another variable which is going to call B b1 which is the derivative or the slope of your cost function with respect to the variable B 1 and so on similarly for the other parameters W 2 and B 2 and then finally the gradient descent update would be to update W 1 as W 1 minus alpha the learning rate times d w1 b1 gets updated as b1 minus the learning rate times D b1 and similarly for W 2 and B 2 and sometimes I use colon equals and sometimes equals as either either the notation works fine and so this would be one iteration of gradient descent and then you repeat this some number of times until your parameters look like they're converging so in previous videos we talked about how to compute predictions how to compute the outputs and we saw how to do that in a vectorized way as well so the key is to know how to compute these partial derivative terms the DW 1 DB 1 as well as the derivatives BW 2 and DP 2 so what I'd like to do is just give you the equations you need in order to compute these derivatives and I'll defer to the next video which is an optional video to go greater into Jeff about how we came up with those formulas so then just summarize again the equations for for propagation so you have z1 equals W 1x plus B 1 and then a 1 equals the activation function in that layer applied other than Y since V 1 and then Z 2 equals W 2 A 1 plus B 2 and then finally this is all vectorize across your training set right a 2 is equal to G 2 of Z 2 the game for now if we assume you're doing binary classification then this activation function really should be the sigmoid function so I'm just throw that in 0 so that the forward propagation or the left-to-right forward computation for your neural network next let's compute the derivatives so this is the back propagation step we're going to compute D Z 2 equals a 2 minus the ground truth Y and just just as a reminder all this is vectorize across example so the matrix Y is the sum 1 by M matrix then this all of your M examples stacked horizontally then it turns out DW 2 is equal to this in fact um these first three equations are very similar to gradient descent for logistic regression come on X is equals 1 comma um chickens equals true and just a little detail this NP dot some is a Python numpy come-on for something across one dimension of a matrix in this case something horizontally and what keep dims does is it prevents python from outputting one of those funny rank 1 arrays where where the dimensions was you know n comma so by having keep them as equals true this ensures that Python outputs for db2 a vector that is sum n by one technically this will be I guess n to buy one in this case is just a one by one number so maybe it doesn't matter but later on we'll see when it really matters so so far what we've done is very similar to logistic regression but now as you compute two new to run back propagation you would compute this easy two times G one prime of Z 1 so this quantity G 1 prime is the derivative of whatever was the activation function you use for the hidden layer and for the output layer I assume that you're doing binary classification with the sigmoid function so that's already baked into that formula for DZ 2 and this times is a element-wise product so this here is going to be an N 1 by M matrix and this here this element wise derivative thing is also going to be an N 1 by n matrix and so this times there is an element wise products of two matrices then finally DW 1 is equal to that and DB 1 is equal to this and P dot some D Z 1 X is equals 1 keep this equals true so we're previously the keep dinners maybe matter less if n 2 is equal to 1 so just one by one thing is there's a real number here pp 1 will be a N 1 by 1 vector and so you want Python you want n P dot some output something of this dimension rather than a family right one array of that dimension which could end up messing up some of your basic calculations the other way would be to not have to keep them parameters but to explicitly call in a reshape to reshape the output of and P dot some into this dimension which you would like dB so how so that was more propagation in I guess four equations and back propagation in I guess six equations I knew I just wrote down these equations but in the next optional video let's go over some intuitions for how the six equations for the back propagation algorithm were derived please feel free to watch that or not but either way if you implement these algorithms you will have a correct implementation of four prop and back prop and you'll be able to compute the derivative you need in order to apply gradient descent to learn the parameters of your neural network it is possible to implement design room and get it to work without deeply understanding the calculus a lot of successful deep learning practitioners do so but if you want you can also watch the next video just to get a bit more intuition about the derivation of these of these equations

Original Description

Take the Deep Learning Specialization: http://bit.ly/32KQSWb Check out all our courses: https://www.deeplearning.ai Subscribe to The Batch, our weekly newsletter: https://www.deeplearning.ai/thebatch Follow us: Twitter: https://twitter.com/deeplearningai_ Facebook: https://www.facebook.com/deeplearningHQ/ Linkedin: https://www.linkedin.com/company/deeplearningai

Watch on YouTube ↗ (saves to browser)

Sign in to unlock AI tutor explanation · ⚡30

Playlist

Uploads from DeepLearningAI · DeepLearningAI · 32 of 60

← Previous Next →

Forward and Backward Propagation (C1W4L06)

Forward and Backward Propagation (C1W4L06)

deeplearning.ai's Heroes of Deep Learning: Yuanqing Lin

deeplearning.ai's Heroes of Deep Learning: Yuanqing Lin

deeplearning.ai's Heroes of Deep Learning: Ruslan Salakhutdinov

deeplearning.ai's Heroes of Deep Learning: Ruslan Salakhutdinov

deeplearning.ai's Heroes of Deep Learning: Yoshua Bengio

deeplearning.ai's Heroes of Deep Learning: Yoshua Bengio

deeplearning.ai's Heroes of Deep Learning: Pieter Abbeel

deeplearning.ai's Heroes of Deep Learning: Pieter Abbeel

deeplearning.ai's Heroes of Deep Learning: Ian Goodfellow

deeplearning.ai's Heroes of Deep Learning: Ian Goodfellow

deeplearning.ai's Heroes of Deep Learning: Andrej Karpathy

deeplearning.ai's Heroes of Deep Learning: Andrej Karpathy

Using an Appropriate Scale (C2W3L02)

Using an Appropriate Scale (C2W3L02)

Gradient Checking (C2W1L13)

Gradient Checking (C2W1L13)

Gradient Checking Implementation Notes (C2W1L14)

Gradient Checking Implementation Notes (C2W1L14)

Learning Rate Decay (C2W2L09)

Learning Rate Decay (C2W2L09)

Understanding Mini-Batch Gradient Dexcent (C2W2L02)

Understanding Mini-Batch Gradient Dexcent (C2W2L02)

Mini Batch Gradient Descent (C2W2L01)

Mini Batch Gradient Descent (C2W2L01)

The Problem of Local Optima (C2W3L10)

The Problem of Local Optima (C2W3L10)

Exponentially Weighted Averages (C2W2L03)

Exponentially Weighted Averages (C2W2L03)

Tuning Process (C2W3L01)

Tuning Process (C2W3L01)

Understanding Exponentially Weighted Averages (C2W2L04)

Understanding Exponentially Weighted Averages (C2W2L04)

Bias Correction of Exponentially Weighted Averages (C2W2L05)

Bias Correction of Exponentially Weighted Averages (C2W2L05)

Gradient Descent With Momentum (C2W2L06)

Gradient Descent With Momentum (C2W2L06)

Normalizing Activations in a Network (C2W3L04)

Normalizing Activations in a Network (C2W3L04)

Hyperparameter Tuning in Practice (C2W3L03)

Hyperparameter Tuning in Practice (C2W3L03)

Adam Optimization Algorithm (C2W2L08)

Adam Optimization Algorithm (C2W2L08)

RMSProp (C2W2L07)

RMSProp (C2W2L07)

Fitting Batch Norm Into Neural Networks (C2W3L05)

Fitting Batch Norm Into Neural Networks (C2W3L05)

Why Does Batch Norm Work? (C2W3L06)

Why Does Batch Norm Work? (C2W3L06)

Batch Norm At Test Time (C2W3L07)

Batch Norm At Test Time (C2W3L07)

Softmax Regression (C2W3L08)

Softmax Regression (C2W3L08)

Deep Learning Frameworks (C2W3L10)

Deep Learning Frameworks (C2W3L10)

Neural Network Overview (C1W3L01)

Neural Network Overview (C1W3L01)

Training Softmax Classifier (C2W3L09)

Training Softmax Classifier (C2W3L09)

Why Deep Representations? (C1W4L04)

Why Deep Representations? (C1W4L04)

Gradient Descent For Neural Networks (C1W3L09)

Gradient Descent For Neural Networks (C1W3L09)

Neural Network Representations (C1W3L02)

Neural Network Representations (C1W3L02)

TensorFlow (C2W3L11)

TensorFlow (C2W3L11)

Activation Functions (C1W3L06)

Activation Functions (C1W3L06)

Explanation For Vectorized Implementation (C1W3L05)

Explanation For Vectorized Implementation (C1W3L05)

Getting Matrix Dimensions Right (C1W4L03)

Getting Matrix Dimensions Right (C1W4L03)

Understanding Dropout (C2W1L07)

Understanding Dropout (C2W1L07)

Building Blocks of a Deep Neural Network (C1W4L05)

Building Blocks of a Deep Neural Network (C1W4L05)

Why Non-linear Activation Functions (C1W3L07)

Why Non-linear Activation Functions (C1W3L07)

Computing Neural Network Output (C1W3L03)

Computing Neural Network Output (C1W3L03)

Backpropagation Intuition (C1W3L10)

Backpropagation Intuition (C1W3L10)

Train/Dev/Test Sets (C2W1L01)

Train/Dev/Test Sets (C2W1L01)

Deep L-Layer Neural Network (C1W4L01)

Deep L-Layer Neural Network (C1W4L01)

Random Initialization (C1W3L11)

Random Initialization (C1W3L11)

Other Regularization Methods (C2W1L08)

Other Regularization Methods (C2W1L08)

Normalizing Inputs (C2W1L09)

Normalizing Inputs (C2W1L09)

Derivatives Of Activation Functions (C1W3L08)

Derivatives Of Activation Functions (C1W3L08)

Parameters vs Hyperparameters (C1W4L07)

Parameters vs Hyperparameters (C1W4L07)

Vectorizing Across Multiple Examples (C1W3L04)

Vectorizing Across Multiple Examples (C1W3L04)

What does this have to do with the brain? (C1W4L08)

What does this have to do with the brain? (C1W4L08)

Dropout Regularization (C2W1L06)

Dropout Regularization (C2W1L06)

Vanishing/Exploding Gradients (C2W1L10)

Vanishing/Exploding Gradients (C2W1L10)

Basic Recipe for Machine Learning (C2W1L03)

Basic Recipe for Machine Learning (C2W1L03)

Bias/Variance (C2W1L02)

Bias/Variance (C2W1L02)

Forward Propagation in a Deep Network (C1W4L02)

Forward Propagation in a Deep Network (C1W4L02)

Weight Initialization in a Deep Network (C2W1L11)

Weight Initialization in a Deep Network (C2W1L11)

Numerical Approximations of Gradients (C2W1L12)

Numerical Approximations of Gradients (C2W1L12)

Regularization (C2W1L04)

Regularization (C2W1L04)

Why Regularization Reduces Overfitting (C2W1L05)

Why Regularization Reduces Overfitting (C2W1L05)

This video teaches how to implement gradient descent for a neural network with one hidden layer, covering forward and back propagation, and providing the necessary equations for computing partial derivatives. It is a fundamental concept in machine learning and deep learning.

Key Takeaways

Initialize parameters randomly
Compute predictions
Compute partial derivatives
Update parameters using gradient descent

💡 Understanding the derivation of the back propagation equations is not necessary to implement them, but it can provide valuable intuition for debugging and optimizing neural networks.

🔒 Pro feature: Ask AI to explain this lesson →

More on: ML Maths Basics

View skill →

Important Steps I Have Followed To Improve My Data Science Skills- Sharing My Experience

Important Steps I Have Followed To Improve My Data Science Skills- Sharing My Experience

Learn Python FAST for Beginners 🚀#coding #conditionals #loops #functions

Learn Python FAST for Beginners 🚀#coding #conditionals #loops #functions

ChethanAIChronicles

“Hello, world” from scratch on a 6502 — Part 1

“Hello, world” from scratch on a 6502 — Part 1

PCA (Principal Component Analysis) in Python - Machine Learning From Scratch 11 - Python Tutorial

PCA (Principal Component Analysis) in Python - Machine Learning From Scratch 11 - Python Tutorial

ROC and AUC in R

ROC and AUC in R

StatQuest with Josh Starmer

Data Science Fundamentals: Data Cleaning in Python

Data Science Fundamentals: Data Cleaning in Python

Related AI Lessons

Data Preprocessing: Encoding and Feature Scaling in Machine Learning

Learn to preprocess data by encoding and scaling features for better machine learning model performance

Medium · Machine Learning

Data Preprocessing: Encoding and Feature Scaling in Machine Learning

Learn to preprocess data for machine learning by encoding and scaling features, a crucial step for model training

Medium · Data Science

Data Preprocessing: Encoding and Feature Scaling in Machine Learning

Learn to preprocess data for machine learning by encoding and scaling features, a crucial step for model training

Medium · Python

The Python Dictionary Trick That Makes Interviewers Smile

Learn the Python dictionary trick that impresses interviewers and improves your coding skills

Dev.to · Ameer Abdullah

Is Python Dead in 2026?| Truth About Python in AI Era | 90 Days Roadmap @FameWorldEducationalHub

FAME WORLD EDUCATIONAL HUB