Gradient Descent For Neural Networks (C1W3L09)

DeepLearningAI · Beginner ·📐 ML Fundamentals ·8y ago

Key Takeaways

The video demonstrates how to implement gradient descent for a neural network with one hidden layer, providing the necessary equations for forward and back propagation. It covers the basics of neural network parameters, cost functions, and gradient descent updates.

Full Transcript

all right I think there's been an exciting video in this video you see how to implement gradient descent for your neural network with one hidden layer in this video I'm going to just give you the equations you need to implement in order to get that propagation of the jet gradient descent working and then in the video after this one or give some more intuition about why these particular equations are the accurate equations or the correct equations for computing the gradients you need for your neural network so your neural network with a single hidden layer for now will have parameters W 1 V 1 W 2 and B 2 and so as a reminder if you have an X alternative the UM n 0 input features and N 1 hidden units and n 2 output units in our example so following it n 2 equals 1 then the matrix W 1 will be N 1 by n 0 B 1 will be an N 1 dimensional vector so you can write down as a 10 1 by 1 dimensional matrix really a column vector the dimensions of W 2 will be n 2 by N 1 and the dimension of B 2 will be n 2 by 1 right where again so far we've only seen examples where n 2 is equal to one where you have just one a single hidden unit so you also have a cost function for a neural network and for now I'm just going to assume that you're doing binary classification so in that case the cost of your parameters as follows is going to be 1 over m of the average of that loss function and so L here is the loss when your new network predicts y hat all right this is really a a to when the ground should label is e to the Y and if you're doing binary classification the loss function can be exactly what you use for logistic earlier so to crane the parameters your algorithms you need to perform gradient descent when training your network is important to initialize the parameters randomly rather than to all the URLs will say later why that's the case but after initializing the parameter to something each loop or gradient descent would compute the predictions so you basically compute you know y hat I for I equals 1 through m say and then you need to compute the derivative so you need to compute DW 1 and that's this is a derivative of the cost function with respect to the parameter w1 you need to compute another variable which is going to call B b1 which is the derivative or the slope of your cost function with respect to the variable B 1 and so on similarly for the other parameters W 2 and B 2 and then finally the gradient descent update would be to update W 1 as W 1 minus alpha the learning rate times d w1 b1 gets updated as b1 minus the learning rate times D b1 and similarly for W 2 and B 2 and sometimes I use colon equals and sometimes equals as either either the notation works fine and so this would be one iteration of gradient descent and then you repeat this some number of times until your parameters look like they're converging so in previous videos we talked about how to compute predictions how to compute the outputs and we saw how to do that in a vectorized way as well so the key is to know how to compute these partial derivative terms the DW 1 DB 1 as well as the derivatives BW 2 and DP 2 so what I'd like to do is just give you the equations you need in order to compute these derivatives and I'll defer to the next video which is an optional video to go greater into Jeff about how we came up with those formulas so then just summarize again the equations for for propagation so you have z1 equals W 1x plus B 1 and then a 1 equals the activation function in that layer applied other than Y since V 1 and then Z 2 equals W 2 A 1 plus B 2 and then finally this is all vectorize across your training set right a 2 is equal to G 2 of Z 2 the game for now if we assume you're doing binary classification then this activation function really should be the sigmoid function so I'm just throw that in 0 so that the forward propagation or the left-to-right forward computation for your neural network next let's compute the derivatives so this is the back propagation step we're going to compute D Z 2 equals a 2 minus the ground truth Y and just just as a reminder all this is vectorize across example so the matrix Y is the sum 1 by M matrix then this all of your M examples stacked horizontally then it turns out DW 2 is equal to this in fact um these first three equations are very similar to gradient descent for logistic regression come on X is equals 1 comma um chickens equals true and just a little detail this NP dot some is a Python numpy come-on for something across one dimension of a matrix in this case something horizontally and what keep dims does is it prevents python from outputting one of those funny rank 1 arrays where where the dimensions was you know n comma so by having keep them as equals true this ensures that Python outputs for db2 a vector that is sum n by one technically this will be I guess n to buy one in this case is just a one by one number so maybe it doesn't matter but later on we'll see when it really matters so so far what we've done is very similar to logistic regression but now as you compute two new to run back propagation you would compute this easy two times G one prime of Z 1 so this quantity G 1 prime is the derivative of whatever was the activation function you use for the hidden layer and for the output layer I assume that you're doing binary classification with the sigmoid function so that's already baked into that formula for DZ 2 and this times is a element-wise product so this here is going to be an N 1 by M matrix and this here this element wise derivative thing is also going to be an N 1 by n matrix and so this times there is an element wise products of two matrices then finally DW 1 is equal to that and DB 1 is equal to this and P dot some D Z 1 X is equals 1 keep this equals true so we're previously the keep dinners maybe matter less if n 2 is equal to 1 so just one by one thing is there's a real number here pp 1 will be a N 1 by 1 vector and so you want Python you want n P dot some output something of this dimension rather than a family right one array of that dimension which could end up messing up some of your basic calculations the other way would be to not have to keep them parameters but to explicitly call in a reshape to reshape the output of and P dot some into this dimension which you would like dB so how so that was more propagation in I guess four equations and back propagation in I guess six equations I knew I just wrote down these equations but in the next optional video let's go over some intuitions for how the six equations for the back propagation algorithm were derived please feel free to watch that or not but either way if you implement these algorithms you will have a correct implementation of four prop and back prop and you'll be able to compute the derivative you need in order to apply gradient descent to learn the parameters of your neural network it is possible to implement design room and get it to work without deeply understanding the calculus a lot of successful deep learning practitioners do so but if you want you can also watch the next video just to get a bit more intuition about the derivation of these of these equations

Original Description

Take the Deep Learning Specialization: http://bit.ly/32KQSWb Check out all our courses: https://www.deeplearning.ai Subscribe to The Batch, our weekly newsletter: https://www.deeplearning.ai/thebatch Follow us: Twitter: https://twitter.com/deeplearningai_ Facebook: https://www.facebook.com/deeplearningHQ/ Linkedin: https://www.linkedin.com/company/deeplearningai
Watch on YouTube ↗ (saves to browser)
Sign in to unlock AI tutor explanation · ⚡30

Playlist

Uploads from DeepLearningAI · DeepLearningAI · 32 of 60

1 Forward and Backward Propagation (C1W4L06)
Forward and Backward Propagation (C1W4L06)
DeepLearningAI
2 deeplearning.ai's Heroes of Deep Learning: Yuanqing Lin
deeplearning.ai's Heroes of Deep Learning: Yuanqing Lin
DeepLearningAI
3 deeplearning.ai's Heroes of Deep Learning: Ruslan Salakhutdinov
deeplearning.ai's Heroes of Deep Learning: Ruslan Salakhutdinov
DeepLearningAI
4 deeplearning.ai's Heroes of Deep Learning: Yoshua Bengio
deeplearning.ai's Heroes of Deep Learning: Yoshua Bengio
DeepLearningAI
5 deeplearning.ai's Heroes of Deep Learning: Pieter Abbeel
deeplearning.ai's Heroes of Deep Learning: Pieter Abbeel
DeepLearningAI
6 deeplearning.ai's Heroes of Deep Learning: Ian Goodfellow
deeplearning.ai's Heroes of Deep Learning: Ian Goodfellow
DeepLearningAI
7 deeplearning.ai's Heroes of Deep Learning: Andrej Karpathy
deeplearning.ai's Heroes of Deep Learning: Andrej Karpathy
DeepLearningAI
8 Using an Appropriate Scale (C2W3L02)
Using an Appropriate Scale (C2W3L02)
DeepLearningAI
9 Gradient Checking (C2W1L13)
Gradient Checking (C2W1L13)
DeepLearningAI
10 Gradient Checking Implementation Notes (C2W1L14)
Gradient Checking Implementation Notes (C2W1L14)
DeepLearningAI
11 Learning Rate Decay (C2W2L09)
Learning Rate Decay (C2W2L09)
DeepLearningAI
12 Understanding Mini-Batch Gradient Dexcent (C2W2L02)
Understanding Mini-Batch Gradient Dexcent (C2W2L02)
DeepLearningAI
13 Mini Batch Gradient Descent (C2W2L01)
Mini Batch Gradient Descent (C2W2L01)
DeepLearningAI
14 The Problem of Local Optima (C2W3L10)
The Problem of Local Optima (C2W3L10)
DeepLearningAI
15 Exponentially Weighted Averages (C2W2L03)
Exponentially Weighted Averages (C2W2L03)
DeepLearningAI
16 Tuning Process (C2W3L01)
Tuning Process (C2W3L01)
DeepLearningAI
17 Understanding Exponentially Weighted Averages (C2W2L04)
Understanding Exponentially Weighted Averages (C2W2L04)
DeepLearningAI
18 Bias Correction of Exponentially Weighted Averages (C2W2L05)
Bias Correction of Exponentially Weighted Averages (C2W2L05)
DeepLearningAI
19 Gradient Descent With Momentum (C2W2L06)
Gradient Descent With Momentum (C2W2L06)
DeepLearningAI
20 Normalizing Activations in a Network (C2W3L04)
Normalizing Activations in a Network (C2W3L04)
DeepLearningAI
21 Hyperparameter Tuning in Practice (C2W3L03)
Hyperparameter Tuning in Practice (C2W3L03)
DeepLearningAI
22 Adam Optimization Algorithm (C2W2L08)
Adam Optimization Algorithm (C2W2L08)
DeepLearningAI
23 RMSProp (C2W2L07)
RMSProp (C2W2L07)
DeepLearningAI
24 Fitting Batch Norm Into Neural Networks (C2W3L05)
Fitting Batch Norm Into Neural Networks (C2W3L05)
DeepLearningAI
25 Why Does Batch Norm Work? (C2W3L06)
Why Does Batch Norm Work? (C2W3L06)
DeepLearningAI
26 Batch Norm At Test Time (C2W3L07)
Batch Norm At Test Time (C2W3L07)
DeepLearningAI
27 Softmax Regression (C2W3L08)
Softmax Regression (C2W3L08)
DeepLearningAI
28 Deep Learning Frameworks (C2W3L10)
Deep Learning Frameworks (C2W3L10)
DeepLearningAI
29 Neural Network Overview (C1W3L01)
Neural Network Overview (C1W3L01)
DeepLearningAI
30 Training Softmax Classifier (C2W3L09)
Training Softmax Classifier (C2W3L09)
DeepLearningAI
31 Why Deep Representations? (C1W4L04)
Why Deep Representations? (C1W4L04)
DeepLearningAI
Gradient Descent For Neural Networks (C1W3L09)
Gradient Descent For Neural Networks (C1W3L09)
DeepLearningAI
33 Neural Network Representations (C1W3L02)
Neural Network Representations (C1W3L02)
DeepLearningAI
34 TensorFlow (C2W3L11)
TensorFlow (C2W3L11)
DeepLearningAI
35 Activation Functions (C1W3L06)
Activation Functions (C1W3L06)
DeepLearningAI
36 Explanation For Vectorized Implementation (C1W3L05)
Explanation For Vectorized Implementation (C1W3L05)
DeepLearningAI
37 Getting Matrix Dimensions Right (C1W4L03)
Getting Matrix Dimensions Right (C1W4L03)
DeepLearningAI
38 Understanding Dropout (C2W1L07)
Understanding Dropout (C2W1L07)
DeepLearningAI
39 Building Blocks of a Deep Neural Network (C1W4L05)
Building Blocks of a Deep Neural Network (C1W4L05)
DeepLearningAI
40 Why Non-linear Activation Functions (C1W3L07)
Why Non-linear Activation Functions (C1W3L07)
DeepLearningAI
41 Computing Neural Network Output (C1W3L03)
Computing Neural Network Output (C1W3L03)
DeepLearningAI
42 Backpropagation Intuition (C1W3L10)
Backpropagation Intuition (C1W3L10)
DeepLearningAI
43 Train/Dev/Test Sets (C2W1L01)
Train/Dev/Test Sets (C2W1L01)
DeepLearningAI
44 Deep L-Layer Neural Network (C1W4L01)
Deep L-Layer Neural Network (C1W4L01)
DeepLearningAI
45 Random Initialization (C1W3L11)
Random Initialization (C1W3L11)
DeepLearningAI
46 Other Regularization Methods (C2W1L08)
Other Regularization Methods (C2W1L08)
DeepLearningAI
47 Normalizing Inputs (C2W1L09)
Normalizing Inputs (C2W1L09)
DeepLearningAI
48 Derivatives Of Activation Functions (C1W3L08)
Derivatives Of Activation Functions (C1W3L08)
DeepLearningAI
49 Parameters vs Hyperparameters (C1W4L07)
Parameters vs Hyperparameters (C1W4L07)
DeepLearningAI
50 Vectorizing Across Multiple Examples (C1W3L04)
Vectorizing Across Multiple Examples (C1W3L04)
DeepLearningAI
51 What does this have to do with the brain? (C1W4L08)
What does this have to do with the brain? (C1W4L08)
DeepLearningAI
52 Dropout Regularization (C2W1L06)
Dropout Regularization (C2W1L06)
DeepLearningAI
53 Vanishing/Exploding Gradients (C2W1L10)
Vanishing/Exploding Gradients (C2W1L10)
DeepLearningAI
54 Basic Recipe for Machine Learning (C2W1L03)
Basic Recipe for Machine Learning (C2W1L03)
DeepLearningAI
55 Bias/Variance (C2W1L02)
Bias/Variance (C2W1L02)
DeepLearningAI
56 Forward Propagation in a Deep Network (C1W4L02)
Forward Propagation in a Deep Network (C1W4L02)
DeepLearningAI
57 Weight Initialization in a Deep Network (C2W1L11)
Weight Initialization in a Deep Network (C2W1L11)
DeepLearningAI
58 Numerical Approximations of Gradients (C2W1L12)
Numerical Approximations of Gradients (C2W1L12)
DeepLearningAI
59 Regularization (C2W1L04)
Regularization (C2W1L04)
DeepLearningAI
60 Why Regularization Reduces Overfitting (C2W1L05)
Why Regularization Reduces Overfitting (C2W1L05)
DeepLearningAI

This video teaches how to implement gradient descent for a neural network with one hidden layer, covering forward and back propagation, and providing the necessary equations for computing partial derivatives. It is a fundamental concept in machine learning and deep learning.

Key Takeaways
  1. Initialize parameters randomly
  2. Compute predictions
  3. Compute partial derivatives
  4. Update parameters using gradient descent
💡 Understanding the derivation of the back propagation equations is not necessary to implement them, but it can provide valuable intuition for debugging and optimizing neural networks.

Related AI Lessons

Stop Overfitting With Basically One Line of Code
Learn to prevent overfitting with a simple code tweak and understand the difference between Ridge and Lasso regression
Medium · AI
Stop Overfitting With Basically One Line of Code
Learn to prevent overfitting in machine learning models with a simple code tweak and understand the difference between Ridge and Lasso regression
Medium · Machine Learning
Stop Overfitting With Basically One Line of Code
Prevent overfitting in models with a simple code tweak, understanding the difference between Ridge and Lasso regression
Medium · Data Science
Stop Overfitting With Basically One Line of Code
Learn to prevent overfitting in machine learning models with a simple code tweak, comparing Ridge and Lasso regression techniques
Medium · Python
Up next
Learn Deep Learning by Hand (Beginner's Guide - Part 1)
Thu Vu
Watch →