Gradient Descent For Neural Networks (C1W3L09)
Key Takeaways
The video demonstrates how to implement gradient descent for a neural network with one hidden layer, providing the necessary equations for forward and back propagation. It covers the basics of neural network parameters, cost functions, and gradient descent updates.
Full Transcript
all right I think there's been an exciting video in this video you see how to implement gradient descent for your neural network with one hidden layer in this video I'm going to just give you the equations you need to implement in order to get that propagation of the jet gradient descent working and then in the video after this one or give some more intuition about why these particular equations are the accurate equations or the correct equations for computing the gradients you need for your neural network so your neural network with a single hidden layer for now will have parameters W 1 V 1 W 2 and B 2 and so as a reminder if you have an X alternative the UM n 0 input features and N 1 hidden units and n 2 output units in our example so following it n 2 equals 1 then the matrix W 1 will be N 1 by n 0 B 1 will be an N 1 dimensional vector so you can write down as a 10 1 by 1 dimensional matrix really a column vector the dimensions of W 2 will be n 2 by N 1 and the dimension of B 2 will be n 2 by 1 right where again so far we've only seen examples where n 2 is equal to one where you have just one a single hidden unit so you also have a cost function for a neural network and for now I'm just going to assume that you're doing binary classification so in that case the cost of your parameters as follows is going to be 1 over m of the average of that loss function and so L here is the loss when your new network predicts y hat all right this is really a a to when the ground should label is e to the Y and if you're doing binary classification the loss function can be exactly what you use for logistic earlier so to crane the parameters your algorithms you need to perform gradient descent when training your network is important to initialize the parameters randomly rather than to all the URLs will say later why that's the case but after initializing the parameter to something each loop or gradient descent would compute the predictions so you basically compute you know y hat I for I equals 1 through m say and then you need to compute the derivative so you need to compute DW 1 and that's this is a derivative of the cost function with respect to the parameter w1 you need to compute another variable which is going to call B b1 which is the derivative or the slope of your cost function with respect to the variable B 1 and so on similarly for the other parameters W 2 and B 2 and then finally the gradient descent update would be to update W 1 as W 1 minus alpha the learning rate times d w1 b1 gets updated as b1 minus the learning rate times D b1 and similarly for W 2 and B 2 and sometimes I use colon equals and sometimes equals as either either the notation works fine and so this would be one iteration of gradient descent and then you repeat this some number of times until your parameters look like they're converging so in previous videos we talked about how to compute predictions how to compute the outputs and we saw how to do that in a vectorized way as well so the key is to know how to compute these partial derivative terms the DW 1 DB 1 as well as the derivatives BW 2 and DP 2 so what I'd like to do is just give you the equations you need in order to compute these derivatives and I'll defer to the next video which is an optional video to go greater into Jeff about how we came up with those formulas so then just summarize again the equations for for propagation so you have z1 equals W 1x plus B 1 and then a 1 equals the activation function in that layer applied other than Y since V 1 and then Z 2 equals W 2 A 1 plus B 2 and then finally this is all vectorize across your training set right a 2 is equal to G 2 of Z 2 the game for now if we assume you're doing binary classification then this activation function really should be the sigmoid function so I'm just throw that in 0 so that the forward propagation or the left-to-right forward computation for your neural network next let's compute the derivatives so this is the back propagation step we're going to compute D Z 2 equals a 2 minus the ground truth Y and just just as a reminder all this is vectorize across example so the matrix Y is the sum 1 by M matrix then this all of your M examples stacked horizontally then it turns out DW 2 is equal to this in fact um these first three equations are very similar to gradient descent for logistic regression come on X is equals 1 comma um chickens equals true and just a little detail this NP dot some is a Python numpy come-on for something across one dimension of a matrix in this case something horizontally and what keep dims does is it prevents python from outputting one of those funny rank 1 arrays where where the dimensions was you know n comma so by having keep them as equals true this ensures that Python outputs for db2 a vector that is sum n by one technically this will be I guess n to buy one in this case is just a one by one number so maybe it doesn't matter but later on we'll see when it really matters so so far what we've done is very similar to logistic regression but now as you compute two new to run back propagation you would compute this easy two times G one prime of Z 1 so this quantity G 1 prime is the derivative of whatever was the activation function you use for the hidden layer and for the output layer I assume that you're doing binary classification with the sigmoid function so that's already baked into that formula for DZ 2 and this times is a element-wise product so this here is going to be an N 1 by M matrix and this here this element wise derivative thing is also going to be an N 1 by n matrix and so this times there is an element wise products of two matrices then finally DW 1 is equal to that and DB 1 is equal to this and P dot some D Z 1 X is equals 1 keep this equals true so we're previously the keep dinners maybe matter less if n 2 is equal to 1 so just one by one thing is there's a real number here pp 1 will be a N 1 by 1 vector and so you want Python you want n P dot some output something of this dimension rather than a family right one array of that dimension which could end up messing up some of your basic calculations the other way would be to not have to keep them parameters but to explicitly call in a reshape to reshape the output of and P dot some into this dimension which you would like dB so how so that was more propagation in I guess four equations and back propagation in I guess six equations I knew I just wrote down these equations but in the next optional video let's go over some intuitions for how the six equations for the back propagation algorithm were derived please feel free to watch that or not but either way if you implement these algorithms you will have a correct implementation of four prop and back prop and you'll be able to compute the derivative you need in order to apply gradient descent to learn the parameters of your neural network it is possible to implement design room and get it to work without deeply understanding the calculus a lot of successful deep learning practitioners do so but if you want you can also watch the next video just to get a bit more intuition about the derivation of these of these equations
Original Description
Take the Deep Learning Specialization: http://bit.ly/32KQSWb
Check out all our courses: https://www.deeplearning.ai
Subscribe to The Batch, our weekly newsletter: https://www.deeplearning.ai/thebatch
Follow us:
Twitter: https://twitter.com/deeplearningai_
Facebook: https://www.facebook.com/deeplearningHQ/
Linkedin: https://www.linkedin.com/company/deeplearningai
Watch on YouTube ↗
(saves to browser)
Sign in to unlock AI tutor explanation · ⚡30
Playlist
Uploads from DeepLearningAI · DeepLearningAI · 32 of 60
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
▶
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
Forward and Backward Propagation (C1W4L06)
DeepLearningAI
deeplearning.ai's Heroes of Deep Learning: Yuanqing Lin
DeepLearningAI
deeplearning.ai's Heroes of Deep Learning: Ruslan Salakhutdinov
DeepLearningAI
deeplearning.ai's Heroes of Deep Learning: Yoshua Bengio
DeepLearningAI
deeplearning.ai's Heroes of Deep Learning: Pieter Abbeel
DeepLearningAI
deeplearning.ai's Heroes of Deep Learning: Ian Goodfellow
DeepLearningAI
deeplearning.ai's Heroes of Deep Learning: Andrej Karpathy
DeepLearningAI
Using an Appropriate Scale (C2W3L02)
DeepLearningAI
Gradient Checking (C2W1L13)
DeepLearningAI
Gradient Checking Implementation Notes (C2W1L14)
DeepLearningAI
Learning Rate Decay (C2W2L09)
DeepLearningAI
Understanding Mini-Batch Gradient Dexcent (C2W2L02)
DeepLearningAI
Mini Batch Gradient Descent (C2W2L01)
DeepLearningAI
The Problem of Local Optima (C2W3L10)
DeepLearningAI
Exponentially Weighted Averages (C2W2L03)
DeepLearningAI
Tuning Process (C2W3L01)
DeepLearningAI
Understanding Exponentially Weighted Averages (C2W2L04)
DeepLearningAI
Bias Correction of Exponentially Weighted Averages (C2W2L05)
DeepLearningAI
Gradient Descent With Momentum (C2W2L06)
DeepLearningAI
Normalizing Activations in a Network (C2W3L04)
DeepLearningAI
Hyperparameter Tuning in Practice (C2W3L03)
DeepLearningAI
Adam Optimization Algorithm (C2W2L08)
DeepLearningAI
RMSProp (C2W2L07)
DeepLearningAI
Fitting Batch Norm Into Neural Networks (C2W3L05)
DeepLearningAI
Why Does Batch Norm Work? (C2W3L06)
DeepLearningAI
Batch Norm At Test Time (C2W3L07)
DeepLearningAI
Softmax Regression (C2W3L08)
DeepLearningAI
Deep Learning Frameworks (C2W3L10)
DeepLearningAI
Neural Network Overview (C1W3L01)
DeepLearningAI
Training Softmax Classifier (C2W3L09)
DeepLearningAI
Why Deep Representations? (C1W4L04)
DeepLearningAI
Gradient Descent For Neural Networks (C1W3L09)
DeepLearningAI
Neural Network Representations (C1W3L02)
DeepLearningAI
TensorFlow (C2W3L11)
DeepLearningAI
Activation Functions (C1W3L06)
DeepLearningAI
Explanation For Vectorized Implementation (C1W3L05)
DeepLearningAI
Getting Matrix Dimensions Right (C1W4L03)
DeepLearningAI
Understanding Dropout (C2W1L07)
DeepLearningAI
Building Blocks of a Deep Neural Network (C1W4L05)
DeepLearningAI
Why Non-linear Activation Functions (C1W3L07)
DeepLearningAI
Computing Neural Network Output (C1W3L03)
DeepLearningAI
Backpropagation Intuition (C1W3L10)
DeepLearningAI
Train/Dev/Test Sets (C2W1L01)
DeepLearningAI
Deep L-Layer Neural Network (C1W4L01)
DeepLearningAI
Random Initialization (C1W3L11)
DeepLearningAI
Other Regularization Methods (C2W1L08)
DeepLearningAI
Normalizing Inputs (C2W1L09)
DeepLearningAI
Derivatives Of Activation Functions (C1W3L08)
DeepLearningAI
Parameters vs Hyperparameters (C1W4L07)
DeepLearningAI
Vectorizing Across Multiple Examples (C1W3L04)
DeepLearningAI
What does this have to do with the brain? (C1W4L08)
DeepLearningAI
Dropout Regularization (C2W1L06)
DeepLearningAI
Vanishing/Exploding Gradients (C2W1L10)
DeepLearningAI
Basic Recipe for Machine Learning (C2W1L03)
DeepLearningAI
Bias/Variance (C2W1L02)
DeepLearningAI
Forward Propagation in a Deep Network (C1W4L02)
DeepLearningAI
Weight Initialization in a Deep Network (C2W1L11)
DeepLearningAI
Numerical Approximations of Gradients (C2W1L12)
DeepLearningAI
Regularization (C2W1L04)
DeepLearningAI
Why Regularization Reduces Overfitting (C2W1L05)
DeepLearningAI
More on: ML Maths Basics
View skill →
🎓
Tutor Explanation
DeepCamp AI