Neural Network from scratch - Part 3 (Backward Propagation)
Key Takeaways
The video demonstrates backward propagation calculations for a feedforward neural network, covering topics such as computational graphs, chain rules, and partial derivatives. It provides a step-by-step guide on how to compute derivatives of loss with respect to weights and biases, and how to update them using gradients.
Full Transcript
after doing all of the calculations for forward propagation we can view it as a computational graph just like this we have the output from our first layer we have the computed for the linear part of the of the layer one which uses the weight and the biases and then we have the activation function from that linear linear calculation and then lastly we have the second layer and then we compute the loss so we can we can view all of the steps in your computation as a computational graph now that we want to do the calculations for the backward propagation which is really what the neural networks neural network learns from we start by by moving our way from back to the front so what we want to our goal is to compute first of all this one but really we want to calculate the these these gradients for the weights and the biases but to get to that point we need to start at the loss with respect to this node so what we do is that first of all we need to remember that we have a loss for a particular training example which is the minus a log of e raise to Z of the correct label so our computed for the correct label divided by the sum of class 1 to capital C so for all of the classes erased to Zed of that particular class and remember here that well we have if we remember the log rules we have log x over log y equals log X minus love why so if we use that we have first of all - log of this thing so - log of e of this said why I and then minus log of the bottom part but remember that we have a minus log of - so we have plus log of the Sun C equals 1 to capital C of all of our classes erased to see said C so this is how we can write the loss and now what we want to compute is that the derivative of the loss so the thing we just wrote here with respect to Z - and for a particular node K so we want this arrow this is what we're trying to compute right now so we just plug in the values or we just rather we just write the derivative first and then the minus log and we can actually recognize right here that we have log and this is the log e log e of e so these two will write just right off the bat just cancel so we get minus Z of Y I and we here we have here we have the partial with respect to Z K of layer 2 and then the log of the sum and recognize here that we're gonna have to do a chain rule first with respect the log and then with respect to the inner part so first simplifying a little bit we see here that this will be 0 if Y I will not be equal to the node K that we're looking at so K is a specific it's just any node of the output and why I is the correct the correct label so the node which is the correct one for this particular training example I so this will be minus 1 if Y I is equal to K so this notation means that if the K is equal to the Y I then it will be 1 and specifically minus 1 so if we have that for example Y is 0 so we have 0 and K is 1 and we have a 0 is equal to 1 which is false so then this entire thing is 0 ok so if we move on to the second part we have first the outer derivative with respect to log so we just have 1 divided by the sum C equal 1 to capital C of all of our classes and then we have the inner derivative so with respect to a sum now I'm going to swap the order of these so this is minus this is plus I'm just going to swap the order so I'm going to write 1 divided by c equals 1 to capital c and then recognize here that this right here will be a sum so this will be a sum of e raised to said 1 plus e raised to z 2 etcetera all the way up to capital c so if we try to be very clear here and also recognize here that all these then here are specific for this layer so what we have here is the sum so if we write off some just some values to make this clear somewhere in between these will be the K value then at the end we will have arrays to C capital C so what will happen here well all of them and remember all of these should also be layer 2 but same for all of them so what will happen here is that well this will cancel and become 0 this will cancel become 0 all others will cancel and become 0 so we have that this right here this derivative would you do the Zed K and a partial we'd said K will be just e said okay so all other will be else 0 and this one will be here to say K so what we have and I'm going to remember this last part so we have minus 1 if Y I is equal to K so now we can recognize here to that this part right here it's just the softmax so this is just a soft Mack that we computed in the forward propagation and the only thing that we need to add here is that we have minus one what if y equals 2k all right so what we just computed was really this arrow but we want to know well how should we update our weight so that next iteration should be should be better for in our network well the W of layer two we need to take the partial with respect the derivative of Z 2 with respect to the derivative of the weight so we need to move backwards so we need to move in this direction and remember if we want the derivative of the loss with respect to W then we need to multiplicate those two derivatives by the chain rule ok so what we want now is the derivative of Z 2 with respect to W of layer 2 I remember that we can write a derivative W 2 and is that - we'll just be let's see it's a 1 W 2 plus B - and this derivative will just be equal to a 1 because it's very typically respect to beat that will all be just 0 W with respect to that will be once so we will have a 1 left ok and then so we've computed this derivative as well now we want to be too so we want to backwards going to be - this one will be quite simple you to that will be similarly as here we'll have the derivative respect to B - a 1 w2 plus b2 everything will be 0 so these will cancel and this last one will just be an identity matrix since we're doing two matrix calculus really but it's just one this it's just one right taking derivative of the variable variable with respect to the exact same variable so it's just one one thing to keep in mind here is that the the biases are local to that to go specific nodes and the only thing that we have in this z2 is that we have we do the computation for a lot of different examples simultaneously so because all the rows in in the z2 said layer two are all the examples the images of handwritten digits for example and what we have to do when we when doing their gradient descent part we're going to when we update our weight or the biases later on we're going to take with respect to this one right and this one this derivative right here will be of size examples comma features in l2 but the biases are local they're independent of the amount of examples so this will be 1 comma features features as l2 and we're going to subtract them so obviously this doesn't work right the dimensions don't match so the thing to keep in mind here is that when having several incoming gradients to one particular node as we have in this case since we have several examples with gradients to a single node then the solution is to add the gradients so that that's one thing to keep in mind when we're actually doing the implementation okay so we've calculated all of these three now we need to move backwards again in the computational graph we're going to from said 2 to a 1 so we're going in this direction so we take the derivative of Z there's him I said - yeah said here to respect to a 1 a plain one okay so we have like this and we have let's see if we have a one right w-2 and these are the calculations from the forward propagation so we're just moving backwards so we think they're the derivative of a one with respect to Z 2 which is just this part and there's one tricky part here remember that we're not you just doing normal derivatives per se we're doing matrix derivatives because all of these are matrixes and this is a vector so it will be W 2 which is exactly what we expect right but there's one tricky part here is that this will be the W 2 transpose I won't go into exactly why it's the transpose it's it can be shown quite easily but it would take some time which would distract from the point so this is the this is the derivative so we have moved we've just calculated this part and so what we want to do now is that we want to move backwards again I want to calculate this one and this is the last tricky part so we have we want to compute the partial of a 1 with respect to Z 1 well remember that a 1 this is just a maximum right because this is the activation function of zero comma Z one and there's one one part here that's particularly tricky if we have since we're doing derivative is with matrixes if we take one if we have in one by one thousand and we have some type of function and the output is 1 by 1000 in theory each of the values of this vector could have impacted all of the outputs so what we need to do then is that each of the output needs to be have a derivative of every output so the derivative of if we call this if we call this capital F and we call this something else let's call it a not to be confused with this a and if we look at the partial of F respect to a and this will be a 1,000 by 1,000 vector matrix so that's a lot of numbers and the thing about the reloj is that it's applied element wise so we know that not every value of the input will impact every value of the output actually it's just a one-to-one the first node of our input our first value of our input will will impact exactly the first value of our output because it's applied on twice so not to expand too much on this but we can save a lot on compute by doing by in doing the derivative element wise and the reason why we can do it is because the the derivative of this if we know that F is an element wise applied function is that all the D values except the values on the diagonal will be zero and in fact even some values on the diagonal will be exactly zero because the the relative function acts as a gradient router so if it's greater than zero then it's then it's passed with the exact same value if it's negative if you see one is negative then it's set to zero so there's really nothing impacting the the value of the output it's just a router it checks if it's greater than zero then it lets it pass otherwise it set it to zero so what we can do here when we calculate movie do the implementation is that we can do this calculation element wise instead of doing the complete Jacobian matrix which will save a lot of compute okay so we've done all the tricky parts now let's see we've computed this one this one this one this one and this one and the only thing left is to compute all of these ones which are in essence the exact same computations that we've already computed it will be the exact same just that the numbers from layer two will change to layer one so that's it for the computation of the backward propagation so in the next video we will see how we can actually use these calculations to implement a neural network from scratch in numpy thank you for watching
Original Description
In this video we go through backward propagation calculations for a feedforward- neural network!
Watch on YouTube ↗
(saves to browser)
Sign in to unlock AI tutor explanation · ⚡30
Playlist
Uploads from Aladdin Persson · Aladdin Persson · 5 of 60
1
2
3
4
▶
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
computeCost.m Linear Regression Cost Function - Machine Learning
Aladdin Persson
gradientDescent.m Gradient Descent Implementation - Machine Learning
Aladdin Persson
Neural Network from scratch - Part 1 (Standard Notation)
Aladdin Persson
Neural Network from scratch - Part 2 (Forward Propagation)
Aladdin Persson
Neural Network from scratch - Part 3 (Backward Propagation)
Aladdin Persson
Neural Network from scratch - Part 4 (With Python)
Aladdin Persson
sigmoid.m - Programming Assignment 2 Machine Learning
Aladdin Persson
costFunction.m - Programming Assignment 2 Machine Learning
Aladdin Persson
predict.m - Programming Assignment 2 Machine Learning
Aladdin Persson
costFunctionReg.m - Programming Assignment 2 Machine Learning
Aladdin Persson
lrCostFunction.m - Programming Assignment 3 Machine Learning
Aladdin Persson
oneVsAll.m - Programming Assignment 3 Machine Learning
Aladdin Persson
predictOneVsAll.m - Programming Assignment 3 Machine Learning
Aladdin Persson
predict.m - Programming Assignment 3 Machine Learning
Aladdin Persson
Caesar Cipher Encryption and Decryption with example
Aladdin Persson
Cryptography: Caesar Cipher Python
Aladdin Persson
Vigenere Cipher Explained (with Example)
Aladdin Persson
Cryptography: Vigenere Cipher Python
Aladdin Persson
Hill Cipher Explained (with Example)
Aladdin Persson
Cryptography: Hill Cipher Python
Aladdin Persson
Interval Scheduling Greedy Algorithm: Python
Aladdin Persson
Weighted Interval Scheduling Algorithm Explained
Aladdin Persson
Weighted Interval Scheduling Python Code
Aladdin Persson
Sequence Alignment | Needleman Wunsch Algorithm
Aladdin Persson
Sequence Alignment | Needleman Wunsch in Python
Aladdin Persson
Codility BinaryGap Python
Aladdin Persson
Codility CyclicRotation Python
Aladdin Persson
Derivation Linear Regression with Gradient Descent
Aladdin Persson
Linear Regression Gradient Descent From Scratch in Python
Aladdin Persson
Pytorch Neural Network example
Aladdin Persson
Pytorch CNN example (Convolutional Neural Network)
Aladdin Persson
Pytorch LeNet implementation from scratch
Aladdin Persson
Pytorch VGG implementation from scratch
Aladdin Persson
Pytorch GoogLeNet / InceptionNet implementation from scratch
Aladdin Persson
How to save and load models in Pytorch
Aladdin Persson
How to build custom Datasets for Images in Pytorch
Aladdin Persson
Pytorch Transfer Learning and Fine Tuning Tutorial
Aladdin Persson
Pytorch Data Augmentation using Torchvision
Aladdin Persson
Pytorch Quick Tip: Weight Initialization
Aladdin Persson
Pytorch Quick Tip: Using a Learning Rate Scheduler
Aladdin Persson
Pytorch ResNet implementation from Scratch
Aladdin Persson
Pytorch TensorBoard Tutorial
Aladdin Persson
Pytorch DCGAN Tutorial (See description for updated video)
Aladdin Persson
Naive Bayes from Scratch - Machine Learning Python
Aladdin Persson
Spam Classifier using Naive Bayes in Python
Aladdin Persson
K-Nearest Neighbor from scratch - Machine Learning Python
Aladdin Persson
Linear Regression Normal Equation Python
Aladdin Persson
SVM from Scratch - Machine Learning Python (Support Vector Machine)
Aladdin Persson
Neural Network from Scratch - Machine Learning Python
Aladdin Persson
Pytorch RNN example (Recurrent Neural Network)
Aladdin Persson
Pytorch Bidirectional LSTM example
Aladdin Persson
Pytorch Text Generator with character level LSTM
Aladdin Persson
Logistic Regression from Scratch - Machine Learning Python
Aladdin Persson
K-Means Clustering from Scratch - Machine Learning Python
Aladdin Persson
Pytorch Torchtext Tutorial 1: Custom Datasets and loading JSON/CSV/TSV files
Aladdin Persson
Pytorch Torchtext Tutorial 2: Built in Datasets with Example
Aladdin Persson
Pytorch Torchtext Tutorial 3: From Textfiles to Dataset
Aladdin Persson
Paper Review: Sequence to Sequence Learning with Neural Networks
Aladdin Persson
Pytorch Seq2Seq Tutorial for Machine Translation
Aladdin Persson
Pytorch Seq2Seq with Attention for Machine Translation
Aladdin Persson
More on: ML Maths Basics
View skill →Related Reads
📰
📰
📰
📰
What Is MLIR and Why Does It Exist?
Dev.to · Fedor Nikolaev
Why Choosing the Right Machine Learning Development Company Matters More Than the AI Model
Medium · Machine Learning
Data privacy in AI training: federated learning, differential privacy, and synthetic data
Dev.to AI
Data Preprocessing: Encoding and Feature Scaling in Machine Learning
Medium · Machine Learning
🎓
Tutor Explanation
DeepCamp AI