Neural Network from scratch - Part 3 (Backward Propagation)

Aladdin Persson · Advanced ·📐 ML Fundamentals ·7y ago

Key Takeaways

The video demonstrates backward propagation calculations for a feedforward neural network, covering topics such as computational graphs, chain rules, and partial derivatives. It provides a step-by-step guide on how to compute derivatives of loss with respect to weights and biases, and how to update them using gradients.

Full Transcript

after doing all of the calculations for forward propagation we can view it as a computational graph just like this we have the output from our first layer we have the computed for the linear part of the of the layer one which uses the weight and the biases and then we have the activation function from that linear linear calculation and then lastly we have the second layer and then we compute the loss so we can we can view all of the steps in your computation as a computational graph now that we want to do the calculations for the backward propagation which is really what the neural networks neural network learns from we start by by moving our way from back to the front so what we want to our goal is to compute first of all this one but really we want to calculate the these these gradients for the weights and the biases but to get to that point we need to start at the loss with respect to this node so what we do is that first of all we need to remember that we have a loss for a particular training example which is the minus a log of e raise to Z of the correct label so our computed for the correct label divided by the sum of class 1 to capital C so for all of the classes erased to Zed of that particular class and remember here that well we have if we remember the log rules we have log x over log y equals log X minus love why so if we use that we have first of all - log of this thing so - log of e of this said why I and then minus log of the bottom part but remember that we have a minus log of - so we have plus log of the Sun C equals 1 to capital C of all of our classes erased to see said C so this is how we can write the loss and now what we want to compute is that the derivative of the loss so the thing we just wrote here with respect to Z - and for a particular node K so we want this arrow this is what we're trying to compute right now so we just plug in the values or we just rather we just write the derivative first and then the minus log and we can actually recognize right here that we have log and this is the log e log e of e so these two will write just right off the bat just cancel so we get minus Z of Y I and we here we have here we have the partial with respect to Z K of layer 2 and then the log of the sum and recognize here that we're gonna have to do a chain rule first with respect the log and then with respect to the inner part so first simplifying a little bit we see here that this will be 0 if Y I will not be equal to the node K that we're looking at so K is a specific it's just any node of the output and why I is the correct the correct label so the node which is the correct one for this particular training example I so this will be minus 1 if Y I is equal to K so this notation means that if the K is equal to the Y I then it will be 1 and specifically minus 1 so if we have that for example Y is 0 so we have 0 and K is 1 and we have a 0 is equal to 1 which is false so then this entire thing is 0 ok so if we move on to the second part we have first the outer derivative with respect to log so we just have 1 divided by the sum C equal 1 to capital C of all of our classes and then we have the inner derivative so with respect to a sum now I'm going to swap the order of these so this is minus this is plus I'm just going to swap the order so I'm going to write 1 divided by c equals 1 to capital c and then recognize here that this right here will be a sum so this will be a sum of e raised to said 1 plus e raised to z 2 etcetera all the way up to capital c so if we try to be very clear here and also recognize here that all these then here are specific for this layer so what we have here is the sum so if we write off some just some values to make this clear somewhere in between these will be the K value then at the end we will have arrays to C capital C so what will happen here well all of them and remember all of these should also be layer 2 but same for all of them so what will happen here is that well this will cancel and become 0 this will cancel become 0 all others will cancel and become 0 so we have that this right here this derivative would you do the Zed K and a partial we'd said K will be just e said okay so all other will be else 0 and this one will be here to say K so what we have and I'm going to remember this last part so we have minus 1 if Y I is equal to K so now we can recognize here to that this part right here it's just the softmax so this is just a soft Mack that we computed in the forward propagation and the only thing that we need to add here is that we have minus one what if y equals 2k all right so what we just computed was really this arrow but we want to know well how should we update our weight so that next iteration should be should be better for in our network well the W of layer two we need to take the partial with respect the derivative of Z 2 with respect to the derivative of the weight so we need to move backwards so we need to move in this direction and remember if we want the derivative of the loss with respect to W then we need to multiplicate those two derivatives by the chain rule ok so what we want now is the derivative of Z 2 with respect to W of layer 2 I remember that we can write a derivative W 2 and is that - we'll just be let's see it's a 1 W 2 plus B - and this derivative will just be equal to a 1 because it's very typically respect to beat that will all be just 0 W with respect to that will be once so we will have a 1 left ok and then so we've computed this derivative as well now we want to be too so we want to backwards going to be - this one will be quite simple you to that will be similarly as here we'll have the derivative respect to B - a 1 w2 plus b2 everything will be 0 so these will cancel and this last one will just be an identity matrix since we're doing two matrix calculus really but it's just one this it's just one right taking derivative of the variable variable with respect to the exact same variable so it's just one one thing to keep in mind here is that the the biases are local to that to go specific nodes and the only thing that we have in this z2 is that we have we do the computation for a lot of different examples simultaneously so because all the rows in in the z2 said layer two are all the examples the images of handwritten digits for example and what we have to do when we when doing their gradient descent part we're going to when we update our weight or the biases later on we're going to take with respect to this one right and this one this derivative right here will be of size examples comma features in l2 but the biases are local they're independent of the amount of examples so this will be 1 comma features features as l2 and we're going to subtract them so obviously this doesn't work right the dimensions don't match so the thing to keep in mind here is that when having several incoming gradients to one particular node as we have in this case since we have several examples with gradients to a single node then the solution is to add the gradients so that that's one thing to keep in mind when we're actually doing the implementation okay so we've calculated all of these three now we need to move backwards again in the computational graph we're going to from said 2 to a 1 so we're going in this direction so we take the derivative of Z there's him I said - yeah said here to respect to a 1 a plain one okay so we have like this and we have let's see if we have a one right w-2 and these are the calculations from the forward propagation so we're just moving backwards so we think they're the derivative of a one with respect to Z 2 which is just this part and there's one tricky part here remember that we're not you just doing normal derivatives per se we're doing matrix derivatives because all of these are matrixes and this is a vector so it will be W 2 which is exactly what we expect right but there's one tricky part here is that this will be the W 2 transpose I won't go into exactly why it's the transpose it's it can be shown quite easily but it would take some time which would distract from the point so this is the this is the derivative so we have moved we've just calculated this part and so what we want to do now is that we want to move backwards again I want to calculate this one and this is the last tricky part so we have we want to compute the partial of a 1 with respect to Z 1 well remember that a 1 this is just a maximum right because this is the activation function of zero comma Z one and there's one one part here that's particularly tricky if we have since we're doing derivative is with matrixes if we take one if we have in one by one thousand and we have some type of function and the output is 1 by 1000 in theory each of the values of this vector could have impacted all of the outputs so what we need to do then is that each of the output needs to be have a derivative of every output so the derivative of if we call this if we call this capital F and we call this something else let's call it a not to be confused with this a and if we look at the partial of F respect to a and this will be a 1,000 by 1,000 vector matrix so that's a lot of numbers and the thing about the reloj is that it's applied element wise so we know that not every value of the input will impact every value of the output actually it's just a one-to-one the first node of our input our first value of our input will will impact exactly the first value of our output because it's applied on twice so not to expand too much on this but we can save a lot on compute by doing by in doing the derivative element wise and the reason why we can do it is because the the derivative of this if we know that F is an element wise applied function is that all the D values except the values on the diagonal will be zero and in fact even some values on the diagonal will be exactly zero because the the relative function acts as a gradient router so if it's greater than zero then it's then it's passed with the exact same value if it's negative if you see one is negative then it's set to zero so there's really nothing impacting the the value of the output it's just a router it checks if it's greater than zero then it lets it pass otherwise it set it to zero so what we can do here when we calculate movie do the implementation is that we can do this calculation element wise instead of doing the complete Jacobian matrix which will save a lot of compute okay so we've done all the tricky parts now let's see we've computed this one this one this one this one and this one and the only thing left is to compute all of these ones which are in essence the exact same computations that we've already computed it will be the exact same just that the numbers from layer two will change to layer one so that's it for the computation of the backward propagation so in the next video we will see how we can actually use these calculations to implement a neural network from scratch in numpy thank you for watching

Original Description

In this video we go through backward propagation calculations for a feedforward- neural network!
Watch on YouTube ↗ (saves to browser)
Sign in to unlock AI tutor explanation · ⚡30

Playlist

Uploads from Aladdin Persson · Aladdin Persson · 5 of 60

1 computeCost.m Linear Regression Cost Function - Machine Learning
computeCost.m Linear Regression Cost Function - Machine Learning
Aladdin Persson
2 gradientDescent.m Gradient Descent Implementation -  Machine Learning
gradientDescent.m Gradient Descent Implementation - Machine Learning
Aladdin Persson
3 Neural Network from scratch - Part 1 (Standard Notation)
Neural Network from scratch - Part 1 (Standard Notation)
Aladdin Persson
4 Neural Network from scratch - Part 2 (Forward Propagation)
Neural Network from scratch - Part 2 (Forward Propagation)
Aladdin Persson
Neural Network from scratch - Part 3 (Backward Propagation)
Neural Network from scratch - Part 3 (Backward Propagation)
Aladdin Persson
6 Neural Network from scratch - Part 4 (With Python)
Neural Network from scratch - Part 4 (With Python)
Aladdin Persson
7 sigmoid.m - Programming Assignment 2 Machine Learning
sigmoid.m - Programming Assignment 2 Machine Learning
Aladdin Persson
8 costFunction.m - Programming Assignment 2 Machine Learning
costFunction.m - Programming Assignment 2 Machine Learning
Aladdin Persson
9 predict.m - Programming Assignment 2 Machine Learning
predict.m - Programming Assignment 2 Machine Learning
Aladdin Persson
10 costFunctionReg.m - Programming Assignment 2 Machine Learning
costFunctionReg.m - Programming Assignment 2 Machine Learning
Aladdin Persson
11 lrCostFunction.m - Programming Assignment 3 Machine Learning
lrCostFunction.m - Programming Assignment 3 Machine Learning
Aladdin Persson
12 oneVsAll.m - Programming Assignment 3 Machine Learning
oneVsAll.m - Programming Assignment 3 Machine Learning
Aladdin Persson
13 predictOneVsAll.m - Programming Assignment 3 Machine Learning
predictOneVsAll.m - Programming Assignment 3 Machine Learning
Aladdin Persson
14 predict.m - Programming Assignment 3 Machine Learning
predict.m - Programming Assignment 3 Machine Learning
Aladdin Persson
15 Caesar Cipher Encryption and Decryption with example
Caesar Cipher Encryption and Decryption with example
Aladdin Persson
16 Cryptography: Caesar Cipher Python
Cryptography: Caesar Cipher Python
Aladdin Persson
17 Vigenere Cipher Explained (with Example)
Vigenere Cipher Explained (with Example)
Aladdin Persson
18 Cryptography: Vigenere Cipher Python
Cryptography: Vigenere Cipher Python
Aladdin Persson
19 Hill Cipher Explained (with Example)
Hill Cipher Explained (with Example)
Aladdin Persson
20 Cryptography: Hill Cipher Python
Cryptography: Hill Cipher Python
Aladdin Persson
21 Interval Scheduling Greedy Algorithm: Python
Interval Scheduling Greedy Algorithm: Python
Aladdin Persson
22 Weighted Interval Scheduling Algorithm Explained
Weighted Interval Scheduling Algorithm Explained
Aladdin Persson
23 Weighted Interval Scheduling Python Code
Weighted Interval Scheduling Python Code
Aladdin Persson
24 Sequence Alignment | Needleman Wunsch Algorithm
Sequence Alignment | Needleman Wunsch Algorithm
Aladdin Persson
25 Sequence Alignment | Needleman Wunsch in Python
Sequence Alignment | Needleman Wunsch in Python
Aladdin Persson
26 Codility BinaryGap Python
Codility BinaryGap Python
Aladdin Persson
27 Codility CyclicRotation Python
Codility CyclicRotation Python
Aladdin Persson
28 Derivation Linear Regression with Gradient Descent
Derivation Linear Regression with Gradient Descent
Aladdin Persson
29 Linear Regression Gradient Descent From Scratch in Python
Linear Regression Gradient Descent From Scratch in Python
Aladdin Persson
30 Pytorch Neural Network example
Pytorch Neural Network example
Aladdin Persson
31 Pytorch CNN example (Convolutional Neural Network)
Pytorch CNN example (Convolutional Neural Network)
Aladdin Persson
32 Pytorch LeNet implementation from scratch
Pytorch LeNet implementation from scratch
Aladdin Persson
33 Pytorch VGG implementation from scratch
Pytorch VGG implementation from scratch
Aladdin Persson
34 Pytorch GoogLeNet / InceptionNet implementation from scratch
Pytorch GoogLeNet / InceptionNet implementation from scratch
Aladdin Persson
35 How to save and load models in Pytorch
How to save and load models in Pytorch
Aladdin Persson
36 How to build custom Datasets for Images in Pytorch
How to build custom Datasets for Images in Pytorch
Aladdin Persson
37 Pytorch Transfer Learning and Fine Tuning Tutorial
Pytorch Transfer Learning and Fine Tuning Tutorial
Aladdin Persson
38 Pytorch Data Augmentation using Torchvision
Pytorch Data Augmentation using Torchvision
Aladdin Persson
39 Pytorch Quick Tip: Weight Initialization
Pytorch Quick Tip: Weight Initialization
Aladdin Persson
40 Pytorch Quick Tip: Using a Learning Rate Scheduler
Pytorch Quick Tip: Using a Learning Rate Scheduler
Aladdin Persson
41 Pytorch ResNet implementation from Scratch
Pytorch ResNet implementation from Scratch
Aladdin Persson
42 Pytorch TensorBoard Tutorial
Pytorch TensorBoard Tutorial
Aladdin Persson
43 Pytorch DCGAN Tutorial (See description for updated video)
Pytorch DCGAN Tutorial (See description for updated video)
Aladdin Persson
44 Naive Bayes from Scratch - Machine Learning Python
Naive Bayes from Scratch - Machine Learning Python
Aladdin Persson
45 Spam Classifier using Naive Bayes in Python
Spam Classifier using Naive Bayes in Python
Aladdin Persson
46 K-Nearest Neighbor from scratch - Machine Learning Python
K-Nearest Neighbor from scratch - Machine Learning Python
Aladdin Persson
47 Linear Regression Normal Equation Python
Linear Regression Normal Equation Python
Aladdin Persson
48 SVM from Scratch - Machine Learning Python (Support Vector Machine)
SVM from Scratch - Machine Learning Python (Support Vector Machine)
Aladdin Persson
49 Neural Network from Scratch - Machine Learning Python
Neural Network from Scratch - Machine Learning Python
Aladdin Persson
50 Pytorch RNN example (Recurrent Neural Network)
Pytorch RNN example (Recurrent Neural Network)
Aladdin Persson
51 Pytorch Bidirectional LSTM example
Pytorch Bidirectional LSTM example
Aladdin Persson
52 Pytorch Text Generator with character level LSTM
Pytorch Text Generator with character level LSTM
Aladdin Persson
53 Logistic Regression from Scratch - Machine Learning Python
Logistic Regression from Scratch - Machine Learning Python
Aladdin Persson
54 K-Means Clustering from Scratch - Machine Learning Python
K-Means Clustering from Scratch - Machine Learning Python
Aladdin Persson
55 Pytorch Torchtext Tutorial 1: Custom Datasets and loading JSON/CSV/TSV files
Pytorch Torchtext Tutorial 1: Custom Datasets and loading JSON/CSV/TSV files
Aladdin Persson
56 Pytorch Torchtext Tutorial 2: Built in Datasets with Example
Pytorch Torchtext Tutorial 2: Built in Datasets with Example
Aladdin Persson
57 Pytorch Torchtext Tutorial 3: From Textfiles to Dataset
Pytorch Torchtext Tutorial 3: From Textfiles to Dataset
Aladdin Persson
58 Paper Review: Sequence to Sequence Learning with Neural Networks
Paper Review: Sequence to Sequence Learning with Neural Networks
Aladdin Persson
59 Pytorch Seq2Seq Tutorial for Machine Translation
Pytorch Seq2Seq Tutorial for Machine Translation
Aladdin Persson
60 Pytorch Seq2Seq with Attention for Machine Translation
Pytorch Seq2Seq with Attention for Machine Translation
Aladdin Persson

This video teaches how to perform backward propagation calculations for a feedforward neural network, which is a crucial step in training neural networks. By following the steps outlined in the video, viewers can learn how to compute derivatives of loss with respect to weights and biases, and how to update them using gradients.

Key Takeaways
  1. View the output from the first layer
  2. Compute the loss
  3. Start at the loss with respect to this node
  4. Compute the derivative of the loss with respect to Z
  5. Plug in the values for the derivative
  6. Compute derivative of Z2 with respect to W2
  7. Compute derivative of Z2 with respect to B2
  8. Apply chain rule to find derivative of Z2 with respect to weights
  9. Update weights and biases using gradients
💡 Backward propagation involves calculating partial derivatives of each layer's output with respect to its inputs, which can be computationally expensive. However, by using element-wise derivatives and the chain rule, it is possible to efficiently compute these derivatives.

Related Reads

📰
What Is MLIR and Why Does It Exist?
Learn about MLIR, a intermediate representation for machine learning models, and its purpose in optimizing ML workflows
Dev.to · Fedor Nikolaev
📰
Why Choosing the Right Machine Learning Development Company Matters More Than the AI Model
Choosing the right machine learning development company is crucial for turning AI investments into measurable results, as it can make or break the success of AI projects
Medium · Machine Learning
📰
Data privacy in AI training: federated learning, differential privacy, and synthetic data
Learn how federated learning, differential privacy, and synthetic data preserve data privacy in AI training, and why they matter for secure machine learning
Dev.to AI
📰
Data Preprocessing: Encoding and Feature Scaling in Machine Learning
Learn to preprocess data by encoding and scaling features for better machine learning model performance
Medium · Machine Learning
Up next
Is Python Dead in 2026?| Truth About Python in AI Era | 90 Days Roadmap @FameWorldEducationalHub
FAME WORLD EDUCATIONAL HUB
Watch →