Neural Network from scratch - Part 3 (Backward Propagation)

Aladdin Persson · Advanced ·📐 ML Fundamentals ·7y ago

Skills: ML Maths Basics90%Supervised Learning60%ML Pipelines50%

Key Takeaways

The video demonstrates backward propagation calculations for a feedforward neural network, covering topics such as computational graphs, chain rules, and partial derivatives. It provides a step-by-step guide on how to compute derivatives of loss with respect to weights and biases, and how to update them using gradients.

Full Transcript

after doing all of the calculations for forward propagation we can view it as a computational graph just like this we have the output from our first layer we have the computed for the linear part of the of the layer one which uses the weight and the biases and then we have the activation function from that linear linear calculation and then lastly we have the second layer and then we compute the loss so we can we can view all of the steps in your computation as a computational graph now that we want to do the calculations for the backward propagation which is really what the neural networks neural network learns from we start by by moving our way from back to the front so what we want to our goal is to compute first of all this one but really we want to calculate the these these gradients for the weights and the biases but to get to that point we need to start at the loss with respect to this node so what we do is that first of all we need to remember that we have a loss for a particular training example which is the minus a log of e raise to Z of the correct label so our computed for the correct label divided by the sum of class 1 to capital C so for all of the classes erased to Zed of that particular class and remember here that well we have if we remember the log rules we have log x over log y equals log X minus love why so if we use that we have first of all - log of this thing so - log of e of this said why I and then minus log of the bottom part but remember that we have a minus log of - so we have plus log of the Sun C equals 1 to capital C of all of our classes erased to see said C so this is how we can write the loss and now what we want to compute is that the derivative of the loss so the thing we just wrote here with respect to Z - and for a particular node K so we want this arrow this is what we're trying to compute right now so we just plug in the values or we just rather we just write the derivative first and then the minus log and we can actually recognize right here that we have log and this is the log e log e of e so these two will write just right off the bat just cancel so we get minus Z of Y I and we here we have here we have the partial with respect to Z K of layer 2 and then the log of the sum and recognize here that we're gonna have to do a chain rule first with respect the log and then with respect to the inner part so first simplifying a little bit we see here that this will be 0 if Y I will not be equal to the node K that we're looking at so K is a specific it's just any node of the output and why I is the correct the correct label so the node which is the correct one for this particular training example I so this will be minus 1 if Y I is equal to K so this notation means that if the K is equal to the Y I then it will be 1 and specifically minus 1 so if we have that for example Y is 0 so we have 0 and K is 1 and we have a 0 is equal to 1 which is false so then this entire thing is 0 ok so if we move on to the second part we have first the outer derivative with respect to log so we just have 1 divided by the sum C equal 1 to capital C of all of our classes and then we have the inner derivative so with respect to a sum now I'm going to swap the order of these so this is minus this is plus I'm just going to swap the order so I'm going to write 1 divided by c equals 1 to capital c and then recognize here that this right here will be a sum so this will be a sum of e raised to said 1 plus e raised to z 2 etcetera all the way up to capital c so if we try to be very clear here and also recognize here that all these then here are specific for this layer so what we have here is the sum so if we write off some just some values to make this clear somewhere in between these will be the K value then at the end we will have arrays to C capital C so what will happen here well all of them and remember all of these should also be layer 2 but same for all of them so what will happen here is that well this will cancel and become 0 this will cancel become 0 all others will cancel and become 0 so we have that this right here this derivative would you do the Zed K and a partial we'd said K will be just e said okay so all other will be else 0 and this one will be here to say K so what we have and I'm going to remember this last part so we have minus 1 if Y I is equal to K so now we can recognize here to that this part right here it's just the softmax so this is just a soft Mack that we computed in the forward propagation and the only thing that we need to add here is that we have minus one what if y equals 2k all right so what we just computed was really this arrow but we want to know well how should we update our weight so that next iteration should be should be better for in our network well the W of layer two we need to take the partial with respect the derivative of Z 2 with respect to the derivative of the weight so we need to move backwards so we need to move in this direction and remember if we want the derivative of the loss with respect to W then we need to multiplicate those two derivatives by the chain rule ok so what we want now is the derivative of Z 2 with respect to W of layer 2 I remember that we can write a derivative W 2 and is that - we'll just be let's see it's a 1 W 2 plus B - and this derivative will just be equal to a 1 because it's very typically respect to beat that will all be just 0 W with respect to that will be once so we will have a 1 left ok and then so we've computed this derivative as well now we want to be too so we want to backwards going to be - this one will be quite simple you to that will be similarly as here we'll have the derivative respect to B - a 1 w2 plus b2 everything will be 0 so these will cancel and this last one will just be an identity matrix since we're doing two matrix calculus really but it's just one this it's just one right taking derivative of the variable variable with respect to the exact same variable so it's just one one thing to keep in mind here is that the the biases are local to that to go specific nodes and the only thing that we have in this z2 is that we have we do the computation for a lot of different examples simultaneously so because all the rows in in the z2 said layer two are all the examples the images of handwritten digits for example and what we have to do when we when doing their gradient descent part we're going to when we update our weight or the biases later on we're going to take with respect to this one right and this one this derivative right here will be of size examples comma features in l2 but the biases are local they're independent of the amount of examples so this will be 1 comma features features as l2 and we're going to subtract them so obviously this doesn't work right the dimensions don't match so the thing to keep in mind here is that when having several incoming gradients to one particular node as we have in this case since we have several examples with gradients to a single node then the solution is to add the gradients so that that's one thing to keep in mind when we're actually doing the implementation okay so we've calculated all of these three now we need to move backwards again in the computational graph we're going to from said 2 to a 1 so we're going in this direction so we take the derivative of Z there's him I said - yeah said here to respect to a 1 a plain one okay so we have like this and we have let's see if we have a one right w-2 and these are the calculations from the forward propagation so we're just moving backwards so we think they're the derivative of a one with respect to Z 2 which is just this part and there's one tricky part here remember that we're not you just doing normal derivatives per se we're doing matrix derivatives because all of these are matrixes and this is a vector so it will be W 2 which is exactly what we expect right but there's one tricky part here is that this will be the W 2 transpose I won't go into exactly why it's the transpose it's it can be shown quite easily but it would take some time which would distract from the point so this is the this is the derivative so we have moved we've just calculated this part and so what we want to do now is that we want to move backwards again I want to calculate this one and this is the last tricky part so we have we want to compute the partial of a 1 with respect to Z 1 well remember that a 1 this is just a maximum right because this is the activation function of zero comma Z one and there's one one part here that's particularly tricky if we have since we're doing derivative is with matrixes if we take one if we have in one by one thousand and we have some type of function and the output is 1 by 1000 in theory each of the values of this vector could have impacted all of the outputs so what we need to do then is that each of the output needs to be have a derivative of every output so the derivative of if we call this if we call this capital F and we call this something else let's call it a not to be confused with this a and if we look at the partial of F respect to a and this will be a 1,000 by 1,000 vector matrix so that's a lot of numbers and the thing about the reloj is that it's applied element wise so we know that not every value of the input will impact every value of the output actually it's just a one-to-one the first node of our input our first value of our input will will impact exactly the first value of our output because it's applied on twice so not to expand too much on this but we can save a lot on compute by doing by in doing the derivative element wise and the reason why we can do it is because the the derivative of this if we know that F is an element wise applied function is that all the D values except the values on the diagonal will be zero and in fact even some values on the diagonal will be exactly zero because the the relative function acts as a gradient router so if it's greater than zero then it's then it's passed with the exact same value if it's negative if you see one is negative then it's set to zero so there's really nothing impacting the the value of the output it's just a router it checks if it's greater than zero then it lets it pass otherwise it set it to zero so what we can do here when we calculate movie do the implementation is that we can do this calculation element wise instead of doing the complete Jacobian matrix which will save a lot of compute okay so we've done all the tricky parts now let's see we've computed this one this one this one this one and this one and the only thing left is to compute all of these ones which are in essence the exact same computations that we've already computed it will be the exact same just that the numbers from layer two will change to layer one so that's it for the computation of the backward propagation so in the next video we will see how we can actually use these calculations to implement a neural network from scratch in numpy thank you for watching

Original Description

In this video we go through backward propagation calculations for a feedforward- neural network!

Watch on YouTube ↗ (saves to browser)

Sign in to unlock AI tutor explanation · ⚡30

Playlist

Uploads from Aladdin Persson · Aladdin Persson · 5 of 60

← Previous Next →

computeCost.m Linear Regression Cost Function - Machine Learning

computeCost.m Linear Regression Cost Function - Machine Learning

Aladdin Persson

gradientDescent.m Gradient Descent Implementation - Machine Learning

gradientDescent.m Gradient Descent Implementation - Machine Learning

Aladdin Persson

Neural Network from scratch - Part 1 (Standard Notation)

Neural Network from scratch - Part 1 (Standard Notation)

Aladdin Persson

Neural Network from scratch - Part 2 (Forward Propagation)

Neural Network from scratch - Part 2 (Forward Propagation)

Aladdin Persson

Neural Network from scratch - Part 3 (Backward Propagation)

Neural Network from scratch - Part 3 (Backward Propagation)

Aladdin Persson

Neural Network from scratch - Part 4 (With Python)

Neural Network from scratch - Part 4 (With Python)

Aladdin Persson

sigmoid.m - Programming Assignment 2 Machine Learning

sigmoid.m - Programming Assignment 2 Machine Learning

Aladdin Persson

costFunction.m - Programming Assignment 2 Machine Learning

costFunction.m - Programming Assignment 2 Machine Learning

Aladdin Persson

predict.m - Programming Assignment 2 Machine Learning

predict.m - Programming Assignment 2 Machine Learning

Aladdin Persson

costFunctionReg.m - Programming Assignment 2 Machine Learning

costFunctionReg.m - Programming Assignment 2 Machine Learning

Aladdin Persson

lrCostFunction.m - Programming Assignment 3 Machine Learning

lrCostFunction.m - Programming Assignment 3 Machine Learning

Aladdin Persson

oneVsAll.m - Programming Assignment 3 Machine Learning

oneVsAll.m - Programming Assignment 3 Machine Learning

Aladdin Persson

predictOneVsAll.m - Programming Assignment 3 Machine Learning

predictOneVsAll.m - Programming Assignment 3 Machine Learning

Aladdin Persson

predict.m - Programming Assignment 3 Machine Learning

predict.m - Programming Assignment 3 Machine Learning

Aladdin Persson

Caesar Cipher Encryption and Decryption with example

Caesar Cipher Encryption and Decryption with example

Aladdin Persson

Cryptography: Caesar Cipher Python

Cryptography: Caesar Cipher Python

Aladdin Persson

Vigenere Cipher Explained (with Example)

Vigenere Cipher Explained (with Example)

Aladdin Persson

Cryptography: Vigenere Cipher Python

Cryptography: Vigenere Cipher Python

Aladdin Persson

Hill Cipher Explained (with Example)

Hill Cipher Explained (with Example)

Aladdin Persson

Cryptography: Hill Cipher Python

Cryptography: Hill Cipher Python

Aladdin Persson

Interval Scheduling Greedy Algorithm: Python

Interval Scheduling Greedy Algorithm: Python

Aladdin Persson

Weighted Interval Scheduling Algorithm Explained

Weighted Interval Scheduling Algorithm Explained

Aladdin Persson

Weighted Interval Scheduling Python Code

Weighted Interval Scheduling Python Code

Aladdin Persson

Sequence Alignment | Needleman Wunsch Algorithm

Sequence Alignment | Needleman Wunsch Algorithm

Aladdin Persson

Sequence Alignment | Needleman Wunsch in Python

Sequence Alignment | Needleman Wunsch in Python

Aladdin Persson

Codility BinaryGap Python

Codility BinaryGap Python

Aladdin Persson

Codility CyclicRotation Python

Codility CyclicRotation Python

Aladdin Persson

Derivation Linear Regression with Gradient Descent

Derivation Linear Regression with Gradient Descent

Aladdin Persson

Linear Regression Gradient Descent From Scratch in Python

Linear Regression Gradient Descent From Scratch in Python

Aladdin Persson

Pytorch Neural Network example

Pytorch Neural Network example

Aladdin Persson

Pytorch CNN example (Convolutional Neural Network)

Pytorch CNN example (Convolutional Neural Network)

Aladdin Persson

Pytorch LeNet implementation from scratch

Pytorch LeNet implementation from scratch

Aladdin Persson

Pytorch VGG implementation from scratch

Pytorch VGG implementation from scratch

Aladdin Persson

Pytorch GoogLeNet / InceptionNet implementation from scratch

Pytorch GoogLeNet / InceptionNet implementation from scratch

Aladdin Persson

How to save and load models in Pytorch

How to save and load models in Pytorch

Aladdin Persson

How to build custom Datasets for Images in Pytorch

How to build custom Datasets for Images in Pytorch

Aladdin Persson

Pytorch Transfer Learning and Fine Tuning Tutorial

Pytorch Transfer Learning and Fine Tuning Tutorial

Aladdin Persson

Pytorch Data Augmentation using Torchvision

Pytorch Data Augmentation using Torchvision

Aladdin Persson

Pytorch Quick Tip: Weight Initialization

Pytorch Quick Tip: Weight Initialization

Aladdin Persson

Pytorch Quick Tip: Using a Learning Rate Scheduler

Pytorch Quick Tip: Using a Learning Rate Scheduler

Aladdin Persson

Pytorch ResNet implementation from Scratch

Pytorch ResNet implementation from Scratch

Aladdin Persson

Pytorch TensorBoard Tutorial

Pytorch TensorBoard Tutorial

Aladdin Persson

Pytorch DCGAN Tutorial (See description for updated video)

Pytorch DCGAN Tutorial (See description for updated video)

Aladdin Persson

Naive Bayes from Scratch - Machine Learning Python

Naive Bayes from Scratch - Machine Learning Python

Aladdin Persson

Spam Classifier using Naive Bayes in Python

Spam Classifier using Naive Bayes in Python

Aladdin Persson

K-Nearest Neighbor from scratch - Machine Learning Python

K-Nearest Neighbor from scratch - Machine Learning Python

Aladdin Persson

Linear Regression Normal Equation Python

Linear Regression Normal Equation Python

Aladdin Persson

SVM from Scratch - Machine Learning Python (Support Vector Machine)

SVM from Scratch - Machine Learning Python (Support Vector Machine)

Aladdin Persson

Neural Network from Scratch - Machine Learning Python

Neural Network from Scratch - Machine Learning Python

Aladdin Persson

Pytorch RNN example (Recurrent Neural Network)

Pytorch RNN example (Recurrent Neural Network)

Aladdin Persson

Pytorch Bidirectional LSTM example

Pytorch Bidirectional LSTM example

Aladdin Persson

Pytorch Text Generator with character level LSTM

Pytorch Text Generator with character level LSTM

Aladdin Persson

Logistic Regression from Scratch - Machine Learning Python

Logistic Regression from Scratch - Machine Learning Python

Aladdin Persson

K-Means Clustering from Scratch - Machine Learning Python

K-Means Clustering from Scratch - Machine Learning Python

Aladdin Persson

Pytorch Torchtext Tutorial 1: Custom Datasets and loading JSON/CSV/TSV files

Pytorch Torchtext Tutorial 1: Custom Datasets and loading JSON/CSV/TSV files

Aladdin Persson

Pytorch Torchtext Tutorial 2: Built in Datasets with Example

Pytorch Torchtext Tutorial 2: Built in Datasets with Example

Aladdin Persson

Pytorch Torchtext Tutorial 3: From Textfiles to Dataset

Pytorch Torchtext Tutorial 3: From Textfiles to Dataset

Aladdin Persson

Paper Review: Sequence to Sequence Learning with Neural Networks

Paper Review: Sequence to Sequence Learning with Neural Networks

Aladdin Persson

Pytorch Seq2Seq Tutorial for Machine Translation

Pytorch Seq2Seq Tutorial for Machine Translation

Aladdin Persson

Pytorch Seq2Seq with Attention for Machine Translation

Pytorch Seq2Seq with Attention for Machine Translation

Aladdin Persson

This video teaches how to perform backward propagation calculations for a feedforward neural network, which is a crucial step in training neural networks. By following the steps outlined in the video, viewers can learn how to compute derivatives of loss with respect to weights and biases, and how to update them using gradients.

Key Takeaways

View the output from the first layer
Compute the loss
Start at the loss with respect to this node
Compute the derivative of the loss with respect to Z
Plug in the values for the derivative
Compute derivative of Z2 with respect to W2
Compute derivative of Z2 with respect to B2
Apply chain rule to find derivative of Z2 with respect to weights
Update weights and biases using gradients

💡 Backward propagation involves calculating partial derivatives of each layer's output with respect to its inputs, which can be computationally expensive. However, by using element-wise derivatives and the chain rule, it is possible to efficiently compute these derivatives.

🔒 Pro feature: Ask AI to explain this lesson →

More on: ML Maths Basics

View skill →

Important Steps I Have Followed To Improve My Data Science Skills- Sharing My Experience

Important Steps I Have Followed To Improve My Data Science Skills- Sharing My Experience

Learn Python FAST for Beginners 🚀#coding #conditionals #loops #functions

Learn Python FAST for Beginners 🚀#coding #conditionals #loops #functions

ChethanAIChronicles

“Hello, world” from scratch on a 6502 — Part 1

“Hello, world” from scratch on a 6502 — Part 1

PCA (Principal Component Analysis) in Python - Machine Learning From Scratch 11 - Python Tutorial

PCA (Principal Component Analysis) in Python - Machine Learning From Scratch 11 - Python Tutorial

ROC and AUC in R

ROC and AUC in R

StatQuest with Josh Starmer

Data Science Fundamentals: Data Cleaning in Python

Data Science Fundamentals: Data Cleaning in Python

Related Reads

What Is MLIR and Why Does It Exist?

Learn about MLIR, a intermediate representation for machine learning models, and its purpose in optimizing ML workflows

Dev.to · Fedor Nikolaev

Why Choosing the Right Machine Learning Development Company Matters More Than the AI Model

Choosing the right machine learning development company is crucial for turning AI investments into measurable results, as it can make or break the success of AI projects

Medium · Machine Learning

Data privacy in AI training: federated learning, differential privacy, and synthetic data

Learn how federated learning, differential privacy, and synthetic data preserve data privacy in AI training, and why they matter for secure machine learning

Data Preprocessing: Encoding and Feature Scaling in Machine Learning

Learn to preprocess data by encoding and scaling features for better machine learning model performance

Medium · Machine Learning

Is Python Dead in 2026?| Truth About Python in AI Era | 90 Days Roadmap @FameWorldEducationalHub

FAME WORLD EDUCATIONAL HUB