ResNet - Explained!
Key Takeaways
The video explains the ResNet network, its advantages over shallower networks, and its implementation in code, using resources such as the ResNet paper and code on GitHub.
Full Transcript
Greetings fellow learners. In this video we are going to talk about ResNet. The what, the why, and the how. So what is ResNet? It is a network that makes use of residual or skip connections. That's these connections over here. So why and how do we use it? Well, in order to understand this, let's actually talk about the object recognition pipeline and builds out the logic of how we can even get to ResNet that we see. So, first we have the object recognition pipeline that takes in an image and will determine like what the classification of that image is. Now, in 2012, we had AlexNet which was the state-of-the-art for object recognition. It was basically a network with a sequence of convolution activation and pooling layers along with some feed forward layers in order to you know map an image to an object category. And over the years in order to make this more performant a few more architectures were introduced. For example, in 2014, we had VGNet, which was a suite of architectures that used smaller 3 +3 convolutions and stacks of them in order to make the network deeper and hence more performant than alexnet. And around the same time we also had the inception architecture where we had again a network that was much deeper and wider in order to simulate sparse connections and hence used pointwise or one cross one convolutions. This too was more performant than alexnet. And so kind of the general consensus around this time was that deeper networks can increase performance. Now knowing this, researchers at Microsoft thought like what happens if we still go deeper. At this point, let's take a look at some code to actually code this out and see what's happening. So in my collab notebook over here, I have like six models and we're going to go through like the first four right now. The first model being just a basic Alex net where it's a sequence of convolution activations and max pullings. And each of these convolutions can be of different sizes. Some are 7 + 7 and some are 5 cross 5. And kind of training this model, you'll see that you'll get like a final accuracy of let's say 64% on this like cipher 10 data set for image classification. Now what happens now if we you know use some form of VGNET which basically means we'll replace the larger 7 + 7 5 + 5 convolutions with stacks of just 3 +3 convolutions and in doing so you're going to find out that well the performance is actually well the network isn't even learning in this case. So now we have like a very deep network with learn network stop training. And the reason for this is actually because of the vanishing gradient problem. And I can kind of prove this out in code here because you can see that like in the later layers we have gradients that are much larger whereas you know in the earliest layers we have gradients of the order of 10 to the^ of -5. And this means that there's an order of magnitude of hundred or even like thousand times smaller updates in the beginning of the network compared to the end of the network. And hence the network does not learn and hence we're running into this issue. So to solve this vanishing gradient problem, what we can do is just add bashn normalization over here between the layers. So I add like batch normalization after every single convolution layer and using the exact same network just adding batch normalization after the convolution layers. And so with batch normalization you see we get like the most performant architecture right here with 80% accuracy. And if you kind of look at like the the gradient flow if you just pay attention to the weights over here you can see that it's much more palatable. It's like 10 to the^ of -2 instead of 10 the^ of neg5 what it used to be. This is also more comparable to like what we see throughout over in these like later or layers of the network too. So no vanishing gradient problem and the network we can see that it's reflected in its performance. So what happens now if we try to go even deeper adding more convolution and activation layers for example. Well, we have a model 4 over here where we did just that. We took the exact same network as before, but we added like 10 convolution activation bash normalization and activation layers in sequence. And when we train this, well, what we notice here is that the network is definitely training as the accuracy does get better, but the accuracy is not better than the previous shallower network case. And we also notice here that the train loss or the train accuracy while it does get better, it is much slower at getting better. Same with this validation accuracy. While it is better, it is much slower at getting better than let's say the shallower counterpart. So this was like 29, 49, and 60 for the first three epochs. If you scroll over here, it's like 44, 63, 70. So it's much quicker in learning this shallower network over here. So now like what why does this actually happen? Well, you can also see here that it's not really a problem of the vanishing gradients because you know the gradients here you know they're still quite healthily active right 10 the^ of negative 1 10 the^ of -2 which seems par for the course for you know compared to like all these other cases over here that we see as well. So we don't have like a vanishing gradient problem but training is much slower and testing is also much slower and this is what we call performance degradation. So let's now see in the theory of like what this is why it occurs and then we'll also see how ResNet can potentially solve this issue. So performance degradation what is it? It is a phenomenon for a deeper network where the training and testing error is worse than its shallower counterpart. This is exactly what we saw in code. Now why does performance degradation happen? To illustrate this, let's actually take a simple example. I'm going to take two blocks of convolution backs normalization and activation right over here and here. And what we're going to do is we're going to train a image classifier, an object recognizer. So it'll take this image and output, you know, an object category. Let's assume that this small network over here is powerful enough to capture all the nonlinearities that are required to map the image to this output category. And so you can just imagine that let's assume that this like network um performance is like 95% or something and we can't really go much higher than that. Now let's consider another network. Let's say we made it deeper by doubling the number of layers to four layers. Now in theory a deeper network should be at least as performance as a shallow network as deeper layers should be able to learn a pass through or identity function. So you can imagine that you know this these layers will have the exact same um parameters for example and then this would just be like if there's a tensor here it'll map it to the exact same tensor here effectively mimicking this network here and so it should be the same but in practice it's not. So in theory, what we're trying to say is like here, that's kind of what we wrote out. We have like a tensor. This should be mapped to the exact same tensor. But in practice, it's actually mapped to a slightly distorted tensor. And why this happens is because this network or these networks learn through back propagation. So there's like an estimation technique of updating these weights or these configuration parameters little by little and in doing so kind of finding like this set of parameters the configuration set of parameters here that maps like you know an like a tensor to the exact same tensor over here. It is a very specific solution that is very difficult for our optimizer to typically find. And because it's so difficult to find, what you're going to end up with is instead a slightly distorted tensor. And you can imagine well as you get deeper and deeper you know if you have multiple layers like this that mid tensors here might be slightly distorted but the distortions can keep adding up to a point that you get a very distorted tensor over here and this tensor when you know you make an output you'll see that you might even get a wrong object category classification. So we effectively have a deeper network that has lower performance over here and this is basically performance degradation and hence also why it occurs. So now that we know why it can occur, how do you really solve this problem? Well, researchers at Microsoft thought here that well, why don't we modify the network structure to make it easier to emulate a pass through or identity function and they did this using skip connections. So essentially in the original case where we you know we have this tensor and we get like a slightly distorted tensor over here. What if we now just create this residual or skip connection over here? So basically we're going to like take the activation from here and then you know perform like a re activation only after we take the sum of these two arms over here. Now this is great because now it is far easier to simulate a pass through as the last convolution or batch normalization can just learn to be zero. So this batchalization basically can learn to output zeros over here or this convolution can have filters such that you know the output becomes just zero on applying the convolution and in either case you're going to get like zero for this arm and the tensor is just going to be essentially a pass through for this arm. So it's far easier to actually get a very similar or the same tensor itself after this operation. And now this residual arm, the residual arm is like this connection over here. This arm is now only going to really be learning anything if it will if it is like beneficial to the network. So if this part is just not, you know, capturing all the nonlinearities in order to perform the object classification or the image classification, it's only the extra information that will be learned here. But if it is powerful enough then this would almost simulate like a pass through. And so what we can do is repeat these skip connections throughout the network as we see here. And that's it with skip connections. Thus it is trivial to model a pass through function and hence a network of any depth can at least be as performant as its shallower counterpart. So let's now go back to our code and see this in action. So now we have our fifth model where we're creating ResNet and we're going to add skip connections to our previous like VGNet architecture over here where it has like a we have like a bunch of convolution uh batchalization activation right here. So when we do that and we do take the exact same network and we're just going to add like residual connections which I'll I'm just denoting by residuals over here. If we do this and I'll share the code later so you can see exactly how it's coded out. You'll see you get like the performance that's as good as well better than any other performance that we had seen previously. So we got now rid of this performance or this training degradation issue. So you can see like these training values are much much higher now. And at the same time, it also has the ability to to mitigate any other like gradients that vanish as well. So it also has like an added benefit there. Now, as an added bonus, let's say we go even deeper than this. So honestly, all I did was add even more layers to this. So I added like 10 more convolution block layers to this, and I just wanted to see what's going to happen. Well, if you do this and you, you know, you train it, you'll see that you'll still get really good performance that's maintained with very minimal performance or training degradation. And so I'm going to share all of this code in the description below. So feel free to play around with it. And I hope everything here made sense. Quiz time. Have you been paying attention? Let's quiz you to find out. Why use residual connections? A. To avoid the dying red loop problem. B, so performance of deeper networks can at least match the shallower counterparts. C, to avoid performance degradation, or D, to mitigate vanishing gradients. Multiple options may be correct here, and I'll give you a few seconds to answer this question. The correct options are B, C, and D. Did you get them right? Comment your reasoning down in the comments below and let's have a discussion. And at this point, if you do think I deserve it, please do consider giving this video a like because it will help me out a lot. That's going to do it for quiz time and for this video. But before we go, let's generate a summary. So in this video, we took a look at the what, why, and how of FresNet. So we started with the definition that it is a network that makes use of residual or skip connections. Then we also understood that you know through VGNet and through inception deeper networks can increase performance. But what happens if you go even deeper than that? Well the issue that we run into is performance degradation. It's a phenomena where a deeper network for a deeper network the training and test error is worse than its shallower counterpart. We also took a look at how how and why this happens. Because of the nature of optimization during back propagation, it is very difficult for this chunk of network to actually represent a pass through function. And these distortions add up over time to a point where we might even get worse predictions and hence performance degrades. To solve this, what we can do is we add a residual connection or a skip connection over here. And this will allow you know only this part of the network will actually learn something if it is beneficial to the network or otherwise it'll just simulate a pass through if you know the network already is powerful enough to capture the mapping between this image and the object category and hence we add them throughout the network. And so with skip connections, it is trivial to model a pass through function and hence a network of any depth can be at least as performant as its shallower counterpart. We also took a look at a few models to demonstrate that this is the case too in practice. And that's all that I have for you today. If you think I deserve it, please do consider giving this video a like. All resources are going to be down in the description below for the code, the paper, the slides. So do check them out. To continue your AI journey, do click on this video right over here. And I will see you in the next one.
Original Description
In this video, we take a look the ResNet network. What is it? Why is it better than some of the shallower networks that came before it? How do we implement this in code?
ABOUT ME
⭕ Subscribe: https://www.youtube.com/c/CodeEmporium?sub_confirmation=1
📚 Medium Blog: https://medium.com/@dataemporium
💻 Github: https://github.com/ajhalthor
👔 LinkedIn: https://www.linkedin.com/in/ajay-halthor-477974bb/
RESOURCES
[1 📚] Slides used in the video: https://link.excalidraw.com/p/readonly/Oj623wJMmvUZxfF5dyXl
[2 📚] Main paper of the video: https://arxiv.org/pdf/1512.03385
[3 📚] Code for ResNet network: https://github.com/ajhalthor/computer-vision-101
PLAYLISTS FROM MY CHANNEL
⭕ Reinforcement Learning: https://youtube.com/playlist?list=PLTl9hO2Oobd9kS--NgVz0EPNyEmygV1Ha&si=AuThDZJwG19cgTA8
Natural Language Processing: https://youtube.com/playlist?list=PLTl9hO2Oobd_bzXUpzKMKA3liq2kj6LfE&si=LsVy8RDPu8jeO-cc
⭕ Transformers from Scratch: https://youtube.com/playlist?list=PLTl9hO2Oobd_bzXUpzKMKA3liq2kj6LfE
⭕ ChatGPT Playlist: https://youtube.com/playlist?list=PLTl9hO2Oobd9coYT6XsTraTBo4pL1j4HJ
⭕ Convolutional Neural Networks: https://youtube.com/playlist?list=PLTl9hO2Oobd9U0XHz62Lw6EgIMkQpfz74
⭕ The Math You Should Know : https://youtube.com/playlist?list=PLTl9hO2Oobd-_5sGLnbgE8Poer1Xjzz4h
⭕ Probability Theory for Machine Learning: https://youtube.com/playlist?list=PLTl9hO2Oobd9bPcq0fj91Jgk_-h1H_W3V
⭕ Coding Machine Learning: https://youtube.com/playlist?list=PLTl9hO2Oobd82vcsOnvCNzxrZOlrz3RiD
MATH COURSES (7 day free trial)
📕 Mathematics for Machine Learning: https://imp.i384100.net/MathML
📕 Calculus: https://imp.i384100.net/Calculus
📕 Statistics for Data Science: https://imp.i384100.net/AdvancedStatistics
📕 Bayesian Statistics: https://imp.i384100.net/BayesianStatistics
📕 Linear Algebra: https://imp.i384100.net/LinearAlgebra
📕 Probability: https://imp.i384100.net/Probability
OTHER RELATED COURSES (7 day free trial)
📕 ⭐ Deep Learning Specialization: https://imp.i
Watch on YouTube ↗
(saves to browser)
Sign in to unlock AI tutor explanation · ⚡30
Playlist
Uploads from CodeEmporium · CodeEmporium · 0 of 60
← Previous
Next →
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
Linear Regression and Multiple Regression
CodeEmporium
Logistic Regression - THE MATH YOU SHOULD KNOW!
CodeEmporium
Generative Adversarial Networks - FUTURISTIC & FUN AI !
CodeEmporium
Deep Learning on the Cloud - GPU TO LEARN FASTER
CodeEmporium
Deep Mind's AlphaGo Zero - EXPLAINED
CodeEmporium
Mask Region based Convolution Neural Networks - EXPLAINED!
CodeEmporium
Attention in Neural Networks
CodeEmporium
Depthwise Separable Convolution - A FASTER CONVOLUTION!
CodeEmporium
One Neural network learns EVERYTHING ?!
CodeEmporium
Neural Voice Cloning
CodeEmporium
AI creates Image Classifiers…by DRAWING?
CodeEmporium
Unpaired Image-Image Translation using CycleGANs
CodeEmporium
K-Means Clustering - EXPLAINED!
CodeEmporium
Random Forest Classification
CodeEmporium
Data Science in Finance
CodeEmporium
Hypothesis testing with Applications in Data Science
CodeEmporium
A/B Testing - Simply Explained
CodeEmporium
The Kernel Trick - THE MATH YOU SHOULD KNOW!
CodeEmporium
Support Vector Machines - THE MATH YOU SHOULD KNOW
CodeEmporium
Principal Component Analysis (PCA) - THE MATH YOU SHOULD KNOW!
CodeEmporium
History of Calculus - Animated
CodeEmporium
Curiosity in AI
CodeEmporium
DropBlock - A BETTER DROPOUT for Neural Networks
CodeEmporium
Autoencoders - EXPLAINED
CodeEmporium
Recurrent Neural Networks - EXPLAINED!
CodeEmporium
LSTM Networks - EXPLAINED!
CodeEmporium
Building an Image Captioner with Neural Networks
CodeEmporium
10 Machine Learning Questions - ANSWERED!
CodeEmporium
How do neural networks work?
CodeEmporium
Evolution of Face Generation | Evolution of GANs
CodeEmporium
How does Google Translate's AI work?
CodeEmporium
How to keep up with AI research?
CodeEmporium
How does YouTube recommend videos? - AI EXPLAINED!
CodeEmporium
Variational Autoencoders - EXPLAINED!
CodeEmporium
Logistic Regression - VISUALIZED!
CodeEmporium
Gradient Descent - THE MATH YOU SHOULD KNOW
CodeEmporium
Boosting - EXPLAINED!
CodeEmporium
Transformer Neural Networks - EXPLAINED! (Attention is all you need)
CodeEmporium
Loss Functions - EXPLAINED!
CodeEmporium
Optimizers - EXPLAINED!
CodeEmporium
NLP with Neural Networks & Transformers
CodeEmporium
Batch Normalization - EXPLAINED!
CodeEmporium
Activation Functions - EXPLAINED!
CodeEmporium
Data Scientist Answers Interview Questions
CodeEmporium
Why use GPU with Neural Networks?
CodeEmporium
How do GPUs speed up Neural Network training?
CodeEmporium
BERT Neural Network - EXPLAINED!
CodeEmporium
ConvNets Scaled Efficiently
CodeEmporium
Transformer Neural Net makes music! (JukeboxAI)
CodeEmporium
What do filters of Convolution Neural Network learn?
CodeEmporium
We're hosting a Machine Learning Conference!
CodeEmporium
MLconfEU 2020: Machine Learning Conference for Software Engineers
CodeEmporium
Are Neural Networks Intelligent?
CodeEmporium
Time Series Forecasting with Machine Learning
CodeEmporium
Few Shot Learning - EXPLAINED!
CodeEmporium
How does a Data Scientist Fight FRAUD?
CodeEmporium
How would a Data Scientist analyze Customer Churn?
CodeEmporium
Expectations with Machine Learning
CodeEmporium
Why Logistic Regression DOESN'T return probabilities?!
CodeEmporium
How you SHOULD code Machine Learning
CodeEmporium
More on: CV Basics
View skill →Related AI Lessons
⚡
⚡
⚡
⚡
Data Preprocessing: Encoding and Feature Scaling in Machine Learning
Medium · Machine Learning
Data Preprocessing: Encoding and Feature Scaling in Machine Learning
Medium · Data Science
Data Preprocessing: Encoding and Feature Scaling in Machine Learning
Medium · Python
The Python Dictionary Trick That Makes Interviewers Smile
Dev.to · Ameer Abdullah
🎓
Tutor Explanation
DeepCamp AI