VGGNet - Explained!

CodeEmporium · Advanced ·🔢 Mathematical Foundations ·8mo ago

Key Takeaways

The video explains the VGGNet architecture, its components, and how it achieves high performance in object recognition tasks, using techniques such as replacing larger convolutions with smaller 3+3 convolutions and increasing network depth. It also discusses the trade-offs between parameter count and performance, and how scaling the network can improve results.

Full Transcript

Greetings fellow learners. In this video, we are going to take a look at VGNET. This was the state-of-the-art for object recognition in 2014. So, we're going to take a look at it. So, object recognition, this is the problem where we take an image and then the output is a category of said image. And in 2012, AlexNet was the state-of-the-art for object recognition. And this was mostly because of the rise of the internet and hence the creation of large data sets like imageet and GPU availability like Nvidia's GeForce released in 1999 but became more prevalent later in the 2000s. But can we do better here? Now one way to think about increasing performance is to increase network depth. And so researchers at the visual geometry group at the University of Oxford thus created this suite of architectures with 11 to 19 layers. So we have a this is a LRN B C D and E. Each of these are different network configurations and this suite of networks became known as VGGnet. So the main contribution was based on the following. During a convolution operation, we can approximate large filters with smaller 3 +3 convolutions. And specifically, a 5 + 5 convolution will have the same receptive field as two 3 + 3 convolutions. And a 7 + 7 convolution will have the same receptive field as three 3 +3 convolutions. Now to see exactly like how this is the case, let's illustrate this with an example using the one-dimensional case. So here we have an input which is like a one-dimensional signal. Each of these are like floating point values. And let's say the size of the input is seven. Next we just have a size five filter which is one-dimensional that looks like this. And if you apply convolution, you'll see that first you know this first number over here is going to be the sum of products of these five elements over here. Then you'll get this one floatingoint value and then we just keep sliding the window to get the final convolution output. Now each output cell over here has a receptive field of five. So what that means is that this over here this floatingoint number is some function of these five inputs. Similarly this second one is a function of these five inputs and this third one is a function of these five inputs and also each cell is a linear function of the five inputs because it's just a sum of products. No nonlinearities. Now let's look at the second case where we use 3 +3 filters. So we have the same input. Now we'll use a 3 + 1 filter over here in order to get an output. So in each case we just take a sum of products of just these three items and this will be the output. Keep sliding the window and you'll get the total output. Now each output cell has a receptive field of three. So it's just a function a linear function of three inputs like this is just these first three this is a second three and so on. Now let's say that we use another 3 + 1 filter right and once we do this let's say we take the sum of products of you know these over here you'll get this output. Similarly you can slide the window to get the total output. Now what's interesting here is that this second output is actually going to be a linear function of five input cells over here. So it's it's like a function of these three which is in turn a function of these five. So this is exactly just like the five + one case. So you can kind of see that hence the receptive field of the five cross one filter is the same as the receptive field of these two 3 + one filters. So hope that makes sense and we can even extend this. If we add another 3 + 1 filter right over here, we take the sum of products, you'll get this single final output value. And this output value is going to be a linear function of these seven inputs. So each output cell is a linear function of seven input cells. And this is similar to a single 7 + 1 convolution. So I hope you can see how now a 7 + 1 convolution can kind of be represented by a sequence of three of these 3 + 1 filters and we would just need to extend this in two dimensions for the case that I just mentioned. Hence during a convolution operation we can approximate larger filters with smaller 3 + 3 convolutions. Now why would we want to do this though? So first it decreases the number of learnable parameters and second it increases the number of layers which increases the ability to add like relus after each of those layers which increases the discriminative power of the network. So let's look at exactly what we mean and how this actually happens with an example. So let's say that we want to in our case one take a 5 + 5 convolution and we want to apply it to like a tensor with 64 channels. The 64 is the depth over here and this we want to transform it into another tensor with 128 channels. So in order to do this operation we would use like one of the filters would be like 5 + 5 cross like 64. We would apply a convolution operation by sliding it across this entire tensor. We'll get one of these 32 cross 32 feature maps. And then we need to use 128 such filters in order to get this full feature map over here. And hence the number of parameters is going to be 5 * 5 * 64 which is the number of input channels* 128 number of output channels. and you get 204,800 learnable parameters with this single 5 + 5 convolution. Now let's move on to the second case where we mention we have a sequence of two 3 +3 convolutions and let's say that from 64 channels we want to go to 32 channels kind of as a bottleneck and then transform it to 128 channels. In doing so, the number of parameters incurred is 3 * 3 * 64 * 32 plus 3 * 3 * 32 * 128 which is 55,296 learnable parameters and the number of parameters here has certainly decreased by like 4x compared to the previous case. And the second point that I mentioned before was that we can now add two nonlinearities two relu over here versus like one in the case of like a 5 + 5 convolution. And this is to make the network more discriminative more powerful. And so I hope those two points make sense. So replacing these larger convolutions, the 5 + 5 convolutions and the 11 cross 11 convolutions from AlexNet, we can just replace it with like 3 + 3 convolutions. And we can then create deeper networks that are more performant. And when I say more performant, I'm looking at like the ILSVRC competition in 2014. And if you look at the different categories here and you scroll down, what you're going to see is that for object like classification and localization, VGNET came on top. So it's performing. Now, in order to put these numbers in perspective, I've also created some code. So let's take a look at that. So here's some code that's basically going to train a simplified version of a VG network and simplified version of an Alex net and then compare the two. So right here we're going to train this on the cipher 10 data set which is essentially going to be a object recognition data set with 10 classification outputs. Then we have the code for training and evaluation. And then we create an architecture for alexnet. It's a simplified architecture from the original one but it has the same ideas. So we first have let's say we take the input image and then we transform it with a 7 + 7 filter or a 7 + 7 convolution to create 64 output channels. And then the second convolution block we take 5 + 5 convolutions to transform it from you know like 64 to 192 channels and then another 5 + 5 convolution to transform it from 192 to 384 channels and then a sequence of like 3 +3 convolutions and then we flatten it out in order to get the final prediction. Now to this exact network over here, what we're going to do is we're going to replace the 7 + 7 convolutions with a sequence of three 3 + 3 convolutions. And we're going to replace the 5 + 5 convolutions here and here with each a sequence of two 3 + 3 convolutions. In doing so, I've coded that out exactly right over here. So we transform three channels to 64 like before but we do it with like three convolution operations of size kernel 3. Similarly we transform 64 to 192 but we do it with two 3 +3 kernels to mimic a 5 + 5 convolution. Similarly we do the same here to mimic a 5 + 5 convolution from 192 to 384. And then the rest of the network is exactly as you know it was before. So nothing has changed. And this is just to give you the idea of comparison. So if we compare the number of parameters, we see that this VGET now has like half the number of parameters as AlexNet. Yet when you actually train the network on the exact same amount of data with the same configuration parameters for the same amount of time, you're going to get slightly even better performance in the VGNet case. So I hope this kind of shows how like the deeper network of VGNET adds like the ability to add nonlinearities and the smaller convolutions decreases the parameter sizes and you can imagine now that if we you know actually create the true VG net architectures with like 11 or 19 layers it's going to we can even get like ek out more performance here. So I hope this all makes sense. Quiz time. Have you been paying attention? Let's quiz you to find out. Which of the following is true about VGNET? A. It uses stacks of 3 + 3 convolutions with two cross 2 max poolings. B. It introduces residual or skip connections. C. The network is sparsely connected and hence has a lower parameter count. Or D, it removes fully connected layers in the original model. I'll give you a few seconds to answer this question. The correct option is A. But did you get it right? Comment your reasoning down below and let's have a discussion. And at this point, if you think I deserve it, please do consider giving this video a like because it will help me out a lot. And that's going to do it for quiz time. And before we go, let's generate a summary. So in this video we took a look at VGNet the what the why and the how. So we started with the fact that AlexNet was the state-of-the-art for object recognition and we tried to see how we could do better. VGNet actually does this by increasing the network depth and specifically it replaces larger convolutions with much smaller 3 +3 convolutions. And so that's like a 5 + 5 convolution will have the same receptive field as two 3 + 3 convolutions and a 7 + 7 convolution will have the same receptive field as three 3 +3 convolutions. And we also see how this is kind of proofed out with a simple example. We also mentioned the reasons for potentially wanting to do this where it decreases the number of learnable parameters while increasing layers increasing relu activations hence increasing the discriminative power of the network. We then saw exactly this in code where we compared the AlexNet architecture, a simplified version of it and a VGNet architecture where we replaced the larger convolutions with 3 + 3 convolutions and what we saw was with half the number of parameters the network actually was even more performant and also we introduced the idea of how like scaling this network could even potentially improve the performance despite increased parameters here. That's all I have for today. Thank you all so much for watching. All the resources and the code and the slides will be in the description below. So, please do check it out for more resources. And that's all I got for you today. And I will see you in the next one. Bye-bye.

Original Description

In this video, we take a look the VGG network architecture. What is it? Why is it so deep? How do we code it out? How well does it perform? ABOUT ME ⭕ Subscribe: https://www.youtube.com/c/CodeEmporium?sub_confirmation=1 📚 Medium Blog: https://medium.com/@dataemporium 💻 Github: https://github.com/ajhalthor 👔 LinkedIn: https://www.linkedin.com/in/ajay-halthor-477974bb/ RESOURCES [1 📚] Slides used in the video: https://link.excalidraw.com/p/readonly/wW80S4xsMV1c5kvilNfo [2 📚] Main paper of the video: https://arxiv.org/pdf/1409.1556 [3 📚] Code for VGG network: https://github.com/ajhalthor/computer-vision-101/blob/main/VGGNetwork.ipynb [4 📚] ILSVRC Image net competition in 2014 models ranked: https://image-net.org/challenges/LSVRC/2014/results PLAYLISTS FROM MY CHANNEL ⭕ Reinforcement Learning: https://youtube.com/playlist?list=PLTl9hO2Oobd9kS--NgVz0EPNyEmygV1Ha&si=AuThDZJwG19cgTA8 Natural Language Processing: https://youtube.com/playlist?list=PLTl9hO2Oobd_bzXUpzKMKA3liq2kj6LfE&si=LsVy8RDPu8jeO-cc ⭕ Transformers from Scratch: https://youtube.com/playlist?list=PLTl9hO2Oobd_bzXUpzKMKA3liq2kj6LfE ⭕ ChatGPT Playlist: https://youtube.com/playlist?list=PLTl9hO2Oobd9coYT6XsTraTBo4pL1j4HJ ⭕ Convolutional Neural Networks: https://youtube.com/playlist?list=PLTl9hO2Oobd9U0XHz62Lw6EgIMkQpfz74 ⭕ The Math You Should Know : https://youtube.com/playlist?list=PLTl9hO2Oobd-_5sGLnbgE8Poer1Xjzz4h ⭕ Probability Theory for Machine Learning: https://youtube.com/playlist?list=PLTl9hO2Oobd9bPcq0fj91Jgk_-h1H_W3V ⭕ Coding Machine Learning: https://youtube.com/playlist?list=PLTl9hO2Oobd82vcsOnvCNzxrZOlrz3RiD MATH COURSES (7 day free trial) 📕 Mathematics for Machine Learning: https://imp.i384100.net/MathML 📕 Calculus: https://imp.i384100.net/Calculus 📕 Statistics for Data Science: https://imp.i384100.net/AdvancedStatistics 📕 Bayesian Statistics: https://imp.i384100.net/BayesianStatistics 📕 Linear Algebra: https://imp.i384100.net/LinearAlgebra 📕 Probability: https://imp.i384100.ne
Watch on YouTube ↗ (saves to browser)
Sign in to unlock AI tutor explanation · ⚡30

Playlist

Uploads from CodeEmporium · CodeEmporium · 0 of 60

← Previous Next →
1 Linear Regression and Multiple Regression
Linear Regression and Multiple Regression
CodeEmporium
2 Logistic Regression - THE MATH YOU SHOULD KNOW!
Logistic Regression - THE MATH YOU SHOULD KNOW!
CodeEmporium
3 Generative Adversarial Networks - FUTURISTIC & FUN AI !
Generative Adversarial Networks - FUTURISTIC & FUN AI !
CodeEmporium
4 Deep Learning on the Cloud - GPU TO LEARN FASTER
Deep Learning on the Cloud - GPU TO LEARN FASTER
CodeEmporium
5 Deep Mind's AlphaGo Zero - EXPLAINED
Deep Mind's AlphaGo Zero - EXPLAINED
CodeEmporium
6 Mask Region based Convolution Neural Networks - EXPLAINED!
Mask Region based Convolution Neural Networks - EXPLAINED!
CodeEmporium
7 Attention in Neural Networks
Attention in Neural Networks
CodeEmporium
8 Depthwise Separable Convolution - A FASTER CONVOLUTION!
Depthwise Separable Convolution - A FASTER CONVOLUTION!
CodeEmporium
9 One Neural network learns EVERYTHING ?!
One Neural network learns EVERYTHING ?!
CodeEmporium
10 Neural Voice Cloning
Neural Voice Cloning
CodeEmporium
11 AI creates Image Classifiers…by DRAWING?
AI creates Image Classifiers…by DRAWING?
CodeEmporium
12 Unpaired Image-Image Translation using CycleGANs
Unpaired Image-Image Translation using CycleGANs
CodeEmporium
13 K-Means Clustering - EXPLAINED!
K-Means Clustering - EXPLAINED!
CodeEmporium
14 Random Forest Classification
Random Forest Classification
CodeEmporium
15 Data Science in Finance
Data Science in Finance
CodeEmporium
16 Hypothesis testing with Applications in Data Science
Hypothesis testing with Applications in Data Science
CodeEmporium
17 A/B Testing - Simply Explained
A/B Testing - Simply Explained
CodeEmporium
18 The Kernel Trick - THE MATH YOU SHOULD KNOW!
The Kernel Trick - THE MATH YOU SHOULD KNOW!
CodeEmporium
19 Support Vector Machines - THE MATH YOU  SHOULD KNOW
Support Vector Machines - THE MATH YOU SHOULD KNOW
CodeEmporium
20 Principal Component Analysis (PCA) - THE MATH YOU SHOULD KNOW!
Principal Component Analysis (PCA) - THE MATH YOU SHOULD KNOW!
CodeEmporium
21 History of Calculus - Animated
History of Calculus - Animated
CodeEmporium
22 Curiosity in AI
Curiosity in AI
CodeEmporium
23 DropBlock - A BETTER DROPOUT for Neural Networks
DropBlock - A BETTER DROPOUT for Neural Networks
CodeEmporium
24 Autoencoders - EXPLAINED
Autoencoders - EXPLAINED
CodeEmporium
25 Recurrent Neural Networks - EXPLAINED!
Recurrent Neural Networks - EXPLAINED!
CodeEmporium
26 LSTM Networks - EXPLAINED!
LSTM Networks - EXPLAINED!
CodeEmporium
27 Building an Image Captioner with Neural Networks
Building an Image Captioner with Neural Networks
CodeEmporium
28 10 Machine Learning Questions - ANSWERED!
10 Machine Learning Questions - ANSWERED!
CodeEmporium
29 How do neural networks work?
How do neural networks work?
CodeEmporium
30 Evolution of Face Generation |  Evolution of GANs
Evolution of Face Generation | Evolution of GANs
CodeEmporium
31 How does Google Translate's AI work?
How does Google Translate's AI work?
CodeEmporium
32 How to keep up with AI research?
How to keep up with AI research?
CodeEmporium
33 How does YouTube recommend videos? - AI EXPLAINED!
How does YouTube recommend videos? - AI EXPLAINED!
CodeEmporium
34 Variational Autoencoders - EXPLAINED!
Variational Autoencoders - EXPLAINED!
CodeEmporium
35 Logistic Regression - VISUALIZED!
Logistic Regression - VISUALIZED!
CodeEmporium
36 Gradient Descent - THE MATH YOU SHOULD KNOW
Gradient Descent - THE MATH YOU SHOULD KNOW
CodeEmporium
37 Boosting - EXPLAINED!
Boosting - EXPLAINED!
CodeEmporium
38 Transformer Neural Networks - EXPLAINED! (Attention is all you need)
Transformer Neural Networks - EXPLAINED! (Attention is all you need)
CodeEmporium
39 Loss Functions - EXPLAINED!
Loss Functions - EXPLAINED!
CodeEmporium
40 Optimizers - EXPLAINED!
Optimizers - EXPLAINED!
CodeEmporium
41 NLP with Neural Networks & Transformers
NLP with Neural Networks & Transformers
CodeEmporium
42 Batch Normalization - EXPLAINED!
Batch Normalization - EXPLAINED!
CodeEmporium
43 Activation Functions - EXPLAINED!
Activation Functions - EXPLAINED!
CodeEmporium
44 Data Scientist Answers Interview Questions
Data Scientist Answers Interview Questions
CodeEmporium
45 Why use GPU with Neural Networks?
Why use GPU with Neural Networks?
CodeEmporium
46 How do GPUs speed up Neural Network training?
How do GPUs speed up Neural Network training?
CodeEmporium
47 BERT Neural Network - EXPLAINED!
BERT Neural Network - EXPLAINED!
CodeEmporium
48 ConvNets Scaled Efficiently
ConvNets Scaled Efficiently
CodeEmporium
49 Transformer Neural Net makes music! (JukeboxAI)
Transformer Neural Net makes music! (JukeboxAI)
CodeEmporium
50 What do filters of Convolution Neural Network learn?
What do filters of Convolution Neural Network learn?
CodeEmporium
51 We're hosting a Machine Learning Conference!
We're hosting a Machine Learning Conference!
CodeEmporium
52 MLconfEU 2020: Machine Learning Conference for Software Engineers
MLconfEU 2020: Machine Learning Conference for Software Engineers
CodeEmporium
53 Are Neural Networks Intelligent?
Are Neural Networks Intelligent?
CodeEmporium
54 Time Series Forecasting with Machine Learning
Time Series Forecasting with Machine Learning
CodeEmporium
55 Few Shot Learning - EXPLAINED!
Few Shot Learning - EXPLAINED!
CodeEmporium
56 How does a Data Scientist Fight FRAUD?
How does a Data Scientist Fight FRAUD?
CodeEmporium
57 How would a Data Scientist analyze Customer Churn?
How would a Data Scientist analyze Customer Churn?
CodeEmporium
58 Expectations with Machine Learning
Expectations with Machine Learning
CodeEmporium
59 Why Logistic Regression DOESN'T return probabilities?!
Why Logistic Regression DOESN'T return probabilities?!
CodeEmporium
60 How you SHOULD code Machine Learning
How you SHOULD code Machine Learning
CodeEmporium

This video teaches the fundamentals of the VGGNet architecture and how to optimize its performance for object recognition tasks. It covers key concepts such as convolutional neural networks, network scaling, and discriminative power, and provides practical steps for implementing and improving the network.

Key Takeaways
  1. Apply a convolution operation by sliding a filter across a tensor
  2. Use a sequence of two 3+3 convolutions to reduce the number of learnable parameters
  3. Replace larger convolutions with smaller 3+3 convolutions
  4. Increase the network depth
  5. Use a sequence of three 3+3 convolutions to transform three channels to 64 channels
💡 Replacing larger convolutions with smaller 3+3 convolutions can decrease the number of learnable parameters and increase the discriminative power of the network, leading to improved performance in object recognition tasks.

Related AI Lessons

Up next
How to Open OSM Files (OpenStreetMap Data)
File Extension Geeks
Watch →