Deconvolution - what do networks learn? (visualization + code)
Key Takeaways
The video demonstrates the use of deconvolution to visualize what convolutional neural networks learn, using the AlexNet architecture as an example, and provides code snippets in PyTorch to illustrate the process.
Full Transcript
Greetings fellow learners. In this video, we are going to visualize the deep layers of convolution neural networks. So we'll begin our discussion by understanding like why we want to do this. So this here is AlexNet. It was in 2012 the state-ofthe-art for object recognition where you pass in an image and you'll get a class categorization of what this image represents. And throughout this we have a sequence of convolution activation and max pooling layers followed by some feed forward densely connected layers. Now in order to visualize exactly like what these represent let's say that you know to this network we have maybe an image like an emnest digit 7 that we pass through the first layers of the network and then what we can do is just visualize the activations of the feature maps. So when you pass in that seven at the feature map for the convolution layer let's say that there's eight of these feature maps you might get some things that look like this. very red and very blue means that the filter has picked up on these patterns. So you can see that okay maybe like each of these filters has kind of picked up on the stroke of the seven. Maybe this has picked up on some other inner edges and inner shadows of the seven and so on here. And this is for the convolution layer, the activation layer, the pooling layer. And I guess at this point you can kind of see like visually what the feature or like that filter has learned. But as we go to the next block like the next set of convolution activation and pooling layers well you it's really hard to kind of understand really like what's being learned here. And so kind of the point of like what we want to do now is is there a better way to actually visualize what the internal architecture of Alexet and some of these deeper architectures has learned because it can give us an insight into whether the training has actually happened correctly or not. And it can also as like an added benefit guide future neural network research instead of making it all about like some trial and error for some network components. And so all of this is kind of the why we want to effectively visualize the internal components of a network. Now that we understand why we want to visualize convolution networks, let's understand how to visualize them. So let's consider this layered architecture of a convolution neural network with convolution activation and pooling feed forward and it's trained in visual object recognition. Now let's say that we want to visualize what this layer has learned and instead before we would you know look at just the activations at probably like this point over here. Instead of that what we're going to do is you know from this forward pass we're going to actually pass this here into a max unpooling layer to undo this max pooling. We then pass it through a relation and then we undo this convolution operation with a transpose convolution operation and we keep working our way back to the beginning of the network until eventually we will get a pixel space activation from layer N. And what this will represent it is the visualization of what this layer has actually learned to focus on. And so that's kind of our goal and how we want to visualize. Now let's get into some of the details of exactly how we can visualize this with some computations. So let's just consider a very simple network of like 5 + 5 input. It has two filters of 3 + 3. And after you apply a convolution, you're going to get an output feature map. We have two feature maps over here. Then we can apply relu activation. And then we can apply a max pooling to get the pulled outputs. And then there's just like next layers. We're we're not concerned with this for now because let's just say we want to visualize some components here. So let's say in here that the input to this after you know let's assume that this is a trained filter. These filters are their final weights. They it is like inputting an image that looks like an A. Then when passed through the convolution layer and we apply the convolution operation, you might get something that looks like this. So basically you can imagine take this it's a sliding window approach that we go over here take a sum of these products and put that over here. Similarly slide that window and do the same thing. That's the fundamentals of the convolution operation. Next, what we're going to do is perform a reo activation which will clamp all the negative values to zeros just to highlight which are like the active neurons. And next we perform pooling where we have a window size here of like two and then a stride of one. And we're just going to perform like a max pooling operation and the outputs are just over here. Now let's say that we want to visualize what this max pooling layer has actually learned. And so what we're going to do is first we're going to create something called like some switches. And this is going to be used eventually to unpool or to reverse the pooling operation. And what we do is you know it's a function of whatever this pooling output is that 0.6 0.4.6 and 0.2. And we in during the forward pass we would have actually recorded like which position that you know what what actually created this. It's kind of like an arg max. So 0.6 was created from position zero. 0.4 was created from position one. 0.6 here was created from position 6. 0.2 created from position 4. So this is position 0. That's a 0.6. This is the position one which is the 0.4. Then position six. That's this 0.6 that created this. And then you have the position four which is this 0.2 which created this over here. And now that we have these switches and we also have this max pooling output we can try to reconstruct what its predecessor was. And so you can kind of get this 3 +3 grid where we have you know this is 0.6 0.4 0.2 and 0.6. So this is kind of how we can, you know, we can reconstruct it using these switches because we know what positions they exist in. And everything else is kind of zeroed out. And so this is kind of like a lossy reconstruction of what this layer was. Next, we are going to, you know, create an activation here. So it's kind of like a no operation as there's nothing below zero and everything's clamped to zero. And then now we perform transpose convolution. And the way we do this is let's say we take this right over here and we know that you know and we also take this you know neuron here at 0.6 this neuron would have been created from the neurons that are present over here. And in order to kind of determine like what values they should be you basically multiply each of these values here with this 0.6. So that's 0.6 * 0.2 2 you get 0.12 0.6 * 0.2 you get positive 0.12 and so on. Next you do the same with 0.4. With 0.4 what we're going to do is now it's going to affect the neurons over here and we take the product of this neuron along with all of these neurons over here. So that's like plus or minus 0.4 which we will add or subtract to the neurons here. And you get something like this. Then everything else is like zero. So plus or minus 0, nothing changes. So we'll go to the next non zero neuron which is 0.2. And this would mean we add, you know, if you multiply 0.2 with all of these other plus or minus 0.2s, you're going to get plus or minus 0.04, which you add or subtract to the neurons over here. So you take this 0.6, you multiply it by like the positive 0.2, 2 you get 0.12 positive or negative which will affect the neurons over here and when you add or subtract those values you'll get this. So effectively now this is going to be the pixel space activations for this max polling. So this effectively rep is you know we've represented it in the pixel space that looks like this which makes it easier for us to now visualize. And if you try to even like code this out, you can code this out with like torch functions like max pull 2D, max unpool 2D, ReLU, and then you perform the transpose convolution 2D and you print them all out. You'll notice that the final output is exactly as we computed them. So I hope this makes sense if you also want to verify it with code. Now what we can do here is use those core components in torch and we can also now take AlexNet the trained architecture pass in an image to the architecture and visualize let's say each of the convolution layers along the way mapping them to the pixel space using this deconvolution computation and in doing so you'll get like this is the original image of the tiger maybe there's like some grass and some like water in the back. And if you pass this into like the first convolution layer and try to visualize what that convolution, you know, feature map has learned. There's many of these output feature maps. I'm just going to display like 16 of them where we basically see that, you know, the green is considered the points of activation. So you can kind of see that like for this convolution layer, it's focused a lot on the outline of the tiger. You can see some of them are focused on the river that exists in the back. There's also a lot of like edges that are being highlighted on the tiger which whether it's like the stripes or like the actual contours of the tiger itself. So you can see that this first layer is kind of detected some some edge sense of like edges and this is just the feature map without the overlay of the image itself. So you can actually see like the edges being detected. Now for the next convolution layer after the convolution activation pooling there's another like convolution layer we can see that you know there is like some emphasis on detecting the background and also some emphasis on you know some of these are detecting like the face of the tiger itself the foot contours um and a lot of it is actually detecting parts of the actual conceptual object too so that kind of makes sense but then as you go to like the higher layers admittedly it is a little bit it gets a little trickier to see like what it's focusing on. There are parts of like the background that I think the network is now or these like specific convolution layers are focusing on um and it also just gets more abstract as you go further further down as well. So while this is actually a nice visual representation for a single input sample, it gives like maybe some general guidance of like anecdotally is this network kind of learning well but we can actually do a little bit better. So this here is like a visualization that's a part of the main paper that introduced this deconvolution for visualization specifically and I'm just going to walk through this to get an understanding of like what the layers are truly representing. So let's say now that you have AlexNet and it is a fully trained network. Now we take an entire validation set which may be like tens of thousands of images that we just pass into the network and for each case we will record the output feature activations for every filter in every layer. Now for each of these convolution layers there's going to be many feature activations because there's many filters. We're just going to choose a subsample of let's say like you know let's say nine or 16 of them and for each of those feature maps we're just going to choose the images that led to the largest activations and then you know try to paste it on screen here. So effectively if we're looking at this here each of these consider just consider that for layer 2 convolution layer we are viewing the outputs of four cross four. So that's 16 feature maps some randomly selected feature maps. Now for each of these feature maps we are looking at the top nine images that activated that feature map the most. So these are like nine images in the inference or validation data set that activated this specific feature map the most. And the idea is that you know if we look at these images for one feature activation of the convolution layer and we also look at this corresponding pixel space maps. We'll get a good sense of understanding of what this specific filter or feature map is specialized in recognizing. So let's go through some examples. So in the first like convolution layer, we're looking at some random nine feature activations. And you can see like for each feature activation, it looks like it's paying attention to like edges. And specifically the all of these edges for a feature map are kind of oriented in the same way. Like these are all like 45° diagonals. This one's like 63 60° diagonal edges. This is like you know the opposite side diagonal edges. So they all pay attention to some sense of orientation. And you can also see that there is also like some sense of color that's picked up by the layer 1 convolution feature activations. Now let's move on to layer 2. So here we are visualizing 16 feature activations. And of these, you know, for each of these 16, we're looking at like the top nine images. So, let's say in this we can see that there's like a good sense of color that is detected by this specific feature maps. And some of these will al some of these feature maps also are good at like detecting some form of like pattern or texture here. There's also like some that detect, you know, these uh directional edges, like vertical edges of some sort, or they're also like circular patterns. And so you can kind of see at a high level that these layer 2 feature activations specialize in detecting like colors as well as um patterns. Now, let's go on to layer three. So with layer 3, we're visualizing this is 12 feature maps. And for each feature map, you can kind of see, let's say this first feature map looks like it's detecting like a good amount of these patterns like this mesh type structure. And you can see that this one over here is kind of detecting more like circular objects. And let's see, this here is detecting it looks like it's highlighting the this entire structure here. So that looks like the back, but it's not actually recognizing this as the back of a human, but more like just the contour or, you know, kind of like the curvature right here. So it is kind of still paying attention to general like curved shapes and looking at, you know, let's say these patterns over here, it's activated with some of these like words or like barcode symbols. So it is starting to detect like some form of like texture as a whole. So there's a lot of like texture and certain shapes that are being detected in layer 3. So this is slightly higher level conceptually than like the color or edges detected in the first two layers. So that looks pretty nice. Now going to layer four, things get kind of interesting because this is where we're actually starting to see some more conceptual understanding. So with layer 4, you can see like this first feature activation, it looks like it's detecting, you know, a dog face or what kind of looks like, you know, two eyes, a nose, and a tongue basically. And if you go over here to, you know, this part of the image, it looks like it's actually detecting, you know, looks like it's activating from like the reflections that you see in water. This feature map looks like it's been activated through like two legs and it's learned to like recognize two legs. And so overall we can see that layer 4 tends to have a much higher level of conceptual understanding of objects than the previous layers. Now moving on to layer five. This one's pretty interesting. So here we see like maybe this feature map is looking at um a lot different kinds of patterns like a dotted pattern and it's activated by that too. Now, what's interesting is that there's probably nothing that looks really in common for this particular feature map. All these images look very different. So, it's really hard to kind of tell what this specific feature map is trying to learn or has learned. And but if you look at like the pixels, you know, the pixel activations here, you can kind of see that oh, you know, this is the grass over here. Similarly, this center part is also grass. So it almost looks like the like this feature map was actually picking up on the background grass itself. And now we have this feature map that looks like it's picking up on some signs. And this one over here looks like it's picking up on the general like contour faces where we have like, you know, an oval face along with like two eyes and a mouth. That's why it probably groups like humans and dogs together, but it is also pretty good at detecting, you know, these dogs with like pointy ears and also at detecting eyes itself. So overall, we can still see that the layer five looks like it has even a higher conceptual understanding of the objects than like the previous layers that came before it. And so these layer five filters are looking at much high level data compared to let's say you know the colors and orientations of layer 1, the edges and patterns of layer 2, the the the textures and some color of layer three and then you know some higher level conceptual visuals of layer 4, whether it's like water or you know the legs or even just like full objects on layer five or even like sometimes like the background like grass. And so I hope that this gives intuition on how like what exactly these feature maps are learning in the internal layers and also how as we go deeper how these features change. Quiz time. Have you been paying attention? Let's quiz you to find out which of these is false about convolution networks. A. Deeper layers better understand objects conceptually. B. Earlier layers detect low-level features like edges and color. C. Earlier layers better understand objects conceptually. Or D. Earlier layers are more invariant to image changes. I'll give you a few seconds to answer this question. The correct options are C and D. Did you get them right? Leave your reasoning down in the comments below and let's have a discussion. And at this point, if you do think I deserve it, please do consider giving this video a like because it will help me out a lot. Now, that's going to do it for quiz time and this video. But before we go, let's generate a summary. So we started out this video by looking at the output activations for a neural network and just understanding that you know some of these later layer activations are just not easily interpretable and hence we don't really know if the neural network was trained properly and so to get around this what we could do is that if we wanted to visualize let's say the output of this like layer n we could perform an unpooling relu activation transpose convolution in series until we get pixel space activations from layer n. And this pixel space activation is going to visualize what layer n has learned to focus on. We also looked deeper at like the computations that are involved in actually computing this as well. We then took a look at you know all of this in action by taking a trained Alexnet network passing an image into the network and just for every single convolution layer we basically try to map it back to the pixel space so that we can actually see which parts of the image are being activated. Now while this doesn't give the entire picture it can at least directionally give you a sense of if the network is kind of being trained at least with a single example. And then we concluded by understanding exactly what every layer of an AlexNet architecture is learning from the edge level features of layer 1 to much more higher conceptual features in the later layers. And that's all that we have for today. Thank you all so much for watching. Some of the code and the resources for this video are going to be available in the description below, including some of the slides and the computation. So, if the numbers are too small, you can look at it later. Thank you all so much for watching, and I will see you in the next one.
Original Description
In this video, we take a look at what the deep layers of a convolution neural network actually learn with some neat visuals. We illustrate this with deconvolution
ABOUT ME
⭕ Subscribe: https://www.youtube.com/c/CodeEmporium?sub_confirmation=1
📚 Medium Blog: https://medium.com/@dataemporium
💻 Github: https://github.com/ajhalthor
👔 LinkedIn: https://www.linkedin.com/in/ajay-halthor-477974bb/
RESOURCES
[1 📚] Slides used in the video: https://link.excalidraw.com/p/readonly/KAFfYYmtoLUZuWisoJH8
[2 📚] Main paper of the video: https://arxiv.org/pdf/1311.2901
[3 📚] Code for visualizing deep layers of the network using deconvolution layers: https://github.com/ajhalthor/computer-vision-101/blob/main/visualization_of_deep_layers.ipynb
[4 📚] Main paper that introduced AlexNet: https://proceedings.neurips.cc/paper_files/paper/2012/file/c399862d3b9d6b76c8436e924a68c45b-Paper.pdf
[5 📚] Code for visualizing the feature maps of convolution networks: https://github.com/ajhalthor/computer-vision-101/blob/main/Visualizing_Convolutions_and_Pooling.ipynb
PLAYLISTS FROM MY CHANNEL
⭕ Reinforcement Learning: https://youtube.com/playlist?list=PLTl9hO2Oobd9kS--NgVz0EPNyEmygV1Ha&si=AuThDZJwG19cgTA8
Natural Language Processing: https://youtube.com/playlist?list=PLTl9hO2Oobd_bzXUpzKMKA3liq2kj6LfE&si=LsVy8RDPu8jeO-cc
⭕ Transformers from Scratch: https://youtube.com/playlist?list=PLTl9hO2Oobd_bzXUpzKMKA3liq2kj6LfE
⭕ ChatGPT Playlist: https://youtube.com/playlist?list=PLTl9hO2Oobd9coYT6XsTraTBo4pL1j4HJ
⭕ Convolutional Neural Networks: https://youtube.com/playlist?list=PLTl9hO2Oobd9U0XHz62Lw6EgIMkQpfz74
⭕ The Math You Should Know : https://youtube.com/playlist?list=PLTl9hO2Oobd-_5sGLnbgE8Poer1Xjzz4h
⭕ Probability Theory for Machine Learning: https://youtube.com/playlist?list=PLTl9hO2Oobd9bPcq0fj91Jgk_-h1H_W3V
⭕ Coding Machine Learning: https://youtube.com/playlist?list=PLTl9hO2Oobd82vcsOnvCNzxrZOlrz3RiD
MATH COURSES (7 day free trial)
📕 Mathematics for Machine Learning: https://imp.
Watch on YouTube ↗
(saves to browser)
Sign in to unlock AI tutor explanation · ⚡30
Playlist
Uploads from CodeEmporium · CodeEmporium · 0 of 60
← Previous
Next →
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
Linear Regression and Multiple Regression
CodeEmporium
Logistic Regression - THE MATH YOU SHOULD KNOW!
CodeEmporium
Generative Adversarial Networks - FUTURISTIC & FUN AI !
CodeEmporium
Deep Learning on the Cloud - GPU TO LEARN FASTER
CodeEmporium
Deep Mind's AlphaGo Zero - EXPLAINED
CodeEmporium
Mask Region based Convolution Neural Networks - EXPLAINED!
CodeEmporium
Attention in Neural Networks
CodeEmporium
Depthwise Separable Convolution - A FASTER CONVOLUTION!
CodeEmporium
One Neural network learns EVERYTHING ?!
CodeEmporium
Neural Voice Cloning
CodeEmporium
AI creates Image Classifiers…by DRAWING?
CodeEmporium
Unpaired Image-Image Translation using CycleGANs
CodeEmporium
K-Means Clustering - EXPLAINED!
CodeEmporium
Random Forest Classification
CodeEmporium
Data Science in Finance
CodeEmporium
Hypothesis testing with Applications in Data Science
CodeEmporium
A/B Testing - Simply Explained
CodeEmporium
The Kernel Trick - THE MATH YOU SHOULD KNOW!
CodeEmporium
Support Vector Machines - THE MATH YOU SHOULD KNOW
CodeEmporium
Principal Component Analysis (PCA) - THE MATH YOU SHOULD KNOW!
CodeEmporium
History of Calculus - Animated
CodeEmporium
Curiosity in AI
CodeEmporium
DropBlock - A BETTER DROPOUT for Neural Networks
CodeEmporium
Autoencoders - EXPLAINED
CodeEmporium
Recurrent Neural Networks - EXPLAINED!
CodeEmporium
LSTM Networks - EXPLAINED!
CodeEmporium
Building an Image Captioner with Neural Networks
CodeEmporium
10 Machine Learning Questions - ANSWERED!
CodeEmporium
How do neural networks work?
CodeEmporium
Evolution of Face Generation | Evolution of GANs
CodeEmporium
How does Google Translate's AI work?
CodeEmporium
How to keep up with AI research?
CodeEmporium
How does YouTube recommend videos? - AI EXPLAINED!
CodeEmporium
Variational Autoencoders - EXPLAINED!
CodeEmporium
Logistic Regression - VISUALIZED!
CodeEmporium
Gradient Descent - THE MATH YOU SHOULD KNOW
CodeEmporium
Boosting - EXPLAINED!
CodeEmporium
Transformer Neural Networks - EXPLAINED! (Attention is all you need)
CodeEmporium
Loss Functions - EXPLAINED!
CodeEmporium
Optimizers - EXPLAINED!
CodeEmporium
NLP with Neural Networks & Transformers
CodeEmporium
Batch Normalization - EXPLAINED!
CodeEmporium
Activation Functions - EXPLAINED!
CodeEmporium
Data Scientist Answers Interview Questions
CodeEmporium
Why use GPU with Neural Networks?
CodeEmporium
How do GPUs speed up Neural Network training?
CodeEmporium
BERT Neural Network - EXPLAINED!
CodeEmporium
ConvNets Scaled Efficiently
CodeEmporium
Transformer Neural Net makes music! (JukeboxAI)
CodeEmporium
What do filters of Convolution Neural Network learn?
CodeEmporium
We're hosting a Machine Learning Conference!
CodeEmporium
MLconfEU 2020: Machine Learning Conference for Software Engineers
CodeEmporium
Are Neural Networks Intelligent?
CodeEmporium
Time Series Forecasting with Machine Learning
CodeEmporium
Few Shot Learning - EXPLAINED!
CodeEmporium
How does a Data Scientist Fight FRAUD?
CodeEmporium
How would a Data Scientist analyze Customer Churn?
CodeEmporium
Expectations with Machine Learning
CodeEmporium
Why Logistic Regression DOESN'T return probabilities?!
CodeEmporium
How you SHOULD code Machine Learning
CodeEmporium
More on: ML Maths Basics
View skill →Related AI Lessons
⚡
⚡
⚡
⚡
I Asked ChatGPT to Fix My Life. It Couldn’t — Until I Changed One Thing
Medium · AI
I Asked ChatGPT to Fix My Life. It Couldn’t — Until I Changed One Thing
Medium · ChatGPT
Claude Sonnet 5 Is Here: Why It Might Replace Your Opus Subscription
Medium · Programming
Introducing Claude Sonnet 5 on AWS: Anthropic’s most capable Sonnet model
AWS Machine Learning
🎓
Tutor Explanation
DeepCamp AI