Semantic Segmentation in PyTorch | Neural Style Transfer #7
Key Takeaways
This video covers the basics of semantic segmentation in PyTorch, including the theory and implementation using DeepLab v3 and ResNet 101. The video demonstrates how to instantiate the model, apply transforms to frames, create a data loader, and perform inference using the model.
Full Transcript
in this video you're going to learn about semantic segmentation and i'll first cover some basic theory and then we'll jump straight into coding in pi torch a project a recently open source so let's dig into it so what exactly is semantic segmentation so it's a computer vision task where the goal is to assign a single class to a single pixel so you can see on the image there certain pixels belong to the road some pixels belong to the sidewalk pedestrians trees etc if you have all of this information it's really easy to also do for example object detection because for example you just take the pedestrian you just draw a bounding box around the pedestrian and you also know it's a pedestrian so you already did classification the main metric for this task is something called mean intersection over union and it's pretty simple to use to understand so basically uh you want to see the prediction your model made for for example for a single pedestrian and you want to see the the intersection between the prediction and the ground truth i.e the the the true labels uh there and the the bigger the intersection the better the metric and it intuitively makes sense and you can see the nice image with the squares there so it's used in all kinds of applications the two main ones that fall into my mind are autonomous vehicles and the second one would be maybe mixed reality i'm probably biased because i worked on hololens the neural network that we'll be using in the code is deep lab we three so let me just briefly mention the d flat family so it started it was developed by google and it started from we one all the way to we three plus that's the the newest model but we are using we three because it has official pi torch implementation so it's really simple to use on the screen on the top left you can see the input image and in the bottom right you can see the actual output i got with dplab v3 the model itself was trained on the pascal voc 2012 data set which has 21 classes so 20 classes foreground classes and one background class and the classes uh compass uh so i already mentioned background we also have like person airplane bird etc because it was trained on 21 classes the output has 21 channels and the spatial resolution is the same as the input image that's how uh semantic segmentation problem works so how you get the most probable class for a single pixel is this you you just take so you take a certain xy coordinate and you get 21 numbers right because you have 21 channels and wherever the highest number is that's the highest probability uh that's the class that's the most probable for that specific pixel say channel 0 had the the biggest probability that means uh background is the highest probability class for that specific pixel it's as easy as that the model you can see on the screen is actually fcn and not b-lab but the output shape is the thing i want you to see here so it's 21 and the spatial resolution is the same as the input image as i already mentioned okay enough theory let's jump into the code this code is actually a whole uh neural style transfer video creation pipeline but we'll be focusing only on a single component and that's the segmentation so let's see what the what the code looks like so the extract person mask from frames function basically takes an input frame and just extracts the the the pixels where the person is present so line 55 we just uh like a regular python thing we just figure out whether the user has gpu or cpu gpus are always preferable when training when using when using neural uh neural networks uh sigma then we just instantiate uh deep lab with three model with the resnet 101 backbone and we we set pre-trained to true because we want to have a pre-trained model obviously we put it to the gpu if we have one and we set the model to the evaluation mode because certain layers like batch normalization and dropout will behave differently if we don't set this and we'll have some wrong outputs next up we create these transforms which will be applied to every single frame so we wanna uh maybe specify certain height and width because if we maybe our gpu doesn't have enough vram and it will just give you cuda out of memory exception if we have a too big of a frame so then we convert it to tensor to pi torch tensor and we do normalization using imagenet's statistics and this is just uh because of the way that the pi torch models were trained we have to do this processing step then we just create a image folder out of the frames that we want to process also standard python thing and we create a data loader and set the batch size to for example 4 or something because we want to use the processing the parallel processing power of gpus next steps are after figuring out whether the output directories are empty or full this is like a cache mechanism if they already have some frames we want to just skip this stage in the pipeline but if they are empty we want to proceed and we wrap all of this into this torch no red context because we are doing inference and otherwise pytorch will create computational graphs by default which will allocate lots of memory and heat up your vrm so you want to you want to do this step it's really important next important thing is we just iterate through through the this data loader and we get image batches and we we just uh this line 84 we just place the batch onto the gpu because the model is also in gpu so you want to have tensors as well as the model on the same device otherwise you'll get some error and we just do the inference here we pass the image patch into the segmentation deep lab with three model and because the output is actually order dictionary uh this is just like a sun like a thing you gotta do you just extract the actual output using this out and then we place the that resulting batch to the cpu and convert it to numpy afterwards we iterate through the result patch which as i already mentioned contains so it's a dimension is n where the n is the number of uh like the batch size then 21 because we have 21 output channels and then height and width the same as the input frames the out cpu is the has the shape i mentioned so 21 uh channels and height and width as the input frame and this is the main step so we do the arg max so that actually finds the thing i mentioned in the theoretical part so we just find the channel where we have the highest probability for that specific pixel and then the this equal equal person channel index will figure out the pixels where the person class was the highest probable one on the image so we just get a mask doing this we get a person mask i.e we get the we'll have like a boolean value true on those pixels where the person was present as simple as that and then just some bureaucracy here uh times 255 will convert booleans into uh zero to 255 uh binary image and converting into uh just explicitly converting it here to numpy uh unsigned integer eight uh type after this we just do this post processing step but before we uh dig into that part of the code i wanna briefly also cover some theory behind the heuristics that will be used in that specific function what it will do is it will just clean up some some components that the model uh spuriously outputted uh which are erroneous so we just want to clean the the uh do some post processing and that's pretty common in computer version you usually have these hybrid approaches where the deep learning pipeline produces something and just want to do some cleaning afterwards so there are two things you want to know here the first one is connected components algorithm and the second one is the uh morphological filtering operations so connected components are pretty simple we as humans can easily tell that the square is not connected to the circle i there is uh there doesn't exist some like some some path of white pixels that's connecting them and what the algorithm should do here is just assign a different label to every one of these components like label 0 for background label 1 for square label 2 for circle having that information we can easily extract the circle or whatever component that we want and the colored image on the right just visualizes the thing i mentioned so it just visualizes the labels so the second thing you need to know about is morphological filtering and you basically take the binary image as the input and you just process it with something called the structuring element or the kernel which is also a binary a simple binary mask and you can either do erosion where you get like the smaller area like you can see the j letter got smaller there or you can do something called deletion where you get uh the like bigger area and the way you implement this is if you if you know something about logic gates this is pretty much a multiple uh input uh and for the erosion or multiple uh input or gate for the deletion pretty simple so finally uh opening is something we'll be using actually and that's just a combination just a sequential you just sequentially apply first erosion and then deletion and it makes sense because if you see the input image before doing the opening operation you have those small dots after doing erosion they will disappear and after doing deletion you'll just you'll just be left with the j letter which will get to its previous size initial size so in this slide you can see a concrete example of the person mask i got using deep lab with three model and you can see by doing the opening we'll just kind of split that a small component uh that shouldn't be shouldn't supposed to be there and then after figuring out connected components and finding the second biggest one uh the first being background the second being person will be able to keep only the the person pixels simple as that back to the code we can now figure out what the post-processing method actually uh does and if we go jump here uh you can see that on line 26 we just create a kernel so that's the structuring element i mentioned in the morphological filtering uh slide and it's just a simple square of ones and after applying the using opencv's morphological uh function we just uh apply to the mask and we get the this this thing called open mask which is uh the the initial uh mass from the dplab model after applying opening operation so then we just do connected components on the open mask and we get the labeled image out now given that labeled image from the connected components algorithm we need to figure out which component belongs to the background so we take the upper part of the labeled image so the first 10 percent of the of the image and just count and see what's the most frequent uh value in that space i call that discriminant subspace and the most common uh value is something we assume to be the background component background a label and we get the background index uh here on line 37 the next step is to create a list of tuples where each tuple contains connected component components label and area and after that's line 43 and after sorting those according to the size of those areas and just filtering out the the the background label that we found above using the discriminant subspace uh we are left off with the the the first biggest component after the background that we just we assume to be person and we just grab the the using the so this this here it takes the the biggest leftover component and the zero just grabs the uh label so i'm left with person index and after after just checking which pixels uh contain exactly this label i'm left only with person pixels it's pretty simple if i go ahead and only visualize the mass that came out directly from the model we get a result like this and you can notice uh certain components here which do not belong to the person mask and which should be removed obviously so if i just go inside the post-processing method and after doing the morphological operations we get this and you can see on the right that the erosion already fixed this concrete mask but in some other cases we'll need connected components to remove the other connected components which do not belong to the person so that covers the semantic segmentation theory and code hopefully you found this video useful if you did consider subscribing and consider sharing this video and see you next time
Original Description
❤️ Become The AI Epiphany Patreon ❤️ ► https://www.patreon.com/theaiepiphany
▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬
In this video, I cover semantic segmentation - both basic theory and we also dig into the PyTorch implementation.
You'll learn about:
✔️ What is semantic segmentation
✔️ How to implement it in PyTorch using DeepLab V3
✔️ What are connected components and morph filters
✔️ How to post-process the raw model masks
Note: I'll cover the whole pipeline in the next video.
▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬
✅ GitHub code: https://github.com/gordicaleksa/pytorch-naive-video-nst
✅ DeepLab V3 paper: https://arxiv.org/abs/1706.05587
✅ Morph filtering: https://opencv-python-tutroals.readthedocs.io/en/latest/py_tutorials/py_imgproc/py_morphological_ops/py_morphological_ops.html
✅ Useful blog: https://www.learnopencv.com/pytorch-for-beginners-semantic-segmentation-using-torchvision/
▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬
⌚️ Timetable:
0:00 Semantic Segmentation (Basic Theory)
3:00 Semantic Segmentation (Code-Walkthrough)
8:25 Digital Image Processing (Basic Theory)
10:47 Mask post-processing (Code-Walkthrough)
▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬
💰 BECOME A PATREON OF THE AI EPIPHANY ❤️
If these videos, GitHub projects, and blogs help you,
consider helping me out by supporting me on Patreon!
The AI Epiphany ► https://www.patreon.com/theaiepiphany
One-time donations: https://www.paypal.com/paypalme/theaiepiphany
Much love! ❤️
▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬
💡 The AI Epiphany is a channel dedicated to simplifying the field of AI using creative visualizations and in general, a stronger focus on geometrical and visual intuition, rather than the algebraic and numerical "intuition".
▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬
👋 CONNECT WITH ME ON SOCIAL
LinkedIn ► https://www.linkedin.com/in/aleksagordic/
Twitter ► https://twitter.com/gordic_aleksa
Instagram ► https://www.instagram.com/aiepiphany/
Facebook ► https://www.facebook.com/aiepiphany/
👨👩👧👦 JOIN OUR DISCORD COMMUNITY:
Discord ► https://discord.gg/peBrCphe
Watch on YouTube ↗
(saves to browser)
Sign in to unlock AI tutor explanation · ⚡30
Playlist
Uploads from Aleksa Gordić - The AI Epiphany · Aleksa Gordić - The AI Epiphany · 11 of 60
1
2
3
4
5
6
7
8
9
10
▶
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
Intro | Neural Style Transfer #1
Aleksa Gordić - The AI Epiphany
Basic Theory | Neural Style Transfer #2
Aleksa Gordić - The AI Epiphany
Optimization method | Neural Style Transfer #3
Aleksa Gordić - The AI Epiphany
Advanced Theory | Neural Style Transfer #4
Aleksa Gordić - The AI Epiphany
Anyone can make deepfakes now!
Aleksa Gordić - The AI Epiphany
What is Computer Vision? | The Art of Creating Seeing Machines
Aleksa Gordić - The AI Epiphany
Feed-forward method | Neural Style Transfer #5
Aleksa Gordić - The AI Epiphany
Alan Turing | Computing Machinery and Intelligence
Aleksa Gordić - The AI Epiphany
Feed-forward method (training) | Neural Style Transfer #6
Aleksa Gordić - The AI Epiphany
What is Google Deep Dream? (Basic Theory) | Deep Dream Series #1
Aleksa Gordić - The AI Epiphany
Semantic Segmentation in PyTorch | Neural Style Transfer #7
Aleksa Gordić - The AI Epiphany
How to get started with Machine Learning
Aleksa Gordić - The AI Epiphany
How to learn PyTorch? (3 easy steps) | 2021
Aleksa Gordić - The AI Epiphany
PyTorch or TensorFlow?
Aleksa Gordić - The AI Epiphany
3 Machine Learning Projects For Beginners (Highly visual) | 2021
Aleksa Gordić - The AI Epiphany
Machine Learning Projects (Intermediate level) | 2021
Aleksa Gordić - The AI Epiphany
Cheapest (0$) Deep Learning Hardware Options | 2021
Aleksa Gordić - The AI Epiphany
How to learn deep learning? (Transformers Example)
Aleksa Gordić - The AI Epiphany
How do transformers work? (Attention is all you need)
Aleksa Gordić - The AI Epiphany
Developing a deep learning project (case study on transformer)
Aleksa Gordić - The AI Epiphany
Vision Transformer (ViT) - An image is worth 16x16 words | Paper Explained
Aleksa Gordić - The AI Epiphany
GPT-3 - Language Models are Few-Shot Learners | Paper Explained
Aleksa Gordić - The AI Epiphany
Google DeepMind's AlphaFold 2 explained! (Protein folding, AlphaFold 1, a glimpse into AlphaFold 2)
Aleksa Gordić - The AI Epiphany
Attention Is All You Need (Transformer) | Paper Explained
Aleksa Gordić - The AI Epiphany
Graph Attention Networks (GAT) | GNN Paper Explained
Aleksa Gordić - The AI Epiphany
Graph Convolutional Networks (GCN) | GNN Paper Explained
Aleksa Gordić - The AI Epiphany
Graph SAGE - Inductive Representation Learning on Large Graphs | GNN Paper Explained
Aleksa Gordić - The AI Epiphany
PinSage - Graph Convolutional Neural Networks for Web-Scale Recommender Systems | Paper Explained
Aleksa Gordić - The AI Epiphany
OpenAI CLIP - Connecting Text and Images | Paper Explained
Aleksa Gordić - The AI Epiphany
Temporal Graph Networks (TGN) | GNN Paper Explained
Aleksa Gordić - The AI Epiphany
Graph Neural Network Project Update! (I'm coding GAT from scratch)
Aleksa Gordić - The AI Epiphany
Graph Attention Network Project Walkthrough
Aleksa Gordić - The AI Epiphany
How to get started with Graph ML? (Blog walkthrough)
Aleksa Gordić - The AI Epiphany
DQN - Playing Atari with Deep Reinforcement Learning | RL Paper Explained
Aleksa Gordić - The AI Epiphany
AlphaGo - Mastering the game of Go with deep neural networks and tree search | RL Paper Explained
Aleksa Gordić - The AI Epiphany
DeepMind's AlphaGo Zero and AlphaZero | RL paper explained
Aleksa Gordić - The AI Epiphany
OpenAI - Solving Rubik's Cube with a Robot Hand | RL paper explained
Aleksa Gordić - The AI Epiphany
MuZero - Mastering Atari, Go, Chess and Shogi by Planning with a Learned Model | RL Paper explained
Aleksa Gordić - The AI Epiphany
EfficientNetV2 - Smaller Models and Faster Training | Paper explained
Aleksa Gordić - The AI Epiphany
Implementing DeepMind's DQN from scratch! | Project Update
Aleksa Gordić - The AI Epiphany
MLP-Mixer: An all-MLP Architecture for Vision | Paper explained
Aleksa Gordić - The AI Epiphany
DeepMind's Android RL Environment - AndroidEnv
Aleksa Gordić - The AI Epiphany
When Vision Transformers Outperform ResNets without Pretraining | Paper Explained
Aleksa Gordić - The AI Epiphany
Non-Parametric Transformers | Paper explained
Aleksa Gordić - The AI Epiphany
Chip Placement with Deep Reinforcement Learning | Paper Explained
Aleksa Gordić - The AI Epiphany
Text Style Brush - Transfer of text aesthetics from a single example | Paper Explained
Aleksa Gordić - The AI Epiphany
Graphormer - Do Transformers Really Perform Bad for Graph Representation? | Paper Explained
Aleksa Gordić - The AI Epiphany
GANs N' Roses: Stable, Controllable, Diverse Image to Image Translation | Paper Explained
Aleksa Gordić - The AI Epiphany
VQ-VAEs: Neural Discrete Representation Learning | Paper + PyTorch Code Explained
Aleksa Gordić - The AI Epiphany
VQ-GAN: Taming Transformers for High-Resolution Image Synthesis | Paper Explained
Aleksa Gordić - The AI Epiphany
Multimodal Few-Shot Learning with Frozen Language Models | Paper Explained
Aleksa Gordić - The AI Epiphany
Focal Transformer: Focal Self-attention for Local-Global Interactions in Vision Transformers
Aleksa Gordić - The AI Epiphany
AudioCLIP: Extending CLIP to Image, Text and Audio | Paper Explained
Aleksa Gordić - The AI Epiphany
RMA: Rapid Motor Adaptation for Legged Robots | Paper Explained
Aleksa Gordić - The AI Epiphany
DALL-E: Zero-Shot Text-to-Image Generation | Paper Explained
Aleksa Gordić - The AI Epiphany
DETR: End-to-End Object Detection with Transformers | Paper Explained
Aleksa Gordić - The AI Epiphany
DINO: Emerging Properties in Self-Supervised Vision Transformers | Paper Explained!
Aleksa Gordić - The AI Epiphany
DeepMind DetCon: Efficient Visual Pretraining with Contrastive Detection | Paper Explained
Aleksa Gordić - The AI Epiphany
Do Vision Transformers See Like Convolutional Neural Networks? | Paper Explained
Aleksa Gordić - The AI Epiphany
Fastformer: Additive Attention Can Be All You Need | Paper Explained
Aleksa Gordić - The AI Epiphany
More on: CV Basics
View skill →Related AI Lessons
⚡
⚡
⚡
⚡
Cloud-Optimized OpenCV + A Special Surprise Announcement on OpenCV Live
OpenCV Blog
When the Camera Becomes an Exam Proctor: Building an AI-Powered Exam Monitoring System with…
Medium · Python
When the Camera Becomes an Exam Proctor: Building an AI-Powered Exam Monitoring System with…
Medium · Deep Learning
When the Camera Becomes an Exam Proctor: Building an AI-Powered Exam Monitoring System with…
Medium · Cybersecurity
Chapters (4)
Semantic Segmentation (Basic Theory)
3:00
Semantic Segmentation (Code-Walkthrough)
8:25
Digital Image Processing (Basic Theory)
10:47
Mask post-processing (Code-Walkthrough)
🎓
Tutor Explanation
DeepCamp AI