Semantic Segmentation in PyTorch | Neural Style Transfer #7

Aleksa Gordić - The AI Epiphany · Beginner ·👁️ Computer Vision ·5y ago

Key Takeaways

This video covers the basics of semantic segmentation in PyTorch, including the theory and implementation using DeepLab v3 and ResNet 101. The video demonstrates how to instantiate the model, apply transforms to frames, create a data loader, and perform inference using the model.

Full Transcript

in this video you're going to learn about semantic segmentation and i'll first cover some basic theory and then we'll jump straight into coding in pi torch a project a recently open source so let's dig into it so what exactly is semantic segmentation so it's a computer vision task where the goal is to assign a single class to a single pixel so you can see on the image there certain pixels belong to the road some pixels belong to the sidewalk pedestrians trees etc if you have all of this information it's really easy to also do for example object detection because for example you just take the pedestrian you just draw a bounding box around the pedestrian and you also know it's a pedestrian so you already did classification the main metric for this task is something called mean intersection over union and it's pretty simple to use to understand so basically uh you want to see the prediction your model made for for example for a single pedestrian and you want to see the the intersection between the prediction and the ground truth i.e the the the true labels uh there and the the bigger the intersection the better the metric and it intuitively makes sense and you can see the nice image with the squares there so it's used in all kinds of applications the two main ones that fall into my mind are autonomous vehicles and the second one would be maybe mixed reality i'm probably biased because i worked on hololens the neural network that we'll be using in the code is deep lab we three so let me just briefly mention the d flat family so it started it was developed by google and it started from we one all the way to we three plus that's the the newest model but we are using we three because it has official pi torch implementation so it's really simple to use on the screen on the top left you can see the input image and in the bottom right you can see the actual output i got with dplab v3 the model itself was trained on the pascal voc 2012 data set which has 21 classes so 20 classes foreground classes and one background class and the classes uh compass uh so i already mentioned background we also have like person airplane bird etc because it was trained on 21 classes the output has 21 channels and the spatial resolution is the same as the input image that's how uh semantic segmentation problem works so how you get the most probable class for a single pixel is this you you just take so you take a certain xy coordinate and you get 21 numbers right because you have 21 channels and wherever the highest number is that's the highest probability uh that's the class that's the most probable for that specific pixel say channel 0 had the the biggest probability that means uh background is the highest probability class for that specific pixel it's as easy as that the model you can see on the screen is actually fcn and not b-lab but the output shape is the thing i want you to see here so it's 21 and the spatial resolution is the same as the input image as i already mentioned okay enough theory let's jump into the code this code is actually a whole uh neural style transfer video creation pipeline but we'll be focusing only on a single component and that's the segmentation so let's see what the what the code looks like so the extract person mask from frames function basically takes an input frame and just extracts the the the pixels where the person is present so line 55 we just uh like a regular python thing we just figure out whether the user has gpu or cpu gpus are always preferable when training when using when using neural uh neural networks uh sigma then we just instantiate uh deep lab with three model with the resnet 101 backbone and we we set pre-trained to true because we want to have a pre-trained model obviously we put it to the gpu if we have one and we set the model to the evaluation mode because certain layers like batch normalization and dropout will behave differently if we don't set this and we'll have some wrong outputs next up we create these transforms which will be applied to every single frame so we wanna uh maybe specify certain height and width because if we maybe our gpu doesn't have enough vram and it will just give you cuda out of memory exception if we have a too big of a frame so then we convert it to tensor to pi torch tensor and we do normalization using imagenet's statistics and this is just uh because of the way that the pi torch models were trained we have to do this processing step then we just create a image folder out of the frames that we want to process also standard python thing and we create a data loader and set the batch size to for example 4 or something because we want to use the processing the parallel processing power of gpus next steps are after figuring out whether the output directories are empty or full this is like a cache mechanism if they already have some frames we want to just skip this stage in the pipeline but if they are empty we want to proceed and we wrap all of this into this torch no red context because we are doing inference and otherwise pytorch will create computational graphs by default which will allocate lots of memory and heat up your vrm so you want to you want to do this step it's really important next important thing is we just iterate through through the this data loader and we get image batches and we we just uh this line 84 we just place the batch onto the gpu because the model is also in gpu so you want to have tensors as well as the model on the same device otherwise you'll get some error and we just do the inference here we pass the image patch into the segmentation deep lab with three model and because the output is actually order dictionary uh this is just like a sun like a thing you gotta do you just extract the actual output using this out and then we place the that resulting batch to the cpu and convert it to numpy afterwards we iterate through the result patch which as i already mentioned contains so it's a dimension is n where the n is the number of uh like the batch size then 21 because we have 21 output channels and then height and width the same as the input frames the out cpu is the has the shape i mentioned so 21 uh channels and height and width as the input frame and this is the main step so we do the arg max so that actually finds the thing i mentioned in the theoretical part so we just find the channel where we have the highest probability for that specific pixel and then the this equal equal person channel index will figure out the pixels where the person class was the highest probable one on the image so we just get a mask doing this we get a person mask i.e we get the we'll have like a boolean value true on those pixels where the person was present as simple as that and then just some bureaucracy here uh times 255 will convert booleans into uh zero to 255 uh binary image and converting into uh just explicitly converting it here to numpy uh unsigned integer eight uh type after this we just do this post processing step but before we uh dig into that part of the code i wanna briefly also cover some theory behind the heuristics that will be used in that specific function what it will do is it will just clean up some some components that the model uh spuriously outputted uh which are erroneous so we just want to clean the the uh do some post processing and that's pretty common in computer version you usually have these hybrid approaches where the deep learning pipeline produces something and just want to do some cleaning afterwards so there are two things you want to know here the first one is connected components algorithm and the second one is the uh morphological filtering operations so connected components are pretty simple we as humans can easily tell that the square is not connected to the circle i there is uh there doesn't exist some like some some path of white pixels that's connecting them and what the algorithm should do here is just assign a different label to every one of these components like label 0 for background label 1 for square label 2 for circle having that information we can easily extract the circle or whatever component that we want and the colored image on the right just visualizes the thing i mentioned so it just visualizes the labels so the second thing you need to know about is morphological filtering and you basically take the binary image as the input and you just process it with something called the structuring element or the kernel which is also a binary a simple binary mask and you can either do erosion where you get like the smaller area like you can see the j letter got smaller there or you can do something called deletion where you get uh the like bigger area and the way you implement this is if you if you know something about logic gates this is pretty much a multiple uh input uh and for the erosion or multiple uh input or gate for the deletion pretty simple so finally uh opening is something we'll be using actually and that's just a combination just a sequential you just sequentially apply first erosion and then deletion and it makes sense because if you see the input image before doing the opening operation you have those small dots after doing erosion they will disappear and after doing deletion you'll just you'll just be left with the j letter which will get to its previous size initial size so in this slide you can see a concrete example of the person mask i got using deep lab with three model and you can see by doing the opening we'll just kind of split that a small component uh that shouldn't be shouldn't supposed to be there and then after figuring out connected components and finding the second biggest one uh the first being background the second being person will be able to keep only the the person pixels simple as that back to the code we can now figure out what the post-processing method actually uh does and if we go jump here uh you can see that on line 26 we just create a kernel so that's the structuring element i mentioned in the morphological filtering uh slide and it's just a simple square of ones and after applying the using opencv's morphological uh function we just uh apply to the mask and we get the this this thing called open mask which is uh the the initial uh mass from the dplab model after applying opening operation so then we just do connected components on the open mask and we get the labeled image out now given that labeled image from the connected components algorithm we need to figure out which component belongs to the background so we take the upper part of the labeled image so the first 10 percent of the of the image and just count and see what's the most frequent uh value in that space i call that discriminant subspace and the most common uh value is something we assume to be the background component background a label and we get the background index uh here on line 37 the next step is to create a list of tuples where each tuple contains connected component components label and area and after that's line 43 and after sorting those according to the size of those areas and just filtering out the the the background label that we found above using the discriminant subspace uh we are left off with the the the first biggest component after the background that we just we assume to be person and we just grab the the using the so this this here it takes the the biggest leftover component and the zero just grabs the uh label so i'm left with person index and after after just checking which pixels uh contain exactly this label i'm left only with person pixels it's pretty simple if i go ahead and only visualize the mass that came out directly from the model we get a result like this and you can notice uh certain components here which do not belong to the person mask and which should be removed obviously so if i just go inside the post-processing method and after doing the morphological operations we get this and you can see on the right that the erosion already fixed this concrete mask but in some other cases we'll need connected components to remove the other connected components which do not belong to the person so that covers the semantic segmentation theory and code hopefully you found this video useful if you did consider subscribing and consider sharing this video and see you next time

Original Description

❤️ Become The AI Epiphany Patreon ❤️ ► https://www.patreon.com/theaiepiphany ▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬ In this video, I cover semantic segmentation - both basic theory and we also dig into the PyTorch implementation. You'll learn about: ✔️ What is semantic segmentation ✔️ How to implement it in PyTorch using DeepLab V3 ✔️ What are connected components and morph filters ✔️ How to post-process the raw model masks Note: I'll cover the whole pipeline in the next video. ▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬ ✅ GitHub code: https://github.com/gordicaleksa/pytorch-naive-video-nst ✅ DeepLab V3 paper: https://arxiv.org/abs/1706.05587 ✅ Morph filtering: https://opencv-python-tutroals.readthedocs.io/en/latest/py_tutorials/py_imgproc/py_morphological_ops/py_morphological_ops.html ✅ Useful blog: https://www.learnopencv.com/pytorch-for-beginners-semantic-segmentation-using-torchvision/ ▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬ ⌚️ Timetable: 0:00 Semantic Segmentation (Basic Theory) 3:00 Semantic Segmentation (Code-Walkthrough) 8:25 Digital Image Processing (Basic Theory) 10:47 Mask post-processing (Code-Walkthrough) ▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬ 💰 BECOME A PATREON OF THE AI EPIPHANY ❤️ If these videos, GitHub projects, and blogs help you, consider helping me out by supporting me on Patreon! The AI Epiphany ► https://www.patreon.com/theaiepiphany One-time donations: https://www.paypal.com/paypalme/theaiepiphany Much love! ❤️ ▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬ 💡 The AI Epiphany is a channel dedicated to simplifying the field of AI using creative visualizations and in general, a stronger focus on geometrical and visual intuition, rather than the algebraic and numerical "intuition". ▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬ 👋 CONNECT WITH ME ON SOCIAL LinkedIn ► https://www.linkedin.com/in/aleksagordic/ Twitter ► https://twitter.com/gordic_aleksa Instagram ► https://www.instagram.com/aiepiphany/ Facebook ► https://www.facebook.com/aiepiphany/ 👨‍👩‍👧‍👦 JOIN OUR DISCORD COMMUNITY: Discord ► https://discord.gg/peBrCphe
Watch on YouTube ↗ (saves to browser)
Sign in to unlock AI tutor explanation · ⚡30

Playlist

Uploads from Aleksa Gordić - The AI Epiphany · Aleksa Gordić - The AI Epiphany · 11 of 60

1 Intro | Neural Style Transfer #1
Intro | Neural Style Transfer #1
Aleksa Gordić - The AI Epiphany
2 Basic Theory | Neural Style Transfer #2
Basic Theory | Neural Style Transfer #2
Aleksa Gordić - The AI Epiphany
3 Optimization method | Neural Style Transfer #3
Optimization method | Neural Style Transfer #3
Aleksa Gordić - The AI Epiphany
4 Advanced Theory | Neural Style Transfer #4
Advanced Theory | Neural Style Transfer #4
Aleksa Gordić - The AI Epiphany
5 Anyone can make deepfakes now!
Anyone can make deepfakes now!
Aleksa Gordić - The AI Epiphany
6 What is Computer Vision? | The Art of Creating Seeing Machines
What is Computer Vision? | The Art of Creating Seeing Machines
Aleksa Gordić - The AI Epiphany
7 Feed-forward method | Neural Style Transfer #5
Feed-forward method | Neural Style Transfer #5
Aleksa Gordić - The AI Epiphany
8 Alan Turing | Computing Machinery and Intelligence
Alan Turing | Computing Machinery and Intelligence
Aleksa Gordić - The AI Epiphany
9 Feed-forward method (training) | Neural Style Transfer #6
Feed-forward method (training) | Neural Style Transfer #6
Aleksa Gordić - The AI Epiphany
10 What is Google Deep Dream? (Basic Theory) | Deep Dream Series #1
What is Google Deep Dream? (Basic Theory) | Deep Dream Series #1
Aleksa Gordić - The AI Epiphany
Semantic Segmentation in PyTorch | Neural Style Transfer #7
Semantic Segmentation in PyTorch | Neural Style Transfer #7
Aleksa Gordić - The AI Epiphany
12 How to get started with Machine Learning
How to get started with Machine Learning
Aleksa Gordić - The AI Epiphany
13 How to learn PyTorch? (3 easy steps) | 2021
How to learn PyTorch? (3 easy steps) | 2021
Aleksa Gordić - The AI Epiphany
14 PyTorch or TensorFlow?
PyTorch or TensorFlow?
Aleksa Gordić - The AI Epiphany
15 3 Machine Learning Projects For Beginners (Highly visual) | 2021
3 Machine Learning Projects For Beginners (Highly visual) | 2021
Aleksa Gordić - The AI Epiphany
16 Machine Learning Projects (Intermediate level) | 2021
Machine Learning Projects (Intermediate level) | 2021
Aleksa Gordić - The AI Epiphany
17 Cheapest (0$) Deep Learning Hardware Options | 2021
Cheapest (0$) Deep Learning Hardware Options | 2021
Aleksa Gordić - The AI Epiphany
18 How to learn deep learning? (Transformers Example)
How to learn deep learning? (Transformers Example)
Aleksa Gordić - The AI Epiphany
19 How do transformers work? (Attention is all you need)
How do transformers work? (Attention is all you need)
Aleksa Gordić - The AI Epiphany
20 Developing a deep learning project (case study on transformer)
Developing a deep learning project (case study on transformer)
Aleksa Gordić - The AI Epiphany
21 Vision Transformer (ViT) - An image is worth 16x16 words | Paper Explained
Vision Transformer (ViT) - An image is worth 16x16 words | Paper Explained
Aleksa Gordić - The AI Epiphany
22 GPT-3 - Language Models are Few-Shot Learners | Paper Explained
GPT-3 - Language Models are Few-Shot Learners | Paper Explained
Aleksa Gordić - The AI Epiphany
23 Google DeepMind's AlphaFold 2 explained! (Protein folding, AlphaFold 1, a glimpse into AlphaFold 2)
Google DeepMind's AlphaFold 2 explained! (Protein folding, AlphaFold 1, a glimpse into AlphaFold 2)
Aleksa Gordić - The AI Epiphany
24 Attention Is All You Need (Transformer) | Paper Explained
Attention Is All You Need (Transformer) | Paper Explained
Aleksa Gordić - The AI Epiphany
25 Graph Attention Networks (GAT) | GNN Paper Explained
Graph Attention Networks (GAT) | GNN Paper Explained
Aleksa Gordić - The AI Epiphany
26 Graph Convolutional Networks (GCN) | GNN Paper Explained
Graph Convolutional Networks (GCN) | GNN Paper Explained
Aleksa Gordić - The AI Epiphany
27 Graph SAGE - Inductive Representation Learning on Large Graphs | GNN Paper Explained
Graph SAGE - Inductive Representation Learning on Large Graphs | GNN Paper Explained
Aleksa Gordić - The AI Epiphany
28 PinSage - Graph Convolutional Neural Networks for Web-Scale Recommender Systems | Paper Explained
PinSage - Graph Convolutional Neural Networks for Web-Scale Recommender Systems | Paper Explained
Aleksa Gordić - The AI Epiphany
29 OpenAI CLIP - Connecting Text and Images | Paper Explained
OpenAI CLIP - Connecting Text and Images | Paper Explained
Aleksa Gordić - The AI Epiphany
30 Temporal Graph Networks (TGN) | GNN Paper Explained
Temporal Graph Networks (TGN) | GNN Paper Explained
Aleksa Gordić - The AI Epiphany
31 Graph Neural Network Project Update! (I'm coding GAT from scratch)
Graph Neural Network Project Update! (I'm coding GAT from scratch)
Aleksa Gordić - The AI Epiphany
32 Graph Attention Network Project Walkthrough
Graph Attention Network Project Walkthrough
Aleksa Gordić - The AI Epiphany
33 How to get started with Graph ML? (Blog walkthrough)
How to get started with Graph ML? (Blog walkthrough)
Aleksa Gordić - The AI Epiphany
34 DQN - Playing Atari with Deep Reinforcement Learning | RL Paper Explained
DQN - Playing Atari with Deep Reinforcement Learning | RL Paper Explained
Aleksa Gordić - The AI Epiphany
35 AlphaGo - Mastering the game of Go with deep neural networks and tree search | RL Paper Explained
AlphaGo - Mastering the game of Go with deep neural networks and tree search | RL Paper Explained
Aleksa Gordić - The AI Epiphany
36 DeepMind's AlphaGo Zero and AlphaZero | RL paper explained
DeepMind's AlphaGo Zero and AlphaZero | RL paper explained
Aleksa Gordić - The AI Epiphany
37 OpenAI - Solving Rubik's Cube with a Robot Hand | RL paper explained
OpenAI - Solving Rubik's Cube with a Robot Hand | RL paper explained
Aleksa Gordić - The AI Epiphany
38 MuZero - Mastering Atari, Go, Chess and Shogi by Planning with a Learned Model | RL Paper explained
MuZero - Mastering Atari, Go, Chess and Shogi by Planning with a Learned Model | RL Paper explained
Aleksa Gordić - The AI Epiphany
39 EfficientNetV2 - Smaller Models and Faster Training | Paper explained
EfficientNetV2 - Smaller Models and Faster Training | Paper explained
Aleksa Gordić - The AI Epiphany
40 Implementing DeepMind's DQN from scratch! | Project Update
Implementing DeepMind's DQN from scratch! | Project Update
Aleksa Gordić - The AI Epiphany
41 MLP-Mixer: An all-MLP Architecture for Vision | Paper explained
MLP-Mixer: An all-MLP Architecture for Vision | Paper explained
Aleksa Gordić - The AI Epiphany
42 DeepMind's Android RL Environment - AndroidEnv
DeepMind's Android RL Environment - AndroidEnv
Aleksa Gordić - The AI Epiphany
43 When Vision Transformers Outperform ResNets without Pretraining | Paper Explained
When Vision Transformers Outperform ResNets without Pretraining | Paper Explained
Aleksa Gordić - The AI Epiphany
44 Non-Parametric Transformers | Paper explained
Non-Parametric Transformers | Paper explained
Aleksa Gordić - The AI Epiphany
45 Chip Placement with Deep Reinforcement Learning | Paper Explained
Chip Placement with Deep Reinforcement Learning | Paper Explained
Aleksa Gordić - The AI Epiphany
46 Text Style Brush - Transfer of text aesthetics from a single example | Paper Explained
Text Style Brush - Transfer of text aesthetics from a single example | Paper Explained
Aleksa Gordić - The AI Epiphany
47 Graphormer - Do Transformers Really Perform Bad for Graph Representation? | Paper Explained
Graphormer - Do Transformers Really Perform Bad for Graph Representation? | Paper Explained
Aleksa Gordić - The AI Epiphany
48 GANs N' Roses: Stable, Controllable, Diverse Image to Image Translation | Paper Explained
GANs N' Roses: Stable, Controllable, Diverse Image to Image Translation | Paper Explained
Aleksa Gordić - The AI Epiphany
49 VQ-VAEs: Neural Discrete Representation Learning | Paper + PyTorch Code Explained
VQ-VAEs: Neural Discrete Representation Learning | Paper + PyTorch Code Explained
Aleksa Gordić - The AI Epiphany
50 VQ-GAN: Taming Transformers for High-Resolution Image Synthesis | Paper Explained
VQ-GAN: Taming Transformers for High-Resolution Image Synthesis | Paper Explained
Aleksa Gordić - The AI Epiphany
51 Multimodal Few-Shot Learning with Frozen Language Models | Paper Explained
Multimodal Few-Shot Learning with Frozen Language Models | Paper Explained
Aleksa Gordić - The AI Epiphany
52 Focal Transformer: Focal Self-attention for Local-Global Interactions in Vision Transformers
Focal Transformer: Focal Self-attention for Local-Global Interactions in Vision Transformers
Aleksa Gordić - The AI Epiphany
53 AudioCLIP: Extending CLIP to Image, Text and Audio | Paper Explained
AudioCLIP: Extending CLIP to Image, Text and Audio | Paper Explained
Aleksa Gordić - The AI Epiphany
54 RMA: Rapid Motor Adaptation for Legged Robots | Paper Explained
RMA: Rapid Motor Adaptation for Legged Robots | Paper Explained
Aleksa Gordić - The AI Epiphany
55 DALL-E: Zero-Shot Text-to-Image Generation | Paper Explained
DALL-E: Zero-Shot Text-to-Image Generation | Paper Explained
Aleksa Gordić - The AI Epiphany
56 DETR: End-to-End Object Detection with Transformers | Paper Explained
DETR: End-to-End Object Detection with Transformers | Paper Explained
Aleksa Gordić - The AI Epiphany
57 DINO: Emerging Properties in Self-Supervised Vision Transformers | Paper Explained!
DINO: Emerging Properties in Self-Supervised Vision Transformers | Paper Explained!
Aleksa Gordić - The AI Epiphany
58 DeepMind DetCon: Efficient Visual Pretraining with Contrastive Detection | Paper Explained
DeepMind DetCon: Efficient Visual Pretraining with Contrastive Detection | Paper Explained
Aleksa Gordić - The AI Epiphany
59 Do Vision Transformers See Like Convolutional Neural Networks? | Paper Explained
Do Vision Transformers See Like Convolutional Neural Networks? | Paper Explained
Aleksa Gordić - The AI Epiphany
60 Fastformer: Additive Attention Can Be All You Need | Paper Explained
Fastformer: Additive Attention Can Be All You Need | Paper Explained
Aleksa Gordić - The AI Epiphany

This video teaches the basics of semantic segmentation in PyTorch, including the theory and implementation using DeepLab v3 and ResNet 101. The video demonstrates how to instantiate the model, apply transforms to frames, create a data loader, and perform inference using the model. By the end of this video, viewers will be able to implement semantic segmentation in PyTorch and understand the basics of computer vision.

Key Takeaways
  1. Instantiate DeepLab v3 model with ResNet 101 backbone
  2. Set the model to evaluation mode
  3. Create a function to extract person mask from frames
  4. Apply transforms to frames, including resizing and normalization
  5. Create a data loader with batch size 4 and use GPU for parallel processing
  6. Do inference using DeepLab model and extract output using arg max
  7. Apply post-processing to clean up erroneous components
💡 The video highlights the importance of post-processing in semantic segmentation, including the use of morphological filtering operations and connected components algorithm to clean up erroneous components.

Related AI Lessons

Cloud-Optimized OpenCV + A Special Surprise Announcement on OpenCV Live
Learn about Cloud-Optimized OpenCV for faster computer vision computations and a special announcement on OpenCV Live
OpenCV Blog
When the Camera Becomes an Exam Proctor: Building an AI-Powered Exam Monitoring System with…
Learn how to build an AI-powered exam monitoring system using Computer Vision and DeepFace to assist professional certification exams
Medium · Python
When the Camera Becomes an Exam Proctor: Building an AI-Powered Exam Monitoring System with…
Build an AI-powered exam monitoring system using Computer Vision and Deep Learning to enhance professional certification exams
Medium · Deep Learning
When the Camera Becomes an Exam Proctor: Building an AI-Powered Exam Monitoring System with…
Build an AI-powered exam monitoring system using Computer Vision and Deep Learning to enhance exam security and integrity
Medium · Cybersecurity

Chapters (4)

Semantic Segmentation (Basic Theory)
3:00 Semantic Segmentation (Code-Walkthrough)
8:25 Digital Image Processing (Basic Theory)
10:47 Mask post-processing (Code-Walkthrough)
Up next
Marketing management for ugc net| Important topics of marketing management ugc net commerce dec 2023
Bhoomi Learning Centre~Dr. Muskan
Watch →