DETR - Explained!

CodeEmporium · Advanced ·🔢 Mathematical Foundations ·5mo ago

Key Takeaways

The video explains Detection Transformers (DETR) and its comparison to Faster R-CNN, covering its architecture, training, and applications.

Full Transcript

Greetings fellow learners. In this video we are going to look at DTOR the detection transformer. So let's get to it. Now we're going to start our discussion with what is object detection. It involves the localization and classification of objects in an image. So localization is drawing bounding boxes around objects and classification is determining what those objects represent. In 2015, faster RCNN emerged as the state-of-the-art for object detection. And it looked kind of something like this, where we would take an image, we would generate anchors. These are basically prior bounding boxes where we determine different aspect ratios and sizes for thousands of squares over here. And then we kind of like slightly adjust each of these to get some predictions of whether they actually encapsulate objects. Then we performed non-maxim suppression to remove overlapping bounding boxes to eventually get a set of bounding box object proposals or region proposals. These are where objects can be present and then we adjust these and perform actual classifications with objects that we cared about. in order to perform like true object detection after that and then perform NMS after this in order to get the final set of localizations and classifications. Now for more information on this network I have an entire video on faster RCNN. So I'll delegate the details to that. Now the main issue though with this architecture is that it's pretty complex. For example, we would have to design anchors, you know, by hand, which involves understanding or like determining what the size and aspect ratios of the prior boxes, what they should be. And then there's also a bunch of post-processing, particularly like non-maxim suppression. And this requires us to determine a threshold of okay, if two bounding boxes are overlapping, when should we remove them, at what threshold? And these design choices can greatly affect performance. So how do we deal with this? Well, seeing the recent success of transformer-based architectures in NLP task, specifically from like attention is all you need from 2017 through like 2020 with like BERT and GPT. Researchers thought of using this transformerbased architecture in object detection and so we have DTOR. So DTR is an object detection framework that's built on transformers. It's simpler than faster RCNN yet it achieves similar performance with also a similar speed. And let's take a look at exactly how this happens and how it's trained. So in order to train DTOR, we first pre-train ResNet on image recognition. So you can imagine each of these are convolution activation pooling blocks with residual connections and that's why this network can be extremely deep. and we pass in an image and we can get an object classification and so we can train this ResNet architecture. Then we can remove that last FC layer to create the convolution backbone for DTOR. So you can imagine if we pass an image here of H crossW for like RGB three channels we can get this tensor that represents the image where this little H is like 32 times smaller than this big H. This little W is 32 times smaller than this big W. And there's like 24,48 channels here. And then we're now going to use it in our DTOR architecture. So for training our end toend model here, DTOR comprises of like three major components. One is this like convolution block section. Another is this transformer block section. And then we have something called Hungarian matching. So let's talk about each of these components. So we first take our input image and pass it to our ResNet backbone in order to get our tensor that represents the image. We then perform a pointwise convolution to adjust the number of channels here to let's say 256. And this is done to eventually pass it in or you know prepare this data for passing in to the transformer block. So we then take this tensor and we flatten it such that this final matrix is HW cross 256. The idea here is that each of these 256dimensional vectors will represent some like chunks of information or be a rich representation of features of a of the image in a specific section. So this could be like one of the representations of the 256dimensional vector. Then over here too we might have another 256dimensional vector that represents this section of the image probably like the edges, the curves, the texture and so on. So because they represent now different sections of the image, we will also add positional encodings here. Now these are non-learnable parameters. So that means that they're not going to be updated by back propagation, but they are going to be like s cosine functions. So this is basically saying this 256dimensional vector is going to be position one. This 256dimensional vector will be position two and so on. Next, we pass it into the encoder decoder architecture of our transformer. And you can kind of see that this architecture is very similar to the attention is all you need paper from 2017. Again, I've done a video on this topic too. So, I will link to that in the description. But effectively, this transformer encoder is going to have self attention. it's going to perform self attention. So you can imagine each of these 256dimensional vectors will pass to the encoder and will thus encode some like global context. So it'll be like richer representations effectively of some pieces of the image. Next in our transformer decoder the input is going to be a bunch of vectors known as object queries. You can think of these as placeholders for objects that will eventually be detected in the image. So each of these are effectively learnable parameters which means that they're going to be updated via back propagation. They can be randomly initialized to start with but the learnable parameters and we have 100 of these vectors. This is because for now let's assume that we want to for a given image detect at most 100 objects in that image. Hence we have a 100 cross 256dimensional matrix over here. We pass this as input to the transformer decoder through cross attention. It's going to get information about the image pieces over here. And once it encodes that information through self and cross attention, the output vectors, there's going to be a hundred of these vectors. And each vector here is now going to represent an actual object. So this 256 dimension vector that represents an object is not very interpretable. And so we're going to pass it through some FC layers that are going to encode it into a 11dimensional vector. and a fourdimensional vector. And we do this for 100 of these. Each of these like there's 100 cases. For 100 of them, we pass through the same FC layer to get 11 dimensional vectors and the same FC layer over here to get four-dimensional vectors. And why 11 dimensions? Well, let's assume that for this uh object detection task, there are 10 classes of objects that are possible to detect. And the 11th one is going to be the background class. And why four dimensions over here? Well, this is for localization to like they can represent the x and y that's like the coordinates of the center of the object or the bounding box and then the height and the width in pixels of the bounding box. So we can represent the bounding box by four numbers. So effectively this 256dimensional vector can give us the localization as well as probability distribution of classes of the object. And so if you actually like encode this and put this into an image, you'll get the predictions that look like this. There's like a hundred potentially bounding boxes and each of these bounding boxes is going to be associated with some probability distribution. I just put like dog for some of them but it's actually each of the boxes will have like 11 of these predictions of probabilities. Now for these predictions we have a ground truth label for the image. That's the actual bounding box around the actual object of interest. And here we perform the third phase which is known as Hungarian matching. Hungarian matching involves creating a onetoone mapping of these ground truth bounding box along with one of these predictions. So in this case what we want to do is we want to select the optimal prediction here that will correspond to this object and we want the other 99 to basically be matched to no object and this is effectively going to this is what we see here. So we have like one of these predictions that's going to be mapped to the bounding box. One of these predictions here mapped to this prediction here. And the remaining 99 of them are mapped to no object. And this is required because we wanted to have a ground truth label for all of these 100 predictions here. And Hungarian matching allows us to find that ground truth label or matching. So we can now perform cross entropy loss as well as a smooth L1 loss. So cross entropy loss basically this will have like 100 terms right. So each of those predictions we have 100 of them 99 of these should predict no box. So we'll look at like the probabilities of predicting no box for 99 of them. But for the one that's matched to this actual ground truth we'll look at the actual like dog prediction which is like 0.87 here. And so we compute a cross entropy loss. this smooth L1 loss is basically um the bounding box overlap right larger the overlap of this box and this ground truth box then smaller is this loss and it's only going to be for the actual cases where there is an object because for the background cases we we exclude them there is no bounding box so there will be no smooth L1 contribution to this loss and so there will be no contribution to this smooth L1 loss And then we can aggregate both of these losses together in order to create the final loss here. And so this is the entire endto-end architecture which we then train via back propagation. So this loss is effectively going to be updating all the parameters of the FC layers, the transformer decoder. These object queries are going to be learned back propagates through the encoder and this is not going to be learned here. These are fixed and position embeddings but the convolution layers over here and the ResNet architecture are also going to be learned and hence training happens via back propagation. Now for inference the architecture looks a little bit more simplified. it no longer has the Hungarian matching and also there is no loss computation because there is no ground truth labels and so we take an image we pass it through the convolution block in order to get this HW cross 256dimensional matrix add positional encodings from s cosine pass it through the transformer encoder to get global context encoded in these vectors each vector represents some patch of like you know some section of the image. We then now take these learned object queries, pass them into the decoder. These are placeholders for objects. So they will take in information about the image that has been learned. And then each of these are now going to represent objects of interest. And we can then make them, you know, we'll transform each 256dimensional vector into 11dimensional vector to represent probability distribution of classes. and a four-dimensional vector to represent the localization of bounding boxes. And the idea here is, you know, if there is like in this case, there's only one object of interest, the idea is that for 99 of these records or rows, the pro in 99 of those probability distribution predictions, the background prediction should be the highest. And so we don't need to pay attention to its corresponding localization. But for one of them ideally it would say like you know the bounding box prediction the class probability prediction of dog will be like you know what does it say like 87%. The class probability of like cat is 6% and so on. And so it has the dog prediction over here. We look at its corresponding localization in order to get the bounding box which would look like this. And so we don't really need like non-maximum suppression or any other post-processing here which is fantastic as it simplifies our architecture. So let's now look at performance. So this is kind of a snapshot of a table from the paper where we have different versions of faster RCNN and different versions of DTOR. Now this FPS is frames per second which basically indicates speed how fast these are at inference. Higher the numbers here that means higher is the speed and AP over here indicates average precision. So higher these numbers indicates higher performance. So what you can kind of see here is that both like overall data and faster RCNN have similar speeds and also kind of similar and comparable performances. But one thing to note is that DTER is a simpler architecture but faster RCNN is better at detecting smaller objects. So APS here is the performance on smaller objects where you can kind of see that there is a little bit of a bigger difference whereas you know for large objects DTOR seems to still be pretty good almost across the board. So yeah I hope all of this now makes sense. Quiz time. Have you been paying attention? Let's quiz you to find out which of these statements is false about the vanilla der. A der has no heavy postprocessing. B der requires longer training time than faster RCNN. C der detects small objects better than faster RCNN. Or d Hungarian matching is used to match predictions to ground truth boxes. I'll give you a few seconds to answer this question. The correct option is C. Did you get it right? Please comment your thoughts and reasonings down in the comments below and let's have a discussion. And at this point, if you think I do deserve it, please do consider giving this video a like because it will help me out a lot. Now, that's going to do it for quiz time. But before we go, let's generate a summary. In this video, we looked at DTOR, which is the detection transformer. We started with a discussion on object detection and then compared it to first the soda of 2015, which was faster RCNN. However, this architecture was very complex as it had many design choices that could greatly affect performance. Then to deal with this and seeing also the recent success of like transformer-based architectures, DTOR was born. So DTOR is an object detection framework which was built on transformers. It's simpler than faster RCNN yet it achieves similar performance. We also then took a look at how DTOR is trained via this architecture and also how inference is made as well. And then we ended with a comparison on performance and speed with DTER and faster RCNN. So that's all I have today. Thank you all so much for watching. All the resources including the slides and the papers and references to other videos will be down in the description below. Thank you all so much for watching. If you think I deserve it, please do consider giving this video a like and I will see you in the next one. Bye-bye.

Original Description

In this video, we take a look at Detection Transformers (DETR). What is it? Why do we have it? How do we train it? How does it compare to Faster R-CNN? ABOUT ME ⭕ Subscribe: https://www.youtube.com/c/CodeEmporium?sub_confirmation=1 📚 Medium Blog: https://medium.com/@dataemporium 💻 Github: https://github.com/ajhalthor 👔 LinkedIn: https://www.linkedin.com/in/ajay-halthor-477974bb/ RESOURCES [1 📚] Main Paper: https://arxiv.org/pdf/2005.12872 [2 📚] Slides: https://link.excalidraw.com/p/readonly/1OzfsMt78e1BuqDMBYJO [3 📚] My video on resnet: https://youtu.be/gyhCfjixLV0?si=N-NTU4Y4228KOUSt [4 📚] Video on the transformer architecture: https://youtu.be/TQQlZhbC5ps?si=rACu5O4FGRKQwaKl [5 📚] Playlist of Transformers from scratch: https://youtu.be/QCJQG4DuHT0?si=UllVN6odQKC-nsvb PLAYLISTS FROM MY CHANNEL ⭕ Reinforcement Learning: https://youtube.com/playlist?list=PLTl9hO2Oobd9kS--NgVz0EPNyEmygV1Ha&si=AuThDZJwG19cgTA8 Natural Language Processing: https://youtube.com/playlist?list=PLTl9hO2Oobd_bzXUpzKMKA3liq2kj6LfE&si=LsVy8RDPu8jeO-cc ⭕ Transformers from Scratch: https://youtube.com/playlist?list=PLTl9hO2Oobd_bzXUpzKMKA3liq2kj6LfE ⭕ ChatGPT Playlist: https://youtube.com/playlist?list=PLTl9hO2Oobd9coYT6XsTraTBo4pL1j4HJ ⭕ Convolutional Neural Networks: https://youtube.com/playlist?list=PLTl9hO2Oobd9U0XHz62Lw6EgIMkQpfz74 ⭕ The Math You Should Know : https://youtube.com/playlist?list=PLTl9hO2Oobd-_5sGLnbgE8Poer1Xjzz4h ⭕ Probability Theory for Machine Learning: https://youtube.com/playlist?list=PLTl9hO2Oobd9bPcq0fj91Jgk_-h1H_W3V ⭕ Coding Machine Learning: https://youtube.com/playlist?list=PLTl9hO2Oobd82vcsOnvCNzxrZOlrz3RiD MATH COURSES (7 day free trial) 📕 Mathematics for Machine Learning: https://imp.i384100.net/MathML 📕 Calculus: https://imp.i384100.net/Calculus 📕 Statistics for Data Science: https://imp.i384100.net/AdvancedStatistics 📕 Bayesian Statistics: https://imp.i384100.net/BayesianStatistics 📕 Linear Algebra: https://imp.i384100.net/LinearAlgebra 📕 Pr
Watch on YouTube ↗ (saves to browser)
Sign in to unlock AI tutor explanation · ⚡30

Playlist

Uploads from CodeEmporium · CodeEmporium · 0 of 60

← Previous Next →
1 Linear Regression and Multiple Regression
Linear Regression and Multiple Regression
CodeEmporium
2 Logistic Regression - THE MATH YOU SHOULD KNOW!
Logistic Regression - THE MATH YOU SHOULD KNOW!
CodeEmporium
3 Generative Adversarial Networks - FUTURISTIC & FUN AI !
Generative Adversarial Networks - FUTURISTIC & FUN AI !
CodeEmporium
4 Deep Learning on the Cloud - GPU TO LEARN FASTER
Deep Learning on the Cloud - GPU TO LEARN FASTER
CodeEmporium
5 Deep Mind's AlphaGo Zero - EXPLAINED
Deep Mind's AlphaGo Zero - EXPLAINED
CodeEmporium
6 Mask Region based Convolution Neural Networks - EXPLAINED!
Mask Region based Convolution Neural Networks - EXPLAINED!
CodeEmporium
7 Attention in Neural Networks
Attention in Neural Networks
CodeEmporium
8 Depthwise Separable Convolution - A FASTER CONVOLUTION!
Depthwise Separable Convolution - A FASTER CONVOLUTION!
CodeEmporium
9 One Neural network learns EVERYTHING ?!
One Neural network learns EVERYTHING ?!
CodeEmporium
10 Neural Voice Cloning
Neural Voice Cloning
CodeEmporium
11 AI creates Image Classifiers…by DRAWING?
AI creates Image Classifiers…by DRAWING?
CodeEmporium
12 Unpaired Image-Image Translation using CycleGANs
Unpaired Image-Image Translation using CycleGANs
CodeEmporium
13 K-Means Clustering - EXPLAINED!
K-Means Clustering - EXPLAINED!
CodeEmporium
14 Random Forest Classification
Random Forest Classification
CodeEmporium
15 Data Science in Finance
Data Science in Finance
CodeEmporium
16 Hypothesis testing with Applications in Data Science
Hypothesis testing with Applications in Data Science
CodeEmporium
17 A/B Testing - Simply Explained
A/B Testing - Simply Explained
CodeEmporium
18 The Kernel Trick - THE MATH YOU SHOULD KNOW!
The Kernel Trick - THE MATH YOU SHOULD KNOW!
CodeEmporium
19 Support Vector Machines - THE MATH YOU  SHOULD KNOW
Support Vector Machines - THE MATH YOU SHOULD KNOW
CodeEmporium
20 Principal Component Analysis (PCA) - THE MATH YOU SHOULD KNOW!
Principal Component Analysis (PCA) - THE MATH YOU SHOULD KNOW!
CodeEmporium
21 History of Calculus - Animated
History of Calculus - Animated
CodeEmporium
22 Curiosity in AI
Curiosity in AI
CodeEmporium
23 DropBlock - A BETTER DROPOUT for Neural Networks
DropBlock - A BETTER DROPOUT for Neural Networks
CodeEmporium
24 Autoencoders - EXPLAINED
Autoencoders - EXPLAINED
CodeEmporium
25 Recurrent Neural Networks - EXPLAINED!
Recurrent Neural Networks - EXPLAINED!
CodeEmporium
26 LSTM Networks - EXPLAINED!
LSTM Networks - EXPLAINED!
CodeEmporium
27 Building an Image Captioner with Neural Networks
Building an Image Captioner with Neural Networks
CodeEmporium
28 10 Machine Learning Questions - ANSWERED!
10 Machine Learning Questions - ANSWERED!
CodeEmporium
29 How do neural networks work?
How do neural networks work?
CodeEmporium
30 Evolution of Face Generation |  Evolution of GANs
Evolution of Face Generation | Evolution of GANs
CodeEmporium
31 How does Google Translate's AI work?
How does Google Translate's AI work?
CodeEmporium
32 How to keep up with AI research?
How to keep up with AI research?
CodeEmporium
33 How does YouTube recommend videos? - AI EXPLAINED!
How does YouTube recommend videos? - AI EXPLAINED!
CodeEmporium
34 Variational Autoencoders - EXPLAINED!
Variational Autoencoders - EXPLAINED!
CodeEmporium
35 Logistic Regression - VISUALIZED!
Logistic Regression - VISUALIZED!
CodeEmporium
36 Gradient Descent - THE MATH YOU SHOULD KNOW
Gradient Descent - THE MATH YOU SHOULD KNOW
CodeEmporium
37 Boosting - EXPLAINED!
Boosting - EXPLAINED!
CodeEmporium
38 Transformer Neural Networks - EXPLAINED! (Attention is all you need)
Transformer Neural Networks - EXPLAINED! (Attention is all you need)
CodeEmporium
39 Loss Functions - EXPLAINED!
Loss Functions - EXPLAINED!
CodeEmporium
40 Optimizers - EXPLAINED!
Optimizers - EXPLAINED!
CodeEmporium
41 NLP with Neural Networks & Transformers
NLP with Neural Networks & Transformers
CodeEmporium
42 Batch Normalization - EXPLAINED!
Batch Normalization - EXPLAINED!
CodeEmporium
43 Activation Functions - EXPLAINED!
Activation Functions - EXPLAINED!
CodeEmporium
44 Data Scientist Answers Interview Questions
Data Scientist Answers Interview Questions
CodeEmporium
45 Why use GPU with Neural Networks?
Why use GPU with Neural Networks?
CodeEmporium
46 How do GPUs speed up Neural Network training?
How do GPUs speed up Neural Network training?
CodeEmporium
47 BERT Neural Network - EXPLAINED!
BERT Neural Network - EXPLAINED!
CodeEmporium
48 ConvNets Scaled Efficiently
ConvNets Scaled Efficiently
CodeEmporium
49 Transformer Neural Net makes music! (JukeboxAI)
Transformer Neural Net makes music! (JukeboxAI)
CodeEmporium
50 What do filters of Convolution Neural Network learn?
What do filters of Convolution Neural Network learn?
CodeEmporium
51 We're hosting a Machine Learning Conference!
We're hosting a Machine Learning Conference!
CodeEmporium
52 MLconfEU 2020: Machine Learning Conference for Software Engineers
MLconfEU 2020: Machine Learning Conference for Software Engineers
CodeEmporium
53 Are Neural Networks Intelligent?
Are Neural Networks Intelligent?
CodeEmporium
54 Time Series Forecasting with Machine Learning
Time Series Forecasting with Machine Learning
CodeEmporium
55 Few Shot Learning - EXPLAINED!
Few Shot Learning - EXPLAINED!
CodeEmporium
56 How does a Data Scientist Fight FRAUD?
How does a Data Scientist Fight FRAUD?
CodeEmporium
57 How would a Data Scientist analyze Customer Churn?
How would a Data Scientist analyze Customer Churn?
CodeEmporium
58 Expectations with Machine Learning
Expectations with Machine Learning
CodeEmporium
59 Why Logistic Regression DOESN'T return probabilities?!
Why Logistic Regression DOESN'T return probabilities?!
CodeEmporium
60 How you SHOULD code Machine Learning
How you SHOULD code Machine Learning
CodeEmporium

This video explains the basics of Detection Transformers (DETR) and its comparison to Faster R-CNN, covering its architecture, training, and applications. It provides a comprehensive overview of the DETR model and its potential uses in object detection tasks. By watching this video, viewers can gain a deeper understanding of the DETR architecture and its applications in computer vision.

Key Takeaways
  1. Understand the basics of DETR
  2. Learn the architecture of DETR
  3. Compare DETR to Faster R-CNN
  4. Train a DETR model
  5. Implement DETR in a computer vision task
💡 DETR provides a novel approach to object detection tasks, leveraging the power of transformer architectures to improve model performance and efficiency.

Related Reads

Up next
How to Open OSM Files (OpenStreetMap Data)
File Extension Geeks
Watch →