DETR - Explained!
Key Takeaways
The video explains Detection Transformers (DETR) and its comparison to Faster R-CNN, covering its architecture, training, and applications.
Full Transcript
Greetings fellow learners. In this video we are going to look at DTOR the detection transformer. So let's get to it. Now we're going to start our discussion with what is object detection. It involves the localization and classification of objects in an image. So localization is drawing bounding boxes around objects and classification is determining what those objects represent. In 2015, faster RCNN emerged as the state-of-the-art for object detection. And it looked kind of something like this, where we would take an image, we would generate anchors. These are basically prior bounding boxes where we determine different aspect ratios and sizes for thousands of squares over here. And then we kind of like slightly adjust each of these to get some predictions of whether they actually encapsulate objects. Then we performed non-maxim suppression to remove overlapping bounding boxes to eventually get a set of bounding box object proposals or region proposals. These are where objects can be present and then we adjust these and perform actual classifications with objects that we cared about. in order to perform like true object detection after that and then perform NMS after this in order to get the final set of localizations and classifications. Now for more information on this network I have an entire video on faster RCNN. So I'll delegate the details to that. Now the main issue though with this architecture is that it's pretty complex. For example, we would have to design anchors, you know, by hand, which involves understanding or like determining what the size and aspect ratios of the prior boxes, what they should be. And then there's also a bunch of post-processing, particularly like non-maxim suppression. And this requires us to determine a threshold of okay, if two bounding boxes are overlapping, when should we remove them, at what threshold? And these design choices can greatly affect performance. So how do we deal with this? Well, seeing the recent success of transformer-based architectures in NLP task, specifically from like attention is all you need from 2017 through like 2020 with like BERT and GPT. Researchers thought of using this transformerbased architecture in object detection and so we have DTOR. So DTR is an object detection framework that's built on transformers. It's simpler than faster RCNN yet it achieves similar performance with also a similar speed. And let's take a look at exactly how this happens and how it's trained. So in order to train DTOR, we first pre-train ResNet on image recognition. So you can imagine each of these are convolution activation pooling blocks with residual connections and that's why this network can be extremely deep. and we pass in an image and we can get an object classification and so we can train this ResNet architecture. Then we can remove that last FC layer to create the convolution backbone for DTOR. So you can imagine if we pass an image here of H crossW for like RGB three channels we can get this tensor that represents the image where this little H is like 32 times smaller than this big H. This little W is 32 times smaller than this big W. And there's like 24,48 channels here. And then we're now going to use it in our DTOR architecture. So for training our end toend model here, DTOR comprises of like three major components. One is this like convolution block section. Another is this transformer block section. And then we have something called Hungarian matching. So let's talk about each of these components. So we first take our input image and pass it to our ResNet backbone in order to get our tensor that represents the image. We then perform a pointwise convolution to adjust the number of channels here to let's say 256. And this is done to eventually pass it in or you know prepare this data for passing in to the transformer block. So we then take this tensor and we flatten it such that this final matrix is HW cross 256. The idea here is that each of these 256dimensional vectors will represent some like chunks of information or be a rich representation of features of a of the image in a specific section. So this could be like one of the representations of the 256dimensional vector. Then over here too we might have another 256dimensional vector that represents this section of the image probably like the edges, the curves, the texture and so on. So because they represent now different sections of the image, we will also add positional encodings here. Now these are non-learnable parameters. So that means that they're not going to be updated by back propagation, but they are going to be like s cosine functions. So this is basically saying this 256dimensional vector is going to be position one. This 256dimensional vector will be position two and so on. Next, we pass it into the encoder decoder architecture of our transformer. And you can kind of see that this architecture is very similar to the attention is all you need paper from 2017. Again, I've done a video on this topic too. So, I will link to that in the description. But effectively, this transformer encoder is going to have self attention. it's going to perform self attention. So you can imagine each of these 256dimensional vectors will pass to the encoder and will thus encode some like global context. So it'll be like richer representations effectively of some pieces of the image. Next in our transformer decoder the input is going to be a bunch of vectors known as object queries. You can think of these as placeholders for objects that will eventually be detected in the image. So each of these are effectively learnable parameters which means that they're going to be updated via back propagation. They can be randomly initialized to start with but the learnable parameters and we have 100 of these vectors. This is because for now let's assume that we want to for a given image detect at most 100 objects in that image. Hence we have a 100 cross 256dimensional matrix over here. We pass this as input to the transformer decoder through cross attention. It's going to get information about the image pieces over here. And once it encodes that information through self and cross attention, the output vectors, there's going to be a hundred of these vectors. And each vector here is now going to represent an actual object. So this 256 dimension vector that represents an object is not very interpretable. And so we're going to pass it through some FC layers that are going to encode it into a 11dimensional vector. and a fourdimensional vector. And we do this for 100 of these. Each of these like there's 100 cases. For 100 of them, we pass through the same FC layer to get 11 dimensional vectors and the same FC layer over here to get four-dimensional vectors. And why 11 dimensions? Well, let's assume that for this uh object detection task, there are 10 classes of objects that are possible to detect. And the 11th one is going to be the background class. And why four dimensions over here? Well, this is for localization to like they can represent the x and y that's like the coordinates of the center of the object or the bounding box and then the height and the width in pixels of the bounding box. So we can represent the bounding box by four numbers. So effectively this 256dimensional vector can give us the localization as well as probability distribution of classes of the object. And so if you actually like encode this and put this into an image, you'll get the predictions that look like this. There's like a hundred potentially bounding boxes and each of these bounding boxes is going to be associated with some probability distribution. I just put like dog for some of them but it's actually each of the boxes will have like 11 of these predictions of probabilities. Now for these predictions we have a ground truth label for the image. That's the actual bounding box around the actual object of interest. And here we perform the third phase which is known as Hungarian matching. Hungarian matching involves creating a onetoone mapping of these ground truth bounding box along with one of these predictions. So in this case what we want to do is we want to select the optimal prediction here that will correspond to this object and we want the other 99 to basically be matched to no object and this is effectively going to this is what we see here. So we have like one of these predictions that's going to be mapped to the bounding box. One of these predictions here mapped to this prediction here. And the remaining 99 of them are mapped to no object. And this is required because we wanted to have a ground truth label for all of these 100 predictions here. And Hungarian matching allows us to find that ground truth label or matching. So we can now perform cross entropy loss as well as a smooth L1 loss. So cross entropy loss basically this will have like 100 terms right. So each of those predictions we have 100 of them 99 of these should predict no box. So we'll look at like the probabilities of predicting no box for 99 of them. But for the one that's matched to this actual ground truth we'll look at the actual like dog prediction which is like 0.87 here. And so we compute a cross entropy loss. this smooth L1 loss is basically um the bounding box overlap right larger the overlap of this box and this ground truth box then smaller is this loss and it's only going to be for the actual cases where there is an object because for the background cases we we exclude them there is no bounding box so there will be no smooth L1 contribution to this loss and so there will be no contribution to this smooth L1 loss And then we can aggregate both of these losses together in order to create the final loss here. And so this is the entire endto-end architecture which we then train via back propagation. So this loss is effectively going to be updating all the parameters of the FC layers, the transformer decoder. These object queries are going to be learned back propagates through the encoder and this is not going to be learned here. These are fixed and position embeddings but the convolution layers over here and the ResNet architecture are also going to be learned and hence training happens via back propagation. Now for inference the architecture looks a little bit more simplified. it no longer has the Hungarian matching and also there is no loss computation because there is no ground truth labels and so we take an image we pass it through the convolution block in order to get this HW cross 256dimensional matrix add positional encodings from s cosine pass it through the transformer encoder to get global context encoded in these vectors each vector represents some patch of like you know some section of the image. We then now take these learned object queries, pass them into the decoder. These are placeholders for objects. So they will take in information about the image that has been learned. And then each of these are now going to represent objects of interest. And we can then make them, you know, we'll transform each 256dimensional vector into 11dimensional vector to represent probability distribution of classes. and a four-dimensional vector to represent the localization of bounding boxes. And the idea here is, you know, if there is like in this case, there's only one object of interest, the idea is that for 99 of these records or rows, the pro in 99 of those probability distribution predictions, the background prediction should be the highest. And so we don't need to pay attention to its corresponding localization. But for one of them ideally it would say like you know the bounding box prediction the class probability prediction of dog will be like you know what does it say like 87%. The class probability of like cat is 6% and so on. And so it has the dog prediction over here. We look at its corresponding localization in order to get the bounding box which would look like this. And so we don't really need like non-maximum suppression or any other post-processing here which is fantastic as it simplifies our architecture. So let's now look at performance. So this is kind of a snapshot of a table from the paper where we have different versions of faster RCNN and different versions of DTOR. Now this FPS is frames per second which basically indicates speed how fast these are at inference. Higher the numbers here that means higher is the speed and AP over here indicates average precision. So higher these numbers indicates higher performance. So what you can kind of see here is that both like overall data and faster RCNN have similar speeds and also kind of similar and comparable performances. But one thing to note is that DTER is a simpler architecture but faster RCNN is better at detecting smaller objects. So APS here is the performance on smaller objects where you can kind of see that there is a little bit of a bigger difference whereas you know for large objects DTOR seems to still be pretty good almost across the board. So yeah I hope all of this now makes sense. Quiz time. Have you been paying attention? Let's quiz you to find out which of these statements is false about the vanilla der. A der has no heavy postprocessing. B der requires longer training time than faster RCNN. C der detects small objects better than faster RCNN. Or d Hungarian matching is used to match predictions to ground truth boxes. I'll give you a few seconds to answer this question. The correct option is C. Did you get it right? Please comment your thoughts and reasonings down in the comments below and let's have a discussion. And at this point, if you think I do deserve it, please do consider giving this video a like because it will help me out a lot. Now, that's going to do it for quiz time. But before we go, let's generate a summary. In this video, we looked at DTOR, which is the detection transformer. We started with a discussion on object detection and then compared it to first the soda of 2015, which was faster RCNN. However, this architecture was very complex as it had many design choices that could greatly affect performance. Then to deal with this and seeing also the recent success of like transformer-based architectures, DTOR was born. So DTOR is an object detection framework which was built on transformers. It's simpler than faster RCNN yet it achieves similar performance. We also then took a look at how DTOR is trained via this architecture and also how inference is made as well. And then we ended with a comparison on performance and speed with DTER and faster RCNN. So that's all I have today. Thank you all so much for watching. All the resources including the slides and the papers and references to other videos will be down in the description below. Thank you all so much for watching. If you think I deserve it, please do consider giving this video a like and I will see you in the next one. Bye-bye.
Original Description
In this video, we take a look at Detection Transformers (DETR). What is it? Why do we have it? How do we train it? How does it compare to Faster R-CNN?
ABOUT ME
⭕ Subscribe: https://www.youtube.com/c/CodeEmporium?sub_confirmation=1
📚 Medium Blog: https://medium.com/@dataemporium
💻 Github: https://github.com/ajhalthor
👔 LinkedIn: https://www.linkedin.com/in/ajay-halthor-477974bb/
RESOURCES
[1 📚] Main Paper: https://arxiv.org/pdf/2005.12872
[2 📚] Slides: https://link.excalidraw.com/p/readonly/1OzfsMt78e1BuqDMBYJO
[3 📚] My video on resnet: https://youtu.be/gyhCfjixLV0?si=N-NTU4Y4228KOUSt
[4 📚] Video on the transformer architecture: https://youtu.be/TQQlZhbC5ps?si=rACu5O4FGRKQwaKl
[5 📚] Playlist of Transformers from scratch: https://youtu.be/QCJQG4DuHT0?si=UllVN6odQKC-nsvb
PLAYLISTS FROM MY CHANNEL
⭕ Reinforcement Learning: https://youtube.com/playlist?list=PLTl9hO2Oobd9kS--NgVz0EPNyEmygV1Ha&si=AuThDZJwG19cgTA8
Natural Language Processing: https://youtube.com/playlist?list=PLTl9hO2Oobd_bzXUpzKMKA3liq2kj6LfE&si=LsVy8RDPu8jeO-cc
⭕ Transformers from Scratch: https://youtube.com/playlist?list=PLTl9hO2Oobd_bzXUpzKMKA3liq2kj6LfE
⭕ ChatGPT Playlist: https://youtube.com/playlist?list=PLTl9hO2Oobd9coYT6XsTraTBo4pL1j4HJ
⭕ Convolutional Neural Networks: https://youtube.com/playlist?list=PLTl9hO2Oobd9U0XHz62Lw6EgIMkQpfz74
⭕ The Math You Should Know : https://youtube.com/playlist?list=PLTl9hO2Oobd-_5sGLnbgE8Poer1Xjzz4h
⭕ Probability Theory for Machine Learning: https://youtube.com/playlist?list=PLTl9hO2Oobd9bPcq0fj91Jgk_-h1H_W3V
⭕ Coding Machine Learning: https://youtube.com/playlist?list=PLTl9hO2Oobd82vcsOnvCNzxrZOlrz3RiD
MATH COURSES (7 day free trial)
📕 Mathematics for Machine Learning: https://imp.i384100.net/MathML
📕 Calculus: https://imp.i384100.net/Calculus
📕 Statistics for Data Science: https://imp.i384100.net/AdvancedStatistics
📕 Bayesian Statistics: https://imp.i384100.net/BayesianStatistics
📕 Linear Algebra: https://imp.i384100.net/LinearAlgebra
📕 Pr
Watch on YouTube ↗
(saves to browser)
Sign in to unlock AI tutor explanation · ⚡30
Playlist
Uploads from CodeEmporium · CodeEmporium · 0 of 60
← Previous
Next →
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
Linear Regression and Multiple Regression
CodeEmporium
Logistic Regression - THE MATH YOU SHOULD KNOW!
CodeEmporium
Generative Adversarial Networks - FUTURISTIC & FUN AI !
CodeEmporium
Deep Learning on the Cloud - GPU TO LEARN FASTER
CodeEmporium
Deep Mind's AlphaGo Zero - EXPLAINED
CodeEmporium
Mask Region based Convolution Neural Networks - EXPLAINED!
CodeEmporium
Attention in Neural Networks
CodeEmporium
Depthwise Separable Convolution - A FASTER CONVOLUTION!
CodeEmporium
One Neural network learns EVERYTHING ?!
CodeEmporium
Neural Voice Cloning
CodeEmporium
AI creates Image Classifiers…by DRAWING?
CodeEmporium
Unpaired Image-Image Translation using CycleGANs
CodeEmporium
K-Means Clustering - EXPLAINED!
CodeEmporium
Random Forest Classification
CodeEmporium
Data Science in Finance
CodeEmporium
Hypothesis testing with Applications in Data Science
CodeEmporium
A/B Testing - Simply Explained
CodeEmporium
The Kernel Trick - THE MATH YOU SHOULD KNOW!
CodeEmporium
Support Vector Machines - THE MATH YOU SHOULD KNOW
CodeEmporium
Principal Component Analysis (PCA) - THE MATH YOU SHOULD KNOW!
CodeEmporium
History of Calculus - Animated
CodeEmporium
Curiosity in AI
CodeEmporium
DropBlock - A BETTER DROPOUT for Neural Networks
CodeEmporium
Autoencoders - EXPLAINED
CodeEmporium
Recurrent Neural Networks - EXPLAINED!
CodeEmporium
LSTM Networks - EXPLAINED!
CodeEmporium
Building an Image Captioner with Neural Networks
CodeEmporium
10 Machine Learning Questions - ANSWERED!
CodeEmporium
How do neural networks work?
CodeEmporium
Evolution of Face Generation | Evolution of GANs
CodeEmporium
How does Google Translate's AI work?
CodeEmporium
How to keep up with AI research?
CodeEmporium
How does YouTube recommend videos? - AI EXPLAINED!
CodeEmporium
Variational Autoencoders - EXPLAINED!
CodeEmporium
Logistic Regression - VISUALIZED!
CodeEmporium
Gradient Descent - THE MATH YOU SHOULD KNOW
CodeEmporium
Boosting - EXPLAINED!
CodeEmporium
Transformer Neural Networks - EXPLAINED! (Attention is all you need)
CodeEmporium
Loss Functions - EXPLAINED!
CodeEmporium
Optimizers - EXPLAINED!
CodeEmporium
NLP with Neural Networks & Transformers
CodeEmporium
Batch Normalization - EXPLAINED!
CodeEmporium
Activation Functions - EXPLAINED!
CodeEmporium
Data Scientist Answers Interview Questions
CodeEmporium
Why use GPU with Neural Networks?
CodeEmporium
How do GPUs speed up Neural Network training?
CodeEmporium
BERT Neural Network - EXPLAINED!
CodeEmporium
ConvNets Scaled Efficiently
CodeEmporium
Transformer Neural Net makes music! (JukeboxAI)
CodeEmporium
What do filters of Convolution Neural Network learn?
CodeEmporium
We're hosting a Machine Learning Conference!
CodeEmporium
MLconfEU 2020: Machine Learning Conference for Software Engineers
CodeEmporium
Are Neural Networks Intelligent?
CodeEmporium
Time Series Forecasting with Machine Learning
CodeEmporium
Few Shot Learning - EXPLAINED!
CodeEmporium
How does a Data Scientist Fight FRAUD?
CodeEmporium
How would a Data Scientist analyze Customer Churn?
CodeEmporium
Expectations with Machine Learning
CodeEmporium
Why Logistic Regression DOESN'T return probabilities?!
CodeEmporium
How you SHOULD code Machine Learning
CodeEmporium
More on: ML Pipelines
View skill →Related Reads
🎓
Tutor Explanation
DeepCamp AI