DeepMind DetCon: Efficient Visual Pretraining with Contrastive Detection | Paper Explained

Aleksa Gordić - The AI Epiphany · Beginner ·📄 Research Papers Explained ·4y ago

Skills: Staying Current in AI69%Modern CV Models53%Reading ML Papers53%

Key Takeaways

This video explains the DeepMind DetCon paper, which introduces an efficient visual pretraining method using contrastive detection

Full Transcript

what's up guys i just came back from vacation and in this video i'm covering efficient visual pre-training with contrastive detection or deathcon for short by the deepmind team but before i even dig into the paper i want to announce a couple of news i have for you so the first news is i just created a brand new discord server so that means we'll be able to drive the engagement to this community so much more and you'll be able to get answers to your questions without relying on me answering every single question on youtube and everything and all the goodies that discord brings with it uh also you'll be able to suggest the repos i should do the code walkthroughs off because i want to be more bullish going forward uh doing those like hands-on coding videos and hopefully you'll find that useful uh anyways uh sign up for the discord i'll just link it down in the description so check it out so the second news is i'm creating a new monthly ai newsletter uh where i'm gonna cover like the hottest the latest and greatest things that happened in the field of ai on a monthly basis and i'm gonna keep it high level so that means it will be suitable for engineers researchers and entrepreneurs as well so compared to these videos where i'm gonna zoom in stating a particular paper there i'm gonna give you a brief understanding of what happened over the last month if you wanna just quickly catch up with the things that happened uh that's what this newsletter is gonna be all about and again just sign up for the newsletter i'm gonna link it down in the description so having said that let me jump back into the paper and uh see what's happening here so uh that con is another uh self-supervised learning paper and uh it's just uh uh like basically piggybacks off of dual and sim clear research with initial additional idea of including the segmentation mask information in order to create those representations okay so so let's start from the main motivation self-supervised pre-training has been shown to yield powerful representation for transfer learning these performance gains come at a large computational cost however with state-of-the-art methods requiring an order of magnitude more computation than supervised pre-training so the trade-off here is fairly clear so on one hand you have uh like you don't need like uh expensive labeling annotation teams so you you can kind of save up on the cost uh like there but on the other hand you have to invest in in more like computational so your infrastructure your compute has will rise with some of the older methods so that's the trade-off you you kind of have to accept and uh you're still not going to approach the the representations that you get from from having super from from applying a supervised uh paradigm but like ssl is definitely getting closer and this paper is right like just in the right direction i'd say uh okay so as you can see here uh on the x-axis we have like a number of epochs on imagenet so we're training these different methods on imagenet and they are then fine-tuning them on cocoa and kind of reporting the detection accuracy and you can see that uh using simclear which is one of the older uh self-supervised learning methods we can see that the performance is nowhere near as good as with a supervised method and then we can see that this novel method so that con actually uh surpasses uh even the supervised uh method in 5x less computations you you need like around 200 epochs instead of thousand epochs to achieve higher accuracy on the cocoa detection task so i already mentioned that or i think i mentioned that uh basically this paper piggybacks off of simclear ambul and i'm going to dig a bit deeper into those two but like first of all let me kind of show you how this whole method works so here it is and if you read any of those older papers like moco simclear buell whatever you'll be fairly familiar with this similar pipeline so what we do is we have an image and then we do like two different augmentations so we'll have something like t augmentation t and augmentation t prime and as you can see here so here we had some crop the elephant is now cropped and zoomed in and here we had some other types of augmentations so what we do next is uh we just apply encoder which is your regular cnn they've used resnet across this paper to gain to get these uh convolutional features so this here is just a spatial extent of the features and there is a third dimension here not shown which can be maybe like 1k dimensions there okay so now the important thing to notice here is that they are using these segmentation masks and that's that's so that's the difference compared to be all that's the difference compared to other ssl methods that we've seen so far and uh we're gonna see how they actually calculate the masks but for now just let's let's treat that as a black box and now the whole idea of the paper is the following so you have you have an elephant here as you can see it here and you can see these red dots represent the feature vectors that correspond to that spatial extent of the image here we have more red dots because the elephant occupies nearly the whole image and so now in order to form representations uh what they do is they take those masks so they take the information from the mask so with that like segmentation mask information they can pull exactly those feature vectors which correspond to that red mask or to the elephant mask and so that means effectively you'll use you'll take these like as i said these are maybe 1k uh dimensional feature vectors you'll just take them and you'll do average over all of those five vectors in this case here or whatever the number of rare vectors here is and that's how you get the final representation here okay you do that the same procedure for all of the other masks so because this image had different augmentations compared to this one here we still have this uh yellow object whatever this is like a human so that means we have that vector here whereas we don't have it here and now the whole idea is the following um we just want to make sure that the red vectors get have a representation such that they are closer together in the vector space whereas we want to make sure to push away the red from the blue vectors here and again we want to pull the the the blue vectors and we wanna because we only have yellow here but we don't have yellow here we want to just uh push away the blue vector and the red vector so that's the main idea using these segmentation masks they have a richer learning signal and they are able to do this fine-grained push and pull standard contrastive learning scheme okay so let's dig into some formulas here uh to make this a bit clearer so here is just a formulaic representation of the thing i just explained so basically m is your segmentation mask and it's going to be a binary mask so basically what m will do is the following so m will be something like this so for a red mask it's going to have ones here so these will be ones ones ones once okay and all of the other ones will be zeros so these will be zeros okay and now as you can see here so h is just a feature vector that came from the cnn so that means so these here are h vectors okay so and as you can see for every particular slot so for this one the mask this binary mask will decide whether we want to use that feature vector behind it so whether we want to use the feature vector or not okay and that's simply described here as this so when when m equals one that means that the mask is active that means we're going to use that feature vector and kind of sum them up and then divide by the number of of like active units so let me kind of clarify that so here we have a five times one so that's going to be five so that means we're going to take these five vectors add them together and divide by five which is basically what i already explained and that's average pooling across those uh across those vectors so that's how we form the representations so that's this part now the second part is as i said there are piggybacking piggybacking off of viol and simclear and so in order to like have informed those hm representations we now need to form these vm representations before we actually apply this contrast to detection objective which i'll explain in a second okay so um again these here are only hms and now we have to form vms before we actually apply the the objective and now i'm going to briefly explain so as i said this year these formulas correspond to sim clear and these ones here uh correspond to buell and then uh for now let's just treat it as a black box we somehow uh like a calculate these vms and uh what we do next up is as i said we do this push and pull uh this this pooling and pushing away of these feature vectors so once we have those vms we can then do uh the thing i just explained pictorially here so this uh this uh pushing away of different uh of vectors that correspond to different objects and pulling together the vectors that correspond to the same object in the image so this is just formulaically uh described here basically as you can see so vm and dsvm prime uh just correspond to the uh same mask so m is for example red mask and prime just means different augmentation so this means basically the following so we want let's say this is just for the sake of argument this is now vm just ignore it it's actually hm but let's say we pre-process it using this pure or sim clear and now we have vm here so what we want to do is so this will be the vm and we want to make sure that this one is as close as possible to this one so those are two these are two vectors we want to make sure that the dot product between those two is super high that means they are similar on the other hand we want to make sure that this so this red vector is pushed away from this blue vector and same for the other vectors so that's the whole that's the whole idea and that's kind of captured in this formula here so we want to make sure that the um vectors corresponding to the same mask so for example the the red mask the elephant mask are similar where we want to make sure that these negative ones are pushed away so that means uh this will get negative that means exponential function raised to like a big negative number goes to zero this will roughly go to zero let's say and uh that means that that's how this loss will get minimized because minus log looks something like this so this is your minus log function which means you want to push this value here to one and that's how the loss is minimized and uh by doing that but that so that's a numerical intuition but what happens in that process is that as i said the red vectors or the the vectors corresponding to the same mask get clustered together whereas all of the other vectors get kind of um like pushed away from each other and that's the whole idea nothing nothing fancy there um now briefly before i start uh like giving you some overview of bjol and simclair let me show you the segmentation masks and what i've used so obviously uh you don't know it wouldn't make sense to have a self-surprise learning method if you presuppose that you have semantic segmentation masks because those require obviously labeling from humans so what i've done is they tried a couple of heuristics so some of these heuristics for calculating the segmentation masks are super simple and like primitive like this one's the spatial heuristic where you basically cluster and assume that the neighboring pixels belong to a specific object and so they tried this one they also tried this uh fh algorithm uh heuristic this mcg and those are not that important um basically they're just some more advanced heuristics compared to the spatial heuristic obviously and finally we they have they used human annotated segmentation masks uh they later showed some intuitive results that having better segmentation masks like this one uh actually improved the methods by by quite a lot okay so those are the segmentation masks now let me slowly dig into these uh bjol and simclear methods uh again the ideas of many of these uh self-supervised learning papers are fairly similar so what i do here again is you have an image and you do you apply two different distinct augmentations and that's how we get view t and view t prime of the same image uh so that would be something like this right so like you had an elephant this picture here and you get two different images so that's that's the idea okay so once you have those what viol does is it does not use negative samples so that's uh that's the that was the novel thing that y'all did and it managed uh to actually create like high quality representations without them collapsing and i'm gonna slowly digest what that means but stick with me for now so um as i said you have a view here you apply an encoder which is again usually your cnn you get some representation then they have this uh g data which is basically like a shallow mlp network which is a like a projection head you get this projection representation and then they apply additionally this this additional mlp uh which uh actually was one of the important ingredients to avoid the collapse so this a symmetry between the upper branch and the lower branch was important uh fact in in the important piece of information in bjol okay so um on the other hand the lower branch is just constructed basically from the upper branch by using something called exponentially moving average so that means you just take snapshots of the upper network in time and you use the exponential logic to kind of average those out and that's how you form the at these f f x psi or whatever this greek letter is pronounced like and uh then you you kind of propagate and you get this final representation and the whole goal will be to after you do some l2 normalization so whatever this vector is let's call it v what they do is they first normalize it using l2 so they just find the l2 of that vector they normalize it by dividing by the l2 norm and then they want to make sure that these representations uh basically by doing this msc loss mean square error uh they want to ensure that these vectors are the same and so intuitively what that computational thing does is it forces the network to learn how to abstract the important parts of the image so no matter the augmentations you wanna extract something that's inherently present in those images and by learning that you learn valuable representations which are agnostic to different augmentations you can still capture and understand understand what's in the image so that's the whole point of y'all and they showed empirically that by doing this they avoid the collapse so why would the collapse happen well it's simple like the easiest most trivial way to kind of make sure you are minimizing this this loss function is to always output a constant representation so no matter the views give me an image whatever the image is whatever the augmentations are after you pass them through f theta you'll get some constant representation let's call it like v c like constant vector and you get v c here as well and that means basically um not here but like here at the output you get constant representations and that means that this is basically uh approaching zero without you learning any interesting representations so uh bill avoided this collapse and uh back then when the paper came out uh like there they didn't have like a super strong theoretical understanding of why that is but some recent paper actually i kind of explained this a bit better i forgot the name of the paper but yeah now sim clear is uh similar but different uh it does so i'd say the main one of the main differences is that it uses negative sampling so what sim clear does is the following again we have t and t prime so we applied different augmentations to the same image and this time they apply the same encoder in both branches and then they apply the same projection head in both branches and now the whole idea here is to make sure that the augmentations belonging to the same image are similar so those represent representations are close together in the vector space whereas like whereas representations from different images are further away from each other so let me try and kind of visualize that a little bit so this is your image and you have a batch of images you apply you apply the first augmentation and you get image t1 and you apply the second augmentation to the same image you get t1 prime now you have a second image and you again apply augmentation and you get image t2 and you get t2 prime and you do this obviously n times because you have n images in your batch and now the whole point will be the following you take the representation you get from this t1 image after passing it through f and g and you want to make sure that that representation is close to this one because they belong to the same image and then you want to make sure that it's far away from these representations from this one from this one as well you want to pull it you want to push it away from those representations and the same for every single image in the batch you're gonna push them away and the only thing that's getting pulled together are those two images here so that's the whole idea of sim clear so that's just a kind of formulaically uh showed here you basically have as i said so similarity usually dot product between those representations there's some temperature coefficient but the whole point at the end is to kind of do what what i showed you here is just symbolic representation of the thing i just explained you here okay having understand that let me now go back to the uh method and here we are so again here is the here is the sim clear method so you can now see that hm is the representation that comes from the f so from the encoder and then they just apply the projection head and that's it that's that's everything you need to form representation for sim clear and then you apply this this this novel this novel object objective function uh to get representations for buell they additionally just uh add this uh this head which i mentioned which gives you this symmetry in the in the pipeline and obviously they also use this network so that's the network form from the exponential moving average okay so now you can kind of hopefully ground uh these uh these uh formulas to the dual and sim clear papers and again once you have the representations those vms and vm primes you can do the logic i explained a couple of minutes ago where you're forcing the the vectors for uh the where you're forcing the vectors that correspond to the same masks to be closer together whereas you're pushing away the vectors belonging to different masks uh and whereas masks correspond hopefully to different objects and that depends on the quality of your segmentation masks and they did some ablations on on on how those different so how these different algorithms affect the performance of the method and we'll see that uh in a couple minutes okay let's see the results uh finally of this statcom method and as you can see here uh so having uh pre-training on imagenet and then find unit different tasks like cocoa instant segmentation pascal semantic segmentation cityscapes semantic segmentation and nyu depth estimation you can see in all of these four tasks which are fairly different uh like the thatcom method performs uh way better compared to the supervised and simpler baselines so i think let's let's focus for now on on one chart and then i'll tell you some interesting things that i see here so first things first is you can see here that sim clear after a certain amount of given enough computation uh like surpasses the performance of supervised uh like paradigm and that's something we saw in the beginning of this paper so given enough compute that's a trade-off you're gonna eventually get there but you need a bunch of compute right whereas here you can see focusing on that con that like with after only 200 epochs you're already better compared to these two paradigms and you have better performance okay so the the the same thing applies pretty much for all of the other tasks i won't be digging into every single chart here but the interesting thing here to notice is that supervised um like paradigm completely fails for these two tasks like uh depth estimation and semantic segmentation which means that the pre-training supervised pre-training on imaging ad is not informative enough so the features are not informative enough for these particular tasks whereas sim clear and that con seem to generalize better to different tasks which is another plus and bonus point for applying ssl because you get more generalization out of the method okay i'll skip this table it just has a like a numeric representation of the thing we just saw here so basically the the the main point they're trying to drive here is that with 10x less computation they can achieve performance that's much better than all of the other baselines so here you can see with only 100 approximately 100 epochs it's already better than sim clear which is way better that supervised learning on this cityscape semantic segmentation task so that's the whole point okay going forward let's see what else is there uh um okay um here they compare like uh vietcon again against various different like baselines not only bjol and simclear but also moco supervised suave and other methods and they just showed that it consistently outperforms all of those baselines so one thing they've worried about is that maybe scaling the number of parameters will kind of give us something qualitatively different behavior with that count compared to the previous baselines and this basically just confirms that uh this method does scale with uh with scale and uh like the performance does not change and it still outperforms so that con the blue one still outperforms all of the other base lines like be all similar and supervised baseline on these cocoa detection and coco instant segmentation tasks okay moving further they even showed that this recently published methods here uh which has billion which used billion images for for pre-training and has way more parameters uh it's still like the the dead con outperforms those methods which is a nice result and they kind of mentioned that it's not super comparable because um cr uses the data that's here uses is way more noisy it's kind of not a fair comparison but still it's an impressive result uh okay let's see what else interesting is here um they showed how uh using different uh segmentation masks so different heuristics which i briefly described before improves and changes the performance so as you can see here using those spatial like heuristics so that's this one let me just go briefly up there so this is a spatial heuristic and here we have so this upper here is one by one here we have two by two grid and here we have four by four okay so that will hopefully help you understand this chart a bit better what these numbers mean so here is one by one here is using two by two ten by ten and five by five seems to be working the best on this particular um like task and i guess this is not that surprising so the better the semantic segmentation you do initially so fh obviously has better so as you can see on the x-axis this means that the overlap between the uh the masks that this method found is way way uh higher to the gt which is obviously one because by default uh gt overlaps completely with itself that's why we have one here and these methods have even smaller overlap which means they are of lower quality so that basically means the higher the quality of your segmentation masks the higher the detection accuracy and that's kind of i guess intuitive but it's nice to have a chart up like proving exactly that okay um they did a couple more ablations um of pre-training this time on coco instead of imagenet and they still have they still outperform the supervised and the sim clear uh like baselines that's that that's the the point there uh finally they show that uh using gt labels which kind of makes sense uh helps helps them even increasing the the pre-training uh image resolution and always testing it on this resolution consistently improves the performance when using gt mass whereas when you're using fh the performance does not improve with resolution uh which i guess makes sense um they mentioned that somewhere here in the in the discussion portion uh of the paper so okay this part how can this be so one interpretation is that other images give us clean negative examples because images in coco depict different scenes however it appears that negatives from the same image provide a stronger learning signal blah blah blah i we're not pushing features from the same object apart positives from the same image are also at least as good as those across augmentations if again they're clean and we are not pulling together features from different objects so basically if you have if you don't have perfect segmentation masks that means you'll have overlap between classes which means sometimes you're going to be pushing away two feature vectors even though there is an overlap where those two vectors are formed from the same object so that's some not something you want to do you want to push away different objects you want to pull together the same object and that's a like a hand wavy explanation of why you need better uh better semantics segmentation masks uh okay uh wrapping up this paper the main idea again is uh usage of these segmentation masks in order to form a richer training signal so whereas sim clear will in this example just we won't have any segmentation masks we won't have any semantic segmentation mask which means sinclair will just be pulling together these two because they belong to the same image whereas here we have a much richer signal because of the presence of semantic segmentation so that's pretty much it for this video if you have any suggestion uh for which repo i should do a code walk through off uh let me know down in the comments so i was thinking maybe dino or some of those uh newer papers that came out so if you have any suggestion please let me know down in the comments and i definitely read those comments and you can influence the next video other than that if you found this video useful consider subscribing and hit that notification button so that you can get notified the moment i upload a new video i've noticed that only ten percent of you have the notification bell on and only five or six percent of you uh are actually receiving those notifications because your youtube settings are set in such manner that you are not receiving them instantaneously so yeah definitely toggle on that notification button and until next time bye bye [Music] [Music] you

Original Description

📢 SUBSCRIBE TO MY MONTHLY AI NEWSLETTER: Substack ► https://aiepiphany.substack.com/ 👨‍👩‍👧‍👦 JOIN OUR DISCORD COMMUNITY: Discord ► https://discord.gg/peBrCpheKE ❤️ Become The AI Epiphany Patreon ❤️ ► https://www.patreon.com/theaiepiphany In this video, I cover DetCon: Efficient Visual Pretraining with Contrastive Detection - a novel self-supervised method that achieves SOTA results on various transfer learning tasks. The main idea is to add the semantic segmentation information into the contrastive objective (they used various heuristics to obtain semantic segmentation information in an unsupervised fashion). ▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬ ✅ Paper: https://arxiv.org/abs/2103.10957 ▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬ ⌚️ Timetable: 00:00 News: Discord and AI newsletter! 01:24 Self-supervised learning, BYOL, and SimCLR 03:50 DetCon method overview 11:35 Semantic segmentation heuristics 12:45 Overview of BYOL and SimCLR 20:15 Results 24:10 Impact of segmentation heuristics 27:20 Outro 28:10 Turn on the notification bell, much love! ▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬ 💰 BECOME A PATREON OF THE AI EPIPHANY ❤️ If these videos, GitHub projects, and blogs help you, consider helping me out by supporting me on Patreon! The AI Epiphany ► https://www.patreon.com/theaiepiphany One-time donation: https://www.paypal.com/paypalme/theaiepiphany Much love! ❤️ TODO: Huge thank you to these AI Epiphany patreons: Eli Mahler Petar Veličković Zvonimir Sabljic ▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬ 💡 The AI Epiphany is a channel dedicated to simplifying the field of AI using creative visualizations and in general, a stronger focus on geometrical and visual intuition, rather than the algebraic and numerical "intuition". ▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬ 👋 CONNECT WITH ME ON SOCIAL LinkedIn ► https://www.linkedin.com/in/aleksagordic/ Twitter ► https://twitter.com/gordic_aleksa Instagram ► https://www.instagram.com/aiepiphany/ Facebook ► https://www.facebook.com/aiepiphany/ 👨‍👩‍👧‍👦 JOIN OUR DISCORD

Watch on YouTube ↗ (saves to browser)

Sign in to unlock AI tutor explanation · ⚡30

Playlist

Uploads from Aleksa Gordić - The AI Epiphany · Aleksa Gordić - The AI Epiphany · 58 of 60

← Previous Next →

Intro | Neural Style Transfer #1

Intro | Neural Style Transfer #1

Aleksa Gordić - The AI Epiphany

Basic Theory | Neural Style Transfer #2

Basic Theory | Neural Style Transfer #2

Aleksa Gordić - The AI Epiphany

Optimization method | Neural Style Transfer #3

Optimization method | Neural Style Transfer #3

Aleksa Gordić - The AI Epiphany

Advanced Theory | Neural Style Transfer #4

Advanced Theory | Neural Style Transfer #4

Aleksa Gordić - The AI Epiphany

Anyone can make deepfakes now!

Anyone can make deepfakes now!

Aleksa Gordić - The AI Epiphany

What is Computer Vision? | The Art of Creating Seeing Machines

What is Computer Vision? | The Art of Creating Seeing Machines

Aleksa Gordić - The AI Epiphany

Feed-forward method | Neural Style Transfer #5

Feed-forward method | Neural Style Transfer #5

Aleksa Gordić - The AI Epiphany

Alan Turing | Computing Machinery and Intelligence

Alan Turing | Computing Machinery and Intelligence

Aleksa Gordić - The AI Epiphany

Feed-forward method (training) | Neural Style Transfer #6

Feed-forward method (training) | Neural Style Transfer #6

Aleksa Gordić - The AI Epiphany

What is Google Deep Dream? (Basic Theory) | Deep Dream Series #1

What is Google Deep Dream? (Basic Theory) | Deep Dream Series #1

Aleksa Gordić - The AI Epiphany

Semantic Segmentation in PyTorch | Neural Style Transfer #7

Semantic Segmentation in PyTorch | Neural Style Transfer #7

Aleksa Gordić - The AI Epiphany

How to get started with Machine Learning

How to get started with Machine Learning

Aleksa Gordić - The AI Epiphany

How to learn PyTorch? (3 easy steps) | 2021

How to learn PyTorch? (3 easy steps) | 2021

Aleksa Gordić - The AI Epiphany

PyTorch or TensorFlow?

PyTorch or TensorFlow?

Aleksa Gordić - The AI Epiphany

3 Machine Learning Projects For Beginners (Highly visual) | 2021

3 Machine Learning Projects For Beginners (Highly visual) | 2021

Aleksa Gordić - The AI Epiphany

Machine Learning Projects (Intermediate level) | 2021

Machine Learning Projects (Intermediate level) | 2021

Aleksa Gordić - The AI Epiphany

Cheapest (0$) Deep Learning Hardware Options | 2021

Cheapest (0$) Deep Learning Hardware Options | 2021

Aleksa Gordić - The AI Epiphany

How to learn deep learning? (Transformers Example)

How to learn deep learning? (Transformers Example)

Aleksa Gordić - The AI Epiphany

How do transformers work? (Attention is all you need)

How do transformers work? (Attention is all you need)

Aleksa Gordić - The AI Epiphany

Developing a deep learning project (case study on transformer)

Developing a deep learning project (case study on transformer)

Aleksa Gordić - The AI Epiphany

Vision Transformer (ViT) - An image is worth 16x16 words | Paper Explained

Vision Transformer (ViT) - An image is worth 16x16 words | Paper Explained

Aleksa Gordić - The AI Epiphany

GPT-3 - Language Models are Few-Shot Learners | Paper Explained

GPT-3 - Language Models are Few-Shot Learners | Paper Explained

Aleksa Gordić - The AI Epiphany

Google DeepMind's AlphaFold 2 explained! (Protein folding, AlphaFold 1, a glimpse into AlphaFold 2)

Google DeepMind's AlphaFold 2 explained! (Protein folding, AlphaFold 1, a glimpse into AlphaFold 2)

Aleksa Gordić - The AI Epiphany

Attention Is All You Need (Transformer) | Paper Explained

Attention Is All You Need (Transformer) | Paper Explained

Aleksa Gordić - The AI Epiphany

Graph Attention Networks (GAT) | GNN Paper Explained

Graph Attention Networks (GAT) | GNN Paper Explained

Aleksa Gordić - The AI Epiphany

Graph Convolutional Networks (GCN) | GNN Paper Explained

Graph Convolutional Networks (GCN) | GNN Paper Explained

Aleksa Gordić - The AI Epiphany

Graph SAGE - Inductive Representation Learning on Large Graphs | GNN Paper Explained

Graph SAGE - Inductive Representation Learning on Large Graphs | GNN Paper Explained

Aleksa Gordić - The AI Epiphany

PinSage - Graph Convolutional Neural Networks for Web-Scale Recommender Systems | Paper Explained

PinSage - Graph Convolutional Neural Networks for Web-Scale Recommender Systems | Paper Explained

Aleksa Gordić - The AI Epiphany

OpenAI CLIP - Connecting Text and Images | Paper Explained

OpenAI CLIP - Connecting Text and Images | Paper Explained

Aleksa Gordić - The AI Epiphany

Temporal Graph Networks (TGN) | GNN Paper Explained

Temporal Graph Networks (TGN) | GNN Paper Explained

Aleksa Gordić - The AI Epiphany

Graph Neural Network Project Update! (I'm coding GAT from scratch)

Graph Neural Network Project Update! (I'm coding GAT from scratch)

Aleksa Gordić - The AI Epiphany

Graph Attention Network Project Walkthrough

Graph Attention Network Project Walkthrough

Aleksa Gordić - The AI Epiphany

How to get started with Graph ML? (Blog walkthrough)

How to get started with Graph ML? (Blog walkthrough)

Aleksa Gordić - The AI Epiphany

DQN - Playing Atari with Deep Reinforcement Learning | RL Paper Explained

DQN - Playing Atari with Deep Reinforcement Learning | RL Paper Explained

Aleksa Gordić - The AI Epiphany

AlphaGo - Mastering the game of Go with deep neural networks and tree search | RL Paper Explained

AlphaGo - Mastering the game of Go with deep neural networks and tree search | RL Paper Explained

Aleksa Gordić - The AI Epiphany

DeepMind's AlphaGo Zero and AlphaZero | RL paper explained

DeepMind's AlphaGo Zero and AlphaZero | RL paper explained

Aleksa Gordić - The AI Epiphany

OpenAI - Solving Rubik's Cube with a Robot Hand | RL paper explained

OpenAI - Solving Rubik's Cube with a Robot Hand | RL paper explained

Aleksa Gordić - The AI Epiphany

MuZero - Mastering Atari, Go, Chess and Shogi by Planning with a Learned Model | RL Paper explained

MuZero - Mastering Atari, Go, Chess and Shogi by Planning with a Learned Model | RL Paper explained

Aleksa Gordić - The AI Epiphany

EfficientNetV2 - Smaller Models and Faster Training | Paper explained

EfficientNetV2 - Smaller Models and Faster Training | Paper explained

Aleksa Gordić - The AI Epiphany

Implementing DeepMind's DQN from scratch! | Project Update

Implementing DeepMind's DQN from scratch! | Project Update

Aleksa Gordić - The AI Epiphany

MLP-Mixer: An all-MLP Architecture for Vision | Paper explained

MLP-Mixer: An all-MLP Architecture for Vision | Paper explained

Aleksa Gordić - The AI Epiphany

DeepMind's Android RL Environment - AndroidEnv

DeepMind's Android RL Environment - AndroidEnv

Aleksa Gordić - The AI Epiphany

When Vision Transformers Outperform ResNets without Pretraining | Paper Explained

When Vision Transformers Outperform ResNets without Pretraining | Paper Explained

Aleksa Gordić - The AI Epiphany

Non-Parametric Transformers | Paper explained

Non-Parametric Transformers | Paper explained

Aleksa Gordić - The AI Epiphany

Chip Placement with Deep Reinforcement Learning | Paper Explained

Chip Placement with Deep Reinforcement Learning | Paper Explained

Aleksa Gordić - The AI Epiphany

Text Style Brush - Transfer of text aesthetics from a single example | Paper Explained

Text Style Brush - Transfer of text aesthetics from a single example | Paper Explained

Aleksa Gordić - The AI Epiphany

Graphormer - Do Transformers Really Perform Bad for Graph Representation? | Paper Explained

Graphormer - Do Transformers Really Perform Bad for Graph Representation? | Paper Explained

Aleksa Gordić - The AI Epiphany

GANs N' Roses: Stable, Controllable, Diverse Image to Image Translation | Paper Explained

GANs N' Roses: Stable, Controllable, Diverse Image to Image Translation | Paper Explained

Aleksa Gordić - The AI Epiphany

VQ-VAEs: Neural Discrete Representation Learning | Paper + PyTorch Code Explained

VQ-VAEs: Neural Discrete Representation Learning | Paper + PyTorch Code Explained

Aleksa Gordić - The AI Epiphany

VQ-GAN: Taming Transformers for High-Resolution Image Synthesis | Paper Explained

VQ-GAN: Taming Transformers for High-Resolution Image Synthesis | Paper Explained

Aleksa Gordić - The AI Epiphany

Multimodal Few-Shot Learning with Frozen Language Models | Paper Explained

Multimodal Few-Shot Learning with Frozen Language Models | Paper Explained

Aleksa Gordić - The AI Epiphany

Focal Transformer: Focal Self-attention for Local-Global Interactions in Vision Transformers

Focal Transformer: Focal Self-attention for Local-Global Interactions in Vision Transformers

Aleksa Gordić - The AI Epiphany

AudioCLIP: Extending CLIP to Image, Text and Audio | Paper Explained

AudioCLIP: Extending CLIP to Image, Text and Audio | Paper Explained

Aleksa Gordić - The AI Epiphany

RMA: Rapid Motor Adaptation for Legged Robots | Paper Explained

RMA: Rapid Motor Adaptation for Legged Robots | Paper Explained

Aleksa Gordić - The AI Epiphany

DALL-E: Zero-Shot Text-to-Image Generation | Paper Explained

DALL-E: Zero-Shot Text-to-Image Generation | Paper Explained

Aleksa Gordić - The AI Epiphany

DETR: End-to-End Object Detection with Transformers | Paper Explained

DETR: End-to-End Object Detection with Transformers | Paper Explained

Aleksa Gordić - The AI Epiphany

DINO: Emerging Properties in Self-Supervised Vision Transformers | Paper Explained!

DINO: Emerging Properties in Self-Supervised Vision Transformers | Paper Explained!

Aleksa Gordić - The AI Epiphany

DeepMind DetCon: Efficient Visual Pretraining with Contrastive Detection | Paper Explained

DeepMind DetCon: Efficient Visual Pretraining with Contrastive Detection | Paper Explained

Aleksa Gordić - The AI Epiphany

Do Vision Transformers See Like Convolutional Neural Networks? | Paper Explained

Do Vision Transformers See Like Convolutional Neural Networks? | Paper Explained

Aleksa Gordić - The AI Epiphany

Fastformer: Additive Attention Can Be All You Need | Paper Explained

Fastformer: Additive Attention Can Be All You Need | Paper Explained

Aleksa Gordić - The AI Epiphany

More on: Staying Current in AI

View skill →

The biggest mistake developers make in their resumes

The biggest mistake developers make in their resumes

THIS is why a CS degree won't get you a coding job

THIS is why a CS degree won't get you a coding job

Recon-ng - Introduction And Installation

Recon-ng - Introduction And Installation

The Ultimate Home Assistant Backup Guide (Google Drive, OneDrive, Dropbox & Cloudflare R2)

The Ultimate Home Assistant Backup Guide (Google Drive, OneDrive, Dropbox & Cloudflare R2)

Recon-ng - Generating Reports

Recon-ng - Generating Reports

How can I be notified when my name is mentioned on the web?

How can I be notified when my name is mentioned on the web?

Google Search Central

Related Reads

A lightweight workflow for keeping up with AI conference papers

Learn a lightweight workflow to stay updated with AI conference papers and never miss important research again

Dev.to · Daniel

Why CitedEvidence Believes Great Researchers Read Less Than You Think

Great researchers don't read every paper, but rather focus on reading the right ones and applying their knowledge effectively

How to Write a Literature Review That Actually Argues Something

Learn to write a literature review that presents a clear argument, a crucial skill for ML researchers and students

Medium · Machine Learning

I Built a Personal Paper Engine to Stop Losing Research Papers

Build a personal paper engine to organize and annotate research papers efficiently

Dev.to · Ethan

Chapters (9)

News: Discord and AI newsletter!

1:24 Self-supervised learning, BYOL, and SimCLR

3:50 DetCon method overview

11:35 Semantic segmentation heuristics

12:45 Overview of BYOL and SimCLR

20:15 Results

24:10 Impact of segmentation heuristics

27:20 Outro

28:10 Turn on the notification bell, much love!

The Real AI Frontier Isn't Smarter Machines (with Catherine Williams)

Super Data Science: ML & AI Podcast with Jon Krohn