DINO: Emerging Properties in Self-Supervised Vision Transformers | Paper Explained!

Aleksa Gordić - The AI Epiphany · Beginner ·📄 Research Papers Explained ·4y ago

Skills: Reading ML Papers90%Research Methods80%RAG Basics70%Vector Stores60%RAG Evaluation60%

Key Takeaways

The video explains the DINO model, a self-supervised vision transformer introduced in the 'Emerging Properties in Self-Supervised Vision Transformers' paper by Facebook AI, which uses a teacher-student framework to learn visual features without labels.

Full Transcript

what's cing guys in this video I'm covering emerging properties in self-supervised Vision Transformers paper and introducing Dino model uh from Facebook AI research in Ria and sbone University uh the main motivation behind uh the paper was uh investigating why haven't Transformers uh brought such an like a competitive advantage to computer vision as they did in NLP world and the main hypothesis is that using supervised learning as we've have so far with vision Transformers have been kind of dampering um damping down their performance and so they they they propose to use self supervised learning which is U which provides us s a richer uh training signal and so they mention it here nicely the resulting V are competitive with comets but they have not yet delivered clear benefits over them they're computationally more demanding require more training data and their features do not exhibit unique properties so on on a small tangent this part of requiring more training data is not correct anymore so I I recorded a video where I explained uh this novel technique where they use sharpness aware minimization objective to kind of smooth out smooth out the Lost landscape of Vats and thus um it requires much less data now so yeah that that part is not true anymore but anyways returning back to the paper so the reason they want to use self supervised learning is first because labeling is expensive and secondly because it's a richer training signal uh because suppose you have a sentence and if you want to predict just a single label for that sentence that's a like a poor training signal compared to just trying to predict every single token which is a language modeling objective and so yeah that's that's the that's the that's the idea let's kind of try and use that self-supervised learning and see whether we get some like new properties popping up for Transformers uh like for vision Transformers and they say here that in this paper we question if self-supervised learning provides new properties to Vats that stand compared to CNN and as we'll soon see uh that is the case I.E they do have some properties which um even when they use the same uh training procedure so dyo procedure for comets they do not achieve the same uh they do not gain the same properties as as vat so first of those properties is these are these um segmentation masks as you see these are the attention maps that we get from the last layer of vat after we train it with dyo and you can see that the objects are segmented which are Salient for our humans so we see a bird here we see a boat here we see a bicycle toothbrush Etc and that's not something we explicitly um like told a v to to like predict and so it's just an emergent property and uh if you're by the way if you're not if you don't understand how these are calculated let me just kind of briefly walk you through that part uh they see here we look at the self attention of the CLS token on the heads of the last layer so again I recorded a video on Vision Transformer so if you want to know more details you can check it out I'm going to briefly kind of explain it here so basically it's a Transformer so you have a Transformer like blob here and what it do in order to kind of parse images is so first you have your CLS token which is a learnable token and then what you do is you kind of uh segment your image so you kind of add these patches you kind of create all of these patches and then you for examp example take it take these patches in a rester order you flatten them out and you kind of put them here okay and so now what happens is how they actually calculate this attention is uh how they calculate how they made this uh attention map is the following so you kind of process all of these tokens using Transformer and then out comes a representation of this CLS token and what you do is and also obviously for every single token here you'll have a corresponding output representation and they just take a this uh this token here they create a query Vector so this this will be query and they take uh these tokens and they just going of map them to key vectors so we'll have key one here we'll have key2 here Etc so same here and just basically doing do product between this query so that corresponds to the CLS token and these Keys you'll get scores and then you'll just take these scores you'll have a you'll get a score for this one you'll get a score for this one and you just take that flattened out structure and you wrap it back into the Matrix structure and that's how you get these like these these images so basically these are just scores of these dot products between queries and keys pretty much so that's as simple as that um now that was the that was uh one of the emerging properties the second one that is super important is that the features that come out of the vision Transformer train by Dino H have they they perform really well using this K nearest neighbor classification and we'll see that a bit later but those are the two properties that emerge and now let me slowly start explaining the dyo architecture itself uh on a high level what you do you have an input image X and you kind of do different augmentations so you do geometric augmentation such as crop you do uh photometric augmentation such as color Jitter and you apply one set of those augmentations and you get X1 and you apply a distinct set of augmentations and you get X too okay so now you pass those augmentations through these networks which are so this one will be just Vision Transformer so that's the whole point of this paper right we have Vision Transformer we want to kind of test it how it works with self superise learning and uh you form you form the teacher waights by kind of using this EMA procedure so exponentially moving average so you'll be taking uh past snapshots of of student uh weights and you'll be using them to form the teacher weights and so now just keeping it on a high level what happens here we will we'll output some distribution here which will be a categorical distribution I'm just going to draw it as continuous one because it's easier but it's soft Max it's categorical and the whole point will be to kind of match these two distributions so we'll have one coming from the teacher as well and the whole goal will be to kind of make sure these two are the same and why is that so the intuition behind this procedure is so no matter how you kind of deform how you augment your image uh you want to make sure that the student and teacher learn how to extract such representations so that they are invariant to those augmentations so again even though these two uh images uh are completely different because there are different crops different augmentations we'll end up with the same representations because we'll have same distributions and so that means we're extracting relevant information from the image that's the whole point and those features turn out to be super valuable on the Downstream tasks now going in a bit more detail first how we enforce the uh these two distributions to be the same is via cross centri Bloss so that's this thing here uh secondly uh important uh details here is we are using a temperature coefficient here in softmax if you're not familiar what the temperature does is basically let's say you have a distribution something like this and basically so this will be your mode of the distribution if you apply now the temperature and if we let the temperature go down to zero will'll basically end up with something like this so we'll have a probability of one for this mode and everything else will drop down to zero that's also called sharpening if you see that expression now you know what it is basically as you drop down the temperature you'll be sharpening distribution so that only the mode Peaks and everything else drops down so that's the idea there uh and on the other hand we have this this centering which is vital for this method otherwise we'd got we would get a like a collapse of the representations and I'll explain those in a bit but basically um we need uh we need these this centering to to avoid collapse and um so briefly like what collapse is is um the the easiest thing the easiest thing this this this network could do this system could do in order to make sure that the the output distributions are the same is to just kind of output the uniform distribution for every single input so they'll just kind of learn to output something like this so this is your distribution and they'll just learn how to Output a uniform one and this one will also learn how to uh output a uniform one and then trivially the loss goes to zero but that's obviously not what we want because that's that's a collapse a second type of collapse would be uh if you instead have just uh like a peak on a specific uh position on spe specific Dimension so here for example you having a peak and on the same position from the teacher Network as well so these are two types of collapses that happen and in order to avoid them what they do is they apply this centering and what centering is is you um just basically um so uh let me start like this we have image here and it's usually a batch of images so you'll have a like a b number of of these images and so the vector here the tensor here will be B comma K okay where K is just the dimensionality of the representation and we have B of those so what you do is you'll you'll be calculating this C Vector so the centering vector by just kind of taking a mean from all of these representations so you just take a mean over over over the B Dimension over the zero Dimension and so you'll end up with a K one vector and you'll additionally be using exponential moving average to calculate this C Vector so that means that also the previous batches will be influencing the value of this C vector and then what you do is you just subtract C from all of these uh B uh representations and then you pass them through softmax that also has temperature coefficient U second important detail here is this stop gradient operator so that basically means we'll be propagating uh the loss uh like uh the errors through the student Network only because that makes sense teacher is only is formed by using this exponentially moving average um that's that's it hopefully uh that made that made sense um now now I'm going to walk you through the pseudo code that will maybe help you further understand this um and yeah so here it is we have the teacher Network we have the student Network we iterate through the data set so X is just a batch of images you apply some augmentations here you apply different set of augmentations here you end up with X1 and X2 okay so those are two batches with different augmentations applied to them uh then you do a feed forward through the student Network and you do a feed forward through the teacher Network and you end up with these Logics here okay and finally uh what you do here you can see it's kind of symmetric so we're using T1 which is a teacher logit from the this X1 batch of augmented images and we use this uh S2 which is a which which came from this uh second group of augmented images uh being feed through the the student Network so basically just symmetry stuff um so let's see let's see how how H is implemented basically here we can see T gets detached so the logits of the teacher gets detached and that's just an implementation of the stop gradient in py torch uh then we apply softmax and as you can see here so TPS is just a temperature parameter for the student um that that's that's what makes it sharper and uh then we just apply softmax and that's the the output distribution we had here so that that's this one and on the other hand we had here we we have the C Vector so we'll be subtracting it as I said so that's the centering part and then we have then we apply the sharpening part so that's the this part here and we get T which is the output distribution so the the categorical distribution that came out from teacher and finally we just apply as you can see here cross entropy and we do a mean because it's a batch of different uh images okay um that's pretty much it hopefully this helped a bit backward we'll calculate the actual gradients for all of the weights in the network update just updates the student Network as you see so we only update student Network and then here are the two exponentially moving average updates One updates the teacher Network as you can see here L times the teacher networks parameters and Yus L times the student Network parameters uh on the other hand we have here the the centering Vector so T1 and T2 remember those are just logits so these are BK tensors and they'll just concatenate those so that means they'll form basically if you have uh let's let's make it like this so we have B here and K is here okay so that's T1 and then we have T2 which they'll just concatenate here and then they do like a simple mean across this Dimension okay and that's how they get the the the the the the the C Vector plus they have as you see the exponentially moving part so they'll be using C's from other batches from the previous batches that's how they form the C vector and that's it in a nutshell that's those are all of the nitty-gritty details um but again the main picture is very very simple the idea is to form such representations so that we have same or similar distributions at the output uh like it doesn't matter what kind of augmentations we applied those are still the same okay um couple more interesting points to make here is uh they mentioned here that all crops are passed through the student while only the global views are passed through the teacher therefore encouraging local to Global correspondences so that means the following so let's say we have an image here and we have let me draw something inside like maybe a Chad programmer okay like a guy with glasses he's buffed okay like this and what you'll do is some of the crops will be uh maybe uh consisting out of more than 50% of the image something like this some of those crops may be smaller maybe like taking only one part one smaller part of the image and you'll be passing the big ones through the that's why they say here you'll be passing the big ones through the teacher Network and you'll be passing both types of crops through the student Network and again the idea is if you can kind of infer from this small crop if you can infer uh like relevant information that contains the same information as this bigger crop uh that means you you got your yourself a robust representation okay and that's the whole idea that this this uh uh like multicrop strategy enforces uh robust representations which turn out to to to do quite well on those K&N evaluations as we'll soon see now let's see some more details about the teacher Network here uh they mentioned that unlike knowledge distillation we do not have a teacher uh given a priority and hence we build it from past iterations of the student Network so as we saw we are using that uh exponential moving average to form at the teacher Network so usually in computer vision you have a big uh like CNN maybe something like this so that's a CNN and you train it on a data set maybe something like image net and the whole idea is now have having trained that one you freeze the weights and then you have a smaller CNN something like this and so what you do is the following so you take an image from that data set you feed it both here into the big one as well as to your smaller yet to be train in CNN and now instead of using the one hot uh encoding of the label uh as the target you'll actually be using whatever the output distribution from this one is so whatever the output distribution here is uh we'll want to mimic in the smaller one so here we'll want to mimic that one uh by just applying uh K Divergence or or whatnot uh so that's the usual knowledge distillation and here we are not using exactly that because as you as you saw the teacher network is not frozen we're just updating it uh exponentially as the time goes by okay that's that's the first detail uh secondly they they mentioned here that indeed we observe that this teacher performs a form of model assembling similar to polyak rert a averaging with an exponential decay uh that leads us to this thing and that's we observe that this teacher has better performance then the student throughout the training enhance guides the training of the student by providing Target features of high quality uh so yeah I mean just doing assembling is a is a is a known method where whereby you can improve the the performance of your model by just taking a bunch of models and then averaging uh their outputs but here we are doing the averaging in the in the parameter space and that also uh helps obviously uh increase the performance we'll see some curves later on where they quantitatively show that this does happen indeed so that the teacher is always better than the student and so student is can can actually learn something from the teacher because if yeah yeah I guess that makes sense um okay so a couple more details worth mentioning is that this whole uh like system this Dino system will be uh free of B normalization which is a trend we've seen recently with ANF Nets if you if you if you saw that paper uh the it came from Deep Mind the idea was to kind of ditch the B tralization because it has some nasty properties um and yeah but it ended up um compl Lia in the whole procedure so I'm not sure it uh it got so popular but yeah still um just if you're able to kind of ditch P normalization that that that's cool and here they managed it because uh they say here that therefore when applying dinno to V we do not use any B tralization in the projection hats uh vat does not have BN by default and they just made made sure not to add BN layers uh on top of those linear layers in the projection head um okay I already mentioned that uh these centering and sharpening operations are vital uh and they say here so as shown experimentally centering prevents one dimension to dominate but encourages collapse to the uniform distribution so that's this first type of uh like collapse I mentioned the uniform one so this one and then they mention um while the sharpening has the opposite effect okay so it kind of makes those uh one dimension explode so applying both operations balances their effects which is sufficient to avoid collapse in presence of a momentum teacher okay that's that's it those are those are all of the details of the architecture and how the system works now let's see for the results um first what they do is they take uh reset as the as the as the B Baseline and they train it using Dyno so this is this method here and they show that uh compared to all of the previous uh like self-supervised methods such as I don't know like bootstrap your own Laten buol or or Moco or or Sim clear Etc so they show that they are they perform better using both types of evaluations so the first one is linear one and the second one is KN andn evaluation and just uh briefly if you don't understand how how the km works it's fairly simple so what you do is you train your your V so you train your vat here and then you take uh you basically take uh your training set so this is your training set you freeze these weights so these are frozen so let me just you know like f you froze these and now you just input an image outcomes a representation and because it's from the training set you'll have a label associated with it and you'll you'll just repeat this for a subset of your training set and you'll store these representation label topples okay and next thing how you actually evaluate now this system is the following so now you have now you have uh like a validation data set and you pass in an image from the validation set out comes representation but this time you don't have a label so what you do instead is you search through these representations which you've collected from your training set and you just find the K nearest ones and they were using like 20 uh experimentally they found that number works well so you'll just find 20 closest representations and you'll see what their labels are and by majority voting for example you you'll just figure out what the represent what the label should be for this one okay that's the that's the whole idea so you just find the closest ones where closeness is defin by I guess they use L2 metric and so yeah that that's it that's how you find a label for for this one you just find the closest ones okay and now the interesting part actually here is that uh once you have bit you can see that the knnn performance is like the gap between these two is severely like uh decreased whereas here we had a bigger Gap that's the the important thing to notice is that using Transformers and SSL has something that's that's um qualitatively different comparing to using CNN that's the first thing worth mentioning and uh yeah in general they compare with other baselines so here so here they compensate for the architecture here they use whatever architecture just the best methods out there and they still show that uh using linear evaluation as well as the K&N they they achieve the best results compared to previous like B Sim clear Etc previous SSL baselines uh those are the initial results now let's see some other stuff here so these two results just support their claim that the features are um really of high quality and uh since they perform really well for K nearest neighbor classification they also should perform well for image retrieval and copy detection and they just kind of uh yeah show that uh that's the case um interesting thing here is that uh cool thing about SSL methods is you can use them on whatever data set you have you don't have to have labels and so they just pre-trained uh dino on this Google Larks uh data set which turns out to to boost performance severely compared to just pre-training dyo on on image net and yeah they get results which are comparable to even a supervised Baseline and that's cool okay let's continue here uh that was the that was the first part so those were the the the the that was the emerging property of nice K features that are good for K&N focusing on these results here we can see that um different colors just you know different heads of Transformer heads attention heads from the vat uh final layer and we can see that the red head focuses on horses on this zebras head we can see that the yellow head is focusing on on on the neck as well as on the ears and the blue head is focusing on on this bonds here okay and um what's interesting is they show here that um even some some occluded objects such as these bushes here are detected and segmented uh like I mean it's hard even for me to kind of detect these and the network learns how to recognize the bush and segment that information okay that's super cool okay continuing further they show that uh if you take supervised Baseline and plot the segmentation Maps uh from from that one and from dyo uh on on a different data set on Pascal so they pre-trained this on imet so both Dino and the supervis method and then they tested on Pascal data set and you can see there's bunch of random like sparse Dots here so the segmentation masks are not of the same quality as as as in Dino and they also show quantitatively that uh you can see that this is I think uh how they call it uh jard similarity or IOU just a fancy name for intersection of a union they show it's much higher compared to supervised Baseline so again this testifies that um uh the dyo the V with dyo learns how to extract these segmentation masks uh high quality segmentation masks okay in this table they show a general applicability of these features to various Downstream benchmarks such as cyppher 10 Cipher 100 Etc so just comparing um Dino with the supervised Baseline we can see uh it pretty much achieves outperforms supervised Baseline on all of these Downstream tasks which further testifies that the features learned using Dino and vat are of super high quality uh some oblations here uh maybe the most important row I want you to focus on is when we uh expel when we don't use the momentum encoder so that the teacher Network being formed from exponen like by the exponential moving average strategy we can see that it completely fails to train so the performance just drops down to almost zero and it does not work so we need we so this method so doyo needs momentum encoder and yeah um other oblations like uh multi not using multi crop reduces result reduces the performance not using cross entropy reduces performance Etc in this uh plot they showed that not that using smaller patches is more important than using a bigger model so if you focus on on this thing here so if you focus on on the 8 by8 uh patches and the vat small uh as you can see it has better performance than using um V8 big so the bigger model but with 16 x 16 patches so the trade-off here obviously exists and um it's the following basically when you're using 8 by8 you need you need a bigger memory footprint and so uh you also kind of uh reduce the throughput of number of images you can feed through the model in a second um but yeah you get some additional performance but all in all you can see these two curves overlap so basically it boils down to how much performance you want and what's the throughput your desired throughput and that's how you can pick the best uh tradeoff between the size of the model and the patch size okay nice um here is the quantitative results I mentioned that the student is always so that the teacher is always better than student you can see as the epoch progress during the training the validation accuracy of the teacher is always higher compared to the student um and they just kind of try different strategies of how they update the teacher and it turns out that using momentum is the best strategy uh using previous Epoch so that's something similar to dqn if you watch my video on dqn or you know what dqn is uh it also uses um it also freezes the target Network and updates it every couple of uh EPO or whatever every every end steps in general and so uh that method also works but it yields somewhat uh worse performance and using uh updating the teacher too often so that would be previous iteration or student copy uh you can see that the method just fails uh all completely so yeah um there are other ways to kind of um build up this teacher network but the best thing seems to be so far this momentum update uh these results are pretty interesting let me just kind of explain them again there are two forms of collapse regardless of the input the model output is uniform along all Dimensions or dominated by one dimension if we uh take the cross entropy we saw in the loss so we take the cross entropy loss and kind of split it down into the entropy component and the K Divergence component uh and plotted here they show The Following so if you're using sharpening only and you just ignore you you kind of ditch the centering part in the teacher Network this is what happens basically the entropy drops down to zero which means what which means that the network learns to Output uh like a something like this you basically have a peak a one dimension and the probability is one and everything else is zero and so the entropy of this thing is zero so that means that uh using only sharpening leads to this mode of collapse and using only centering also collapses but as you can see the collapse has a non-zero value so the the the the the the entropy has non-zero value which means we have something like this which means we have a uniform distribution and uh this has a nonzero entropy and so that's what we end up with here and did did mention it somewhere here let me just check uh the entropy converges to different values so zero with no centering and minus log one over K with no sharpening uh so that's that's the value that we'll have here so here this value will be minus log of 1 / K the reason being is the support has K values so this distribution support is has K Dimensions so that means this will be 1 / K so that the integral amounts to one and this means if you just calculate the entropy you'll end up with with this this value here um so yeah using both as you can see here the entropy converges to some value but it's neither zero neither this value of minus log 1 over K and that's it that's pretty much it uh taking a look at the KL Divergence uh the KL becomes zero for these two cases um because both the student and the teacher learn to Output either uniform or this kind of distribution and when you're using both then KL Divergence does not collapse to zero those are just two different perspectives on the same thing uh and that's the fact that they collapse to either of these modes depending on what what do we omit either Shar sharpening or centering um okay that's that's pretty much it um I think I explained everything I wanted finally they mention it here in the future we plan to explore if pre-training a large V8 model with dyo on random uncurated images could push the limits of visual features and uh I can just imagine uh something along the gpt3 dimensions being uh trained uh and this time with vision Transformers so that will be exciting I guess uh final thing I want to show you are the features that uh like the some semantics of the features of of V uh trained with dyo as you can see here just uh extracting those representations and then using tne to kind of plot them in 2D you can see that Dino vat V Dino learns how to Cluster similar objects so here we can see like Model T that's a car then car wheel ambulance mini bus so we have some kind of a Vehicles cluster here and then we have some like gter snake king snake ring neck snakes so we have some snakes here we have like orangutan some monkeys here so all in all all of these emerge by training dino by training vat using dyo so we get this nice property even though we train vat uh only using SSL objective and we also got those segmentation masks and and we also got uh the features which are of high quality as shown by K&N performance Etc so hopefully you you like this video if you did uh consider subscribing and sharing and until next time bye-bye [Music]

Original Description

❤️ Become The AI Epiphany Patreon ❤️ ► https://www.patreon.com/theaiepiphany In this video I cover DINO (self DIstillation with NO labels) introduced in the "Emerging Properties in Self-Supervised Vision Transformers" paper by Facebook AI. The idea is to see whether using supervised learning was preventing transformers from showing the same kind of results in CV as they demonstrated in the NLP world (where we use self-supervised learning objectives such as (masked) language modeling). It turns out some nice properties emerge such as: * DINO-ViT learns to predict segmentation masks * features are especially of high quality for the k-NN classification ▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬ ✅ Paper: https://arxiv.org/abs/2104.14294 ✅ Code: https://github.com/facebookresearch/dino ▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬ ⌚️ Timetable: 00:00 DINO main ideas, attention maps explained 05:05 DINO explained in depth 10:30 Pseudocode walk-through 13:55 Multi-crop and local-to-global correspondence 15:15 More details on the teacher network 19:00 Results 25:00 Ablations 27:40 Collapse analysis 30:40 Features visualized and outro ▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬ 💰 BECOME A PATREON OF THE AI EPIPHANY ❤️ If these videos, GitHub projects, and blogs help you, consider helping me out by supporting me on Patreon! The AI Epiphany ► https://www.patreon.com/theaiepiphany One-time donation: https://www.paypal.com/paypalme/theaiepiphany Much love! ❤️ Huge thank you to these AI Epiphany patreons: Eli Mahler Petar Veličković Zvonimir Sabljic ▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬ 💡 The AI Epiphany is a channel dedicated to simplifying the field of AI using creative visualizations and in general, a stronger focus on geometrical and visual intuition, rather than the algebraic and numerical "intuition". ▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬ 👋 CONNECT WITH ME ON SOCIAL LinkedIn ► https://www.linkedin.com/in/aleksagordic/ Twitter ► https://twitter.com/gordic_aleksa Instagram ► https://www.instagram.com/aiepiphany/ Facebook ► h

Watch on YouTube ↗ (saves to browser)

Sign in to unlock AI tutor explanation · ⚡30

Playlist

Uploads from Aleksa Gordić - The AI Epiphany · Aleksa Gordić - The AI Epiphany · 57 of 60

← Previous Next →

Intro | Neural Style Transfer #1

Intro | Neural Style Transfer #1

Aleksa Gordić - The AI Epiphany

Basic Theory | Neural Style Transfer #2

Basic Theory | Neural Style Transfer #2

Aleksa Gordić - The AI Epiphany

Optimization method | Neural Style Transfer #3

Optimization method | Neural Style Transfer #3

Aleksa Gordić - The AI Epiphany

Advanced Theory | Neural Style Transfer #4

Advanced Theory | Neural Style Transfer #4

Aleksa Gordić - The AI Epiphany

Anyone can make deepfakes now!

Anyone can make deepfakes now!

Aleksa Gordić - The AI Epiphany

What is Computer Vision? | The Art of Creating Seeing Machines

What is Computer Vision? | The Art of Creating Seeing Machines

Aleksa Gordić - The AI Epiphany

Feed-forward method | Neural Style Transfer #5

Feed-forward method | Neural Style Transfer #5

Aleksa Gordić - The AI Epiphany

Alan Turing | Computing Machinery and Intelligence

Alan Turing | Computing Machinery and Intelligence

Aleksa Gordić - The AI Epiphany

Feed-forward method (training) | Neural Style Transfer #6

Feed-forward method (training) | Neural Style Transfer #6

Aleksa Gordić - The AI Epiphany

What is Google Deep Dream? (Basic Theory) | Deep Dream Series #1

What is Google Deep Dream? (Basic Theory) | Deep Dream Series #1

Aleksa Gordić - The AI Epiphany

Semantic Segmentation in PyTorch | Neural Style Transfer #7

Semantic Segmentation in PyTorch | Neural Style Transfer #7

Aleksa Gordić - The AI Epiphany

How to get started with Machine Learning

How to get started with Machine Learning

Aleksa Gordić - The AI Epiphany

How to learn PyTorch? (3 easy steps) | 2021

How to learn PyTorch? (3 easy steps) | 2021

Aleksa Gordić - The AI Epiphany

PyTorch or TensorFlow?

PyTorch or TensorFlow?

Aleksa Gordić - The AI Epiphany

3 Machine Learning Projects For Beginners (Highly visual) | 2021

3 Machine Learning Projects For Beginners (Highly visual) | 2021

Aleksa Gordić - The AI Epiphany

Machine Learning Projects (Intermediate level) | 2021

Machine Learning Projects (Intermediate level) | 2021

Aleksa Gordić - The AI Epiphany

Cheapest (0$) Deep Learning Hardware Options | 2021

Cheapest (0$) Deep Learning Hardware Options | 2021

Aleksa Gordić - The AI Epiphany

How to learn deep learning? (Transformers Example)

How to learn deep learning? (Transformers Example)

Aleksa Gordić - The AI Epiphany

How do transformers work? (Attention is all you need)

How do transformers work? (Attention is all you need)

Aleksa Gordić - The AI Epiphany

Developing a deep learning project (case study on transformer)

Developing a deep learning project (case study on transformer)

Aleksa Gordić - The AI Epiphany

Vision Transformer (ViT) - An image is worth 16x16 words | Paper Explained

Vision Transformer (ViT) - An image is worth 16x16 words | Paper Explained

Aleksa Gordić - The AI Epiphany

GPT-3 - Language Models are Few-Shot Learners | Paper Explained

GPT-3 - Language Models are Few-Shot Learners | Paper Explained

Aleksa Gordić - The AI Epiphany

Google DeepMind's AlphaFold 2 explained! (Protein folding, AlphaFold 1, a glimpse into AlphaFold 2)

Google DeepMind's AlphaFold 2 explained! (Protein folding, AlphaFold 1, a glimpse into AlphaFold 2)

Aleksa Gordić - The AI Epiphany

Attention Is All You Need (Transformer) | Paper Explained

Attention Is All You Need (Transformer) | Paper Explained

Aleksa Gordić - The AI Epiphany

Graph Attention Networks (GAT) | GNN Paper Explained

Graph Attention Networks (GAT) | GNN Paper Explained

Aleksa Gordić - The AI Epiphany

Graph Convolutional Networks (GCN) | GNN Paper Explained

Graph Convolutional Networks (GCN) | GNN Paper Explained

Aleksa Gordić - The AI Epiphany

Graph SAGE - Inductive Representation Learning on Large Graphs | GNN Paper Explained

Graph SAGE - Inductive Representation Learning on Large Graphs | GNN Paper Explained

Aleksa Gordić - The AI Epiphany

PinSage - Graph Convolutional Neural Networks for Web-Scale Recommender Systems | Paper Explained

PinSage - Graph Convolutional Neural Networks for Web-Scale Recommender Systems | Paper Explained

Aleksa Gordić - The AI Epiphany

OpenAI CLIP - Connecting Text and Images | Paper Explained

OpenAI CLIP - Connecting Text and Images | Paper Explained

Aleksa Gordić - The AI Epiphany

Temporal Graph Networks (TGN) | GNN Paper Explained

Temporal Graph Networks (TGN) | GNN Paper Explained

Aleksa Gordić - The AI Epiphany

Graph Neural Network Project Update! (I'm coding GAT from scratch)

Graph Neural Network Project Update! (I'm coding GAT from scratch)

Aleksa Gordić - The AI Epiphany

Graph Attention Network Project Walkthrough

Graph Attention Network Project Walkthrough

Aleksa Gordić - The AI Epiphany

How to get started with Graph ML? (Blog walkthrough)

How to get started with Graph ML? (Blog walkthrough)

Aleksa Gordić - The AI Epiphany

DQN - Playing Atari with Deep Reinforcement Learning | RL Paper Explained

DQN - Playing Atari with Deep Reinforcement Learning | RL Paper Explained

Aleksa Gordić - The AI Epiphany

AlphaGo - Mastering the game of Go with deep neural networks and tree search | RL Paper Explained

AlphaGo - Mastering the game of Go with deep neural networks and tree search | RL Paper Explained

Aleksa Gordić - The AI Epiphany

DeepMind's AlphaGo Zero and AlphaZero | RL paper explained

DeepMind's AlphaGo Zero and AlphaZero | RL paper explained

Aleksa Gordić - The AI Epiphany

OpenAI - Solving Rubik's Cube with a Robot Hand | RL paper explained

OpenAI - Solving Rubik's Cube with a Robot Hand | RL paper explained

Aleksa Gordić - The AI Epiphany

MuZero - Mastering Atari, Go, Chess and Shogi by Planning with a Learned Model | RL Paper explained

MuZero - Mastering Atari, Go, Chess and Shogi by Planning with a Learned Model | RL Paper explained

Aleksa Gordić - The AI Epiphany

EfficientNetV2 - Smaller Models and Faster Training | Paper explained

EfficientNetV2 - Smaller Models and Faster Training | Paper explained

Aleksa Gordić - The AI Epiphany

Implementing DeepMind's DQN from scratch! | Project Update

Implementing DeepMind's DQN from scratch! | Project Update

Aleksa Gordić - The AI Epiphany

MLP-Mixer: An all-MLP Architecture for Vision | Paper explained

MLP-Mixer: An all-MLP Architecture for Vision | Paper explained

Aleksa Gordić - The AI Epiphany

DeepMind's Android RL Environment - AndroidEnv

DeepMind's Android RL Environment - AndroidEnv

Aleksa Gordić - The AI Epiphany

When Vision Transformers Outperform ResNets without Pretraining | Paper Explained

When Vision Transformers Outperform ResNets without Pretraining | Paper Explained

Aleksa Gordić - The AI Epiphany

Non-Parametric Transformers | Paper explained

Non-Parametric Transformers | Paper explained

Aleksa Gordić - The AI Epiphany

Chip Placement with Deep Reinforcement Learning | Paper Explained

Chip Placement with Deep Reinforcement Learning | Paper Explained

Aleksa Gordić - The AI Epiphany

Text Style Brush - Transfer of text aesthetics from a single example | Paper Explained

Text Style Brush - Transfer of text aesthetics from a single example | Paper Explained

Aleksa Gordić - The AI Epiphany

Graphormer - Do Transformers Really Perform Bad for Graph Representation? | Paper Explained

Graphormer - Do Transformers Really Perform Bad for Graph Representation? | Paper Explained

Aleksa Gordić - The AI Epiphany

GANs N' Roses: Stable, Controllable, Diverse Image to Image Translation | Paper Explained

GANs N' Roses: Stable, Controllable, Diverse Image to Image Translation | Paper Explained

Aleksa Gordić - The AI Epiphany

VQ-VAEs: Neural Discrete Representation Learning | Paper + PyTorch Code Explained

VQ-VAEs: Neural Discrete Representation Learning | Paper + PyTorch Code Explained

Aleksa Gordić - The AI Epiphany

VQ-GAN: Taming Transformers for High-Resolution Image Synthesis | Paper Explained

VQ-GAN: Taming Transformers for High-Resolution Image Synthesis | Paper Explained

Aleksa Gordić - The AI Epiphany

Multimodal Few-Shot Learning with Frozen Language Models | Paper Explained

Multimodal Few-Shot Learning with Frozen Language Models | Paper Explained

Aleksa Gordić - The AI Epiphany

Focal Transformer: Focal Self-attention for Local-Global Interactions in Vision Transformers

Focal Transformer: Focal Self-attention for Local-Global Interactions in Vision Transformers

Aleksa Gordić - The AI Epiphany

AudioCLIP: Extending CLIP to Image, Text and Audio | Paper Explained

AudioCLIP: Extending CLIP to Image, Text and Audio | Paper Explained

Aleksa Gordić - The AI Epiphany

RMA: Rapid Motor Adaptation for Legged Robots | Paper Explained

RMA: Rapid Motor Adaptation for Legged Robots | Paper Explained

Aleksa Gordić - The AI Epiphany

DALL-E: Zero-Shot Text-to-Image Generation | Paper Explained

DALL-E: Zero-Shot Text-to-Image Generation | Paper Explained

Aleksa Gordić - The AI Epiphany

DETR: End-to-End Object Detection with Transformers | Paper Explained

DETR: End-to-End Object Detection with Transformers | Paper Explained

Aleksa Gordić - The AI Epiphany

DINO: Emerging Properties in Self-Supervised Vision Transformers | Paper Explained!

DINO: Emerging Properties in Self-Supervised Vision Transformers | Paper Explained!

Aleksa Gordić - The AI Epiphany

DeepMind DetCon: Efficient Visual Pretraining with Contrastive Detection | Paper Explained

DeepMind DetCon: Efficient Visual Pretraining with Contrastive Detection | Paper Explained

Aleksa Gordić - The AI Epiphany

Do Vision Transformers See Like Convolutional Neural Networks? | Paper Explained

Do Vision Transformers See Like Convolutional Neural Networks? | Paper Explained

Aleksa Gordić - The AI Epiphany

Fastformer: Additive Attention Can Be All You Need | Paper Explained

Fastformer: Additive Attention Can Be All You Need | Paper Explained

Aleksa Gordić - The AI Epiphany

The DINO model uses a teacher-student framework to learn visual features without labels, achieving state-of-the-art results on linear evaluation and KNN evaluation. The model uses a multicrop strategy, centering, and sharpening operations to prevent collapse and encourage uniform distribution.

Key Takeaways

Implement a teacher-student framework for self-supervised learning
Apply a multicrop strategy to the student network
Use centering and sharpening operations to prevent collapse
Evaluate performance using linear evaluation and KNN evaluation
Implement a retrieval augmented generation framework
Apply RAG to vision transformers
Evaluate RAG performance
Implement vector stores for self-supervised learning
Optimize vector stores for performance
Evaluate vector store effectiveness

💡 The DINO model's use of a teacher-student framework, multicrop strategy, centering, and sharpening operations allows it to learn high-quality visual features without labels, achieving state-of-the-art results on linear evaluation and KNN evaluation.

🔒 Pro feature: Ask AI to explain this lesson →

More on: Reading ML Papers

View skill →

Automatic Literature Review with GPT-3 - I embedded and indexed all of arXiv into a search engine!

Automatic Literature Review with GPT-3 - I embedded and indexed all of arXiv into a search engine!

Marcos Lopez Caniego - ESASky's JupyterLab widget| JupyterCon 2020

Marcos Lopez Caniego - ESASky's JupyterLab widget| JupyterCon 2020

Obsidian Zotero Integration Plugin | Streamline Your Research Paper Workflow 📝️

Obsidian Zotero Integration Plugin | Streamline Your Research Paper Workflow 📝️

This FULLY FREE Research Agent can BUILD Reports in Minutes!!!

This FULLY FREE Research Agent can BUILD Reports in Minutes!!!

Claude 3.7 Sonnet API | Build a Research Assistant

Claude 3.7 Sonnet API | Build a Research Assistant

I Built An Obsidian AI Research Assistant with Oz...

I Built An Obsidian AI Research Assistant with Oz...

Related Reads

On July 1, 2026, arXiv will spin out from Cornell University, its home for the past 25 years, to become an independent nonprofit organization. Major funding support from Simons Foundation and Schmidt Sciences. Ditching the red for their website. [N]

arXiv is becoming an independent nonprofit organization after 25 years at Cornell University, backed by major funding, which will impact the future of research and academia

Reddit r/MachineLearning

CS-NRRM™ Official Publications: Paper 1 and Paper 2 Are Now Available

Learn about the CS-NRRM's official publications on a 12-year longitudinal human observation archive and its significance in research and development

Medium · Data Science

Found a potential mistake in an ICLR 2026 blogpost [D]

Verify a potential mistake in an ICLR 2026 blog post and learn how to effectively report errors in academic publications

Reddit r/MachineLearning

Rebuttals Move Peer-Review Scores, but Initial-Review Structure Bounds the Movement

Learn how author rebuttals impact peer-review scores and the factors that influence their effectiveness in ICLR 2024-2025, using LLMs for measurement

Chapters (9)

DINO main ideas, attention maps explained

5:05 DINO explained in depth

10:30 Pseudocode walk-through

13:55 Multi-crop and local-to-global correspondence

15:15 More details on the teacher network

19:00 Results

25:00 Ablations

27:40 Collapse analysis

30:40 Features visualized and outro

Understanding Roy's Adaptation Model (10 Minutes)

Microlearning Daily