DINO -- Self-supervised ViT

Machine Learning Studio · Advanced ·🧬 Deep Learning ·2y ago

Skills: CV Basics70%

Key Takeaways

This video covers DINO, a self-supervised vision transformer model using self-distillation with no labels

Full Transcript

hello everyone today we have a very exciting paper called emerging properties in self-supervised Vision Transformer so far in this series we have covered various forms of vision Transformers but this video is the first method that uses self-supervised training on viit and as we'll see self-supervised training of vits give us some unexpected advantages so let's dive in and see how this works this paper proposes a method called dyo which is a fully self-supervised framework applicable to both vits as well as CNN besides the simplified approach proposed by the paper for self-supervised learning they also investigated the features that are obtained from the prein models and observe that two properties come with these self-supervised vits the first property is that features from a self-supervised v contain very use ful semantic layouts of the scene as shown in this figure and the second property is that these features work great with k&ns before we discuss D methodology let's clarify these two related Concepts in the literature self-training versus knowledge distillation self-training in machine learning is a task where we are given a small set of label data and the goal is to learn features based on the initi IAL labeled data set and improve their quality by incorporating a larger unlabeled set of data whereas knowledge distillation refers to the task of transferring learned features from one network usually referred to as a teacher Network to another Network called the student in this context there is a paper called noisy student which actually leverages both of these and after training on a on the label data set it then learns to propagate soft pseudo labels to an unlabeled data set and then iteratively improve these pseudo labels there are several algorithms for self-supervised learning in the literature for vision the first category of algorithms is instance classification where each image is treated as a different class and the objective is to learn to discriminate between each individual image the drawback of this approach is that it does not Escape with the number of images the second approach is called bootstrap your own latent or Bol for short this approach is a metric learning framework that works with two networks an online and a Target Network it generates different views of the input image and trains the online Network to match the output of the target Network the proposed Dino algorithm is also inspired by byol with with some minor alterations so because of that we are going to cover Bol first byol uses two networks an online Network shown in the top branch and a Target Network shown in the bottom the online network uh which is parameterized by Theta is composed of a backbone encoder F Theta followed by two heads sequentially a projector G Theta and a predictor Q Theta the target Network is parameterized by Zeta and also has an encoder F Zeta followed by just one head which is a projector G zaa the input to the online and Target networks are different views that are generated from the same input image X after feeding the different views of the input image to the online and Target networks we get Q Theta of Z Theta as the output of the online and Z Prime Z data as the output of the target these are two representations obtained from different views of the same image and therefore the goal is to match these representations the L function for the online network is essentially the mean squared error between the normalized representations obtained from the two networks as shown here then for updating the target Network we use this exponential moving average of the parameters of the student Network as shown in this equation now that we have reviewed Bol we can move on to dino framework which is very similar to Bol it has two networks a student parameterized with Theta s and a teacher network parameterized with Theta T then given an input image X we generate different views of X called X Prime and XP Prime we feed X Prime to a student Network and we get G Theta s of X as output and similarly we feed view xou Prime to the teacher Network and get G Theta T of X we can then apply softmax function to convert these logits G Theta S and G Theta T to probabilities PS of X and PT of X as shown in these two equations here to S and to T are the temperatures that control the sharpness of the softmax operation the L function for training the student is based on this cross entropy loss between the probabilities of the student and the teacher that are obtained from the same input image X the exact form of this loss has more details which we will see next so let's look at the details of the self-supervised learning with knowledge legislation as proposed in Dino given input image X we generate a set of views V that contain two Global views x1g and and x2g as well as multiple local views the global views have higher resolution like 224x 224 and they are made by crops that contain at least 50% or more of the input image the local views are made with crops that contain less than 50% of the original image and they have lower resolution we feed the global views to the teacher Network and we feed the local views to the student so this way the student has to generate representations from the local views that match with the representations of the global views from the teacher Network so we use this loss function that considers all pairwise combinations of the probabilities from Global and local views to train the student Network the teacher network is also trained from scratch at the same time as the student Network so for this we build the teacher network from the past iterations of the student and using this exponential moving average update rule this way of training the teacher with momentum encoder gives us a mean teacher which has the ensembling and model averaging effect and as a result it achieves better performance than the student model the student and teacher networks have the same exact architecture including a backbone encoder and a projection head for the backbone architecture we can use either vit or reset 50 with the details shown in this table and the projection head is composed of a three layer MLP followed by L2 normalization and finally a weight normalized fully connected layer so the final model can be represented mathematically as G theta equals h of f of x also note that the viit model originally does not have a batch Norm so the authors also avoid using batch Norm in the projection head in such self-supervised settings that involves training two networks simultaneously we should always be concerned about mode collapse and consider some techniques to avoid that in Dino the authors use centering and sharpening operations that have opposite effects of each other and the combination of these two effectively prevent mode collapse now we can finally look at the experiments conducted in the dino paper for the training the models are pre-trained on the image net data set without using the labels for evaluation there are two standard ways of evaluating SSL models either using a linear evaluation or fine tuning but since both of these evaluations are sensitive to the hyper parameters they also included KNN evaluation that has minimal hyper parameters the results of of the linear and KNN evaluations using different learning methods including supervised as well as various SSL methods are shown in this table when using reset 50 as the backbone we can see that the performance of Dino is unpair with SSL methods this shows that Dino is applicable to a standard CNN models but using vit model we see that Dino is even better than other SSL models by at least three to 4% furthermore as we mentioned in the beginning of this video there are some emerging properties that come with the self supervised bit models the first one is the nearest neighbor retrieval capability which is important for applications like image retrieval and copy detection for both of these applications the authors have conducted experiments on relevant data sets like Oxford and Paris image retrieval data set and the copy copy days uh data set from inria and the results shows that Dino has better performance than the supervised models and the second property is that the features from self-supervised viit contain very useful information related to the semantic layouts of the scene on the left you can see the self attention Maps uh which clearly contain semantic information of their inputs these information are useful for weekly superwide semantic segmentation furthermore the authors conducted experiments for video instant segmentation on DAV 2017 data set and without any fine-tuning Dino is showing Superior performance compared to other SSL methods so that brings us to the end of this video pre-training viit models with Dino that is a self-supervised learning framework with knowledge distillation and two important properties emerge from the models trained with Dino first Dino models provide highly informative features that work well in KNN classification and the second property is that Dino features contain semantic information I hope you enjoyed this video and in the next video we will cover the clip model from open AI so stay tuned and thanks for watching

Original Description

In this video, we cover a very exciting paper, called “Emerging Properties in Self-supervised Vision Transformer”. The proposed method DINO (self-distillation with no labels) is a simplified approach for self-supervised learning in vision domain. Similar to self-supervised transformers in NLP, pre-training ViT with DINO also leads to some emerging properties beyond what they were trained for.

Watch on YouTube ↗ (saves to browser)

Sign in to unlock AI tutor explanation · ⚡30

More on: CV Basics

View skill →

Identify Horses or Humans with TensorFlow and Vertex AI

Building a Dog Breed Identifier App from scratch - DogNet

Building a Dog Breed Identifier App from scratch - DogNet

Aladdin Persson

Apply OpenGL Texturing and Camera Systems

Apply OpenGL Texturing and Camera Systems

Aerial Image Segmentation with PyTorch

Aerial Image Segmentation with PyTorch

How to Install Stable Diffusion - automatic1111

How to Install Stable Diffusion - automatic1111

Sebastian Kamph

NVIDIA RTXGI Unreal Engine 4 Plugin: Introduction and Setup

NVIDIA RTXGI Unreal Engine 4 Plugin: Introduction and Setup

NVIDIA Developer

Related Reads

Help Choosing Neural Network Architecture for Matrix Classification

Learn to choose a suitable neural network architecture for classifying matrices with variable row sizes

Reddit r/deeplearning

How to Choose the Best Deep Learning Model for Medical Imaging

Learn how to choose the best deep learning model for medical imaging to ensure the success of your medical AI project

Medium · Deep Learning

Another Way to Read Neural Geometry

Learn to read neural geometry from first principles using Goodfire's discovery and apply it to your deep learning projects

Medium · Data Science

Another Way to Read Neural Geometry

Learn to read neural geometry from first principles using Goodfire's discovery

Medium · Deep Learning

RNNs Explained in 60 Seconds #ai #coding #machinelearning