Vision Transformer from Scratch Tutorial
Key Takeaways
This video tutorial demonstrates how to build a Vision Transformer from scratch, covering the basics of computer vision, self-attention, and Transformer architecture, and utilizing tools such as Hugging Face's Transformers Library and CLIP model.
Full Transcript
Vision Transformers are reshaping computer vision by bringing the power of self attention to image processing in this course Tunga Barak an experienced machine learning instructor will guide you through building a vision Transformer from scratch he'll cover everything from patch embeddings and multi-head attention to assembling a full Transformer model while also exploring how Vision Transformers compared models like clip and sigp by the end of this course you'll have a deeper understanding of how AI models process visual data welcome in this video we will code a vision Transformer from scratch but what is a vision Transformer so you probably are familiar with Transformers from large language models and a vision Transformer is also type of a transformer but it can take an image as input and turn it into embedding that captures information from that image so Vision Transformers allow us to embed the information from images and because of this we can use them with language models and give the language models the ability to see and take images as input so our language models become Vision language models and you can give them images and ask questions about them so your language model pretty much can now see instead of just reading from text but how do these Vision Transformers work right now I'm going to just give a very brief Overlook and we're going to dive deeper into it um in a minute so the vision Transformers use a Transformer like architecture but instead of processing softw tokens like llms do they work on image patches and each each image patch you can think of it as a token so you can see the image patches here from like a bigger image so you can think of an image as a puzzle where each piece is a patch from the image and we embed each patch into a vector that captures its meaning and information then we add position limings so the model knows where each P belongs in the image and after that we apply self attention and U that allows them to interract that allows the patches to interact and understand the relationships within the entire image and this makes each patch context aware meaning it understands the other patches and their special relationships and to get the final understanding of the image we use a special C CLS token and instead of directly combining all patches this token attends to all patches through S attention and learns represent the entire image in a single Ting and finally through self attention layers and MLP our vision Transformer produces a high level representation of the whole image so in this video we will be coding Sig Vision Transformer I will just show here and this Vision Transformer model has 400 million parameters and it was built by Google as an improvement to a very famous model called Clips which was built by open AI but what makes sigp special so clip originally was used with softmax uh loss function but softmax loss function forces the model to compare every image which with all other text descriptions in the whole batch this is computationally pretty expensive so instead of that sigl replaces the softmax loss with a sigo loss which means that the model only focuses on individual image text pairs instead of the whole batch and without comparing everything at once this makes training way more efficient and flexible because the model learns to match images with their correct descriptions without being distracted by unrelated pairs in the whole batch and sigli is pretty famous Vision Transformer and it's used in Google's VMS like polygama and polyma to you can see here as the vision encoder all right so to understand sigp I think we first need to understand what clip is because clip is an improvement over a clip with a better loss function so we have an image encoder and we have a text encoder and what happens is we uh have a list of images and we have a list of corresponding text um pairs as well and we feed them through this image encod and as a result of these images we get image embeddings so this dog is represented uh with the embedding here as i1 and the same this image gets through the image encoder and now it's represented with an embedding and the same thing applies to the text as well so um The Running Dog the bird on a branch a camel in a desert they both get uh encoded they both get through this text encoder and as a result they have their own embeddings now so now we have a list of uh text embeddings and image embeddings and what we're going to do is we're going to take dot product of all of these uh image and text embeddings and we're going to train our model to the text encoder and image encoder to give us High do products for matching uh Pairs and low dot products for noncorresponding image text pairs so when you look at i1 it's the image of a dog and when you look at text one it's the text of a dog so we want them to have a high dot product and for the image of a dog this one and the image the text of a bird you want it to have a low dot product and so pretty much for image of a dog anything that doesn't correspond to its text we wanted to have a la. product and the same applies to the image of a bird as well so we have the embedding of the image of a bird and we have the text of image of a bird and we want them to have uh high do product because this says bird and this also is an image of a bird and everything else we wanted to have Load Out product because they're not corresponding and this pattern can be seen overall all multiplication of these embeddings and everything that gets matched we want high embeddings high do product of them Bings and everything that doesn't match we want low do product of them Bings and now we want to find the loss function where it forces this dot these dot products to be high and the rest to be low and that loss function is going to be cross entropy loss and so this cross entropy loss trains the model image encoder and text encoder to give high value to corresponding image and text and low value to non-corresponding image and text so this cross loss is going to train our image encoder and text encoder but there's a problem with clip and it's because of the fact that it's using softmax function so what happens here is when we use soft magc function we take this image dog and we take all the corresponding text that we taking producted and what we want to do is we want to give this like high probability and the rest we want them pretty much zero and we want to do it for all the image uh text pairs here and that is pretty problematic because you need to have all the batch uh all the uh corresponding text and image um embeddings here calculated together to be able to find this one as the highest and the others as the lowest and you have to do it as same for on columns as well so when you have the text running doc you're going to look at all this column here and you're going to compare it to all these um images and after that you will get a distribution of uh the probabilities and you're going to try to make this one Higher and all the rest uh lower and that's the problem uh pretty much with Sig that you cannot parallelize this easily and you will have to look at uh row by row and column by column to be able to calculate this and I will show you the math behind this right now so this is the function uh this is the math for clip and this is the math for Signet and what happens here is in uh image text of Max we have this one image here the image of a dog let's say and we're going to compare it to all the rest of the text uh uh input that we have and this is going to be one run and we're going to take their probability log likelihood negative log likelihood and what we're going to do next is we also have text to image soft Max and what is happening here is we have let's say one text let's say we choose T3 uh which was or let's say this say we choose the bird T2 and we're going to compare it to all other images here and find that only this one the corresponding one is going to be hot uh High Dot product and the rest are going to be low dot product values so we will have to train it over the whole uh other images comparatively and that is the problem with clip that you will have to do this uh twice and uh also it's not symmetric and the reason for it not being symmetric is so you might have this i1 T2 which corresponds to the dog and T2 is Bird on a branch the their dot product is different than I2 T1 I2 is the bird on a branch but the image of it versus the text of the running dog so it's not symmetric and that's why we have to do it twice but if we use sigli we will only need to do it once and what we will be doing is people treat this as a one big Matrix and uh find the values from there I will just show it in a second so inside sigp uh instead of comparing every rle like this now this is uh from this image is from clips paper and this is from clip and they're pretty much explain the same thing these are the image embeddings and the text embeddings so when you look here the text embeddings are here and the image eddings are here so in Sigma loss what we're doing is we're going to look at each individual lot product uh individually and we're going to look look at it and say hey is this supposed to be zero or is this supposed to be 1 and that's pretty much what we're going to do in the paper implementation it's uh plus one and the positive values the corresponding values are plus one and the rest are minus one but for the sake of uh understanding it simpler and what's Happening Here is instead of looking at all this row taking this text one and comparing it to all the other images what we're going to do is we're just going to look at this and say hey is this te is this supposed to be one uh and if it's so we're going to train the um image and Bings in that way and we're going to look at this one is this going to be uh one or zero it's going to be zero because it's not the corresponding thing and we can look at all the rest of this image and say text six bit image four what is it supposed to be it's supposed to be zero because they're not corresponding and and we're going to look at Image six Tex six is going to be one because they are the correct image text pairs so sigmoid loss uh here allows us to individually calculate each if it if each dot product has to be zero or one and because of this uh partition ability we can um separate this uh into matrices and give each of these into a different device and we can parallelize this process so this allows us to be faster all right let's get to it right now we will be starting to code CLI Vision Transformer model and first things first we will will need to get our images so let's get our images that we will turn into an embedding so this image uh you see right here the image of a cat this image will is going to run through our model sigp and at the end it is going to have an embedding that captures all the information from this image first things first we need to get our model to actually uh run this image so what we're going to do is we're going to import the uh from Transformers Library we're going to import the auto processor which is going to Pro process our input the image we will import the siget vision model which holds the model weights and it's the fully uh pre-trained uh model from the huging face from hugging face and sign vision confing and this is the configuration uh and the architecture of the model and we will be able to tweak some parameters if we want to like changing how many layers there are and the embedding Dimensions as well and uh right now here we are loading the model and it uh processor from the checkpoint of hugging face this Google Sig sigl base patch 16224 is the checkpoint we are using it means that our patch size is 16 and uh we are going to be using 224 by 224 images and um right here we are actually getting the whole U model itself the full model itself from the same checkpoint and this model is the full Transformer with the weights from huging face and let's check how our model structure looks like so as you can see the model from hugging face uh has uh you can see Vision Transformer and the vision embeddings so let's look at the structure this is the contents of this Vision model but it's not in uh sequential order so first of all we can see Sig Vision embeddings here the patch embeddings and position embeddings so when we look at the Sig architecture the embeddings is this green box we have the patch embeddings and the position embeddings so get the rest and then we have the and this encoder is this blue box over here and what you can see here is we have the layer normalization multi tension another layer normalization and multi-layer perceptron MLP and when we look at our uh contents here we have the layers uh we have the uh scale. product attention which is the self attention we can see over here the multi attension and then we can see layer Norm one this was the thing right before multi tension we can see U then layer Norm two right here it's content it's exactly here and then we have uh we also have multi layer perceptron it has two um fully connected layers fc1 and fc2 and it has the yell uh jell fun nonlinearity function in between so it's inside here and at the very end we have the post normalization all right so as I said this is not in sequential order it shows what contents it has so we now have our image and we know the architecture of our model but we cannot give the image directly to the model we need to pre-process it so that we can feed it to the model in a way that it can understand so let's get this part so what is happening here we have our Imports and we are pre-processing the image what happens is it takes an image and it uh the size that it resizes is 224 by 224 pixels so the images are being processed in here you can see the images are being processed to get resized and then they are turned to a tensor and then it also normalizes them so the resize turns all the images into 224 by 224 and then we turn it into a tensor and then we normalize them and you might wonder what these where these numbers come from they actually come from imag net data set so imag net is a data set of more than 14 million images and what happens there is the when they take uh the normalization values among all that data set these are the values for the mean and these are the values for the standard deviation that turns out uh to be uh the most optimal and this is pretty much an industry standard to use these and when we run it through the pre-process here you can see we give the image and now we have a tensor the pre-process over here we get a tensor and after that we UNS squeeze that tensor and why do we UNS squeeze that tensor the reason for that is over here our image right now is 224x 224 and it has three channels RGB red green and blue but if you want to turn our image into a version that the Transformer can understand we should add the batch Dimension because every time we send U images into Transformers we have to send them in batches meaning that like there's going to be couple images so this is the batch Dimension that we have to add even if we have one uh image we have to add a bad di batch Dimension and say hey there's only one element there and then what happens is we UNS squeeze it and we turn it into the correct Dimensions that we need and then we just return that image tensor and what happens after that is we get our yes image tensor so our image tensor is pretty much we just running it through this function preprocess image and it takes our image and it gives us the image Tor and the values here are these are the values that uh polyma 2 configuration uses for sigp so po 2 is is a visual language Vision language model and these are the parameters that it uses so right now for this sigp implementation I'll be using those parameters so the embed them is 768 this means that every image patch is going to be turned into a vector size of 768 so it has 768 Dimensions that it can get embedded into and the patch size is 16 that means each patch each like little puzzle piece from that image has a a size of 16 by 16 pixels and the image size is 224x 224 pixels that's the big image and right now we need to find how many patches we are going to get from that image and what we're going to do is we're going to get the image size 224 on uh height and width and we have the pat size 16 so we divide it and we get 14 so that means there's going to be 14 uh on the height and 14 on the width Dimensions so when we multiply these two we're going to get 196 uh Patches from the big from the whole image and what we're going to do after is we uh without calculating the gradient here we have no gradient uh this means that we're we are not in the training mode and the model doesn't save our gradients for efficiency so to Simply put um our model want say the feedback from the calculations to update convolution to the parameters and what happens here is with this convolution 2D parameters we running the image uh through image transer values throughout this layer to get the patch and B and what it does here is our kernal size is our patch size that means what is the uh pretty much the kernel size is how big are patch size are going to be and the stride here is also equal to the patch size and that means that the patches that we're going to get from that image through convolution they're not going to overlap so by this way we will have separate individual uh patches that uh don't have any overlapping uh values or pixels and right now let's see what our Dimension is so as you can see here we have a batch dimension of one and our uh embed Dimension is 768 we have it 14 by 14 so that means 14 patches on the uh height and 14 patches on the width and in total our number of patches is 196 all right let's add the position embeddings so as we can see here uh we have the number of patches and the embed Dimension number of patches is 196 and our embedding Dimension is 768 so what happens is here in an embedding creates a lookoff table for U all the positions that we have and their embeddings and then what happens is we create uh the positions here uh n patches uh for all those patches from 0 to 195 and then we expanded adding the batch Dimension we need because otherwise Transformer wouldn't be able to use them and it wouldn't be compatible with the rest of the patch and Bings so we uh add this uh patch Dimension uh batch Dimension and then our position ID now has the shape of one to 196 and let's see what it is exactly so this is a bad size of one and it say 196 patches inside it and now it's time to add the position embeddings uh with patch embeddings so what we're going to do right now here is we had our patch embeddings but we need to flatten them so they are in uh not like a one dimension size but we need to flatten them into bad size of one and our Ed Dimension and number of patches so after this flatten operation we have this uh Dimensions but this Dimensions doesn't uh is not compatible for addition with the uh position eddings because our position ID shape is 1 by 196 and when we have the position uh uh IDs here when we input these to position eddings what we're going to get is 1 by 196 by 768 so what we need to do is our patch eddings needs to be the same size as these and that's why we're going to transpose those embeddings that we have the patch embeddings and they were one by 768 by 196 and when we transpose these two Dimensions here uh what's going to happen is our dimensions are going to match the patch embeding dimensions and position embedding dimensions are going to match and right here the embeddings patch embeddings 1 by 196 by 768 are the same as position embeddings and we're adding them together so right now we have all of our embeddings and their shape is going to be one by bat size is one the um the number of patches is 196 and the uh embed Dimension is uh 768 so we're going to have that uh dimensions and yes we are correct and you can see it here so what we've done uh so far is we had our position embeddings and we had our patch embeddings and if you look at this architecture we had our patch embeddings and we add them with position embeddings so our embeddings uh layer is right now uh done with adding these two together and let's visualize what we have right now so right now I'm going to just uh get the embeddings and I'm going to get the embeddings from a randomly initialized um embeddings so because it should all look random and because they're randomly uh initiated here and our embeddings right now are not trained and and it just gives us this uh not so smooth U image and what does this image mean so the patch number is here from 0 to 196 these are the patches small puzzle pieces from the big image and the embedding dimensions are going from 0 to 768 and what this means is each patch has this many dimensions in its embedding and when you look at this image overall you can see that it's not trained because it's not smooth and it's not uniform so all the dimension Dimensions here and all the embeddings they don't actually u mean something but if we compare it to um the trained uh embeddings so when you look at um this what we're doing is we're taking the inputs and we're running through uh processor and we're using the vision model but what is this Vision model this Vision model is from and you look at it this Vision model is from hugging face checkpoint so this is the train embeddings so this embedding when we run it we're going to see a different result what you see here is the uh embeddings that we got from uh the Hing face checkpoint they are more smooth and you can see that they are U normalized and what these columns you see here are in specific dimensions they have some Rel ations and they are representing some specific features so the difference between untrained patch embeddings from um our uh embeddings that we just randomly initialized versus the embeddings that we took from hugging phas checkpoint are different in the sense that they're not uniform and smooth that means they are not representing the image and image patches as they are intended to right now let's make everything look more neat and put and wrap everything a class so here I create a data class to store the vision configuration so our number of channels is three our M Dimension is 768 our image size is 224 by 224 and each of our patch is 16 by 16 so this configuration we can change it if you wanted to and I'm just putting it as the config class and Sig Vision embeddings this class over here this class is pretty much this Vision embeddings that you can see here and uh it is the patch EDS and it has the position edics and let's look into this one all right so what we have right now is the initialization the config uh values we initialize them and number of channels it was three embed Dimension from here we get 768 image size is the same and Patch size is the same as we defined here and what happens here is we are creating the we are also assigning the patch and Bings from U convolutional 2D which was pretty much getting all the patches from our image and the number of patches we found was 14 by dividing the image size by the patch size and um multiplying height and the the width and number of our number of positions is on the image is the number of patches so however many patches there are there's that many positions all of these we have already done uh above and it's just wrapping everything into class and uh the position embeddings are then embeddings from s how how many positions there are and how many embedding Dimensions there are so we initialize that and uh we pretty much create that uh comparable Dimensions to add these position IDs with the batch and position embeddings with the batch embeddings and uh we did all of these here actually so these were the position eddings and these we pre-processing the image and the patch eddings we got from here convolution 2D is the same thing doing we're doing right over here so what we're doing pretty much is just wrapping everything in a class we're not doing anything new and the forward in the forward what we're doing is so we get the pixel values and we assign it to uh batch Channel height and width so our patch embeddings are the pixel values after we run patch and Bings and what we do with them is we flatten those patch and Bings so we got the patch and Bings and then after that we flatten them here after flattening flattening them we transpose it to be able to add the position embeddings on top of them so their Dimensions match so as you can see we have the position edding addition uh here and after that we add patch embeddings with with position embeddings and now we have our final embeddings and what that means is now we can feed these embeddings into the encoder layer and get our image uh to be actually uh worked on inside the Transformer so let's so let's try it what we're going to do right now is we are going to have our sign the vision embeddings from our vision configuration you can see here as the configuration that and we have our image tensor that image tensor over here that we oh why I not see all right so the image Center we are also feeding it in here and it's uh is the image Tor after pre-processing so what we're doing is we are feeding that image Tor inside this embeddings and now we are hoping to get the shapes of the embeddings then what do you think these this shape is is going to look like so our batch size is one our how many number how many patches we have the number of patches is 196 and what's the embedding Dimension it's 768 so uh since we added the patch Dimensions to uh patch embeddings to position embeddings and they have the same Dimension we transpose the patch embeddings to be able to add it to each other so we are expecting to have 1 by 196 by 768 and let's exactly so this means our edings are actually getting added and it's working right now so right now I think we should do a sanity check uh let's see if we actually initialized our um embeddings properly and if we are able to get uh load the hugging face uh weights and get the same answer so what we're doing here is we have our state dictionary uh it's embeddings that we have and the we are getting it State dictionary means we're getting its weights and biases and we have the Hing face State dictionary it's the weights from Hing face for uh the vision model the vision model that we defined at the very top the thing that we got from hugging face so this is the hugging faces own weights so we're trying to check if we initialized and um created everything uh properly so that it that we can actually take huging face values uh weights so what happens here is we're going to get the difference between our output and huging face output and we're going to do a sence check pretty much and when we run it what it says is the difference between our output and huging face output is zero so what that means is we implemented it correctly so we are um at the right track of creating this all right now that we have the embeddings now it's time to actually go into the encoder so as you can see in our architecture we finished the implementing the embeddings we had our patch embeddings we flatten them we trans CL them and then we added the position embeddings now our embeddings are complete and now we need a way to process them and that is going to be through the encoder layer so inside encoder layer we're going to start implementing it from the multi tension and let's get into it so all right so we have our multi header tension here but uh we're going to implement it in a vectorized way to match the tension weights but to make it simpler uh we can start implementing a single head tension first since it's a little more intuitive doing it this way so here uh we we are setting the head size here this is the uh size that we will project our tokens um into each head and after that we are going to create three linear Transformations so these linear Transformations are key query and value and these are the core components of self attention so these three um uh Transformations actually let patches interact with each other so any patch from any part of the image can interact with each other through these key cting valy Transformations and what we have in the forward pass is so this um from the shape of uh the uh image embeddings we get the bed size we get how many tokens there are and we get the eding size the hidden Dimensions what you can say so but here's what where the magic happens so every token from here every patch from 196 patches each of them are going to embed uh are going to emit three different vectors so the first Vector is key vector and it represents the information that this patch this token contains and the second Vector is a query vector and this represents what information this token is looking for and the last one the value here is what the uh this is the value vector and it's what actually gets AG aggregated if there's a match so how we can think of it is key each patch says I have this I have that and query is each patch is looking for something and if these get u high value um when their Matrix Matrix multiplied that means that they are attending each other more so now we can calculate the attention scores here and through matrix multiplication of queries and keys transpose we are transposing the keys to be able to to the matrix multiplication otherwise the dimensions don't match so and uh this is through this matrix multiplication we get the the affinities between all token tokens and we scale it by square of the head size to prevent that product from growing too large and uh this is um scaling like pretty much normalization you can think of it and then here we have the softmax and this softmax turns the raw scores into uh probabilities like it didn't clip it take um vector and it turn into Pro probability distribution and after that each row will uh sum to one and it's going to give us a proper attention distribution of each patch and after that what we're going to do is finally we can use these attention uh weights to aggregate the values and this weighted sum here here U it allows information to flow uh in between tokens and this output is what we returning from this forward pass and now we can actually implement the full multi attention and what's going to happen with that is so we just looked at what happens in a single attention head and it's pretty much U all the tokens attend to each other they have their Key C and value vectors and they get their dot products how how much there affinities in between each other and now that was one head but we're going to do a lot of these in parallel and that's what we call multi-ad attention so here we have the head size initialized and number of heads initialized sure and in this one here we uh here we create multiple attention heads so uh we said number of heads here so they are pretty much uh creating like running through this number of heads and each head can learn to focus on different types of relationships uh between images one can focus on the colors some specific colors the other can focus on some other specific uh textures or whever they want so each head can focus on specific features and different relationships between these patches and uh after that we are concatenating all heads and we need to project back to our input Dimension so whatever input we got we have to Output the same Dimension and this lets us connect back to the residual pathway in the Transformer and what residual pathways are here is what you can look at is so we have our multi tension here and we had the residuals from right after the embeddings and we want to be able to uh add these here so they have to the alut of multi detension has to be the same as the residual that's coming in and that's what projection does so from this L layer the input and output is going to be the same so what we've done so far is we implemented a single attention head and then we created multiple tension heads and we concatenated them together and the reason we use multiple tension heads is the more heads we have the more interesting features and relationships we can capture so to be able to match the output Dimension that we expected we concatenated those attention heads and now we were able to use those multi detentions and make it work with our model but this is not how hugging face does it to be able to use hugging face weights we need to implement it the way hug face implement it so if you look at our um configuration it's pretty much the same nothing changed here and uh let's look at the signal potention module and here we are implementing the multi detension mechanism in a vectorized way so this is different from the previous implementation where we had separate heads and here we will process all the tension heads in parallel and we're going to do it because it's more efficient to do it that way and that is how face does it because we want to be able to use their weights um while we could keep the implementation cleaner actually with some other multi detension implementation uh it's definitely better to keep it the same way uh with the original implementation and so we can load the weights from the huging face model easier and we don't have any problems with that so here you can see key value and query projections and unlike our previous implementation uh with separate heads you're using single projection layers this time and we will later reshape their outputs to handle multiple heads in parallel and this is actually more memory efficient and it follows the standard Transformer implementations and you can again get the Val the weights from hugging phas exactly if you implement it the way they do to vectorize and in the um forward pass the hidden States and the eddings and of the patches here again are just given to those variables and um what we do is first we project our input to quate key and value vectors just like before and each patch in our image will again emit these three vectors and they will be used again for attention just like before but now the difference part here is uh here like this is the important part we are reshaping multi tension here so instead of having separate heads like before we are uh splitting our embedding Dimension and imagine we're taking our large embedding vector and splitting into num head pieces so here we had our batch number we had how many patches we have and we have the number of heads how many number of heads we have and this is uh the embedding Dimension was 70068 divided by number of heads so we have a one big Vector of c cury and key and value and we divided it between number of heads and we take a one big vector and we divide it instead of concatenating all those different attention heads together so this is what we do differently and this allows us to get the huging face implementation and derates and after that we have the uh K in value settings and we then get um get the matrix multiplication of c key just like before this gives us the um attention scores of each patch and we again apply softmax just like before to find uh the distribution of uh the prob probabilities and after that we also have some drop out here and this uh drop out is for regularization during uh training and this helps uh us prevent our attention patterns from becoming too rigid and this uh also helps us with uh avoiding over fitting and after that we have our attention output uh our attention weights multiplied with our value States and this is just like the same before and after that finally we can actually uh the reshape everything back to our original dimensions and we are transposing it uh just like we did in the previous implementation and uh we will also make sure our tensor is contigous in memory for efficient comp computation and that's what we do here this is important because operations like transports can make memory layout very inefficient and this contigous helps us with memory actually and uh computation efficiency and we have one last projection to um the mix all information from all heads together here and this list the model combine the different patterns each of these heads found and then we return the attention output and actually right now let's test our implementations with a realistic example and we will use a typical Vision Transformer Dimensions so we will use a patch size of one one example one image our new patches is the same and our embedding Dimension is the same so let's create an example and we saw here create a random batch of embedded patches to test our attention mechanism so we set up our vision configuration with this trans standard Transformer parameters again uh attention heads we're going to use 12 attention heads and our hidden size is the same as before eding dimension and finally let's run our test input through the attention layer here and we'll be giving our config for the attention and it's going to take the hidden states that we provide in the beginning of this encoder and it's going to give us the output and uh if everything works our output should have the same shape as our input and let's run it so what we can see here is our input shape was one by 196 by 768 are U number of uh patches stay the same and the embedding Dimension didn't change so that means our implementation was successful all right so far we implemented the multi tension as well and now we want to do another sanity check and see if our model parameters match with huging faces parameters and we try to download and get the same output so what we're going to do is uh just another sity check pretty much so we get the state dictionaries from the huging face model and we get the vision model state dictionaries so these are the pre-trained uh uh this is the pre-trained model from hugging phase and then what we're going to do is we're going to get our own uh State dictionary and this is from the attention State dictionary for the multi detention and what we're going to do is we're going to create a mapping between our keys and hugging face keys because we want to uh load these uh values uh these weights from this address into this uh key uh in our um in our implementation so we want to make sure the weights go into the correct places so that's why we show our K project. weight is the same as like it's going to take the weights from this encoder layers. zero self attention K Pro weight keyword from the hugging phase implementation so this is to make sure keywords match and then what we're going to do is we're just going to copy over the weight and into our from huging face take the dictionary to uh our own uh State dictionary and then we're going to uh load the state dictionary into our model here and after that we're going to verify if the output matches so what happens is our output just uh gets uh the output tensor and you also get the hugging face output from U hugging faces Vision model and what we're going to do is we're going to compare the difference and and here um we just check if the results are close enough in this is close and we have a parameter here called atol and that's for if that's that's an upper bound uh for the discrepancy and if it's lower than this up uh number we call the difference uh is zero and let's try if we manage to load the uh parameters properly and as we can see the maximum difference between our output and H face output is pretty much almost zero that means we correctly loaded the uh weights into our model the rest of the encoder is rather easy and we just need to implement the MLP layer to projecting the hidden states to some other space but it will require us to update the config class so let me just write it here so what's happening here is we're creating the class Sig lip MLP and if you look at our architecture what we have done is we completed multi detention and after that we have another layer norm and we have the MLP multi-layer perceptron to write the um MLP layer we have our the configuration here and what we have is we have the hidden size as 768 and we have the intermediate size is 372 and where does this come from so by mapping the parameters to a higher Dimension the model can now learn more complex relations and in fact the represent ation that we learn in this MLP layer are the ones that got we got from multi attention and even the first residuals uh back there so this allows the encoder to learn richual representations so as you can see here the uh inputs that comes to MLP it comes from this right after multi tension and it also comes from this U very beginning of the encoder from uh the residuals here so this MLP gets both of them as input and uh uh the initialization is as you can see here we have uh fully connected layer one and its input is the hidden size which is 768 and it Maps it into the intermediate size which is 372 so now we have a reg representation and then in the fully connected Layer Two we take that intermediate size and we remap it to Hidden size so that our output is the same Dimension as our input and in the forward pass we pretty much run it through the hidden States run it through fully connected layer one and then we have our nonlinearity and this G uh nonlinearity uh allows the model to capture more complex relationships in the data and uh that's why we use it here and then after that we have the fully connected layer two and then it Maps it back to the original um uh size that we create started with uh in the beginning of the MLP and so we can just try this one out and what happens is we have our vision comp and we have our multilayer perceptual layer and we run uh a random U uh the dimension and bding and what we're going to get is hopefully the same size and as we can see we indeed get the same size B size of one and Patch number of 196 and the dimension embed dimension of or we can also call Hidden size of 768 and finally we can implement the whole Vision Transformer so let me just paste it here so what you can see here is we have our U Vision configurations again and there's a new thing here called layer Norm Epsilon and the paper implementation uh of sigp uses 1 eus 6 as the Epsilon in the layer normalization so to match the exact implementation you we will add this to the config class and Define it ourselves so that's the only reason and apart from that the rest of the sign Vision embedding is the same we have our initialization we have our patch embeddings we have uh the calculation of the position embeddings and all the rest and in the forward pass we have the Pion Bings we flatten them and we transpose them to add to position embeddings and at the very end we add both patch embeddings and position embeddings together and we return so this is pretty much the embeddings the green box that we can see patch embeddings and position embeddings getting added together and now we can look into the rest of this which is the encoder layer so let me get my camera all right so we will be writing we already have this embeddings and now we will be writing the class of this encoder and what we will be doing here is we will past the code for a single encoder layer and what happens here so in this encoder layer we have uh a one uh encoder layer and if you look at the architecture we can see that we have the layer Norm mul tension layer norm and MLP and that is exactly the same thing so we have our uh embeddings we have our self attention layer Norm one MLP and layer Norm two and after the initialization we we have our forward pass and in this forward pass let me separate it like this we first have our residuals and then we have layer Norm self attention and then we add the residuals back in there so it's the same as here so we have our residuals layer Norm U multi attension and we add residuals back in there and if you look at the rest of the forward pass we again have the residuals layer Norm 2 MLP and then we add the residuals back in there and if you look at it we have the residuals come in from here and we have the layer Norm 2 MLP and we add the residuals back in So if we get back at our code and we run this um encoder layer we are giving an image tensor and we are running it through the encoder and we expect the same shape to be the output and we can try and we get a problem called No layer Norm Epsilon but indeed we do have layer Norm Epsilon so that means I forgot to run this so oh we initialize the layer n Epsilon here and then I'm running this one again and yes our input to the model was 1 by 196 by 768 and our output from this encoder layer indeed is the same dimensions now let's take a look at our vision model again and let's check out the architecture and so we in our vision model we had the uh vehicle the Sig Vision Transformer that's the whole that's the big image of what everything is happening there we had our vision embeddings that was the green box over here and we had the sigp encoder layer and this is one of those this is one blue box it's one in quarter layer and we have 12 of those but uh we haven't implemented the rest of these encoded layers so we need to do that and wrap up in a class and the rest we have our layer Norm we have our nonlinearity G we have fully connected layer one fully connected Layer Two we have our layer normalization too and we have the last post layer normalization so post layer normalization and the encoder layers those are the two things that we need to do right now so we already implemented one of these encoder layers and what is left is implementing all those 12 so if we look back at our architecture we created one encoder layer but there's 12 of them so we can just create a class and have all these encoder layers under one class and if you look at this if you look at this class that we just wrote called sigp encoder layer that was meant for one of those encoder layers but we have 12 of them so we can just create a new class and call it sigp encoder and this class encapsulates all those 12 layers at once and what uh happens in this configuration file is that we have the num hidden layers a new parameter and it's called um 12 and 12 is the number of layers that we have in this encoder so that makes sense and let me show myself again yeah and U here we have the class and its initialization is the configuration and we have the modu list so we have all these layers all these 12 layers and we're going to initialize it inside the modu list that's pretty much it and in the forward pass we are pretty much just running a for Loop for each layer in s that layers and there's 12 layers and what what's happening is we are inputting the hidden States and we're taking the output from that layer and again feeding it to the next layer so by this way we are running this uh image uh uh image embedding from uh 12 different um encoder layers and we can take a look at here and what happens here is we have our encoder including all those 12 layers and we just uh input a random uh image embedding uh image um vector and then what happens is we want we expect it shape to be the same as the initial shape so that we know our encoder is working properly and when we run it we can see that indeed our shape is still the same now since we have the class for the whole encoder layer Let's uh create the vision Transformer class and put everything together so our Sig Vision Transformer class here is consisting of the configuration in the initialization we give the vision embeddings class we have the encoder and we have the post leer and what happens here is in the forward pass we just give it pixel values and it that turns them into embeddings and then it feeds it into the encod order and after that we have the post layer norm and it Returns the embeddings from right after the post layer norm and if you look at this architecture that we draw we have the embeddings and then we have the layer Norm pretty much the encoder and then we have the post layer Norm so here we have uh turned all these special classes that we wrote into a bigger class that encapsulates everything in there so embeddings encar and poster norm and output is the L and state is the image edic that includes all the information from that image and includes all the meaning from that image it's a vector so let's take a look at it here we have our sigp model from our sigp vision config and it's the Sig Vision Transformer that we just created it's an instance of this class and we give it an image dancer and we expect it to give us the uh embedding that encapsulates the meaning of that image and when we run this we get the shape of this image tensor from the output as 1 by 196 by 768 and this is what we expected and this shows that our model is doing the calculations uh in the uh desired Dimensions so we're pretty much done we have our vision Transformer and you can use this Vision Transformer as a a image classifier and you can use this to give a lot of images and classify them into some text U classes you can use this as the vision head for a language model and turn it into a vision language model or you can U use it for some other U Vision uh and image text uh tasks and so what we can do the last thing we're going to do is in the hugging phas um implementation there's another class called CP Vision model and they make they wrap it around another class to make it um even more simple simpler implementation so what happens here is we just have the C Vision Transformer uh as in our initialization and in the forward pass we just run it through the vision model and we just give the pixel values and it just returns us the um image embedding with uh the information uh and the the meaning that it captures inside after running it all through the all embedding encoder and post layer norm and as you can see we have our sigp model and we have our image tenser we run it through that and we are expecting to have the same size as output and yes we can see so right now we can see that um from a dimension perspective our calculations are correct but we also want to check if our keys are actually correct as well and to make sure our keys are correct as well we will check it so what we're going to do is we're going to get the state dictionary from the hugging space uh implementation of clip and we're going to get the state dictionary from our implementation of clip and we're going to check if our keys are matching correctly and if we are loading the weights properly and we can just run it like this and we see that all keys match successfully so that was a successful implementation of clip model and uh this is pretty much it so just to recap everything using a vision Transformer we took an image and we embedded into a vector that captures that information from that image and we can use an image transformer for image classification tasks to uh turning a language model into a visual language model adding visual understanding and I want to take I want to thank alar Bai for helping me creating this tutorial and um I hope you enjoyed it thank you
Original Description
Vision Transformers (ViTs) are reshaping computer vision by bringing the power of self-attention to image processing. In this tutorial you will learn how to build a Vision Transformer from scratch. By the end of the course, you'll have a deeper understanding of how AI models process visual data.
Course developed by @tungabayrak9765.
💻 Code: https://colab.research.google.com/drive/1Q6bfCG5UZ7ypBWft9auptcD4Pz5zQQQb?usp=sharing#scrollTo=1EaWO-aNOk3v
❤️ Try interactive Python courses we love, right in your browser: https://scrimba.com/freeCodeCamp-Python (Made possible by a grant from our friends at Scrimba)
⭐️ Contents ⭐️
(0:00:00) Intro to Vision Transformer
(0:03:48) CLIP Model
(0:08:16) SigLIP vs CLIP
(0:12:09) Image Preprocessing
(0:15:32) Patch Embeddings
(0:20:48) Position Embeddings
(0:23:51) Embeddings Visualization
(0:26:11) Embeddings Implementation
(0:32:03) Multi-Head Attention
(0:46:19) MLP Layers
(0:49:18) Assembling the Full Vision Transformer
(0:59:36) Recap
🎉 Thanks to our Champion and Sponsor supporters:
👾 Drake Milly
👾 Ulises Moralez
👾 Goddard Tan
👾 David MG
👾 Matthew Springman
👾 Claudio
👾 Oscar R.
👾 jedi-or-sith
👾 Nattira Maneerat
👾 Justin Hual
--
Learn to code for free and get a developer job: https://www.freecodecamp.org
Read hundreds of articles on programming: https://freecodecamp.org/news
Watch on YouTube ↗
(saves to browser)
Sign in to unlock AI tutor explanation · ⚡30
Playlist
Uploads from freeCodeCamp.org · freeCodeCamp.org · 0 of 60
← Previous
Next →
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
React: Production Server Setup Part 2 - Live Coding with Jesse
freeCodeCamp.org
cookies vs localStorage vs sessionStorage - Beau teaches JavaScript
freeCodeCamp.org
Browser history tutorial - Beau teaches JavaScript
freeCodeCamp.org
Graph Data Structure Intro (inc. adjacency list, adjacency matrix, incidence matrix)
freeCodeCamp.org
React: Parameterized Routing with Next.js - Live Coding with Jesse
freeCodeCamp.org
React: Dealing with jQuery Issues - Live Coding with Jesse
freeCodeCamp.org
setInterval and setTimeout: timing events - Beau teaches JavaScript
freeCodeCamp.org
Browser and Device Testing - Live Coding with Jesse
freeCodeCamp.org
Last Minute Updates - Live Coding with Jesse
freeCodeCamp.org
Post Launch Updates - Live Coding with Jesse
freeCodeCamp.org
React: Setting Up Google Analytics - Live Coding with Jesse
freeCodeCamp.org
React: Masonry Layout - Live Coding with Jesse
freeCodeCamp.org
Load Balancing Digital Ocean Droplets - Live Coding with Jesse
freeCodeCamp.org
try, catch, finally, throw - error handling in JavaScript
freeCodeCamp.org
Load Balancing: SSL Passthrough Setup - Live Coding with Jesse
freeCodeCamp.org
Graphs: breadth-first search - Beau teaches JavaScript
freeCodeCamp.org
React: Masonry Layout Part 2 - Live Coding with Jesse
freeCodeCamp.org
React: WordPress API Live Search - Live Coding with Jesse
freeCodeCamp.org
Creating WordPress Custom Post Types - Live Coding With Jesse
freeCodeCamp.org
Dates - Beau teaches JavaScript
freeCodeCamp.org
Miscellaneous Front End Updates - Live Coding with Jesse
freeCodeCamp.org
Merging a Pull Request from GitHub - Live Coding with Jesse
freeCodeCamp.org
React + Prettier + Standard JS - Live Coding with Jesse
freeCodeCamp.org
React: Sortable Responsive Table - Live Coding with Jesse
freeCodeCamp.org
Geolocation Sorting by Distance - Live Coding with Jesse
freeCodeCamp.org
Tradeoff Matrix - Agile Software Development
freeCodeCamp.org
The Definition of Ready - Agile Software Development
freeCodeCamp.org
Getting first React job without experience - Ask Preethi
freeCodeCamp.org
React: Google Analytics Click Tracking - Live Coding with Jesse
freeCodeCamp.org
Submitting a PR to an Open Source Project - Live Coding with Jesse
freeCodeCamp.org
Should I go back to school to get CS degree? - Ask Preethi
freeCodeCamp.org
Hero Section CSS Changes - Live Coding with Jesse
freeCodeCamp.org
Working Agreement - Agile Software Development
freeCodeCamp.org
A day at Pennybox with Co-Founder Reji Eapen
freeCodeCamp.org
React: Sorting and Filtering Data - Live Coding with Jesse
freeCodeCamp.org
React: Sorting and Filtering Data Part 2 - Live Coding with Jesse
freeCodeCamp.org
React: Building a New UI - Live Coding with Jesse
freeCodeCamp.org
Definition of Done - Agile Software Development
freeCodeCamp.org
Getting started with jQuery (tutorial) - Beau teaches JavaScript
freeCodeCamp.org
Making a React Blog with WordPress Content - Live Coding with Jesse
freeCodeCamp.org
React, NextJS, CSS - Live Coding with Jesse
freeCodeCamp.org
jQuery events - Beau teaches JavaScript
freeCodeCamp.org
React/NextJS Routing and WordPress API Custom Types - Live Coding with Jesse
freeCodeCamp.org
React: Working with API Data - Live Coding with Jesse
freeCodeCamp.org
React: Refactoring Components - Live Streaming with Jesse
freeCodeCamp.org
jQuery effects - Beau teaches JavaScript
freeCodeCamp.org
More React Refactoring - Live Coding with Jesse
freeCodeCamp.org
animate in jQuery - Beau teaches JavaScript
freeCodeCamp.org
"Finishing" My React Site - Live Coding with Jesse
freeCodeCamp.org
Starting a New React Project (P2D1) - Live Coding with Jesse
freeCodeCamp.org
React Project 2 Day 2: Learning Material UI - Live Coding with Jesse
freeCodeCamp.org
The Agile Manifesto - Agile Software Development
freeCodeCamp.org
jQuery: get and set with http, text, val, and attr - Beau teaches JavaScript
freeCodeCamp.org
React Project 2 Day 3 - Live Coding with Jesse
freeCodeCamp.org
The INVEST approach to product backlog items
freeCodeCamp.org
React Project 2 Day 4 - Live Coding with Jesse
freeCodeCamp.org
Chickens and Pigs - Agile Software Development
freeCodeCamp.org
React Project 2 Day 5 - Live Coding with Jesse
freeCodeCamp.org
jQuery: add and remove DOM elements - Beau teaches JavaScript
freeCodeCamp.org
React Project 2 Day 6 - Live Coding with Jesse
freeCodeCamp.org
More on: CV Basics
View skill →Related AI Lessons
⚡
⚡
⚡
⚡
When the Camera Becomes an Exam Proctor: Building an AI-Powered Exam Monitoring System with…
Medium · Python
When the Camera Becomes an Exam Proctor: Building an AI-Powered Exam Monitoring System with…
Medium · Deep Learning
When the Camera Becomes an Exam Proctor: Building an AI-Powered Exam Monitoring System with…
Medium · Cybersecurity
Your Face Is About to Become Your Phone Number
Dev.to AI
🎓
Tutor Explanation
DeepCamp AI