CLIP - Explained!
Key Takeaways
The video explains CLIP (Contrastive Language-Image Pretraining), its purpose, and implementation, with accompanying code examples.
Full Transcript
Greetings fellow learners. In this video we are going to talk about clip the what, the why and the how. So clip stands for contrastive language image pre-training. Contrastive learning is the technique with which the network will learn. It operates on text as well as images and it's used for training this architecture. So that's clip. Now, if you want a much like highle definition of it, I would kind of phrase it like clip is a neural network that jointly trains an image encoder and a text encoder to map the respective modalities that is image or text into the same embedding space. Now, let's understand how and why we even do this. So for training clip, let's say that we have a data set of images along with free text that we found on the internet. There's typically like 440 million of these because they're quite easy to come by. Now let's create batches of size n which could be a few thousand examples. And then we train clip with this pipeline over here. So let's zoom in. So one thing to note here is that click composes of an image encoder which could be a vision transformer or a convolution network like a ResNet and then a text encoder like GPT. So the image encoder is going to take an image and then encode it to a 512dimensional vector. The text encoder is going to take a piece of text and encode it into a 512dimensional vector. Now the goal of this entire pipeline over here is to ensure that if an image and text correspond to each other semantically then their corresponding vectors should also be as close to each other as possible. And if they don't correspond to each other then the image vector and the text vector should be as far away from each other as possible. And this is the contrastive learning approach. So we basically take all of our images and code them into 512 dimensional vectors and do the same for all the text. We then perform L2 normalization. This is just to ensure that it's like a pre-processing step for cosine similarity. So we'll now take the cosine similarity between n images and n text in order to get an n crossn matrix of values that lie between negative 1 and positive 1. Now what we want to do is transform this matrix into probabilities. One subsequent step within this is to compute logets. So this would involve multiplying each of these values by e to the power of some temperature parameter. And this is going to be a learned parameter t. And then we're going to get a bunch of values that lie between negative infinity and positive infinity. And then from here we're going to compute well two different functions. One will take a softmax across the image each image and create a probability distribution across each image. That is for every row we'll basically have a probability distribution. So it'll sum to one and you'll get this matrix. And for the softmax along the text, we're going to do the exact orthogonal operation where we take the softmax across every single column to get a probability distribution for every single text example. And so we will end up with two matrices of probabilities. And we also have a ground truth here because we know exactly which image corresponded to which piece of text. So we have predictions, we have ground truth and so we can compute a loss which we do with a cross entropy loss. And so this is the full architecture for which it'll learn through back propagation of errors. So through back propagation we'll have this you know temperature parameter is going to be learned over time and the values or parameters within each of these encoders is also going to be learned over time as you know these were initialized at scratch. Next let's talk about inference. So let's say now that we have this random example of an image. We're going to pass it into our trained image encoder to get a 512dimensional vector. We will then normalize this in order to get you know still a 512dimensional vector as well which can later be passed into a cosine similarity. Now at the text end well let's say that we are trying to perform now during inference a classification task and let's say that there's like 10 classification you know the classes being like antelope zebra car you know like the sword what we'll do is first convert each of these classes into prompts some natural language prompts like for example a photo of a car a photo of an antelope a photo of a zebra And we'll encode each of these with our trained text encoder to get, you know, these 10 cross 512 dimensional matrix. We'll normalize them and then compute the similarity scores between the image vector with respect to each of the text vectors. And so you're going to get 10 values over here. And these are going to lie between negative 1 and positive one. And we want to now compute probabilities. So we're going to multiply logets and then apply a softmax operation to ensure this is a probability distribution. And now what this will you know inform it could be look something like this. You know the a photo of an antelope could have you know 97% chance of being the correct label. Then a photo of a zebra 2% photo of a car 0%. And so we were able to perform a classification task on our unseen data set over here and it did it pretty well. And the idea of like why is this zeroot is because we have an image encoder and text encoder and they were not trained on our um you know downstream classification data set. It didn't really see any examples of it during training nor during inference and hence it is zeroshot inference. So now let's try to understand why do we even need clip. Well for one task specific labels can be pretty hard to come by. So that's these gold standard labels for every single image. And a good example to mention this is like imageet which is supposed to be this very large ontology itself only has like a million images or in text labels. Whereas you know what we trained on you know natural language text that looked kind of like this. Clip was trained on 440 million such examples because they were far easier to come by. Now, another reason why we would use clip is because natural language supervision allows the image encoder to create rich vectors that better encode the meaning of the image. So, let's take a look at some code to understand exactly how this is the case. So, I have some code in this collab notebook which basically is going to take an input image. This is picture of me and another image. This is just me with a hat. So what I'm going to do here is we'll we'll walk through the code. So first I'm going to load a clip model over here. This is going to be the image and text model effectively that we saw previously. We're going to normalize the input images and this is going to be a 512dimensional vector in each case. Now what I'm going to do is I'm just going to take the difference between these images and we'll get this delta vector. We'll normalize that vector. And so we have this vector stored over here. And what I'm going to do is I'm going to compare this image vector to word vectors. These four word vectors that represent hat, cup, cat, and boat. So essentially, I've created those text vectors over here, normalized them, and I'm computing a cosine similarity between each of these four text vectors and the image vector. And I'll convert it into a probability as well. And on doing so, interestingly enough, we can see that the difference between the two images is actually 60%, it's going to be this hat for it's going to be 16% boat and you know, very small probabilities for like cup and cat. But interestingly enough, semantically the difference between the images were just me wearing a hat versus not wearing a hat. So the difference is just the hat itself. And that's also reflected in the vectors themselves. So what we can see here is that natural language supervision allowed the image encoder to create rich vectors that better encode the meaning of an image. So I hope that's a little bit more clear with this example. Now as far as performance is concerned during inference zerosot clip actually outperforms other networks like convolution or transformer networks which were trained with golden labels. And the only thing here though interestingly enough if you have like a larger number of training examples per class you have this linear probe clip that actually performs pretty well. So let's talk about that really quick. So linear probing is a method to evaluate the visual representations of the clip encoder and it involves training a linear model which is a probe on top of the frozen clip encoder. So what that really means is let's say that we do now have a data set with you know a downstream classification task with actual labels. We hear we have like 10 classes and we have you know a labelled data set. So what we could do is take the image then pass it into our image encoder which is trained generate a 512dimensional vector and we can create this like linear FC layer as our probe and we can then get a prediction and we can train the network in this way and by training here with back propagation we would update the weights of this network but keep all of these weights frozen. So it's only this layer that's actually going to be updated. And once trained, the probe can help understand how well the clip encoder performs on a new data set. And it can also be pretty good also as we saw. Let's go back to our performance here. It can also be pretty good if we have enough examples to even improve the performance of clip overall. So I hope all of this makes sense. Quiz time. Have you been paying attention? Let's quiz you to find out. Which of the following is true about clip? A. Image and text are embedded in the same embedding space. B. Clip uses image plus free text during its training. C. Clip learns via contrastive learning. Or D. Clip's image encoder is usually a convolution or a transformer architecture. Note that multiple options may be correct and I'll give you a few seconds to answer this question. The correct options are all of them. Did you get them right? Please comment your reasoning down in the comments below and let's have a discussion. And at this point, if you think I deserve it, please do consider giving this video a like because it will help me out a lot. Now, that's going to do it for quiz time. But before we go, let's generate a summary. So we looked at clip, which is contrastive language image pre-training. It is essentially a neural network that jointly trains an image encoder and a text encoder to map respective modalities to the same embedding space. We saw exactly how we can train the image encoder and text encoder with this contractive learning technique. And we also saw how we can perform zero shot inference as well. We then took a look at like reasons for like why clip exists and it's because task specific labels are hard to come by whereas natural language ones are much easier to come by. And we also have the fact that natural language supervision allows the image encoder to learn rich representations. We also looked at some code that could help us understand this. And then we took a look at performance of how zeroot clip can perform better than convolution and transformer-based architectures. And then we concluded our discussion with looking a little bit at linear probe clip as well. And that's all that we have for today. I'm going to leave some resources down in the description below along with the link to all the slides and the paper here. And thank you all so much for watching and I will see you in the next one. Bye-bye.
Original Description
In this video, we take a look at CLIP (contrastive language image pretraining). What is it? Why do we have it? How does it look? And some code!
ABOUT ME
⭕ Subscribe: https://www.youtube.com/c/CodeEmporium?sub_confirmation=1
📚 Medium Blog: https://medium.com/@dataemporium
💻 Github: https://github.com/ajhalthor
👔 LinkedIn: https://www.linkedin.com/in/ajay-halthor-477974bb/
RESOURCES
[1 📚] Main Paper: https://openai.com/index/clip/
[2 📚] Slides: https://link.excalidraw.com/p/readonly/STU1Z0GcInkQNvA8naKM
[3 📚] Code: https://github.com/ajhalthor/computer-vision-101/tree/main/CLIP
PLAYLISTS FROM MY CHANNEL
⭕ Reinforcement Learning: https://youtube.com/playlist?list=PLTl9hO2Oobd9kS--NgVz0EPNyEmygV1Ha&si=AuThDZJwG19cgTA8
Natural Language Processing: https://youtube.com/playlist?list=PLTl9hO2Oobd_bzXUpzKMKA3liq2kj6LfE&si=LsVy8RDPu8jeO-cc
⭕ Transformers from Scratch: https://youtube.com/playlist?list=PLTl9hO2Oobd_bzXUpzKMKA3liq2kj6LfE
⭕ ChatGPT Playlist: https://youtube.com/playlist?list=PLTl9hO2Oobd9coYT6XsTraTBo4pL1j4HJ
⭕ Convolutional Neural Networks: https://youtube.com/playlist?list=PLTl9hO2Oobd9U0XHz62Lw6EgIMkQpfz74
⭕ The Math You Should Know : https://youtube.com/playlist?list=PLTl9hO2Oobd-_5sGLnbgE8Poer1Xjzz4h
⭕ Probability Theory for Machine Learning: https://youtube.com/playlist?list=PLTl9hO2Oobd9bPcq0fj91Jgk_-h1H_W3V
⭕ Coding Machine Learning: https://youtube.com/playlist?list=PLTl9hO2Oobd82vcsOnvCNzxrZOlrz3RiD
MATH COURSES (7 day free trial)
📕 Mathematics for Machine Learning: https://imp.i384100.net/MathML
📕 Calculus: https://imp.i384100.net/Calculus
📕 Statistics for Data Science: https://imp.i384100.net/AdvancedStatistics
📕 Bayesian Statistics: https://imp.i384100.net/BayesianStatistics
📕 Linear Algebra: https://imp.i384100.net/LinearAlgebra
📕 Probability: https://imp.i384100.net/Probability
OTHER RELATED COURSES (7 day free trial)
📕 ⭐ Deep Learning Specialization: https://imp.i384100.net/Deep-Learning
📕 Python for Everybody: https://imp.i
Watch on YouTube ↗
(saves to browser)
Sign in to unlock AI tutor explanation · ⚡30
Playlist
Uploads from CodeEmporium · CodeEmporium · 0 of 60
← Previous
Next →
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
Linear Regression and Multiple Regression
CodeEmporium
Logistic Regression - THE MATH YOU SHOULD KNOW!
CodeEmporium
Generative Adversarial Networks - FUTURISTIC & FUN AI !
CodeEmporium
Deep Learning on the Cloud - GPU TO LEARN FASTER
CodeEmporium
Deep Mind's AlphaGo Zero - EXPLAINED
CodeEmporium
Mask Region based Convolution Neural Networks - EXPLAINED!
CodeEmporium
Attention in Neural Networks
CodeEmporium
Depthwise Separable Convolution - A FASTER CONVOLUTION!
CodeEmporium
One Neural network learns EVERYTHING ?!
CodeEmporium
Neural Voice Cloning
CodeEmporium
AI creates Image Classifiers…by DRAWING?
CodeEmporium
Unpaired Image-Image Translation using CycleGANs
CodeEmporium
K-Means Clustering - EXPLAINED!
CodeEmporium
Random Forest Classification
CodeEmporium
Data Science in Finance
CodeEmporium
Hypothesis testing with Applications in Data Science
CodeEmporium
A/B Testing - Simply Explained
CodeEmporium
The Kernel Trick - THE MATH YOU SHOULD KNOW!
CodeEmporium
Support Vector Machines - THE MATH YOU SHOULD KNOW
CodeEmporium
Principal Component Analysis (PCA) - THE MATH YOU SHOULD KNOW!
CodeEmporium
History of Calculus - Animated
CodeEmporium
Curiosity in AI
CodeEmporium
DropBlock - A BETTER DROPOUT for Neural Networks
CodeEmporium
Autoencoders - EXPLAINED
CodeEmporium
Recurrent Neural Networks - EXPLAINED!
CodeEmporium
LSTM Networks - EXPLAINED!
CodeEmporium
Building an Image Captioner with Neural Networks
CodeEmporium
10 Machine Learning Questions - ANSWERED!
CodeEmporium
How do neural networks work?
CodeEmporium
Evolution of Face Generation | Evolution of GANs
CodeEmporium
How does Google Translate's AI work?
CodeEmporium
How to keep up with AI research?
CodeEmporium
How does YouTube recommend videos? - AI EXPLAINED!
CodeEmporium
Variational Autoencoders - EXPLAINED!
CodeEmporium
Logistic Regression - VISUALIZED!
CodeEmporium
Gradient Descent - THE MATH YOU SHOULD KNOW
CodeEmporium
Boosting - EXPLAINED!
CodeEmporium
Transformer Neural Networks - EXPLAINED! (Attention is all you need)
CodeEmporium
Loss Functions - EXPLAINED!
CodeEmporium
Optimizers - EXPLAINED!
CodeEmporium
NLP with Neural Networks & Transformers
CodeEmporium
Batch Normalization - EXPLAINED!
CodeEmporium
Activation Functions - EXPLAINED!
CodeEmporium
Data Scientist Answers Interview Questions
CodeEmporium
Why use GPU with Neural Networks?
CodeEmporium
How do GPUs speed up Neural Network training?
CodeEmporium
BERT Neural Network - EXPLAINED!
CodeEmporium
ConvNets Scaled Efficiently
CodeEmporium
Transformer Neural Net makes music! (JukeboxAI)
CodeEmporium
What do filters of Convolution Neural Network learn?
CodeEmporium
We're hosting a Machine Learning Conference!
CodeEmporium
MLconfEU 2020: Machine Learning Conference for Software Engineers
CodeEmporium
Are Neural Networks Intelligent?
CodeEmporium
Time Series Forecasting with Machine Learning
CodeEmporium
Few Shot Learning - EXPLAINED!
CodeEmporium
How does a Data Scientist Fight FRAUD?
CodeEmporium
How would a Data Scientist analyze Customer Churn?
CodeEmporium
Expectations with Machine Learning
CodeEmporium
Why Logistic Regression DOESN'T return probabilities?!
CodeEmporium
How you SHOULD code Machine Learning
CodeEmporium
More on: Modern CV Models
View skill →Related AI Lessons
⚡
⚡
⚡
⚡
Beyond the Elephant: On Manifolds, Projections, and the Hidden Assumptions of Neural Geometry
Medium · AI
Beyond the Elephant: On Manifolds, Projections, and the Hidden Assumptions of Neural Geometry
Medium · Data Science
Beyond the Elephant: On Manifolds, Projections, and the Hidden Assumptions of Neural Geometry
Medium · Deep Learning
Beyond the Elephant: On Manifolds, Projections, and the Hidden Assumptions of Neural Geometry
Medium · LLM
🎓
Tutor Explanation
DeepCamp AI