CLIP - Explained!

CodeEmporium · Advanced ·📐 ML Fundamentals ·4mo ago

Key Takeaways

The video explains CLIP (Contrastive Language-Image Pretraining), its purpose, and implementation, with accompanying code examples.

Full Transcript

Greetings fellow learners. In this video we are going to talk about clip the what, the why and the how. So clip stands for contrastive language image pre-training. Contrastive learning is the technique with which the network will learn. It operates on text as well as images and it's used for training this architecture. So that's clip. Now, if you want a much like highle definition of it, I would kind of phrase it like clip is a neural network that jointly trains an image encoder and a text encoder to map the respective modalities that is image or text into the same embedding space. Now, let's understand how and why we even do this. So for training clip, let's say that we have a data set of images along with free text that we found on the internet. There's typically like 440 million of these because they're quite easy to come by. Now let's create batches of size n which could be a few thousand examples. And then we train clip with this pipeline over here. So let's zoom in. So one thing to note here is that click composes of an image encoder which could be a vision transformer or a convolution network like a ResNet and then a text encoder like GPT. So the image encoder is going to take an image and then encode it to a 512dimensional vector. The text encoder is going to take a piece of text and encode it into a 512dimensional vector. Now the goal of this entire pipeline over here is to ensure that if an image and text correspond to each other semantically then their corresponding vectors should also be as close to each other as possible. And if they don't correspond to each other then the image vector and the text vector should be as far away from each other as possible. And this is the contrastive learning approach. So we basically take all of our images and code them into 512 dimensional vectors and do the same for all the text. We then perform L2 normalization. This is just to ensure that it's like a pre-processing step for cosine similarity. So we'll now take the cosine similarity between n images and n text in order to get an n crossn matrix of values that lie between negative 1 and positive 1. Now what we want to do is transform this matrix into probabilities. One subsequent step within this is to compute logets. So this would involve multiplying each of these values by e to the power of some temperature parameter. And this is going to be a learned parameter t. And then we're going to get a bunch of values that lie between negative infinity and positive infinity. And then from here we're going to compute well two different functions. One will take a softmax across the image each image and create a probability distribution across each image. That is for every row we'll basically have a probability distribution. So it'll sum to one and you'll get this matrix. And for the softmax along the text, we're going to do the exact orthogonal operation where we take the softmax across every single column to get a probability distribution for every single text example. And so we will end up with two matrices of probabilities. And we also have a ground truth here because we know exactly which image corresponded to which piece of text. So we have predictions, we have ground truth and so we can compute a loss which we do with a cross entropy loss. And so this is the full architecture for which it'll learn through back propagation of errors. So through back propagation we'll have this you know temperature parameter is going to be learned over time and the values or parameters within each of these encoders is also going to be learned over time as you know these were initialized at scratch. Next let's talk about inference. So let's say now that we have this random example of an image. We're going to pass it into our trained image encoder to get a 512dimensional vector. We will then normalize this in order to get you know still a 512dimensional vector as well which can later be passed into a cosine similarity. Now at the text end well let's say that we are trying to perform now during inference a classification task and let's say that there's like 10 classification you know the classes being like antelope zebra car you know like the sword what we'll do is first convert each of these classes into prompts some natural language prompts like for example a photo of a car a photo of an antelope a photo of a zebra And we'll encode each of these with our trained text encoder to get, you know, these 10 cross 512 dimensional matrix. We'll normalize them and then compute the similarity scores between the image vector with respect to each of the text vectors. And so you're going to get 10 values over here. And these are going to lie between negative 1 and positive one. And we want to now compute probabilities. So we're going to multiply logets and then apply a softmax operation to ensure this is a probability distribution. And now what this will you know inform it could be look something like this. You know the a photo of an antelope could have you know 97% chance of being the correct label. Then a photo of a zebra 2% photo of a car 0%. And so we were able to perform a classification task on our unseen data set over here and it did it pretty well. And the idea of like why is this zeroot is because we have an image encoder and text encoder and they were not trained on our um you know downstream classification data set. It didn't really see any examples of it during training nor during inference and hence it is zeroshot inference. So now let's try to understand why do we even need clip. Well for one task specific labels can be pretty hard to come by. So that's these gold standard labels for every single image. And a good example to mention this is like imageet which is supposed to be this very large ontology itself only has like a million images or in text labels. Whereas you know what we trained on you know natural language text that looked kind of like this. Clip was trained on 440 million such examples because they were far easier to come by. Now, another reason why we would use clip is because natural language supervision allows the image encoder to create rich vectors that better encode the meaning of the image. So, let's take a look at some code to understand exactly how this is the case. So, I have some code in this collab notebook which basically is going to take an input image. This is picture of me and another image. This is just me with a hat. So what I'm going to do here is we'll we'll walk through the code. So first I'm going to load a clip model over here. This is going to be the image and text model effectively that we saw previously. We're going to normalize the input images and this is going to be a 512dimensional vector in each case. Now what I'm going to do is I'm just going to take the difference between these images and we'll get this delta vector. We'll normalize that vector. And so we have this vector stored over here. And what I'm going to do is I'm going to compare this image vector to word vectors. These four word vectors that represent hat, cup, cat, and boat. So essentially, I've created those text vectors over here, normalized them, and I'm computing a cosine similarity between each of these four text vectors and the image vector. And I'll convert it into a probability as well. And on doing so, interestingly enough, we can see that the difference between the two images is actually 60%, it's going to be this hat for it's going to be 16% boat and you know, very small probabilities for like cup and cat. But interestingly enough, semantically the difference between the images were just me wearing a hat versus not wearing a hat. So the difference is just the hat itself. And that's also reflected in the vectors themselves. So what we can see here is that natural language supervision allowed the image encoder to create rich vectors that better encode the meaning of an image. So I hope that's a little bit more clear with this example. Now as far as performance is concerned during inference zerosot clip actually outperforms other networks like convolution or transformer networks which were trained with golden labels. And the only thing here though interestingly enough if you have like a larger number of training examples per class you have this linear probe clip that actually performs pretty well. So let's talk about that really quick. So linear probing is a method to evaluate the visual representations of the clip encoder and it involves training a linear model which is a probe on top of the frozen clip encoder. So what that really means is let's say that we do now have a data set with you know a downstream classification task with actual labels. We hear we have like 10 classes and we have you know a labelled data set. So what we could do is take the image then pass it into our image encoder which is trained generate a 512dimensional vector and we can create this like linear FC layer as our probe and we can then get a prediction and we can train the network in this way and by training here with back propagation we would update the weights of this network but keep all of these weights frozen. So it's only this layer that's actually going to be updated. And once trained, the probe can help understand how well the clip encoder performs on a new data set. And it can also be pretty good also as we saw. Let's go back to our performance here. It can also be pretty good if we have enough examples to even improve the performance of clip overall. So I hope all of this makes sense. Quiz time. Have you been paying attention? Let's quiz you to find out. Which of the following is true about clip? A. Image and text are embedded in the same embedding space. B. Clip uses image plus free text during its training. C. Clip learns via contrastive learning. Or D. Clip's image encoder is usually a convolution or a transformer architecture. Note that multiple options may be correct and I'll give you a few seconds to answer this question. The correct options are all of them. Did you get them right? Please comment your reasoning down in the comments below and let's have a discussion. And at this point, if you think I deserve it, please do consider giving this video a like because it will help me out a lot. Now, that's going to do it for quiz time. But before we go, let's generate a summary. So we looked at clip, which is contrastive language image pre-training. It is essentially a neural network that jointly trains an image encoder and a text encoder to map respective modalities to the same embedding space. We saw exactly how we can train the image encoder and text encoder with this contractive learning technique. And we also saw how we can perform zero shot inference as well. We then took a look at like reasons for like why clip exists and it's because task specific labels are hard to come by whereas natural language ones are much easier to come by. And we also have the fact that natural language supervision allows the image encoder to learn rich representations. We also looked at some code that could help us understand this. And then we took a look at performance of how zeroot clip can perform better than convolution and transformer-based architectures. And then we concluded our discussion with looking a little bit at linear probe clip as well. And that's all that we have for today. I'm going to leave some resources down in the description below along with the link to all the slides and the paper here. And thank you all so much for watching and I will see you in the next one. Bye-bye.

Original Description

In this video, we take a look at CLIP (contrastive language image pretraining). What is it? Why do we have it? How does it look? And some code! ABOUT ME ⭕ Subscribe: https://www.youtube.com/c/CodeEmporium?sub_confirmation=1 📚 Medium Blog: https://medium.com/@dataemporium 💻 Github: https://github.com/ajhalthor 👔 LinkedIn: https://www.linkedin.com/in/ajay-halthor-477974bb/ RESOURCES [1 📚] Main Paper: https://openai.com/index/clip/ [2 📚] Slides: https://link.excalidraw.com/p/readonly/STU1Z0GcInkQNvA8naKM [3 📚] Code: https://github.com/ajhalthor/computer-vision-101/tree/main/CLIP PLAYLISTS FROM MY CHANNEL ⭕ Reinforcement Learning: https://youtube.com/playlist?list=PLTl9hO2Oobd9kS--NgVz0EPNyEmygV1Ha&si=AuThDZJwG19cgTA8 Natural Language Processing: https://youtube.com/playlist?list=PLTl9hO2Oobd_bzXUpzKMKA3liq2kj6LfE&si=LsVy8RDPu8jeO-cc ⭕ Transformers from Scratch: https://youtube.com/playlist?list=PLTl9hO2Oobd_bzXUpzKMKA3liq2kj6LfE ⭕ ChatGPT Playlist: https://youtube.com/playlist?list=PLTl9hO2Oobd9coYT6XsTraTBo4pL1j4HJ ⭕ Convolutional Neural Networks: https://youtube.com/playlist?list=PLTl9hO2Oobd9U0XHz62Lw6EgIMkQpfz74 ⭕ The Math You Should Know : https://youtube.com/playlist?list=PLTl9hO2Oobd-_5sGLnbgE8Poer1Xjzz4h ⭕ Probability Theory for Machine Learning: https://youtube.com/playlist?list=PLTl9hO2Oobd9bPcq0fj91Jgk_-h1H_W3V ⭕ Coding Machine Learning: https://youtube.com/playlist?list=PLTl9hO2Oobd82vcsOnvCNzxrZOlrz3RiD MATH COURSES (7 day free trial) 📕 Mathematics for Machine Learning: https://imp.i384100.net/MathML 📕 Calculus: https://imp.i384100.net/Calculus 📕 Statistics for Data Science: https://imp.i384100.net/AdvancedStatistics 📕 Bayesian Statistics: https://imp.i384100.net/BayesianStatistics 📕 Linear Algebra: https://imp.i384100.net/LinearAlgebra 📕 Probability: https://imp.i384100.net/Probability OTHER RELATED COURSES (7 day free trial) 📕 ⭐ Deep Learning Specialization: https://imp.i384100.net/Deep-Learning 📕 Python for Everybody: https://imp.i
Watch on YouTube ↗ (saves to browser)
Sign in to unlock AI tutor explanation · ⚡30

Playlist

Uploads from CodeEmporium · CodeEmporium · 0 of 60

← Previous Next →
1 Linear Regression and Multiple Regression
Linear Regression and Multiple Regression
CodeEmporium
2 Logistic Regression - THE MATH YOU SHOULD KNOW!
Logistic Regression - THE MATH YOU SHOULD KNOW!
CodeEmporium
3 Generative Adversarial Networks - FUTURISTIC & FUN AI !
Generative Adversarial Networks - FUTURISTIC & FUN AI !
CodeEmporium
4 Deep Learning on the Cloud - GPU TO LEARN FASTER
Deep Learning on the Cloud - GPU TO LEARN FASTER
CodeEmporium
5 Deep Mind's AlphaGo Zero - EXPLAINED
Deep Mind's AlphaGo Zero - EXPLAINED
CodeEmporium
6 Mask Region based Convolution Neural Networks - EXPLAINED!
Mask Region based Convolution Neural Networks - EXPLAINED!
CodeEmporium
7 Attention in Neural Networks
Attention in Neural Networks
CodeEmporium
8 Depthwise Separable Convolution - A FASTER CONVOLUTION!
Depthwise Separable Convolution - A FASTER CONVOLUTION!
CodeEmporium
9 One Neural network learns EVERYTHING ?!
One Neural network learns EVERYTHING ?!
CodeEmporium
10 Neural Voice Cloning
Neural Voice Cloning
CodeEmporium
11 AI creates Image Classifiers…by DRAWING?
AI creates Image Classifiers…by DRAWING?
CodeEmporium
12 Unpaired Image-Image Translation using CycleGANs
Unpaired Image-Image Translation using CycleGANs
CodeEmporium
13 K-Means Clustering - EXPLAINED!
K-Means Clustering - EXPLAINED!
CodeEmporium
14 Random Forest Classification
Random Forest Classification
CodeEmporium
15 Data Science in Finance
Data Science in Finance
CodeEmporium
16 Hypothesis testing with Applications in Data Science
Hypothesis testing with Applications in Data Science
CodeEmporium
17 A/B Testing - Simply Explained
A/B Testing - Simply Explained
CodeEmporium
18 The Kernel Trick - THE MATH YOU SHOULD KNOW!
The Kernel Trick - THE MATH YOU SHOULD KNOW!
CodeEmporium
19 Support Vector Machines - THE MATH YOU  SHOULD KNOW
Support Vector Machines - THE MATH YOU SHOULD KNOW
CodeEmporium
20 Principal Component Analysis (PCA) - THE MATH YOU SHOULD KNOW!
Principal Component Analysis (PCA) - THE MATH YOU SHOULD KNOW!
CodeEmporium
21 History of Calculus - Animated
History of Calculus - Animated
CodeEmporium
22 Curiosity in AI
Curiosity in AI
CodeEmporium
23 DropBlock - A BETTER DROPOUT for Neural Networks
DropBlock - A BETTER DROPOUT for Neural Networks
CodeEmporium
24 Autoencoders - EXPLAINED
Autoencoders - EXPLAINED
CodeEmporium
25 Recurrent Neural Networks - EXPLAINED!
Recurrent Neural Networks - EXPLAINED!
CodeEmporium
26 LSTM Networks - EXPLAINED!
LSTM Networks - EXPLAINED!
CodeEmporium
27 Building an Image Captioner with Neural Networks
Building an Image Captioner with Neural Networks
CodeEmporium
28 10 Machine Learning Questions - ANSWERED!
10 Machine Learning Questions - ANSWERED!
CodeEmporium
29 How do neural networks work?
How do neural networks work?
CodeEmporium
30 Evolution of Face Generation |  Evolution of GANs
Evolution of Face Generation | Evolution of GANs
CodeEmporium
31 How does Google Translate's AI work?
How does Google Translate's AI work?
CodeEmporium
32 How to keep up with AI research?
How to keep up with AI research?
CodeEmporium
33 How does YouTube recommend videos? - AI EXPLAINED!
How does YouTube recommend videos? - AI EXPLAINED!
CodeEmporium
34 Variational Autoencoders - EXPLAINED!
Variational Autoencoders - EXPLAINED!
CodeEmporium
35 Logistic Regression - VISUALIZED!
Logistic Regression - VISUALIZED!
CodeEmporium
36 Gradient Descent - THE MATH YOU SHOULD KNOW
Gradient Descent - THE MATH YOU SHOULD KNOW
CodeEmporium
37 Boosting - EXPLAINED!
Boosting - EXPLAINED!
CodeEmporium
38 Transformer Neural Networks - EXPLAINED! (Attention is all you need)
Transformer Neural Networks - EXPLAINED! (Attention is all you need)
CodeEmporium
39 Loss Functions - EXPLAINED!
Loss Functions - EXPLAINED!
CodeEmporium
40 Optimizers - EXPLAINED!
Optimizers - EXPLAINED!
CodeEmporium
41 NLP with Neural Networks & Transformers
NLP with Neural Networks & Transformers
CodeEmporium
42 Batch Normalization - EXPLAINED!
Batch Normalization - EXPLAINED!
CodeEmporium
43 Activation Functions - EXPLAINED!
Activation Functions - EXPLAINED!
CodeEmporium
44 Data Scientist Answers Interview Questions
Data Scientist Answers Interview Questions
CodeEmporium
45 Why use GPU with Neural Networks?
Why use GPU with Neural Networks?
CodeEmporium
46 How do GPUs speed up Neural Network training?
How do GPUs speed up Neural Network training?
CodeEmporium
47 BERT Neural Network - EXPLAINED!
BERT Neural Network - EXPLAINED!
CodeEmporium
48 ConvNets Scaled Efficiently
ConvNets Scaled Efficiently
CodeEmporium
49 Transformer Neural Net makes music! (JukeboxAI)
Transformer Neural Net makes music! (JukeboxAI)
CodeEmporium
50 What do filters of Convolution Neural Network learn?
What do filters of Convolution Neural Network learn?
CodeEmporium
51 We're hosting a Machine Learning Conference!
We're hosting a Machine Learning Conference!
CodeEmporium
52 MLconfEU 2020: Machine Learning Conference for Software Engineers
MLconfEU 2020: Machine Learning Conference for Software Engineers
CodeEmporium
53 Are Neural Networks Intelligent?
Are Neural Networks Intelligent?
CodeEmporium
54 Time Series Forecasting with Machine Learning
Time Series Forecasting with Machine Learning
CodeEmporium
55 Few Shot Learning - EXPLAINED!
Few Shot Learning - EXPLAINED!
CodeEmporium
56 How does a Data Scientist Fight FRAUD?
How does a Data Scientist Fight FRAUD?
CodeEmporium
57 How would a Data Scientist analyze Customer Churn?
How would a Data Scientist analyze Customer Churn?
CodeEmporium
58 Expectations with Machine Learning
Expectations with Machine Learning
CodeEmporium
59 Why Logistic Regression DOESN'T return probabilities?!
Why Logistic Regression DOESN'T return probabilities?!
CodeEmporium
60 How you SHOULD code Machine Learning
How you SHOULD code Machine Learning
CodeEmporium

This video explains the concept of CLIP, its architecture, and how it can be used for image-text matching tasks, with code examples and resources for further learning.

Key Takeaways
  1. Understand the basics of contrastive learning
  2. Learn the architecture of CLIP
  3. Implement CLIP using Python
  4. Fine-tune CLIP for specific tasks
💡 CLIP can be used for a variety of image-text matching tasks, such as image classification and captioning.

Related AI Lessons

Beyond the Elephant: On Manifolds, Projections, and the Hidden Assumptions of Neural Geometry
Learn how neural geometry relies on manifolds, projections, and hidden assumptions to understand complex data, and why it matters for AI development
Medium · AI
Beyond the Elephant: On Manifolds, Projections, and the Hidden Assumptions of Neural Geometry
Learn how neural geometry relies on manifolds, projections, and hidden assumptions to understand complex data, and why it matters for advancing AI research
Medium · Data Science
Beyond the Elephant: On Manifolds, Projections, and the Hidden Assumptions of Neural Geometry
Explore the geometric assumptions underlying neural networks and their implications on manifold learning and projections
Medium · Deep Learning
Beyond the Elephant: On Manifolds, Projections, and the Hidden Assumptions of Neural Geometry
Learn about the hidden assumptions of neural geometry and how manifolds and projections impact neural network performance
Medium · LLM
Up next
Machine Learning Project for Final Year Students | ML Project Idea @FameWorldEducationalHub
FAME WORLD EDUCATIONAL HUB
Watch →