CLIP - Explained!

CodeEmporium · Advanced ·📐 ML Fundamentals ·4mo ago

Skills: Modern CV Models90%CV Basics80%ML Pipelines60%

Key Takeaways

The video explains CLIP (Contrastive Language-Image Pretraining), its purpose, and implementation, with accompanying code examples.

Full Transcript

Greetings fellow learners. In this video we are going to talk about clip the what, the why and the how. So clip stands for contrastive language image pre-training. Contrastive learning is the technique with which the network will learn. It operates on text as well as images and it's used for training this architecture. So that's clip. Now, if you want a much like highle definition of it, I would kind of phrase it like clip is a neural network that jointly trains an image encoder and a text encoder to map the respective modalities that is image or text into the same embedding space. Now, let's understand how and why we even do this. So for training clip, let's say that we have a data set of images along with free text that we found on the internet. There's typically like 440 million of these because they're quite easy to come by. Now let's create batches of size n which could be a few thousand examples. And then we train clip with this pipeline over here. So let's zoom in. So one thing to note here is that click composes of an image encoder which could be a vision transformer or a convolution network like a ResNet and then a text encoder like GPT. So the image encoder is going to take an image and then encode it to a 512dimensional vector. The text encoder is going to take a piece of text and encode it into a 512dimensional vector. Now the goal of this entire pipeline over here is to ensure that if an image and text correspond to each other semantically then their corresponding vectors should also be as close to each other as possible. And if they don't correspond to each other then the image vector and the text vector should be as far away from each other as possible. And this is the contrastive learning approach. So we basically take all of our images and code them into 512 dimensional vectors and do the same for all the text. We then perform L2 normalization. This is just to ensure that it's like a pre-processing step for cosine similarity. So we'll now take the cosine similarity between n images and n text in order to get an n crossn matrix of values that lie between negative 1 and positive 1. Now what we want to do is transform this matrix into probabilities. One subsequent step within this is to compute logets. So this would involve multiplying each of these values by e to the power of some temperature parameter. And this is going to be a learned parameter t. And then we're going to get a bunch of values that lie between negative infinity and positive infinity. And then from here we're going to compute well two different functions. One will take a softmax across the image each image and create a probability distribution across each image. That is for every row we'll basically have a probability distribution. So it'll sum to one and you'll get this matrix. And for the softmax along the text, we're going to do the exact orthogonal operation where we take the softmax across every single column to get a probability distribution for every single text example. And so we will end up with two matrices of probabilities. And we also have a ground truth here because we know exactly which image corresponded to which piece of text. So we have predictions, we have ground truth and so we can compute a loss which we do with a cross entropy loss. And so this is the full architecture for which it'll learn through back propagation of errors. So through back propagation we'll have this you know temperature parameter is going to be learned over time and the values or parameters within each of these encoders is also going to be learned over time as you know these were initialized at scratch. Next let's talk about inference. So let's say now that we have this random example of an image. We're going to pass it into our trained image encoder to get a 512dimensional vector. We will then normalize this in order to get you know still a 512dimensional vector as well which can later be passed into a cosine similarity. Now at the text end well let's say that we are trying to perform now during inference a classification task and let's say that there's like 10 classification you know the classes being like antelope zebra car you know like the sword what we'll do is first convert each of these classes into prompts some natural language prompts like for example a photo of a car a photo of an antelope a photo of a zebra And we'll encode each of these with our trained text encoder to get, you know, these 10 cross 512 dimensional matrix. We'll normalize them and then compute the similarity scores between the image vector with respect to each of the text vectors. And so you're going to get 10 values over here. And these are going to lie between negative 1 and positive one. And we want to now compute probabilities. So we're going to multiply logets and then apply a softmax operation to ensure this is a probability distribution. And now what this will you know inform it could be look something like this. You know the a photo of an antelope could have you know 97% chance of being the correct label. Then a photo of a zebra 2% photo of a car 0%. And so we were able to perform a classification task on our unseen data set over here and it did it pretty well. And the idea of like why is this zeroot is because we have an image encoder and text encoder and they were not trained on our um you know downstream classification data set. It didn't really see any examples of it during training nor during inference and hence it is zeroshot inference. So now let's try to understand why do we even need clip. Well for one task specific labels can be pretty hard to come by. So that's these gold standard labels for every single image. And a good example to mention this is like imageet which is supposed to be this very large ontology itself only has like a million images or in text labels. Whereas you know what we trained on you know natural language text that looked kind of like this. Clip was trained on 440 million such examples because they were far easier to come by. Now, another reason why we would use clip is because natural language supervision allows the image encoder to create rich vectors that better encode the meaning of the image. So, let's take a look at some code to understand exactly how this is the case. So, I have some code in this collab notebook which basically is going to take an input image. This is picture of me and another image. This is just me with a hat. So what I'm going to do here is we'll we'll walk through the code. So first I'm going to load a clip model over here. This is going to be the image and text model effectively that we saw previously. We're going to normalize the input images and this is going to be a 512dimensional vector in each case. Now what I'm going to do is I'm just going to take the difference between these images and we'll get this delta vector. We'll normalize that vector. And so we have this vector stored over here. And what I'm going to do is I'm going to compare this image vector to word vectors. These four word vectors that represent hat, cup, cat, and boat. So essentially, I've created those text vectors over here, normalized them, and I'm computing a cosine similarity between each of these four text vectors and the image vector. And I'll convert it into a probability as well. And on doing so, interestingly enough, we can see that the difference between the two images is actually 60%, it's going to be this hat for it's going to be 16% boat and you know, very small probabilities for like cup and cat. But interestingly enough, semantically the difference between the images were just me wearing a hat versus not wearing a hat. So the difference is just the hat itself. And that's also reflected in the vectors themselves. So what we can see here is that natural language supervision allowed the image encoder to create rich vectors that better encode the meaning of an image. So I hope that's a little bit more clear with this example. Now as far as performance is concerned during inference zerosot clip actually outperforms other networks like convolution or transformer networks which were trained with golden labels. And the only thing here though interestingly enough if you have like a larger number of training examples per class you have this linear probe clip that actually performs pretty well. So let's talk about that really quick. So linear probing is a method to evaluate the visual representations of the clip encoder and it involves training a linear model which is a probe on top of the frozen clip encoder. So what that really means is let's say that we do now have a data set with you know a downstream classification task with actual labels. We hear we have like 10 classes and we have you know a labelled data set. So what we could do is take the image then pass it into our image encoder which is trained generate a 512dimensional vector and we can create this like linear FC layer as our probe and we can then get a prediction and we can train the network in this way and by training here with back propagation we would update the weights of this network but keep all of these weights frozen. So it's only this layer that's actually going to be updated. And once trained, the probe can help understand how well the clip encoder performs on a new data set. And it can also be pretty good also as we saw. Let's go back to our performance here. It can also be pretty good if we have enough examples to even improve the performance of clip overall. So I hope all of this makes sense. Quiz time. Have you been paying attention? Let's quiz you to find out. Which of the following is true about clip? A. Image and text are embedded in the same embedding space. B. Clip uses image plus free text during its training. C. Clip learns via contrastive learning. Or D. Clip's image encoder is usually a convolution or a transformer architecture. Note that multiple options may be correct and I'll give you a few seconds to answer this question. The correct options are all of them. Did you get them right? Please comment your reasoning down in the comments below and let's have a discussion. And at this point, if you think I deserve it, please do consider giving this video a like because it will help me out a lot. Now, that's going to do it for quiz time. But before we go, let's generate a summary. So we looked at clip, which is contrastive language image pre-training. It is essentially a neural network that jointly trains an image encoder and a text encoder to map respective modalities to the same embedding space. We saw exactly how we can train the image encoder and text encoder with this contractive learning technique. And we also saw how we can perform zero shot inference as well. We then took a look at like reasons for like why clip exists and it's because task specific labels are hard to come by whereas natural language ones are much easier to come by. And we also have the fact that natural language supervision allows the image encoder to learn rich representations. We also looked at some code that could help us understand this. And then we took a look at performance of how zeroot clip can perform better than convolution and transformer-based architectures. And then we concluded our discussion with looking a little bit at linear probe clip as well. And that's all that we have for today. I'm going to leave some resources down in the description below along with the link to all the slides and the paper here. And thank you all so much for watching and I will see you in the next one. Bye-bye.

Original Description

In this video, we take a look at CLIP (contrastive language image pretraining). What is it? Why do we have it? How does it look? And some code! ABOUT ME ⭕ Subscribe: https://www.youtube.com/c/CodeEmporium?sub_confirmation=1 📚 Medium Blog: https://medium.com/@dataemporium 💻 Github: https://github.com/ajhalthor 👔 LinkedIn: https://www.linkedin.com/in/ajay-halthor-477974bb/ RESOURCES [1 📚] Main Paper: https://openai.com/index/clip/ [2 📚] Slides: https://link.excalidraw.com/p/readonly/STU1Z0GcInkQNvA8naKM [3 📚] Code: https://github.com/ajhalthor/computer-vision-101/tree/main/CLIP PLAYLISTS FROM MY CHANNEL ⭕ Reinforcement Learning: https://youtube.com/playlist?list=PLTl9hO2Oobd9kS--NgVz0EPNyEmygV1Ha&si=AuThDZJwG19cgTA8 Natural Language Processing: https://youtube.com/playlist?list=PLTl9hO2Oobd_bzXUpzKMKA3liq2kj6LfE&si=LsVy8RDPu8jeO-cc ⭕ Transformers from Scratch: https://youtube.com/playlist?list=PLTl9hO2Oobd_bzXUpzKMKA3liq2kj6LfE ⭕ ChatGPT Playlist: https://youtube.com/playlist?list=PLTl9hO2Oobd9coYT6XsTraTBo4pL1j4HJ ⭕ Convolutional Neural Networks: https://youtube.com/playlist?list=PLTl9hO2Oobd9U0XHz62Lw6EgIMkQpfz74 ⭕ The Math You Should Know : https://youtube.com/playlist?list=PLTl9hO2Oobd-_5sGLnbgE8Poer1Xjzz4h ⭕ Probability Theory for Machine Learning: https://youtube.com/playlist?list=PLTl9hO2Oobd9bPcq0fj91Jgk_-h1H_W3V ⭕ Coding Machine Learning: https://youtube.com/playlist?list=PLTl9hO2Oobd82vcsOnvCNzxrZOlrz3RiD MATH COURSES (7 day free trial) 📕 Mathematics for Machine Learning: https://imp.i384100.net/MathML 📕 Calculus: https://imp.i384100.net/Calculus 📕 Statistics for Data Science: https://imp.i384100.net/AdvancedStatistics 📕 Bayesian Statistics: https://imp.i384100.net/BayesianStatistics 📕 Linear Algebra: https://imp.i384100.net/LinearAlgebra 📕 Probability: https://imp.i384100.net/Probability OTHER RELATED COURSES (7 day free trial) 📕 ⭐ Deep Learning Specialization: https://imp.i384100.net/Deep-Learning 📕 Python for Everybody: https://imp.i

Watch on YouTube ↗ (saves to browser)

Sign in to unlock AI tutor explanation · ⚡30

Playlist

Uploads from CodeEmporium · CodeEmporium · 0 of 60

← Previous Next →

Linear Regression and Multiple Regression

Linear Regression and Multiple Regression

Logistic Regression - THE MATH YOU SHOULD KNOW!

Logistic Regression - THE MATH YOU SHOULD KNOW!

Generative Adversarial Networks - FUTURISTIC & FUN AI !

Generative Adversarial Networks - FUTURISTIC & FUN AI !

Deep Learning on the Cloud - GPU TO LEARN FASTER

Deep Learning on the Cloud - GPU TO LEARN FASTER

Deep Mind's AlphaGo Zero - EXPLAINED

Deep Mind's AlphaGo Zero - EXPLAINED

Mask Region based Convolution Neural Networks - EXPLAINED!

Mask Region based Convolution Neural Networks - EXPLAINED!

Attention in Neural Networks

Attention in Neural Networks

Depthwise Separable Convolution - A FASTER CONVOLUTION!

Depthwise Separable Convolution - A FASTER CONVOLUTION!

One Neural network learns EVERYTHING ?!

One Neural network learns EVERYTHING ?!

Neural Voice Cloning

Neural Voice Cloning

AI creates Image Classifiers…by DRAWING?

AI creates Image Classifiers…by DRAWING?

Unpaired Image-Image Translation using CycleGANs

Unpaired Image-Image Translation using CycleGANs

K-Means Clustering - EXPLAINED!

K-Means Clustering - EXPLAINED!

Random Forest Classification

Random Forest Classification

Data Science in Finance

Data Science in Finance

Hypothesis testing with Applications in Data Science

Hypothesis testing with Applications in Data Science

A/B Testing - Simply Explained

A/B Testing - Simply Explained

The Kernel Trick - THE MATH YOU SHOULD KNOW!

The Kernel Trick - THE MATH YOU SHOULD KNOW!

Support Vector Machines - THE MATH YOU SHOULD KNOW

Support Vector Machines - THE MATH YOU SHOULD KNOW

Principal Component Analysis (PCA) - THE MATH YOU SHOULD KNOW!

Principal Component Analysis (PCA) - THE MATH YOU SHOULD KNOW!

History of Calculus - Animated

History of Calculus - Animated

Curiosity in AI

Curiosity in AI

DropBlock - A BETTER DROPOUT for Neural Networks

DropBlock - A BETTER DROPOUT for Neural Networks

Autoencoders - EXPLAINED

Autoencoders - EXPLAINED

Recurrent Neural Networks - EXPLAINED!

Recurrent Neural Networks - EXPLAINED!

LSTM Networks - EXPLAINED!

LSTM Networks - EXPLAINED!

Building an Image Captioner with Neural Networks

Building an Image Captioner with Neural Networks

10 Machine Learning Questions - ANSWERED!

10 Machine Learning Questions - ANSWERED!

How do neural networks work?

How do neural networks work?

Evolution of Face Generation | Evolution of GANs

Evolution of Face Generation | Evolution of GANs

How does Google Translate's AI work?

How does Google Translate's AI work?

How to keep up with AI research?

How to keep up with AI research?

How does YouTube recommend videos? - AI EXPLAINED!

How does YouTube recommend videos? - AI EXPLAINED!

Variational Autoencoders - EXPLAINED!

Variational Autoencoders - EXPLAINED!

Logistic Regression - VISUALIZED!

Logistic Regression - VISUALIZED!

Gradient Descent - THE MATH YOU SHOULD KNOW

Gradient Descent - THE MATH YOU SHOULD KNOW

Boosting - EXPLAINED!

Boosting - EXPLAINED!

Transformer Neural Networks - EXPLAINED! (Attention is all you need)

Transformer Neural Networks - EXPLAINED! (Attention is all you need)

Loss Functions - EXPLAINED!

Loss Functions - EXPLAINED!

Optimizers - EXPLAINED!

Optimizers - EXPLAINED!

NLP with Neural Networks & Transformers

NLP with Neural Networks & Transformers

Batch Normalization - EXPLAINED!

Batch Normalization - EXPLAINED!

Activation Functions - EXPLAINED!

Activation Functions - EXPLAINED!

Data Scientist Answers Interview Questions

Data Scientist Answers Interview Questions

Why use GPU with Neural Networks?

Why use GPU with Neural Networks?

How do GPUs speed up Neural Network training?

How do GPUs speed up Neural Network training?

BERT Neural Network - EXPLAINED!

BERT Neural Network - EXPLAINED!

ConvNets Scaled Efficiently

ConvNets Scaled Efficiently

Transformer Neural Net makes music! (JukeboxAI)

Transformer Neural Net makes music! (JukeboxAI)

What do filters of Convolution Neural Network learn?

What do filters of Convolution Neural Network learn?

We're hosting a Machine Learning Conference!

We're hosting a Machine Learning Conference!

MLconfEU 2020: Machine Learning Conference for Software Engineers

MLconfEU 2020: Machine Learning Conference for Software Engineers

Are Neural Networks Intelligent?

Are Neural Networks Intelligent?

Time Series Forecasting with Machine Learning

Time Series Forecasting with Machine Learning

Few Shot Learning - EXPLAINED!

Few Shot Learning - EXPLAINED!

How does a Data Scientist Fight FRAUD?

How does a Data Scientist Fight FRAUD?

How would a Data Scientist analyze Customer Churn?

How would a Data Scientist analyze Customer Churn?

Expectations with Machine Learning

Expectations with Machine Learning

Why Logistic Regression DOESN'T return probabilities?!

Why Logistic Regression DOESN'T return probabilities?!

How you SHOULD code Machine Learning

How you SHOULD code Machine Learning

This video explains the concept of CLIP, its architecture, and how it can be used for image-text matching tasks, with code examples and resources for further learning.

Key Takeaways

Understand the basics of contrastive learning
Learn the architecture of CLIP
Implement CLIP using Python
Fine-tune CLIP for specific tasks

💡 CLIP can be used for a variety of image-text matching tasks, such as image classification and captioning.

🔒 Pro feature: Ask AI to explain this lesson →

More on: Modern CV Models

View skill →

YOLOE: Real-time Zero-shot Object Detection | Visual Prompting | Live Coding & Q&A (Mar 14th)

YOLOE: Real-time Zero-shot Object Detection | Visual Prompting | Live Coding & Q&A (Mar 14th)

Statistical Learning: 10.Py Convolutional Neural Network: CIFAR Image Data I 2023

Statistical Learning: 10.Py Convolutional Neural Network: CIFAR Image Data I 2023

Stanford Online

RF-DETR: How to Train SOTA for Object Detection on a Custom Dataset | Step-by-step guide

RF-DETR: How to Train SOTA for Object Detection on a Custom Dataset | Step-by-step guide

Build a Deep Facial Recognition App // Part 8 - Kivy Computer Vision App with OpenCV and Tensorflow

Build a Deep Facial Recognition App // Part 8 - Kivy Computer Vision App with OpenCV and Tensorflow

Nicholas Renotte

Deep Learning with PyTorch : Image Segmentation

Deep Learning with PyTorch : Image Segmentation

Mesh Optimization Using FlexiCubes with NVIDIA Kaolin Library v0.15.0

Mesh Optimization Using FlexiCubes with NVIDIA Kaolin Library v0.15.0

NVIDIA Developer

Related AI Lessons

Beyond the Elephant: On Manifolds, Projections, and the Hidden Assumptions of Neural Geometry

Learn how neural geometry relies on manifolds, projections, and hidden assumptions to understand complex data, and why it matters for AI development

Beyond the Elephant: On Manifolds, Projections, and the Hidden Assumptions of Neural Geometry

Learn how neural geometry relies on manifolds, projections, and hidden assumptions to understand complex data, and why it matters for advancing AI research

Medium · Data Science

Beyond the Elephant: On Manifolds, Projections, and the Hidden Assumptions of Neural Geometry

Explore the geometric assumptions underlying neural networks and their implications on manifold learning and projections

Medium · Deep Learning

Beyond the Elephant: On Manifolds, Projections, and the Hidden Assumptions of Neural Geometry

Learn about the hidden assumptions of neural geometry and how manifolds and projections impact neural network performance

Machine Learning Project for Final Year Students | ML Project Idea @FameWorldEducationalHub

FAME WORLD EDUCATIONAL HUB