Multimodal Few-Shot Learning with Frozen Language Models | Paper Explained

Aleksa Gordić - The AI Epiphany · Beginner ·👁️ Computer Vision ·4y ago

Key Takeaways

The video discusses the paper 'Multimodal Few-Shot Learning with Frozen Language Models' by DeepMind, which introduces a model called Frozen that can handle both visual and textual inputs and demonstrates good generalization capabilities to new tasks. The model uses a frozen language model, such as GPT2, and trains a vision encoder to parse images into tokens compatible with the language model.

Full Transcript

what's up in this video i'm covering this novel paper called multimodal few shot learning with frozen language models by maria timpukeli jacob manik sirkankabi ali islami oriol viniels and felix hill of deepmind so what this paper shows in a nutshell is that they have this molecule frozen where basically they take a huge pre-trained language model such as gpt2 they freeze it hence frozen and they basically train this vision encoder so that they can parse images as well so it will just convert images into like tokens which are compatible with the language model and then they show that they can have uh this model do a few shot learning across different tasks which involve both the like visual cues as well as the linguistic cues to make it a bit less abstract let me show you this example to show you what i mean so this is an input an example of an input that goes into this frozen model so we have an image it will be tokenized then we have this person is like and then we have the smiley because this girl is happy this person is like sad and finally we we prompt the model with this image and we we prompted what this person is like and as we can see the model generates this uh like terrified smiley a dot and end of a sentence token okay which is what we've expected which is really cool and um okay thing to to to keep in mind this during this whole video is that these examples are curated they mentioned that multiple times so again the whole point is that they can uh have this few shot learning capability rather than they have like a highly performant model on all of these tasks okay uh nonetheless let me show you this one so this was invented by zacarias johnson thomas edison blah blah blah so the point here is that just looking at this image you can't deduce what's the answer whereas here you could have uh and so as you can see the output here is the wright brothers dot end of sun sentence token so uh the whole point here as we'll soon see is that the language model can now pull the factual knowledge uh in and we can answer questions like this because of the language model not because of the vision portion of the model okay let's dig deeper into the paper now so that was a high level like uh like a glimpse of this paper basically they can handle both modalities and they can like learn the more examples you give the model the better it becomes that's really cool okay let's now dive deeper when trained at a sufficient scale all regressive language models exhibit the notable ability to learn a new language task after being prompted with just a few examples and this should be uh like familiar to you uh if you haven't watched my video on gpt3 or the original uh transformer paper um do check them out i'll link them somewhere here but basically what those models showed is the following especially the gpt family of models from from open ai they showed that um even though the model was trained as a language model so basically you have an unsupervised task of next token prediction and that's everything and they show that you can actually do machine translation for example and here's an example of how it looks like so we have a one-shot setting here where we prompt the model with translate english to french and then we give it one example see other to lutret de mar i don't speak french and then you prompt with cheese and you have this symbol and the model will actually learn how to translate this so to to perform machine translation even though it's never seen such a task during this training and here we just have a like a like a uh additionally a few shot examples set up where we have multiple uh examples and then we prompt the model to translate and they show that with these multiple examples that the performance just gets better and better obviously saturates after a certain point but like the the trend is clear okay so that's the first thing uh now here we present the simple yet effective approach for transferring this few shot learning ability to a multi-modal setting and but in particular they focus on on vision as we'll soon see okay um finally here's a motivation behind all this so despite these impressive capabilities such large scale language models are blind to modalities other than text preventing us from communicating visual tasks questions or concepts to them so that's the reason they kind of integrated this vision component okay let me explain you how this whole system looks like and how it works so the system as a whole is pretty simple so we have a language model as you can see here and uh it's frozen so the parameters are all frozen and we have this vision encoder so just to be a bit more specific they be they were using gpt2 for the language model they were using nf resonant for the visual encoder but that's not that crucial you can pick some other language model you can pick some different vision encoder maybe something like vision image transformer but that's not that important like the fact that these components exist and they are wired the way they are okay so um so what they do is so they first need to kind of adapt the image into the input that the transformer model is expecting and that's these tokens so what they do is they have this vision encoder and it will output after some pulling layer a vector and what they do is they take a linear layer they just project this into this novel space uh let me draw it like this and the dimension will be n times d where basically uh they found this end to be like the best value was two but that's not the importance you can see here they have only two tokens uh but it can be an arbitrary number in general um so d is uh that's important d is a dimension that's the same as these tokens that go into language model so obviously that's the prerequisite so that we can feed these tokens into the language model that come from the image and so that's the image part now how do they train this is fairly simple so uh basically i think it's it would be wiser to write this down as so we have start of sentence token here and all of these other words will be kind of translated here so a will get here the small will be here etc so we'll have small here and so now what you're trying to do is to predict the target a sequence so as you can see here so let me just focus on predicting the word small so that the word small will have as a context these image tokens that came here as well as uh this start of sentence token as well as a because we have causal masking obviously you don't want to have if you if you saw small in the input then you're kind of cheating and it's easy to predict the output small you can just kind of copy the values and the transformer will learn how just to copy copy paste the values and that's not what we want we want to predict tokens so that's how the the test would look like basically here you'd have uh you'd output a distribution the usual way so you're just trying to maximize the likelihood so here maybe we have some distribution and we find the token that corresponds to the word small maybe this one and we'll just want to maximize this to one and push all of the other probabilities down to zero and we do that by just simple like cross entropy so it will be minus log of p so when p goes to one the loss goes to zero and so in a nutshell that's how the system works that's how it's trained the gradients are back propped through through these weights which are frozen and these weights are then tweaked so that basically what happens is that this vision encoder learns such representations so that they're useful in order to do this captioning task okay and that's that's the whole that's the whole system it's fairly fairly trivial let me now show you how they use this thing and here we have in the first example so we have a vision encoder so we have an image we encode it into these two tokens and then we prompt it with like this text so question what color is the car and the model generates blue uh and then end of sentence token so by the way just a short remark here they actually even though you can see here a word and a single token what actually in practice what they do is they use this uh tokenizer called sentence piece so this boat or this word small maybe like uh separated into sub words maybe like maybe this sma part will be like one will have one token associated with it and then ll will have second so it's just an example but in in general you'll have more tokens than you have words in your sentence okay just a minor detail and uh so that's this is one of the tasks that they're gonna evaluate this model on the second example here will require the model to have some knowledge base and uh additionally i forgot to mention that the captioning so during this captioning training um all of the like named entities are masked so if you have a name like alexa or something you'll mask it with a person so you'll have you you'll just like put a person instead of a name there so that means that this vision encoder cannot learn those name densities obviously because it never saw like named entities and so the examples you see here like uh like this model generated steve jobs that did not certainly didn't come from the vision encoder and that's an important thing i want you to notice here so the knowledge of that steve jobs was the guy who invented iphone uh actually came from the language model itself so that's important and the third task they evaluate this model on is this fast binding task where you present the model with the image then you say you kind of have this made up word called dex and you just want the model to associate this visual category of an apple with the novel word dex and you do the same thing here you have an orange blicket and finally you prompted with an image and you say question what is this answer and the model generates this is odd decks so we saw dex's apple so it properly generated this is a dex uh sentence okay so having said that let me now kind of see let me show you the quantitative results they got um because that's that's that's interesting before that um just a short remark here so they say here in contrast our work enables strong generalization to new multi-modal tasks blah blah blah um what i want to say here is i don't like this part because um basically we don't even have a strong definition of what strong generalization is and the closest thing i could think of was um francois chole's um paper on the measure of intelligence where he uh described this terminology where he he basically says that all of the current models pretty much uh that we have in deep learning field in machine learning are only like able to do this local generalization whereas this broad generalization and extreme journalization is something that only humans can currently uh basically do and so yeah i i pretty much agree uh with him on this on this one because this language model that they use in frozen has seen like a lot of data so that means it has a lot of experience which we need to count in when we calculate these generalization capabilities so arguably because of all of that like immense amount of experience you cannot actually claim that it's generalizing like strongly it does generalize but like just the level yeah it's kind of yeah rent over pretty much okay these are just some details i already explained they have a huge 7 billion pre-trained language model they freeze it and that's how they train the model i also explained this one how do how do we map the images into into tokens which language model can then parse okay i'll just keep all of those small details they're using positional encoding that help them and let me focus on this so um important thing i want you to notice is that they have a bunch of different ways to prompt this model so that it can generate the answers much better and here's one example so they have a pretty intricate uh like terminology here so two ways zero repeats two inner shots so let's see what it means so first they have this uh task induction which they quantitatively show it helps a lot so answer with dex or blicket kind of prompts kind of tunes the model into answering with these two words in a sense and then they have so two shots basically this is the first part and uh the reason it's two ways because we have two made up words like blicket and ducks here and because the reason it's two inner shots is because they have two independent examples here as you can see and finally they prompt the model and liquid would be the correct answer because as you can see blicket is lying okay and the the last task they'll they'll they'll evaluate is this one where again you have to associate a visual category with a novel word and finally you actually not only have to output like here it's basically recognition you recognize it's a line and you have a blicket but here you have to reason because it says what is the the question says what is the dex made of so you first need to understand that dex is a table and then you need to understand what it is made of and the answer is wood so this is the final task they evaluate this frozen model on and now let's see the quantitative results okay here are the results for the visual question answering and there are these base lines which i'll shortly explain now uh especially this plane blind baseline so the strength of the pre-trained language model is a double-edged sword it powers the generalization abilities of frozen but also enables the model to perform surprisingly well without considering the visual input at all so you can learn to ignore the visual input and still answer the question so to guard against this possibility we also train blind bass lines in which the image presented to the visual encoder is blacked out but the comment weights are still trained this amounts to prefix tuning so just as a short reminder how it works basically here instead of an image for these bass lines what they'll do instead is they'll just black out or blue out in my case these images and so what happens is basically uh the model will have to learn some representation which is constant which won't which won't change because it will only be presented with black images and so uh that constant kind of needs to help in this captioning task and so that's why they call it prefix tuning because it finds uh some representation that helps the overall system do better job at captioning okay so that's one of the baselines the the blind baseline and having explained that one let's now focus on the quantitative results okay here we are um so frozen here is the the the version we were talking about here is the version train from scratch so that means you just stitch all of the weights in the pre-trained transformer and try and train it from scratch um here are the worst the setups they have so there's zero shot one shot four shot etc and we can see that as we add more examples we have this strength that the accuracy is improving which is desirable okay we see that the from scratch model totally fails we see that the fine-tuned model is worse uh so fine-tuning so basically here they do not they do not freeze the language model but they initialize the model with the pre-trained weight the blind baseline i've just explained it as well so um again oscar is some dedicated baseline you can see it's much better than frozen but the whole point here is that we have this improv improving ability uh that also was uh pertinent to gpt3 models and language models in general so that's cool um here you can just see when you additionally fine-tune on this very data set visual visual question answering data set the performance obviously ramps up and that's that's expected i guess now the second thing here is the test on this okay vqa data set so this data set contains those examples where you need some additional knowledge base in order to answer the questions and uh again uh they so they show that um the the model improves with additional examples as a new baseline they use this 400 million parameter model and by the way this one has 7 billion and again obviously the the bigger the model the the better the performance it's something that's not that surprising i guess like in 2021 and um yeah again the this baseline is much better they just want to stress out that we have this improvement uh like being transferred to this multi multi multimodal setup and that's cool okay so that was the the first experiments they did um here let me just kind of reiterate this because it's really important so this conceptual captions data set is hyper named meaning that for example proper names are replaced with a general person a word like person okay so this enables us to rigorously study the transfer of actual knowledge because all knowledge of named entities comes from language model pre-training consequently when we show the model an image of an airplane and ask who invented this the visual encoder has determined that the image contains an airplane and the language model has used this to retrieve the factual knowledge that airplanes were invented by the wright brothers and finally jumping to fast concept binding these are the last tasks they tested this frozen model on um let me just kind of connect this table with the actual tasks so this is the task at hand we have this uh visual binding uh tasks so this is a two-way binding because we have again two novel made up word stacks and blicket uh they'll show like five-way binding uh task where obviously we have now five made up words and we'll soon see that it fails on that one but it succeeds on the two-way binding task so on the two-way binding you can see again they have a bunch of different ways to to prompt the model so that they can elicit better outputs and again here you can see it's improving with more examples second thing they tried is they used a real names instead of those made up words they do this so that they can quantify how harder it is for the model to learn the binding uh and how hard the task itself is and you can see uh so this kind of bugs me i'm not sure whether this is a typo but like you can he has better performance and that it kind of saturates saturates after three examples um again this annual baseline is better as i mentioned uh five-way binding fails so here it's literally like uh same as random chance 20 and it kind of improves here then goes back so yeah it's inconclusive here they mentioned here somewhere here just a sec so in table four we show that the observed effects on open-ended mini imagenet do not transfer to the fireway setting where frozen is not significantly above chance uh this shows that learning to bind five new names to five visual categories in a single forward pass is beyond the current capabilities of frozen okay so it kind of fails there and they leave it up as a future research the final task is the one i showed you where you need to aside from fast binding you need to reason not much new conclusions can be made here so again it's improving with more examples the interesting part maybe is so if we focus on this blind baseline we can see that even repeating uh so again remember the image is blacked out so we just just repeat the text a couple of times and those linguistic cues uh help boost the performance of this spline model which means that these improvements above uh basically are a combination of both linguistic cues as well as the visual cues okay um i think that's pretty much it um i like this paper a lot uh i like this inclusion of visual uh information into this whole pipeline it slowly starts resembling the way we humans operate so we have these like we have vision obviously and we kind of uh somehow represent that information and then we have this like computation engines or inside our head by the way one of the previous papers called uh like pre-trained transformers or universal computation engine or something uh they showed that like if you take a huge a pre-trained language model and you just tweak some layer norm parameters and some embedding uh weights you can basically fine-tune it very fast onto novel tasks and that's cool i guess this work is a follow-up in a way and yeah having said that i hope you like this video if you did consider subscribing sharing and until next time bye bye

Original Description

❤️ Become The AI Epiphany Patreon ❤️ ► https://www.patreon.com/theaiepiphany In this video I cover "Multimodal Few-Shot Learning with Frozen Language Models" from DeepMind. They introduce Frozen - which is able to handle both visual and textual inputs and shows good generalization capabilities to novel visual question answering datasets combined with fast binding mechanisms even though it was only trained on image captioning. ▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬ ✅ Paper: https://arxiv.org/abs/2106.13884 ▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬ ⌚️ Timetable: 00:00 Intro 02:20 GPT-3 and emerging few-shot properties 04:20 Training procedure for Frozen 07:45 Inference 10:15 Strong generalization? 11:55 Prompting mechanisms and the hardest task 13:25 Quantitative results 19:50 Outro ▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬ 💰 BECOME A PATREON OF THE AI EPIPHANY ❤️ If these videos, GitHub projects, and blogs help you, consider helping me out by supporting me on Patreon! The AI Epiphany ► https://www.patreon.com/theaiepiphany One-time donation: https://www.paypal.com/paypalme/theaiepiphany Much love! ❤️ Huge thank you to these AI Epiphany patreons: Eli Mahler Petar Veličković Zvonimir Sabljic ▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬ 💡 The AI Epiphany is a channel dedicated to simplifying the field of AI using creative visualizations and in general, a stronger focus on geometrical and visual intuition, rather than the algebraic and numerical "intuition". ▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬ 👋 CONNECT WITH ME ON SOCIAL LinkedIn ► https://www.linkedin.com/in/aleksagordic/ Twitter ► https://twitter.com/gordic_aleksa Instagram ► https://www.instagram.com/aiepiphany/ Facebook ► https://www.facebook.com/aiepiphany/ 👨‍👩‍👧‍👦 JOIN OUR DISCORD COMMUNITY: Discord ► https://discord.gg/peBrCpheKE 📢 SUBSCRIBE TO MY MONTHLY AI NEWSLETTER: Substack ► https://aiepiphany.substack.com/ 💻 FOLLOW ME ON GITHUB FOR COOL PROJECTS: GitHub ► https://github.com/gordicaleksa 📚 FOLLOW ME ON MEDIUM: Medium ► https://gordicaleksa.m
Watch on YouTube ↗ (saves to browser)
Sign in to unlock AI tutor explanation · ⚡30

Playlist

Uploads from Aleksa Gordić - The AI Epiphany · Aleksa Gordić - The AI Epiphany · 51 of 60

1 Intro | Neural Style Transfer #1
Intro | Neural Style Transfer #1
Aleksa Gordić - The AI Epiphany
2 Basic Theory | Neural Style Transfer #2
Basic Theory | Neural Style Transfer #2
Aleksa Gordić - The AI Epiphany
3 Optimization method | Neural Style Transfer #3
Optimization method | Neural Style Transfer #3
Aleksa Gordić - The AI Epiphany
4 Advanced Theory | Neural Style Transfer #4
Advanced Theory | Neural Style Transfer #4
Aleksa Gordić - The AI Epiphany
5 Anyone can make deepfakes now!
Anyone can make deepfakes now!
Aleksa Gordić - The AI Epiphany
6 What is Computer Vision? | The Art of Creating Seeing Machines
What is Computer Vision? | The Art of Creating Seeing Machines
Aleksa Gordić - The AI Epiphany
7 Feed-forward method | Neural Style Transfer #5
Feed-forward method | Neural Style Transfer #5
Aleksa Gordić - The AI Epiphany
8 Alan Turing | Computing Machinery and Intelligence
Alan Turing | Computing Machinery and Intelligence
Aleksa Gordić - The AI Epiphany
9 Feed-forward method (training) | Neural Style Transfer #6
Feed-forward method (training) | Neural Style Transfer #6
Aleksa Gordić - The AI Epiphany
10 What is Google Deep Dream? (Basic Theory) | Deep Dream Series #1
What is Google Deep Dream? (Basic Theory) | Deep Dream Series #1
Aleksa Gordić - The AI Epiphany
11 Semantic Segmentation in PyTorch | Neural Style Transfer #7
Semantic Segmentation in PyTorch | Neural Style Transfer #7
Aleksa Gordić - The AI Epiphany
12 How to get started with Machine Learning
How to get started with Machine Learning
Aleksa Gordić - The AI Epiphany
13 How to learn PyTorch? (3 easy steps) | 2021
How to learn PyTorch? (3 easy steps) | 2021
Aleksa Gordić - The AI Epiphany
14 PyTorch or TensorFlow?
PyTorch or TensorFlow?
Aleksa Gordić - The AI Epiphany
15 3 Machine Learning Projects For Beginners (Highly visual) | 2021
3 Machine Learning Projects For Beginners (Highly visual) | 2021
Aleksa Gordić - The AI Epiphany
16 Machine Learning Projects (Intermediate level) | 2021
Machine Learning Projects (Intermediate level) | 2021
Aleksa Gordić - The AI Epiphany
17 Cheapest (0$) Deep Learning Hardware Options | 2021
Cheapest (0$) Deep Learning Hardware Options | 2021
Aleksa Gordić - The AI Epiphany
18 How to learn deep learning? (Transformers Example)
How to learn deep learning? (Transformers Example)
Aleksa Gordić - The AI Epiphany
19 How do transformers work? (Attention is all you need)
How do transformers work? (Attention is all you need)
Aleksa Gordić - The AI Epiphany
20 Developing a deep learning project (case study on transformer)
Developing a deep learning project (case study on transformer)
Aleksa Gordić - The AI Epiphany
21 Vision Transformer (ViT) - An image is worth 16x16 words | Paper Explained
Vision Transformer (ViT) - An image is worth 16x16 words | Paper Explained
Aleksa Gordić - The AI Epiphany
22 GPT-3 - Language Models are Few-Shot Learners | Paper Explained
GPT-3 - Language Models are Few-Shot Learners | Paper Explained
Aleksa Gordić - The AI Epiphany
23 Google DeepMind's AlphaFold 2 explained! (Protein folding, AlphaFold 1, a glimpse into AlphaFold 2)
Google DeepMind's AlphaFold 2 explained! (Protein folding, AlphaFold 1, a glimpse into AlphaFold 2)
Aleksa Gordić - The AI Epiphany
24 Attention Is All You Need (Transformer) | Paper Explained
Attention Is All You Need (Transformer) | Paper Explained
Aleksa Gordić - The AI Epiphany
25 Graph Attention Networks (GAT) | GNN Paper Explained
Graph Attention Networks (GAT) | GNN Paper Explained
Aleksa Gordić - The AI Epiphany
26 Graph Convolutional Networks (GCN) | GNN Paper Explained
Graph Convolutional Networks (GCN) | GNN Paper Explained
Aleksa Gordić - The AI Epiphany
27 Graph SAGE - Inductive Representation Learning on Large Graphs | GNN Paper Explained
Graph SAGE - Inductive Representation Learning on Large Graphs | GNN Paper Explained
Aleksa Gordić - The AI Epiphany
28 PinSage - Graph Convolutional Neural Networks for Web-Scale Recommender Systems | Paper Explained
PinSage - Graph Convolutional Neural Networks for Web-Scale Recommender Systems | Paper Explained
Aleksa Gordić - The AI Epiphany
29 OpenAI CLIP - Connecting Text and Images | Paper Explained
OpenAI CLIP - Connecting Text and Images | Paper Explained
Aleksa Gordić - The AI Epiphany
30 Temporal Graph Networks (TGN) | GNN Paper Explained
Temporal Graph Networks (TGN) | GNN Paper Explained
Aleksa Gordić - The AI Epiphany
31 Graph Neural Network Project Update! (I'm coding GAT from scratch)
Graph Neural Network Project Update! (I'm coding GAT from scratch)
Aleksa Gordić - The AI Epiphany
32 Graph Attention Network Project Walkthrough
Graph Attention Network Project Walkthrough
Aleksa Gordić - The AI Epiphany
33 How to get started with Graph ML? (Blog walkthrough)
How to get started with Graph ML? (Blog walkthrough)
Aleksa Gordić - The AI Epiphany
34 DQN - Playing Atari with Deep Reinforcement Learning | RL Paper Explained
DQN - Playing Atari with Deep Reinforcement Learning | RL Paper Explained
Aleksa Gordić - The AI Epiphany
35 AlphaGo - Mastering the game of Go with deep neural networks and tree search | RL Paper Explained
AlphaGo - Mastering the game of Go with deep neural networks and tree search | RL Paper Explained
Aleksa Gordić - The AI Epiphany
36 DeepMind's AlphaGo Zero and AlphaZero | RL paper explained
DeepMind's AlphaGo Zero and AlphaZero | RL paper explained
Aleksa Gordić - The AI Epiphany
37 OpenAI - Solving Rubik's Cube with a Robot Hand | RL paper explained
OpenAI - Solving Rubik's Cube with a Robot Hand | RL paper explained
Aleksa Gordić - The AI Epiphany
38 MuZero - Mastering Atari, Go, Chess and Shogi by Planning with a Learned Model | RL Paper explained
MuZero - Mastering Atari, Go, Chess and Shogi by Planning with a Learned Model | RL Paper explained
Aleksa Gordić - The AI Epiphany
39 EfficientNetV2 - Smaller Models and Faster Training | Paper explained
EfficientNetV2 - Smaller Models and Faster Training | Paper explained
Aleksa Gordić - The AI Epiphany
40 Implementing DeepMind's DQN from scratch! | Project Update
Implementing DeepMind's DQN from scratch! | Project Update
Aleksa Gordić - The AI Epiphany
41 MLP-Mixer: An all-MLP Architecture for Vision | Paper explained
MLP-Mixer: An all-MLP Architecture for Vision | Paper explained
Aleksa Gordić - The AI Epiphany
42 DeepMind's Android RL Environment - AndroidEnv
DeepMind's Android RL Environment - AndroidEnv
Aleksa Gordić - The AI Epiphany
43 When Vision Transformers Outperform ResNets without Pretraining | Paper Explained
When Vision Transformers Outperform ResNets without Pretraining | Paper Explained
Aleksa Gordić - The AI Epiphany
44 Non-Parametric Transformers | Paper explained
Non-Parametric Transformers | Paper explained
Aleksa Gordić - The AI Epiphany
45 Chip Placement with Deep Reinforcement Learning | Paper Explained
Chip Placement with Deep Reinforcement Learning | Paper Explained
Aleksa Gordić - The AI Epiphany
46 Text Style Brush - Transfer of text aesthetics from a single example | Paper Explained
Text Style Brush - Transfer of text aesthetics from a single example | Paper Explained
Aleksa Gordić - The AI Epiphany
47 Graphormer - Do Transformers Really Perform Bad for Graph Representation? | Paper Explained
Graphormer - Do Transformers Really Perform Bad for Graph Representation? | Paper Explained
Aleksa Gordić - The AI Epiphany
48 GANs N' Roses: Stable, Controllable, Diverse Image to Image Translation | Paper Explained
GANs N' Roses: Stable, Controllable, Diverse Image to Image Translation | Paper Explained
Aleksa Gordić - The AI Epiphany
49 VQ-VAEs: Neural Discrete Representation Learning | Paper + PyTorch Code Explained
VQ-VAEs: Neural Discrete Representation Learning | Paper + PyTorch Code Explained
Aleksa Gordić - The AI Epiphany
50 VQ-GAN: Taming Transformers for High-Resolution Image Synthesis | Paper Explained
VQ-GAN: Taming Transformers for High-Resolution Image Synthesis | Paper Explained
Aleksa Gordić - The AI Epiphany
Multimodal Few-Shot Learning with Frozen Language Models | Paper Explained
Multimodal Few-Shot Learning with Frozen Language Models | Paper Explained
Aleksa Gordić - The AI Epiphany
52 Focal Transformer: Focal Self-attention for Local-Global Interactions in Vision Transformers
Focal Transformer: Focal Self-attention for Local-Global Interactions in Vision Transformers
Aleksa Gordić - The AI Epiphany
53 AudioCLIP: Extending CLIP to Image, Text and Audio | Paper Explained
AudioCLIP: Extending CLIP to Image, Text and Audio | Paper Explained
Aleksa Gordić - The AI Epiphany
54 RMA: Rapid Motor Adaptation for Legged Robots | Paper Explained
RMA: Rapid Motor Adaptation for Legged Robots | Paper Explained
Aleksa Gordić - The AI Epiphany
55 DALL-E: Zero-Shot Text-to-Image Generation | Paper Explained
DALL-E: Zero-Shot Text-to-Image Generation | Paper Explained
Aleksa Gordić - The AI Epiphany
56 DETR: End-to-End Object Detection with Transformers | Paper Explained
DETR: End-to-End Object Detection with Transformers | Paper Explained
Aleksa Gordić - The AI Epiphany
57 DINO: Emerging Properties in Self-Supervised Vision Transformers | Paper Explained!
DINO: Emerging Properties in Self-Supervised Vision Transformers | Paper Explained!
Aleksa Gordić - The AI Epiphany
58 DeepMind DetCon: Efficient Visual Pretraining with Contrastive Detection | Paper Explained
DeepMind DetCon: Efficient Visual Pretraining with Contrastive Detection | Paper Explained
Aleksa Gordić - The AI Epiphany
59 Do Vision Transformers See Like Convolutional Neural Networks? | Paper Explained
Do Vision Transformers See Like Convolutional Neural Networks? | Paper Explained
Aleksa Gordić - The AI Epiphany
60 Fastformer: Additive Attention Can Be All You Need | Paper Explained
Fastformer: Additive Attention Can Be All You Need | Paper Explained
Aleksa Gordić - The AI Epiphany

This video explains the paper 'Multimodal Few-Shot Learning with Frozen Language Models' and its application to few-shot learning tasks. The model uses a frozen language model and trains a vision encoder to parse images into tokens compatible with the language model. The video discusses the model's performance on various tasks, including visual question answering and binding tasks.

Key Takeaways
  1. Use a frozen language model, such as GPT2
  2. Train a vision encoder to parse images into tokens compatible with the language model
  3. Apply positional encoding to map images into tokens
  4. Evaluate the model on tasks such as visual question answering and binding tasks
  5. Fine-tune the model on the visual question answering data set
  6. Use a tokenizer, such as sentence piece, to split words into subwords
💡 The model's performance improves with additional examples and fine-tuning on the visual question answering data set, but saturates after three examples in the five-way binding task.

Related AI Lessons

Cloud-Optimized OpenCV + A Special Surprise Announcement on OpenCV Live
Learn about Cloud-Optimized OpenCV for faster computer vision computations and a special announcement on OpenCV Live
OpenCV Blog
When the Camera Becomes an Exam Proctor: Building an AI-Powered Exam Monitoring System with…
Learn how to build an AI-powered exam monitoring system using Computer Vision and DeepFace to assist professional certification exams
Medium · Python
When the Camera Becomes an Exam Proctor: Building an AI-Powered Exam Monitoring System with…
Build an AI-powered exam monitoring system using Computer Vision and Deep Learning to enhance professional certification exams
Medium · Deep Learning
When the Camera Becomes an Exam Proctor: Building an AI-Powered Exam Monitoring System with…
Build an AI-powered exam monitoring system using Computer Vision and Deep Learning to enhance exam security and integrity
Medium · Cybersecurity

Chapters (8)

Intro
2:20 GPT-3 and emerging few-shot properties
4:20 Training procedure for Frozen
7:45 Inference
10:15 Strong generalization?
11:55 Prompting mechanisms and the hardest task
13:25 Quantitative results
19:50 Outro
Up next
Marketing management for ugc net| Important topics of marketing management ugc net commerce dec 2023
Bhoomi Learning Centre~Dr. Muskan
Watch →