Multimodal Few-Shot Learning with Frozen Language Models | Paper Explained

Aleksa Gordić - The AI Epiphany · Beginner ·👁️ Computer Vision ·4y ago

Skills: Multimodal LLMs90%LLM Foundations80%Fine-tuning LLMs70%

Key Takeaways

The video discusses the paper 'Multimodal Few-Shot Learning with Frozen Language Models' by DeepMind, which introduces a model called Frozen that can handle both visual and textual inputs and demonstrates good generalization capabilities to new tasks. The model uses a frozen language model, such as GPT2, and trains a vision encoder to parse images into tokens compatible with the language model.

Full Transcript

what's up in this video i'm covering this novel paper called multimodal few shot learning with frozen language models by maria timpukeli jacob manik sirkankabi ali islami oriol viniels and felix hill of deepmind so what this paper shows in a nutshell is that they have this molecule frozen where basically they take a huge pre-trained language model such as gpt2 they freeze it hence frozen and they basically train this vision encoder so that they can parse images as well so it will just convert images into like tokens which are compatible with the language model and then they show that they can have uh this model do a few shot learning across different tasks which involve both the like visual cues as well as the linguistic cues to make it a bit less abstract let me show you this example to show you what i mean so this is an input an example of an input that goes into this frozen model so we have an image it will be tokenized then we have this person is like and then we have the smiley because this girl is happy this person is like sad and finally we we prompt the model with this image and we we prompted what this person is like and as we can see the model generates this uh like terrified smiley a dot and end of a sentence token okay which is what we've expected which is really cool and um okay thing to to to keep in mind this during this whole video is that these examples are curated they mentioned that multiple times so again the whole point is that they can uh have this few shot learning capability rather than they have like a highly performant model on all of these tasks okay uh nonetheless let me show you this one so this was invented by zacarias johnson thomas edison blah blah blah so the point here is that just looking at this image you can't deduce what's the answer whereas here you could have uh and so as you can see the output here is the wright brothers dot end of sun sentence token so uh the whole point here as we'll soon see is that the language model can now pull the factual knowledge uh in and we can answer questions like this because of the language model not because of the vision portion of the model okay let's dig deeper into the paper now so that was a high level like uh like a glimpse of this paper basically they can handle both modalities and they can like learn the more examples you give the model the better it becomes that's really cool okay let's now dive deeper when trained at a sufficient scale all regressive language models exhibit the notable ability to learn a new language task after being prompted with just a few examples and this should be uh like familiar to you uh if you haven't watched my video on gpt3 or the original uh transformer paper um do check them out i'll link them somewhere here but basically what those models showed is the following especially the gpt family of models from from open ai they showed that um even though the model was trained as a language model so basically you have an unsupervised task of next token prediction and that's everything and they show that you can actually do machine translation for example and here's an example of how it looks like so we have a one-shot setting here where we prompt the model with translate english to french and then we give it one example see other to lutret de mar i don't speak french and then you prompt with cheese and you have this symbol and the model will actually learn how to translate this so to to perform machine translation even though it's never seen such a task during this training and here we just have a like a like a uh additionally a few shot examples set up where we have multiple uh examples and then we prompt the model to translate and they show that with these multiple examples that the performance just gets better and better obviously saturates after a certain point but like the the trend is clear okay so that's the first thing uh now here we present the simple yet effective approach for transferring this few shot learning ability to a multi-modal setting and but in particular they focus on on vision as we'll soon see okay um finally here's a motivation behind all this so despite these impressive capabilities such large scale language models are blind to modalities other than text preventing us from communicating visual tasks questions or concepts to them so that's the reason they kind of integrated this vision component okay let me explain you how this whole system looks like and how it works so the system as a whole is pretty simple so we have a language model as you can see here and uh it's frozen so the parameters are all frozen and we have this vision encoder so just to be a bit more specific they be they were using gpt2 for the language model they were using nf resonant for the visual encoder but that's not that crucial you can pick some other language model you can pick some different vision encoder maybe something like vision image transformer but that's not that important like the fact that these components exist and they are wired the way they are okay so um so what they do is so they first need to kind of adapt the image into the input that the transformer model is expecting and that's these tokens so what they do is they have this vision encoder and it will output after some pulling layer a vector and what they do is they take a linear layer they just project this into this novel space uh let me draw it like this and the dimension will be n times d where basically uh they found this end to be like the best value was two but that's not the importance you can see here they have only two tokens uh but it can be an arbitrary number in general um so d is uh that's important d is a dimension that's the same as these tokens that go into language model so obviously that's the prerequisite so that we can feed these tokens into the language model that come from the image and so that's the image part now how do they train this is fairly simple so uh basically i think it's it would be wiser to write this down as so we have start of sentence token here and all of these other words will be kind of translated here so a will get here the small will be here etc so we'll have small here and so now what you're trying to do is to predict the target a sequence so as you can see here so let me just focus on predicting the word small so that the word small will have as a context these image tokens that came here as well as uh this start of sentence token as well as a because we have causal masking obviously you don't want to have if you if you saw small in the input then you're kind of cheating and it's easy to predict the output small you can just kind of copy the values and the transformer will learn how just to copy copy paste the values and that's not what we want we want to predict tokens so that's how the the test would look like basically here you'd have uh you'd output a distribution the usual way so you're just trying to maximize the likelihood so here maybe we have some distribution and we find the token that corresponds to the word small maybe this one and we'll just want to maximize this to one and push all of the other probabilities down to zero and we do that by just simple like cross entropy so it will be minus log of p so when p goes to one the loss goes to zero and so in a nutshell that's how the system works that's how it's trained the gradients are back propped through through these weights which are frozen and these weights are then tweaked so that basically what happens is that this vision encoder learns such representations so that they're useful in order to do this captioning task okay and that's that's the whole that's the whole system it's fairly fairly trivial let me now show you how they use this thing and here we have in the first example so we have a vision encoder so we have an image we encode it into these two tokens and then we prompt it with like this text so question what color is the car and the model generates blue uh and then end of sentence token so by the way just a short remark here they actually even though you can see here a word and a single token what actually in practice what they do is they use this uh tokenizer called sentence piece so this boat or this word small maybe like uh separated into sub words maybe like maybe this sma part will be like one will have one token associated with it and then ll will have second so it's just an example but in in general you'll have more tokens than you have words in your sentence okay just a minor detail and uh so that's this is one of the tasks that they're gonna evaluate this model on the second example here will require the model to have some knowledge base and uh additionally i forgot to mention that the captioning so during this captioning training um all of the like named entities are masked so if you have a name like alexa or something you'll mask it with a person so you'll have you you'll just like put a person instead of a name there so that means that this vision encoder cannot learn those name densities obviously because it never saw like named entities and so the examples you see here like uh like this model generated steve jobs that did not certainly didn't come from the vision encoder and that's an important thing i want you to notice here so the knowledge of that steve jobs was the guy who invented iphone uh actually came from the language model itself so that's important and the third task they evaluate this model on is this fast binding task where you present the model with the image then you say you kind of have this made up word called dex and you just want the model to associate this visual category of an apple with the novel word dex and you do the same thing here you have an orange blicket and finally you prompted with an image and you say question what is this answer and the model generates this is odd decks so we saw dex's apple so it properly generated this is a dex uh sentence okay so having said that let me now kind of see let me show you the quantitative results they got um because that's that's that's interesting before that um just a short remark here so they say here in contrast our work enables strong generalization to new multi-modal tasks blah blah blah um what i want to say here is i don't like this part because um basically we don't even have a strong definition of what strong generalization is and the closest thing i could think of was um francois chole's um paper on the measure of intelligence where he uh described this terminology where he he basically says that all of the current models pretty much uh that we have in deep learning field in machine learning are only like able to do this local generalization whereas this broad generalization and extreme journalization is something that only humans can currently uh basically do and so yeah i i pretty much agree uh with him on this on this one because this language model that they use in frozen has seen like a lot of data so that means it has a lot of experience which we need to count in when we calculate these generalization capabilities so arguably because of all of that like immense amount of experience you cannot actually claim that it's generalizing like strongly it does generalize but like just the level yeah it's kind of yeah rent over pretty much okay these are just some details i already explained they have a huge 7 billion pre-trained language model they freeze it and that's how they train the model i also explained this one how do how do we map the images into into tokens which language model can then parse okay i'll just keep all of those small details they're using positional encoding that help them and let me focus on this so um important thing i want you to notice is that they have a bunch of different ways to prompt this model so that it can generate the answers much better and here's one example so they have a pretty intricate uh like terminology here so two ways zero repeats two inner shots so let's see what it means so first they have this uh task induction which they quantitatively show it helps a lot so answer with dex or blicket kind of prompts kind of tunes the model into answering with these two words in a sense and then they have so two shots basically this is the first part and uh the reason it's two ways because we have two made up words like blicket and ducks here and because the reason it's two inner shots is because they have two independent examples here as you can see and finally they prompt the model and liquid would be the correct answer because as you can see blicket is lying okay and the the last task they'll they'll they'll evaluate is this one where again you have to associate a visual category with a novel word and finally you actually not only have to output like here it's basically recognition you recognize it's a line and you have a blicket but here you have to reason because it says what is the the question says what is the dex made of so you first need to understand that dex is a table and then you need to understand what it is made of and the answer is wood so this is the final task they evaluate this frozen model on and now let's see the quantitative results okay here are the results for the visual question answering and there are these base lines which i'll shortly explain now uh especially this plane blind baseline so the strength of the pre-trained language model is a double-edged sword it powers the generalization abilities of frozen but also enables the model to perform surprisingly well without considering the visual input at all so you can learn to ignore the visual input and still answer the question so to guard against this possibility we also train blind bass lines in which the image presented to the visual encoder is blacked out but the comment weights are still trained this amounts to prefix tuning so just as a short reminder how it works basically here instead of an image for these bass lines what they'll do instead is they'll just black out or blue out in my case these images and so what happens is basically uh the model will have to learn some representation which is constant which won't which won't change because it will only be presented with black images and so uh that constant kind of needs to help in this captioning task and so that's why they call it prefix tuning because it finds uh some representation that helps the overall system do better job at captioning okay so that's one of the baselines the the blind baseline and having explained that one let's now focus on the quantitative results okay here we are um so frozen here is the the the version we were talking about here is the version train from scratch so that means you just stitch all of the weights in the pre-trained transformer and try and train it from scratch um here are the worst the setups they have so there's zero shot one shot four shot etc and we can see that as we add more examples we have this strength that the accuracy is improving which is desirable okay we see that the from scratch model totally fails we see that the fine-tuned model is worse uh so fine-tuning so basically here they do not they do not freeze the language model but they initialize the model with the pre-trained weight the blind baseline i've just explained it as well so um again oscar is some dedicated baseline you can see it's much better than frozen but the whole point here is that we have this improv improving ability uh that also was uh pertinent to gpt3 models and language models in general so that's cool um here you can just see when you additionally fine-tune on this very data set visual visual question answering data set the performance obviously ramps up and that's that's expected i guess now the second thing here is the test on this okay vqa data set so this data set contains those examples where you need some additional knowledge base in order to answer the questions and uh again uh they so they show that um the the model improves with additional examples as a new baseline they use this 400 million parameter model and by the way this one has 7 billion and again obviously the the bigger the model the the better the performance it's something that's not that surprising i guess like in 2021 and um yeah again the this baseline is much better they just want to stress out that we have this improvement uh like being transferred to this multi multi multimodal setup and that's cool okay so that was the the first experiments they did um here let me just kind of reiterate this because it's really important so this conceptual captions data set is hyper named meaning that for example proper names are replaced with a general person a word like person okay so this enables us to rigorously study the transfer of actual knowledge because all knowledge of named entities comes from language model pre-training consequently when we show the model an image of an airplane and ask who invented this the visual encoder has determined that the image contains an airplane and the language model has used this to retrieve the factual knowledge that airplanes were invented by the wright brothers and finally jumping to fast concept binding these are the last tasks they tested this frozen model on um let me just kind of connect this table with the actual tasks so this is the task at hand we have this uh visual binding uh tasks so this is a two-way binding because we have again two novel made up word stacks and blicket uh they'll show like five-way binding uh task where obviously we have now five made up words and we'll soon see that it fails on that one but it succeeds on the two-way binding task so on the two-way binding you can see again they have a bunch of different ways to to prompt the model so that they can elicit better outputs and again here you can see it's improving with more examples second thing they tried is they used a real names instead of those made up words they do this so that they can quantify how harder it is for the model to learn the binding uh and how hard the task itself is and you can see uh so this kind of bugs me i'm not sure whether this is a typo but like you can he has better performance and that it kind of saturates saturates after three examples um again this annual baseline is better as i mentioned uh five-way binding fails so here it's literally like uh same as random chance 20 and it kind of improves here then goes back so yeah it's inconclusive here they mentioned here somewhere here just a sec so in table four we show that the observed effects on open-ended mini imagenet do not transfer to the fireway setting where frozen is not significantly above chance uh this shows that learning to bind five new names to five visual categories in a single forward pass is beyond the current capabilities of frozen okay so it kind of fails there and they leave it up as a future research the final task is the one i showed you where you need to aside from fast binding you need to reason not much new conclusions can be made here so again it's improving with more examples the interesting part maybe is so if we focus on this blind baseline we can see that even repeating uh so again remember the image is blacked out so we just just repeat the text a couple of times and those linguistic cues uh help boost the performance of this spline model which means that these improvements above uh basically are a combination of both linguistic cues as well as the visual cues okay um i think that's pretty much it um i like this paper a lot uh i like this inclusion of visual uh information into this whole pipeline it slowly starts resembling the way we humans operate so we have these like we have vision obviously and we kind of uh somehow represent that information and then we have this like computation engines or inside our head by the way one of the previous papers called uh like pre-trained transformers or universal computation engine or something uh they showed that like if you take a huge a pre-trained language model and you just tweak some layer norm parameters and some embedding uh weights you can basically fine-tune it very fast onto novel tasks and that's cool i guess this work is a follow-up in a way and yeah having said that i hope you like this video if you did consider subscribing sharing and until next time bye bye

Original Description

❤️ Become The AI Epiphany Patreon ❤️ ► https://www.patreon.com/theaiepiphany In this video I cover "Multimodal Few-Shot Learning with Frozen Language Models" from DeepMind. They introduce Frozen - which is able to handle both visual and textual inputs and shows good generalization capabilities to novel visual question answering datasets combined with fast binding mechanisms even though it was only trained on image captioning. ▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬ ✅ Paper: https://arxiv.org/abs/2106.13884 ▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬ ⌚️ Timetable: 00:00 Intro 02:20 GPT-3 and emerging few-shot properties 04:20 Training procedure for Frozen 07:45 Inference 10:15 Strong generalization? 11:55 Prompting mechanisms and the hardest task 13:25 Quantitative results 19:50 Outro ▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬ 💰 BECOME A PATREON OF THE AI EPIPHANY ❤️ If these videos, GitHub projects, and blogs help you, consider helping me out by supporting me on Patreon! The AI Epiphany ► https://www.patreon.com/theaiepiphany One-time donation: https://www.paypal.com/paypalme/theaiepiphany Much love! ❤️ Huge thank you to these AI Epiphany patreons: Eli Mahler Petar Veličković Zvonimir Sabljic ▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬ 💡 The AI Epiphany is a channel dedicated to simplifying the field of AI using creative visualizations and in general, a stronger focus on geometrical and visual intuition, rather than the algebraic and numerical "intuition". ▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬ 👋 CONNECT WITH ME ON SOCIAL LinkedIn ► https://www.linkedin.com/in/aleksagordic/ Twitter ► https://twitter.com/gordic_aleksa Instagram ► https://www.instagram.com/aiepiphany/ Facebook ► https://www.facebook.com/aiepiphany/ 👨‍👩‍👧‍👦 JOIN OUR DISCORD COMMUNITY: Discord ► https://discord.gg/peBrCpheKE 📢 SUBSCRIBE TO MY MONTHLY AI NEWSLETTER: Substack ► https://aiepiphany.substack.com/ 💻 FOLLOW ME ON GITHUB FOR COOL PROJECTS: GitHub ► https://github.com/gordicaleksa 📚 FOLLOW ME ON MEDIUM: Medium ► https://gordicaleksa.m

Watch on YouTube ↗ (saves to browser)

Sign in to unlock AI tutor explanation · ⚡30

Playlist

Uploads from Aleksa Gordić - The AI Epiphany · Aleksa Gordić - The AI Epiphany · 51 of 60

← Previous Next →

Intro | Neural Style Transfer #1

Intro | Neural Style Transfer #1

Aleksa Gordić - The AI Epiphany

Basic Theory | Neural Style Transfer #2

Basic Theory | Neural Style Transfer #2

Aleksa Gordić - The AI Epiphany

Optimization method | Neural Style Transfer #3

Optimization method | Neural Style Transfer #3

Aleksa Gordić - The AI Epiphany

Advanced Theory | Neural Style Transfer #4

Advanced Theory | Neural Style Transfer #4

Aleksa Gordić - The AI Epiphany

Anyone can make deepfakes now!

Anyone can make deepfakes now!

Aleksa Gordić - The AI Epiphany

What is Computer Vision? | The Art of Creating Seeing Machines

What is Computer Vision? | The Art of Creating Seeing Machines

Aleksa Gordić - The AI Epiphany

Feed-forward method | Neural Style Transfer #5

Feed-forward method | Neural Style Transfer #5

Aleksa Gordić - The AI Epiphany

Alan Turing | Computing Machinery and Intelligence

Alan Turing | Computing Machinery and Intelligence

Aleksa Gordić - The AI Epiphany

Feed-forward method (training) | Neural Style Transfer #6

Feed-forward method (training) | Neural Style Transfer #6

Aleksa Gordić - The AI Epiphany

What is Google Deep Dream? (Basic Theory) | Deep Dream Series #1

What is Google Deep Dream? (Basic Theory) | Deep Dream Series #1

Aleksa Gordić - The AI Epiphany

Semantic Segmentation in PyTorch | Neural Style Transfer #7

Semantic Segmentation in PyTorch | Neural Style Transfer #7

Aleksa Gordić - The AI Epiphany

How to get started with Machine Learning

How to get started with Machine Learning

Aleksa Gordić - The AI Epiphany

How to learn PyTorch? (3 easy steps) | 2021

How to learn PyTorch? (3 easy steps) | 2021

Aleksa Gordić - The AI Epiphany

PyTorch or TensorFlow?

PyTorch or TensorFlow?

Aleksa Gordić - The AI Epiphany

3 Machine Learning Projects For Beginners (Highly visual) | 2021

3 Machine Learning Projects For Beginners (Highly visual) | 2021

Aleksa Gordić - The AI Epiphany

Machine Learning Projects (Intermediate level) | 2021

Machine Learning Projects (Intermediate level) | 2021

Aleksa Gordić - The AI Epiphany

Cheapest (0$) Deep Learning Hardware Options | 2021

Cheapest (0$) Deep Learning Hardware Options | 2021

Aleksa Gordić - The AI Epiphany

How to learn deep learning? (Transformers Example)

How to learn deep learning? (Transformers Example)

Aleksa Gordić - The AI Epiphany

How do transformers work? (Attention is all you need)

How do transformers work? (Attention is all you need)

Aleksa Gordić - The AI Epiphany

Developing a deep learning project (case study on transformer)

Developing a deep learning project (case study on transformer)

Aleksa Gordić - The AI Epiphany

Vision Transformer (ViT) - An image is worth 16x16 words | Paper Explained

Vision Transformer (ViT) - An image is worth 16x16 words | Paper Explained

Aleksa Gordić - The AI Epiphany

GPT-3 - Language Models are Few-Shot Learners | Paper Explained

GPT-3 - Language Models are Few-Shot Learners | Paper Explained

Aleksa Gordić - The AI Epiphany

Google DeepMind's AlphaFold 2 explained! (Protein folding, AlphaFold 1, a glimpse into AlphaFold 2)

Google DeepMind's AlphaFold 2 explained! (Protein folding, AlphaFold 1, a glimpse into AlphaFold 2)

Aleksa Gordić - The AI Epiphany

Attention Is All You Need (Transformer) | Paper Explained

Attention Is All You Need (Transformer) | Paper Explained

Aleksa Gordić - The AI Epiphany

Graph Attention Networks (GAT) | GNN Paper Explained

Graph Attention Networks (GAT) | GNN Paper Explained

Aleksa Gordić - The AI Epiphany

Graph Convolutional Networks (GCN) | GNN Paper Explained

Graph Convolutional Networks (GCN) | GNN Paper Explained

Aleksa Gordić - The AI Epiphany

Graph SAGE - Inductive Representation Learning on Large Graphs | GNN Paper Explained

Graph SAGE - Inductive Representation Learning on Large Graphs | GNN Paper Explained

Aleksa Gordić - The AI Epiphany

PinSage - Graph Convolutional Neural Networks for Web-Scale Recommender Systems | Paper Explained

PinSage - Graph Convolutional Neural Networks for Web-Scale Recommender Systems | Paper Explained

Aleksa Gordić - The AI Epiphany

OpenAI CLIP - Connecting Text and Images | Paper Explained

OpenAI CLIP - Connecting Text and Images | Paper Explained

Aleksa Gordić - The AI Epiphany

Temporal Graph Networks (TGN) | GNN Paper Explained

Temporal Graph Networks (TGN) | GNN Paper Explained

Aleksa Gordić - The AI Epiphany

Graph Neural Network Project Update! (I'm coding GAT from scratch)

Graph Neural Network Project Update! (I'm coding GAT from scratch)

Aleksa Gordić - The AI Epiphany

Graph Attention Network Project Walkthrough

Graph Attention Network Project Walkthrough

Aleksa Gordić - The AI Epiphany

How to get started with Graph ML? (Blog walkthrough)

How to get started with Graph ML? (Blog walkthrough)

Aleksa Gordić - The AI Epiphany

DQN - Playing Atari with Deep Reinforcement Learning | RL Paper Explained

DQN - Playing Atari with Deep Reinforcement Learning | RL Paper Explained

Aleksa Gordić - The AI Epiphany

AlphaGo - Mastering the game of Go with deep neural networks and tree search | RL Paper Explained

AlphaGo - Mastering the game of Go with deep neural networks and tree search | RL Paper Explained

Aleksa Gordić - The AI Epiphany

DeepMind's AlphaGo Zero and AlphaZero | RL paper explained

DeepMind's AlphaGo Zero and AlphaZero | RL paper explained

Aleksa Gordić - The AI Epiphany

OpenAI - Solving Rubik's Cube with a Robot Hand | RL paper explained

OpenAI - Solving Rubik's Cube with a Robot Hand | RL paper explained

Aleksa Gordić - The AI Epiphany

MuZero - Mastering Atari, Go, Chess and Shogi by Planning with a Learned Model | RL Paper explained

MuZero - Mastering Atari, Go, Chess and Shogi by Planning with a Learned Model | RL Paper explained

Aleksa Gordić - The AI Epiphany

EfficientNetV2 - Smaller Models and Faster Training | Paper explained

EfficientNetV2 - Smaller Models and Faster Training | Paper explained

Aleksa Gordić - The AI Epiphany

Implementing DeepMind's DQN from scratch! | Project Update

Implementing DeepMind's DQN from scratch! | Project Update

Aleksa Gordić - The AI Epiphany

MLP-Mixer: An all-MLP Architecture for Vision | Paper explained

MLP-Mixer: An all-MLP Architecture for Vision | Paper explained

Aleksa Gordić - The AI Epiphany

DeepMind's Android RL Environment - AndroidEnv

DeepMind's Android RL Environment - AndroidEnv

Aleksa Gordić - The AI Epiphany

When Vision Transformers Outperform ResNets without Pretraining | Paper Explained

When Vision Transformers Outperform ResNets without Pretraining | Paper Explained

Aleksa Gordić - The AI Epiphany

Non-Parametric Transformers | Paper explained

Non-Parametric Transformers | Paper explained

Aleksa Gordić - The AI Epiphany

Chip Placement with Deep Reinforcement Learning | Paper Explained

Chip Placement with Deep Reinforcement Learning | Paper Explained

Aleksa Gordić - The AI Epiphany

Text Style Brush - Transfer of text aesthetics from a single example | Paper Explained

Text Style Brush - Transfer of text aesthetics from a single example | Paper Explained

Aleksa Gordić - The AI Epiphany

Graphormer - Do Transformers Really Perform Bad for Graph Representation? | Paper Explained

Graphormer - Do Transformers Really Perform Bad for Graph Representation? | Paper Explained

Aleksa Gordić - The AI Epiphany

GANs N' Roses: Stable, Controllable, Diverse Image to Image Translation | Paper Explained

GANs N' Roses: Stable, Controllable, Diverse Image to Image Translation | Paper Explained

Aleksa Gordić - The AI Epiphany

VQ-VAEs: Neural Discrete Representation Learning | Paper + PyTorch Code Explained

VQ-VAEs: Neural Discrete Representation Learning | Paper + PyTorch Code Explained

Aleksa Gordić - The AI Epiphany

VQ-GAN: Taming Transformers for High-Resolution Image Synthesis | Paper Explained

VQ-GAN: Taming Transformers for High-Resolution Image Synthesis | Paper Explained

Aleksa Gordić - The AI Epiphany

Multimodal Few-Shot Learning with Frozen Language Models | Paper Explained

Multimodal Few-Shot Learning with Frozen Language Models | Paper Explained

Aleksa Gordić - The AI Epiphany

Focal Transformer: Focal Self-attention for Local-Global Interactions in Vision Transformers

Focal Transformer: Focal Self-attention for Local-Global Interactions in Vision Transformers

Aleksa Gordić - The AI Epiphany

AudioCLIP: Extending CLIP to Image, Text and Audio | Paper Explained

AudioCLIP: Extending CLIP to Image, Text and Audio | Paper Explained

Aleksa Gordić - The AI Epiphany

RMA: Rapid Motor Adaptation for Legged Robots | Paper Explained

RMA: Rapid Motor Adaptation for Legged Robots | Paper Explained

Aleksa Gordić - The AI Epiphany

DALL-E: Zero-Shot Text-to-Image Generation | Paper Explained

DALL-E: Zero-Shot Text-to-Image Generation | Paper Explained

Aleksa Gordić - The AI Epiphany

DETR: End-to-End Object Detection with Transformers | Paper Explained

DETR: End-to-End Object Detection with Transformers | Paper Explained

Aleksa Gordić - The AI Epiphany

DINO: Emerging Properties in Self-Supervised Vision Transformers | Paper Explained!

DINO: Emerging Properties in Self-Supervised Vision Transformers | Paper Explained!

Aleksa Gordić - The AI Epiphany

DeepMind DetCon: Efficient Visual Pretraining with Contrastive Detection | Paper Explained

DeepMind DetCon: Efficient Visual Pretraining with Contrastive Detection | Paper Explained

Aleksa Gordić - The AI Epiphany

Do Vision Transformers See Like Convolutional Neural Networks? | Paper Explained

Do Vision Transformers See Like Convolutional Neural Networks? | Paper Explained

Aleksa Gordić - The AI Epiphany

Fastformer: Additive Attention Can Be All You Need | Paper Explained

Fastformer: Additive Attention Can Be All You Need | Paper Explained

Aleksa Gordić - The AI Epiphany

This video explains the paper 'Multimodal Few-Shot Learning with Frozen Language Models' and its application to few-shot learning tasks. The model uses a frozen language model and trains a vision encoder to parse images into tokens compatible with the language model. The video discusses the model's performance on various tasks, including visual question answering and binding tasks.

Key Takeaways

Use a frozen language model, such as GPT2
Train a vision encoder to parse images into tokens compatible with the language model
Apply positional encoding to map images into tokens
Evaluate the model on tasks such as visual question answering and binding tasks
Fine-tune the model on the visual question answering data set
Use a tokenizer, such as sentence piece, to split words into subwords

💡 The model's performance improves with additional examples and fine-tuning on the visual question answering data set, but saturates after three examples in the five-way binding task.

🔒 Pro feature: Ask AI to explain this lesson →

More on: Multimodal LLMs

View skill →

Google Veo 3 Tutorial: How to create AI Videos in Flow, Gemini or Google Vids?

Google Veo 3 Tutorial: How to create AI Videos in Flow, Gemini or Google Vids?

AI Tool Journey

NVIDIA Clara Guardian Virtual Patient Assistant

NVIDIA Clara Guardian Virtual Patient Assistant

NVIDIA Developer

Building Multimodal Search and RAG

Building Multimodal Search and RAG

Midjourney Trick: Consistent Character in Different Images

Midjourney Trick: Consistent Character in Different Images

Ollama Multimodal: EASILY setup Llava locally & Integrate API

Ollama Multimodal: EASILY setup Llava locally & Integrate API

The ONLY Real Time Speech AI that can run locally!!!

The ONLY Real Time Speech AI that can run locally!!!

Related AI Lessons

Cloud-Optimized OpenCV + A Special Surprise Announcement on OpenCV Live

Learn about Cloud-Optimized OpenCV for faster computer vision computations and a special announcement on OpenCV Live

When the Camera Becomes an Exam Proctor: Building an AI-Powered Exam Monitoring System with…

Learn how to build an AI-powered exam monitoring system using Computer Vision and DeepFace to assist professional certification exams

Medium · Python

When the Camera Becomes an Exam Proctor: Building an AI-Powered Exam Monitoring System with…

Build an AI-powered exam monitoring system using Computer Vision and Deep Learning to enhance professional certification exams

Medium · Deep Learning

When the Camera Becomes an Exam Proctor: Building an AI-Powered Exam Monitoring System with…

Build an AI-powered exam monitoring system using Computer Vision and Deep Learning to enhance exam security and integrity

Medium · Cybersecurity

Chapters (8)

Intro

2:20 GPT-3 and emerging few-shot properties

4:20 Training procedure for Frozen

7:45 Inference

10:15 Strong generalization?

11:55 Prompting mechanisms and the hardest task

13:25 Quantitative results

19:50 Outro

Marketing management for ugc net| Important topics of marketing management ugc net commerce dec 2023

Bhoomi Learning Centre~Dr. Muskan