Chat with your Image! BLIP-2 connects Q-Former w/ VISION-LANGUAGE models (ViT & T5 LLM)

Discover AI · Beginner ·🧠 Large Language Models ·3y ago

Key Takeaways

The video discusses BLIP-2, a method that connects Q-Former with VISION-LANGUAGE models, including ViT and T5 LLM, for multimodal large language models, and demonstrates its application in image-to-text tasks using tools like Q-Former, VISION-LANGUAGE models, and FAISS.

Full Transcript

hello Community today we combine our vision Transformer knowledge and our llm knowledge so we have vision and language and today I show you that there's a beautiful technique that we can now pre-train Vision language models based on the Transformer architecture and here we go this beauty is called blip2 bootstrapping language image free training with Frozen image encoders and llms this is exactly what we wanted interestingly here as you can see here the orders and it is from Salesforce research published January 30 2023 let's just jump right into it and you might ask but what is the use case for this technology so if I upload a thumbnail of one of my last videos let's have a look and we have here our chat input question what elements can you distinguish here on this picture and the answer is processing a helicopter a green screen and the words Vision Transformer now image to text task that visual language models can tackle our image captioning for the visually impaired great useful product description identify inappropriate content beyond the text then you have image text retrieval can be applied in multimodal search as well as autonomous driving and you have of course visual question answering what we are interested in and this will enable our multi-modal our jet GPT chatbots now multi-modal and yippee here we go so at first we have to make sure and this is now kind of a culmination what we learned to transform our architecture here on our large language models that I showed you uh T5 flan T5 Bloom everything here that you know either to decoder side or the encoder side and of course now Vision Transformers both have the same architecture this is the most important Point here both are transform architectures so we compare Apple to Apple and you're not going to believe it to bridge this modality Gap as they call it between Vision here on this side and language on this side they add now a Transformer Now isn't this a surprise so we have a Transformer that connects to a Transformer connects to another Transformer and we will know that our technique is absolutely compatible crossover so this Transformer we call a very Transformer you will understand on a second why this is or a q format so whenever you read now in February March 2023 starting that about Q former you know exactly what we're talking about the interface of vision Transformer to language transforms if you want to know more about llm I show you here about how you can use your flying T5 x6l or if you're interested to have your Bloom 176 billion parameter model operational on the AWS infrastructure those are my two videos for you if you want to learn in general about Transformer this is my video and one of my latest video on Vision Transformers are here on the left side so beautiful now we are of course interested the main problem is an llm is already a huge Model A Vision Transformer is also not as huge but coming up to 22 billion parameter my goodness these are monsters so if you want to combine them how do you want to train if this hardly fits in a supercomputer Center so this is not a beautiful idea here of the orders of this paper please have a look at the original paper that they say now we freeze the vision Transformer and refreeze completely all layers of an llm yes we have an interface between those we have an open i o channel to these models but the layers and the weights are frozen the only thing that we will train is between the Frozen Vision Transformer and the Frozen llm like chat GPT or whatever you.com chat or whatever gpd3 based methodologies you like to apply or flying T5 or whatever the only object we train here is our Hue Transformer a q format or querying Transformer this is the Hot Topic so how we do this now you're not going to believe it since on the one side we have a vision object and on the right side we have a language object the Q format itself has two modules it has an image object an image Transformer that interacts with the Frozen image encoder for the visual feature extraction and on the other side it has also a text Transformer that can function both as an encoder or a decoder for language so you see more or less we copy the external i o structure that it will attach itself on and we put some sub modules within our Q Transformer and yes you guessed it we will have common self attention layers within our Q format so if you want to have it in a little bit more detail this is the query for you you see here to initialize our queue form we use the weights from a bird base model you might say my goodness this simple bird-based model yes exactly now it all comes together we combine now the vision technology with language technology bird expert sentence Transformer this is it can you describe the elements in the image [Applause] now this is gonna be fun so let's have a look I've Fighter Jet flying in the sky with the words q a on jet GPT absolutely right you identified the fighter plane okay now which fight a plane 35 she not bad not bad the F-35 so how does the pre-training happen two stages in the first stage now the vision language representation learning stage as we call it we connect now our Q format that sits in the middle now with the Frozen image encoder so only with one side with our image encoder and perform the free training using some very specific image text pairs so you have an image and you have a text description of the contents of this image you have another image another content description in text of the content of this image and you get it you have then loss function and they found out the best way to do this is have three different loss function when you train it here on the first stage you have an image text contrasting last like we have and the normal birth systems also craned on then you have an image around the text generation and an image text matching loss all the details you find in their original research paper just to give you an idea we have here in let's call it yellow this something occurs monster year and we dock here our Q form in the very first pre-training step here to our image encoder that is frozen so image image encoder and here it is and now we have here as I told you two modules within our Q former and image Transformer and a language Transformer now you have now on the self attention layer you have sharing and this is more or less the main idea behind this if you want to have to text image transform extracts the fixed number of output features from the image encoder independent of the of the image resolution of course and receives learnable query embeddings as an input this here's our input I will show you in a second and then it runs true we have here our attention masking and be able to calculate our three different losses our three different methods and you have here beautiful described here how with the self-attention mechanism now working here across the modules now you might say okay in the Second Step well the second step you're not gonna believe it now we take the Q format it sits in the middle and connect it to the other part to the vision language generative learning where we're connected now to the Frozen NL llm in our case it will be a flan T5 XXL model now aquarium battings that we get out now have the relevant visual information to the text as it passes through an information bottleneck and these embeddings are not used as a visual prefix and I'll show you in a second what I mean to the input to the large language model and this pre-training phase effectively involves an image ground text generation task using causal language model loss let's have a visualization of this now they people that wrote the research paper decided to have two different models two different llms here for example you have here only only if you want a decoder based llm but you know if we're working with flying T5 we have a full Transformer stack so we have an encoder stack and a decoder stack so of course and we're gonna use the second option here for nt5 so we have here our input image that comes in our image encoder that is frozen then the first step is now done we have now the output here of our queue former and this here is runs now to a fully connected layer to linearly project the output query embedding Z into the same Dimension as now the text embedding for the llm and this is the beauty of course if you only have a decoder it goes directly if you have to feed an encoder and a decoder stack you have to take care about this a function here as a soft visual prompt and condition this on presentation extracted by the Q format yes the details you find in the paper I just want that you have an understanding a two process first one it goes in the Q format you got it you have a fully connected layer input to the llm and here now then the next word generation the generative AI takes place now the absolute beauty of this methodology here from the last day of January 2023 was that when they published blip 2 they use division Transformer and for the large language model as I showed you you can use the flan T5 model but you are not restricted to this pre-training approach since you freeze both Vision models and language models you both free stem you can combine almost any visual backbone with any large language model for this specific Vision language model development where you can train here this Q former so we get a complete pre-trained Vision language model out of the pipeline now you might say isn't that beautiful and I agree with you but you know what this was the theory so you understand what we're gonna code next time when we will code every single step we will build our own app maybe we even do agradio and we will end up with an operational Vision language Transformer model you can use you can apply so when you have an image you use the image as an input and then you can start to jet and the system automatically understands the content of the image and will respond to you in this chat correspondingly I say thank you I hope you enjoyed it a little bit see you in my next video

Original Description

Combined Vision-Language Transformers, interlinked w/ a Q-Former, a Querying Transformer! BLIP 2. BLIP-2! The financial resources for pre-training both systems (Vision and Language) are astronomical? Let me introduce you to a clever, new training method: BLIP-2. Multimodal Large Language Models for visual QA or perception-language tasks, multimodal dialogue, or image captioning, and image recognition with verbal content descriptions, plus a Chat function. Visual Perception and Large Language Models: The new combination in Transformers. Multi-modal Large Language Models for visual QA or image captioning. All rights and credits w/: BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models https://arxiv.org/abs/2301.12597 #ai #machinelearning #chatgpt #vision #llm #BLIP2 #QFormer
Watch on YouTube ↗ (saves to browser)
Sign in to unlock AI tutor explanation · ⚡30

Playlist

Uploads from Discover AI · Discover AI · 54 of 60

1 Step Into the Unknown (by YouChat) - May 2023 be your best year yet
Step Into the Unknown (by YouChat) - May 2023 be your best year yet
Discover AI
2 Wishing you all an amazing 2023 filled with Love, Laughter, and Happiness!
Wishing you all an amazing 2023 filled with Love, Laughter, and Happiness!
Discover AI
3 Create a Smarter Future!
Create a Smarter Future!
Discover AI
4 The Art of Text to Vector Transformation: A Comprehensive Look at AI and NLP Transformers
The Art of Text to Vector Transformation: A Comprehensive Look at AI and NLP Transformers
Discover AI
5 Feature Vectors: The Key to Unlocking the Power of BERT and SBERT Transformer Models
Feature Vectors: The Key to Unlocking the Power of BERT and SBERT Transformer Models
Discover AI
6 Domain-Specific AI Models: How to Create Customized BERT and SBERT Models for Your Business
Domain-Specific AI Models: How to Create Customized BERT and SBERT Models for Your Business
Discover AI
7 Achieve Unimaginable Levels of Domain Knowledge through SBERT Extreme in 3D   (SBERT 48)
Achieve Unimaginable Levels of Domain Knowledge through SBERT Extreme in 3D (SBERT 48)
Discover AI
8 Unlocking Scientific Domain Knowledge w/ BPE Tokenizer: An Amazing Journey!  (SBERT 49)
Unlocking Scientific Domain Knowledge w/ BPE Tokenizer: An Amazing Journey! (SBERT 49)
Discover AI
9 SBERT Extreme 3D: Train a BERT Tokenizer  on your (scientific) Domain Knowledge  (SBERT 50)
SBERT Extreme 3D: Train a BERT Tokenizer on your (scientific) Domain Knowledge (SBERT 50)
Discover AI
10 Discover Vision Transformer (ViT) Tech in 2023
Discover Vision Transformer (ViT) Tech in 2023
Discover AI
11 Pre-Train BERT from scratch: Solution for Company Domain Knowledge Data | PyTorch (SBERT 51)
Pre-Train BERT from scratch: Solution for Company Domain Knowledge Data | PyTorch (SBERT 51)
Discover AI
12 Flan-T5-XL model on a free COLAB | A free LLM - that explains itself w/ reasoning /write essay | AI
Flan-T5-XL model on a free COLAB | A free LLM - that explains itself w/ reasoning /write essay | AI
Discover AI
13 BERT and GPT in Language Models like ChatGPT or BLOOM |  EASY Tutorial on Large Language Models LLM
BERT and GPT in Language Models like ChatGPT or BLOOM | EASY Tutorial on Large Language Models LLM
Discover AI
14 Free Alternative to ChatGPT: Flan-T5-XL GUI (open-source)  #shorts
Free Alternative to ChatGPT: Flan-T5-XL GUI (open-source) #shorts
Discover AI
15 From T5 to T5X: A Game-Changing Evolution with JAX & FLAX
From T5 to T5X: A Game-Changing Evolution with JAX & FLAX
Discover AI
16 How to start with ChatGPT?  | Short Introduction to OpenAI API #shorts
How to start with ChatGPT? | Short Introduction to OpenAI API #shorts
Discover AI
17 The Future of Conversational AI? Google's PaLM w/ RLHF  | LLM ChatGPT Competitor
The Future of Conversational AI? Google's PaLM w/ RLHF | LLM ChatGPT Competitor
Discover AI
18 Microsoft and ChatGPU
Microsoft and ChatGPU
Discover AI
19 From Zero to FLAN-T5 XL Model GUI with Gradio: A Step-by-Step Guide on Free COLAB Notebook PyTorch
From Zero to FLAN-T5 XL Model GUI with Gradio: A Step-by-Step Guide on Free COLAB Notebook PyTorch
Discover AI
20 Google's 2nd Answer to "BING ChatGPT":  Sparrow | after BARD w/ LaMDA | 2nd Gen Conversational AI
Google's 2nd Answer to "BING ChatGPT": Sparrow | after BARD w/ LaMDA | 2nd Gen Conversational AI
Discover AI
21 TF2: Pre-Train BERT from scratch (a Transformer), fine-tune & run inference on text | KERAS NLP
TF2: Pre-Train BERT from scratch (a Transformer), fine-tune & run inference on text | KERAS NLP
Discover AI
22 3D Visualization for BERT: How to Pre-Train with a New Layer & Fine-Tune with Downstream Task Layer
3D Visualization for BERT: How to Pre-Train with a New Layer & Fine-Tune with Downstream Task Layer
Discover AI
23 FLAN-T5-XXL on NVIDIA A100 GPU w/ HF Inference Endpoints, let's explore 11b models!
FLAN-T5-XXL on NVIDIA A100 GPU w/ HF Inference Endpoints, let's explore 11b models!
Discover AI
24 ChatGPT - Can it Lie to you?
ChatGPT - Can it Lie to you?
Discover AI
25 ChatGPT Alternative: Perplexity by Perplexity.AI
ChatGPT Alternative: Perplexity by Perplexity.AI
Discover AI
26 2023 KerasNLP Tutorial: Explore Latest KERAS Toolbox & NLP Processing Library for BERT - TF2
2023 KerasNLP Tutorial: Explore Latest KERAS Toolbox & NLP Processing Library for BERT - TF2
Discover AI
27 Self-aware AI: You.com/chat vs Perplexity.ai | Live Demo, LLMs show Future of ChatGPT w/ BING
Self-aware AI: You.com/chat vs Perplexity.ai | Live Demo, LLMs show Future of ChatGPT w/ BING
Discover AI
28 BLOOM 176B Inference on AWS  | Bigger than GPT-3 for more Power!
BLOOM 176B Inference on AWS | Bigger than GPT-3 for more Power!
Discover AI
29 Fine-tune ChatGPT? Buy Embeddings /OpenAI? What are Embeddings?  My own ChatGPT? | Visual Q+A
Fine-tune ChatGPT? Buy Embeddings /OpenAI? What are Embeddings? My own ChatGPT? | Visual Q+A
Discover AI
30 Unleashing the Power of BLOOM 176B with AWS ml.p4de.24xlarge, DJL & DeepSpeed: The Ultimate Boost!
Unleashing the Power of BLOOM 176B with AWS ml.p4de.24xlarge, DJL & DeepSpeed: The Ultimate Boost!
Discover AI
31 After ChatGPT: NEW BioGPT by Microsoft | Do YOU trust Microsoft for your Medication?
After ChatGPT: NEW BioGPT by Microsoft | Do YOU trust Microsoft for your Medication?
Discover AI
32 Improve ChatGPT: Modular, Adaptive, Smart LLM | Inside ChatGPT
Improve ChatGPT: Modular, Adaptive, Smart LLM | Inside ChatGPT
Discover AI
33 Fine-tune ChatGPT w/  in-context learning ICL - Chain of Thought, AMA, reasoning & acting: ReAct
Fine-tune ChatGPT w/ in-context learning ICL - Chain of Thought, AMA, reasoning & acting: ReAct
Discover AI
34 The Intersection of Copyright Law and Human Faces: Exploring Virtual K-Pop with MAVE
The Intersection of Copyright Law and Human Faces: Exploring Virtual K-Pop with MAVE
Discover AI
35 New TECH: Vision Transformer 2023 on Image Classification | AI
New TECH: Vision Transformer 2023 on Image Classification | AI
Discover AI
36 PyTorch code Vision Transformer: Apply ViT models pre-trained and fine-tuned  | AI  Tech
PyTorch code Vision Transformer: Apply ViT models pre-trained and fine-tuned | AI Tech
Discover AI
37 New BING ChatGPT: Unlock the Power of Emotions in your Search Engine!
New BING ChatGPT: Unlock the Power of Emotions in your Search Engine!
Discover AI
38 New BING ChatGPT loses its mind
New BING ChatGPT loses its mind
Discover AI
39 Self-Attention Heads of last Layer of Vision Transformer (ViT) visualized (pre-trained with DINO)
Self-Attention Heads of last Layer of Vision Transformer (ViT) visualized (pre-trained with DINO)
Discover AI
40 Visualizing the Self-Attention Head of the Last Layer in DINO ViT: A Unique Perspective on Vision AI
Visualizing the Self-Attention Head of the Last Layer in DINO ViT: A Unique Perspective on Vision AI
Discover AI
41 Microsoft strongly restricts access to ChatGPT on new BING - WHY?
Microsoft strongly restricts access to ChatGPT on new BING - WHY?
Discover AI
42 PyTorch ViT: The Ultimate Guide to Fine-Tuning for Object Identification (COLAB)
PyTorch ViT: The Ultimate Guide to Fine-Tuning for Object Identification (COLAB)
Discover AI
43 New BING Chat AGGRESSIVE
New BING Chat AGGRESSIVE
Discover AI
44 Panoptic Image Segmentation: Mask2Former explained | Identify all objects!
Panoptic Image Segmentation: Mask2Former explained | Identify all objects!
Discover AI
45 Code Panoptic Image Segmentation w/ Vision Transformer & Mask2Former - A PyTorch tutorial
Code Panoptic Image Segmentation w/ Vision Transformer & Mask2Former - A PyTorch tutorial
Discover AI
46 Dream Job Alert: AI Prompt Engineer - $335K  |  AI Prompt Design: A Crash Course
Dream Job Alert: AI Prompt Engineer - $335K | AI Prompt Design: A Crash Course
Discover AI
47 Streamlining Similar Image Detection with ViT in PyTorch: A Step-by-Step Guide
Streamlining Similar Image Detection with ViT in PyTorch: A Step-by-Step Guide
Discover AI
48 Microsoft's CEO in Trouble   #shorts
Microsoft's CEO in Trouble #shorts
Discover AI
49 Why wait for KOSMOS-1? Code a VISION - LLM w/ ViT, Flan-T5 LLM and BLIP-2: Multimodal LLMs (MLLM)
Why wait for KOSMOS-1? Code a VISION - LLM w/ ViT, Flan-T5 LLM and BLIP-2: Multimodal LLMs (MLLM)
Discover AI
50 OpenAI's ChatGPT can NOW summarize external Sources on the Internet?
OpenAI's ChatGPT can NOW summarize external Sources on the Internet?
Discover AI
51 ChatGPT polarizes
ChatGPT polarizes
Discover AI
52 Hospital /Clinic AI Decision Models: Performance of 12 AI LLM Systems (incl $$) Radiology, Biomed
Hospital /Clinic AI Decision Models: Performance of 12 AI LLM Systems (incl $$) Radiology, Biomed
Discover AI
53 ChatGPT Prompt Engineering w/ in-context learning (ICL)  - 7 Examples | Tutorial
ChatGPT Prompt Engineering w/ in-context learning (ICL) - 7 Examples | Tutorial
Discover AI
Chat with your Image!  BLIP-2 connects Q-Former w/ VISION-LANGUAGE models (ViT & T5 LLM)
Chat with your Image! BLIP-2 connects Q-Former w/ VISION-LANGUAGE models (ViT & T5 LLM)
Discover AI
55 ChatGPT:  Multidimensional Prompts
ChatGPT: Multidimensional Prompts
Discover AI
56 ChatGPT:  In-context Retrieval-Augmented Learning (IC-RALM) | In-context Learning (ICL) Examples
ChatGPT: In-context Retrieval-Augmented Learning (IC-RALM) | In-context Learning (ICL) Examples
Discover AI
57 Code your BLIP-2 APP: VISION Transformer (ViT) + Chat LLM (Flan-T5) = MLLM
Code your BLIP-2 APP: VISION Transformer (ViT) + Chat LLM (Flan-T5) = MLLM
Discover AI
58 Buy Microsoft "Azure OpenAI Service" or buy from OpenAI its API for ChatGPT access & tuning?
Buy Microsoft "Azure OpenAI Service" or buy from OpenAI its API for ChatGPT access & tuning?
Discover AI
59 Pretraining vs Fine-tuning vs In-context Learning of LLM (GPT-x) EXPLAINED | Ultimate Guide ($)
Pretraining vs Fine-tuning vs In-context Learning of LLM (GPT-x) EXPLAINED | Ultimate Guide ($)
Discover AI
60 Reversible Transformer: ReFORMER for GPU Memory Optimization! Reversible Residual Layers?
Reversible Transformer: ReFORMER for GPU Memory Optimization! Reversible Residual Layers?
Discover AI

This video introduces BLIP-2, a method that connects Q-Former with VISION-LANGUAGE models for multimodal large language models, and demonstrates its application in image-to-text tasks. The video covers the architecture and training method of BLIP-2, including the use of Frozen image encoders and LLMs, and the combination of vision and language technologies. By watching this video, viewers can learn how to build and fine-tune multimodal LLMs for image-to-text tasks.

Key Takeaways
  1. Pre-train with image-text pairs using three loss functions
  2. Connect Q-Former to Frozen NL LLM for vision-language generative learning
  3. Freeze both Vision models and language models for pre-training
  4. Combine any visual backbone with any large language model for Vision language model development
  5. Use FAISS with cosine similarity on 768-dim embeddings for sub-100ms retrieval
💡 The BLIP-2 method allows for efficient fine-tuning of pre-trained models by combining Q-Former with VISION-LANGUAGE models, enabling the development of multimodal large language models for image-to-text tasks.

Related Reads

📰
LLM API Latency & Speed Comparison 2026 — Which Provider Is Fastest?
Learn how to compare the latency and speed of different LLM API providers, including DeepSeek, GPT-5, Claude, and Gemini, to optimize your application's performance
Dev.to · TokenPAPA
📰
The Illusion of Knowing:
Learn to recognize when AI explanations exceed their knowledge, and why it matters for trustworthy AI applications
Medium · LLM
📰
Codebase Memory MCP Cures the 412k Token Tax Dragging Down AI Agents
Optimize AI agent performance by 99% using deterministic knowledge graphs for codebase searches, reducing token tax
Medium · Machine Learning
📰
Your AI Is Forgetting Things On Purpose — And That’s Kind of Genius
Discover how AI models like ChatGPT use a hidden math trick to forget unnecessary information, making them 10x cheaper to run and more efficient
Medium · AI
Up next
5 Levels of AI Agents - From Simple LLM Calls to Multi-Agent Systems
Dave Ebbelaar (LLM Eng)
Watch →