Chat with your Image! BLIP-2 connects Q-Former w/ VISION-LANGUAGE models (ViT & T5 LLM)

Discover AI · Beginner ·🧠 Large Language Models ·3y ago

Skills: LLM Foundations90%Multimodal LLMs90%Prompt Craft80%Fine-tuning LLMs80%LLM Engineering70%

Key Takeaways

The video discusses BLIP-2, a method that connects Q-Former with VISION-LANGUAGE models, including ViT and T5 LLM, for multimodal large language models, and demonstrates its application in image-to-text tasks using tools like Q-Former, VISION-LANGUAGE models, and FAISS.

Full Transcript

hello Community today we combine our vision Transformer knowledge and our llm knowledge so we have vision and language and today I show you that there's a beautiful technique that we can now pre-train Vision language models based on the Transformer architecture and here we go this beauty is called blip2 bootstrapping language image free training with Frozen image encoders and llms this is exactly what we wanted interestingly here as you can see here the orders and it is from Salesforce research published January 30 2023 let's just jump right into it and you might ask but what is the use case for this technology so if I upload a thumbnail of one of my last videos let's have a look and we have here our chat input question what elements can you distinguish here on this picture and the answer is processing a helicopter a green screen and the words Vision Transformer now image to text task that visual language models can tackle our image captioning for the visually impaired great useful product description identify inappropriate content beyond the text then you have image text retrieval can be applied in multimodal search as well as autonomous driving and you have of course visual question answering what we are interested in and this will enable our multi-modal our jet GPT chatbots now multi-modal and yippee here we go so at first we have to make sure and this is now kind of a culmination what we learned to transform our architecture here on our large language models that I showed you uh T5 flan T5 Bloom everything here that you know either to decoder side or the encoder side and of course now Vision Transformers both have the same architecture this is the most important Point here both are transform architectures so we compare Apple to Apple and you're not going to believe it to bridge this modality Gap as they call it between Vision here on this side and language on this side they add now a Transformer Now isn't this a surprise so we have a Transformer that connects to a Transformer connects to another Transformer and we will know that our technique is absolutely compatible crossover so this Transformer we call a very Transformer you will understand on a second why this is or a q format so whenever you read now in February March 2023 starting that about Q former you know exactly what we're talking about the interface of vision Transformer to language transforms if you want to know more about llm I show you here about how you can use your flying T5 x6l or if you're interested to have your Bloom 176 billion parameter model operational on the AWS infrastructure those are my two videos for you if you want to learn in general about Transformer this is my video and one of my latest video on Vision Transformers are here on the left side so beautiful now we are of course interested the main problem is an llm is already a huge Model A Vision Transformer is also not as huge but coming up to 22 billion parameter my goodness these are monsters so if you want to combine them how do you want to train if this hardly fits in a supercomputer Center so this is not a beautiful idea here of the orders of this paper please have a look at the original paper that they say now we freeze the vision Transformer and refreeze completely all layers of an llm yes we have an interface between those we have an open i o channel to these models but the layers and the weights are frozen the only thing that we will train is between the Frozen Vision Transformer and the Frozen llm like chat GPT or whatever you.com chat or whatever gpd3 based methodologies you like to apply or flying T5 or whatever the only object we train here is our Hue Transformer a q format or querying Transformer this is the Hot Topic so how we do this now you're not going to believe it since on the one side we have a vision object and on the right side we have a language object the Q format itself has two modules it has an image object an image Transformer that interacts with the Frozen image encoder for the visual feature extraction and on the other side it has also a text Transformer that can function both as an encoder or a decoder for language so you see more or less we copy the external i o structure that it will attach itself on and we put some sub modules within our Q Transformer and yes you guessed it we will have common self attention layers within our Q format so if you want to have it in a little bit more detail this is the query for you you see here to initialize our queue form we use the weights from a bird base model you might say my goodness this simple bird-based model yes exactly now it all comes together we combine now the vision technology with language technology bird expert sentence Transformer this is it can you describe the elements in the image [Applause] now this is gonna be fun so let's have a look I've Fighter Jet flying in the sky with the words q a on jet GPT absolutely right you identified the fighter plane okay now which fight a plane 35 she not bad not bad the F-35 so how does the pre-training happen two stages in the first stage now the vision language representation learning stage as we call it we connect now our Q format that sits in the middle now with the Frozen image encoder so only with one side with our image encoder and perform the free training using some very specific image text pairs so you have an image and you have a text description of the contents of this image you have another image another content description in text of the content of this image and you get it you have then loss function and they found out the best way to do this is have three different loss function when you train it here on the first stage you have an image text contrasting last like we have and the normal birth systems also craned on then you have an image around the text generation and an image text matching loss all the details you find in their original research paper just to give you an idea we have here in let's call it yellow this something occurs monster year and we dock here our Q form in the very first pre-training step here to our image encoder that is frozen so image image encoder and here it is and now we have here as I told you two modules within our Q former and image Transformer and a language Transformer now you have now on the self attention layer you have sharing and this is more or less the main idea behind this if you want to have to text image transform extracts the fixed number of output features from the image encoder independent of the of the image resolution of course and receives learnable query embeddings as an input this here's our input I will show you in a second and then it runs true we have here our attention masking and be able to calculate our three different losses our three different methods and you have here beautiful described here how with the self-attention mechanism now working here across the modules now you might say okay in the Second Step well the second step you're not gonna believe it now we take the Q format it sits in the middle and connect it to the other part to the vision language generative learning where we're connected now to the Frozen NL llm in our case it will be a flan T5 XXL model now aquarium battings that we get out now have the relevant visual information to the text as it passes through an information bottleneck and these embeddings are not used as a visual prefix and I'll show you in a second what I mean to the input to the large language model and this pre-training phase effectively involves an image ground text generation task using causal language model loss let's have a visualization of this now they people that wrote the research paper decided to have two different models two different llms here for example you have here only only if you want a decoder based llm but you know if we're working with flying T5 we have a full Transformer stack so we have an encoder stack and a decoder stack so of course and we're gonna use the second option here for nt5 so we have here our input image that comes in our image encoder that is frozen then the first step is now done we have now the output here of our queue former and this here is runs now to a fully connected layer to linearly project the output query embedding Z into the same Dimension as now the text embedding for the llm and this is the beauty of course if you only have a decoder it goes directly if you have to feed an encoder and a decoder stack you have to take care about this a function here as a soft visual prompt and condition this on presentation extracted by the Q format yes the details you find in the paper I just want that you have an understanding a two process first one it goes in the Q format you got it you have a fully connected layer input to the llm and here now then the next word generation the generative AI takes place now the absolute beauty of this methodology here from the last day of January 2023 was that when they published blip 2 they use division Transformer and for the large language model as I showed you you can use the flan T5 model but you are not restricted to this pre-training approach since you freeze both Vision models and language models you both free stem you can combine almost any visual backbone with any large language model for this specific Vision language model development where you can train here this Q former so we get a complete pre-trained Vision language model out of the pipeline now you might say isn't that beautiful and I agree with you but you know what this was the theory so you understand what we're gonna code next time when we will code every single step we will build our own app maybe we even do agradio and we will end up with an operational Vision language Transformer model you can use you can apply so when you have an image you use the image as an input and then you can start to jet and the system automatically understands the content of the image and will respond to you in this chat correspondingly I say thank you I hope you enjoyed it a little bit see you in my next video

Original Description

Combined Vision-Language Transformers, interlinked w/ a Q-Former, a Querying Transformer! BLIP 2. BLIP-2! The financial resources for pre-training both systems (Vision and Language) are astronomical? Let me introduce you to a clever, new training method: BLIP-2. Multimodal Large Language Models for visual QA or perception-language tasks, multimodal dialogue, or image captioning, and image recognition with verbal content descriptions, plus a Chat function. Visual Perception and Large Language Models: The new combination in Transformers. Multi-modal Large Language Models for visual QA or image captioning. All rights and credits w/: BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models https://arxiv.org/abs/2301.12597 #ai #machinelearning #chatgpt #vision #llm #BLIP2 #QFormer

Watch on YouTube ↗ (saves to browser)

Sign in to unlock AI tutor explanation · ⚡30

Playlist

Uploads from Discover AI · Discover AI · 54 of 60

← Previous Next →

Step Into the Unknown (by YouChat) - May 2023 be your best year yet

Step Into the Unknown (by YouChat) - May 2023 be your best year yet

Wishing you all an amazing 2023 filled with Love, Laughter, and Happiness!

Wishing you all an amazing 2023 filled with Love, Laughter, and Happiness!

Create a Smarter Future!

Create a Smarter Future!

The Art of Text to Vector Transformation: A Comprehensive Look at AI and NLP Transformers

The Art of Text to Vector Transformation: A Comprehensive Look at AI and NLP Transformers

Feature Vectors: The Key to Unlocking the Power of BERT and SBERT Transformer Models

Feature Vectors: The Key to Unlocking the Power of BERT and SBERT Transformer Models

Domain-Specific AI Models: How to Create Customized BERT and SBERT Models for Your Business

Domain-Specific AI Models: How to Create Customized BERT and SBERT Models for Your Business

Achieve Unimaginable Levels of Domain Knowledge through SBERT Extreme in 3D (SBERT 48)

Achieve Unimaginable Levels of Domain Knowledge through SBERT Extreme in 3D (SBERT 48)

Unlocking Scientific Domain Knowledge w/ BPE Tokenizer: An Amazing Journey! (SBERT 49)

Unlocking Scientific Domain Knowledge w/ BPE Tokenizer: An Amazing Journey! (SBERT 49)

SBERT Extreme 3D: Train a BERT Tokenizer on your (scientific) Domain Knowledge (SBERT 50)

SBERT Extreme 3D: Train a BERT Tokenizer on your (scientific) Domain Knowledge (SBERT 50)

Discover Vision Transformer (ViT) Tech in 2023

Discover Vision Transformer (ViT) Tech in 2023

Pre-Train BERT from scratch: Solution for Company Domain Knowledge Data | PyTorch (SBERT 51)

Pre-Train BERT from scratch: Solution for Company Domain Knowledge Data | PyTorch (SBERT 51)

Flan-T5-XL model on a free COLAB | A free LLM - that explains itself w/ reasoning /write essay | AI

Flan-T5-XL model on a free COLAB | A free LLM - that explains itself w/ reasoning /write essay | AI

BERT and GPT in Language Models like ChatGPT or BLOOM | EASY Tutorial on Large Language Models LLM

BERT and GPT in Language Models like ChatGPT or BLOOM | EASY Tutorial on Large Language Models LLM

Free Alternative to ChatGPT: Flan-T5-XL GUI (open-source) #shorts

Free Alternative to ChatGPT: Flan-T5-XL GUI (open-source) #shorts

From T5 to T5X: A Game-Changing Evolution with JAX & FLAX

From T5 to T5X: A Game-Changing Evolution with JAX & FLAX

How to start with ChatGPT? | Short Introduction to OpenAI API #shorts

How to start with ChatGPT? | Short Introduction to OpenAI API #shorts

The Future of Conversational AI? Google's PaLM w/ RLHF | LLM ChatGPT Competitor

The Future of Conversational AI? Google's PaLM w/ RLHF | LLM ChatGPT Competitor

Microsoft and ChatGPU

Microsoft and ChatGPU

From Zero to FLAN-T5 XL Model GUI with Gradio: A Step-by-Step Guide on Free COLAB Notebook PyTorch

From Zero to FLAN-T5 XL Model GUI with Gradio: A Step-by-Step Guide on Free COLAB Notebook PyTorch

Google's 2nd Answer to "BING ChatGPT": Sparrow | after BARD w/ LaMDA | 2nd Gen Conversational AI

Google's 2nd Answer to "BING ChatGPT": Sparrow | after BARD w/ LaMDA | 2nd Gen Conversational AI

TF2: Pre-Train BERT from scratch (a Transformer), fine-tune & run inference on text | KERAS NLP

TF2: Pre-Train BERT from scratch (a Transformer), fine-tune & run inference on text | KERAS NLP

3D Visualization for BERT: How to Pre-Train with a New Layer & Fine-Tune with Downstream Task Layer

3D Visualization for BERT: How to Pre-Train with a New Layer & Fine-Tune with Downstream Task Layer

FLAN-T5-XXL on NVIDIA A100 GPU w/ HF Inference Endpoints, let's explore 11b models!

FLAN-T5-XXL on NVIDIA A100 GPU w/ HF Inference Endpoints, let's explore 11b models!

ChatGPT - Can it Lie to you?

ChatGPT - Can it Lie to you?

ChatGPT Alternative: Perplexity by Perplexity.AI

ChatGPT Alternative: Perplexity by Perplexity.AI

2023 KerasNLP Tutorial: Explore Latest KERAS Toolbox & NLP Processing Library for BERT - TF2

2023 KerasNLP Tutorial: Explore Latest KERAS Toolbox & NLP Processing Library for BERT - TF2

Self-aware AI: You.com/chat vs Perplexity.ai | Live Demo, LLMs show Future of ChatGPT w/ BING

Self-aware AI: You.com/chat vs Perplexity.ai | Live Demo, LLMs show Future of ChatGPT w/ BING

BLOOM 176B Inference on AWS | Bigger than GPT-3 for more Power!

BLOOM 176B Inference on AWS | Bigger than GPT-3 for more Power!

Fine-tune ChatGPT? Buy Embeddings /OpenAI? What are Embeddings? My own ChatGPT? | Visual Q+A

Fine-tune ChatGPT? Buy Embeddings /OpenAI? What are Embeddings? My own ChatGPT? | Visual Q+A

Unleashing the Power of BLOOM 176B with AWS ml.p4de.24xlarge, DJL & DeepSpeed: The Ultimate Boost!

Unleashing the Power of BLOOM 176B with AWS ml.p4de.24xlarge, DJL & DeepSpeed: The Ultimate Boost!

After ChatGPT: NEW BioGPT by Microsoft | Do YOU trust Microsoft for your Medication?

After ChatGPT: NEW BioGPT by Microsoft | Do YOU trust Microsoft for your Medication?

Improve ChatGPT: Modular, Adaptive, Smart LLM | Inside ChatGPT

Improve ChatGPT: Modular, Adaptive, Smart LLM | Inside ChatGPT

Fine-tune ChatGPT w/ in-context learning ICL - Chain of Thought, AMA, reasoning & acting: ReAct

Fine-tune ChatGPT w/ in-context learning ICL - Chain of Thought, AMA, reasoning & acting: ReAct

The Intersection of Copyright Law and Human Faces: Exploring Virtual K-Pop with MAVE

The Intersection of Copyright Law and Human Faces: Exploring Virtual K-Pop with MAVE

New TECH: Vision Transformer 2023 on Image Classification | AI

New TECH: Vision Transformer 2023 on Image Classification | AI

PyTorch code Vision Transformer: Apply ViT models pre-trained and fine-tuned | AI Tech

PyTorch code Vision Transformer: Apply ViT models pre-trained and fine-tuned | AI Tech

New BING ChatGPT: Unlock the Power of Emotions in your Search Engine!

New BING ChatGPT: Unlock the Power of Emotions in your Search Engine!

New BING ChatGPT loses its mind

New BING ChatGPT loses its mind

Self-Attention Heads of last Layer of Vision Transformer (ViT) visualized (pre-trained with DINO)

Self-Attention Heads of last Layer of Vision Transformer (ViT) visualized (pre-trained with DINO)

Visualizing the Self-Attention Head of the Last Layer in DINO ViT: A Unique Perspective on Vision AI

Visualizing the Self-Attention Head of the Last Layer in DINO ViT: A Unique Perspective on Vision AI

Microsoft strongly restricts access to ChatGPT on new BING - WHY?

Microsoft strongly restricts access to ChatGPT on new BING - WHY?

PyTorch ViT: The Ultimate Guide to Fine-Tuning for Object Identification (COLAB)

PyTorch ViT: The Ultimate Guide to Fine-Tuning for Object Identification (COLAB)

New BING Chat AGGRESSIVE

New BING Chat AGGRESSIVE

Panoptic Image Segmentation: Mask2Former explained | Identify all objects!

Panoptic Image Segmentation: Mask2Former explained | Identify all objects!

Code Panoptic Image Segmentation w/ Vision Transformer & Mask2Former - A PyTorch tutorial

Code Panoptic Image Segmentation w/ Vision Transformer & Mask2Former - A PyTorch tutorial

Dream Job Alert: AI Prompt Engineer - $335K | AI Prompt Design: A Crash Course

Dream Job Alert: AI Prompt Engineer - $335K | AI Prompt Design: A Crash Course

Streamlining Similar Image Detection with ViT in PyTorch: A Step-by-Step Guide

Streamlining Similar Image Detection with ViT in PyTorch: A Step-by-Step Guide

Microsoft's CEO in Trouble #shorts

Microsoft's CEO in Trouble #shorts

Why wait for KOSMOS-1? Code a VISION - LLM w/ ViT, Flan-T5 LLM and BLIP-2: Multimodal LLMs (MLLM)

Why wait for KOSMOS-1? Code a VISION - LLM w/ ViT, Flan-T5 LLM and BLIP-2: Multimodal LLMs (MLLM)

OpenAI's ChatGPT can NOW summarize external Sources on the Internet?

OpenAI's ChatGPT can NOW summarize external Sources on the Internet?

ChatGPT polarizes

ChatGPT polarizes

Hospital /Clinic AI Decision Models: Performance of 12 AI LLM Systems (incl $$) Radiology, Biomed

Hospital /Clinic AI Decision Models: Performance of 12 AI LLM Systems (incl $$) Radiology, Biomed

ChatGPT Prompt Engineering w/ in-context learning (ICL) - 7 Examples | Tutorial

ChatGPT Prompt Engineering w/ in-context learning (ICL) - 7 Examples | Tutorial

Chat with your Image! BLIP-2 connects Q-Former w/ VISION-LANGUAGE models (ViT & T5 LLM)

Chat with your Image! BLIP-2 connects Q-Former w/ VISION-LANGUAGE models (ViT & T5 LLM)

ChatGPT: Multidimensional Prompts

ChatGPT: Multidimensional Prompts

ChatGPT: In-context Retrieval-Augmented Learning (IC-RALM) | In-context Learning (ICL) Examples

ChatGPT: In-context Retrieval-Augmented Learning (IC-RALM) | In-context Learning (ICL) Examples

Code your BLIP-2 APP: VISION Transformer (ViT) + Chat LLM (Flan-T5) = MLLM

Code your BLIP-2 APP: VISION Transformer (ViT) + Chat LLM (Flan-T5) = MLLM

Buy Microsoft "Azure OpenAI Service" or buy from OpenAI its API for ChatGPT access & tuning?

Buy Microsoft "Azure OpenAI Service" or buy from OpenAI its API for ChatGPT access & tuning?

Pretraining vs Fine-tuning vs In-context Learning of LLM (GPT-x) EXPLAINED | Ultimate Guide ($)

Pretraining vs Fine-tuning vs In-context Learning of LLM (GPT-x) EXPLAINED | Ultimate Guide ($)

Reversible Transformer: ReFORMER for GPU Memory Optimization! Reversible Residual Layers?

Reversible Transformer: ReFORMER for GPU Memory Optimization! Reversible Residual Layers?

This video introduces BLIP-2, a method that connects Q-Former with VISION-LANGUAGE models for multimodal large language models, and demonstrates its application in image-to-text tasks. The video covers the architecture and training method of BLIP-2, including the use of Frozen image encoders and LLMs, and the combination of vision and language technologies. By watching this video, viewers can learn how to build and fine-tune multimodal LLMs for image-to-text tasks.

Key Takeaways

Pre-train with image-text pairs using three loss functions
Connect Q-Former to Frozen NL LLM for vision-language generative learning
Freeze both Vision models and language models for pre-training
Combine any visual backbone with any large language model for Vision language model development
Use FAISS with cosine similarity on 768-dim embeddings for sub-100ms retrieval

💡 The BLIP-2 method allows for efficient fine-tuning of pre-trained models by combining Q-Former with VISION-LANGUAGE models, enabling the development of multimodal large language models for image-to-text tasks.

🔒 Pro feature: Ask AI to explain this lesson →

More on: LLM Foundations

View skill →

Getting Started with Vertex AI Gemini 1.5 Flash

I TRAINED AN AI TO SOLVE 2+2 (w/ Live Coding)

I TRAINED AN AI TO SOLVE 2+2 (w/ Live Coding)

How to use the ChatGPT API with Python!!

How to use the ChatGPT API with Python!!

Nicholas Renotte

Gemini 2.5: Create an interactive plot of economic data

Gemini 2.5: Create an interactive plot of economic data

Google DeepMind

LangChain Chatbots: Building a Personalized AI Assistant

LangChain Chatbots: Building a Personalized AI Assistant

Analytics Vidhya

Auto-generating meeting notes with Python

Auto-generating meeting notes with Python

Related Reads

Teaching AI Your Language: Prompt Engineering for Indie Game Devs

Learn how to teach AI your game's language using prompt engineering to automate repetitive tasks and boost productivity

June 2026 newsletter

Learn about the latest AI models and tools, including Claude, GPT-5.6, and GLM-5.2, and how to access them

Simon Willison's Blog

La tua azienda esiste già nell’intelligenza artificiale.

Discover how your company already exists in AI and what it means for your business

Demystifying the “Agent Harness”: Why an LLM is Just a Brain Without a Body

Learn how an LLM needs an agent harness to interact with the environment, just like a brain needs a body to function

5 Levels of AI Agents - From Simple LLM Calls to Multi-Agent Systems

Dave Ebbelaar (LLM Eng)