Code your BLIP-2 APP: VISION Transformer (ViT) + Chat LLM (Flan-T5) = MLLM
Key Takeaways
This video demonstrates how to build a multimodal large language model (MLLM) using BLIP-2, Vision Transformer (ViT), and Chat LLM (Flan-T5), enabling image analysis and text generation. The model is implemented using the Transformers library and Hugging Face Hub.
Full Transcript
and live community So today we're gonna code blip 2 Vision language model now we're gonna do it remember the Frozen image encoder and the large language models and we have here our blip 2 in between those two yes this is exactly what we're going to do so run time we need a GPU system.class premium no I'm broke a GPU accelerator yes please so here we go pip install Transformer data sets from our hugging face hub from the library on hugging face beautiful and then here we go from the Transformer we have here a blip to Vision configuration blip 2 Q format configuration then we're gonna use a specific model with a configuration the blip 2 configuration and of course what yes yes yes yes yes downloading downloading downloading yes beautiful and the most important blip 2 for conditional generation this is what we're gonna use and what we need just wanna show you so we have here the configuration then we initialize it here with some random weights and I want to show you now the model configuration so here we go this is not nice error cannot import name blip to Vision config from Transformer oh no don't tell me don't tell me that is so brand new that is not included yet in the official release okay just a second just half I'll be back in a second so and here we are again so we install it now deleting and bleeding the latest version the nightly version or whatever pip install and git and https so let's do this yes here we go get clone beautiful so we are now looking at the latest version and I hope we will be able that this is already included otherwise we're gonna have a problem cutting this thing so let's have a look uh RAM this is nice resources from here are now included here in this little thing and that's cute so resolve to commit what's happening installing dependencies yes here we go yes yes yes building Wheels yes here we go Transformers successfully build Transformers the latest version I uninstalled so 4 26 1 and you must restart restart runtime yes beautiful everything is back again oh ladies and Gentlemen let's try again to import are we online yes our Transformers blip 2 for conditional generation this is what we need and with a little bit of luck we are able to run it now come on show me show me how beautiful you are you can do it baby yes come on you have access to it I know come on it's not that complicated oh Jesus take some time thanks sometime oh here we are yes successful so what have we achieved our configuration now blip 2 configuration now we have everything here I mean if you want to have a look at this half and it's like so you can now choose every parameter you can modify every barometer output attention hidden state cross attention text configuration dropout rate early stopping my goodness so everything of two complete system of a complete Vision system and of a complete language system plus the interface of our Q format so whenever you felt hey there are not enough data here that I can manipulate the welcome so so if you use it I hope you can install it with the normal pip install Transformer command otherwise you have to do like I do here and we do this yeah uh it's a no class of course Flip 2 configuration and you have revision configuration the Q formal configuration the text configuration the number of query token configuration and any additional keyword arguments of course what else I don't have to show you this you'll notice ah this is now the blip2 vision configuration the pure Vision file okay I had to do the vision configuration and you can have a look at the vision configuration in detail so you focus just on one component but I guess I don't have to show you this view you get it the parameters of blip 2 Vision are a little bit more yes yes yes yes yes attention Dropout yeah Dropout initializer initialize a factor yes ready ready so we are using Transformer version 4.27.0 development zero so this seems to be the latest version yeah so let's tune a pip install accelerate our little accelerator it's just nonsense because we just have one GPU but never mind so and here we go now here we go now so we have our processor and we have chosen a specific model sales for split 2 up to 2.7 billion parameter model and our model is of course from pre-trained we do not want to train anything we just say give me the pre-trained version hugging face thank you I just want to use it this is our little image here and I want to have model generate and processor batching code so I want to have now that the system tells me analyzes the picture and tells me what does it see on the on the picture what are the objects and we can have a conversation with an llm about the content of this so oops Ram oh this does not look good okay we have to unload download here I can make this a little bit smaller so system Ram spiked already this is not good yeah as I told you it's about I think 12 gigabyte 15 17 gigabyte this small version here with 2.7 billion no it it's if you go higher for seven billion parameters you have to have at least I don't know oh 12 gigabyte of system Ram this oh this will be oh this could be close this could be really close because normally I use about 40 gigabyte of system Ram so let's see GPU RAM does it help at all here I don't know so take some time to load and I guess I will be back with you when the 15 gigabytes downloaded so and we are back and you can see we have here 15 gigabyte but velocity is 200 megabytes per second is not bad this is nice so a 15 gigs are done but now comes the now comes the interesting part of system Ram I have a feeling there might be a problem approaching real fast so let's see we have a download and then and then and we will know within minutes so 15 gigabytes are down great I know and now what's happening now thinking thank you still singing okay good sign still alive we're fine with ram no problem at all maybe I was wrong beautiful what are you doing come on tell me no grad Norm where are you still sinking I just want to hear one line of text that it tells me what it what the system could identify as object on the picture and they put now with the large language model they make a nice sentence or a nice story or a nice whatever and they tell me oh system ran nine gigabytes going up oh yes nine nine is okay everything below 11 we are fine I'm 0.9 come on you can do it 9.9 yes yes we are stable we are stable 10.7 okay but now now we are stable we have to be stable oh yeah yeah ten seven eleven five oh we have red okay here you see the importance of ram some some viewers of mine ask hey how important is Ram absolutely everything 11.8 uh oh yes attack bone charts yes yes we made it your session crashed gone your assassin craft using all available Ram if you are interested to access High Ram runtimes you may want to check out collab Pro collab Pro they give us a trial a free trial now 11 per month fifty dollars per month for pro pro Plus yeah if you have a small job maybe think about just buy for 11 10 here 100 compute units and use those units up and get an idea and if you see that you really need this and that the infrastructure is enough then it would say Dan is the next step to go for the cola Pro and pay 11 per month 11 10 euros per month and yeah so you have faster gpus more memory it's interesting they don't tell you how much more memory what's the limit there but European uses prices of display attacks inclusive so this means system crashed this is great because now what I do now I have do I have a 810 somewhere no everything is busy I have an old laptop with 64 gigabytes of RAM so if you give me a second I will switch to my laptop so now on my old laptop my laptop has 64 gigabytes of RAM and no um Cuda course so I'm running here on AMD GPU so we'll have to do everything now in or without a GPU so everything will be executed on our CPU so as I told you we import our Transformers beautiful from the Transformers we have here our Auto processor and blip 2 for conditional generation we are working in pie torch of course the whole system functions also in tensorflow too and then it is easy we have here now our processor our blip 2 processor and we have here now from Salesforce the blip 2 model where they go for a 2.7 billion parameter model and a model of course is from pre-trained so we take a pre-trained model and our processor and we are running on a zpu so let's execute this just gonna show you in real time what you're gonna expect if you do this I hope you can do it with a GPU somewhere in the cloud where you have a computer with um Cuda course but currently I have to do everything in CPU yes checkpoint charts yes yes yes by the way I already loaded here my model my sales for force model here about half an hour ago so just that you know I it's about 15 gigabyte so it took me half an hour to load this and if you want now to see the model that we downloaded from hugging face here we go my goodness so we have here a blip 2 for conditional generation we have a vision model we start here our this model vision embeddings and then you just can go on and please dive right into it have a look at the complexity of the model in my last video I explained in detail each layer how they are connected which layer are frozen and now here we have now the code implementation from the Salesforce team and we are gonna use this model as it here was it here was it here yes language model here we go now we go with the language model yeah you got the idea so what do we need we need an image show me an image and here the classical image always you will see this image so we go with standard you can download it this is available for everybody just for the type of image it's a pillow beautiful so and here we go now and we have now our processor we have here my image oh oops yes thank you my image I want to go with the official image we have our model generates the input then we have a processor with a batch decode yes yes yes and the final answer is what the Transformer sees is and now let's execute the model to get an answer what does vision Transformer and a language Mal combined sees here in this picture what is the output that the language model will generate the final answer what that Fusion Transformer sees is two cats lying laying on the couch beautiful so I think yeah it's okay but now you might ask hey what about here this this TV this remote control monster Vision transform identify the remote control that there are two cats so we can have image captioning of course with providing a text prompts so I'm just half now here my prompt and I say hey question how many kids are the answer so the typical uh clear structured for our vision language model so it is sinking it is still singing yeah images this image my cats so the question was how many cats are there and your answer is two yep what's a two two cats oh okay so what else can I show you oh yeah if we have prompts we can use this prompt now what do you say here we go now for a prompt that is a strange problem it has nothing to do we say hey no comma a car is and then we have here this from our language and the vision Transformer is now identifying what it sees and then our language model must make a sentence out of this so let's try this so somehow we have no car is and something with cats this is what we expect but let's have a look here how the max new tokens is 50. well it comes back with your answer no a car is not a cat okay yeah but you see exactly what you can do here with your prompts it responds to the prompts that you are giving a system so remember I told you hey what about here the remote control so let's do this on a prompt my question the cat is lying next to you give me an answer so what I'm hoping for is that there might be now either the sofa the couch or the remote control in your answer that is now generated by the vision controller let's see take some time to think cat is lying next to the remote control yes exactly this is what I want to do after cat is lying to the remote control funny this is a kind of a mirror image so both cats are lying right next to their own remote control so yes this is nice but you see here the general concept here if you combine a language model and a vision model and just not to show you this only here let's change this now a little bit I say now I have now my only image and I say okay this is here just show you my image you know this image this is our pyramid and then I say the final answer what the vision Transformer is and I would expect a pyramid so system is thinking remember this is a laptop that is only working on a CPU so if you have in the cloud GPU you are much faster and the final answer for the Transformer sees is the Great Pyramid of Giza yes absolutely I don't know if it's really DIY but one of those three beautiful so you see this works great [Music] built thinking system is thinking still thinking question when were they built on so the pyramids were built between 2550 and 2500 BC before the common error so you see we have now here the power of a language model it's not really a chat GPT it's just a small model as I showed you which model I've chosen here the Salesforce blip to opt with 2.7 billion parameters but of course now you might say hey wait a minute but you did not we had here hugging face let's have a look now come on we explore this on the model side so we have here Salesforce and we're going for a blip 2 model do we have a T5 here we go hey look this is so beautiful so we have a Salesforce blip 2 flan T5 XL oh yes we can use this so either you go with the opt model or you go what I would recommend a flan T5 XL or even if you have the the GPU for it and the memory for it what about the flan T5 XXL oh yes so Salesforce was so great they put here the blip 2 flan T5 x6l model now this is nice and as you can see blip2 consists of three model a clip-like image encoder a query Transformer RQ former and an llm and you know which llm it is here it's a flan T5 x6l everything we know image captioning visual question answering chat like conversations yes so no this was the other one where's mine this here so instead of yeah it's just want to show you now this is the T5 XXL is the T5 XL let's take the T5 XL so if you wanna have fun you just go here and you say instead of Salesforce flip we insert our blip to flan T5 XL yeah and of course here too sorry of course both are processor and our model and then you're ready to go and you can do this here now on a flan T5 xxxl depending what memory you have available on your cluster so finally we managed that I at least could show you this how does it look like and you know what if we want to go now with a flan T5 XXL maybe hugging face has it somewhere operational oh inference API has been turned off for this model oh come on this is not okay um you know since was was it a XXL model has inference has been turned off but you know what Salesforce flip 2. hey what do you think we go with spaces so on Spaces we have Salesforce blip 2 and we have here our XXL chatbone put tell me a story about so here we have beam storage yep temperature length penalty a little bit length penalty set to larger for longer sequences yes let's do this so we have here now our picture we have our original Transformer we have our llm and now we say hey chat input tell me a story about the content that the system must now deduct is here analyze the image and come up with a nice story created by our llm so nobody there it's an empty queue a time when you were a little girl and you wanted to be a ballerina but you could not afford to go to the ballet yes this is exactly how I feel I was a little girl and I got an afford to go to the ballet this is why I become a theoretical physicist and a computer scientists hey this is nice so um chat input submit I have no idea what I should ask the system photo of a woman with a light bulb in her hair beautiful so you get the idea what it is so you see here Salesforce blip 2 you can use Flip 2 to encode your own version language Transformer model so whatever you want to use this whatever picture you are gonna take this is nice just put in an image and start chatting about the content of this image with your llm I hope you enjoy it I see you in my next video foreign
Original Description
BLIP-2: Upload an image, the vision transformer will analyze the content of the image and a LLM will tell you a story about it - or answer your questions about the picture. We'll use Flan-T5 and Vision Transformer, interlinked w/ Q-Former (BLIP 2). Multimodal LLM w/ BLIP-2.
Example: if you upload a picture from the great pyramid in Egypt and you prompt (ask) the system: "When was it built"? The ViT will tell the LLM that on the image are the pyramids from Gizeh and therefore the LLM (ChatGPT or T5) will tell you: "The great pyramid was build about 2500 BCE, when the pharaoh ....... "
Easy, simple. Your own Vision-Language Transformer system with a Q-Former, utilizing the idea of BLIP-2.
BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models
Junnan Li, Dongxu Li, Silvio Savarese, Steven Hoi
https://arxiv.org/abs/2301.12597
#ai
#vision
#naturallanguageprocessing
#machinelearning
#languagemodel
Watch on YouTube ↗
(saves to browser)
Sign in to unlock AI tutor explanation · ⚡30
Playlist
Uploads from Discover AI · Discover AI · 57 of 60
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
▶
58
59
60
Step Into the Unknown (by YouChat) - May 2023 be your best year yet
Discover AI
Wishing you all an amazing 2023 filled with Love, Laughter, and Happiness!
Discover AI
Create a Smarter Future!
Discover AI
The Art of Text to Vector Transformation: A Comprehensive Look at AI and NLP Transformers
Discover AI
Feature Vectors: The Key to Unlocking the Power of BERT and SBERT Transformer Models
Discover AI
Domain-Specific AI Models: How to Create Customized BERT and SBERT Models for Your Business
Discover AI
Achieve Unimaginable Levels of Domain Knowledge through SBERT Extreme in 3D (SBERT 48)
Discover AI
Unlocking Scientific Domain Knowledge w/ BPE Tokenizer: An Amazing Journey! (SBERT 49)
Discover AI
SBERT Extreme 3D: Train a BERT Tokenizer on your (scientific) Domain Knowledge (SBERT 50)
Discover AI
Discover Vision Transformer (ViT) Tech in 2023
Discover AI
Pre-Train BERT from scratch: Solution for Company Domain Knowledge Data | PyTorch (SBERT 51)
Discover AI
Flan-T5-XL model on a free COLAB | A free LLM - that explains itself w/ reasoning /write essay | AI
Discover AI
BERT and GPT in Language Models like ChatGPT or BLOOM | EASY Tutorial on Large Language Models LLM
Discover AI
Free Alternative to ChatGPT: Flan-T5-XL GUI (open-source) #shorts
Discover AI
From T5 to T5X: A Game-Changing Evolution with JAX & FLAX
Discover AI
How to start with ChatGPT? | Short Introduction to OpenAI API #shorts
Discover AI
The Future of Conversational AI? Google's PaLM w/ RLHF | LLM ChatGPT Competitor
Discover AI
Microsoft and ChatGPU
Discover AI
From Zero to FLAN-T5 XL Model GUI with Gradio: A Step-by-Step Guide on Free COLAB Notebook PyTorch
Discover AI
Google's 2nd Answer to "BING ChatGPT": Sparrow | after BARD w/ LaMDA | 2nd Gen Conversational AI
Discover AI
TF2: Pre-Train BERT from scratch (a Transformer), fine-tune & run inference on text | KERAS NLP
Discover AI
3D Visualization for BERT: How to Pre-Train with a New Layer & Fine-Tune with Downstream Task Layer
Discover AI
FLAN-T5-XXL on NVIDIA A100 GPU w/ HF Inference Endpoints, let's explore 11b models!
Discover AI
ChatGPT - Can it Lie to you?
Discover AI
ChatGPT Alternative: Perplexity by Perplexity.AI
Discover AI
2023 KerasNLP Tutorial: Explore Latest KERAS Toolbox & NLP Processing Library for BERT - TF2
Discover AI
Self-aware AI: You.com/chat vs Perplexity.ai | Live Demo, LLMs show Future of ChatGPT w/ BING
Discover AI
BLOOM 176B Inference on AWS | Bigger than GPT-3 for more Power!
Discover AI
Fine-tune ChatGPT? Buy Embeddings /OpenAI? What are Embeddings? My own ChatGPT? | Visual Q+A
Discover AI
Unleashing the Power of BLOOM 176B with AWS ml.p4de.24xlarge, DJL & DeepSpeed: The Ultimate Boost!
Discover AI
After ChatGPT: NEW BioGPT by Microsoft | Do YOU trust Microsoft for your Medication?
Discover AI
Improve ChatGPT: Modular, Adaptive, Smart LLM | Inside ChatGPT
Discover AI
Fine-tune ChatGPT w/ in-context learning ICL - Chain of Thought, AMA, reasoning & acting: ReAct
Discover AI
The Intersection of Copyright Law and Human Faces: Exploring Virtual K-Pop with MAVE
Discover AI
New TECH: Vision Transformer 2023 on Image Classification | AI
Discover AI
PyTorch code Vision Transformer: Apply ViT models pre-trained and fine-tuned | AI Tech
Discover AI
New BING ChatGPT: Unlock the Power of Emotions in your Search Engine!
Discover AI
New BING ChatGPT loses its mind
Discover AI
Self-Attention Heads of last Layer of Vision Transformer (ViT) visualized (pre-trained with DINO)
Discover AI
Visualizing the Self-Attention Head of the Last Layer in DINO ViT: A Unique Perspective on Vision AI
Discover AI
Microsoft strongly restricts access to ChatGPT on new BING - WHY?
Discover AI
PyTorch ViT: The Ultimate Guide to Fine-Tuning for Object Identification (COLAB)
Discover AI
New BING Chat AGGRESSIVE
Discover AI
Panoptic Image Segmentation: Mask2Former explained | Identify all objects!
Discover AI
Code Panoptic Image Segmentation w/ Vision Transformer & Mask2Former - A PyTorch tutorial
Discover AI
Dream Job Alert: AI Prompt Engineer - $335K | AI Prompt Design: A Crash Course
Discover AI
Streamlining Similar Image Detection with ViT in PyTorch: A Step-by-Step Guide
Discover AI
Microsoft's CEO in Trouble #shorts
Discover AI
Why wait for KOSMOS-1? Code a VISION - LLM w/ ViT, Flan-T5 LLM and BLIP-2: Multimodal LLMs (MLLM)
Discover AI
OpenAI's ChatGPT can NOW summarize external Sources on the Internet?
Discover AI
ChatGPT polarizes
Discover AI
Hospital /Clinic AI Decision Models: Performance of 12 AI LLM Systems (incl $$) Radiology, Biomed
Discover AI
ChatGPT Prompt Engineering w/ in-context learning (ICL) - 7 Examples | Tutorial
Discover AI
Chat with your Image! BLIP-2 connects Q-Former w/ VISION-LANGUAGE models (ViT & T5 LLM)
Discover AI
ChatGPT: Multidimensional Prompts
Discover AI
ChatGPT: In-context Retrieval-Augmented Learning (IC-RALM) | In-context Learning (ICL) Examples
Discover AI
Code your BLIP-2 APP: VISION Transformer (ViT) + Chat LLM (Flan-T5) = MLLM
Discover AI
Buy Microsoft "Azure OpenAI Service" or buy from OpenAI its API for ChatGPT access & tuning?
Discover AI
Pretraining vs Fine-tuning vs In-context Learning of LLM (GPT-x) EXPLAINED | Ultimate Guide ($)
Discover AI
Reversible Transformer: ReFORMER for GPU Memory Optimization! Reversible Residual Layers?
Discover AI
More on: LLM Foundations
View skill →Related AI Lessons
⚡
⚡
⚡
⚡
Building LSTMs with PyTorch and Lightning AI Part 7: Resuming Training with Checkpoints
Dev.to · Rijul Rajesh
How AI Learns with Less Labeled Data
Medium · AI
Comparing Sarvam-30B and Qwen2.5–14B on Spider Text-to-SQL: An Active-Parameter Perspective
Medium · LLM
Debugging Benchmark: DeepSeek V4 Pro vs MiMo V2.5 Pro
Dev.to · Stanislav
🎓
Tutor Explanation
DeepCamp AI