Let's train Vision Language Models (VLM) from scratch using just Text-Only LLMs!
This is a video about Multimodal Vision Language Models, in which we take a simple text-only language model (LLM) and give it vision capabilities. We visually explain the Query Former (Q-Former) model, introduced in the BLIP-2 paper. We will cover all the code and present a thorough step-by-step guide to training these VLMs yourself!
To join our Patreon and support this channel financially, visit: https://www.patreon.com/NeuralBreakdownwithAVB
Members get access to everything behind-the-scenes that goes into producing my videos - including code. Plus, it supports the channel in a big way and…
Watch on YouTube ↗
(saves to browser)
Chapters (9)
Intro
5:45
Vision Transformers
6:52
Coding ViT
8:52
Q-Former models
11:45
Coding Q-Former from a BERT
12:36
Cross Attention in Transformers
17:52
Coding Q-Formers
21:33
LORA finetune Language Model
27:12
Summary
DeepCamp AI