Let's train Vision Language Models (VLM) from scratch using just Text-Only LLMs!

Neural Breakdown with AVB · Advanced ·🧠 Large Language Models ·1mo ago
This is a video about Multimodal Vision Language Models, in which we take a simple text-only language model (LLM) and give it vision capabilities. We visually explain the Query Former (Q-Former) model, introduced in the BLIP-2 paper. We will cover all the code and present a thorough step-by-step guide to training these VLMs yourself! To join our Patreon and support this channel financially, visit: https://www.patreon.com/NeuralBreakdownwithAVB Members get access to everything behind-the-scenes that goes into producing my videos - including code. Plus, it supports the channel in a big way and…
Watch on YouTube ↗ (saves to browser)

Chapters (9)

Intro
5:45 Vision Transformers
6:52 Coding ViT
8:52 Q-Former models
11:45 Coding Q-Former from a BERT
12:36 Cross Attention in Transformers
17:52 Coding Q-Formers
21:33 LORA finetune Language Model
27:12 Summary
5 Levels of AI Agents - From Simple LLM Calls to Multi-Agent Systems
Next Up
5 Levels of AI Agents - From Simple LLM Calls to Multi-Agent Systems
Dave Ebbelaar (LLM Eng)