Let's train Vision Language Models (VLM) from scratch using just Text-Only LLMs!

Name: Let's train Vision Language Models (VLM) from scratch using just Text-Only LLMs!
Uploaded: 2026-01-30T23:09:43+00:00
Channel: Neural Breakdown with AVB
Description: This is a video about Multimodal Vision Language Models, in which we take a simple text-only language model (LLM) and give it vision capabilities. We vi...

Neural Breakdown with AVB · Advanced ·🧠 Large Language Models ·1mo ago

This is a video about Multimodal Vision Language Models, in which we take a simple text-only language model (LLM) and give it vision capabilities. We visually explain the Query Former (Q-Former) model, introduced in the BLIP-2 paper. We will cover all the code and present a thorough step-by-step guide to training these VLMs yourself! To join our Patreon and support this channel financially, visit: https://www.patreon.com/NeuralBreakdownwithAVB Members get access to everything behind-the-scenes that goes into producing my videos - including code. Plus, it supports the channel in a big way and…

Watch on YouTube ↗ (saves to browser)