Fine-Tune Vision Language Models (VLMs) Like a Pro: Live Demo + Benchmarks | Predibase Webinar
Multimodal AI is no longer optional—it's the future. In this in-depth webinar, the ML experts at Predibase break down everything you need to know about Vision Language Models (VLMs)—from architectures and use cases to training, inference, and real-world performance.
✅ Learn why fine-tuning open-source VLMs often beats closed models like GPT-4V
✅ See a live demo of fine-tuning a Pokémon card captioning model
✅ Get benchmark results showing performance boosts over GPT-4
✅ Discover real-world use cases: healthcare, retail, drive-thrus, content moderation & more
✅ Understand the challenges in tra…
Watch on YouTube ↗
(saves to browser)
Chapters (19)
Intro & Speakers
1:45
Why Multimodal AI Matters
4:20
Real-World Multimodal Use Cases (Amazon, Duolingo, Converse Now)
7:10
Developer Interest in Open-Source VLMs
9:40
What Are Vision Language Models (VLMs)?
11:05
VLM Architecture: Encoder, Projector, Decoder
14:00
Popular Components: CLIP, LLaMA, Qwen
16:00
Prompting & Use Cases (Image QA, Captioning, Video Analysis)
19:15
Strengths & Limitations of VLMs
21:00
VLMs vs Humans: Reading Handwritten Text
23:00
Biases & Benchmark Failures in VLMs
25:30
Open vs Closed Source: Who Wins in Vision?
28:00
Fine-Tuning for Accuracy in Specialized Tasks
31:20
Impact of Image Resolution on Token Count, Latency, Accuracy
34:00
Latency vs Resolution Trade-offs
36:00
VLM Fine-Tuning = Better Accuracy, Lower Cost
38:00
Challenges in Training & Serving VLMs
40:45
How Predibase Simplifies VLM Training & Inference
43:00
Pokémon Card
DeepCamp AI