Fine-Tune Vision Language Models (VLMs) Like a Pro: Live Demo + Benchmarks | Predibase Webinar

Predibase by Rubrik · Advanced ·🧠 Large Language Models ·7mo ago
Multimodal AI is no longer optional—it's the future. In this in-depth webinar, the ML experts at Predibase break down everything you need to know about Vision Language Models (VLMs)—from architectures and use cases to training, inference, and real-world performance. ✅ Learn why fine-tuning open-source VLMs often beats closed models like GPT-4V ✅ See a live demo of fine-tuning a Pokémon card captioning model ✅ Get benchmark results showing performance boosts over GPT-4 ✅ Discover real-world use cases: healthcare, retail, drive-thrus, content moderation & more ✅ Understand the challenges in tra…
Watch on YouTube ↗ (saves to browser)

Chapters (19)

Intro & Speakers
1:45 Why Multimodal AI Matters
4:20 Real-World Multimodal Use Cases (Amazon, Duolingo, Converse Now)
7:10 Developer Interest in Open-Source VLMs
9:40 What Are Vision Language Models (VLMs)?
11:05 VLM Architecture: Encoder, Projector, Decoder
14:00 Popular Components: CLIP, LLaMA, Qwen
16:00 Prompting & Use Cases (Image QA, Captioning, Video Analysis)
19:15 Strengths & Limitations of VLMs
21:00 VLMs vs Humans: Reading Handwritten Text
23:00 Biases & Benchmark Failures in VLMs
25:30 Open vs Closed Source: Who Wins in Vision?
28:00 Fine-Tuning for Accuracy in Specialized Tasks
31:20 Impact of Image Resolution on Token Count, Latency, Accuracy
34:00 Latency vs Resolution Trade-offs
36:00 VLM Fine-Tuning = Better Accuracy, Lower Cost
38:00 Challenges in Training & Serving VLMs
40:45 How Predibase Simplifies VLM Training & Inference
43:00 Pokémon Card
5 Levels of AI Agents - From Simple LLM Calls to Multi-Agent Systems
Next Up
5 Levels of AI Agents - From Simple LLM Calls to Multi-Agent Systems
Dave Ebbelaar (LLM Eng)