The Future of Vision in ML | Merve Noyan | HF Podcast #1
Skills:
Modern CV Models80%
In this episode, we sit down with Merve to talk about where vision AI is heading: from early computer vision systems to modern multimodal models, world models, robotics, and open source AI.
We discuss LLaVA, IDEFICS, Vision Transformers, CNNs, JEPA, V-JEPA, Genie 3, OpenClaw, IMCP, PaliGemma, ColPali, ColQwen, and why Hugging Face has become such a central part of the open ecosystem.
## Connect with Merve Noyan, the open-sourceress 👇
- X (twitter): https://x.com/mervenoyann
- LinkedIn: https://www.linkedin.com/in/merve-noyan-28b1a113a/
- Personal Site: https://merveenoyan.github.io/me/
- GitHub: https://github.com/merveenoyan
## Chapters
00:00 Intro: vision, Hugging Face, and the future of AI
00:31 Why vision feels different now
03:58 LLaVA, IDEFICS, and multimodal training
08:56 CNNs, ViTs, and older vision architectures
15:46 How vision models could reach everyday users
16:50 World models, JEPA, V-JEPA, Genie 3, and robotics
25:44 OpenClaw, IMCP, and agent safety
28:01 Small vision models, fine-tuning, and getting started
34:39 Why Hugging Face matters in open source AI
42:49 PaliGemma, ColPali, ColQwen, and vision retrieval
47:26 Before Hugging Face: how models were shared
49:48 Mentors, culture, and closing thoughts
If you enjoyed the episode, subscribe for more conversations about open models, multimodal systems, and the future of AI.
Watch on YouTube ↗
(saves to browser)
Sign in to unlock AI tutor explanation · ⚡30
More on: Modern CV Models
View skill →Related AI Lessons
⚡
⚡
⚡
⚡
Demystifying CNNs: How Convolutional Filters and Max-Pooling Actually Work
Medium · Data Science
Your "Biometric Age Check" Isn't Verifying Identity — And Defense Lawyers Know It
Dev.to AI
MoCapAnything V2: End-to-End Motion Capture for Arbitrary Skeletons
Medium · Machine Learning
How I Built a Perceptual Color Quantization Engine for LEGO Mosaics
Dev.to · BMBrick
Chapters (12)
Intro: vision, Hugging Face, and the future of AI
0:31
Why vision feels different now
3:58
LLaVA, IDEFICS, and multimodal training
8:56
CNNs, ViTs, and older vision architectures
15:46
How vision models could reach everyday users
16:50
World models, JEPA, V-JEPA, Genie 3, and robotics
25:44
OpenClaw, IMCP, and agent safety
28:01
Small vision models, fine-tuning, and getting started
34:39
Why Hugging Face matters in open source AI
42:49
PaliGemma, ColPali, ColQwen, and vision retrieval
47:26
Before Hugging Face: how models were shared
49:48
Mentors, culture, and closing thoughts
🎓
Tutor Explanation
DeepCamp AI