The Future of Vision in ML | Merve Noyan | HF Podcast #1

Hugging Face · Beginner ·👁️ Computer Vision ·1mo ago
In this episode, we sit down with Merve to talk about where vision AI is heading: from early computer vision systems to modern multimodal models, world models, robotics, and open source AI. We discuss LLaVA, IDEFICS, Vision Transformers, CNNs, JEPA, V-JEPA, Genie 3, OpenClaw, IMCP, PaliGemma, ColPali, ColQwen, and why Hugging Face has become such a central part of the open ecosystem. ## Connect with Merve Noyan, the open-sourceress 👇 - X (twitter): https://x.com/mervenoyann - LinkedIn: https://www.linkedin.com/in/merve-noyan-28b1a113a/ - Personal Site: https://merveenoyan.github.io/me/ - GitHub: https://github.com/merveenoyan ## Chapters 00:00 Intro: vision, Hugging Face, and the future of AI 00:31 Why vision feels different now 03:58 LLaVA, IDEFICS, and multimodal training 08:56 CNNs, ViTs, and older vision architectures 15:46 How vision models could reach everyday users 16:50 World models, JEPA, V-JEPA, Genie 3, and robotics 25:44 OpenClaw, IMCP, and agent safety 28:01 Small vision models, fine-tuning, and getting started 34:39 Why Hugging Face matters in open source AI 42:49 PaliGemma, ColPali, ColQwen, and vision retrieval 47:26 Before Hugging Face: how models were shared 49:48 Mentors, culture, and closing thoughts If you enjoyed the episode, subscribe for more conversations about open models, multimodal systems, and the future of AI.
Watch on YouTube ↗ (saves to browser)
Sign in to unlock AI tutor explanation · ⚡30

Related AI Lessons

Demystifying CNNs: How Convolutional Filters and Max-Pooling Actually Work
Learn how Convolutional Neural Networks (CNNs) use convolutional filters and max-pooling to recognize images
Medium · Data Science
Your "Biometric Age Check" Isn't Verifying Identity — And Defense Lawyers Know It
Biometric age checks don't verify identity, a crucial distinction for developers in computer vision and biometrics
Dev.to AI
MoCapAnything V2: End-to-End Motion Capture for Arbitrary Skeletons
Learn about MoCapAnything V2, an end-to-end motion capture system for arbitrary skeletons, and its applications in 3D animation
Medium · Machine Learning
How I Built a Perceptual Color Quantization Engine for LEGO Mosaics
Learn how to build a perceptual color quantization engine for LEGO mosaics and improve image conversion
Dev.to · BMBrick

Chapters (12)

Intro: vision, Hugging Face, and the future of AI
0:31 Why vision feels different now
3:58 LLaVA, IDEFICS, and multimodal training
8:56 CNNs, ViTs, and older vision architectures
15:46 How vision models could reach everyday users
16:50 World models, JEPA, V-JEPA, Genie 3, and robotics
25:44 OpenClaw, IMCP, and agent safety
28:01 Small vision models, fine-tuning, and getting started
34:39 Why Hugging Face matters in open source AI
42:49 PaliGemma, ColPali, ColQwen, and vision retrieval
47:26 Before Hugging Face: how models were shared
49:48 Mentors, culture, and closing thoughts
Up next
How Transformers Finally Ate Vision – Isaac Robinson, Roboflow
AI Engineer
Watch →