How Transformers Finally Ate Vision – Isaac Robinson, Roboflow

AI Engineer · Beginner ·👁️ Computer Vision ·6d ago
Skills: CV Basics90%
Vision used to belong to CNNs. This talk explains why that changed, and why transformers only recently started winning for vision despite looking like the less natural fit for images. The answer runs through pretraining, scaling, borrowed infrastructure from the LLM world, and the long arc back to the simple architecture that scales best. Using the evolution from ViT and Swin through ConvNeXt, Hiera, SAM, and RF-DETR, Isaac Robinson walks through what actually made transformer vision systems practical, where the tradeoffs still are, and why deployment flexibility now matters as much as raw benchmark wins. What comes next for VLMs, world models, and physical AI? Speaker info: - https://www.linkedin.com/in/robinsonish/
Watch on YouTube ↗ (saves to browser)
Sign in to unlock AI tutor explanation · ⚡30

Related AI Lessons

Demystifying CNNs: How Convolutional Filters and Max-Pooling Actually Work
Learn how Convolutional Neural Networks (CNNs) use convolutional filters and max-pooling to recognize images
Medium · Data Science
Your "Biometric Age Check" Isn't Verifying Identity — And Defense Lawyers Know It
Biometric age checks don't verify identity, a crucial distinction for developers in computer vision and biometrics
Dev.to AI
MoCapAnything V2: End-to-End Motion Capture for Arbitrary Skeletons
Learn about MoCapAnything V2, an end-to-end motion capture system for arbitrary skeletons, and its applications in 3D animation
Medium · Machine Learning
How I Built a Perceptual Color Quantization Engine for LEGO Mosaics
Learn how to build a perceptual color quantization engine for LEGO mosaics and improve image conversion
Dev.to · BMBrick
Up next
Build an AI Face Recognition Meme Matcher
DataCamp
Watch →