An image is worth NxN words | Diffusion Transformers (ViT, DiT, MMDiT)

Julia Turc · Beginner ·👁️ Computer Vision ·3mo ago
This video covers the Vision Transformer (ViT), Diffusion Transformer (DiT) and Multimodal Diffusion Transformer (MMDiT). This is the architectural evolution that enabled the original Transformer model (initially designed for machine translation and language modeling) to replace the de-facto model for vision, the Convolutional Neural Networks (CNN). ▶️ Companion videos: - Transformers in language: https://youtu.be/SFi9KsnidNc?si=XmQpBqd0_KH7Vmcl - Diffusion fundamentals: https://youtu.be/R0uMcXsfo2o?si=LvBqX2-A1wm66iLJ - How the Transformer replaced CNNs: https://youtu.be/KnCRTP11p5U?si=2RrAya_2LU5I1Ms- 📚 Papers ViT: https://arxiv.org/abs/2010.11929 DiT: https://arxiv.org/abs/2212.09748 MMDiT: https://arxiv.org/abs/2403.03206 FiLM: https://arxiv.org/abs/1709.07871 My full reading list: https://www.patreon.com/c/JuliaTurc 00:00 Intro 01:13 Transformer recap 02:24 Image classification 03:35 Vision Transformer (ViT) 05:37 Image generation 07:54 Diffusion Transformer (DiT) 10:07 DiT in-context learning 10:38 DiT cross-attention 11:15 DiT adaLN (and FiLM inspiration) 14:26 DiT adaLN-Zero 16:03 Pixart-alpha 16:43 Multimodal Diffusion Transformer (MMDiT)
Watch on YouTube ↗ (saves to browser)
Sign in to unlock AI tutor explanation · ⚡30

Related AI Lessons

Inside SAM 3D: how Meta turns a single image into 3D
Learn how Meta's SAM 3D technology turns a single image into 3D, revolutionizing the field of computer vision
Medium · Machine Learning
Inside SAM 3D: how Meta turns a single image into 3D
Learn how Meta's SAM 3D technology generates 3D models from single images, revolutionizing the field of computer vision
Medium · Deep Learning
Demystifying CNNs: How Convolutional Filters and Max-Pooling Actually Work
Learn how Convolutional Neural Networks (CNNs) use convolutional filters and max-pooling to recognize images
Medium · Data Science
Your "Biometric Age Check" Isn't Verifying Identity — And Defense Lawyers Know It
Biometric age checks don't verify identity, a crucial distinction for developers in computer vision and biometrics
Dev.to AI

Chapters (12)

Intro
1:13 Transformer recap
2:24 Image classification
3:35 Vision Transformer (ViT)
5:37 Image generation
7:54 Diffusion Transformer (DiT)
10:07 DiT in-context learning
10:38 DiT cross-attention
11:15 DiT adaLN (and FiLM inspiration)
14:26 DiT adaLN-Zero
16:03 Pixart-alpha
16:43 Multimodal Diffusion Transformer (MMDiT)
Up next
How Transformers Finally Ate Vision – Isaac Robinson, Roboflow
AI Engineer
Watch →