An image is worth NxN words | Diffusion Transformers (ViT, DiT, MMDiT)
This video covers the Vision Transformer (ViT), Diffusion Transformer (DiT) and Multimodal Diffusion Transformer (MMDiT).
This is the architectural evolution that enabled the original Transformer model (initially designed for machine translation and language modeling) to replace the de-facto model for vision, the Convolutional Neural Networks (CNN).
▶️ Companion videos:
- Transformers in language: https://youtu.be/SFi9KsnidNc?si=XmQpBqd0_KH7Vmcl
- Diffusion fundamentals: https://youtu.be/R0uMcXsfo2o?si=LvBqX2-A1wm66iLJ
- How the Transformer replaced CNNs: https://youtu.be/KnCRTP11p5U?si=2RrAya_2LU5I1Ms-
📚 Papers
ViT: https://arxiv.org/abs/2010.11929
DiT: https://arxiv.org/abs/2212.09748
MMDiT: https://arxiv.org/abs/2403.03206
FiLM: https://arxiv.org/abs/1709.07871
My full reading list: https://www.patreon.com/c/JuliaTurc
00:00 Intro
01:13 Transformer recap
02:24 Image classification
03:35 Vision Transformer (ViT)
05:37 Image generation
07:54 Diffusion Transformer (DiT)
10:07 DiT in-context learning
10:38 DiT cross-attention
11:15 DiT adaLN (and FiLM inspiration)
14:26 DiT adaLN-Zero
16:03 Pixart-alpha
16:43 Multimodal Diffusion Transformer (MMDiT)
Watch on YouTube ↗
(saves to browser)
Sign in to unlock AI tutor explanation · ⚡30
More on: Modern CV Models
View skill →Related AI Lessons
⚡
⚡
⚡
⚡
Inside SAM 3D: how Meta turns a single image into 3D
Medium · Machine Learning
Inside SAM 3D: how Meta turns a single image into 3D
Medium · Deep Learning
Demystifying CNNs: How Convolutional Filters and Max-Pooling Actually Work
Medium · Data Science
Your "Biometric Age Check" Isn't Verifying Identity — And Defense Lawyers Know It
Dev.to AI
Chapters (12)
Intro
1:13
Transformer recap
2:24
Image classification
3:35
Vision Transformer (ViT)
5:37
Image generation
7:54
Diffusion Transformer (DiT)
10:07
DiT in-context learning
10:38
DiT cross-attention
11:15
DiT adaLN (and FiLM inspiration)
14:26
DiT adaLN-Zero
16:03
Pixart-alpha
16:43
Multimodal Diffusion Transformer (MMDiT)
🎓
Tutor Explanation
DeepCamp AI