An image is worth NxN words | Diffusion Transformers (ViT, DiT, MMDiT)

Julia Turc · Beginner ·🎨 Image & Video AI ·1mo ago
This video covers the Vision Transformer (ViT), Diffusion Transformer (DiT) and Multimodal Diffusion Transformer (MMDiT). This is the architectural evolution that enabled the original Transformer model (initially designed for machine translation and language modeling) to replace the de-facto model for vision, the Convolutional Neural Networks (CNN). ▶️ Companion videos: - Transformers in language: https://youtu.be/SFi9KsnidNc?si=XmQpBqd0_KH7Vmcl - Diffusion fundamentals: https://youtu.be/R0uMcXsfo2o?si=LvBqX2-A1wm66iLJ - How the Transformer replaced CNNs: https://youtu.be/KnCRTP11p5U?si=2RrA…
Watch on YouTube ↗ (saves to browser)

Chapters (12)

Intro
1:13 Transformer recap
2:24 Image classification
3:35 Vision Transformer (ViT)
5:37 Image generation
7:54 Diffusion Transformer (DiT)
10:07 DiT in-context learning
10:38 DiT cross-attention
11:15 DiT adaLN (and FiLM inspiration)
14:26 DiT adaLN-Zero
16:03 Pixart-alpha
16:43 Multimodal Diffusion Transformer (MMDiT)
Gestão de produtos digitais: Princípios básicos modernos
Next Up
Gestão de produtos digitais: Princípios básicos modernos
Coursera