An image is worth NxN words | Diffusion Transformers (ViT, DiT, MMDiT)

Julia Turc · Beginner ·👁️ Computer Vision ·3mo ago

Skills: Modern CV Models90%CV Basics80%Generative CV80%

This video covers the Vision Transformer (ViT), Diffusion Transformer (DiT) and Multimodal Diffusion Transformer (MMDiT). This is the architectural evolution that enabled the original Transformer model (initially designed for machine translation and language modeling) to replace the de-facto model for vision, the Convolutional Neural Networks (CNN). ▶️ Companion videos: - Transformers in language: https://youtu.be/SFi9KsnidNc?si=XmQpBqd0_KH7Vmcl - Diffusion fundamentals: https://youtu.be/R0uMcXsfo2o?si=LvBqX2-A1wm66iLJ - How the Transformer replaced CNNs: https://youtu.be/KnCRTP11p5U?si=2RrAya_2LU5I1Ms- 📚 Papers ViT: https://arxiv.org/abs/2010.11929 DiT: https://arxiv.org/abs/2212.09748 MMDiT: https://arxiv.org/abs/2403.03206 FiLM: https://arxiv.org/abs/1709.07871 My full reading list: https://www.patreon.com/c/JuliaTurc 00:00 Intro 01:13 Transformer recap 02:24 Image classification 03:35 Vision Transformer (ViT) 05:37 Image generation 07:54 Diffusion Transformer (DiT) 10:07 DiT in-context learning 10:38 DiT cross-attention 11:15 DiT adaLN (and FiLM inspiration) 14:26 DiT adaLN-Zero 16:03 Pixart-alpha 16:43 Multimodal Diffusion Transformer (MMDiT)

Watch on YouTube ↗ (saves to browser)

Sign in to unlock AI tutor explanation · ⚡30

More on: Modern CV Models

View skill →

YOLOE: Real-time Zero-shot Object Detection | Visual Prompting | Live Coding & Q&A (Mar 14th)

YOLOE: Real-time Zero-shot Object Detection | Visual Prompting | Live Coding & Q&A (Mar 14th)

RF-DETR: How to Train SOTA for Object Detection on a Custom Dataset | Step-by-step guide

RF-DETR: How to Train SOTA for Object Detection on a Custom Dataset | Step-by-step guide

Build a Deep Facial Recognition App // Part 8 - Kivy Computer Vision App with OpenCV and Tensorflow

Build a Deep Facial Recognition App // Part 8 - Kivy Computer Vision App with OpenCV and Tensorflow

Nicholas Renotte

Deep Learning with PyTorch : Image Segmentation

Deep Learning with PyTorch : Image Segmentation

Mesh Optimization Using FlexiCubes with NVIDIA Kaolin Library v0.15.0

Mesh Optimization Using FlexiCubes with NVIDIA Kaolin Library v0.15.0

NVIDIA Developer

Code Panoptic Image Segmentation w/ Vision Transformer & Mask2Former - A PyTorch tutorial

Code Panoptic Image Segmentation w/ Vision Transformer & Mask2Former - A PyTorch tutorial

Related AI Lessons

Inside SAM 3D: how Meta turns a single image into 3D

Learn how Meta's SAM 3D technology turns a single image into 3D, revolutionizing the field of computer vision

Medium · Machine Learning

Inside SAM 3D: how Meta turns a single image into 3D

Learn how Meta's SAM 3D technology generates 3D models from single images, revolutionizing the field of computer vision

Medium · Deep Learning

Demystifying CNNs: How Convolutional Filters and Max-Pooling Actually Work

Learn how Convolutional Neural Networks (CNNs) use convolutional filters and max-pooling to recognize images

Medium · Data Science

Your "Biometric Age Check" Isn't Verifying Identity — And Defense Lawyers Know It

Biometric age checks don't verify identity, a crucial distinction for developers in computer vision and biometrics

Chapters (12)

Intro

1:13 Transformer recap

2:24 Image classification

3:35 Vision Transformer (ViT)

5:37 Image generation

7:54 Diffusion Transformer (DiT)

10:07 DiT in-context learning

10:38 DiT cross-attention

11:15 DiT adaLN (and FiLM inspiration)

14:26 DiT adaLN-Zero

16:03 Pixart-alpha

16:43 Multimodal Diffusion Transformer (MMDiT)

How Transformers Finally Ate Vision – Isaac Robinson, Roboflow