Stanford CME296 Diffusion & Large Vision Models | Spring 2026 | Lecture 5 - Architectures
Skills:
CV Basics70%
Learn more details about this course: https://online.stanford.edu/courses/cme296-diffusion-and-large-vision-models
To follow along with the course schedule and syllabus, visit: https://cme296.stanford.edu/syllabus/
Chapters:
00:00:00 Introduction
00:05:26 Objective
00:09:58 Convolutions, filters
00:14:44 Receptive field
00:17:14 Pooling
00:19:06 U-Net
00:27:52 Timestep representation
00:30:31 Class label representation
00:33:21 Timeline of U-Net models
00:35:43 Diffusion Transformer (DiT)
00:48:08 Adaptive layer normalization (adaLN)
01:02:30 DiT end-to-end example
01:12:57 Multimodal DiT (MM-DiT)
01:23:33 Qwen-Image, Z-Image, FLUX.1
01:24:27 Timeline of DiT models
01:25:25 Absolute position embeddings
01:38:48 Rotary position embeddings (RoPE)
01:39:59 2D RoPE variants
For more information about Stanford’s graduate programs, visit: https://online.stanford.edu/graduate-education
Afshine Amidi is an Adjunct Lecturer at Stanford University.
Shervine Amidi is an Adjunct Lecturer at Stanford University.
View the course playlist: https://www.youtube.com/playlist?list=PLoROMvodv4rNdy8rt2rZ4T2xM0OjADnfu
Watch on YouTube ↗
(saves to browser)
Sign in to unlock AI tutor explanation · ⚡30
More on: CV Basics
View skill →Related AI Lessons
⚡
⚡
⚡
⚡
How to Write Better AI Image Prompts for Midjourney (With Examples That Actually Work)
Medium · ChatGPT
Image to Video AI: The Complete Workflow Playbook That Actually Produces Results
Medium · AI
Image Harvest v1.0.2: Internationalization, Free Pro Trial & Quality-of-Life Improvements
Dev.to · kyriewen
Pix2Pix: Image-to-Image Translation using Conditional GANs
Medium · Deep Learning
Chapters (18)
Introduction
5:26
Objective
9:58
Convolutions, filters
14:44
Receptive field
17:14
Pooling
19:06
U-Net
27:52
Timestep representation
30:31
Class label representation
33:21
Timeline of U-Net models
35:43
Diffusion Transformer (DiT)
48:08
Adaptive layer normalization (adaLN)
1:02:30
DiT end-to-end example
1:12:57
Multimodal DiT (MM-DiT)
1:23:33
Qwen-Image, Z-Image, FLUX.1
1:24:27
Timeline of DiT models
1:25:25
Absolute position embeddings
1:38:48
Rotary position embeddings (RoPE)
1:39:59
2D RoPE variants
🎓
Tutor Explanation
DeepCamp AI