Stanford CME296 Diffusion & Large Vision Models | Spring 2026 | Lecture 5 - Architectures

Stanford Online · Beginner ·🎨 Image & Video AI ·2d ago
Skills: CV Basics70%
Learn more details about this course: https://online.stanford.edu/courses/cme296-diffusion-and-large-vision-models To follow along with the course schedule and syllabus, visit: https://cme296.stanford.edu/syllabus/ Chapters: 00:00:00 Introduction 00:05:26 Objective 00:09:58 Convolutions, filters 00:14:44 Receptive field 00:17:14 Pooling 00:19:06 U-Net 00:27:52 Timestep representation 00:30:31 Class label representation 00:33:21 Timeline of U-Net models 00:35:43 Diffusion Transformer (DiT) 00:48:08 Adaptive layer normalization (adaLN) 01:02:30 DiT end-to-end example 01:12:57 Multimodal DiT (MM-DiT) 01:23:33 Qwen-Image, Z-Image, FLUX.1 01:24:27 Timeline of DiT models 01:25:25 Absolute position embeddings 01:38:48 Rotary position embeddings (RoPE) 01:39:59 2D RoPE variants For more information about Stanford’s graduate programs, visit: https://online.stanford.edu/graduate-education Afshine Amidi is an Adjunct Lecturer at Stanford University. Shervine Amidi is an Adjunct Lecturer at Stanford University. View the course playlist: https://www.youtube.com/playlist?list=PLoROMvodv4rNdy8rt2rZ4T2xM0OjADnfu
Watch on YouTube ↗ (saves to browser)
Sign in to unlock AI tutor explanation · ⚡30

Related AI Lessons

How to Write Better AI Image Prompts for Midjourney (With Examples That Actually Work)
Learn to write effective AI image prompts for Midjourney with actionable examples and techniques
Medium · ChatGPT
Image to Video AI: The Complete Workflow Playbook That Actually Produces Results
Learn a step-by-step workflow for image-to-video AI that produces results, from preparation to delivery
Medium · AI
Image Harvest v1.0.2: Internationalization, Free Pro Trial & Quality-of-Life Improvements
Learn about Image Harvest v1.0.2, a Chrome extension with internationalization, free pro trial, and quality-of-life improvements, and how to utilize it for privacy-first image extraction
Dev.to · kyriewen
Pix2Pix: Image-to-Image Translation using Conditional GANs
Learn how to use Pix2Pix for image-to-image translation with conditional GANs, a powerful technique for generating realistic images
Medium · Deep Learning

Chapters (18)

Introduction
5:26 Objective
9:58 Convolutions, filters
14:44 Receptive field
17:14 Pooling
19:06 U-Net
27:52 Timestep representation
30:31 Class label representation
33:21 Timeline of U-Net models
35:43 Diffusion Transformer (DiT)
48:08 Adaptive layer normalization (adaLN)
1:02:30 DiT end-to-end example
1:12:57 Multimodal DiT (MM-DiT)
1:23:33 Qwen-Image, Z-Image, FLUX.1
1:24:27 Timeline of DiT models
1:25:25 Absolute position embeddings
1:38:48 Rotary position embeddings (RoPE)
1:39:59 2D RoPE variants
Up next
Top AI Video Editing Tools You Should Try | Must-Try AI Video Editing Tools | #Shorts | #Simplilearn
Simplilearn
Watch →