Video Generation with Diffusion Transformers | Generative AI
In this video, we dive deep into Latte, a latent diffusion transformer for video generation. This generative video diffusion model combines diffusion techniques with transformer architecture and is trained on latent frames of videos.
We start with a quick recap of diffusion transformers, as the core building block of this latent transformer for video generation is similar to the adaptive layer norm block variant from the DiT (Diffusion Transformer) paper.
Next, we explore specific features of the Latte model, including video patch embedding for processing latent frames, spatial and temporal…
Watch on YouTube ↗
(saves to browser)
Chapters (15)
Intro
0:57
Diffusion Transformers recap
3:50
Patch embedding for Video generation model
8:48
Spatial and Temporal Attention for Video Generation
12:40
Variants of Latent Diffusion Transformer for Video
16:16
Temporal Position Embeddings for Latte Model
17:24
Experiments between video model design choices
23:46
Implementation Details of Latent Video Diffusion Transformer
24:58
Autoencoder Training for Video Diffusion Model
29:34
Autoencoder Results
32:08
VideoDataset for training Video Diffusion Transformer
36:32
Video Diffusion Transformer Model Class
44:34
Training Code for Latte Model
46:30
Video Diffusion Transformer Results
47:25
Up Next on Video Generation
DeepCamp AI