AdaptiveLoad: Towards Efficient Video Diffusion Transformer Training

📰 ArXiv cs.AI

arXiv:2605.17923v1 Announce Type: cross Abstract: In video generation models, particularly world models, training large-scale video diffusion Transformers (such as DiT and MMDiT) poses significant computational challenges due to the extreme variance in sequence lengths within mixed-mode datasets. Existing bucket-based data loading strategies typically rely on "equal token length" constraints. This approach fails to account for the quadratic complexity of self-attention mechanisms, leading to sev

Published 19 May 2026

Read full paper → ← Back to Reads