Learning World Models for Interactive Video Generation

📰 ArXiv cs.AI

arXiv:2505.21996v3 Announce Type: replace-cross Abstract: Foundational world models must be both interactive and preserve spatiotemporal coherence for effective future planning with action choices. However, present models for long video generation have limited inherent world modeling capabilities due to two main challenges: compounding errors and insufficient memory mechanisms. We enhance image-to-video models with interactive capabilities through additional action conditioning and autoregressiv

Published 14 Apr 2026
Read full paper → ← Back to Reads