Echoes Over Time: Unlocking Length Generalization in Video-to-Audio Generation Models

📰 ArXiv cs.AI

arXiv:2602.20981v3 Announce Type: replace-cross Abstract: Scaling multimodal alignment between video and audio is challenging, particularly due to limited data and the mismatch between text descriptions and frame-level video information. In this work, we tackle the scaling challenge in multimodal-to-audio generation, examining whether models trained on short instances can generalize to longer ones during testing. To tackle this challenge, we present multimodal hierarchical networks so-called MMH

Published 16 Apr 2026
Read full paper → ← Back to Reads