PhyAVBench: A Challenging Audio Physics-Sensitivity Benchmark for Physically Grounded Text-to-Audio-Video Generation

📰 ArXiv cs.AI

PhyAVBench is a benchmark for evaluating physically grounded text-to-audio-video generation models

advanced Published 8 Apr 2026

Action Steps

Identify the limitations of current text-to-audio-video generation models in producing physically plausible sounds
Develop a benchmark that evaluates audio-physics grounding in generated audio-visual content
Use PhyAVBench to assess the performance of different models and identify areas for improvement
Apply the insights from PhyAVBench to fine-tune and improve the physical plausibility of generated audio-visual content

Who Needs to Know This

AI researchers and engineers working on text-to-audio-video generation models can benefit from PhyAVBench to evaluate their models' physical plausibility, while product managers can use it to assess the quality of generated audio-visual content

Key Insight

💡 Evaluating the physical plausibility of generated audio-visual content is crucial for realistic text-to-audio-video generation

Key Takeaways

PhyAVBench is a benchmark for evaluating physically grounded text-to-audio-video generation models

Full Article

Title: PhyAVBench: A Challenging Audio Physics-Sensitivity Benchmark for Physically Grounded Text-to-Audio-Video Generation

Abstract:
arXiv:2512.23994v2 Announce Type: replace-cross Abstract: Text-to-audio-video (T2AV) generation is central to applications such as filmmaking and world modeling. However, current models often fail to produce physically plausible sounds. Previous benchmarks primarily focus on audio-video temporal synchronization, while largely overlooking explicit evaluation of audio-physics grounding, thereby limiting the study of physically plausible audio-visual generation. To address this issue, we present Ph

Read full paper → ← Back to Reads