Rethinking Video Human-Object Interaction: Set Prediction over Time for Unified Detection and Anticipation

📰 ArXiv cs.AI

arXiv:2604.10397v1 Announce Type: cross Abstract: Video-based human-object interaction (HOI) understanding requires both detecting ongoing interactions and anticipating their future evolution. However, existing methods usually treat anticipation as a downstream forecasting task built on externally constructed human-object pairs, limiting joint reasoning between detection and prediction. In addition, sparse keyframe annotations in current benchmarks can temporally misalign nominal future labels f

Published 14 Apr 2026

Read full paper → ← Back to Reads