STORM: End-to-End Referring Multi-Object Tracking in Videos
📰 ArXiv cs.AI
arXiv:2604.10527v1 Announce Type: cross Abstract: Referring multi-object tracking (RMOT) is a task of associating all the objects in a video that semantically match with given textual queries or referring expressions. Existing RMOT approaches decompose object grounding and tracking into separated modules and exhibit limited performance due to the scarcity of training videos, ambiguous annotations, and restricted domains. In this work, we introduce STORM, an end-to-end MLLM that jointly performs
DeepCamp AI