VSI: Visual Subtitle Integration for Keyframe Selection to enhance Long Video Understanding
📰 ArXiv cs.AI
arXiv:2508.06869v4 Announce Type: replace-cross Abstract: Multimodal large language models (MLLMs) demonstrate exceptional performance in vision-language tasks, yet their processing of long videos is constrained by input context length and high computational costs. Sparse frame sampling thus becomes a necessary preprocessing step, with sampled frame quality directly impacting downstream performance. Existing keyframe search algorithms achieve a balance between efficiency and sampled frame qualit
DeepCamp AI