VSI: Visual Subtitle Integration for Keyframe Selection to enhance Long Video Understanding

📰 ArXiv cs.AI

arXiv:2508.06869v4 Announce Type: replace-cross Abstract: Multimodal large language models (MLLMs) demonstrate exceptional performance in vision-language tasks, yet their processing of long videos is constrained by input context length and high computational costs. Sparse frame sampling thus becomes a necessary preprocessing step, with sampled frame quality directly impacting downstream performance. Existing keyframe search algorithms achieve a balance between efficiency and sampled frame qualit

Published 13 Apr 2026
Read full paper → ← Back to Reads