MODIX: A Training-Free Multimodal Information-Driven Positional Index Scaling for Vision-Language Models
📰 ArXiv cs.AI
arXiv:2604.12537v1 Announce Type: cross Abstract: Vision-Language Models (VLMs) have achieved remarkable progress in multimodal understanding, yet their positional encoding mechanisms remain suboptimal. Existing approaches uniformly assign positional indices to all tokens, overlooking variations in information density within and across modalities, which leads to inefficient attention allocation where redundant visual regions dominate while informative content is underrepresented. We identify pos
DeepCamp AI