Bottleneck Tokens for Unified Multimodal Retrieval

📰 ArXiv cs.AI

arXiv:2604.11095v1 Announce Type: cross Abstract: Adapting decoder-only multimodal large language models (MLLMs) for unified multimodal retrieval faces two structural gaps. First, existing methods rely on implicit pooling, which overloads the hidden state of a standard vocabulary token (e.g., ) as the sequence-level representation, a mechanism never designed for information aggregation. Second, contrastive fine-tuning specifies what the embedding should match but provides no token-level gui

Published 14 Apr 2026
Read full paper → ← Back to Reads