LITTA: Late-Interaction and Test-Time Alignment for Visually-Grounded Multimodal Retrieval

📰 ArXiv cs.AI

LITTA is a retrieval framework for visually-grounded multimodal retrieval that improves document retrieval without retraining the retriever

advanced Published 31 Mar 2026

Action Steps

Generate complementary queries to expand the user's query
Align the query and document representations at test time
Use late interaction to improve the retrieval of relevant evidence pages
Evaluate the performance of LITTA on multimodal document retrieval tasks

Who Needs to Know This

Researchers and engineers working on multimodal retrieval and question-answering systems can benefit from LITTA, as it enhances the retrieval of relevant evidence from visually rich documents

Key Insight

💡 LITTA's query-expansion-centric approach and test-time alignment enable effective retrieval of relevant evidence pages from visually rich documents