IRPAPERS Explained!
AI systems have achieved remarkable success in processing text and relational data, however, visual document processing remains relatively underexplored. Whereas traditional systems require OCR transcriptions to convert these visual documents into text and metadata, recent advances in multimodal foundation models offer an alternative path: retrieval and generation directly from document images. This raises a timely and important question: How do image-based systems compare to established text-based methods?
To answer this question, we present IRPAPERS, a benchmark totaling 3,230
pages sourced from 166 scientific papers, with both an image and OCR transcription for each page. We present a curation of 180 needle-in-the-haystack questions for evaluating retrieval and question answering systems with this corpus. We begin by comparing image- and text-based retrieval with open-source models, as well as multimodal hybrid search. For image retrieval, we evaluate the ColModernVBERT multi-vector embedding model. For text retrieval, we evaluate Arctic 2.0 dense single-vector embeddings, BM25, and their combination in hybrid text search. Text-based methods achieved 46% Recall@1, 78% Recall@5, and 91% Recall@20, while image-based retrieval achieved 43% Recall@1, 78% Recall@5, and 93% Recall@20. These retrieval systems exhibit complementary failures, each succeeding on queries where the other fails, enabling multimodal fusion to exceed either modality alone. Multimodal hybrid search achieved the highest performance with 49% Recall@1, 81% Recall@5, and 95% Recall@20. We additionally evaluate the efficiency-performance tradeoff of MUVERA encoding with varying levels of ef, as well as the performance of the ColPali and ColQwen2 multi-vector image embeddings models. To contextualize open-source performance, we further evaluate leading closed-source models. Cohere Embed v4 page image embeddings reached 58% Recall@1, 87% Recall@5, and 97% Recall@20, outperforming Voyage 3 Large text
Watch on YouTube ↗
(saves to browser)
Sign in to unlock AI tutor explanation · ⚡30
More on: Reading ML Papers
View skill →Related AI Lessons
⚡
⚡
⚡
⚡
The ABCs of reading medical research and review papers these days
Medium · LLM
#1 DevLog Meta-research: I Got Tired of Tab Chaos While Reading Research Papers.
Dev.to AI
How to Set Up a Karpathy-Style Wiki for Your Research Field
Medium · AI
The Non-Optimality of Scientific Knowledge: Path Dependence, Lock-In, and The Local Minimum Trap
ArXiv cs.AI
🎓
Tutor Explanation
DeepCamp AI