IRPAPERS Explained!

Name: IRPAPERS Explained!
Uploaded: 2026-02-24T14:59:58+00:00
Channel: Weaviate vector database
Description: AI systems have achieved remarkable success in processing text and relational data, however, visual document processing remains relatively underexplored...

Weaviate vector database · Beginner ·📄 Research Papers Explained ·2mo ago

Skills: Reading ML Papers90%Research Methods90%Paper Reproduction80%RAG Basics80%Vector Stores80%

AI systems have achieved remarkable success in processing text and relational data, however, visual document processing remains relatively underexplored. Whereas traditional systems require OCR transcriptions to convert these visual documents into text and metadata, recent advances in multimodal foundation models offer an alternative path: retrieval and generation directly from document images. This raises a timely and important question: How do image-based systems compare to established text-based methods? To answer this question, we present IRPAPERS, a benchmark totaling 3,230 pages sourced from 166 scientific papers, with both an image and OCR transcription for each page. We present a curation of 180 needle-in-the-haystack questions for evaluating retrieval and question answering systems with this corpus. We begin by comparing image- and text-based retrieval with open-source models, as well as multimodal hybrid search. For image retrieval, we evaluate the ColModernVBERT multi-vector embedding model. For text retrieval, we evaluate Arctic 2.0 dense single-vector embeddings, BM25, and their combination in hybrid text search. Text-based methods achieved 46% Recall@1, 78% Recall@5, and 91% Recall@20, while image-based retrieval achieved 43% Recall@1, 78% Recall@5, and 93% Recall@20. These retrieval systems exhibit complementary failures, each succeeding on queries where the other fails, enabling multimodal fusion to exceed either modality alone. Multimodal hybrid search achieved the highest performance with 49% Recall@1, 81% Recall@5, and 95% Recall@20. We additionally evaluate the efficiency-performance tradeoff of MUVERA encoding with varying levels of ef, as well as the performance of the ColPali and ColQwen2 multi-vector image embeddings models. To contextualize open-source performance, we further evaluate leading closed-source models. Cohere Embed v4 page image embeddings reached 58% Recall@1, 87% Recall@5, and 97% Recall@20, outperforming Voyage 3 Large text

Watch on YouTube ↗ (saves to browser)