IRPAPERS Explained!
Key Takeaways
The video explains the IRPAPERS dataset, a collection of 166 information retrieval papers with 3,230 pages, and its use in testing image and text-based representations for retrieval and question answering. The study compares various models, including ColBERT, MuVERA, ARCTIC 2.0, and BM25, and discusses the benefits of multimodal hybrid search.
Full Transcript
[music] >> Hey everyone, I'm super excited to present IR papers, a deep dive into building AI systems with visual documents like PDFs. There's been all sorts of recent advances in multimodal embedding and foundation models that motivated us to revisit whether we still need OCR transcription and how these systems compare when you represent the image the pages as images compared to extracted text transcriptions. So we tested all sorts of models and I'm super excited to share what we found. Let's dive into it. We present a comparative study between image and text-based representations for retrieval and question answering over a collection of PDFs. In this case, a collection of scientific papers. We introduce IR papers, a new data set of information retrieval papers. They're sourced from the citations of large language models for information retrieval, a survey, which was originally published in 2023 and most recently updated in September 2025. So this constitutes 166 papers and 3,230 pages. It's a great example of a particular knowledge base of PDFs. It's something I'm personally really excited to work with. Hopefully other information retrieval researchers will find this interesting. But if not, more broadly, this is about this idea of developing a knowledge base that you might either just use for your own purposes or say provide your AI agent of a particular domain of expertise like information retrieval. We then derive a set of 180 questions to test these systems through the needle in the haystack question generation philosophy. The idea is to extract a question for each document in your corpus that targets retrieving that particular document or answering a question on that particular document. So again, we consider each of our pages as a document. So we have 3,230 pages, but we only use a subset of 180 of these pages to derive these questions. So the prompt that we used to create these questions is we give the entire paper of our 166 papers to the Claude GUI and ask, "Please write an information seeking question for each page of this document that it uniquely contains the answer to." There's more details to our prompt that can be found in the paper. And then it produces questions like this. In hide, which is one of these LLM query writing steps, what specific instruction following models and contrastive encoders were used for English versus non-English retrieval tasks? The answer InstructGBT Contriever and Contriever. So these are these highly targeted questions, also called factoid questions, that we use to evaluate these systems. We start off by testing out different open-source embedding models and retrieval strategies, starting with the ColBERT multi-vector image embedding models as well as with MuVERA encoding, ARCTIC 2.0 single-vector text embedding models, BM25 keyword scoring, and then hybrid text search combining ARCTIC 2.0 with BM25 and multimodal hybrid search combining ColBERT, ARCTIC 2.0 and BM25. So a super quick primer on why we chose these models. We're super excited about ColBERT late interaction multi-vector methods. So to give just a super quick helicopter view of the evolution of this, the first iteration of vector search was primarily based on single vector representations where we the output of the embedding model is to produce a single vector for the query as well as all the candidate documents. And then we built these efficient indexes like HNSW and all the stuff to efficiently calculate the distances between the query and all the documents in your database to find the nearest neighbors. So now what's happening with late interaction multi-vector models is instead of pooling all the vectors and representing documents with just a single vector, you have all of these different token vectors for the documents and all these different token vectors for the query. Then you apply this maximum operator to get just a more fine-grained late interaction. So it's like using a cross encoder or a higher capacity modeling technique of modeling the similarity between these queries and all these candidate documents. So another thing that's really exciting about ColBERT is the way it translated into ColPali. So in ColPali, the idea is that instead of having token vectors, you have image patch vectors. And just one thing that's really neat about this is the visualization you get. This heat map, you see these bright spots are where you have the high maximum similarities between the query tokens and the particular vector that makes up this image patch. So it's This is a particular reason why we're testing these multi-vector image embedding models for the page images of our visual documents. And then finally, MuVERA encoding. So as you can probably tell, doing this maximum, this is a ton of dot products. So we need efficient ways to calculate this as well as efficient ways to not have to have all of the document vectors for our entire database in memory. And that's the idea of MuVERA. It reduces the problem to a two-stage algorithm where first you derive a single vector representation by applying this hashing algorithm over the multi-vector representation. You can then shortlist candidates and then rerank them with full precision maximum rescoring. So here are the results of testing out these different open-source retrieval models on our IR papers benchmark. We find generally a pretty similar performance between the ColBERT multi-vector page image embeddings with the ARCTIC 2.0 text transcription embeddings or BM25 keyword scoring on the text transcriptions as well as hybrid text search. So we see 0.43, generally around 0.45 at recall at one, around 0.78 at recall at five, and around 0.9 at recall at 20. And as a quick primer if you're not familiar, the idea of measuring recall is seeing whether the source document that was used to create the question is contained in the top K results for K equals 1, 5, and 20. So that's the idea of this. So most excitingly, we find the best performance with multimodal hybrid search combining ColBERT multi-vector page image embeddings with ARCTIC 2.0 and BM25 on the text representations. So combining the page image with the OCR extracted text representation. So more details into the multimodal hybrid search. We explored different ways of combining the scores from the hybrid text search with the ColBERT scores or rankings with either reciprocal rank fusion or relative score fusion. So just a deeper dive into the hyperparameter tuning over the particular score fusion strategy and how much to weight the contribution from each of the search methods. So we also tested MuVERA encoding. So again, the idea of MuVERA is to make late interaction multi-vector retrieval methods efficient by decomposing the problem into a two-stage retrieval method where you first retrieve EF candidates that you then rerank with full precision maximum rescoring. So this EF parameter is controlling how many candidates are then given into the full precision maximum rescoring. So we find that without MuVERA, we're at 43% recall at one, 78% recall at five, and 93% recall at 20. At EF equals 1,024, we then drop to 41%, 75%, and 88%. 512, 37%, 68%, and 78%. So hopefully this gives you a little more context as to what to expect with the downstream performance degradation over how much MuVERA encoding you're using and how much reranking you're using. And of course, this is a tradeoff between the accuracy of the search results and then the system efficiency, how many queries per second you can achieve and how much, you know, computation memory you need for running these queries. So that's the tradeoff between MuVERA encoding for multi-vector. In addition to ColBERT, we also tested the ColPali and ColQuen two multi-vector image embedding models. So the key difference here is that ColBERT is about 250 million parameters and these two models are about 2.5 billion parameters. So about one order of magnitude larger. So we do see slightly higher performance with these other two models, particularly at a recall at one with the gain being less so at recall at five and recall at 20. So this gives us a little more insight into choosing the 250 million parameter model and the tradeoffs that we can expect there. So then we explored some of the leading closed-source models, Cohere embed V4.0 on the page images and Voyage three large embeddings on the text transcriptions. And we find just really remarkable results from the Cohere embed V4.0 image embedding. So kudos to them. It is a fantastic embedding model for processing visual documents. And we find the same benefit with multimodal hybrid search combining Cohere embed V4.0 with Voyage three large and BM25. So our retrieval lead leaderboard summarizing all these different methods that we've tested are on github.com/weaviate/irpapers. We plan to be adding more methods to this and feel free to open a PR if you're testing a method or any issues of things you want us to test. And this is where the leaderboard will be. So next up, aside from information retrieval, what about question answering? How So all these LLMs these days, at least most of them, are multimodal. They let you pass in either text or images in the inputs. But how does that impact the answer quality? Does it make a difference if you pass in the page image compared to the text transcription? So firstly, before we present the results of that, we have three different baselines that we covered to sort of ground the benchmark and sanity test this. So no retrieval is the LLM just answers the question from its parametric knowledge. It's also a good test to see if our benchmark is useful to see if the LLMs need this private domain, this private knowledge base in order to answer these questions or not. Next up is hard negative retrieval. So hard negatives is a thing that's often used in training these search models where you're talking about the top ranked document from search systems that's not the gold document. So this might indeed be the top ranked document as again, at recall at one, it's around 45%. So most of the time it isn't the number one ranked document isn't the gold document. But even if it is, you'd be say giving it the second rank document. So, you're seeing if that hard negative, like a document that's probably pretty related to the query, if that can answer the question. And that is a great we that I think is a great way to explore that needle in the haystack philosophy, whether the you know, whether the question can be answered based on that particular document or not. And then oracle retrieval, what is the question answering quality when it's given that perfect gold document as the context. So, here are the results of our question answering tests. So, starting off with the the baselines, the no retrieval gets an alignment score of 0.6. Sorry, let me back up and explain alignment score. So, alignment So, we're using an LLM as judge to assess the output answer from the system with the ground truth that was extracted from when we were creating the questions. So, the alignment score is assessing whether the system answer is aligned with the ground truth answer. And we use three different LLM as judge inferences to just calibrate that a little bit and have a majority voting. And then we do that for each of the 180 questions. So, the no retrieval baseline achieves an alignment score of 0.6, which shows that you can't answer these questions without the private knowledge base. So, the hard negative image context actually performs worse at 0.12. And so an interesting thing about that particular result is something that we see it just interestingly in that particular case is that the LLM often refuses to answer. So, that's a little bit of a interesting decoupling with this kind of alignment score because sometimes when it the model doesn't have the correct context, it knows that and then it refuses to answer. So, so you kind of have to account for that, but then that would be some interesting future work as we look at say newer methods with LLM as judge or exploring these rubric scoring methods. So, that could be an interesting way to extend this going forward. But anyways, the more interesting results are looking at the image rag and text rag. This is where we're using k equals 1 from the search system and we're that's what we're giving to the question answering system. So, we see a huge difference 0.40 versus 0.62 when providing the text transcription to the question answering LLM compared to the page image. So, pretty big gap there and then but we do see that gap is lower with the oracle retrieval. So, if it was perfect context, we don't see as much of a gap. And then interestingly, we do see a gain in question answering when we extend k equals 1 to k equals 5 up to 0.71 and 0.82. So, then you can see the input tokens, which is the cost of doing that. So, you're going to slow it down and the inference is going to cost more, but you are going to get better answers. So, it'll probably be really interesting in future experiments to keep scaling that k and see what we expect what we see when we have k equals 10, 20 and that kind of, you know, find that lost cut lost in the middle problem, but generally what we find in our test is that k equals 5 results in better answers than k equals 1. And of course, these are the input tokens that you have to think about with that trade-off. But the key takeaway probably from this is that the LLMs are performing better at question answering with the text transcriptions compared to the page images. So, again, we'll have a leaderboard on github.com/weaviate/ir_papers highlighting these different strategies for answering these questions from IR papers. One of the most interesting findings from our study was the success of multimodal hybrid search. We've been huge fans of hybrid search at Weaviate combining BM25 and dense single vector text embeddings or say multi vector text embeddings, but seeing how you can now expand the scope of multimodal search for visual documents to also encode retrieval signals to also cap capture and you know, jointly fuse these ranking signals from images is just super interesting concept. So, in addition to the aggregate performance gain we find out of the 180 queries, 25 queries exclusively succeeded with the text retrieval methods Arctic 2.0 and BM25 and then 15 queries exclusively succeeded with the image retrieval cold modern V bird. And that motivates us to ask the question, what kinds of questions uniquely require either images or text to answer? And then more broadly, what are the trade-offs? How can we combine these systems to use both images and text representations? So, we started off by trying to understand this by constructing an adversarial question set with by similarly providing the Claude GUI with the paper and then giving it the prompt, "Please write an information seeking question for each visual element in this document that would be impossible to answer with only an OCR text transcription of the image. If you cannot imagine any such questions, please say so." So, out of the 63 visuals from the 19 papers we create questions from, we the all there only 30 questions. A lot of these visuals Claude is just saying, "There is no way that there is a question that the text transcription won't capture." And in these this first set of questions, we still find better question answering results with the text transcription inputs compared to the page image inputs. So, digging a little more into why we think this is, the visuals in IR papers are generally pretty easily transcribed into text. They're often times out of the 63 visuals in IR papers, there's 32 data charts, 10 architectural diagrams, and 21 conceptual or abstract visuals. And so, generally the accompanying prose of the text will describe these things well. And so, it it is hard to derive these questions from this particular set of visual documents. However, we do find there's one perfect visual, which is the TSNE visualization. So, firstly, imagine take this visualization and imagine trying to transcribe this to text. You might say, "There the gree the query seems farther from the hide embedding than the positive documents. There's a a slightly cluster of positive documents." It's just it's really hard to to transcribe this into text. And this is the perfect kind of example of something that would resist transcription and still be better for better suited at as having an image representation to answer questions about. And in this case, we it's a small set of questions cuz we're only deriving questions from this one image, but we do find that higher performance in question answering when directly looking at this image compared to the text transcription. So, here's some concluding thoughts on the limitations of unimodal representations where you're either only using text or only using images and the benefits of using both. Firstly, this is a preliminary study our adversarial tag methodology and we're definitely looking to improve it going forward and better figure out how to design these questions that target image representations or text representations and highlight their benefit. But there's one obvious benefit of text, which is the exact string matching. So, if you're asking what is hide, you can apply a text constraint on a text representation to guarantee that the retrieved contact the the top ranked results definitely contain the string hide. So, say you're searching through your emails and you have some particular thing or you're looking through laws or things like this and you have this exact string that you want contained in the answer, that's a huge benefit of text representations that image images don't have an analog for. And then with our multimodal hybrid search, an interesting thing about this is that we can just use the embeddings from the images in our retrieval scoring and we don't actually need to have the images from the pages anymore. We can just leave the images on disk or in the cloud or just throw them away and still use them for our information retrieval calculations. But then another idea could be to have an agentic layer that based on the query determines if the image should be sent to the reader as well as the text or or if the text or the image, as well as how to weight the alpha in multimodal hybrid search. So, a lot of interesting decisions as you bring in that agentic layer and design a compound retrieval system around both text and image representations. So, let's conclude on this question, to OCR or not to OCR? And this is Sorry for the taking the to be or not to be and that's Chat GPT's take on Hamlet from Shakespeare. But anyway, so we've made this argument for multimodal hybrid search as well as these agentic search layer arguments and ideas like having that agentic layer decide that the results must contain some exact text string match or maybe there are queries that require having the actual page image for producing the answer. So, there are definitely arguments around doing both, but let's look at both in with either OCR or not and then of course, you could combine both with the same arguments. So, let's look at this through two axes, preprocessing and storage. So, starting with preprocessing, we begin with to OCR. So, in our paper, we use OCR transcription with large language models, particularly OpenAI's GPT-4.1. And on average for our 3,230 pages of scientific papers, we find about 1,000 input tokens and about 1.1,000 output tokens. So, just let's say that's 2.2,000. And so, firstly starting with the time that it takes to transcribe all these papers, at the entry tier of OpenAI, you're going to be blocked by their API rate limit, which is 30,000 tokens per minute. So, with parallel requests, you're only going to be able to transcribe 13 pages per minute and it's going to take 4 hours to transcribe the entire corpus. So, it's 25 seconds per page, but that really you're blocked by the rate limit cuz it's, you know, easily parallelizable. And then in addition to the time, you have the cost. So, at $3 per 1 million input tokens and $12 per 1 million output tokens, you're looking at a little under 2 cents per page and the 3,230 pages comes out to about $54. Now, this is one of the hottest areas of innovation AI startups, I would say, is reducing the cost of this OCR transcription layer. There are open source models you could host and and you could host the inference yourself. So, there are definitely ways to bring this cost down, but that's just like probably the easiest way to just do it out of the box and get a sense of the maybe the upper bound of the cost of this 4 hours and $54. So, then on the other side of the not to OCR and this is like an obvious win. It just wins by so much here because it's 130 milliseconds to just split the page the PDF into its constituent pages and then just encode them with BS base64 64 strings and then just off to the database. So, even sequential processing would be 7 minutes, but because this is easily parallelizable, you can encode all 3,230 pages in under a minute. So, it definitely wins the preprocessing argument on text, but then images lose the argument on storage. So, to OCR, you're looking at 1.1 thousand output tokens, and text is really cheap to encode. It's 4.4 kilobytes per page with the text transcription and only 14 megabytes for the full corpus. Whereas, one page image is 1.3 megabytes. Of course, this depends on the resolution, but we use 300 DPI, that kind of thing. So, you know, about like an average resolution for a PDF page image. That's 1.3 megabytes, and the full corpus is 4.2 gigabytes. So, you're looking at a much higher storage cost on not to OCR. So, summarizing the uh to OCR or not to OCR on these two axes of preprocessing and storage, it's just a such a huge win for the images on preprocessing, whereas the storage is another big So, these are the kinds of tradeoffs in addition to this retrieval and question answering that you're looking at around this whole argument of whether to represent your visual documents as images or text. Thank you so much for watching this video explaining our new paper IR papers, a visual document benchmark for scientific retrieval and question answering. To see any more details about the different retrieval question answering tests we ran, as well as the arguments for the new data set, the to OCR or not to OCR arguments, and contextualization against previous works on visual document benchmarking and scientific literature mining, please check out our paper. I really hope you find it interesting, and of course, more than happy to discuss any of these ideas with you further or answer any questions.
Original Description
AI systems have achieved remarkable success in processing text and relational data, however, visual document processing remains relatively underexplored. Whereas traditional systems require OCR transcriptions to convert these visual documents into text and metadata, recent advances in multimodal foundation models offer an alternative path: retrieval and generation directly from document images. This raises a timely and important question: How do image-based systems compare to established text-based methods?
To answer this question, we present IRPAPERS, a benchmark totaling 3,230
pages sourced from 166 scientific papers, with both an image and OCR transcription for each page. We present a curation of 180 needle-in-the-haystack questions for evaluating retrieval and question answering systems with this corpus. We begin by comparing image- and text-based retrieval with open-source models, as well as multimodal hybrid search. For image retrieval, we evaluate the ColModernVBERT multi-vector embedding model. For text retrieval, we evaluate Arctic 2.0 dense single-vector embeddings, BM25, and their combination in hybrid text search. Text-based methods achieved 46% Recall@1, 78% Recall@5, and 91% Recall@20, while image-based retrieval achieved 43% Recall@1, 78% Recall@5, and 93% Recall@20. These retrieval systems exhibit complementary failures, each succeeding on queries where the other fails, enabling multimodal fusion to exceed either modality alone. Multimodal hybrid search achieved the highest performance with 49% Recall@1, 81% Recall@5, and 95% Recall@20. We additionally evaluate the efficiency-performance tradeoff of MUVERA encoding with varying levels of ef, as well as the performance of the ColPali and ColQwen2 multi-vector image embeddings models. To contextualize open-source performance, we further evaluate leading closed-source models. Cohere Embed v4 page image embeddings reached 58% Recall@1, 87% Recall@5, and 97% Recall@20, outperforming Voyage 3 Large text
Watch on YouTube ↗
(saves to browser)
Sign in to unlock AI tutor explanation · ⚡30
More on: Reading ML Papers
View skill →Related AI Lessons
⚡
⚡
⚡
⚡
When the Camera Becomes an Exam Proctor: Building an AI-Powered Exam Monitoring System with…
Medium · Python
When the Camera Becomes an Exam Proctor: Building an AI-Powered Exam Monitoring System with…
Medium · Deep Learning
When the Camera Becomes an Exam Proctor: Building an AI-Powered Exam Monitoring System with…
Medium · Cybersecurity
Your Face Is About to Become Your Phone Number
Dev.to AI
🎓
Tutor Explanation
DeepCamp AI