OCR vs. Image Embeddings for PDF RAG: Which One is Better?
Skills:
RAG Basics90%
My colleagues at Weaviate released IRPAPERS, a benchmark comparing ๐ถ๐บ๐ฎ๐ด๐ฒ-๐ฏ๐ฎ๐๐ฒ๐ฑ and ๐๐ฒ๐
๐-๐ฏ๐ฎ๐๐ฒ๐ฑ retrieval over 3,230 pages from 166 scientific papers.
The setup: Take the same PDFs and process them two ways. For text, run OCR with GPT-4.1 and embed with Arctic 2.0 + BM25 hybrid search. For images, embed raw page images with ColModernVBERT multi-vector embeddings. Test both on 180 needle-in-the-haystack questions.
๐ง๐ต๐ฒ ๐ฟ๐ฒ๐๐๐น๐๐:
Text edges out images at the top rank: 46% vs 43% Recall@1
But images match or exceed text at deeper recall: 93% vs 91% Recall@20
But text and image based methods actually fail on ๐ฅ๐ช๐ง๐ง๐ฆ๐ณ๐ฆ๐ฏ๐ ๐ฒ๐ถ๐ฆ๐ณ๐ช๐ฆ๐ด.
At Recall@1:
โข 22 queries succeed with text but fail with images
โข 18 queries succeed with images but fail with text
This complementarity is what makes ๐ ๐๐น๐๐ถ๐บ๐ผ๐ฑ๐ฎ๐น ๐๐๐ฏ๐ฟ๐ถ๐ฑ ๐ฆ๐ฒ๐ฎ๐ฟ๐ฐ๐ต work. By fusing scores from both text and image retrieval, they achieved:
โข 49% Recall@1 (beating either modality alone)
โข 81% Recall@5
โข 95% Recall@20
00:00 - Intro
00:08 - Visual- vs Text-based methods
01:04 - The IRPapers dataset
01:59 - The 6 different search strategies
03:43 - The results
04:30 - The paper's most interesting finding...
05:11 - Conclusion
Watch on YouTube โ
(saves to browser)
Sign in to unlock AI tutor explanation ยท โก30
More on: RAG Basics
View skill โRelated AI Lessons
โก
โก
โก
โก
What is RAG and How Does It Work with Modern AI Systems?
Medium ยท AI
Limits of RAG and implications for self-hosted AI
Medium ยท RAG
Best Vector Databases for RAG (Free & Paid)
Medium ยท RAG
Retrieval-Augmented Generation: The Architecture That Made AI Actually Useful in Production
Medium ยท RAG
Chapters (7)
Intro
0:08
Visual- vs Text-based methods
1:04
The IRPapers dataset
1:59
The 6 different search strategies
3:43
The results
4:30
The paper's most interesting finding...
5:11
Conclusion
๐
Tutor Explanation
DeepCamp AI