From PDF to RAG-Ready: Evaluating Document Conversion Frameworks for Domain-Specific Question Answering
📰 ArXiv cs.AI
Evaluating PDF-to-Markdown conversion frameworks for RAG-based question answering accuracy
Action Steps
- Select a PDF conversion framework (e.g., Docling, MinerU, Marker, DeepSeek OCR)
- Configure the framework for text and content extraction
- Evaluate the framework's impact on downstream question-answering accuracy
- Compare results across different frameworks and pipeline configurations
Who Needs to Know This
NLP engineers and researchers benefit from this study as it helps them choose the best PDF conversion framework for their RAG systems, improving overall question-answering accuracy
Key Insight
💡 The quality of document preprocessing significantly affects RAG-based question-answering accuracy
Share This
📄🤖 Evaluating PDF conversion frameworks for RAG-based QA
Key Takeaways
Evaluating PDF-to-Markdown conversion frameworks for RAG-based question answering accuracy
Full Article
Title: From PDF to RAG-Ready: Evaluating Document Conversion Frameworks for Domain-Specific Question Answering
Abstract:
arXiv:2604.04948v1 Announce Type: cross Abstract: Retrieval-Augmented Generation (RAG) systems depend critically on the quality of document preprocessing, yet no prior study has evaluated PDF processing frameworks by their impact on downstream question-answering accuracy. We address this gap through a systematic comparison of four open-source PDF-to-Markdown conversion frameworks, Docling, MinerU, Marker, and DeepSeek OCR, across 19 pipeline configurations for extracting text and other contents
Abstract:
arXiv:2604.04948v1 Announce Type: cross Abstract: Retrieval-Augmented Generation (RAG) systems depend critically on the quality of document preprocessing, yet no prior study has evaluated PDF processing frameworks by their impact on downstream question-answering accuracy. We address this gap through a systematic comparison of four open-source PDF-to-Markdown conversion frameworks, Docling, MinerU, Marker, and DeepSeek OCR, across 19 pipeline configurations for extracting text and other contents
DeepCamp AI