From PDF to RAG-Ready: Evaluating Document Conversion Frameworks for Domain-Specific Question Answering

📰 ArXiv cs.AI

Evaluating PDF-to-Markdown conversion frameworks for RAG-based question answering accuracy

advanced Published 8 Apr 2026

Action Steps

Select a PDF conversion framework (e.g., Docling, MinerU, Marker, DeepSeek OCR)
Configure the framework for text and content extraction
Evaluate the framework's impact on downstream question-answering accuracy
Compare results across different frameworks and pipeline configurations

Who Needs to Know This

NLP engineers and researchers benefit from this study as it helps them choose the best PDF conversion framework for their RAG systems, improving overall question-answering accuracy

Key Insight

💡 The quality of document preprocessing significantly affects RAG-based question-answering accuracy

Key Takeaways

Evaluating PDF-to-Markdown conversion frameworks for RAG-based question answering accuracy

Full Article

Title: From PDF to RAG-Ready: Evaluating Document Conversion Frameworks for Domain-Specific Question Answering

Abstract:
arXiv:2604.04948v1 Announce Type: cross Abstract: Retrieval-Augmented Generation (RAG) systems depend critically on the quality of document preprocessing, yet no prior study has evaluated PDF processing frameworks by their impact on downstream question-answering accuracy. We address this gap through a systematic comparison of four open-source PDF-to-Markdown conversion frameworks, Docling, MinerU, Marker, and DeepSeek OCR, across 19 pipeline configurations for extracting text and other contents

Read full paper → ← Back to Reads