Why Your PDF Breaks RAG (And How to Fix It)

Shane | LLM Implementation · Intermediate ·🧠 Large Language Models ·2w ago
Your RAG system is only as good as your document processing. If your PDF parser destroys table structure, retrieval starts from broken text. And if your chunking strategy cuts words or context in half, it gets worse. In this video, we fix bad text extraction. We compare PyMuPDF vs LlamaParse for clean markdown, build a page-level chunking strategy with overlap, and run a proper experiment — testing 128, 256, and 512-token chunks on hard queries using LLM-as-judge evaluation. 📚 This is Module 2 of a 10-part RAG course. ⏳ Chapters: 00:00 The Problem with Real-World PDFs 00:50 Why RAG Pipelin…
Watch on YouTube ↗ (saves to browser)

Chapters (11)

The Problem with Real-World PDFs
0:50 Why RAG Pipelines Fail
1:54 Colab Setup & API Keys
2:56 Naive Extraction (PyMuPDF)
3:52 Clean Extraction (LlamaParse)
4:34 Why Naive Chunking Breaks
5:26 Page-Level Chunking Strategy
7:06 Experiment: Testing Chunk Sizes
8:00 LLM-as-Judge Evaluation
9:53 Results: What Won
10:47 What's Next
5 Levels of AI Agents - From Simple LLM Calls to Multi-Agent Systems
Next Up
5 Levels of AI Agents - From Simple LLM Calls to Multi-Agent Systems
Dave Ebbelaar (LLM Eng)