Python Libraries to Extract Tables from PDFs

NeuralNine · Beginner ·⚡ Algorithms & Data Structures ·1y ago

Key Takeaways

The video compares Python libraries for extracting tables from PDFs, including Camelot, Tabula, PDFPlumber, LLMWhisperer, PyPDF2, and Unstract.

Full Transcript

what is going on guys welcome back in this video today we're going to learn how to parse and extract tables from PDF documents in Python for this we're going to compare multiple approaches and packages if you like this video let me know by hitting a like button and subscribing but now let us get right into [Music] it all right so when it comes to document extraction working with PDF documents is often times quite challenging because the format is not optimized being parsed the format is not optimized for structure and readability it's optimized for portability as the name portable document format already says so unlike HTML files XML files or Json files we don't have any predictable structure any predictable tax or keywords in PDF documents we have to work with whatever we get sometimes this is going to be images sometimes it's going to be text sometimes the tables are going to be delineated by uh Lines by Drawn Lines sometimes it's just a spacing that tells you that this is a table and what I want to do in this video today is I want to compare different approaches for extracting structured table data from PDF documents in Python and for this I have prepared three different example PDFs we have uh the digital ocean PDF here for example a table can look like this we have some lines uh we have some spacing here as well we have another table down here uh maybe you could consider this also to be a table then we have Safari PD PF where we have a table like this we also have a table like this down here so basically there are two table kinds we have stream tables that look like this so that would be uh delineated or separated by spacing margins paddings and so on and then we have uh tables like for example here which are called lattice tables uh these are tables separated by Drawn Lines so you can see we have a structure here indicated by actual lines not just by spacing that would be a lattice table the other one would be a stream table and you can see also this document here is a scan so I cannot even select anything here because that is not text this is a scan this is basically just an image and this is quite challenging to work with if I go into my terminal I navigate to this directory and I for example open one of these files up as a text uh or in a text editor you can see we don't have anything like a table tag or anything like that we don't have raw text here we have some stuff that is made or that is optimized for looking the same at every um in every operating system in every editor not for being easily parsed so we need to use parsers we need to use packages to do that so I want to get started right away with the first package which is Camelot we're going to install it by saying pip or pip 3 install Camelot dasp in square brackets we're going to specify CV and we're going to put all of that into quotation marks and Camelot is a python package made specifically for extracting table data from PDF documents is also compatible with pandas so let us open up a new file extractor camelot. piy and here I'm just going to say now import Camelot and the extraction process is quite straightforward we just have to say camelot. read PDF then we need to specify of course the PDF safari. PDF for example uh then we need to specify the pages we can say pages is equal to all to just extract the tables from all pages if we have multiple Pages uh then we're going to say flavor is equal to and now we can either choose stream or lattice stream is for space separated tables or for padding margin separated tables and lattice is for line separated tabl so when we have clear delineations um and here I'm just going to go with lattice now and we are going to do the same thing with stream in a second and then I want to also say suppress SD out equals false so I'm going to copy that I'm going to paste that and now I'm going to change this to stream as well and let's just sort this into two different variables so I'm going to say lattice tables is equal to that stream tables is equal to that and then we can just say four table in lattice tables we can say print lattice table and then print table and we can do the same same thing with stream tables stream table and that's it so that is all you need to try to extract tables with Camelot so just to see how well it performs let's open up the Safari PDF document again that is it so we would expect it to extract this part and maybe this part here as well so let's go and say Python 3 extract camelot. piy we can see okay I think oh sorry I made one little mistake because of course we don't want to just uh print the table object we want to actually print the data frame based on that table object so the data frame means we're taking the table and we're turning it into a pandas data frame and that is what we actually want to dis display here so what happens now is it finds one stream table and that's actually the one down here so maybe let me move my camera a bit uh actually maybe let me move my camera to the bottom left for this video and what we can see here is that it finds this particular table here at least parts of it so it finds invoice totals it finds gross amount it finds discount subtotal uh additional notes TX so it finds basically this table down here which is fine but it doesn't find this table up here so if I go and say now um extract Camelot and I change the PDF from Safari to digital Das ocean I do the same thing here let's see if something changes so let's clear this let's run this again now I get a bit more so now I get um let's open up that document as well now it finds a bunch of stuff that I'm probably not really interested in it finds a stream table this is the stuff up here I guess this is what it finds but all in all it doesn't seem to find the table I'm most interested in which is the summary table here so that's the most obvious table to me as a human this up here could be considered a table because of the spacing a stream table definitely but I wouldn't consider this to be a very successful attempt and then finally the scan I assume it's not going to be able to do anything here so let's go and try to apply this to the scan Dash bio Gen X PDF let's do the same thing here scan PDF like this and my assumption is that I have a syntax error uh yeah because I have PDF twice here but now we get page one is image based Camelot only works on text based pages so it doesn't work so it does work somewhat well for this one it extracted this table down here it did not work too well on this one and didn't work at all on this one so Camelot even though sometimes it can be a good choice in this case now for our three examples it performed okay at best when it comes to this one it didn't find this one either but besides that it didn't perform too well so let us go to the next package which is tabula tabula is based on Java and here now we're going to say pip or pip 3 install tabula dasp let's see if we can get better results using that we're going to to create the file extractor tabula dopy and here we're just going to say import tabula and import pandas SPD we're going to work with pandas again and all we have to do here is we have to say tables is equal to tabula do read PDF quite similar to Camelot but we're going to provide a couple of um additional parameters here so we're going to say first of all pages equals all again that's the same uh we also want to first of all before that specify the Safari PDF path uh and then we're going to say multiple tables equals true to allow for multiple tables at once we're going to say lattice equals true we're going to say stream equals true so what you can see right away is that this doesn't work like Camelot where we specify lattice or stream we can set both Lattis and stream to true in the same function so it extracts both tables here uh then we're going to say guess equals fals and then we're going to say uh pandascore options is equal to header none and you can change these settings you can play around with that there's more to play around with here I'm just trying a default setting that's not too complicated where want we want to extract lattice and stream tables and now all we have to do is we have to say for I table in or actually let's not do enumerate just do four table in tables we're going to say print table so let's run this and what I get here now is not very good so I don't really get anything interesting here uh that is remember that is the Safari PDF so that is this document here and it seems to have found charge detail where is that okay it seems to find the header here so it it seems to find this uh charge detail service period subtotal and so on but it doesn't find anything else now I'm not saying that this is the best performance it can get uh there's probably some more stuff that we can tweak here to get at least some decent or some some better results but it doesn't seem to work that well out of the box and I'm not sure if I can even make it to work better so let's try a different file let's try the digital D ocean PDF here we don't get anything so it doesn't extract anything at all that's not good and I assume the scan is going to be the same issue as before scan biogenic PDF is probably not going to be anything either okay so tabula at least this code here doesn't seem to work at all so it doesn't seem to extract any of these tables properly it seems to be able to extract uh the header here even though I said header none but yeah uh so yeah it doesn't work too well the next package I want to take a look at is PDF plumber this this is a package as far as I understand it that is well known for accuracy and customizability or uh being able to adjust certain parameters so we're going to try uh an approach here I'm going to say pip or pip 3 install PDF plumber in my case again already installed I'm going to open up a file here PDF or actually extractor PDF plumber py and here we're going to say import PDF plumber and we're going to say tables is going to start as an empty list with PDF plumber open and now here we're going to put our files so safari. PDF s f and then we're going to say for page in f. Pages or maybe let's call this PDF so let's do pdf. pages so this is going to open up the PDF this is going to iterate over the pages and now on each page I want to get the tables tables on page are going to be equal to page. extract table now PDF plumber this is important is not a package that is only focused on extracting tables the previous two Camelot and tabula were specifically made for extracting tables this is why they just had a simple read PDF method which extracted table here we have different methods and one of them is extract tables and what we want to do now is we want to pass some settings here as a dictionary now first of all let's do that without so let's see what happens with the default settings and then we're going to say if tables on page for table in tables on page we're going to say if table tables append table or actually let's uh let's do this differently so we can turn this into a pandas data frame so we're going to say also import pandas aspd uh by the way if you don't have pandas of course you need to install pandas so pip or pip 3 install pandas but we're going to say now append here a dictionary and we're going to say page is going to be the page number so it's going to be just pdf. pages index of the page + one and then we're going to say data is equal to the table itself just so we have this as a structured uh output and then we're going to say here for table in tables just print them so just print pd. dataframe table uh tables no table and then data like this and maybe above that we want to also print something like page is table page there you go okay so it's going to print a page and the table so let's go and run this now we do get something similar to tabula now interestingly we get a table here with a lot of nonone values so this doesn't look too professional but it seems to be extracting something probably there's some data in here and then we have again this uh header here but nothing more so let's go and try to add some parameters here let's add some more settings uh what we want to do here in the dictionary is we want to say vertical unor strategy is text same is going to be done for the horizontal strategy and then also I do have some issues with autocomplete here something is wrong with my setup and then we're going to say intersection undor X tolerance so we're going to tolerate intersections and we're going to pass here 10 and we're going to do the same thing here for y so [Music] intersection y tolerance now this should improve the results let's run this and we get some more stuff so maybe let's do this in full screen let's get out of this Python 3 PDF Plus and now we get some more interesting stuff so this seems to be as of right now the best result that we' have gotten so this is way better than Camelot way better than uh tabula so if we compare this to the PDF document here we have the what is it uh we have oh up here this is the first table it seems then we have some charge Detail Service period we actually get the information here that's quite impressive or or actually do we we get this here but we don't really no actually we do we do get the information here so that's quite quite impressive already now we do have some formatting issues here it's not exactly the same as before but you can see it just basically extracts line by line here it is decent it's definitely better than what we had before let's see if it also works with the other uh with the other documents so I'm going to going to go digital ocean PDF and that is this one here um I mean we're not displaying everything here but it seems to be extracting too much it extracts basically everything and it somehow just extracts a text but it still has some formatting issues I would say definitely the best up until now definitely better than the other two let's try to see if it can also work with a scan I don't believe that but no okay for the scan it doesn't work at all for these two it extracts a lot so it's definitely the best solution up until now it's not perfect I would say there are some formatting issues but it's definitely better now the fourth approach that we're going to take a look at is a little bit different and a little bit more unique but it can be the best solution depending on your use case and that is to use llm whisperer now llm Whisperer and unra are sponsoring this video today however you can follow along and do everything we do here for free you don't have to pay for anything you can do exactly the same thing that I'm doing here in the video for free and you can test it out so in order to use llm Whisperer we just have to say pip or pip 3 install llm Whisperer Das client this is going to install the package and we're going to need an API key but as I said you don't have to pay for anything you just have to go to un. start for free then you can go to LM Whisperer login this is going going to get you to the dashboard and here all you have to do is you have to go to API keys and copy your API key into a file or into your script directly if you want to for security reasons here I have copied it into a file called enf but basically then you can just go and say extractor LM whisperer. piy and then we can say import time and then from unra dolm Whisperer import llm whisper a client V2 and then all we have to do is we have to say client is equal to llm whisper a client base URL now we need to Target the API so this is an API request that we're sending here https colon llm Whisperer uh- api. us- central. unstr doc SL a API slv2 that is the API endpoint and in addition to that we need to specify our API key so we just need to say API unor key is equal to and now in your case you can just copy paste the key if you want to I'm going to load it from a DOT end file so I have a end file where I have API key is equal to and then the string that I copied from my um unra account so just go to this here copy the API key and in my case what I have to do now now is I have to sa fromn import load. n you don't need to do that as I said you can just copy paste your key but since I'm recording I'm going to do it like this then I'm just going to say load. enf I'm also going to import OS just so I can say here API key is equal to OS get environment and then API key so you don't need OS you don't need n you don't need load. n you could just paste your API key here if you want to that is now the client that we're going to use and all we need to do now is a very simple thing we need to say result is equal to client. whisper and we just have to specify the file path in my case here safari. PDF and then what we're going to get as a result of that is we're going to get um an identifier so a hash that is going to identify the task that we submitted and once the task is processed or once the PDF is processed we're going to get the result so we're going to say here while true status is equal to client whisper status and then whisper hash is equal to result whisper hash that is just a status and I need to close this of course and then we're going to say uh if if status status is equal to processed then we're going to say result X let's call it is equal to client whisper retrieve and we're going to get the results which is going to be based again on the whisper hash so result whisper hash and then we're going to break out of the loop otherwise if this is not the case we're going to just wait for 5 seconds so that we don't ask too often and all we need to do then is extracted text is equal to result X extraction and result text so let's run this or actually we need to print also the result otherwise we're not going to see anything let's run this and then I'm going to explain why this is a more unique approach and why this can be the best approach uh from all the available options so you can see we're sending this we're waiting um and after some time we get the output now what you see here is not a table entity like in Camelot or in uh tabula for example where we get table objects where we can iterate over the tables this is essentially the PDF document extracted um Al together so this is the entirety of the PDF file uh extracted here and what you notice is that the structure is preserved so you can see we have the invoice date up here we have the uh address up here we have the uh information here we have the table here we have Also the table at the bottom all of this is preserved in terms of the layout and that is the power of LM Whisperer because as the name already suggests this is a first step to then feed this into a large language model so large language models can extract information from structured text so from something like this but large language models often times have a hard time extracting it from chaotic text so if you just take something that doesn't extract the text properly it's going to have difficulty working with it but if you give it a perfect layout like here this basically looks like the PDF but in text form here in the command line if you give it something like this it's way easier to work with that so I can actually show you a comparison between this and another python package called Pi pdf2 which is also for um text extraction but it doesn't have this layout preservation so I can actually create another one extract Pi pdf2 dot uh py you don't need to do that if you don't want to but basically what we can do here is we can say pi pdf2 and we can also extract the information here without calling an API but you're going to see the difference in quality so if I say open safari. PDF for example in Reading bytes mode as file what I can do then is I can say PDF reader I can create an instance here is equal to Pi pdf2 PDF reader and I can open the file I can say number of pages is equal to the length of the PDF reader and then do pages and I can just say for page number in range num pages I can say the page is equal to the PDF reader pages and then page number actually probably I can also just iterate over the pages but then we can say now page text is equal to page do extract text like this and then just text plus [Music] equals page text plus two line breaks and then we can print the text so that would be text extraction without LM Whisperer I can close this and I can run this now for the same same document and you're going to see this doesn't look very good so that is the kind of stuff you get if you do it um without the layer preservation so even if all of the content would be included here it is not the same structure so first of all for us humans it's more readable but also for large language models that's much easier to process so if I for example take the other one here digital ocean PDF let's see what this would look like in LM Whisperer I just have to say digital ocean PDF and then let's see there you go this is again all of the content that we have here summary total usage charges and so on everything's included here and that is the perfect step for taking that and feeding that into a large language model to extract information out of it so it's not exactly the same approach which is the other packages but when it comes to table extraction that might actually be your best bet because you get all the information you get all the structure and you can take that and give it to an intelligent large language model to make decisions based on that and this brings us now to the other tool which I just briefly want to mention which is the unrack cloud and unrack is open source LM Whisperer is not so LM Whisperer is their proprietary uh approach for extracting the structured information but unra itself uses l m Whisperer and is open source so you can host it yourself or you can use it here in the cloud and basically what we can do here just a quick example test PDF let's call it like this PDF PDF whatever let's create a new project here in the prompt Studio what I can do here is I can define a document parser so for example I can say uh extract the taxes and then I can say um plus prompt then I can say extract the total amount and then maybe I can say extract the taxes as a percentage of the total amount so I can Define here certain Fields I can also say I want to have this as a number I want to have this as a text or whatever um and what I can do then is I can add a document for example here Safari PDF and that is going to load the PDF here and what I can do now is I have GPT 40 connected so I have set this up with my API key I can just run all prompts this is going to run through all the prompts here given this PDF and what's important is behind the scenes this is actually using llm Whisperer so this is taking the document parsing it the way I showed you in the python script so we're getting the structured text and what you can see now is we get zero TX taxes here we get 199 as the total which is true and we have zero as the percentage taxes so let's try with a different um document maybe let's go manage document and let's go with digital ocean by the way one thing that I forgot to show you this is important because that's the only approach that's actually able to do that uh LM Whisperer can also extract the information from the scanned documents so if I go to extract LM Whisperer and I I use the scan that no package was able to handle I can actually get results using LM Whisperer here so that's not a problem it's going to be able to extract the whole structured content uh from the scan which is just an image we get all the information we get the table here everything works and that's the only approach of the ones that we looked at today that is able to do that and we can also do that here so now I have here digital ocean loaded let's take a look at that let's Let's uh see we have here the actually we don't have taxes right so let's maybe change this or is this tax GST is probably tax so let's keep that it's going to be 18% and it's going to be 0.56 let's run all of this these prompts are going to go through the document and that is going to then uh give us D structured output so we specify one half numbers here we have text doesn't really matter this is probably just going to give us also the dollar so in this case now since it's text gives us more than what we asked for so taxes gives us the identifier here the percentage and the value here we get the total amount and here we get the taxes as a percentage now let's go to manage documents and index the whole thing so indexed successfully and now if I go to the raw view you can see this is actually llm Whisperer behind the scenes it takes document parses it as uh text and then we can run this through GPT 40 to extract information so again it's a different approach but I think that's the most powerful approach when it comes to extracting table data from PDF documents especially the combination LM Whisperer and unstr so make sure you check them out in the description down below it helps the channel it's free you can just try it out you don't have to pay for anything you don't have to submit a credit card or something you can just play around with that but this was the comparison for today so that's it for today's video I hope you enjoyed it and I hope you learned something don't forget to check out unrack and LM whisper in the description down below they're sponsoring this video today again you're helping out the channel if you also visit the sponsors besides that let me know in the comment section down below if you know other packages other approaches to extract tables from PDF documents maybe we can find other interesting approaches as well and besides that don't forget to hit the like button and subscribe to this channel to not miss a single future video for free other than that thank you much for watching see you in the next video and bye for

Original Description

In this video we compare different packages and strategies for extracting tables from PDF documents in Python. LLMWhisperer: https://unstract.com/llmwhisperer/?utm_source=nn Unstract: https://unstract.com/?utm_source=nn Unstract GitHub: https://github.com/Zipstack/unstract Code: https://github.com/NeuralNine/youtube-tutorials/tree/main/PDF%20Table%20Extraction ◾◾◾◾◾◾◾◾◾◾◾◾◾◾◾◾◾ 📚 Programming Books & Merch 📚 🐍 The Python Bible Book: https://www.neuralnine.com/books/ 💻 The Algorithm Bible Book: https://www.neuralnine.com/books/ 👕 Programming Merch: https://www.neuralnine.com/shop 💼 Services 💼 💻 Freelancing & Tutoring: https://www.neuralnine.com/services 🌐 Social Media & Contact 🌐 📱 Website: https://www.neuralnine.com/ 📷 Instagram: https://www.instagram.com/neuralnine 🐦 Twitter: https://twitter.com/neuralnine 🤵 LinkedIn: https://www.linkedin.com/company/neuralnine/ 📁 GitHub: https://github.com/NeuralNine 🎙 Discord: https://discord.gg/JU4xr8U3dm Timestamps: (0:00) Intro (0:23) PDF Documents (2:43) Camelot (7:46) Tabula (10:55) PDFPlumber (17:16) LLMWhisperer (23:32) PyPDF2 (26:40) Unstract
Watch on YouTube ↗ (saves to browser)
Sign in to unlock AI tutor explanation · ⚡30

Playlist

Uploads from NeuralNine · NeuralNine · 0 of 60

← Previous Next →
1 Visualizing Stock Data With Candlestick Charts in Python
Visualizing Stock Data With Candlestick Charts in Python
NeuralNine
2 Python Beginner Tutorial #1 - Installation and First Program
Python Beginner Tutorial #1 - Installation and First Program
NeuralNine
3 Python Beginner Tutorial #2 - Variables and Data Types
Python Beginner Tutorial #2 - Variables and Data Types
NeuralNine
4 Python Beginner Tutorial #3 - Operators and User Input
Python Beginner Tutorial #3 - Operators and User Input
NeuralNine
5 Python Beginner Tutorial #4 - If Statements and Conditions
Python Beginner Tutorial #4 - If Statements and Conditions
NeuralNine
6 Python Beginner Tutorial #5 - Loops
Python Beginner Tutorial #5 - Loops
NeuralNine
7 Python Beginner Tutorial #6 - Sequences and Collections
Python Beginner Tutorial #6 - Sequences and Collections
NeuralNine
8 Python Beginner Tutorial #7 - Functions
Python Beginner Tutorial #7 - Functions
NeuralNine
9 Python Beginner Tutorial #8 - Exception Handling
Python Beginner Tutorial #8 - Exception Handling
NeuralNine
10 Python Beginner Tutorial #9 - File Operations
Python Beginner Tutorial #9 - File Operations
NeuralNine
11 Python Beginner Tutorial #10 - String Functions
Python Beginner Tutorial #10 - String Functions
NeuralNine
12 Python Intermediate Tutorial #1 - Classes and Objects
Python Intermediate Tutorial #1 - Classes and Objects
NeuralNine
13 Python Intermediate Tutorial #2 - Inheritance
Python Intermediate Tutorial #2 - Inheritance
NeuralNine
14 Python Intermediate Tutorial #3 - Multithreading
Python Intermediate Tutorial #3 - Multithreading
NeuralNine
15 Python Intermediate Tutorial #4 - Synchronizing Threads
Python Intermediate Tutorial #4 - Synchronizing Threads
NeuralNine
16 Python Intermediate Tutorial #5 - Events and Daemon Threads
Python Intermediate Tutorial #5 - Events and Daemon Threads
NeuralNine
17 Python Intermediate Tutorial #6 - Queues
Python Intermediate Tutorial #6 - Queues
NeuralNine
18 Python Intermediate Tutorial #7 - Sockets and Network Programming
Python Intermediate Tutorial #7 - Sockets and Network Programming
NeuralNine
19 Python Intermediate Tutorial #8 - Database Programming
Python Intermediate Tutorial #8 - Database Programming
NeuralNine
20 Python Intermediate Tutorial #9 - Recursion
Python Intermediate Tutorial #9 - Recursion
NeuralNine
21 Python Intermediate Tutorial #10 - XML Processing
Python Intermediate Tutorial #10 - XML Processing
NeuralNine
22 Python Intermediate Tutorial #11 - Logging
Python Intermediate Tutorial #11 - Logging
NeuralNine
23 Python Data Science Tutorial #1 - Anaconda and PyCharm Setup
Python Data Science Tutorial #1 - Anaconda and PyCharm Setup
NeuralNine
24 Python Data Science Tutorial #2 - NumPy Arrays
Python Data Science Tutorial #2 - NumPy Arrays
NeuralNine
25 Python Data Science Tutorial #3 - Numpy Functions
Python Data Science Tutorial #3 - Numpy Functions
NeuralNine
26 Python Data Science Tutorial #4 - Plotting Functions With Matplotlib
Python Data Science Tutorial #4 - Plotting Functions With Matplotlib
NeuralNine
27 Python Data Science Tutorial #5 - Subplots and Multiple Windows
Python Data Science Tutorial #5 - Subplots and Multiple Windows
NeuralNine
28 Python Data Science Tutorial #6 - Matplotlib Styling
Python Data Science Tutorial #6 - Matplotlib Styling
NeuralNine
29 Python Data Science Tutorial #7 - Bar Charts with Matplotlib
Python Data Science Tutorial #7 - Bar Charts with Matplotlib
NeuralNine
30 Python Data Science Tutorial #8 - Pie Charts with Matplotlib
Python Data Science Tutorial #8 - Pie Charts with Matplotlib
NeuralNine
31 Python Data Science Tutorial #9 - Plotting Histograms with Matplotlib
Python Data Science Tutorial #9 - Plotting Histograms with Matplotlib
NeuralNine
32 Python Data Science Tutorial #10 - Scatter Plots with Matplotlib
Python Data Science Tutorial #10 - Scatter Plots with Matplotlib
NeuralNine
33 Python Data Science Tutorial #11 - 3D Plotting with Matplotlib
Python Data Science Tutorial #11 - 3D Plotting with Matplotlib
NeuralNine
34 Python Data Science Tutorial #12 - Pandas Series
Python Data Science Tutorial #12 - Pandas Series
NeuralNine
35 Python Data Science Tutorial #13 - Pandas Data Frames
Python Data Science Tutorial #13 - Pandas Data Frames
NeuralNine
36 Python Data Science Tutorial #14 - Pandas Statistics
Python Data Science Tutorial #14 - Pandas Statistics
NeuralNine
37 Python Data Science Tutorial #15 - Pandas Sorting and Functions
Python Data Science Tutorial #15 - Pandas Sorting and Functions
NeuralNine
38 Python Data Science Tutorial #16 - Pandas Merging Data Frames
Python Data Science Tutorial #16 - Pandas Merging Data Frames
NeuralNine
39 Python Data Science Tutorial #17 - Pandas Queries
Python Data Science Tutorial #17 - Pandas Queries
NeuralNine
40 Python Machine Learning Tutorial #1 - What is Machine Learning?
Python Machine Learning Tutorial #1 - What is Machine Learning?
NeuralNine
41 Python Machine Learning Tutorial #2 - Linear Regression
Python Machine Learning Tutorial #2 - Linear Regression
NeuralNine
42 Python Machine Learning Tutorial #3 - K-Nearest Neighbors Classification
Python Machine Learning Tutorial #3 - K-Nearest Neighbors Classification
NeuralNine
43 Python Machine Learning #4 - Support Vector Machines
Python Machine Learning #4 - Support Vector Machines
NeuralNine
44 Python Machine Learning Tutorial #5 - Decision Trees and Random Forest Classification
Python Machine Learning Tutorial #5 - Decision Trees and Random Forest Classification
NeuralNine
45 Python Machine Learning Tutorial #6 - K-Means Clustering
Python Machine Learning Tutorial #6 - K-Means Clustering
NeuralNine
46 Python Machine Learning Tutorial #7 - Neural Networks
Python Machine Learning Tutorial #7 - Neural Networks
NeuralNine
47 Python Machine Learning Tutorial #8 - Handwritten Digit Recognition with Tensorflow
Python Machine Learning Tutorial #8 - Handwritten Digit Recognition with Tensorflow
NeuralNine
48 Generating Poetic Texts with Recurrent Neural Networks in Python
Generating Poetic Texts with Recurrent Neural Networks in Python
NeuralNine
49 Stock Portfolio Visualization with Matplotlib in Python
Stock Portfolio Visualization with Matplotlib in Python
NeuralNine
50 Analyzing Coronavirus with Python (COVID-19)
Analyzing Coronavirus with Python (COVID-19)
NeuralNine
51 Making Text Images Readable Again with Python and OpenCV
Making Text Images Readable Again with Python and OpenCV
NeuralNine
52 Neural Networks Simply Explained (Theory)
Neural Networks Simply Explained (Theory)
NeuralNine
53 Motion Filtering with OpenCV in Python
Motion Filtering with OpenCV in Python
NeuralNine
54 Top 5 Programming Languages To Learn in 2020
Top 5 Programming Languages To Learn in 2020
NeuralNine
55 Simple TCP Chat Room in Python
Simple TCP Chat Room in Python
NeuralNine
56 Image Classification with Neural Networks in Python
Image Classification with Neural Networks in Python
NeuralNine
57 Edge Detection with OpenCV in Python
Edge Detection with OpenCV in Python
NeuralNine
58 S&P 500 Web Scraping with Python
S&P 500 Web Scraping with Python
NeuralNine
59 Simple Sentiment Text Analysis in Python
Simple Sentiment Text Analysis in Python
NeuralNine
60 Introduction - Algorithms & Data Structures #1
Introduction - Algorithms & Data Structures #1
NeuralNine

This video teaches how to extract tables from PDF documents using various Python libraries, including Camelot, Tabula, and Unstract. It compares the strengths and weaknesses of each library and provides examples of how to use them. By watching this video, viewers can learn how to choose the best library for their table extraction needs and improve their Python programming skills.

Key Takeaways
  1. Install the required libraries
  2. Import the libraries
  3. Load the PDF document
  4. Extract tables using Camelot
  5. Extract tables using Tabula
  6. Extract tables using PDFPlumber
  7. Use LLMWhisperer for table extraction
  8. Use Unstract for table extraction
💡 The choice of library depends on the complexity of the PDF document and the specific requirements of the project.

Related AI Lessons

Bloom Filters, Explained Properly
Learn how Bloom filters work and their benefits, including tiny memory and blazing speed, in exchange for potential false positives.
Dev.to · Daksh Gargas
Prefix Sums: The Preprocessing Trick That Makes Range Queries Instant
Learn how prefix sums enable instant range queries in arrays, boosting performance in various applications
Medium · Programming
I Thought I Was Ready for the Interview — Then One Simple Math Question Destroyed Me
A simple math question can destroy a developer's interview, highlighting the importance of being prepared for unexpected questions
Medium · Programming
Week 2(Day 10): LeetCode Two Pointers(slow & fast): Remove Duplicates from Sorted Array (Brute…
Learn to remove duplicates from a sorted array using the two pointers technique, improving from brute force to optimized solutions
Medium · Python
Up next
Stump Grinder Carbide Wheel Grinds Hardwood To Chips
Innoforge Studio
Watch →