LlamaIndex Sessions: Practical challenges of building a Legal Chatbot over your PDFs

LlamaIndex · Intermediate ·🧠 Large Language Models ·3y ago

Key Takeaways

This video discusses the practical challenges of building a Legal Chatbot over PDFs, including parsing supreme court decisions and extracting data from PDF files, and explores strategies for parsing and building a retrieval augmented system using LLMs and tools like NLP, GPT4, and Weebie vector database.

Full Transcript

all right hey everybody uh this is Jerry here and we're super excited today on the Llama index webinar to bring on a guest uh his name is Sam you he's the co-founder at the honorable.ai and he's here to tell you all about the challenges of building a legal chatbot um so just to start uh maybe Sam you could give a quick instruction oh hi I'm Sam um I'm currently at working as a software engineer because the recent explosion of the chatbot llm so I decided to use my knowledge of building a chatbot that's focusing on legal matters especially for let's say the Supreme Court you know decisions opinions I know because of the recent dishes decisions there's a lot of interest in those fields but there's not much resources that can be easily accessible to people without the legal background that's a super interesting and relevant use case and um yeah maybe to give some context would you be able to describe a little bit about what you're building okay so basically um the first um iteration is quite simple I'm just pulling from all the like legal opinions from the Supreme Court um dating back to basically the first um I think 19 I'm like 18 something you know um and all the legal documents that's existing in the PDF files on the Supreme Court website and the library Congress so getting all those information together and making like doing some extracting data extraction pre-processing metadata embeddings and then feeding that into the embedded into the LM model and then so when the user when asking any questions it will be accurately retrieving information regarding um like their queries got it as a end user ux just to allow users to understand Supreme Court cases or are there other additional kind of like high-level goals that you have in mind oh this is just the first iteration It's Beginning just as a chatbots for users to understanding any question they have interests in the Supreme Court it can be either you can be a lawyer or you can be a Layman that works both way but down the road I do plan into adding let's say the transcript the audio files or like other um let's say um the law the code itself embed into the host um the LM so when the user asks a question it will retrieve more relevance query and the result sweet that's awesome um and then maybe just to dive into the nature of these documents a little bit more can you can you tell me a little bit about like the format of like a Supreme Court case and and you know what is the overall structure of this data across like over 200 years there are so many different formats of the documents then when you have the more modern documents which is pretty much like very nicely formatted in the PDF files but then we also go back to let's say pre-1980s then you have a lot of documents it's kind of the scan of the old paper documents which is sometimes it's not very nicely formatted and it has a lot of let's say handwritten like on some of the documents and so it can be like kind of tricky and when sometimes they don't scan it really well so you have this like missing pieces here and there or they're just like it's just like maybe it wasn't preserved well so it can like having when so once you do this conversion from the PDF file to the text file it sometimes it can having lots of like um like extra artifacts or there's a missing pieces here and there and or also the um the document the text was not recognized well got it um what about like the nature of the text itself I think you mentioned um briefly like there's you know there's probably like different judges like writing like concurring opinions dissenting opinions like what are some of the like processing challenges there um because the formatting the four million documents itself because as you said there is a basically you have the opinion um the basic the majority um so you have up and judge writing the opinion pieces and then you have the additional judge right maybe writing concurrence to the opinion pieces but not the same as a phoenix and then you have the descending judges uh must there could be multiple descending so in one PDF you might have different basically uh different pieces of documents so how are you going to do because when you're processing the basic PDF as a document you might put them as a metadata if you put them metadata all together they may not reflecting what the judge like basically jealous say Judge Sotomayor said it's maybe different from the judge Robert said but in the meta it may not reflect that so in order to best have you accurate you know um like response you want to basically um separate them into different pieces but at the same time you want to have the metadata that's kind of really reflecting on like which each judges stands on these cases and they have all the information that's readily available when you do the retrieve super interesting so basically you kind of want to associate you know the relevant text and opinions with the relevant judge who who's actually associated with the specific section right because you could have like multiple judges each writing their own opinion on the case yeah and there's some like really um I wouldn't call it uh edge cases so there's um time then when one judge agree with one piece of opinion but says okay but I agree on all the opinions except the first sentences so like okay so when you retrieve those how you're going to process that one sentence is so exactly yeah but these are the age cases like I haven't find a really good solution yeah but this is just something like you really need to think about it how to like when you do the embedding how you process these documents gotcha and maybe just uh the overall idea of just like you know how do you assign like uh the text uh to to like a person what are some of the processing strategies that you have tried for that because that's pretty relevant right like even if you process like a chat history for instance it's a different use case but it seems pretty relevant and in this case you know it's a supreme court opinion with like a bunch of different opinions floating around in the document yeah so I've been and then initially I would just straight processing the documents without considering too much into it but once I go into the details no I've been using like they say NLP instead of the LL I'm just using NLP to extracting basically the judge's name like from the text itself and then basically assign them in based on their whether they're descending or concurring assigned to each pieces so the NLP was one way of doing it but because of sometimes the edge cases um so some document does not work well with NLP and so I've been discovering using let's say using gpd4 it's like it's really good at the classification so you can basically do Zero The Prompt basically just giving the whole chunk into uh gbt4 and then it will tell you like category who is descending who is agreeing who is like concurring every single eyes very nice so you actually you do like you use gpt4 to actually process the document and then to actually extract like relevant metadata uh that you can just give it a piece of this yeah I just leave a piece of it so basically it's like the self querying basic retrieval just extracting the entity given the conditions got it and then you then use this like process information and then you like store this information uh in like a vector database that you use for later retrieval yeah exactly and I've I'm trying to get additional doc additional metadata from the core website to understand like what's this case for like what's the area of Interest like what's the you know the basically the the prior um let's say the lower court which lower court is from so these are data going to be added into the Beta data as well so when you do a retrieval you can basically asking like make give it more keyword let's say you want to give it more keyword to more precisely to locate where this case like whether it's relevant to your interest or not gotcha that's awesome I think you know if you're interested in contributing a Supreme Court Case loader to llama Hub which is our site for just like data loaders for llfs I think that would be awesome because this seems like a pretty relevant and useful like and kind of like domain specific use case uh right and and for like Supreme Court cases there's like a certain way you can like process extract this information take create like a nice document representation yeah definitely once I'm finishing basically finalize my end of like how to best processing these data and then put them all together definitely and interesting like contributing to the Atlanta index so uh no that's awesome and and maybe um the Step Beyond this is now that you have this document um and you've extracted some of this information uh what is like the the way you're representing the the document is it like you have is it still as like text Trunks and you have metadata for each text trunk or are you thinking about a slightly different approach um so basically I'm processing converting the PDF into text file and then and you know I'm making some additional adjustment processing and then putting those metadata into the document file itself so this way it'll be basically um it will keep along within the document file so we're going to do the retrieval will be still there got it and so you're basically inserting the metadata into the file and then storing that somewhere um awesome uh what is the later part of your stack so what like kind of vector database are you using and then how are you doing uh initial retrieval uh and then how are you thinking about some of the like failure cases uh like the initial like uh like retrieval stack yeah I've been testing all kinds of stores just to try to see which one has the basically best use um best um like match for my use case I think each store is very different these days let's say on the current one I'm picking is weebie because we be had this hybrid search method you can providing it with keywords and then doing basically the query so this way it will be more precise because the metadata I'm embedded so I definitely wanted the retrieval to have a more precise outcome got it makes sense uh maybe uh poor hybrid search just thinking about semantic search um did you try that and then what if so like what were some of the failure cases of it uh it sometimes it doesn't capture everything because um like when you do a semantic search let's say if your sentence is having multiple judges sometimes it doesn't really you know um sometimes doesn't like capture well which one is like basically who is the most prominent in the sentences and yes so does that change even after you have uh added like the metadata about like the the speed the you know the judge corresponding so uh so basically when I'm pulling out those judges I put a weight under um basically on the judge so the the one that's appearing the first have a higher weight so like basically that's the person who wrote it and the other is agree within this the first church judge so this way I'm looking at basically I'm telling the the data to like focus more on the first judge I see yeah I say was that the solution that you had to kind of make sure that fetch is more relevant software is this like even with that it was still uh fetching out all that information it's it's it's it's like it's not 100 working sometimes sometimes it works well but sometimes it doesn't that's the the thing like it's like I wish it can work 100 of time but it doesn't at this time I'm trying to try and figure out how to like basically maybe adding more metadata but at the same time I have to be aware user may not put in a very long quarries uh user may just putting very simple sentences and they want to get the best result got it makes sense um and and along those lines uh you mentioned you uh the user might uh enter like different types of queries to ask these types of questions what are what is the class of questions that you're looking to answer uh and could you give some examples of the of those uh let's say um I wanted to know um whether like uh let's say judge um um judge Kavanaugh's hasn't wrote anything about a specific like interesting Commerce so so that will be so first it will process a query to pick out okay so this um the area of Interest Interstate Converse and the judge is Kavanaugh and then so you put these two into the basically the keyword the keyword search so to reflecting to retrieving the relevant document then you can do either re-ranker or like um doing other processing to then you do the basic The Courier three but the embedded retrieval to pulling out the relevant information got it have you noticed a substantial difference between um once you go from just like pure top case semantic search to adding in some sort of hybrid search keyword filtering component oh it definitely increased the accuracy it's not 100 there yet but it definitely like basically pulling that was a the early on was it by maybe like 30 and then now it's come you can get to like maybe 60 70 you know higher a score super interesting and maybe just for our listeners could you give a sense of like what exactly hybrid search is doing in this case that will help improve the accuracy uh so basically because um like I said earlier when you have the metadata so when you're looking through those metadatas so the first step is going to basically just doing a metadata to finding that documents that embedded with those metadata so it will do the first retrieval getting those documents back and then the second process is doing the basically using the query 2 through the embedded retrieval to findings from those documents to getting more relevant information instead of just in going through everything or getting through on the summary of your documents got it so it's a way of like increasing the Precision of your retrieved documents uh right because like without the metadata maybe you're getting back stuff that kind of matches like the semantic search part like the embedding similarity but you're not necessarily getting back documents that fit the keywords exactly got it got it um the other part that you mentioned is this idea of like a second stage like re-ranking uh module and maybe uh have you tried that and and if so like uh could you give a description of like how it works and the stuff that you tried Yeah so basically so first um I've tried that but the result is kind of a mixed right now for me but I'm still kind of in the process of refining trying to find a like basic Sweet Spot how to best use it so basically when you're retrieving those documents first and then through the re-rent the score each document you have a score and then through those score basically you're putting the more relevant document first and then you're doing the more on the second the embedded retrieval from the quarries oh I see so you're actually doing some sort of um uh are you doing the embedding base for approval first or as a second stage um I forgot them by pipeline so um I'm billing I think I'm getting the embedded retrieval first year re-ranking then you do and go through the prompt I see yeah and then you use the llm to do some sort of rebranding I see got it got it makes sense um cool um taking a step back what are what are your kind of like favorite PDF OCR packages like what what are some existing parsitters that you think are pretty good uh I think personally if you want to do a local um like Tesseract it's probably the best uh I can't even give you guys show some example and let me share my screen sounds great all right can you see my screen yep okay so basically this is just a testing of some of the documents um I'm processing and you can see on the right is the one of the Supreme Court opinions that's from the 1968 which you can see is the scan of the the paper documents oh nice it's literally just like a scan right okay all right yeah so you have the handwritten over place you have some like basically having the watermarks so it's it's everything everywhere and it can be really confusing for some of the the PDF loaders and I can just show you guys this is the result I got from this first one is from the Tesseract so the result as you can see on the left it's not too bad but some some of the word definitely is missing and it doesn't really know exactly what they're looking at and then I have yeah another one is the PDF Miner uh six and it it does some well as you can see the first sentence is the Supreme Court of the United States as um the um the Tesseract didn't like it picked up it has more attacks on the top so you know each document is different agent parser is different so sometimes this document will work well with one thing but not well with the other thing so let me show that this is a more worse example it's a pipdf2 so it gives out something but it's just like as you can see on the bottom it just like it doesn't work well but oh interesting so this is like it's trying to do OCR over this document right yeah maybe like the bot like the characters are just not you know like in the right format yeah yeah and I'm currently this I want to show you guys one method I'm using to clean up documents is using gp4 so I basically feeding the document of scan basically the OCR attacks into 3d4 and ask it to correct it and it actually does a really decent job as you can see it pretty much capture everything that's on the on the page even though I don't know but I don't really know if gb4 has this doc documents in this training data or not so but this is a very impressive results it doesn't capture everything let's say the one thing I think it's missing is the page number it doesn't have the page number on there but it pretty much capture the essence of the documents really this is awesome this is a great um I actually think this would be a super useful notebook to share yeah so yeah this is one way of processing document but because gb4 is so costly it may not be feasible to run like your entire document basis but it can be something that you think about it that's fair I think even just a basic comparison of oh I see you have all other PDF libraries in there too uh like if you just have a comparison of all the different PDF parsers and just uh showcase some of the texts I actually think that would be like a great comparison tool you know how like there's um like Nat dot Dev for for like uh llms so you can compare like the inputs and outputs of different algorithms so like same for PDF horses just like throwing different PDFs take a look at how the output differs yeah I actually think that would be a super useful yeah that's a great idea yeah definitely yeah I mean I can show you basically this is another thing you want to think about the tables so a lot of PDFs have tables but it doesn't really process well let's say I'm using PDF Miner basically it does extracted the text but it's kind of just like it's not really useful like if you're just putting them into LM it may not process really nicely which I ran into that problem a lot so you might want to like change it to into a data form um the data frame so this way you have a more structured data and you can use there's so many packages out there that you can process the the like the data frame very nicely with llm so this is something that you might want to think about it when you're running let's say just a table PDF so yeah very interesting okay so is this like this uh um table parsing that you described it's for like parsing tables within a PDF right yes yeah and and for other texts like can it does it parse like hybrid data or only tables within a PDF um I think this is mainly just basically it only extract tables it doesn't like work it doesn't take out the text but I think uh well I'm kind of assuming the research field like trying to find a mix parser so put everything together like basically extracting table and extracting text and put them all together at the end yeah so how that's a that's the next question I was gonna ask actually if you think about a PDF that has a lot of like unstructured text and you could get that from like OCR or just like the text directly in the PDF and it also has uh structured data how are you thinking about like you know uh creating a unified document representation that contains both elements or how are you thinking of like merging the structured unstructured data somehow I'm thinking you probably need to have a mix format so you have the data frame let's say that or CSV format and you have the text format so it's just you have to but I mean I'm starting to struggle and how to like reference the other documents in your document so they know what it where those information is located so I think that's still issue I haven't really worked out yet I see so being able to like reference other sections within the document yeah and model those relationships I see yeah makes sense yeah and the lastly I just want to like showcase real quick basically like handwritten um like PDF sometimes it doesn't work well let's say the first one the pdfiner it basically it doesn't recognize any text at all for some reason so if you having a large basically you're processing hundreds of documents at once and you may not even realize you have missing documents sometimes because it some of the parser may just doesn't process oh and some parser process well and let's say the Tesseract does a recently decent job and as there's so many like Benchmark people out there just like handwritten like category handwritten and just like something sometimes they just like do it very poorly some of the parser it just doesn't process as well and if your company or organization have some kind of documents you might even want to consider training datas so they them the parser can understand your data way better than the existing parsers out there I see so this this is uh this is like a public Benchmark or did you create this oh this is probably a benchmark this is so it's like comparing the quality of like different kind of like OCR tools like across different like categories of data um yeah uh could you help like maybe just like distill those results just a little bit more like kind of what what are these schools doing while and where are they yeah so the um the category one is basically a random Wikipedia Google search so it's like very clear um HTML format even though it's saved as a PDF file so they still I believe they still retrain uh retain those texts very nicely so pretty much every parser out there can like getting those texts like close to 100 accuracy but the problem starts when you do uh dealing with the handwritten which is a category two so the handwritten like I'm showing the example some parser just doesn't do it well like basically do it very poorly let's say on this Azure parser it's like belowing 20 and others and Abby is about like 50 and you have a higher rate with AWS and the gcp and so like it can really varies so you have to be very careful about which parcel you're using and when processing your hand if you your documents containing any handwritten information yeah wow the the discrepancy is like huge there's like gcp is like pretty good and uh Azure is pretty low yeah got it um it seems to me there's like an interesting challenge with OCR itself where like if it's well formatted it seems to do pretty well and then handwriting just seems very volatile right and then just like that creates a lot more variability in the performance yeah and you have to consider your companies for may be different from like existing the training data they have in those models you made that that's like creative cases you might want to train your own data based on your models to having a more accurate representation when doing a parser awesome this is great I think the um next question I was just wondering is you know there's a lot of these PDF parsing and also OCR packages uh and tools that you've played around with and this is a great analysis is there anything that you wish that these PDF parsers had that that just don't exist right now um I would consider it it's it's like basically it has to be a la carte so it has you considering let's say some of the PDF you download from internet may have a security um like setting in it so you your parser default PDF parser may not um like basically process like may not be able to bypass Social Security so that's one issue you might have your face and also the second is basically if your care if your um parser has a like a lot of uh images or tables how does that being process maybe if the perirect or other parser can like really parsing those uh straight uh text PDF really well but once it encounters those mixed um data the PDF how does like how to handle those uh information I think that's a key doctor yeah makes a lot of sense and and kind of looking more broadly at the overall like kind of lime based retrieval augmented generation applications that you know that your application uh also falls in this category um what are some of these like challenges that uh you're like you think still exist and and what are some like potential exciting future directions that you're excited about that you want to tackle I think the specific domain uh processing document processing is still kind of lacking because all the documents all the basically existing product out there is mostly tailored towards um like we can process any documents but um it fails at like if you're having a more mixed format documents or if you have a foreign language sometimes it just doesn't process at all like I've tried it some like um parsers it just doesn't handle like foreign let's say Asian language as well so this is something that definitely to be considered when you're doing those uh like basically a document processing got it so like most like multilingual capabilities yeah yeah exactly like more specific domain like more let's say you want to just process a transcript or processing um like a supreme court opinions like or a core document so those has to be like very specific I mean it may not work well with the like the basically current the PDF parser that's in one of the packages and on the llm side like you know you talked a little bit about like kind of uh like two-stage re-ranking like hybrid search are there any other potential directions that you could be interested in exploring uh just uh Beyond kind of like some of the uh existing like hybrid search and re-ranking directions that you've tried um I think it's just more distilled into like basically how to you know parse in the query I think that's another key to how to pick up real information like basically I think the process of housing using llm to improving the core uh query qualities I think that's something that I really need to look into it see what can be done and to improving basically be Beyond The Limited information that's been provided okay makes a lot of sense well Sam thanks so much for being here today and uh I think this is a really educational podcast for me I learned a lot about PDF parsing a lot of the challenges and it really does seem like to really build something useful this really is like one of the key things that you'd have to solve for kind of like your domain specific chat app uh so it was great learning about some of the uh the thoughts that you have on this area so thanks so much all right thank you thank you for having me oh

Original Description

In this video, we chat with Sam Yu on practical challenges of 1) parsing supreme court decisions, and 2) building an LLM-powered chatbot over it. A lot of challenges in building a retrieval augmented system boil down to challenges in parsing the data. We talk about different strategies for parsing, the pros/cons of different PDF parsing/OCR packages, and also different retrieval strategies. Background: Sam is an AI product engineer currently developing an application with AI capabilities. His goal is to utilize a LLM in order to democratize specialized domain knowledge, making it accessible to everyone.
Watch on YouTube ↗ (saves to browser)
Sign in to unlock AI tutor explanation · ⚡30

Playlist

Uploads from LlamaIndex · LlamaIndex · 9 of 60

1 LlamaIndex Virtual Meetup (May 4th, 2023)
LlamaIndex Virtual Meetup (May 4th, 2023)
LlamaIndex
2 LlamaIndex + MongoDB Workshop/Fireside Chat
LlamaIndex + MongoDB Workshop/Fireside Chat
LlamaIndex
3 Discover LlamaIndex: Ask Complex Queries over Multiple Documents
Discover LlamaIndex: Ask Complex Queries over Multiple Documents
LlamaIndex
4 Discover LlamaIndex: Document Management
Discover LlamaIndex: Document Management
LlamaIndex
5 Discover LlamaIndex: Joint Text to SQL and Semantic Search
Discover LlamaIndex: Joint Text to SQL and Semantic Search
LlamaIndex
6 Discover LlamaIndex: JSON Query Engine
Discover LlamaIndex: JSON Query Engine
LlamaIndex
7 LlamaIndex Webinar: Active Retrieval Augmented Generation
LlamaIndex Webinar: Active Retrieval Augmented Generation
LlamaIndex
8 LlamaIndex Webinar: Demonstrate-Search-Predict (DSP) with Omar Khattab
LlamaIndex Webinar: Demonstrate-Search-Predict (DSP) with Omar Khattab
LlamaIndex
LlamaIndex Sessions: Practical challenges of building a Legal Chatbot over your PDFs
LlamaIndex Sessions: Practical challenges of building a Legal Chatbot over your PDFs
LlamaIndex
10 LlamaIndex Webinar: Graph Databases, Knowledge Graphs, and RAG with Wey (NebulaGraph)
LlamaIndex Webinar: Graph Databases, Knowledge Graphs, and RAG with Wey (NebulaGraph)
LlamaIndex
11 LlamaIndex Webinar: Community Project Showcase (07/07/2023)
LlamaIndex Webinar: Community Project Showcase (07/07/2023)
LlamaIndex
12 LlamaIndex Webinar: LLMs for Investment Research (with Didier Lopes, co-founder/CEO at OpenBB)
LlamaIndex Webinar: LLMs for Investment Research (with Didier Lopes, co-founder/CEO at OpenBB)
LlamaIndex
13 Discover LlamaIndex: Bottoms-Up Development With LLMs (Part 1, LLMs and Prompts)
Discover LlamaIndex: Bottoms-Up Development With LLMs (Part 1, LLMs and Prompts)
LlamaIndex
14 Discover LlamaIndex: Bottoms-Up Development With LLMs (Part 2, Documents and Metadata)
Discover LlamaIndex: Bottoms-Up Development With LLMs (Part 2, Documents and Metadata)
LlamaIndex
15 Discover LlamaIndex: Key Components to build QA Systems
Discover LlamaIndex: Key Components to build QA Systems
LlamaIndex
16 Discover LlamaIndex: Bottoms-Up Development with LLMs (Part 3, Evaluation)
Discover LlamaIndex: Bottoms-Up Development with LLMs (Part 3, Evaluation)
LlamaIndex
17 LlamaIndex Webinar: From Prompt to Schema Engineering with Pydantic  (with @jxnlco)
LlamaIndex Webinar: From Prompt to Schema Engineering with Pydantic (with @jxnlco)
LlamaIndex
18 Discover LlamaIndex: Bottoms-Up Development with LLMs (Part 4, Embeddings)
Discover LlamaIndex: Bottoms-Up Development with LLMs (Part 4, Embeddings)
LlamaIndex
19 Discover LlamaIndex: Custom Retrievers + Hybrid Search
Discover LlamaIndex: Custom Retrievers + Hybrid Search
LlamaIndex
20 LlamaIndex Webinar: Document Metadata and Local Models for Better, Faster Retrieval
LlamaIndex Webinar: Document Metadata and Local Models for Better, Faster Retrieval
LlamaIndex
21 LlamaIndex Webinar: Build Personalized AI Characters with RealChar
LlamaIndex Webinar: Build Personalized AI Characters with RealChar
LlamaIndex
22 LlamaIndex Webinar: Make RAG Production-Ready
LlamaIndex Webinar: Make RAG Production-Ready
LlamaIndex
23 LlamaIndex Workshop: Building RAG with Knowledge Graphs
LlamaIndex Workshop: Building RAG with Knowledge Graphs
LlamaIndex
24 Discover LlamaIndex: Introduction to Data Agents for Developers
Discover LlamaIndex: Introduction to Data Agents for Developers
LlamaIndex
25 LlamaIndex Webinar: Finetuning + RAG
LlamaIndex Webinar: Finetuning + RAG
LlamaIndex
26 Discover LlamaIndex: SEC Insights, End-to-End Guide
Discover LlamaIndex: SEC Insights, End-to-End Guide
LlamaIndex
27 Discover LlamaIndex: Custom Tools for Data Agents
Discover LlamaIndex: Custom Tools for Data Agents
LlamaIndex
28 LlamaIndex Sessions: Building a Lending Criteria Chatbot in Production
LlamaIndex Sessions: Building a Lending Criteria Chatbot in Production
LlamaIndex
29 Discover LlamaIndex: Bottoms-Up Development with LLMs (Part 5, Retrievers + Node Postprocessors)
Discover LlamaIndex: Bottoms-Up Development with LLMs (Part 5, Retrievers + Node Postprocessors)
LlamaIndex
30 LlamaIndex Webinar: How to Win a LLM Hackathon
LlamaIndex Webinar: How to Win a LLM Hackathon
LlamaIndex
31 LlamaIndex Webinar: LLM Challenges in Production (w/ Mayo Oshin, AI Jason, Dylan from Starmorph)
LlamaIndex Webinar: LLM Challenges in Production (w/ Mayo Oshin, AI Jason, Dylan from Starmorph)
LlamaIndex
32 LlamaIndex Webinar: Agents Showcase!
LlamaIndex Webinar: Agents Showcase!
LlamaIndex
33 LlamaIndex Webinar: Learn about DSPy
LlamaIndex Webinar: Learn about DSPy
LlamaIndex
34 LlamaIndex Webinar: Time-based retrieval for RAG (with Timescale)
LlamaIndex Webinar: Time-based retrieval for RAG (with Timescale)
LlamaIndex
35 LlamaIndex Webinar: Build/Break/Test LLM Apps Showcase (co-hosted with TrueEra, Pinecone)
LlamaIndex Webinar: Build/Break/Test LLM Apps Showcase (co-hosted with TrueEra, Pinecone)
LlamaIndex
36 LlamaIndex Workshop: Evaluation-Driven Development (EDD)
LlamaIndex Workshop: Evaluation-Driven Development (EDD)
LlamaIndex
37 LlamaIndex Webinar: Building LLM Apps for Production, Part 1 (co-hosted with Anyscale)
LlamaIndex Webinar: Building LLM Apps for Production, Part 1 (co-hosted with Anyscale)
LlamaIndex
38 LlamaIndex Webinar: Learn about Fine-tuning + RAG (w/ Victoria Lin, author of RA-DIT)
LlamaIndex Webinar: Learn about Fine-tuning + RAG (w/ Victoria Lin, author of RA-DIT)
LlamaIndex
39 LlamaIndex Webinar: What's next for AI after OpenAI Dev Day?
LlamaIndex Webinar: What's next for AI after OpenAI Dev Day?
LlamaIndex
40 Introducing create-llama
Introducing create-llama
LlamaIndex
41 LlamaIndex Webinar: PrivateGPT - Production RAG with Local Models
LlamaIndex Webinar: PrivateGPT - Production RAG with Local Models
LlamaIndex
42 Multi-modal Retrieval Augmented Generation with LlamaIndex
Multi-modal Retrieval Augmented Generation with LlamaIndex
LlamaIndex
43 LlamaIndex Webinar: LLaVa Deep Dive
LlamaIndex Webinar: LLaVa Deep Dive
LlamaIndex
44 A deep dive into Retrieval-Augmented Generation with Llamaindex
A deep dive into Retrieval-Augmented Generation with Llamaindex
LlamaIndex
45 LlamaIndex Workshop: Multimodal + Advanced RAG Workhop with Gemini
LlamaIndex Workshop: Multimodal + Advanced RAG Workhop with Gemini
LlamaIndex
46 LlamaIndex Webinar: Efficient Parallel Function Calling Agents with LLMCompiler
LlamaIndex Webinar: Efficient Parallel Function Calling Agents with LLMCompiler
LlamaIndex
47 Introduction to Query Pipelines (Building Advanced RAG, Part 1)
Introduction to Query Pipelines (Building Advanced RAG, Part 1)
LlamaIndex
48 LLMs for Advanced Question-Answering over Tabular/CSV/SQL Data (Building Advanced RAG, Part 2)
LLMs for Advanced Question-Answering over Tabular/CSV/SQL Data (Building Advanced RAG, Part 2)
LlamaIndex
49 LlamaIndex Webinar: Advanced Tabular Data Understanding with LLMs
LlamaIndex Webinar: Advanced Tabular Data Understanding with LLMs
LlamaIndex
50 Ollama X LlamaIndex Multi-Modal
Ollama X LlamaIndex Multi-Modal
LlamaIndex
51 Build Agents from Scratch (Building Advanced RAG, Part 3)
Build Agents from Scratch (Building Advanced RAG, Part 3)
LlamaIndex
52 LlamaIndex Webinar: Build No-Code RAG with Flowise
LlamaIndex Webinar: Build No-Code RAG with Flowise
LlamaIndex
53 LlamaIndex Sessions: Practical Tips and Tricks for Productionizing RAG (feat. Sisil @ Jasper)
LlamaIndex Sessions: Practical Tips and Tricks for Productionizing RAG (feat. Sisil @ Jasper)
LlamaIndex
54 Introduction to LlamaIndex v0.10
Introduction to LlamaIndex v0.10
LlamaIndex
55 Build SELF-DISCOVER from Scratch with LlamaIndex
Build SELF-DISCOVER from Scratch with LlamaIndex
LlamaIndex
56 Introducing LlamaCloud (and LlamaParse)
Introducing LlamaCloud (and LlamaParse)
LlamaIndex
57 LlamaIndex Sessions: 12 RAG Pain Points and Solutions
LlamaIndex Sessions: 12 RAG Pain Points and Solutions
LlamaIndex
58 LlamaIndex Webinar: RAG Beyond Basic Chatbots
LlamaIndex Webinar: RAG Beyond Basic Chatbots
LlamaIndex
59 A Comprehensive Cookbook for Claude 3
A Comprehensive Cookbook for Claude 3
LlamaIndex
60 LlamaIndex Webinar: RAPTOR - Tree-Structured Indexing and Retrieval
LlamaIndex Webinar: RAPTOR - Tree-Structured Indexing and Retrieval
LlamaIndex

This video teaches how to build a Legal Chatbot over PDFs using LLMs and explores the challenges of parsing supreme court decisions and extracting data from PDF files. It discusses strategies for parsing and building a retrieval augmented system using tools like NLP, GPT4, and Weebie vector database.

Key Takeaways
  1. Extract data from PDF files using NLP and GPT4
  2. Implement retrieval augmented generation using Weebie vector database
  3. Use hybrid search and re-ranking for query quality improvement
  4. Fine-tune LLMs for domain-specific tasks
  5. Improve accuracy of PDF parsing and OCR using tools like Tesseract and PDFMiner
💡 Domain-specific document processing is lacking in current products, and two-stage re-ranking and hybrid search are potential directions for improving query quality.

Related Reads

📰
Outpost: Routing Agent Turns to a Local Model, with Frontier Escalation
Learn how to optimize AI agent performance by using a local model as a proxy to reduce reliance on external LLM providers
Medium · LLM
📰
Outpost: Routing Agent Turns to a Local Model, with Frontier Escalation
Learn how to optimize AI agent performance by using a local model as a proxy to reduce reliance on external LLM providers
Medium · ChatGPT
📰
Building Business Intelligence Tools with LLM
Learn to build business intelligence tools with large language models, enabling interactive and language-driven interfaces for analysts and operators
Dev.to AI
📰
Leveraging LLM for Business Intelligence
Learn how to build a conversational BI agent using LLM to turn English questions into SQL and get insights from structured data
Dev.to AI
Up next
5 Levels of AI Agents - From Simple LLM Calls to Multi-Agent Systems
Dave Ebbelaar (LLM Eng)
Watch →