LlamaIndex Sessions: Practical Tips and Tricks for Productionizing RAG (feat. Sisil @ Jasper)

LlamaIndex · Intermediate ·🧠 Large Language Models ·2y ago

Skills: RAG Basics90%Vector Stores80%RAG Evaluation70%Advanced RAG60%LLM Foundations50%

Key Takeaways

The video discusses practical tips and tricks for productionizing RAG systems using LlamaIndex abstractions, including PDF parsing, chunking, and indexing, as well as techniques for improving retrieval and synthesis.

Full Transcript

everyone welcome back to a special llama index session we're excited to feature Cil meta ml engineer at Jasper AI uh and we're featuring some very very practical tips and tricks for productionizing rag um we've mentioned this a lot but building production grade rag is actually quite hard there's a lot of choices that developers have to make uh from the data parsing inje side all the way to retrieval reranking quering response synthesis and so this is actually a real example of where these tips and tricks helped in building a live rag application in production powered by some aspects of L index abstractions and so without further Ado um c will take us through some of these slides and uh Cecil uh welcome and and feel free to introduce yourself yeah yeah um thanks for having me um on the podcast like Jerry and hey everyone so I'll be like Jerry mentioned I'll be talking about like some tips and tricks to kind of productionize R systems um so I kind of just share the presentation and then we'll kick it off from there sounds great all right um yeah so for the better part of last year I've kind of been working on like building out like a rack system at Jasper and as we were like you know building this out we had like a lot of questions from you know how to make like good decision choices or like system decision choices as we building this up and there were like a lot of learnings along the way and I thought it would be kind of great to like share that with the community at large um yeah so we'll kind of walk you through like the different parts of the r system and like how we made choices in picking um or like how we did like a pros and pawns and like what choices we made and why we made them so we'll walk talk about like you know how um like really understanding like the customer data which actually does influence a lot of decisions on how you build your system also uh like you know what kind of questions will the customer actually be asking your system and that also helps you decide like how do you actually ingest how do you store how do you you like serve those queries um and then we'll walk through like some data processing retrieval reranking synthesis um user experience and like a tradeoff of you know latency versus accuracy in like rack systems and then towards the end we'll cover like feedback and some continuous learning all right um so um since we are like a marketing AI company um like a lot of our um documents or content is like marketing focused and so often we see that you know these mark this marketing content contains like you know like brand name they have like new kind of madeup nouns like novel language or phrases or terms and a lot of these like tend to be like out of domain so which means like um standard like llm models would like not have seen these or standard embeding models would like not have seen these terms before because they were basically invented as new brand names that didn't exist before um the other thing we have to like think about was like a lot of our documents are like PDFs and so the way we kind of ingest and process PDFs had to be like really spoton and like really good to get like good outputs um and generally content is like a mix of like um you know a lot of like text images and and tables as well uh so tables because a lot of content could be like around um you know like analysis or analytics of like that specific company in question um let's see um the other thing we have to like really think about was like okay what was um you know the user end so of course like naturally people want to like query like your knowledge store to like get the right kind of documents uh as a result but a lot of times uh people are using this in like ways we did not expect so for example they would want to like summarize a specific document or they w't want to summarize like a list of specific documents um also they would want to like compare like different um Concepts from different documents so they might want to do a comparison between like some campaigns across like one or two documents and so um keeping that in mind we had to like make sure that we could retrieve multiple documents but we we could also synthesize answers across like a host of these documents or if they were like trying to summarize um documents we could not only retrieve them but summarize answers across like a bunch of documents I think one of the nice things was like keeping like the first two aspects which is the data and the user in mind it meant we had to like you know test like a lot of different techniques and which is why initially when we started building this out we wanted to have like the kind of an abstraction layer and so we pick like Lama Index right obviously and so we could like actually try like a lot of different techniques like really quickly before we decide okay which ones you want to like settle settle down on so that is kind of nice it also kind of helped like structure our code like really well and kind of segregate like the different aspects of retrieval which I'll talk in the user experience section um and that would like really useful for us okay so I think one of the key issues like I mentioned was we have a lot of like PDF documents uh but initially when we started like this work um we noticed that our uh PDF to text parsing actually had like a lot of problems and had actually significantly uh impacted the quality of a retrieval because um since the PDF to text passing wasn't that great we could never recover like a lot of documents during the retrieval process um some of the issues we seeing um at that point we using like a bunch of Open Source like PDF parsel um and so some of the issues we were seeing was like the semantic structure of the PDF would be lost which means we couldn't figure out like whether like some text was grouped under a heading or whether a particular piece of text like belonged to like a subheading or like a slide or like a page so we going figure like those things out and so when users were trying to reference a particular topic or heading like we weren't able to fetch like all the content related to that heading um the other issue we seeing was like parsing of tables uh so since they have like so many tables in these documents like often you'll see that when tables were passed they would get just get passed as just generic text so you couldn't figure out like which column had like which values under it and so if somebody asked to like say summarize a particular like some text or some um numerical data from a particular column like you couldn't do that anymore because there was no relationship between the column and the text underneath it um we also saw like a lot of complex tables where like a table would be split um into two parts or a particular table had like you know sub columns underneath it and I show examples of that in like the following slide um also like a lot of times there'll be lots of like infographics um in these like PDFs so you would have you would have like some uh you would have like some like numbers like for example like 10% of value like recovered from some like you know X process and you would have like some data related to that number like right underneath it and we would lose like the context of the fact that that number is related to the data underneath just because of um you know the quality of the parsing outputs that we were getting and so like here are some like examples that I managed to pull out so um the first one like right here is like this table and you can see like there are like sub columns underneath like the last results column and so often when like some a table like this would get parsed you would use you would lose the fact that accuracy is a separate column like a sub underneath results and it had like some numbers so you kind of lose that Association and in some of the parsers there was like no association between the headers and the text underneath so all the headers would get pared like first as just as pure text and then just the content underneath would just get pass as pure text so this is like challenging because the LM then can't figure out any structure and so you can't really um you know do like a Q&A against this document um the other section um the other example was around like infographics so you have like these numbers like 20 % 50% and those are like clubbed with some information related to that number and so often um P these parses would just take the 20 and 50% in one line and output that as a single line and then the next uh pieces of text would get output below that as single lines and so you couldn't associate the fact that the 20% is related to like the text like models are bad or they undertrained and the Very a very common thing theme that we often see is this is of sections in in PDF documents so you could have like you know two columns like shown here and then you could have um you know separate sections within a column so you would have like benefits and that would have like a section on coverage and a section on intent so you ideally want to club like this content U the benefits coverage and intent content together and that should be separate from like a distribution and regions content because those are two different like semantically different pieces of text U so this is kind of challenging and and um So eventually what we actually um ended up doing was uh we tried like a lot of like these open source options and we um ended up like opting for like like Adobe API which is kind of like pretty good at like PDF processing and some of the issues we were able to like cument so we actually got now a semantic structure on the information or text from like images and tables like correctly extracted um it also had like a really interesting feature which is um tables would now like represented as like just say CSV or markdown so we could pick like what kind of option we wanted for tables and then we would inject that as XML or markdown within the text at the location where the uh table was present in the original PDF so we had like a nice continuity of information but we could also represent tables as XML which llms are like able to understand like really well uh because the limbs have seen like a lot of HTML data so they're like really good at like understanding of paring XML um a lot of tables were images and we would like extract like you know text from those as well uh so this kind of gave us like really good started giving us like really good like outputs uh just because our input data was like really good now um all right the next um issue we did see is around like so now that you have like the data um in the correct text format the next thing you would naturally want to do is like break that data down into like smaller chunks and then index it so you can retrieve or you can do a query against these chunks to retrieve the original document later so one of the issues we did see around this was if you if you take like a long document so you take like a giant article on Wikipedia uh about like alen Turing right um and so you try to start breaking it down into chunks so you'll notice that you know say the first three chunks actually have a mention of the name like alent Turing so when you later try to retrieve it you know that if I say alen Turing uh it's going to get me like the first three chunks but uh the last three chunks which also corresponds to the latter half of the document may not have a mention of alent uring right so if you try to retrieval now like the last three chunks will never be retrieved and so this like a common theme that we saw where Concepts referenced at the beginning of the document where reference with like either pronouns or like it or like reference in some other way towards the end of the document so we could never retrieve those Concepts uh in the standard retrieval process um so one workaround we did for that was um every chunk would have like like semantic information about what that chunk is referencing and so one way we did that was um we would basically get like a summary of the document and we would um so the summary would be like very small like say 20 tokens or like 50 tokens and we would append that um to a chunk and the chunk is say like 500 tokens um I mean around around that ballp part right so you essentially want to be under the length of you know the your embedding model otherwise you like truncate trun the embedding right so we would basically append like a small like a few token um summary to the chunk and then we would embed that whole thing and that would be like one embedding and now we we were like basically injecting information about like what this chunk is about so every Chunk from like say the aling Wikipedia page would tell us that oh we're trying to summarize um like like Allen touring is like life and so now if you like do a query on Allen touring you would like fetch all the chunks and then you can like figure out like which of these are like the right chunks to answer that query um so here I do reference like this concept of suboc and suboc two but I'll talk about that in like latest slides and like why we have this concept um so just by adding the summary we like saw like really good like improvements in like retrieval and a lot of times um you actually won't see this Improvement in like standard benchmarks because standard benchmarks either contain like small small chunks which can can we like just we are like Stand Alone by themselves right and they have enough information to like retrieve and so you never see this issue in like normal standard benchmarks but um in our case like since we have like like big documents that we get from customers we kind of cheat this like all the time and so we like when we initially tested our system with standard benchmarks it seemed like fine but when we see like this in production you kind of realize that hey you know what yeah this is actually a problem and you know we had to like figure out like a solution uh to like solve this um the second um issue we did see as we started appending this concept of summaries to like a chunk is um often documents are like they say they lot say like 10 pages 20 pages and they have like different themes across the document so like one single summary um doesn't really cover like the different themes that are covered in like the different pieces of the parts of the document um so for example see the first top half of this article only talks about aluring but the next half only talks about like um say alen turing's mother um so that information is lost in the summary and so we'll have a hard time like retrieving that kind of theme or topic so instead so that was one issue we saw with chunking with summaries uh and then the other issue is around like um doing like synthesis across um a lot of different like documents so in retrial traditionally people fetch like small chunks like say 500 token chunks and then you'll try to synthes an answer from like say 10 500 token chunks however usually I think chunks are kind of too small to actually represent like all the context um of of a document so I feel like either either some chunks are like never retrieved or you might retrieve only parts of chunks and um not really get like the full context of the document and so your final synthesized answer may not really be that good and so we started thinking how how would he actually solve this problem in like an ideal situation so ideally what I would love to do is you know take the entire document give the entire document to llm and ask it to retrieve the answer uh but then that has problems around like latency because if you say have retrieved like five documents processing five documents through llms could be like challenging and can can be slow and that kind of affects like user experience so we have to simulate this aspect of fetching like entire documents so we could do like synthes as well um so that kind of brought us to this concept of creating like subdocuments um and so we basically would take like an entire document and break it down into smaller subdocuments um so here like you can assume like say a subdocument is like 5,000 tokens and so one document say gets broken down into like five or six different subdocuments and then every subdocument gets broken down into chunks with their summaries and this summary is related to the subdocument which we assume like it may not be true but we assume that one one subdocument kind of contains like a subset of themes across the whole document and so now we have like two different indexes so we have like a standard Vector index which consists of tunks chunks and then we have like a um like a bigger index of sub documents now these uh we we the thing is I think standard or most embedding models they end up having like length restrictions and so we decided like not to vectorize this and so instead we just keep it as a lexical index um and one of the thinking around that was uh there's like this paper from 2021 around like the beer benchmarks uh and now I'm guessing it's it's kind of old because I think it's been like three years and and stuff has moved but I think one of the interesting learnings there was um that if you use like a lexical Benchmark with like re ranking it kind of outperforms uh on most data sets uh especially especially in the case where you know there are like there's like out of domain data on those data sets that those embedding models have not seen and so it would traditionally end up out performing like Spar models every time or like dense models every time um and so a lot of our data also tends to be like you know say out of domain because like I was mentioning before like a lot of data contains like made up nouns or like phrases that like these embeding models would have like never seen and so U it made like a lot of sense for us to have like a secondary index be like just lexical um just for like that um I think additional performance so now that we have like two different indexes we have like a lexical index and your vector index how do we combine like results from both of these a really quick question before you get to the next say because I'm just really curious how did you create the suboc oh uh we just did like overlapping chunks of five 5K to L I see so you just had big trunks that overlapped a little bit okay all right um so great so now when a user query comes in we first query index one which is like a lexical index and we get like a subset of subdocuments and then we query um the vector index and we get like a bunch of chunks which are associated with like the subdocuments so in this case we had a couple of options on you know how do we do like hybrid combination of these two different uh results and so we investigated like a couple of techniques like we was Max score we we tried like voting or like linear combination and so uh when you try to like explain like uh you the results we are seeing from line linear combination we couldn't really explain them because it's basically a linear combination of like two different scores and then we couldn't justify like why a particular um combination constant made sense versus not and so then we basically either decide to go with like a Max score approach where we just take the maximum score and say okay that is the best match or we we did like a voting approach where you would basically have like U you know a bunch of these chunks or subdocuments vot cost their vote toward the sub document and the vote is basically the score that they bring and the documents that were getting the maximum scores were um considered like the best matches because across like both the techniques those were the documents that stood out compared to the other documents so the voting was like very easy to explain the Max Sport technique was like very easy to explain and so we basically picked like one of these yeah so we would either go with like Max score or voting and then we take or pick like the top you know top Cas subsets of sub documents that we've like picked out uh and that was like very easy to explain kind of worked really well um and so we kind of just spent with that um all right so once we've retrieved uh a bunch of like say you retriev like K uh top sub documents now the next challenge next challenge is around like reranking them so we like have use cases where you know we don't want to show like all the key documents to the user you want to get like the best top n and so that's why we introduced like R ranking into the process um so a lot of these subdocuments um like I show before they like really large right like 5K Tok is like pretty big for subdocument if you have like um key of them that's like a lot of data to like sift through um uh traditional like re rankers so for example cohar or even other re rankers you've seen out there they have like input dock like size limits um so at certain size limit they like trate the document right so we couldn't just take like the 5K tokens and just use any document as is and the other interesting naret of information here was um if you if you look at like traditional uh cross cross encoders and if you compare those with like traditional llms uh the llms actually train on like much more um on World Knowledge like a lot more World Knowledge so I feel like they just have like a better understanding of like Concepts then I would say like like a standard cross encoder which is like not trained on that much information and one example could be for something like um if you um if you have a document that basically has a mention of say Instagram reals um so standard um you know Ross encoders may not know that reals basically creates like video content uh but llms that are like scraping the web have this notion of the fact that yeah Instagram reals is for creating video content right so that kind of like real world information is like present in llms and so we want to take advantage of that um to figure out how we could do reranking like really well um so what we did was we introduced like a new index and this new index is called like a summary index so all it does is it takes your 5K token subdocument and pairs it down into like a small summary which has like all the concepts from that subdocument so it's not exactly like a chunk because chunk only has a part of your subdocument but a summary has like all the important facts from your subdocument so we take our retrieved subdocuments we fetch the summaries for them then we use like a llm ranker and in one or two shots you basically rerank um all your subdocuments or or all your summaries actually we rank all the summaries and given the rankings of the summaries we rank the subd ments in that fashion so this actually so in cases where we needed like this kind of real world context this like worked really well and again like we couldn't you know we like there are no benchmarks that test on like real world Concepts if I try this on traditional benchmarks this does like work pretty well and it will still be at par with like you know any cross encoder but if I start introducing like things where you know um things like this like the fact that reals creates video content uh a cross encoder wouldn't do as well as just using llm to do R ranking [Music] board all right um so here we just use like the ranking component that I think um Lama index has and we tweaked we tweaked The Prompt a little bit um and this was like some interesting learnings we had like from before where we started introducing like um you know structured prompting so like we the output format was like structured as XML so we could easily pass it um also the input is also um structured as XML that way um if you um even if the context or even if the input prompt is like long the model can pay like attention like really well if your input context has like some kind of a structure uh so we've done this a lot with like XML like I haven't tried with like you know other kinds of structure like gson and stuff but with XML like the model definitely does like much much better if your inputs are structured especially if your context is like really long or not context especially if your prompts like really long um and then we added like a couple of prompting quirks that kind of uh perform better like uh so nobody knows why but I feel like I've seen this in like a couple of papers where you introduce like a statement like go take a deep breath or like you know ensure the scores are correct and the model tends to like do this uh so these are like some like prompting works we picked up from like couple of paper is um just to get like consistent outputs great so we've reranked the documents uh we know the top five State documents that we want to use for a final answer so the last step we have left is like synthesis uh I feel like um until maybe maybe until like two days a two days ago maybe this was a problem but I feel like now people are targeting like million token context windows so you know maybe synthesis has become like an easier problem now uh but before it was um we had to like pick and choose between latency and accuracy and so if the final fetch documents were like small we would just use the entire documents and then use like a t synthesizer from llama index to like do the synthesis uh however if the documents were like too large then we would just pick the suboc and the surrounding subdocuments so suboc n n minus one and N plus one then we would do synthesis from those documents and get the final answer um all right so coming back to like what I was mentioned before um in that we use this L index concept to kind of really create like modules in our pipelines and so we could use like different parts of the modules for like different like product features right so for example like in in cases where we could we wanted like really quick uh we wanted like really low latency like really fast responses but we okay doing like a accuracy tradeoff we we would basically just get results from the retrieval part of the pipeline and then users would pick you know one of the answers there are cases where you know we couldn't U use everything we got from retrieval and so we would pass it through reranking phase which would add like a little more latency but it also meant we get like a more curated list back so that so then we would pick like results after the ranking stage and then they were of course like you know chat experience cases where you want want to extract like the answer instead of just giving back the documents and so that's even more expensive and so that happens like after the synthesis stage um yeah and and um so once you kind of like put this into production we want to like keep um like improving this and so right now basically I think we just want to collect like a lot of data before we figure out like okay what Improvement looks like but we want to put in like this cycle of like feedback or continuous learning where users like we show we show users like results they pick like the right one we log the metrics then given that we can find T like our imping models and keep um improving the retrieval [Music] system um yeah there's some like open questions that I think we still have like not answered and something that we definitely want to explore like going forward I think um one of them is the fact that we use lexical search is like great or like out of domain search is how do you support like multiple languages uh because traditionally I think Leal search kind of works well with like a single language and then you have to figure out how to um router it to like different language indexes so I think that's like an unanswered question for now and also like how do you analyze like images like really well from documents because not um not all images and documents are like useful um so how do you figure out like which ones are useful and which can add like value for retrieval and the third one that I I think kind of has been a pain point is um a lot of B data sets are like really really large like five million documents or like you know three million documents and kind of indexing these or like you know running them to the pipeline for evaluation is like extremely expensive especially if you make a mistake after indexing them and you want you have to reindex them again U and so I think one thing definitely for us to think about is how do you get like subsets of these uh like bu data sets so we can kind of evaluate them like quickly and also like at reasonable cost so yeah I think that's about awesome um yeah this is a fantastic presentation I'm sure a lot of people are G to get a lot of value out of this uh considering this is um very comprehensive and covers the the space from like you know data processing retrieval reranking synthesis um and maybe just for a few minutes of of questions um in terms of things that you know I figured uh we just have a a little bit of discussion that the first question first question I have is if you're running evaluation how did you set that up uh in your current system and did you do like retrieval evaluation like all and based devals and what what were some of the data sets that you used yeah so currently we actually doing a lot of the evaluation like locally um and so we had to like be very mindful of the fact of like you know which data set we would use and so I ended up using like the smaller data set from from the B marks like you know the NF Corpus CA uh I those have like a couple thousand documents and so those also like some of the more challenging data sets in the MCH Mark so we actually use those and then so that was on the retrieval side and then for reranking um I think um there's the MTB Benchmark which also has like some reranking data sets and so we ran ran with those and um I know like um L index also has like a bunch of metrics around like checking for like faithfulness correctness so you basically run against those metrics oh cool did you use like any L based generation Matrix or are they primarily retrievable metrix uh which ones like L like generation like yeah using the L as a judge to evaluate responses that type of stuff yeah yeah um so I think the for retrieval it is a standard metrics like you know NDC and and then for the reranking stuff yes we use like the llm beas metrics because those are like hard to know exactly what the right answer even is uh because the answer might be like hidden inside like the synthesized answer post three ranking yeah for sure um the next question I had is going back all the way to the data so I think you mentioned a lot of the you mentioned a lot of um techniques and tricks for parsing uh PDFs so especially like parsing tables like headings um all these things was or most of your data sources in PDFs or do you have other data sources to consider as well yeah so we had like text um PDFs and then um a lot of like Word documents interestingly um and I think Word documents um I actually haven't done like an analysis on Word documents oh that's a good thing I should probably check that but I think I think those kind of come with a similar problems as PDFs because they also have like tables images in them um yeah I think those are three categories we see like most of I see because I was about to ask like do you find that you use like similar strategies for pressing these documents um regardless of the document type or just like if there were unique characteristics or challenges for certain document types right for for text and PDFs I think I talked about like the approaches for Word documents I don't think we paid as much attention because PDFs was like the bigger problem got it um the the next uh question I had was on the reranking side uh so it's it's very interesting um that you're actually using the llms for reranking and you made a point that you know llms have more generalization properties than like a cross encoder you know because they generally like better to out of the domain data but you know I think one of the things that I feel like the space of using llms for ranking is pretty early in the minds of a lot of people and so curious uh to hear your thoughts on maybe one immediate consideration is latency and cost like how did you think about latency and cost uh when you're using LMS for reranking did you use a much smaller model did you use like open AI um how did you think about that yeah yeah so I think latency was a concern which is why I think on this slide I basically say okay if you're doing ranking we do pay like a like a cost in terms of time um and so yeah we ended up using um like openi models and so 3.51 wasn't as consistent with re ranking as like the four like a GPD 4 model is um and I think one of the tricks we had to do was um the model tends to be like very verbose in its outputs and so one of the tricks like I said here we have to make sure the format is like very compressed so it doesn't spit out anything other than this and so because of this we could actually get answers back like fairly like very quickly I see so you um made sure it was in some sort of concise X representation yeah um speaking of XML uh so yeah you mentioned you know the model has a much better understanding some of these like structured tags versus like unstructured text um I'm curious if there were like specific techniques uh like along these lines like for instance did you find that wrapping like the question in a tag helped or wrapping like few shot examples in a tag helped I'm curious like how you thought about like where to place these XML tags for Optimal Performance yeah so I've been so I think this like amazing like work um done on done by like the anthropic team and so I reference um their blogs like a lot and I think they had like a very interesting blog on how to effectively use like large context where they talk about like you know how placement of the different um parts of the promt like really affect performance and so for me at least like the takeaway has been that um I actually keep the instructions at the very bottom of the prompt like that's the last thing the model sees before like giving an output and every like all the other contexts including like formatting examples they go at the very top of the prompt and um so they um so the performance of the model is like more stable and it's like really good if you like keep the structure that way and then they kind of prove this in their in their blog like using like actual data and metrics uh so it's like very interesting to read that the other I think good finding has been like using structure like wherever you can so uh you know like breaking the prompt out into structure like keeping them inside like XML tags gives like really good performance because I because then I think the model knows like when a particular Concept in the prompt ended and when A New Concept has started otherwise um if we didn't do this we would see like cases where you know part of the prompt would like leak into the outputs because the model doesn't know that that's that's an example it's not like a part of the instruction awesome I think that's a great place to to end actually um you know you gave uh really great answers across a lot of these different components um and so yeah any any additional comments thoughts that that you want to add before we conclude um no not much else but I think um I mean I can talk about like I think why it's been like so valuable to have like these abstraction Frameworks um because so the V1 when we build the V1 of this we didn't we did not use a framework and so we handbuilt a lot of it and the code got like very messy and we couldn't iterate fast enough because we couldn't try like a lot of new Concepts that are coming out and um I've been seeing like like at least to Twitter right like I see like there's a new concept that comes out like every couple of weeks and you have to like try that quickly so you couldn't do that and so I think the abstraction has been like really helpful for us to like structure our code well and try like a new like a lot of new techniques um including like things like I didn't know we could do like a summary index before and that has be like very like kind of key for us to like ranking well uh I also know that that's going to be key for us to like a lot of different user style queries if they want to do like summarization across like a lot of documents I think it's going to be key to have this concept of like a summary index as as well um so yeah so it's been like really helpful to like use this that's awesome um yeah I mean I think the underrated piece of a lot of these Frameworks is that you know even if um you know some people point out that some of the implementations are light but at the same time it's like having the right code abstractions actually matters quite a bit because if you don't have the right abstractions you need to refactor things every time like a new technique comes out and so you know we we've definitely run into on the framework side too and our goal is to try to make sure that developers don't have to do this um so that we absorb some of the challenges um great well see thanks so much for your time um and yeah for for those listening um you know please if you have questions feel free to drop it in the comments and we'll see you guys next time great thanks for having me

Original Description

In this video, Sisil Mehta (ML eng @, Jasper) walks through practical tips and tricks that his team implemented for productionizing a RAG system at Jasper.ai, backed by LlamaIndex abstractions. These tricks include the following: 1. Picking a proper PDF parser that can maintain semantic structure, parse text from tables/images, and be represented as XML or Markdown 2. Adding the right "layers" of metadata; besides global document context, also inject summary context from "sub-documents" to more precisely localize context. 3. Hybrid fusion between different retrieval methods 4. LLM-powered reranking. Reduce token usage by reranking summaries that reference underlying chunks. 5. Use XML and emotion prompting to get well-structured outputs free of hallucinations

Watch on YouTube ↗ (saves to browser)

Sign in to unlock AI tutor explanation · ⚡30

Playlist

Uploads from LlamaIndex · LlamaIndex · 53 of 60

← Previous Next →

LlamaIndex Virtual Meetup (May 4th, 2023)

LlamaIndex Virtual Meetup (May 4th, 2023)

LlamaIndex + MongoDB Workshop/Fireside Chat

LlamaIndex + MongoDB Workshop/Fireside Chat

Discover LlamaIndex: Ask Complex Queries over Multiple Documents

Discover LlamaIndex: Ask Complex Queries over Multiple Documents

Discover LlamaIndex: Document Management

Discover LlamaIndex: Document Management

Discover LlamaIndex: Joint Text to SQL and Semantic Search

Discover LlamaIndex: Joint Text to SQL and Semantic Search

Discover LlamaIndex: JSON Query Engine

Discover LlamaIndex: JSON Query Engine

LlamaIndex Webinar: Active Retrieval Augmented Generation

LlamaIndex Webinar: Active Retrieval Augmented Generation

LlamaIndex Webinar: Demonstrate-Search-Predict (DSP) with Omar Khattab

LlamaIndex Webinar: Demonstrate-Search-Predict (DSP) with Omar Khattab

LlamaIndex Sessions: Practical challenges of building a Legal Chatbot over your PDFs

LlamaIndex Sessions: Practical challenges of building a Legal Chatbot over your PDFs

LlamaIndex Webinar: Graph Databases, Knowledge Graphs, and RAG with Wey (NebulaGraph)

LlamaIndex Webinar: Graph Databases, Knowledge Graphs, and RAG with Wey (NebulaGraph)

LlamaIndex Webinar: Community Project Showcase (07/07/2023)

LlamaIndex Webinar: Community Project Showcase (07/07/2023)

LlamaIndex Webinar: LLMs for Investment Research (with Didier Lopes, co-founder/CEO at OpenBB)

LlamaIndex Webinar: LLMs for Investment Research (with Didier Lopes, co-founder/CEO at OpenBB)

Discover LlamaIndex: Bottoms-Up Development With LLMs (Part 1, LLMs and Prompts)

Discover LlamaIndex: Bottoms-Up Development With LLMs (Part 1, LLMs and Prompts)

Discover LlamaIndex: Bottoms-Up Development With LLMs (Part 2, Documents and Metadata)

Discover LlamaIndex: Bottoms-Up Development With LLMs (Part 2, Documents and Metadata)

Discover LlamaIndex: Key Components to build QA Systems

Discover LlamaIndex: Key Components to build QA Systems

Discover LlamaIndex: Bottoms-Up Development with LLMs (Part 3, Evaluation)

Discover LlamaIndex: Bottoms-Up Development with LLMs (Part 3, Evaluation)

LlamaIndex Webinar: From Prompt to Schema Engineering with Pydantic (with @jxnlco)

LlamaIndex Webinar: From Prompt to Schema Engineering with Pydantic (with @jxnlco)

Discover LlamaIndex: Bottoms-Up Development with LLMs (Part 4, Embeddings)

Discover LlamaIndex: Bottoms-Up Development with LLMs (Part 4, Embeddings)

Discover LlamaIndex: Custom Retrievers + Hybrid Search

Discover LlamaIndex: Custom Retrievers + Hybrid Search

LlamaIndex Webinar: Document Metadata and Local Models for Better, Faster Retrieval

LlamaIndex Webinar: Document Metadata and Local Models for Better, Faster Retrieval

LlamaIndex Webinar: Build Personalized AI Characters with RealChar

LlamaIndex Webinar: Build Personalized AI Characters with RealChar

LlamaIndex Webinar: Make RAG Production-Ready

LlamaIndex Webinar: Make RAG Production-Ready

LlamaIndex Workshop: Building RAG with Knowledge Graphs

LlamaIndex Workshop: Building RAG with Knowledge Graphs

Discover LlamaIndex: Introduction to Data Agents for Developers

Discover LlamaIndex: Introduction to Data Agents for Developers

LlamaIndex Webinar: Finetuning + RAG

LlamaIndex Webinar: Finetuning + RAG

Discover LlamaIndex: SEC Insights, End-to-End Guide

Discover LlamaIndex: SEC Insights, End-to-End Guide

Discover LlamaIndex: Custom Tools for Data Agents

Discover LlamaIndex: Custom Tools for Data Agents

LlamaIndex Sessions: Building a Lending Criteria Chatbot in Production

LlamaIndex Sessions: Building a Lending Criteria Chatbot in Production

Discover LlamaIndex: Bottoms-Up Development with LLMs (Part 5, Retrievers + Node Postprocessors)

Discover LlamaIndex: Bottoms-Up Development with LLMs (Part 5, Retrievers + Node Postprocessors)

LlamaIndex Webinar: How to Win a LLM Hackathon

LlamaIndex Webinar: How to Win a LLM Hackathon

LlamaIndex Webinar: LLM Challenges in Production (w/ Mayo Oshin, AI Jason, Dylan from Starmorph)

LlamaIndex Webinar: LLM Challenges in Production (w/ Mayo Oshin, AI Jason, Dylan from Starmorph)

LlamaIndex Webinar: Agents Showcase!

LlamaIndex Webinar: Agents Showcase!

LlamaIndex Webinar: Learn about DSPy

LlamaIndex Webinar: Learn about DSPy

LlamaIndex Webinar: Time-based retrieval for RAG (with Timescale)

LlamaIndex Webinar: Time-based retrieval for RAG (with Timescale)

LlamaIndex Webinar: Build/Break/Test LLM Apps Showcase (co-hosted with TrueEra, Pinecone)

LlamaIndex Webinar: Build/Break/Test LLM Apps Showcase (co-hosted with TrueEra, Pinecone)

LlamaIndex Workshop: Evaluation-Driven Development (EDD)

LlamaIndex Workshop: Evaluation-Driven Development (EDD)

LlamaIndex Webinar: Building LLM Apps for Production, Part 1 (co-hosted with Anyscale)

LlamaIndex Webinar: Building LLM Apps for Production, Part 1 (co-hosted with Anyscale)

LlamaIndex Webinar: Learn about Fine-tuning + RAG (w/ Victoria Lin, author of RA-DIT)

LlamaIndex Webinar: Learn about Fine-tuning + RAG (w/ Victoria Lin, author of RA-DIT)

LlamaIndex Webinar: What's next for AI after OpenAI Dev Day?

LlamaIndex Webinar: What's next for AI after OpenAI Dev Day?

Introducing create-llama

Introducing create-llama

LlamaIndex Webinar: PrivateGPT - Production RAG with Local Models

LlamaIndex Webinar: PrivateGPT - Production RAG with Local Models

Multi-modal Retrieval Augmented Generation with LlamaIndex

Multi-modal Retrieval Augmented Generation with LlamaIndex

LlamaIndex Webinar: LLaVa Deep Dive

LlamaIndex Webinar: LLaVa Deep Dive

A deep dive into Retrieval-Augmented Generation with Llamaindex

A deep dive into Retrieval-Augmented Generation with Llamaindex

LlamaIndex Workshop: Multimodal + Advanced RAG Workhop with Gemini

LlamaIndex Workshop: Multimodal + Advanced RAG Workhop with Gemini

LlamaIndex Webinar: Efficient Parallel Function Calling Agents with LLMCompiler

LlamaIndex Webinar: Efficient Parallel Function Calling Agents with LLMCompiler

Introduction to Query Pipelines (Building Advanced RAG, Part 1)

Introduction to Query Pipelines (Building Advanced RAG, Part 1)

LLMs for Advanced Question-Answering over Tabular/CSV/SQL Data (Building Advanced RAG, Part 2)

LLMs for Advanced Question-Answering over Tabular/CSV/SQL Data (Building Advanced RAG, Part 2)

LlamaIndex Webinar: Advanced Tabular Data Understanding with LLMs

LlamaIndex Webinar: Advanced Tabular Data Understanding with LLMs

Ollama X LlamaIndex Multi-Modal

Ollama X LlamaIndex Multi-Modal

Build Agents from Scratch (Building Advanced RAG, Part 3)

Build Agents from Scratch (Building Advanced RAG, Part 3)

LlamaIndex Webinar: Build No-Code RAG with Flowise

LlamaIndex Webinar: Build No-Code RAG with Flowise

LlamaIndex Sessions: Practical Tips and Tricks for Productionizing RAG (feat. Sisil @ Jasper)

LlamaIndex Sessions: Practical Tips and Tricks for Productionizing RAG (feat. Sisil @ Jasper)

Introduction to LlamaIndex v0.10

Introduction to LlamaIndex v0.10

Build SELF-DISCOVER from Scratch with LlamaIndex

Build SELF-DISCOVER from Scratch with LlamaIndex

Introducing LlamaCloud (and LlamaParse)

Introducing LlamaCloud (and LlamaParse)

LlamaIndex Sessions: 12 RAG Pain Points and Solutions

LlamaIndex Sessions: 12 RAG Pain Points and Solutions

LlamaIndex Webinar: RAG Beyond Basic Chatbots

LlamaIndex Webinar: RAG Beyond Basic Chatbots

A Comprehensive Cookbook for Claude 3

A Comprehensive Cookbook for Claude 3

LlamaIndex Webinar: RAPTOR - Tree-Structured Indexing and Retrieval

LlamaIndex Webinar: RAPTOR - Tree-Structured Indexing and Retrieval

The video teaches practical tips and tricks for productionizing RAG systems using LlamaIndex abstractions, including techniques for improving retrieval and synthesis. It covers topics such as PDF parsing, chunking, and indexing, as well as re-ranking and summarization techniques.

Key Takeaways

Use LlamaIndex abstractions for building RAG systems
Choose a proper PDF parser
Break down long documents into smaller chunks
Append a summary to each chunk
Use lexical and vector indexing for subdocument retrieval
Combine results using Max score or voting approach
Rerank subdocuments using re-ranking technique
Use LLM ranker to rerank summaries
Tweak the ranking component and prompt to improve performance

💡 Having the right code abstractions matters for efficient implementation of new techniques, and using structured prompting with XML can improve model attention and performance.

🔒 Pro feature: Ask AI to explain this lesson →

More on: RAG Basics

View skill →

High Performance (Realtime) RAG Chains: From Basic to Advanced

High Performance (Realtime) RAG Chains: From Basic to Advanced

Coding the Ultimate RAG Engine from Zero

Coding the Ultimate RAG Engine from Zero

Building Agentic RAG From Scratch in Pure Python

Building Agentic RAG From Scratch in Pure Python

Build an LLM and RAG-based Chat Application using AlloyDB and LangChain

I Built a RAG App to Decode Airline Bureaucracy (So You Don't Have To)

I Built a RAG App to Decode Airline Bureaucracy (So You Don't Have To)

Akamai Developers

RAG Demo for Beginners: Full Hands-On Tutorial in Tamil | Build Your Own RAG AI | Karthik's Show

RAG Demo for Beginners: Full Hands-On Tutorial in Tamil | Build Your Own RAG AI | Karthik's Show

Related AI Lessons

Claude AI vs ChatGPT: Which One Is Actually Better in 2026?

Compare Claude AI and ChatGPT based on real-world usage and benchmarking to determine which one is better in 2026

Claude AI vs ChatGPT: Which One Is Actually Better in 2026?

Compare Claude AI and ChatGPT to determine which AI model is better for your needs in 2026

Medium · Programming

IntelliBooks: Classic RAG vs Graph RAG vs Agentic RAG – Choosing the Right AI Retrieval Architecture for Enterprise AI

Learn to choose the right AI retrieval architecture for enterprise AI between Classic RAG, Graph RAG, and Agentic RAG

Fluid, natural voice translation with Gemini 3.5 Live Translate

Learn about Gemini 3.5 Live Translate, a new voice translation technology that enables fluid and natural conversations across languages

5 Levels of AI Agents - From Simple LLM Calls to Multi-Agent Systems

Dave Ebbelaar (LLM Eng)