LlamaIndex Webinar: Make RAG Production-Ready

LlamaIndex · Beginner ·🔍 RAG & Vector Search ·2y ago

Skills: RAG Basics90%LLM Foundations80%Vector Stores80%RAG Evaluation70%Advanced RAG60%

Key Takeaways

The LlamaIndex Webinar discusses making RAG production-ready, covering topics such as retrieval augmented generation, fine-tuning, and vector stores, with tools like Haystack, Chat GPT, and Leviate, and skills like llm_foundations, rag_basics, and vector_stores.

Full Transcript

all right welcome everyone uh welcome back to another episode of The Llama index webinar series uh this is Jerry here and today we're super excited to be talking about uh production retrieval augmented generation and this is the first time that we're actually doing a panel um and so this will be a panel with uh Joanna from Haystack uh Max from um said AI um as well as Bob uh from weave eat and so uh as veteran we'll start off with uh some brief presentations from each of the folks uh around five minutes or so just to kind of like give an overview of the respective companies as well as some basic concepts and then we'll jump into the panel discussion so let's go from there um uh tuana do you want to Joanna kick us off because let me hopefully share my screen successfully um and hopefully you're seeing this yes okay um so as we have five minutes just a fair warning to everyone these are very very basic slides about my experiences with retrieval augmented generation and some things that I think are some interesting discussions that we can maybe have in the panel discussion um so quickly about myself um I'm yeah I'm twana telek I'm one of the developer Advocates at Deep set but I mainly focus on our open source llm framework called Haystack and obviously with that comes building a lot of rag pipelines and some experiences dealing with large language models so without further Ado I think I'm just going to start off by talking about um why we even discuss rag what rag is what retrieval augmentation really achieves for us so that if there is anyone who doesn't know we just briefly go over it so I boiled it down to three very basic um bullet points um the models that we use um for well currently use large language models um I've trained up to a certain point so they don't really have information after that certain point they also don't have any information on our confidential data so rag is just a methodology we use to help large language models get the relevant context to answer any query and I love to show a very simple example I I reuse this example very frequently but I think it depicts it quite well so I use the example of this webinar this evening well evening for me um and I just put in this question in chat GPT uh just a few hours ago and I asked who are the speakers at the Llama index panel tonight and as this is chat gbt and it's actually quite a performance model I got a pretty nice response um admitting that it just doesn't know um so what we mean by retrieval augmented generation or rag is basically a technique that allows us to transform this instruction to something like this so instead of just inputting the query as is we instruct the model to answer the query based on the provided context in this case but obviously this could be a plethora of instructions which is why I think rag is actually quite exciting and then we provide it with some relevant context before we ask the same question again um and the important part here is how we fill in this context so I cheated and I just copy pasted off of your event page today um but what a rag architecture achieves is to find the relevant context and instruct the model that we choose to use with that relevant context so that looks a bit like this from a very very high level you might have external data that could be your vector database like leviate it could be even be the web and we have something we call a retriever component that acts as a sort of filter that can go through that external data source and select the most relevant context for any given question and then we inject that into a prompt which we then send to a large language model and in my view today what could be interesting to discuss is actually some of these things here so some important building blocks from my experiences to build a rag pipeline for production is the retrieving of the relevant documents so all the context and we can do this in a very various ways we can do that with keyword retrieval embedding retrieval or even hybrid retrieval we can also play around with the ordering of the retrieved documents and then of course the important things are the large language model we use and the actual prompt itself I'm really going to focus on the top part but what they contribute to the quality of the rag pipeline at the end is largely the quality of the context provided to the large language model if we don't provide it with the right context then the large language model doesn't really have much to do and then the last two is given the context the manner with which the large language model ends up transforming that into some sort of answer so the first step retrieval um we can I think probably discuss keyword retrieval and embedding retrieval and hybrid retrieval each have their own strengths and each have their reasons why you would pick it for a production use case or not and especially because um I think it's interesting to discuss hybrid retrieval it has positives and negatives but embedding retrievers or semantic search isn't great at necessarily retrieving anything that might require us to do some keyword they search and I think we can already think of some examples here if you have like product IDs you want to search up if you have an e-commerce platform or something like that or company names that you have to look up embedding retrieval might not be great at this but you still might need semantic search to happen so in this case you might need to look into combining two retrieval methodologies and then um this is the thing that I'm most excited about these days is re-ranking um and this is an effective way to retrieve a diverse set of documents and this is quite new it can help us combine multiple resources and um this is just going to be a brief intro in this slide deck but it can help with lost in the middle I'm going to show a paper that just came out quite recently about that but quickly about diversity um so one thing that we're seeing that can be a problem in certain production applications of rag pipelines is imagine a similarity ranker and here I've basically ordered documents and they're all in a color palette so imagine the blue ones they're all kind of different colors they are different documents but they are quite similar to each other and if I'm ending up using a large language model whose context window only allows me to have this many documents then I maybe have a problem because in cases where I want to potentially answer long form question answering um based on a broad topic only having a subset of documents that actually talk about a very limited amount of topics might be an issue so this is where we talk about diversity ranker which simply introduces higher diversity in the order of documents that we introduced to large language models so I think this might be an interesting topic to discuss and then finally another thing that is quite new to me um is lost in the middle this is a paper I've linked it down below um but what it's uh effectively it tells us is that actually large language models really concentrate focus on what they see at the beginning and the end of context windows so what do we do in scenarios where we might lose a lot of valuable context and information that simply happened to be more in the middle of the context window so we are now looking into re-ranking techniques that will allow us to shuffle those up so that we lose as little relevant information as possible and that's it for me and I think these would be some of the topics that I'd really like to discuss in the panel session and hopefully I didn't go over five minutes say nurse uh thanks to Anna for your time and next up is uh Max from uh Sid Ai and if you want to share some slides yes let me quickly set up the screen sharing here there we go give me a second here we are okay uh hi good morning um yeah I'm Max uh I'm the CEO of sit.ai we're in YC's current summer batch um and I'll quickly talk about our experience with Rag and trying to get this to work in production and the subtitle of this the B story is 11 months of pain um and quickly I think um this is covered already so I'll jump over it quickly um we we started building llm apps for consumers quite early um but they really they kept failing and the failure always looked something like this I'd ask it to write a slogan for my company Sid and open AI would think um Sid stands for sudden infant death syndrome which it doesn't um and the easy and the quick answer is llms are a bit like first day interns they're super smart and eager but they know nothing about the person or company that they're working for um and the solution seems quite simple you just add data um and I think like especially coming in with the classical engineering mindset the the feeling is how hard can it be I'll just take some documents I'll chunk them into fixed length chunks I'll then use and better I can use open AIS or I can use something else I'll throw it into a vector database and at the end I'll create an internal or an external API endpoint to actually get this to work um but once you go into production you realize that there's a lot more that can go wrong and that will go wrong how do you handle changes to documents if you're pulling from Life Source like Google um Google Drive Google Gmail notion or any of the dozers how do you handle those updates how do you reflect those in the vectors that you have stored and if you have something that's public on the web and especially if you ingest from Google Drive someone will ingest or try to ingest a three gigabyte PDF and we'll send you an email at two in the morning uh if that somehow failed huh um and I think one of the classical problems and one of the approaches that we'll get into a bit more later is at some point you have a lot of chunks um and every single retrieval result seems really really bad because you somehow have a finite set of granularity but you have an almost infinite set of data and then you have to somehow think about cross encoders and re-rankers to bring that back up and improve that again um and then you have to think about things like markdowns data distribution being different than than for example power cord slides like these just because it's a different writing style and if you want it to be represented well in the same latent space then you're going to struggle then there's of course the compliance question as soon as you're going more into the Enterprise world can I even use openai or do I have to self-host everything and the complexities that come with llm hosting um and then there's a emails are very very tricky individual one and I think I'll get into like some approaches that we've used um for that later because there are so much information and just as so much fluff and just so little information and then there's things like Google's API Casa certification and processes like these that you actually have to run into before you can get up and running um and I'll quickly get into these two more and one approach that we found worked well for us there um and that's you have a limited amount of floating points um a limited amount of precision in your embeddings um and it's really important to make them count so in the classical email sense most of it is looking forward great to hear from you let's Circle back um et cetera Etc and not actual information [Music] um uh and not actual information on on the content so what we have done um what we've done is we've tried to reduce the fluff that icds and the writing style as much as possible and to do that we actually have a fine-tuned summarizer that just is designed to summarize email threads and then we have an embedder fine-tuned on these summaries to actually afford US a larger degree of separation in the latent space so instead of going through the original text which is very fluffy and very um and not not that easily separable we create this different kind of latent space just on the summaries and that makes search much much better um and means we need less cross encoding and re-ranking at the other end to make it work huh and yeah that pretty much brings us to to the end um the retrieval architecture depends mostly on the data type more than the usage type um and there's a long road from prototype to production thank you very much awesome thanks Max uh and uh last but not least uh Bob from review hey well thanks for having me Jerry and the wonderful uh um Decks that I just saw I'm not going to show it back I'm actually going to show a demo so I'm going to bring this together and actually demo this live for you based on on what you get so um let me quickly share my screen one sec here we go let's finish here there we go so let me give you a little bit of context what you're about to uh to look at and that's the uh this is the um the website console we've had this effective database and open source Vector database for those who do not know so at some of the slides you saw the term Factor database mentioned and basically what we did takes care of is storing these embeddings and storing the data objects so that could be the emails that Max I just showed or the examples of the um the meetups that the 200 showed in this case I have a demo data set that one of my one of my colleagues uh Conor made a this data set and he wrote a very nice blog post about it so I can I can share it afterwards but basically what we're going to do is we're going to real time try to do that and the direct thing and we're not only going to do that but we're also going to um do something new on top of that what we call generative feedback loops in the in the data so um I've prepared stuff a little bit so and the query language that I'm going to use is graph well but if you interact with it you can use python JavaScript in the clients whatever you want but I think this is an easy query language for everybody to read and understand what's um what's happening here so uh we have a in a database we have a query uh language and we have the listings in this database and what these listings contain are um um Airbnb listings so yeah so here you see like the uh [Music] um you see here the name of the listing the neighborhood the price per night room type and those kind of things but one of the things that you might notice is that there's no uh there's no description so if we now want to search if you do want to do semantic search or hybrid search I just want to mentioned over the uh these listings we we can because we don't have we don't have any descriptions so one of the things that we can do is that we can use a form of rack for to create a generative feedback loop and what you're basically going to do is that we're gonna and I shut it in a bit to be gonna inject um this data into a prompt and what comes out of the prompt will be stored in the database so we have a little prompt here and I have that in a notebook when I still write a short Airbnb description for the listing etc etc so um well also I'm now hitting run so basically we're using um uh for this demo I'm using open AI embeddings I was basically doing is that it's querying to aviate and it's okay show me all the listings where there's no description take that information from the description and it will generate not only a description for it but also generate a vector embedding for it so if I let not sure if it's done yet so running because it's almost done so if I if I run it so you see already you see a few so now you see the same listings so this was the listing that we just saw but now it has a description and what we've done is by importing it we told it we want to create infection embedding for the description so if we limit that to the first that's limited to the first one so just the first result so that this one we can do this additional vector so this is basically the effect from Banning that in this case we received from from uh from open AI all the way down avoid the system so now one of the things that we can do based on the data that was generated on um the information information from the listing is that we can do a well a hybrid search for example so we say class in New York to walk my luck so what this is gonna do when I hit a search is that it's going to create a vector embedding for this query it's going to do a hybrid search thank you to Anna for the inspiration let's go for hybrid search and what it's going to do is it's going to do a a factor based search on the query and um It's Gonna Do a bm25 search around the individual works in the query simultaneously and now the cool thing is is that we try to retrieve a data object based on the description that was just generated using a form of Rec right so here we go then run this query and then says description no it's of course is demo why is it saying no let's see uh let's see here we go from two uh oh here you go so um uh yeah so here you go so here it returns this result so this description I don't know why it didn't give the description to uh to this one but you know live demo so basically what you see here is that it says okay welcome to Lara spacious let's study of Love look at the Central Park etc etc etc and now we can do rag on top of that again so we can say um let's say we can say generate single results and I can give it another single result I can give it another prompt and I can say why is this listing a good place to walk my dog and what I'm going to do is I'm basically going to inject the description so what this is going to do is that it's going to query the um it's got a query database so it's going to create a vector embedding for this result it's gonna return the description the hostname the name neighbors will have you and then we're going to do rank again for the generated content which is this and we're going to basically ask it like if it can explain why that that's actually a good result this one takes a little bit longer because we need to send two results um to the API but now you see here we have that result so that we generate welcome to lower spaciousstudio love the Central Park blah blah and then the rack result says like this listing is a good place to work your dog because it's located near Central Park in East Harlem and the nice thing that you can do with these kind of things is that you basically have seen two concepts here so one is just pure rank so that's this where we basically are injecting that information in the prompt and you can do that doesn't matter how big your data set is that is like kind of the power affected databases I mean this data set is like 25 data objects or something but you can we see users that go literally into the billions so you can quickly search for them and then the second thing you can do with that is that you can actually generate content like this one and store that back in the database with a vector embedding so now you can not only use rack for better search but you can also use rack to um create data in your data set modify data in your data set sometimes you can delete data so um that was my that was my demo foreign thanks all for the amazing presentations uh and uh with that let's get into let's get into the fun part let's get into the panel so awesome uh basic questions that kick us off uh and probably we'll do this for the next like 25 minutes or so uh and then if there's any questions from the audience please just feel free to chime in and I'll probably just like interleave some of the questions uh along with uh some of the basic questions I had prepped as well uh let's kick it off with a a kind of like introductory question which is so a lot of users have built these types of retrieval augmented systems or I'm going to call it rag for short in the prototyping phase uh in your minds what are the key considerations that users would need to take into account when actually trying to build rag in production so everything from performance cost latency scalability security uh and we can start with uh with uh um so the way I see it there's actually two parts where a lot of considerations go into kind of separately from each other um because there's a retrieval step of the retrieval augmented generation Pipeline and then there's a generation step and so in the retrieval step you have to consider your retrieval models but this can actually be quite relatively small and cheap in some ways that they tend to be a lot more lightweight whereas in the generation part there's I see tons of considerations and I live in the EU it can even start from considering security wise what kind of models you want to use and we see this very often here in Europe for example open AI models are very very performant but sometimes it's just not an option given given the legal circumstances and then you have to consider whether you are going with an open source model and whether you're hosting it yourself or you go and go for a hosted service such as Azure or sagemaker this alongside it brings of course cost considerations as well and so on and so on and then the second part that I briefly mentioned with hybrid retrieval I think it kind of depends it goes into a bit of latency considerations and this is really dependent on your use case I would say um for some people a simple embedding retrieval with a very lightweight model if you're lucky enough to have a sentence similarity model that works on your language maybe this is fine but like actually Bob's example with New York was a really good one where you need to combine both keyword search and embedding search you are doing two things basically at once although keyword search can be very very very fast you now have two things to think about that run in parallel so there's some latency considerations those are the main facets that I immediately my mind goes to Thanks Max yeah I think the the main part or what in our experience ends up eating a lot of the time is actually getting the data syncing part right um and that is just because the apis for the services are so different and depending on how you ingest information you actually have to keep quite a complex state of what you already have and what you still need to add or need to remove or change or modify and that can just you know out of experience eat up a lot of your time I think for a lot of the the stuff here it's you have to be quite conscious of how much data you actually want to ingest and what the search sample size is if it's quite small then probably you're well served by an easy solution and you won't run into that many issues but as soon as you're probably looking at mult a few million chunks um you're gonna have to get more creative and hybrid search is a great way to actually get there um and uh yeah also with openai I think like the legal considerations are one and the other is just speed right by self-hosting these models um and by co-locating them with your vector database and the other things you can actually get much much faster right and have turnaround times in the you know High tens of milliseconds or low hundreds of milliseconds for the entire part including hybrid search and if you try something like that with open eyes embeddings you're oftentimes looking at a P99 of above a second yeah so um the so um what's what's interesting is and um to to build on on top of Max's point is the the times like this this big distinction between um um a demo like I just gave and I tell I mean I had 25 data objects and the the example that I have read that took about 25 seconds to actually execute that is for demo great right that's not so great if you have a real-time use case and you have like 100 million or more data objects because then it's going to take quite some time to index all that stuff so the um the the combination of of these production systems of how you work with these models is completely different than these these examples that I just so I agree with every what everybody said there so what's super interesting to see from my perspective is actually development when it comes to um CPU based inference because the thing is at some point you need to make like a cost um you know you need to make estimations about cost so um again the example I just yeah it was perfectly fine for this demo is in in you know it's going to be very expensive in production right so it's okay we have a we have a user that has like 20 billion data objects in waveguide even if they wanted to use for example of opening eye that's like that's two million bucks to create Spectrum weddings and so then an open source model is way more interesting how you operate it Etc the good news is that there's so much work happening in like that stack so like the database with the models and those kind of things and secondly also something that I find very interesting and that goes for um uh that Jerry were you working on also UMX antoana so and the the the tooling that starts to arrive around the the ecosystem basically the solutions to help people actually you know figure out how to store the information on the junk the information so the um well long story short I think the uh what I'm what I'm very excited about is the um how these things basically um and now are starting to come together and one more idea that I want to add just more for people on our listing that are listening here is that I think if I have to make prediction is that we're gonna see this combination of um not only doing rag for search results so as I just you know gave the example of like you know why is this a good place to work your dog but you also they were going to say like is this a good representation for my data set right so the example that makes you escape if you have all that these emails stored and it just says okay yeah thanks for the info that then actually the model could say no it's way too generic and that that we start to use rack and in um so the models in the database and the tooling to achieve that to actually clean up the data and actually makes make sense of it or and you can go very far with it so so if you for example look at um if I if I may Max use your your example that you can even say okay this email is not containing enough information effective space let's go to the previous email and see what was in the previous email and then generate content based on that so I think that's something we will see in the near future too that people also start to use the models to query the data and then modify the uh the data set and there's like a Harmony of the models the effect of the database and the tooling like like you all are creating so I'm excited about that yeah there's uh there's like one thing that we're we're also exploring is is the sen is the notion of something like lazy information retrieval so if someone asks a question right or if there's a query and there's no really good answers by whatever metric we find right that we can actually go in and try to collect more and more information have intermediate synthesis steps to actually find an answer for that question or for that query once it comes again right um and trying to actually like in an intermediate step find more information and collect more um something that wouldn't work a query time but that we can do in kind of like more the static kind of regime interesting so like Dynamic information Black Ops basically so um that this is this is awesome uh and and maybe the next thing that we should talk about is uh let's talk about data so everything from you know ETL extraction transformation loading into a vector store um this is a pretty common step that users face when they first build a rag system uh from loading data let's say a PDF uh chunking it up uh putting it into a vector sort and uh Max and Bob both of you have talked about some of the key considerations you need to think around the data so around performance around uh costs around what you see and scalability uh what are some of the key pitfalls that users commonly run into if we you know drill down into this a little bit more and what are the key considerations that they might have to experiment with so for instance everything from like chunking strategy like document Transformations how to account for scalability uh maybe we could start with uh with Max sorry I didn't my my internet connection was bad I didn't catch the last 20 seconds uh okay Norris um the high level idea is just um uh what are the key pitfalls that users commonly run into uh in kind of this data ET also player and you know you were talking about this on some of your slides as well as your answer too like everything from trunk sizes to scalability uh and the Curious get your thoughts on yeah like how should they robust fly their their data to create better rack systems um yeah I think that there's there's the one part which is just follow the standard practices around you're effectively handling untrusted text um so there's always stuff that can go wrong and will go wrong um and then setting reasonable limits so for example right on I think in our production you know someone tried to link uh you know a few terabyte large uh Google drive right um and you need to make sure that actually the pipeline is engineered to handle something like that to batch it properly and then and then return and not just you know return it out of memory error and break the entire system um and especially if you're offering this to multiple also disjoint users you have to think about how do you actually write provide some sort of exclusion that you know one person can't break the servers for for everyone else I think the most difficult part to engineer from that perspective apart from of course the retrieval pipeline is exactly the data syncing component and making sure that works well um and then you know handling all of those keys and will actually all of those API access tokens I'm not from not sure how many of you have actually tried but for example getting getting also API access to Google's Drive API and email API is a huge process and very very lengthy and it's like a 600 emails front and back and forth with Google to actually get permission to do that and then you know having a casa assessment and etc etc that you have to run into and I think like these are the definitely the underrated challenges and stuff that you know you look at and you're like yeah this sounds kind of reasonable but you know then you go about it and you actually write hey this is actually a lot of work yeah certainly so the um so one of the things that we see a lot is that and I'm going to assume that a lot of people on the um on the zoom here are just you know are building something or playing around with something and again like my example with 25 data objects you know that that's fine even if you think it's pretty fast that you've optimized it and you get to maybe like and you in in your mind is like 200 milliseconds is is great that maybe um if you don't have a large data set right then it it's unworkable with with um uh 200 milliseconds so what I what I would recommend people do is that if you take just a sheet of paper or like a whiteboard and you're just from the moment of ingestion all the way to the query you just draw out what's happening there how much how long was inference time on the model how long was retrieval time based on database Etc then you get to a number so let's say that um that if you in my case that I use these opening endpoints that I get to maybe 150 200 milliseconds just do that times the number of data objects that you have and then then you will be quickly shocked about the the number that that's there so you really need to start to think from the perspective of um a performance if you're really building a business now or whatever you're building as if it has like a large data set what I recommend doing is that for example if you use an open source model or a model that's in stage maker or those kind of things validate if that model generates the results that you want and if that's a check in the box built a pipeline as easy as you can perfectly fine to do it like with the latency from open AI Etc and start to optimize for them one by one just take the building block out replace it with something else and just start to minimize the time um if what we know from the big production use cases where we have where um retrieval from the database embedding generation Etc plays a role if you really optimize it well for big data sets you should be able to get between like 20 and 30 milliseconds today end to end if you do that very well but let's say it's a it's really it's a it's a it's a it's a it's a profession right so it's like people are really good at just optimizing these kind of things but just keep it in mind because I've seen a lot of people building enthusiastically build amazing prototypes and then they said like but now to make this into affordable business we need to do this you know x 100 million and then they were shocked with the infrastructure and embedding prices and times that were um associated with that so shouldn't hold you back but just bear in mind that these things take time what people might like to know if they're new to this is like a lot of work is already being done there so for example mvpa we have like a spark uh connector that people literally pump in millions of embeddings and data objects per per minute so these tools are there but make sure to think about and if you really have like a data science background I would highly recommend to start to think also about the devops side or the engineering side of just bringing the stuff to production right I just wanna sorry I can add some things uh in this data ingestion part based on what I see from our community um which I think uh probably actually Max with your product you also experienced a lot is um especially if you want your right pipeline to be um so sort of up to date to your time there are actually two types of um data sets that you do rack on there are some cases where you have a data set that is not necessarily going to change that much so you don't have to think about data ingestion more than once maybe but there are some other cases where I'm sure in these examples of for example you want to be able to do continuous rag on emails that are constantly changing or notion pages that are constantly changing that requires a completely different architectural setup maybe the same pipeline but a completely different setup for scalability so this is one of the main considerations I see when we're talking about data ingestion for Vector databases thanks for your thoughts um I think this is a related question uh to data and and we'll tie into the next section on kind of retrieval as well and it's actually a question from the audience um what is the best way to trunk your data and how do you think about optimal chunk sizes as well as trunking strategies and I think we could start with uh tuano if you want to go first I find this question quite difficult because it feels like a lot of the case it can be trial and error but it also is going to depend on the embedding model that you wanted uh that you decide you want to use so a lot of embedding models actually you if you do want to do retrieval with a given embedding model um you're not going to go beyond a certain number a certain number of words for your chunk sizes but at the end of the day there's also a chunking strategy so for example um we I'm going to show it with my hands because I can't really describe it any other way but for example we have this um paragraph two paragraphs following each other and there's this chunk and there's this chunk immediately after it um but they have potentially something here just at the end of it would have been relevant to the sentence that started right here so we see a lot of people trying around with chunks that actually overlap each other so that you don't lose context that might be relevant given a scenario where one chunk is retrieved and the other is not so there's a lot of pre-processing thoughts that goes into chunking and there's obviously the sizes of the chunks that matter and of course um I'm gonna go back to this but I think chunk size is probably is going to be something we discuss more if it does turn out that diversity is going to be very important for lfqa because then we have a limited amount of context length we can fill so a limited number of chunks we can add into our context so if you want to include a lot more diversity maybe smaller chunks are better it depends on the type of data you're doing rag on That's My overall answer uh Bob did you want to uh response and so this is this is a question that I that I don't have a good answer on because the thing is like the moment that it that it hits the database it's already jumped so it's a it's like it's so as um the the you know uh like uh utuana and future that's like of course your expertise like it until you're building how that how that's done but so I'm not sure if I'm allowed to break the format a little bit here but I would love to hear your answer to that question Jerry so uh because you see the stuff a lot of course oh wow yeah I thought I thought I was interviewing you guys um so I mean I think it's basically what uh to wanna mentioned and I was actually kind of curious to learn some of the tips and tricks that you guys had because I was gonna see if I could influence on this in the framework itself um the um the two things that I typically see is in terms of chunks um uh I think smaller chunks tend to uh lead to better and betting based retrieval um just because you're not like averaging out the relevant uh context with like a bunch of random stuff like uh before and after the the actual uh piece of text um however the downside with smaller chunks actually is the fact that uh when you actually feed this to the language model for synthesis it doesn't actually have enough context to really give you a detailed answer to the question and then I think the other piece here is a lot of people especially pretty much everybody building all apps for a specific vertical on specific types of data builds their own custom parsers as opposed to using like any of the out of the box parsers from for instance like well index or online Trend um but yeah actually uh Max I was I was curious to get your thoughts on this too because he also talks a little bit about this and the slides too uh yeah I think I have I mainly have two points so um first of all I think you have to separate the Chug size at embedding time and the chunk size at retrieval time because these can actually be different right um and this is also how we approach it right um so sometimes right we'll we'll retrieve actually quite small something that's you know inside the distribution of um of the the models that we use for embedding um and then we'll actually try to retrieve the previous and the post chunks right to actually provide more context as you said for for the query right so these are sometimes two different things and depending on what you're working on actually splitting the chunk size at embedding and the chunk size at retrieval makes sense uh um and the uh so I think like that's that's the first consideration and then in the slides right the other one is you can quite freely choose to actually search and receive over a different latent space than just the raw text um and for us for emails that was the summary space so we effectively we have a quantized MPT 7 bill fine-tuned on email summarization and all that model does is it'll take in a thread and it'll try to Output a quite fixed size summary of that text um and then we have an embedder that will actually then um and then better that that is fine-tuned to just those summaries and that will then go around finally and that usually means the we lose a lot of the context on the writing style and you know the the kinds of those components but if you're building a rack for information retrieval and not for style retrieval that is actually what what you want right um and you have the ability to engineer it that way and do it that way um and then yeah I think like more generally right to then measure the end-to-end performance um you know I come more from the research background so I Love Thinking of You Know synthetic benchmarks and like ideas to test this et cetera Etc um but the most important thing is to capture that quality from your users um and then try to evaluate different approaches in your pipeline that way yeah and if I if I may quickly add something because something popped into my mind that I actually do have a response to this but again based thanks to Max Base and what you said so so from the from the database perspective that's kind of where we should write so the the demo that I just gave you you see me do the in this case hybrid search but it could also be the the vector search but the thing that you also have are filters so the thing is that you set the complete you you load all the data in the database so the the let's say that we take these emails and let's say that we have like 100 million emails but you know that you're searching for an answer based on something that's in a email thread for example you can actually store it in a database so you can say okay this is the the body of the email this is the factual Banning of the body of that email and that is Bay that's coming from this email address and it was part of this threat so then your database query looks like something like do a vector search for something well very specific but only limited to that email thread in those 100 million emails and that works very well so uh as a tip I guess bear in mind that you don't have to solve it 100 on ingestion time uh in this case in the in the database you can also be smart about how you're structuring the data that you're storing and then build filters on top of the vector search that you're doing so that's also an option I actually wanted to add something here um because actually the conversation of on chunks and then what Max said made me think of one thought of right that maybe we didn't necessarily discuss but it could potentially have an effect on the type of chunking you decide to do is that rag can mean many things actually based on simply the prompt you provided so initially what is even your task why are you building rag I think often we talk about question answering um but for example um in Max's example it could be summarizing emails which are very short and another example it could be to summarize something about a whole topic and then you have your retriever set to retrieve maybe the top few most relevant documents out of your database so this is actually a good consideration on whether your chunks should be larger or smaller because if you for example are asking it to summarize a a whole topic maybe a larger size of context and less documents being embedded into your prompt make more sense so actually I think the point of your rag pipeline matters a lot in how you design your chunking as well yeah that's a good point and that is like one more idea to add to that just for people playing around with this it's like what you can even do is that um and sorry again from the perspective database but that's just my that's what I do every day so it's my just by where I live you can even say like hey we have email threads and we have a summarization so we use rank to create the summarization of the whole email thread we store that and then the individual emails so then you basically have two queries so query number one is like show me the email that was about X and then when you've retrieve that email and you have the ID for example of the email thread okay now search through these individual email threads set a filter for the email thread and drill down into an email so let's say that you have an email with 10 threads you basically have 11 data objects the one data object with a summarization of the whole email and 10 data objects with embeddings that are the individual bodies of the of the email that you've stored so you can you can get very smart about this kind of stuff and just be based on your use case you know you can do a lot on in as to an asset like using rack to generate that kind of stuff but also store it and retrieve it in a smart way and then that combination to bring you know that brings you very far awesome thanks everyone for the thoughts um I do actually want to talk a little bit about uh retrieval um and this leads into some of the later conversations around like how to structure the data and also like you know adding metadata filters and also hybrid search and re-ranking um yeah like what uh in your minds are just generally the top retrieval issues that users run into when they maybe first build uh rag with for instance like top K embedding search and how exactly do things like hybrid search re-ranking adding metadata filters uh help them um and and to honor if you want to I guess our um so off the top of my head um actually this is not a problem with retrieval but potentially uh uh problem with the way we view rag or evaluate rug and that's the first thing that pops my mind when you say retrieval is um often I feel like we seem to overlook the retrieval step when we think about the performance of entire rag pipeline um someone else that's in the chat and I reply to this and often we see the result of a rag Pipeline and the mind goes straight to the large language model we're using rather than seeing actually let's drill down like uh Bob said earlier let's drill down is my retrieval step actually performing well am I getting the relevant context to begin with um and then obviously this is an issue at this point with the retrieval methodology you decide to use and one of the reasons I brought up a hybrid search and I saw that there was some chat going on in there is uh for example again to to reference what Bob and Max brought up um let's imagine a scenario where we have uh emails and we need to be allowing our users to ask what was the topic of email ID something or let's assume we sell shoes and we've got a specific shoe name or ID and say what are the specifications of shoe XYZ and unfortunately embedding retrieval um that we really often use for Rag and we'd stop there often is not great at Retreat retrieving these types of queries they're not great at doing keyword search but we still want to be able to do semantic search to get a more comprehensive answer as well so this is where we see people use a combination of keyword and embedding search basically what Bob was showing in this demo at the beginning um and yeah and when we talk about metadata filtering I actually posted in this in the chat as well I'd love to hear what um you guys's uh experiences have been um but for example metadata addition of metadata within the context you could provide to your rag pipeline is something I do very often maybe even misuse um but this allows us to provide some sort of labeling information of each of the documents we retrieve and provide extra content or context with our replies with our large language models I for example have used this to add um if I've got a website I've crawled or I've got a set of websites I'd call somewhere in a static data set I tend to have the URL in the metadata and I love this for documentation search because then if I have the URL of that piece of documentation I retrieved right after it then the reply of a large language model can simply reference the actual place that the answer was generated from and in use cases like this I see metadata very very useful but I'd love to hear you guys's thoughts as well so I I that's a very good point uh um Atlanta so so three things that I that I want to respond to so the the first one is the um a lot of people ask we see ask questions about like what we call like explainable Ai and using the metadata in these results is actually very helpful because what you basically can do is you can say well let's go back to the example of the of the emails now around on that topic that you basically can say that whatever you're generating with the generative model that you can take and that came from this and the metadata is that you give might for example be the email thread ID and you say okay so that that's from it's coming from that email that was sent on that date for example so you you know that allows you to make it explainable and give that kind of uh context the second thing of usage is like the so the example that I basically gave from the uh with the um uh with the with the Airbnb data set that actually that listing information is a form of metadata and so we just call it the that object but it's a form of metadata so you use the metadata to generate the content that you create an embedding for to search over right so that is there's also something new um you could do to think about use cases as in staying with the and with with the emails is that the email body is a form of metadata you can basically okay what should be the right response to this email right and then that's what you what you store and the third thing that I want to say and this is something that um because 20 already mentioned this but um we see this coming back more and more often that indeed pure uh effective search pure effective based retrieval is often and not enough for your case and and I've I'm sharing now a um a link to to a visual in a blog post in the in the chat window and one way that we explain that internally often is that like if you have a big c with Fish And you want to you know catch a specific type of fish then you throw in the net and the net is basically your Factor search so you know in space where to search but then you net comes out of the water and it does not mean that the first thing that you pick out of the net net is ac

Original Description

If you’re building LLM apps, you may already know that RAG is easy to setup but hard to iterate + make prod-ready. In this webinar, we host a panel of experts to discuss the ways you can make RAG production-ready: - Bob (co-founder/CEO, Weaviate) - Max (co-founder/CEO, sid.ai) - Tuana (dev rel, Haystack)

Watch on YouTube ↗ (saves to browser)

Sign in to unlock AI tutor explanation · ⚡30

Playlist

Uploads from LlamaIndex · LlamaIndex · 22 of 60

← Previous Next →

LlamaIndex Virtual Meetup (May 4th, 2023)

LlamaIndex Virtual Meetup (May 4th, 2023)

LlamaIndex + MongoDB Workshop/Fireside Chat

LlamaIndex + MongoDB Workshop/Fireside Chat

Discover LlamaIndex: Ask Complex Queries over Multiple Documents

Discover LlamaIndex: Ask Complex Queries over Multiple Documents

Discover LlamaIndex: Document Management

Discover LlamaIndex: Document Management

Discover LlamaIndex: Joint Text to SQL and Semantic Search

Discover LlamaIndex: Joint Text to SQL and Semantic Search

Discover LlamaIndex: JSON Query Engine

Discover LlamaIndex: JSON Query Engine

LlamaIndex Webinar: Active Retrieval Augmented Generation

LlamaIndex Webinar: Active Retrieval Augmented Generation

LlamaIndex Webinar: Demonstrate-Search-Predict (DSP) with Omar Khattab

LlamaIndex Webinar: Demonstrate-Search-Predict (DSP) with Omar Khattab

LlamaIndex Sessions: Practical challenges of building a Legal Chatbot over your PDFs

LlamaIndex Sessions: Practical challenges of building a Legal Chatbot over your PDFs

LlamaIndex Webinar: Graph Databases, Knowledge Graphs, and RAG with Wey (NebulaGraph)

LlamaIndex Webinar: Graph Databases, Knowledge Graphs, and RAG with Wey (NebulaGraph)

LlamaIndex Webinar: Community Project Showcase (07/07/2023)

LlamaIndex Webinar: Community Project Showcase (07/07/2023)

LlamaIndex Webinar: LLMs for Investment Research (with Didier Lopes, co-founder/CEO at OpenBB)

LlamaIndex Webinar: LLMs for Investment Research (with Didier Lopes, co-founder/CEO at OpenBB)

Discover LlamaIndex: Bottoms-Up Development With LLMs (Part 1, LLMs and Prompts)

Discover LlamaIndex: Bottoms-Up Development With LLMs (Part 1, LLMs and Prompts)

Discover LlamaIndex: Bottoms-Up Development With LLMs (Part 2, Documents and Metadata)

Discover LlamaIndex: Bottoms-Up Development With LLMs (Part 2, Documents and Metadata)

Discover LlamaIndex: Key Components to build QA Systems

Discover LlamaIndex: Key Components to build QA Systems

Discover LlamaIndex: Bottoms-Up Development with LLMs (Part 3, Evaluation)

Discover LlamaIndex: Bottoms-Up Development with LLMs (Part 3, Evaluation)

LlamaIndex Webinar: From Prompt to Schema Engineering with Pydantic (with @jxnlco)

LlamaIndex Webinar: From Prompt to Schema Engineering with Pydantic (with @jxnlco)

Discover LlamaIndex: Bottoms-Up Development with LLMs (Part 4, Embeddings)

Discover LlamaIndex: Bottoms-Up Development with LLMs (Part 4, Embeddings)

Discover LlamaIndex: Custom Retrievers + Hybrid Search

Discover LlamaIndex: Custom Retrievers + Hybrid Search

LlamaIndex Webinar: Document Metadata and Local Models for Better, Faster Retrieval

LlamaIndex Webinar: Document Metadata and Local Models for Better, Faster Retrieval

LlamaIndex Webinar: Build Personalized AI Characters with RealChar

LlamaIndex Webinar: Build Personalized AI Characters with RealChar

LlamaIndex Webinar: Make RAG Production-Ready

LlamaIndex Webinar: Make RAG Production-Ready

LlamaIndex Workshop: Building RAG with Knowledge Graphs

LlamaIndex Workshop: Building RAG with Knowledge Graphs

Discover LlamaIndex: Introduction to Data Agents for Developers

Discover LlamaIndex: Introduction to Data Agents for Developers

LlamaIndex Webinar: Finetuning + RAG

LlamaIndex Webinar: Finetuning + RAG

Discover LlamaIndex: SEC Insights, End-to-End Guide

Discover LlamaIndex: SEC Insights, End-to-End Guide

Discover LlamaIndex: Custom Tools for Data Agents

Discover LlamaIndex: Custom Tools for Data Agents

LlamaIndex Sessions: Building a Lending Criteria Chatbot in Production

LlamaIndex Sessions: Building a Lending Criteria Chatbot in Production

Discover LlamaIndex: Bottoms-Up Development with LLMs (Part 5, Retrievers + Node Postprocessors)

Discover LlamaIndex: Bottoms-Up Development with LLMs (Part 5, Retrievers + Node Postprocessors)

LlamaIndex Webinar: How to Win a LLM Hackathon

LlamaIndex Webinar: How to Win a LLM Hackathon

LlamaIndex Webinar: LLM Challenges in Production (w/ Mayo Oshin, AI Jason, Dylan from Starmorph)

LlamaIndex Webinar: LLM Challenges in Production (w/ Mayo Oshin, AI Jason, Dylan from Starmorph)

LlamaIndex Webinar: Agents Showcase!

LlamaIndex Webinar: Agents Showcase!

LlamaIndex Webinar: Learn about DSPy

LlamaIndex Webinar: Learn about DSPy

LlamaIndex Webinar: Time-based retrieval for RAG (with Timescale)

LlamaIndex Webinar: Time-based retrieval for RAG (with Timescale)

LlamaIndex Webinar: Build/Break/Test LLM Apps Showcase (co-hosted with TrueEra, Pinecone)

LlamaIndex Webinar: Build/Break/Test LLM Apps Showcase (co-hosted with TrueEra, Pinecone)

LlamaIndex Workshop: Evaluation-Driven Development (EDD)

LlamaIndex Workshop: Evaluation-Driven Development (EDD)

LlamaIndex Webinar: Building LLM Apps for Production, Part 1 (co-hosted with Anyscale)

LlamaIndex Webinar: Building LLM Apps for Production, Part 1 (co-hosted with Anyscale)

LlamaIndex Webinar: Learn about Fine-tuning + RAG (w/ Victoria Lin, author of RA-DIT)

LlamaIndex Webinar: Learn about Fine-tuning + RAG (w/ Victoria Lin, author of RA-DIT)

LlamaIndex Webinar: What's next for AI after OpenAI Dev Day?

LlamaIndex Webinar: What's next for AI after OpenAI Dev Day?

Introducing create-llama

Introducing create-llama

LlamaIndex Webinar: PrivateGPT - Production RAG with Local Models

LlamaIndex Webinar: PrivateGPT - Production RAG with Local Models

Multi-modal Retrieval Augmented Generation with LlamaIndex

Multi-modal Retrieval Augmented Generation with LlamaIndex

LlamaIndex Webinar: LLaVa Deep Dive

LlamaIndex Webinar: LLaVa Deep Dive

A deep dive into Retrieval-Augmented Generation with Llamaindex

A deep dive into Retrieval-Augmented Generation with Llamaindex

LlamaIndex Workshop: Multimodal + Advanced RAG Workhop with Gemini

LlamaIndex Workshop: Multimodal + Advanced RAG Workhop with Gemini

LlamaIndex Webinar: Efficient Parallel Function Calling Agents with LLMCompiler

LlamaIndex Webinar: Efficient Parallel Function Calling Agents with LLMCompiler

Introduction to Query Pipelines (Building Advanced RAG, Part 1)

Introduction to Query Pipelines (Building Advanced RAG, Part 1)

LLMs for Advanced Question-Answering over Tabular/CSV/SQL Data (Building Advanced RAG, Part 2)

LLMs for Advanced Question-Answering over Tabular/CSV/SQL Data (Building Advanced RAG, Part 2)

LlamaIndex Webinar: Advanced Tabular Data Understanding with LLMs

LlamaIndex Webinar: Advanced Tabular Data Understanding with LLMs

Ollama X LlamaIndex Multi-Modal

Ollama X LlamaIndex Multi-Modal

Build Agents from Scratch (Building Advanced RAG, Part 3)

Build Agents from Scratch (Building Advanced RAG, Part 3)

LlamaIndex Webinar: Build No-Code RAG with Flowise

LlamaIndex Webinar: Build No-Code RAG with Flowise

LlamaIndex Sessions: Practical Tips and Tricks for Productionizing RAG (feat. Sisil @ Jasper)

LlamaIndex Sessions: Practical Tips and Tricks for Productionizing RAG (feat. Sisil @ Jasper)

Introduction to LlamaIndex v0.10

Introduction to LlamaIndex v0.10

Build SELF-DISCOVER from Scratch with LlamaIndex

Build SELF-DISCOVER from Scratch with LlamaIndex

Introducing LlamaCloud (and LlamaParse)

Introducing LlamaCloud (and LlamaParse)

LlamaIndex Sessions: 12 RAG Pain Points and Solutions

LlamaIndex Sessions: 12 RAG Pain Points and Solutions

LlamaIndex Webinar: RAG Beyond Basic Chatbots

LlamaIndex Webinar: RAG Beyond Basic Chatbots

A Comprehensive Cookbook for Claude 3

A Comprehensive Cookbook for Claude 3

LlamaIndex Webinar: RAPTOR - Tree-Structured Indexing and Retrieval

LlamaIndex Webinar: RAPTOR - Tree-Structured Indexing and Retrieval

The LlamaIndex Webinar teaches how to make RAG production-ready, covering key concepts like retrieval augmented generation, fine-tuning, and vector stores, and providing practical steps for building and optimizing RAG pipelines. By following the webinar, viewers can learn how to create effective RAG systems and apply them to various use cases. The webinar also discusses important considerations like cost estimations, data processing, and chunking strategies, making it a valuable resource for tho

Key Takeaways

Draw out the pipeline from ingestion to query
Calculate inference time and retrieval time
Validate models before building a pipeline
Optimize one building block at a time
Engineer the pipeline to handle large data sets
Use standard practices for handling untrusted text
Set reasonable limits for data handling

💡 RAG can be slow due to latency considerations and data syncing issues, but hybrid search and self-hosting models can improve performance. Additionally, chunking strategies and metadata filtering are crucial for scalability and explainability.

🔒 Pro feature: Ask AI to explain this lesson →

More on: RAG Basics

View skill →

High Performance (Realtime) RAG Chains: From Basic to Advanced

High Performance (Realtime) RAG Chains: From Basic to Advanced

Coding the Ultimate RAG Engine from Zero

Coding the Ultimate RAG Engine from Zero

Building Agentic RAG From Scratch in Pure Python

Building Agentic RAG From Scratch in Pure Python

Build an LLM and RAG-based Chat Application using AlloyDB and LangChain

I Built a RAG App to Decode Airline Bureaucracy (So You Don't Have To)

I Built a RAG App to Decode Airline Bureaucracy (So You Don't Have To)

Akamai Developers

RAG Demo for Beginners: Full Hands-On Tutorial in Tamil | Build Your Own RAG AI | Karthik's Show

RAG Demo for Beginners: Full Hands-On Tutorial in Tamil | Build Your Own RAG AI | Karthik's Show

Related Reads

AnswerSurvivalRAG: What Happens When RAG Finds the Answer, Then Drops It?

Learn how RAG systems can fail even when they find the correct answer, and why it matters for reliable AI performance

Medium · Machine Learning

A RAG evaluator that admits what it can't judge

Learn how to build a reliable RAG evaluator that acknowledges its limitations, a crucial aspect of AI safety and robustness

Dev.to · Melissa D. Ellison

RAG on Google Cloud in Regulated Environments: A Lifecycle Playbook from Inception to…

Learn to implement RAG on Google Cloud in regulated environments with a lifecycle playbook

Medium · Machine Learning

Solving One of the Hardest Problems in Code RAG: Context Retrieval

Learn to solve context retrieval in code RAG systems, a crucial challenge in automation code generation, and improve your skills in RAG and code analysis.

RRF vs DBSF with Qdrant: Hybrid Retrieval Fusion for RAG in Python

Professor Py: AI Engineering