LlamaIndex Workshop: Evaluation-Driven Development (EDD)

LlamaIndex · Intermediate ·🧠 Large Language Models ·2y ago

Skills: LLM Foundations90%Prompt Craft80%Fine-tuning LLMs80%LLM Engineering70%Multimodal LLMs60%

Key Takeaways

The video demonstrates Evaluation-Driven Development (EDD) for building LLM apps, using tools like LlamaIndex, GPT 3.5, and Zephyr 7B Alpha, and covering concepts such as retrieval augmented generation, fine-tuning, and evaluation metrics.

Full Transcript

hey everyone uh welcome back to uh another edition of The L index webinar series today's a special Workshop hosted by W glance who's uh a key contributor to L index has contributed a ton of like really great blog posts and educational material on a different uh both like basic and advanced functionality that ladex has to offer um and today we're really excited to kind of be doing a workshop on this idea of evaluation driven development U which is basically this core idea that as you're building rag um everybody should bake in evals in the loop so that not only do you like you know just try out the basic stuff and leave it at that you actually try to set up some sort of evaluation Benchmark um and then as you try out more Advanced Techniques or ways to improve and optimize your rag system where LM at then you can actually um uh kind of uh in a more principled manner iterate on this try out different techniques and compare the performance against the basine um and so this notebook will be a Workshop um I think someone in the comments asks whether or not this link will be shared uh W if you if you want to just past the link in in the chat then uh yeah let's let's do that and then uh this this recording will also be up on on YouTube as well so if you join a little bit later then then it's totally fine uh we we'll have this on on YouTube along with the notebook um so without further Ado uh thanks lry for coming and and let's get started whenever whenever you're ready yeah I'm going to paste the link first so people can follow okay there you go okay so um thanks Jerry I really appreciate it um evaluation J development so this topic I actually it's new to me as well so the first time I encountered eval is really from Logan's uh webinar you know discover Lama index series part three focused on evaluation I learned a great deal from there so highly recommend if you know you're new new to Evo uh start with that and really in depth dive into how Evo is done using Lama index evaluation modules um and you know ins and outs So This is highly recommended 16 minutes well well worth your time um and then also from Simon's tweet he mentioned evaluation Jan development so that's really the first time I hear this term and uh you know coming from uh traditional uh programming background we all kind of most of us are familiar with tdd test Jo own development right so this is in the a application development world this is sort of similar to that in the way integration test wise you know you you're having the test driven development for this is evaluation driven development um so today's notebook we're going to focus on this Eva for a multidocument pipeline that we uh I developed um actually use it as a sort of sample app for many of my evaluation kind of experiments uh so what this does is I have a multi do kind of um you know data photo here and three develop self-service uh uh documents PDF documents and it's actually extracted from my own blogs um and then we use that to load it and process it using gbt 3.5 turo and also going to experiment the new Zephyr 7B Alpha which really came up uh came out last week only and it's brand new and very exciting uh you know just experimenting it and to see how it compares with GPT 3.5 and the finding is actually uh pretty uh you know inspiring so we'll go into all that details uh and if time permits we can actually jump on to another uh notebook after this but I let's see hopefully so um the bottom line is that even our Ed we call it Ed um what why do we have to do Ed so um this is kind of a let me see if the diagram is too big I'll just shrink it a little bit so this is talk about the benefits of Ed as we um kind of you know uh from TD perspective right so we understand the benefits of that so in the Ed World it has this uh six benefits really it's sort of highlight um enhancing the accuracy and the relevance um so I also have detailed uh you know explanation here so feel free to explore that as well um identifying weakness and opportunities this obviously you know we can compare a lot of different uh tools and techniques that that we can evaluate and use this approach and find out what works best for our particular use case and also guiding model selection and parameter tuning in this particular use case we're actually going to use that to guide our model selection um parameter tuning as well um I just wanted to also mention that uh Ravi has a wonderful post uh published a little while ago on the chunk size using Eva to to handle the chunze selection that's well worth it highly recommended so all of this you know it's all part of Ed uh kind of um methodology so also can use that to ensure the robustness and generalization um so or you know just to help us to make our pipeline more robust and uh um generalize capability align with user expectations uh obviously you know we wanted to make sure our pipeline is as accurate as possible so use that this particular tool to help reach that goal and to fine tune uh in areas that you know there are a lot of areas can be um we can use Ed to uh to find tune continuous Improvement and the iteration so that again is like use that as a regression test the suite of your pipeline um so all of this just you know a a handful of benefits but there must be more I'm sure um so just starting with there and how to implement Ed um so mainly four main steps uh so you use uh uh just want to also mention this is all from Lama index uh the evaluation module so um they have evaluation response evaluation and retrieval evaluation so highly highly recommend that go to if you haven't already check it out um they just gone through you know massive kind of re enhancements and included a lot more evaluators so highly recommend to check out the details and uh faithfulness and relevancy are the ones we're going to use for this particular pipeline but you know if you have a ground truth and and what you can use correctness evaluator or just explore all these different evaluators and use them to your advantage so uh we first use the data set generator to autog generate a set of evaluation questions so go into that in details here is high level only and then we Define a set of evaluators for right now for this rag uh application we Define for faithfulness and relevancy um but again as I said you can you know add additional ones as you see fit for your use case the third step is we use a batch evaluator a runner to asynchronously run the evaluation of the responses um so here you'll see this batch Evo Runner this is really cool very cool idea because used to be have to you know run Eva one by one but here with batch you can combine one or multiple together and and runs them and defin the number of uh you know Runner worker uh to to asynchronously process the responses and then you can compare the evaluation results so this spits out the results and you compare um so here on the diagram let me just see the whole diagram on the diagram you may notice that data set generator for this particular you know application we're using GPT 3.5 to generate the Eva data set so we generated 30 well a bunch of question but we randomly selected 30 of them um and then we feed that into the batch eval Runner and bat Eva Runner we're currently using the faithfulness evaluator and relevance evaluator uh because because I don't have the ground shoes so I don't do the correctness and also guideline don't have that as well so uh semantic similarity so uh encourage you to explore that and understand what they are all about and use them uh to you know see if it applies to your use case and with this evaluation one thing to point out is that we wanted to use a a superior model to evaluate our existing model So currently um you know as we all know gbd4 is the most powerful right now model out there um so we're going to use G GT4 to evaluate the question you can use gp4 as well so some asked can we use gp4 for the question generation that's totally fine in this case we're using 3.5 but the the eval the you know the evaluation is the main part you want you do wanted to use a superior model to handle that um so and with that the evaluator get passed into the batch eval Runner during the construction of the runner um and uh with the evaluation data set pting and also the query engine the query engine whichever you know you're um depending on that the two approaches will will going into the details there so the query engine gets passed in and then it produces the results and then we compare the results um that's pretty much high level so now let's draw into the details of the code um go back to hopefully the screen is big enough for you to see here we uh install a bunch of libraries obviously L index U last I checked was 45 46 cut out I think this one is so it's amazing the develop at the speed of light I always say uh P PDF obviously when you to upload is a PDF documents sentence Transformers we're using and Transformers accelerated and bits and bytes this all used for um the embedding model and as well as the Zord the A and the bits and B are really used for the quantization to shrink it to four bit so that it can run on the collab as you can as you notice I have this uh T4 High ramp Tel so um initially I tried to do the T4 on the free tier uh actually it fill during download so be aware you can try it with a free version but if you run into an issue suggest to you know upgrade to a pro which is like $99 $9.99 for a month well worth it so try it out and even if you upgrade to Pro make sure you have the high Ram Tel um so that you know go there uh go to the wrong time one time so make sure you have the high Ram turned on so without it I also had the failed experience as well so make sure the T and higham that should work I just I have that mentioned in my blog so you can check that out details um but so we install all the libraries and for this one for the multi um document Rock U pipeline so there are many different strategies so for this one we're going to just dep pick the metadata replacement plus node sentence window uh so for that again I just wanted to point out the production rag uh llama index has a a wonderful page put together with all sorts of uh techniques and strategies uh to handle for production grade um you know rag pipeline so highly recommend to check it out and in this case um oops in this case we're picking the metadata replace plus node sentence window for multiple reasons this is you know we did it for one of the pre previous um Pipeline and uh that one used the recursive document agents but in this case we cannot really handle agents because of the one of the limitation is that zeper Lama index did a benchmark here that out of all advanced R tasks it really outperforms all the other 7B um but with data agent it's still struggling so for that reason we're not using agents or you know multi do agents or recursive document agents for these particular P we're just going to stick with metadata replacement plus no sentence window what that is really helps us to um you know for tasks such as okay so this is essential to um make sure that for large documents retrieves the more fine grain details um so and uh the so we we also wanted to just quickly touch um the on the table of content section here how we're going to approach this is uh we're going to do the general you know the common tasks first like load the documents and the set up as a node pass service context and all that and then we're going to implement with the two different approaches One is using GPT 3.05 turbo um you know all the steps are the same uh between this and the Zer which is extract the nodes build index Define query engine and around the test queries so all those uh the two sets of implementation uh and then after both sets are complete then we do the evaluation so evaluation here is the four steps we mentioned above you know generate the the eval questions first it Define the evaluators um then Define the runner and then do the final comparison to do the evaluation so that's pretty much the the higher level there so going to the details here is loading the documents obviously as mentioned the three PDF documents so load them um using the simple directory reader and you know set up as a node password service context and so on um W um sorry this is just a a quick question um I think someone was asking at the audience uh is there a way to uh download these PDF documents yes yes it's in my repo uh in my GitHub repo and uh if you go to my blog it's listed at the bottom uh bottom I have my repo link this right here so if you go there it's all all the stuff is there so so okay let me um how about this I I'll go ahead and Link the the um yeah feel free to just put in chat yes great okay so uh we're setting up the service context here and also the um the node passer and then we move on to the gbt 3.5 turo implementation um so in this case we're providing the API key um so you know make sure we put the open API key here and Define the A and embedding model so in this case you see our service context so service context is where we Define our large language model as well as embedding model among many other things so in this case we're defining the LM which is defined right above right right here so it's gbt 3.5 turbo and then the embedding model I'm picking this one the ba uh BGE base English v1.5 uh I wanted to mention that this is currently ranked number two on the leaderboard for the embedding model and the number one is really the large version of it so because I'm running a collab I don't you know I might have with pro version but I figured I just use the base one it's safer and it really is a very powerful ined model rank number two Ada open open AI a is number 14 so that gives you an idea you know how good this open model embedding models are so um that's why we pick this one um so you see I'm downloading to local here um and with that executing it uh you'll see it gets downloaded it takes a little bit time 44 seconds to download that whole thing um so that extracts the nodes and builds the index um you know Vector store index with a nodes pasting and obviously the documents there um you know we read it and put into this document list and get nodes from the document and then construct the vector store index from there um so now I have the sentence index and I time for me to move on to do the query engine uh query engine is a you know pretty straightforward the beauty about llama index is just so straightforward um with you know constructing of the query engines and um so on so forth so as query engine that's it all you need to do similarity top K and we Define the node Pro uh post processor um so that is the part um that we defined the index as a query engine now let's run a test query um so in this case I'm just calling notice I'm naming my query Engine with like a metadata prefix so this one is the default for the gpq 3.5 um and uh give it a question and I get an answer pretty pretty quick to 3 seconds and this one is two seconds uh what is hard and runner in develop selfservice Centric pipeline security and guard rails um so I'm happy with the result so move on now let's implement the zeper 7B Alpha um so for this one uh the uh implementation is slightly different in that um in constructing the uh um you know uh service context here where uh first of all where just a quanti quanti quantization to make it the four bit instead of the D 32 which you know requires a lot more Hardware so this process allows us to be able to shrink the memory footage of of this Footprints of this uh particular model and make it you know downloadable to the local um in this case collab to help and is and also lose a little bit accuracy that to be aware but you know it's totally manageable in a way that you can trick it whether to make it a 4 bit 8 bit and and so on if you have the hardware you probably don't need to do that um and then the rest is really a lot is from the implementation of llama index um in their documentation this is code snipp from there where the defines the messages to prompt um this are standard prompts um and then in here this is where we cuse hugging face again it's a llama index class that we passing the model name so this is our Z for 7B Alpha and the tokenizer name and also the query wrapper prompt this again the standard context window Max new token and all that um so that is how we get our Z defined constructed so move on uh by the way it took three minutes to download that as you can see here so it does take a little bit depending on so this is a way to step if you're wrong on free tier you'll see error message maybe you're lucky you know if you happen to have a a t for high RAM available for you at the moment we try but a lot of times if you fail you just encourage to upgrade to the pro just test it out um so that's downloading the model done and from there we construct the service context again you know passing the LM and embedding model so embedding model we're using the same so we don't wanted to compare too many variables here we just wanted to focus on the um so we use the same embedding model and now you see the a I have it was underscore Zer U so making sure we're picking the right a here and the same uh concept here extract the node build index defined query engine as we mentioned earlier so here in this case you notice we're doing the service context Zer which is defined specifically for Zer and also you know that the naming conention just making sure we're not mixing the two together when we do PCS like that to ensure that they're pointing to the right query engines and here I'm running the test to query um notice right here you'll see 13 seconds compared to the previous was only three to two seconds so here is one thing just to keep in mind even though the response is very reasonably good but it took like you know a lot longer um same question by the way so this are the test questions this one took seven seconds so this are all the things you know part of eyeballing it you you'll kind of get an idea so now we we have both of them up and running both query engines are functional um so now let's move on to the evaluation piece of it um first the step is to you know call data set generator to Auto generate evaluation questions so in this case uh what we're doing this also I saw C snip it from Lama index documentation U what we're doing here is we're saving well we're generating the questions and uh where um in this case I'm generating the whole question I want I just curious I want to see how many questions with my three PDF documents generate and interestingly you see here 490 questions okay good to know um so by default uh how that works is 10 questions get generated by per chunk um so that's all the default parameter and you can customize it how however you want this all you know the beauty of Lama index with so much customization that you can um customize to your need but by default that's a default behavior um so I generated 4 490 questions I really don't need that that many questions so I randomly pick 30 so use this random you know generation to pick 30 sample 30 questions and then I save them into a a text txt file so that next time when I run this I don't have to um actually this is a best practice you should save it and use that as and and tweak it and use that as your golden set why because you wanted to compare upward to Upward right if you make a changes to your implementation strategy or you know AR or embeding models and you evaluated with one question data set and you wanted to use the same set for the next evaluation just to making sure that you know the outcome is totally comparable um so to do that is you save it in a txt file and U in this case I have it saved um well when you process it the first time it will generate and it will save it for you but I encourage you to go in there and look at the details of the questions and see if they all make sense I mean eyeballing it you can do that and also you know you can totally manually add additional questions or um you know revise existing questions to it to make it fully relevant kind of you know you know you have very good questions there because a is good to some extent this is we're using gbt 3.5 to generate this question even with gbt 4 You still could run into you know one out of whatever 100 question may still not be relevant and so on so um it's a good idea to manually you know make adjustments and uh use that as your go and set and save it aside um here I generated you know randomly picked the 30 questions with that also uh Now we move on to Define evaluators so evaluator definition is right now we're doing the faithfulness and relevancy what is faceful and relevancy uh again documentation has all the details but I have it summarized here on this page faceful is is measure response from a query engine whether it matches any source code or not a source nodes or not uh so really focus on response matching Source nodes relevancy is focusing on response and Source nodes matching the query um so so you know notice the difference here so the query is in the picture here for relevancy and face forone is really looking at the response and the The Source nodes um so that's again it's extracted from llama index documentation so just putting on place um so here um to get the uh GPT 4 we wanted to pass the GPT 4 to evaluate right so um we Define another service context here we call the GPT 4ore service context and again very similar um we use a passing gbt 4 and then with that service context we pass to the evaluators that's the you know only kind of there are other parameters to be customized but in this case we're just customizing its service context to make sure that it's utilizing gp4 for the evaluation U so we defined the two evaluators um and then move on so in this case I'm going to use these two evaluators to test out my um you know initially generated questions because I wanted to find trick it and making sure you know there are no bad questions in it right so that's a technique you can use use these evaluators and uh this also um you know extracted from Lama index documentation on displaying the Eva data frame um so so you basically wanted to display the query response source and evaluation results U making sure that um so this is what it looks like um you have the query column here and then this is the response it return and this is the source nodes that it's extracted from and then the score this is for the faithfulness um so it tested full faithfulness here and it passed U so that's how you if you see some type of you know response is saying uh context not found uh in the provided uh um documents or something along that line that is that is a red flag you probably wanted to change that question so revise it with something a little more meaningful um so that's you know just keep that uh uh golden set kind of Handy and making sure you have it updated and uh you know have the all the relevant documents so all that once you have that all finalized then let's move on to the patch eval Runner so the construction is extremely simple as you can see uh all you need to do is um you're passing the evaluator in the key value format um so you have faithfulness and you that's the faithfulness evaluator you have relevancy and the relevancy evaluator and workers is here you define how many Runner workers you want them to uh run you know asynchronously to process your evaluation so this one I I tested it was initially was a pretty bigger n number and they running to guess what the rate limit issue right with gb4 so be aware don't set this number too high default is two so you know I would say under 10 is probably reasonable but if you have overturn especially you're you know testing from the same or with the open air apis you are probably running to rate limit issu so adjust that to be a reasonable number show progress is just to show that the bar you know progress bar um and here this is again it's from Lama index uh documentation so get Eva results and this helps us to get the results and uh calculate how many passed compared to the total number of questions and then gets gets that score would be 0 point something one would be perfect score so um so that's that's how it's done so then let's move on to just to run that evaluation on the gbt 3.5 so this will trigger the runner C this async a evaluate queries um function and passing this metod to the query engine so remember this one is def find for gbt 3.5 um you can you know name it a little bit more obvious I highly recommended so that's no confusion because we want to make sure evaluating the right query engine uh and then the query is here you're passing the question data set which is loaded above from the fire off from the initial generation of the questions C get eval results and then that is you passing the key and also the EV results this results basically um has a combination of both for faithfulness and the relevance so you here is splitting them up and display the score so in the scoreboard here you see uh 10 is this one is like it took 10 seconds to run faithfulness for GPD 3.5 and it took 4 seconds to run the relevancy and the score here is uh for faithfulness it scored 29 out of 30 questions so it's like uh you know 9666 so it's pretty good um which is will kind of expect you your GPT 3.5 right so relevancy is 28 questions out of 30 so that's a little lower uh so that's you know good to know so that's what we have but just want to mention because I only selected 30 questions here so in reality if you want to bump that number up to 50 or or or whatever you know you whatever you think is valuable and reasonable for your use case feel free to experiment it and obviously if you don't mind using a little um token money up with gbg4 so that's the only thing I just keep keep that in mind now move on to evaluation for zeper very similarly C Runner um a evaluate evaluate queries and then this is zep's query inine remember defined with zeper suffix here so um same question data set so ensure that is the upper to uper comparison again so what we got um so this one is faithfulness is 28 out of 30 questions and the relevance is also 28 out of 30 um look at that the score in this case is very close to gbt 3.5 which is encouraging but the big but is uh gbd3 took 15 seconds and in this case actually took 14 minutes well I just run it right before the uh uh webinar here so when I initially experimented it was it took four minutes uh but still like compared to the seconds of gbt 3.5 so just keep that in mind and uh you know the accuracy may may be there but the latency definitely is still there's room to improve um so just putting them on diagram this is what looks like very close which is good news um but the latency part is just keeping mind as well um so that's pretty much for the U you know Z for comparison with GPT 3.5 and Jerry you wanted me to move on to the other one we do we have time yeah it's totally up to you we could um talk about the other notebook you got you got up to like like 28 20 minutes but yeah okay so we uh just talked about L uh you know selection using Evo uh Ed so now we move on to the retrieval method so um the using eval to figure out which is the most ideal retrieval method for your rag pipeline as I said on this production page my goodness there so so many different options it's wonderful you know the all different use cases Lama index has covered so much grounds for for you know the open source Community for us right so it's just a matter of us learning understanding and evaluating uh the right the most fitting uh strategy for our particular use case again I stress for your particular use cases uh you know for obvious reasons one particular strategy might work for you know your uh you know my use case but may not be fitting for your use case um so really um use uh this e Ed method to uh you know find out which one works best for you so in this p uh in this um uh notebook here we're demonstrating uh using Ed to decide between two retrieval strategy one is recursive retrieval plus document agent and the other is metadata replacement plus node sentence window uh same exact uh you know the PD three PDF um and U the generate the question generation and all the logic regarding that is all pretty much the same um so we wanted to just uh compare to see which strategy actually works best for us um so I'm going to you know go a little quicker here very much similar we also we can go through contents first the comment has to load the documents first and then implement the recursive retrieval plus document agent uh the detail step here and then the metadata replacement not sentence window which we just gone through the detailed steps here so once both I implement it uh then we do the evaluation the same step really the thing with that is once you do one evaluation the the other uh rug pipeline will be so very similar so you get a hand off it it just you can't you can't do without it basically you know your development it's um uh indispensable and essential part of that use that as your I would call it SS army knife for your R pipeline okay so load documents and uh Implement recursive retrieval document agent so what that is is because we are multi-document pipeline right so so um this approach is based on each document we create a document agent so and each document agent we have two query engines associated with one is a list query engine which is really for summarization and the other is a vector query engine which is for Q&A so each document you have this set of uh you know query engines attached to the document agent and then on before that before the traffic arrives to the document agent we create this index nodes really a summarization high level summarization of the document to sort of direct the traffic from the recursive retal query engine here question comes through goes through the engine engine looks at the uh index noes and figures out which index node it need to drop you know to meaning need to Route the traffic to and the index knows then goes to document agent document agent looks at the query whether it's for summarization or Q&A then you know routes the traffic accordingly and then arrives at the answer so that's a high level kind of architecture that I think Jerry did that like one weekend one weekend I got a a treat like he got this created it's such just amazing um so it's it's very effective um you know query retrieval method for multi document um processing R pipeline um so let's go into the details so the detailed implementation here again we Define the LM in this case we're using the GPT 3.5 so we're not changing the LM here so LM stay the same what we're changing is the retrieval strategy uh so keep that in mind so we're using the same service context across and we create the document agent um so this again is you know build your uh Vector index build your summary index and you define your query engine and the defined tools so this is the the data agent do document agent we call um and then uh build agent here so then on top of that we need to create the index nodes for the retrieval um you know recursive retrial uh engine to to point traffic to so in this case uh we yeah so create the nodes and uh have a index node here so the documented summary so the summary portion uh just you know have to come up with something a little more brief but still very specific to that particular document uh so I noticed in my use case the title makes a big difference so I use the title here in my prompt here um so so then you define the recursive retrieval and query engine um so in this case we do the recursive retrieval construction and the get response synthesizer and uh uh constructed the recursive query engine passing the retriever passing the synthesizer and the service context um then we run some test queries U with you know similar similar question from the previous notebook so that's pretty much for the recursive retrieve plus document agent now moving on we do the metadata replacement node sentence window which is what we just gone through very much similar so two types of oh by the way here by default we're using a o net base V to as embedding so that one just keep that in mind um and uh yeah Define the query engine run test okay so that's fine that's all done now move on to evaluation here same same steps here we first you know generate the question and we Define the evaluator then run Define the um batch Evo Runner and then compare the results um so this block again you can totally reuse because we use the same block you know to first check if the document exists if so we loaded loaded the questions if not we generate the questions using the data set generator uh same same way as well I'm randomly pick 30 in in the case of if you don't want to you know generate all question if you have tons of documents right you don't want to it's not realistic to generate thousands of tens of thousands questions so just passing a number here into this generated questions from node and Define a number whichever you want a 50 100 questions what whatever just pass it in and it will generate only the requested number of questions for you um so keep that in mind so we print out all our questions which in this case is our golden golden um set how I know is because I sprinkled a couple of summary questions in there too so that's my golden set so that's why you can do a such combination of you know normal Q&A and also the summarization questions or whatever other questions you need for your particular use case evaluated definition here again using gp4 to evaluate very similar and Define eval patch Runner same way in this case I'm doing the 10 worker this didn't give me trouble so this one was good um yeah I'm I'm telling you know just to be mindful not to Define that worker too high otherwise you run into limit issue here same way here to get the Eva results to calculate the score um and then we run it uh you know Define the well use cor the runners a uh evaluate queries and passing the recursive query engine and then this next is metadata query engine um so we can see here for the recursive retrieval plus document agent we actually the faithfulness is perfect it got a 30 out of 30 correct it's 1.0 that's that's like hard to get number relevance is 29 so only missed one question so not bad so gbt 3.5 for hour oh no I'm I'm not talking GB I'm talking about the recursive retriever plus the document agent so this give us really good result we're using GPT 3.5 both for both scenarios so that's not a question uh so now looking at the metadata for that particular uh retrieval method and what we got we got 24 out of 30 correct so that's point8 and the relevance is 26 out of 30 that's 086 um so the diagram shows you know metadata replacement is definitely a little less um than the recursive retriever plus document AG agent so that is how you use TD I call tdd Ed U to determine um you know you now you know you evaluate this to you have a confidence you have uh the insights to know this is definitely a better option for my use for you know my particular use case um so that's a point uh why Ed is important and you know gives us uh uh insights into uh what strategies to pick and what LM to pick uh What uh you know Chun size to Define and uh you know a whole bunch other things um so that's pretty much it any questions feel free to ask this is great um I think yeah I think I actually responded to most of the questions on the chat um but yeah yeah but um just to do some kind of um yeah um like shout out shout out to w for like great content and and um the repo is linked in the in the chat and we'll also put it on the YouTube video description as well and so we'll have the two notebooks um and so yeah thanks for running this amazing Workshop um I think this is super comprehensive I think people are going to find a lot of value out of this and I think this really helps and St like great practices for uh users to you know start off with experimenting with the simple things I liked how you started off with experimenting with like the LMS and the edting model because that's probably like the first thing that people should start with uh and then as you go to more advanced stuff start thinking about like more interesting like complex retrieval methods and seeing if that actually helps to improve performance um so great I think let's see if there's any other questions okay seems like it seems like I think we covered most of the questions actually in the chat but uh wry thanks so much for your time um and hope the audience enjoyed it and if you have any questions uh feel free to hop on the Discord and uh we'll have a recording up in in a day or two so you can I feel free to comment there as well great thanks Jer thanks thanks all right cool yeah let's let's end it there then all right thanks everyone um and thanks for coming and and we'll have an update soon bye thanks

Original Description

In this workshop, we teach you how to do "Evaluation Driven Development" (EDD) to build LLM apps for production. This consists of the following: 1. Defining evaluation metrics (performance metrics like faithfulness/relevancy or system metrics like latency/cost) 2. Creating an evaluation dataset 3. Defining a baseline 4. Trying out different approaches We're excited to feature Wenqi Glantz, an open-source evangelist who has a series of wonderful blogs on this topic: https://levelup.gitconnected.com/evaluation-driven-development-the-swiss-army-knife-for-rag-pipelines-dba24218d47e https://levelup.gitconnected.com/exploring-zephyr-7b-alpha-through-the-lens-of-evaluation-driven-development-faf69e9d9ec7

Watch on YouTube ↗ (saves to browser)

Sign in to unlock AI tutor explanation · ⚡30

Playlist

Uploads from LlamaIndex · LlamaIndex · 36 of 60

← Previous Next →

LlamaIndex Virtual Meetup (May 4th, 2023)

LlamaIndex Virtual Meetup (May 4th, 2023)

LlamaIndex + MongoDB Workshop/Fireside Chat

LlamaIndex + MongoDB Workshop/Fireside Chat

Discover LlamaIndex: Ask Complex Queries over Multiple Documents

Discover LlamaIndex: Ask Complex Queries over Multiple Documents

Discover LlamaIndex: Document Management

Discover LlamaIndex: Document Management

Discover LlamaIndex: Joint Text to SQL and Semantic Search

Discover LlamaIndex: Joint Text to SQL and Semantic Search

Discover LlamaIndex: JSON Query Engine

Discover LlamaIndex: JSON Query Engine

LlamaIndex Webinar: Active Retrieval Augmented Generation

LlamaIndex Webinar: Active Retrieval Augmented Generation

LlamaIndex Webinar: Demonstrate-Search-Predict (DSP) with Omar Khattab

LlamaIndex Webinar: Demonstrate-Search-Predict (DSP) with Omar Khattab

LlamaIndex Sessions: Practical challenges of building a Legal Chatbot over your PDFs

LlamaIndex Sessions: Practical challenges of building a Legal Chatbot over your PDFs

LlamaIndex Webinar: Graph Databases, Knowledge Graphs, and RAG with Wey (NebulaGraph)

LlamaIndex Webinar: Graph Databases, Knowledge Graphs, and RAG with Wey (NebulaGraph)

LlamaIndex Webinar: Community Project Showcase (07/07/2023)

LlamaIndex Webinar: Community Project Showcase (07/07/2023)

LlamaIndex Webinar: LLMs for Investment Research (with Didier Lopes, co-founder/CEO at OpenBB)

LlamaIndex Webinar: LLMs for Investment Research (with Didier Lopes, co-founder/CEO at OpenBB)

Discover LlamaIndex: Bottoms-Up Development With LLMs (Part 1, LLMs and Prompts)

Discover LlamaIndex: Bottoms-Up Development With LLMs (Part 1, LLMs and Prompts)

Discover LlamaIndex: Bottoms-Up Development With LLMs (Part 2, Documents and Metadata)

Discover LlamaIndex: Bottoms-Up Development With LLMs (Part 2, Documents and Metadata)

Discover LlamaIndex: Key Components to build QA Systems

Discover LlamaIndex: Key Components to build QA Systems

Discover LlamaIndex: Bottoms-Up Development with LLMs (Part 3, Evaluation)

Discover LlamaIndex: Bottoms-Up Development with LLMs (Part 3, Evaluation)

LlamaIndex Webinar: From Prompt to Schema Engineering with Pydantic (with @jxnlco)

LlamaIndex Webinar: From Prompt to Schema Engineering with Pydantic (with @jxnlco)

Discover LlamaIndex: Bottoms-Up Development with LLMs (Part 4, Embeddings)

Discover LlamaIndex: Bottoms-Up Development with LLMs (Part 4, Embeddings)

Discover LlamaIndex: Custom Retrievers + Hybrid Search

Discover LlamaIndex: Custom Retrievers + Hybrid Search

LlamaIndex Webinar: Document Metadata and Local Models for Better, Faster Retrieval

LlamaIndex Webinar: Document Metadata and Local Models for Better, Faster Retrieval

LlamaIndex Webinar: Build Personalized AI Characters with RealChar

LlamaIndex Webinar: Build Personalized AI Characters with RealChar

LlamaIndex Webinar: Make RAG Production-Ready

LlamaIndex Webinar: Make RAG Production-Ready

LlamaIndex Workshop: Building RAG with Knowledge Graphs

LlamaIndex Workshop: Building RAG with Knowledge Graphs

Discover LlamaIndex: Introduction to Data Agents for Developers

Discover LlamaIndex: Introduction to Data Agents for Developers

LlamaIndex Webinar: Finetuning + RAG

LlamaIndex Webinar: Finetuning + RAG

Discover LlamaIndex: SEC Insights, End-to-End Guide

Discover LlamaIndex: SEC Insights, End-to-End Guide

Discover LlamaIndex: Custom Tools for Data Agents

Discover LlamaIndex: Custom Tools for Data Agents

LlamaIndex Sessions: Building a Lending Criteria Chatbot in Production

LlamaIndex Sessions: Building a Lending Criteria Chatbot in Production

Discover LlamaIndex: Bottoms-Up Development with LLMs (Part 5, Retrievers + Node Postprocessors)

Discover LlamaIndex: Bottoms-Up Development with LLMs (Part 5, Retrievers + Node Postprocessors)

LlamaIndex Webinar: How to Win a LLM Hackathon

LlamaIndex Webinar: How to Win a LLM Hackathon

LlamaIndex Webinar: LLM Challenges in Production (w/ Mayo Oshin, AI Jason, Dylan from Starmorph)

LlamaIndex Webinar: LLM Challenges in Production (w/ Mayo Oshin, AI Jason, Dylan from Starmorph)

LlamaIndex Webinar: Agents Showcase!

LlamaIndex Webinar: Agents Showcase!

LlamaIndex Webinar: Learn about DSPy

LlamaIndex Webinar: Learn about DSPy

LlamaIndex Webinar: Time-based retrieval for RAG (with Timescale)

LlamaIndex Webinar: Time-based retrieval for RAG (with Timescale)

LlamaIndex Webinar: Build/Break/Test LLM Apps Showcase (co-hosted with TrueEra, Pinecone)

LlamaIndex Webinar: Build/Break/Test LLM Apps Showcase (co-hosted with TrueEra, Pinecone)

LlamaIndex Workshop: Evaluation-Driven Development (EDD)

LlamaIndex Workshop: Evaluation-Driven Development (EDD)

LlamaIndex Webinar: Building LLM Apps for Production, Part 1 (co-hosted with Anyscale)

LlamaIndex Webinar: Building LLM Apps for Production, Part 1 (co-hosted with Anyscale)

LlamaIndex Webinar: Learn about Fine-tuning + RAG (w/ Victoria Lin, author of RA-DIT)

LlamaIndex Webinar: Learn about Fine-tuning + RAG (w/ Victoria Lin, author of RA-DIT)

LlamaIndex Webinar: What's next for AI after OpenAI Dev Day?

LlamaIndex Webinar: What's next for AI after OpenAI Dev Day?

Introducing create-llama

Introducing create-llama

LlamaIndex Webinar: PrivateGPT - Production RAG with Local Models

LlamaIndex Webinar: PrivateGPT - Production RAG with Local Models

Multi-modal Retrieval Augmented Generation with LlamaIndex

Multi-modal Retrieval Augmented Generation with LlamaIndex

LlamaIndex Webinar: LLaVa Deep Dive

LlamaIndex Webinar: LLaVa Deep Dive

A deep dive into Retrieval-Augmented Generation with Llamaindex

A deep dive into Retrieval-Augmented Generation with Llamaindex

LlamaIndex Workshop: Multimodal + Advanced RAG Workhop with Gemini

LlamaIndex Workshop: Multimodal + Advanced RAG Workhop with Gemini

LlamaIndex Webinar: Efficient Parallel Function Calling Agents with LLMCompiler

LlamaIndex Webinar: Efficient Parallel Function Calling Agents with LLMCompiler

Introduction to Query Pipelines (Building Advanced RAG, Part 1)

Introduction to Query Pipelines (Building Advanced RAG, Part 1)

LLMs for Advanced Question-Answering over Tabular/CSV/SQL Data (Building Advanced RAG, Part 2)

LLMs for Advanced Question-Answering over Tabular/CSV/SQL Data (Building Advanced RAG, Part 2)

LlamaIndex Webinar: Advanced Tabular Data Understanding with LLMs

LlamaIndex Webinar: Advanced Tabular Data Understanding with LLMs

Ollama X LlamaIndex Multi-Modal

Ollama X LlamaIndex Multi-Modal

Build Agents from Scratch (Building Advanced RAG, Part 3)

Build Agents from Scratch (Building Advanced RAG, Part 3)

LlamaIndex Webinar: Build No-Code RAG with Flowise

LlamaIndex Webinar: Build No-Code RAG with Flowise

LlamaIndex Sessions: Practical Tips and Tricks for Productionizing RAG (feat. Sisil @ Jasper)

LlamaIndex Sessions: Practical Tips and Tricks for Productionizing RAG (feat. Sisil @ Jasper)

Introduction to LlamaIndex v0.10

Introduction to LlamaIndex v0.10

Build SELF-DISCOVER from Scratch with LlamaIndex

Build SELF-DISCOVER from Scratch with LlamaIndex

Introducing LlamaCloud (and LlamaParse)

Introducing LlamaCloud (and LlamaParse)

LlamaIndex Sessions: 12 RAG Pain Points and Solutions

LlamaIndex Sessions: 12 RAG Pain Points and Solutions

LlamaIndex Webinar: RAG Beyond Basic Chatbots

LlamaIndex Webinar: RAG Beyond Basic Chatbots

A Comprehensive Cookbook for Claude 3

A Comprehensive Cookbook for Claude 3

LlamaIndex Webinar: RAPTOR - Tree-Structured Indexing and Retrieval

LlamaIndex Webinar: RAPTOR - Tree-Structured Indexing and Retrieval

This video teaches Evaluation-Driven Development (EDD) for building LLM apps, covering key concepts and tools like LlamaIndex, GPT 3.5, and Zephyr 7B Alpha. By following the steps outlined in the video, viewers can learn how to evaluate and fine-tune LLM models, and build effective LLM pipelines.

Key Takeaways

Define evaluation metrics
Create an evaluation dataset
Use a dataset generator to autogenerate evaluation questions
Define a set of evaluators
Use a batch evaluator runner to asynchronously run the evaluation of responses
Compare the evaluation results
Implement recursive retrieval plus document agent and metadata replacement plus node sentence window
Evaluate both retrieval strategies using Ed

💡 The key insight from this video is that Evaluation-Driven Development (EDD) is a crucial step in building effective LLM apps, and that using tools like LlamaIndex and GPT 3.5 can help improve the accuracy and relevance of LLM models.

🔒 Pro feature: Ask AI to explain this lesson →

More on: LLM Foundations

View skill →

Getting Started with Vertex AI Gemini 1.5 Flash

I TRAINED AN AI TO SOLVE 2+2 (w/ Live Coding)

I TRAINED AN AI TO SOLVE 2+2 (w/ Live Coding)

How to use the ChatGPT API with Python!!

How to use the ChatGPT API with Python!!

Nicholas Renotte

Gemini 2.5: Create an interactive plot of economic data

Gemini 2.5: Create an interactive plot of economic data

Google DeepMind

LangChain Chatbots: Building a Personalized AI Assistant

LangChain Chatbots: Building a Personalized AI Assistant

Analytics Vidhya

Auto-generating meeting notes with Python

Auto-generating meeting notes with Python

Related AI Lessons

Claude AI vs ChatGPT: Which One Is Actually Better in 2026?

Compare Claude AI and ChatGPT based on real-world usage and benchmarking to determine which one is better in 2026

Claude AI vs ChatGPT: Which One Is Actually Better in 2026?

Compare Claude AI and ChatGPT to determine which AI model is better for your needs in 2026

Medium · Programming

IntelliBooks: Classic RAG vs Graph RAG vs Agentic RAG – Choosing the Right AI Retrieval Architecture for Enterprise AI

Learn to choose the right AI retrieval architecture for enterprise AI between Classic RAG, Graph RAG, and Agentic RAG

Fluid, natural voice translation with Gemini 3.5 Live Translate

Learn about Gemini 3.5 Live Translate, a new voice translation technology that enables fluid and natural conversations across languages

5 Levels of AI Agents - From Simple LLM Calls to Multi-Agent Systems

Dave Ebbelaar (LLM Eng)