How to evaluate an LLM-powered RAG application automatically.

Underfitted · Intermediate ·🧠 Large Language Models ·2y ago

Skills: LLM Foundations90%Prompt Craft80%RAG Basics80%Vector Stores70%RAG Evaluation70%

Key Takeaways

The video demonstrates how to evaluate an LLM-powered RAG application automatically using tools like Giskard library and LangChain, and techniques such as fine-tuning and vector store database integration.

Full Transcript

how can you test a rack application this is a question that unfortunately not a lot of people are trying to answer right now they built this huge system that's supposed to trust the results of an llm and they have no clue they have no idea how they should structure that system so they can actually test they can actually evaluate the system to ensure the results are good results so that's the question that I want to answer today I'm going to show you the code of a simple rack system and I'm going to show you one way you can think and you can Implement to evaluate that system one way you can incorporate or create test cases that you can use to test your system Contin continuously even better I'm going to show you a way that you can use to evaluate different models doing the same work so imagine you built this rack application I want you to have an automated um an automated way to test whether GPT 4 is better than an open-source model and do that systematically do that in a way does not involve you trying different things because so far what I've seen is that most people they just do the entire integration they have a couple of pet examples they try those examples and that's it that's the extent of testing this model so hopefully by the end of this video you have a better approach to this by the end of this video you're going to have the tools all of them open source that you can use to actually Implement robust testing for your rack application now before I keep going if you like this type of content uh just give me a like below that's that tells the algorithm that I should keep doing this type of videoos so if you enjoy is free just just just like the video and uh let me show you what I have here all of the code that I'm showing you here it's going to be linked down below so you can you can follow through with the is you can just install it on your computer and you can use it uh this is a notebook I'm going to do everything on a notebook it's a very simple notebook and the first thing that you see here in the first cell is uh just loading the environment variables into the notebook so I I have access to them and I'm just creating this open AI API key and I'm reading it from an environment variable I created this environment variable before off camera so I have it here set that's obviously is your open AI API key uh that comes from mym file that I created and I'm not going to show you because my keys is there obviously but you are going to need to set that environment variable I'm going to be using here gizard which is an open source Library that's going to help me evaluate my rack application and gizar uses that open AI API key and verment variable to do its job okay so make sure you do this particularly for my rack appli ation I'm going to be using GPT uh 3.5 because it's cheaper you can actually change this to an open source model if you want to or you can just use GPT 4 it doesn't really matter so that is what this variable is for is for me later on when I create my model I'm going to be using this variable to uh use GPT 3.5 all right so let's start uh in you know what really matters and my rack application is going to answer questions from a website or actually it's going to answer any questions and it's going to answer those questions using the information from a website so I teach this class um it's called building machine learning systems that don't suck and I have a website and there is ton of information on this website okay there are testimonials uh there is just information about the program uh different characteristics like how many hours it takes to finish the program how many assignments there is a bunch of information here um who is this program for the stuff that you will learn uh there is a syllabus uh like you can go here and you can see just again it's just ton of information uh how much the program cost here's the syllabus of the program and you know again it's just a ton of information about the program so what I want to do is build a rack system by uh scraping this website so I'm going to gather all of the information on this website and I'm going to store that information and then I'm going to answer any questions from the user using this content so that's sort of like the setup for for this app so to scrape the website I'm going to be using by the way by I'm going to build my rack application using Lang chain um you don't have to use l chain you can do like llama index if you wanted to it's fine I'm going to use l chain that's the one that I prefer so for l chain in this cell right here this is how easy it is to do it with L chain I can scrape the website so here is the URL of my website is D ml. school and here is what's happening here I'm importing a couple of libraries I'm creating a text splitter and this splitter is just a class that's going to tell Lang chain how I want to split the content that I'm SC scraping off of the website so it's a ton of content so let's say I'm going to scrape I don't know 10 pages of content this splitter is a recursive character text splitter is telling Lang chain that I want chunks that are no longer than a thousand characters and I want an overlap of 20 characters between them so what's going to happen is that the splitter is going to go through all the content it's going to grab the th000 characters so those first 1,000 characters those are going to become one chunk and then it's going to go to the second 1,000 characters with 20 characters overlap so it's going to take the last 20 characters from the first document and it's going to start there and then it's going to grab another thousand characters and it's going to keep doing that on and on on on now why do I need to split all of the content because my rack system um requires sending context to the model so I'm going to be telling the model hey answer this user question using the following context and I want to include some context now I do not want to send the entire website as the context because I'm probably going to be violating the context size right there is a limited number of characters that I can send so by splitting all of my website into smaller chunks now I have a way to only send a few of these chunks at a time to answer any question U so that's pretty important whenever you're using a model that's sort of like U has a constrain on how much context you have to send now I recorded a video it's on my channel that goes into a lot of details about how the context size works and how all of these models treat the context size I'm this splitting and and recursive character text splitter all of that good stuff it's going to be linked somewhere here if not you can find it on my channel uh if you want more information okay so I'm defining my splitter and now I'm going to use a web based loader and a webbased loader it's just a class that behind the scenes uses beautiful soup to go to that URL and scrape all of the content from that URL it's very simple as you can see I'm just uh setting up the loader right here giving it the URL and then I'm going to call the the function load and split and I'm going to pass the text splitter that I just created so the function know or the loader knows exactly how I want to split that uh content and then I'm going to just print out all of the documents that I'm going to get out of that page out of my website and as you can see here all of the documents that I get and when I'm going to go through all the details details but uh you can see building machine Learning Systems the dck that's how the first document starts and if I go all the way to the end I don't know if it's going to print it out here we'll see but if I go all the way to the end it ends on we'll use this time two that's the final uh sentence here let's look at the second document now if I go to the second do document we use this this time to discuss the first principles behind building see how there is an overlap there that's what's happening here that's because of my text splitter is asking to have a 20 character overlap so this is awesome this is working now I have a list of I don't know how many let me try Here length documents let's see I have 10 different documents from my website I got 10 different documents which is awesome what we need to do right now is load all of those documents into a database and that database is going to help us find the individual documents that are the most relevant to answer any question and I already talked about this again on that video but this database is going to be a vector store and for this particular example I'm just going to be using a vector store that that in memory uh in my other video I also use pine cone which is an actual Vector store but here this is fine just in memory I'm going to be storing all of these documents now there is something very important about a vector store database and it's that when I store the documents I'm also going to be uh generating embeddings for each one of those documents so what is an embedding I'm not going to go too deep into this but an embedding is basically like an identifier for the document is a semantic identifier so it's like a coordinates in space and depending on what the document talks about we're going to generate different coordinates for them so imagine that if we talk about cars the document talks about cars on automobiles well the location is going to be over here but if we talk about boats U maybe the location the embedding is going to point over here so anything related to cars is going to go this way anything related to Bo is going to go this way and the reason this is important is because later on we can find uh if we want uh to answer a question about cars about an Audi or about a Tesla well we can find documents on this section here right on in the location where all of the automobile documents are stored that's what embeddings are going to give us right so in order to create this or load this data into a vector store we need to specify a class that's going to take care of generating those embeddings those locations in space right and in this case I'm using the open AI embeddings class as you can see it here so whenever I create this doc array in memory search again this is just a vector store that's storing everything in memory just to keep it simple here I say hey create this database from a list of documents that I have right so this is my list of documents it's right here and use this open AI embeddings class to generate the embeddings that you need to store those documents that's what's happening on this line so after I run this line I'm going to have my database all of my documents inside and for each one of those documents I'm going to have embeddings generated for them okay so that's great so that means that if I have a query I can find all of the documents that are similar to that query so if I'm asking about BMW how fast can they go the documents that I'm going to return from the database are all related to cars and BMWs and Audi and that type of stuff after doing this I have my Vector store I'm going to create a knowledge base okay and this is key here and we start entering in the area of how do I test my system okay so I have all of these documents I'm going to be building a system I'm going to be building a rack application that is going to answer questions using these documents how do I test that and the steps that we're going to go through here will make sense in a second is we're going to generate automatically a bunch of test cases now you can do that manually but that's a lot of work here's what happens if you're trying to test a classification system something that classifies let's say patients into sick or healthy right a classification system is pretty simple to test because you can you know the ground truth you know whether a patient is sick or not and you look at the response from the system and if this if the response matches the ground truth the system got it correctly if not it's it's wrong that's it it's just comparing the output with the with the correct label the problem of a rack system when you're using a rack system for text generation is that it's it's really subjective it's really hard to compare like if I ask you hey here you have I don't know here you have like a a document and I want my rack system to to summarize that document for example I want a summarization of this page here like how do you know if that summary truly reflects what the page says right it's it's it's less clear how you can test these systems so that's the challenge that we have here so to start with we need to generate a bunch of test cases like how do we test these well let's just generate a number of test cases and before we generate the test cases I'm going to create what we call a knowledge base okay so the knowledge base it's just going to contain all of the documents that we have that's that's the knowledge that we have all of the documents that I just stored in the database now in order for me to create that knowledge base I need a data frame it's a pandas data frame just table structure where I'm going to organize all of those documents um so you can see here I'm just creating that data frame off of the documents that we just loaded into the vector store so nothing fancy here I'm I'm I'm putting them in a column called text because that's the input that I'm going to need to create my knowledge basee I'm printing out the 10 documents that I have um those are go from zero to 9 and you can see that's the content the content is is just right there in that column so this is awesome here is where we start right I'm going to be using Gard which is a library that's going to help me evaluate my rack system Gard has a class that's called knowledge base that's going to wrap all of these documents okay and the reason gar needs this this knowledge base it's because Gard is going to help me generate automatic test cases I'm want to see them in just a second all right so I'm going to wrap all of my data frame into this knowledge Base Class okay so after doing this I have my knowledge base and I'm going to use that knowledge base to do everything else from here on out all right so let's generate test cases this is Key by the way if you wanted to create your own test cases you can do that what is a test case well the test case is say it's going to be a question a sample question it's going to it's going to have like what the answer should look like and it's going to contain what is the document where the system should find that answer okay so if I ask you hey what is the price of the class the sample answer should say well the class cost $450 and it should come together with the document let's say document seven where the price uh is specified right we can do that manually that would be a ton of work okay generating sample test cases is a ton of work as you may imagine so what gizar is going to do for us behind the scenes gizar is going to use that open AI environment variable that I told you at the beginning it was important to set and it's going to connect to gp4 and it's going to use gp4 with obviously specific prompts that they use to automatic Ally generate test cases for my knowledge base so here is how that looks like in code I'm using the generate test set function from giz card I'm going to pass the knowledge base hey all of the content that I know right now that we have that we're using to power our rack system that's the knowledge base I'm going to specify how many test cases do I want 16 this case that's number of questions I want to automatically generate 60 test cases so just go at it if you want a 100 just set a 100 120 doesn't matter you can generate many many test cases the longer your knowledge basis the more content you have obviously the more test cases you can generate right and then I'm going to specify a description for this agent that's going to help in the generation of test cases okay so I'm just going to specify a description for it after I generate my test cases after I run this and this is going to take a minute to finish remember this is connecting to gp4 using your API key you have to be you have to understand that it's going to be using your API key to connect to gp4 to generate all of these test cases okay so after doing this I'm printing out here just so you see them and we're going to open the file now I'm saving also this to a file but just so you see them here in the notebook I'm printing out three of these questions three of these test cases and as you can see I'm printing out the question number with what the question is the first question was what does the machine Learning System course offer that was the first automatically generated question okay then I'm printing out the reference answer what gp4 think a good answer will be so the machine Learning System course offers 18 hours of live interactive session it is a practical Hands-On y y y and I'm also printing out the reference context in other words what is the document or which documents answer this question okay and for this particular one is document zero so the first document should answer this particular question now look at the second question here who is the instructor of the machine learning program that is a test case that gp4 came up with geiz car asked gp4 to come up with these questions and that was one of the test cases that I can use to test my system reference answer the instructor of the program is Santiago that's me reference context what is this answer coming from and it says well there are two documents that can be used to answer this question document five and document nine okay and then it goes on and on and on here in this sale I'm just saving the test set to a Json uh file Json L file so let's open that test set here and you can see all of the questions that were generated automatically by gizar so this is great these are my test cases now I can use this to test my system okay so this is just this is a very valuable step that we don't have to go through manually which takes a ton of time and you think about it let's see uh let me see question look at this hello I'm considering enrolling in the machine learning school program this is simulating a user asking my system a question which is great that's exactly the type of test case that I need so you get the question you get the reference answer right what what the correct answer should be and you get the context at some point there we go you get the context again it's just that the list of documents containing that answer or the list of documents that the system should use to answer that question this is awesome I have 60 test cases now next step is to run those test cases next step is to actually validate my system but I need to build the system first because I don't don't have a system right now let's just prepare the prompt and this is going to be just my simple chain my simple rack system that is going to work like this I'm going to grab a question from the user uh hopefully find the context in my database in my Vector store put them together and ask the model to answer that question and if the model cannot answer that question then we can say I don't know that's what my prompt is this is a very simp prompt to build a rack system by the way if you really want to build a rack system for something serious there are much better prompts that helped the model answer better this is a very very simple one I'm just creating a prompt template this is a class from L chain that's going to allow me to parameterize a prompt so you can see I have two variables here I have the context variable and I have the question variable and whenever I execute this prompt or I use this prompt as part of a bigger rack system I'm going to have to pass those two variables or values for those two variables the context and the question so I'm creating this promt template from the text that I just put in here and I'm printing out what the template will look like after I formatt it with the two variables you can see I'm passing the variable context here is some context and I'm passing a variable question here is a question okay so this is what I get answer the question based on the context below if you can answer the question reply I don't know context here is some context question here's a question okay so that works that's fine that's cool let's now create the rack chain uh of course I'm not spending a ton of time there is a ton of um ideas and steps that we have to go through in order to come up with this rack system I'm not going to go through all of them right now because I'm assuming you only care about evaluating these rack systems but again the video that I linked before in my channel goes through all of that uh all of those ideas in order for you to get here so let me try to explain what's happening here in this rack chain um first of all I'm going to be creating the model and like I told you before I'm using GPT 3.5 model that's the model that's going to be answering the questions from my knowledge base Okay so just initializing my model here with the chat open AI class from Lan chain I'm passing the API key and I'm passing the name of the model which is GPT 3.5 turbo I could be using GPT 4 here as well if I wanted to that will actually be very interesting test because right now what's going to happen at the end of this is that I'm going to have GPT 3.5 answering questions and those questions will be evaluated by gp4 because gp4 was the one generating the test cases in the first place that's just just the way it it happens here all right so I'm going to create my chain okay and a chain is what the name says it's just like a string of components where the input or the output of one component will become the input of the next component in that chain so that's how you build here in L chain and that's one of the reasons I like it a lot I'm going to start with the first component here in my chain and is this map that you see or dictionary that you see and notice that there are two keys on this dictionary the first key is context and the second key is question and the reason I have this map here is because the second component is the prompt that we created let me scroll up to that prompt remember that prompt requires two variables so the input to that prompt is two variables context and question or is it's a map with two variables inside because of that the first component of this chain is a map that again is going to get fed into the prompt which is the second component now let's see where these values are coming from the first value is the context where is the cont context coming from well obviously it has to come from my Vector store my Vector store contains all of the documents they're stored right there and some of those documents are going to be the context that I need to send to the model to answer a particular question how do we know which documents well we need to pass to the vector store we need to pass a question and tell the vector store give me any questions that are simil ilar or give me any documents that are similar to this question remember how embeddings work if I tell the vector store give me what the price of the course is the vector store should look through all of those embeddings in space and return any embeddings that are around the location that talks about prices and costs right so if that such location exists any documents that are very similar to that Center Point are going to get uh returned back to me and hopefully the those documents will answer the question that I asked which is how much does it cost the way I I I do that or or or I sort of like accomplish that here in code is by taking the vector store that we created and generating a retriever from that Vector store that retriever uh let's do this let's do this so so maybe maybe this is going to make it a little bit clearer okay so I'm going to create a retriever and I'm going to say hey just the vector store just uh give me a retriever okay and let's see what we can do with that retriever okay so if if you do you probably know this but if you use the function there this is going to return all of the functionality of that retriever okay so look at this what do we get here these are all of the functions that we can call from that retriever here that's a bunch of stuff so let's see the gets um get prompts get relevant documents okay so that sounds like a that sounds cool uh there is invoke as well okay so let's do the get relevant documents let's try this out Let's do let's commment this out here and let's do Retriever get relevant documents uh look at this so what is the machine learning school okay top K1 I don't I'm not going to pass that let's see what happens when I do this did that even work let's go up oh this is awesome okay so when I called get relevant documents on a retriever and I pass a string what's going to happen is exactly what you're imagining right now the retriever will return the top four documents in this case the top four documents that are related to that question the top four of them are going to come back and that is exactly what we need to ACC accomplish here as part of the Lang chain chain in this case we are using the as retriever here but we could be using just a retriever it doesn't matter just the retriever variable that we use here and we are passing the question and this item getter I'm going to let uh you figure that out but the item getter is just a function from the operator package and the item getter is just basically going to grab the question out of the function that you apply this two so in other words or in English uh what's going to happen is that I'm going to when I invoke that chain I'm going to be invoking that chain you can see it here I'm going to be invoking that chain with a variable called question right or with an attribute it's GNA I'm going to pass a dictionary with an attribute inside that's called question this item getter is going to grab the value of that question and it's going to pass the value of that question to the vector store retriever that we created right here okay just to make it clear let's just do retriever if I can spell retriever here okay so it's going to pass that question to the Retriever and we already know that what's what this is going to do is return the relevant documents that is what's going to happen so now the context the context here will have a m or a list of relev documents so this same list that you see here that is the list that we're are going to be passing to that context variable okay the second one is pretty straightforward I'm saying I also need to an attribute called question let's just put the same value that we invoked this chain with okay so the same value of question here is going to just go here and that is my first component of the chain unfortunately the most complicated one to understand because everything else is going to be pretty simple the next component of the chain is a prompt which is just the prompt that we defined before we're going to be injecting the context and the question to that prompt the output of that prompt which is a well-formatted prompt is going to go into our model so now we're going to be invoking our model our GPT 3.5 it always takes me a second to say GPT 3.5 always takes me a second to think about that so we're going to take that prompt invoke the model with prompt the model is going to return an answer back now in this particular case I'm not going to get into too many details but in this particular case that model actually we can we can uh we can look at here in action I'm going to add another line I'm going to say model. invoke let's just invoke the model uh with tell me tell me a joke okay I'm going to invoke that model with oh actually I cannot do that here I'm going to do it here oh of course not because I have not executed this okay so that's that's awesome let let me Jo ex let me just execute this and let me say model tell me a joke and in this case notice that yeah I'm getting a joke back from GPT 3.5 this is my joke it's a bad joke obviously notice that this is not like clean string text it comes like wrapped into an AI message and the reason is because this is a shat model so it's supposed to have system message and human message and in this case this is an AI message so it's a message coming from AI I don't want that I want clean strings clean strings so I'm going to be passing a parser which is just a string output parser which is going to make this go away so the output of a chain is actually going to be a string all right so let me remove this that explains why you see prompt then Model D string output parser just to clean that class out and get clean beautiful strings and then here is just the test of the tests where I invoke my chain just to make sure it works I'm saying okay invoke my chain and pass a question I need to pass a question what is the machine learning school and look at the answer it's beautiful the answer is just a string just to make sure I did not lie I need you to trust me when I invoke these chain without the parser look what's going to happen see AI message horrible we don't want that so let me reexecute this beautiful just String Clean that is what we need all right we have our knowledge base we created test cases we have a chain we have a rack system we need to test that rack system how good is the rack system that is what's going to happen right now to do this evaluation we're still going to use gizard because gizard is going to take care of running every single test case through my chain take an answer evaluate that answer is it a good answer or not that is what the tricky part is remember this is not a classification model you cannot just compare strings and say yeah this string is exactly like this string you have to use a model to look at two answers and say yeah I think they're hitting the same points I think both of those answers are answering the same question that is what gizar is going to do behind the scenes so how do I use this well they have a function that's called evaluate very simple okay so that function requires a test set which we already have we created it they require the knowledge base which is the original data where is the data coming from and it requires a function that is going to call the model okay so in this case I'm calling it answer function you can call it however you want this function is very simple it's going to receive a question and an optional history if you want to enable history for your chat application in this case I'm not enabling it just so I keep this simple and then I'm going internally within that question the goal of that question is to answer that question or the goal of that function sorry is just to answer that question what I'm doing here is just invoking the chain so I'm going to be invoking the chain passing that question and what's going to happen is that within this evaluate function gar is going to repeatedly call my function passing the different questions that it needs to evaluate okay so that's pretty cool you call this evaluate function and it's going to give me back a report and when you run this it's going to take a second to run and remember this is going to be using gp4 behind the scenes so I imagine without looking at the source code I imagine that what's happening is is going through all of the test cases grabbing the first test case sending it sending that question to my Shain my rack system grabbing the answer and then using gp4 to compare the answer from my chain with the reference answer that we generated before it's going to compare those two and if they are similar if they look correct it's going to give me a point and if they don't I don't get any points and at the end we can determine how accurate my system is how many questions did I get correctly it should do that behind the scenes so what is in that report we can just display the report if you're working on a notebook uh you can just display the report and see what it looks like if not you can also open a web page so I'm displaying the report here but I'm going to go to the web page because it looks a little bit better uh I opened the report after running it later and here's what you get first there is a umap representation of my knowledge based and that's again the more questions I have the the larger my knowledge Bas is the more interesting stuff you're going to get here here you get where the false uh answers are or the incorrect answers are where they are located if my knowledge base was was bigger well obviously this will tell me this will give me more information about what areas of my knowledge base are not well covered or I'm having problems remember I only have 10 documents here okay so that's why you have so few points uh you get a component analysis here we're going to see that in a second we're going to talk about that in a second but this is giving me a score for every single component of my Rx system we're going to talk about through all of them in a second there are some recommendations some correctness by topic I have only one topic in my website this can get really really complex with larger knowledge base and the overall correctness score is 73.33% okay that's my overall how good my system was right now okay so let's talk about this component analysis let's go back here so I can show you uh I have here sort of like a the score individual score for each one of the components of a rack system so the first one is the generator and if you scroll like if you uh put your mouse on top of it you can see what is that component about so in this case this is the large language model that we used to uh in the chain to generate the answers so the way gizar is evaluating my system is it depending on what the test case looks like is trying to evaluate all of these components separately now in my simple chain I don't have a uh I don't have a rewriter and I don't have a routing the rewriter will be a component that you had in your chain to rewrite the question like when they when the user asks something that doesn't look like correct you could have a component that rewrites that question in a way that's simp it's easier to answer that question it becomes more relevant I don't have a rewriter here in this case so obviously I'm not doing that great in those type of questions in questions that should be rewritting I'm not doing too hot here uh the retriever is just just getting the most relevant questions from my map uh so I should work a little bit better on how on those embeddings on how that the similarity U gets computed and how I get the relevant documents so this breakdown is great to tell you exactly what you should be focusing on so let's go down a little bit here's my recommendation I'm saving or you can save that report to HTML that's the HTML document that I show you I you can also just sort of like print the correctness or compute the correctness based on question type okay that's what you get here you see that complex questions uh 90% correct conversational questions 50% correct this makes sense I did not include a history in my chat remember that this supports when I'm using a shatow open AI model uh it supports a conversation so it supports keeping context I did not use that so I'm sure that by using that I can improve the cont conversational aspect of my rack system which I did not Implement distracting elements only 50% so questions that were generated with distracting elements here they did not score well so I'm going to have to do better there double questions simple questions situational questions what 100% so this is gold because this tell me how my system is doing and where should I focus on to fix my system by the way there are no topics here but gizar has the ability ility to automatically generate topics based on your documents so if I had like a bigger document or bigger knowledge base this card will it's able to just generate different topics recognize and generate different topics and then give you scores on those topics so you know okay so anything related to price the llm is doing great anything related to this other topic is not doing great okay so I can also get the failures so if you want to know exactly what questions did the system fail so I can get the list of failures I can safeties I can do whatever I want so let me see it's simple well you cannot read the whole question here because it's is what does the machine Learning System course blah blah blah reference answer conversation look at this look at this conversation history see how there is conversation for some of the questions I'm not supporting that right now so I'm not surprised the system is not doing great on those okay so all of that is awesome if you stop the video right now you already have a ton of value here just by doing this you have a ton of value but there is more there is more okay this is great to run an evaluation of your system one time just do it one time how is it doing great I want to actually automate this I want to do this every time I push a change or every time I'm ready to make a deployment uh I want to just run a test Su with all of my t cases but way those test cases that were autogenerated you can add your own as well right you can add your own test cases you can fix them you can do whatever you want with them but the key here is I want to automate my tests so how do we do that well let's let's take a look so here it's I'm just loading the test set from the Json L file just loading them in memory very simple and I can just create a test Suite it's just take the test sets and generate a test Suite that's the name of the test Suite that will be reference later whenever you run multiple test Suites you know exactly which one it is one line creates a test suite for me and then I can run that test Suite so how do I do that in order to run this test Suite I'm going to wrap my chain in a gizar model okay so this is a class that's going to provide all of the information to gizar that it needs to run my tests so look at this it gizar model requires a prediction function or the model that's going to be answering test Suite or it's going to be solving the test suite and we're going to see that in a second what the type of model is so in this case is text generation you can do classification you can use gizard to just do classification or that type of stuff what is the name what is the description always specify these two parameters they help their model make decisions and what is the feature name that I care about in this case is going to be the question okay so that's the feature that we care about that's the question that we need need to answer now look at this prediction function I call it batch prediction function this is very similar to the answer function we created before to run that report that evaluation report but in this case I'm just answering question in batches so when running the test Suite G car is not going to go question by question it will take a long time so it's actually doing this in batches and what's cool about L chain is that I can run I can invoke book a chain with a batch of inputs and that's exactly what's happening here this is my chain and now I'm passing a batch so it's an array of questions same thing as invogue before but now it's in batches so we can send multiple questions to the model at the same time we don't have to wait for one answer before sending the next question that makes this really really fast so very similar as before I receive a data frame I go through that data frame frame all of the values of questions and I'm pass them as an array of a map with one attribute that's called question here when I run this model I can get now the test suite and I can say run pass the gizard model and my test Suite is going to run and look at this it says that it succeeded with 62% that's the metric that I'm getting back when I run this test suite and that is awesome because now I can automate the process of running this test Suite by the way I obviously I can just get from the results I can just get you know what the metric was was the result was I can get that information here to automate something like before deploying the model make sure the test Suite passed if it didn't pass then don't deploy the model that will be the way to automate this this one more thing here uh notice here I'm displaying the results of the test you can see the test with pass the metric was 61 667 which is a past that's the name of the test Suite all of that good stuff the final thing that I have to show you is how do we integrate this with pie test why pie test because if you're not using pie test you not doing it correctly okay so py test in my opinion is the best unit test library that there is for python so of course I jumped all over this when I saw that you could actually integrate this with py test in my example here here I'm using something that I don't see many people using because guess what people who use notebooks they're not thinking about testing their code they should but they're not I'm using here the IPI test Library which is a library that's going to allow me to run pie test tests directly from my notebook and it's great so I'm going to install P IPI test IPI test not P test but IPI test and then I can use a cell magic as you can see here percentage percentage IPI test and this cell became runnable like if I run this test this cell it's going to run all of the test cases inside just like if I were running this with pi test which is awesome so look at this code here it's very simple I have only one test so I have a fix here that Returns the data set and the data set is just me loading the test set from the hard drive I'm loading my test sets my 60 test sets and I'm returning a data set I'm turning that test set into a data set and I'm returning that then I have a model fixture that is going to return the gizard model that I had that I created before I can also create it here inside but just decided to just reference the one that's that's outside and then I have a single test case okay these single test case that receive both fixtures the data set and the model and it's going to use a function that's called test llm correctness there are a bunch of functions inside this card that you can use to test different aspects of a system in this particular case I just care about is the llm correct and I passed the model and I pass the data set and very very important I pass a threshold that threshold indicates how high should I need my results in order to declare that this was successful okay okay and in this particular case the tests pass because my threshold is under 62% so 62% that's the metric that I'm getting I'm not going to run it here on screen because it takes a little bit of time to answer all of those 60 questions but just trust me when I run this is going to succeed now if I set that threshold to 70 or80 now these tests are going to fail with this you can see how you can integrate with your system if you're using pie test now you can integrate unit tests for your llm application you can evaluate your llm application automatically not just by calling John Doe or Mary Black and telling them can you try a few questions and see if it works which is what I've been seeing that is bad with this you can actually do it automatically so hopefully this makes sense hopefully this helps you if you got all the way to the end just please like this video it helps me understand whether this type of content is useful for you and I will see you in the next one bye-bye

Original Description

Source code of this example: https://github.com/svpino/llm/tree/main/evaluation Giskard library: https://github.com/Giskard-AI/giskard I teach a live, interactive program that'll help you build production-ready machine learning systems from the ground up. Check it out here: https://www.ml.school To keep up with the content I create: • Twitter/X: https://www.twitter.com/svpino • LinkedIn: https://www.linkedin.com/in/svpino

Watch on YouTube ↗ (saves to browser)

Sign in to unlock AI tutor explanation · ⚡30

Playlist

Uploads from Underfitted · Underfitted · 40 of 60

← Previous Next →

Test-Time Augmentation In Machine Learning.

Test-Time Augmentation In Machine Learning.

Don't Replace Missing Values In Your Dataset.

Don't Replace Missing Values In Your Dataset.

Introduction to Adversarial Validation In Machine Learning.

Introduction to Adversarial Validation In Machine Learning.

Introduction To Autoencoders In Machine Learning.

Introduction To Autoencoders In Machine Learning.

Active Learning. The Secret of Training Models Without Labels.

Active Learning. The Secret of Training Models Without Labels.

Early Stopping. The Most Popular Regularization Technique In Machine Learning.

Early Stopping. The Most Popular Regularization Technique In Machine Learning.

The Confusion Matrix in Machine Learning

The Confusion Matrix in Machine Learning

3 Tips to Build a Career in Machine Learning (Unconventional Advice)

3 Tips to Build a Career in Machine Learning (Unconventional Advice)

I can predict cars CRASHING. And it's 99% accurate!

I can predict cars CRASHING. And it's 99% accurate!

A Critical Skill People Learn Too LATE: Learning Curves In Machine Learning.

A Critical Skill People Learn Too LATE: Learning Curves In Machine Learning.

The BEST Machine Learning Interview Strategy.

The BEST Machine Learning Interview Strategy.

OpenAI’s Whisper is AMAZING!

OpenAI’s Whisper is AMAZING!

5 Lessons You’re NOT Taught in School

5 Lessons You’re NOT Taught in School

TensorFlow On Apple Silicon. Step-by-Step Instructions

TensorFlow On Apple Silicon. Step-by-Step Instructions

Generating Images From Text. Stable Diffusion, Explained

Generating Images From Text. Stable Diffusion, Explained

The Wrong Batch Size Will Ruin Your Model

The Wrong Batch Size Will Ruin Your Model

8 Mistakes Holding Your Career Back | Machine Learning

8 Mistakes Holding Your Career Back | Machine Learning

AI Just Solved a 53-Year-Old Problem! | AlphaTensor, Explained

AI Just Solved a 53-Year-Old Problem! | AlphaTensor, Explained

Bias and Variance, Simplified

Bias and Variance, Simplified

Should You Stop Splitting Your Data Like This?

Should You Stop Splitting Your Data Like This?

The Function That Changed Everything

The Function That Changed Everything

This Model Caused A Nuclear Disaster

This Model Caused A Nuclear Disaster

Will Your Code Write Itself?

Will Your Code Write Itself?

The Simplest Encoding You’ve Never Heard Of

The Simplest Encoding You’ve Never Heard Of

Superhuman AI Cracked An Impossible Game! | DeepNash, Explained

Superhuman AI Cracked An Impossible Game! | DeepNash, Explained

Can you become a Data Scientist without a Ph.D?

Can you become a Data Scientist without a Ph.D?

How to 10x your productivity with ChatGPT?

How to 10x your productivity with ChatGPT?

Cheating the Prisoner's Dilemma

Cheating the Prisoner's Dilemma

We integrated OpenAI's Whisper with Spot

We integrated OpenAI's Whisper with Spot

The Machine Learning School program

The Machine Learning School program

We integrated ChatGPT with our robots

We integrated ChatGPT with our robots

Solving complex tasks using a Large Language Model (LLM)

Solving complex tasks using a Large Language Model (LLM)

5 problems when using a Large Language Model

5 problems when using a Large Language Model

We just discovered faster sorting algorithms!

We just discovered faster sorting algorithms!

The 3 most important updates to OpenAI's API.

The 3 most important updates to OpenAI's API.

People are divided! Does GPT-4 understand what it says?

People are divided! Does GPT-4 understand what it says?

How much should you charge hourly as a Machine Learning freelancer?

How much should you charge hourly as a Machine Learning freelancer?

Building a RAG application from scratch using Python, LangChain, and the OpenAI API

Building a RAG application from scratch using Python, LangChain, and the OpenAI API

Building a RAG application using open-source models (Asking questions from a PDF using Llama2)

Building a RAG application using open-source models (Asking questions from a PDF using Llama2)

How to evaluate an LLM-powered RAG application automatically.

How to evaluate an LLM-powered RAG application automatically.

Step by step no-code RAG application using Langflow.

Step by step no-code RAG application using Langflow.

I built a simple game using Langchain. Here is a step by step tutorial.

I built a simple game using Langchain. Here is a step by step tutorial.

I used the first AI Software Engineer for a week. This is happening.

I used the first AI Software Engineer for a week. This is happening.

I deployed a recommendation model. Testing Models In Production using Interleaving Experiments.

I deployed a recommendation model. Testing Models In Production using Interleaving Experiments.

How to run PyTorch, TensorFlow, and JAX on your Mac (Apple Silicon)

How to run PyTorch, TensorFlow, and JAX on your Mac (Apple Silicon)

How to train a model to generate image embeddings from scratch

How to train a model to generate image embeddings from scratch

Building an AI assistant that listens and sees the world (Step by step tutorial)

Building an AI assistant that listens and sees the world (Step by step tutorial)

Why are vector databases so FAST?

Why are vector databases so FAST?

A Machine Learning roadmap (the one I recommend to my students)

A Machine Learning roadmap (the one I recommend to my students)

How to build a real-time AI assistant (with voice and vision)

How to build a real-time AI assistant (with voice and vision)

An introduction to Mojo (for Python developers)

An introduction to Mojo (for Python developers)

How does Lexical Scoping in Mojo 🔥 works (under 3 minutes)

How does Lexical Scoping in Mojo 🔥 works (under 3 minutes)

Building a CI workflow for those who hate it (using GitHub Actions)

Building a CI workflow for those who hate it (using GitHub Actions)

How to run Python Code in Mojo 🔥

How to run Python Code in Mojo 🔥

AI will not take your job. Here is what I think will happen instead.

AI will not take your job. Here is what I think will happen instead.

How to fine-tune a model using LoRA (step by step)

How to fine-tune a model using LoRA (step by step)

Late initialization in Mojo🔥 (Python doesn't support this)

Late initialization in Mojo🔥 (Python doesn't support this)

The $1,000,000 problem AI can't solve

The $1,000,000 problem AI can't solve

A gentle introduction to RAG (using open-source models)

A gentle introduction to RAG (using open-source models)

Automating feedback using ChatGPT and Zapier

Automating feedback using ChatGPT and Zapier

This video teaches how to evaluate an LLM-powered RAG application automatically using tools like Giskard library and LangChain, and techniques such as fine-tuning and vector store database integration. It covers building a RAG system, creating a knowledge base, generating test cases, and evaluating the RAG application using Gizard.

Key Takeaways

Load environment variables into a notebook
Create a RAG system using LangChain
Use a vector store to store documents
Generate embeddings for documents
Use Gizard to evaluate the RAG system

💡 Automating the evaluation of LLM-powered RAG applications can be achieved using tools like Gizard and techniques such as fine-tuning and vector store database integration.

🔒 Pro feature: Ask AI to explain this lesson →

More on: LLM Foundations

View skill →

Getting Started with Vertex AI Gemini 1.5 Flash

I TRAINED AN AI TO SOLVE 2+2 (w/ Live Coding)

I TRAINED AN AI TO SOLVE 2+2 (w/ Live Coding)

How to use the ChatGPT API with Python!!

How to use the ChatGPT API with Python!!

Nicholas Renotte

Gemini 2.5: Create an interactive plot of economic data

Gemini 2.5: Create an interactive plot of economic data

Google DeepMind

LangChain Chatbots: Building a Personalized AI Assistant

LangChain Chatbots: Building a Personalized AI Assistant

Analytics Vidhya

Auto-generating meeting notes with Python

Auto-generating meeting notes with Python

Related AI Lessons

Building HITL Feedback RAG: Embeddings, Retrieval, and Reranking

Learn to build a Human-in-the-Loop (HITL) Feedback RAG system using embeddings, retrieval, and reranking to improve model performance

Building HITL Feedback RAG: Embeddings, Retrieval, and Reranking

Learn to build a Human-in-the-Loop (HITL) Feedback RAG system using embeddings, retrieval, and reranking to improve LLM performance

The 2026 AI Model Release Race: Every Major LLM Launch You Need to Know

Stay updated on the 2026 AI model release race, including major LLM launches like Claude Sonnet 5 and GPT-5.6, to leverage the latest advancements in AI technology

Call GPT, Claude, and Gemini from one API key — a 3-step setup

Access GPT, Claude, and Gemini through one API key with a 3-step setup using Modelishub

5 Levels of AI Agents - From Simple LLM Calls to Multi-Agent Systems

Dave Ebbelaar (LLM Eng)