Validating Machine Learning Model and Avoiding Common Challenges | Community Webinar
Skills:
ML Pipelines80%
Key Takeaways
This video teaches how to validate machine learning models and avoid common challenges
Full Transcript
So um this is a topic I really love talking about because uh it's very close to my heart as a data scientist and uh seeing many different types of problems in models data and uh knowing that I could have avoid them avoided them much earlier and uh after kind of facing these types of things in my career I'm happy to have the opportunity to really uh both talk about it think about it and also work to develop that field. Um so maybe I'll really introduce myself shortly. So I'm here um my original background is in in in the IDF. There's a program called POT for technological leadership. Um that's kind of where I started my first degree and uh worked in uh cyber research initially and then transitioned on to to data science. uh worked in the prime minister's office and anything as you me as you've mentioned I'm the co-founder and CTO of deep checks so after working on various data science problems and seeing like the challenges in uh both their adoption and also like different things we face during the research and how we could have found avoid them in general just like done it more efficiently and methodologically um in deep checks we work on what we call continuous validation for machine learning systems which is like models and data and they'll touch that uh a bit more and well except for my professional life um I train not only models. So uh I love also training in more conventional stuff and also a bit of less conventional stuff and if you wander around uh Tel Aviv you may see me also on my electrical or non-electrical unycle. So um we're going to talk a bit about motivation to this field like what am I talking about? What am I trying to avoid? Why is this so common, problematic or important? Um drill down a bit into what does it mean to like to validate ML or to test. I will use these terms uh kind of interchangeably. ML validation is um like the I see by the way a raised hand. Um so ML validation is the kind of the concept of checking it while testing is what we do in or ML validation what we want to do for the model data and testing is uh like that's the term we usually use to running the various checks um so we're going to see various problems typical problems and uh tips how to avoid them how how to avoid them and in the last part I'll introduce the deep checks package which is um a framework an open source one that you can use uh for some of these uh for some some these tests and feel free to to run the code alongside or also if you want later. So those of you that want um you can already kind of have an environment up and running with uh like just a Python environment recommended a clean one and you can install deep checks um with minusu in order to upgrade if you already have it installed that will just kind of enable you to more easily um run it later on. We'll share a bit later the link. So it'll be just like a Jupyter notes book and then as long as you have deep checks in your system then you can uh just like uh run the cells with me. Cool. So a bit motivation to to what we're talking about is these are just a few kind of public examples to what can happen with machine learning without sufficient validation. Of course again these are the things we thought we we heard about many times. um uh it's much smaller things that we face are already solved some time sometime in our pipeline um either either uh yeah sorry the raised hands okay yeah I I am uh kind of looking at the Q&A so if there's something feel free to write there but uh a quote so um these are a few of the public ones uh if you want uh Fatima can you please share the link there's like we have a a blog post about various of some of these and some more kind of public uh ML failures um yeah and you can read some more about each of these anyway in general it's something that we we face and uh um we want to solve or really not not reach there so a bit motivation to what is it like why not just I mean okay we want to solve problems we don't want to have bugs in production we don't we we don't want to like discover them super late in our research process we want to find them early so why not just use software software testing method technologies I mean software development is here for lots of time and there's a lot of uh soft uh testing best practices and thing is that it's in some aspects much more challenging um to test machine learning so one thing is that it changes in a very different manner I mean in software we can kind of track our changes and see have we tested the different things we changed while in machine learning it's not that clear what are we tracking and what is actually changing another thing is that it's harder to understand have we checked everything. I mean also in software like code coverage won't really solve everything but it does give us some uh understanding of what uh which use cases we've covered or not and in machine learning what are we trying to cover is it like the input distribution space or is it um you know the the activation of the neural network. So there's really it's really hard to understand have we tested various flows or various edge cases and one of the in my ca in my um opinion the most troubling or challenging things is that it's super common to have silent failures. So what what do I mean a model will still give the prediction even though it's given um data that is just like not that relevant whether it's nulls or just something changed and the number will still be there. won't know that it won't just like in in many cases just won't break. So incorporating the like kind of the practice of testing um is first of all important because of the various challenges of the system and gives it the potential to improve um our model. So many times like we may have a diff some type of bug for example we have some leakage in the way we split the data. I mean maybe the model works fine and everything looks okay but if we actually drill down we understand that if we would have split the model in a more um kind of uh if we would have planned more and split the model split sorry the data better then the model like performance would be much better so it's a potential to improve not only our practice as data scientists but also the actual results and I think uh being able to uh back our models be more professional and uh be more sure of the results and the products that we're delivering is of course uh essential And I think you'll agree with me that it's important. One thing that we saw is that it's not very common. What do I mean? Many places have various ways of like testing manually or inspecting. I mean probably all of you when you build a model you check its performance metrics. But uh how deeply do you do it? Do you check it in different segments? Do you know that everything in your pipeline works as you think it is? When the model goes out into production, in what manner do you monitor it? So these are like testing and are continuously validating and testing is quite a wide topic and um it's very hard to cover it elaborately and it's very rare to see to see um companies that have really kind of adopted all of the workflow and that is what one of the reasons that we believe that it's important to both help um incorporate the processes and help build tools to to enable that. So when we come to test machine learning models what do we have? Essentially it's some it's quite simple right we have the data we have the model that we've built and we have the process of kind of taking the data training it building the model and of course some kind of environmental aspects. So whether it's um various constraints like business constraints and whether it's the dynamics of the environment. So for example if there's a specific um specific trend for example due to COVID or or different changes just in the system. We'll go over a few examples uh soon. And in reality, it looks a bit more like this. So this is kind of uh us understanding that we have to try really planning and thinking what do we want to test? How can we how can we do this in a as as a comprehensive and kind of uh well uh well-built manner as possible. So we'll talk a bit we'll view some of the challenges um in order to understand what are we trying to what are we trying to monitor and look for or test for. So when we talked about the different aspects we can we can split it kind of to three phases. First of all we have the data and we wonder does it okay? I mean do we have any unknown nulls? Do we have maybe um like data from different data sources and maybe it's kind of representing the same thing but it's built completely differently and we're maybe not aware of it. Maybe we have older samples along with newer samples and maybe like the newer samples have more updated labels that may be conflicting the older samples and so forth. So does the data represent or is built the way we think it should and uh and is it correct? we have data distributions which is everything that has to do with like how does it behave um internally. So are there any um extreme outliers or for example when we compare like different data batches batches does it still behave the same and we have model behavior which is how the model like any any things well these are just kind of a few examples of in each of these fields of examples of various problems that we may have. So like I gave examples of dirty data and label integrity and of drifts and uh in the model behavior things like uh uh overfitit or fairness and bias and we'll see a few practical examples. I'll say in general I'm going to demonstrate mainly for tabular data. The concepts of what are the types of things that we're trying to catch and that are kind of popular are very very similar. Many times the algorithms to how to actually implement and check it may be different because of course um the data is different and we process it differently. Uh if you want feel free to ask later. I can also give kind of a computer vision example and um that's it. But u again the concepts are the same and we'll jump into a tabular example. So I have here a sample of a model that is trying to um predict the like predict whether to approve or not a loan. So we see here things like um income, sex, um loan amount and various additional metadata and the model just gives like it's a binary classification. So between zero and one whether to approve or not the model. Now let's see we another sample came down the pipe and we see here that uh something changed. So in this case you see that like the only thing that changed is the way that United States is spelled which may be you know due to um another data source or just like kind of a ambiguous uh spelling because maybe it's a a field that the user um just put as input. We may have caught it earlier in the pipeline and kind of made it um all the same uniform. But if we haven't then it's likely that in the processing the model will kind of treat this as a different category. Obviously we would expect the model to give the exact same predictions for these two. And in this case we call this a string mismatch. It's just a small example of you know um small changes that can kind of silently change the way your model behaves and probably make uh make it perform worse than uh than it would have without these kind of problems. And another maybe more like on a different uh note an example here is that we see that all of the details are the same except for the sex. And also here we got in this case a higher score. And this comes like to basically to us thinking about our policy. Is this something we either required or want to check for monitor verify? And in this case obviously no one will tell us okay there are some parameters that you don't want to affect your model. This by the way is a very direct one but many times like we won't actually include this as a feature but we may have other proxy features that are affecting and making various like demographic uh changes to our model's predictions and many times it's just based on the specific maybe a sampling bias kind of in the data we've trained on and these are things again that we want to to make sure we test what as as we'd want them to to behave and define to connect this a bit um I want kind of give a a real use case. I'll say that it's real as in it represents the uh full end to end story. I will describe it all of its phases and not all of these faults happened on the same project in the same time but just kind of to to give you a a more practical feel. So so I took many cases that we faced or found in different aspects and kind of put them uh join them to this one project. So like to the story of this one project. So um we're gonna uh kind of view a model that that is trying to classify for a specific access to a website whether it's a bot or a human being. Um this is interesting and relevant because we want to avoid the scraping attempts uh credit card fraud and things like that and there's lots of bot traffic trying to do this types of things. Um, classically what will happen is that if it's identified as a bot or potentially a bot, it will be um referred for example to a capture and then uh on and well we wouldn't want to to just like refer all the traffic there and u and harm conversion and rates like and things like that. So second cool so I'll describe a bit the pipeline of of this model. Um so it's working on data um like things like uh data from the packet of the access things maybe from the like firewall or router it's there there's a company let's say a like a company in the field of uh cyber security and trying to identify this and has v various uh teams and tools and in this case it has various tools like both sitting on the web like um from the web browser um traffic uh firewall etc not firewall I mean like um a router and these all extract data fields like analysts develop, choose them and uh and they're later on referred to the model. The machine learning team did like a train test splits pro pre-process the data and everything like that and um after it was deployed to production. So here we have with like the research part and then once the model is already working on going then it's automatically retrained um every time on eight days which is nice because it gets like um twice the waiting of all the samples are weighted equally or well not really it's like oversampled for the for uh like in this case the bot activity but uh but it's uh given twice the the same day because sometimes like typical days have in the week have more similar behavior and it's trained on a very wide um it's trained on a very wide data set both on websites that it was directly trained on and also some uh new websites that are also a clients of of that company and sounds good I hope I think and still I mean what can go wrong right it sounds like a great pipeline so let's look at the different phases for the different um the data that comes there was one of the team did an API teams did an API update and they changed the result of one of the fields from instead of being between one to three where one was low and three was high to one to five. Um you can imagine what happens with the model that on one hand continues to work but on the other hand like treats the three as the highest and so forth. So we still have the the same the same data but actually representing different values for the split. Um when the data science team did it, they didn't notice that even though they had like an identifier of they knew when the traffic came from the same device or device or or IP, they split the p some of the like website access was uh to you know in the training data and the others were in the test data obviously squee the skewing the results because if we have that we may kind of learn on uh on various um features that are more relevant to the device which is usually consistently either a human being or a bot and so forth. um every time we train it on the past data but we weren't notified like as a data science team that the um marketing has just um launched uh some campaign like marketing of that company that we're trying to um help help protect their website and they're now targeting Android users for and promoting a specific uh product on a specific page. Obviously this changes very much the performance on the day and the days after the traffic the the campaign was launched and uh kind of it wasn't yet uh like let's say on the on the second or third day most of the training data is still much before but now like we're still likely target likely understanding these new activities as something very different and maybe like referring more of them than necessary for example to a capture and we talked about various websites some that we know better because we've had them in our training set and some that we know less. And well, we really don't necessarily know on which of these is it going to behave like more similarly and therefore the performance will be as we expect it to. And maybe we have some websites from other domains that it just like the the typical behavior that we know is super different and then like maybe we have some websites with very very bad performance. But if we're just kind of looking overall, we won't necessarily notice it. So looking at all of these together and again there are various examples but all of them actually happened in um in different projects we can understand that while when we trained our model and retrained it and made everything and you know deployed it happily to production um we were super happy with the results and so was the management but in real life things were very very different and um uh we we received quite a few not so happy phone calls. So let's start testing or what do I talk about? What do I mean when I say testing and how how should we start or what should we think about? This is a kind of introduction in general to how to approach it and then we're going to uh go dig dig in a bit a bit deeper. So how should we start testing? Um we have the we have various things. So the inputs of what we actually want to check um like what should be explored for example the data or the model that checks so what are the types of problem that we're checking for when do we do it so in research in different phases in the research in production in between in all of the above and methodology of how do we incorporate this in the process. So we're going to talk mainly about the first three and kind of explore each of them. For the methodology um part first of all I encourage you to ask any specific questions or ideas if you want. It really depends on what is the fa what is your phase like what are you working on now? Is it a academic research use case? Um do you have are you working in a big company where you already have for example CI/CD workflows and if so are your data scientist data science models incorporated there or not? So obviously for each of the phases whether it's in research during deployment like for the decision whether to deploy or not a model and when the model is already deployed there are different practices tools and kind of best practices to to incorporate and I think that once you have the awareness um tools and thoughts of what can go wrong now it's kind of more an act of a decision of actually implementing it. So about the input. So we viewed these different types of problems before the integrity, distribution, um model behavior. And if we look at where these problems um occur or what should we actually inspect, we see three three things. So for data integrity and data distributions, it's everything that has to do with the input data, whether it's like um you know the data that we've already trained on or like new data that's coming in production. Batches is um a name for kind of various like different data sets. So for example, if I split my data to train and test, then these are two batches. Or if I train my model and I will have new inference data. So I I can also treat these two as batches and like compare between them. Do I have like drifts or leakages and everything that has to do with the model? Obviously the model is also coupled with the data because in order to understand for example its performance, we want to either run it on this data or on other data. But um in order to evaluate, we also use the data. But that's the concept of really like being able to to inspect the model's predictions. And so what do we want to check here is um very high level um list of what are the types of things that can go wrong like the types of the data the for example correlation between different uh features diff um distribution differences and um model robustness and things like that. This is a part that it is both I'll say like two things. there's quite a long kind of generic list that you can uh build um and elaborate as you go on. Um so that is one of the tips and I would recommend you also to check out like in this case I'll talk a bit about it um later but in general like in deep shifts you can also just get inspiration for many of the different types of things that you can test for um and I I would say that what you really want is what you want to need is a kind of long list which is both domain specific so there may be you know some things that are relevant only for you and not general but there's also a very very wide kind of uh general types of problems as as you saw before. So the idea is first to know what you want to check and make sure you do it. I will say many of us do this quite naturally you know when we do like data exploration right right we check different things like about our our distributions and about the data's integrity but we don't necessarily then first of all we don't necessarily cover everything and if we would have had a reusable code that you know you can just kind of plug and play and run then it would be much more efficient and also much more elaborate um to actually check everything and also we usually do it manually and only a singular amount of times like once or twice But we probably won't do it every day or on every new batch of data and inspect it manually. Um so yeah this is the main tip. Again this is like super high level list. So from here I guess you have a list of at least um 50 different checks that you know you can prioritize accordingly. And moving on to when should we test. So the quick answer is as as soon as possible as early as possible in the pipeline. And if we talk about the pipeline, so we have everything that is done in the research like the first time before we've even deployed the um our model when deploying it like kind of the deciding whether we want to deploy it or not and also now when it's in production and serving. So in each of these phases we have the relevant test that can already be done. So like in when monitoring it it's continuously checking it. But for example, when we started our research, as soon as we already got the first data and we haven't yet even split it, built any model, we really want to extensively check its uh integrity. And once we start pre-processing and working with it, we may already have like the batches that you know we can compare and uh and verify. And when we have the a model whether it's like our most um initial model or like also left for further iterations that's when we can already kind of inspect compare and so forth and of course continue probably in both in CI/CD and in monitoring we would want to inspect the same kind of things that we've checked for before just um kind of together and uh and also verify whether whether we should continue working with them or or send it back to the research bench. So how to start? Um I tried to Google this and um no real kind of well-built um methodologies or results because really it's it's a less established kind of field and this is one of the things that we have been um thinking a lot in deep checks kind of as how should we approach it what should we check as I said like the checklist um what do we run it on and um we're going to now uh drill down a bit into the structure and then run an example. I'll say in general that like the way we built deep checks is uh is a very um in a very kind of customizable manner. So the idea for me is first of all it's open source so you're free to to try it out and use it and also to take you know ideas maybe you want to implement uh implement different checks or like add custom checks or also use the framework and methodology um in in whichever way that um you know that matches your form of work and organization. Um, one thing I will say now is that we're going to be like in about in a few minutes, maybe like five minutes, we'll reach the the live code example in which I'll go over a notebook with some data set and then we'll kind of see how it runs and like explore a few of the checks and things like that. So, those of you that want um we have a Fatima, if you if you can send the the Bitly link. It's just a link to a Google Drive where I put both the notebook that I'm going to um go over and also the data set. The data set is just um downloaded from Kaggle. So, you can also just um just uh download it from there. Anyway, um so that will be in a few minutes. So if you want like you can you know download it open I recommend a new uh virtual environment for Python install deep checks and then um you'll be ready and also you can just you know follow follow me and uh run it later so as you wish. Um cool so as I said deep checks is uh well what you see here is a visualization of some of the results. So I'll talk a bit about the structure how it looks and um and how it works um for the presentation. Then um I'm not sure you may maybe I'll be able to upload like a PDF of some of the slides, but I'm not sure I'll be able to share um all the PDF of the presentation. Um anyway, feel free to take screenshots for specific things if if you want. So um or you'll also be able to see it by video. So so it should be accessible anyway. Anyway, um so okay. So what we saw here is a few of the checks, but let's talk about how deep checks is built. So we have the main or the base unit is called a check. A check has a display like um a check comes to verify a certain issue. So for example, it checks if um if two if uh two like similar samples have different labels or for example if there's a drift between two data sets. So that is every issue will be one check and a check has both a display um value like the output graphic one uh many times and also a result value which you can process like in code. So for example uh you know the numbers of the actual uh check it has a it has a condition. So um pass fail warning or also know it has like an optional condition. So you can also just have an exploratory check without a condition. And in order to be able to run many checks efficiently together we have a concept of a suite. So we have a few built-in suites that you can just run as is or you can take any any combination of checks and build your own kind of test suite. The idea is that is in that in one line of code you run many checks and you also receive like the output report. So you can either view it um save it as an HTML view it in like if you're using for example Jupiter so you can view it in line or when u or or export it in various convenient manners um you can check out our docs for more information uh in general but I'm kind of explaining the overview and the way these look uh gooey wise is like this. So as I said every check has an output in this case we're seeing a few checks from the computer vision package. So this is something with uh like inspecting outliers for the labels. That's the check. The condition is just like whether it passed or not for each of the checks. And this is an example of a code how to build your own custom suite. So just define a few checks and um put them basically together. And remember we saw before like about um uh uh testing as soon as possible and data and batches and uh and your model. So, oh yeah, that's what what we see here. So, in order to help you in each of these phases or at least like a quick start for each of them, we have a built-in suite for uh putting encompassing all of the varant checks for that phase. So, for example, data integrity suite runs on uh only on a data set. The train test validation runs on two data sets and then model validation runs on two data sets and on the and on the model. So like for the model you need a a model that's able to like supports the kind of psychic flirt and API for prediction. Um that's for the tabular case. Again of course for the computer vision we have uh we have also a solution which you can check out. So that's a bit about the structure. Um if there are any specific questions we're going to go now to the live code example. So um if there's anything you want to go before and if not I'll I'll dive in. So you got before like this link also in the chat. So, if you want uh feel free to open it and I'll just launch my browser. Um, cool. So, I hope this is big enough. Maybe I'll zoom in a bit more. Yeah. So, those of you that didn't install, um, you can uncomment this uninstall. And um I'll explain a bit the like the use case. So remember we have the data integrity trainers validation model valuation. I wanted to work kind of on something completely clean and use. So in this case I'm going to demonstrate the data integrity suite and a few of its checks on a data set from Kaggle. It's um it's a CDC data set for chronic disease indicators. So it's a very interesting data set. You can use it to explore various risk factors and different topics and areas and well build lots of models based on it. And here on purpose I didn't build any model for specific purpose but rather said okay I have here an interesting data set that um that I can explore and build models from for for various uh interesting indications and let's see kind of out of the box without doing any exploration what we see or find. Um probably if you do it on almost any data set you'll find various problems as I mentioned before by the way probably in your process as a data scientist when doing exploration you may find many or maybe even all of these problems it it's first of all it can also help you also in the initial exploration and it's also the idea of being sure to to to cover like lots of the checks and that's why I thought it could be cool just to run a unprocessed data set and see how it works. Great. So now I download it locally. It's also in the in the drive folder and I'm just um yeah I think it's like 400k rows. I'm just uh working here on 10k rows but u deep runs quickly. You can also uh run it on more or on all data. Um let's just have a look. So in general the data set has 34 features and it has different like um topics. So, for example, now we don't see it here. We have Oh, yeah. things like alcohol, um heart diseases um in different locations around uh around the US, the CDC and with responses so to the different topics. So, um that's kind of the the structure in general and feel free to check out Kaggle for for some more like exploratory notebooks and to to learn more about this data set. And let's just see. Okay, so I already said it has 34 features and well I loaded 10,000 and these are the different uh features about the topics and it's kind of a question response. So uh about alcohol about um uh cardiac stuff and and so forth. Um just to have a quick look I'll zoom out a bit. I hope you see. Okay, just a bit too big for me. Um cool. So this is just a super quick overview about the different types of features. Um I use this mainly to build this list. So the only thing I need to give in this case to deep checks in order to analyze my data set is a bit metadata. Uh actually I don't even like for categorical features I don't even need to supply it. If I won't it will just infer it automatically. But it's I mean if I already know in advance what are the categorical features then it's more accurate if I just state them explicitly. So I'm just um stating them. And also I saw before like in the data set that there are quite a lot of nulls I'm not sure if yeah you can see it here right so just saying that if there are any columns in which all of the values are null so let's drop them this will be relevant in the future so I tried doing that well all of my samples are still here and um and uh let's get going so what I'm doing is importing this is like this is the phase where we run deep checks okay so we loaded the model we define the categorical features. If we would have had like a label column for example, I would have defined its name as well. If I have a dateset time column, I would also give its name and same for index and so forth. No, actually those are the main ones. So, categorical features, label, um time and uh and index. You can also see these all here. So, if I if I check out here, you can see what are the various like parameters I can give the data set. Um this is mainly relevant because once you know this made up metadata then deep checks can use it to analyze better your data set and some checks run only if you have that for example things like uh time leakage it will check it out only if I actually have a time column and I've defined it explicitly so we don't know to like guess what is your label for example if you don't supply it so in this case I don't supply a label and therefore some of the checks won't run um and the data set is just a wrapper object containing the data and the metadata so that's all I needed to do in order to run. And now I'm going to run the data integrity suite as I said. So while the checks are running in the background and um in a sec we'll kind of see the results. As I said this is a custo um sorry a default suite. So it already comes both with a list of checks and also um almost all of the checks have default conditions on them. So it means that like the different results are already analyzed and we have some uh default threshold values deciding whether we passed or not. So in this case let's see what didn't pass. Okay. So we have single value and column. So we have some columns that contain a single value. This was like the check condition of feature feature correlation and special characters. So let's just dive into the first one for now. Some of them are also related. So it says that there's a list of columns which have only one single unique value. Okay, obviously one single unique value basically means they add no more no info for me, right? So in this case maybe if I would have tooken all the data set then it wouldn't have been like that. But in this case let's say I'm using this actual data set to train the model and this is not a recommended state. I mean it's obviously like irrelevant data. Let's try to understand what what was the problem. So again there are a few more checks that I'll I'll show I'll show a bit later but in this case let's explore this one and well there are many more checks that that that fan as well. So going on to the next um next check. I did notice before um that I have um like many nulls right. So um so uh let's see first of all what are the um oh wait I I'll be back here just to show you from where I brought it. So out of the columns I just took the first column I said let's check about that one and there are additional columns with a single value. And by the way what I saw the single value is looks like something empty. Okay, I'm not really sure but uh we'll see it in a second. So, continuing on, let's see what are the nonnulls um values in this. And really, if I'm checking where response is not null, I see that this is kind of looks empty. So, it means all the rest are nulls. And basically what I understand now is that the response field has only nulls and seven empty strings. Okay. um if I want to kind of rerun just the is single remember I had a various um columns with that not only the response one so let's say I want to like rerun the specific check in this case it's called is single value so um the check I just import from checks I have all the like in the in the docs you can see all the checks and like examples how to run them and their names and so forth and well you can also kind of you know explore um explore using uh using your the inline and seeing like what completion you can have. So is single value in order to run a check just like a suite. I initialize it. I can give it various parameters. Um you will see it soon. So like a random state if I want if I'm using uh like sampling. So this is the default. We can also uh change this to none in order to just use all of the samples. And should I ignore nulls or not? Um the default is to is to ignore them. By the way, that kind of explains the result. And in this case we just see again um the result from before and we can also get it like the value itself. Okay. So the here I did like I show the result and now um I'm running the value and I can see the number of unique values for each of the fields um for the categorical fields um like those that uh have uh re relevant yeah categorical or actually checks also also for the numeric sorry. So for example, year start is a numeric field and it checks um it just shows the number for each of them and I can see that I have a few. I'm just trying to extract like the names of all these columns. Okay. So in this case these are all the columns with the single. I just wrote list comprehension for receiving and their names and I can see that okay all of them have exactly seven kind of empty ones and as you remember this is the shape of my data frame. So, for now, what I'm going to do is just understand that this empty string says nothing. It's basically a null value. And I'm just going to fix it by replacing the empty strings with a null. And let's see what happens. Now, remember I did this line before. I said, okay, let's just drop any irrelevant columns that have all null. So, I'm checking this out again. And um as you can see uh in this case really those uh like those exact all of those features really um dropped because they had only empty strings and nulls and now I'm continuing with a much smaller data set. So um I'm going to show one more one more check that failed but if you remember there were like a few and uh then also to how to explore a bit of check and then I'll be happy to um uh show you a bit more additional resources and uh and also answer any questions. So um in this case so I I I dropped the data on the like original data frame and remember we have the data set so I want to create a new data set um by the way I can also just like access the data inside. Okay so like this was the original data set and this is its data and um there's also like various um additional completions that I can have like um uh columns info for example and see like the different columns and so forth. So the data set itself has various like metadata about the data set that that we described and it's all accessible. Anyway, in this case I want to uh create a new one because because I dropped some of the some of the features and I can use the built-in copy one. It's just instead of like recreating on defining a new data set like I did before, I'm just copying the metadata because it's the same. Um so that's uh one way I can do it. And okay, this is the new columns info and um I'm running the data set uh sorry the data integity suite again and let's have a quick look at the results. So in this case um if you remember before we had the is single value check failing. So we don't see it failing anymore. We'll probably see it here in the past ones. Single value in column. Cool. and what didn't pass. So we have um special characters but let's see what that means. So we have some columns where like most of the most of the sample or many of the samples sorry or maybe not that many. We see here what are the amount like the columns and the amount of only special characters in them. And in this case like the default condition is that um if if more than 0.1% of the samples had only special characters then maybe this is something weird. So um that's why by the way this one is like warning it's not an error anyway. We can define like uh for every problem if it's a error or a warning um whether if it fails. So this was one thing and we can check it out if if it's interesting. And the other thing is that we have um very high correlation between some of the features. So in this case we see like lots of features or actually feature couples with very high correlation and we see here um kind of the correlation matrix. So let's look at this one a bit deeper. You can also by the way like just access the the result from the suite itself but I find it more convenient to to run the check itself because then you can also change the parameters. So in this case again I see the exact same result. By the way I have much more features and like the default display is for a smaller number because we want the display to be uh convenient. So in this case what I'm going to now do is rerun this exact same check just with one of the as you remember each of the check has like various um parameters. So n um n samples is how many it runs on and um and show end top columns is how many are displayed. So the default is 10. That's why we saw previously 10. Now I just put 24 because that's all I have like that's all the columns. And also I added a condition. So remember that in the suite like some of the things passed or didn't pass. So in order to actually define that we have uh we have default conditions. So in this case like there's various um yeah if I do the autocomplete I can see like what are the conditions I can add and um and I added a condition that there are no more than zero pairs with over 0.7 correlation. And now if I run it, I'll um I'll be able to see uh much uh kind of much more and see, you know, what are the exactly correlated things. For example, year start and year end are are the totally correlated probably they they're probably all the same. Not necessarily obviously, but in this case, I think they are. And we can also see some other stuff like for example data value footnote and footnote symbol or things like uh every topic has like both a value and also an ID. So um uh so many of them can probably be dropped or you know kind of combined. So it really depends and this is more some like of a manual process that I will usually um you know just want to make sure that there wasn't a very uh like severe change or that you know that I that the features that are highly correlated that I'm aware I'm aware aware of and also in this case like if I want to you know I can I can also use the the result value and then it's just a data frame of all the actual values on the uh in the here they're displayed as as we wrote in absolute values whether it's um like a positive or or negative correlation and here you have the the values themselves and you can you know process it and see see what are the things that you would want to maybe drop out um like topic ID and topic location ID question ID and so forth um so that was like a brief um overview I will show one thing that if you go into the docs um like and you want to try using I'll I'll show like a good place to start would be the tabular quick starts So you have here like the examples for the different suites like a data integrity which is the one we did now but just like on a different data set uh train test validation for finding you know drifts and leakages and model valuation and also things like how to um you know export the results for an HTML or how to like create a custom check a custom suite and so forth. And here you really have examples for each of the checks explaining them um and showing like uh how they look and also an API reference which can be very useful. For example, if you you know you want to change like specific um specifics of uh for example some check. Um and of course um you can also always check the sources on GitHub if you want any any more like to drill down or or help us uh you know fix a bug if you found one. Um, so one second I'll stop. Oh, I won't stop my share screen. I'll just actually move this. So, we talked about ML validation and why it's important, how it can help you deal with a challenging system and improve your model and uh your work as a data scientist or ML engineer. Um, showed quite a few of the recurring problem types and uh like both in their title and also follow through a use case. What should we think about when we start testing? As I said, I didn't really delve into the methodology one. Um, but I have a lot to say about it. Uh, what do we want to check and also when do we want to check obviously in various phases and in the research we kind of um delve deeper into that part. So, um I do want to use this opportunity to say thank you um to the data science community. We feel that and I also want to encourage you to to kind of keep uh contributing your feature requests um and uh you know if you have also any uh ideas code contributions or documentation uh contributions these are things that really vastly help us improve the package and have more ideas and um it's really greatly appreciated and I'm sure it helped the the very quick and wide adoption of the package. Also if you like what we're doing I'd love if you can give us a star on GitHub. It's really important for for open source initiatives and um yeah, I'd like to thank you hopefully and see you develop more uh more tested and um and safer models. All right, thank you so much here. Um, so before and I know we only have about 10 minutes left for questions. So, um, while people are thinking of those, remember if you would like to raise your hand, you can. We're going to test out live mics, see how that goes. Um, otherwise feel free to throw your question in the Q&A tab. Um, sheer, we do have a question um about any, you know, recommendations you have for books or resources on uh on methodologies. I'm not sure which methodologies this person is referring to, but um do you for validation, let's say, do you have any recommended resources um that that you like? My my microphone is fine. Yeah. in general in the world of software testing there's like a very wide variety of um of uh books and uh ideas and I I didn't really go into it but like the concept of unit testing and integration testing and how to approach it and when to run it and so forth that's more kind of the high level one which I mean I have some recommendation I'll be able to I'll be happy to to send them over later but it's also um very easy to to find those kinds of materials and they're here for a long time in specifically in the area of testing, machine learning. I think like there's lots of things you can kind of dive into for each of the different challenges, but I don't know to date um a very uh comprehensive one. I do have one example that I'll I'll be happy to like I don't have it here so it'll take me maybe a minute or two to find it but I have a a nice GitHub repo with uh with kind of a that connects between concepts of software testing and like how to implement them like in code for example using pi test and frameworks like that in machine learning it's somewhere between practic practicality and and uh kind of theory so I'll be happy happy to um like to to send you later the link Nathan you can feel free supposed to go alongside. Um, and then, uh, there's someone named John who wants to compare deep checks versus, I think it's DQ. Um, uh, but, uh, I don't want to turn this into like a sales call. So, um, so John, maybe you can reach out to them individually. Actually, Nathan, they are they are all um, open source tools. Oh, okay. So I think it's yeah it's it's a question we get quite a lot and uh and we we see them fitting a very nice in the pipeline. So it's not like a competitive uh companies or something like that. Um DQ is a AWS tool and the great expectations is also a company. They have a great open source project for data validation. I'll say that the main the main differences is that both DQ and great expectation focus with uh testing data. So like unit test for data and specifically like for example when they talk so they talk about data integrity but specifically in the area of like data integrity in uh databases um and less so in the context of machine learning so you know they don't have like what is a label or or um or like the the fact that these are samples that are then you know fit into a model and therefore some of our checks are very different. I would say that generally like those kind of tools would fit in more in the like data engineering pipeline and once this data is then kind of transition to the machine learning phase then um then kind of we there's like a small overlap kind of they they come before us and and um deep checks or any other testing they would do later on is is like the next phase. Okay. And then um this question is from Sky. I think Sky is on one of our um live streams. So, how do you check drift? Uh, okay. Great. Yeah. So, so a great question because drift is a very widely discussed um topic and um even more so by the way in unstructured data like in uh both NLP and uh and computer vision there's a quite a wide discussion about like okay it's a it's a big field that I um I can just kind of drop a few words also on that one. um inside drift first of all everything is like I I do I do just invite you to to check it out in our docs because we have like various um it depends if it's a categorical variable or a numerical one we have um we have like we use different algorithms for each of them and we also kind of tune the the threat sorry the threshold to um you know to the to the relevant one when it's like a higher drift um oh I will send also another link to our jet checks demo where you can actually see it so um yeah We just run statistical tests between you know the two data sets depending on on the data type and um that's it. That's it. Perfect. I'm not seeing any other questions at the moment but let I'll go through so much they love the webinar. They love the talk. So um and I'm not seeing any other questions. So um too new. Okay. So, uh, Shear, thank you so much for being here
Original Description
Learn how to validate your machine learning model and how to avoid common challenges.
Building a good and stable machine learning model is hard. The wide variety of challenges in the process includes biases when building or splitting the datasets, data leakages, data quality, and integrity issues, drifts, model performance stability, and many more.
In this session we’ll explore these types of challenges, give real-life examples of such faults, and suggest a structure for building tests for these types of issues, to enable validating them efficiently. We’ll include a hands-on demonstration of running validation tests during the ML research phase (which you can follow along by running it locally). By the end of this session, you’ll have the knowledge about which issues to look out for to avoid critical problems, along with the tools for how to do so efficiently.
Table of Contents:
00:00 Introduction
03:15 ML Failures and Motivation
18:10 ML Validation/Testing
24:35 Deepcheck Packages
29:15 Live Code Example
46:32 QnA
--
Download Live Code: https://bit.ly/3Of8am0
Deepchecks Github: https://github.com/deepchecks/deepchecks
Deepchecks Docs: https://docs.deepchecks.com/
Deepchecks Checks Demo: https://checks-demo.deepchecks.com/
Deepcheck Slack Community: https://www.deepchecks.com/slack
Deepchecks Website: https://www.deepchecks.com/
--
For more captivating community talks featuring renowned speakers, check out this playlist: https://youtube.com/playlist?list=PL8eNk_zTBST-EBv2LDSW9Wx_V4Gy5OPFT
For further tutorials on the fundamentals of machine learning, check out this exclusive playlist: https://youtube.com/playlist?list=PL8eNk_zTBST-RTog7CPYvRfs1pYRWkPHG
--
At Data Science Dojo, we believe data science is for everyone. Our data science trainings have been attended by more than 10,000 employees from over 2,500 companies globally, including many leaders in tech like Microsoft, Google, and Facebook. For more information please visit: https://hubs.la/Q01Z-13k0
���
Watch on YouTube ↗
(saves to browser)
Sign in to unlock AI tutor explanation · ⚡30
Playlist
Uploads from Data Science Dojo · Data Science Dojo · 0 of 60
← Previous
Next →
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
Feature Engineering and Predictive Modeling | Data Analytics with R and Azure ML | Community Webinar
Data Science Dojo
Data Exploration and Visualization | Beginning Azure ML | Part 3
Data Science Dojo
Reading External Data Sources | Beginning Azure ML | Part 2
Data Science Dojo
Importing Data, Accessing, & Creating a New Experiment | Beginning Azure ML | Part 1
Data Science Dojo
Casting Columns & Renaming Columns | Beginning Azure ML | Part 4
Data Science Dojo
Scrub Missing Values & Project Columns | Beginning Azure ML | Part 5
Data Science Dojo
Feature Engineering & R Script | Beginning Azure ML | Part 6
Data Science Dojo
Building Your First Model | Beginning Azure ML | Part 7
Data Science Dojo
Run and Fine-Tune Multiple Models | Beginning Azure ML | Part 8
Data Science Dojo
Deploying Your First Predictive Model As a Web Service | Beginning Azure ML | Part 9
Data Science Dojo
Using R API to Obtain Predictions From Your Web Service Beginning Azure ML | Part 10
Data Science Dojo
Using Python API to Obtain Predictions From Your Web Service | Beginning Azure ML | Part 11
Data Science Dojo
Twitter Sentiment Analysis | Natural Language Processing | Community Webinar
Data Science Dojo
Listening to the Melody of the Universe (LIGO Gravitational Waves Presentation) | Community Webinar
Data Science Dojo
David Wechsler on the Impact of Data Science Bootcamp
Data Science Dojo
Andrew Choi on the Impact of Data Science Bootcamp
Data Science Dojo
Microsoft's Software Engineer Shares Her Experience with Data Science Bootcamp
Data Science Dojo
Michael DAndrea on the Impact of Data Science Bootcamp
Data Science Dojo
Data Driven Decision-Making with Data Science Bootcamp: Artem Kopelev's Revelation
Data Science Dojo
Learn the Fundamentals of Data Science: Srinivas Rao's Experience with Data Science Bootcamp
Data Science Dojo
Re-Learning Data Science with Data Science Bootcamp: Analyst's Revelation
Data Science Dojo
Scale R to Big Data with Hadoop & Spark | Community Webinar
Data Science Dojo
Enhancing Skills with Data Science Bootcamp: Sharon Lane-Getaz's Revelation
Data Science Dojo
Ryan DeMartino on the Impact of Data Science Bootcamp
Data Science Dojo
Software Engineer at Microsoft Reveals About His Experience with Data Science Bootcamp
Data Science Dojo
Wade Wimer on the Impact of Data Science Bootcamp
Data Science Dojo
Analyzing Data with Data Science Bootcamp: Hannah Richta's Revelation
Data Science Dojo
Applying Data Science Skills to The Current Role with Bootcamp: Marcos Lacayo's Revelation
Data Science Dojo
Lance Milner on the Impact of Data Science Bootcamp
Data Science Dojo
Deloitte's Data Scientist Revelation: Learning Predictive Analytics with Data Science Bootcamp
Data Science Dojo
Rajesh Patil's Experience at Data Science Bootcamp As an Enterprise Architect
Data Science Dojo
Michael Atlin on the Impact of Data Science Bootcamp
Data Science Dojo
Amina Tariq's In-Person Experience at Data Science Bootcamp
Data Science Dojo
Ceo's Revelation about Data Science Bootcamp
Data Science Dojo
Stephen Miller Describes His Experience at Data Science Dojo's Bootcamp
Data Science Dojo
Kevin Hillaker on the Impact of Data Science Bootcamp
Data Science Dojo
Marko Topalovic's Experience with Data Science Bootcamp
Data Science Dojo
Text Analytics With Python, Cognitive Services & PowerBI | Data Analytics | Community Webinar
Data Science Dojo
Unisys Manager's Revelation: Visualizing Real Time Data with Data Science Bootcamp
Data Science Dojo
Learn Data Mining with Data Science Bootcamp: Ryan LaBrie's Revelation
Data Science Dojo
Vang Xiong on the Impact of Data Science Bootcamp
Data Science Dojo
Data Scientist's Experience at Our Data Science Bootcamp
Data Science Dojo
Alejandro Wolf Yadlin on the Impact of Data Science Bootcamp
Data Science Dojo
Introduction To Titanic Kaggle Competition | Part 1
Data Science Dojo
Learning How to Code in R with Data Science Bootcamp: Priscilla Mannuel's Revelation
Data Science Dojo
Andrew Berman On Why Data Science Bootcamp Is Better Fit for Him
Data Science Dojo
How To Do Titanic Kaggle Competition in R | Part 3.1
Data Science Dojo
How to do the Titanic Kaggle competition in R | Part 3.1
Data Science Dojo
Delve Deeper into Data Science with Data Science Bootcamp
Data Science Dojo
Bank of America Data Scientist Reveals His Experience of Data Science Bootcamp
Data Science Dojo
Shaena Montanari on the Impact of Data Science Bootcamp
Data Science Dojo
Types of Sampling | Introduction to Data Mining | Part 12
Data Science Dojo
Sampling for Data Selection | Introduction to Data Mining | Part 11
Data Science Dojo
Data Aggregation | Introduction to Data Mining | Part 10
Data Science Dojo
Data Cleaning | Introduction to Data Mining | Part 9
Data Science Dojo
Missing & Duplicated Data | Introduction to Data Mining | Part 8
Data Science Dojo
Data Noise | Introduction to Data Mining | Part 7
Data Science Dojo
Graph and Ordered Data | Introduction to Data Mining | Part 5
Data Science Dojo
Document Data & Transaction Data | Introduction to Data Mining | Part 4
Data Science Dojo
Data Quality | Introduction to Data Mining | Part 6
Data Science Dojo
More on: ML Pipelines
View skill →Related Reads
📰
📰
📰
📰
Your Job Isn’t Being Replaced by AI. It’s Being Replaced by Someone Who Uses AI Better Than You.
Medium · AI
Will AI Replace Jobs? Here’s What Most People Get Wrong
Medium · AI
The Answer Machine: How AI Replacing Search is Also Replacing You
Medium · Data Science
18 Hot Takes On Where AI is Headed Next
Dev.to · dev.to staff
Chapters (6)
Introduction
3:15
ML Failures and Motivation
18:10
ML Validation/Testing
24:35
Deepcheck Packages
29:15
Live Code Example
46:32
QnA
🎓
Tutor Explanation
DeepCamp AI