Full Machine Learning Project — Detecting Outliers in Sensor Data (Part 4)
Key Takeaways
This video demonstrates how to identify and handle outliers in sensor data using three different methods in Python: IQR method, Chauvinet's Criterion, and Local Outlier Factor (LOF) from the scikit-learn library. It covers outlier detection, data preprocessing, and machine learning fundamentals using tools like scikit-learn, pandas, and matplotlib.
Full Transcript
hey everyone and welcome back to part 4 of this series where we create a fitness tracker with python and today part 4 is all about detecting outliers in sensor data so the goal for today for this video will be to check whether there are any outliers extreme values in our data that we want to remove using various methods alright so by now you know the drill make sure you have watched all the previous videos here in the playlist before continuing with this and after you've done that we can continue by downloading the python file and make sure to put it in your vs code workspace directory in the features folder that is within the source folder so Source features remove outliers dot by this is the file we will be working in today all right then a quick overview of what we will be covering in today's video we will first briefly discuss what are outliers extreme values what do they look like then we'll go over box plots and the intercon interquartile range to determine outliers we're going to look at a function that can plot outliers in time then we're going to Mark outliers using three different methods so the first one the IQR method we're going to look at show Finance Criterion and we're going to look at local Alpha layer factor and then we're also going to check outliers by first grouping on the label and then eventually we will pick one of the methods that we find suits our problem the best and then we will replace the outliers or remove them better to say and Export the new data frame let's get into it all right let's start as usual in vs code by importing all the libraries and it could be that you still have to install sklearn this is a new library that we haven't used before so if this gives you an error use pip install scikit-learn you can do that straight from the interactive session by putting an exclamation mark in front of of it and then running this line or you can do it via the terminal so open up a terminal and then do pip install so I could learn next we're going to import the data as usual so we're going to start by defining data frame as Speedy dot read and then we're going to point to our data directory in interim and then select the 01 data process file run this line make sure that the data frame looks like this and now before we continue I want to briefly look at what outliers are what they look like so we're all on the same page so coming back to our document over here there is a resource what are outliers and if we have a look at the this page there is a brief description and if we scroll down we can have a more visual representation of what outliers could potentially look like but basically to summarize an outliers and extreme value in a data set that is much higher or low lower than the majority of the values in the data set so these are just weird messy values that can introduce noise in our data set and coming back to our example where we were measuring the movement from a participant doing an exercise this could for example be the case when someone is performing let's say a squat and somewhere mid movement the participant gets a twitch or something doesn't feel right and he or she adjusts the position so he steps back or steps places his feet somewhere else and this could introduce a movement pattern that you typically would not see during a squat and then the question is do you want to include this kind of data within your data set and that is often a pretty hard question to to answer so determining whether an extreme value is really an outlier and should be removed from a data set always depends on the problem at hand as with almost everything in data science and machine learning always depends on the problem but by keeping these extremes values within our data set we are also going to train our models with this data so data that you would typically not see during a squat so the adjustment of movement for example is still labeled as being a squat so then the model Encounters this data and thinks oh this is a squat but then for example another similar movement pattern occurs in or during a bench press for example where a participant adjusts his movement and then the model might think oh this is a squad because I've seen this before so you can already start to imagine like how this could affect our model also that's why it's really important to yeah really understand the online underlying data so you can make better more informed decisions about when to remove an outlier and when to leave it in the data set so that is what we will focus on today okay so now we know what outliers are let's look at a few methods that we can use to First determine outliers and then visualize them as well and we're going to start off with the box plots and the interquartile range and for that I'm going to open up this document over here and here we're looking at a box plot which can be used to visualize outliers as well as show you some information about the distribution of a data set and I'm not going to dive into all the technical and statistical details about box plots and distributions you can look that up if you want to learn more about that there are also some links over here but just know that for the first method of the Turning determining outliers we're going to look at box plots and then using the interquartile range so the IQR which has which has two cut-off points so a minimum and a maximum and basically anything that lies outside of this range is considered to be an outlier so let's start off by creating some box plots to have a look at the data and potential outliers the first thing we're going to do is we're going to change some style settings again for mud plot lib just like we did in the last episode but now instead of the Seabourn team we're going to use the 538 team which is also a very nice theme in my opinion so start off with this and then we can continue to the box plots and now to create a box plot we can use a method from the bonus library that we can directly call on a data frame and that is the box plot method so we can determine a column of a data frame so let's look at the X acceleration and then say dot box plots and then when we run this we get a pretty white image and that is because we are using the fixed size over here that is ideal for plotting time series data over time but for a box plot it gives um not so nice image but we can already see that we have a box plot over here and that anything outside of the line over here in line over here are considered to be outliers so for this column over here looking at the total data set using the IQR method we can already see that there are some outliers now let's make this a bit prettier by also including the label so now we have the data frame with acceleration X and the label and then in the box plot we say buy and then we specify the label column so we're going to split up the data and we basically create a groups box plot and then also change the fixed size a little to be a little bit taller so let's check this out all right it already looks better so now we have a or we have to figure split up based on the label and we we're looking at the acceleration for the X and now we can get a better look at the potential outlier so here for the squat we we see a lot actually and for the resting period I think only one and now we can also switch this up so for example say we want to look at the Y acceleration then we get a different a different image so if we compare the overhead press for example now we get a lot more potential outliers on the y-axis and then let's for example look at the gyroscope data and that also gives us a different image this is a nice starting point to better understand the data now the next step is to add some additional data so we're not going to only look at a single column but we're going to look at multiple columns at the same time so coming back to our data over here what we can do is we can Define our outlier columns that we want to look at so if we have a look at the data over here basically all the numerical columns should be considered when looking at outliers for we're going to make a selection from the data frame dot columns and then we say we want the first six I believe let me check yes so the first six of the columns are all the numerical values and we're going to store that in a variable because we are going to use that regularly to Loop over and we're also going to turn it into a list so when you return a data frame columns object it is considered to be an index and we just want to turn that into a list so we get a nice python list and we store that into the outlier columns and what we can then do is we can come back here and then put put in the outlier columns over here but then make one more split based on the accelerometer and gyroscope data like we always do and then add the label in there as well so what I'm doing right now is I'm basically want to make a selection of the data frame and I want to take the columns over here so this will be the acceleration data and we're going to add the label to that as well and oh sorry I see that I have a comma over here and let's have one more look our list over here so here we can see that we're using the accelerometer data and we have a label so just to show that one more time visually in the data frame um what am I missing here a few extra brackets that we don't need okay so here we can see now we have a selection of the data frame with acceleration and the label and now we're going to use that to create create a box plot grouped by label and then the final thing that we're going to add is the layout and that determines how the box plots will be visualized and for that we're going to use a layout of one by three so let's have a look at that and now we get a better understanding of the data from this one figure only and we don't have to scroll up and down to compare everything so we can do the same thing but then for the gyroscope data and then we're going to move the three and put it in front of the column so just to show you that one more time this will be gyroscope columns plus the label which will result in the following data frame and then we're also going to create a plus for that so let me just clear that up and then run this line after line so now we have on one screen two images a really good understanding of all the individual parameters and all also split by the label and we can already tell by just looking at these figures over here that using the interquantile range there are potentially a lot of outliers within this data and let's now look into the data a bit deeper and visualize these outliers or potential outliers over time because now everything is just on one big pile and we can't really visually tell whether the outlier for example marks over here is actually a really extreme value or that is something that is pretty normal and shouldn't be considered an outlier so for that we're going to use a custom function to visualize outliers in time and I have a really awesome function for you that we're going to use for this and for this function I have already put it in the document over here because this is not an episode about data visualization so so we're not going to spend a lot of time creating this whole function because it's quite long you can go to the document click on the research sources on plotting outliers in time then click over here and here we'll get the preview of the whole function that we're going to use and you can click here on copy and then you can come back to visual studio code and then we'll paste the function over here first let's make sure this is saved and now I will quickly go over what we're doing in this function so first of all part of this function comes from this get the page over here which is from the official machine learning for the quantifies file itself book so this is all open source and we've used some code from this function and later also from another function within this project but I've adjusted it to better fit my style but just know that this is the official Source then briefly covering the function okay so what are we doing so this is a function that can plot outliers in case of a binary outlier score so true or false values and we basically insert a data frame and a column and that same column again but then marked with uh true false values whether it's it is an outlier or not and then it will create a Time series plot and it will map or basically plot the non-outliers in blue using a A plus notation and it will plot values marked as an outlier in red with a blush notation and this is all just some styling but this is a really awesome function and now we're going to insert another function and then we're going to take this to action and actually show you how to use this and that is because before we can actually use this we first have to mark values as out Liars or not using a true false column and the box plot figure is nice to quickly visualize potential outliers but it does not help us in visualizing or creating a column that marks outliers um if you get what I'm saying so in order to do that we're going to use another function that is available in the document over here so let's go back come back over here and then to marking outliers using IQR so here's another function that we can copy and then come back to visual studio code and here it says insert IQR function we can insert that over here and now we can run this as well and basically what this will do so this is necessary before we can visualize them this will or takes a data set as an input and also a column specified as a string and then it will determine the q1 and the Q u3 it will calculate the interquartile range and then it will use the notation or the formula over here to calculate the lower bound and the upper bound and this is also this was explained in the document so here you can see what we're basically doing but this is just the same but then translated in Python code and then we're going to add a new column to the data frame and that column will have the same name of the column that we are evaluating but then we add underscore outlier to it so for example if we're looking at acceleration for the y-axis we will return a data frame which also has a y acceleration underscore outlier column which is marked with true and false values um needless to say when there is an outlier it will be marked as true and otherwise it will be marked as false that is how this function works and it is also a really awesome function and and especially when you combine it with this function over here so now let's see what this looks like when we apply it to a single column and then visualize the results so let's start off with the acceleration for the X and then save that and then let's say data set is and then let's call The Mark outliers IQR function and insert the data frame and then also the column so looking at this function over here we take two inputs data set and column looking good make sure this is stored in memory this as well and then we can check out what we get if we run this function let's now have a look at data set and what we can see is that dataset is now just the original data frame that is this one over here but we have one extra column so data frame is 10 columns data set has 11 columns and that final column is the acceleration X Out liar column and here you can see false false false false it could be the case that somewhere hidden in this data set of 9000 rows that there is a true over here so let's see if that is the case using our plus binary outliers function and let's have a look what we have to input so we have a data set a column and outlier column and a reset index which is optional uh sorry not optional which is a Boolean which can be used to reset the index like we've seen in the previous episode so it is a Time series data frame and usually when we're plotting the data over here we want to reset the index first to give a better uh yeah to better visually represent the data because there are time gaps so that is why the reset index is in there so let's uh have a look and say plot binary outliers we put in our new data set which is data set and then our call column is the column and then our outlier column column and then we say plus and then we do outlier and then for reset index uh let me first set it to false then I'll show you why this is important and I think we're good wow look at that okay first of all what we can see is that we have non-outlier values in blue and that we have outlier values in red and you can already see that the values over here look pretty extreme and they're also marked in red here as well but um like I stated here we can see that we have a time frame of about two weeks and that is not ideal to visualize this data so let me turn that to true and then run that again and now we can already see what's going on at the More granule Level so the function the mark outliers rqr is definitely doing its job so we can clearly see that the red dots are only marked at what appears to be pretty extreme values or at least they're not in the middle so they're either on on top or here all the way on the bottom and just by officially looking at this we can already tell like yeah I could see why why these are are outliers but then over here it seems like a bit too much like we would be throwing away a lot of data if we were to accept uh what we're seeing over here all right and let's now continue by creating a loop to Loop over all of the outlier columns that we've defined earlier for a column in outlier columns we're basically going to do the same thing so we can copy this over here and since we've already denoted the column as call in this over here this should work straight out of the box and let me clear this up so what we are now doing for all the outlier columns we will first run the mark outliers IQR function the outliers stored in the data set and then plot that column and then do that six times so let's have a look all right first we have the acceleration data and then we have the gyroscope data and we can clearly see that the gyroscope data has a lot more red dots red crosses than the accelerometer data so we're already starting to see some pretty interesting things over here so for example this Z acceleration I can't I think there's one there's one over here but but that's it and then over here in way too extreme so we would be throwing away a lot of data so this is a good starting point but we definitely have to tweak some things and the main problem that we are dealing with right now and which we are going to solve later is that we're looking at all the data on basically own one big pile and we're not differentiating between the different exercises we can have a look over here and we can clearly see that the data over here is very different from the data over here and the IQR method is a distribution based method of determining outliers when the majority of the data looks like this and there are a few sets within the data that look like this statistically looking at the distribution these are under represented meaning that the yeah the data or sorry the the mean value and the standard deviation is mainly determined by the majority of the data and values like these are then identified as extreme values because they are larger and don't appear that often in the data set and from looking at the data earlier I know that the periods over here are periods of rest and during a period of rest the participant had no no limitations of what they could do so they could walk around stand up drink some water and you can imagine how that results in a movement pattern that is very different from a very limited movement during an exercise but it is also important for a model that it can differentiate between periods of rest and periods of performing an exercise then the question arise okay so what do we want to do with this data and later we will see that it is much better to split the data or group the data by exercise by label and then apply this method that will drastically improve the results but in order to do that I'm first going to introduce two different methods and then we're going to split by exercise and that is basically like like the same that we're always doing we're building building block building block building block and then we bring everything together with that after out of the way let's continue to chauvinet's Criterion so shalvin s Criterion is also a distribution based method to look at outliers but it tackles it a bit different than the IQR method so coming back to the document we can have a look at chauviness Criterion and the show Finance Criterion is a bit more complex in terms of how it is actually calculated so I won't really dive into that but know that there is some extra information over here and that this is from the original book machine learning for the Quantified itself but you can also just look up show Finance Criterion but basically to to sum it up according to show Finance Criterion we reject a measurement meaning that we identify it as an outlier from a data set of size n so this is just the length of the data set when its probability of observing is less than 1 divided by 2 times n and then n is the length of the data set and then a generalization is to replace the value 2 with the parameter C and that is also what we will see in the also awesome function that we have over here that we're going to use so we're going to basically take a data set again then a column and then we take this value C that we default to 2 and then it will calculate chauvinash Criterion for us and like I said the calculation to calculate the probability of an observation is pretty complex and we won't really go into that but that is basically what we're doing here and we're using the Sci-Fi library for that and this as well comes from GitHub repository over here where you can have a look at the outlier detection so there are a few different methods in here as well but here you can see Shuffle net so this is the calculations that we were using so I also didn't came up with this so just so you know that now coming back to the function same step as before we copy the function and then we have a function over here or a comment over here that says insert Shuffle Nets function and we place that over here and then one more thing to Note coming back to show Finance Criterion as well is that it assumes a normal distribution of the data so that is really important otherwise the results can be messed up or wrong so the we assume that the data is normally distributed and there are plenty of ways to check whether a data is normally distributed but the most straightforward ways are to look at the histogram or a box plot and then for histogram the question is do we see a bell-shaped curve and for a box plot the question is is the Box symmetrical meaning that the whiskers at the the end have somewhat of the same length compared to the center box for Q3 and q1 so let's have a click quick check at our data again to see whether it is actually normally distributed and in order to create histograms I'm going to scroll up a little bit and then take the codes where we created the Box plus then come down to over here so show Finance Criterion check for normal distribution insert this again and then we're going to replace box plot with plot dot hiss and now we're going to make the fixed size a little larger and we're going to do a three by three and you will see in a bit why is the case so again box plots plotist and then again all right so let's oh where are we going let's have a look at the accelerometer data awesome so we're creating a grouped box plot grouped by label for acceleration x y and z and we're also yeah creating separate plots for all the labels so looking at this coming back to explanation or our check better to say do we see a bell-shaped girth that is basically the check and then in general I would say I see bell-shaped curves for most of the data there is some data especially the the rest data over here which is especially the Y acceleration this is far from normally distributed so that could result in problems when looking at the resting data but other than that it looks kinda normal to me let's have a look at the gyroscope data or this is even more normally distributed so yeah we're just looking at the bell-shaped curve it's not perfect but for the case of this demonstration we will just assume that the data or at least most of the data is normally distributed and we can continue to use chauvinet's Criterion for outlier detection so from a high overview this function works exactly the same as the mark outliers IQR function in terms of what the output of the function is namely a data set with an additional column marked as outlier but the calculation is a bit more complex in here and as I said we won't cover this part but for now we can make sure this function is stored within memory and what we can now do is we can take this part of code over here that we've used to Loop over all the columns using the mark outliers IQR methods what we can now do is we can just change it into Shuffle net and that is the nice that is nice about these functions that they are standardized to result in the same output um because now we can just let me clear that up and look at the show for Nets Criterion for determining outliers and we can already tell there are a lot less outliers which is good in my opinion because uh as I just mentioned looking at the IQR method there was a lot of data especially also here in in the beginning that was marked red so I remember the for example the gyroscope C data this was all marked as outliers and now we can see that um it's not so hard on the data anymore as in it is it's leaving a lot more um values untouched we can see problems here during the in the rest data but that could also be the case because we've just seen that the resting data is actually not normally distributed and that's why it's resulting in a lot of outliers here as well so very interesting results all right let's now continue to the local outlier Factor function and as I've mentioned we first want to create all the different functions and then actually start comparing them to check what is the best approach so we're quickly jumping over this part without really going into the details but that's to bring it all uh together later in this video when we have all the results so local outlier Factor we can come back to our document and there is another awesome function over here that we can use to calculate outliers based on the local outlier Factor this one is a bit different than the previous ones because it is a distance based approach to determining outliers ins versus a distribution based approach and it's also an unsupervised learning method because we're basically going to train a model and then make predictions and we're going to use those predictions to Mark outliers or not so again I'm not going to cover all the details you can look at the scikit-learn library over here to better check out how this works and how to calculated values are calculated but it's basically yeah looking at the local density deviation giving a data point with respect to its neighbors and another key difference over here is that before we were looking at individual columns now we're going to look at an individual row so we are going to consider the six data points within a row and use that to compare it to all of its neighbors or the neighbors that we number of neighbors that we specify and we're going to go with the recommendation of the scikit-learn library of setting the total amount of neighbors to 20. for each of the rows that we have in our data frame six columns we're going to look at the 20 closest neighbors and then check whether values are isolated so again function and insert LOF function over here make sure to run this as well and the only difference over here is that we don't have to Loop over all the individual columns because we're just going to import the whole data set or take the whole data set as an input and then as you can see what happens over here is uh like in the regular way you use scikit-learn classifiers is we Define an object from from the classifier and then we Define the data set and that is in this case our data set and then specified with the columns which are all of our outlier columns in this case so not looping over in one by one and then we're going to do a fit predict on data itself and then we also compute the negative outlier Factor scores which we are not even using in this example over here we're just going to check whether a data set outlier LOF column we're going to set that to to either true or false based on whether the outliers that we fit predict over here are set to negative one so negative one is an outlier and one is not an outlier and this will result in the same true false column that we've seen earlier so that is how this works so in order to create this Loop we have to make one small adjustment so if I take the previous Loop that we've created I'm just gonna take this over here and we are starting with the for Loop during the data visualization parts so we can visualize the outlier columns in the same manner that we've done over here so that is by a loop but we're going to Mark the outliers using the mark outliers LOF not in a for Loop but just putting in the whole data frame and then this should be the outlier columns so the outlier columns and the data frame and that is what we throw into the function over here then let's have a look at it and then we get a data set um let's check I see one small mistake here so um we are returning the data set the outliers and the X score so that is what we're seeing over here and we also have to specify that over here in the output so we're not really doing anything with that but we're just going to store them let me just run this one more time now we have a data set with one column outlier love so that considers all of the uh six rows sorry all of the six columns for every Row in the data frame and then you can also see like how this is just a list of ones and minus ones and that's how we basically specify the outlier love column over here and turn it into a Boolean so we say where outliers equals -1 that would be true otherwise false and then we also get the X scores which you can basically see as the certainty of whether it is an outlier or not and we're using the in this case we're using the negative outlier Factor so the more negative a value is the less chance of it to be an outlier so if that makes sense so we're not really using it you can also look it up in the documentation of the scikit learn Library there they explain how they use it they basically use it to create these circles around here where the larger the circle the more certain the model is that it is an outlier and it uses that value for it all right so now we have the data set and we can Loop over the columns again so let's see what that looks like okay I see we've made a tiny mistake and that is because I haven't adjusted this over here and this is correct because now we're looking at the outlier what was called outlier LOF that is our column that we're considering right now that shoot results yeah yeah let's check so that was the error before okay interesting so what we can see right now is we're looking at all the individual columns and we can see that now outliers are starting to be identified more within the data itself so before it was usually we could basically draw a straight line and anything underneath that would be marked as an outlier uh or on the top everything above a certain line would be marked as an outlier but now we're starting to see data points within what seems to be a regular movement pattern is marked as an outlier also what's very interesting over here so you can see that these values here on the bottom are are fine but then in between the the methods that we're using right now things that these are outliers and from um from how we can look at this is here we can really see the difference between distribution and distance based methods so you can see that this point over here is really isolated and when it looks at its 20 next neighbors the local outlier effect on model thing this is a very isolated data point but if we compare that to a distribution based method where you look at the data as a whole then it says oh this value is basically on this line over here and there's a lot of data around that point so this value is fine this is an extreme value but then with the local outlier Factor it says no I have a cluster of data over here I have a couple of points and this point is is not that strange because it's surrounded by 20 or so neighbors that are nearby so really interesting two very different approaches to look at outliers and now it is up to us as data scientists to determine what the best methods or what the best approach is in this case for this particular problem okay so next step check out outliers grouped by the label so I've identified this earlier it's probably better in this scenario to first split the data and then check it out so yeah I think that's a more fair assessment of the data so let's have a look at how to do that and I'm gonna quickly copy a block of code because this is basically uh all stuff that we've already done so we're going to look at the different methods so first the mark outliers IQR and we're going to look at the bench so you can just copy this over here just type it on your own and start with a label bench and then we create a for Loop for column in outlier columns and then we're going to basically do the same as we did before but now instead of so please note that the difference is that we're not putting in the whole data frame but we're putting in the data frame with a selection on the label so in this case we've just specified label is bench press so if I check this out this results in a data frame with only bench press data and now if we run this with the IQR methods we can see that the patterns are much more similar because this is all bench press data and now the also the whole distribution of the underlying data changed and now what we can also see is that the IQR method seems even harder in marking or I would say it is even stricter in whether it is an outlier or not so it's marking a lot of points over here even clusters of data as as outliers and we can basically draw some straight lines over here that anything above or below a certain line is marked as an outlier have a look at squad for example oh let's have a look at the squat okay also some areas where there are lots of outliers so still in my opinion this is a bit strict and we would betray throwing away a lot of data using this method let's apply the same approach but now to the mark outliers and then to the bench press okay what are we looking at here and here we can see that shuffonet is treating the bench press data very nicely so we can only see a couple of outliers over here and this seems fairly reasonable to me especially look at this over here we got a nice blue area over here and we have one where to point over here which in my opinion could definitely be an outlier so here we have a few more but again by looking at the whole data set over here these are some pretty weird looking points over here so I um I like the chauvinet approach here we can also just visually inspect the squat one more time again I like I like the results over here so few points not too strict looking good now let's have a final look at the local outlier factor and then let's just take this yeah let's run this okay local outlier Factor again we see this Behavior where we are now marking points in the uh within the bulk of the data and not necessarily above or below a certain point very interesting results okay on to the next part and almost the the final part of this video already and that is choose a method and deal with outlier so as of right now we have visually inspected all of the outliers using the plots over here but if I look at my data frame this is still the same data frame that I've imported over here so now we have to make a decision first what approach do we want to use and second what do we want to do with outliers do we want to remove them do we want to mark them do we want to impute them so let's focus on that right now so in order to do this we're going to first test on a single column so like usual we're still we start with the building blocks so let's for example say the column is gyroscope Y and now let's look at a data set or create a data set where we say Mark outliers and we'll take the show for net in this case and we insert our data frame and we have to input our column as well which is our column and let's check so our column over here and then let's run that and let's check it out like this so now we have our data set which has an additional column for gyroscope C outlier now the next step is to translate what we're seeing here so this Boolean column over here to a actual transformation in the column over here so let's first have a a look at data set where we say data set and then we can just take this new column that we've created remember with bundles we can do Boolean indexing so let me put that in quotes so basically what we can do over here is this is a Boolean Series so false and there are some truths somewhere in here we can insert this between square brackets giving a data frame and it will do a Boolean indexing and it will will return only the values where this is true so by running this we can see where all the outliers are and we have we have a couple so we can see all of the values over here are marked as true so that is the first step so now we know okay this is uh sorry these are all of the values that we want to adjust what we're going to use right now is we're going to replace these values with Nan and we're going to later take care of these missing values but for now we're just gonna remove these values from the data frame we're not going to move the entire row we're just going to set the values over here to np9 and in order to do this we're going to take the data set and we're going to take the location and then we're going to specify the column over here and we're going to set those values to NP net this is a pretty Advanced but we're basically using the log function of the data frame to First make a selection based on the Boolean indexing and then we're going to say we want to set where this is true we want to set the values of this column to np.net so when we run this we will we don't get an output so this happens in place so note that we don't have to do data sets is this updates the data frame in place and now if we look this again we can see that all the values gyroscope C outlier was set to true so marked as an outlier is now set to none and now if you have a look at the data set that just contains still all of the 9000 still has that column but somewhere in here where this is set to true this value is now set to num we can check that by running this so that was the first piece of this puzzle testing this on a single column but now we want to create a for loop as usual and loop over all of the outlier columns and perform this transformation in a loop and in order to do this I'm going to first create a copy of the original data frame that we call outliers removed day F so let's check this out original data frame so that is just the same as we've imported up here nothing's changed and we're going to create a copy now meaning that outliers removed the f is exactly the same and we're going to do this to yeah create a new final version of the data frame that we will export late later and now we're going to create a for Loop just as usual to Loop over the data so we're going to say four column in outlier columns so remember the outline columns are just the six columns over here I'm going to Loop over them one by one and we're going to say for label in day F label that's unique and that basically means that we're going to Loop over all the individual labels that are in the data frame so we're going to apply the approach where we first group the data as we've seen that that results in a overall more fair assessment of whether a value is an outlier especially in the distribution based models so we're going to Loop over the columns and then the labels so it's a nested for Loop now the next step is to actually label the outliers based on what we've done here already so let me just copy this line over here from the shuffle finesse Criterion and we just go to the next line and we say Okay data set Mark outlier Shuffle Nets and then here we enter the data frame and we filter by label then next thing we're going to basically do what we've done here so we're going to actually set the values to mp.net but now we're going to make this adjustable for the for Loop by replacing the values so that would be data set outlier Yeah so basically we're first making uh so we're first selecting the column that has the true false values so the Boolean series and then we're going to use that to make a selection and then update the column within the for Loop and set it to NP Nan so to add some comments over here we replace value marked as outlooks with Nan and now the next step that we have to do is an additional step that we haven't seen before because now we're storing everything in the variable data set and the variable data set is a subset is a selection based on the label but eventually we of course want to update these values as well in the outlier removed data frame so we're going to update the column in the original data frame so in order to do this we want to take the outly remove data frame and we're going to use the lock notation again where we say we want we want to create a subset where outliers removed they have label equals the label that we are using in the for Loop so basically what we're saying right here is we're first creating a subset based on the label so on the loop where we have a bench press we say okay where the subset labels is bench press that is the selection and then we'll use the same notation to update the values within that column and we say okay this is the column that we want to change so the column within the for loop as well and we want to set that to data set and then the column that we've just updated this is a pretty Advanced mechanism but I'm not sure if there is a more straightforward way to do this so to summarize we first create a subset of the original data frame based on the value of the for Loop then we're going to Mark the outliers using shofenet's Criterion as usual but that results in a data set that is a subset of the original data frame so when we overwrite the values we still don't override them in our targets output data frame so then the next step is we take a subset of our output data frame and we specify the column that we've just defined over here in the loop and please let me know in the comments if you think there's a better or more straightforward way to do this because I feel like there is a better way to do this but not sure anyways let's continue um we update the data frame and now the final cool thing that we can do is we can basically create a print statement that also lets us know how many values are removed so we can for example say an outliers equals and then we take the length of the original data frame and then we subtract the length of the outliers uh sorry outliers removes the F and then we take the subset column and then we say drop an A so what we're doing here is remember when there is an outlier we replace it with np.net which means there is a missing value and then by dropping all those missing values we get a data frame that is less than 9900 records and by subtracting that from the length from the original data frame know how many outliers are in that certain or in that specific column for that specific exercise now all you have to do is create a print statement where we say quotes and then we create an F string and then we say removed outliers and then we say column and then we say four and use a label so removed X outliers from acceleration y for bench press and that should do the job so now let me see let me clear this up check our inputs again we have our data frame original data frame nothing's changed we create a copy store it in outliers remove the F let me clear it up run the for loop loop over everything and replace the outliers with missing values what we got removed Zero from acceleration X for bench awesome so looks like this is working and we can see that especially for the gyroscope C data over here we get a lot of values that are removed and we can do a quick check um using Shuffle Nets again so this is for the squat so let's say for example we we take a deadlift because we can see that for deadlift there are 40 outliers removed in the gyroscope C so let's store label like this and then let's have a look ah I see one mistake uh when calculating the length we were using the uh I was initially using the data frame and the outliers removed but that is wrong we have to use the data set here and that's also why you were seeing these values increase over time because it was looking at how many values were removed from the data set already so um that we should have we could have seen that earlier by looking at these these values over here and seeing that they were increasing um it should be data set and data set that is how we calculate the values and now let's run that again now we we get a different picture over here so we can see Zero to couple of values couple of values and then for the gyroscope for the dot gyroscope Z data for the deadlift if we run that again so deadlifts show for Nets let's go is this the deadlift this doesn't look like the deadlift make sure this is the deadlift then run that one more time yes now we're looking at the deadlift data again we can see that we have one two three four five and if you count all of them they are 14 so we have 14 red dots that is looking good and now if we take our outliers they have removed let me clear this up we can see that this is a data set same length as the original data frame also same amount of columns but now if you do info we can see that some columns have missing values meaning that we have successfully replaced values marked as an outline liar using these Shuffle Nets Criterion with MP dots Nan and that brings us to the final line of code that we have to write for this video today and that is we take the data frame and we're going to export that so we say two and just like usual we go into the data folder interim and now we're going to call this O2 outliers removed and then we call this show for Nets dot p k l let's run this let's validate that in the data interim folder we have that over here and now we're done so we now have a data frame that we can continue working on in the next series and then we're going to add additional features will be very interesting and we're also going to tackle how to deal with these missing values so we have covered a lot in this episode and as of right now you saw me using Shuffle Nets criteria and from looking at the data in my opinion show Finance Criterion looked the most Fair the best approach for dealing with outliers on this data but to be honest I'm not sure I can't really tell from looking at this and tests that we eventually have to perform is will it improve Model results we have now exported this data frame as O2 outliers removed Shuffle Nets in later series we will look at different approaches of removing outliers generating features imputing Mission values and then check what the effect is on the overall performance of the classification model so we will look at accuracy for example Precision recall and we will validate which approach is the best because as I said to be honest you don't know for sure you don't know what the impact is of removing the values that we've just identified as outliers removing outliers is always an interesting part of a data science project like any part it requires some special attention and domain Knowledge from a data scientist to really validate whether an approach is correct or not and that brings us to the end of this episode so we've covered everything there's resources over here that you can find additional information I want to thank you guys for watching if you have made it all the way to the ends please like this video And subscribe to the channel I put a lot of time and effort into making these videos for you and by liking and subscribing you really help me out and help the channel to grow and also let YouTube know that you want to see more content like this so it's basically a win-win it's very good so thank you very much much for watching and I'll see you next week in part 5 where we dive into feature engineering
Original Description
Want to get started with freelancing? Let me help: https://www.datalumina.com/data-freelancer
Need help with a project? Work with me: https://www.datalumina.com/solutions
In this video, we will learn how to identify and handle outliers in sensor data using three different methods in Python: the interquartile range (IQR) method, Chauvenet's criterion, and the local outlier factor (LOF).
👉🏻 Source material for this week: https://docs.datalumina.io/jD1BSJCAPYKSwh
⏱️ Timestamps
00:00 Introduction
01:38 Loading the data
02:38 What are outliers
05:03 Boxplots and interquartile range (IQR)
24:04 Chauvenet's criterion
30:55 Local outlier factor (LOF)
42:30 Choose a method and deal with outliers
55:02 Export data
55:37 Conclusion
Project overview (what you will learn)
Part 1 — Introduction, goal, quantified self, MetaMotion sensor, dataset
Part 2 — Converting raw data, reading CSV files, splitting data, cleaning
Part 3 — Visualizing data, plotting time series data
Part 4 — Outlier detection, Chauvenet’s criterion, local outlier factor
Part 5 — Feature engineering, frequency, low pass filter, PCA, clustering
Part 6 — Predictive modelling, Naive Bayes, SVMs, random forest, neural network
Part 7 — Counting repetitions, creating a custom algorithm
Link to playlist: https://youtube.com/playlist?list=PL-Y17yukoyy0sT2hoSQxn1TdV0J7-MX4K
If you find these videos helpful, consider subscribing @daveebbelaar
Watch on YouTube ↗
(saves to browser)
Sign in to unlock AI tutor explanation · ⚡30
Playlist
Uploads from Dave Ebbelaar · Dave Ebbelaar · 26 of 60
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
▶
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
How to Install Homebrew on Mac (Getting Started)
Dave Ebbelaar
How to Install Python on Mac (Homebrew)
Dave Ebbelaar
How to Install Anaconda on Mac (Getting Started)
Dave Ebbelaar
How to Set up VS Code for Data Science & AI
Dave Ebbelaar
How to Use Git in VS Code for Data Science
Dave Ebbelaar
Data Science Desk Setup to Maximize Productivity
Dave Ebbelaar
THIS Is How I Write Clean Data Science Code EVERY TIME
Dave Ebbelaar
Data Science Tutorial - Project Structure
Dave Ebbelaar
Changing rcParams for Better Data Science Plots | Matplotlib Tutorial
Dave Ebbelaar
How to Read Excel Files with Python (Pandas Tutorial)
Dave Ebbelaar
My Data Science Journey (Zero to Freelance)
Dave Ebbelaar
How I Automate Data Visualization in Python
Dave Ebbelaar
16 Apps I Use Daily as a Data Scientist
Dave Ebbelaar
How to Manage Conda Environments for Data Science
Dave Ebbelaar
How to Export Machine Learning Models in Python
Dave Ebbelaar
VS Code Speed Hack for Data Science
Dave Ebbelaar
17 VS Code Tips That Will Change Your Data Science Workflow
Dave Ebbelaar
How to Predict the Future with Python (Forecasting Tutorial)
Dave Ebbelaar
How to Use Python Environment Variables
Dave Ebbelaar
7 Data Science Tips for Beginners in 2023
Dave Ebbelaar
How to Effectively Use the Data Science Lifecycle
Dave Ebbelaar
Full Machine Learning Project — Coding a Fitness Tracker with Python (Part 1)
Dave Ebbelaar
Full Machine Learning Project — Processing Raw Data (Part 2)
Dave Ebbelaar
Full Machine Learning Project — Data Visualization with Matplotlib (Part 3)
Dave Ebbelaar
This Will Change Data Science as We Know It (ChatGPT)
Dave Ebbelaar
Full Machine Learning Project — Detecting Outliers in Sensor Data (Part 4)
Dave Ebbelaar
Full Machine Learning Project — Low-pass Filter & Principal Component Analysis (Part 5a)
Dave Ebbelaar
Full Machine Learning Project — Fourier Transformation & Clustering (Part 5b)
Dave Ebbelaar
Full Machine Learning Project — Predictive Modelling (Part 6)
Dave Ebbelaar
Automate Machine Learning with ChatGPT
Dave Ebbelaar
Scraping Web Datasets for Data Science Projects
Dave Ebbelaar
Full Machine Learning Project — Counting Repetitions (Part 7)
Dave Ebbelaar
How to Use GitHub Copilot for Data Science (Python + VS Code)
Dave Ebbelaar
Every Beginner Data Scientist Should Understand This
Dave Ebbelaar
Revealing My New AI-Powered Data Science Workflow
Dave Ebbelaar
Auto-GPT Tutorial - Create Your Personal AI Assistant 🦾
Dave Ebbelaar
Build Your Own Auto-GPT Apps with LangChain (Python Tutorial)
Dave Ebbelaar
Building Slack AI Assistants with Python & LangChain
Dave Ebbelaar
ChatGPT Code Interpreter - Goodbye Data Analysts?
Dave Ebbelaar
How to Deploy AI Apps to the Cloud with Flask & Azure
Dave Ebbelaar
How to Build an AI Document Chatbot in 10 Minutes
Dave Ebbelaar
Is Falcon LLM the OpenAI Alternative? An Experimental Setup with LangChain
Dave Ebbelaar
GPT Engineer... Generate an entire codebase with one prompt
Dave Ebbelaar
Pandas DataFrame Agent... the future of data analysis?
Dave Ebbelaar
OpenAI Function Calling - Full Beginner Tutorial
Dave Ebbelaar
How to use ChatGPT's new “Code Interpreter” feature
Dave Ebbelaar
LangChain just launched their new "LangSmith" platform
Dave Ebbelaar
How I'd Learn AI (if I could start over)
Dave Ebbelaar
I Used AI To Scrape The Web & Write PDF Reports
Dave Ebbelaar
LangSmith Tutorial - LLM Evaluation for Beginners
Dave Ebbelaar
7 Lessons for New AI Engineers - Beginner’s Guide
Dave Ebbelaar
The Rise of the "New-Age" Machine Learning Engineer
Dave Ebbelaar
OpenAI Assistants Tutorial for Beginners
Dave Ebbelaar
How To Connect OpenAI To WhatsApp (Python Tutorial)
Dave Ebbelaar
How to Build Chatbot Interfaces with Python
Dave Ebbelaar
PostgreSQL as VectorDB - Beginner Tutorial
Dave Ebbelaar
My MacBook Setup (as a coder & business owner)
Dave Ebbelaar
Easiest Way to Connect AI Chatbots to WhatsApp
Dave Ebbelaar
ClickUp Tutorial - What Is ClickUp Brain? 🧠
Dave Ebbelaar
My Development Workflow for Data & AI Projects
Dave Ebbelaar
More on: Unsupervised Learning
View skill →Related AI Lessons
Chapters (9)
Introduction
1:38
Loading the data
2:38
What are outliers
5:03
Boxplots and interquartile range (IQR)
24:04
Chauvenet's criterion
30:55
Local outlier factor (LOF)
42:30
Choose a method and deal with outliers
55:02
Export data
55:37
Conclusion
🎓
Tutor Explanation
DeepCamp AI