Full Machine Learning Project — Predictive Modelling (Part 6)
Key Takeaways
This video teaches predictive modeling in a full machine learning project
Full Transcript
hey everyone welcome to part six of this series where we create a fitness Tech tracker with python and today is all about predictive modeling the exciting stuff right so up until this point all has been a preparation preparing the data getting it into the right shape for us to try out some different models so that is what we will focus on in this episode if you're new here make sure you have watched all the previous videos because we will build the building upon that the link to this document will be linked in the description as usual there you can find all the information so let's get into it so the specific goal for this episode is we are going to experiment with feature selection model selection and Hyper parameter tuning using a grid search to find the optimal combination that results in the highest classification accuracy so remember we're working we're working on a classification problem problem where we are trying to predict one of the following labels given a large data set of sensory values that was captured during workouts with a wrist sensor then furthermore we are going to dive into the following 10 subjects for today starting off by creating a training and a test set but before we can get started as always make sure you download all the python files so there is a train model dot Pi file which you can have a look at over here so it's all empty as usual with the headers make sure that is in your Visual Studio code workspace so we're working under the source and today we're working in the models over here so make sure your train model file that you're seeing right here is in the model and again you can download the file here or just copy and paste it and then put it in there and then one another file that we will be using is the learning algorithms.pi so make sure to add this to the directory as well so this is a neat little piece of code that is available as part of the machine learning for the Quantified Self book and this was taken from the original GitHub page and as you can see I've made some adjustments to this but just know that this is not my code this belongs to the authors that you see over here but basically what this class contains is a bunch of classification algorithms that we are going to use and they are standardized in such a way and also based on the sklearn library as you can see that we can easily Loop over the different algorithms and can also apply a grid search and then get back to results so this just streams like streamlines the whole process of basically doing that grid search and what that exactly is that we will will get into that in a bit for if you're new to that but just know that this is the file with all the cool algorithms machine learning models that we will be using so um let me check with that out of the way we can get started by creating a training and a test set and now of course before we can do that we have to first load the data so as usually we'll load the pickle file that we dimmed in the previous episode and we'll Define it as dayf and we're going to PD read pickle and then we're going to the data folder so two steps back data and then we're going to the interim folder and then from there all three data features dot pickle once it's in there we can fire up an interactive python session and see if everything loads correctly and then preview the data frame awesome so your data frame frame should look like this and now the next step is to actually create a training and the test set but we for uh we are we are going to do that we're going to remove some of the columns from this data frame that we won't be using so we'll Define our day F train and we'll set that equal to our data frame where we drop a certain set of rows and that will be the participant we won't be needing that for now we are going to get rid of the category and also the set so these are some labor or these are some columns within the data frame that won't be adding value right now for the predictive models so it's easier to get rid of them and we can do that with the data frame the drop methods and then we only have to specify access equals one meaning that we're dropping column wise so we can first get a preview of what this looks like by running it like this and we can see that we have three less columns um then our original data frame and that is correct so now let's store it in the DF train variable that was to get rid of the extra columns now the next step is to split up the data in the so-called X and the Y of the training set that we will be using so we'll start off with declaring our X strain and we're going to set that equal to our day of train and then we're going to drop the label that is in there so remember that if we have a look at our day of train and I'll take a look at the label that is the column that contains all of our exercise labels that we want to classify the model on so this will be our Y and now a nice way to split up your data into X strain and Y train is to initially and sorry as I train but we're going to later split this into train now it's just the X variable a nice way to split it into X and Y is to take all of your data and then drop the label and then for y we're going to say we want the data frame but then just the label so now if I look at this this will be the whole data frame without the label and this will be just the label so now we can store that into the respective variables and we can continue to the next step now in order to actually create our trained test split we're going to leverage the train test split from scikitlearn.model selection and this is a nice and convenient methods to easily split everything up and it takes the X and the Y as input so now we can start defining our X strain our X test y train white rest y test and then we can Define our train test split and then in here we fill in the X the Y and also a test size and we can even specify a random State and by specifying the random State we will make sure that we will all get the same results so if you are following along you will get the same split because if we leave out the random State and this this can be whatever this can be this or this can be any integer but let's just leave it at 42 for now by setting it like this we take control over the stochastic process that is happening in the background when you're doing a split and for the size let's just set it at 25 meaning that uh 75 of the data will be used as a trade training variables for the training set and then the additional or the leftover 25 will be used for testing so we can run this and then let's have a look at our variables so we can now see that we have X strain and that is 75 percent actually in size of the original data frame we also have X test which is the test size so that is a 25 then we have a y train which is all our labels and then also for the test set we have our labels so that is looking good but we're going to add one more parameter to the Train's test split and that is the stratify option and we're going to set that equal to Y and let me save this real quick in meantime so black will do its formatting so we can see what's going on since we are using a labeled data set we of course want to make sure that our train and test split is split up in such a way that that they both contain enough labels of all the instances that we can pick from so what we don't want is that all of our training set contains for example only bench press and squat data and then our test set contains all the row rowing data that is not seen in a test in a training set so we of course want to ensure that that is not happening and typically by default the train test split comes with a shuffle parameter that is set to True by default and this kind of counters the effect but the stratify parameter is specifically made for these kind of problems where you want to ensure that there is an equal distribution of all your of all your labels so that's why we'll add the stratify uh very parameter as well and then run everything again make sure it still looks as we expect and this is all looking good so we can then continue and create a quick plot to see the stratus file in action so I'm going to copy and paste a block of code over here that we will be using to see how the different sets so the Y train and the Y test how the values compared to the total they have how they add up so if I run this block of code we get a beautiful illustration over here and you can just follow along with this code there's nothing special going on in here it is all same thing that we did in a data visualization episode so in order to speed things up I just dump it in here but what we can see here is a nice illustration of like the total amount of instances for each label so we can see bench press overhead press all like 700 and then for the row and the rest there seems to be a little less records over here as we can see but all in all we can see that there is a nice equal distribution of all the various labels within the training and the test set meaning that we train on all the labels and we are also going to test on all the labels so that is the main goal of creating a trained desk split like this and using the stratus file function all right and next up we're going to split up the different features into subsets and we are going to do this in order to later check whether the additional features that we've added using or in the future engineering phase are actually beneficial to the predictive performance of the models so we are going to split them up starting with the basic features and for this we're going to start out with the original parameters that were in the data set so these would be as you see here would be the acceleration in X Y and Z and also the gyroscope x y and C so those were the original features in the data set and just to give you a little little heads up if you see the code appearing here in my editor that is because I'm trying something new and that is a GitHub copilot within vs code and as you can see this is your AI pair programmer and I've been using it for only two days but as far as I can tell this thing is really awesome so as you are typing it gives you suggestions based on the whole database probably trained from GitHub and kind of like how chat GPT works this is similar in the sense that it can recommend next steps so there's a little side step I will probably make another video just on using GitHub copilot but if you see the code appearing and I do tap and then it autocompletes that's what's going on next we have the square features and here you can see it's already trying trying to predict what I want to do so we have the square features that we created and those are the two accelerometer R and also the gyroscope are features that we created so those are the square features then we have the uh principal component analysis features and these are named PCA one two and three all right and then we start off with the or we continue I should say with the time features and for this we're going to use a list comprehension to Loop over the columns in the data frame and do a match on a specific string so so if you remember from the feature engineering episode we basically labeled all the features with regards to time they have the notation of temp DMP be temp in it so what we can do is we can do a list comprehension and we say f for f in and then dftrain dots and then we say if and then we can make its formal and it's not time it's Temp and if that is in F so let me just get a quick check whether this is all right but as you can see so these are the columns that are within the data set and all of the temporal abstraction features have this notion of underscore uh temp underscore in it and in setting it up like this in a list comprehension we can extract them in a nice little way and we are basically going to do the same for the frequency features only there we have to check for the frequency instead of the and then also we can add an or statement because there are and let me just check the F train and then column so there there are a couple more and then let me just take um starting from because there are actually quite a lot so there are some more so you have also the PSE over here that we shoot at and in order to make that work correctly we have to put it between parentheses and then also include the in F over here so we're basically going to create an or a statement over here where we say okay we want F in all the columns if and only if it contains underscore frequency underscore or underscore PSE and if we have a look at this we can see that we get all of the frequency features that we have created so that is great and then one more we have the cluster features and this is pretty straightforward because that is just one column and that is the cluster make sure to run everything and then to make sure that we are all on the same page I am going to input a print statement over here that is going to print oh I see I made some mistakes here square and frequency that should be features in this case let's run that one more time and make sure this is correct and now we can do a print statement in between and we can check how many features we have and now I can immediately see why it's important to print these statements because I see that I have eight 80 features over here but it should be 88 and I noticed that for the frequency we don't need the dash at the end because some of the features end with frequency and then it doesn't work properly so if we run it like this then we should be good so yeah then we have 88 frequency features and another thing if you've been paying attention and that is just what there's one drawback of using the GitHub gold pilot it suggested The Columns nicely but it added an o in the columns over here so you got to be careful with what you accept and always test everything properly sorry for that correction but we have the uh all the accelerometer and all the gyroscope values then the sum of squares PCA time features frequency and the Clusters and if we run everything and then print it it should look like this and now with that correct we are going to create four different feature sets the so we're going to create feature set one up until feature set four so let me just one two three and four and we're going to set this equal to the first set will be just the basic features so uh we're gonna set that like this so this will be basic features and then for the second set we will be using the basic features plus the sum of squares and in all order to do this properly we're going to create a set out of this and a set basically means a set is a data type in Python and that is used in order to avoid duplicates so if for example we have a problem in our code over here and we select multiple of which we select the column twice then a set will get rid of that and in order to streamline everything I can just show you what that looks like so we have first a list you can see by the square brackets and then if we turn it into a set it will turn into curly brackets but then again to convert it back to a list we put it into a list parameter again and now for the first one that doesn't really make sense but we'll add it anyway to just keep everything nice and clean and so for the second one we can combine the basic feature and the square features and we are also going to add the PCA features to this subset 3 were going to start from feature set 2 and we are going to add the time features so let's see let's run this first and then see what this looks like so we're stacking features on top of features to the point where we are now have the basic features the square features PCA and also the time features and then finally we are going to create the same setup again and then we're going to take feature set 3 and we're going to add the frequency features and also the cluster features so let's run that and then make sure to run this first now we should have N4 this contains all the features that we have and this one contains only the original six sensor values so you can already tell how we can use this later on to make different data selections and then see what the performance of the model is and then start iterating over one two three and four to see whether our feature engineering efforts have actually been beneficial or maybe it was just a waste of time okay so the next thing we're going to do involves feature selection and more specifically we're going to perform a forward feature selection using a simple decision tree basically what this means is we're going to Loop over all the individual features and start small meaning a forward selection and just try one individual feature and then C using a simple decision tree what the accuracy is on scoring are labels and then once we have the feature that results in the highest accuracy we're going to start this process over and then start adding again all the features but now adding and adding them to the originally best performing feature so now we have two features that we're trying to to bet tool and what this will result in typically is that over time as you add more features the accuracy will start to increase because we give the model more information meaning that the model can learn from more data and is usually better at becomes better at predicting the label but typically what you see is that there at some point once you have introduced enough features you start to see diminishing returns meaning that the slope of the accuracy curve that we are going to plot is going to decrease so the slope is going to decrease meaning that adding more features more complexity also in that sense to the model will not necessarily perform in a much better accuracy a much better model and the golden rule of course is a simple model is better than a complex model Occam's Razer so that is what we are trying to check for basically in this feed forward forward feature selection method and using that we're going to select in this case the 10 best performing features and creates another subset which will be feature set 5 that we're going to evaluate as well so let's get into it and for this we're going to define a learner object that we're going to set equal to the classification algorithms class that we have imported from the learning algorithms that you can see over here and all the way at the top of the class you can see the model or the algorithm so to say and this basically does what I've just explained so you can check out this file over here and go through it see see if you can understand it but this is basically performing the process of looping out for the max feature so we'll Define it as 10 then it's going to check how many features we have have left and in the beginning this is everything and it's going to train a simple decision tree it will store the performance and if the performance is better than the previous one will continue Etc and then we're trying to build to stack upon uh the The Columns all the features in order or till we have the 10 best performing so really cool stuff let's see how it works we'll initiate the learner we're going to set a Max features equal to 10 and now in order to fire up this forward selection we have to see everything that it returns as an output so we're going to start off with these variables over here and we're going to say learner.forward selection and then we're going to input let's check what we need we need a Max features X strain and Y train well that's good because we have those so we have Max features we have X train and we have y train now let's save this in order order for black to format It Alright and then once that is all stored we can start running this script and now this will take some time because remember it's going to Loop over all the individual columns in the data frame and we have I think 117 in total and it's going to train a decision tree so it's going to do that initially 117 times it's going to train a model then it has the best parameter and then in the next iteration it's going to Loop over the 116 columns left over columns besides the best performing one and it's going to do all the training again so another round of 116 training cycles for the decision tree and now you can see here at the bottom so it did the first cycle and which is zero index so zero is first and now it's on the second cycle and it's going to do this up until nine in this case because then we we have 10 features and if we look at what this function returns so we have the selected features and the orders features which I think are the same in this function we have to check that but I think there are the same and the scores meaning that we can check what's the best feature was that was added and then also what the total accuracy score was so we are going to wait for this to finish and then we'll get back and we're going to plot the accuracy graph all right and we're done so let's have a look at what what's inside these variables and as I've said yeah so it seems to be that the selected features and orders features return the same thing so we can just ignore one and here we can see the order scores so we can already see some interesting things going on here but we really get a good understanding of what's happening over here if we create a simple plot so again I just put it in here a simple line plots that we can create and look what we have over here so let's break this down what we are looking at right here is the total number of features selected via forward feature selection and the accuracy on the training data so remember this is not using the trained test split this is only on the training data so we train uh the data we train the model on the training set and then we also make predictions on the training set so in a sense that is cheating so it's making predictions Based on data it has already seen but initially for feature selection it can give us a sense of direction as to which contribute most to the accuracy and here we get a nice view of the acceleration Z in the frequency domain and we can see that by just using this column alone we get an 88 accuracy which is already animations that is just one feature so here you can see how powerful the Fourier Transformations are that we applied the second best feature is also in the frequency domain only this is from the X and also the accelerometer and now you can see that with two features we get an increase up to if we round that to 99 that is absolutely amazing so we have 99 accuracy with just two features on the training set so here we can see that our efforts of applying the feature engineering and doing doing the uh the Fourier Transformations that it really paid off as we're seeing here so all of almost all of the features over here are in the frequency domain if we look at the graph over here here we can clearly see the diminishing uh returns that I was talking about so we get a big increase from one to two and then till three is also a leap but then here you can see that we're all already at past 99 and were were heading towards the one actually so almost a perfect score but you can also see that here in the end the values don't really change this much and uh that is really interesting to see so we are going to use the variables over here so uh the ordered uh or selected features and we will be adding them uh later on when we are looping over uh the data and just a back best practice because this took some time to run what we can do is we can just uh store those features like this and then we don't lose them so if our kernel crashes or something happens we don't have to run this part again we can just access them via the list over here and just be clear I think there is also somewhat of a stochastic process going on in the background here so it could be that um you get a different results from running the forward selection so it it could be because I'm looking at my previous code example and I see that they are different but that doesn't really matter for now because still most of them overlap but just so you so you know you might get a different results over here but you can just leave that like this and then continue now the next thing we are going to do is a grid search for the best hyper parameters and model combinations so if you don't know what a grid search is it's basically a way in order to come up with the set of optimal hyper parameters for your models so each of the models and in this case the scikit-learn models that we will be using can have different parameters and you can set different values for them and in order to find the optimal combination we can define a grid search over all of the different that we want to test for and then the grid search CV function also from scikit-learn will Loop over all the single combination combinations and you can see that this also then scales up in computation time but because it doesn't it's not additive but it's via addition but it's multiplication so this will be five times three times two total amount of combinations that we have to train a model and test for and then typically what you also do during a grid search is that you validate your results using k-fold cross validation and in this case we'll be using 5 volt course validation and this is basically another way of splitting up your data into essentially a training and a test set but without touching the original test set that we have defined previously so if you're not familiar with k-fold course validation I would recommend to look it up that is what we will be applying here and a very important component of machine learning so you will probably encounter that a lot so that is what we will focus on right now so coming back to the code we are first going to Define two lists uh over here so one is called possible feature sets and the other is feature names and the first list will contain of all the features and make sure that this is still stored in memory collect correctly so we have the basic ones we have two three then four with everything and then also here we have our 10 best performing features so first list and then we will also give them a name basically so we will store the names in this feature name list and we will later use that for plotting so that is the first step then we're also going to Define iterations of variable for now so iterations and we're going to set it equal to one for now and I will explain this in a bit but some of the models that we will be using have a stochastic nature meaning that every time you run them they train and optimize a little different because of the initialization that is somewhat random and um typically what you then do is you you try and train the models a couple of times so for example five times and then you can you take all the averages and then you uh so you you take all the accuracies and then take the average so five accuracies and then you divide it by five to get the average but uh we'll start out with an iteration of one just to get a sense of how everything's working and we're also going to uh Define a score data frame that we're just going to set to a empty data frame right now and then the next step that we are going to do is we're actually going to copy and paste a huge piece of code from the grid search codes that you see over here so let me refresh that over here and get this one over here so I'm putting this here in the document for you to copy so you can just click on the right corner over here to copy this because this is quite an extensive block of calls and it would take a lot of time to actually code it out right now and we're basically uh let me just show you in the file over here we're basically doing uh the same thing five times so you can see it's quite an extensive list over here and I can see that it should be iterations and not iteration so let's just quickly update that but basically what we are doing here is we are going to Loop over the possible feature set and the feature names the lists that we have just defined then we are going to select a train set and our test set based on the possible feature set and then we'll index it by the loop number so quite complex what's going on here but to break it down we're going to Loop over the length of the possible feature sets so this is the length and I will use that e so that length to index the feature set and then we'll make a subset so basically the first iteration we will take feature set 1 then the second iteration we will take feature set 2 and like we've just seen these build up so we are one by one adding more information to the model and then we're first going to run the non-deterministic classifiers and average their score so this is the effect that I was just talking about that there is a stochastic element element in the random forest and also in the neural network but since the iteration is now set to one we will just do one run and we'll first train a neural network and then we'll train a random Forest this is written in such a way that it works with the learning algorithms that are in the classification algorithms class so this is the file we looked at in the beginning so here you can see how the neural network is defined and also the grid search that you will perform and and that for basically everything so we have some other models in here that we won't be using right now but these are basically the most common scikit-learn classification models that are in here and they are standardized in such a way uh to flawlessly work with this code and again this is all from the getter page of the machine learning for the Quantified Self and I've made some adjustments to uh yeah to make it work for this specific project and my style of coding but know that most of it just comes from that keto page another quick thing for the neural network initially we will leave grid search off because this takes quite some time with the neural network and we initially just want to get a quick and dirty so to say estimation of all the models to see basically the Baseline of where everything's at and then later on we can try and tweak with the parameters but so for the neural network we will leave this off but as you can see for the random Forest it's true kainer's neighbor is true decision tree true naive base there is no parameters that you can tweak for naive base since it's a probabilistic model so there is no grid search and then you can also see if we're coming back to the learning algorithm so for example let me get to the random Forest where is it I think it's all the way in the bottom here you can also see the the grid that we will be performing The Grid search over so feel free to add for example another setting 200 500 even and that will add additional parameters to the grid search for you to Loop over so have a look at this code over here see how the models are built up and also understand what the different parameters are that we're testing for so I know that I'm going quite fast here so we're actually copying pasting a lot of code in here but actually what's going on here is pretty damn cool so like we've said we are going to train five models at the same time while also performing a grid search so this is pretty Advanced pretty Next Level and um yeah let me just go ahead and make sure that everything is stored in memory and then we can start up the training program and this also takes some time there are some nice print statements over here so you can here see that we're currently we're started feature set zero meaning that that will be the basic features and now just training the neural network that went quite fast because we don't use the grid search now at the random Forest it takes some times but because it is looping over all the parameters that you're seeing over here so this takes a while and then once it's finished it will continue with the K nearest neighbors and the decision tree and the naive's base and they they go instantly so really fast and then here you can see we're already on the next iteration of the feature sets so now it's looping over feature set two so we have extra information that we can provide to the model and now if we come all the way to the bottom this is where we actually save the results to a data frame so for every iteration over a feature set that we're seeing here so here now it completes features at one oh sorry this is actually feature set two now it's feature set three um we can see that it stores all the models as defined over here so with the abbreviations then also the feature set that it was on and then the accuracy of the Performing model which are stored after each training run so we will use the accuracy score from the scikit-learn.metrix to calculate the accuracy score based on the original y from the test set that we split it from the training data and then we compare that with the class labels so the class test why and those are the results of the models so we can have a look over here at what this returns so now I'm inside the learning algorithms again and I can see so we train the model over here grid search and then print the results and then return so we have predictions for the training data for the test data and we also get the probabilities we are not looking into that right now and we're also not looking into the accuracy on the training data I am just storing the accuracy from the test data because that is essentially what we want to evaluate our model on I am not that interested in our accuracy on the training data although sometimes that can also be a a good metric to look at basically to compare how your model is generalizing so if you have a really good accuracy a really good score on your training set but the same significant drop off on your test set it means that you are overfitting but since the scores were already pretty high on the training set I'm curious to see what the performance will be on the test set right now and then we can go from there but just know that the training accuracy can also be really beneficial but we are skipping that for now okay so it's on the feature set four um it's doing its thing let's just wait for it to finish and then we'll come back all right and the training is finished so now let's have a quick look at our score data frame and let's see what's in here so we can see that we have a model we have a feature set and we have an accuracy and now let's just sort these values so we're going to do sort values and then we're going to say buy accuracy ascending is false and we can get rid of it in place over here and do it like this and wow now we can actually see that the random Forest over here is performing a 99 if we round it it's almost perfect uh percent accuracy but now this is on the test set so initially when we were looking at the accuracy uh graph that we were plotting using the feature selection we were evaluating on the training data but these are results that originated from unseen data so this is really impressive and in order to get a better understanding of how each of the feature sets contributes to the accuracy for each of the individual models we can create a quick bar plots over here so I'm using CNS for this because that actually has a nice feature that you can use and that is you so we can use the color to make a differentiate differentiation between the feature set so just to keep in mind it should be in the Imports already import C bonus CNS I think we've used it in a previous episode also so it should be already in your environment if not you can pip install Seaborn and running this should give us a fancy graph bar chart I should say that looks like this so here we can see how each of the models is performing and also how the feature sets are performing so this is really cool so a few things that we see so we can see that feature sets four so all of the features perform the best for all of the models so that is cool we can also see that the selected features perform much better in all of the cases compared to just feature set one and feature set two so that selection of only 10 columns actually uh performs pretty well and then feature set 3 also performs really good but really the features from set 4 so those frequency features those are what really push the last few percentages for all of the models and we could also see that k n is is not performing that well and that for the neural network and the random Force we have pretty similar results over here especially if we look at feature set 3 and feature set 4. um really good score so we can also see that over here of course that we yeah have 99 accuracy over here now let's take the best performing model in this case and that is the random force and uh train the train is on the data set one more time and then let's have a look at a confusion Matrix because Su might know accuracy is not everything when it comes to classification although when we're dealing with course like this you can't really be wrong but let's just have a look at the confusion metrics to see what's going on and how comes to an accuracy score of 99 so I'm just gonna scroll up over here I'm going to grab random Forest over here and I'm going to copy paste that over here make sure all the indentations are correct and then what we can do right here is we can still leave the grid search on and now we're going to change it up to be the X strain but then with our best performing feature set so remember if I come up oh where is it so feature set 4 is the best performing one and if we have a look at what that was called again so it's feature set for in this case so we take X strain but we want feature set four and we also want y train and that's just the label so we don't have to change that and we also want the X test but then all also the same subset of feature set 4. so let's just have a look at this so we have everything and so let me check one more time we're missing one parenthesis over here and then black will automatically format this so this should be the code so we have our outputs and it's looking kind of weird because of the how black formats everything but as you can see that makes sure that everything fits nicely on the screen and it also standardizes our code so no matter how you write it will black will put it in this format so in that sense it makes things easier but sometimes this is a bit confusing but what we're doing we're calling the learning class again and then the random forest and we're going to train using X train y train we're going to validate on X tests or test set and we're going to use feature set 4 which basically means all of the features and we are going to perform a grid search so let's do that one more time and it shouldn't take that long okay it's finished took a bit longer than expected this I think it's about one minute on my machine but what we can do right now so we have just uh basically completed the steps that we've done earlier but now um we can also continue and use the output variables to calculate an accuracy and we can do that using the scikit-learn accuracy metric so I could learn accuracy score and we can just run that by calling um what we got we have y test and also our y uh our class test y so that is our prediction so these are stored over here and remember the Y test is just the uh the test set that we have split up all the way up here so we have to find that one all the way up here and now we're using it again to make an accuracy score so we have the predictions and we have the original series and we can call calculate an accuracy and now remember as I've said the random force that we are currently using but also the neural network have some stochastic elements meaning some form of Randomness in their initialization and how they optimize the model so every time you train this model it will be a little bit different so before we we could see that we had 99.5 percent accuracy something like that let's see what we have now so we have stored the accuracy and we can run it again beautiful again a really high accuracy score that is really awesome but as I've said we wanted to look at the confusion Matrix to get a sense of what's it doing right and where those few cases are where it predicted the wrong label so let's first start off with the classes and we're going to define those as uh are trained y probability so these are the probabilities and then we could just take the columns and then we have all our labels so let's just uh a quick hack to get all the the labels from this uh from this data over here and then the next step is we're going to define a confusion Matrix and for this we're going to input the Y test and the Clause test Y and then the classes that we've just created so make sure to store this and then run it over here and then now we get a confusion Matrix but this doesn't really tell us anything because we still have to show the labels and now that is what we can do with our confusion Matrix function and that one I also stored in the document for you so you can go to the confusion Matrix code come over here and then copy this part over here then come back to the document and paste it so again I've put this in the document because there is some quite some some complex things going on here and you have to get it just right in order to make it pretty but basically what we can do right here is plot a beautiful confusion Matrix and now this is really awesome because from this graph we can see what's going on with our predictions and our model is almost perfect but we can see that in of some instances so four in total there is a wrong prediction and we can here see the true label and then here we see the predicted label so uh the row so when someone was performing a row there were two instances in the test set for which the model classified it as a deadlift and we have seen before that when we look at the raw accelerometer and gyroscope data that a row and a deadlift look pretty similar because some parts of the movements are actually the same so it amazes me at how good the model is able to differentiate 8 between the two and get almost everything good except for two instances now the same holds true for the other way around so we have a deadlift where it classified it as a row and when you have one instance where it classified a bench press as an overhead press and again bench press overhead press very similar movement patterns if you consider it from the perspective of the wrist where the sensor was measuring so again amazing results there's one final test that we still have to do and that is we are currently creating a trading test split stratified but based on the whole data set with all the data you can imagine how there is a lot of overlapping data with between the records and we try to get rid of 50 of that in the previous episodes with the feature engineering but since we've added so many time variables so with the window size we have a lot of records that look very similar and now you can imagine that if you look at the whole data set and you pick a random record from the test set there is a very high chance that somewhere in the training set there is a record that is almost identical because we are using the windows and basically in that sense you are still kind of cheating because you are yes providing unseen new data to the model for testing but you are also taking that data out of a larger data set that has a very large range of overlap in it so basically it I could see how it's easy for the model to determine that something was a bench press if it has been trained on data from that same set from that participant so the next step and then the final test to validate our approach is to select the train and test split based on the participants so remember we have five participants in total what we are now going to do is we're going to subtract participant a from the training data meaning that we will train on all Bots participant a so in that way we will provide the trained models with data from a particip from a participant that was performing the exercises at a different time N is a completely different person so maybe his way her way of doing a bench press or deadlift is slightly different from what the model has seen and what the model has been able to generalize to so this is the ultimate test so let's first create the train and test split again and for this I'm going to first create a participant data frame and I'm going to set this equal to the data frame but I'm going to drop the and that should be in a list I'm going to drop and the category so set and category because we won't be needing those so let's get rid of this and then one parenthesis over here and then we Define find the axis as one so let's store that and then the next thing that we are going to do is we are going to select our training data so our X strain based on participant DF where participant is not in this case so again I'm using and this should be in quotes I'm really using GitHub copilot over here so this is actually pretty convenient to auto complete it like that but basically we want to select our X train and we're also then going to drop of course our label over here and then specify the axis as well so we create a subset of the data frame where participant is not participant a and then we drop the label and then for y train we're going to basically do the same thing and see like this see how cool this is now GitHub called pilot already knows what I want so I can just type Tab and then it says we're going to create a subset based on participant again no not a but then just a label so let's check this out why train awesome extrane awesome okay and now we're going to do the same thing for the test set but the only difference is that so this will be y test and why sorry X test and Y test but now we'll set this equal to a meaning that we do the same thing but we filter by a and now let's have a look at our X strain our y train we can see we have a training size of 2500 and we have a test size of 1200 almost 1300 rows so this is a nice way to split the data so a little more data in the test set compared to the original model but that is even better to test things out and then just to clean things up I'm going to drop the participant again it's not necessarily needed because we are selecting based on a subset but it's just best practice to keep that out of the data set because we are not going to use it and then one more quick test that we can do is create another bar chart and what we can do here is we can again see how many rows or how many uh rows there are little rows in here but how many records instances for each label there are in the data set and we can see that by uh in when we split based on participant a we can see that there are not many rows in the test set left but yeah we can just leave that for now it's just important that there are still records in there so we can evaluate the predictive performance for that label but overall we get a nice split an accurate split that we can use to build our models again on and then basically apply the same trick that we've done previously so in order to use the best model again and evaluate the results we can literally grow up and copy and paste everything that we've done over here and put it here in this block of code and now everything is the same the only thing is that the X strain the Y train and the X test is now the split based on the participant and again we're doing this to really put it to the test so also from like a practical point of view we want to create a fitness tracker that can generalize to new people new participants so of course you have an initial training set of people that work with you to collect the data and to train the model but once the model is trained we want to ship it to our applications and do our wearables so whenever someone buys a product it will work for them as well and that is the uh that is why generalizability between different persons is very important in this case so let's just run everything again should take another minute and then we get back to the results all right and the results are in so let me first just plot the accuracy and we can see 99.4 again it's performing extremely well and this is with the participant trained desk split so that is amazing that means that our model can generalize to new people and we can have a look at here at the overhead press that is predicted six times as a bench press so again the problem we also saw in the previous uh run and again for the deadlift as well so in some cases it uh predicts the wrong label but again we have a magnificent accuracy over here almost perfect this is a really exciting approach a really exciting method to um yeah to even further develop I would say I don't believe there is a consumer product on the market already like an Apple Watch for example that does this really good for it to be actually practiced cool so of course I've explained in the beginning the Apple watch can detect when you're walking cycling that's the more cardiovascular activity so it's good at that but not for lifting and of course the fitness industry um people that actually uh perform weight training that market is huge so I could see how there's a huge potential huge market for this so if you want to go ahead and develop this into a product do it I would I would buy it now the next step and this is quite funny it says you try a simpler model with selected features now the previous time that I was running this the neural network came out on top but by a 10 a tenth of a percent so like we've seen uh in the beginning so the neural network in the random Force perform almost similar um but I stated they're trying a simpler model because in a sense a random Forest is simpler than a neural network and also the selected features but now I'm just going to try around so we're actually going to try a more complex model using less features what I'm going to do is I'm going to again copy paste the whole thing and here you can also see how it's very convenient that the code is structured in such a way so I'm going to for now set grid search to false and then we're just going to check out so the neural network was a feed forward neural network that we're going to use so we're just going to swap out the model and now instead of the feature set 4 I'm going to do the selected yes like this so the selected features is the 10 best performance one and with the neural network but not with the grid search and we could even try that later on but let's see how this works and then we can again run this and it's now not performing in Grid search so it should finish quite quickly yes all right okay here we can see something interesting so we can definitely see that we have a lot of errors over here so let's see what I did to the accuracy okay so the accuracy still good but as a significant drop compared to the earlier uh methods and this is actually quite uh interesting because I'm now also wondering like why the initial uh set of 10 features is so different from the set I was getting so let me just and this is just an experiment over here that I'm going to try I'm going to take my initial selected features over here so initially when I run this script I got this as the initial values and I don't know maybe I missed something maybe I made an error maybe you can spot it or maybe there is a stochastic process in here I thought it was not so it should be pretty straightforward but it could be in the way the features are arranged when you're creating a set so maybe there is something happening over there that switches the order in which the the parameters the The Columns are evaluated and that I could see how that could end up in a different order but so these were the values initially and I got a better score and also what was funny I got the original acceleration Z in here and also a principal component and also some temporal and some frequency data so if we look at the selected features that I got this one it's all frequency and this was a much nicer mix of original PCA temporal and frequency so it should provide us with more accuracy so let's just overwrite the selected features and then run this one more time and see if we can improve yeah okay so now we we're seeing that we have a couple of more errors in the row but it did increase the overall accuracy but still it does not beat the random Force which it did this morning so this is actually quite interesting I can also just look I can just uh take this one more time and train it again and then you can see how there is this look you now oh now we have a lot more errors over here so this is the stochastic nature of a neural network that's why it's important to uh train your mole your models multiple times um when there is a stochastic element like this so here you can see that depending on in just for the sake of it I'm going to just try this a couple of more times I think I was just lucky this morning so here it improved here we can have a look at the accuracy again so overall we're touching the 99 but it's not still not as high as the SD random Force which is quite interesting and I guess I was just lucky on my run this morning because the neural network also achieved the 99 but maybe it is because I'm switching it around now so what I can also do maybe that's uh that's the thing so I'm just kind of freestyling here right now as we're heading towards the end of this lecture um let me just swap the selected features put in feature set 4 and do it like this I think this should result in yes so here we're getting that 99 accuracy again so we can also see how the neural network is benefiting from lots of features over here and you can see that uh yeah we have very few instances get rid of this again and just to check once more also good results so we see a general pattern over here where uh regardless of the neural network or the random forest with all of the features we can see that there are some problems with the overhead press and the row and here we have the overhead press and this classified as a rest but in general we see this pattern where there are a few instances that that they get wrong here again and sometimes one more but that brings us to the end of this classifying lecture I think it was really really exciting I've shown a lot also lots of very useful blocks of code that you can repurpose later in your projects but let me get back to our discussion of the results so basically I've talked about this a lot already in during this episode but these results are really cool they're quite fascinating and as I've said this is I think still quite groundbreaking stuff so when I did this project two three three years ago something like that I also performed a literature research basically to give an overview because I had to write a paper about this about what already was done in uh in the space in the market and also in the research World regarding this topic and there are some there have been some researchers that have been playing around with this but I I think there hasn't been a product that's fully utilizes this potential so I was also looking on scholar.google and you can look up like classifying fitness exercises barbell so there's one study over here it goes into a method and here you can see they classify four different exercises also quite interesting and they they also get really good results but I show you to basically illustrate how what we are currently working on what we have been working on and if you have been following along up until this point congratulations really good job we are working at actual problems that could be published as like peer reviewed research that you can get literally uh public Publications for under your name so that is the level uh that we are operating at right now with this with this piece of code basically now of course to uh really make it a solid research you have to dig into a lot more nuances I should say and really validate and evaluate everything and do a lot more experience but the code structure would still remain the same like this is the setup that you can use to publish a paper so I think that is really exciting I found this a really interesting exciting project it's not the end that Joe I'm talking about it like this is the labs episode but there are some more things that we are going to do but as far as the classification part this was it for that so we are going to create a custom algorithm that can focus on Counting repetitions so that is another interesting add-on that we are going to be creating and that would also be really beneficial of course in a fitness tracker itself so first classify what exercise are you doing and in a second how many repetitions are you doing so you can lock it in a Tracker in a file and in that way you don't have to manually write everything down you will only have to basically input the weight if you are still watching if you have been following along I want to thank you for your time and attention as always I would really appreciate appreciate it if you like this video also by liking this video you show YouTube that you want to see more content like this and it helps me so it's a win-win if you're not subscribed make sure to subscribe to the channel there will be a lot of exciting more videos to come and then I see you in the next one
Original Description
Want to get started with freelancing? Let me help: https://www.datalumina.com/data-freelancer
Need help with a project? Work with me: https://www.datalumina.com/solutions
Welcome back! In this video, we are going to code the actual classification models and make predictions on the data. We'll create a train/test split, select feature sets, perform a grid search for hyperparameter tuning, select the model, and finally, evaluate the result.
👉🏻 Source material for this week: https://docs.datalumina.io/GQBbHfi4hJA0FV
⏱️ Timestamps
00:00:00 Introduction
00:03:21 Loading the data
00:03:52 Create a training and test set
00:11:01 Split feature subsets
00:19:46 Perform forward feature selection using simple decision tree
00:28:57 Grid search for best hyperparameters and model selection
00:40:06 Create a grouped bar plot to compare the results
00:43:11 Select best model and evaluate results
00:50:04 Select train and test data based on participant
00:58:02 Try a more complex model with the selected features
01:03:45 Discussion of results
Project overview (what you will learn)
Part 1 — Introduction, goal, quantified self, MetaMotion sensor, dataset
Part 2 — Converting raw data, reading CSV files, splitting data, cleaning
Part 3 — Visualizing data, plotting time series data
Part 4 — Outlier detection, Chauvenet’s criterion, local outlier factor
Part 5 — Feature engineering, frequency, low pass filter, PCA, clustering
Part 6 — Predictive modelling, Naive Bayes, SVMs, random forest, neural network
Part 7 — Counting repetitions, creating a custom algorithm
Link to playlist: https://youtube.com/playlist?list=PL-Y17yukoyy0sT2hoSQxn1TdV0J7-MX4K
If you find these videos helpful, consider subscribing @daveebbelaar
Watch on YouTube ↗
(saves to browser)
Sign in to unlock AI tutor explanation · ⚡30
Playlist
Uploads from Dave Ebbelaar · Dave Ebbelaar · 29 of 60
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
▶
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
How to Install Homebrew on Mac (Getting Started)
Dave Ebbelaar
How to Install Python on Mac (Homebrew)
Dave Ebbelaar
How to Install Anaconda on Mac (Getting Started)
Dave Ebbelaar
How to Set up VS Code for Data Science & AI
Dave Ebbelaar
How to Use Git in VS Code for Data Science
Dave Ebbelaar
Data Science Desk Setup to Maximize Productivity
Dave Ebbelaar
THIS Is How I Write Clean Data Science Code EVERY TIME
Dave Ebbelaar
Data Science Tutorial - Project Structure
Dave Ebbelaar
Changing rcParams for Better Data Science Plots | Matplotlib Tutorial
Dave Ebbelaar
How to Read Excel Files with Python (Pandas Tutorial)
Dave Ebbelaar
My Data Science Journey (Zero to Freelance)
Dave Ebbelaar
How I Automate Data Visualization in Python
Dave Ebbelaar
16 Apps I Use Daily as a Data Scientist
Dave Ebbelaar
How to Manage Conda Environments for Data Science
Dave Ebbelaar
How to Export Machine Learning Models in Python
Dave Ebbelaar
VS Code Speed Hack for Data Science
Dave Ebbelaar
17 VS Code Tips That Will Change Your Data Science Workflow
Dave Ebbelaar
How to Predict the Future with Python (Forecasting Tutorial)
Dave Ebbelaar
How to Use Python Environment Variables
Dave Ebbelaar
7 Data Science Tips for Beginners in 2023
Dave Ebbelaar
How to Effectively Use the Data Science Lifecycle
Dave Ebbelaar
Full Machine Learning Project — Coding a Fitness Tracker with Python (Part 1)
Dave Ebbelaar
Full Machine Learning Project — Processing Raw Data (Part 2)
Dave Ebbelaar
Full Machine Learning Project — Data Visualization with Matplotlib (Part 3)
Dave Ebbelaar
This Will Change Data Science as We Know It (ChatGPT)
Dave Ebbelaar
Full Machine Learning Project — Detecting Outliers in Sensor Data (Part 4)
Dave Ebbelaar
Full Machine Learning Project — Low-pass Filter & Principal Component Analysis (Part 5a)
Dave Ebbelaar
Full Machine Learning Project — Fourier Transformation & Clustering (Part 5b)
Dave Ebbelaar
Full Machine Learning Project — Predictive Modelling (Part 6)
Dave Ebbelaar
Automate Machine Learning with ChatGPT
Dave Ebbelaar
Scraping Web Datasets for Data Science Projects
Dave Ebbelaar
Full Machine Learning Project — Counting Repetitions (Part 7)
Dave Ebbelaar
How to Use GitHub Copilot for Data Science (Python + VS Code)
Dave Ebbelaar
Every Beginner Data Scientist Should Understand This
Dave Ebbelaar
Revealing My New AI-Powered Data Science Workflow
Dave Ebbelaar
Auto-GPT Tutorial - Create Your Personal AI Assistant 🦾
Dave Ebbelaar
Build Your Own Auto-GPT Apps with LangChain (Python Tutorial)
Dave Ebbelaar
Building Slack AI Assistants with Python & LangChain
Dave Ebbelaar
ChatGPT Code Interpreter - Goodbye Data Analysts?
Dave Ebbelaar
How to Deploy AI Apps to the Cloud with Flask & Azure
Dave Ebbelaar
How to Build an AI Document Chatbot in 10 Minutes
Dave Ebbelaar
Is Falcon LLM the OpenAI Alternative? An Experimental Setup with LangChain
Dave Ebbelaar
GPT Engineer... Generate an entire codebase with one prompt
Dave Ebbelaar
Pandas DataFrame Agent... the future of data analysis?
Dave Ebbelaar
OpenAI Function Calling - Full Beginner Tutorial
Dave Ebbelaar
How to use ChatGPT's new “Code Interpreter” feature
Dave Ebbelaar
LangChain just launched their new "LangSmith" platform
Dave Ebbelaar
How I'd Learn AI (if I could start over)
Dave Ebbelaar
I Used AI To Scrape The Web & Write PDF Reports
Dave Ebbelaar
LangSmith Tutorial - LLM Evaluation for Beginners
Dave Ebbelaar
7 Lessons for New AI Engineers - Beginner’s Guide
Dave Ebbelaar
The Rise of the "New-Age" Machine Learning Engineer
Dave Ebbelaar
OpenAI Assistants Tutorial for Beginners
Dave Ebbelaar
How To Connect OpenAI To WhatsApp (Python Tutorial)
Dave Ebbelaar
How to Build Chatbot Interfaces with Python
Dave Ebbelaar
PostgreSQL as VectorDB - Beginner Tutorial
Dave Ebbelaar
My MacBook Setup (as a coder & business owner)
Dave Ebbelaar
Easiest Way to Connect AI Chatbots to WhatsApp
Dave Ebbelaar
ClickUp Tutorial - What Is ClickUp Brain? 🧠
Dave Ebbelaar
My Development Workflow for Data & AI Projects
Dave Ebbelaar
More on: ML Pipelines
View skill →Related AI Lessons
Chapters (11)
Introduction
3:21
Loading the data
3:52
Create a training and test set
11:01
Split feature subsets
19:46
Perform forward feature selection using simple decision tree
28:57
Grid search for best hyperparameters and model selection
40:06
Create a grouped bar plot to compare the results
43:11
Select best model and evaluate results
50:04
Select train and test data based on participant
58:02
Try a more complex model with the selected features
1:03:45
Discussion of results
🎓
Tutor Explanation
DeepCamp AI