Automating Supervised Machine Learning Pipeline Development | Machine Learning | Community Webinar

Data Science Dojo · Beginner ·📰 AI News & Updates ·4y ago

Key Takeaways

The video discusses automating supervised machine learning pipeline development, covering data preparation, modeling, and rollout, with a focus on production and business impact. It highlights the importance of mastering machine learning basics and using techniques like dimensionality reduction, feature scaling, and correlation analysis to improve model performance and interpretability.

Full Transcript

hi everyone my name is nathan and i am the marketing manager at data science dojo and i have tom ives and gates tankari with me tom is the lead data scientist at ul prospector and gate is the digital health platform advisor at bear and they're going to be presenting automating supervised machine learning pipelines a little bit at a time so tom and gaith why don't you go ahead and get started right and i should qualify i am no longer with ul but i appreciate i know okay i'm even more proud of being founder of integrated machine learning and ai that's that's a much more important role got it i'll update your bio thank you sorry for that that was my mistake we planned this quite a while ago well guys we're really honored to be here in case you can't tell guys and i are brothers we were separated at birth so we ended up being raised differently but seriously we're very close friends we love to present together and we're even writing a book together on this material and we hope this talk will help you in our experience there's a lot of smart people in data science and when they are exposed this material many of them have an attitude of like oh finally seeing it all is one cohesive set of methodologies and we hope you have that reaction too i want to read this out loud to you um because i think it's a powerful bit of wisdom that we can use in many areas of our lives i've encountered many great mentors in many forms one powerfully recurring principle taught by all of them the greatest practitioners in any field take the time to master the basics the basics are the foundation of any great art in life masters literally practice the basic 10 basics tens of thousands of times and gython are all for the latest cool tools and techniques but they will crumble as a house with a poor foundation if they're not built on good basics fundamentals so this is what we're going to talk about and i hope this looks a little overwhelming if well i don't hope it looks overwhelming but i suspect for many of you it might look overwhelming but we're going to break it down piece by piece and really this is just trying to get the best data so that we can do good predictions and we liken it to picking fruit we want to pick really good fruit so we can eat healthy fruit and we want to throw away the bad fruit same thing with data well that was a fancy long transition sorry for that why don't you take over from here okay so let's first before going to the what is the pipeline uh first i have to thank tom for for this uh introduction and about our brotherhood let's name it um second point it's about the pipeline what is the pipeline in general let's name it or let's consider it as assembly line and since we are looking for a production this is the target of our speech today it's to think how to push our machine learning models into production what kind of activities needed to do that and how to support the usage and get the benefit from this huge boom of machine learning during the last years into production into real life of the people and to main business into real scenarios now the difference between the machine learning experience in general or experiment in general and what we are looking for is to have the assembly line and we have um an example we are always using analogy we are always using that even if the uh automobile or the cars are or invented by carl pens uh at the end of uh 19th century but the real boom of the of of automobile industry it's appeared with henry ford when he created the assembly line for for for this machines now according to that we can consider what carbons it's doing it's the pure knowledge it's the vertical growth of this field and what henry ford performed it's the horizontal growth our speak today it's just to focus on the not only but it's how to focus on the horizontal growth to get the best benefit from machine learning techniques so we have in our pipeline we have we are always dealing let's just try to explain the basics what we mean by features features are the input to our model are the information the data the pieces of data that we are using to create a prediction or analytic model in general this can be performed in data science model or in machine learning model at the end so what we name it a feature it's at the end it's the input and the input it has a three major let's say properties which is first it is samples from the overall population because it's not possible to have the whole population in in one model in one model process or in one pipeline so second it's always we have to use what kind of inputs that really has prediction or predicting values which is reduced to the essentials because any correlation any relationship between the inputs it might lead to a mistake or to be uh biased and in the output and we will explain how this is appeared later and sometimes performing some kind of engineering uh for this inputs it will be needed because this engineering will help the mathematical machine the math machines that we are running which is the model that we are building to uh to perform in a better way now this is about the feature in general what is the output of our model it's what we name it labels uh labels it's what is we are expecting and according to that when we are talking about supervised machine learning the labels it's the labels in our data set in our first training data set it will be provided we have to find these labels we have to perform sometimes kind of engineering for this labels because this is will tell the model what to expect and what kind of relationship exists between these labels and the features that we are using or the inputs what is the model the model is the model it's our math machine and there is multiple way to find the model and we have to find a methodology to compare the models together and for sure all of this has to be performed under the human overwatch because always there is some disk let's say some buyers or some drifts might be discoverable or sometimes it might be hidden and it will be uh discovered over the the in the production life so this is our highest level of thinking about the pipeline so what is the pipeline it's set of processes performed over the inputs to find the outputs using a models or math machines let's name it right now and under the overwatch of the engineers or the machine learning or data scientist in our scenario let's go to the next slide please um so this graph tom if you want to explain about it i think it will be your vision it will be more clear from mine yes yeah i'll go i'll take this one so we just wanted to give you a the highest level view of what happens when we're getting a model ready to make predictions so we have our inputs and we run it through the model and let's call it a math machine and it makes some predictions and the very first ones are bad because it's just using some random factors some random parameters for the model we can call them weights and so then we take those predictions and compare them to our known outputs like i said and we get an error and we use that error to change the weights in the model and make yet another next iteration prediction and we keep repeating this into the until the error gets to as small as possible and one thing we like to say about this is all machine learning problems are math problems so to get the data ready for this kind of model training can take a lot of work with especially with real world data so here just maybe i i want to add something when we are talking about the previous uh chart that we are saying this one it's this is how we build our model but again when we are talking about the supervised machine learning pipeline we are talking about how to build the model and how to make the full pipeline the full assembly assembly line ready for production to be used in production and that including some other activities we might discuss it during the session at the end we are looking at this pipeline on the high level let's go to the next slide so the features it's the data it's our input it's what's driving our our model performance but what kind of action it's needed to be ready which kind of data we are able to use this is decided by our data preparation pipeline let's name it this data preparation pipeline it's starting from processing them let's say the high level of it it's considered the missing values processing the missing values clean the data uh including text because at the end let's remember what we said in the in the previous slide that all the machine learning problems are mathematical problems and text has to be represented in a mathematical friendly way let's name it so encoding the text and we have multiple method to do that then we have to normalize this input to be controlled in a better way to be on the range needed then we have to reduce the dimensionality let's let's discuss about this later because once we start with this with this concept that we cannot stop and perform some kind of engineering to this uh to this features at that point we can consider that our data is ready to be feeded into the the model and train the model according to that that mean training the model it's one step one iterative strip step it has a lot of parameters to be controlled but it's one step all of these steps it's before that and also we have other steps after that so that's why we used to to say that 80 percent or more than 80 percent of the machine learning model building let's say it's considered as data preparation but while the model training itself or the model itself it will it will be only 20 of the work uh let's say it in that way so anything tom to add here and also i think it's good to to check about the questions maybe uh for the previous part just to keep it synchronized quickly yeah no questions are coming in yet but basically in summary our features require much attention and sometimes even our labels and some of that attention we have to give is to missing values and this is a good chart from medium that we stole um because it's a good chart so to us missing values are a travesty we hate them and sometimes they're so bad that we find we have to either delete rows uh delete complete columns which would typically be a feature or there's a pairwise deletion we would only do this pairwise deletion meaning [Music] when we're comparing two columns a pair of columns when we're looking for co correlation coefficients between features we'll talk about that more in a minute but our preferred way to deal with missing values is to impute what they most likely were and depending on whether we have a categorical or a continuous problem there's different methods for predicting sometimes the mean or the median or the mode are okay to use for replacement but more often times will predict the values that are missing and that actually ends up working pretty good time series has its own special considerations when we do that then we also have to clean the data because just because it's missing doesn't mean i mean just because we replaced the missing ones doesn't mean we're out of the woods or that there's no more problems we can have things like too much data meaning too many features we can have outliers that we need to go study or inconsistencies we can see strange patterns etc so we want to diagnose these um and figure out what how we want to deal with them and if you remove outliers the main thing is to go study them and document why you removed them because those outliers can tell you a great story sometimes or a horrible story that you need to understand and deal with but the main thing a lot of what we encounter is perhaps someone didn't program the data entry system correctly and a num a value that should be a float or an integer is being entered as a string maybe a date wasn't formatted correctly these are all things we can automate uh the the cleaning of exactly tom this is this is a huge way that you make it at the end which is when we have let's say a clear understanding about how the data is collected we can check our our routines our mechanisms of data cleansing to select what's suitable from them so right now we are we are mentioning here a new terminology which is the the mechanism mechanism it's any kind of function or procedure or any routine that you build to perform a specific task but we have to reform it in a generalized way that will help us to use it later this is the major step and the basic step for automation so how we can yes oh sorry nylu's asking a good question is the removing of missing values a data cleaning process sort of if we're going through making sure we get a value in place of the fields that have missing values sure we could call that cleaning but we usually call it a special type of uh problem just where we're dealing with missing values guys go on sorry to interrupt you sorry i want to interrupt as well we have a question on youtube um from xiaom uh what are the ways for handling new categories that appear after some time when our model pipeline is in production ah okay that's great but it will come next okay will you pro will you remind us to answer that later in the presentation that's a great question for sure yes exactly just i i have one point related to the missing values just we need to mention that that our labels our given labor since since we are talking about supervised learning about given labels that we have on our training or our data set our our labels it will help us to predict the missing values sometimes when we are using uh when we have a categorical for example missing values and we need to to make the imputation so the the label itself it's part of our data set at the end we know that it's our label but it can be helped even in the processing the missing value and um honestly and honestly it it's a blame major rule here sometimes um right okay great i think i think on the encoding it's it's your game okay i'll go ahead with that one okay in case y'all can't tell we kind of switch it up each time we present on who's going to do which slide but okay like we said early on sometimes we have data that comes in the form of text but we need numbers because all machine learning problems are math problems so here's one way we do it it's called one hot encoding in other words we want to take each of these words and encode them to a value but when we have this type of categorical data such as these colors we create completely new features for each color and then if that color appears it's a one if it doesn't it's a zero for that column but the one that did appear gets the one and because the case in this example where green occurs red and blue will be zero we want to remove one of the columns so if we have n categories we're going to end up with n minus one because that last one or the first one or whichever one you remove is described by all the other cases being zero and just to make it a a long story short it just makes the math cleaner it creates fewer problems in in the math in the solving for the models now the other form can be ordinal where oh it's clear we had some kind of scale of quality or goodness indicated uh by the words so in this case bad is zero etcetera excellent is four so that's ordinal encoding now we can get fancier and fancier and natural language processing deals with a lot of very intricate um beautiful powerful encoding scheme [Music] but for this one let's say we had a corpus of documents and we record all the words that occur in all those documents we get a vocabulary for the corpus then we say each word has a number now we go back through each of the documents and we say well how many times did word zero occurred in doc zero and how many times did word one occur in doc zero so we're getting an occurrence rate for each of the words and then we do the same for the next stock in this next time this ends up being a very sparse gigantic matrix depending on the size of your purpose and vocabulary but thanks to scipy and its sparse matrix routines you can take a huge matrix and find out close matches of one document to another our new document coming in closely matching one of these other ones in in less than a second even for very large matrices so these are just some examples of how we encode words to uh numbers uh yeah maybe tom sorry for interruption just to one question it's always coming to the mind of a data scientist when to use um for example one hot encoding and when to go to the ordinal uh encoding beside beside i will add this you already make it clear that it's depending on the nature of the values itself the the the input itself beside that sometimes even if it's categorical information just like the colors here or classific it can be classified easily but the size of this matrix it can be very huge now at that point it might be wise to go to the ordinal not following one hot encoding because um it will be easier to to be represented uh for practical practical point yes so this is what i want to add here we always want to reduce the dimensionality of our data set if we can so if possible use ordel encoding which is a one-to-one replacement versus one hot encoding which means for the number of different words we had we had to come up with that many new features so that's an excellent point guys just so you know a lot of the reason we're talking through all these principles is we we use them to help make our data set as small as practically possible now the next thing we're going into is normalizing our feature set and guy do you want to talk on this or you want me to do it no no please go ahead i will i will comment i will i will go for the reduce excellent so what's important about scaling is it gets all our numerical ranges within the same uh an equal range so that we don't have one feature being a much larger magnitude than another and this helps us in the forward later when we're doing the modeling because some of the types of models we use give weights on the features they tell which feature is more important than another to the predictions but we can't rely on those weights unless the features are scaled within similar ranges so this is really important for that reason and other reasons uh numerically speaking okay now guys going to talk about reducing our model to make it as simple as possible okay so yes this images here yes exactly explain our situation as data scientists we are always looking for reducing the dimensionality but what is the dimensionality itself what what this terminology means here let's consider a line equation which is for example or a circle equation which is only we have it as two dimensions x and y and everything related to that in our case now if you want to explain what is our dimensions what is the dimension that we are falling in in our data set you can consider that number of your inputs the number of your features it's the dimensionality that we have and according to that we are talking about space with many many many dimensions now uh whenever we are able to reduce the dimension it will give us more clarity about how the model it will be performed and how where is the gap if there is any gap um i used to have uh translated uh statement about that whenever the dimensionality is higher the statement about your model is less you cannot say a lot about it and this is the scenario we are facing it and deep learning for example but this is totally not different but yes it has its own management methods so how to reduce the dimensionality it's why it's bad first i will i will i will follow the the slides um the uh redundancy and on the individual effects and accuracy on the model rates let's consider that for example we have two inputs that having the same rule toward the output so what does that mean it's for us we can consider them as one one input that mean one of them it can be um can be removed because they are collinear to each other that mean we we need a non-collinear inputs to be used in our production model why because the collinear model just like the the the let's say the chart on the right the collinear features it's performing the same toward the output and that means we are giving this feature a higher value let's let's english eyes it like this a higher value than other features now according to that once we have this collinearity it's very wise to consider removing one of them and there is a specific method about it [Music] let's say how to do that we have multiple method once it's considered the correlation matrix which is explaining how the relationship between each feature and other and we have let's say other methods that's uh what what's the solution also we can discuss about the solution later but let's go to the mechanisms that we are using uh for re removing or reducing the dimensionality so we have um tom if you want to go on that um yeah this uh we we're constantly trying to improve this slide set and this one sometimes confuses those that have used it many times but basically going back to our initial mind map we're trying to reduce our data set why well because it might have co-linearity we just covered that so we we don't need two features playing the same role it'd be like two guys on a soccer team trying to play the same position well what happens to the empty position or it's it's just better to have one feature serving each task so we keep the strongest of those co-linear ones like i said but um at the same time we want strong correlation between the label and the features so what are our methods for that well we can loop through the features and see which one gives the greatest accuracy for the model just when we can use one feature and then we eliminate that from our list and we go back and look for the second most important one and then the third most important one and then we can do that in reverse too so that's a very strong method after we've gotten rid of co-linear features or even before to see which features are most important and if you don't get rid of the co-linear features that can confuse that analysis um let's see we just want to show you what it means to be correlated so this would be perfect positive correlation this would be perfect negative correlation with everything in between where you can see there's not even a clear correlation at this point now negative doesn't mean bad it just means that it's saying uh this particular feature is tending to reduce the value of the label and there's different ways to calculate these correlations but then these other method we talked about looping lasso which is a penalization method used in linear regression can help you figure out when you're using lasso it will drive those features that aren't as important down to zero while you train a linear regression model and then um finally there's oh and while we're doing these methods we're always using metrics on the models we're looking not just for accuracy but how well we generalize meaning oh it trained well on the training set but how well will it do with new data and we have techniques to do that to test that with our training data but always this fancy word parsimony this american acronym that's been around for my whole lifetime kiss keep it simple stupid we're just trying to keep we're looking for the set of features that are essential we don't want too many we don't want too few and then there's a soft method that we call principal component analysis and if you are familiar with uh eigenspace eigenvectors eigenvalues this is what this is and we love this going into the space when we need to because if you have a lot of co-linearity that's hard to get rid of it will be decoupled in pca however we're always trying to explain what's going on and it's not that once you move to that new space to the eigen space that special space where all the features are decoupled it's got some great tools to tell you which features you can drop and such but don't get sucked into thinking that that means it changes the features in your original space oh no it doesn't what it but if you use the eigenvectors it can help you to describe the relation between [Music] the eigenspace features and the original space features and uh you guys want to connect and or follow jonathan papworth my son guy's brother he can explain very clearly how to do that visualization relation between the two spaces guys what would you have here yeah yes thank you um maybe a couple of things the first one it's related to to the correlation yes collinearity we are we want to get rid of it but sometimes we have to make it a careful way because we sometimes need to use a two collinear features because we cannot get rid of them uh one example that we can speak about it which is that the negative uh let's say correlation metrics that we find it in in the in the previous slide sometimes because the correlation is negative and for the nature of our task we need to keep the both features that mean even if they are correlated unless we find let's say maybe a third feature or another thing to tell us i will give one example if you are predicting anything related to that to your health and you are using a cholesterol information there is always a correlation between the low density cholesterol and high density cholesterol which is a good cholesterol and bad cholesterol but you cannot remove one of them because each one of them is playing different role to your prediction this is the first point the second point it's related to the bca also we have to be careful about the nature of our task because sometimes bca it cannot be a good mechanism to follow especially mainly i face that in my personal life in unsupervised learning tasks when it's related to the clustering task because bca is moving from one space to another so even if we understand the contribution of each original feature to the eigenv feature feature or bca feature it will not be enough to understand how the cluster is created on the bca feature so sometimes we have to be careful not to continue the modeling with the bca feature just to use it as a decision making for if we have if we are able to remove some uh some features or some inputs on not so this is um let's say practical example about about what we what we are talking about this is an excellent point guys making in um actually there's sometimes you have co-linearity and you don't want to get rid of it that way you could go to pca or you can use one of the modeling algorithms that's not sensitive to co-linearity so we should point that out too just depending on the nature of what you need to communicate from your work to your organizations that will dictate whether you leave co-linearity in because it doesn't affect clustering and we want to see perhaps the relation of those a similar thing happens in diamond analysis where there's different measurements and they all closely correlate to carrot size but you may want to keep all those features in as a description of the model then you can go to pca or if you're using another algorithm you don't even need to worry about the co-linearity as much okay just maybe before the engineering let's check the question i think there's a question coming from youtube nathan if you are here if you are able to help us with that yeah i have a couple um so the first is how about if we do not have ordinality in our feature and hard to do one hot encoding because of reducing dimensionality let's say we have 1 000 different categories for a feature and i posted these in the chat in case you want them yeah it's not to say that okay how to how to answer this briefly there's remember this becomes very binary at this point and if you have a lot of one hot encoded features you're going to have big areas of your feature set that are zeros and ones in that case for the features that were originally numerical i would choose to scale them from zero to one that's usually min max scaling but just because your data set gets really large due to one hot encoding doesn't mean it's going to be a horrible problem for example most of the scikit-learn algorithms will accept sparse matrices and it makes a huge difference in your training speed so again it's a goal to reduce the number of features but it doesn't mean shy away from using you know a thousand one hot encoded features if you need to just you now you're being careful to look for other methods to deal with that rapidly and then this next question from youtube what about correcting a correlated feature for its correlation as is commonly done by wall street with the u.s stock market and the price of boeing stock um now i'm not sure what you mean correcting a correlated feature for its correlation do you mean when there's co-linearity do a follow-up question for that and then we'll move on and look again um so now once we have this data set reduced um and by the way sometimes you want to engineer and then reduce sometimes you want to reduce and then engineer it just kind of depends but typically i found you're safe to reduce first and then add engineered features what do we mean by engineered features goth you want to take this one or go ahead please oh sure so i'm just just there is one point it's coming sorry for interruption before starting the about the engineering also we didn't highlight somehow uh clearly why we need to reduce the correlation or the the dimensionality there is clear two reasons which is very famous and there is also other practical reasons the first one is raising the explainability of your model once you have um let's say controlled number and the suitable number the best number of of inputs explaining the model it will be easier the second one it's the cost at the end when with the higher dimensionality that mean you are going to use more resources and this is can be more and more in a very dramatic way which is it can be finished under deep learning scenarios which is we are not running away from it but at the end once you have possibility to make it simpler it's most faster cost effective and easier to explain and uh to kareem's question on the q a is it a rule of thumb to avoid features that have some sort of dependency on each other it's more than a rule of thumb depending on your modeling algorithm it can create very bad numerical issues singularity in your matrices uh you know underdetermined matrices etc so um then you ask somehow combine features is preferred method to reduce possibly again it depends on the feature so it it could be that you're replacing you you i can imagine situations where you've got co-linear features and a combination of those can be the replacement of the the co-linear ones and that kind of takes us well to this engineered example it could be that we engineer a feature to replace those co-linear features have never thought of answering this area that way and then jonathan my son's he's saying outliers are often removed can you share some good examples where outliers would be welcome to train a model um yes i'll come to that in a minute jonathan make sure we answer it okay so here's an example where i have feature x and labels y and so this is a scatter plot between the two this is some simple univariate problem but we see that if i'm looking for a linear relationship between these two i don't get it but if i engineer a feature and then fit the model oh it does so much better and frankly if i had plotted x squared here this would have looked linear and i should have done that we'll we'll update the presentation with that um oh mustafa great question having the co-linearity how to prove the causality it's not that you prove the causality but let's say how do you find out that you have co-linearity you look at that matrix of correlation values and what you're doing is you're saying how does feature a correlate to feature b to feature c to feature etc that way when you have high correlation numbers between the features that's bad when you have high correlation between a feature any of the features in the label that's good so but how do you prove the causality um you'd have to get into the each specific domain for that and ask okay does one cause the other or do they just relate intrinsically it's more like that okay oh and jonathan i thought of a good way to at least give initial answer to your question um if you know those outliers are going to fly in through your pipeline when you're using the model in production it'd be great to leave something in place like either a scaling mechanism or a generalization routine like lasso or ridge that helps to reduce those but it also could be that you've trained a deep neural network and it it somehow learned to deal with those very well all right so this is just very simple feature engineering you can get very complicated with it and engineer a lot of additional features now some of you that may know more and guy please jump in any point here some of you may be thinking well deep neural networks are good at figuring out their own feature engineering yes but what our point is we still want to even though we still want to reduce and yes deep neural networks can figure out what features aren't important again it's the training the parsimony the simplicity the model every bit of handwork that we can automate to reduce the dimensionality up front before we go to test deep neural networks or deep learning can help us in the deep learning too guys what would you add there yes actually i want to describe the deep learning it just feeling the glass of water from the waterfalls that mean you are using a lot of data with a huge structure to perform one task at the end you are sure this task will be performed but the wasted resources it will be high so even with deep deep neural networks with deep learning scenarios and yes it's uh it's helped us to find it's by its own let's say methodology it will find the suitable features to use but at the end currently we are in the in the in the age let's say on the and the place where we are looking even to simplify the training methodology of the neural networks for example just like the sparse brain papers if anybody from you just check about it which is creating a kind of mask of zeros for some uh some new some neurons uh and and hidden layers for example just to check uh if it's enough that to reduce the size of the neural networks and this is will help them to run multiple experiments with the same resources so somehow somehow even with the deep neural networks we are running from at a huge cost to reduce the size of the neural networks and the time of the of the training needed so yes always reducing and engineering the features it will be helpful even with the deep learning scenarios exactly so as you've seen uh we say 80 it's probably higher right guys yes the amount of work we need to put in on the features and this doesn't count all the code we're trying to refactor and make better and and everything else now labels do we ever engineer the labels you bet we do sometimes that simplifies the whole problem some do we ever scale the label well i've never seen it but we always want to remain open and explore uh when we're doing this and again it's because we're we're trying to get the best fruit now just so we have plenty of time for q a guys and i are going to rush through the rest of this these are the different types of models we have in machine learning we have supervised learning and unsupervised learning and in supervised learning we can have classification problems or regression problems or continuous and then in the unsupervised learning it's stuff like clustering understanding groupings but there's even a higher branch up here called reinforced learning and then uh boy i guess um transformers appear in both they can simultaneously appear in unsupervised and supervised but how do we train these supervised models and well we randomly shuffle our data and then we split it mostly into a training group the rest into a test group and we prefer to do that in what's called k-fold cross-validation we'll explain that on the next slide but we're also exploring hyper parameters that are specific to each modeling algorithm but as we try different algorithms and different hyper parameters we have to have a way to measure the difference so we have these metrics and again we're looking for generalization yeah it works on current samples but how can we get a guess of how to deal on future samples of data and then improve on growing data and then we've got to change the model as needed and this gets back to that question that was asked earlier what about correct excuse me about the new category might be appeared in the categorical data yes sorry i would like to answer this and once we train the model and then it's in the production we are always keep tracking the model performance to detect two kind of let's say issues might appear first the data drift and second the concept drift now a new category might appear this is considered a data data drift and in this case it might be suitable to rebuild your data set and retrain the model maybe the same model that means the same algorithm can be followed and concept drift that means it's totally there is change on the meaning of the input itself that mean something was describing the height or the distance between two cities and currently it's changed totally there is different change there different meaning to that uh to that input which is it might appear in in the real life so at that point we have to run the full pipeline once again starting from building the data set from engineering the data everything has to be performed once again according to the new uh issue identification let's name it like this so this is just to answer the category the category new category might appear because always empty production we are keep monitoring for one of these two kind of issues might appear in the in the model performance and another good thing to point out this we were really glad you asked that question because it deals with concept drift but data drift can be actually tracked by seeing hey am i approaching the central limit theorem and so you use google to see is my overall distribution of my sample means approaching some steady state shape that can help too so in this whole process though with these metrics we're driving again toward parsimony we have a motivation to automate all of this that is the training and the metrics and the choice of algorithm okay now this is what crossfield validation is and let me just say it in the simplest way we figured out soon we want each portion of the data to get an opportunity to be the testing data so you see each time we're really just changing what group of data is going to be used for testing and the rest of it's used for training but by going across the folds what are we looking for yes we want good accuracy on each fold but we also want a tight distribution of accuracy that tells us the model is generalizing well at least as well as we can do with the current data we have for new data that would come in and then always we're doing human oversight this is why i jumped to this slide while i was giving his excellent explanation of we have to monitor models that are in production or they're being used to see is their concept drift is their data drift is there a different model that we could put in production that would do better based on our current data assets well after going through all that oh by the way always visualize that's why this histograms is here across all of this work we're always visualizing everything we can and we hope this doesn't look so overwhelming anymore and quite frankly this doesn't tell the whole story it's just kind of like a first layer overview of everything and uh we'll leave it on this slide and take questions now or more questions thanks for the questions just maybe one thing i want i want to mention here why this brief explanation because once you are in all of us when we are in in production life for machine learning and we face some tools orchestrators or something like that but it's performing it's helping us to automate what's possible to be automated and building our workflow our pipeline to perform our task so at the end this is the concepts this is the basics that we are looking for and then any other tool we are trying to make a tool agnostic any other any tool can help us to perform that we know exactly where we started exactly and by the way think of the advanced i mean henry ford was brilliant when he applied assembly lines to making automobiles but think of the improvements in manufacturing that have been brought to us by quality engineers with their lean stigma in their their black belt analysis and i could go on and on but the the cool things that they do nathan and i are rigorously over the next few years planning to take every bit of wisdom in those processes abstract them and help find ways to improve this horizontal genius in creating pipelines and oh another thing we're trying to remind each other to say when we get great opportunities to talk to smart people like you guys is this that 80 or more of the work that we're doing before we get to training the model we want to get better as a whole as a community communicating the insights we're gaining from that back to our business stakeholders our organizational stakeholders our domain experts because we feel that 80 percent of the insights come from eighty percent of that work we can react to we can proact to those things whereas when we get a model prediction it's just at the reaction stage so it's not to say the predictions are important it's to say that all this stuff we showed you here on the right is huge and important and for example you're finding missing values go find where the data is being entered and make sure that can't happen anymore you find the source of the dirty data go find a way to make sure that type of data doesn't keep happening the encoding once you know how you need to encode something go ahead and automate that outside your machine learning work normalization um since we are talking about that i think it's good to to check this question from youtube nathan posted on the chat can you talk about some of the models retraining approach in the production ah you want to do it or me you go you can start i'll start so this is the point this is probably a big point we may not have made very clear the mechanisms of code that we've developed and it's automated when we're building an automation to test different models outside of production that's different than the actual pipeline imagine that this study code we've created has a bunch of dials or settings we can change what scaling mechanisms we use what encoding we're using uh which features we're removing what dirty data we're cleaning all of that well we have a lot of different things we can try outside of production but we're putting a fixed version of the pipeline into production it's outside of the production that we're constantly looking for new alternatives i hope that helped guy how would you explain it further and on the other hand we have to consider that we are explaining here the first two levels of overall pipeline that we are looking for which is data preparation and modeling the third level it's related to the roll out which is here supported hugely supported by software engineering concepts and the modern software engineering let's say life cycle concepts like continuous integration contains deployment tracking even if we go to the architecture part which is go for example for microservices and then back from microservices sometimes you are unable to share your data the data privacy it's very high and you have to follow some learning methodologies just like learning at the edge for example for autonomous cars or something i can say that part which is the third part the roll out of the model it will help you to always retrain once the model in the production first second there is no model will be valid forever all the models has to be retrained but it's our task to define how to retrain it and which level of retraining we are we are following the level of automation we are looking for we can choose to go manually and to decide what is the drop down the drop time for us until we push the sec scan model and the policies here project management concept and the most important application life cycle concepts it will help us define exactly which action needed at which stage yes yes and so by the way someone out there may be thinking uh what about automl windmill is awesome we're actually teaching you the beginnings not all of how to approach that making your own automl but automl doesn't necessarily do feature selection or feature engineering it could you could add it but you also need to be agile to change for weirder problems we're showing you again this is about mastering the basics for basic typical machine learning but a lot of the things we work on can get kind of weird and the better you know these basics the better you'll do in those weird situations let's see some more i think there's no more questions and i think we have back to nathan yeah i think we're a little over time here so i'm going to cut us off so thank you tom and gate you guys are great um it was great having you and your presentation was amazing and thank you to everyone who joined um i think you all learned a lot based on the questions that you were all asking um i have one last thing to say so uh next week november 3rd at noon so the same time as today we're going to be having meena principal architect at microsoft on for casual causal behavioral modeling framework discrete choice modeling of consumer demand um love to see you all there i've posted i've put posted the the link to the webinar in in the chat as well as on youtube um so yeah thank you again tom and gaith and i hope everyone has a great rest of their day thank you nathan thanks for everyone we'd just like to ask everyone follow us on linkedin if you would like to join our community it's not a competitor to data science joe dojo it's a compliment we're a community of data scientists that just want to grow more together our mentoring fees are the most expensive on the planet and that we ask you to not pay us but to pay it forward by helping other new people like we're doing with each other

Original Description

Is there such a set of methodologies for data science and machine learning? YES! This talk will walk you through a high-level coverage of these methods and learn how to automate a supervised machine-learning pipeline. About the presenters: Thom Ives founded Integrated Machine Learning & AI, which is a very large group of data scientists who seek to grow and learn MORE TOGETHER. He is a leading data scientist and has developed a wide range of analytical models using multi-physics, data, and experiments. While Thom loves predictive modeling, his real passion for the data science space is making sure that data is clean from collection to storage for achieving the greatest overall return on data for all from retrieval. Thom is married and has 9 kids = 4 bios + 5 internationally adopted. He also has an awesome son-in-law that he is close to, AND he also regularly adopts amazing people from around the world! He lives in Eagle, Idaho, USA. Ghaith Sankari has an AI bachelor's degree from the University of Aleppo, and he is the founder of AI-HUT (Dubai, UAE). He is also a mentor in AI4Medicine Specialisation. Table of Contents: 0:00 – Introduction 2:50 – The highest level 9:26 – Supervised machine learning essence 12:21 – Features - Missing values 13:45 – Features - Data cleaning perception Vs reality 17:36 – Features - Encoding 22:26 – Features - Normalize 32:13 – Correlation - Methods 39:29 – Engineering features 50:02 – Human oversight 50:28 – Complete pipeline For further tutorials on the fundamentals of machine learning, check out this exclusive playlist: https://youtube.com/playlist?list=PL8eNk_zTBST-RTog7CPYvRfs1pYRWkPHG -- At Data Science Dojo, we believe data science is for everyone. Our data science trainings have been attended by more than 10,000 employees from over 2,500 companies globally, including many leaders in tech like Microsoft, Google, and Facebook. For more information please visit: https://hubs.la/Q01Z-13k0 💼 Learn to build LLM-powered apps
Watch on YouTube ↗ (saves to browser)
Sign in to unlock AI tutor explanation · ⚡30

Playlist

Uploads from Data Science Dojo · Data Science Dojo · 0 of 60

← Previous Next →
1 Feature Engineering and Predictive Modeling | Data Analytics with R and Azure ML | Community Webinar
Feature Engineering and Predictive Modeling | Data Analytics with R and Azure ML | Community Webinar
Data Science Dojo
2 Data Exploration and Visualization | Beginning Azure ML | Part 3
Data Exploration and Visualization | Beginning Azure ML | Part 3
Data Science Dojo
3 Reading External Data Sources | Beginning Azure ML | Part 2
Reading External Data Sources | Beginning Azure ML | Part 2
Data Science Dojo
4 Importing Data, Accessing, & Creating a New Experiment | Beginning Azure ML | Part 1
Importing Data, Accessing, & Creating a New Experiment | Beginning Azure ML | Part 1
Data Science Dojo
5 Casting Columns & Renaming Columns | Beginning Azure ML | Part 4
Casting Columns & Renaming Columns | Beginning Azure ML | Part 4
Data Science Dojo
6 Scrub Missing Values & Project Columns | Beginning Azure ML | Part 5
Scrub Missing Values & Project Columns | Beginning Azure ML | Part 5
Data Science Dojo
7 Feature Engineering & R Script | Beginning Azure ML | Part 6
Feature Engineering & R Script | Beginning Azure ML | Part 6
Data Science Dojo
8 Building Your First Model | Beginning Azure ML |  Part 7
Building Your First Model | Beginning Azure ML | Part 7
Data Science Dojo
9 Run and Fine-Tune Multiple Models | Beginning Azure ML | Part 8
Run and Fine-Tune Multiple Models | Beginning Azure ML | Part 8
Data Science Dojo
10 Deploying Your First Predictive Model As a Web Service | Beginning Azure ML | Part 9
Deploying Your First Predictive Model As a Web Service | Beginning Azure ML | Part 9
Data Science Dojo
11 Using R API to Obtain Predictions From Your Web Service Beginning Azure ML | Part 10
Using R API to Obtain Predictions From Your Web Service Beginning Azure ML | Part 10
Data Science Dojo
12 Using Python API to Obtain Predictions From Your Web Service | Beginning Azure ML | Part 11
Using Python API to Obtain Predictions From Your Web Service | Beginning Azure ML | Part 11
Data Science Dojo
13 Twitter Sentiment Analysis | Natural Language Processing | Community Webinar
Twitter Sentiment Analysis | Natural Language Processing | Community Webinar
Data Science Dojo
14 Listening to the Melody of the Universe (LIGO Gravitational Waves Presentation) | Community Webinar
Listening to the Melody of the Universe (LIGO Gravitational Waves Presentation) | Community Webinar
Data Science Dojo
15 David Wechsler on the Impact of Data Science Bootcamp
David Wechsler on the Impact of Data Science Bootcamp
Data Science Dojo
16 Andrew Choi on the Impact of Data Science Bootcamp
Andrew Choi on the Impact of Data Science Bootcamp
Data Science Dojo
17 Microsoft's Software Engineer Shares Her Experience with Data Science Bootcamp
Microsoft's Software Engineer Shares Her Experience with Data Science Bootcamp
Data Science Dojo
18 Michael DAndrea on the Impact of Data Science Bootcamp
Michael DAndrea on the Impact of Data Science Bootcamp
Data Science Dojo
19 Data Driven Decision-Making with Data Science Bootcamp: Artem Kopelev's Revelation
Data Driven Decision-Making with Data Science Bootcamp: Artem Kopelev's Revelation
Data Science Dojo
20 Learn the Fundamentals of Data Science: Srinivas Rao's Experience with Data Science Bootcamp
Learn the Fundamentals of Data Science: Srinivas Rao's Experience with Data Science Bootcamp
Data Science Dojo
21 Re-Learning Data Science with Data Science Bootcamp: Analyst's Revelation
Re-Learning Data Science with Data Science Bootcamp: Analyst's Revelation
Data Science Dojo
22 Scale R to Big Data with Hadoop & Spark | Community Webinar
Scale R to Big Data with Hadoop & Spark | Community Webinar
Data Science Dojo
23 Enhancing Skills with Data Science Bootcamp: Sharon Lane-Getaz's Revelation
Enhancing Skills with Data Science Bootcamp: Sharon Lane-Getaz's Revelation
Data Science Dojo
24 Ryan DeMartino on the Impact of Data Science Bootcamp
Ryan DeMartino on the Impact of Data Science Bootcamp
Data Science Dojo
25 Software Engineer at Microsoft Reveals About His Experience with Data Science Bootcamp
Software Engineer at Microsoft Reveals About His Experience with Data Science Bootcamp
Data Science Dojo
26 Wade Wimer on the Impact of Data Science Bootcamp
Wade Wimer on the Impact of Data Science Bootcamp
Data Science Dojo
27 Analyzing Data with Data Science Bootcamp: Hannah Richta's Revelation
Analyzing Data with Data Science Bootcamp: Hannah Richta's Revelation
Data Science Dojo
28 Applying Data Science Skills to The Current Role with Bootcamp: Marcos Lacayo's Revelation
Applying Data Science Skills to The Current Role with Bootcamp: Marcos Lacayo's Revelation
Data Science Dojo
29 Lance Milner on the Impact of Data Science Bootcamp
Lance Milner on the Impact of Data Science Bootcamp
Data Science Dojo
30 Deloitte's Data Scientist Revelation: Learning Predictive Analytics with Data Science Bootcamp
Deloitte's Data Scientist Revelation: Learning Predictive Analytics with Data Science Bootcamp
Data Science Dojo
31 Rajesh Patil's Experience at Data Science Bootcamp As an Enterprise Architect
Rajesh Patil's Experience at Data Science Bootcamp As an Enterprise Architect
Data Science Dojo
32 Michael Atlin on the Impact of Data Science Bootcamp
Michael Atlin on the Impact of Data Science Bootcamp
Data Science Dojo
33 Amina Tariq's In-Person Experience at Data Science Bootcamp
Amina Tariq's In-Person Experience at Data Science Bootcamp
Data Science Dojo
34 Ceo's Revelation about Data Science Bootcamp
Ceo's Revelation about Data Science Bootcamp
Data Science Dojo
35 Stephen Miller Describes His Experience at Data Science Dojo's Bootcamp
Stephen Miller Describes His Experience at Data Science Dojo's Bootcamp
Data Science Dojo
36 Kevin Hillaker on the Impact of Data Science Bootcamp
Kevin Hillaker on the Impact of Data Science Bootcamp
Data Science Dojo
37 Marko Topalovic's Experience with Data Science Bootcamp
Marko Topalovic's Experience with Data Science Bootcamp
Data Science Dojo
38 Text Analytics With Python, Cognitive Services & PowerBI | Data Analytics | Community Webinar
Text Analytics With Python, Cognitive Services & PowerBI | Data Analytics | Community Webinar
Data Science Dojo
39 Unisys Manager's Revelation: Visualizing Real Time Data with Data Science Bootcamp
Unisys Manager's Revelation: Visualizing Real Time Data with Data Science Bootcamp
Data Science Dojo
40 Learn Data Mining with Data Science Bootcamp: Ryan LaBrie's Revelation
Learn Data Mining with Data Science Bootcamp: Ryan LaBrie's Revelation
Data Science Dojo
41 Vang Xiong on the Impact of Data Science Bootcamp
Vang Xiong on the Impact of Data Science Bootcamp
Data Science Dojo
42 Data Scientist's Experience at Our Data Science Bootcamp
Data Scientist's Experience at Our Data Science Bootcamp
Data Science Dojo
43 Alejandro Wolf Yadlin on the Impact of Data Science Bootcamp
Alejandro Wolf Yadlin on the Impact of Data Science Bootcamp
Data Science Dojo
44 Introduction To Titanic Kaggle Competition | Part 1
Introduction To Titanic Kaggle Competition | Part 1
Data Science Dojo
45 Learning How to Code in R with Data Science Bootcamp: Priscilla Mannuel's Revelation
Learning How to Code in R with Data Science Bootcamp: Priscilla Mannuel's Revelation
Data Science Dojo
46 Andrew Berman On Why Data Science Bootcamp Is Better Fit for Him
Andrew Berman On Why Data Science Bootcamp Is Better Fit for Him
Data Science Dojo
47 How To Do Titanic Kaggle Competition in R | Part 3.1
How To Do Titanic Kaggle Competition in R | Part 3.1
Data Science Dojo
48 How to do the Titanic Kaggle competition in R | Part 3.1
How to do the Titanic Kaggle competition in R | Part 3.1
Data Science Dojo
49 Delve Deeper into Data Science with Data Science Bootcamp
Delve Deeper into Data Science with Data Science Bootcamp
Data Science Dojo
50 Bank of America Data Scientist Reveals His Experience of Data Science Bootcamp
Bank of America Data Scientist Reveals His Experience of Data Science Bootcamp
Data Science Dojo
51 Shaena Montanari on the Impact of Data Science Bootcamp
Shaena Montanari on the Impact of Data Science Bootcamp
Data Science Dojo
52 Types of Sampling | Introduction to Data Mining | Part 12
Types of Sampling | Introduction to Data Mining | Part 12
Data Science Dojo
53 Sampling for Data Selection | Introduction to Data Mining | Part 11
Sampling for Data Selection | Introduction to Data Mining | Part 11
Data Science Dojo
54 Data Aggregation | Introduction to Data Mining | Part 10
Data Aggregation | Introduction to Data Mining | Part 10
Data Science Dojo
55 Data Cleaning | Introduction to Data Mining | Part 9
Data Cleaning | Introduction to Data Mining | Part 9
Data Science Dojo
56 Missing & Duplicated Data | Introduction to Data Mining | Part 8
Missing & Duplicated Data | Introduction to Data Mining | Part 8
Data Science Dojo
57 Data Noise | Introduction to Data Mining | Part 7
Data Noise | Introduction to Data Mining | Part 7
Data Science Dojo
58 Graph and Ordered Data | Introduction to Data Mining | Part 5
Graph and Ordered Data | Introduction to Data Mining | Part 5
Data Science Dojo
59 Document Data & Transaction Data | Introduction to Data Mining | Part 4
Document Data & Transaction Data | Introduction to Data Mining | Part 4
Data Science Dojo
60 Data Quality | Introduction to Data Mining | Part 6
Data Quality | Introduction to Data Mining | Part 6
Data Science Dojo

The video teaches how to automate supervised machine learning pipeline development, covering data preparation, modeling, and rollout. It emphasizes the importance of mastering machine learning basics and using techniques like dimensionality reduction and feature scaling to improve model performance and interpretability. By following the steps outlined in the video, viewers can build and deploy their own supervised machine learning pipelines.

Key Takeaways
  1. Prepare data by handling missing values and encoding categorical variables
  2. Scale numerical features and reduce dimensionality
  3. Split data into training and test sets and evaluate model performance
  4. Use correlation analysis and feature selection to improve model interpretability
  5. Automate data preparation and modeling using machine learning pipelines
💡 Automating supervised machine learning pipeline development can improve model performance and reduce the time and effort required to deploy machine learning models in production.

Related Reads

📰
Claude Sonnet 5 Didn’t Just Get Smarter. It Changed the Economics of AI.
Learn how Claude Sonnet 5's advancements changed the economics of AI, making 'good enough AI' viable for production, and understand the implications for AI development and deployment
Medium · AI
📰
Claude Sonnet 5 Didn’t Just Get Smarter. It Changed the Economics of AI.
Claude Sonnet 5's improved AI capabilities have transformed the economics of AI, making it more viable for production
Medium · Machine Learning
📰
The AI Career Toolkit That Replaced My Job Hunt in 2026
Learn how to leverage AI tools to enhance your job search and career development in 2026
Dev.to · freelancewith_ai
📰
The AI Problem Nobody Saw Coming: The Decline Of Curiosity And Meaning
The rise of AI may lead to a decline in human curiosity, affecting innovation and our sense of meaning, and it's crucial to understand this potential consequence
Forbes Innovation

Chapters (11)

Introduction
2:50 The highest level
9:26 Supervised machine learning essence
12:21 Features - Missing values
13:45 Features - Data cleaning perception Vs reality
17:36 Features - Encoding
22:26 Features - Normalize
32:13 Correlation - Methods
39:29 Engineering features
50:02 Human oversight
50:28 Complete pipeline
Up next
Man dies after horror Gold Coast house fire; high-speed Sydney motorway pursuit | 9 News Australia
9 News Australia
Watch →