Data Science 101: Overview of Machine Learning Model Building Process
Key Takeaways
The video covers the machine learning model building process, including data understanding, data cleaning, feature selection, and model selection, using tools like Random Forest, Decision Trees, and Support Vector Machine. It also discusses model evaluation metrics such as r-squared, mean squared error, and accuracy.
Full Transcript
welcome back to the data professor YouTube channel if you new here my name is Tina and tottenham on and on this YouTube channel we cover about data science concepts and practical tutorials so if you're into this kind of content please consider subscribing so a week ago on January 1st of this year 2020 you might have noticed that I've shared a infographic entitled a one-page summary of the machine learning model building process on the data professor Facebook and I've also made available on the fixture and so if you click on it you will see a one-page summary infographic and so I posted this one-page infographic on some of the Facebook groups and it has received wide interest as there were more than 100 likes and so this gave me an idea of why do I make a video about the machine learning model process and so thus this video is born okay so let's begin so you can go to the Facebook fan page of the data professor so type in facebook.com slash data professor and if you haven't yet like this page please go ahead and like the page please also follow the page and also please share it to your friends who you think would be interested in data science and so you can click on this link by scrolling down a bit and then click on the fixture link and then you will come to this page and you could click and download the infographic here and so this is the infographic so let's zoom in and let's talk about the machine learning model process okay so let's begin so in every data mining or data science project you will start with your initial data set your data set could be structured or unstructured your data set could be numerical it could be what quantitative qualitative right the data set could be clean meaning that it will be having no missing values or it could also have missing values presented in the data set and so therefore you will have to clean the data set pre-process the data set curate the data set and also oftentimes the variables might be and therefore you have to perform some sort of feature selection right and in order to get a rough understanding of your data set you have to do data understanding and you could do that by performing exploratory data analysis by using PCA or principal component analysis self-organizing map and you can also use basic statistics like looking at the distribution looking at the scatter plot of the variables looking at the histograms looking at the minimum maximum values the standard deviation the mean value the median the mode right the standard statistical approach okay so once you have clean the data curate the data remove any redundant features primarily when it has low standard deviation values like variables that are useless will have very low or several standard deviation value so in some of the project that we do we also set like if the standard deviation is less than 0.01 we're going to delete that variable and you can also implement something similar in order to remove constant variables or variables that are useless and so the data set will essentially look like this right you have the input variables and then you have the output variables your up we variable could be either quantitative or qualitative so that is if you have an output variable but if you don't have an output variable then you're going to only have the input variable right so the selection of a suitable learning algorithm we're gonna cover that in just a moment okay so once you have your pre-processed data set you might be ready to start the data mining process but not just yet so the thing is will you use all of your data set so oftentimes it's more economical if you could subset the data set into a smaller unit that you think would be relevant to answer your hypothesis so let's say that if you talk to your stakeholders or you talk to the people who are relevant who wanted you to develop the model in the first place then come to an agreement on what is the scope of the prediction model right so let's say that you want to develop a prediction model for people who are age 60 and above so your data set will have to filtered in such a way that you're going to use age greater than 60 as a filter right and then if H is less than 60 you're not going to use right so you're gonna subset the data so that H is greater than 60 so that will significantly reduce the number of rolls the quantity the volume of your data set right so subsets in your data set depending on the stakeholders input okay so once you have the data set that you want to use then you want to split your data set into two portions so one portion you want to use as the training set and another portion you want to use as the testing set so a good number to use would be 80/20 so why 80/20 well it's just an arbitrary number and according to the Pareto principle 80/20 so in the Pareto principle 80 percent of the effort will account for 20% of the productivity or 80 percent of the world's GDP will be accounted by 20% of the world's country 80 percent of the profit are coming from 20% of the company's products okay so it's just an arbitrary number that we use to create our training set and testing set which is to 80 20 percent so we're gonna use to 80 percent to create the training set and we're gonna use that 20 percent to use sv test set okay so now let's come to the selection of the learning algorithms so I'm sure there's a lot of algorithms available out there and fanciest algorithm and most popular algorithm thus far right now is deep learning okay so the thing is do you require deep learning in order to develop your model maybe or maybe not maybe a more simple model might be more suitable for your dataset so the thing is you might want to try more simple models before you invest into deep learning because deep learning will consume high compute costs so if you could use a more simple approach to model your data set then you could try that first and then you could work your way up okay so the selection of the learning algorithm will be dependent on whether you have the output variable or not which is one of the criteria because in learning algorithms you have supervised learning and unsupervised learning so with supervised learning it means that you have an put variable that you want to predict and so that's supervised learning it's like you have a teacher who will teach the students so the upward variable will be teaching the algorithm to learn how to classify the data objects based on the output variable right and then the the error will then adjust the parameters etc until we have a predictive model that can accurately predict the output variable and so in supervised learning you have the output variables and what if you don't have an output variable then you could use unsupervised learning and typically unsupervised learning our algorithm such as PCA som right the principal component analysis self-organizing map and so these are popular unsupervised learning approaches and so with supervised learning approaches it is for most of the project that we are doing in our research program and so with the supervised learning approaches there are quite a lot right support vector machine deep learning GBM gradient boosted machines k nearest neighbor decision trees random forests right we like random forests and decision tree a lot because it allows us to interpret the important underlying features by means of the Gini index ok so learning algorithm selection is dependent on which algorithm will be able to do classification and regression right so for a supervised learning you have the output variable and then the output variable will be suitable for classification or regression will depend on whether it is in numeric or a qualitative value so if it's in numerical or quantitative value then you could use the regression but if it is a categorical or qualitative label then you're going to use the classification and the classification could be binary class classification or it could be multi class right support vector machine could handle both regression and classification deep learning as well TBM decision tree ran a forest ok so that's part of the learning algorithm and now let's hop on to this concept of hyper parameter optimization so every learning algorithm will have parameters that you can adjust in order to improve sacré see like for example in random forest you've just the M tri and also the entry parameters support vector machine you could adjust it by deciding on whether it will be a linear machine a polynomial kernel a radial basis function kernel and also the C parameter and the gamma parameter and also the epsilon value and you could do this hyper parameter optimization in a grid wise manner right and also you could do some form of feature selection in order to further reduce the features that you use during the modeling process but four approaches like random forests it has two built-in feature selection and so typically we don't have to do any form of feature selection just use the whole entire feature that are containing information meaning that we have already removed the redundant features okay so let's talk about how we're going to use the training set we're going to use the training set to create the trained model and we're also going to use it for creating the cross-validation model right so in the cross-validation it essentially will partition or separate the data set of this training set into n fold right if you specify n to be 10 then it will separate it into ten fold if you specify it to be five it was separated into five fold fold would mean partition so each partition will have roughly same number of data samples so let's say that you have a hundred and fifty iris flower so if you have a 10-fold cross-validation each fold will contain fifteen flowers and so fifteen random flowers will be assigned to each of the ten partition and in one iteration one partition will be left out and the remaining nine partition will be used to create the prediction model and then the prediction model will be applied to the left out partition and so that concludes iteration one and then the next iteration will then take the left out partition move it back in and take a new partition out and then use the remaining nine to create a prediction model and then apply the prediction model to predict the values of the left out partition and so we're gonna do this over and over again until each partition will be left out at least one time and then and then the prediction accuracy will be averaged over the ten iterations okay so that's the cross validation okay so once we use the training set to develop the trained model the trained model could be used to predict the y-values right the trained model could be used to predict the y values of the training set and also of the test sets okay and so once it makes the prediction so depending on whether it is a classification or a regression problem the model evaluation will have different metric so requestion have specifics such as r-squared mean squared error root mean squared error and for classification models it has metrics such as accuracy sensitivity specificity and the matthews correlation coefficient right and based on the basis of these classification metrics or regression performance metrics you would then decide whether your prediction is valid or is it robust enough if it is you can decide whether to deploy your model by talking to your stakeholders and if you are ready to deploy your model then please refer to our other videos on how you can deploy your machine learning model okay and so this is the one-page summary of the machine learning model building process made into a video and so comments down below on whether you would like to see more of this type of video and if you would like to what kind of topics that you would like to see and I'm currently working on a infographic about how to handle missing data and I might create a video out of that as well and also it's going to be part of the data pre-processing in our series where we will have many videos covering about different aspects on how you can pre-process your data set thank you for watching please like subscribe and share and I'll see you in the next one but in the meantime please check out these videos
Original Description
Are you just starting out data science and are looking for an introductory video on the concepts of what it takes to build a machine learning model. Look no further, in this video we cover the basic concepts of the machine learning model building process. The concept of this video first started out as a drawn infographic and is now converted to a video format.
🌟 Buy me a coffee: https://www.buymeacoffee.com/dataprofessor
Inspired from our own infographic "1 page summary of the machine learning model building process" and the suggested comment from Bazi Ahmed
📎INFOGRAPHIC: https://doi.org/10.6084/m9.figshare.11492316.v1
⭕ Playlist:
Check out our other videos in the following playlists.
✅ Data Science 101: https://bit.ly/dataprofessor-ds101
✅ Data Science YouTuber Podcast: https://bit.ly/datascience-youtuber-podcast
✅ Data Science Virtual Internship: https://bit.ly/dataprofessor-internship
✅ Bioinformatics: http://bit.ly/dataprofessor-bioinformatics
✅ Data Science Toolbox: https://bit.ly/dataprofessor-datasciencetoolbox
✅ Streamlit (Web App in Python): https://bit.ly/dataprofessor-streamlit
✅ Shiny (Web App in R): https://bit.ly/dataprofessor-shiny
✅ Google Colab Tips and Tricks: https://bit.ly/dataprofessor-google-colab
✅ Pandas Tips and Tricks: https://bit.ly/dataprofessor-pandas
✅ Python Data Science Project: https://bit.ly/dataprofessor-python-ds
✅ R Data Science Project: https://bit.ly/dataprofessor-r-ds
⭕ Subscribe:
If you're new here, it would mean the world to me if you would consider subscribing to this channel.
✅ Subscribe: https://www.youtube.com/dataprofessor?sub_confirmation=1
⭕ Recommended Tools:
Kite is a FREE AI-powered coding assistant that will help you code faster and smarter. The Kite plugin integrates with all the top editors and IDEs to give you smart completions and documentation while you’re typing. I've been using Kite and I love it!
✅ Check out Kite: https://www.kite.com/get-kite/?utm_medium=referral&utm_source=youtube&utm_campaign=d
Watch on YouTube ↗
(saves to browser)
Sign in to unlock AI tutor explanation · ⚡30
Playlist
Uploads from Data Professor · Data Professor · 27 of 60
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
▶
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
How a Biologist became a Data Scientist
Data Professor
WEKA Tutorial #1.1 - How to Build a Data Mining Model from Scratch
Data Professor
WEKA Tutorial #1.2 - How to Build a Data Mining Model from Scratch
Data Professor
WEKA Tutorial #1.3 - How to Build a Data Mining Model from Scratch
Data Professor
Computational Drug Discovery: Machine Learning for Making Sense of Big Data in Drug Discovery
Data Professor
Quotes #1 on Big Data and Data Science
Data Professor
Quotes #2 on Big Data and Data Science
Data Professor
Quotes #3 on Big Data and Data Science
Data Professor
Quotes #4 on Big Data and Data Science
Data Professor
Quotes #5 on Big Data and Data Science
Data Professor
Data Science 101: Starting a Data Science / Data Mining Project
Data Professor
Data Science 101: CRISP-DM - Data Mining / Data Science in 6 Steps
Data Professor
R Programming 101: How to Define Variables
Data Professor
R Programming 101: Read and Write CSV files
Data Professor
Data Science 101: Basic Command-Line for Data Science
Data Professor
Strategies for Learning Data Science in 2020 (Data Science 101)
Data Professor
Building your Data Science Portfolio with GitHub (Data Science 101)
Data Professor
R Programming 101: Setting up R programming environment (R, RStudio and RStudio.cloud)
Data Professor
Exploratory Data Analysis in R: Towards Data Understanding
Data Professor
Exploratory Data Analysis in R: Quick Dive into Data Visualization
Data Professor
Machine Learning in R: Building a Classification Model
Data Professor
Machine Learning in R: Repurpose Machine Learning Code for New Data
Data Professor
Data Science 101: Deploying your Machine Learning Model
Data Professor
Machine Learning in R: Deploy Machine Learning Model using RDS
Data Professor
Data Pre-processing in R: Handling Missing Data
Data Professor
Machine Learning in R: Speed up Model Building with Parallel Computing
Data Professor
Data Science 101: Overview of Machine Learning Model Building Process
Data Professor
Web Apps in R: Building your First Web Application in R | Shiny Tutorial Ep 1
Data Professor
Web Apps in R: Build Interactive Histogram Web Application in R | Shiny Tutorial Ep 2
Data Professor
Web Apps in R: Building Data-Driven Web Application in R | Shiny Tutorial Ep 3
Data Professor
Web Apps in R: Building the Machine Learning Web Application in R | Shiny Tutorial Ep 4
Data Professor
Web Apps in R: Build BMI Calculator web application in R for health monitoring | Shiny Tutorial Ep 5
Data Professor
Machine Learning in R: Building a Linear Regression Model
Data Professor
What programming language to learn for Data Science? R versus Python
Data Professor
How to Become a Data Scientist (Learning Path and Skill Sets Needed)
Data Professor
Using Python in R
Data Professor
Interpretable Machine Learning Models
Data Professor
Making Scatter Plots in R [Data Visualisation in R series]
Data Professor
Machine Learning in Python: Building a Classification Model
Data Professor
Compare Machine Learning Classifiers in Python
Data Professor
Hyperparameter Tuning of Machine Learning Model in Python
Data Professor
Practical Introduction to Google Colab for Data Science
Data Professor
File Handling in Google Colab for Data Science
Data Professor
Pandas for Data Science: Create and Combine DataFrames / Rename Columns
Data Professor
Machine Learning in Python: Building a Linear Regression Model
Data Professor
Machine Learning in Python: Principal Component Analysis (PCA) for Handling High-Dimensional Data
Data Professor
How to Plot an ROC Curve in Python | Machine Learning in Python
Data Professor
Installing conda on Google Colab for Data Science
Data Professor
Use native R on Google Colab for Data Science
Data Professor
How to Save and Download files from Google Colab
Data Professor
Easy Web Scraping in Python using Pandas for Data Science
Data Professor
Data Science for Computational Drug Discovery using Python (Part 1)
Data Professor
Pandas Profiling for Data Science (Quick and Easy Exploratory Data Analysis)
Data Professor
Exploratory Data Analysis in Python using pandas
Data Professor
Quick tour of PyCaret (a low-code machine learning library in Python)
Data Professor
How to Upload Files to Google Colab
Data Professor
How to Install and Use Pandas Profiling on Google Colab
Data Professor
How to Adjust the Style of Pandas DataFrame
Data Professor
How to use Bamboolib for Data Wrangling in Data Science
Data Professor
How to use Pandas Profiling on Kaggle
Data Professor
More on: ML Maths Basics
View skill →
🎓
Tutor Explanation
DeepCamp AI