Data Science 101: Overview of Machine Learning Model Building Process

Data Professor · Beginner ·📐 ML Fundamentals ·6y ago

Skills: ML Maths Basics80%Supervised Learning70%Unsupervised Learning60%ML Pipelines50%

Key Takeaways

The video covers the machine learning model building process, including data understanding, data cleaning, feature selection, and model selection, using tools like Random Forest, Decision Trees, and Support Vector Machine. It also discusses model evaluation metrics such as r-squared, mean squared error, and accuracy.

Full Transcript

welcome back to the data professor YouTube channel if you new here my name is Tina and tottenham on and on this YouTube channel we cover about data science concepts and practical tutorials so if you're into this kind of content please consider subscribing so a week ago on January 1st of this year 2020 you might have noticed that I've shared a infographic entitled a one-page summary of the machine learning model building process on the data professor Facebook and I've also made available on the fixture and so if you click on it you will see a one-page summary infographic and so I posted this one-page infographic on some of the Facebook groups and it has received wide interest as there were more than 100 likes and so this gave me an idea of why do I make a video about the machine learning model process and so thus this video is born okay so let's begin so you can go to the Facebook fan page of the data professor so type in facebook.com slash data professor and if you haven't yet like this page please go ahead and like the page please also follow the page and also please share it to your friends who you think would be interested in data science and so you can click on this link by scrolling down a bit and then click on the fixture link and then you will come to this page and you could click and download the infographic here and so this is the infographic so let's zoom in and let's talk about the machine learning model process okay so let's begin so in every data mining or data science project you will start with your initial data set your data set could be structured or unstructured your data set could be numerical it could be what quantitative qualitative right the data set could be clean meaning that it will be having no missing values or it could also have missing values presented in the data set and so therefore you will have to clean the data set pre-process the data set curate the data set and also oftentimes the variables might be and therefore you have to perform some sort of feature selection right and in order to get a rough understanding of your data set you have to do data understanding and you could do that by performing exploratory data analysis by using PCA or principal component analysis self-organizing map and you can also use basic statistics like looking at the distribution looking at the scatter plot of the variables looking at the histograms looking at the minimum maximum values the standard deviation the mean value the median the mode right the standard statistical approach okay so once you have clean the data curate the data remove any redundant features primarily when it has low standard deviation values like variables that are useless will have very low or several standard deviation value so in some of the project that we do we also set like if the standard deviation is less than 0.01 we're going to delete that variable and you can also implement something similar in order to remove constant variables or variables that are useless and so the data set will essentially look like this right you have the input variables and then you have the output variables your up we variable could be either quantitative or qualitative so that is if you have an output variable but if you don't have an output variable then you're going to only have the input variable right so the selection of a suitable learning algorithm we're gonna cover that in just a moment okay so once you have your pre-processed data set you might be ready to start the data mining process but not just yet so the thing is will you use all of your data set so oftentimes it's more economical if you could subset the data set into a smaller unit that you think would be relevant to answer your hypothesis so let's say that if you talk to your stakeholders or you talk to the people who are relevant who wanted you to develop the model in the first place then come to an agreement on what is the scope of the prediction model right so let's say that you want to develop a prediction model for people who are age 60 and above so your data set will have to filtered in such a way that you're going to use age greater than 60 as a filter right and then if H is less than 60 you're not going to use right so you're gonna subset the data so that H is greater than 60 so that will significantly reduce the number of rolls the quantity the volume of your data set right so subsets in your data set depending on the stakeholders input okay so once you have the data set that you want to use then you want to split your data set into two portions so one portion you want to use as the training set and another portion you want to use as the testing set so a good number to use would be 80/20 so why 80/20 well it's just an arbitrary number and according to the Pareto principle 80/20 so in the Pareto principle 80 percent of the effort will account for 20% of the productivity or 80 percent of the world's GDP will be accounted by 20% of the world's country 80 percent of the profit are coming from 20% of the company's products okay so it's just an arbitrary number that we use to create our training set and testing set which is to 80 20 percent so we're gonna use to 80 percent to create the training set and we're gonna use that 20 percent to use sv test set okay so now let's come to the selection of the learning algorithms so I'm sure there's a lot of algorithms available out there and fanciest algorithm and most popular algorithm thus far right now is deep learning okay so the thing is do you require deep learning in order to develop your model maybe or maybe not maybe a more simple model might be more suitable for your dataset so the thing is you might want to try more simple models before you invest into deep learning because deep learning will consume high compute costs so if you could use a more simple approach to model your data set then you could try that first and then you could work your way up okay so the selection of the learning algorithm will be dependent on whether you have the output variable or not which is one of the criteria because in learning algorithms you have supervised learning and unsupervised learning so with supervised learning it means that you have an put variable that you want to predict and so that's supervised learning it's like you have a teacher who will teach the students so the upward variable will be teaching the algorithm to learn how to classify the data objects based on the output variable right and then the the error will then adjust the parameters etc until we have a predictive model that can accurately predict the output variable and so in supervised learning you have the output variables and what if you don't have an output variable then you could use unsupervised learning and typically unsupervised learning our algorithm such as PCA som right the principal component analysis self-organizing map and so these are popular unsupervised learning approaches and so with supervised learning approaches it is for most of the project that we are doing in our research program and so with the supervised learning approaches there are quite a lot right support vector machine deep learning GBM gradient boosted machines k nearest neighbor decision trees random forests right we like random forests and decision tree a lot because it allows us to interpret the important underlying features by means of the Gini index ok so learning algorithm selection is dependent on which algorithm will be able to do classification and regression right so for a supervised learning you have the output variable and then the output variable will be suitable for classification or regression will depend on whether it is in numeric or a qualitative value so if it's in numerical or quantitative value then you could use the regression but if it is a categorical or qualitative label then you're going to use the classification and the classification could be binary class classification or it could be multi class right support vector machine could handle both regression and classification deep learning as well TBM decision tree ran a forest ok so that's part of the learning algorithm and now let's hop on to this concept of hyper parameter optimization so every learning algorithm will have parameters that you can adjust in order to improve sacré see like for example in random forest you've just the M tri and also the entry parameters support vector machine you could adjust it by deciding on whether it will be a linear machine a polynomial kernel a radial basis function kernel and also the C parameter and the gamma parameter and also the epsilon value and you could do this hyper parameter optimization in a grid wise manner right and also you could do some form of feature selection in order to further reduce the features that you use during the modeling process but four approaches like random forests it has two built-in feature selection and so typically we don't have to do any form of feature selection just use the whole entire feature that are containing information meaning that we have already removed the redundant features okay so let's talk about how we're going to use the training set we're going to use the training set to create the trained model and we're also going to use it for creating the cross-validation model right so in the cross-validation it essentially will partition or separate the data set of this training set into n fold right if you specify n to be 10 then it will separate it into ten fold if you specify it to be five it was separated into five fold fold would mean partition so each partition will have roughly same number of data samples so let's say that you have a hundred and fifty iris flower so if you have a 10-fold cross-validation each fold will contain fifteen flowers and so fifteen random flowers will be assigned to each of the ten partition and in one iteration one partition will be left out and the remaining nine partition will be used to create the prediction model and then the prediction model will be applied to the left out partition and so that concludes iteration one and then the next iteration will then take the left out partition move it back in and take a new partition out and then use the remaining nine to create a prediction model and then apply the prediction model to predict the values of the left out partition and so we're gonna do this over and over again until each partition will be left out at least one time and then and then the prediction accuracy will be averaged over the ten iterations okay so that's the cross validation okay so once we use the training set to develop the trained model the trained model could be used to predict the y-values right the trained model could be used to predict the y values of the training set and also of the test sets okay and so once it makes the prediction so depending on whether it is a classification or a regression problem the model evaluation will have different metric so requestion have specifics such as r-squared mean squared error root mean squared error and for classification models it has metrics such as accuracy sensitivity specificity and the matthews correlation coefficient right and based on the basis of these classification metrics or regression performance metrics you would then decide whether your prediction is valid or is it robust enough if it is you can decide whether to deploy your model by talking to your stakeholders and if you are ready to deploy your model then please refer to our other videos on how you can deploy your machine learning model okay and so this is the one-page summary of the machine learning model building process made into a video and so comments down below on whether you would like to see more of this type of video and if you would like to what kind of topics that you would like to see and I'm currently working on a infographic about how to handle missing data and I might create a video out of that as well and also it's going to be part of the data pre-processing in our series where we will have many videos covering about different aspects on how you can pre-process your data set thank you for watching please like subscribe and share and I'll see you in the next one but in the meantime please check out these videos

Original Description

Are you just starting out data science and are looking for an introductory video on the concepts of what it takes to build a machine learning model. Look no further, in this video we cover the basic concepts of the machine learning model building process. The concept of this video first started out as a drawn infographic and is now converted to a video format. 🌟 Buy me a coffee: https://www.buymeacoffee.com/dataprofessor Inspired from our own infographic "1 page summary of the machine learning model building process" and the suggested comment from Bazi Ahmed 📎INFOGRAPHIC: https://doi.org/10.6084/m9.figshare.11492316.v1 ⭕ Playlist: Check out our other videos in the following playlists. ✅ Data Science 101: https://bit.ly/dataprofessor-ds101 ✅ Data Science YouTuber Podcast: https://bit.ly/datascience-youtuber-podcast ✅ Data Science Virtual Internship: https://bit.ly/dataprofessor-internship ✅ Bioinformatics: http://bit.ly/dataprofessor-bioinformatics ✅ Data Science Toolbox: https://bit.ly/dataprofessor-datasciencetoolbox ✅ Streamlit (Web App in Python): https://bit.ly/dataprofessor-streamlit ✅ Shiny (Web App in R): https://bit.ly/dataprofessor-shiny ✅ Google Colab Tips and Tricks: https://bit.ly/dataprofessor-google-colab ✅ Pandas Tips and Tricks: https://bit.ly/dataprofessor-pandas ✅ Python Data Science Project: https://bit.ly/dataprofessor-python-ds ✅ R Data Science Project: https://bit.ly/dataprofessor-r-ds ⭕ Subscribe: If you're new here, it would mean the world to me if you would consider subscribing to this channel. ✅ Subscribe: https://www.youtube.com/dataprofessor?sub_confirmation=1 ⭕ Recommended Tools: Kite is a FREE AI-powered coding assistant that will help you code faster and smarter. The Kite plugin integrates with all the top editors and IDEs to give you smart completions and documentation while you’re typing. I've been using Kite and I love it! ✅ Check out Kite: https://www.kite.com/get-kite/?utm_medium=referral&utm_source=youtube&utm_campaign=d

Watch on YouTube ↗ (saves to browser)

Sign in to unlock AI tutor explanation · ⚡30

Playlist

Uploads from Data Professor · Data Professor · 27 of 60

← Previous Next →

How a Biologist became a Data Scientist

How a Biologist became a Data Scientist

WEKA Tutorial #1.1 - How to Build a Data Mining Model from Scratch

WEKA Tutorial #1.1 - How to Build a Data Mining Model from Scratch

WEKA Tutorial #1.2 - How to Build a Data Mining Model from Scratch

WEKA Tutorial #1.2 - How to Build a Data Mining Model from Scratch

WEKA Tutorial #1.3 - How to Build a Data Mining Model from Scratch

WEKA Tutorial #1.3 - How to Build a Data Mining Model from Scratch

Computational Drug Discovery: Machine Learning for Making Sense of Big Data in Drug Discovery

Computational Drug Discovery: Machine Learning for Making Sense of Big Data in Drug Discovery

Quotes #1 on Big Data and Data Science

Quotes #1 on Big Data and Data Science

Quotes #2 on Big Data and Data Science

Quotes #2 on Big Data and Data Science

Quotes #3 on Big Data and Data Science

Quotes #3 on Big Data and Data Science

Quotes #4 on Big Data and Data Science

Quotes #4 on Big Data and Data Science

Quotes #5 on Big Data and Data Science

Quotes #5 on Big Data and Data Science

Data Science 101: Starting a Data Science / Data Mining Project

Data Science 101: Starting a Data Science / Data Mining Project

Data Science 101: CRISP-DM - Data Mining / Data Science in 6 Steps

Data Science 101: CRISP-DM - Data Mining / Data Science in 6 Steps

R Programming 101: How to Define Variables

R Programming 101: How to Define Variables

R Programming 101: Read and Write CSV files

R Programming 101: Read and Write CSV files

Data Science 101: Basic Command-Line for Data Science

Data Science 101: Basic Command-Line for Data Science

Strategies for Learning Data Science in 2020 (Data Science 101)

Strategies for Learning Data Science in 2020 (Data Science 101)

Building your Data Science Portfolio with GitHub (Data Science 101)

Building your Data Science Portfolio with GitHub (Data Science 101)

R Programming 101: Setting up R programming environment (R, RStudio and RStudio.cloud)

R Programming 101: Setting up R programming environment (R, RStudio and RStudio.cloud)

Exploratory Data Analysis in R: Towards Data Understanding

Exploratory Data Analysis in R: Towards Data Understanding

Exploratory Data Analysis in R: Quick Dive into Data Visualization

Exploratory Data Analysis in R: Quick Dive into Data Visualization

Machine Learning in R: Building a Classification Model

Machine Learning in R: Building a Classification Model

Machine Learning in R: Repurpose Machine Learning Code for New Data

Machine Learning in R: Repurpose Machine Learning Code for New Data

Data Science 101: Deploying your Machine Learning Model

Data Science 101: Deploying your Machine Learning Model

Machine Learning in R: Deploy Machine Learning Model using RDS

Machine Learning in R: Deploy Machine Learning Model using RDS

Data Pre-processing in R: Handling Missing Data

Data Pre-processing in R: Handling Missing Data

Machine Learning in R: Speed up Model Building with Parallel Computing

Machine Learning in R: Speed up Model Building with Parallel Computing

Data Science 101: Overview of Machine Learning Model Building Process

Data Science 101: Overview of Machine Learning Model Building Process

Web Apps in R: Building your First Web Application in R | Shiny Tutorial Ep 1

Web Apps in R: Building your First Web Application in R | Shiny Tutorial Ep 1

Web Apps in R: Build Interactive Histogram Web Application in R | Shiny Tutorial Ep 2

Web Apps in R: Build Interactive Histogram Web Application in R | Shiny Tutorial Ep 2

Web Apps in R: Building Data-Driven Web Application in R | Shiny Tutorial Ep 3

Web Apps in R: Building Data-Driven Web Application in R | Shiny Tutorial Ep 3

Web Apps in R: Building the Machine Learning Web Application in R | Shiny Tutorial Ep 4

Web Apps in R: Building the Machine Learning Web Application in R | Shiny Tutorial Ep 4

Web Apps in R: Build BMI Calculator web application in R for health monitoring | Shiny Tutorial Ep 5

Web Apps in R: Build BMI Calculator web application in R for health monitoring | Shiny Tutorial Ep 5

Machine Learning in R: Building a Linear Regression Model

Machine Learning in R: Building a Linear Regression Model

What programming language to learn for Data Science? R versus Python

What programming language to learn for Data Science? R versus Python

How to Become a Data Scientist (Learning Path and Skill Sets Needed)

How to Become a Data Scientist (Learning Path and Skill Sets Needed)

Using Python in R

Using Python in R

Interpretable Machine Learning Models

Interpretable Machine Learning Models

Making Scatter Plots in R [Data Visualisation in R series]

Making Scatter Plots in R [Data Visualisation in R series]

Machine Learning in Python: Building a Classification Model

Machine Learning in Python: Building a Classification Model

Compare Machine Learning Classifiers in Python

Compare Machine Learning Classifiers in Python

Hyperparameter Tuning of Machine Learning Model in Python

Hyperparameter Tuning of Machine Learning Model in Python

Practical Introduction to Google Colab for Data Science

Practical Introduction to Google Colab for Data Science

File Handling in Google Colab for Data Science

File Handling in Google Colab for Data Science

Pandas for Data Science: Create and Combine DataFrames / Rename Columns

Pandas for Data Science: Create and Combine DataFrames / Rename Columns

Machine Learning in Python: Building a Linear Regression Model

Machine Learning in Python: Building a Linear Regression Model

Machine Learning in Python: Principal Component Analysis (PCA) for Handling High-Dimensional Data

Machine Learning in Python: Principal Component Analysis (PCA) for Handling High-Dimensional Data

How to Plot an ROC Curve in Python | Machine Learning in Python

How to Plot an ROC Curve in Python | Machine Learning in Python

Installing conda on Google Colab for Data Science

Installing conda on Google Colab for Data Science

Use native R on Google Colab for Data Science

Use native R on Google Colab for Data Science

How to Save and Download files from Google Colab

How to Save and Download files from Google Colab

Easy Web Scraping in Python using Pandas for Data Science

Easy Web Scraping in Python using Pandas for Data Science

Data Science for Computational Drug Discovery using Python (Part 1)

Data Science for Computational Drug Discovery using Python (Part 1)

Pandas Profiling for Data Science (Quick and Easy Exploratory Data Analysis)

Pandas Profiling for Data Science (Quick and Easy Exploratory Data Analysis)

Exploratory Data Analysis in Python using pandas

Exploratory Data Analysis in Python using pandas

Quick tour of PyCaret (a low-code machine learning library in Python)

Quick tour of PyCaret (a low-code machine learning library in Python)

How to Upload Files to Google Colab

How to Upload Files to Google Colab

How to Install and Use Pandas Profiling on Google Colab

How to Install and Use Pandas Profiling on Google Colab

How to Adjust the Style of Pandas DataFrame

How to Adjust the Style of Pandas DataFrame

How to use Bamboolib for Data Wrangling in Data Science

How to use Bamboolib for Data Wrangling in Data Science

How to use Pandas Profiling on Kaggle

How to use Pandas Profiling on Kaggle

This video provides an overview of the machine learning model building process, covering data understanding, data cleaning, feature selection, and model selection. It also discusses model evaluation metrics and techniques. By watching this video, viewers can learn how to build and evaluate a machine learning model.

Key Takeaways

Clean and curate the data set
Perform feature selection
Split the data set into training and testing sets
Select a suitable learning algorithm
Train the model
Evaluate the model using metrics like r-squared and accuracy

💡 The choice of learning algorithm depends on the output variable and type of learning, and hyperparameter optimization is crucial for achieving good model performance.

🔒 Pro feature: Ask AI to explain this lesson →

More on: ML Maths Basics

View skill →

Coding the GARCH Model : Time Series Talk

Coding the GARCH Model : Time Series Talk

Important Steps I Have Followed To Improve My Data Science Skills- Sharing My Experience

Important Steps I Have Followed To Improve My Data Science Skills- Sharing My Experience

Learn Python FAST for Beginners 🚀#coding #conditionals #loops #functions

Learn Python FAST for Beginners 🚀#coding #conditionals #loops #functions

ChethanAIChronicles

“Hello, world” from scratch on a 6502 — Part 1

“Hello, world” from scratch on a 6502 — Part 1

PCA (Principal Component Analysis) in Python - Machine Learning From Scratch 11 - Python Tutorial

PCA (Principal Component Analysis) in Python - Machine Learning From Scratch 11 - Python Tutorial

ROC and AUC in R

ROC and AUC in R

StatQuest with Josh Starmer

Related AI Lessons

Mastering TypeScript — Understanding the TypeScript Compiler (tsc) from Scratch — Lesson 2

Learn the basics of the TypeScript compiler to write better JavaScript code

Medium · JavaScript

Stop Overfitting With Basically One Line of Code

Learn to prevent overfitting with a simple code tweak and understand the difference between Ridge and Lasso regression

Stop Overfitting With Basically One Line of Code

Learn to prevent overfitting in machine learning models with a simple code tweak and understand the difference between Ridge and Lasso regression

Medium · Machine Learning

Stop Overfitting With Basically One Line of Code

Prevent overfitting in models with a simple code tweak, understanding the difference between Ridge and Lasso regression

Medium · Data Science

Learn Deep Learning by Hand (Beginner's Guide - Part 1)