Data Pre-processing in R: Handling Missing Data

Data Professor · Beginner ·🛠️ AI Tools & Apps ·6y ago

Skills: ML Pipelines80%ML Maths Basics70%

Key Takeaways

This video demonstrates data pre-processing in R, specifically handling missing data using various techniques such as deleting entries with missing data, imputing missing values with the column mean or median, and using custom functions to iterate through the dataset and replace NA values. The video utilizes the dhfr data set and various R libraries and functions, including curl, github, nest, sum, NAgen, nani, colSums, and complete.cases.

Full Transcript

welcome back to the data professor YouTube channel my name is Tennant aston ahmad and i'm an associate professor of bioinformatics on this youtube channel we cover about data science concepts and practical tutorials so if you're into this kind of content please consider subscribing so as one of our viewer have suggested that we cover about data pre-processing so we're going to do that in this episode so we're going to talk about how we can handle missing data so without further ado let's get started okay so this episode represents the first part in a multi-part series on data pre-processing in R and so we're going to start this off by talking about how we can handle missing data because when we're pre-processing the data it is very common that some of the values are missing meaning that they're empty they don't have any value or it might have other obscure value such as question mark or minus 999 so today we're gonna cover about how you can handle these types of data so let's go to the next slide today we're going to use the dhfr data set and so maybe a little bit about the data set itself dhf R stands for dihydrofolate reductase it is an enzyme and also an anti-malarial drug target whereby the data set is comprised of three hundred and twenty-five compounds or roles and 229 variables or columns so of the 229 variables one is the target variable called Y and this represents the biological activity and so this variable can be classified as being either active or inactive so if it is active it means that the drug has good bioactivity so if it is inactive it means that the drug has bad bio activity so the objective here is to classify whether the drug molecule has good or bad activity which is corresponding to active or inactive but today we're not gonna focus on the classification but we're going to focus on how you can handle the missing tier of the 229 variables the remaining 228 variables are called the molecular descriptor so they represent the physical chemical property that described the unique characteristics of the druk molecule in terms of the charge the molecular connectivity the solubility etc and on the right here you see the protein structure of the dihydrofolate reductase or the dhfr okay so here is an outline of what we will learn today there's gonna be a total of five steps so the first step is we're going to load in the dhfr data set so we're gonna load that directly from the data professor github and number two we're gonna check for missing data we're gonna soon find out that there is no missing data in the dhfr so the data is clean so therefore we will create a function where we can randomly introduce missing data to the data set and so we will create a function that will introduce randomly missing data into the data set and so after that we're going to check again for the missing data okay so hint there should now be missing data and the fifth and final step we're going to handle the missing data so this is the highlight of this episode so there are two options that you will decide what to do so the first option is to simply delete all entry with missing data so this is the simplest way to do however so the downside of this is you're going to miss some of the data points that are present in your data and so therefore you will reduce the number of samples or compounds okay and the second approach is to perform what is called imputation imitation is the process in which you impute or you replace the missing value with another value for example the columns mean or the columns median which will we'll show you how to do that in our okay so let's jump in so this code will be available on the github of the data professor and so the links are down below in the description we're gonna start by loading the R curl library so that we can directly load the data set from the data professor github and so this will read it into an object called the dhfr so we see that there is a total of 325 objects or 325 compounds and there are 229 variables so let's have a look so as we recall off the 229 variables we have one of them as the bioactivity which is either active or inactive and then we have 228 Molecular descriptors shown here okay and then we can scroll to the writes okay so now we're finished with step one let's go on to the second step so now we're going to check for missing data so we're gonna nest to function and use both of them so the first one is we're going to determine whether the dhfr contain any missing data set and we're gonna embed this or nest this into a sum function so what this does is that first it will retrieve a vector of whether your DCF R is an A so it will return a list or vector of true or false and then it will apply the summation function in order to do a count of the total number of n a that are present in your data set so let's run that by hitting on the ctrl enter and so we see that there is no missing data so it's zero so this means that your data set is clean we're finished with step two okay so now we're gonna go to step three okay so the third step is when the data is clean we will now introduce randomly missing data points or an A to the data set and so here we create a custom function call NH n as in generate and so the function takes in two arguments as input so the first one is data which is the data object which is the th have our data object and the second is n or the number of n a to add to the data set so as in the example below here we're going to apply the N agent function to the dhfr data object and we're going to add 100 n a randomly into the data set and so we're going to create a data set that is not clean so in a nutshell this code will it while the number of iteration is less than n or let's say if reached a 100 and plus 1 because after iteration it will perform only 99 in a addition so then we added 1 so that it will perform the actual 100 and so what it does is it will perform to indexing so the first index will determine the role to randomly select and the second index will determine the column to randomly select so let's say in the first iteration it determines that index 1 will be 10 and index 2 will be 5 so it will add in a to roll number 10 and column number 5 so let's say that in iteration 2 and X 1 becomes 20 and in this 2 becomes 30 so it means that this is 20 and this is 30 so it will add na to row number 20 column number 30 ok so it does the spirit of Lee until it satisfied the loop ok so to use the na dejenne function you can use either this line or this line right so in this line you just say the name of the data object and then the number how many times you want to add to na but for this one you have to put in the proper order right you have to start with the dhfr followed by the number because it is the same order as we have put in the data comma n however if you don't want to follow this order you're going to have to specify n equals and then the number which is the same in here and then you have to specify data equals and then the data object name which is the same name here data and data if you switch the number 100 and then this fr this will not work okay you could give it a try ok so now let's run the n a gen function ok and now we're going to run this using 108 addition and now let's go ahead and proceed to step number 4 so let's now check for the number of Na ok so the total number of Na is now 1 times red and so we're gonna do a check of which column contains an A so we're gonna use coal sums function so we're going to nest that mist and a into the coal sums function all right so I haven't yet shown you how it looks like if we just apply this and a with the dhfr so it will return a vector of true false so if it's true it means that there is missing data at the particular position right so we're gonna run embed this is an a into cosine function and now we're going to see which column has an a right so for example this column or this variable has one and a and this one has one in a mo to deal it violation has two na it might be easier to have a look at the Col sums it's an A inside the view function so we're going to paste that in and so here we can easily see which variable has how many n a okay okay and now we're gonna look at the particular row containing the missing data so what we want to do here is we're going to define a variable object called missing data and we're going to put in the name of the data object which is PS fr so inside the bracket you will see that there are two values separated by a comma so the left value here which is highlighted represents the role and the right part here represents the columns so the exclamation mark is an inverse of the function complete cases right so it means that for the DS if our object which cases are not complete and that it will show the specific roles which are not complete meaning that which row contains missing data it will show that role in this new data frame that we have created okay and so let's see the summation of the missing data of this newly created data frame and it's equivalent to the missing data in the original ESFR okay so let's now have a look at the missing data so let's type in the view command enter okay so this is the data frame of the data subset that contains missing data so we might need to scroll a bit to find the n/a in here it might be challenging to find some okay right here spot an na here so this is one out of 100 and a and so and it will be distributed randomly so here is another na here is another na and this is an na and this is also an in a so there will be 100 of na distributed randomly throughout this subset of the data set right and the subsequent prizes of 89 roles and 229 variables which means that it's rather distributed in 89 compounds and so we're finished with the forth step now and we're gonna move on to the fifth step and final step is we're going to handle the missing data so we have two options so the first option is to clean the data set meaning that we're going to omit we're gonna delete every na from the data set and so let's do that and so the clean data will now have several missing data and however the clean data let's look at the data size so after deleting all and a we are left with 236 compounds right because 236 plus 89 will be equivalent to 325 so that would represent almost a quarter of your data set which is a big chunk of data so is there a better way well the second way is to perform imputation right so imputation is where you will replace the missing value with another value such as the columns mean or median okay so let's do that so we're going to create a new data object called dhfr impute and so we're just gonna dump the DSF our data set into this so it's gonna contain duplicate information it's kind of like making a clone a disease of our dataset so that we can compare between the original data set and the new data set which we have performed imputation okay so now the D so far dataset has the same exact dimension as the original one so this block of code is a custom function which will iterate through the whole dataset to determine which position is an a and at that position it will determine the columns mean or the columns median value and it will replace the n/a with the mean or median okay so let's try this out and then okay so an a is present in dy so maybe not a good idea to add the NA to the Y so what we could actually have done is to skip the y variable and perform the NA generation for the remaining X variables we should see 228 variables so in here we could have just put in dhf our okay now why it's no longer present and we should meet in the data set again start over right determine that there is no na and then we will take out the Y variable and we're going to generate 100 random and a into the data set and now we have a hundred and a which is added to it and then we're going to do the imputation again we're going to do the mean imputation again right apply this imputation of the mean and okay now the mean imputation gives zero missing data okay so if we do this again we run the we create a dhfr that impute data frame again and then we will determine how many missing data so there's a hundred and then we will perform this median imputation and let's see again now there's zero okay so that pretty much wraps up this episode and in the next one we're going to cover how you can perform imputation using other approaches so we're gonna use other data set which might not always be a numerical value so what if your n a is a factor or a ordinal or a categorical variable what will you do take Oracle or ordinal variables such as low medium high if you're missing the data for these variables how will you handle this type of data so we're gonna cover that in the future videos okay so thank you for watching thank you for watching please like subscribe and share and I'll see you in the next one but in the meantime please check out these videos

Original Description

In this video, I will show you how you can handle missing data in your own data science project. This video represents the first in a multi-part series on data pre-processing in R. 🌟 Buy me a coffee: https://www.buymeacoffee.com/dataprofessor ⭕ Timeline 0:33 First part in Data pre-processing series 1:11 DHFR dataset 2:41 Outline of this episode 4:08 Open up RStudio or RStudio.cloud 4:15 Let's start 4:21 1. Load in the dataset 4:59 2. Check for missing data 5:48 3. Let's make the data dirty! 5:58 The custom function na.gen() 8:38 4. Check for missing data 9:08 How does is.na(dhfr) looks like? 10:18 Let's look at rows containing NA 11:29 Let's find the NA in the data 12:45 5. Handling the missing data 12:54 5.1 Simply delete data samples containing NA 13:30 5.2 Perform imputation 16:59 Preview of next episode of this series (on Data pre-processing) The idea for this video was suggested in a comment by Marco Festugato 📎DATA: https://raw.githubusercontent.com/dataprofessor/data/master/dhfr.csv 📎CODE: https://github.com/dataprofessor/code/blob/master/dhfr/dhfr-handling-missing-data.R 📎SLIDES: https://github.com/dataprofessor/slides/blob/master/Handling-missing-data.pdf ⭕ Playlist: Check out our other videos in the following playlists. ✅ Data Science 101: https://bit.ly/dataprofessor-ds101 ✅ Data Science YouTuber Podcast: https://bit.ly/datascience-youtuber-podcast ✅ Data Science Virtual Internship: https://bit.ly/dataprofessor-internship ✅ Bioinformatics: http://bit.ly/dataprofessor-bioinformatics ✅ Data Science Toolbox: https://bit.ly/dataprofessor-datasciencetoolbox ✅ Streamlit (Web App in Python): https://bit.ly/dataprofessor-streamlit ✅ Shiny (Web App in R): https://bit.ly/dataprofessor-shiny ✅ Google Colab Tips and Tricks: https://bit.ly/dataprofessor-google-colab ✅ Pandas Tips and Tricks: https://bit.ly/dataprofessor-pandas ✅ Python Data Science Project: https://bit.ly/dataprofessor-python-ds ✅ R Data Science Project: https://bit.ly/dataprofessor-r-ds ⭕ S

Watch on YouTube ↗ (saves to browser)

Sign in to unlock AI tutor explanation · ⚡30

Playlist

Uploads from Data Professor · Data Professor · 25 of 60

← Previous Next →

How a Biologist became a Data Scientist

How a Biologist became a Data Scientist

WEKA Tutorial #1.1 - How to Build a Data Mining Model from Scratch

WEKA Tutorial #1.1 - How to Build a Data Mining Model from Scratch

WEKA Tutorial #1.2 - How to Build a Data Mining Model from Scratch

WEKA Tutorial #1.2 - How to Build a Data Mining Model from Scratch

WEKA Tutorial #1.3 - How to Build a Data Mining Model from Scratch

WEKA Tutorial #1.3 - How to Build a Data Mining Model from Scratch

Computational Drug Discovery: Machine Learning for Making Sense of Big Data in Drug Discovery

Computational Drug Discovery: Machine Learning for Making Sense of Big Data in Drug Discovery

Quotes #1 on Big Data and Data Science

Quotes #1 on Big Data and Data Science

Quotes #2 on Big Data and Data Science

Quotes #2 on Big Data and Data Science

Quotes #3 on Big Data and Data Science

Quotes #3 on Big Data and Data Science

Quotes #4 on Big Data and Data Science

Quotes #4 on Big Data and Data Science

Quotes #5 on Big Data and Data Science

Quotes #5 on Big Data and Data Science

Data Science 101: Starting a Data Science / Data Mining Project

Data Science 101: Starting a Data Science / Data Mining Project

Data Science 101: CRISP-DM - Data Mining / Data Science in 6 Steps

Data Science 101: CRISP-DM - Data Mining / Data Science in 6 Steps

R Programming 101: How to Define Variables

R Programming 101: How to Define Variables

R Programming 101: Read and Write CSV files

R Programming 101: Read and Write CSV files

Data Science 101: Basic Command-Line for Data Science

Data Science 101: Basic Command-Line for Data Science

Strategies for Learning Data Science in 2020 (Data Science 101)

Strategies for Learning Data Science in 2020 (Data Science 101)

Building your Data Science Portfolio with GitHub (Data Science 101)

Building your Data Science Portfolio with GitHub (Data Science 101)

R Programming 101: Setting up R programming environment (R, RStudio and RStudio.cloud)

R Programming 101: Setting up R programming environment (R, RStudio and RStudio.cloud)

Exploratory Data Analysis in R: Towards Data Understanding

Exploratory Data Analysis in R: Towards Data Understanding

Exploratory Data Analysis in R: Quick Dive into Data Visualization

Exploratory Data Analysis in R: Quick Dive into Data Visualization

Machine Learning in R: Building a Classification Model

Machine Learning in R: Building a Classification Model

Machine Learning in R: Repurpose Machine Learning Code for New Data

Machine Learning in R: Repurpose Machine Learning Code for New Data

Data Science 101: Deploying your Machine Learning Model

Data Science 101: Deploying your Machine Learning Model

Machine Learning in R: Deploy Machine Learning Model using RDS

Machine Learning in R: Deploy Machine Learning Model using RDS

Data Pre-processing in R: Handling Missing Data

Data Pre-processing in R: Handling Missing Data

Machine Learning in R: Speed up Model Building with Parallel Computing

Machine Learning in R: Speed up Model Building with Parallel Computing

Data Science 101: Overview of Machine Learning Model Building Process

Data Science 101: Overview of Machine Learning Model Building Process

Web Apps in R: Building your First Web Application in R | Shiny Tutorial Ep 1

Web Apps in R: Building your First Web Application in R | Shiny Tutorial Ep 1

Web Apps in R: Build Interactive Histogram Web Application in R | Shiny Tutorial Ep 2

Web Apps in R: Build Interactive Histogram Web Application in R | Shiny Tutorial Ep 2

Web Apps in R: Building Data-Driven Web Application in R | Shiny Tutorial Ep 3

Web Apps in R: Building Data-Driven Web Application in R | Shiny Tutorial Ep 3

Web Apps in R: Building the Machine Learning Web Application in R | Shiny Tutorial Ep 4

Web Apps in R: Building the Machine Learning Web Application in R | Shiny Tutorial Ep 4

Web Apps in R: Build BMI Calculator web application in R for health monitoring | Shiny Tutorial Ep 5

Web Apps in R: Build BMI Calculator web application in R for health monitoring | Shiny Tutorial Ep 5

Machine Learning in R: Building a Linear Regression Model

Machine Learning in R: Building a Linear Regression Model

What programming language to learn for Data Science? R versus Python

What programming language to learn for Data Science? R versus Python

How to Become a Data Scientist (Learning Path and Skill Sets Needed)

How to Become a Data Scientist (Learning Path and Skill Sets Needed)

Using Python in R

Using Python in R

Interpretable Machine Learning Models

Interpretable Machine Learning Models

Making Scatter Plots in R [Data Visualisation in R series]

Making Scatter Plots in R [Data Visualisation in R series]

Machine Learning in Python: Building a Classification Model

Machine Learning in Python: Building a Classification Model

Compare Machine Learning Classifiers in Python

Compare Machine Learning Classifiers in Python

Hyperparameter Tuning of Machine Learning Model in Python

Hyperparameter Tuning of Machine Learning Model in Python

Practical Introduction to Google Colab for Data Science

Practical Introduction to Google Colab for Data Science

File Handling in Google Colab for Data Science

File Handling in Google Colab for Data Science

Pandas for Data Science: Create and Combine DataFrames / Rename Columns

Pandas for Data Science: Create and Combine DataFrames / Rename Columns

Machine Learning in Python: Building a Linear Regression Model

Machine Learning in Python: Building a Linear Regression Model

Machine Learning in Python: Principal Component Analysis (PCA) for Handling High-Dimensional Data

Machine Learning in Python: Principal Component Analysis (PCA) for Handling High-Dimensional Data

How to Plot an ROC Curve in Python | Machine Learning in Python

How to Plot an ROC Curve in Python | Machine Learning in Python

Installing conda on Google Colab for Data Science

Installing conda on Google Colab for Data Science

Use native R on Google Colab for Data Science

Use native R on Google Colab for Data Science

How to Save and Download files from Google Colab

How to Save and Download files from Google Colab

Easy Web Scraping in Python using Pandas for Data Science

Easy Web Scraping in Python using Pandas for Data Science

Data Science for Computational Drug Discovery using Python (Part 1)

Data Science for Computational Drug Discovery using Python (Part 1)

Pandas Profiling for Data Science (Quick and Easy Exploratory Data Analysis)

Pandas Profiling for Data Science (Quick and Easy Exploratory Data Analysis)

Exploratory Data Analysis in Python using pandas

Exploratory Data Analysis in Python using pandas

Quick tour of PyCaret (a low-code machine learning library in Python)

Quick tour of PyCaret (a low-code machine learning library in Python)

How to Upload Files to Google Colab

How to Upload Files to Google Colab

How to Install and Use Pandas Profiling on Google Colab

How to Install and Use Pandas Profiling on Google Colab

How to Adjust the Style of Pandas DataFrame

How to Adjust the Style of Pandas DataFrame

How to use Bamboolib for Data Wrangling in Data Science

How to use Bamboolib for Data Wrangling in Data Science

How to use Pandas Profiling on Kaggle

How to use Pandas Profiling on Kaggle

This video teaches viewers how to handle missing data in R using various techniques, including deleting entries with missing data and imputing missing values with the column mean or median. The video provides a comprehensive overview of data pre-processing in R and demonstrates how to use custom functions to iterate through the dataset and replace NA values.

Key Takeaways

Load the dhfr data set from the data professor github
Check for missing data using nest function and sum function
Introduce randomly missing data points using custom function 'NAgen'
Check for total number of NA values and which column contains NA
Use nani function to check for missing data
Use colSums function to count missing values
Identify rows with missing data using complete.cases function
Impute missing values using mean or median
Create a new data object to store imputed data

💡 Handling missing data is a crucial step in data pre-processing, and R provides various techniques and functions to impute missing values, including mean and median imputation.

🔒 Pro feature: Ask AI to explain this lesson →

More on: ML Pipelines

View skill →

Building a Dog Breed Identifier App from scratch - DogNet

Building a Dog Breed Identifier App from scratch - DogNet

Aladdin Persson

Complete Dockers For Data Science Tutorial In One Shot

Complete Dockers For Data Science Tutorial In One Shot

Part 6 | Deploy ML Model on Kubernetes | Auto-Scaling with HPA and Monitoring with Prometheus

Part 6 | Deploy ML Model on Kubernetes | Auto-Scaling with HPA and Monitoring with Prometheus

Abonia Sojasingarayar

Vertex Pipelines: Qwik Start

Coding a Multimodal (Vision) Language Model from scratch in PyTorch with full explanation

Coding a Multimodal (Vision) Language Model from scratch in PyTorch with full explanation

Automate R scripts with GitHub Actions: Deploy a model

Related AI Lessons

Best AI Tools and Software Reviews: 2026 Picks

Discover the best AI tools and software for your specific needs in 2026, and learn how to match them to your work for optimal results

Verify real estate listings with Dwell, a platform that checks claims against records before you sign

Reddit r/artificial

X now offers an MCP server to make its platform easier for AI tools to use

X launches a hosted MCP server to simplify AI tool integration with its API

n8n Automation Repurpose Video Content: The 2025 Production Guide

Learn to repurpose video content using n8n automation, replacing manual labor with a self-hosted workflow solution

Chapters (17)

0:33 First part in Data pre-processing series

1:11 DHFR dataset

2:41 Outline of this episode

4:08 Open up RStudio or RStudio.cloud

4:15 Let's start

4:21 1. Load in the dataset

4:59 2. Check for missing data

5:48 3. Let's make the data dirty!

5:58 The custom function na.gen()

8:38 4. Check for missing data

9:08 How does is.na(dhfr) looks like?

10:18 Let's look at rows containing NA

11:29 Let's find the NA in the data

12:45 5. Handling the missing data

12:54 5.1 Simply delete data samples containing NA

13:30 5.2 Perform imputation

16:59 Preview of next episode of this series (on Data pre-processing)

How to Open HPL Files (HP-GL Plotter)

File Extension Geeks