How to handle imbalanced datasets in Python

Data Professor · Beginner ·📰 AI News & Updates ·5y ago

Skills: Unsupervised Learning80%ML Pipelines60%

Key Takeaways

The video demonstrates handling imbalanced datasets in Python using the imbalanced learn library, specifically random undersampling and oversampling techniques to balance class labels in a classification model.

Full Transcript

in your quest to analyze datasets or build machine learning models you probably encountered a situation where you have imbalanced datasets which as the term suggests means that the class label is very imbalanced whereby one class may have abnormally high number of data samples whereas another class label will have significantly lower number of the data sample for example if you have a data set where you're trying to predict whether a student will pass or not pass an exam on the basis of several input parameters then an imbalanced data set would mean that you have significantly larger number of students who do not pass the exam for example you could have 2 000 data samples for those not passing the exam and you would have let's say 200 data samples for those passing the exam particularly you have 200 versus 2000 and so you can see that there's 10 times higher magnitude of those not passing the exam and so how can you deal with that data set we're going to cover that in this tutorial video and so we're starting right now all right and so let's get started and the links to this particular jupyter notebook will be provided in the video description and so the first thing that you need to do is make sure to update or install your imbalanced learn library and so this particular library will allow you to handle the imbalanced data set and so let's proceed further and we're going to read in the data here and the data is from one of our research groups recent publication on the hepatitis c virus inhibitors and so i'll provide you the link to the original research paper and also the github of this particular research article and so let's read it in and let's have a look at the data frame and so here you can see that there's 578 rows or there are 578 compounds and there are 882 columns where the last column is the activity class label let's scroll to the right and you're going to see here the activity class label active and inactive they're quite imbalanced and so i'm going to show you in just a moment here and the rest here are the x variables so first thing that we need to do here is to split the data set from the data frame here to the x and y variables and so we're using the df.drop activity column for the x variables meaning that we're going to drop only the last column the activity column and then to assign the y variable we're going to particularly select the activity column and assign it to the y variable and so let's have a look at the y variable and so here we're gonna see that there are 412 active and 166 inactive meaning that there are 412 rows having a value of active and there is only 166 rows having a value of inactive and if we have a look at the pie chart we're going to see that they're pretty imbalanced so the active class has significantly almost three times more data samples than the inactive class so we have 412 versus 166 which accounts for 71.28 versus 28.72 and i've provided you two versions of the code and approach number one here you're gonna use the inbuilt function of pandas in order to make the pie plot or you could also display it in the traditional way of using the pot lip here i'll probably notice that both are using matplotlib but then the second example here is explicitly using the matpot lip approach whereas this approach approach number one will be using the building function of the pandas in order to make the pi plot so you're going to get the same pi plot here and so now let's address the problem how can we go from this imbalanced data set to this balanced data set whereby the actives and the inactive will be proportional to one another and so here we're going to use the random undersampling meaning that the majority class will be reduced so that it will have the same proportion as the minority class so the terminology here is the majority meaning that there are the high data samples and the minority are the ones with the lower data samples and so in order to do under sampling we're going to reduce the size of the majority so that the majority will then be equal to the minority this is one approach under sampling and another approach would be over sampling and so over sampling would mean that we want to increase the minority class so that it is equal to the majority class and so in our example here we have 412 active and so this is the majority class and we have 166 inactive which is the minority class and so in under sampling we're going to reduce the majority class so we want to reduce 412 to become 166. and so i'm going to show you that in just a moment and in oversampling we're going to increase 166 to become 412 and in order to do that we're going to perform resampling repetitive resampling will allow us to artificially generate new data samples so that 166 original data samples will then become 412 because at each resampling it's going to be performing in a random manner as shown here we're going to use the random undersampling and the random oversampling and so you could check out the api of this library's documentation which will provide you more than one way to perform undersampling and more than one way to perform oversampling and so in this particular example i'm going to show you only the random approach of performing both undersampling and oversampling and so the links to the imbalance learn library is provided here you can click on the logo here which will take you to the website and you can click on the api reference here in order to see the other functions available to you for performing under sampling which is right here and here we're going to use only random under sampler and as you can see there are several other approach and then if you click on over sampling there are several approaches here and a predominant one is the smoked over sampling approach and so you could check that out and let's head back to the tutorial all right and so here we're going to perform random undersampling as i mentioned we're going to reduce the majority class so that it will have the same number as the minority and the point of note here is that we're going to import the random undersampler function from the imb learn dot undersampling and here we're going to create a variable called rus r means random u mean under s means sampling and we are using the random under sampler function and as input argument we're going to use the sampling strategy equals to 1. and so as you can note here it could also be a floating number or it could be a numerical number let me just say number here or numerical value and you could also comment this portion out and perform this approach as well which will provide you the same results okay and so this will provide you with a ratio of one to one when you have a value of one but you could also play around with the numbers here which will give you a relatively unequal class ratio meaning that the active and inactive will not be in a one-to-one ratio and so i could show you that in just a moment and let's run it it's actually the one generated from the previous one and here is the new one here and so in the x underscore res res means resampling and y res is the resampled y and so we're generating two new variables here via the use of the rus.fit underscore resample and the input argument are the x and y which is the original x and y and here we're generating the new x and y and then we're going to take the newly generated y variable and then we're going to have a look at the value count and so let's take a look at this particular function which is right here so the newly generated y variable dot value count will give you the number of compounds in the active and inactive class and so here you can see that there are equal number of actives and inactives and so you can see clearly that the majority class active has been reduced from 412 to become 166 right here and then the code here is take the value count as shown here which is 166 and 26 and then applying the plot.pi function as i mentioned earlier it's a built in function to make the pipeline and then you have percent.2f which will give you two decimal points here and then here we're going to set the title to be under sampling all right and now let's head over to the random over sampling and so let's move back to this original data distribution so in oversampling we're going to increase the size of the minority class and so 166 inactive will then become 412. so for this one we're going to create the ros variable and we're using the random over sampler function from the inb learn dot over sampling and then as input argument we could use either one or not majority and so both will provide you with the same results and then here we're gonna in a similar fashion generate the new x and y variables using the ros dot fit resample taking in the original x and y and then we're going to take the newly generated y variable and then we're performing the value counts and then we're going to make the pi plot out of that and then we're going to show the two decimal points and then we're going to set the title here to be over sampling all right and you can see here that now the data is equally distributed so the number of compounds are that are active and inactive are now the same as a one-to-one ratio so they're both 412 compounds now and so as you can see the inactive increased from 166 to become 412 and so i think it would be better that i leave this as your homework to play around with this particular option here you could modify this to be in the range of zero and one and please feel free to try the other over or under sampling approaches as mentioned in the api documentation and drop a comment your observation from this experimentation and congratulations you have successfully balanced your data set using undersampling or oversampling let me know in the comments which approach do you like better oversampling or undersampling and i hope that you're finding value in this video please support the channel by smashing the like button subscribing if you haven't already and also make sure to hit on the notification bell so that you will be notified of the next video and as always the best way to learn data science is to do data science and please enjoy the journey

Original Description

In this video, you will be learning about how you can handle imbalanced datasets. Particularly, your class labels for your classification model is imbalanced (one class is significantly larger than the other which essentially gives rise to a majority class and minority class). Here, we will use the imbalanced-learn Python library to perform random undersampling and random oversampling so that you can address this issue of imbalanced datasets. 🌟 Download Kite for FREE https://www.kite.com/get-kite/?utm_medium=referral&utm_source=youtube&utm_campaign=dataprofessor&utm_content=description-only Code: https://github.com/dataprofessor/imbalanced-data ⭕ Support my work: 🌟 Subscribe to the Coding Professor channel https://www.youtube.com/channel/UCJzlfIoF8nmWqJIv_iWQVRw?sub_confirmation=1 🌟 Subscribe to the Data Professor https://www.youtube.com/dataprofessor?sub_confirmation=1 🌟 Join the Newsletter of Data Professor http://newsletter.dataprofessor.org 🌟 Buy me a coffee https://www.buymeacoffee.com/dataprofessor ⭕ Recommended Books: 🌟https://kit.co/dataprofessor ✅ Python Basics: A Practical Introduction to Python 3 https://amzn.to/3awdWgm ✅ Learn Python Programming (The no-nonsense, beginner's guide) https://amzn.to/2RFpSpn ✅ Learn to Program with Minecraft https://amzn.to/3x2MujZ ✅ Automate the Boring Stuff with Python, 2nd Edition: Practical Programming for Total Beginners https://amzn.to/2QzkyDs ⭕ Disclaimer: Recommended books and tools are affiliate links that gives me a portion of sales at no cost to you, which will contribute to the improvement of this channel's contents. ⭕ Stock photos, graphics and videos used on this channel: ✅ https://1.envato.market/c/2346717/628379/4662 #python #data #datascience #dataprofessor

Watch on YouTube ↗ (saves to browser)

Sign in to unlock AI tutor explanation · ⚡30

Playlist

Uploads from Data Professor · Data Professor · 0 of 60

← Previous Next →

How a Biologist became a Data Scientist

How a Biologist became a Data Scientist

WEKA Tutorial #1.1 - How to Build a Data Mining Model from Scratch

WEKA Tutorial #1.1 - How to Build a Data Mining Model from Scratch

WEKA Tutorial #1.2 - How to Build a Data Mining Model from Scratch

WEKA Tutorial #1.2 - How to Build a Data Mining Model from Scratch

WEKA Tutorial #1.3 - How to Build a Data Mining Model from Scratch

WEKA Tutorial #1.3 - How to Build a Data Mining Model from Scratch

Computational Drug Discovery: Machine Learning for Making Sense of Big Data in Drug Discovery

Computational Drug Discovery: Machine Learning for Making Sense of Big Data in Drug Discovery

Quotes #1 on Big Data and Data Science

Quotes #1 on Big Data and Data Science

Quotes #2 on Big Data and Data Science

Quotes #2 on Big Data and Data Science

Quotes #3 on Big Data and Data Science

Quotes #3 on Big Data and Data Science

Quotes #4 on Big Data and Data Science

Quotes #4 on Big Data and Data Science

Quotes #5 on Big Data and Data Science

Quotes #5 on Big Data and Data Science

Data Science 101: Starting a Data Science / Data Mining Project

Data Science 101: Starting a Data Science / Data Mining Project

Data Science 101: CRISP-DM - Data Mining / Data Science in 6 Steps

Data Science 101: CRISP-DM - Data Mining / Data Science in 6 Steps

R Programming 101: How to Define Variables

R Programming 101: How to Define Variables

R Programming 101: Read and Write CSV files

R Programming 101: Read and Write CSV files

Data Science 101: Basic Command-Line for Data Science

Data Science 101: Basic Command-Line for Data Science

Strategies for Learning Data Science in 2020 (Data Science 101)

Strategies for Learning Data Science in 2020 (Data Science 101)

Building your Data Science Portfolio with GitHub (Data Science 101)

Building your Data Science Portfolio with GitHub (Data Science 101)

R Programming 101: Setting up R programming environment (R, RStudio and RStudio.cloud)

R Programming 101: Setting up R programming environment (R, RStudio and RStudio.cloud)

Exploratory Data Analysis in R: Towards Data Understanding

Exploratory Data Analysis in R: Towards Data Understanding

Exploratory Data Analysis in R: Quick Dive into Data Visualization

Exploratory Data Analysis in R: Quick Dive into Data Visualization

Machine Learning in R: Building a Classification Model

Machine Learning in R: Building a Classification Model

Machine Learning in R: Repurpose Machine Learning Code for New Data

Machine Learning in R: Repurpose Machine Learning Code for New Data

Data Science 101: Deploying your Machine Learning Model

Data Science 101: Deploying your Machine Learning Model

Machine Learning in R: Deploy Machine Learning Model using RDS

Machine Learning in R: Deploy Machine Learning Model using RDS

Data Pre-processing in R: Handling Missing Data

Data Pre-processing in R: Handling Missing Data

Machine Learning in R: Speed up Model Building with Parallel Computing

Machine Learning in R: Speed up Model Building with Parallel Computing

Data Science 101: Overview of Machine Learning Model Building Process

Data Science 101: Overview of Machine Learning Model Building Process

Web Apps in R: Building your First Web Application in R | Shiny Tutorial Ep 1

Web Apps in R: Building your First Web Application in R | Shiny Tutorial Ep 1

Web Apps in R: Build Interactive Histogram Web Application in R | Shiny Tutorial Ep 2

Web Apps in R: Build Interactive Histogram Web Application in R | Shiny Tutorial Ep 2

Web Apps in R: Building Data-Driven Web Application in R | Shiny Tutorial Ep 3

Web Apps in R: Building Data-Driven Web Application in R | Shiny Tutorial Ep 3

Web Apps in R: Building the Machine Learning Web Application in R | Shiny Tutorial Ep 4

Web Apps in R: Building the Machine Learning Web Application in R | Shiny Tutorial Ep 4

Web Apps in R: Build BMI Calculator web application in R for health monitoring | Shiny Tutorial Ep 5

Web Apps in R: Build BMI Calculator web application in R for health monitoring | Shiny Tutorial Ep 5

Machine Learning in R: Building a Linear Regression Model

Machine Learning in R: Building a Linear Regression Model

What programming language to learn for Data Science? R versus Python

What programming language to learn for Data Science? R versus Python

How to Become a Data Scientist (Learning Path and Skill Sets Needed)

How to Become a Data Scientist (Learning Path and Skill Sets Needed)

Using Python in R

Using Python in R

Interpretable Machine Learning Models

Interpretable Machine Learning Models

Making Scatter Plots in R [Data Visualisation in R series]

Making Scatter Plots in R [Data Visualisation in R series]

Machine Learning in Python: Building a Classification Model

Machine Learning in Python: Building a Classification Model

Compare Machine Learning Classifiers in Python

Compare Machine Learning Classifiers in Python

Hyperparameter Tuning of Machine Learning Model in Python

Hyperparameter Tuning of Machine Learning Model in Python

Practical Introduction to Google Colab for Data Science

Practical Introduction to Google Colab for Data Science

File Handling in Google Colab for Data Science

File Handling in Google Colab for Data Science

Pandas for Data Science: Create and Combine DataFrames / Rename Columns

Pandas for Data Science: Create and Combine DataFrames / Rename Columns

Machine Learning in Python: Building a Linear Regression Model

Machine Learning in Python: Building a Linear Regression Model

Machine Learning in Python: Principal Component Analysis (PCA) for Handling High-Dimensional Data

Machine Learning in Python: Principal Component Analysis (PCA) for Handling High-Dimensional Data

How to Plot an ROC Curve in Python | Machine Learning in Python

How to Plot an ROC Curve in Python | Machine Learning in Python

Installing conda on Google Colab for Data Science

Installing conda on Google Colab for Data Science

Use native R on Google Colab for Data Science

Use native R on Google Colab for Data Science

How to Save and Download files from Google Colab

How to Save and Download files from Google Colab

Easy Web Scraping in Python using Pandas for Data Science

Easy Web Scraping in Python using Pandas for Data Science

Data Science for Computational Drug Discovery using Python (Part 1)

Data Science for Computational Drug Discovery using Python (Part 1)

Pandas Profiling for Data Science (Quick and Easy Exploratory Data Analysis)

Pandas Profiling for Data Science (Quick and Easy Exploratory Data Analysis)

Exploratory Data Analysis in Python using pandas

Exploratory Data Analysis in Python using pandas

Quick tour of PyCaret (a low-code machine learning library in Python)

Quick tour of PyCaret (a low-code machine learning library in Python)

How to Upload Files to Google Colab

How to Upload Files to Google Colab

How to Install and Use Pandas Profiling on Google Colab

How to Install and Use Pandas Profiling on Google Colab

How to Adjust the Style of Pandas DataFrame

How to Adjust the Style of Pandas DataFrame

How to use Bamboolib for Data Wrangling in Data Science

How to use Bamboolib for Data Wrangling in Data Science

How to use Pandas Profiling on Kaggle

How to use Pandas Profiling on Kaggle

This video teaches how to handle imbalanced datasets in Python using the imbalanced learn library, covering random undersampling and oversampling techniques to balance class labels in a classification model. It provides a step-by-step guide on how to preprocess data for machine learning models.

Key Takeaways

Update or install imbalanced learn library
Read in data from jupyter notebook
Split data set into x and y variables
Use df.drop to drop activity column for x variables
Assign y variable by selecting activity column
Import the random undersampler function from the imbalanced-learn library
Create a variable called rus and use the random undersampler function with the sampling strategy set to 1
Use the rus.fit_resample method to generate new x and y variables
Take the newly generated y variable and use the value count method to count the number of each class
Generate new x and y variables using rus.fit() or ros.fit() and resample

💡 Random undersampling and oversampling techniques can be used to balance class labels in a classification model, improving the performance of the model on imbalanced datasets.

🔒 Pro feature: Ask AI to explain this lesson →

More on: Unsupervised Learning

View skill →

How to implement K-Means from scratch with Python

How to implement K-Means from scratch with Python

K-Means Clustering - The Math of Intelligence (Week 3)

K-Means Clustering - The Math of Intelligence (Week 3)

Mean Shift with Titanic Dataset - Practical Machine Learning Tutorial with Python p.40

Mean Shift with Titanic Dataset - Practical Machine Learning Tutorial with Python p.40

Self-/Unsupervised GNN Training

Self-/Unsupervised GNN Training

Statistical Learning: 12.R.3 Hierarchical Clustering

Statistical Learning: 12.R.3 Hierarchical Clustering

Stanford Online

Clustering with DBSCAN, Clearly Explained!!!

Clustering with DBSCAN, Clearly Explained!!!

StatQuest with Josh Starmer

Related Reads

Hyundai and Kia built a UV system that kills bacteria inside a car while you are sitting in it

Hyundai and Kia develop an in-vehicle UV system to kill bacteria and viruses while passengers are present, using far-ultraviolet light technology

The Next Web AI

The latest AI news we announced in June 2026

Get the latest AI news from Google's June 2026 updates and stay current with industry developments

AI-Powered Theodore Roosevelt Is Ready To Answer Your Questions

Learn about the AI-powered Theodore Roosevelt avatar at the presidential library, which showcases innovative applications of AI in education and history

Forbes Innovation

Krafton agrees to pay Subnautica 2 bonuses after CEO who used ChatGPT to dodge them steps down

Krafton agrees to pay bonuses to Subnautica 2 staff after CEO steps down, highlighting the importance of transparency and accountability in leadership

The Next Web AI

FABLE 5 IS BACK