How to handle imbalanced datasets in Python
Key Takeaways
The video demonstrates handling imbalanced datasets in Python using the imbalanced learn library, specifically random undersampling and oversampling techniques to balance class labels in a classification model.
Full Transcript
in your quest to analyze datasets or build machine learning models you probably encountered a situation where you have imbalanced datasets which as the term suggests means that the class label is very imbalanced whereby one class may have abnormally high number of data samples whereas another class label will have significantly lower number of the data sample for example if you have a data set where you're trying to predict whether a student will pass or not pass an exam on the basis of several input parameters then an imbalanced data set would mean that you have significantly larger number of students who do not pass the exam for example you could have 2 000 data samples for those not passing the exam and you would have let's say 200 data samples for those passing the exam particularly you have 200 versus 2000 and so you can see that there's 10 times higher magnitude of those not passing the exam and so how can you deal with that data set we're going to cover that in this tutorial video and so we're starting right now all right and so let's get started and the links to this particular jupyter notebook will be provided in the video description and so the first thing that you need to do is make sure to update or install your imbalanced learn library and so this particular library will allow you to handle the imbalanced data set and so let's proceed further and we're going to read in the data here and the data is from one of our research groups recent publication on the hepatitis c virus inhibitors and so i'll provide you the link to the original research paper and also the github of this particular research article and so let's read it in and let's have a look at the data frame and so here you can see that there's 578 rows or there are 578 compounds and there are 882 columns where the last column is the activity class label let's scroll to the right and you're going to see here the activity class label active and inactive they're quite imbalanced and so i'm going to show you in just a moment here and the rest here are the x variables so first thing that we need to do here is to split the data set from the data frame here to the x and y variables and so we're using the df.drop activity column for the x variables meaning that we're going to drop only the last column the activity column and then to assign the y variable we're going to particularly select the activity column and assign it to the y variable and so let's have a look at the y variable and so here we're gonna see that there are 412 active and 166 inactive meaning that there are 412 rows having a value of active and there is only 166 rows having a value of inactive and if we have a look at the pie chart we're going to see that they're pretty imbalanced so the active class has significantly almost three times more data samples than the inactive class so we have 412 versus 166 which accounts for 71.28 versus 28.72 and i've provided you two versions of the code and approach number one here you're gonna use the inbuilt function of pandas in order to make the pie plot or you could also display it in the traditional way of using the pot lip here i'll probably notice that both are using matplotlib but then the second example here is explicitly using the matpot lip approach whereas this approach approach number one will be using the building function of the pandas in order to make the pi plot so you're going to get the same pi plot here and so now let's address the problem how can we go from this imbalanced data set to this balanced data set whereby the actives and the inactive will be proportional to one another and so here we're going to use the random undersampling meaning that the majority class will be reduced so that it will have the same proportion as the minority class so the terminology here is the majority meaning that there are the high data samples and the minority are the ones with the lower data samples and so in order to do under sampling we're going to reduce the size of the majority so that the majority will then be equal to the minority this is one approach under sampling and another approach would be over sampling and so over sampling would mean that we want to increase the minority class so that it is equal to the majority class and so in our example here we have 412 active and so this is the majority class and we have 166 inactive which is the minority class and so in under sampling we're going to reduce the majority class so we want to reduce 412 to become 166. and so i'm going to show you that in just a moment and in oversampling we're going to increase 166 to become 412 and in order to do that we're going to perform resampling repetitive resampling will allow us to artificially generate new data samples so that 166 original data samples will then become 412 because at each resampling it's going to be performing in a random manner as shown here we're going to use the random undersampling and the random oversampling and so you could check out the api of this library's documentation which will provide you more than one way to perform undersampling and more than one way to perform oversampling and so in this particular example i'm going to show you only the random approach of performing both undersampling and oversampling and so the links to the imbalance learn library is provided here you can click on the logo here which will take you to the website and you can click on the api reference here in order to see the other functions available to you for performing under sampling which is right here and here we're going to use only random under sampler and as you can see there are several other approach and then if you click on over sampling there are several approaches here and a predominant one is the smoked over sampling approach and so you could check that out and let's head back to the tutorial all right and so here we're going to perform random undersampling as i mentioned we're going to reduce the majority class so that it will have the same number as the minority and the point of note here is that we're going to import the random undersampler function from the imb learn dot undersampling and here we're going to create a variable called rus r means random u mean under s means sampling and we are using the random under sampler function and as input argument we're going to use the sampling strategy equals to 1. and so as you can note here it could also be a floating number or it could be a numerical number let me just say number here or numerical value and you could also comment this portion out and perform this approach as well which will provide you the same results okay and so this will provide you with a ratio of one to one when you have a value of one but you could also play around with the numbers here which will give you a relatively unequal class ratio meaning that the active and inactive will not be in a one-to-one ratio and so i could show you that in just a moment and let's run it it's actually the one generated from the previous one and here is the new one here and so in the x underscore res res means resampling and y res is the resampled y and so we're generating two new variables here via the use of the rus.fit underscore resample and the input argument are the x and y which is the original x and y and here we're generating the new x and y and then we're going to take the newly generated y variable and then we're going to have a look at the value count and so let's take a look at this particular function which is right here so the newly generated y variable dot value count will give you the number of compounds in the active and inactive class and so here you can see that there are equal number of actives and inactives and so you can see clearly that the majority class active has been reduced from 412 to become 166 right here and then the code here is take the value count as shown here which is 166 and 26 and then applying the plot.pi function as i mentioned earlier it's a built in function to make the pipeline and then you have percent.2f which will give you two decimal points here and then here we're going to set the title to be under sampling all right and now let's head over to the random over sampling and so let's move back to this original data distribution so in oversampling we're going to increase the size of the minority class and so 166 inactive will then become 412. so for this one we're going to create the ros variable and we're using the random over sampler function from the inb learn dot over sampling and then as input argument we could use either one or not majority and so both will provide you with the same results and then here we're gonna in a similar fashion generate the new x and y variables using the ros dot fit resample taking in the original x and y and then we're going to take the newly generated y variable and then we're performing the value counts and then we're going to make the pi plot out of that and then we're going to show the two decimal points and then we're going to set the title here to be over sampling all right and you can see here that now the data is equally distributed so the number of compounds are that are active and inactive are now the same as a one-to-one ratio so they're both 412 compounds now and so as you can see the inactive increased from 166 to become 412 and so i think it would be better that i leave this as your homework to play around with this particular option here you could modify this to be in the range of zero and one and please feel free to try the other over or under sampling approaches as mentioned in the api documentation and drop a comment your observation from this experimentation and congratulations you have successfully balanced your data set using undersampling or oversampling let me know in the comments which approach do you like better oversampling or undersampling and i hope that you're finding value in this video please support the channel by smashing the like button subscribing if you haven't already and also make sure to hit on the notification bell so that you will be notified of the next video and as always the best way to learn data science is to do data science and please enjoy the journey
Original Description
In this video, you will be learning about how you can handle imbalanced datasets. Particularly, your class labels for your classification model is imbalanced (one class is significantly larger than the other which essentially gives rise to a majority class and minority class). Here, we will use the imbalanced-learn Python library to perform random undersampling and random oversampling so that you can address this issue of imbalanced datasets.
🌟 Download Kite for FREE https://www.kite.com/get-kite/?utm_medium=referral&utm_source=youtube&utm_campaign=dataprofessor&utm_content=description-only
Code: https://github.com/dataprofessor/imbalanced-data
⭕ Support my work:
🌟 Subscribe to the Coding Professor channel https://www.youtube.com/channel/UCJzlfIoF8nmWqJIv_iWQVRw?sub_confirmation=1
🌟 Subscribe to the Data Professor https://www.youtube.com/dataprofessor?sub_confirmation=1
🌟 Join the Newsletter of Data Professor http://newsletter.dataprofessor.org
🌟 Buy me a coffee https://www.buymeacoffee.com/dataprofessor
⭕ Recommended Books:
🌟https://kit.co/dataprofessor
✅ Python Basics: A Practical Introduction to Python 3 https://amzn.to/3awdWgm
✅ Learn Python Programming (The no-nonsense, beginner's guide) https://amzn.to/2RFpSpn
✅ Learn to Program with Minecraft https://amzn.to/3x2MujZ
✅ Automate the Boring Stuff with Python, 2nd Edition: Practical Programming for Total Beginners https://amzn.to/2QzkyDs
⭕ Disclaimer:
Recommended books and tools are affiliate links that gives me a portion of sales at no cost to you, which will contribute to the improvement of this channel's contents.
⭕ Stock photos, graphics and videos used on this channel:
✅ https://1.envato.market/c/2346717/628379/4662
#python #data #datascience #dataprofessor
Watch on YouTube ↗
(saves to browser)
Sign in to unlock AI tutor explanation · ⚡30
Playlist
Uploads from Data Professor · Data Professor · 0 of 60
← Previous
Next →
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
How a Biologist became a Data Scientist
Data Professor
WEKA Tutorial #1.1 - How to Build a Data Mining Model from Scratch
Data Professor
WEKA Tutorial #1.2 - How to Build a Data Mining Model from Scratch
Data Professor
WEKA Tutorial #1.3 - How to Build a Data Mining Model from Scratch
Data Professor
Computational Drug Discovery: Machine Learning for Making Sense of Big Data in Drug Discovery
Data Professor
Quotes #1 on Big Data and Data Science
Data Professor
Quotes #2 on Big Data and Data Science
Data Professor
Quotes #3 on Big Data and Data Science
Data Professor
Quotes #4 on Big Data and Data Science
Data Professor
Quotes #5 on Big Data and Data Science
Data Professor
Data Science 101: Starting a Data Science / Data Mining Project
Data Professor
Data Science 101: CRISP-DM - Data Mining / Data Science in 6 Steps
Data Professor
R Programming 101: How to Define Variables
Data Professor
R Programming 101: Read and Write CSV files
Data Professor
Data Science 101: Basic Command-Line for Data Science
Data Professor
Strategies for Learning Data Science in 2020 (Data Science 101)
Data Professor
Building your Data Science Portfolio with GitHub (Data Science 101)
Data Professor
R Programming 101: Setting up R programming environment (R, RStudio and RStudio.cloud)
Data Professor
Exploratory Data Analysis in R: Towards Data Understanding
Data Professor
Exploratory Data Analysis in R: Quick Dive into Data Visualization
Data Professor
Machine Learning in R: Building a Classification Model
Data Professor
Machine Learning in R: Repurpose Machine Learning Code for New Data
Data Professor
Data Science 101: Deploying your Machine Learning Model
Data Professor
Machine Learning in R: Deploy Machine Learning Model using RDS
Data Professor
Data Pre-processing in R: Handling Missing Data
Data Professor
Machine Learning in R: Speed up Model Building with Parallel Computing
Data Professor
Data Science 101: Overview of Machine Learning Model Building Process
Data Professor
Web Apps in R: Building your First Web Application in R | Shiny Tutorial Ep 1
Data Professor
Web Apps in R: Build Interactive Histogram Web Application in R | Shiny Tutorial Ep 2
Data Professor
Web Apps in R: Building Data-Driven Web Application in R | Shiny Tutorial Ep 3
Data Professor
Web Apps in R: Building the Machine Learning Web Application in R | Shiny Tutorial Ep 4
Data Professor
Web Apps in R: Build BMI Calculator web application in R for health monitoring | Shiny Tutorial Ep 5
Data Professor
Machine Learning in R: Building a Linear Regression Model
Data Professor
What programming language to learn for Data Science? R versus Python
Data Professor
How to Become a Data Scientist (Learning Path and Skill Sets Needed)
Data Professor
Using Python in R
Data Professor
Interpretable Machine Learning Models
Data Professor
Making Scatter Plots in R [Data Visualisation in R series]
Data Professor
Machine Learning in Python: Building a Classification Model
Data Professor
Compare Machine Learning Classifiers in Python
Data Professor
Hyperparameter Tuning of Machine Learning Model in Python
Data Professor
Practical Introduction to Google Colab for Data Science
Data Professor
File Handling in Google Colab for Data Science
Data Professor
Pandas for Data Science: Create and Combine DataFrames / Rename Columns
Data Professor
Machine Learning in Python: Building a Linear Regression Model
Data Professor
Machine Learning in Python: Principal Component Analysis (PCA) for Handling High-Dimensional Data
Data Professor
How to Plot an ROC Curve in Python | Machine Learning in Python
Data Professor
Installing conda on Google Colab for Data Science
Data Professor
Use native R on Google Colab for Data Science
Data Professor
How to Save and Download files from Google Colab
Data Professor
Easy Web Scraping in Python using Pandas for Data Science
Data Professor
Data Science for Computational Drug Discovery using Python (Part 1)
Data Professor
Pandas Profiling for Data Science (Quick and Easy Exploratory Data Analysis)
Data Professor
Exploratory Data Analysis in Python using pandas
Data Professor
Quick tour of PyCaret (a low-code machine learning library in Python)
Data Professor
How to Upload Files to Google Colab
Data Professor
How to Install and Use Pandas Profiling on Google Colab
Data Professor
How to Adjust the Style of Pandas DataFrame
Data Professor
How to use Bamboolib for Data Wrangling in Data Science
Data Professor
How to use Pandas Profiling on Kaggle
Data Professor
More on: Unsupervised Learning
View skill →Related Reads
📰
📰
📰
📰
Hyundai and Kia built a UV system that kills bacteria inside a car while you are sitting in it
The Next Web AI
The latest AI news we announced in June 2026
Google AI Blog
AI-Powered Theodore Roosevelt Is Ready To Answer Your Questions
Forbes Innovation
Krafton agrees to pay Subnautica 2 bonuses after CEO who used ChatGPT to dodge them steps down
The Next Web AI
🎓
Tutor Explanation
DeepCamp AI