Python Tutorial: Increasing successful detections using data resampling
Key Takeaways
Applies data resampling methods in Python to increase successful detections of fraud cases
Full Transcript
in this video we're going to talk about data resampling methods they help us to better train our models to recognize for cases when there are only very few cases of fraud the most straightforward way to adjust the imbalance of your data is to understand Paloma jority class aka a known fraud cases or oversample the minority class aka the fraud cases with undersampling you take random draws from your known fraud observations to match the amount of fraud observations as seen on the picture with over sampling you take random draws from the fraud cases and copy those observations to increase the amount of fraud samples you have in your data both methods lead to having a perfect balance between fraud and on fraud data but there are drawbacks with random under sampling you are effectively throwing away a lot of data and information with over sampling you are simply copying data and therefore are training your model on a lot of duplicates let's see how you can implement these methods in practice you can implement resampling methods using pythons imbalanced learn module it is compatible with scikit-learn and allows you to implement these methods in just two lines of code as you can see here I import the package and take the random over sampler and assign it to method I simply fit the method on to my original feature set X and labels Y to obtain a resampled feature set X and resampled Y I plot the data sets here side by side such that you can see the effect of my resampling method the darker blue color of the data points reflect that there are more identical data points now the synthetic minority over sampling technique or ass moat is another way of adjusting the imbalance by over sampling your minority observations aka your fraud cases but with smote were not just copying the minority class instead as you can see in this picture smoke uses characteristics of nearest neighbors of fraud cases to create new synthetic for cases and thereby avoids duplicating observations you might wonder which one of these methods is the best well it depends very much on the situation if you have very large amounts of data and also many fraud cases you might find it computationally easier to understand whether than to increase data even more but in most cases throwing away data is not desirable when it comes to over sampling smoke is more sophisticated as it does not duplicate data but this only works well if your fraud cases are quite similar to each other if fraud is spread out over your data and not very distinct using nearest neighbors to create more fraud cases introduce a bit of noise in the data as the nearest neighbors might not necessarily be for cases one thing to keep in mind when using resampling methods is to only resample on your training set your goal is to better train your model by giving it balanced amounts of data your goal is not to predict your sin-debt examples always make sure your test data is free of duplicates or synthetic data such that you can test your model on real data only the way to do this is to first split the data into a train and test it as you can see here I then resample the training set only I fit my model into the resample training data and lastly I obtain my performance metrics by looking at my original not resembled test data these that should look familiar to you so
Original Description
Want to learn more? Take the full course at https://learn.datacamp.com/courses/fraud-detection-in-python at your own pace. More than a video, you'll learn hands-on coding & quickly apply skills to your daily work.
---
In this video we're going to talk about data resampling methods. They help us to better train our models to recognise fraud cases when there are only very few cases of fraud.
The most straightforward way to adjust the imbalance of your data, is to under-sample the majority class (aka non-fraud cases), or oversample the minority class (aka the fraud cases). With under-sampling, you take random draws from your non-fraud observations, to match the amount of fraud observations as seen on the picture.
The most straightforward way to adjust the imbalance of your data, is to under-sample the majority class (aka non-fraud cases), or oversample the minority class (aka the fraud cases). With under-sampling, you take random draws from your non-fraud observations, to match the amount of fraud observations as seen on the picture.
You can implement resampling methods using Python's imblearn module. It is compatible with scikit-learn and allows you to implement these methods in just 2 lines of code. As you can see here, I import the package and take the Random Oversampler and assign it to "method". I simply fit the method onto my original feature set X, and labels y, to obtained a resampled feature set X, and resampled y. I plot the datasets here side by side, such that you can see the effect of my resampling method. The darker blue colour of the data points reflect that there are more identical data points now.
The Synthetic Minority Oversampling Technique, or SMOTE, is another way of adjusting the imbalance by oversampling your minority observations, aka your fraud cases.
But with SMOTE, we're not just copying the minority class; instead, as you see in this picture, SMOTE uses characteristics of nearest neighbours of fraud cases to create new synthetic fraud
Watch on YouTube ↗
(saves to browser)
Sign in to unlock AI tutor explanation · ⚡30
Playlist
Uploads from DataCamp · DataCamp · 0 of 60
← Previous
Next →
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
SQL Server Tutorial: Date manipulation
DataCamp
R Tutorial: Intermediate Interactive Data Visualization with plotly in R
DataCamp
R Tutorial: Adding aesthetics to represent a variable
DataCamp
R Tutorial: Moving Beyond Simple Interactivity
DataCamp
Python Tutorial: Why use ML for marketing? Strategies and use cases
DataCamp
Python Tutorial: Preparation for modeling
DataCamp
Python Tutorial: Machine Learning modeling steps
DataCamp
R Tutorial: The prior model
DataCamp
R Tutorial: Data & the likelihood
DataCamp
R Tutorial: The posterior model
DataCamp
R Tutorial: An Introduction to plotly
DataCamp
R Tutorial: Plotting a single variable
DataCamp
R Tutorial: Bivariate graphics
DataCamp
Python Tutorial: Customer Segmentation in Python
DataCamp
Python Tutorial: Time cohorts
DataCamp
Python Tutorial: Calculate cohort metrics
DataCamp
Python Tutorial: Cohort analysis visualization
DataCamp
R Tutorial: Building Dashboards with flexdashboard
DataCamp
R Tutorial: Anatomy of a flexdashboard
DataCamp
R Tutorial: Layout basics
DataCamp
R Tutorial: Advanced layouts
DataCamp
Python Tutorial: Time Series Analysis in Python
DataCamp
Python Tutorial: Correlation of Two Time Series
DataCamp
Python Tutorial: Simple Linear Regressions
DataCamp
Python Tutorial: Autocorrelation
DataCamp
R Tutorial: The gapminder dataset
DataCamp
R Tutorial: The filter verb
DataCamp
R Tutorial: The arrange verb
DataCamp
R Tutorial: The mutate verb
DataCamp
R Tutorial: What is cluster analysis?
DataCamp
R Tutorial: Distance between two observations
DataCamp
R Tutorial: The importance of scale
DataCamp
R Tutorial: Measuring distance for categorical data
DataCamp
Python Tutorial: Plotting multiple graphs
DataCamp
Python Tutorial: Customizing axes
DataCamp
Python Tutorial: Legends, annotations, & styles
DataCamp
Python Tutorial: Introduction to iterators
DataCamp
Python Tutorial: Playing with iterators
DataCamp
Python Tutorial: Using iterators to load large files into memory
DataCamp
SQL Tutorial: Introduction to Relational Databases in SQL
DataCamp
SQL Tutorial: Tables: At the core of every database
DataCamp
SQL Tutorial: Update your database as the structure changes
DataCamp
Python Tutorial: Classification-Tree Learning
DataCamp
Python Tutorial: Decision-Tree for Classification
DataCamp
Python Tutorial: Decision-Tree for Regression
DataCamp
Python Tutorial: Census Subject Tables
DataCamp
Python Tutorial: Census Geography
DataCamp
Python Tutorial: Using the Census API
DataCamp
R Tutorial: A/B Testing in R
DataCamp
R Tutorial: Baseline Conversion Rates
DataCamp
R Tutorial: Designing an Experiment - Power Analysis
DataCamp
R Tutorial: Introduction to qualitative data
DataCamp
R Tutorial: Understanding your qualitative variables
DataCamp
R Tutorial: Making Better Plots
DataCamp
SQL Tutorial: OLTP and OLAP
DataCamp
SQL Tutorial: Storing data
DataCamp
SQL Tutorial: Database design
DataCamp
Python Tutorial: Introduction to spaCy
DataCamp
Python Tutorial: Statistical Models
DataCamp
Python Tutorial: Rule-based Matching
DataCamp
More on: ML Pipelines
View skill →
🎓
Tutor Explanation
DeepCamp AI