R Tutorial: Binning encoding: data driven
Skills:
Data Literacy50%
Key Takeaways
This video tutorial demonstrates data-driven approaches to reducing the space of categorical variables and creating meaningful features using binning encoding in R, specifically using the education level variable from the adult incomes data set.
Full Transcript
the encoding procedures we have discussed work well on categorical data with a manageable number of categories however using one hot encoding for a variable with thousands of categories will create a thousand or more new columns and will be complicated even if you combine similar categories based on contextual information let's discuss data driven approaches to reducing the space of categorical variables and creating meaningful features let's take a look at the education level variable from the adult underscore incomes data set there are sixteen distinct categories we want to incorporate into our model that predicts income levels above or below fifty thousand dollars we want to reduce these categories in a meaningful way leveraging the outcomes associated with these levels one approach is to look at the proportions of each category with respect to the income which is the outcome variable in this example we can combine prop table with the table function the prop table function takes a table which sells and divides each sell value by the sum of all the cells if you add a 1 after the table you get the value of each cell divided by the sum of the row cells in our example we want the proportion of income within each grade which is the gross sum these proportions give us insights into possible relationships the categories have with the outcome for example we can deduce that lower grade levels are associated with making less than $50,000 in in calendar year for example of the folks that only completed the 10th grade about 93 percent of those individuals make less than 50k a year we order the proportions I correspond to making over $50,000 a year using the arrange function and passing a table that contains the education span income and the corresponding proportions we can leverage this information to create meaningful categories for example we can group categories with similar proportions of making over $50,000 a year in 2/3 order ranges with low education from zero to ten percent medium education from ten to thirty percent and high education containing the rest from 30 to 100 percent we can attach this ad hoc information to our existing income data by using inner underscore joint and attaching the proportions table results with the proportion mappings for each grade level category an inner join takes two data frames and only combines records that have the same link discarding records with no links from either table the link is specified using the buy statement in our example we are linking education from the adult income stable with edy underscore span from a proportions table we now have the proportions associated with our education levels to map on our desired low medium and high education range categories we create a new column continued the new mappings where the low education category contains eight categories from preschool to twelfth grade the medium category contains four categories after graduating high school and the high education level contains a bachelor's degree and more now it's your turn let's
Original Description
Want to learn more? Take the full course at https://learn.datacamp.com/courses/feature-engineering-in-r at your own pace. More than a video, you'll learn hands-on coding & quickly apply skills to your daily work.
---
The encoding procedures we have discussed work well on categorical data with a manageable number of categories. However, using one-hot encoding for a variable with thousands of categories will create a thousand or more new columns, and will be complicated even if you combine similar categories based on contextual information. Let's discuss data-driven approaches to reducing the space of categorical variables and creating meaningful features.
Let's take a look at the education level variable from the adult-underscore-incomes dataset. There are 16 distinct categories we want to incorporate into our model that predicts income levels above or below 50,000 dollars. We want to reduce these categories in a meaningful way, leveraging the outcomes associated with these levels.
One approach is to look at the proportions of each category with respect to the income, which is the outcome variable. In this example, we can combine prop-dot-table() with the table() function. The prop-dot-table() function takes a table with cells and divides each cell value by the sum of all the cells. If you add a one after the table, you get the value of each cell divided by the sum of the row cells.
In our example, we want the proportion of income within each grade, which is the row sum. These proportions give us insight into possible relationships the categories have with the outcome. For example, we can deduce that lower grade levels are associated with making less than 50,000 dollars in a calendar year. For example, of the folks that only completed the tenth grade, about 93 percent of those individuals make less than 50K a year.
We order the proportions that correspond to making over 50,000 dollars a year using the arrange() function and passing a table that contains the
Watch on YouTube ↗
(saves to browser)
Sign in to unlock AI tutor explanation · ⚡30
Playlist
Uploads from DataCamp · DataCamp · 0 of 60
← Previous
Next →
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
SQL Server Tutorial: Date manipulation
DataCamp
R Tutorial: Intermediate Interactive Data Visualization with plotly in R
DataCamp
R Tutorial: Adding aesthetics to represent a variable
DataCamp
R Tutorial: Moving Beyond Simple Interactivity
DataCamp
Python Tutorial: Why use ML for marketing? Strategies and use cases
DataCamp
Python Tutorial: Preparation for modeling
DataCamp
Python Tutorial: Machine Learning modeling steps
DataCamp
R Tutorial: The prior model
DataCamp
R Tutorial: Data & the likelihood
DataCamp
R Tutorial: The posterior model
DataCamp
R Tutorial: An Introduction to plotly
DataCamp
R Tutorial: Plotting a single variable
DataCamp
R Tutorial: Bivariate graphics
DataCamp
Python Tutorial: Customer Segmentation in Python
DataCamp
Python Tutorial: Time cohorts
DataCamp
Python Tutorial: Calculate cohort metrics
DataCamp
Python Tutorial: Cohort analysis visualization
DataCamp
R Tutorial: Building Dashboards with flexdashboard
DataCamp
R Tutorial: Anatomy of a flexdashboard
DataCamp
R Tutorial: Layout basics
DataCamp
R Tutorial: Advanced layouts
DataCamp
Python Tutorial: Time Series Analysis in Python
DataCamp
Python Tutorial: Correlation of Two Time Series
DataCamp
Python Tutorial: Simple Linear Regressions
DataCamp
Python Tutorial: Autocorrelation
DataCamp
R Tutorial: The gapminder dataset
DataCamp
R Tutorial: The filter verb
DataCamp
R Tutorial: The arrange verb
DataCamp
R Tutorial: The mutate verb
DataCamp
R Tutorial: What is cluster analysis?
DataCamp
R Tutorial: Distance between two observations
DataCamp
R Tutorial: The importance of scale
DataCamp
R Tutorial: Measuring distance for categorical data
DataCamp
Python Tutorial: Plotting multiple graphs
DataCamp
Python Tutorial: Customizing axes
DataCamp
Python Tutorial: Legends, annotations, & styles
DataCamp
Python Tutorial: Introduction to iterators
DataCamp
Python Tutorial: Playing with iterators
DataCamp
Python Tutorial: Using iterators to load large files into memory
DataCamp
SQL Tutorial: Introduction to Relational Databases in SQL
DataCamp
SQL Tutorial: Tables: At the core of every database
DataCamp
SQL Tutorial: Update your database as the structure changes
DataCamp
Python Tutorial: Classification-Tree Learning
DataCamp
Python Tutorial: Decision-Tree for Classification
DataCamp
Python Tutorial: Decision-Tree for Regression
DataCamp
Python Tutorial: Census Subject Tables
DataCamp
Python Tutorial: Census Geography
DataCamp
Python Tutorial: Using the Census API
DataCamp
R Tutorial: A/B Testing in R
DataCamp
R Tutorial: Baseline Conversion Rates
DataCamp
R Tutorial: Designing an Experiment - Power Analysis
DataCamp
R Tutorial: Introduction to qualitative data
DataCamp
R Tutorial: Understanding your qualitative variables
DataCamp
R Tutorial: Making Better Plots
DataCamp
SQL Tutorial: OLTP and OLAP
DataCamp
SQL Tutorial: Storing data
DataCamp
SQL Tutorial: Database design
DataCamp
Python Tutorial: Introduction to spaCy
DataCamp
Python Tutorial: Statistical Models
DataCamp
Python Tutorial: Rule-based Matching
DataCamp
More on: Data Literacy
View skill →
🎓
Tutor Explanation
DeepCamp AI