R Tutorial: Binning encoding: data driven

DataCamp · Beginner ·🛠️ AI Tools & Apps ·6y ago

Key Takeaways

This video tutorial demonstrates data-driven approaches to reducing the space of categorical variables and creating meaningful features using binning encoding in R, specifically using the education level variable from the adult incomes data set.

Full Transcript

the encoding procedures we have discussed work well on categorical data with a manageable number of categories however using one hot encoding for a variable with thousands of categories will create a thousand or more new columns and will be complicated even if you combine similar categories based on contextual information let's discuss data driven approaches to reducing the space of categorical variables and creating meaningful features let's take a look at the education level variable from the adult underscore incomes data set there are sixteen distinct categories we want to incorporate into our model that predicts income levels above or below fifty thousand dollars we want to reduce these categories in a meaningful way leveraging the outcomes associated with these levels one approach is to look at the proportions of each category with respect to the income which is the outcome variable in this example we can combine prop table with the table function the prop table function takes a table which sells and divides each sell value by the sum of all the cells if you add a 1 after the table you get the value of each cell divided by the sum of the row cells in our example we want the proportion of income within each grade which is the gross sum these proportions give us insights into possible relationships the categories have with the outcome for example we can deduce that lower grade levels are associated with making less than $50,000 in in calendar year for example of the folks that only completed the 10th grade about 93 percent of those individuals make less than 50k a year we order the proportions I correspond to making over $50,000 a year using the arrange function and passing a table that contains the education span income and the corresponding proportions we can leverage this information to create meaningful categories for example we can group categories with similar proportions of making over $50,000 a year in 2/3 order ranges with low education from zero to ten percent medium education from ten to thirty percent and high education containing the rest from 30 to 100 percent we can attach this ad hoc information to our existing income data by using inner underscore joint and attaching the proportions table results with the proportion mappings for each grade level category an inner join takes two data frames and only combines records that have the same link discarding records with no links from either table the link is specified using the buy statement in our example we are linking education from the adult income stable with edy underscore span from a proportions table we now have the proportions associated with our education levels to map on our desired low medium and high education range categories we create a new column continued the new mappings where the low education category contains eight categories from preschool to twelfth grade the medium category contains four categories after graduating high school and the high education level contains a bachelor's degree and more now it's your turn let's

Original Description

Want to learn more? Take the full course at https://learn.datacamp.com/courses/feature-engineering-in-r at your own pace. More than a video, you'll learn hands-on coding & quickly apply skills to your daily work. --- The encoding procedures we have discussed work well on categorical data with a manageable number of categories. However, using one-hot encoding for a variable with thousands of categories will create a thousand or more new columns, and will be complicated even if you combine similar categories based on contextual information. Let's discuss data-driven approaches to reducing the space of categorical variables and creating meaningful features. Let's take a look at the education level variable from the adult-underscore-incomes dataset. There are 16 distinct categories we want to incorporate into our model that predicts income levels above or below 50,000 dollars. We want to reduce these categories in a meaningful way, leveraging the outcomes associated with these levels. One approach is to look at the proportions of each category with respect to the income, which is the outcome variable. In this example, we can combine prop-dot-table() with the table() function. The prop-dot-table() function takes a table with cells and divides each cell value by the sum of all the cells. If you add a one after the table, you get the value of each cell divided by the sum of the row cells. In our example, we want the proportion of income within each grade, which is the row sum. These proportions give us insight into possible relationships the categories have with the outcome. For example, we can deduce that lower grade levels are associated with making less than 50,000 dollars in a calendar year. For example, of the folks that only completed the tenth grade, about 93 percent of those individuals make less than 50K a year. We order the proportions that correspond to making over 50,000 dollars a year using the arrange() function and passing a table that contains the
Watch on YouTube ↗ (saves to browser)
Sign in to unlock AI tutor explanation · ⚡30

Playlist

Uploads from DataCamp · DataCamp · 0 of 60

← Previous Next →
1 SQL Server Tutorial: Date manipulation
SQL Server Tutorial: Date manipulation
DataCamp
2 R Tutorial: Intermediate Interactive Data Visualization with plotly in R
R Tutorial: Intermediate Interactive Data Visualization with plotly in R
DataCamp
3 R Tutorial: Adding aesthetics to represent a variable
R Tutorial: Adding aesthetics to represent a variable
DataCamp
4 R Tutorial: Moving Beyond Simple Interactivity
R Tutorial: Moving Beyond Simple Interactivity
DataCamp
5 Python Tutorial: Why use ML for marketing? Strategies and use cases
Python Tutorial: Why use ML for marketing? Strategies and use cases
DataCamp
6 Python Tutorial: Preparation for modeling
Python Tutorial: Preparation for modeling
DataCamp
7 Python Tutorial: Machine Learning modeling steps
Python Tutorial: Machine Learning modeling steps
DataCamp
8 R Tutorial: The prior model
R Tutorial: The prior model
DataCamp
9 R Tutorial: Data & the likelihood
R Tutorial: Data & the likelihood
DataCamp
10 R Tutorial: The posterior model
R Tutorial: The posterior model
DataCamp
11 R Tutorial: An Introduction to plotly
R Tutorial: An Introduction to plotly
DataCamp
12 R Tutorial: Plotting a single variable
R Tutorial: Plotting a single variable
DataCamp
13 R Tutorial: Bivariate graphics
R Tutorial: Bivariate graphics
DataCamp
14 Python Tutorial: Customer Segmentation in Python
Python Tutorial: Customer Segmentation in Python
DataCamp
15 Python Tutorial: Time cohorts
Python Tutorial: Time cohorts
DataCamp
16 Python Tutorial: Calculate cohort metrics
Python Tutorial: Calculate cohort metrics
DataCamp
17 Python Tutorial: Cohort analysis visualization
Python Tutorial: Cohort analysis visualization
DataCamp
18 R Tutorial: Building Dashboards with flexdashboard
R Tutorial: Building Dashboards with flexdashboard
DataCamp
19 R Tutorial: Anatomy of a flexdashboard
R Tutorial: Anatomy of a flexdashboard
DataCamp
20 R Tutorial: Layout basics
R Tutorial: Layout basics
DataCamp
21 R Tutorial: Advanced layouts
R Tutorial: Advanced layouts
DataCamp
22 Python Tutorial: Time Series Analysis in Python
Python Tutorial: Time Series Analysis in Python
DataCamp
23 Python Tutorial: Correlation of Two Time Series
Python Tutorial: Correlation of Two Time Series
DataCamp
24 Python Tutorial: Simple Linear Regressions
Python Tutorial: Simple Linear Regressions
DataCamp
25 Python Tutorial: Autocorrelation
Python Tutorial: Autocorrelation
DataCamp
26 R Tutorial: The gapminder dataset
R Tutorial: The gapminder dataset
DataCamp
27 R Tutorial: The filter verb
R Tutorial: The filter verb
DataCamp
28 R Tutorial: The arrange verb
R Tutorial: The arrange verb
DataCamp
29 R Tutorial: The mutate verb
R Tutorial: The mutate verb
DataCamp
30 R Tutorial: What is cluster analysis?
R Tutorial: What is cluster analysis?
DataCamp
31 R Tutorial: Distance between two observations
R Tutorial: Distance between two observations
DataCamp
32 R Tutorial: The importance of scale
R Tutorial: The importance of scale
DataCamp
33 R Tutorial: Measuring distance for categorical data
R Tutorial: Measuring distance for categorical data
DataCamp
34 Python Tutorial: Plotting multiple graphs
Python Tutorial: Plotting multiple graphs
DataCamp
35 Python Tutorial: Customizing axes
Python Tutorial: Customizing axes
DataCamp
36 Python Tutorial: Legends, annotations, & styles
Python Tutorial: Legends, annotations, & styles
DataCamp
37 Python Tutorial: Introduction to iterators
Python Tutorial: Introduction to iterators
DataCamp
38 Python Tutorial: Playing with iterators
Python Tutorial: Playing with iterators
DataCamp
39 Python Tutorial: Using iterators to load large files into memory
Python Tutorial: Using iterators to load large files into memory
DataCamp
40 SQL Tutorial: Introduction to Relational Databases in SQL
SQL Tutorial: Introduction to Relational Databases in SQL
DataCamp
41 SQL Tutorial: Tables: At the core of every database
SQL Tutorial: Tables: At the core of every database
DataCamp
42 SQL Tutorial: Update your database as the structure changes
SQL Tutorial: Update your database as the structure changes
DataCamp
43 Python Tutorial: Classification-Tree Learning
Python Tutorial: Classification-Tree Learning
DataCamp
44 Python Tutorial: Decision-Tree for Classification
Python Tutorial: Decision-Tree for Classification
DataCamp
45 Python Tutorial: Decision-Tree for Regression
Python Tutorial: Decision-Tree for Regression
DataCamp
46 Python Tutorial: Census Subject Tables
Python Tutorial: Census Subject Tables
DataCamp
47 Python Tutorial: Census Geography
Python Tutorial: Census Geography
DataCamp
48 Python Tutorial: Using the Census API
Python Tutorial: Using the Census API
DataCamp
49 R Tutorial: A/B Testing in R
R Tutorial: A/B Testing in R
DataCamp
50 R Tutorial: Baseline Conversion Rates
R Tutorial: Baseline Conversion Rates
DataCamp
51 R Tutorial: Designing an Experiment - Power Analysis
R Tutorial: Designing an Experiment - Power Analysis
DataCamp
52 R Tutorial: Introduction to qualitative data
R Tutorial: Introduction to qualitative data
DataCamp
53 R Tutorial: Understanding your qualitative variables
R Tutorial: Understanding your qualitative variables
DataCamp
54 R Tutorial: Making Better Plots
R Tutorial: Making Better Plots
DataCamp
55 SQL Tutorial: OLTP and OLAP
SQL Tutorial: OLTP and OLAP
DataCamp
56 SQL Tutorial: Storing data
SQL Tutorial: Storing data
DataCamp
57 SQL Tutorial: Database design
SQL Tutorial: Database design
DataCamp
58 Python Tutorial: Introduction to spaCy
Python Tutorial: Introduction to spaCy
DataCamp
59 Python Tutorial: Statistical Models
Python Tutorial: Statistical Models
DataCamp
60 Python Tutorial: Rule-based Matching
Python Tutorial: Rule-based Matching
DataCamp

This video tutorial teaches data-driven approaches to reducing the space of categorical variables and creating meaningful features using binning encoding in R. It demonstrates how to use the education level variable from the adult incomes data set to create new categories based on proportions of income. By the end of this tutorial, viewers will be able to apply data driven approaches to categorical data and create meaningful features using binning encoding.

Key Takeaways
  1. Load the adult incomes data set in R
  2. Explore the education level variable and its categories
  3. Use proptable and table function to calculate proportions of income within each category
  4. Arrange the proportions in descending order using the arrange function
  5. Create new categories based on proportions of making over $50,000 a year
  6. Use inner join to attach the proportions table results with the income data
  7. Map the new categories to the existing income data
💡 Data-driven approaches to reducing the space of categorical variables can create more meaningful features for machine learning models by leveraging the relationships between categories and the outcome variable.

Related AI Lessons

Up next
How to Open HPL Files (HP-GL Plotter)
File Extension Geeks
Watch →