R Tutorial: Measuring distance for categorical data
Want to learn more? Take the full course at https://learn.datacamp.com/courses/cluster-analysis-in-r at your own pace. More than a video, you'll learn hands-on coding & quickly apply skills to your daily work.
---
So far you have exclusively worked with one type distance metric, the euclidean distance.
This is a commonly used metric and is a great starting point when working with data that is continuous. But what happens if the data you have isn't continuous but is categorical?
Let's start with the most basic case of categorical features, those that are binary, meaning that the values can only be one of two possibilities.
Here you are presented with survey data, let's call it survey a.
The participants of this survey were asked whether they enjoy drinking various types of alcoholic beverages. Since they can only answer yes or no we can code this binary response as TRUE or FALSE.
We would be interested to learn which participants are similar to one another based on their responses.
To calculate this we will use the similarity score called the Jaccard Index.
This measure of similarity captures the ratio between the intersection of A and B to the union of A and B.
Or more intuitively the ratio between the number of times the features of both observations are TRUE to the number of times they are ever TRUE.
So going back to the previous example.
Let us calculate the Jaccard similarity for two observations one and two. They only agree in one category, beer, so for the intersection, we get the value of one. While the number of categories these observations are ever true, or the union, is four.
Dividing the intersection by the union we get the Jaccard similarity value of 0-point-25.
But what about the distance. Well remember that distance is 1 - similarity, so in this case, the distance is just 0-point-75.
To learn how to do this in R lets start with a subset of our data containing three observations, called survey a.
In order to calculate the Jaccard distan
Watch on YouTube ↗
(saves to browser)
Sign in to unlock AI tutor explanation · ⚡30
Playlist
Uploads from DataCamp · DataCamp · 33 of 60
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
▶
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
SQL Server Tutorial: Date manipulation
DataCamp
R Tutorial: Intermediate Interactive Data Visualization with plotly in R
DataCamp
R Tutorial: Adding aesthetics to represent a variable
DataCamp
R Tutorial: Moving Beyond Simple Interactivity
DataCamp
Python Tutorial: Why use ML for marketing? Strategies and use cases
DataCamp
Python Tutorial: Preparation for modeling
DataCamp
Python Tutorial: Machine Learning modeling steps
DataCamp
R Tutorial: The prior model
DataCamp
R Tutorial: Data & the likelihood
DataCamp
R Tutorial: The posterior model
DataCamp
R Tutorial: An Introduction to plotly
DataCamp
R Tutorial: Plotting a single variable
DataCamp
R Tutorial: Bivariate graphics
DataCamp
Python Tutorial: Customer Segmentation in Python
DataCamp
Python Tutorial: Time cohorts
DataCamp
Python Tutorial: Calculate cohort metrics
DataCamp
Python Tutorial: Cohort analysis visualization
DataCamp
R Tutorial: Building Dashboards with flexdashboard
DataCamp
R Tutorial: Anatomy of a flexdashboard
DataCamp
R Tutorial: Layout basics
DataCamp
R Tutorial: Advanced layouts
DataCamp
Python Tutorial: Time Series Analysis in Python
DataCamp
Python Tutorial: Correlation of Two Time Series
DataCamp
Python Tutorial: Simple Linear Regressions
DataCamp
Python Tutorial: Autocorrelation
DataCamp
R Tutorial: The gapminder dataset
DataCamp
R Tutorial: The filter verb
DataCamp
R Tutorial: The arrange verb
DataCamp
R Tutorial: The mutate verb
DataCamp
R Tutorial: What is cluster analysis?
DataCamp
R Tutorial: Distance between two observations
DataCamp
R Tutorial: The importance of scale
DataCamp
R Tutorial: Measuring distance for categorical data
DataCamp
Python Tutorial: Plotting multiple graphs
DataCamp
Python Tutorial: Customizing axes
DataCamp
Python Tutorial: Legends, annotations, & styles
DataCamp
Python Tutorial: Introduction to iterators
DataCamp
Python Tutorial: Playing with iterators
DataCamp
Python Tutorial: Using iterators to load large files into memory
DataCamp
SQL Tutorial: Introduction to Relational Databases in SQL
DataCamp
SQL Tutorial: Tables: At the core of every database
DataCamp
SQL Tutorial: Update your database as the structure changes
DataCamp
Python Tutorial: Classification-Tree Learning
DataCamp
Python Tutorial: Decision-Tree for Classification
DataCamp
Python Tutorial: Decision-Tree for Regression
DataCamp
Python Tutorial: Census Subject Tables
DataCamp
Python Tutorial: Census Geography
DataCamp
Python Tutorial: Using the Census API
DataCamp
R Tutorial: A/B Testing in R
DataCamp
R Tutorial: Baseline Conversion Rates
DataCamp
R Tutorial: Designing an Experiment - Power Analysis
DataCamp
R Tutorial: Introduction to qualitative data
DataCamp
R Tutorial: Understanding your qualitative variables
DataCamp
R Tutorial: Making Better Plots
DataCamp
SQL Tutorial: OLTP and OLAP
DataCamp
SQL Tutorial: Storing data
DataCamp
SQL Tutorial: Database design
DataCamp
Python Tutorial: Introduction to spaCy
DataCamp
Python Tutorial: Statistical Models
DataCamp
Python Tutorial: Rule-based Matching
DataCamp
Related AI Lessons
⚡
⚡
⚡
⚡
This Tool is Changing How Chinese Devs Build AI Apps
Dev.to AI
Japan’s Monster Wolf robot is a $4,000 scarecrow with red LED eyes, and it actually works
The Next Web AI
5 Claude AI Prompts That Save Me 10 Hours Every Week (Copy & Paste Ready)
Medium · ChatGPT
Desktop vs Web Applications for PDF Accessibility Validation
Medium · AI
🎓
Tutor Explanation
DeepCamp AI