R Tutorial: Randomized distributions
Skills:
ML Maths Basics80%
Key Takeaways
The video demonstrates statistical inference using R, specifically exploring randomized distributions and null hypothesis testing with the mutate and sample functions.
Full Transcript
the idea behind statistic inference is to understand samples from a hypothetical population where the null hypothesis is true for example from east and west coasts where Cola preference is the same as a way of summarizing each of the null samples we calculate one statistic from each sample here the statistic is the difference in the proportion of West Coast people who prefer Cola as compared with the proportion of East Coast people who prefer Cola where each of the sample proportions is denoted P hat the difference in P hats changes with each sample first it's zero then it's negative one third and it will keep changing we can build a distribution of differences in proportions assuming the null hypothesis that there is no link between location and soda preference is true that is the null samples consist of randomly shuffled soda variables so that the samples don't have any dependency between location and soda preference the original sample proportions are P hat east of 0.82 and P hat west of 0.73 a difference of negative 0.09 the first shuffle of the drink variable gives the exact same summaries as the observed data the second shuffle on the other hand gives 27 people on the East Coast who prefer Cola as compared with 20 on the West Coast to prefer Cola the difference in sample proportions for the second shuffle of the data is negative 0.02 which is less extreme than the original data note that both the original data the redline and the first to shuffle differences in proportions black dots can be plotted together the next few shuffles give differences in proportions centered around zero note that the fifth difference is negative 0.16 which is farther from zero than the original data that is the fifth shuffle gives more evidence of a difference in soda preference than the original data does and we know that the fifth shuffle was created by randomly permuting the labels so a difference of negative 0.16 is plausible under the null hypothesis generally the null differences are between negative point two and positive point two and about a third of the differences are as or more extreme than the observed difference of negative 0.09 now that we have seen a visual representation of the null distribution let's see how a null sample can be generated in our using the mutate and sample functions the vector of soda preferences is mixed up or / muted such that whether someone is on the east or west coast can't possibly be causing any difference in proportions however due to inherent natural variability there's also no expectation that soda preferences are exactly the same for any sample after grouping by the location variable summarize calculates the proportion of each coast that prefers Cola note that drink equals Cola produces a vector of trues and falses which are then courses two ones and zeros when the mean function is applied since a 1 represents an individual who prefers Cola the average of these ones and zeros represent the proportion of individuals who prefer Cola summarizes used a second time to find the difference in proportion of Cola preference across the two costal groups the DIF function is applied across the two costal groups because the data have been summarized by location notice that the output gives a per muted difference of negative 0.02 as compared to the observed difference of negative 0.09 however the per muted difference of negative 0.02 represents only one instance of the variability of soda preference under the null model to get a sense of the degree of variability under the null model it's necessary to permute the drink variable many times by repeating the permuting and difference calculations 5 times the per muted differences are seen to be sometimes positive sometimes negative sometimes close to zero sometimes far from zero however five times isn't quite enough to capture all of the variability in null differences by repeating the permutation process 100 time's the null differences are seen to range from approximately negative 0.3 to positive 0.3 although the majority of the differences are between negative 0.1 and positive 0.1 the observed data difference of negative 0.09 doesn't seem too extreme compared to this collection of null differences okay now it's your turn to practice what you've learned
Original Description
Want to learn more? Take the full course at https://learn.datacamp.com/courses/foundations-of-inference-in-r at your own pace. More than a video, you'll learn hands-on coding & quickly apply skills to your daily work.
---
The idea behind statistical inference is to understand samples from a hypothetical population where the null hypothesis is true. For example, from East and West Coasts, where cola preference is the same.
As a way of summarizing each of the null samples, we calculate one statistic from each sample. Here, the statistic is the difference in the proportion of West Coast people who prefer cola as compared with the proportion of East Coast people who prefer cola, where each of the sample proportions is denoted “p-hat”. The difference in p-hats changes with each sample. First it is 0, then it is negative one third, and it will keep changing.
We can build a distribution of differences in proportions assuming the null hypothesis, that there is no link between location and soda preference, is true. That is, the null samples consist of randomly shuffled soda variables so that the samples don’t have any dependency between location and soda preference.
The original sample proportions are p-ha East of (point) 82 and p-hat West of (point) 73. A difference of negative (point) 09.
The first shuffle of the drink variable gives the exact same summaries as the observed data! The second shuffle, on the other hand, gives 27 people on the East Coast who prefer cola as compared with 20 on the West Coast who prefer cola. The difference in sample proportions for the second shuffle of the data is negative (point) 02, which is less extreme than the original data. Note that both the original data, the red line, and the first two shuffled differences in proportions, black dots, can be plotted together.
The next few shuffles give differences in proportions centered around zero. Note that the 5th difference is negative (point) 16, which is farther from zero than the origi
Watch on YouTube ↗
(saves to browser)
Sign in to unlock AI tutor explanation · ⚡30
Playlist
Uploads from DataCamp · DataCamp · 0 of 60
← Previous
Next →
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
SQL Server Tutorial: Date manipulation
DataCamp
R Tutorial: Intermediate Interactive Data Visualization with plotly in R
DataCamp
R Tutorial: Adding aesthetics to represent a variable
DataCamp
R Tutorial: Moving Beyond Simple Interactivity
DataCamp
Python Tutorial: Why use ML for marketing? Strategies and use cases
DataCamp
Python Tutorial: Preparation for modeling
DataCamp
Python Tutorial: Machine Learning modeling steps
DataCamp
R Tutorial: The prior model
DataCamp
R Tutorial: Data & the likelihood
DataCamp
R Tutorial: The posterior model
DataCamp
R Tutorial: An Introduction to plotly
DataCamp
R Tutorial: Plotting a single variable
DataCamp
R Tutorial: Bivariate graphics
DataCamp
Python Tutorial: Customer Segmentation in Python
DataCamp
Python Tutorial: Time cohorts
DataCamp
Python Tutorial: Calculate cohort metrics
DataCamp
Python Tutorial: Cohort analysis visualization
DataCamp
R Tutorial: Building Dashboards with flexdashboard
DataCamp
R Tutorial: Anatomy of a flexdashboard
DataCamp
R Tutorial: Layout basics
DataCamp
R Tutorial: Advanced layouts
DataCamp
Python Tutorial: Time Series Analysis in Python
DataCamp
Python Tutorial: Correlation of Two Time Series
DataCamp
Python Tutorial: Simple Linear Regressions
DataCamp
Python Tutorial: Autocorrelation
DataCamp
R Tutorial: The gapminder dataset
DataCamp
R Tutorial: The filter verb
DataCamp
R Tutorial: The arrange verb
DataCamp
R Tutorial: The mutate verb
DataCamp
R Tutorial: What is cluster analysis?
DataCamp
R Tutorial: Distance between two observations
DataCamp
R Tutorial: The importance of scale
DataCamp
R Tutorial: Measuring distance for categorical data
DataCamp
Python Tutorial: Plotting multiple graphs
DataCamp
Python Tutorial: Customizing axes
DataCamp
Python Tutorial: Legends, annotations, & styles
DataCamp
Python Tutorial: Introduction to iterators
DataCamp
Python Tutorial: Playing with iterators
DataCamp
Python Tutorial: Using iterators to load large files into memory
DataCamp
SQL Tutorial: Introduction to Relational Databases in SQL
DataCamp
SQL Tutorial: Tables: At the core of every database
DataCamp
SQL Tutorial: Update your database as the structure changes
DataCamp
Python Tutorial: Classification-Tree Learning
DataCamp
Python Tutorial: Decision-Tree for Classification
DataCamp
Python Tutorial: Decision-Tree for Regression
DataCamp
Python Tutorial: Census Subject Tables
DataCamp
Python Tutorial: Census Geography
DataCamp
Python Tutorial: Using the Census API
DataCamp
R Tutorial: A/B Testing in R
DataCamp
R Tutorial: Baseline Conversion Rates
DataCamp
R Tutorial: Designing an Experiment - Power Analysis
DataCamp
R Tutorial: Introduction to qualitative data
DataCamp
R Tutorial: Understanding your qualitative variables
DataCamp
R Tutorial: Making Better Plots
DataCamp
SQL Tutorial: OLTP and OLAP
DataCamp
SQL Tutorial: Storing data
DataCamp
SQL Tutorial: Database design
DataCamp
Python Tutorial: Introduction to spaCy
DataCamp
Python Tutorial: Statistical Models
DataCamp
Python Tutorial: Rule-based Matching
DataCamp
More on: ML Maths Basics
View skill →Related AI Lessons
⚡
⚡
⚡
⚡
X now offers an MCP server to make its platform easier for AI tools to use
TechCrunch AI
n8n Automation Repurpose Video Content: The 2025 Production Guide
Dev.to AI
You’re Still Paying $200/Month for AI Tools You Could Replace With a Free Local Setup Tonight
Medium · Data Science
Top 10 AI Tools Every College Student Should Know in 2026
Medium · AI
🎓
Tutor Explanation
DeepCamp AI