R Tutorial: Randomized distributions

DataCamp · Beginner ·🛠️ AI Tools & Apps ·6y ago

Skills: ML Maths Basics80%

Key Takeaways

The video demonstrates statistical inference using R, specifically exploring randomized distributions and null hypothesis testing with the mutate and sample functions.

Full Transcript

the idea behind statistic inference is to understand samples from a hypothetical population where the null hypothesis is true for example from east and west coasts where Cola preference is the same as a way of summarizing each of the null samples we calculate one statistic from each sample here the statistic is the difference in the proportion of West Coast people who prefer Cola as compared with the proportion of East Coast people who prefer Cola where each of the sample proportions is denoted P hat the difference in P hats changes with each sample first it's zero then it's negative one third and it will keep changing we can build a distribution of differences in proportions assuming the null hypothesis that there is no link between location and soda preference is true that is the null samples consist of randomly shuffled soda variables so that the samples don't have any dependency between location and soda preference the original sample proportions are P hat east of 0.82 and P hat west of 0.73 a difference of negative 0.09 the first shuffle of the drink variable gives the exact same summaries as the observed data the second shuffle on the other hand gives 27 people on the East Coast who prefer Cola as compared with 20 on the West Coast to prefer Cola the difference in sample proportions for the second shuffle of the data is negative 0.02 which is less extreme than the original data note that both the original data the redline and the first to shuffle differences in proportions black dots can be plotted together the next few shuffles give differences in proportions centered around zero note that the fifth difference is negative 0.16 which is farther from zero than the original data that is the fifth shuffle gives more evidence of a difference in soda preference than the original data does and we know that the fifth shuffle was created by randomly permuting the labels so a difference of negative 0.16 is plausible under the null hypothesis generally the null differences are between negative point two and positive point two and about a third of the differences are as or more extreme than the observed difference of negative 0.09 now that we have seen a visual representation of the null distribution let's see how a null sample can be generated in our using the mutate and sample functions the vector of soda preferences is mixed up or / muted such that whether someone is on the east or west coast can't possibly be causing any difference in proportions however due to inherent natural variability there's also no expectation that soda preferences are exactly the same for any sample after grouping by the location variable summarize calculates the proportion of each coast that prefers Cola note that drink equals Cola produces a vector of trues and falses which are then courses two ones and zeros when the mean function is applied since a 1 represents an individual who prefers Cola the average of these ones and zeros represent the proportion of individuals who prefer Cola summarizes used a second time to find the difference in proportion of Cola preference across the two costal groups the DIF function is applied across the two costal groups because the data have been summarized by location notice that the output gives a per muted difference of negative 0.02 as compared to the observed difference of negative 0.09 however the per muted difference of negative 0.02 represents only one instance of the variability of soda preference under the null model to get a sense of the degree of variability under the null model it's necessary to permute the drink variable many times by repeating the permuting and difference calculations 5 times the per muted differences are seen to be sometimes positive sometimes negative sometimes close to zero sometimes far from zero however five times isn't quite enough to capture all of the variability in null differences by repeating the permutation process 100 time's the null differences are seen to range from approximately negative 0.3 to positive 0.3 although the majority of the differences are between negative 0.1 and positive 0.1 the observed data difference of negative 0.09 doesn't seem too extreme compared to this collection of null differences okay now it's your turn to practice what you've learned

Original Description

Want to learn more? Take the full course at https://learn.datacamp.com/courses/foundations-of-inference-in-r at your own pace. More than a video, you'll learn hands-on coding & quickly apply skills to your daily work. --- The idea behind statistical inference is to understand samples from a hypothetical population where the null hypothesis is true. For example, from East and West Coasts, where cola preference is the same. As a way of summarizing each of the null samples, we calculate one statistic from each sample. Here, the statistic is the difference in the proportion of West Coast people who prefer cola as compared with the proportion of East Coast people who prefer cola, where each of the sample proportions is denoted “p-hat”. The difference in p-hats changes with each sample. First it is 0, then it is negative one third, and it will keep changing. We can build a distribution of differences in proportions assuming the null hypothesis, that there is no link between location and soda preference, is true. That is, the null samples consist of randomly shuffled soda variables so that the samples don’t have any dependency between location and soda preference. The original sample proportions are p-ha East of (point) 82 and p-hat West of (point) 73. A difference of negative (point) 09. The first shuffle of the drink variable gives the exact same summaries as the observed data! The second shuffle, on the other hand, gives 27 people on the East Coast who prefer cola as compared with 20 on the West Coast who prefer cola. The difference in sample proportions for the second shuffle of the data is negative (point) 02, which is less extreme than the original data. Note that both the original data, the red line, and the first two shuffled differences in proportions, black dots, can be plotted together. The next few shuffles give differences in proportions centered around zero. Note that the 5th difference is negative (point) 16, which is farther from zero than the origi

Watch on YouTube ↗ (saves to browser)

Sign in to unlock AI tutor explanation · ⚡30

Playlist

Uploads from DataCamp · DataCamp · 0 of 60

← Previous Next →

SQL Server Tutorial: Date manipulation

SQL Server Tutorial: Date manipulation

R Tutorial: Intermediate Interactive Data Visualization with plotly in R

R Tutorial: Intermediate Interactive Data Visualization with plotly in R

R Tutorial: Adding aesthetics to represent a variable

R Tutorial: Adding aesthetics to represent a variable

R Tutorial: Moving Beyond Simple Interactivity

R Tutorial: Moving Beyond Simple Interactivity

Python Tutorial: Why use ML for marketing? Strategies and use cases

Python Tutorial: Why use ML for marketing? Strategies and use cases

Python Tutorial: Preparation for modeling

Python Tutorial: Preparation for modeling

Python Tutorial: Machine Learning modeling steps

Python Tutorial: Machine Learning modeling steps

R Tutorial: The prior model

R Tutorial: The prior model

R Tutorial: Data & the likelihood

R Tutorial: Data & the likelihood

R Tutorial: The posterior model

R Tutorial: The posterior model

R Tutorial: An Introduction to plotly

R Tutorial: An Introduction to plotly

R Tutorial: Plotting a single variable

R Tutorial: Plotting a single variable

R Tutorial: Bivariate graphics

R Tutorial: Bivariate graphics

Python Tutorial: Customer Segmentation in Python

Python Tutorial: Customer Segmentation in Python

Python Tutorial: Time cohorts

Python Tutorial: Time cohorts

Python Tutorial: Calculate cohort metrics

Python Tutorial: Calculate cohort metrics

Python Tutorial: Cohort analysis visualization

Python Tutorial: Cohort analysis visualization

R Tutorial: Building Dashboards with flexdashboard

R Tutorial: Building Dashboards with flexdashboard

R Tutorial: Anatomy of a flexdashboard

R Tutorial: Anatomy of a flexdashboard

R Tutorial: Layout basics

R Tutorial: Layout basics

R Tutorial: Advanced layouts

R Tutorial: Advanced layouts

Python Tutorial: Time Series Analysis in Python

Python Tutorial: Time Series Analysis in Python

Python Tutorial: Correlation of Two Time Series

Python Tutorial: Correlation of Two Time Series

Python Tutorial: Simple Linear Regressions

Python Tutorial: Simple Linear Regressions

Python Tutorial: Autocorrelation

Python Tutorial: Autocorrelation

R Tutorial: The gapminder dataset

R Tutorial: The gapminder dataset

R Tutorial: The filter verb

R Tutorial: The filter verb

R Tutorial: The arrange verb

R Tutorial: The arrange verb

R Tutorial: The mutate verb

R Tutorial: The mutate verb

R Tutorial: What is cluster analysis?

R Tutorial: What is cluster analysis?

R Tutorial: Distance between two observations

R Tutorial: Distance between two observations

R Tutorial: The importance of scale

R Tutorial: The importance of scale

R Tutorial: Measuring distance for categorical data

R Tutorial: Measuring distance for categorical data

Python Tutorial: Plotting multiple graphs

Python Tutorial: Plotting multiple graphs

Python Tutorial: Customizing axes

Python Tutorial: Customizing axes

Python Tutorial: Legends, annotations, & styles

Python Tutorial: Legends, annotations, & styles

Python Tutorial: Introduction to iterators

Python Tutorial: Introduction to iterators

Python Tutorial: Playing with iterators

Python Tutorial: Playing with iterators

Python Tutorial: Using iterators to load large files into memory

Python Tutorial: Using iterators to load large files into memory

SQL Tutorial: Introduction to Relational Databases in SQL

SQL Tutorial: Introduction to Relational Databases in SQL

SQL Tutorial: Tables: At the core of every database

SQL Tutorial: Tables: At the core of every database

SQL Tutorial: Update your database as the structure changes

SQL Tutorial: Update your database as the structure changes

Python Tutorial: Classification-Tree Learning

Python Tutorial: Classification-Tree Learning

Python Tutorial: Decision-Tree for Classification

Python Tutorial: Decision-Tree for Classification

Python Tutorial: Decision-Tree for Regression

Python Tutorial: Decision-Tree for Regression

Python Tutorial: Census Subject Tables

Python Tutorial: Census Subject Tables

Python Tutorial: Census Geography

Python Tutorial: Census Geography

Python Tutorial: Using the Census API

Python Tutorial: Using the Census API

R Tutorial: A/B Testing in R

R Tutorial: A/B Testing in R

R Tutorial: Baseline Conversion Rates

R Tutorial: Baseline Conversion Rates

R Tutorial: Designing an Experiment - Power Analysis

R Tutorial: Designing an Experiment - Power Analysis

R Tutorial: Introduction to qualitative data

R Tutorial: Introduction to qualitative data

R Tutorial: Understanding your qualitative variables

R Tutorial: Understanding your qualitative variables

R Tutorial: Making Better Plots

R Tutorial: Making Better Plots

SQL Tutorial: OLTP and OLAP

SQL Tutorial: OLTP and OLAP

SQL Tutorial: Storing data

SQL Tutorial: Storing data

SQL Tutorial: Database design

SQL Tutorial: Database design

Python Tutorial: Introduction to spaCy

Python Tutorial: Introduction to spaCy

Python Tutorial: Statistical Models

Python Tutorial: Statistical Models

Python Tutorial: Rule-based Matching

Python Tutorial: Rule-based Matching

This video teaches statistical inference using R, demonstrating how to generate null samples and calculate statistics to understand the distribution of differences in proportions under the null hypothesis. By the end of this lesson, learners will be able to apply null hypothesis testing and randomization techniques in R.

Key Takeaways

Calculate statistics from samples
Generate null samples using randomization
Calculate differences in proportions
Repeat permutation process to capture variability
Compare observed data to null distribution

💡 The null distribution of differences in proportions can be used to determine if the observed difference is statistically significant.

🔒 Pro feature: Ask AI to explain this lesson →

More on: ML Maths Basics

View skill →

Coding the GARCH Model : Time Series Talk

Coding the GARCH Model : Time Series Talk

Important Steps I Have Followed To Improve My Data Science Skills- Sharing My Experience

Important Steps I Have Followed To Improve My Data Science Skills- Sharing My Experience

Learn Python FAST for Beginners 🚀#coding #conditionals #loops #functions

Learn Python FAST for Beginners 🚀#coding #conditionals #loops #functions

ChethanAIChronicles

“Hello, world” from scratch on a 6502 — Part 1

“Hello, world” from scratch on a 6502 — Part 1

PCA (Principal Component Analysis) in Python - Machine Learning From Scratch 11 - Python Tutorial

PCA (Principal Component Analysis) in Python - Machine Learning From Scratch 11 - Python Tutorial

ROC and AUC in R

ROC and AUC in R

StatQuest with Josh Starmer

Related AI Lessons

X now offers an MCP server to make its platform easier for AI tools to use

X launches a hosted MCP server to simplify AI tool integration with its API

n8n Automation Repurpose Video Content: The 2025 Production Guide

Learn to repurpose video content using n8n automation, replacing manual labor with a self-hosted workflow solution

You’re Still Paying $200/Month for AI Tools You Could Replace With a Free Local Setup Tonight

Replace expensive AI tools with a free local setup and save $200/month

Medium · Data Science

Top 10 AI Tools Every College Student Should Know in 2026

Discover the top 10 AI tools that can enhance your college experience and future career prospects

I Asked ChatGPT to Apply to 500 Jobs (8 Interviews in 48 Hours)

Sabrina Ramonov 🍄