R Tutorial: Randomized distributions

DataCamp · Beginner ·🛠️ AI Tools & Apps ·6y ago

Key Takeaways

The video demonstrates statistical inference using R, specifically exploring randomized distributions and null hypothesis testing with the mutate and sample functions.

Full Transcript

the idea behind statistic inference is to understand samples from a hypothetical population where the null hypothesis is true for example from east and west coasts where Cola preference is the same as a way of summarizing each of the null samples we calculate one statistic from each sample here the statistic is the difference in the proportion of West Coast people who prefer Cola as compared with the proportion of East Coast people who prefer Cola where each of the sample proportions is denoted P hat the difference in P hats changes with each sample first it's zero then it's negative one third and it will keep changing we can build a distribution of differences in proportions assuming the null hypothesis that there is no link between location and soda preference is true that is the null samples consist of randomly shuffled soda variables so that the samples don't have any dependency between location and soda preference the original sample proportions are P hat east of 0.82 and P hat west of 0.73 a difference of negative 0.09 the first shuffle of the drink variable gives the exact same summaries as the observed data the second shuffle on the other hand gives 27 people on the East Coast who prefer Cola as compared with 20 on the West Coast to prefer Cola the difference in sample proportions for the second shuffle of the data is negative 0.02 which is less extreme than the original data note that both the original data the redline and the first to shuffle differences in proportions black dots can be plotted together the next few shuffles give differences in proportions centered around zero note that the fifth difference is negative 0.16 which is farther from zero than the original data that is the fifth shuffle gives more evidence of a difference in soda preference than the original data does and we know that the fifth shuffle was created by randomly permuting the labels so a difference of negative 0.16 is plausible under the null hypothesis generally the null differences are between negative point two and positive point two and about a third of the differences are as or more extreme than the observed difference of negative 0.09 now that we have seen a visual representation of the null distribution let's see how a null sample can be generated in our using the mutate and sample functions the vector of soda preferences is mixed up or / muted such that whether someone is on the east or west coast can't possibly be causing any difference in proportions however due to inherent natural variability there's also no expectation that soda preferences are exactly the same for any sample after grouping by the location variable summarize calculates the proportion of each coast that prefers Cola note that drink equals Cola produces a vector of trues and falses which are then courses two ones and zeros when the mean function is applied since a 1 represents an individual who prefers Cola the average of these ones and zeros represent the proportion of individuals who prefer Cola summarizes used a second time to find the difference in proportion of Cola preference across the two costal groups the DIF function is applied across the two costal groups because the data have been summarized by location notice that the output gives a per muted difference of negative 0.02 as compared to the observed difference of negative 0.09 however the per muted difference of negative 0.02 represents only one instance of the variability of soda preference under the null model to get a sense of the degree of variability under the null model it's necessary to permute the drink variable many times by repeating the permuting and difference calculations 5 times the per muted differences are seen to be sometimes positive sometimes negative sometimes close to zero sometimes far from zero however five times isn't quite enough to capture all of the variability in null differences by repeating the permutation process 100 time's the null differences are seen to range from approximately negative 0.3 to positive 0.3 although the majority of the differences are between negative 0.1 and positive 0.1 the observed data difference of negative 0.09 doesn't seem too extreme compared to this collection of null differences okay now it's your turn to practice what you've learned

Original Description

Want to learn more? Take the full course at https://learn.datacamp.com/courses/foundations-of-inference-in-r at your own pace. More than a video, you'll learn hands-on coding & quickly apply skills to your daily work. --- The idea behind statistical inference is to understand samples from a hypothetical population where the null hypothesis is true. For example, from East and West Coasts, where cola preference is the same. As a way of summarizing each of the null samples, we calculate one statistic from each sample. Here, the statistic is the difference in the proportion of West Coast people who prefer cola as compared with the proportion of East Coast people who prefer cola, where each of the sample proportions is denoted “p-hat”. The difference in p-hats changes with each sample. First it is 0, then it is negative one third, and it will keep changing. We can build a distribution of differences in proportions assuming the null hypothesis, that there is no link between location and soda preference, is true. That is, the null samples consist of randomly shuffled soda variables so that the samples don’t have any dependency between location and soda preference. The original sample proportions are p-ha East of (point) 82 and p-hat West of (point) 73. A difference of negative (point) 09. The first shuffle of the drink variable gives the exact same summaries as the observed data! The second shuffle, on the other hand, gives 27 people on the East Coast who prefer cola as compared with 20 on the West Coast who prefer cola. The difference in sample proportions for the second shuffle of the data is negative (point) 02, which is less extreme than the original data. Note that both the original data, the red line, and the first two shuffled differences in proportions, black dots, can be plotted together. The next few shuffles give differences in proportions centered around zero. Note that the 5th difference is negative (point) 16, which is farther from zero than the origi
Watch on YouTube ↗ (saves to browser)
Sign in to unlock AI tutor explanation · ⚡30

Playlist

Uploads from DataCamp · DataCamp · 0 of 60

← Previous Next →
1 SQL Server Tutorial: Date manipulation
SQL Server Tutorial: Date manipulation
DataCamp
2 R Tutorial: Intermediate Interactive Data Visualization with plotly in R
R Tutorial: Intermediate Interactive Data Visualization with plotly in R
DataCamp
3 R Tutorial: Adding aesthetics to represent a variable
R Tutorial: Adding aesthetics to represent a variable
DataCamp
4 R Tutorial: Moving Beyond Simple Interactivity
R Tutorial: Moving Beyond Simple Interactivity
DataCamp
5 Python Tutorial: Why use ML for marketing? Strategies and use cases
Python Tutorial: Why use ML for marketing? Strategies and use cases
DataCamp
6 Python Tutorial: Preparation for modeling
Python Tutorial: Preparation for modeling
DataCamp
7 Python Tutorial: Machine Learning modeling steps
Python Tutorial: Machine Learning modeling steps
DataCamp
8 R Tutorial: The prior model
R Tutorial: The prior model
DataCamp
9 R Tutorial: Data & the likelihood
R Tutorial: Data & the likelihood
DataCamp
10 R Tutorial: The posterior model
R Tutorial: The posterior model
DataCamp
11 R Tutorial: An Introduction to plotly
R Tutorial: An Introduction to plotly
DataCamp
12 R Tutorial: Plotting a single variable
R Tutorial: Plotting a single variable
DataCamp
13 R Tutorial: Bivariate graphics
R Tutorial: Bivariate graphics
DataCamp
14 Python Tutorial: Customer Segmentation in Python
Python Tutorial: Customer Segmentation in Python
DataCamp
15 Python Tutorial: Time cohorts
Python Tutorial: Time cohorts
DataCamp
16 Python Tutorial: Calculate cohort metrics
Python Tutorial: Calculate cohort metrics
DataCamp
17 Python Tutorial: Cohort analysis visualization
Python Tutorial: Cohort analysis visualization
DataCamp
18 R Tutorial: Building Dashboards with flexdashboard
R Tutorial: Building Dashboards with flexdashboard
DataCamp
19 R Tutorial: Anatomy of a flexdashboard
R Tutorial: Anatomy of a flexdashboard
DataCamp
20 R Tutorial: Layout basics
R Tutorial: Layout basics
DataCamp
21 R Tutorial: Advanced layouts
R Tutorial: Advanced layouts
DataCamp
22 Python Tutorial: Time Series Analysis in Python
Python Tutorial: Time Series Analysis in Python
DataCamp
23 Python Tutorial: Correlation of Two Time Series
Python Tutorial: Correlation of Two Time Series
DataCamp
24 Python Tutorial: Simple Linear Regressions
Python Tutorial: Simple Linear Regressions
DataCamp
25 Python Tutorial: Autocorrelation
Python Tutorial: Autocorrelation
DataCamp
26 R Tutorial: The gapminder dataset
R Tutorial: The gapminder dataset
DataCamp
27 R Tutorial: The filter verb
R Tutorial: The filter verb
DataCamp
28 R Tutorial: The arrange verb
R Tutorial: The arrange verb
DataCamp
29 R Tutorial: The mutate verb
R Tutorial: The mutate verb
DataCamp
30 R Tutorial: What is cluster analysis?
R Tutorial: What is cluster analysis?
DataCamp
31 R Tutorial: Distance between two observations
R Tutorial: Distance between two observations
DataCamp
32 R Tutorial: The importance of scale
R Tutorial: The importance of scale
DataCamp
33 R Tutorial: Measuring distance for categorical data
R Tutorial: Measuring distance for categorical data
DataCamp
34 Python Tutorial: Plotting multiple graphs
Python Tutorial: Plotting multiple graphs
DataCamp
35 Python Tutorial: Customizing axes
Python Tutorial: Customizing axes
DataCamp
36 Python Tutorial: Legends, annotations, & styles
Python Tutorial: Legends, annotations, & styles
DataCamp
37 Python Tutorial: Introduction to iterators
Python Tutorial: Introduction to iterators
DataCamp
38 Python Tutorial: Playing with iterators
Python Tutorial: Playing with iterators
DataCamp
39 Python Tutorial: Using iterators to load large files into memory
Python Tutorial: Using iterators to load large files into memory
DataCamp
40 SQL Tutorial: Introduction to Relational Databases in SQL
SQL Tutorial: Introduction to Relational Databases in SQL
DataCamp
41 SQL Tutorial: Tables: At the core of every database
SQL Tutorial: Tables: At the core of every database
DataCamp
42 SQL Tutorial: Update your database as the structure changes
SQL Tutorial: Update your database as the structure changes
DataCamp
43 Python Tutorial: Classification-Tree Learning
Python Tutorial: Classification-Tree Learning
DataCamp
44 Python Tutorial: Decision-Tree for Classification
Python Tutorial: Decision-Tree for Classification
DataCamp
45 Python Tutorial: Decision-Tree for Regression
Python Tutorial: Decision-Tree for Regression
DataCamp
46 Python Tutorial: Census Subject Tables
Python Tutorial: Census Subject Tables
DataCamp
47 Python Tutorial: Census Geography
Python Tutorial: Census Geography
DataCamp
48 Python Tutorial: Using the Census API
Python Tutorial: Using the Census API
DataCamp
49 R Tutorial: A/B Testing in R
R Tutorial: A/B Testing in R
DataCamp
50 R Tutorial: Baseline Conversion Rates
R Tutorial: Baseline Conversion Rates
DataCamp
51 R Tutorial: Designing an Experiment - Power Analysis
R Tutorial: Designing an Experiment - Power Analysis
DataCamp
52 R Tutorial: Introduction to qualitative data
R Tutorial: Introduction to qualitative data
DataCamp
53 R Tutorial: Understanding your qualitative variables
R Tutorial: Understanding your qualitative variables
DataCamp
54 R Tutorial: Making Better Plots
R Tutorial: Making Better Plots
DataCamp
55 SQL Tutorial: OLTP and OLAP
SQL Tutorial: OLTP and OLAP
DataCamp
56 SQL Tutorial: Storing data
SQL Tutorial: Storing data
DataCamp
57 SQL Tutorial: Database design
SQL Tutorial: Database design
DataCamp
58 Python Tutorial: Introduction to spaCy
Python Tutorial: Introduction to spaCy
DataCamp
59 Python Tutorial: Statistical Models
Python Tutorial: Statistical Models
DataCamp
60 Python Tutorial: Rule-based Matching
Python Tutorial: Rule-based Matching
DataCamp

This video teaches statistical inference using R, demonstrating how to generate null samples and calculate statistics to understand the distribution of differences in proportions under the null hypothesis. By the end of this lesson, learners will be able to apply null hypothesis testing and randomization techniques in R.

Key Takeaways
  1. Calculate statistics from samples
  2. Generate null samples using randomization
  3. Calculate differences in proportions
  4. Repeat permutation process to capture variability
  5. Compare observed data to null distribution
💡 The null distribution of differences in proportions can be used to determine if the observed difference is statistically significant.

Related AI Lessons

Up next
I Asked ChatGPT to Apply to 500 Jobs (8 Interviews in 48 Hours)
Sabrina Ramonov 🍄
Watch →