R Tutorial: Know your data
Key Takeaways
The video tutorial demonstrates how to use R to explore and understand a dataset, specifically the Baker's data from the Great British Bake Off, using functions such as glimpse from the dplyr package and skim from the skimr package.
Full Transcript
now that we've read our data into our let's start with getting to know it a little better one of the most important things you can do when working with any new data is to learn about how it was collected and do an exploratory data analysis we have been working with the Baker's data from the Great British Bake Off in each episode of the show one Baker is eliminated one wins the technical challenge and one is chosen as star Baker the title of star Baker is based on the Baker's performance across three timed challenges the signature the technical and the showstopper now let's have another look at the Baker's data so far we've printed tables to view them but if you have lots of columns most will be cut off when you print here when we print our Baker's data with ten columns we see that there are four more variables that are hidden to see all the columns we use the function glimpse from the deep hire package the argument for glimpse is the name of your table the glimpse output is a transposed view of your data where each variable appears in rows from top to bottom instead of left to right going across each row glimpse prints the first few observed values for every variable we also see the number of observations and variables at the top you may also want to summarize your data by looking at summary statistics for each variable a quick way to do this is with the skim function from the skim or package like glimpse the argument for skim is the name of your table skim provides statistics for every column depending on the type of variable the results are printed horizontally with one roper variable divided in sections for each variable type let's break down the first section of output summarizing our three character variables for Baker there are no missing values and ten complete observations for each variable the minimum and maximum values refer to string lengths also each value is unique here there are no Baker's with the same name the next sections of the skim output summarize dates the variable last underscored 8 underscore UK is the last date that each Baker appeared on the show in the UK from the men and max values we can tell that our data spans about two years for this series factor there are only three unique values across the 10 observations looking at the top counts series four is the most common the logical column named aired underscore us is true if that Baker appeared in a series that aired in the US and false if not the mean tells us that 70% of the Baker's here were seen by us viewers numeric variables are summarised last in addition to the number of missing and complete values skim returns the means standard deviations and quantiles of the variables a mini histogram is also printed to give you a sense for the distribution of each variable from this skimmed output we know that the average age of these Baker's is 34 and Baker's appeared in anywhere from 1 to 10 episodes with a median of 5 only one of these Baker's was crowned star Baker in their time on the show and they want it twice most Baker's in this table never won the technical challenge but one did win three technical challenges now it's time to put glimpse and skin into practice with our bake-off data
Original Description
Want to learn more? Take the full course at https://learn.datacamp.com/courses/working-with-data-in-the-tidyverse at your own pace. More than a video, you'll learn hands-on coding & quickly apply skills to your daily work.
---
Now that we have read our data into R, let's get to know it a little better.
One of the most important things you can do when working with any new data is to learn about how it was collected. We have been working with the bakers data from The Great British Bake Off.
On each episode of the show, one baker is eliminated, one wins the technical challenge, and one is chosen as star baker. The title of star baker is based on the baker's performance across three timed challenges; the signature, the technical, and the showstopper.
Now, let's have another look at the bakers data.
So far, we've printed tibbles to view them. But, if you have lots of columns, most will be cut off when you print.
Here, when we print our bakers data with 10 columns, we see that there are 4 more variables that are hidden.
To see all the columns, we use the function glimpse from the dplyr package. The argument for glimpse is the name of your tibble.
The glimpse output is a transposed view of your data, where each variable appears in rows from top to bottom instead of left to right. Going across each row, glimpse prints the first few observed values for every variable.
We also see the number of observations and variables at the top.
You may also want to summarize your data by looking at summary statistics for each variable. A quick way to do this is with the skim function from the skimr package.
Like glimpse, the argument for skim is the name of your tibble.
Skim provides statistics for every column depending on the type of variable. The results are printed horizontally with one row per variable, divided in sections for each variable type.
Let's break down the first section of output summarizing our three character variables.
For baker, there are no missing va
Watch on YouTube ↗
(saves to browser)
Sign in to unlock AI tutor explanation · ⚡30
Playlist
Uploads from DataCamp · DataCamp · 0 of 60
← Previous
Next →
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
SQL Server Tutorial: Date manipulation
DataCamp
R Tutorial: Intermediate Interactive Data Visualization with plotly in R
DataCamp
R Tutorial: Adding aesthetics to represent a variable
DataCamp
R Tutorial: Moving Beyond Simple Interactivity
DataCamp
Python Tutorial: Why use ML for marketing? Strategies and use cases
DataCamp
Python Tutorial: Preparation for modeling
DataCamp
Python Tutorial: Machine Learning modeling steps
DataCamp
R Tutorial: The prior model
DataCamp
R Tutorial: Data & the likelihood
DataCamp
R Tutorial: The posterior model
DataCamp
R Tutorial: An Introduction to plotly
DataCamp
R Tutorial: Plotting a single variable
DataCamp
R Tutorial: Bivariate graphics
DataCamp
Python Tutorial: Customer Segmentation in Python
DataCamp
Python Tutorial: Time cohorts
DataCamp
Python Tutorial: Calculate cohort metrics
DataCamp
Python Tutorial: Cohort analysis visualization
DataCamp
R Tutorial: Building Dashboards with flexdashboard
DataCamp
R Tutorial: Anatomy of a flexdashboard
DataCamp
R Tutorial: Layout basics
DataCamp
R Tutorial: Advanced layouts
DataCamp
Python Tutorial: Time Series Analysis in Python
DataCamp
Python Tutorial: Correlation of Two Time Series
DataCamp
Python Tutorial: Simple Linear Regressions
DataCamp
Python Tutorial: Autocorrelation
DataCamp
R Tutorial: The gapminder dataset
DataCamp
R Tutorial: The filter verb
DataCamp
R Tutorial: The arrange verb
DataCamp
R Tutorial: The mutate verb
DataCamp
R Tutorial: What is cluster analysis?
DataCamp
R Tutorial: Distance between two observations
DataCamp
R Tutorial: The importance of scale
DataCamp
R Tutorial: Measuring distance for categorical data
DataCamp
Python Tutorial: Plotting multiple graphs
DataCamp
Python Tutorial: Customizing axes
DataCamp
Python Tutorial: Legends, annotations, & styles
DataCamp
Python Tutorial: Introduction to iterators
DataCamp
Python Tutorial: Playing with iterators
DataCamp
Python Tutorial: Using iterators to load large files into memory
DataCamp
SQL Tutorial: Introduction to Relational Databases in SQL
DataCamp
SQL Tutorial: Tables: At the core of every database
DataCamp
SQL Tutorial: Update your database as the structure changes
DataCamp
Python Tutorial: Classification-Tree Learning
DataCamp
Python Tutorial: Decision-Tree for Classification
DataCamp
Python Tutorial: Decision-Tree for Regression
DataCamp
Python Tutorial: Census Subject Tables
DataCamp
Python Tutorial: Census Geography
DataCamp
Python Tutorial: Using the Census API
DataCamp
R Tutorial: A/B Testing in R
DataCamp
R Tutorial: Baseline Conversion Rates
DataCamp
R Tutorial: Designing an Experiment - Power Analysis
DataCamp
R Tutorial: Introduction to qualitative data
DataCamp
R Tutorial: Understanding your qualitative variables
DataCamp
R Tutorial: Making Better Plots
DataCamp
SQL Tutorial: OLTP and OLAP
DataCamp
SQL Tutorial: Storing data
DataCamp
SQL Tutorial: Database design
DataCamp
Python Tutorial: Introduction to spaCy
DataCamp
Python Tutorial: Statistical Models
DataCamp
Python Tutorial: Rule-based Matching
DataCamp
More on: Data Literacy
View skill →Related Reads
📰
📰
📰
📰
How I Built a Free Online Image & PDF Processing Platform with Vue 3 + FastAPI
Dev.to · IAMUU
I Built a Free AI-Powered YouTube SEO Toolkit With Zero Budget. Here’s What Actually Happened.
Medium · Startup
How to Create a Second Version of Yourself Inside Obsidian Using AI (Step-by-Step Guide)
Medium · ChatGPT
How to prepare for Spain civil service TIC exam using AI in 2026
Dev.to · David García
🎓
Tutor Explanation
DeepCamp AI