Going deeper with dplyr: New features in 0.3 and 0.4 (tutorial)
Key Takeaways
This video tutorial covers the new features in dplyr versions 0.3 and 0.4, including the select, filter, rename, arrange, distinct, mutate, transmute, and add_row_names functions, as well as data manipulation and analysis techniques using the dplyr package in R.
Full Transcript
hello and welcome to data school this video tutorial is going to cover many of the new features in D plier versions 0.3 and 0.4 as well as some other functionality that I didn't cover in my previous deer tutorial if you haven't already watched my previous tutorial I strongly recommend you watch that first that video covered the basic functionality of the package using version 0.2 and I won't be repeating that material here I'm going to be walking through an R markdown document that I created here in our studio if you want to follow along this code is available on GitHub and there's also a rendered version of this document on our pubs that you could use as a quick reference guide you can find links to both of these below the video alternatively you can just type the commands as you go one other note before we get get started if you have any questions about anything in this video please let me know in the comments below and I'd be happy to answer them and with that let's get started uh so the first thing you should know is um that even though my last tutorial used data from the HF flights package uh Hadley Wickham the chief author of the deer package has Rewritten all the deer vignettes to use the NYC flights 133 package instead and so for the sake of consistency I'm going to use NYC flights 13 also so if you just finished my previous tutorial you'll want to run this line of code remove flights to make sure that you get rid of the flights object from your environment all right so let's go ahead and load up the deer package and then we're also going to load the NYC flights 13 data set if you don't have it you'll just want to install that now and I'll print the flights data set from NYC flights 13 you can see that there is one row of data one observation for each flight and there were 336,000 flights these are just flights from the New York City area airports so Let's uh Jump Right In to our first couple commands uh We've of course seen the select command before and this is used to uh pick columns so in this case I am just uh picking the columns carrier and flight throughout this tutorial I'm going to be using this pipe operator uh which I'm going to call then so in any case uh besides using select just to pick columns you can also use select to hide columns so let's say that you want to select all columns except for the month and day columns well I can just do that like this and as you can see I'm now showing uh 14 columns instead of 16 all right so uh you might remember that you can use select to select a range of columns using this colon operator and uh in this case I'm going to actually use the minus sign to hide a range of columns so in other words this hides all columns between and including departure time through arrival delay we can also uh use the minus sign to hide columns matching a name so let's pretend that all of the columns whose names have time in it are not of interest to me well I can say select minus contains time and it will hide those columns so for example the air time column is no longer shown now here's a piece of new functionality let's say that I have a character Vector which I'll call calls and it contains the names of some um columns that I want to show well there's now a command called one of and I can pass it to select so I say select one of calls and it will only select those columns now you might not know that select can also be used to rename columns and let's do that here and we see that I have selected the the tail num column and I've renamed it as tail note that uh when I do this all columns that I don't mention they're dropped now this probably isn't that useful to you but there's a new function called a rename that does the exact same thing except all columns not mentioned are kept which is more likely what you want to do so you can see that uh we now have the tail num column and it's uh it's been renamed to tail but all the other columns that I didn't mention are still here all right let's move along to some uh functions for filtering so uh you might remember that the filter command which uh filters rows or observations supports the use of multiple conditions so in this casee I'm filtering to only show rows where the departure time is greater than or equal to 60 and the uh departure time is less than or equal to 605 in other words this is flights that departed uh between 6:00 and 6:05 in the morning inclusive note that I use the comma to separate these conditions but I can also use Ampersand if you're doing this particular operation where you're selecting these end points uh you can actually um use a new command called between so this is a more concise alternative for determining if numeric values fall in a range and so if I say between depth time 6 605 it's the same as this command and once again I'm I'm filtering and again you can see that uh all of these departure times are between 6:00 and uh 6:05 a.m. uh as a side note on filtering I found that the isna can actually be quite useful within filtering so in this case I'm filtering flights to only show observations in which depth time is not an NA value so you can see that now we only have about 328,000 observations out of those 336,000 observations uh moving on to another new verb uh called slice slice is used to uh filter rows by position so for example here I'm going to show only rows in positions 1,000 through 1, and five now you might wonder how that might be useful uh you've already got the head command to look at rows at the top and the tail command to look at rows at the bottom well it's possible you might want to just uh slice some rows out of the middle and take a look at those and this is an easy way to do that um but this probably isn't the most compelling use case for slice uh what I found is that slice is actually most useful for uh filtering rows within groups so in this case I'm going to group by month and day and then I'm going to slice out the first three rows within each group so as you can see I've got three rows for January 1 three rows for January 2nd three rows for January thir and so on and these are just the first three rows that deeper encountered uh within those groups uh now perhaps I don't want to look at just the uh first three rows instead I want to sample three random rows from each group well there's a handy command called sample n uh and I'm going to just pass it a three and that just means within each group I want to see three random rows so uh maybe I don't want to look at the first three rows or a random selection of three rows maybe within each group I want to look at the three rows that have the top departure delay or some other variable well I can do that with this top n function and what I do is I pass it the number of n so the number of rows I want and then I pass it the variable uh on which those rows should be chosen so let's run this and what we see is that again I have one row or sorry I have three rows for each day of the year and then within those the particular three rows that were chosen were the ones with really high departure delays so as you can see earlier I was just getting arbitrary uh rows and here I'm actually getting one specifically with high departure Del now you might note that they're not actually sorted within each group um which to me is probably what you'd want to do if you were doing this particular operation so just note that you can throw on an arrange uh function at the end and if I say arrange in descending order by departure delay then I get what I actually want which is I'm getting three rows from each group uh with the top depart departure delay and then I'm sorting in descending order by that departure delay uh moving along um let's say you wanted to identify unique rows so for example I want to select uh origin and destination and I want to see all unique origin destination pairs there's a unique function from base R so let's run that and we'll see that there were 224 unique pairs of Origins and destinations in this data set now D plier actually has an alternative command called distinct and you'll see that the results are the same but it's a more efficient command so this is what uh deer recommends you use instead of unique uh and as a side note um and this this is something I just learned is that when chaining you actually don't even have to include the parentheses if there are no arguments so for example with distinct there are no arguments I'm including and thus I can actually just run it without the parenthesis um it looks kind of funny to me having you know uh being used to seeing a lot of parentheses in R um and I don't know if this is officially endorsed and whether this behav will stick around forever but it at least for now it works and it certainly makes your code a little bit shorter saves you some typing so uh let's move on to um adding new variables so we've got mutate which has been around from the beginning of deer so mutate is how you create a new variable uh and let's say I want to create a new variable called speed and have it be a function of distance distance and air time so this is speed and Miles hour I run mutate and although we can't see it here we've got a new um variable speed uh and you'll notice that mutate kept all the existing variables now there might be times when you only want to keep the variables you just created and that's the purpose of the new verb transmute so the syntax is the same and when you run it you see that uh the only variable that was kept is that new variable now let's move uh just briefly to a different data set built into R called Mt cars very famous R data set and uh we're just taking a look at the head the first six rows and you can see that Mt cars uh uses row name to store the the model of the car uh now in in kind of Tidy data principles this is not recommended it's recommended that you instead of using uh row names you make this an explicit variable in your data frame and as such uh there's this new verb called add row names so what it does is it takes any row names in a data frame and it turns them into an explicit variable with the specified name so let's do that and you'll see that I've now uh created a new variable called Model that just took the row names from Mt cars now as a side note um anytime you are creating a uh local data frame uh and you create a local data frame uh using this TBL DF command also known as tible diff if you like uh local data frames uh they used to print row names under some circumstances and not in other circumstances well D plier no longer prints row names ever so you should not uh you should not plan to see them um anytime you're using local data frames uh moving on to some grouping and Counting operations um and and the first is just a a review of summarize so let's say I want to uh Group by month and then summarize count equals n so what is this exactly it means for each month I want to I want to calculate the number of rows and one row represents a flight and store that in a variable called CNT or count so let's run this and this is telling us that in month one there were 27,0 and4 rows which represents the number of flights in February there is this number and so on so this is nothing new I wanted to uh compare that syntax with Tally and count so tally is a way that you can uh do this slightly more concisely so tally just replaces this uh summarize command and when I say tally it does the same exact thing except it automatically assigns n as the column name so I couldn't Define it as CNT it's just automatically n um but it saves me uh some typing and it's a bit cleaner to read now a new uh a new way to do this is to use this count function and this makes it even uh more concise because it automatically does the grouping for you so this line of code uh count month does exactly the same thing as above which does the same thing as here uh now I'm going to show a couple more examples of comparing summarize versus tally versus Count so here's uh let's say that I wanted to sort this data in descending order I've course use an arrange command so arrange descending count so let's run this and now we can see that July is the month with the most flights and February is the month with the least flights I can also do the same thing using tally using this optional parameter called sort and I'll say sort equals true and you see that tally does the same thing and then count also has that same parameter sort equals true and again it does the same thing so that's another comparison of summarize tally and count uh now note that um when you're summarizing you can do a lot more than just counting uh the rows in each group um you can uh for instance sum over a specific variable so in this case for each month I'm going to summarize and I'll say dist equals sum of distance which means that I want to sum up the uh distance values for each row and store it in a variable called disc so let's run this and we'll see that for example uh in January this was the number of miles that was flown and this was the number flown in February and so on uh now again with Tally and count you can do the same thing so with Tally you have this optional parameter called WT which I just call weight um so this weight parameter uh spe specifies the same thing here and it will do the sum of uh that particular column instead of just counting and you can see I get the same results and then count again has the same weight parameter now coming back to Counting uh so remember uh this command here so for each month I want to tally all right now let's say that I want to quickly get the these numbers into a vector for some reason well that's the function of this group size uh function so let's run this and you can see that I get the same exact results as here except it's now uh just been turned into a uh integer Vector okay uh now for something even simpler maybe all you care about calculating is the number of groups we know in this case that the number of groups will be 12 because they're 12 months but in your case you might have data in which all you want to calculate is the number of groups and that's what NR groups does so it just tells me I have 12 groups uh next I want to just show a little bit of behavior that that is a bit confusing to me and in case it's confusing to you I want to show you kind of a workaround so let's pretend I'm grouping by two variables in this case month and day and then I'm doing a summarize doing a count and then I uh doing an arrange okay and one thing I'm going to throw on at the end here is this print function and say n equals 40 that allows me to print 40 rows so let's go ahead and run this and let's take a look at what we've got so I grouped by month and day so of course 365 groups now there's something that's a little funny to me so I see that uh it does look like it's in descending order but notice that all of these are from January so there were uh you know 943 flights in in January 2nd uh and that was the greatest for January but if I want to see February it has go all the way down here so in other words um it's sorting only within month uh whereas what I had really wanted is sorting across all groups so I wanted to know for instance the top 10 days throughout the year in terms of number of flights not just the top ones from within each month now uh why exactly is this occurring well I did some research and it turns out that every time summarize is called that removes a grouping level so even though I group by month and day here as soon as I called summarize it actually drops a grouping level so now it's only grouping by month so when I arrange it's arranging with that in mind now there is a workaround and it turns out that there's this command called ungroup so it essentially removes any groups that have been defined and I just throw that in after the summary but before the arrange and if I do that and this time I'm just showing 10 rows you can see that it's sorting across all months uh across all months and days so I can see that the day of the year in 2013 in the New York airports that had the most flights was November 27th followed by July 11th Etc so ungrouped can come in handy if you're experiencing Behavior like this and it's not what you want moving along let's talk about creating data frames using data underscore frame now this is not data. frame this is data underscore frame which is a new thing in Deer so uh data underscore frame is a better way than data. frame for creating data frames so what are some benefits of data underscore frame well you can use previously defined columns to compute new columns and we'll see an example below uh it never coerces column types it never munges column names never adds row names it only recycles length one input which is probably what you want and it only returns a uh a local data frame which is what uh deer encourages so let's do a quick example of data underscore frame and uh what I'm doing is I'm creating a data frame um with four columns and I'm sending it to Glimpse which is uh kind of deer's version of of stir St Str so what do we see we see that a is been created 1 to six um B has been created as a * 2 which does work uh when I created c as string you'll see it it stays a character instead of being uh automatically converted to a factor and when I created D+ e as my fourth column name it uh it leaves that as is it doesn't it doesn't change it and you can't tell from this Glimpse uh but this is actually a local data frame so let's compare this to uh data do frame your standard way for creating data frames in R um I didn't bother Define trying to Define find B = A * 2 here because that would fail with data. frame since um since a doesn't yet exist at this point uh so you'll notice that when I do C equals string it gets turned into a factor uh when I say d plus e as the column name it gets turned into D.E um and for what it's worth this is not a uh a local data frame all right let's move on to uh joins or merges as they're sometimes called and there's a number of different joins um we've got four uh what are called mutating joins and then two filtering joins um if you already know joins this might be pretty um Elementary for you but I'm going to presume that you're not familiar with joins so let's create two simple data frames which I'll call a and B and take a look at them and in case you're wondering uh anytime you wrap an assignment statement in parentheses it automatically prints it out with without you having to explicitly uh tell it to print uh so what what do we have um well we have a data frame a which has color green yellow red and num one2 three and then data frame B which has color green yellow pink and size SML for small medium and large okay so let's start with uh with probably the simplest to understand join the inner join so uh that type of join only includes observations found in both A and B and deer will automatically join on variables that appear in both tables so the only common variable is color so it joins on color so what this injin results in is we see green and yellow because those appeared in both A and B and then I get the numb column for those observations from a and the size column for those observations from B okay uh next we'll do a full join sometimes known as an out a full outer join and uh what a full join does is include observations found in either A or B so this time we'll see green yellow and then red which came from a and pink which came from B we'll see both numb and size but because uh because red uh did not have a size there is no red in um in B size gets an NA value and because pink was not in a uh it has no num value so that gets marked as na okay so that's our full join our left join is for including observations found in a so anything found in a will be shown and so of course we see green yellow and red we see numb one two and three so so far this is just what we already had in a but in addition we also have this size column so size small medium and then na so it's basically these three rows but only those three rows because uh left joins only include the observations found in a okay uh We've also got a right join so when I do right join a b it only includes the observations found found in B so we've got green yellow and pink and then the numb and then the size so this is just like rows one two and four from our full join okay now uh a lot of a lot of people aren't used to right joins they just use left joins and of course you could just do a left join of ba instead of a right join a and they are indeed identical except for the column ordering so let's do a left join of b and a and you'll see that it just swaps the column ordering so it just all depends upon uh what you want in terms of your resulting column ordering uh finally let's do our kind of filtering joins so the first is called a semi join so when I do semi join a b that filters a to only show observations that match B so I don't gain any new uh columns all I did was I lost the red uh observation from a because red was not found in B now the opposite that of that is an anti-join so this filters a to only show observations that don't match B so you can see that between uh the semi join and the anti-join we have all of a okay uh now of course uh matching variables don't always have identical names in different tables so I'm just going to rename the uh the color column in B to be call instead so how do I do my join in that case well I actually just need to add on this argument by equals and I pass it a character Vector that says um color equals call and then if there were multiple I could just say comma and then indicate others but um uh by telling uh the inner join that color in a equals call in B it successfully uh completes the join even though they had different column names and finally let's wrap up with some uh tips for viewing more output so let's say you want to print more than just the default of 10 rows let's say you want to print 15 well you can use this handy little uh print function and say print nals 15 uh note that you do have to say n equals you can't just put the 15 there uh and and of course I see 15 rows uh now let's say your data frame has a couple hundred uh rows and you want to see all of them you don't want to be constrained to see only 10 um but you don't know exactly how many well you can just say print nals INF which means uh infinity or infinite now I'm not going to run this in this case because we've got 300,000 rows um but you may find this to be handy for smaller data sets um for seeing all columns this is the the trick I use most often if I say print width equals INF I get to see all the columns so I'm still limited to the 10 rows but D plier is indeed showing me all of the columns um now a different way to accomplish kind of the same objective is to use this view function which is not does not come from deer I don't believe uh it's a c capital V in view if you couldn't tell so let's run this and it opens up this uh data viewer that you might already be familiar with and as you can see uh we can see uh 1,000 rows so all the way down to row 1000 and we can see all the columns I'm sure there's a limit as to how many columns it will display but I don't know what that limit is uh so I'm going to go ahead and close that so that's another way to view more uh data without printing it out in your console here um let's say I want to set some options so it turns out maybe I like uh deer to print six rows by default uh rather than 10 well I can change deer. print Min to six and maybe I always want to see all columns uh I can put deer. width equals infinite so let's run that and now if we look at flights we'll see six rows and all of the columns without having to use this uh this print uh command uh now if I want to reset this um I can set D plier width back to null and deer. print Min back to 10 and that will reset my options to the usual option itions uh alternatively if you just close R it will reset your options um and I'm sure there's a way to uh set these options to be um permanent upon startup if you like so that's it for the code I'm just going to go through a few resources now uh the release announcements on the r Studio blog for version 0.3 uh back in October and version 0.4 for uh in January of 2015 I linked to those here they're a nice quick read um the deer reference manual in vignettes on cray uh those are certainly worthwhile uh there's a lot of vignettes um they're always being updated and uh they're quite thorough in uh covering certain aspects of deeper one particular vignette that might be of interest to you if you want to learn more about joins is this two table vignette uh specifically it also includes set operations which I did not cover during this tutorial um highly recommended is our Studio's data wrangling cheat sheet which is a two-page very well-designed uh visual cheat sheet for uh both deer and tidy R for tidying data um and finally we've got the deer GitHub repo if you want to keep up with uh development or the list of releases uh and you can read the very detailed release announcements which is a lot of how I prepare for these video tutorials is reading those in depth so thank you so much for watching if you have any comments or questions please let me know in the comments section below and if you enjoyed this video I'd love to have you as a YouTube subscriber so that I can let you know when future videos come out so please do click the Subscribe Button as well I have a Blog and an email newsletter in which I cover a much wider range of topics including R python git machine learning and data science in general so please check those out and again thank you so much for watching
Original Description
dplyr is an R package for data manipulation. Using a series of examples on a dataset you can download, this tutorial covers the new functionality in dplyr versions 0.3 and 0.4.
Watch the previous dplyr tutorial: http://youtu.be/jWjqLW-u3hc
View the R Markdown document: http://rpubs.com/justmarkham/dplyr-tutorial-part-2
Download the source document: https://github.com/justmarkham/dplyr-tutorial
Tutorial contents:
1. Introduction (starts at 0:00)
2. Loading dplyr and the nycflights13 dataset (starts at 1:12)
3. Choosing columns: `select`, `rename` (starts at 2:28)
4. Choosing rows: `filter`, `between`, `slice`, `sample_n`, `top_n`, `distinct` (starts at 5:40)
5. Adding new variables: `mutate`, `transmute`, `add_rownames` (starts at 12:38)
6. Grouping and counting: `summarise`, `tally`, `count`, `group_size`, `n_groups`, `ungroup` (starts at 15:20)
7. Creating data frames: `data_frame` (starts at 23:01)
8. Joining (merging) tables: `left_join`, `right_join`, `inner_join`, `full_join`, `semi_join`, `anti_join` (starts at 25:28)
9. Viewing more output: `print`, `View` (starts at 31:29)
10. Resources (starts at 34:41)
== RESOURCES ==
Release announcement (0.3): http://blog.rstudio.org/2014/10/13/dplyr-0-3-2/
Release announcement (0.4): http://blog.rstudio.org/2015/01/09/dplyr-0-4-0/
Reference manual and vignettes: http://cran.r-project.org/web/packages/dplyr/
Two-table vignette: http://cran.r-project.org/web/packages/dplyr/vignettes/two-table.html
Data Wrangling Cheat Sheet: http://www.rstudio.com/wp-content/uploads/2015/02/data-wrangling-cheatsheet.pdf
GitHub repo: https://github.com/hadley/dplyr
List of releases: https://github.com/hadley/dplyr/releases
== LET'S CONNECT! ==
Newsletter: https://www.dataschool.io/subscribe/
Twitter: https://twitter.com/justmarkham
Facebook: https://www.facebook.com/DataScienceSchool/
LinkedIn: https://www.linkedin.com/in/justmarkham/
Watch on YouTube ↗
(saves to browser)
Sign in to unlock AI tutor explanation · ⚡30
Playlist
Uploads from Data School · Data School · 12 of 60
1
2
3
4
5
6
7
8
9
10
11
▶
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
Setting up Git and GitHub
Data School
Navigating a GitHub Repository - Part 1
Data School
Forking a GitHub Repository
Data School
Creating a New GitHub Repository
Data School
Copying a GitHub Repository to Your Local Computer
Data School
Committing Changes in Git and Pushing to a GitHub Repository
Data School
Syncing Your GitHub Fork
Data School
Allstate Purchase Prediction Challenge on Kaggle
Data School
Troubleshooting: Updates Rejected When Pushing to GitHub
Data School
Hands-on dplyr tutorial for faster data manipulation in R
Data School
ROC Curves and Area Under the Curve (AUC) Explained
Data School
Going deeper with dplyr: New features in 0.3 and 0.4 (tutorial)
Data School
What is machine learning, and how does it work?
Data School
Setting up Python for machine learning: scikit-learn and Jupyter Notebook
Data School
Getting started in scikit-learn with the famous iris dataset
Data School
Training a machine learning model with scikit-learn
Data School
Comparing machine learning models in scikit-learn
Data School
Data science in Python: pandas, seaborn, scikit-learn
Data School
Selecting the best model in scikit-learn using cross-validation
Data School
How to find the best model parameters in scikit-learn
Data School
How to evaluate a classifier in scikit-learn
Data School
What is pandas? (Introduction to the Q&A series)
Data School
How do I read a tabular data file into pandas?
Data School
How do I select a pandas Series from a DataFrame?
Data School
Why do some pandas commands end with parentheses (and others don't)?
Data School
How do I rename columns in a pandas DataFrame?
Data School
How do I remove columns from a pandas DataFrame?
Data School
How do I sort a pandas DataFrame or a Series?
Data School
How do I filter rows of a pandas DataFrame by column value?
Data School
How do I apply multiple filter criteria to a pandas DataFrame?
Data School
Your pandas questions answered!
Data School
How do I use the "axis" parameter in pandas?
Data School
How do I use string methods in pandas?
Data School
How do I change the data type of a pandas Series?
Data School
When should I use a "groupby" in pandas?
Data School
How do I explore a pandas Series?
Data School
How do I handle missing values in pandas?
Data School
What do I need to know about the pandas index? (Part 1)
Data School
What do I need to know about the pandas index? (Part 2)
Data School
How do I select multiple rows and columns from a pandas DataFrame?
Data School
Machine Learning with Text in scikit-learn (PyCon 2016)
Data School
When should I use the "inplace" parameter in pandas?
Data School
How do I make my pandas DataFrame smaller and faster?
Data School
How do I use pandas with scikit-learn to create Kaggle submissions?
Data School
More of your pandas questions answered!
Data School
How do I create dummy variables in pandas?
Data School
How do I work with dates and times in pandas?
Data School
How do I find and remove duplicate rows in pandas?
Data School
How do I avoid a SettingWithCopyWarning in pandas?
Data School
How do I change display options in pandas?
Data School
How do I create a pandas DataFrame from another object?
Data School
How do I apply a function to a pandas Series or DataFrame?
Data School
Getting started with machine learning in Python (webcast)
Data School
Q&A about Machine Learning with Text (online course)
Data School
Your pandas questions answered! (webcast)
Data School
Machine Learning with Text in scikit-learn (PyData DC 2016)
Data School
Write Pythonic Code for Better Data Science (webcast)
Data School
Web scraping in Python (Part 1): Getting started
Data School
Web scraping in Python (Part 2): Parsing HTML with Beautiful Soup
Data School
Web scraping in Python (Part 3): Building a dataset
Data School
More on: ML Pipelines
View skill →Related AI Lessons
⚡
⚡
⚡
⚡
The AI Moat Paradox: The Better Models Become, the Less Models Matter
Medium · AI
170,927 AI Papers Reveal the Biggest Research Shifts of the First Half of 2026
Medium · Machine Learning
170,927 AI Papers Reveal the Biggest Research Shifts of the First Half of 2026
Medium · Data Science
[PoV] When Everyone Is Smart, No One Is
Medium · AI
🎓
Tutor Explanation
DeepCamp AI