Chapter 1 Data Manipulation using dplyr
Learning Objectives
- Select columns in a data frame with the
dplyr
functionselect
.- Select rows in a data frame according to filtering conditions with the
dplyr
functionfilter
.- Direct the output of one
dplyr
function to the input of another function with the ‘pipe’ operator%>%
.- Add new columns to a data frame that are functions of existing columns with
mutate
.- Understand the split-apply-combine concept for data analysis.
- Use
summarize
,group_by
, andcount
to split a data frame into groups of observations, apply a summary statistics for each group, and then combine the results.- Join two tables by a common variable.
Manipulation of data frames is a common task when you start exploring your data in R and dplyr
is a package for making tabular data manipulation easier.
Brief recap: Packages in R are sets of additional functions that let you do more stuff. Functions like
str()
ordata.frame()
, come built into R; packages give you access to more of them. Before you use a package for the first time you need to install it on your machine, and then you should import it in every subsequent R session when you need it.
If you haven’t, please install the tidyverse
package.
install.packages("tidyverse")
tidyverse
is an “umbrella-package” that installs a series of packages useful for data analysis which work together well. Some of them are considered core packages (among them tidyr
, dplyr
, ggplot2
), because you are likely to use them in almost every analysis. Other packages, like lubridate
(to work wiht dates) or haven
(for SPSS, Stata, and SAS data) that you are likely to use not for every analysis are also installed.
If you type the following command, it will load the core tidyverse
packages.
library("tidyverse") ## load the core tidyverse packages, incl. dplyr
If you need to use functions from tidyverse
packages other than the core packages, you will need to load them separately.
1.1 What is dplyr
?
dplyr
is one part of a larger tidyverse
that enables you to work
with data in tidy data formats. “Tidy datasets are easy to manipulate, model and visualise, and have a specific structure: each variable is a column, each observation is a row, and each type of observational unit is a table.” (From Wickham, H. (2014): Tidy Data https://www.jstatsoft.org/article/view/v059i10)
The package dplyr
provides convenient tools for the most common data manipulation
tasks. It is built to work directly with data frames, with many common tasks
optimized by being written in a compiled language (C++). An additional feature is the
ability to work directly with data stored in an external database. The benefits of
doing this are that the data can be managed natively in a relational database,
queries can be conducted on that database, and only the results of the query are
returned.
This addresses a common problem with R in that all operations are conducted in-memory and thus the amount of data you can work with is limited by available memory. The database connections essentially remove that limitation in that you can have a database of many 100s GB, conduct queries on it directly, and pull back into R only what you need for analysis.
To learn more about dplyr
after the workshop, you may want to check out the handy data transformation with dplyr
cheatsheet.
1.2 Subsetting columns and rows
Let’s begin with loading our sample data into a data frame.
We will be working a small subset of the data from the Stanford Open Policing Project. It contains information about traffic stops for blacks and whites in the state of Mississippi during January 2013 to mid-July of 2016.
<- read_csv("data/MS_trafficstops_bw_age.csv") stops
#> Parsed with column specification:
#> cols(
#> id = col_character(),
#> stop_date = col_date(format = ""),
#> county_name = col_character(),
#> county_fips = col_double(),
#> police_department = col_character(),
#> driver_gender = col_character(),
#> driver_birthdate = col_date(format = ""),
#> driver_race = col_character(),
#> officer_id = col_character(),
#> driver_age = col_double(),
#> violation = col_character()
#> )
stops
#> # A tibble: 211,211 x 11
#> id stop_date county_name county_fips police_departme… driver_gender
#> <chr> <date> <chr> <dbl> <chr> <chr>
#> 1 MS-2… 2013-01-01 Jones 28067 Mississippi Hig… male
#> 2 MS-2… 2013-01-01 Lauderdale 28075 Mississippi Hig… male
#> 3 MS-2… 2013-01-01 Pike 28113 Mississippi Hig… male
#> 4 MS-2… 2013-01-01 Hancock 28045 Mississippi Hig… male
#> 5 MS-2… 2013-01-01 Holmes 28051 Mississippi Hig… male
#> 6 MS-2… 2013-01-01 Jackson 28059 Mississippi Hig… female
#> 7 MS-2… 2013-01-01 Jackson 28059 Mississippi Hig… female
#> 8 MS-2… 2013-01-01 Grenada 28043 Mississippi Hig… female
#> 9 MS-2… 2013-01-01 Holmes 28051 Mississippi Hig… male
#> 10 MS-2… 2013-01-01 Holmes 28051 Mississippi Hig… male
#> # … with 211,201 more rows, and 5 more variables: driver_birthdate <date>,
#> # driver_race <chr>, officer_id <chr>, driver_age <dbl>, violation <chr>
You may have noticed that by using read_csv
we have generated an object
of class tbl_df
, also known as a “tibble.” Tibble’s data
structure is very similar to a data frame. For our purposes the only differences
are that
- columns of class
character
are never converted into factors1,
- columns of class
- it tries to recognize and
date
types
- it tries to recognize and
- the output displays the data type of each column under its name, and
- it only prints the first few rows of data and only as many columns as
fit on one screen. If we wanted to print all columns we can use the print command, and set the
width
parameter toInf
. To print the first 6 rows for example we would do this:print(my_tibble, n=6, width=Inf)
.
- it only prints the first few rows of data and only as many columns as
fit on one screen. If we wanted to print all columns we can use the print command, and set the
To select columns of a
data frame with dplyr
, use select()
. The first argument to this function is the data
frame (stops
), and the subsequent arguments are the columns to keep.
select(stops, police_department, officer_id, driver_race)
#> # A tibble: 211,211 x 3
#> police_department officer_id driver_race
#> <chr> <chr> <chr>
#> 1 Mississippi Highway Patrol J042 Black
#> 2 Mississippi Highway Patrol B026 Black
#> 3 Mississippi Highway Patrol M009 Black
#> 4 Mississippi Highway Patrol K035 White
#> 5 Mississippi Highway Patrol D028 White
#> 6 Mississippi Highway Patrol K023 White
#> 7 Mississippi Highway Patrol K032 White
#> 8 Mississippi Highway Patrol D021 White
#> 9 Mississippi Highway Patrol R021 White
#> 10 Mississippi Highway Patrol R021 White
#> # … with 211,201 more rows
It is worth knowing that dplyr
is backed by another package with a number of helper functions, which provide convenient functions to select columns based on their names. For example:
select(stops, starts_with("driver"))
#> # A tibble: 211,211 x 4
#> driver_gender driver_birthdate driver_race driver_age
#> <chr> <date> <chr> <dbl>
#> 1 male 1950-06-14 Black 63
#> 2 male 1967-04-06 Black 46
#> 3 male 1974-04-15 Black 39
#> 4 male 1981-03-23 White 32
#> 5 male 1992-08-03 White 20
#> 6 female 1960-05-02 White 53
#> 7 female 1953-03-16 White 60
#> 8 female 1993-06-14 White 20
#> 9 male 1947-12-11 White 65
#> 10 male 1984-07-14 White 28
#> # … with 211,201 more rows
Check out the tidyselect reference for more.
To subset rows based on specific criteria, we use filter()
:
filter(stops, county_name == "Yazoo")
#> # A tibble: 3,528 x 11
#> id stop_date county_name county_fips police_departme… driver_gender
#> <chr> <date> <chr> <dbl> <chr> <chr>
#> 1 MS-2… 2013-01-02 Yazoo 28163 Mississippi Hig… male
#> 2 MS-2… 2013-01-02 Yazoo 28163 Mississippi Hig… female
#> 3 MS-2… 2013-01-02 Yazoo 28163 Mississippi Hig… male
#> 4 MS-2… 2013-01-02 Yazoo 28163 Mississippi Hig… female
#> 5 MS-2… 2013-01-02 Yazoo 28163 Mississippi Hig… male
#> 6 MS-2… 2013-01-03 Yazoo 28163 Mississippi Hig… male
#> 7 MS-2… 2013-01-03 Yazoo 28163 Mississippi Hig… male
#> 8 MS-2… 2013-01-04 Yazoo 28163 Mississippi Hig… male
#> 9 MS-2… 2013-01-04 Yazoo 28163 Mississippi Hig… male
#> 10 MS-2… 2013-01-04 Yazoo 28163 Mississippi Hig… female
#> # … with 3,518 more rows, and 5 more variables: driver_birthdate <date>,
#> # driver_race <chr>, officer_id <chr>, driver_age <dbl>, violation <chr>
Here are some other ways to subset rows:
- by row number:
slice(stops, 1:3) # rows 1-3
- rows with highest or lowest values of a variable:
slice_min(stops, driver_age) # likewise slice_max()
- random rows:
slice_sample(stops, n = 5) # number of rows to select
slice_sample(stops, prop = .0001) # fraction of rows to select
To sort rows by variables use the arrange
function:
arrange(stops, county_name, stop_date)
#> # A tibble: 211,211 x 11
#> id stop_date county_name county_fips police_departme… driver_gender
#> <chr> <date> <chr> <dbl> <chr> <chr>
#> 1 MS-2… 2013-02-09 Adams 28001 Mississippi Hig… male
#> 2 MS-2… 2013-03-02 Adams 28001 Mississippi Hig… female
#> 3 MS-2… 2013-03-16 Adams 28001 Mississippi Hig… female
#> 4 MS-2… 2013-03-20 Adams 28001 Mississippi Hig… female
#> 5 MS-2… 2013-04-06 Adams 28001 Mississippi Hig… female
#> 6 MS-2… 2013-04-13 Adams 28001 Mississippi Hig… female
#> 7 MS-2… 2013-04-19 Adams 28001 Mississippi Hig… female
#> 8 MS-2… 2013-04-21 Adams 28001 Mississippi Hig… female
#> 9 MS-2… 2013-04-24 Adams 28001 Mississippi Hig… male
#> 10 MS-2… 2013-04-24 Adams 28001 Mississippi Hig… male
#> # … with 211,201 more rows, and 5 more variables: driver_birthdate <date>,
#> # driver_race <chr>, officer_id <chr>, driver_age <dbl>, violation <chr>
1.3 Pipes
What if you wanted to filter and select on the same data? For example, lets find drivers over 85 years and only keep the violation and gender columns. There are three ways to do this: use intermediate steps, nested functions, or pipes.
- Intermediate steps:
With intermediate steps, you essentially create a temporary data frame and use that as input to the next function. This can clutter up your workspace with lots of objects.
<- filter(stops, driver_age > 85)
tmp_df select(tmp_df, violation, driver_gender)
- Nested functions
You can also nest functions (i.e. placce one function inside of another). This is handy, but can be difficult to read if too many functions are nested as things are evaluated from the inside out.
select(filter(stops, driver_age > 85), violation, driver_gender)
- Pipes!
The last option, called “pipes.” Pipes let you take
the output of one function and send it directly to the next, which is useful
when you need to do many things to the same dataset. Pipes in R look like
%>%
and are made available via the magrittr
package, which is installed automatically with dplyr
. If you use RStudio, you can type the pipe with Ctrl
+ Shift + M if you have a PC or Cmd +
Shift + M if you have a Mac.
%>%
stops filter(driver_age > 85) %>%
select(violation, driver_gender)
In the above, we use the pipe to send the stops
data first through
filter()
to keep rows where driver_race
is Black, then through select()
to keep only the officer_id
and stop_date
columns. Since %>%
takes
the object on its left and passes it as the first argument to the function on
its right, we don’t need to explicitly include it as an argument to the
filter()
and select()
functions anymore.
If we wanted to create a new object with this smaller version of the data, we could do so by assigning it a new name:
<- stops %>%
senior_drivers filter(driver_age > 85) %>%
select(violation, driver_gender, driver_race)
senior_drivers
#> # A tibble: 3 x 3
#> violation driver_gender driver_race
#> <chr> <chr> <chr>
#> 1 Seat belt male White
#> 2 Speeding male White
#> 3 Seat belt male Black
Note that the final data frame is the leftmost part of this expression.
Challenge
Using pipes, subset the
stops
data to include stops in Tunica county only and retain the columnsstop_date
,driver_age
, andviolation
. Bonus: sort the table by driver age.
1.4 Add new columns
Frequently you’ll want to create new columns based on the values in existing columns or. For this we’ll use mutate()
. We can also reassign values to an existing column with that function.
Be aware that new and edited columns will not permanently be added to the existing data frame – unless we explicitly save the output.
So here is an example using the year()
function from the lubridate package to extract the year of the drivers’ birthdate:
library(lubridate)
%>%
stops mutate(birth_year = year(driver_birthdate))
We can keep adding columns like this:
%>%
stops mutate(birth_year = year(driver_birthdate),
birth_cohort = floor(birth_year/10)*10)
We are beginning to see the power of piping. Here is a slightly expanded example, where we select the column birth_cohort
that we have created and send it to plot:
%>%
stops mutate(birth_year = year(driver_birthdate),
birth_cohort = floor(birth_year/10)*10,
birth_cohort = factor(birth_cohort)) %>%
select(birth_cohort) %>%
plot()

Figure 1.1: Driver Birth Cohorts
Mutate can also be used in conjunction with logical conditions. For example, we could create a new column, where we assign everyone born after the year 2000 to a group “millenial” and overyone before to “pre-millenial.”
In order to do this we take advantage of the ifelse
function:
ifelse(a_logical_condition, if_true_return_this, if_false_return_this)
In conjunction with mutate, this works like this:
%>%
stops mutate(cohort = ifelse(year(driver_birthdate) < 2000, "pre-millenial", "millenial")) %>%
select(driver_birthdate, cohort)
#> # A tibble: 211,211 x 2
#> driver_birthdate cohort
#> <date> <chr>
#> 1 1950-06-14 pre-millenial
#> 2 1967-04-06 pre-millenial
#> 3 1974-04-15 pre-millenial
#> 4 1981-03-23 pre-millenial
#> 5 1992-08-03 pre-millenial
#> 6 1960-05-02 pre-millenial
#> 7 1953-03-16 pre-millenial
#> 8 1993-06-14 pre-millenial
#> 9 1947-12-11 pre-millenial
#> 10 1984-07-14 pre-millenial
#> # … with 211,201 more rows
More advanced conditional recoding can be done with case_when()
.
Challenge
Create a new data frame from the
stops
data that meets the following criteria: contains only theviolation
column for female drivers of age 50 that were stopped on a Sunday. For this add a new column to your data frame calledweekday_of_stop
containing the number of the weekday when the stop occurred. Use thewday()
function fromlubridate
(Sunday = 1).Think about how the commands should be ordered to produce this data frame!
1.5 What is split-apply-combine?
Many data analysis tasks can be approached using the split-apply-combine paradigm: split the data into groups, apply some analysis to each group, and then combine the results.

Figure 1.2: Split - Apply - Combine
dplyr
makes this possible through the use of the group_by()
function.
group_by()
is often used together with summarize()
, which collapses each
group into a single-row summary of that group. group_by()
takes as arguments
the column names that contain the categorical variables for which you want
to calculate the summary statistics. So to view the mean age for black and white drivers:
%>%
stops group_by(driver_race) %>%
summarize(mean_age = mean(driver_age, na.rm=TRUE))
#> `summarise()` ungrouping output (override with `.groups` argument)
#> # A tibble: 3 x 2
#> driver_race mean_age
#> <chr> <dbl>
#> 1 Black 34.2
#> 2 White 36.2
#> 3 <NA> 34.5
If we wanted to remove the line with NA
we could insert a filter()
in the chain:
%>%
stops filter(!is.na(driver_race)) %>%
group_by(driver_race) %>%
summarize(mean_age = mean(driver_age, na.rm=TRUE))
#> `summarise()` ungrouping output (override with `.groups` argument)
#> # A tibble: 2 x 2
#> driver_race mean_age
#> <chr> <dbl>
#> 1 Black 34.2
#> 2 White 36.2
Recall that is.na()
is a function that determines whether something is an NA
. The !
symbol negates the result, so we’re asking for everything that is not an NA
.
You can also group by multiple columns:
%>%
stops filter(!is.na(driver_race)) %>%
group_by(county_name, driver_race) %>%
summarize(mean_age = mean(driver_age, na.rm=TRUE))
#> `summarise()` regrouping output by 'county_name' (override with `.groups` argument)
#> # A tibble: 163 x 3
#> # Groups: county_name [82]
#> county_name driver_race mean_age
#> <chr> <chr> <dbl>
#> 1 Adams Black 36.2
#> 2 Adams White 40.0
#> 3 Alcorn Black 34.6
#> 4 Alcorn White 33.6
#> 5 Amite Black 37.5
#> 6 Amite White 42.1
#> 7 Attala Black 36.4
#> 8 Attala White 38.6
#> 9 Benton Black 34.7
#> 10 Benton White 32.0
#> # … with 153 more rows
Once the data are grouped, you can also summarize multiple variables at the same time (and not necessarily on the same variable). For instance, we could add a column indicating the minimum age in each group (i.e. county):
%>%
stops filter(!is.na(driver_race)) %>%
group_by(county_name, driver_race) %>%
summarize(mean_age = mean(driver_age, na.rm=TRUE),
min_age = min(driver_age, na.rm=TRUE))
#> `summarise()` regrouping output by 'county_name' (override with `.groups` argument)
#> # A tibble: 163 x 4
#> # Groups: county_name [82]
#> county_name driver_race mean_age min_age
#> <chr> <chr> <dbl> <dbl>
#> 1 Adams Black 36.2 16
#> 2 Adams White 40.0 16
#> 3 Alcorn Black 34.6 17
#> 4 Alcorn White 33.6 15
#> 5 Amite Black 37.5 17
#> 6 Amite White 42.1 15
#> 7 Attala Black 36.4 8
#> 8 Attala White 38.6 15
#> 9 Benton Black 34.7 18
#> 10 Benton White 32.0 18
#> # … with 153 more rows
1.6 Tallying
When working with data, it is also common to want to know the number of
observations found for categorical variables. For this, dplyr
provides count()
. For example, if we wanted to see how many traffic stops each officer recorded:
%>%
stops count(officer_id)
Bu default, count will name the column with the counts n
. We can change this by explicitly providing a value for the name
argument:
%>%
stops count(officer_id, name = "n_stops")
We can optionally sort the results in descending order by adding sort=TRUE
:
%>%
stops count(officer_id, name = "n_stops", sort = TRUE)
count()
calls group_by()
transparently before counting the total number of records for each category. Similarly, we can count subgroups within groups:
%>%
stops count(officer_id, violation, name = "n_stops")
Alternatives:
%>%
stops group_by(officer_id) %>%
tally(sort = TRUE) # tally() requires group_by before counting
%>%
stops group_by(officer_id) %>%
summarize(n = n()) %>% # n() is useful when the count is needed within a calculation
arrange(desc(n))
Challenge
Which 5 counties were the ones with the most stops in 2013? Hint: use the year() function from lubridate.
1.7 Joining two tables
It is not uncommon that we have our data spread out in different tables and need to bring those together for analysis. In this example we will combine the numbers of stops for black and white drivers per county together with the numbers of the black and white total population for these counties. The population data are the estimated values of the 5 year average from the 2011-2015 American Community Survey (ACS):
<- read_csv("data/MS_acs2015_bw.csv") acs
#> Parsed with column specification:
#> cols(
#> County = col_character(),
#> FIPS = col_double(),
#> black_pop = col_double(),
#> white_pop = col_double(),
#> bw_pop = col_double()
#> )
acs
#> # A tibble: 82 x 5
#> County FIPS black_pop white_pop bw_pop
#> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 Jones 28067 19711 47154 66865
#> 2 Lauderdale 28075 33893 43482 77375
#> 3 Pike 28113 21028 18282 39310
#> 4 Hancock 28045 4172 39686 43858
#> 5 Holmes 28051 15498 3105 18603
#> 6 Jackson 28059 30704 101686 132390
#> 7 Grenada 28043 9417 11991 21408
#> 8 Scott 28123 10562 16920 27482
#> 9 Wayne 28153 8015 12154 20169
#> 10 Bolivar 28011 21648 11197 32845
#> # … with 72 more rows
In a first step we count all the stops per county.
%>%
stops count(county_name, name = "n_stops")
#> # A tibble: 82 x 2
#> county_name n_stops
#> <chr> <int>
#> 1 Adams 942
#> 2 Alcorn 3345
#> 3 Amite 2921
#> 4 Attala 4203
#> 5 Benton 214
#> 6 Bolivar 4526
#> 7 Calhoun 1658
#> 8 Carroll 1788
#> 9 Chickasaw 3869
#> 10 Choctaw 613
#> # … with 72 more rows
We will then pipe this into our next operation where we bring the two tables together. We will use left_join
, which returns all rows from the left table, and all columns from the left and the right table. As ID, which uniquely identifies the corresponding records in each table we use the County names.
%>%
stops count(county_name, name = "n_stops") %>%
left_join(acs, by = c("county_name" = "County"))
#> # A tibble: 82 x 6
#> county_name n_stops FIPS black_pop white_pop bw_pop
#> <chr> <int> <dbl> <dbl> <dbl> <dbl>
#> 1 Adams 942 28001 17757 12856 30613
#> 2 Alcorn 3345 28003 4281 31563 35844
#> 3 Amite 2921 28005 5416 7395 12811
#> 4 Attala 4203 28007 8194 10649 18843
#> 5 Benton 214 28009 3078 5166 8244
#> 6 Bolivar 4526 28011 21648 11197 32845
#> 7 Calhoun 1658 28013 3991 10103 14094
#> 8 Carroll 1788 28015 3470 6702 10172
#> 9 Chickasaw 3869 28017 7549 9522 17071
#> 10 Choctaw 613 28019 2596 5661 8257
#> # … with 72 more rows
Now we can, for example calculate the stop rate, i.e. the number of stops per population in each county.
Challenge
Which county has the highest and which one the lowest stop rate? Use the snippet from above and pipe into the additional operations to do this.
dplyr
join functions are generally equivalent to merge
from the base command, but there are a few advantages:
- rows are kept in existing order
- it runs faster
- tells you what keys you’re merging by (if you don’t supply them)
- also works with database tables.
https://groups.google.com/d/msg/manipulatr/OuAPC4VyfIc/Qnt8mDfq0WwJ
See ?dplyr::join
for all the possible joins.
This is now also true for the base
read.csv
starting with R version 4.↩︎