Chapter 2 Data Manipulation using tidyr

Learning Objectives

  • Understand the concept of a wide and a long table format and for which purpose those formats are useful.
  • Understand what key-value pairs are.
  • Reshape a data frame from long to wide format and back with the pivot_wider and pivot_longer commands from the tidyr package.
  • Export a data frame to a .csv file.

dplyr pairs nicely with tidyr which enables you to swiftly convert between different data formats for plotting and analysis.

The package tidyr addresses the common problem of wanting to reshape your data for plotting and use by different R functions. Sometimes we want data sets where we have one row per observation. Sometimes we want a data frame where each observation type has its own column, and rows are instead more aggregated groups - like surveys, where each column represents an answer. Moving back and forth between these formats is nontrivial, and tidyr gives you tools for this and more sophisticated data manipulation.

To learn more about tidyr after the workshop, you may want to check out this cheatsheet about tidyr.

2.1 About long and wide table format

The ‘long’ format is where:

  • each column is a variable
  • each row is an observation

In the ‘long’ format, you usually have 1 column for the observed variable and the other columns are ID variables.

For the ‘wide’ format a row, for example could be a reserach subject for which you have multiple observation variables containing the same type of data, for example responses to a set of survey questions, or repeated observations over time, or a mix of both. Here is an example:

You may find data input may be simpler or some other applications may prefer the ‘wide’ format. However, many of R‘s functions have been designed assuming you have ’long’ format data. This tutorial will help you efficiently transform your data regardless of original format.

Wide vs. Long Table Format

Figure 2.1: Wide vs. Long Table Format

The choice of data format affects readability. For humans, the wide format is often more intuitive, since we can often see more of the data on the screen due to its shape. However, the long format is more machine readable and is closer to the formatting of databases. The ID variables in our dataframes are similar to the fields in a database and observed variables are like the database values.

Challenge 1

Is trafficstops in a long or wide format?

2.2 Long to Wide with pivot_wider

Now let’s see this in action. First, using dplyr, let’s create a data frame with the mean age of each driver by gender and county:

#> # A tibble: 164 x 3
#> # Groups:   county_name [82]
#>    county_name   driver_gender mean_age
#>    <fct>         <fct>            <dbl>
#>  1 Adams County  female            36.7
#>  2 Adams County  male              38.4
#>  3 Alcorn County female            33.3
#>  4 Alcorn County male              34.1
#>  5 Amite County  female            38.3
#>  6 Amite County  male              40.3
#>  7 Attala County female            36.7
#>  8 Attala County male              38.1
#>  9 Benton County female            32.1
#> 10 Benton County male              34.4
#> # … with 154 more rows

Now, to make this long data wide, we use pivot_wider from tidyr to turn the driver gender into columns. In addition to our data table we provide pivot_wider with two arguments: names_from describes which column to use for name of the output column, and values_from tells it from column to get the cell values. We’ll use a pipe so we can ignore the data argument.

#> # A tibble: 82 x 3
#> # Groups:   county_name [82]
#>    county_name      female  male
#>    <fct>             <dbl> <dbl>
#>  1 Adams County       36.7  38.4
#>  2 Alcorn County      33.3  34.1
#>  3 Amite County       38.3  40.3
#>  4 Attala County      36.7  38.1
#>  5 Benton County      32.1  34.4
#>  6 Bolivar County     33.2  36.3
#>  7 Calhoun County     33.3  34.8
#>  8 Carroll County     33.8  36.0
#>  9 Chickasaw County   33.2  34.6
#> 10 Choctaw County     35.8  36.8
#> # … with 72 more rows

We can now do things like compare the mean age of men against women drivers. As example we use the age difference to find the counties with the largest and with the smallest number. (A negative number means that female drivers are on average older than male drivers, a positive number means that male drivers are on average older than women drivers.)

#> # A tibble: 2 x 4
#>   county_name      female  male agediff
#>   <fct>             <dbl> <dbl>   <dbl>
#> 1 Neshoba County     35.1  31.1   -3.94
#> 2 Yalobusha County   33.4  39.4    5.99

Note that trafficstops_ma_wide is derived from trafficstops_ma, and is a “grouped” data frame, which was created with the group_by function above. (Check class(trafficstops_ma) and class(trafficstops_ma_wide)). That means that any instruction that follows will operate on each group (in this case county) separately. That may be ok for some instances (like mutate), but if we are interested in retrieving the maximum and the minumim age difference over all counties we need to ungroup the tibble to have the filter command operate on the entire dataset instead of each group (i.e. county).

2.3 Wide to long with pivot_longer

What if we had the opposite problem, and wanted to go from a wide to long format? For that, we use pivot_longer, which will increase the number of rows and decrease the number of columns. We provide the functino with thee arguments: cols which are the columns we want to pivot into the long format, names_to, which is a string specifying the name of the column to create from the data stored in the column names, and values_to, which is also a string, specifying the name of the column to create from the data stored in cell values. So, to go backwards from trafficstops_ma_wide, and exclude county_name from the long, we would do the following:

#> # A tibble: 164 x 3
#> # Groups:   county_name [82]
#>    county_name   driver_gender mean_age
#>    <fct>         <chr>            <dbl>
#>  1 Adams County  female            36.7
#>  2 Adams County  male              38.4
#>  3 Alcorn County female            33.3
#>  4 Alcorn County male              34.1
#>  5 Amite County  female            38.3
#>  6 Amite County  male              40.3
#>  7 Attala County female            36.7
#>  8 Attala County male              38.1
#>  9 Benton County female            32.1
#> 10 Benton County male              34.4
#> # … with 154 more rows

We could also have used a specification for what columns to include. This can be useful if you have a large number of identifying columns, and it’s easier to specify what to gather than what to leave alone. And if the columns are adjacent to each other, we don’t even need to list them all out – we can use the : operator!

#> # A tibble: 164 x 3
#> # Groups:   county_name [82]
#>    county_name   driver_gender mean_age
#>    <fct>         <chr>            <dbl>
#>  1 Adams County  male              38.4
#>  2 Adams County  female            36.7
#>  3 Alcorn County male              34.1
#>  4 Alcorn County female            33.3
#>  5 Amite County  male              40.3
#>  6 Amite County  female            38.3
#>  7 Attala County male              38.1
#>  8 Attala County female            36.7
#>  9 Benton County male              34.4
#> 10 Benton County female            32.1
#> # … with 154 more rows

There are many powerful operations you can do with the pivot_* functions. To learn more review the vignette:

Challenge

1.From the trafficstops dataframe create a wide data frame tr_wide with “year” as columns, each row is a different violation (“violation_raw”), and the values are the number of traffic stops per each violation, roughly like this:

violation_raw | 2013 | 2014 | 2015 ... Improper turn | 65 | 54| 67 ... Speeding | 713 | 948| 978 ... ...

Use year() from the lubridate package. Hint: You will need to summarize and count the traffic stops before reshaping the table.

  1. Now take the data frame, and make it long again, so each row is a unique violation_raw - year combination, like this:

violation_raw | year | n of stops Improper turn | 2013 | 65 Improper turn | 2014 | 54 ... etc

2.4 Exporting data

Similar to the read.csv() function used for reading CSV files into R, there is a write.csv() function that generates CSV files from data frames.

Before using write.csv(), we are going to create a new folder, data_output, in our working directory that will store this generated dataset. We don’t want to write generated datasets in the same directory as our raw data. It’s good practice to keep them separate. The data folder should only contain the raw, unaltered data, and should be left alone to make sure we don’t delete or modify it. In contrast, our script will generate the contents of the data_output directory, so even if the files it contains are deleted, we can always re-generate them.

We can save the table generated above in our data_output folder. By default, write.csv() includes a column with row names (in our case these names are the row numbers), so we need to add row.names = FALSE so they are not included: