Chapter 2 Data Manipulation using tidyr

Learning Objectives

  • Understand the concept of a wide and a long table format and for which purpose those formats are useful.
  • Understand what key-value pairs are.
  • Reshape a data frame from long to wide format and back with the pivot_wider and pivot_longer commands from the tidyr package.
  • Export a data frame to a .csv file.

dplyr pairs nicely with tidyr which enables you to swiftly convert between different data formats for plotting and analysis.

The package tidyr addresses the common problem of wanting to reshape your data for plotting and use by different R functions. Sometimes we want data sets where we have one row per observation. Sometimes we want a data frame where each observation type has its own column, and rows are instead more aggregated groups - like surveys, where each column represents an answer. Moving back and forth between these formats is nontrivial, and tidyr gives you tools for this and more sophisticated data manipulation.

To learn more about tidyr after the workshop, you may want to check out this cheatsheet about tidyr.

2.1 About long and wide table format

The ‘long’ format is where:

  • each column is a variable
  • each row is an observation

In the ‘long’ format, you usually have 1 column for the observed variable and the other columns are ID variables.

For the ‘wide’ format a row, for example could be a reserarch subject for which you have multiple observation variables containing the same type of data, for example responses to a set of survey questions, or repeated observations over time, or a mix of both. Here is an example:

You may find data input may be simpler or some other applications may prefer the ‘wide’ format. However, many of R‘s functions have been designed assuming you have ’long’ format data. This tutorial will help you efficiently transform your data regardless of original format.

Wide vs. Long Table Format

Figure 2.1: Wide vs. Long Table Format

The choice of data format affects readability. For humans, the wide format is often more intuitive, since we can often see more of the data on the screen due to its shape. However, the long format is more machine readable and is closer to the formatting of databases. The ID variables in our dataframes are similar to the fields in a database and observed variables are like the database values.

Challenge 1

Is stops in a long or wide format?

2.2 Long to Wide with pivot_wider

Now let’s see this in action. First, using dplyr, let’s create a data frame with the counts of different violations for each county:

violations <- stops %>%
  count(county_name, violation)

violations
#>   county_name                violation   n
#> 1       Adams        Breaks-Lights-etc   7
#> 2       Adams         Careless driving  48
#> 3       Adams License-Permit-Insurance 118
#> 4       Adams         Other or unknown  35
#> 5       Adams                Seat belt 229
#> 6       Adams                 Speeding 505

Now, to make this long data wide, we use pivot_wider from tidyr to turn the driver gender into columns. In addition to our data table we provide pivot_wider with two arguments: names_from describes which column to use for name of the output column, and values_from tells it from column to get the cell values. We’ll use a pipe so we can ignore the data argument.

violations_wide <- violations %>%
  pivot_wider(names_from = violation, 
              values_from = n) 

violations_wide
#> # A tibble: 82 x 7
#>    county_name `Breaks-Lights-… `Careless drivi… `License-Permit…
#>    <chr>                  <int>            <int>            <int>
#>  1 Adams                      7               48              118
#>  2 Alcorn                    62              100              737
#>  3 Amite                     47               86              370
#>  4 Attala                    99              113              526
#>  5 Benton                     3                9               73
#>  6 Bolivar                   57              139             1034
#>  7 Calhoun                   26               38              383
#>  8 Carroll                   26               40              323
#>  9 Chickasaw                 42               53             1378
#> 10 Choctaw                    8                6               73
#> # … with 72 more rows, and 3 more variables: `Other or unknown` <int>, `Seat
#> #   belt` <int>, Speeding <int>

2.3 Wide to long with pivot_longer

What if we had the opposite problem, and wanted to go from a wide to long format? For that, we use pivot_longer, which will increase the number of rows and decrease the number of columns. We provide the functino with thee arguments: cols which are the columns we want to pivot into the long format, names_to, which is a string specifying the name of the column to create from the data stored in the column names, and values_to, which is also a string, specifying the name of the column to create from the data stored in cell values. So, to go backwards from violations_wide, and exclude county_name from the long format, we would do the following:

violations_long <- violations_wide %>%
  pivot_longer(cols = -county_name,        # exclude column with county name
               names_to = "violation",     # name is a string!
               values_to = "n")            # also a string

violations_long
#> # A tibble: 492 x 3
#>    county_name violation                    n
#>    <chr>       <chr>                    <int>
#>  1 Adams       Breaks-Lights-etc            7
#>  2 Adams       Careless driving            48
#>  3 Adams       License-Permit-Insurance   118
#>  4 Adams       Other or unknown            35
#>  5 Adams       Seat belt                  229
#>  6 Adams       Speeding                   505
#>  7 Alcorn      Breaks-Lights-etc           62
#>  8 Alcorn      Careless driving           100
#>  9 Alcorn      License-Permit-Insurance   737
#> 10 Alcorn      Other or unknown           418
#> # … with 482 more rows

We could also have used a specification for what columns to include. This can be useful if you have a large number of identifying columns, and it’s easier to specify what to gather than what to leave alone. And if the columns are adjacent to each other, we don’t even need to list them all out – we can use the : operator!

violations_wide %>%
  pivot_longer(cols = `Breaks-Lights-etc`:Speeding,      # this also works
               names_to = "violation", 
               values_to = "n")
#> # A tibble: 492 x 3
#>    county_name violation                    n
#>    <chr>       <chr>                    <int>
#>  1 Adams       Breaks-Lights-etc            7
#>  2 Adams       Careless driving            48
#>  3 Adams       License-Permit-Insurance   118
#>  4 Adams       Other or unknown            35
#>  5 Adams       Seat belt                  229
#>  6 Adams       Speeding                   505
#>  7 Alcorn      Breaks-Lights-etc           62
#>  8 Alcorn      Careless driving           100
#>  9 Alcorn      License-Permit-Insurance   737
#> 10 Alcorn      Other or unknown           418
#> # … with 482 more rows

There are many powerful operations you can do with the pivot_* functions. To learn more review the vignette:

vignette("pivot")

Challenge

1.From the stops dataframe create a wide data frame tr_wide with “year” as columns, each row is a different violation, and the values are the number of traffic stops per each violation, roughly like this:

violation | 2013 | 2014 | 2015 ... Break-Lights | 65 | 54| 67 ... Speeding | 713 | 948| 978 ... ...

Use year() from the lubridate package. Hint: You will need to summarize and count the traffic stops before reshaping the table.

  1. Now take the data frame, and make it long again, so each row is a unique violation - year combination, like this:

violation | year | n of stops Speeding | 2013 | 65 Speeding | 2014 | 54 ... etc

2.4 Exporting data

Similar to the read_csv() function used for reading CSV files into R, there is a write_csv() function that generates CSV files from data frames.

Before using write_csv(), we are going to create a new folder, data_output, in our working directory that will store this generated dataset. We don’t want to write generated datasets in the same directory as our raw data. It’s good practice to keep them separate. The data folder should only contain the raw, unaltered data, and should be left alone to make sure we don’t delete or modify it. In contrast, our script will generate the contents of the data_output directory, so even if the files it contains are deleted, we can always re-generate them.

We can now save the table generated above in our data_output folder:

write_csv(violation_wide, "data_output/county_violations.csv")