Chapter 3 Working with tabular data in R

Learning Objectives

  • Load external data from a .csv file into a data frame in R with read.csv()
  • Find basic properties of a data frames including size, class or type of the columns, names of rows and columns by using str(), nrow(), ncol(), dim(), length() , colnames(), rownames()
  • Use head() and tail() to inspect rows of a data frame.
  • Generate summary statistics for a data frame
  • Use indexing to select rows and columns
  • Use logical conditions to select rows and columns
  • Add columns and rows to a data frame
  • Manipulate categorical data with factors, levels() and as.character()
  • Change how character strings are handled in a data frame.
  • Format dates in R and calculate time differences
  • Use df$new_col <- new_col to add a new column to a data frame.
  • Use cbind() to add a new column to a data frame.
  • Use rbind() to add a new row to a data frame.
  • Use na.omit() to remove rows from a data frame with NA values.

## Loading tabular data

One the most common ways of getting data into R is to read in a table. And – you guessed it – we read it into a data frame! We will take a simple CSV file as example. What is a CSV file?

You may know about the Stanford Open Policing Project and we will be working with a sample dataset from their repository (https://openpolicing.stanford.edu/data/). The sample I extracted contains information about traffic stops for black and white drivers in the state of Mississippi during January 2013 to mid-July of 2016.

We are going to use the R function download.file() to download the CSV file that contains the traffic stop data, and we will use read.csv() to load into memory the content of the CSV file as an object of class data.frame.

To download the data into your local data/ subdirectory, run the following:

You are now ready to load the data:

This statement doesn’t produce any output because, as you might recall, assignments don’t display anything. If we want to check that our data has been loaded, we can print the variable’s value: trafficstops.

Wow… that was a lot of output. At least it means the data loaded properly. Let’s check the top (the first 6 lines) of this data frame using the function head():

#>              id state  stop_date
#> 1 MS-2013-00001    MS 2013-01-01
#> 2 MS-2013-00002    MS 2013-01-01
#> 3 MS-2013-00003    MS 2013-01-01
#> 4 MS-2013-00004    MS 2013-01-01
#> 5 MS-2013-00005    MS 2013-01-01
#> 6 MS-2013-00006    MS 2013-01-01
#>         county_name county_fips
#> 1      Jones County       28067
#> 2 Lauderdale County       28075
#> 3       Pike County       28113
#> 4    Hancock County       28045
#> 5     Holmes County       28051
#> 6    Jackson County       28059
#>            police_department
#> 1 Mississippi Highway Patrol
#> 2 Mississippi Highway Patrol
#> 3 Mississippi Highway Patrol
#> 4 Mississippi Highway Patrol
#> 5 Mississippi Highway Patrol
#> 6 Mississippi Highway Patrol
#>   driver_gender driver_birthdate
#> 1             M       1950-06-14
#> 2             M       1967-04-06
#> 3             M       1974-04-15
#> 4             M       1981-03-23
#> 5             M       1992-08-03
#> 6             F       1960-05-02
#>   driver_race
#> 1       Black
#> 2       Black
#> 3       Black
#> 4       White
#> 5       White
#> 6       White
#>                                                 violation_raw
#> 1                     Seat belt not used properly as required
#> 2                                            Careless driving
#> 3 Speeding - Regulated or posted speed limit and actual speed
#> 4 Speeding - Regulated or posted speed limit and actual speed
#> 5 Speeding - Regulated or posted speed limit and actual speed
#> 6 Speeding - Regulated or posted speed limit and actual speed
#>   officer_id
#> 1       J042
#> 2       B026
#> 3       M009
#> 4       K035
#> 5       D028
#> 6       K023

3.1 Inspecting data.frame Objects

As you may recall, a data frame in R is a special case of a list, and a representation of data where the columns are vectors that all have the same length. Because the columns are vectors, they all contain the same type of data (e.g., characters, integers, factors, etc.).

We can see this when inspecting the structure of a data frame with the function str():

#> 'data.frame':	211211 obs. of  11 variables:
#>  $ id               : chr  "MS-2013-00001" "MS-2013-00002" "MS-2013-00003" "MS-2013-00004" ...
#>  $ state            : chr  "MS" "MS" "MS" "MS" ...
#>  $ stop_date        : chr  "2013-01-01" "2013-01-01" "2013-01-01" "2013-01-01" ...
#>  $ county_name      : chr  "Jones County" "Lauderdale County" "Pike County" "Hancock County" ...
#>  $ county_fips      : int  28067 28075 28113 28045 28051 28059 28059 28043 28051 28051 ...
#>  $ police_department: chr  "Mississippi Highway Patrol" "Mississippi Highway Patrol" "Mississippi Highway Patrol" "Mississippi Highway Patrol" ...
#>  $ driver_gender    : chr  "M" "M" "M" "M" ...
#>  $ driver_birthdate : chr  "1950-06-14" "1967-04-06" "1974-04-15" "1981-03-23" ...
#>  $ driver_race      : chr  "Black" "Black" "Black" "White" ...
#>  $ violation_raw    : chr  "Seat belt not used properly as required" "Careless driving" "Speeding - Regulated or posted speed limit and actual speed" "Speeding - Regulated or posted speed limit and actual speed" ...
#>  $ officer_id       : chr  "J042" "B026" "M009" "K035" ...

We already saw how the functions head() and str() can be useful to check the content and the structure of a data frame. Here is a non-exhaustive list of functions to get a sense of the content/structure of the data. Let’s try them out!

  • Size:
    • dim(trafficstops) - returns a vector with the number of rows in the first element, and the number of columns as the second element (the dimensions of the object)
    • nrow(trafficstops) - returns the number of rows
    • ncol(trafficstops) - returns the number of columns
    • length(trafficstops) - returns number of columns
  • Content:
    • head(trafficstops) - shows the first 6 rows
    • tail(trafficstops) - shows the last 6 rows
  • Names:
    • names(trafficstops) - returns the column names (synonym of colnames() for data.frame objects)
    • rownames(trafficstops) - returns the row names
  • Summary:
    • str(trafficstops) - structure of the object and information about the class, length and content of each column
    • summary(trafficstops) - summary statistics for each column

Note: most of these functions are “generic”, they can be used on other types of objects besides data.frame.

Challenge

Based on the output of str(trafficstops), can you answer the following questions?

  • What is the class of the object trafficstops?
  • How many rows and how many columns are in this object?
  • How many counties have been recorded in this dataset?

3.2 Indexing and subsetting data frames

Our trafficstops data frame has rows and columns (it has 2 dimensions), if we want to extract some specific data from it, we need to specify the “coordinates” we want from it. Row numbers come first, followed by column numbers. However, note that different ways of specifying these coordinates lead to results with different classes.

As well as using numeric values to subset a data.frame (or matrix), columns can be called by name, using one of the four following notations:

For our purposes, the last three notations are equivalent. RStudio knows about the columns in your data frame, so you can take advantage of the autocompletion feature to get the full and correct column name.

Challenge

  1. Create a data.frame (trafficstops_200) containing only the observations from row 200 of the trafficstops dataset.

  2. Notice how nrow() gave you the number of rows in a data.frame?

    • Use that number to pull out just that last row in the data frame.
    • Compare that with what you see as the last row using tail() to make sure it’s meeting expectations.
    • Pull out that last row using nrow() instead of the row number.
    • Create a new data frame object (trafficstops_last) from that last row.
  3. Use nrow() to extract the row that is in the middle of the data frame. Store the content of this row in an object named trafficstops_middle.

  4. Combine nrow() with the - notation above to reproduce the behavior of head(trafficstops) keeping just the first through 6th rows of the trafficstops dataset.

## Conditional subsetting

Often times we need to extract a subset of a data frame based on certain conditions. For example, if we wanted to look at traffic stops in Webster County only we could say:

This is also a possibility (but slower):

#> [1] 156
#> 
#> Black White 
#>    59    97

These commands are from the R base package. In the R Data Wrangling workshop we will discuss a different way of subsetting using functions from the tidyverse package.

Challenge

  • Use subsetting to extract trafficstops in Hancock, Harrison, and Jackson Counties into a separate data frame coastal_counties.
  • Using coastal_counties, count the number of Black and White drivers in the three counties.
  • Bonus: How does the ratio of Black to White stops in the three coastal counties compare to the same ratio for stops in the entire state of Mississippi?

## Adding and removing rows and columns

To add a new column to the data frame we can use the cbind() function.

#>              id state  stop_date
#> 1 MS-2013-00001    MS 2013-01-01
#> 2 MS-2013-00002    MS 2013-01-01
#> 3 MS-2013-00003    MS 2013-01-01
#> 4 MS-2013-00004    MS 2013-01-01
#> 5 MS-2013-00005    MS 2013-01-01
#> 6 MS-2013-00006    MS 2013-01-01
#>         county_name county_fips
#> 1      Jones County       28067
#> 2 Lauderdale County       28075
#> 3       Pike County       28113
#> 4    Hancock County       28045
#> 5     Holmes County       28051
#> 6    Jackson County       28059
#>            police_department
#> 1 Mississippi Highway Patrol
#> 2 Mississippi Highway Patrol
#> 3 Mississippi Highway Patrol
#> 4 Mississippi Highway Patrol
#> 5 Mississippi Highway Patrol
#> 6 Mississippi Highway Patrol
#>   driver_gender driver_birthdate
#> 1             M       1950-06-14
#> 2             M       1967-04-06
#> 3             M       1974-04-15
#> 4             M       1981-03-23
#> 5             M       1992-08-03
#> 6             F       1960-05-02
#>   driver_race
#> 1       Black
#> 2       Black
#> 3       Black
#> 4       White
#> 5       White
#> 6       White
#>                                                 violation_raw
#> 1                     Seat belt not used properly as required
#> 2                                            Careless driving
#> 3 Speeding - Regulated or posted speed limit and actual speed
#> 4 Speeding - Regulated or posted speed limit and actual speed
#> 5 Speeding - Regulated or posted speed limit and actual speed
#> 6 Speeding - Regulated or posted speed limit and actual speed
#>   officer_id new_col
#> 1       J042       1
#> 2       B026       2
#> 3       M009       3
#> 4       K035       4
#> 5       D028       5
#> 6       K023       6

Alternatively, we can also add a new column adding the new column name after the $ sign then assigning the value, like below. Note that this will change the original data frame, which you may not always want to do.

There is an equivalent function, rbind() to add a new row to a data frame. I use this far less frequently than the column equivalent. The one thing to keep in mind is that the row to be added to the data frame needs to match the order and type of columns in the data frame. Remember that R’s way to store multiple different data types in one object is a list. So if we wanted to add a new row to trafficstops we would say:

#>                   id state  stop_date
#> 211207 MS-2016-24293    MS 2016-07-09
#> 211208 MS-2016-24294    MS 2016-07-10
#> 211209 MS-2016-24295    MS 2016-07-11
#> 211210 MS-2016-24296    MS 2016-07-14
#> 211211 MS-2016-24297    MS 2016-07-14
#> 211212 MS-2017-12345    MS 2017-08-24
#>                county_name county_fips
#> 211207       George County       28039
#> 211208       Copiah County       28029
#> 211209      Grenada County       28043
#> 211210       Copiah County       28029
#> 211211       Copiah County       28029
#> 211212 Tallahatchie County       12345
#>                 police_department
#> 211207 Mississippi Highway Patrol
#> 211208 Mississippi Highway Patrol
#> 211209 Mississippi Highway Patrol
#> 211210 Mississippi Highway Patrol
#> 211211 Mississippi Highway Patrol
#> 211212                       MSHP
#>        driver_gender driver_birthdate
#> 211207             M       1992-07-14
#> 211208             M       1975-12-23
#> 211209             M       1998-02-02
#> 211210             F       1970-06-14
#> 211211             M       1948-03-11
#> 211212             F       1999-06-14
#>        driver_race
#> 211207       White
#> 211208       Black
#> 211209       White
#> 211210       White
#> 211211       White
#> 211212    Hispanic
#>                                                      violation_raw
#> 211207 Speeding - Regulated or posted speed limit and actual speed
#> 211208 Speeding - Regulated or posted speed limit and actual speed
#> 211209                     Seat belt not used properly as required
#> 211210       Expired or no non-commercial driver license or permit
#> 211211                     Seat belt not used properly as required
#> 211212                                                    Speeding
#>        officer_id
#> 211207       K025
#> 211208       C033
#> 211209       D014
#> 211210       C015
#> 211211       C015
#> 211212       ABCD

A convenient function to know about is na.omit(). It will remove all rows from a data frame that have at least one column with NA values.

Challenge

  • Given the following data frame:

What would you expect the following commands to return?

3.3 Categorical data: factors

Factors are very useful and are actually something that make R particularly well suited to working with data, so we’re going to spend a little time introducing them.

Factors are used to represent categorical data. Factors can be ordered or unordered, and understanding them is necessary for statistical analysis and for plotting.

Factors are stored as integers, and have labels (text) associated with these unique integers. While factors look (and often behave) like character vectors, they are actually integers under the hood, and you need to be careful when treating them like strings.

Once created, factors can only contain a pre-defined set of values, known as levels. By default, R always sorts levels in alphabetical order. For instance, if you have a factor with 2 levels:

R will assign 1 to the level "democrat" and 2 to the level "republican" (because d comes before r, even though the first element in this vector is "republican"). You can check this by using the function levels(), and check the number of levels using nlevels():

Sometimes, the order of the factors does not matter, other times you might want to specify the order because it is meaningful (e.g., “low”, “medium”, “high”), it improves your visualization, or it is required by a particular type of analysis. Here, one way to reorder our levels in the party vector would be:

#> [1] republican democrat   democrat  
#> [4] republican
#> Levels: democrat republican
#> [1] republican democrat   democrat  
#> [4] republican
#> Levels: republican democrat

In R’s memory, these factors are represented by integers (1, 2, 3), but are more informative than integers because factors are self describing: "democrat", "republican" is more descriptive than 1, 2. Which one is “republican”? You wouldn’t be able to tell just from the integer data. Factors, on the other hand, have this information built in. It is particularly helpful when there are many levels (like the county names in our example dataset).

3.3.1 Converting factors

If you need to convert a factor to a character vector, you use as.character(x).

Converting factors where the levels appear as numbers (such as concentration levels, or years) to a numeric vector is a little trickier. One method is to convert factors to characters and then numbers. Another method is to use the levels() function. Compare:

Notice that in the levels() approach, three important steps occur:

  • We obtain all the factor levels using levels(f)
  • We convert these levels to numeric values using as.numeric(levels(f))
  • We then access these numeric values using the underlying integers of the vector f as indices inside the square brackets

3.3.2 Renaming factors

When your data is stored as a factor, you can use the plot() function to get a quick glance at the number of observations represented by each factor level. Let’s look at the number of blacks and whites in the dataset:

There seem to be a number of individuals for which the race information hasn’t been recorded.

Additionally, for these individuals, there is no label to indicate that the information is missing. Let’s rename this label to something more meaningful. Before doing that, we’re going to pull out the data on race and work with that data, so we’re not modifying the working copy of the data frame:

#> [1] Black Black Black White White White
#> Levels:  Black White
#> [1] ""      "Black" "White"
#> [1] "Missing" "Black"   "White"
#> [1] Black Black Black White White White
#> Levels: Missing Black White

Challenge

  • Rename “Black” to “African American”.
  • Now that we have renamed the factor level to “Missing”, can you recreate the barplot such that “Missing” is last (to the right)?

3.3.3 Using stringsAsFactors

As of R version 4.0.0, the stringsAsfactors argument defaults to FALSE when using read.csv().

Depending on what you want to do with the data, you may want to keep these columns as character or set stringsAsFactors = TRUE to coerce strings to factors when importing csv files. Alternatively, the read_csv() function from the dplyr library also does not covert strings to factors (more on this another day).

Compare the output of str(trafficstops) when setting stringsAsFactors = TRUE (default) and stringsAsFactors = FALSE:

Challenge

Can you predict the class for each of the columns in the following example? Check your guesses using str(country_climate): * Are they what you expected? Why? Why not? * What would have been different if we had added stringsAsFactors = FALSE to this call? * What would you need to change to ensure that each column had the accurate data type?

```
country_climate <- data.frame(
       country=c("Canada", "Panama", "South Africa", "Australia"),
       climate=c("cold", "hot", "temperate", "hot/temperate"),
       temperature=c(10, 30, 18, "15"),
       northern_hemisphere=c(TRUE, TRUE, FALSE, "FALSE"),
       has_kangaroo=c(FALSE, FALSE, FALSE, 1)
       )
```

The automatic conversion of data type is sometimes a blessing, sometimes an annoyance. Be aware that it exists, learn the rules, and double check that data you import in R are of the correct type within your data frame. If not, use it to your advantage to detect mistakes that might have been introduced during data entry (a letter in a column that should only contain numbers for instance).

3.4 Dates

One of the most common issues that new (and experienced!) R users have is converting date and time information into a variable that is appropriate and usable during analyses. If you have control over your data it might be useful to ensure that each component of your date is stored as a separate variable, i.e a separate column for day, month, and year. However, often we do not have control and the date is stored in one single column and with varying order and separating characters between its components.

Using str(), we can see that both dates in our data frame stop_date and driver_birthdate are each stored in one column.

As an example for how to work with dates let us see if there are seasonal differences in the number of traffic stops.

We’re going to be using the ymd() function from the package lubridate. This function is designed to take a vector representing year, month, and day and convert that information to a POSIXct vector. POSIXct is a class of data recognized by R as being a date or date and time. The argument that the function requires is relatively flexible, but, as a best practice, is a character vector formatted as “YYYY-MM-DD”.

Start by loading the required package:

The ymd function also has nicely taken care of the fact that the original format of the date column is a factor!

We can now easily extract year, month, and date using the respective functions: year(), month(), and day() like so:

Challenge

  • Are there more stops in certain months of the year or certain days of the month?

Challenge

  • Determine the age of the driver in years (approximate) at the time of the stop:
  • Extract driver_birthdate into a vector birth_date
  • Create a new vector age with the driver’s age at the time of the stop in years
  • Coerce age to a factor and use the plot function to check your results. What do you find?