Chapter 3 Working with tabular data in R

Learning Objectives

  • Load external data from a .csv file into a data frame in R with read.csv()
  • Find basic properties of a data frames including size, class or type of the columns, names of rows and columns by using str(), nrow(), ncol(), dim(), length() , colnames(), rownames()
  • Use head() and tail() to inspect rows of a data frame.
  • Generate summary statistics for a data frame
  • Use indexing to select rows and columns
  • Use logical conditions to select rows and columns
  • Add columns and rows to a data frame
  • Manipulate categorical data with factors, levels() and as.character()
  • Change how character strings are handled in a data frame.
  • Format dates in R and calculate time differences
  • Use df$new_col <- new_col to add a new column to a data frame.
  • Use cbind() to add a new column to a data frame.
  • Use rbind() to add a new row to a data frame.
  • Use na.omit() to remove rows from a data frame with NA values.

3.1 Loading tabular data

One the most common ways of getting data into R is to read in a table. And – you guessed it – we read it into a data frame! We will take a simple CSV file as example. What is a CSV file?

You may know about the Stanford Open Policing Project and we will be working with a sample dataset from their repository (https://openpolicing.stanford.edu/data/). The sample I extracted contains information about traffic stops for black and white drivers in the state of Mississippi during January 2013 to mid-July of 2016.

We are going to use the R function download.file() to download the CSV file that contains the traffic stop data, and we will use read.csv() to load into memory the content of the CSV file as an object of class data.frame.

To download the data into your local data/ subdirectory, run the following:

download.file("http://bit.ly/MS_trafficstops_bw", "data/MS_trafficstops_bw.csv")

You are now ready to load the data:

trafficstops <- read.csv("data/MS_trafficstops_bw.csv")

This statement doesn’t produce any output because, as you might recall, assignments don’t display anything. If we want to check that our data has been loaded, we can print the variable’s value: trafficstops.

Wow… that was a lot of output. At least it means the data loaded properly. Let’s check the top (the first 6 lines) of this data frame using the function head():

head(trafficstops)
#>              id state  stop_date       county_name county_fips
#> 1 MS-2013-00001    MS 2013-01-01      Jones County       28067
#> 2 MS-2013-00002    MS 2013-01-01 Lauderdale County       28075
#> 3 MS-2013-00003    MS 2013-01-01       Pike County       28113
#> 4 MS-2013-00004    MS 2013-01-01    Hancock County       28045
#> 5 MS-2013-00005    MS 2013-01-01     Holmes County       28051
#> 6 MS-2013-00006    MS 2013-01-01    Jackson County       28059
#>            police_department driver_gender driver_birthdate driver_race
#> 1 Mississippi Highway Patrol             M       1950-06-14       Black
#> 2 Mississippi Highway Patrol             M       1967-04-06       Black
#> 3 Mississippi Highway Patrol             M       1974-04-15       Black
#> 4 Mississippi Highway Patrol             M       1981-03-23       White
#> 5 Mississippi Highway Patrol             M       1992-08-03       White
#> 6 Mississippi Highway Patrol             F       1960-05-02       White
#>                                                 violation_raw officer_id
#> 1                     Seat belt not used properly as required       J042
#> 2                                            Careless driving       B026
#> 3 Speeding - Regulated or posted speed limit and actual speed       M009
#> 4 Speeding - Regulated or posted speed limit and actual speed       K035
#> 5 Speeding - Regulated or posted speed limit and actual speed       D028
#> 6 Speeding - Regulated or posted speed limit and actual speed       K023

3.2 Inspecting data.frame Objects

As you may recall, a data frame in R is a special case of a list, and a representation of data where the columns are vectors that all have the same length. Because the columns are vectors, they all contain the same type of data (e.g., characters, integers, factors, etc.).

We can see this when inspecting the structure of a data frame with the function str():

str(trafficstops)
#> 'data.frame':    211211 obs. of  11 variables:
#>  $ id               : Factor w/ 211211 levels "MS-2013-00001",..: 1 2 3 4 5 6 7 8 9 10 ...
#>  $ state            : Factor w/ 1 level "MS": 1 1 1 1 1 1 1 1 1 1 ...
#>  $ stop_date        : Factor w/ 1288 levels "2013-01-01","2013-01-02",..: 1 1 1 1 1 1 1 1 1 1 ...
#>  $ county_name      : Factor w/ 82 levels "Adams County",..: 34 38 57 23 26 30 30 22 26 26 ...
#>  $ county_fips      : int  28067 28075 28113 28045 28051 28059 28059 28043 28051 28051 ...
#>  $ police_department: Factor w/ 1 level "Mississippi Highway Patrol": 1 1 1 1 1 1 1 1 1 1 ...
#>  $ driver_gender    : Factor w/ 3 levels "","F","M": 3 3 3 3 3 2 2 2 3 3 ...
#>  $ driver_birthdate : Factor w/ 21423 levels "","1930-01-11",..: 3558 9575 12137 14670 18820 7061 4504 19135 2755 15878 ...
#>  $ driver_race      : Factor w/ 3 levels "","Black","White": 2 2 2 3 3 3 3 3 3 3 ...
#>  $ violation_raw    : Factor w/ 19 levels "??","Careless driving",..: 17 2 19 19 19 19 19 19 19 19 ...
#>  $ officer_id       : Factor w/ 897 levels "","A003","A004",..: 519 52 635 560 212 550 559 205 723 723 ...

We already saw how the functions head() and str() can be useful to check the content and the structure of a data frame. Here is a non-exhaustive list of functions to get a sense of the content/structure of the data. Let’s try them out!

  • Size:
    • dim(trafficstops) - returns a vector with the number of rows in the first element, and the number of columns as the second element (the dimensions of the object)
    • nrow(trafficstops) - returns the number of rows
    • ncol(trafficstops) - returns the number of columns
    • length(trafficstops) - returns number of columns
  • Content:
    • head(trafficstops) - shows the first 6 rows
    • tail(trafficstops) - shows the last 6 rows
  • Names:
    • names(trafficstops) - returns the column names (synonym of colnames() for data.frame objects)
    • rownames(trafficstops) - returns the row names
  • Summary:
    • str(trafficstops) - structure of the object and information about the class, length and content of each column
    • summary(trafficstops) - summary statistics for each column

Note: most of these functions are “generic”, they can be used on other types of objects besides data.frame.

Challenge

Based on the output of str(trafficstops), can you answer the following questions?

  • What is the class of the object trafficstops?
  • How many rows and how many columns are in this object?
  • How many counties have been recorded in this dataset?

3.3 Indexing and subsetting data frames

Our trafficstops data frame has rows and columns (it has 2 dimensions), if we want to extract some specific data from it, we need to specify the “coordinates” we want from it. Row numbers come first, followed by column numbers. However, note that different ways of specifying these coordinates lead to results with different classes.

trafficstops[1, 1]   # first element in the first column of the data frame (as a vector)
trafficstops[1, 6]   # first element in the 6th column (as a vector)
trafficstops[, 1]    # first column in the data frame (as a vector)
trafficstops[1]      # first column in the data frame (as a data.frame)
trafficstops[1:3, 7] # first three elements in the 7th column (as a vector)
trafficstops[3, ]    # the 3rd row (as a data.frame)
trafficstops[1:6, ]  # the 1st to 6th rows, equivalent to head(trafficstops)
trafficstops[, -1]           # the whole data frame, excluding the first column
trafficstops[-c(7:211211),]  # equivalent to head(trafficstops)

As well as using numeric values to subset a data.frame (or matrix), columns can be called by name, using one of the four following notations:

trafficstops["violation_raw"]       # Result is a data.frame
trafficstops[, "violation_raw"]     # Result is a vector
trafficstops[["violation_raw"]]     # Result is a vector
trafficstops$violation_raw          # Result is a vector

For our purposes, the last three notations are equivalent. RStudio knows about the columns in your data frame, so you can take advantage of the autocompletion feature to get the full and correct column name.

Challenge

  1. Create a data.frame (trafficstops_200) containing only the observations from row 200 of the trafficstops dataset.

  2. Notice how nrow() gave you the number of rows in a data.frame?

    • Use that number to pull out just that last row in the data frame.
    • Compare that with what you see as the last row using tail() to make sure it’s meeting expectations.
    • Pull out that last row using nrow() instead of the row number.
    • Create a new data frame object (trafficstops_last) from that last row.
  3. Use nrow() to extract the row that is in the middle of the data frame. Store the content of this row in an object named trafficstops_middle.

  4. Combine nrow() with the - notation above to reproduce the behavior of head(trafficstops) keeping just the first through 6th rows of the trafficstops dataset.

3.4 Conditional subsetting

Often times we need to extract a subset of a data frame based on certain conditions. For example, if we wanted to look at traffic stops in Webster County only we could say:

# the condition:
# returns a logical vector of the length of the column
trafficstops$county_name == "Webster County" 

# use this vector to extract rows and all columns
# note the comma: we want *all* columns
trafficstops[trafficstops$county_name == "Webster County", ] 

# assign extract to a new data frame
Webster_trafficstops <- trafficstops[trafficstops$county_name == "Webster County", ]

This is also a possibility (but slower):

Webster_trafficstops <- subset(trafficstops, county_name == "Webster County")
nrow(Webster_trafficstops) # 393 stops in Webster County!
#> [1] 156
# and if we wanted to see the breakdown by race:
table(Webster_trafficstops$driver_race)
#> 
#>       Black White 
#>     0    59    97

These commands are from the R base package. In the R Data Wrangling workshop we will discuss a different way of subsetting using functions from the tidyverse package.

Challenge

  • Use subsetting to extract trafficstops in Hancock, Harrison, and Jackson Counties into a separate data frame coastal_counties.
  • Using coastal_counties, count the number of Black and White drivers in the three counties.
  • Bonus: How does the ratio of Black to White stops in the three coastal counties compare to the same ratio for stops in the entire state of Mississippi?

3.5 Adding and removing rows and columns

To add a new column to the data frame we can use the cbind() function.

new_col <- row.names(trafficstops)
trafficstops_withnewcol <- cbind(trafficstops, new_col)
head(trafficstops_withnewcol)
#>              id state  stop_date       county_name county_fips
#> 1 MS-2013-00001    MS 2013-01-01      Jones County       28067
#> 2 MS-2013-00002    MS 2013-01-01 Lauderdale County       28075
#> 3 MS-2013-00003    MS 2013-01-01       Pike County       28113
#> 4 MS-2013-00004    MS 2013-01-01    Hancock County       28045
#> 5 MS-2013-00005    MS 2013-01-01     Holmes County       28051
#> 6 MS-2013-00006    MS 2013-01-01    Jackson County       28059
#>            police_department driver_gender driver_birthdate driver_race
#> 1 Mississippi Highway Patrol             M       1950-06-14       Black
#> 2 Mississippi Highway Patrol             M       1967-04-06       Black
#> 3 Mississippi Highway Patrol             M       1974-04-15       Black
#> 4 Mississippi Highway Patrol             M       1981-03-23       White
#> 5 Mississippi Highway Patrol             M       1992-08-03       White
#> 6 Mississippi Highway Patrol             F       1960-05-02       White
#>                                                 violation_raw officer_id
#> 1                     Seat belt not used properly as required       J042
#> 2                                            Careless driving       B026
#> 3 Speeding - Regulated or posted speed limit and actual speed       M009
#> 4 Speeding - Regulated or posted speed limit and actual speed       K035
#> 5 Speeding - Regulated or posted speed limit and actual speed       D028
#> 6 Speeding - Regulated or posted speed limit and actual speed       K023
#>   new_col
#> 1       1
#> 2       2
#> 3       3
#> 4       4
#> 5       5
#> 6       6

Alternatively, we can also add a new column adding the new column name after the $ sign then assigning the value, like below. Note that this will change the original data frame, which you may not always want to do.

trafficstops$row_numbers <- c(1:nrow(trafficstops))
trafficstops$all_false <- FALSE  # what do you think will happen here?

There is an equivalent function, rbind() to add a new row to a data frame. I use this far less frequently than the column equivalent. The one thing to keep in mind is that the row to be added to the data frame needs to match the order and type of columns in the data frame. Remember that R’s way to store multiple different data types in one object is a list. So if we wanted to add a new row to trafficstops we would say:

new_row <- data.frame(id="MS-2017-12345", state="MS", stop_date="2017-08-24",
                county_name="Tallahatchie County", county_fips=12345,
                police_department="MSHP", driver_gender="F", driver_birthdate="1999-06-14",
                driver_race="Hispanic", violation_raw="Speeding", officer_id="ABCD")

trafficstops_withnewrow <- rbind(trafficstops, new_row)
tail(trafficstops_withnewrow)
#>                   id state  stop_date         county_name county_fips
#> 211207 MS-2016-24293    MS 2016-07-09       George County       28039
#> 211208 MS-2016-24294    MS 2016-07-10       Copiah County       28029
#> 211209 MS-2016-24295    MS 2016-07-11      Grenada County       28043
#> 211210 MS-2016-24296    MS 2016-07-14       Copiah County       28029
#> 211211 MS-2016-24297    MS 2016-07-14       Copiah County       28029
#> 211212 MS-2017-12345    MS 2017-08-24 Tallahatchie County       12345
#>                 police_department driver_gender driver_birthdate
#> 211207 Mississippi Highway Patrol             M       1992-07-14
#> 211208 Mississippi Highway Patrol             M       1975-12-23
#> 211209 Mississippi Highway Patrol             M       1998-02-02
#> 211210 Mississippi Highway Patrol             F       1970-06-14
#> 211211 Mississippi Highway Patrol             M       1948-03-11
#> 211212                       MSHP             F       1999-06-14
#>        driver_race
#> 211207       White
#> 211208       Black
#> 211209       White
#> 211210       White
#> 211211       White
#> 211212    Hispanic
#>                                                      violation_raw
#> 211207 Speeding - Regulated or posted speed limit and actual speed
#> 211208 Speeding - Regulated or posted speed limit and actual speed
#> 211209                     Seat belt not used properly as required
#> 211210       Expired or no non-commercial driver license or permit
#> 211211                     Seat belt not used properly as required
#> 211212                                                    Speeding
#>        officer_id
#> 211207       K025
#> 211208       C033
#> 211209       D014
#> 211210       C015
#> 211211       C015
#> 211212       ABCD

A convenient function to know about is na.omit(). It will remove all rows from a data frame that have at least one column with NA values.

Challenge

  • Given the following data frame:
dfr <- data.frame(col_1 = c(1:3), 
                  col_2 = c(NA, NA, "b"), 
                  col_3 = c(TRUE, NA, FALSE))

What would you expect the following commands to return?

nrow(dfr)
nrow(na.omit(dfr))

3.6 Categorical data: factors

When we did str(trafficstops) we saw that only one of the columns are numeric (county_fips), all the others are of a special class called a factor. Factors are very useful and are actually something that make R particularly well suited to working with data, so we’re going to spend a little time introducing them.

Factors are used to represent categorical data. Factors can be ordered or unordered, and understanding them is necessary for statistical analysis and for plotting.

Factors are stored as integers, and have labels (text) associated with these unique integers. While factors look (and often behave) like character vectors, they are actually integers under the hood, and you need to be careful when treating them like strings.

Once created, factors can only contain a pre-defined set of values, known as levels. By default, R always sorts levels in alphabetical order. For instance, if you have a factor with 2 levels:

party <- factor(c("republican", "democrat", "democrat", "republican"))

R will assign 1 to the level "democrat" and 2 to the level "republican" (because d comes before r, even though the first element in this vector is "republican"). You can check this by using the function levels(), and check the number of levels using nlevels():

levels(party)
nlevels(party)

Sometimes, the order of the factors does not matter, other times you might want to specify the order because it is meaningful (e.g., “low”, “medium”, “high”), it improves your visualization, or it is required by a particular type of analysis. Here, one way to reorder our levels in the party vector would be:

party # current order
#> [1] republican democrat   democrat   republican
#> Levels: democrat republican
party <- factor(party, levels = c("republican", "democrat"))
party # after re-ordering
#> [1] republican democrat   democrat   republican
#> Levels: republican democrat

In R’s memory, these factors are represented by integers (1, 2, 3), but are more informative than integers because factors are self describing: "democrat", "republican" is more descriptive than 1, 2. Which one is “republican”? You wouldn’t be able to tell just from the integer data. Factors, on the other hand, have this information built in. It is particularly helpful when there are many levels (like the county names in our example dataset).

3.6.1 Converting factors

If you need to convert a factor to a character vector, you use as.character(x).

as.character(party)

Converting factors where the levels appear as numbers (such as concentration levels, or years) to a numeric vector is a little trickier. One method is to convert factors to characters and then numbers. Another method is to use the levels() function. Compare:

f <- factor(c(1990, 1983, 1977, 1998, 1990))
as.numeric(f)               # wrong! and there is no warning...
as.numeric(as.character(f)) # works...
as.numeric(levels(f))[f]    # The recommended way.

Notice that in the levels() approach, three important steps occur:

  • We obtain all the factor levels using levels(f)
  • We convert these levels to numeric values using as.numeric(levels(f))
  • We then access these numeric values using the underlying integers of the vector f as indices inside the square brackets

3.6.2 Renaming factors

When your data is stored as a factor, you can use the plot() function to get a quick glance at the number of observations represented by each factor level. Let’s look at the number of blacks and whites in the dataset:

# bar plot of the number of black and white drivers stopped:
plot(trafficstops$driver_race)

There seem to be a number of individuals for which the race information hasn’t been recorded.

Additionally, for these individuals, there is no label to indicate that the information is missing. Let’s rename this label to something more meaningful. Before doing that, we’re going to pull out the data on race and work with that data, so we’re not modifying the working copy of the data frame:

race <- trafficstops$driver_race
head(race)
#> [1] Black Black Black White White White
#> Levels:  Black White
levels(race)
#> [1] ""      "Black" "White"
levels(race)[1] <- "Missing"
levels(race)
#> [1] "Missing" "Black"   "White"
head(race)
#> [1] Black Black Black White White White
#> Levels: Missing Black White

Challenge

  • Rename “Black” to “African American”.
  • Now that we have renamed the factor level to “Missing”, can you recreate the barplot such that “Missing” is last (to the right)?

3.6.3 Using stringsAsFactors=FALSE

By default, when building or importing a data frame with read.csv(), the columns that contain characters (i.e., text) are coerced (=converted) into the factor data type. Depending on what you want to do with the data, you may want to keep these columns as character. To do so, read.csv() and read.table() have an argument called stringsAsFactors which can be set to FALSE.

In most cases, it’s preferable to set stringsAsFactors = FALSE when importing your data, and converting as a factor only the columns that require this data type.

Compare the output of str(trafficstops) when setting stringsAsFactors = TRUE (default) and stringsAsFactors = FALSE:

# Compare the difference between when the data are being read as
# `factor`, and when they are being read as `character`.
trafficstops <- read.csv("data/MS_policing_bw.csv", stringsAsFactors = TRUE)
str(trafficstops)
trafficstops <- read.csv("data/MS_policing_bw.csv", stringsAsFactors = FALSE)
str(trafficstops)
# Convert the column "driver_race" into a factor
trafficstops$driver_race <- factor(trafficstops$driver_race)

Challenge

Can you predict the class for each of the columns in the following example? Check your guesses using str(country_climate): * Are they what you expected? Why? Why not? * What would have been different if we had added stringsAsFactors = FALSE to this call? * What would you need to change to ensure that each column had the accurate data type?

```
country_climate <- data.frame(
       country=c("Canada", "Panama", "South Africa", "Australia"),
       climate=c("cold", "hot", "temperate", "hot/temperate"),
       temperature=c(10, 30, 18, "15"),
       northern_hemisphere=c(TRUE, TRUE, FALSE, "FALSE"),
       has_kangaroo=c(FALSE, FALSE, FALSE, 1)
       )
```

The automatic conversion of data type is sometimes a blessing, sometimes an annoyance. Be aware that it exists, learn the rules, and double check that data you import in R are of the correct type within your data frame. If not, use it to your advantage to detect mistakes that might have been introduced during data entry (a letter in a column that should only contain numbers for instance).

3.7 Dates

One of the most common issues that new (and experienced!) R users have is converting date and time information into a variable that is appropriate and usable during analyses. If you have control over your data it might be useful to ensure that each component of your date is stored as a separate variable, i.e a separate column for day, month, and year. However, often we do not have control and the date is stored in one single column and with varying order and separating characters between its components.

Using str(), we can see that both dates in our data frame stop_date and driver_birthdate are each stored in one column.

str(trafficstops)

As an example for how to work with dates let us see if there are seasonal differences in the number of traffic stops.

We’re going to be using the ymd() function from the package lubridate. This function is designed to take a vector representing year, month, and day and convert that information to a POSIXct vector. POSIXct is a class of data recognized by R as being a date or date and time. The argument that the function requires is relatively flexible, but, as a best practice, is a character vector formatted as “YYYY-MM-DD”.

Start by loading the required package:

library(lubridate)
stop_date <- ymd(trafficstops$stop_date)
str(stop_date) # notice the 'date' class

The ymd function also has nicely taken care of the fact that the original format of the date column is a factor!

We can now easily extract year, month, and date using the respective functions: year(), month(), and day() like so:

plot(factor(year(stop_date))) #convert year to factor to plot

Challenge

  • Are there more stops in certain months of the year or certain days of the month?

Challenge

  • Determine the age of the driver in years (approximate) at the time of the stop:
  • Extract driver_birthdate into a vector birth_date
  • Create a new vector age with the driver’s age at the time of the stop in years
  • Coerce age to a factor and use the plot function to check your results. What do you find?