Chapter 3 Working with tabular data in R
- Load external data from a .csv file into a data frame in R with
- Find basic properties of a data frames including size, class or type of the columns, names of rows and columns by using
tail()to inspect rows of a data frame.
- Generate summary statistics for a data frame
- Use indexing to select rows and columns
- Use logical conditions to select rows and columns
- Add columns and rows to a data frame
- Manipulate categorical data with
- Change how character strings are handled in a data frame.
- Format dates in R and calculate time differences
df$new_col <- new_colto add a new column to a data frame.
cbind()to add a new column to a data frame.
rbind()to add a new row to a data frame.
na.omit()to remove rows from a data frame with
3.1 Loading tabular data
One the most common ways of getting data into R is to read in a table. And – you guessed it – we read it into a data frame! We will take a simple CSV file as example. What is a CSV file?
You may know about the Stanford Open Policing Project and we will be working with a sample dataset from their repository (https://openpolicing.stanford.edu/data/). The sample I extracted contains information about traffic stops for black and white drivers in the state of Mississippi during January 2013 to mid-July of 2016.
We are going to use the R function
download.file() to download the CSV file that contains the traffic stop data, and we will use
read.csv() to load into memory the content of the CSV file as an object of class
To download the data into your local
data/ subdirectory, run the following:
You are now ready to load the data:
trafficstops <- read.csv("data/MS_trafficstops_bw.csv")
This statement doesn’t produce any output because, as you might recall, assignments don’t display anything. If we want to check that our data has been loaded, we can print the variable’s value:
Wow… that was a lot of output. At least it means the data loaded properly. Let’s check the top (the first 6 lines) of this data frame using the function
#> id state stop_date county_name county_fips #> 1 MS-2013-00001 MS 2013-01-01 Jones County 28067 #> 2 MS-2013-00002 MS 2013-01-01 Lauderdale County 28075 #> 3 MS-2013-00003 MS 2013-01-01 Pike County 28113 #> 4 MS-2013-00004 MS 2013-01-01 Hancock County 28045 #> 5 MS-2013-00005 MS 2013-01-01 Holmes County 28051 #> 6 MS-2013-00006 MS 2013-01-01 Jackson County 28059 #> police_department driver_gender driver_birthdate driver_race #> 1 Mississippi Highway Patrol M 1950-06-14 Black #> 2 Mississippi Highway Patrol M 1967-04-06 Black #> 3 Mississippi Highway Patrol M 1974-04-15 Black #> 4 Mississippi Highway Patrol M 1981-03-23 White #> 5 Mississippi Highway Patrol M 1992-08-03 White #> 6 Mississippi Highway Patrol F 1960-05-02 White #> violation_raw officer_id #> 1 Seat belt not used properly as required J042 #> 2 Careless driving B026 #> 3 Speeding - Regulated or posted speed limit and actual speed M009 #> 4 Speeding - Regulated or posted speed limit and actual speed K035 #> 5 Speeding - Regulated or posted speed limit and actual speed D028 #> 6 Speeding - Regulated or posted speed limit and actual speed K023
As you may recall, a data frame in R is a special case of a list, and a representation of data where the columns are vectors that all have the same length. Because the columns are vectors, they all contain the same type of data (e.g., characters, integers, factors, etc.).
We can see this when inspecting the structure of a data frame with the function
#> 'data.frame': 211211 obs. of 11 variables: #> $ id : Factor w/ 211211 levels "MS-2013-00001",..: 1 2 3 4 5 6 7 8 9 10 ... #> $ state : Factor w/ 1 level "MS": 1 1 1 1 1 1 1 1 1 1 ... #> $ stop_date : Factor w/ 1288 levels "2013-01-01","2013-01-02",..: 1 1 1 1 1 1 1 1 1 1 ... #> $ county_name : Factor w/ 82 levels "Adams County",..: 34 38 57 23 26 30 30 22 26 26 ... #> $ county_fips : int 28067 28075 28113 28045 28051 28059 28059 28043 28051 28051 ... #> $ police_department: Factor w/ 1 level "Mississippi Highway Patrol": 1 1 1 1 1 1 1 1 1 1 ... #> $ driver_gender : Factor w/ 3 levels "","F","M": 3 3 3 3 3 2 2 2 3 3 ... #> $ driver_birthdate : Factor w/ 21423 levels "","1930-01-11",..: 3558 9575 12137 14670 18820 7061 4504 19135 2755 15878 ... #> $ driver_race : Factor w/ 3 levels "","Black","White": 2 2 2 3 3 3 3 3 3 3 ... #> $ violation_raw : Factor w/ 19 levels "??","Careless driving",..: 17 2 19 19 19 19 19 19 19 19 ... #> $ officer_id : Factor w/ 897 levels "","A003","A004",..: 519 52 635 560 212 550 559 205 723 723 ...
We already saw how the functions
str() can be useful to check the content and the structure of a data frame. Here is a non-exhaustive list of functions to get a sense of the content/structure of the data. Let’s try them out!
dim(trafficstops)- returns a vector with the number of rows in the first element, and the number of columns as the second element (the dimensions of the object)
nrow(trafficstops)- returns the number of rows
ncol(trafficstops)- returns the number of columns
length(trafficstops)- returns number of columns
head(trafficstops)- shows the first 6 rows
tail(trafficstops)- shows the last 6 rows
names(trafficstops)- returns the column names (synonym of
rownames(trafficstops)- returns the row names
str(trafficstops)- structure of the object and information about the class, length and content of each column
summary(trafficstops)- summary statistics for each column
Note: most of these functions are “generic”, they can be used on other types of objects besides
Based on the output of
str(trafficstops), can you answer the following questions?
- What is the class of the object
- How many rows and how many columns are in this object?
- How many counties have been recorded in this dataset?
3.3 Indexing and subsetting data frames
Our trafficstops data frame has rows and columns (it has 2 dimensions), if we want to extract some specific data from it, we need to specify the “coordinates” we want from it. Row numbers come first, followed by column numbers. However, note that different ways of specifying these coordinates lead to results with different classes.
trafficstops[1, 1] # first element in the first column of the data frame (as a vector) trafficstops[1, 6] # first element in the 6th column (as a vector) trafficstops[, 1] # first column in the data frame (as a vector) trafficstops # first column in the data frame (as a data.frame) trafficstops[1:3, 7] # first three elements in the 7th column (as a vector) trafficstops[3, ] # the 3rd row (as a data.frame) trafficstops[1:6, ] # the 1st to 6th rows, equivalent to head(trafficstops) trafficstops[, -1] # the whole data frame, excluding the first column trafficstops[-c(7:211211),] # equivalent to head(trafficstops)
As well as using numeric values to subset a
matrix), columns can be called by name, using one of the four following notations:
trafficstops["violation_raw"] # Result is a data.frame trafficstops[, "violation_raw"] # Result is a vector trafficstops[["violation_raw"]] # Result is a vector trafficstops$violation_raw # Result is a vector
For our purposes, the last three notations are equivalent. RStudio knows about the columns in your data frame, so you can take advantage of the autocompletion feature to get the full and correct column name.
trafficstops_200) containing only the observations from row 200 of the
nrow()gave you the number of rows in a
- Use that number to pull out just that last row in the data frame.
- Compare that with what you see as the last row using
tail()to make sure it’s meeting expectations.
- Pull out that last row using
nrow()instead of the row number.
- Create a new data frame object (
trafficstops_last) from that last row.
nrow()to extract the row that is in the middle of the data frame. Store the content of this row in an object named
-notation above to reproduce the behavior of
head(trafficstops)keeping just the first through 6th rows of the trafficstops dataset.
3.4 Conditional subsetting
Often times we need to extract a subset of a data frame based on certain conditions. For example, if we wanted to look at traffic stops in Webster County only we could say:
# the condition: # returns a logical vector of the length of the column trafficstops$county_name == "Webster County" # use this vector to extract rows and all columns # note the comma: we want *all* columns trafficstops[trafficstops$county_name == "Webster County", ] # assign extract to a new data frame Webster_trafficstops <- trafficstops[trafficstops$county_name == "Webster County", ]
This is also a possibility (but slower):
Webster_trafficstops <- subset(trafficstops, county_name == "Webster County") nrow(Webster_trafficstops) # 393 stops in Webster County!
#>  156
# and if we wanted to see the breakdown by race: table(Webster_trafficstops$driver_race)
#> #> Black White #> 0 59 97
These commands are from the R base package. In the R Data Wrangling workshop we will discuss a different way of subsetting using functions from the
- Use subsetting to extract trafficstops in Hancock, Harrison, and Jackson Counties into a separate data frame
coastal_counties, count the number of Black and White drivers in the three counties.
- Bonus: How does the ratio of Black to White stops in the three coastal counties compare to the same ratio for stops in the entire state of Mississippi?
3.5 Adding and removing rows and columns
To add a new column to the data frame we can use the
new_col <- row.names(trafficstops) trafficstops_withnewcol <- cbind(trafficstops, new_col) head(trafficstops_withnewcol)
#> id state stop_date county_name county_fips #> 1 MS-2013-00001 MS 2013-01-01 Jones County 28067 #> 2 MS-2013-00002 MS 2013-01-01 Lauderdale County 28075 #> 3 MS-2013-00003 MS 2013-01-01 Pike County 28113 #> 4 MS-2013-00004 MS 2013-01-01 Hancock County 28045 #> 5 MS-2013-00005 MS 2013-01-01 Holmes County 28051 #> 6 MS-2013-00006 MS 2013-01-01 Jackson County 28059 #> police_department driver_gender driver_birthdate driver_race #> 1 Mississippi Highway Patrol M 1950-06-14 Black #> 2 Mississippi Highway Patrol M 1967-04-06 Black #> 3 Mississippi Highway Patrol M 1974-04-15 Black #> 4 Mississippi Highway Patrol M 1981-03-23 White #> 5 Mississippi Highway Patrol M 1992-08-03 White #> 6 Mississippi Highway Patrol F 1960-05-02 White #> violation_raw officer_id #> 1 Seat belt not used properly as required J042 #> 2 Careless driving B026 #> 3 Speeding - Regulated or posted speed limit and actual speed M009 #> 4 Speeding - Regulated or posted speed limit and actual speed K035 #> 5 Speeding - Regulated or posted speed limit and actual speed D028 #> 6 Speeding - Regulated or posted speed limit and actual speed K023 #> new_col #> 1 1 #> 2 2 #> 3 3 #> 4 4 #> 5 5 #> 6 6
Alternatively, we can also add a new column adding the new column name after the
$ sign then assigning the value, like below. Note that this will change the original data frame, which you may not always want to do.
trafficstops$row_numbers <- c(1:nrow(trafficstops)) trafficstops$all_false <- FALSE # what do you think will happen here?
There is an equivalent function,
rbind() to add a new row to a data frame. I use this far less frequently than the column equivalent. The one thing to keep in mind is that the row to be added to the data frame needs to match the order and type of columns in the data frame. Remember that R’s way to store multiple different data types in one object is a
list. So if we wanted to add a new row to
trafficstops we would say:
new_row <- data.frame(id="MS-2017-12345", state="MS", stop_date="2017-08-24", county_name="Tallahatchie County", county_fips=12345, police_department="MSHP", driver_gender="F", driver_birthdate="1999-06-14", driver_race="Hispanic", violation_raw="Speeding", officer_id="ABCD") trafficstops_withnewrow <- rbind(trafficstops, new_row) tail(trafficstops_withnewrow)
#> id state stop_date county_name county_fips #> 211207 MS-2016-24293 MS 2016-07-09 George County 28039 #> 211208 MS-2016-24294 MS 2016-07-10 Copiah County 28029 #> 211209 MS-2016-24295 MS 2016-07-11 Grenada County 28043 #> 211210 MS-2016-24296 MS 2016-07-14 Copiah County 28029 #> 211211 MS-2016-24297 MS 2016-07-14 Copiah County 28029 #> 211212 MS-2017-12345 MS 2017-08-24 Tallahatchie County 12345 #> police_department driver_gender driver_birthdate #> 211207 Mississippi Highway Patrol M 1992-07-14 #> 211208 Mississippi Highway Patrol M 1975-12-23 #> 211209 Mississippi Highway Patrol M 1998-02-02 #> 211210 Mississippi Highway Patrol F 1970-06-14 #> 211211 Mississippi Highway Patrol M 1948-03-11 #> 211212 MSHP F 1999-06-14 #> driver_race #> 211207 White #> 211208 Black #> 211209 White #> 211210 White #> 211211 White #> 211212 Hispanic #> violation_raw #> 211207 Speeding - Regulated or posted speed limit and actual speed #> 211208 Speeding - Regulated or posted speed limit and actual speed #> 211209 Seat belt not used properly as required #> 211210 Expired or no non-commercial driver license or permit #> 211211 Seat belt not used properly as required #> 211212 Speeding #> officer_id #> 211207 K025 #> 211208 C033 #> 211209 D014 #> 211210 C015 #> 211211 C015 #> 211212 ABCD
A convenient function to know about is
na.omit(). It will remove all rows from a data frame that have at least one column with
- Given the following data frame:
dfr <- data.frame(col_1 = c(1:3), col_2 = c(NA, NA, "b"), col_3 = c(TRUE, NA, FALSE))
What would you expect the following commands to return?
3.6 Categorical data: factors
When we did
str(trafficstops) we saw that only one of the columns are numeric (
county_fips), all the others are of a special class called a
factor. Factors are very useful and are actually something that make R particularly well suited to working with data, so we’re going to spend a little time introducing them.
Factors are used to represent categorical data. Factors can be ordered or unordered, and understanding them is necessary for statistical analysis and for plotting.
Factors are stored as integers, and have labels (text) associated with these unique integers. While factors look (and often behave) like character vectors, they are actually integers under the hood, and you need to be careful when treating them like strings.
Once created, factors can only contain a pre-defined set of values, known as levels. By default, R always sorts levels in alphabetical order. For instance, if you have a factor with 2 levels:
party <- factor(c("republican", "democrat", "democrat", "republican"))
R will assign
1 to the level
2 to the level
d comes before
r, even though the first element in this vector is
"republican"). You can check this by using the function
levels(), and check the number of levels using
Sometimes, the order of the factors does not matter, other times you might want to specify the order because it is meaningful (e.g., “low”, “medium”, “high”), it improves your visualization, or it is required by a particular type of analysis. Here, one way to reorder our levels in the
party vector would be:
party # current order
#>  republican democrat democrat republican #> Levels: democrat republican
party <- factor(party, levels = c("republican", "democrat")) party # after re-ordering
#>  republican democrat democrat republican #> Levels: republican democrat
In R’s memory, these factors are represented by integers (1, 2, 3), but are more informative than integers because factors are self describing:
"republican" is more descriptive than
2. Which one is “republican”? You wouldn’t be able to tell just from the integer data. Factors, on the other hand, have this information built in. It is particularly helpful when there are many levels (like the county names in our example dataset).
3.6.1 Converting factors
If you need to convert a factor to a character vector, you use
Converting factors where the levels appear as numbers (such as concentration levels, or years) to a numeric vector is a little trickier. One method is to convert factors to characters and then numbers. Another method is to use the
levels() function. Compare:
f <- factor(c(1990, 1983, 1977, 1998, 1990)) as.numeric(f) # wrong! and there is no warning... as.numeric(as.character(f)) # works... as.numeric(levels(f))[f] # The recommended way.
Notice that in the
levels() approach, three important steps occur:
- We obtain all the factor levels using
- We convert these levels to numeric values using
- We then access these numeric values using the underlying integers of the vector
fas indices inside the square brackets
3.6.2 Renaming factors
When your data is stored as a factor, you can use the
plot() function to get a quick glance at the number of observations represented by each factor level. Let’s look at the number of blacks and whites in the dataset:
# bar plot of the number of black and white drivers stopped: plot(trafficstops$driver_race)
There seem to be a number of individuals for which the race information hasn’t been recorded.
Additionally, for these individuals, there is no label to indicate that the information is missing. Let’s rename this label to something more meaningful. Before doing that, we’re going to pull out the data on race and work with that data, so we’re not modifying the working copy of the data frame:
race <- trafficstops$driver_race head(race)
#>  Black Black Black White White White #> Levels: Black White
#>  "" "Black" "White"
levels(race) <- "Missing" levels(race)
#>  "Missing" "Black" "White"
#>  Black Black Black White White White #> Levels: Missing Black White
- Rename “Black” to “African American”.
- Now that we have renamed the factor level to “Missing”, can you recreate the barplot such that “Missing” is last (to the right)?
By default, when building or importing a data frame with
read.csv(), the columns that contain characters (i.e., text) are coerced (=converted) into the
factor data type. Depending on what you want to do with the data, you may want to keep these columns as
character. To do so,
read.table() have an argument called
stringsAsFactors which can be set to
In most cases, it’s preferable to set
stringsAsFactors = FALSE when importing your data, and converting as a factor only the columns that require this data type.
Compare the output of
str(trafficstops) when setting
stringsAsFactors = TRUE (default) and
stringsAsFactors = FALSE:
# Compare the difference between when the data are being read as # `factor`, and when they are being read as `character`. trafficstops <- read.csv("data/MS_policing_bw.csv", stringsAsFactors = TRUE) str(trafficstops) trafficstops <- read.csv("data/MS_policing_bw.csv", stringsAsFactors = FALSE) str(trafficstops) # Convert the column "driver_race" into a factor trafficstops$driver_race <- factor(trafficstops$driver_race)
Can you predict the class for each of the columns in the following example? Check your guesses using
str(country_climate): * Are they what you expected? Why? Why not? * What would have been different if we had added
stringsAsFactors = FALSEto this call? * What would you need to change to ensure that each column had the accurate data type?
``` country_climate <- data.frame( country=c("Canada", "Panama", "South Africa", "Australia"), climate=c("cold", "hot", "temperate", "hot/temperate"), temperature=c(10, 30, 18, "15"), northern_hemisphere=c(TRUE, TRUE, FALSE, "FALSE"), has_kangaroo=c(FALSE, FALSE, FALSE, 1) ) ```
The automatic conversion of data type is sometimes a blessing, sometimes an annoyance. Be aware that it exists, learn the rules, and double check that data you import in R are of the correct type within your data frame. If not, use it to your advantage to detect mistakes that might have been introduced during data entry (a letter in a column that should only contain numbers for instance).
One of the most common issues that new (and experienced!) R users have is converting date and time information into a variable that is appropriate and usable during analyses. If you have control over your data it might be useful to ensure that each component of your date is stored as a separate variable, i.e a separate column for day, month, and year. However, often we do not have control and the date is stored in one single column and with varying order and separating characters between its components.
str(), we can see that both dates in our data frame
driver_birthdate are each stored in one column.
As an example for how to work with dates let us see if there are seasonal differences in the number of traffic stops.
We’re going to be using the
ymd() function from the package
lubridate. This function is designed to take a vector representing year, month, and day and convert that information to a POSIXct vector. POSIXct is a class of data recognized by R as being a date or date and time. The argument that the function requires is relatively flexible, but, as a best practice, is a character vector formatted as “YYYY-MM-DD”.
Start by loading the required package:
stop_date <- ymd(trafficstops$stop_date) str(stop_date) # notice the 'date' class
ymd function also has nicely taken care of the fact that the original format of the date column is a factor!
We can now easily extract year, month, and date using the respective functions:
day() like so:
plot(factor(year(stop_date))) #convert year to factor to plot
- Are there more stops in certain months of the year or certain days of the month?
- Determine the age of the driver in years (approximate) at the time of the stop:
driver_birthdateinto a vector
- Create a new vector
agewith the driver’s age at the time of the stop in years
ageto a factor and use the
plotfunction to check your results. What do you find?