Chapter 3 Working with tabular data in R
Learning Objectives
- Load external data from a .csv file into a data frame in R with
read.csv()
- Find basic properties of a data frames including size, class or type of the columns, names of rows and columns by using
str()
,nrow()
,ncol()
,dim()
,length()
,colnames()
,rownames()
- Use
head()
andtail()
to inspect rows of a data frame.- Generate summary statistics for a data frame
- Use indexing to select rows and columns
- Use logical conditions to select rows and columns
- Add columns and rows to a data frame
- Manipulate categorical data with
factors
,levels()
andas.character()
- Change how character strings are handled in a data frame.
- Format dates in R and calculate time differences
- Use
df$new_col <- new_col
to add a new column to a data frame.- Use
cbind()
to add a new column to a data frame.- Use
rbind()
to add a new row to a data frame.- Use
na.omit()
to remove rows from a data frame withNA
values.
## Loading tabular data
One the most common ways of getting data into R is to read in a table. And – you guessed it – we read it into a data frame! We will take a simple CSV file as example. What is a CSV file?
You may know about the Stanford Open Policing Project and we will be working with a sample dataset from their repository (https://openpolicing.stanford.edu/data/). The sample I extracted contains information about traffic stops for black and white drivers in the state of Mississippi during January 2013 to mid-July of 2016.
We are going to use the R function download.file()
to download the CSV file
that contains the traffic stop data, and we will use read.csv()
to
load into memory the content of the CSV file as an object of class data.frame
.
To download the data into your local data/
subdirectory, run the following:
You are now ready to load the data:
This statement doesn’t produce any output because, as you might recall,
assignments don’t display anything. If we want to check that our data has been
loaded, we can print the variable’s value: trafficstops
.
Wow… that was a lot of output. At least it means the data loaded
properly. Let’s check the top (the first 6 lines) of this data frame using the
function head()
:
#> id state stop_date county_name county_fips
#> 1 MS-2013-00001 MS 2013-01-01 Jones County 28067
#> 2 MS-2013-00002 MS 2013-01-01 Lauderdale County 28075
#> 3 MS-2013-00003 MS 2013-01-01 Pike County 28113
#> 4 MS-2013-00004 MS 2013-01-01 Hancock County 28045
#> 5 MS-2013-00005 MS 2013-01-01 Holmes County 28051
#> 6 MS-2013-00006 MS 2013-01-01 Jackson County 28059
#> police_department driver_gender driver_birthdate driver_race
#> 1 Mississippi Highway Patrol M 1950-06-14 Black
#> 2 Mississippi Highway Patrol M 1967-04-06 Black
#> 3 Mississippi Highway Patrol M 1974-04-15 Black
#> 4 Mississippi Highway Patrol M 1981-03-23 White
#> 5 Mississippi Highway Patrol M 1992-08-03 White
#> 6 Mississippi Highway Patrol F 1960-05-02 White
#> violation_raw officer_id
#> 1 Seat belt not used properly as required J042
#> 2 Careless driving B026
#> 3 Speeding - Regulated or posted speed limit and actual speed M009
#> 4 Speeding - Regulated or posted speed limit and actual speed K035
#> 5 Speeding - Regulated or posted speed limit and actual speed D028
#> 6 Speeding - Regulated or posted speed limit and actual speed K023
3.1 Inspecting data.frame
Objects
As you may recall, a data frame in R is a special case of a list, and a representation of data where the columns are vectors that all have the same length. Because the columns are vectors, they all contain the same type of data (e.g., characters, integers, factors, etc.).
We can see this when inspecting the structure of a data frame
with the function str()
:
#> 'data.frame': 211211 obs. of 11 variables:
#> $ id : chr "MS-2013-00001" "MS-2013-00002" "MS-2013-00003" "MS-2013-00004" ...
#> $ state : chr "MS" "MS" "MS" "MS" ...
#> $ stop_date : chr "2013-01-01" "2013-01-01" "2013-01-01" "2013-01-01" ...
#> $ county_name : chr "Jones County" "Lauderdale County" "Pike County" "Hancock County" ...
#> $ county_fips : int 28067 28075 28113 28045 28051 28059 28059 28043 28051 28051 ...
#> $ police_department: chr "Mississippi Highway Patrol" "Mississippi Highway Patrol" "Mississippi Highway Patrol" "Mississippi Highway Patrol" ...
#> $ driver_gender : chr "M" "M" "M" "M" ...
#> $ driver_birthdate : chr "1950-06-14" "1967-04-06" "1974-04-15" "1981-03-23" ...
#> $ driver_race : chr "Black" "Black" "Black" "White" ...
#> $ violation_raw : chr "Seat belt not used properly as required" "Careless driving" "Speeding - Regulated or posted speed limit and actual speed" "Speeding - Regulated or posted speed limit and actual speed" ...
#> $ officer_id : chr "J042" "B026" "M009" "K035" ...
We already saw how the functions head()
and str()
can be useful to check the
content and the structure of a data frame. Here is a non-exhaustive list of
functions to get a sense of the content/structure of the data. Let’s try them out!
- Size:
dim(trafficstops)
- returns a vector with the number of rows in the first element, and the number of columns as the second element (the dimensions of the object)nrow(trafficstops)
- returns the number of rowsncol(trafficstops)
- returns the number of columnslength(trafficstops)
- returns number of columns
- Content:
head(trafficstops)
- shows the first 6 rowstail(trafficstops)
- shows the last 6 rows
- Names:
names(trafficstops)
- returns the column names (synonym ofcolnames()
fordata.frame
objects)rownames(trafficstops)
- returns the row names
- Summary:
str(trafficstops)
- structure of the object and information about the class, length and content of each columnsummary(trafficstops)
- summary statistics for each column
Note: most of these functions are “generic”, they can be used on other types of
objects besides data.frame
.
Challenge
Based on the output of
str(trafficstops)
, can you answer the following questions?
- What is the class of the object
trafficstops
?- How many rows and how many columns are in this object?
- How many counties have been recorded in this dataset?
3.2 Indexing and subsetting data frames
Our trafficstops data frame has rows and columns (it has 2 dimensions), if we want to extract some specific data from it, we need to specify the “coordinates” we want from it. Row numbers come first, followed by column numbers. However, note that different ways of specifying these coordinates lead to results with different classes.
trafficstops[1, 1] # first element in the first column of the data frame (as a vector)
trafficstops[1, 6] # first element in the 6th column (as a vector)
trafficstops[, 1] # first column in the data frame (as a vector)
trafficstops[1] # first column in the data frame (as a data.frame)
trafficstops[1:3, 7] # first three elements in the 7th column (as a vector)
trafficstops[3, ] # the 3rd row (as a data.frame)
trafficstops[1:6, ] # the 1st to 6th rows, equivalent to head(trafficstops)
trafficstops[, -1] # the whole data frame, excluding the first column
trafficstops[-c(7:211211),] # equivalent to head(trafficstops)
As well as using numeric values to subset a data.frame
(or matrix
), columns
can be called by name, using one of the four following notations:
trafficstops["violation_raw"] # Result is a data.frame
trafficstops[, "violation_raw"] # Result is a vector
trafficstops[["violation_raw"]] # Result is a vector
trafficstops$violation_raw # Result is a vector
For our purposes, the last three notations are equivalent. RStudio knows about the columns in your data frame, so you can take advantage of the autocompletion feature to get the full and correct column name.
Challenge
Create a
data.frame
(trafficstops_200
) containing only the observations from row 200 of thetrafficstops
dataset.Notice how
nrow()
gave you the number of rows in adata.frame
?
- Use that number to pull out just that last row in the data frame.
- Compare that with what you see as the last row using
tail()
to make sure it’s meeting expectations.- Pull out that last row using
nrow()
instead of the row number.- Create a new data frame object (
trafficstops_last
) from that last row.Use
nrow()
to extract the row that is in the middle of the data frame. Store the content of this row in an object namedtrafficstops_middle
.Combine
nrow()
with the-
notation above to reproduce the behavior ofhead(trafficstops)
keeping just the first through 6th rows of the trafficstops dataset.
## Conditional subsetting
Often times we need to extract a subset of a data frame based on certain conditions. For example, if we wanted to look at traffic stops in Webster County only we could say:
# the condition:
# returns a logical vector of the length of the column
trafficstops$county_name == "Webster County"
# use this vector to extract rows and all columns
# note the comma: we want *all* columns
trafficstops[trafficstops$county_name == "Webster County", ]
# assign extract to a new data frame
Webster_trafficstops <- trafficstops[trafficstops$county_name == "Webster County", ]
This is also a possibility (but slower):
Webster_trafficstops <- subset(trafficstops, county_name == "Webster County")
nrow(Webster_trafficstops) # 393 stops in Webster County!
#> [1] 156
#>
#> Black White
#> 59 97
These commands are from the R base package. In the R Data Wrangling workshop we will discuss a different way of subsetting using functions from the tidyverse
package.
Challenge
- Use subsetting to extract trafficstops in Hancock, Harrison, and Jackson Counties into a separate data frame
coastal_counties
.- Using
coastal_counties
, count the number of Black and White drivers in the three counties.- Bonus: How does the ratio of Black to White stops in the three coastal counties compare to the same ratio for stops in the entire state of Mississippi?
## Adding and removing rows and columns
To add a new column to the data frame we can use the cbind()
function.
new_col <- row.names(trafficstops)
trafficstops_withnewcol <- cbind(trafficstops, new_col)
head(trafficstops_withnewcol)
#> id state stop_date county_name county_fips
#> 1 MS-2013-00001 MS 2013-01-01 Jones County 28067
#> 2 MS-2013-00002 MS 2013-01-01 Lauderdale County 28075
#> 3 MS-2013-00003 MS 2013-01-01 Pike County 28113
#> 4 MS-2013-00004 MS 2013-01-01 Hancock County 28045
#> 5 MS-2013-00005 MS 2013-01-01 Holmes County 28051
#> 6 MS-2013-00006 MS 2013-01-01 Jackson County 28059
#> police_department driver_gender driver_birthdate driver_race
#> 1 Mississippi Highway Patrol M 1950-06-14 Black
#> 2 Mississippi Highway Patrol M 1967-04-06 Black
#> 3 Mississippi Highway Patrol M 1974-04-15 Black
#> 4 Mississippi Highway Patrol M 1981-03-23 White
#> 5 Mississippi Highway Patrol M 1992-08-03 White
#> 6 Mississippi Highway Patrol F 1960-05-02 White
#> violation_raw officer_id
#> 1 Seat belt not used properly as required J042
#> 2 Careless driving B026
#> 3 Speeding - Regulated or posted speed limit and actual speed M009
#> 4 Speeding - Regulated or posted speed limit and actual speed K035
#> 5 Speeding - Regulated or posted speed limit and actual speed D028
#> 6 Speeding - Regulated or posted speed limit and actual speed K023
#> new_col
#> 1 1
#> 2 2
#> 3 3
#> 4 4
#> 5 5
#> 6 6
Alternatively, we can also add a new column adding the new column name after the $
sign then assigning the value, like below. Note that this will change the original data frame, which you may not always want to do.
trafficstops$row_numbers <- c(1:nrow(trafficstops))
trafficstops$all_false <- FALSE # what do you think will happen here?
There is an equivalent function, rbind()
to add a new row to a data frame. I use this far less frequently than the column equivalent. The one thing to keep in mind is that the row to be added to the data frame needs to match the order and type of columns in the data frame. Remember that R’s way to store multiple different data types in one object is a list
. So if we wanted to add a new row to trafficstops
we would say:
new_row <- data.frame(id="MS-2017-12345", state="MS", stop_date="2017-08-24",
county_name="Tallahatchie County", county_fips=12345,
police_department="MSHP", driver_gender="F", driver_birthdate="1999-06-14",
driver_race="Hispanic", violation_raw="Speeding", officer_id="ABCD")
trafficstops_withnewrow <- rbind(trafficstops, new_row)
tail(trafficstops_withnewrow)
#> id state stop_date county_name county_fips
#> 211207 MS-2016-24293 MS 2016-07-09 George County 28039
#> 211208 MS-2016-24294 MS 2016-07-10 Copiah County 28029
#> 211209 MS-2016-24295 MS 2016-07-11 Grenada County 28043
#> 211210 MS-2016-24296 MS 2016-07-14 Copiah County 28029
#> 211211 MS-2016-24297 MS 2016-07-14 Copiah County 28029
#> 211212 MS-2017-12345 MS 2017-08-24 Tallahatchie County 12345
#> police_department driver_gender driver_birthdate driver_race
#> 211207 Mississippi Highway Patrol M 1992-07-14 White
#> 211208 Mississippi Highway Patrol M 1975-12-23 Black
#> 211209 Mississippi Highway Patrol M 1998-02-02 White
#> 211210 Mississippi Highway Patrol F 1970-06-14 White
#> 211211 Mississippi Highway Patrol M 1948-03-11 White
#> 211212 MSHP F 1999-06-14 Hispanic
#> violation_raw officer_id
#> 211207 Speeding - Regulated or posted speed limit and actual speed K025
#> 211208 Speeding - Regulated or posted speed limit and actual speed C033
#> 211209 Seat belt not used properly as required D014
#> 211210 Expired or no non-commercial driver license or permit C015
#> 211211 Seat belt not used properly as required C015
#> 211212 Speeding ABCD
A convenient function to know about is na.omit()
. It will remove all rows from a data frame that have at least one column with NA
values.
Challenge
- Given the following data frame:
What would you expect the following commands to return?
3.3 Categorical data: factors
Factors are very useful and are actually something that make R particularly well suited to working with data, so we’re going to spend a little time introducing them.
Factors are used to represent categorical data. Factors can be ordered or unordered, and understanding them is necessary for statistical analysis and for plotting.
Factors are stored as integers, and have labels (text) associated with these unique integers. While factors look (and often behave) like character vectors, they are actually integers under the hood, and you need to be careful when treating them like strings.
Once created, factors can only contain a pre-defined set of values, known as levels. By default, R always sorts levels in alphabetical order. For instance, if you have a factor with 2 levels:
R will assign 1
to the level "democrat"
and 2
to the level "republican"
(because
d
comes before r
, even though the first element in this vector is
"republican"
). You can check this by using the function levels()
, and check the
number of levels using nlevels()
:
Sometimes, the order of the factors does not matter, other times you might want
to specify the order because it is meaningful (e.g., “low”, “medium”, “high”),
it improves your visualization, or it is required by a particular type of
analysis. Here, one way to reorder our levels in the party
vector would be:
#> [1] republican democrat democrat republican
#> Levels: democrat republican
#> [1] republican democrat democrat republican
#> Levels: republican democrat
In R’s memory, these factors are represented by integers (1, 2, 3), but are more
informative than integers because factors are self describing: "democrat"
,
"republican"
is more descriptive than 1
, 2
. Which one is “republican”? You wouldn’t
be able to tell just from the integer data. Factors, on the other hand, have
this information built in. It is particularly helpful when there are many levels
(like the county names in our example dataset).
3.3.1 Converting factors
If you need to convert a factor to a character vector, you use
as.character(x)
.
Converting factors where the levels appear as numbers (such as concentration
levels, or years) to a numeric vector is a little trickier. One method is to
convert factors to characters and then numbers. Another method is to use the
levels()
function. Compare:
f <- factor(c(1990, 1983, 1977, 1998, 1990))
as.numeric(f) # wrong! and there is no warning...
as.numeric(as.character(f)) # works...
as.numeric(levels(f))[f] # The recommended way.
Notice that in the levels()
approach, three important steps occur:
- We obtain all the factor levels using
levels(f)
- We convert these levels to numeric values using
as.numeric(levels(f))
- We then access these numeric values using the underlying integers of the vector
f
as indices inside the square brackets
3.3.2 Renaming factors
When your data is stored as a factor, you can use the plot()
function to get a
quick glance at the number of observations represented by each factor
level. Let’s look at the number of blacks and whites in the dataset:
# Coerce the column "driver_race" into a factor
trafficstops$driver_race <- factor(trafficstops$driver_race)
# bar plot of the number of black and white drivers stopped:
trafficstops$driver_race <- as.factor(trafficstops$driver_race)
plot(trafficstops$driver_race)
There seem to be a number of individuals for which the race information hasn’t been recorded.
Additionally, for these individuals, there is no label to indicate that the information is missing. Let’s rename this label to something more meaningful. Before doing that, we’re going to pull out the data on race and work with that data, so we’re not modifying the working copy of the data frame:
#> [1] Black Black Black White White White
#> Levels: Black White
#> [1] "" "Black" "White"
#> [1] "Missing" "Black" "White"
#> [1] Black Black Black White White White
#> Levels: Missing Black White
Challenge
- Rename “Black” to “African American”.
- Now that we have renamed the factor level to “Missing”, can you recreate the barplot such that “Missing” is last (to the right)?
3.3.3 Using stringsAsFactors
By default, when building or importing a data frame with read.csv()
, the columns that contain characters (i.e., text) are read as such. (This was introduced in R Version 4.) Depending on what you intend to do with the data, you may want to keep these
columns as character
or convert to factor
. To do so, read.csv()
and read.table()
have an
argument called stringsAsFactors
which can be set to TRUE
.
# Compare the difference between when the data are being read as
# `factor`, and when they are being read as `character`.
trafficstops <- read.csv("data/MS_policing_bw.csv", stringsAsFactors = TRUE)
str(trafficstops)
trafficstops <- read.csv("data/MS_policing_bw.csv", stringsAsFactors = FALSE) # this is now the default.
str(trafficstops)
Challenge
Can you predict the class for each of the columns in the following example? Check your guesses using
str(country_climate)
: * Are they what you expected? Why? Why not? * What would have been different if we had addedstringsAsFactors = FALSE
to this call? * What would you need to change to ensure that each column had the accurate data type?``` country_climate <- data.frame( country=c("Canada", "Panama", "South Africa", "Australia"), climate=c("cold", "hot", "temperate", "hot/temperate"), temperature=c(10, 30, 18, "15"), northern_hemisphere=c(TRUE, TRUE, FALSE, "FALSE"), has_kangaroo=c(FALSE, FALSE, FALSE, 1) ) ```
The automatic conversion of data type is sometimes a blessing, sometimes an annoyance. Be aware that it exists, learn the rules, and double check that data you import in R are of the correct type within your data frame. If not, use it to your advantage to detect mistakes that might have been introduced during data entry (a letter in a column that should only contain numbers for instance).
3.4 Dates
One of the most common issues that new (and experienced!) R users have is converting date and time information into a variable that is appropriate and usable during analyses. If you have control over your data it might be useful to ensure that each component of your date is stored as a separate variable, i.e a separate column for day, month, and year. However, often we do not have control and the date is stored in one single column and with varying order and separating characters between its components.
Using str()
, we can see that both dates in our data frame stop_date
and driver_birthdate
are each stored in one column.
As an example for how to work with dates let us see if there are seasonal differences in the number of traffic stops.
We’re going to be using the ymd()
function from the package lubridate
. This
function is designed to take a vector representing year, month, and day and convert
that information to a POSIXct vector. POSIXct is a class of data recognized by R as
being a date or date and time. The argument that the function requires is relatively
flexible, but, as a best practice, is a character vector formatted as “YYYY-MM-DD”.
Start by loading the required package:
The ymd
function also has nicely taken care of the fact that the original format of the date column is a factor!
We can now easily extract year, month, and date using the respective functions: year()
, month()
, and day()
like so:
Challenge
- Are there more stops in certain months of the year or certain days of the month?
Challenge
- Determine the age of the driver in years (approximate) at the time of the stop:
- Extract
driver_birthdate
into a vectorbirth_date
- Create a new vector
age
with the driver’s age at the time of the stop in years- Coerce
age
to a factor and use theplot
function to check your results. What do you find?