Chapter 1 Preparing Textual Data

Learning Objectives

  • read textual data into R using readtext
  • use the stringr package to prepare strings for processing
  • use tidytext functions to tokenize texts and remove stopwords
  • use SnowballC to stem words

We’ll use several R packages in this section:

  • sotu will provide the metadata and text of State of the Union speeches ranging from George Washington to Barack Obama.
  • tidyverse is a collection of R packages designed for data science, including dplyr with a set of verbs for common data manipulations and ggplot2 for visualization.
  • tidytext provides specific functions for a “tidy” approach to working with textual data, where one row represents one “token” or meaningful unit of text, for example a word.
  • readtext provides a function well suited to reading textual data from a large number of formats into R, including metadata.
library(sotu)
library(tidyverse)
library(tidytext)
library(readtext)

1.1 Reading text into R

First, let’s look at the data in the sotu package. The metadata and texts are contained in this package separately in sotu_meta and sotu_text respectively. We can take a look at those by either typing the names or use funnctions like glimpse() or str(). Below, or example is what the metadata look like. Can you tell how many speeches there are?

# Let's take a look at the state of the union metadata
str(sotu_meta)
#> tibble [236 × 5] (S3: tbl_df/tbl/data.frame)
#>  $ president   : chr [1:236] "George Washington" "George Washington" "George Washington" "George Washington" ...
#>  $ year        : int [1:236] 1790 1790 1791 1792 1793 1794 1795 1796 1797 1798 ...
#>  $ years_active: chr [1:236] "1789-1793" "1789-1793" "1789-1793" "1789-1793" ...
#>  $ party       : chr [1:236] "Nonpartisan" "Nonpartisan" "Nonpartisan" "Nonpartisan" ...
#>  $ sotu_type   : chr [1:236] "speech" "speech" "speech" "speech" ...

In order to work with the speech texts and to later practice reading text files from disk we use the function sotu_dir() to write the texts out. This function by default writes to a temporary directory with one speech in each file. It returns a character vector where each element is the name of the path to the individual speech file. We save this vector into the file_paths variable.

# sotu_dir writes the text files to disk in a temporary dir, 
# but you could also specify a location.
file_paths <- sotu_dir()
head(file_paths)
#> [1] "/var/folders/5y/9x92pjcx2xd2h7qxqx39vpmc0000gn/T//Rtmpef14Qw/file836c737cda99/george-washington-1790a.txt"
#> [2] "/var/folders/5y/9x92pjcx2xd2h7qxqx39vpmc0000gn/T//Rtmpef14Qw/file836c737cda99/george-washington-1790b.txt"
#> [3] "/var/folders/5y/9x92pjcx2xd2h7qxqx39vpmc0000gn/T//Rtmpef14Qw/file836c737cda99/george-washington-1791.txt" 
#> [4] "/var/folders/5y/9x92pjcx2xd2h7qxqx39vpmc0000gn/T//Rtmpef14Qw/file836c737cda99/george-washington-1792.txt" 
#> [5] "/var/folders/5y/9x92pjcx2xd2h7qxqx39vpmc0000gn/T//Rtmpef14Qw/file836c737cda99/george-washington-1793.txt" 
#> [6] "/var/folders/5y/9x92pjcx2xd2h7qxqx39vpmc0000gn/T//Rtmpef14Qw/file836c737cda99/george-washington-1794.txt"

Now that we have the files on disk and a vector of filepaths, we can pass this vector directly into readtext to read the texts into a new variable.

# let's read in the files with readtext
sotu_texts <- readtext(file_paths)

readtext() generated a dataframe for us with 2 colums: the doc_id, which is the name of the document and the actual text:

glimpse(sotu_texts)
#> Rows: 236
#> Columns: 2
#> $ doc_id <chr> "abraham-lincoln-1861.txt", "abraham-lincoln-1862.txt", "abraha…
#> $ text   <chr> "\n\n Fellow-Citizens of the Senate and House of Representative…

To work with a single table, we combine the text and metadata. Our sotu_texts are organized by alphabetical order, so we sort our metadata in sotu_meta to match that order and then bind the columns.

sotu_whole <- 
  sotu_meta %>%  
  arrange(president) %>% # sort metadata
  bind_cols(sotu_texts) # combine with texts

glimpse(sotu_whole)
#> Rows: 236
#> Columns: 7
#> $ president    <chr> "Abraham Lincoln", "Abraham Lincoln", "Abraham Lincoln", …
#> $ year         <int> 1861, 1862, 1863, 1864, 1829, 1830, 1831, 1832, 1833, 183…
#> $ years_active <chr> "1861-1865", "1861-1865", "1861-1865", "1861-1865", "1829…
#> $ party        <chr> "Republican", "Republican", "Republican", "Republican", "…
#> $ sotu_type    <chr> "written", "written", "written", "written", "written", "w…
#> $ doc_id       <chr> "abraham-lincoln-1861.txt", "abraham-lincoln-1862.txt", "…
#> $ text         <chr> "\n\n Fellow-Citizens of the Senate and House of Represen…

Now that we have our data combined, we can start looking at the text. Typically quite a bit of effort goes into pre-processing the text for further analysis. Depending on the quality of your data and your goal, you might for example need to:

  • replace certain characters or words,
  • remove urls or certain numbers, such as phone numbers,
  • clean up misspellings or errors,
  • etc.

There are several ways to handle this sort of cleaning, we’ll show a few examples below.

1.2 String operations

R has many functions available to manipulate strings including functions like grep and paste, which come with the R base install.

Here we will here take a look at the stringr package, which is part of the tidyverse. It refers to a lot of functionality from the stringi package which is perhaps one of the most comprehensive string manipulation packages.

Below are examples for a few functions that might be useful.

1.2.1 Counting ocurrences

str_count takes a character vector as input and by default counts the number of pattern matches in a string.

How man times does the word “citizen” appear in each of the speeches?

sotu_whole %>% 
    mutate(n_citizen = str_count(text, "citizen")) 
#> # A tibble: 236 × 8
#>    president        year years_active party     sotu_type doc_id text  n_citizen
#>    <chr>           <int> <chr>        <chr>     <chr>     <chr>  <chr>     <int>
#>  1 Abraham Lincoln  1861 1861-1865    Republic… written   abrah… "\n\…         9
#>  2 Abraham Lincoln  1862 1861-1865    Republic… written   abrah… "\n\…         7
#>  3 Abraham Lincoln  1863 1861-1865    Republic… written   abrah… "\n\…        15
#>  4 Abraham Lincoln  1864 1861-1865    Republic… written   abrah… "\n\…         3
#>  5 Andrew Jackson   1829 1829-1833    Democrat… written   andre… "\n\…        19
#>  6 Andrew Jackson   1830 1829-1833    Democrat… written   andre… "\n\…        14
#>  7 Andrew Jackson   1831 1829-1833    Democrat… written   andre… "\n\…        23
#>  8 Andrew Jackson   1832 1829-1833    Democrat… written   andre… "\n\…        19
#>  9 Andrew Jackson   1833 1833-1837    Democrat… written   andre… "\n\…        14
#> 10 Andrew Jackson   1834 1833-1837    Democrat… written   andre… "\n\…        25
#> # … with 226 more rows

It is possible to use regular expressions, for example, this is how we would check how many times either “citizen” or “Citizen” appear in each of the speeches:

sotu_whole %>% 
    mutate(n_citizen = str_count(text, "citizen"),
           n_cCitizen = str_count(text, "[C|c]itizen")) 
#> # A tibble: 236 × 9
#>    president        year years_active party     sotu_type doc_id text  n_citizen
#>    <chr>           <int> <chr>        <chr>     <chr>     <chr>  <chr>     <int>
#>  1 Abraham Lincoln  1861 1861-1865    Republic… written   abrah… "\n\…         9
#>  2 Abraham Lincoln  1862 1861-1865    Republic… written   abrah… "\n\…         7
#>  3 Abraham Lincoln  1863 1861-1865    Republic… written   abrah… "\n\…        15
#>  4 Abraham Lincoln  1864 1861-1865    Republic… written   abrah… "\n\…         3
#>  5 Andrew Jackson   1829 1829-1833    Democrat… written   andre… "\n\…        19
#>  6 Andrew Jackson   1830 1829-1833    Democrat… written   andre… "\n\…        14
#>  7 Andrew Jackson   1831 1829-1833    Democrat… written   andre… "\n\…        23
#>  8 Andrew Jackson   1832 1829-1833    Democrat… written   andre… "\n\…        19
#>  9 Andrew Jackson   1833 1833-1837    Democrat… written   andre… "\n\…        14
#> 10 Andrew Jackson   1834 1833-1837    Democrat… written   andre… "\n\…        25
#> # … with 226 more rows, and 1 more variable: n_cCitizen <int>

Going into the use of regular expressions is beyond this introduction. However we want to point out the str_view() function which can help you to create your correct expression. Also see RegExr, an online tool to learn, build, & test regular expressions.

When used with the boundary argument str_count() can count different entities like “character”, “line_break”, “sentence”, or “word”. Here we add a new column to the dataframe indicating how many words are in each speech:

sotu_whole %>% 
  mutate(n_words = str_count(text, boundary("word"))) 
#> # A tibble: 236 × 8
#>    president        year years_active party      sotu_type doc_id  text  n_words
#>    <chr>           <int> <chr>        <chr>      <chr>     <chr>   <chr>   <int>
#>  1 Abraham Lincoln  1861 1861-1865    Republican written   abraha… "\n\…    6998
#>  2 Abraham Lincoln  1862 1861-1865    Republican written   abraha… "\n\…    8410
#>  3 Abraham Lincoln  1863 1861-1865    Republican written   abraha… "\n\…    6132
#>  4 Abraham Lincoln  1864 1861-1865    Republican written   abraha… "\n\…    5975
#>  5 Andrew Jackson   1829 1829-1833    Democratic written   andrew… "\n\…   10547
#>  6 Andrew Jackson   1830 1829-1833    Democratic written   andrew… "\n\…   15109
#>  7 Andrew Jackson   1831 1829-1833    Democratic written   andrew… "\n\…    7198
#>  8 Andrew Jackson   1832 1829-1833    Democratic written   andrew… "\n\…    7887
#>  9 Andrew Jackson   1833 1833-1837    Democratic written   andrew… "\n\…    7912
#> 10 Andrew Jackson   1834 1833-1837    Democratic written   andrew… "\n\…   13472
#> # … with 226 more rows

CHALLENGE: Use the code above and add another column n_sentences where you calculate the number of sentences per speech. Then create a third column avg_word_per_sentence, where you calculate the number of words per sentence for each speech. Finally use filter to find which speech has shortest/longest average sentences length and what is the average length.

1.2.2 Detecting patterns

str_detect also looks for patterns, but instead of counts it returns a logical vector (TRUE/FALSE) indiciating if the pattern is or is not found. So we typically want to use it with the filter “verb” from dplyr.

What are the names of the documents where the words “citizen” and “Citizen” do not occur?

sotu_whole %>% 
  filter(!str_detect(text, "[C|c]itizen")) %>% 
  select(doc_id) 
#> # A tibble: 11 × 1
#>    doc_id                      
#>    <chr>                       
#>  1 dwight-d-eisenhower-1958.txt
#>  2 gerald-r-ford-1975.txt      
#>  3 richard-m-nixon-1970.txt    
#>  4 richard-m-nixon-1971.txt    
#>  5 richard-m-nixon-1972a.txt   
#>  6 ronald-reagan-1988.txt      
#>  7 woodrow-wilson-1916.txt     
#>  8 woodrow-wilson-1917.txt     
#>  9 woodrow-wilson-1918.txt     
#> 10 woodrow-wilson-1919.txt     
#> 11 woodrow-wilson-1920.txt

1.2.3 Extracting words

The word function extracts specific words from a character vector of words. By default it returns the first word. If for example we wanted to extract the first 5 words of each speech by Woodrow Wilson we provide the end argument like this:

sotu_whole %>% 
  filter(president == "Woodrow Wilson") %>%  # sample a few speeches as demo
  pull(text) %>% # we pull out the text vector only
  word(end = 5) 
#> [1] "\n\nGentlemen of the Congress:\n\nIn pursuance"
#> [2] "\n\nGENTLEMEN OF THE CONGRESS: \n\nThe"        
#> [3] "GENTLEMEN OF THE CONGRESS: \n\nSince"          
#> [4] "\n\nGENTLEMEN OF THE CONGRESS: \n\nIn"         
#> [5] "Gentlemen of the Congress:\n\nEight months"    
#> [6] "\n\nGENTLEMEN OF THE CONGRESS: \n\nThe"        
#> [7] "\n\nTO THE SENATE AND HOUSE"                   
#> [8] "\n\nGENTLEMEN OF THE CONGRESS:\n\nWhen I"

1.2.4 Replacing and removing characters

Now let’s take a look at text ‘cleaninng’. We will first remove the newline characters (\n). We use the str_replace_all function to replace all the ocurrences of the \n pattern with a white space " ". We need to add the escape character \ in front of our pattern to be replaced so the backslash before the n is interpreted correctly.

sotu_whole %>% 
  filter(president == "Woodrow Wilson") %>%  
  pull(text) %>%
  str_replace_all("\\n", " ") %>% # replace newline
  word(end = 5) 
#> [1] "  Gentlemen of the"          "  GENTLEMEN OF THE"         
#> [3] "GENTLEMEN OF THE CONGRESS: " "  GENTLEMEN OF THE"         
#> [5] "Gentlemen of the Congress: " "  GENTLEMEN OF THE"         
#> [7] "  TO THE SENATE"             "  GENTLEMEN OF THE"

This looks better, but we still have a problem to extract exactly 5 words because the too whitespaces are counted as a word. So let’s get rid of any whitespaces before and after, as well as repeated whitespaces within the string with the str_squish() function.

sotu_whole %>% 
  filter(president == "Woodrow Wilson") %>%  
  pull(text) %>%
  str_replace_all("\\n", " ") %>% 
  str_squish() %>%  # remove whitespaces
  word(end = 5) 
#> [1] "Gentlemen of the Congress: In"    "GENTLEMEN OF THE CONGRESS: The"  
#> [3] "GENTLEMEN OF THE CONGRESS: Since" "GENTLEMEN OF THE CONGRESS: In"   
#> [5] "Gentlemen of the Congress: Eight" "GENTLEMEN OF THE CONGRESS: The"  
#> [7] "TO THE SENATE AND HOUSE"          "GENTLEMEN OF THE CONGRESS: When"

(For spell checks take a look at https://CRAN.R-project.org/package=spelling or https://CRAN.R-project.org/package=hunspell)

1.3 Tokenize

A very common part of preparing your text for analysis involves tokenization. Currently our data contains in each each row a single text with metadata, so the entire speech text is the unit of observation. When we tokenize we break down the text into “tokens” (most commonly single words), so each row contains a single word with its metadata as unit of observation.

tidytext provides a function unnest_tokens() to convert our speech table into one that is tokenized. It takes three arguments:

  • a tibble or data frame which contains the text;
  • the name of the newly created column that will contain the tokens;
  • the name of the column within the data frame which contains the text to be tokenized.

In the example below we name the new column to hold the tokens word. Remember that the column that holds the speech is called text.

tidy_sotu <- sotu_whole %>%
  unnest_tokens(word, text)

tidy_sotu
#> # A tibble: 1,965,212 × 7
#>    president        year years_active party      sotu_type doc_id          word 
#>    <chr>           <int> <chr>        <chr>      <chr>     <chr>           <chr>
#>  1 Abraham Lincoln  1861 1861-1865    Republican written   abraham-lincol… fell…
#>  2 Abraham Lincoln  1861 1861-1865    Republican written   abraham-lincol… citi…
#>  3 Abraham Lincoln  1861 1861-1865    Republican written   abraham-lincol… of   
#>  4 Abraham Lincoln  1861 1861-1865    Republican written   abraham-lincol… the  
#>  5 Abraham Lincoln  1861 1861-1865    Republican written   abraham-lincol… sena…
#>  6 Abraham Lincoln  1861 1861-1865    Republican written   abraham-lincol… and  
#>  7 Abraham Lincoln  1861 1861-1865    Republican written   abraham-lincol… house
#>  8 Abraham Lincoln  1861 1861-1865    Republican written   abraham-lincol… of   
#>  9 Abraham Lincoln  1861 1861-1865    Republican written   abraham-lincol… repr…
#> 10 Abraham Lincoln  1861 1861-1865    Republican written   abraham-lincol… in   
#> # … with 1,965,202 more rows

Note that the unnest_tokens function didn’t just tokenize our texts at the word level. It also lowercased each word and stripped off the punctuation. We can tell it not to do this, by adding the following parameters:

# Word tokenization with punctuation and no lowercasing
sotu_whole %>%
  unnest_tokens(word, text, to_lower = FALSE, strip_punct = FALSE)
#> # A tibble: 2,157,777 × 7
#>    president        year years_active party      sotu_type doc_id          word 
#>    <chr>           <int> <chr>        <chr>      <chr>     <chr>           <chr>
#>  1 Abraham Lincoln  1861 1861-1865    Republican written   abraham-lincol… Fell…
#>  2 Abraham Lincoln  1861 1861-1865    Republican written   abraham-lincol… -    
#>  3 Abraham Lincoln  1861 1861-1865    Republican written   abraham-lincol… Citi…
#>  4 Abraham Lincoln  1861 1861-1865    Republican written   abraham-lincol… of   
#>  5 Abraham Lincoln  1861 1861-1865    Republican written   abraham-lincol… the  
#>  6 Abraham Lincoln  1861 1861-1865    Republican written   abraham-lincol… Sena…
#>  7 Abraham Lincoln  1861 1861-1865    Republican written   abraham-lincol… and  
#>  8 Abraham Lincoln  1861 1861-1865    Republican written   abraham-lincol… House
#>  9 Abraham Lincoln  1861 1861-1865    Republican written   abraham-lincol… of   
#> 10 Abraham Lincoln  1861 1861-1865    Republican written   abraham-lincol… Repr…
#> # … with 2,157,767 more rows

We can also tokenize the text at the level of ngrams or sentences, if those are the best units of analysis for our work.

# Sentence tokenization
sotu_whole %>%
  unnest_tokens(sentence, text, token = "sentences", to_lower = FALSE) %>% 
  select(sentence)
#> # A tibble: 69,158 × 1
#>    sentence                                                                     
#>    <chr>                                                                        
#>  1 Fellow-Citizens of the Senate and House of Representatives:   In the midst o…
#>  2 You will not be surprised to learn that in the peculiar exigencies of the ti…
#>  3 A disloyal portion of the American people have during the whole year been en…
#>  4 A nation which endures factious domestic division is exposed to disrespect a…
#>  5 Nations thus tempted to interfere are not always able to resist the counsels…
#>  6 The disloyal citizens of the United States who have offered the ruin of our …
#>  7 If it were just to suppose, as the insurgents have seemed to assume, that fo…
#>  8 If we could dare to believe that foreign nations are actuated by no higher p…
#>  9 The principal lever relied on by the insurgents for exciting foreign nations…
#> 10 Those nations, however, not improbably saw from the first that it was the Un…
#> # … with 69,148 more rows
# N-gram tokenization as trigrams
sotu_whole %>%
  unnest_tokens(trigram, text, token = "ngrams", n = 3) %>% 
  select(trigram)
#> # A tibble: 1,964,740 × 1
#>    trigram                 
#>    <chr>                   
#>  1 fellow citizens of      
#>  2 citizens of the         
#>  3 of the senate           
#>  4 the senate and          
#>  5 senate and house        
#>  6 and house of            
#>  7 house of representatives
#>  8 of representatives in   
#>  9 representatives in the  
#> 10 in the midst            
#> # … with 1,964,730 more rows

(Take note that the trigrams are generated by a moving 3-word window over the text.)

1.4 Stopwords

Another common task of preparing text for analysis is to remove stopwords. Stopwords are highly common words that are considered to provide non-relevant information about the content of a text.

Let’s look at the stopwords that come with the tidytext package to get a sense of what they are.

stop_words
#> # A tibble: 1,149 × 2
#>    word        lexicon
#>    <chr>       <chr>  
#>  1 a           SMART  
#>  2 a's         SMART  
#>  3 able        SMART  
#>  4 about       SMART  
#>  5 above       SMART  
#>  6 according   SMART  
#>  7 accordingly SMART  
#>  8 across      SMART  
#>  9 actually    SMART  
#> 10 after       SMART  
#> # … with 1,139 more rows

These are English stopwords, pulled from different lexica (“onix”, “SMART”, or “snowball”). Depending on the type of analysis you’re doing, you might leave these words in or alternatively use your own curated list of stopwords. Stopword lists exist for many languages, see for examle the stopwords package in R. For now we will remove the English stopwords as suggested here.

For this we use anti_join from dplyr. We join and return all rows from our table of tokens tidy_sotu where there are no matching values in our list of stopwords. Both of these tables have one column name in common: word so by default the join will be on that column, and dplyr will tell us so.

tidy_sotu_words <- tidy_sotu %>% 
  anti_join(stop_words)

tidy_sotu_words
#> # A tibble: 778,161 × 7
#>    president        year years_active party      sotu_type doc_id          word 
#>    <chr>           <int> <chr>        <chr>      <chr>     <chr>           <chr>
#>  1 Abraham Lincoln  1861 1861-1865    Republican written   abraham-lincol… fell…
#>  2 Abraham Lincoln  1861 1861-1865    Republican written   abraham-lincol… citi…
#>  3 Abraham Lincoln  1861 1861-1865    Republican written   abraham-lincol… sena…
#>  4 Abraham Lincoln  1861 1861-1865    Republican written   abraham-lincol… house
#>  5 Abraham Lincoln  1861 1861-1865    Republican written   abraham-lincol… repr…
#>  6 Abraham Lincoln  1861 1861-1865    Republican written   abraham-lincol… midst
#>  7 Abraham Lincoln  1861 1861-1865    Republican written   abraham-lincol… unpr…
#>  8 Abraham Lincoln  1861 1861-1865    Republican written   abraham-lincol… poli…
#>  9 Abraham Lincoln  1861 1861-1865    Republican written   abraham-lincol… trou…
#> 10 Abraham Lincoln  1861 1861-1865    Republican written   abraham-lincol… grat…
#> # … with 778,151 more rows

If we compare this with tidy_sotu we see that the records with words like “of”, “the”, “and”, “in” are now removed.

We also went from 1965212 to 778161 rows, which means we had a lot of stopwords in our corpus. This is a huge removal, so for serious analysis, we might want to scrutinize the stopword list carefully and determine if this is feasible.

1.5 Word Stemming

Another way you may want to clean your data is to stem your words, that is, to reduce them to their word stem or root form, for example reducing fishing, fished, and fisher to the stem fish.

tidytext does not implement its own word stemmer. Instead it relies on separate packages like hunspell or SnowballC.

We will give an example here for the SnowballC package which comes with a function wordStem. (hunspell appears to run much slower, and it also returns a list instead of a vector, so in this context SnowballC seems to be more convenient.)

library(SnowballC)
tidy_sotu_words %>%
        mutate(word_stem = wordStem(word))
#> # A tibble: 778,161 × 8
#>    president        year years_active party     sotu_type doc_id word  word_stem
#>    <chr>           <int> <chr>        <chr>     <chr>     <chr>  <chr> <chr>    
#>  1 Abraham Lincoln  1861 1861-1865    Republic… written   abrah… fell… fellow   
#>  2 Abraham Lincoln  1861 1861-1865    Republic… written   abrah… citi… citizen  
#>  3 Abraham Lincoln  1861 1861-1865    Republic… written   abrah… sena… senat    
#>  4 Abraham Lincoln  1861 1861-1865    Republic… written   abrah… house hous     
#>  5 Abraham Lincoln  1861 1861-1865    Republic… written   abrah… repr… repres   
#>  6 Abraham Lincoln  1861 1861-1865    Republic… written   abrah… midst midst    
#>  7 Abraham Lincoln  1861 1861-1865    Republic… written   abrah… unpr… unpreced 
#>  8 Abraham Lincoln  1861 1861-1865    Republic… written   abrah… poli… polit    
#>  9 Abraham Lincoln  1861 1861-1865    Republic… written   abrah… trou… troubl   
#> 10 Abraham Lincoln  1861 1861-1865    Republic… written   abrah… grat… gratitud 
#> # … with 778,151 more rows

Lemmatization takes this another step further. While a stemmer operates on a single word without knowledge of the context, lemmatization attempts to discriminate between words which have different meanings depending on part of speech. For example, the word “better” has “good” as its lemma, something a stemmer would not detect.

For lemmatization in R, you may want to take a look a the koRpus package, another comprehensive R package for text analysis. It allows to use TreeTagger, a widely used part-of-speech tagger. For full functionality of the R package a local installation of TreeTagger is recommended.