Chapter 1 Preparing Textual Data
Learning Objectives
- read textual data into R using
readtext
- use the
stringr
package to prepare strings for processing- use
tidytext
functions to tokenize texts and remove stopwords- use
SnowballC
to stem words
We’ll use several R packages in this section:
sotu
will provide the metadata and text of State of the Union speeches ranging from George Washington to Barack Obama.tidyverse
is a collection of R packages designed for data science, includingdplyr
with a set of verbs for common data manipulations andggplot2
for visualization.tidytext
provides specific functions for a “tidy” approach to working with textual data, where one row represents one “token” or meaningful unit of text, for example a word.readtext
provides a function well suited to reading textual data from a large number of formats into R, including metadata.
library(sotu)
library(tidyverse)
library(tidytext)
library(readtext)
1.1 Reading text into R
First, let’s look at the data in the sotu
package. The metadata and texts are contained in this package separately in sotu_meta
and sotu_text
respectively. We can take a look at those by either typing the names or use funnctions like glimpse()
or str()
. Below, or example is what the metadata look like. Can you tell how many speeches there are?
# Let's take a look at the state of the union metadata
str(sotu_meta)
#> 'data.frame': 240 obs. of 6 variables:
#> $ X : int 1 2 3 4 5 6 7 8 9 10 ...
#> $ president : chr "George Washington" "George Washington" "George Washington" "George Washington" ...
#> $ year : int 1790 1790 1791 1792 1793 1794 1795 1796 1797 1798 ...
#> $ years_active: chr "1789-1793" "1789-1793" "1789-1793" "1789-1793" ...
#> $ party : chr "Nonpartisan" "Nonpartisan" "Nonpartisan" "Nonpartisan" ...
#> $ sotu_type : chr "speech" "speech" "speech" "speech" ...
In order to work with the speech texts and to later practice reading text files from disk we use the function sotu_dir()
to write the texts out. This function by default writes to a temporary directory with one speech in each file. It returns a character vector where each element is the name of the path to the individual speech file. We save this vector into the file_paths
variable.
# sotu_dir writes the text files to disk in a temporary dir,
# but you could also specify a location.
<- sotu_dir()
file_paths head(file_paths)
#> [1] "/var/folders/b5/fxcv6x555j51n30f4nqq4dqr0000gp/T//Rtmp3kRRs1/file63921d8a5c72/george-washington-1790a.txt"
#> [2] "/var/folders/b5/fxcv6x555j51n30f4nqq4dqr0000gp/T//Rtmp3kRRs1/file63921d8a5c72/george-washington-1790b.txt"
#> [3] "/var/folders/b5/fxcv6x555j51n30f4nqq4dqr0000gp/T//Rtmp3kRRs1/file63921d8a5c72/george-washington-1791.txt"
#> [4] "/var/folders/b5/fxcv6x555j51n30f4nqq4dqr0000gp/T//Rtmp3kRRs1/file63921d8a5c72/george-washington-1792.txt"
#> [5] "/var/folders/b5/fxcv6x555j51n30f4nqq4dqr0000gp/T//Rtmp3kRRs1/file63921d8a5c72/george-washington-1793.txt"
#> [6] "/var/folders/b5/fxcv6x555j51n30f4nqq4dqr0000gp/T//Rtmp3kRRs1/file63921d8a5c72/george-washington-1794.txt"
Now that we have the files on disk and a vector of filepaths, we can pass this vector directly into readtext
to read the texts into a new variable.
# let's read in the files with readtext
<- readtext(file_paths) sotu_texts
readtext()
generated a dataframe for us with 2 colums: the doc_id, which is the name of the document and the actual text:
glimpse(sotu_texts)
#> Rows: 240
#> Columns: 2
#> $ doc_id <chr> "abraham-lincoln-1861.txt", "abraham-lincoln-1862.txt", "abraha…
#> $ text <chr> "\n\n Fellow-Citizens of the Senate and House of Representative…
To work with a single table, we combine the text and metadata. Our sotu_texts
are organized by alphabetical order, so we sort our metadata in sotu_meta
to match that order and then bind the columns.
<-
sotu_whole %>%
sotu_meta arrange(president) %>% # sort metadata
bind_cols(sotu_texts) %>% # combine with texts
as_tibble() # convert to tibble for better screen viewing
glimpse(sotu_whole)
#> Rows: 240
#> Columns: 8
#> $ X <int> 73, 74, 75, 76, 41, 42, 43, 44, 45, 46, 47, 48, 77, 78, 7…
#> $ president <chr> "Abraham Lincoln", "Abraham Lincoln", "Abraham Lincoln", …
#> $ year <int> 1861, 1862, 1863, 1864, 1829, 1830, 1831, 1832, 1833, 183…
#> $ years_active <chr> "1861-1865", "1861-1865", "1861-1865", "1861-1865", "1829…
#> $ party <chr> "Republican", "Republican", "Republican", "Republican", "…
#> $ sotu_type <chr> "written", "written", "written", "written", "written", "w…
#> $ doc_id <chr> "abraham-lincoln-1861.txt", "abraham-lincoln-1862.txt", "…
#> $ text <chr> "\n\n Fellow-Citizens of the Senate and House of Represen…
Now that we have our data combined, we can start looking at the text. Typically quite a bit of effort goes into pre-processing the text for further analysis. Depending on the quality of your data and your goal, you might for example need to:
- replace certain characters or words,
- remove urls or certain numbers, such as phone numbers,
- clean up misspellings or errors,
- etc.
There are several ways to handle this sort of cleaning, we’ll show a few examples below.
1.2 String operations
R has many functions available to manipulate strings including functions like grep
and paste
, which come with the R base install.
Here we will here take a look at the stringr
package, which is part of the tidyverse
. It refers to a lot of functionality from the stringi
package which is perhaps one of the most comprehensive string manipulation packages.
Below are examples for a few functions that might be useful.
1.2.1 Counting ocurrences
str_count
takes a character vector as input and by default counts the number of pattern matches in a string.
How man times does the word “citizen” appear in each of the speeches?
%>%
sotu_whole mutate(n_citizen = str_count(text, "citizen"))
#> # A tibble: 240 × 9
#> X president year years_active party sotu_…¹ doc_id text n_cit…²
#> <int> <chr> <int> <chr> <chr> <chr> <chr> <chr> <int>
#> 1 73 Abraham Lincoln 1861 1861-1865 Republ… written abrah… "\n\… 9
#> 2 74 Abraham Lincoln 1862 1861-1865 Republ… written abrah… "\n\… 7
#> 3 75 Abraham Lincoln 1863 1861-1865 Republ… written abrah… "\n\… 15
#> 4 76 Abraham Lincoln 1864 1861-1865 Republ… written abrah… "\n\… 3
#> 5 41 Andrew Jackson 1829 1829-1833 Democr… written andre… "\n\… 19
#> 6 42 Andrew Jackson 1830 1829-1833 Democr… written andre… "\n\… 14
#> 7 43 Andrew Jackson 1831 1829-1833 Democr… written andre… "\n\… 23
#> 8 44 Andrew Jackson 1832 1829-1833 Democr… written andre… "\n\… 19
#> 9 45 Andrew Jackson 1833 1833-1837 Democr… written andre… "\n\… 14
#> 10 46 Andrew Jackson 1834 1833-1837 Democr… written andre… "\n\… 25
#> # … with 230 more rows, and abbreviated variable names ¹sotu_type, ²n_citizen
It is possible to use regular expressions, for example, this is how we would check how many times either “citizen” or “Citizen” appear in each of the speeches:
%>%
sotu_whole mutate(n_citizen = str_count(text, "citizen"),
n_cCitizen = str_count(text, "[C|c]itizen"))
#> # A tibble: 240 × 10
#> X president year years…¹ party sotu_…² doc_id text n_cit…³ n_cCi…⁴
#> <int> <chr> <int> <chr> <chr> <chr> <chr> <chr> <int> <int>
#> 1 73 Abraham Linco… 1861 1861-1… Repu… written abrah… "\n\… 9 10
#> 2 74 Abraham Linco… 1862 1861-1… Repu… written abrah… "\n\… 7 8
#> 3 75 Abraham Linco… 1863 1861-1… Repu… written abrah… "\n\… 15 16
#> 4 76 Abraham Linco… 1864 1861-1… Repu… written abrah… "\n\… 3 4
#> 5 41 Andrew Jackson 1829 1829-1… Demo… written andre… "\n\… 19 20
#> 6 42 Andrew Jackson 1830 1829-1… Demo… written andre… "\n\… 14 15
#> 7 43 Andrew Jackson 1831 1829-1… Demo… written andre… "\n\… 23 24
#> 8 44 Andrew Jackson 1832 1829-1… Demo… written andre… "\n\… 19 20
#> 9 45 Andrew Jackson 1833 1833-1… Demo… written andre… "\n\… 14 15
#> 10 46 Andrew Jackson 1834 1833-1… Demo… written andre… "\n\… 25 26
#> # … with 230 more rows, and abbreviated variable names ¹years_active,
#> # ²sotu_type, ³n_citizen, ⁴n_cCitizen
Going into the use of regular expressions is beyond this introduction. However we want to point out the str_view()
function which can help you to create your correct expression. Also see RegExr, an online tool to learn, build, & test regular expressions.
When used with the boundary
argument str_count()
can count different entities like “character”, “line_break”, “sentence”, or “word”. Here we add a new column to the dataframe indicating how many words are in each speech:
%>%
sotu_whole mutate(n_words = str_count(text, boundary("word")))
#> # A tibble: 240 × 9
#> X president year years_active party sotu_…¹ doc_id text n_words
#> <int> <chr> <int> <chr> <chr> <chr> <chr> <chr> <int>
#> 1 73 Abraham Lincoln 1861 1861-1865 Republ… written abrah… "\n\… 6998
#> 2 74 Abraham Lincoln 1862 1861-1865 Republ… written abrah… "\n\… 8410
#> 3 75 Abraham Lincoln 1863 1861-1865 Republ… written abrah… "\n\… 6132
#> 4 76 Abraham Lincoln 1864 1861-1865 Republ… written abrah… "\n\… 5975
#> 5 41 Andrew Jackson 1829 1829-1833 Democr… written andre… "\n\… 10547
#> 6 42 Andrew Jackson 1830 1829-1833 Democr… written andre… "\n\… 15109
#> 7 43 Andrew Jackson 1831 1829-1833 Democr… written andre… "\n\… 7198
#> 8 44 Andrew Jackson 1832 1829-1833 Democr… written andre… "\n\… 7887
#> 9 45 Andrew Jackson 1833 1833-1837 Democr… written andre… "\n\… 7912
#> 10 46 Andrew Jackson 1834 1833-1837 Democr… written andre… "\n\… 13472
#> # … with 230 more rows, and abbreviated variable name ¹sotu_type
CHALLENGE: Use the code above and add another column
n_sentences
where you calculate the number of sentences per speech. Then create a third columnavg_word_per_sentence
, where you calculate the number of words per sentence for each speech. Finally usefilter
to find which speech has shortest/longest average sentences length and what is the average length.
1.2.2 Detecting patterns
str_detect
also looks for patterns, but instead of counts it returns a logical vector (TRUE/FALSE) indiciating if the pattern is or is not found. So we typically want to use it with the filter
“verb” from dplyr
.
What are the names of the documents where the words “citizen” and “Citizen” do not occur?
%>%
sotu_whole filter(!str_detect(text, "[C|c]itizen")) %>%
select(doc_id)
#> # A tibble: 11 × 1
#> doc_id
#> <chr>
#> 1 dwight-d-eisenhower-1958.txt
#> 2 gerald-r-ford-1975.txt
#> 3 richard-m-nixon-1970.txt
#> 4 richard-m-nixon-1971.txt
#> 5 richard-m-nixon-1972a.txt
#> 6 ronald-reagan-1988.txt
#> 7 woodrow-wilson-1916.txt
#> 8 woodrow-wilson-1917.txt
#> 9 woodrow-wilson-1918.txt
#> 10 woodrow-wilson-1919.txt
#> 11 woodrow-wilson-1920.txt
1.2.3 Extracting words
The word
function extracts specific words from a character vector of words. By default it returns the first word. If for example we wanted to extract the first 5 words of each speech by Woodrow Wilson we provide the end
argument like this:
%>%
sotu_whole filter(president == "Woodrow Wilson") %>% # sample a few speeches as demo
pull(text) %>% # we pull out the text vector only
word(end = 5)
#> [1] "\n\nGentlemen of the Congress:\n\nIn pursuance"
#> [2] "\n\nGENTLEMEN OF THE CONGRESS: \n\nThe"
#> [3] "GENTLEMEN OF THE CONGRESS: \n\nSince"
#> [4] "\n\nGENTLEMEN OF THE CONGRESS: \n\nIn"
#> [5] "Gentlemen of the Congress:\n\nEight months"
#> [6] "\n\nGENTLEMEN OF THE CONGRESS: \n\nThe"
#> [7] "\n\nTO THE SENATE AND HOUSE"
#> [8] "\n\nGENTLEMEN OF THE CONGRESS:\n\nWhen I"
1.2.4 Replacing and removing characters
Now let’s take a look at text ‘cleaninng’. We will first remove the newline characters (\n
). We use the str_replace_all
function to replace all the ocurrences of the \n
pattern with a white space " "
. We need to add the escape character \
in front of our pattern to be replaced so the backslash before the n
is interpreted correctly.
%>%
sotu_whole filter(president == "Woodrow Wilson") %>%
pull(text) %>%
str_replace_all("\\n", " ") %>% # replace newline
word(end = 5)
#> [1] " Gentlemen of the" " GENTLEMEN OF THE"
#> [3] "GENTLEMEN OF THE CONGRESS: " " GENTLEMEN OF THE"
#> [5] "Gentlemen of the Congress: " " GENTLEMEN OF THE"
#> [7] " TO THE SENATE" " GENTLEMEN OF THE"
This looks better, but we still have a problem to extract exactly 5 words because the too whitespaces are counted as a word. So let’s get rid of any whitespaces before and after, as well as repeated whitespaces within the string with the str_squish()
function.
%>%
sotu_whole filter(president == "Woodrow Wilson") %>%
pull(text) %>%
str_replace_all("\\n", " ") %>%
str_squish() %>% # remove whitespaces
word(end = 5)
#> [1] "Gentlemen of the Congress: In" "GENTLEMEN OF THE CONGRESS: The"
#> [3] "GENTLEMEN OF THE CONGRESS: Since" "GENTLEMEN OF THE CONGRESS: In"
#> [5] "Gentlemen of the Congress: Eight" "GENTLEMEN OF THE CONGRESS: The"
#> [7] "TO THE SENATE AND HOUSE" "GENTLEMEN OF THE CONGRESS: When"
(For spell checks take a look at https://CRAN.R-project.org/package=spelling or https://CRAN.R-project.org/package=hunspell)
1.3 Tokenize
A very common part of preparing your text for analysis involves tokenization. Currently our data contains in each each row a single text with metadata, so the entire speech text is the unit of observation. When we tokenize we break down the text into “tokens” (most commonly single words), so each row contains a single word with its metadata as unit of observation.
tidytext
provides a function unnest_tokens()
to convert our speech table into one that is tokenized. It takes three arguments:
- a tibble or data frame which contains the text;
- the name of the newly created column that will contain the tokens;
- the name of the column within the data frame which contains the text to be tokenized.
In the example below we name the new column to hold the tokens word
. Remember that the column that holds the speech is called text
.
<- sotu_whole %>%
tidy_sotu unnest_tokens(word, text)
tidy_sotu
#> # A tibble: 1,988,203 × 8
#> X president year years_active party sotu_type doc_id word
#> <int> <chr> <int> <chr> <chr> <chr> <chr> <chr>
#> 1 73 Abraham Lincoln 1861 1861-1865 Republican written abraham-… fell…
#> 2 73 Abraham Lincoln 1861 1861-1865 Republican written abraham-… citi…
#> 3 73 Abraham Lincoln 1861 1861-1865 Republican written abraham-… of
#> 4 73 Abraham Lincoln 1861 1861-1865 Republican written abraham-… the
#> 5 73 Abraham Lincoln 1861 1861-1865 Republican written abraham-… sena…
#> 6 73 Abraham Lincoln 1861 1861-1865 Republican written abraham-… and
#> 7 73 Abraham Lincoln 1861 1861-1865 Republican written abraham-… house
#> 8 73 Abraham Lincoln 1861 1861-1865 Republican written abraham-… of
#> 9 73 Abraham Lincoln 1861 1861-1865 Republican written abraham-… repr…
#> 10 73 Abraham Lincoln 1861 1861-1865 Republican written abraham-… in
#> # … with 1,988,193 more rows
Note that the unnest_tokens
function didn’t just tokenize our texts at the word level. It also lowercased each word and stripped off the punctuation. We can tell it not to do this, by adding the following parameters:
# Word tokenization with punctuation and no lowercasing
%>%
sotu_whole unnest_tokens(word, text, to_lower = FALSE, strip_punct = FALSE)
#> # A tibble: 2,184,602 × 8
#> X president year years_active party sotu_type doc_id word
#> <int> <chr> <int> <chr> <chr> <chr> <chr> <chr>
#> 1 73 Abraham Lincoln 1861 1861-1865 Republican written abraham-… Fell…
#> 2 73 Abraham Lincoln 1861 1861-1865 Republican written abraham-… -
#> 3 73 Abraham Lincoln 1861 1861-1865 Republican written abraham-… Citi…
#> 4 73 Abraham Lincoln 1861 1861-1865 Republican written abraham-… of
#> 5 73 Abraham Lincoln 1861 1861-1865 Republican written abraham-… the
#> 6 73 Abraham Lincoln 1861 1861-1865 Republican written abraham-… Sena…
#> 7 73 Abraham Lincoln 1861 1861-1865 Republican written abraham-… and
#> 8 73 Abraham Lincoln 1861 1861-1865 Republican written abraham-… House
#> 9 73 Abraham Lincoln 1861 1861-1865 Republican written abraham-… of
#> 10 73 Abraham Lincoln 1861 1861-1865 Republican written abraham-… Repr…
#> # … with 2,184,592 more rows
We can also tokenize the text at the level of ngrams or sentences, if those are the best units of analysis for our work.
# Sentence tokenization
%>%
sotu_whole unnest_tokens(sentence, text, token = "sentences", to_lower = FALSE) %>%
select(sentence)
#> # A tibble: 70,761 × 1
#> sentence
#> <chr>
#> 1 Fellow-Citizens of the Senate and House of Representatives: In the midst o…
#> 2 You will not be surprised to learn that in the peculiar exigencies of the ti…
#> 3 A disloyal portion of the American people have during the whole year been en…
#> 4 A nation which endures factious domestic division is exposed to disrespect a…
#> 5 Nations thus tempted to interfere are not always able to resist the counsels…
#> 6 The disloyal citizens of the United States who have offered the ruin of our …
#> 7 If it were just to suppose, as the insurgents have seemed to assume, that fo…
#> 8 If we could dare to believe that foreign nations are actuated by no higher p…
#> 9 The principal lever relied on by the insurgents for exciting foreign nations…
#> 10 Those nations, however, not improbably saw from the first that it was the Un…
#> # … with 70,751 more rows
# N-gram tokenization as trigrams
%>%
sotu_whole unnest_tokens(trigram, text, token = "ngrams", n = 3) %>%
select(trigram)
#> # A tibble: 1,987,723 × 1
#> trigram
#> <chr>
#> 1 fellow citizens of
#> 2 citizens of the
#> 3 of the senate
#> 4 the senate and
#> 5 senate and house
#> 6 and house of
#> 7 house of representatives
#> 8 of representatives in
#> 9 representatives in the
#> 10 in the midst
#> # … with 1,987,713 more rows
(Take note that the trigrams are generated by a moving 3-word window over the text.)
1.4 Stopwords
Another common task of preparing text for analysis is to remove stopwords. Stopwords are highly common words that are considered to provide non-relevant information about the content of a text.
Let’s look at the stopwords that come with the tidytext
package to get a sense of what they are.
stop_words
#> # A tibble: 1,149 × 2
#> word lexicon
#> <chr> <chr>
#> 1 a SMART
#> 2 a's SMART
#> 3 able SMART
#> 4 about SMART
#> 5 above SMART
#> 6 according SMART
#> 7 accordingly SMART
#> 8 across SMART
#> 9 actually SMART
#> 10 after SMART
#> # … with 1,139 more rows
These are English stopwords, pulled from different lexica (“onix”, “SMART”, or “snowball”). Depending on the type of analysis you’re doing, you might leave these words in or alternatively use your own curated list of stopwords. Stopword lists exist for many languages, see for examle the stopwords
package in R. For now we will remove the English stopwords as suggested here.
For this we use anti_join
from dplyr
. We join and return all rows from our table of tokens tidy_sotu
where there are no matching values in our list of stopwords. Both of these tables have one column name in common: word
so by default the join will be on that column, and dplyr will tell us so.
<- tidy_sotu %>%
tidy_sotu_words anti_join(stop_words)
tidy_sotu_words
#> # A tibble: 787,851 × 8
#> X president year years_active party sotu_type doc_id word
#> <int> <chr> <int> <chr> <chr> <chr> <chr> <chr>
#> 1 73 Abraham Lincoln 1861 1861-1865 Republican written abraham-… fell…
#> 2 73 Abraham Lincoln 1861 1861-1865 Republican written abraham-… citi…
#> 3 73 Abraham Lincoln 1861 1861-1865 Republican written abraham-… sena…
#> 4 73 Abraham Lincoln 1861 1861-1865 Republican written abraham-… house
#> 5 73 Abraham Lincoln 1861 1861-1865 Republican written abraham-… repr…
#> 6 73 Abraham Lincoln 1861 1861-1865 Republican written abraham-… midst
#> 7 73 Abraham Lincoln 1861 1861-1865 Republican written abraham-… unpr…
#> 8 73 Abraham Lincoln 1861 1861-1865 Republican written abraham-… poli…
#> 9 73 Abraham Lincoln 1861 1861-1865 Republican written abraham-… trou…
#> 10 73 Abraham Lincoln 1861 1861-1865 Republican written abraham-… grat…
#> # … with 787,841 more rows
If we compare this with tidy_sotu
we see that the records with words like “of”, “the”, “and”, “in” are now removed.
We also went from 1988203 to 787851 rows, which means we had a lot of stopwords in our corpus. This is a huge removal, so for serious analysis, we might want to scrutinize the stopword list carefully and determine if this is feasible.
1.5 Word Stemming
Another way you may want to clean your data is to stem your words, that is, to reduce them to their word stem or root form, for example reducing fishing, fished, and fisher to the stem fish.
tidytext
does not implement its own word stemmer. Instead it relies on separate packages like hunspell
or SnowballC
.
We will give an example here for the SnowballC
package which comes with a function wordStem
. (hunspell
appears to run much slower, and it also returns a list instead of a vector, so in this context SnowballC
seems to be more convenient.)
library(SnowballC)
%>%
tidy_sotu_words mutate(word_stem = wordStem(word))
#> # A tibble: 787,851 × 9
#> X president year years_active party sotu_…¹ doc_id word word_…²
#> <int> <chr> <int> <chr> <chr> <chr> <chr> <chr> <chr>
#> 1 73 Abraham Lincoln 1861 1861-1865 Republ… written abrah… fell… fellow
#> 2 73 Abraham Lincoln 1861 1861-1865 Republ… written abrah… citi… citizen
#> 3 73 Abraham Lincoln 1861 1861-1865 Republ… written abrah… sena… senat
#> 4 73 Abraham Lincoln 1861 1861-1865 Republ… written abrah… house hous
#> 5 73 Abraham Lincoln 1861 1861-1865 Republ… written abrah… repr… repres
#> 6 73 Abraham Lincoln 1861 1861-1865 Republ… written abrah… midst midst
#> 7 73 Abraham Lincoln 1861 1861-1865 Republ… written abrah… unpr… unprec…
#> 8 73 Abraham Lincoln 1861 1861-1865 Republ… written abrah… poli… polit
#> 9 73 Abraham Lincoln 1861 1861-1865 Republ… written abrah… trou… troubl
#> 10 73 Abraham Lincoln 1861 1861-1865 Republ… written abrah… grat… gratit…
#> # … with 787,841 more rows, and abbreviated variable names ¹sotu_type,
#> # ²word_stem
Lemmatization takes this another step further. While a stemmer operates on a single word without knowledge of the context, lemmatization attempts to discriminate between words which have different meanings depending on part of speech. For example, the word “better” has “good” as its lemma, something a stemmer would not detect.
For lemmatization in R, you may want to take a look a the koRpus
package, another comprehensive R package for text analysis. It allows to use TreeTagger, a widely used part-of-speech tagger. For full functionality of the R package a local installation of TreeTagger is recommended.