Chapter 2 Analyzing Texts

Learning Objectives

  • perform frequency counts and generate plots
  • use the widyr package to calculate co-ocurrance
  • use igraph and ggraph to plot a co-ocurrance graph
  • import and export a Document-Term Matrix into tidytext
  • use the sentiments dataset from tidytext to perform a sentiment analysis

Now that we’ve read in our text and metadata, tokenized and cleaned it a little, let’s move on to some analysis.

First, we’ll make sure we have loaded the libraries we’ll need.

library(tidyverse)
library(tidytext)

Let’s remind ourselves of what our data looks like.

tidy_sotu_words
#> # A tibble: 787,851 × 8
#>        X president        year years_active party      sotu_type doc_id    word 
#>    <int> <chr>           <int> <chr>        <chr>      <chr>     <chr>     <chr>
#>  1    73 Abraham Lincoln  1861 1861-1865    Republican written   abraham-… fell…
#>  2    73 Abraham Lincoln  1861 1861-1865    Republican written   abraham-… citi…
#>  3    73 Abraham Lincoln  1861 1861-1865    Republican written   abraham-… sena…
#>  4    73 Abraham Lincoln  1861 1861-1865    Republican written   abraham-… house
#>  5    73 Abraham Lincoln  1861 1861-1865    Republican written   abraham-… repr…
#>  6    73 Abraham Lincoln  1861 1861-1865    Republican written   abraham-… midst
#>  7    73 Abraham Lincoln  1861 1861-1865    Republican written   abraham-… unpr…
#>  8    73 Abraham Lincoln  1861 1861-1865    Republican written   abraham-… poli…
#>  9    73 Abraham Lincoln  1861 1861-1865    Republican written   abraham-… trou…
#> 10    73 Abraham Lincoln  1861 1861-1865    Republican written   abraham-… grat…
#> # … with 787,841 more rows

2.1 Frequencies

Since our unit of analysis at this point is a word, let’s count to determine which words occur most frequently in the corpus as a whole.

tidy_sotu_words %>%
  count(word, sort = TRUE)
#> # A tibble: 29,897 × 2
#>    word           n
#>    <chr>      <int>
#>  1 government  7597
#>  2 congress    5808
#>  3 united      5156
#>  4 people      4298
#>  5 country     3641
#>  6 public      3419
#>  7 time        3188
#>  8 american    2988
#>  9 war         2976
#> 10 world       2633
#> # … with 29,887 more rows

We can pipe this into ggplot to make a graph of the words that occur more that 2000 times. We count the words and use geom_col to represent the n values.

tidy_sotu_words %>%
  count(word) %>% 
  filter(n > 2000) %>% 
  mutate(word = reorder(word, n)) %>%  # reorder values by frequency
  ggplot(aes(word, n)) +
     geom_col(fill = "gray") +
     coord_flip()  # flip x and y coordinates so we can read the words better

Now let’s look at a different question: In any given year, how often is the word ‘peace’ used and how often is the word ‘war’ used?

# steps:
# Select only the words 'war' and 'peace'.
# count ocurrences of each per year

tidy_sotu_words %>%
  filter(word %in% c("war", "peace")) %>% 
  count(year, word)
#> # A tibble: 442 × 3
#>     year word      n
#>    <int> <chr> <int>
#>  1  1790 peace     3
#>  2  1790 war       4
#>  3  1791 peace     4
#>  4  1791 war       1
#>  5  1792 peace     5
#>  6  1792 war       2
#>  7  1793 peace     6
#>  8  1793 war       4
#>  9  1794 peace     3
#> 10  1794 war       1
#> # … with 432 more rows

Now we can plot this as a bar chart that shows for each year the proportion of each of these two words out of the total of how often both words are used.

# plot n by year, and use position 'fill' to show the proportion

tidy_sotu_words %>%
  filter(word %in% c("war", "peace")) %>% 
  count(year, word) %>% 
  ggplot(aes(year, n, fill = word)) +
    geom_col(position = "fill")

As another example let us calculate the average number of words per speech for each president: How long was the average speech of each president and who are the most ‘wordy’ presidents?

First we summarize the words per president per speech:

tidy_sotu_words %>%
  count(president, doc_id)
#> # A tibble: 240 × 3
#>    president       doc_id                       n
#>    <chr>           <chr>                    <int>
#>  1 Abraham Lincoln abraham-lincoln-1861.txt  2578
#>  2 Abraham Lincoln abraham-lincoln-1862.txt  3088
#>  3 Abraham Lincoln abraham-lincoln-1863.txt  2398
#>  4 Abraham Lincoln abraham-lincoln-1864.txt  2398
#>  5 Andrew Jackson  andrew-jackson-1829.txt   3849
#>  6 Andrew Jackson  andrew-jackson-1830.txt   5428
#>  7 Andrew Jackson  andrew-jackson-1831.txt   2612
#>  8 Andrew Jackson  andrew-jackson-1832.txt   2881
#>  9 Andrew Jackson  andrew-jackson-1833.txt   2869
#> 10 Andrew Jackson  andrew-jackson-1834.txt   4952
#> # … with 230 more rows

Then we use the output table and group it by president. That allows us to calculate the average number of words per speech.

tidy_sotu_words %>%
  count(president, doc_id)  %>% 
  group_by(president) %>% 
  summarize(avg_words = mean(n)) %>% 
  arrange(desc(avg_words))
#> # A tibble: 42 × 2
#>    president           avg_words
#>    <chr>                   <dbl>
#>  1 William Howard Taft     9126.
#>  2 William McKinley        7797 
#>  3 Jimmy Carter            7673.
#>  4 Theodore Roosevelt      7356 
#>  5 James K. Polk           6920.
#>  6 Grover Cleveland        5736.
#>  7 James Buchanan          5409 
#>  8 Benjamin Harrison       5308.
#>  9 Rutherford B. Hayes     4411 
#> 10 Martin Van Buren        4286.
#> # … with 32 more rows

2.2 Term frequency

Often a raw count of a word is less important than understanding how often that word appears relative to the total number of words in a text. This ratio is called the term frequency. We can use dplyr to calculate it like this:

tidy_sotu_words %>%
  count(doc_id, word, sort = T)  %>%  # count occurrence of word and sort descending
  group_by(doc_id) %>% 
  mutate(n_tot = sum(n),              # count total number of words per doc
         term_freq = n/n_tot)
#> # A tibble: 358,186 × 5
#> # Groups:   doc_id [240]
#>    doc_id                       word               n n_tot term_freq
#>    <chr>                        <chr>          <int> <int>     <dbl>
#>  1 harry-s-truman-1946.txt      dollars          207 12614   0.0164 
#>  2 jimmy-carter-1980b.txt       congress         204 16128   0.0126 
#>  3 harry-s-truman-1946.txt      war              201 12614   0.0159 
#>  4 william-howard-taft-1910.txt government       164 11178   0.0147 
#>  5 james-k-polk-1846.txt        mexico           158  7023   0.0225 
#>  6 richard-m-nixon-1974b.txt    federal          141  9996   0.0141 
#>  7 harry-s-truman-1946.txt      million          138 12614   0.0109 
#>  8 harry-s-truman-1946.txt      fiscal           129 12614   0.0102 
#>  9 jimmy-carter-1981.txt        administration   129 16595   0.00777
#> 10 william-howard-taft-1912.txt government       129 10215   0.0126 
#> # … with 358,176 more rows

Let’s plot the distribution of the term frequency for the speeches:

tidy_sotu_words %>%
  count(doc_id, word)  %>%  # count n for each word
  group_by(doc_id) %>% 
  mutate(n_tot = sum(n), # count total number of words per doc
         term_freq = n/n_tot) %>% 
  ggplot(aes(term_freq)) +
    geom_histogram() 

This distribution makes sense. Many words are used relatively rarely in a text. Only a few have a high term frequency.

Assuming that terms with high relative frequency are an indicator of significance we can find the term with the highest term frequency for each president:

tidy_sotu_words %>%
  count(president, word)  %>%  # count n for each word
  group_by(president) %>% 
  mutate(n_tot = sum(n), # count total number of words per doc
         term_freq = n/n_tot) %>% 
  arrange(desc(term_freq)) %>% # sort by term frequency
  top_n(1) %>%  # take the top for each president
  print(n = Inf) # print all rows
#> # A tibble: 44 × 5
#> # Groups:   president [42]
#>    president             word           n n_tot term_freq
#>    <chr>                 <chr>      <int> <int>     <dbl>
#>  1 John Adams            united        49  2768   0.0177 
#>  2 John Tyler            government   209 12596   0.0166 
#>  3 Martin Van Buren      government   256 17145   0.0149 
#>  4 William J. Clinton    people       336 22713   0.0148 
#>  5 Franklin D. Roosevelt war          283 19311   0.0147 
#>  6 William McKinley      government   452 31188   0.0145 
#>  7 Andrew Jackson        government   436 31031   0.0141 
#>  8 Donald Trump          american     135  9690   0.0139 
#>  9 Andrew Johnson        government   207 14968   0.0138 
#> 10 George Washington     united        86  6226   0.0138 
#> 11 Calvin Coolidge       government   274 20518   0.0134 
#> 12 James K. Polk         mexico       360 27679   0.0130 
#> 13 James Buchanan        government   279 21636   0.0129 
#> 14 Zachary Taylor        congress      38  2948   0.0129 
#> 15 Ulysses S. Grant      united       359 27933   0.0129 
#> 16 William Howard Taft   government   461 36506   0.0126 
#> 17 Grover Cleveland      government   574 45889   0.0125 
#> 18 Franklin Pierce       united       200 16240   0.0123 
#> 19 George Bush           world         82  6706   0.0122 
#> 20 James Monroe          united       184 15157   0.0121 
#> 21 George W. Bush        america      209 17265   0.0121 
#> 22 Millard Fillmore      government   135 11986   0.0113 
#> 23 John Quincy Adams     congress     131 11788   0.0111 
#> 24 Harry S Truman        war          308 27819   0.0111 
#> 25 Gerald R. Ford        federal       65  5879   0.0111 
#> 26 Herbert Hoover        government   121 10947   0.0111 
#> 27 Rutherford B. Hayes   congress     194 17644   0.0110 
#> 28 Chester A. Arthur     government   185 16961   0.0109 
#> 29 Lyndon B. Johnson     congress     115 11207   0.0103 
#> 30 James Madison         war           85  8327   0.0102 
#> 31 Barack Obama          america      204 20529   0.00994
#> 32 Benjamin Harrison     government   209 21230   0.00984
#> 33 Richard M. Nixon      federal      232 23701   0.00979
#> 34 Jimmy Carter          congress     518 53710   0.00964
#> 35 John F. Kennedy       world         68  7302   0.00931
#> 36 Theodore Roosevelt    government   528 58848   0.00897
#> 37 Ronald Reagan         government   133 15005   0.00886
#> 38 Ronald Reagan         people       133 15005   0.00886
#> 39 Woodrow Wilson        government   105 11982   0.00876
#> 40 Warren G. Harding     public        39  4583   0.00851
#> 41 Dwight D. Eisenhower  world        204 24410   0.00836
#> 42 Thomas Jefferson      country       58  7418   0.00782
#> 43 Abraham Lincoln       congress      81 10462   0.00774
#> 44 Abraham Lincoln       united        81 10462   0.00774

CHALLENGE: Pick one president. For each of his speeches, which is the term with highest term frequency? Create a table as output. (Hint: top_nmight be useful)

2.3 Tf-idf

So far we’ve been looking at term frequency per document. What if we want to know about words that seem more important based on the contents of the entire corpus?

For this, we can use term-frequency according to inverse document frequency, also callled tf-idf. Tf-idf measures how important a word is within a corpus by scaling term frequency per document according to the inverse of the term’s document frequency (number of documents within the corpus in which the term appears divided by the number of documents).

The tf-idf value will be:

  • lower for words that appear frequently in many documents of the corpus, and lowest when the word occurs in virtually all documents.
  • higher for words that appear frequently in just a few documents of the corpus, this lending high discriminatory power to those few documents.

The intuition here is that if a term appears frequently in a document, we think that it is important but if that word appears in too many other documents, it is not that unique and thus perhaps not that important.

The tidytext package includes a function bind_tf_idf. It takes a table that contains one-row-per-term-per-document, the name of the column that contains the words (terms), the name of the column which contains the doc-id, and the name of the column that contains the document-term counts.

So below we aggregate our tibble with the word tokens to create the one-row-per-term-per-document table and then pipe it into the bind_tf_idf function.

tidy_sotu_words %>%
  count(doc_id, word, sort = TRUE)  %>%  # aggregate to count n for each word
  bind_tf_idf(word, doc_id, n) 
#> # A tibble: 358,186 × 6
#>    doc_id                       word               n      tf     idf    tf_idf
#>    <chr>                        <chr>          <int>   <dbl>   <dbl>     <dbl>
#>  1 harry-s-truman-1946.txt      dollars          207 0.0164  0.598   0.00981  
#>  2 jimmy-carter-1980b.txt       congress         204 0.0126  0.00418 0.0000528
#>  3 harry-s-truman-1946.txt      war              201 0.0159  0.0339  0.000540 
#>  4 william-howard-taft-1910.txt government       164 0.0147  0.00418 0.0000613
#>  5 james-k-polk-1846.txt        mexico           158 0.0225  0.799   0.0180   
#>  6 richard-m-nixon-1974b.txt    federal          141 0.0141  0.293   0.00414  
#>  7 harry-s-truman-1946.txt      million          138 0.0109  0.710   0.00777  
#>  8 harry-s-truman-1946.txt      fiscal           129 0.0102  0.511   0.00522  
#>  9 jimmy-carter-1981.txt        administration   129 0.00777 0.277   0.00215  
#> 10 william-howard-taft-1912.txt government       129 0.0126  0.00418 0.0000527
#> # … with 358,176 more rows

Our function added three columns to the aggregated table which contain term frequency (tf), inverse document frequency (idf) and Tf-idf (tf_idf).

Let’s look at some of the words in the corpus that have the highest tf-idf scores, which means words that are particularly distinctive for their documents.

tidy_sotu_words %>%
  count(doc_id, word, sort = TRUE)  %>% 
  bind_tf_idf(word, doc_id, n) %>% 
  arrange(desc(tf_idf))
#> # A tibble: 358,186 × 6
#>    doc_id                        word          n      tf   idf tf_idf
#>    <chr>                         <chr>     <int>   <dbl> <dbl>  <dbl>
#>  1 donald-trump-2019.txt         applause    104 0.0424   2.22 0.0942
#>  2 lyndon-b-johnson-1966.txt     vietnam      32 0.0152   2.35 0.0356
#>  3 jimmy-carter-1980a.txt        soviet       31 0.0218   1.49 0.0325
#>  4 george-w-bush-2003.txt        hussein      19 0.00811  3.87 0.0314
#>  5 george-w-bush-2003.txt        saddam       19 0.00811  3.69 0.0299
#>  6 franklin-d-roosevelt-1943.txt 1942         13 0.00758  3.87 0.0294
#>  7 dwight-d-eisenhower-1961.txt  1953         23 0.00747  3.87 0.0289
#>  8 john-adams-1800.txt           gentlemen     8 0.0153   1.77 0.0270
#>  9 benjamin-harrison-1892.txt    1892         40 0.00741  3.53 0.0262
#> 10 franklin-d-roosevelt-1942.txt hitler        7 0.00527  4.79 0.0252
#> # … with 358,176 more rows

To understand the occurrence of the years as being particularly distinctive we might need to look more closely at the speeches themselves, and determine whether the years are significant or whether they need to be removed from the text either permanently in the clean up or temporarily using filter().

CHALLENGE: Pick the same president you chose above. For each of his speeches, which is the term with highest tf-idf? Create a table as output. (Hint: Remember to group by doc_id before you use top_n)

2.4 N-Grams

We mentioned n-grams in the intro, but let’s revisit them here and take a look at the most common bigrams in the speeches. Remember we can use the unnest_token() function on our texts and explicitly tell it to generate bigrams:

sotu_whole %>%
  unnest_tokens(bigram, text, token = "ngrams", n = 2) # create bigram
#> # A tibble: 1,987,963 × 8
#>        X president        year years_active party      sotu_type doc_id   bigram
#>    <int> <chr>           <int> <chr>        <chr>      <chr>     <chr>    <chr> 
#>  1    73 Abraham Lincoln  1861 1861-1865    Republican written   abraham… fello…
#>  2    73 Abraham Lincoln  1861 1861-1865    Republican written   abraham… citiz…
#>  3    73 Abraham Lincoln  1861 1861-1865    Republican written   abraham… of the
#>  4    73 Abraham Lincoln  1861 1861-1865    Republican written   abraham… the s…
#>  5    73 Abraham Lincoln  1861 1861-1865    Republican written   abraham… senat…
#>  6    73 Abraham Lincoln  1861 1861-1865    Republican written   abraham… and h…
#>  7    73 Abraham Lincoln  1861 1861-1865    Republican written   abraham… house…
#>  8    73 Abraham Lincoln  1861 1861-1865    Republican written   abraham… of re…
#>  9    73 Abraham Lincoln  1861 1861-1865    Republican written   abraham… repre…
#> 10    73 Abraham Lincoln  1861 1861-1865    Republican written   abraham… in the
#> # … with 1,987,953 more rows

Let’s see the most common bigrams:

sotu_whole %>%
  unnest_tokens(bigram, text, token = "ngrams", n = 2) %>% 
  count(bigram, sort = TRUE) # count occurrences and sort descending
#> # A tibble: 475,240 × 2
#>    bigram            n
#>    <chr>         <int>
#>  1 of the        33699
#>  2 in the        12608
#>  3 to the        11684
#>  4 for the        6926
#>  5 and the        6265
#>  6 by the         5625
#>  7 of our         5219
#>  8 the united     4816
#>  9 united states  4808
#> 10 it is          4774
#> # … with 475,230 more rows

Ok, so we again need to remove the stopwords. First let us separate the two words into two columns “word1” and “word2” with separate from the tidyr package:

sotu_whole %>%
  unnest_tokens(bigram, text, token = "ngrams", n = 2) %>% 
  separate(bigram, c("word1", "word2"), sep = " ")
#> # A tibble: 1,987,963 × 9
#>        X president        year years_active party     sotu_…¹ doc_id word1 word2
#>    <int> <chr>           <int> <chr>        <chr>     <chr>   <chr>  <chr> <chr>
#>  1    73 Abraham Lincoln  1861 1861-1865    Republic… written abrah… fell… citi…
#>  2    73 Abraham Lincoln  1861 1861-1865    Republic… written abrah… citi… of   
#>  3    73 Abraham Lincoln  1861 1861-1865    Republic… written abrah… of    the  
#>  4    73 Abraham Lincoln  1861 1861-1865    Republic… written abrah… the   sena…
#>  5    73 Abraham Lincoln  1861 1861-1865    Republic… written abrah… sena… and  
#>  6    73 Abraham Lincoln  1861 1861-1865    Republic… written abrah… and   house
#>  7    73 Abraham Lincoln  1861 1861-1865    Republic… written abrah… house of   
#>  8    73 Abraham Lincoln  1861 1861-1865    Republic… written abrah… of    repr…
#>  9    73 Abraham Lincoln  1861 1861-1865    Republic… written abrah… repr… in   
#> 10    73 Abraham Lincoln  1861 1861-1865    Republic… written abrah… in    the  
#> # … with 1,987,953 more rows, and abbreviated variable name ¹​sotu_type

Now we use dplyr’s filter() function to select only the words in each column that are not in the stopwords.

sotu_whole %>%
  unnest_tokens(bigram, text, token = "ngrams", n = 2) %>% 
  separate(bigram, c("word1", "word2"), sep = " ") %>% # separate into cols
  filter(!word1 %in% stop_words$word, # remove stopwords
         !word2 %in% stop_words$word)
#> # A tibble: 219,471 × 9
#>        X president        year years_active party     sotu_…¹ doc_id word1 word2
#>    <int> <chr>           <int> <chr>        <chr>     <chr>   <chr>  <chr> <chr>
#>  1    73 Abraham Lincoln  1861 1861-1865    Republic… written abrah… fell… citi…
#>  2    73 Abraham Lincoln  1861 1861-1865    Republic… written abrah… unpr… poli…
#>  3    73 Abraham Lincoln  1861 1861-1865    Republic… written abrah… poli… trou…
#>  4    73 Abraham Lincoln  1861 1861-1865    Republic… written abrah… abun… harv…
#>  5    73 Abraham Lincoln  1861 1861-1865    Republic… written abrah… pecu… exig…
#>  6    73 Abraham Lincoln  1861 1861-1865    Republic… written abrah… fore… nati…
#>  7    73 Abraham Lincoln  1861 1861-1865    Republic… written abrah… prof… soli…
#>  8    73 Abraham Lincoln  1861 1861-1865    Republic… written abrah… soli… chie…
#>  9    73 Abraham Lincoln  1861 1861-1865    Republic… written abrah… dome… affa…
#> 10    73 Abraham Lincoln  1861 1861-1865    Republic… written abrah… disl… port…
#> # … with 219,461 more rows, and abbreviated variable name ¹​sotu_type

Lastly, we re-unite the two word columns into back into our bigrams and save it into a new table sotu_bigrams.

sotu_bigrams <- sotu_whole %>%
  unnest_tokens(bigram, text, token = "ngrams", n = 2) %>% 
  separate(bigram, c("word1", "word2"), sep = " ") %>% # separate into cols
  filter(!word1 %in% stop_words$word, # remove stopwords
         !word2 %in% stop_words$word) %>% 
  unite(bigram, word1, word2, sep = " ")  # combine columns


sotu_bigrams %>% 
  count(bigram, sort = TRUE)
#> # A tibble: 131,899 × 2
#>    bigram                 n
#>    <chr>              <int>
#>  1 federal government   479
#>  2 american people      439
#>  3 june 30              325
#>  4 fellow citizens      299
#>  5 public debt          283
#>  6 public lands         256
#>  7 health care          253
#>  8 social security      233
#>  9 post office          202
#> 10 annual message       200
#> # … with 131,889 more rows

A bigram can also be treated as a term in a document in the same way that we treated individual words. That means we can look at tf-idf values in the same way. For example, we can find out the most distinct bigrams that the presidents uttered in all their respective speeches taken together.

We count per president and bigram and then bind the tf-idf value with the bind_tf_idf function. In order to get the top bigram for each president we then group by president, and sort and retrieve the highest value for each.

sotu_bigrams %>%
  count(president, bigram) %>%
  bind_tf_idf(bigram, president, n) %>%
  group_by(president) %>%  
  arrange(desc(tf_idf)) %>% 
  top_n(1)
#> # A tibble: 45 × 6
#> # Groups:   president [42]
#>    president          bigram              n      tf   idf tf_idf
#>    <chr>              <chr>           <int>   <dbl> <dbl>  <dbl>
#>  1 John Adams         john adams          3 0.00510 3.74  0.0191
#>  2 George W. Bush     al qaida           35 0.00628 2.64  0.0166
#>  3 Thomas Jefferson   gun boats           7 0.00462 3.04  0.0141
#>  4 Thomas Jefferson   port towns          7 0.00462 3.04  0.0141
#>  5 Thomas Jefferson   sea port            7 0.00462 3.04  0.0141
#>  6 William J. Clinton 21st century       59 0.00830 1.66  0.0138
#>  7 Zachary Taylor     german empire       5 0.00789 1.66  0.0131
#>  8 Lyndon B. Johnson  south vietnam      13 0.00424 3.04  0.0129
#>  9 James Madison      james madison       8 0.00412 3.04  0.0125
#> 10 Harry S Truman     million dollars   119 0.0129  0.965 0.0124
#> # … with 35 more rows

CHALLENGE: Again, pick the same president you chose above. For each of his speeches, which is the bigram with highest tf-idf? Create a table as output.

2.5 Co-occurrence

Co-occurrences give us a sense of words that appear in the same text, but not necessarily next to each other.

For this section we will make use of the widyr package. The function which helps us do this is the pairwise_count() function. It lets us count common pairs of words co-appearing within the same speech.

Behind the scenes, this function first turns our table into a wide matrix. In our case that matrix will be made up of the individual words and the cell values will be the counts of in how many speeches they co-occur, like this:

#>      we thus have
#> we   NA    4    5
#> thus  4   NA    2
#> have  5    2   NA

It then will turn the matrix back into a tidy form, where each row contains the word pairs and the count of their co-occurrence. Since we don’t care about the order of the words, we will not count the upper triangle of the wide matrix, which leaves us with:

#>             
#>    we thus 4
#>    we have 5
#>  thus have 2

Since processing the entire corpus would take too long here, we will only look at the last 100 words of each speech: which words occur most commonly together at the end of the speeches?

library(widyr)

sotu_word_pairs <- sotu_whole %>% 
  mutate(speech_end = word(text, -100, end = -1)) %>%  # extract last 100 words
  unnest_tokens(word, speech_end) %>%   # tokenize
  filter(!word %in% stop_words$word) %>%  # remove stopwords
  pairwise_count(word, doc_id, sort = TRUE, upper = FALSE) # don't include upper triangle of matrix

sotu_word_pairs
#> # A tibble: 126,853 × 3
#>    item1      item2       n
#>    <chr>      <chr>   <dbl>
#>  1 god        bless      41
#>  2 god        america    39
#>  3 bless      america    34
#>  4 people     country    27
#>  5 god        nation     24
#>  6 world      god        23
#>  7 god        people     23
#>  8 god        country    22
#>  9 people     america    22
#> 10 government people     21
#> # … with 126,843 more rows

To visualize the co-occurrence network of words that occur together at the end of 10 or more speeches, we use the igraph package to convert our table into a network graph and the ggraph package which adds functionality to ggplot to make it easier to plot a network.

library(igraph)
library(ggraph)

sotu_word_pairs %>% 
  filter(n >= 10) %>%  # only word pairs that occur 10 or more times
  graph_from_data_frame() %>% #convert to graph
  ggraph(layout = "fr") + # place nodes according to the force-directed algorithm of Fruchterman and Reingold
  geom_edge_link(aes(edge_alpha = n, edge_width = n), edge_colour = "tomato") +
  geom_node_point(size = 5) +
  geom_node_text(aes(label = name), repel = TRUE, 
                 point.padding = unit(0.2, "lines")) +
  theme_void()

There are alternative approaches for this as well. See for example the findAssocs function in the tm package.

2.6 Document-Term Matrix

A document-term matrix (DTM) is a format which is frequently used in text analysis. It is a matrix where we can see the counts of each term per document. In a DTM each row represents a document, each column represents a term, and the cell values are the counts of the occurrences of the term for the particular document.

tidytext provides functionality to convert to and from DTMs, if for example, your analysis requires specific functions from a different R package which only works with DTM object types.

The cast_dtm function can be used to create a DTM object from a tidy table.

Let’s assume that for some reason we want to use the findAssoc() function from the tm package.

First we use dplyr to create a table with the document name, the term, and the count.

# make a table with document, term, count
tidy_sotu_words %>% 
  count(doc_id, word) 
#> # A tibble: 358,186 × 3
#>    doc_id                   word               n
#>    <chr>                    <chr>          <int>
#>  1 abraham-lincoln-1861.txt 1,470,018          1
#>  2 abraham-lincoln-1861.txt 1,500              1
#>  3 abraham-lincoln-1861.txt 100,000            1
#>  4 abraham-lincoln-1861.txt 102,532,509.27     1
#>  5 abraham-lincoln-1861.txt 12,528,000         1
#>  6 abraham-lincoln-1861.txt 13,606,759.11      1
#>  7 abraham-lincoln-1861.txt 1830               1
#>  8 abraham-lincoln-1861.txt 1859               1
#>  9 abraham-lincoln-1861.txt 1860               2
#> 10 abraham-lincoln-1861.txt 1861               6
#> # … with 358,176 more rows

Now we cast it as a DTM.

sotu_dtm <- tidy_sotu_words %>% 
  count(doc_id, word) %>% 
  cast_dtm(doc_id, word, n) 

class(sotu_dtm)
#> [1] "DocumentTermMatrix"    "simple_triplet_matrix"

Finally, let’s use it in the tm package:

library(tm)

# look at the terms with tm function
Terms(sotu_dtm) %>% tail()
#> [1] "queretaro"    "refreshments" "schleswig"    "sedulous"     "subagents"   
#> [6] "transcript"
# most frequent terms
findFreqTerms(sotu_dtm, lowfreq = 5000)
#> [1] "congress"   "government" "united"
# find terms associated with "citizen"
findAssocs(sotu_dtm, "citizen", corlimit = 0.5)
#> $citizen
#>        laws citizenship  protection   contained    entitled  government 
#>        0.62        0.59        0.56        0.54        0.53        0.53 
#>    citizens  postmaster     careful    question      report       suits 
#>        0.52        0.52        0.51        0.51        0.51        0.51

Conversely, tidytext implements the tidy function (originally from the broom package) to import DocumentTermMatrix objects. Note that it only takes the cells from the DTM that are not 0, so there will be no rows with 0 counts.

2.7 Sentiment analysis

tidytext comes with a dataset sentiments which contains several sentiment lexicons, where each word is attributed a certain sentiment, like this:

sentiments
#> # A tibble: 6,786 × 2
#>    word        sentiment
#>    <chr>       <chr>    
#>  1 2-faces     negative 
#>  2 abnormal    negative 
#>  3 abolish     negative 
#>  4 abominable  negative 
#>  5 abominably  negative 
#>  6 abominate   negative 
#>  7 abomination negative 
#>  8 abort       negative 
#>  9 aborted     negative 
#> 10 aborts      negative 
#> # … with 6,776 more rows

Here we will take a look at how the sentiment of the speeches change over time. We will use the lexicon from Bing Liu and collaborators, which assigns positive/negative labels for each word:

bing_lex <- get_sentiments("bing")
bing_lex
#> # A tibble: 6,786 × 2
#>    word        sentiment
#>    <chr>       <chr>    
#>  1 2-faces     negative 
#>  2 abnormal    negative 
#>  3 abolish     negative 
#>  4 abominable  negative 
#>  5 abominably  negative 
#>  6 abominate   negative 
#>  7 abomination negative 
#>  8 abort       negative 
#>  9 aborted     negative 
#> 10 aborts      negative 
#> # … with 6,776 more rows

We can use these sentiments attached to each word and join them to the words of our speeches. We will use inner_join from dplyr. It will take all rows with words from tidy_sotu_words that match words in bing_lex, eliminating rows where the word cannot be found in the lexicon. Since our columns to join on have the same name (word) we don’t need to explicitly name it.

sotu_sentiments <- tidy_sotu_words %>% 
  inner_join(bing_lex)  # join to add semtinemt column

sotu_sentiments
#> # A tibble: 106,649 × 9
#>        X president        year years_active party   sotu_…¹ doc_id word  senti…²
#>    <int> <chr>           <int> <chr>        <chr>   <chr>   <chr>  <chr> <chr>  
#>  1    73 Abraham Lincoln  1861 1861-1865    Republ… written abrah… trou… negati…
#>  2    73 Abraham Lincoln  1861 1861-1865    Republ… written abrah… grat… positi…
#>  3    73 Abraham Lincoln  1861 1861-1865    Republ… written abrah… unus… negati…
#>  4    73 Abraham Lincoln  1861 1861-1865    Republ… written abrah… abun… positi…
#>  5    73 Abraham Lincoln  1861 1861-1865    Republ… written abrah… pecu… negati…
#>  6    73 Abraham Lincoln  1861 1861-1865    Republ… written abrah… prof… positi…
#>  7    73 Abraham Lincoln  1861 1861-1865    Republ… written abrah… soli… negati…
#>  8    73 Abraham Lincoln  1861 1861-1865    Republ… written abrah… disl… negati…
#>  9    73 Abraham Lincoln  1861 1861-1865    Republ… written abrah… dest… negati…
#> 10    73 Abraham Lincoln  1861 1861-1865    Republ… written abrah… disr… negati…
#> # … with 106,639 more rows, and abbreviated variable names ¹​sotu_type,
#> #   ²​sentiment

Finally we can visualize the proportion of positive sentiment (out of the total of positive and negative) in US State of the Union Addresses over time like this:

sotu_sentiments %>% 
  count(year, sentiment) %>% # count by year and sentiment
  pivot_wider(names_from = "sentiment", values_from = "n") %>% # create column for positive
                                                               # and negative sentiment
  mutate(positive_ratio = positive/(negative + positive)) %>% # calculate positive ratio
  # plot
  ggplot(aes(year, positive_ratio)) +
    geom_line(color="gray") +
    geom_smooth(span = 0.3, se = FALSE) + # smooth for easier viewing
    geom_hline(yintercept = .5, linetype="dotted", color = "orange", size = 1) + # .5 as reference
    scale_x_continuous(breaks = seq(1790, 2016, by = 10)) +
    theme(axis.text.x = element_text(angle = 45, hjust = 1))