Chapter 1 Preparing Textual Data

Learning Objectives

  • read textual data into R using readtext
  • use the stringr package to prepare strings for processing
  • use tidytext functions to tokenize texts and remove stopwords
  • use SnowballC to stem words

We’ll use several R packages in this section:

  • sotu will provide the metadata and text of State of the Union speeches ranging from George Washington to Barack Obama.
  • tidyverse is a collection of R packages designed for data science, including dplyr with a set of verbs for common data manipulations and ggplot2 for visualization.
  • tidytext provides specific functions for a “tidy” approach to working with textual data, where one row represents one “token” or meaningful unit of text, for example a word.
  • readtext provides a function well suited to reading textual data from a large number of formats into R, including metadata.
library(sotu)
library(tidyverse)
library(tidytext)
library(readtext)

1.1 Reading text into R

First, let’s look at the data in the sotu package. The metadata and texts are contained in this package separately in sotu_meta and sotu_text respectively. We can take a look at those by either typing the names or use funnctions like glimpse() or str(). Below, or example is what the metadata look like. Can you tell how many speeches there are?

# Let's take a look at the state of the union metadata
str(sotu_meta)
#> 'data.frame':    240 obs. of  6 variables:
#>  $ X           : int  1 2 3 4 5 6 7 8 9 10 ...
#>  $ president   : chr  "George Washington" "George Washington" "George Washington" "George Washington" ...
#>  $ year        : int  1790 1790 1791 1792 1793 1794 1795 1796 1797 1798 ...
#>  $ years_active: chr  "1789-1793" "1789-1793" "1789-1793" "1789-1793" ...
#>  $ party       : chr  "Nonpartisan" "Nonpartisan" "Nonpartisan" "Nonpartisan" ...
#>  $ sotu_type   : chr  "speech" "speech" "speech" "speech" ...

In order to work with the speech texts and to later practice reading text files from disk we use the function sotu_dir() to write the texts out. This function by default writes to a temporary directory with one speech in each file. It returns a character vector where each element is the name of the path to the individual speech file. We save this vector into the file_paths variable.

# sotu_dir writes the text files to disk in a temporary dir, 
# but you could also specify a location.
file_paths <- sotu_dir()
head(file_paths)
#> [1] "/var/folders/b5/fxcv6x555j51n30f4nqq4dqr0000gp/T//RtmpHnpobH/file6fd7c35da27/george-washington-1790a.txt"
#> [2] "/var/folders/b5/fxcv6x555j51n30f4nqq4dqr0000gp/T//RtmpHnpobH/file6fd7c35da27/george-washington-1790b.txt"
#> [3] "/var/folders/b5/fxcv6x555j51n30f4nqq4dqr0000gp/T//RtmpHnpobH/file6fd7c35da27/george-washington-1791.txt" 
#> [4] "/var/folders/b5/fxcv6x555j51n30f4nqq4dqr0000gp/T//RtmpHnpobH/file6fd7c35da27/george-washington-1792.txt" 
#> [5] "/var/folders/b5/fxcv6x555j51n30f4nqq4dqr0000gp/T//RtmpHnpobH/file6fd7c35da27/george-washington-1793.txt" 
#> [6] "/var/folders/b5/fxcv6x555j51n30f4nqq4dqr0000gp/T//RtmpHnpobH/file6fd7c35da27/george-washington-1794.txt"

Now that we have the files on disk and a vector of filepaths, we can pass this vector directly into readtext to read the texts into a new variable.

# let's read in the files with readtext
sotu_texts <- readtext(file_paths)

readtext() generated a dataframe for us with 2 colums: the doc_id, which is the name of the document and the actual text:

glimpse(sotu_texts)
#> Rows: 240
#> Columns: 2
#> $ doc_id <chr> "abraham-lincoln-1861.txt", "abraham-lincoln-1862.txt", "abraha…
#> $ text   <chr> "\n\n Fellow-Citizens of the Senate and House of Representative…

To work with a single table, we combine the text and metadata. Our sotu_texts are organized by alphabetical order, so we sort our metadata in sotu_meta to match that order and then bind the columns.

sotu_whole <- 
  sotu_meta %>%  
  arrange(president) %>% # sort metadata
  bind_cols(sotu_texts) %>% # combine with texts
  as_tibble() # convert to tibble for better screen viewing

glimpse(sotu_whole)
#> Rows: 240
#> Columns: 8
#> $ X            <int> 73, 74, 75, 76, 41, 42, 43, 44, 45, 46, 47, 48, 77, 78, 7…
#> $ president    <chr> "Abraham Lincoln", "Abraham Lincoln", "Abraham Lincoln", …
#> $ year         <int> 1861, 1862, 1863, 1864, 1829, 1830, 1831, 1832, 1833, 183…
#> $ years_active <chr> "1861-1865", "1861-1865", "1861-1865", "1861-1865", "1829…
#> $ party        <chr> "Republican", "Republican", "Republican", "Republican", "…
#> $ sotu_type    <chr> "written", "written", "written", "written", "written", "w…
#> $ doc_id       <chr> "abraham-lincoln-1861.txt", "abraham-lincoln-1862.txt", "…
#> $ text         <chr> "\n\n Fellow-Citizens of the Senate and House of Represen…

Now that we have our data combined, we can start looking at the text. Typically quite a bit of effort goes into pre-processing the text for further analysis. Depending on the quality of your data and your goal, you might for example need to:

  • replace certain characters or words,
  • remove urls or certain numbers, such as phone numbers,
  • clean up misspellings or errors,
  • etc.

There are several ways to handle this sort of cleaning, we’ll show a few examples below.

1.2 String operations

R has many functions available to manipulate strings including functions like grep and paste, which come with the R base install.

Here we will here take a look at the stringr package, which is part of the tidyverse. It refers to a lot of functionality from the stringi package which is perhaps one of the most comprehensive string manipulation packages.

Below are examples for a few functions that might be useful.

1.2.1 Counting ocurrences

str_count takes a character vector as input and by default counts the number of pattern matches in a string.

How man times does the word “citizen” appear in each of the speeches?

sotu_whole %>% 
    mutate(n_citizen = str_count(text, "citizen")) 
#> # A tibble: 240 × 9
#>        X president      year years_active party sotu_type doc_id text  n_citizen
#>    <int> <chr>         <int> <chr>        <chr> <chr>     <chr>  <chr>     <int>
#>  1    73 Abraham Linc…  1861 1861-1865    Repu… written   abrah… "\n\…         9
#>  2    74 Abraham Linc…  1862 1861-1865    Repu… written   abrah… "\n\…         7
#>  3    75 Abraham Linc…  1863 1861-1865    Repu… written   abrah… "\n\…        15
#>  4    76 Abraham Linc…  1864 1861-1865    Repu… written   abrah… "\n\…         3
#>  5    41 Andrew Jacks…  1829 1829-1833    Demo… written   andre… "\n\…        19
#>  6    42 Andrew Jacks…  1830 1829-1833    Demo… written   andre… "\n\…        14
#>  7    43 Andrew Jacks…  1831 1829-1833    Demo… written   andre… "\n\…        23
#>  8    44 Andrew Jacks…  1832 1829-1833    Demo… written   andre… "\n\…        19
#>  9    45 Andrew Jacks…  1833 1833-1837    Demo… written   andre… "\n\…        14
#> 10    46 Andrew Jacks…  1834 1833-1837    Demo… written   andre… "\n\…        25
#> # ℹ 230 more rows

It is possible to use regular expressions, for example, this is how we would check how many times either “citizen” or “Citizen” appear in each of the speeches:

sotu_whole %>% 
    mutate(n_citizen = str_count(text, "citizen"),
           n_cCitizen = str_count(text, "[C|c]itizen")) 
#> # A tibble: 240 × 10
#>        X president      year years_active party sotu_type doc_id text  n_citizen
#>    <int> <chr>         <int> <chr>        <chr> <chr>     <chr>  <chr>     <int>
#>  1    73 Abraham Linc…  1861 1861-1865    Repu… written   abrah… "\n\…         9
#>  2    74 Abraham Linc…  1862 1861-1865    Repu… written   abrah… "\n\…         7
#>  3    75 Abraham Linc…  1863 1861-1865    Repu… written   abrah… "\n\…        15
#>  4    76 Abraham Linc…  1864 1861-1865    Repu… written   abrah… "\n\…         3
#>  5    41 Andrew Jacks…  1829 1829-1833    Demo… written   andre… "\n\…        19
#>  6    42 Andrew Jacks…  1830 1829-1833    Demo… written   andre… "\n\…        14
#>  7    43 Andrew Jacks…  1831 1829-1833    Demo… written   andre… "\n\…        23
#>  8    44 Andrew Jacks…  1832 1829-1833    Demo… written   andre… "\n\…        19
#>  9    45 Andrew Jacks…  1833 1833-1837    Demo… written   andre… "\n\…        14
#> 10    46 Andrew Jacks…  1834 1833-1837    Demo… written   andre… "\n\…        25
#> # ℹ 230 more rows
#> # ℹ 1 more variable: n_cCitizen <int>

Going into the use of regular expressions is beyond this introduction. However we want to point out the str_view() function which can help you to create your correct expression. Also see RegExr, an online tool to learn, build, & test regular expressions.

When used with the boundary argument str_count() can count different entities like “character”, “line_break”, “sentence”, or “word”. Here we add a new column to the dataframe indicating how many words are in each speech:

sotu_whole %>% 
  mutate(n_words = str_count(text, boundary("word"))) 
#> # A tibble: 240 × 9
#>        X president        year years_active party sotu_type doc_id text  n_words
#>    <int> <chr>           <int> <chr>        <chr> <chr>     <chr>  <chr>   <int>
#>  1    73 Abraham Lincoln  1861 1861-1865    Repu… written   abrah… "\n\…    6998
#>  2    74 Abraham Lincoln  1862 1861-1865    Repu… written   abrah… "\n\…    8410
#>  3    75 Abraham Lincoln  1863 1861-1865    Repu… written   abrah… "\n\…    6132
#>  4    76 Abraham Lincoln  1864 1861-1865    Repu… written   abrah… "\n\…    5975
#>  5    41 Andrew Jackson   1829 1829-1833    Demo… written   andre… "\n\…   10547
#>  6    42 Andrew Jackson   1830 1829-1833    Demo… written   andre… "\n\…   15109
#>  7    43 Andrew Jackson   1831 1829-1833    Demo… written   andre… "\n\…    7198
#>  8    44 Andrew Jackson   1832 1829-1833    Demo… written   andre… "\n\…    7887
#>  9    45 Andrew Jackson   1833 1833-1837    Demo… written   andre… "\n\…    7912
#> 10    46 Andrew Jackson   1834 1833-1837    Demo… written   andre… "\n\…   13472
#> # ℹ 230 more rows

CHALLENGE: Use the code above and add another column n_sentences where you calculate the number of sentences per speech. Then create a third column avg_word_per_sentence, where you calculate the number of words per sentence for each speech. Finally use filter to find which speech has shortest/longest average sentences length and what is the average length.

1.2.2 Detecting patterns

str_detect also looks for patterns, but instead of counts it returns a logical vector (TRUE/FALSE) indiciating if the pattern is or is not found. So we typically want to use it with the filter “verb” from dplyr.

What are the names of the documents where the words “citizen” and “Citizen” do not occur?

sotu_whole %>% 
  filter(!str_detect(text, "[C|c]itizen")) %>% 
  select(doc_id) 
#> # A tibble: 11 × 1
#>    doc_id                      
#>    <chr>                       
#>  1 dwight-d-eisenhower-1958.txt
#>  2 gerald-r-ford-1975.txt      
#>  3 richard-m-nixon-1970.txt    
#>  4 richard-m-nixon-1971.txt    
#>  5 richard-m-nixon-1972a.txt   
#>  6 ronald-reagan-1988.txt      
#>  7 woodrow-wilson-1916.txt     
#>  8 woodrow-wilson-1917.txt     
#>  9 woodrow-wilson-1918.txt     
#> 10 woodrow-wilson-1919.txt     
#> 11 woodrow-wilson-1920.txt

1.2.3 Extracting words

The word function extracts specific words from a character vector of words. It takes a ‘start’ and ‘end’ argument, which determines the range of the words to be extracted. By default it returns the first word. If for example we wanted to extract the first 5 words of each speech like we could add another column to the table like this:

sotu_whole %>% 
  mutate(first_5 = word(text, end = 5)) %>% 
  select(first_5)
#> # A tibble: 240 × 1
#>    first_5                             
#>    <chr>                               
#>  1 "\n\n Fellow-Citizens of the Senate"
#>  2 "\n\n Fellow-Citizens of the Senate"
#>  3 "\n\n Fellow-Citizens of the Senate"
#>  4 "\n\n Fellow-Citizens of the Senate"
#>  5 "\n\n Fellow Citizens of the"       
#>  6 "\n\n Fellow Citizens of the"       
#>  7 "\n\n Fellow Citizens of the"       
#>  8 "\n\n Fellow Citizens of the"       
#>  9 "\n\n Fellow Citizens of the"       
#> 10 "\n\n Fellow Citizens of the"       
#> # ℹ 230 more rows

1.2.4 Replacing and removing characters

Now let’s take a look at text ‘cleaning’ and see if we can imrove this output. We will first remove the newline characters (\n). We use the str_replace_all function to replace all the occurrences of the \n pattern with a white space " ". We need to add the escape character \ in front of our pattern to be replaced so the backslash before the n is interpreted correctly.

sotu_whole %>% 
  mutate(text_clean = str_replace_all(text, "\\n", " "), # replace newline
         first_5 = word(text, end = 5)) %>% 
  select(first_5)
#> # A tibble: 240 × 1
#>    first_5                             
#>    <chr>                               
#>  1 "\n\n Fellow-Citizens of the Senate"
#>  2 "\n\n Fellow-Citizens of the Senate"
#>  3 "\n\n Fellow-Citizens of the Senate"
#>  4 "\n\n Fellow-Citizens of the Senate"
#>  5 "\n\n Fellow Citizens of the"       
#>  6 "\n\n Fellow Citizens of the"       
#>  7 "\n\n Fellow Citizens of the"       
#>  8 "\n\n Fellow Citizens of the"       
#>  9 "\n\n Fellow Citizens of the"       
#> 10 "\n\n Fellow Citizens of the"       
#> # ℹ 230 more rows

This looks better, but we still have a problem to extract exactly 5 words because the too whitespaces are counted as a word. So let’s get rid of any whitespaces before and after, as well as repeated whitespaces within the string with the str_squish() function.

sotu_whole %>% 
  mutate(text_clean = str_replace_all(text, "\\n", " "), # replace newline
         text_clean = str_squish(text_clean),  # remove whitespaces
         first_5 = word(text, end = 5)) %>% 
  select(first_5)
#> # A tibble: 240 × 1
#>    first_5                             
#>    <chr>                               
#>  1 "\n\n Fellow-Citizens of the Senate"
#>  2 "\n\n Fellow-Citizens of the Senate"
#>  3 "\n\n Fellow-Citizens of the Senate"
#>  4 "\n\n Fellow-Citizens of the Senate"
#>  5 "\n\n Fellow Citizens of the"       
#>  6 "\n\n Fellow Citizens of the"       
#>  7 "\n\n Fellow Citizens of the"       
#>  8 "\n\n Fellow Citizens of the"       
#>  9 "\n\n Fellow Citizens of the"       
#> 10 "\n\n Fellow Citizens of the"       
#> # ℹ 230 more rows

(For spell checks take a look at https://CRAN.R-project.org/package=spelling or https://CRAN.R-project.org/package=hunspell)

1.3 Tokenize

A very common part of preparing your text for analysis involves tokenization. Currently our data contains in each each row a single text with metadata, so the entire speech text is the unit of observation. When we tokenize we break down the text into “tokens” (most commonly single words), so each row contains a single word with its metadata as unit of observation.

tidytext provides a function unnest_tokens() to convert our speech table into one that is tokenized. It takes three arguments:

  • a tibble or data frame which contains the text;
  • the name of the newly created column that will contain the tokens;
  • the name of the column within the data frame which contains the text to be tokenized.

In the example below we name the new column to hold the tokens word. Remember that the column that holds the speech is called text.

tidy_sotu <- sotu_whole %>%
  unnest_tokens(word, text)

tidy_sotu
#> # A tibble: 1,988,203 × 8
#>        X president        year years_active party      sotu_type doc_id    word 
#>    <int> <chr>           <int> <chr>        <chr>      <chr>     <chr>     <chr>
#>  1    73 Abraham Lincoln  1861 1861-1865    Republican written   abraham-… fell…
#>  2    73 Abraham Lincoln  1861 1861-1865    Republican written   abraham-… citi…
#>  3    73 Abraham Lincoln  1861 1861-1865    Republican written   abraham-… of   
#>  4    73 Abraham Lincoln  1861 1861-1865    Republican written   abraham-… the  
#>  5    73 Abraham Lincoln  1861 1861-1865    Republican written   abraham-… sena…
#>  6    73 Abraham Lincoln  1861 1861-1865    Republican written   abraham-… and  
#>  7    73 Abraham Lincoln  1861 1861-1865    Republican written   abraham-… house
#>  8    73 Abraham Lincoln  1861 1861-1865    Republican written   abraham-… of   
#>  9    73 Abraham Lincoln  1861 1861-1865    Republican written   abraham-… repr…
#> 10    73 Abraham Lincoln  1861 1861-1865    Republican written   abraham-… in   
#> # ℹ 1,988,193 more rows

Note that the unnest_tokens function didn’t just tokenize our texts at the word level. It also lowercased each word and stripped off the punctuation. We can tell it not to do this, by adding the following parameters:

# Word tokenization with punctuation and no lowercasing
sotu_whole %>%
  unnest_tokens(word, text, to_lower = FALSE, strip_punct = FALSE)
#> # A tibble: 2,184,602 × 8
#>        X president        year years_active party      sotu_type doc_id    word 
#>    <int> <chr>           <int> <chr>        <chr>      <chr>     <chr>     <chr>
#>  1    73 Abraham Lincoln  1861 1861-1865    Republican written   abraham-… Fell…
#>  2    73 Abraham Lincoln  1861 1861-1865    Republican written   abraham-… -    
#>  3    73 Abraham Lincoln  1861 1861-1865    Republican written   abraham-… Citi…
#>  4    73 Abraham Lincoln  1861 1861-1865    Republican written   abraham-… of   
#>  5    73 Abraham Lincoln  1861 1861-1865    Republican written   abraham-… the  
#>  6    73 Abraham Lincoln  1861 1861-1865    Republican written   abraham-… Sena…
#>  7    73 Abraham Lincoln  1861 1861-1865    Republican written   abraham-… and  
#>  8    73 Abraham Lincoln  1861 1861-1865    Republican written   abraham-… House
#>  9    73 Abraham Lincoln  1861 1861-1865    Republican written   abraham-… of   
#> 10    73 Abraham Lincoln  1861 1861-1865    Republican written   abraham-… Repr…
#> # ℹ 2,184,592 more rows

We can also tokenize the text at the level of ngrams or sentences, if those are the best units of analysis for our work.

# Sentence tokenization
sotu_whole %>%
  unnest_tokens(sentence, text, token = "sentences", to_lower = FALSE) %>% 
  select(sentence)
#> # A tibble: 70,761 × 1
#>    sentence                                                                     
#>    <chr>                                                                        
#>  1 Fellow-Citizens of the Senate and House of Representatives:   In the midst o…
#>  2 You will not be surprised to learn that in the peculiar exigencies of the ti…
#>  3 A disloyal portion of the American people have during the whole year been en…
#>  4 A nation which endures factious domestic division is exposed to disrespect a…
#>  5 Nations thus tempted to interfere are not always able to resist the counsels…
#>  6 The disloyal citizens of the United States who have offered the ruin of our …
#>  7 If it were just to suppose, as the insurgents have seemed to assume, that fo…
#>  8 If we could dare to believe that foreign nations are actuated by no higher p…
#>  9 The principal lever relied on by the insurgents for exciting foreign nations…
#> 10 Those nations, however, not improbably saw from the first that it was the Un…
#> # ℹ 70,751 more rows
# N-gram tokenization as trigrams
sotu_whole %>%
  unnest_tokens(trigram, text, token = "ngrams", n = 3) %>% 
  select(trigram)
#> # A tibble: 1,987,723 × 1
#>    trigram                 
#>    <chr>                   
#>  1 fellow citizens of      
#>  2 citizens of the         
#>  3 of the senate           
#>  4 the senate and          
#>  5 senate and house        
#>  6 and house of            
#>  7 house of representatives
#>  8 of representatives in   
#>  9 representatives in the  
#> 10 in the midst            
#> # ℹ 1,987,713 more rows

(Take note that the trigrams are generated by a moving 3-word window over the text.)

1.4 Stopwords

Another common task of preparing text for analysis is to remove stopwords. Stopwords are highly common words that are considered to provide non-relevant information about the content of a text.

Let’s look at the stopwords that come with the tidytext package to get a sense of what they are.

stop_words
#> # A tibble: 1,149 × 2
#>    word        lexicon
#>    <chr>       <chr>  
#>  1 a           SMART  
#>  2 a's         SMART  
#>  3 able        SMART  
#>  4 about       SMART  
#>  5 above       SMART  
#>  6 according   SMART  
#>  7 accordingly SMART  
#>  8 across      SMART  
#>  9 actually    SMART  
#> 10 after       SMART  
#> # ℹ 1,139 more rows

These are English stopwords, pulled from different lexica (“onix”, “SMART”, or “snowball”). Depending on the type of analysis you’re doing, you might leave these words in or alternatively use your own curated list of stopwords. Stopword lists exist for many languages, see for examle the stopwords package in R. For now we will remove the English stopwords as suggested here.

For this we use anti_join from dplyr. We join and return all rows from our table of tokens tidy_sotu where there are no matching values in our list of stopwords. Both of these tables have one column name in common: word so by default the join will be on that column, and dplyr will tell us so.

tidy_sotu_words <- tidy_sotu %>% 
  anti_join(stop_words)

tidy_sotu_words
#> # A tibble: 787,851 × 8
#>        X president        year years_active party      sotu_type doc_id    word 
#>    <int> <chr>           <int> <chr>        <chr>      <chr>     <chr>     <chr>
#>  1    73 Abraham Lincoln  1861 1861-1865    Republican written   abraham-… fell…
#>  2    73 Abraham Lincoln  1861 1861-1865    Republican written   abraham-… citi…
#>  3    73 Abraham Lincoln  1861 1861-1865    Republican written   abraham-… sena…
#>  4    73 Abraham Lincoln  1861 1861-1865    Republican written   abraham-… house
#>  5    73 Abraham Lincoln  1861 1861-1865    Republican written   abraham-… repr…
#>  6    73 Abraham Lincoln  1861 1861-1865    Republican written   abraham-… midst
#>  7    73 Abraham Lincoln  1861 1861-1865    Republican written   abraham-… unpr…
#>  8    73 Abraham Lincoln  1861 1861-1865    Republican written   abraham-… poli…
#>  9    73 Abraham Lincoln  1861 1861-1865    Republican written   abraham-… trou…
#> 10    73 Abraham Lincoln  1861 1861-1865    Republican written   abraham-… grat…
#> # ℹ 787,841 more rows

If we compare this with tidy_sotu we see that the records with words like “of”, “the”, “and”, “in” are now removed.

We also went from 1988203 to 787851 rows, which means we had a lot of stopwords in our corpus. This is a huge removal, so for serious analysis, we might want to scrutinize the stopword list carefully and determine if this is feasible.

1.5 Word Stemming

Another way you may want to clean your data is to stem your words, that is, to reduce them to their word stem or root form, for example reducing fishing, fished, and fisher to the stem fish.

tidytext does not implement its own word stemmer. Instead it relies on separate packages like hunspell or SnowballC.

We will give an example here for the SnowballC package which comes with a function wordStem. (hunspell appears to run much slower, and it also returns a list instead of a vector, so in this context SnowballC seems to be more convenient.)

library(SnowballC)
tidy_sotu_words %>%
        mutate(word_stem = wordStem(word)) %>% 
  select(word, word_stem)
#> # A tibble: 787,851 × 2
#>    word            word_stem
#>    <chr>           <chr>    
#>  1 fellow          fellow   
#>  2 citizens        citizen  
#>  3 senate          senat    
#>  4 house           hous     
#>  5 representatives repres   
#>  6 midst           midst    
#>  7 unprecedented   unpreced 
#>  8 political       polit    
#>  9 troubles        troubl   
#> 10 gratitude       gratitud 
#> # ℹ 787,841 more rows

Lemmatization takes this another step further. While a stemmer operates on a single word without knowledge of the context, lemmatization attempts to discriminate between words which have different meanings depending on part of speech. For example, the word “better” has “good” as its lemma, something a stemmer would not detect.

For lemmatization in R, you may want to take a look a the koRpus package, another comprehensive R package for text analysis. It allows to use TreeTagger, a widely used part-of-speech tagger. For full functionality of the R package a local installation of TreeTagger is recommended.