How do i optimize the performance of stemming and spell check in R?

You can substantially optimize your code by performing expensive steps on the vocabulary instead of all words in the document. The quanteda package offers a really useful object class or this called tokens:

toks <- quanteda::tokens(sentence)
unclass(toks)
#> $text1
#>  [1]  1  2  3  4  5  4  6  7  8  9 10 11 12  1  2  3  4  5  4  6  7  8  9 10 11
#> [26] 12  1  2  3  4  5  4  6  7  8  9 10 11 12  1  2  3  4  5  4  6  7  8  9 10
#> [51] 11 12  1  2  3  4  5  4  6  7  8  9 10 11 12  1  2  3  4  5  4  6  7  8  9
#> [76] 10 11 12
#> 
#> attr(,"types")
#>  [1] "We"       "aree"     "drivng"   "as"       "fast"     "we"      
#>  [7] "drove"    "yestrday" "or"       "evven"    "fastter"  "zysxzw"  
#> attr(,"padding")
#> [1] FALSE
#> attr(,"what")
#> [1] "word"
#> attr(,"ngrams")
#> [1] 1
#> attr(,"skip")
#> [1] 0
#> attr(,"concatenator")
#> [1] "_"
#> attr(,"docvars")
#> data frame with 0 columns and 1 row

As you can see, text is split into vocabulary (types) and position of the words. We can use this to optimize your code by performing all steps on the types instead of the entire text:

spellAndStem_tokens <- function(sent, language = "en_US") {

  sent_t <- quanteda::tokens(sent)

  # extract types to only work on them
  types <- quanteda::types(sent_t)

  # spelling
  correct <- hunspell_check(
    words = as.character(types), 
    dict = hunspell::dictionary(language)
  )

  pattern <- types[!correct]
  replacement <- sapply(hunspell_suggest(pattern, dict = language), FUN = "[", 1)

  types <- stringi::stri_replace_all_fixed(
    types,
    pattern, 
    replacement,
    vectorize_all = FALSE
  )

  # stemming
  types <- hunspell_stem(types, dict = dictionary(language))


  # replace original tokens
  sent_t_new <- quanteda::tokens_replace(sent_t, quanteda::types(sent_t), as.character(types))

  sent_t_new <- quanteda::tokens_remove(sent_t_new, pattern = "NULL", valuetype = "fixed")

  paste(as.character(sent_t_new), collapse = " ")
}

I'm using the bench package to do the benchmarking as it also checks if the results of the two functions are identical and as I find it more comfortable in general:

res <- bench::mark(
  spellAndStem(sentence),
  spellAndStem_tokens(sentence)
)

res
#> # A tibble: 2 x 6
#>   expression                         min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>                    <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 spellAndStem(sentence)           807ms    807ms      1.24     259KB        0
#> 2 spellAndStem_tokens(sentence)    148ms    150ms      6.61     289KB        0

summary(res, relative = TRUE)
#> # A tibble: 2 x 6
#>   expression                      min median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>                    <dbl>  <dbl>     <dbl>     <dbl>    <dbl>
#> 1 spellAndStem(sentence)         5.44   5.37      1         1         NaN
#> 2 spellAndStem_tokens(sentence)  1      1         5.33      1.11      NaN

The new function is 5.44 times faster than the original one. Note though that the difference is getting even more pronounced the larger the input text is:

sentence <- "We aree drivng as fast as we drove yestrday or evven fastter zysxzw" %>%
  rep(times = 600) %>%
  paste(collapse = " ")

res_big <- bench::mark(
  spellAndStem(sentence),
  spellAndStem_tokens(sentence)
)

res_big
#> # A tibble: 2 x 6
#>   expression                         min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>                    <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 spellAndStem(sentence)         1.27m    1.27m      0.0131  749.81KB        0
#> 2 spellAndStem_tokens(sentence)  178.26ms 182.12ms   5.51      1.94MB        0
summary(res_big, relative = TRUE)
#> # A tibble: 2 x 6
#>   expression                      min median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>                    <dbl>  <dbl>     <dbl>     <dbl>    <dbl>
#> 1 spellAndStem(sentence)         428.   419.        1       1         NaN
#> 2 spellAndStem_tokens(sentence)   1      1       420.       2.65      NaN

As you can see, the time it takes to process the 100 x bigger sample is almost the same as for the smaller one. This is because the vocabulary between the two is exactly the same. We can extrapolate from this result to your entire dataset assuming this bigger sample represents 100 of your documents. The function should take less than an hour (0.17826 * 14000 / 3600 = 0.69) but the calculation is really imperfect as the actual time it takes to run it on your real data will depend almost solely on the size of the vocabulary.

Besides the programming/performance aspect, I have a few more concerns that might not be applicable in your specific case:

  1. I would suggest changing the last line in the function to sapply(as.list(sent_t_new), paste, collapse = " ") as this will not collapse all documents into one long string but keep them separate.
  2. Currently, your setup removes words where hunspell could not find any suggestions for. I copied this approach (see the tokens_remove command) but you might want to think about at least outputting the discarded words instead of removing them silently.
  3. If the function above is for preparation for some other text analysis, it would make more sense to transform the data directly into a document-term-matrix before stemming and spell checking are performed.
  4. Stemming is just an approximation to lemmatization, which is the process of actually finding the base form of a word. Additionally, stemming usually works quite poorly in German. Depending on what you are doing, you might want to do lemmatization instead (e.g., using spacyr) or simply turning it off since stemming rarely improves results in German.

hunspell_suggest is just an expensive operation since it calculates the distance between your string and every word in the dictionary (see here: https://github.com/ropensci/hunspell/issues/7). When I remove the hunspell_suggest lines, it only takes 25ms on average on my machine. So if you want to speed it up, this is the critical part. Note that it makes a difference how many incorrect words are in the actual documents. Your example with about 50% misspelled words should rather be the exception. Why don't you try the algorithm on the first couple of documents first to have a more realistic time estimate. I assume the language will matter (for your benefit) as there are more words in English than German (think dictionary size).

A simple and obvious thing to do would be to use multiple cores. Something simple as the following with the parallel package already halves the time with my four cores:

sentences <- rep(sentence, 4)
microbenchmark(lapply = lapply(sentences, spellAndStem),
               mclapply = parallel::mclapply(sentences, spellAndStem),
               times = 10)

Unit: seconds
                                        expr      min       lq     mean   median       uq      max neval cld
             lapply(sentences, spellAndStem) 1.967008 2.023291 2.045705 2.051764 2.077168 2.105420    10   b
 parallel::mclapply(sentences, spellAndStem) 1.011945 1.048055 1.078003 1.081850 1.109274 1.135508    10  a 

The suggestion by Andrew Gustar could also work. Even if you just apply the suggest function to a group of documents, this should speed up the calculation significantly. The problem is to separate the documents and put them together after stemming---I guess a "separator" for the documents would just be stemmed and be unrecognizable afterwards. Judging from your question you have already tried this or something similar.

A smaller dictionary could also help, but is probably not a good idea if you want high-quality data.

By the way, I would not consider 11 days to be long for a computation that has to be done only once. You could simply upload the script to a Server that has R installed and let it run there via Rscript from the shell (use nohup to log out again without stopping the process). This is especially true if you have access to a strong "working machine" (e.g. at a university) with many cores.


This uses the idea of only comparing on unique words. To do so, factors are used to determine the unique levels.

  words_fct <- sent %>%
    strsplit(split = " ") %>% 
    unlist(use.names = FALSE) %>%
    factor()

  correct_lvl <- words_fct%>%
    levels()%>%
    hunspell_check(dict = language)

  levels(words_fct)[!correct_lvl] %<>% 
    hunspell_suggest(dict = language) %>%
    sapply("[", 1L)

  levels(words_fct)%<>%
    hunspell_stem(dict = language)%>%
    unlist(use.names = FALSE)

  words_fct%>%
    as.character()%>%
    na.omit()%>%
    paste(collapse = " ")
}

It's slightly faster than @JBGruber's but it is also in many ways is derivative of @JBGruber's answer.

I also like the idea of of using parallel structures for all of your documents. Assuming that each document is a single string of text, this would likely work:

library(future.apply)
plan(multiprocess)
future_lapply(documents, spellAndStem_fcts, language)