Faster approach than gsub in r

I built two tokenizer functions with one difference, the first function uses gsub the second one uses str_replace_all from the stringr package.
Here's function number one:

tokenize_gsub <- function(df){

    require(lexicon)
    require(dplyr)
    require(tidyr)
    require(tidytext)
    myStopWords <- c(
        "ø",
        "øthe",
        "iii"
    )

    profanity <- c(
        profanity_alvarez,
        profanity_arr_bad,
        profanity_banned,
        profanity_racist,
        profanity_zac_anger
    ) %>%
        unique()

    df %>%
        mutate(text = gsub(x = text, pattern = "[0-9]+|[[:punct:]]|\\(.*\\)", replacement = "")) %>%
        unnest_tokens(word, text) %>%
        anti_join(stop_words, by = "word") %>%
        anti_join(tibble(word = profanity), by = "word") %>%
        anti_join(tibble(word = myStopWords), by = "word")

}

Here's function number two:

tokenize_stringr <- function(df){

    require(stringr)
    require(lexicon)
    require(dplyr)
    require(tidyr)
    require(tidytext)

    myStopWords <- c(
        "ø",
        "øthe",
        "iii"
    )

    profanity <- c(
        profanity_alvarez,
        profanity_arr_bad,
        profanity_banned,
        profanity_racist,
        profanity_zac_anger
    ) %>%
        unique()

    df %>%
        mutate(text = str_replace_all(text, "[0-9]+|[[:punct:]]|\\(.*\\)", "")) %>%
        unnest_tokens(word, text) %>%
        anti_join(stop_words, by = "word") %>%
        anti_join(tibble(word = profanity), by = "word") %>%
        anti_join(tibble(word = myStopWords), by = "word")

}

Then I used a benchmarking function to compare performance with a dataset containing 4,269,678 social media posts (twitter, blogs, etc.)

library(microbenchmark)
mc <- microbenchmark(
    gsubOption = tokenize_gsub(englishPosts),
    stringrOption = tokenize_stringr(englishPosts)
)

mc

Here's the output:

Unit: seconds
          expr      min       lq     mean   median       uq      max neval cld
    gsubOption 161.4945 175.3040 211.6979 197.5054 240.6451 376.2927   100   b
 stringrOption 101.4138 117.0748 142.9605 132.4253 159.6291 328.1517   100  a

CONCLUSION: The function str_replace_all is considerably faster than the gsub option under the conditions explained above.


This is not a real answer, as I didnt find any method that is always faster. Apparently it depends on the length of your text/vector. With short texts gsub performs fastest. With longer texts or vectors sometimes gsub with perl=TRUE and sometimes stri_replace_all_regex perform the fastest.

Here is some test code to try out:

library(stringi)
text = "(a1,\"something (f fdd71)\");(b2,\"something else (a fa171)\");(b4,\"something else (a fa171)\")"
# text = paste(rep(text, 5), collapse = ",")
# text = rep(text, 100)
nchar(text)

a = gsub(pattern = "[()]", replacement = "", x = text)
b = gsub(pattern = "[()]", replacement = "", x = text, perl=T)
c = stri_replace_all_regex(str = text, pattern = "[()]", replacement = "")
d = stri_replace(str = text, regex = "[()]", replacement = "", mode="all")

identical(a,b); identical(a,c); identical(a,d)

library(microbenchmark)
mc <- microbenchmark(
  gsub = gsub(pattern = "[()]", replacement = "", x = text),
  gsub_perl = gsub(pattern = "[()]", replacement = "", x = text, perl=T),
  stringi_all = stri_replace_all_regex(str = text, pattern = "[()]", replacement = ""),
  stringi = stri_replace(str = text, regex = "[()]", replacement = "", mode="all")
)
mc
Unit: microseconds
        expr    min      lq     mean  median     uq     max neval  cld
        gsub 10.868 11.7740 13.47869 13.5840 14.490  31.394   100 a   
   gsub_perl 79.690 80.2945 82.58225 82.4070 83.312 137.043   100    d
 stringi_all 14.188 14.7920 15.58558 15.5460 16.301  17.509   100  b  
     stringi 36.828 38.0350 39.90904 38.7895 39.543 129.194   100   c

As mentioned by Jason, stringi is good option for you..

Following is the performance of stringi

system.time(res <- gsub(pattern, "", sent$words))
   user  system elapsed 
 66.229   0.000  66.199 

library(stringi)
system.time(stri_replace_all_regex(sent$words, pattern, ""))
   user  system elapsed 
 21.246   0.320  21.552 

Update (Thanks Arun)

system.time(res <- gsub(pattern, "", sent$words, perl = TRUE))
   user  system elapsed 
 12.290   0.000  12.281 

Tags:

Regex

R