Faster approach than gsub in r
I built two tokenizer functions with one difference, the first function uses gsub the second one uses str_replace_all from the stringr package.
Here's function number one:
tokenize_gsub <- function(df){
require(lexicon)
require(dplyr)
require(tidyr)
require(tidytext)
myStopWords <- c(
"ø",
"øthe",
"iii"
)
profanity <- c(
profanity_alvarez,
profanity_arr_bad,
profanity_banned,
profanity_racist,
profanity_zac_anger
) %>%
unique()
df %>%
mutate(text = gsub(x = text, pattern = "[0-9]+|[[:punct:]]|\\(.*\\)", replacement = "")) %>%
unnest_tokens(word, text) %>%
anti_join(stop_words, by = "word") %>%
anti_join(tibble(word = profanity), by = "word") %>%
anti_join(tibble(word = myStopWords), by = "word")
}
Here's function number two:
tokenize_stringr <- function(df){
require(stringr)
require(lexicon)
require(dplyr)
require(tidyr)
require(tidytext)
myStopWords <- c(
"ø",
"øthe",
"iii"
)
profanity <- c(
profanity_alvarez,
profanity_arr_bad,
profanity_banned,
profanity_racist,
profanity_zac_anger
) %>%
unique()
df %>%
mutate(text = str_replace_all(text, "[0-9]+|[[:punct:]]|\\(.*\\)", "")) %>%
unnest_tokens(word, text) %>%
anti_join(stop_words, by = "word") %>%
anti_join(tibble(word = profanity), by = "word") %>%
anti_join(tibble(word = myStopWords), by = "word")
}
Then I used a benchmarking function to compare performance with a dataset containing 4,269,678 social media posts (twitter, blogs, etc.)
library(microbenchmark)
mc <- microbenchmark(
gsubOption = tokenize_gsub(englishPosts),
stringrOption = tokenize_stringr(englishPosts)
)
mc
Here's the output:
Unit: seconds
expr min lq mean median uq max neval cld
gsubOption 161.4945 175.3040 211.6979 197.5054 240.6451 376.2927 100 b
stringrOption 101.4138 117.0748 142.9605 132.4253 159.6291 328.1517 100 a
CONCLUSION: The function str_replace_all is considerably faster than the gsub option under the conditions explained above.
This is not a real answer, as I didnt find any method that is always faster. Apparently it depends on the length of your text/vector. With short texts gsub
performs fastest. With longer texts or vectors sometimes gsub
with perl=TRUE
and sometimes stri_replace_all_regex
perform the fastest.
Here is some test code to try out:
library(stringi)
text = "(a1,\"something (f fdd71)\");(b2,\"something else (a fa171)\");(b4,\"something else (a fa171)\")"
# text = paste(rep(text, 5), collapse = ",")
# text = rep(text, 100)
nchar(text)
a = gsub(pattern = "[()]", replacement = "", x = text)
b = gsub(pattern = "[()]", replacement = "", x = text, perl=T)
c = stri_replace_all_regex(str = text, pattern = "[()]", replacement = "")
d = stri_replace(str = text, regex = "[()]", replacement = "", mode="all")
identical(a,b); identical(a,c); identical(a,d)
library(microbenchmark)
mc <- microbenchmark(
gsub = gsub(pattern = "[()]", replacement = "", x = text),
gsub_perl = gsub(pattern = "[()]", replacement = "", x = text, perl=T),
stringi_all = stri_replace_all_regex(str = text, pattern = "[()]", replacement = ""),
stringi = stri_replace(str = text, regex = "[()]", replacement = "", mode="all")
)
mc
Unit: microseconds expr min lq mean median uq max neval cld gsub 10.868 11.7740 13.47869 13.5840 14.490 31.394 100 a gsub_perl 79.690 80.2945 82.58225 82.4070 83.312 137.043 100 d stringi_all 14.188 14.7920 15.58558 15.5460 16.301 17.509 100 b stringi 36.828 38.0350 39.90904 38.7895 39.543 129.194 100 c
As mentioned by Jason, stringi is good option for you..
Following is the performance of stringi
system.time(res <- gsub(pattern, "", sent$words))
user system elapsed
66.229 0.000 66.199
library(stringi)
system.time(stri_replace_all_regex(sent$words, pattern, ""))
user system elapsed
21.246 0.320 21.552
Update (Thanks Arun)
system.time(res <- gsub(pattern, "", sent$words, perl = TRUE))
user system elapsed
12.290 0.000 12.281