How to efficiently sort the characters in a string in R?
Re-implementing using stringi
gives a roughly 4x speedup. I also edited sort_cat
to use fixed = TRUE
in the strsplit
, which makes it a little faster. And thanks to Carl for the single loop suggestion, which speeds us up just a little bit more.
sort_cat <- function(strings){
tmp <- strsplit(strings, split="", fixed = TRUE)
tmp <- lapply(tmp, sort)
tmp <- lapply(tmp, paste0, collapse = "")
tmp <- unlist(tmp)
return(tmp)
}
library(stringi)
sort_stringi = function(s) {
s = stri_split_boundaries(s, type = "character")
s = lapply(s, stri_sort)
s = lapply(s, stri_join, collapse = "")
unlist(s)
}
sort_stringi_loop = function(s) {
s = stri_split_boundaries(s, type = "character")
for (i in seq_along(s)) {
s[[i]] = stri_join(stri_sort(s[[i]]), collapse = "")
}
unlist(s)
}
bench::mark(
sort_cat(strings),
sort_stringi(strings),
sort_stringi_loop(strings)
)
# # A tibble: 3 x 13
# expression min median `itr/sec` mem_alloc `gc/sec` n_itr n_gc total_time result memory
# <bch:expr> <bch:> <bch:> <dbl> <bch:byt> <dbl> <int> <dbl> <bch:tm> <list> <list>
# 1 sort_cat(strings) 23.01s 23.01s 0.0435 31.2MB 2.17 1 50 23.01s <chr ~ <Rpro~
# 2 sort_stringi(strings) 6.16s 6.16s 0.162 30.5MB 2.11 1 13 6.16s <chr ~ <Rpro~
# 3 sort_stringi_loop(strings) 5.75s 5.75s 0.174 15.3MB 1.74 1 10 5.75s <chr ~ <Rpro~
# # ... with 2 more variables: time <list>, gc <list>
This method could also be used in parallel. Profiling the code to see which operations actually take the longest would be a good next step if you want to go even faster.
You can reduce time by minimizing the number of loops for sure, and further do so by using the parallel
package... my approach would be split strings once, then in the loop sort and paste:
sort_cat <- function(strings){
tmp <- strsplit(strings, split="")
tmp <- lapply(tmp, sort)
tmp <- lapply(tmp, paste0, collapse = "")
tmp <- unlist(tmp)
return(tmp)
}
sort_cat2 <- function(strings){
unlist(mcMap(function(i){
stri_join(sort(i), collapse = "")
}, stri_split_regex(strings, "|", omit_empty = TRUE, simplify = F), mc.cores = 8L))
}
> microbenchmark::microbenchmark(
+ old = sort_cat(strings[1:500000]),
+ new = sort_cat2(strings[1:500000]),
+ times = 1
+ )
Unit: seconds
expr min lq mean median uq max neval
old 9.62673395 9.62673395 9.62673395 9.62673395 9.62673395 9.62673395 1
new 5.10547437 5.10547437 5.10547437 5.10547437 5.10547437 5.10547437 1
Shaves like 4 seconds, but it's still not that fast...
Edit
Okay got it way down using apply
.. strategy here:
1) extract letters rather than split boundaries 2) create a matrix with the results 3) iterate through row-wise 4) Sort 5) Join
You avoid multiple loops and unlisting.... IGNORE: ?caveat is if strings different lengths, you'll need to remove any empty or NA within the apply
such as i[!is.na(i) && nchar(i) > 0]
sort_cat3 <- function(strings){
apply(stri_extract_all_regex(strings, "\\p{L}", simplify = TRUE), 1, function(i){
stri_join(stri_sort(i), collapse = "")
})
}
> microbenchmark::microbenchmark(
+ old = sort_cat(strings[1:500000]),
+ mapping = sort_cat2(strings[1:500000]),
+ applying = sort_cat3(strings[1:500000]),
+ times = 1
+ )
Unit: seconds
expr min lq mean median uq max neval
old 10.35101934 10.35101934 10.35101934 10.35101934 10.35101934 10.35101934 1
mapping 5.12771799 5.12771799 5.12771799 5.12771799 5.12771799 5.12771799 1
applying 3.97775326 3.97775326 3.97775326 3.97775326 3.97775326 3.97775326 1
Takes us from 10.3 secs to 3.98