Optimized version of grep to match vector against vector
Since you don't use regular expressions but want to find substrings in longer strings, you can use fixed = TRUE
. It is much faster.
library(microbenchmark)
microbenchmark(lapply(a, grep, b), # original
lapply(paste0("^", a), grep, b), # @flodel
lapply(a, grep, b, fixed = TRUE))
Unit: microseconds
expr min lq median uq max neval
lapply(a, grep, b) 112.633 114.2340 114.9390 116.0990 326.857 100
lapply(paste0("^", a), grep, b) 119.949 121.7380 122.7425 123.9775 191.851 100
lapply(a, grep, b, fixed = TRUE) 21.004 22.5885 23.8580 24.6110 33.608 100
Testing with longer vectors (1000 times the original length).
ar <- rep(a, 1000)
br <- rep(b, 1000)
library(microbenchmark)
microbenchmark(lapply(ar, grep, br), # original
lapply(paste0("^", ar), grep, br), # @flodel
lapply(ar, grep, br, fixed = TRUE))
Unit: seconds
expr min lq median uq max neval
lapply(ar, grep, br) 32.288139 32.564223 32.726149 32.97529 37.818299 100
lapply(paste0("^", ar), grep, br) 24.997339 25.343401 25.531138 25.71615 28.238802 100
lapply(ar, grep, br, fixed = TRUE) 2.461934 2.494759 2.513931 2.55375 4.194093 100
(This took quite a while...)
Following on my last suggestion...
The big problem with what you are asking is that, a priori, you need to do length(a) * length(b)
comparisons. However, you can take advantage of the fact that the matches here will only happen at the beginning of the strings (what I gathered from the comments.).
I suggested you first split your a
and b
vectors into lists, after looking at the first word ( "Or", "Gr", "Control", "PMT", etc.) in each item, then only look for matches in the corresponding sets. In other words, take the items in a
that start with Or_
and only look for matches in the items in b
that also start with Or_
.
To give you an idea of why this is efficient in terms of complexity. Imagine a
and b
both have length n
; that there are x
possible prefixes, uniformly distributed throughout a
and b
. Then you would only have to do x * (n/x * n/x)
comparisons versus n * n
in your case. That's x
times fewer comparisons. And you could even imagine repeating the process using the second word, third, etc. in a recursive way.
Now here is the code for it:
reduced.match <- function(a, b) {
first.word <- function(string) sub("_.*", "", string)
a.first <- first.word(a)
b.first <- first.word(b)
l.first <- unique(c(a.first, b.first))
a.first <- factor(a.first, l.first)
b.first <- factor(b.first, l.first)
a.split <- split(a, a.first)
b.split <- split(b, b.first)
a.idx.split <- split(seq_along(a), a.first)
b.idx.split <- split(seq_along(b), b.first)
unsorted.matches <-
Map(function(a, b, i) lapply(a, function(x) i[grep(x, b, fixed = TRUE)]),
a.split, b.split, b.idx.split, USE.NAMES = FALSE)
sorted.matches <-
unlist(unsorted.matches, recursive = FALSE)[
match(seq_along(a), unlist(a.idx.split))]
return(sorted.matches)
}
# sample data
set.seed(123)
n <- 10000
words <- paste0(LETTERS, LETTERS, LETTERS)
a <- paste(sample(words[-1], n, TRUE),
sample(words, n, TRUE), sep = "_")
b <- paste(sample(words[-2], n, TRUE),
sample(words, n, TRUE), sep = "_")
# testing
identical(reduced.match(a, b), lapply(a, grep, b, fixed = TRUE))
# [1] TRUE
# benchmarks
system.time(reduced.match(a, b))
# user system elapsed
# 0.187 0.000 0.187
system.time(lapply(a, grep, b, fixed = TRUE))
# user system elapsed
# 2.915 0.002 2.920
If a and b are sorted (and a unique) and one is interested in exact matches at the beginning of the string, then the following C code will usually be relatively efficient (something on the order of length(a) + length(b) string comparisons?). The R wrapper makes sure the C code and R user get appropriate data.
f3 <- local({
library(inline)
.amatch <- cfunction(c(a="character", b="character"),
includes="#include <string.h>", '
int len_a = Rf_length(a), len_b = Rf_length(b);
SEXP ans = PROTECT(allocVector(INTSXP, len_b));
memset(INTEGER(ans), 0, sizeof(int) * len_b);
int cmp, i = 0, j = 0;
while (i < len_a) {
const char *ap = CHAR(STRING_ELT(a, i));
while (j < len_b) {
cmp = strncmp(ap, CHAR(STRING_ELT(b, j)), strlen(ap));
if (cmp > 0) {
j += 1;
} else break;
}
if (j == len_b)
break;
if (cmp == 0)
INTEGER(ans)[j++] = i + 1;
else if (cmp < 0) i += 1;
}
UNPROTECT(1);
return(ans);')
function(a, b) {
locale = Sys.getlocale("LC_COLLATE")
if (locale != "C") {
warning('temporarily trying to set LC_COLLATE to "C"')
Sys.setlocale("LC_COLLATE", "C")
on.exit(Sys.setlocale("LC_COLLATE", locale))
}
a0 <- a
lvls <- unique(a)
a <- sort(lvls)
o <- order(b)
idx <- .amatch(a, b[o])[order(o)]
f <- factor(a[idx[idx != 0]], levels=lvls)
split(which(idx != 0), f)[a0]
}
})
In comparison with this semi-friendly grep
f0 <- function(a, b) {
a0 <- a
a <- unique(a)
names(a) <- a
lapply(a, grep, b, fixed=TRUE)[a0]
}
that allows for (but doesn't pay too much of a price for) duplicate 'a' values the timings for @flodel's data set are
> microbenchmark(f0(a, b), f3(a, b), times=5)
Unit: milliseconds
expr min lq median uq max neval
f0(a, b) 431.03595 431.45211 432.59346 433.96036 434.87550 5
f3(a, b) 15.70972 15.75976 15.93179 16.05184 16.06767 5
Unfortunately, this simple algorithm fails when one element is a prefix of another
> str(f0(c("a", "ab"), "abc"))
List of 2
$ : chr "abc"
$ : chr "abc"
> str(f3(c("a", "ab"), "abc"))
List of 2
$ : chr "abc"
$ : chr(0)
Contrary to a comment, for this data set (the random number seed needs to be specified for reproducibility)
set.seed(123)
categ <- c("Control", "Gr", "Or", "PMT", "P450")
genes <- paste(categ, rep(1:40, each=length(categ)), sep="_")
a0 <- paste0(genes, "_", rep(1:50, each=length(genes)), "_")
b0 <- paste0(a0, "1")
ite <- 50
lg <- 1000
b <- b0[1:lg]
a <- (a0[1:lg])[sample(seq(lg), ite)]
f3()
returns the same values as grep
> identical(unname(f3(a, b)), lapply(a, grep, b, fixed=TRUE))
[1] TRUE
The algorithms f0 and f3 have been modified to return indexes in a named list.