Optimized version of grep to match vector against vector

Since you don't use regular expressions but want to find substrings in longer strings, you can use fixed = TRUE. It is much faster.

library(microbenchmark)
microbenchmark(lapply(a, grep, b),                 # original
                 lapply(paste0("^", a), grep, b),  # @flodel
                 lapply(a, grep, b, fixed = TRUE))

Unit: microseconds
                             expr     min       lq   median       uq     max neval
               lapply(a, grep, b) 112.633 114.2340 114.9390 116.0990 326.857   100
  lapply(paste0("^", a), grep, b) 119.949 121.7380 122.7425 123.9775 191.851   100
 lapply(a, grep, b, fixed = TRUE)  21.004  22.5885  23.8580  24.6110  33.608   100

Testing with longer vectors (1000 times the original length).

ar <- rep(a, 1000)
br <- rep(b, 1000)

library(microbenchmark)
microbenchmark(lapply(ar, grep, br),               # original
               lapply(paste0("^", ar), grep, br),  # @flodel
               lapply(ar, grep, br, fixed = TRUE))

Unit: seconds
                               expr       min        lq    median       uq       max neval
               lapply(ar, grep, br) 32.288139 32.564223 32.726149 32.97529 37.818299   100
  lapply(paste0("^", ar), grep, br) 24.997339 25.343401 25.531138 25.71615 28.238802   100
 lapply(ar, grep, br, fixed = TRUE)  2.461934  2.494759  2.513931  2.55375  4.194093   100

(This took quite a while...)

Following on my last suggestion...

The big problem with what you are asking is that, a priori, you need to do length(a) * length(b) comparisons. However, you can take advantage of the fact that the matches here will only happen at the beginning of the strings (what I gathered from the comments.).

I suggested you first split your a and b vectors into lists, after looking at the first word ( "Or", "Gr", "Control", "PMT", etc.) in each item, then only look for matches in the corresponding sets. In other words, take the items in a that start with Or_ and only look for matches in the items in b that also start with Or_.

To give you an idea of why this is efficient in terms of complexity. Imagine a and b both have length n; that there are x possible prefixes, uniformly distributed throughout a and b. Then you would only have to do x * (n/x * n/x) comparisons versus n * n in your case. That's x times fewer comparisons. And you could even imagine repeating the process using the second word, third, etc. in a recursive way.

Now here is the code for it:

reduced.match <- function(a, b) {

   first.word <- function(string) sub("_.*", "", string)

   a.first <- first.word(a)
   b.first <- first.word(b)
   l.first <- unique(c(a.first, b.first))
   a.first <- factor(a.first, l.first)
   b.first <- factor(b.first, l.first)
   a.split <- split(a, a.first)
   b.split <- split(b, b.first)
   a.idx.split <- split(seq_along(a), a.first)
   b.idx.split <- split(seq_along(b), b.first)

   unsorted.matches <-
     Map(function(a, b, i) lapply(a, function(x) i[grep(x, b, fixed = TRUE)]),
         a.split, b.split, b.idx.split, USE.NAMES = FALSE)

   sorted.matches <-
     unlist(unsorted.matches, recursive = FALSE)[
       match(seq_along(a), unlist(a.idx.split))]

   return(sorted.matches)
}

# sample data
set.seed(123)
n <- 10000
words <- paste0(LETTERS, LETTERS, LETTERS)
a <- paste(sample(words[-1], n, TRUE),
           sample(words, n, TRUE), sep = "_")
b <- paste(sample(words[-2], n, TRUE),
           sample(words, n, TRUE), sep = "_")

# testing
identical(reduced.match(a, b), lapply(a, grep, b, fixed = TRUE))
# [1] TRUE

# benchmarks
system.time(reduced.match(a, b))
#    user  system elapsed 
#   0.187   0.000   0.187 
system.time(lapply(a, grep, b, fixed = TRUE))
#    user  system elapsed 
#   2.915   0.002   2.920

If a and b are sorted (and a unique) and one is interested in exact matches at the beginning of the string, then the following C code will usually be relatively efficient (something on the order of length(a) + length(b) string comparisons?). The R wrapper makes sure the C code and R user get appropriate data.

f3 <- local({
    library(inline)
    .amatch <- cfunction(c(a="character", b="character"),
             includes="#include <string.h>", '
         int len_a = Rf_length(a), len_b = Rf_length(b);
         SEXP ans = PROTECT(allocVector(INTSXP, len_b));
         memset(INTEGER(ans), 0, sizeof(int) * len_b);
         int cmp, i = 0, j = 0;
         while (i < len_a) {
             const char *ap = CHAR(STRING_ELT(a, i));
             while (j < len_b) {
                 cmp = strncmp(ap, CHAR(STRING_ELT(b, j)), strlen(ap));
                 if (cmp > 0) {
                     j += 1;
                 } else break;
             }
             if (j == len_b)
                 break;
             if (cmp == 0)
                 INTEGER(ans)[j++] = i + 1;
             else if (cmp < 0) i += 1;
         }
         UNPROTECT(1);
         return(ans);')

    function(a, b) {
        locale = Sys.getlocale("LC_COLLATE")
        if (locale != "C") {
            warning('temporarily trying to set LC_COLLATE to "C"')
            Sys.setlocale("LC_COLLATE", "C")
            on.exit(Sys.setlocale("LC_COLLATE", locale))
        }
        a0 <- a
        lvls <- unique(a)
        a <- sort(lvls)
        o <- order(b)
        idx <- .amatch(a, b[o])[order(o)]
        f <- factor(a[idx[idx != 0]], levels=lvls)
        split(which(idx != 0), f)[a0]
    }
})

In comparison with this semi-friendly grep

f0 <- function(a, b) {
    a0 <- a
    a <- unique(a)
    names(a) <- a
    lapply(a, grep, b, fixed=TRUE)[a0]
}

that allows for (but doesn't pay too much of a price for) duplicate 'a' values the timings for @flodel's data set are

> microbenchmark(f0(a, b), f3(a, b), times=5)
Unit: milliseconds
     expr       min        lq    median        uq       max neval
 f0(a, b) 431.03595 431.45211 432.59346 433.96036 434.87550     5
 f3(a, b)  15.70972  15.75976  15.93179  16.05184  16.06767     5

Unfortunately, this simple algorithm fails when one element is a prefix of another

> str(f0(c("a", "ab"), "abc"))
List of 2
 $ : chr "abc"
 $ : chr "abc"
> str(f3(c("a", "ab"), "abc"))
List of 2
 $ : chr "abc"
 $ : chr(0)

Contrary to a comment, for this data set (the random number seed needs to be specified for reproducibility)

set.seed(123)
categ <- c("Control", "Gr", "Or", "PMT", "P450")
genes <- paste(categ, rep(1:40, each=length(categ)), sep="_")
a0 <- paste0(genes, "_", rep(1:50, each=length(genes)), "_")
b0 <- paste0(a0, "1")
ite <- 50
lg <- 1000
b <- b0[1:lg]
a <- (a0[1:lg])[sample(seq(lg), ite)]

f3() returns the same values as grep

> identical(unname(f3(a, b)), lapply(a, grep, b, fixed=TRUE))
[1] TRUE

The algorithms f0 and f3 have been modified to return indexes in a named list.

Optimized version of grep to match vector against vector

Tags:

Optimization

Regex

R

Vector

Related

Recent Posts