More efficient strategy for which() or match()

The solutions offered so far all imply creating a logical(length(vec)) and doing a full or partial scan on this. As you note, the vector is sorted. We can exploit this by doing a binary search. I started thinking I'd be super-clever and implement this in C for even greater speed, but had trouble with debugging the indexing of the algorithm (which is the tricky part!). So I wrote it in R:

f3 <- function(x) {
    imin <- 1L
    imax <- length(x)
    while (imax >= imin) {
        imid <- as.integer(imin + (imax - imin) / 2)
        if (x[imid] >= 0)
            imax <- imid - 1L
        else
            imin <- imid + 1L
    }
    imax
}

For comparison with the other suggestions

f0 <- function(v) length(which(v < 0))
f1 <- function(v) sum(v < 0)
f2 <- function(v) which.min(v < 0) - 1L

and for fun

library(compiler)
f3.c <- cmpfun(f3)

Leading to

> vec <- c(seq(-100,-1,length.out=1e6), rep(0,20), seq(1,100,length.out=1e6))
> identical(f0(vec), f1(vec))
[1] TRUE
> identical(f0(vec), f2(vec))
[1] TRUE
> identical(f0(vec), f3(vec))
[1] TRUE
> identical(f0(vec), f3.c(vec))
[1] TRUE
> microbenchmark(f0(vec), f1(vec), f2(vec), f3(vec), f3.c(vec))
Unit: microseconds
      expr       min        lq     median         uq       max neval
   f0(vec) 15274.275 15347.870 15406.1430 15605.8470 19890.903   100
   f1(vec) 15513.807 15575.229 15651.2970 17064.8830 18326.293   100
   f2(vec) 21473.814 21558.989 21679.3210 22733.1710 27435.889   100
   f3(vec)    51.715    56.050    75.4495    78.5295   100.730   100
 f3.c(vec)    11.612    17.147    28.5570    31.3160    49.781   100

Probably there are some tricky edge cases that I've got wrong! Moving to C, I did

library(inline)
f4 <- cfunction(c(x = "numeric"), "
    int imin = 0, imax = Rf_length(x) - 1, imid;
    while (imax >= imin) {
        imid = imin + (imax - imin) / 2;
        if (REAL(x)[imid] >= 0)
            imax = imid - 1;
        else
            imin = imid + 1;
    }
    return ScalarInteger(imax + 1);
")

with

> identical(f3(vec), f4(vec))
[1] TRUE
> microbenchmark(f3(vec), f3.c(vec), f4(vec))
Unit: nanoseconds
      expr   min      lq  median      uq   max neval
   f3(vec) 52096 53192.0 54918.5 55539.0 69491   100
 f3.c(vec) 10924 12233.5 12869.0 13410.0 20038   100
   f4(vec)   553   796.0   893.5  1004.5  2908   100

findInterval came up when a similar question was asked on the R-help list. It is slow but safe, checking that vec is actually sorted and dealing with NA values. If one wants to live on the edge (arguably no worse that implementing f3 or f4) then

f5.i <- function(v)
    .Internal(findInterval(v, 0 - .Machine$double.neg.eps, FALSE, FALSE))

is nearly as fast as the C implementation, but likely more robust and vectorized (i.e., look up a vector of values in the second argument, for easy range-like calculations).

Use sum() and logical comparison:

sum( vec < 0 )
[1] 100

This will be pretty quick, and when you sum a logical, TRUE is 1 and FALSE is 0 so the total will be the number of negative values.

Uh oh, I feel the need for a benchmarking comparison... :-) Vector length is 2e5

library(microbenchmark)
vec<-c(seq(-100,-1,length.out=1e5), rep(0,20), seq(1,100,length.out=1e5))
microbenchmark( (which.min(vec < 0) - 1L) , (sum( vec < 0 )) )

Unit: milliseconds
                      expr      min       lq   median       uq       max neval
 (which.min(vec < 0) - 1L) 1.883847 2.130746 2.554725 3.141787 75.943911   100
            (sum(vec < 0)) 1.398100 1.500639 1.508688 1.745088  2.662164   100

You could use which.min

 which.min(vec < 0) - 1L

This will return the first FALSE value, i.e. the first 0.

More efficient strategy for which() or match()

Tags:

R

Match

Vectorization

Related

Recent Posts