Finding the index of first changes in the elements of a vector

The data.table package internally (meaning not exported yet) uses a function uniqlist (in dev 1.8.11) or alternatively duplist (in current 1.8.10 @CRAN) that does exactly what you're after. It should be quite fast. Here's a benchmark:

require(data.table)
set.seed(45)
# prepare a huge vector (sorted)
x <- sort(as.numeric(sample(1e5, 1e7, TRUE)))

require(microbenchmark)
ben <- function(v) c(1,1+which(diff(v)!=0))
matthew <- function(v) rle(v)
matteo <- function(v) firstDiff(v)
exegetic <- function(v) first.changes(v)
# if you use 1.8.10, replace uniqlist with duplist
dt <- function(v) data.table:::uniqlist(list(v))

microbenchmark( ans1 <- ben(x), matthew(x), matteo(x), 
                exegetic(x), ans2 <- dt(x), times=10)

# Unit: milliseconds
#            expr       min         lq     median         uq        max neval
#  ans1 <- ben(x)  1181.808  1229.8197  1313.2646  1357.5026  1553.9296    10
#      matthew(x)  1456.918  1496.0300  1581.0062  1660.4067  2148.1691    10
#       matteo(x) 28609.890 29305.1117 30698.5843 32706.3147 34290.9864    10
#     exegetic(x)  1365.243  1546.0652  1576.8699  1659.5488  1886.6058    10
#   ans2 <- dt(x)   113.712   114.7794   143.9938   180.3743   221.8386    10

identical(as.integer(ans1), ans2) # [1] TRUE

I'm not that familiar with Rcpp, but seems like the solution could be improved quite a bit.

Edit: Refer to Matteo's updated answer for Rcpp timings.


You're looking for rle:

rle(v)
## Run Length Encoding
##   lengths: int [1:3] 7 2 5
##   values : num [1:3] 1 1.5 2

This says that the value changes at locations 7+1, 7+2+1 (and 7+2+5+1 would be the index of the element "one off the end")


rle is a good idea, but if you only want the indices of the changepoints you can just do:

c(1,1+which(diff(v)!=0))
## 1 8 10

Tags:

R