Finding the index of first changes in the elements of a vector
The data.table
package internally (meaning not exported yet) uses a function uniqlist
(in dev 1.8.11) or alternatively duplist
(in current 1.8.10 @CRAN) that does exactly what you're after. It should be quite fast. Here's a benchmark:
require(data.table)
set.seed(45)
# prepare a huge vector (sorted)
x <- sort(as.numeric(sample(1e5, 1e7, TRUE)))
require(microbenchmark)
ben <- function(v) c(1,1+which(diff(v)!=0))
matthew <- function(v) rle(v)
matteo <- function(v) firstDiff(v)
exegetic <- function(v) first.changes(v)
# if you use 1.8.10, replace uniqlist with duplist
dt <- function(v) data.table:::uniqlist(list(v))
microbenchmark( ans1 <- ben(x), matthew(x), matteo(x),
exegetic(x), ans2 <- dt(x), times=10)
# Unit: milliseconds
# expr min lq median uq max neval
# ans1 <- ben(x) 1181.808 1229.8197 1313.2646 1357.5026 1553.9296 10
# matthew(x) 1456.918 1496.0300 1581.0062 1660.4067 2148.1691 10
# matteo(x) 28609.890 29305.1117 30698.5843 32706.3147 34290.9864 10
# exegetic(x) 1365.243 1546.0652 1576.8699 1659.5488 1886.6058 10
# ans2 <- dt(x) 113.712 114.7794 143.9938 180.3743 221.8386 10
identical(as.integer(ans1), ans2) # [1] TRUE
I'm not that familiar with Rcpp, but seems like the solution could be improved quite a bit.
Edit: Refer to Matteo's updated answer for Rcpp timings.
You're looking for rle
:
rle(v)
## Run Length Encoding
## lengths: int [1:3] 7 2 5
## values : num [1:3] 1 1.5 2
This says that the value changes at locations 7+1, 7+2+1 (and 7+2+5+1 would be the index of the element "one off the end")
rle
is a good idea, but if you only want the indices of the changepoints you can just do:
c(1,1+which(diff(v)!=0))
## 1 8 10