Index consecutive duplicates in vector
One option with data.table. No real reason to use this instead of lag
/shift
when n = 2, but for larger n this would save you from creating a large number of new lagged vectors.
library(data.table)
which(rowid(rleid(t1)) > 2)
# [1] 4 11 12
Explanation:
rleid
will produce a unique value for each "run" of equal values, and rowid
will mark how many elements "into" the run each element is. What you want is elements more than 2 "into" a run.
data.table(
t1,
rleid(t1),
rowid(t1))
# t1 V2 V3
# 1: 1 1 1
# 2: 10 2 1
# 3: 10 2 2
# 4: 10 2 3
# 5: 14 3 1
# 6: 37 4 1
# 7: 3 5 1
# 8: 14 6 2
# 9: 8 7 1
# 10: 8 7 2
# 11: 8 7 3
# 12: 8 7 4
# 13: 39 8 1
# 14: 12 9 1
Edit: If, as in the example posed by this question, no two runs (even length-1 "runs") are of the same value (or if you don't care whether the duplicates are next to eachother), you can just use which(rowid(t1) > 2)
instead. (This is noted by Frank in the comments)
Hopefully this example clarifies the differences
a <- c(1, 1, 1, 2, 2, 1)
which(rowid(a) > 2)
# [1] 3 6
which(rowid(rleid(a)) > 2)
# [1] 3
You can use dplyr::lag
or data.table::shift
(note, default for shift
is to lag, so shift(t1, 1)
is equal to shift(t1, 1, type = "lag")
:
which(t1 == lag(t1, 1) & lag(t1, 1) == lag(t1, 2))
[1] 4 11 12
# Or
which(t1 == shift(t1, 1) & shift(t1, 1) == shift(t1, 2))
[1] 4 11 12
If you need it to scale for several duplicates you can do the following (thanks for the tip @IceCreamToucan):
n <- 2
df1 <- sapply(0:n, function(x) shift(t1, x))
which(rowMeans(df1 == df1[,1]) == 1)
[1] 4 11 12
This is usually a case that rle
is useful, i.e.
v1 <- rle(t1)
i1 <- seq_along(t1)[t1 %in% v1$values[v1$lengths > 2]]
i2 <- t1[t1 %in% v1$values[v1$lengths > 2]]
tapply(i1, i2, function(i) tail(i, -2))
#$`8`
#[1] 11 12
#$`10`
#[1] 4
You can unlist and get it as a vector,
unlist(tapply(i1, i2, function(i) tail(i, -2)))
#81 82 10
#11 12 4
There is also a function called rleid
in data.table
package which we can use,
unlist(lapply(Filter(function(i) length(i) > 2, split(seq_along(t1), data.table::rleid(t1))),
function(i) tail(i, -2)))
#2 71 72
#4 11 12