Creating a sequence in a data.table depending on a column
Here is an option:
set.seed(1)
DT <- data.table(R=sample(0:1, 10000, rep=TRUE), Seq=0L)
DT[, Seq:=seq(.N), by=list(cumsum(c(0, abs(diff(R)))))]
DT
We create a counter that increments every time your 0-1 variable changes using cumsum(abs(diff(R)))
. The c(0,
part is to ensure we get the correct length vector. Then we split by it with by
. This produces:
R Seq
1: 0 1
2: 0 2
3: 1 1
4: 1 2
5: 0 1
---
9996: 1 1
9997: 0 1
9998: 1 1
9999: 1 2
10000: 1 3
EDIT: Addressing request for clarification:
lets look at the computation I'm using in by
, broken down into two new columns:
DT[, diff:=c(0, diff(R))]
DT[, cumsum:=cumsum(abs(diff))]
print(DT, topn=10)
Produces:
R Seq diff cumsum
1: 0 1 0 0
2: 0 2 0 0
3: 1 1 1 1
4: 1 2 0 1
5: 0 1 -1 2
6: 1 1 1 3
7: 1 2 0 3
8: 1 3 0 3
9: 1 4 0 3
10: 0 1 -1 4
---
9991: 1 2 0 5021
9992: 1 3 0 5021
9993: 1 4 0 5021
9994: 1 5 0 5021
9995: 0 1 -1 5022
9996: 1 1 1 5023
9997: 0 1 -1 5024
9998: 1 1 1 5025
9999: 1 2 0 5025
10000: 1 3 0 5025
You can see how the cumulative sum of the absolute of the diff increments by one each time R changes. We can then use that cumsum
column to break up the data.table
into chunks, and for each chunk, generate a sequence using seq(.N)
that counts to the number of items in the chunk (.N
represents exactly that, how many items in each by
group).