Count a sequence to include NA values
I'd make a change here
CheckNA <- rle(is.na(df$Day))
CheckNA$values <- CheckNA$lengths >= 4 & CheckNA$values == 1
CheckNA$values <- ifelse(!CheckNA$values, cumsum(CheckNA$values)+1, NA)
df$Co <- inverse.rle(CheckNA)
I kept the first two lines the same, then I used cumsum()
to assign new IDs at each break. This means you won't have to hard-code any values. With the new values, you can use inverse.rle
much in the same way you used rep()
to expand the new ID out to each of the rows.
If you turn that into a function, you can clean up the dplyr
bits
id_NA_break <- function(x) {
CheckNA <- rle(is.na(x))
CheckNA$values <- CheckNA$lengths >= 4 & CheckNA$values == 1
CheckNA$values <- ifelse(!CheckNA$values, cumsum(CheckNA$values)+1, NA)
inverse.rle(CheckNA)
}
df <- data.frame(z, Day, y)
df %>%
mutate(Co=id_NA_break(Day)) %>%
group_by(Co) %>%
mutate(CoDay = ifelse(is.na(Co), NA, seq(Co)))
Here's a data.table solution. I'm not sure how the two functions would compare. We would have to benchmark them. Typically data.table is faster, but I ended up using a lot of steps here.
library(data.table)
Day <- c(1, 2, NA, 3, 4, NA, NA, NA, NA, NA, 1, 2, 3, NA, NA, NA, NA, 1, 2, NA, NA, 3, 4, 5)
y <- rpois(length(Day), 2)
z <- seq(1:length(Day)) + 500
df <- data.frame(z, Day, y)
setDT(df)
df[ , "isNA" := ifelse(is.na(Day), 1, 0)]
df[ , "numNA" := rep(rle(isNA)$length*rle(isNA)$value, rle(isNA)$length)]
df[ , "Gap" := ifelse(numNA < 4, 0, 1)]
df[ , "Cohort" := cumsum(Gap)]
df[Gap == 1, "Cohort" := NA]
df[Gap == 0, "Cohort" := as.double(rleid(Cohort))]
> df
z Day y isNA numNA Gap Cohort
1: 501 1 1 0 0 0 1
2: 502 2 2 0 0 0 1
3: 503 NA 2 1 1 0 1
4: 504 3 1 0 0 0 1
5: 505 4 2 0 0 0 1
6: 506 NA 2 1 5 1 NA
7: 507 NA 1 1 5 1 NA
8: 508 NA 0 1 5 1 NA
9: 509 NA 4 1 5 1 NA
10: 510 NA 2 1 5 1 NA
11: 511 1 3 0 0 0 2
12: 512 2 3 0 0 0 2
13: 513 3 2 0 0 0 2
14: 514 NA 3 1 4 1 NA
15: 515 NA 1 1 4 1 NA
16: 516 NA 3 1 4 1 NA
17: 517 NA 2 1 4 1 NA
18: 518 1 4 0 0 0 3
19: 519 2 4 0 0 0 3
20: 520 NA 1 1 2 0 3
21: 521 NA 1 1 2 0 3
22: 522 3 3 0 0 0 3
23: 523 4 0 0 0 0 3
24: 524 5 3 0 0 0 3
z Day y isNA numNA Gap Cohort
to cleanup the extra columns
df[ , c("isNA", "numNA", "Gap") := NULL]
EDIT MrFlick's is faster. I ran them both through microbenchmark.
> microbenchmark(data_table_way(df))
Unit: milliseconds
expr min lq mean median uq max neval
data_table_way(df) 2.515004 2.678493 2.879678 2.770054 2.923348 4.917869 100
> microbenchmark(dplyr_way())
Unit: milliseconds
expr min lq mean median uq max neval
dplyr_way() 1.564279 1.703792 1.814998 1.765713 1.824615 2.773641 100