Count a sequence to include NA values

I'd make a change here

CheckNA        <- rle(is.na(df$Day))
CheckNA$values <- CheckNA$lengths >= 4 & CheckNA$values == 1
CheckNA$values <- ifelse(!CheckNA$values, cumsum(CheckNA$values)+1, NA)
df$Co <- inverse.rle(CheckNA)

I kept the first two lines the same, then I used cumsum() to assign new IDs at each break. This means you won't have to hard-code any values. With the new values, you can use inverse.rle much in the same way you used rep() to expand the new ID out to each of the rows.

If you turn that into a function, you can clean up the dplyr bits

id_NA_break <- function(x) {
  CheckNA        <- rle(is.na(x))
  CheckNA$values <- CheckNA$lengths >= 4 & CheckNA$values == 1
  CheckNA$values <- ifelse(!CheckNA$values, cumsum(CheckNA$values)+1, NA)
  inverse.rle(CheckNA)  
}

df  <- data.frame(z, Day, y)
df %>% 
  mutate(Co=id_NA_break(Day)) %>%
  group_by(Co) %>% 
  mutate(CoDay = ifelse(is.na(Co), NA, seq(Co)))

Here's a data.table solution. I'm not sure how the two functions would compare. We would have to benchmark them. Typically data.table is faster, but I ended up using a lot of steps here.

library(data.table)
Day <- c(1, 2, NA, 3, 4, NA, NA, NA, NA, NA, 1, 2, 3, NA, NA, NA, NA, 1, 2, NA, NA, 3, 4, 5)
y   <- rpois(length(Day), 2)
z   <- seq(1:length(Day)) + 500
df  <- data.frame(z, Day, y)

setDT(df)

df[ , "isNA" := ifelse(is.na(Day), 1, 0)]
df[ , "numNA" := rep(rle(isNA)$length*rle(isNA)$value, rle(isNA)$length)]
df[ , "Gap" := ifelse(numNA < 4, 0, 1)]
df[ , "Cohort" := cumsum(Gap)]

df[Gap == 1, "Cohort" := NA]
df[Gap == 0, "Cohort" := as.double(rleid(Cohort))]

> df
      z Day y isNA numNA Gap Cohort
 1: 501   1 1    0     0   0      1
 2: 502   2 2    0     0   0      1
 3: 503  NA 2    1     1   0      1
 4: 504   3 1    0     0   0      1
 5: 505   4 2    0     0   0      1
 6: 506  NA 2    1     5   1     NA
 7: 507  NA 1    1     5   1     NA
 8: 508  NA 0    1     5   1     NA
 9: 509  NA 4    1     5   1     NA
10: 510  NA 2    1     5   1     NA
11: 511   1 3    0     0   0      2
12: 512   2 3    0     0   0      2
13: 513   3 2    0     0   0      2
14: 514  NA 3    1     4   1     NA
15: 515  NA 1    1     4   1     NA
16: 516  NA 3    1     4   1     NA
17: 517  NA 2    1     4   1     NA
18: 518   1 4    0     0   0      3
19: 519   2 4    0     0   0      3
20: 520  NA 1    1     2   0      3
21: 521  NA 1    1     2   0      3
22: 522   3 3    0     0   0      3
23: 523   4 0    0     0   0      3
24: 524   5 3    0     0   0      3
      z Day y isNA numNA Gap Cohort

to cleanup the extra columns

df[ , c("isNA", "numNA", "Gap") := NULL]

EDIT MrFlick's is faster. I ran them both through microbenchmark.

> microbenchmark(data_table_way(df))
Unit: milliseconds
               expr      min       lq     mean   median       uq      max neval
 data_table_way(df) 2.515004 2.678493 2.879678 2.770054 2.923348 4.917869   100

> microbenchmark(dplyr_way())
Unit: milliseconds
        expr      min       lq     mean   median       uq      max neval
 dplyr_way() 1.564279 1.703792 1.814998 1.765713 1.824615 2.773641   100

Count a sequence to include NA values

Tags:

Sequence

R

Related

Recent Posts