How to identify all sequential numbers not covered by 'to' and 'from' positions?

Edit: Should have read the question better. This is basically your current approach.

You can pmap over your input with the seq function, and unlist that to get a vector of all values. Then setdiff to get the missing values. Using diff and cumsum you can create a grouping variable for the missing values, grouping them into from-to pairs. Then split the missing value vector by the grouping var and map over that to create one row of output for each group.

library(purrr)

miss <- setdiff(1:100, unlist(pmap(df1, seq)))
i <- 
  miss %>% 
    diff %>% 
    `>`(1) %>% 
    rev %>%
    cumsum %>% 
    rev 

map_df(split(miss, c(i, 0)), ~list(from = head(.x, 1), to = tail(.x, 1))) %>% 
  dplyr::arrange(from)


# # A tibble: 5 x 2
#    from    to
#   <int> <int>
# 1     1     6
# 2    14    20
# 3    32    34
# 4    44    49
# 5    61   100

Since you need a fast solution we could attempt a base R approach using setdiff and split. The vectorization we leave to mapply. To find the factors where to split we use findInterval. To get the elements' start and end points of the resulting list we clear with range.

d <- setdiff(1:100, unlist(mapply(seq.default, df1[, 1], df1[, 2])))
t(sapply(split(d, findInterval(d, d[which(c(1, diff(d)) > 1)])), range))
#   [,1] [,2]
# 0    1    6
# 1   14   20
# 2   32   34
# 3   44   49
# 4   61  100

Benchmark

As we can see from the benchmark, we have achieved a pretty fast solution.

Unit: microseconds
         expr      min        lq      mean    median       uq      max neval cld
        purrr 1575.479 1593.2110 1634.3573 1604.9475 1634.033 2028.095   100   b
 findInterval  250.801  256.9245  276.8609  273.3815  281.673  498.285   100  a

How to identify all sequential numbers not covered by 'to' and 'from' positions?

Tags:

Sequence

R

Large Data

Related

Recent Posts