How to identify all sequential numbers not covered by 'to' and 'from' positions?
Edit: Should have read the question better. This is basically your current approach.
You can pmap
over your input with the seq
function, and unlist
that to get a vector of all values. Then setdiff
to get the missing values. Using diff
and cumsum
you can create a grouping variable for the missing values, grouping them into from-to pairs. Then split the missing value vector by the grouping var and map
over that to create one row of output for each group.
library(purrr)
miss <- setdiff(1:100, unlist(pmap(df1, seq)))
i <-
miss %>%
diff %>%
`>`(1) %>%
rev %>%
cumsum %>%
rev
map_df(split(miss, c(i, 0)), ~list(from = head(.x, 1), to = tail(.x, 1))) %>%
dplyr::arrange(from)
# # A tibble: 5 x 2
# from to
# <int> <int>
# 1 1 6
# 2 14 20
# 3 32 34
# 4 44 49
# 5 61 100
Since you need a fast solution we could attempt a base R approach using setdiff
and split
. The vectorization we leave to mapply
. To find the factors where to split
we use findInterval
. To get the elements' start and end points of the resulting list we clear with range
.
d <- setdiff(1:100, unlist(mapply(seq.default, df1[, 1], df1[, 2])))
t(sapply(split(d, findInterval(d, d[which(c(1, diff(d)) > 1)])), range))
# [,1] [,2]
# 0 1 6
# 1 14 20
# 2 32 34
# 3 44 49
# 4 61 100
Benchmark
As we can see from the benchmark, we have achieved a pretty fast solution.
Unit: microseconds
expr min lq mean median uq max neval cld
purrr 1575.479 1593.2110 1634.3573 1604.9475 1634.033 2028.095 100 b
findInterval 250.801 256.9245 276.8609 273.3815 281.673 498.285 100 a