What's a tidyverse approach to iterating over rows in a data frame when vectorisation is not feasible?

I think that for this sort of intrinsically iterative process it is genuinely hard to beat a for loop. The method proposed by @Shree depends on the NAs being continuous and starting in a known spot.

Here is my mild improvement on your loop, which I think is more readable and about 2.5 times the speed and will probably scale up better than your approach which combines vectorized operations with the loop. By moving out of the tidyverse altogether and embracing a rowwise loop that really works on each row one at a time, we get some efficiencies on both counts:

method_peter <- function(x){
  for(i in 2:nrow(x)){
    x[i, "a"] <- ifelse(is.na(x[i, "a"]), x[i - 1, "a"] * 1.1,       x[i, "a"])
    x[i, "b"] <- ifelse(is.na(x[i, "b"]), x[i, "a"] * x[i - 1, "b"], x[i, "b"])
    x[i, "c"] <- ifelse(is.na(x[i, "c"]), x[i, "b"] * x[i - 2, "a"], x[i, "c"])
  }
  return(x)
}

There's doubtless more efficiencies possible, and of course this is an ideal candidate to rewrite it in C++ :).

This is about twice as fast as your method as seen by this:

method_matt <- function(x){
  for(i in 1:nrow(x)) {
    x <- x %>%
      mutate(a = if_else(!is.na(a), a, lag(a, 1) * 1.1),
             b = if_else(!is.na(b), b, a * lag(b, 1)),
             c = if_else(!is.na(c), c, b * lag(a, 2)))
  }
  return(x)
}

set.seed(123)
x <- tibble(t = c(1:10),
            a = c(seq(100, 140, 10), rep(NA_real_, 5)),
            b = c(runif(5), rep(NA_real_, 5)),
            c = c(runif(5), rep(NA_real_, 5)))

stopifnot(identical(method_matt(x), method_peter(x)))

library(microbenchmark)
microbenchmark(
  method_matt(x),
  method_peter(x)
)

which returns:

Unit: milliseconds
            expr     min       lq     mean   median      uq     max neval
  method_matt(x) 24.1975 25.50925 30.64438 26.33310 31.8681 74.5093   100
 method_peter(x) 10.0005 10.56050 13.33751 11.06495 13.5913 42.0568   100

@Shree's method is much faster again and is ideal for the example data, but I'm not sure it is flexible enough to work in all your use cases.

I would like to see a tidyverse solution if there is one.

Edit: Added tidyverse approach

Heres a readable and flexible tidyverse approach. The downside is that it is very slow.

accumutate <- function(df, ...){
  df %>% group_by(row_number()) %>%
    nest() %>%
    pull(data) %>%
    accumulate(function(x,y){bind_rows(x,y) %>% mutate(!!!enquos(...)) }) %>%
    .[[length(.)]]
}

x %>%
  accumutate(a = ifelse(is.na(a), 1.1 * lag(a,1), a)) %>%
  accumutate(b = ifelse(is.na(b), a * lag(b), b)) %>%
  accumutate(c = ifelse(is.na(c),b * lag(a, 2), c))

#> # A tibble: 10 x 4
#>        t     a        b        c
#>    <int> <dbl>    <dbl>    <dbl>
#>  1     1  100  2.88e- 1 4.56e- 2
#>  2     2  110  7.88e- 1 5.28e- 1
#>  3     3  120  4.09e- 1 8.92e- 1
#>  4     4  130  8.83e- 1 5.51e- 1
#>  5     5  140  9.40e- 1 4.57e- 1
#>  6     6  154  1.45e+ 2 1.88e+ 4
#>  7     7  169. 2.45e+ 4 3.43e+ 6
#>  8     8  186. 4.57e+ 6 7.04e+ 8
#>  9     9  205. 9.37e+ 8 1.59e+11
#> 10    10  225. 2.11e+11 3.94e+13

^{Created on 2020-10-07 by the reprex package (v0.3.0)}

Here's another approach that you might find interesting. It's not concise or especially readable, but it's tidyverse (or at least functionally) inspired. And it performs fairly well.

It uses a semigroup pattern, converting the mutate expressions into binary functions, creating corresponding lists and then using accumulate.

library(tidyverse)
library(dplyr)
library(microbenchmark)
options(width =100)

set.seed(123)

# Create the data frame
x <- tibble(t = c(1:100),
            a = c(seq(100, 140, 10), rep(NA_real_,100- 5)),
            b = c(runif(5), rep(NA_real_, 100-5)),
            c = c(runif(5), rep(NA_real_, 100-5)))

a_mappend <- function(a1, a2) {
  ifelse(is.na(a2), a1 * 1.1, a2)
}

b_mappend <- function(ab1, ab2) {
    list(a = ab2$a, b =  ifelse(is.na(ab2$b), ab2$a * ab1$b,ab2$b))
}

c_mappend <- function(abc12, abc23) {
  list(abc1 = list(a = abc12$abc2$a, b = abc12$abc2$b, c = abc12$abc2$c),
       abc2 = list(a = abc23$abc2$a, b = abc23$abc2$b, c = ifelse(is.na(abc23$abc2$c),abc12$abc1$a * abc23$abc2$b,abc23$abc2$c)))
}

method_ian <- function(x) {
  x %>%
    mutate(a = accumulate(a, a_mappend)) %>%
    mutate(b = list(a, b) %>% 
                  pmap(~ list(a = .x, b = .y)) %>% 
                  accumulate(b_mappend) %>% map_dbl(~ .x$b)) %>%
    mutate(c = list(a, b, c, c(a[-1], NA), c(b[-1], NA), c(c[-1], NA)) %>%
                  pmap(~ list(abc1 = list(a = ..1, b = ..2, c = ..3),
                              abc2 = list(a = ..4, b = ..5, c = ..6))) %>% 
                  accumulate(c_mappend) %>% map_dbl(~ .x$abc1$c))
}


method_matt <- function(x){
  for(i in 1:nrow(x)) {
    x <- x %>%
      mutate(a = if_else(!is.na(a), a, lag(a, 1) * 1.1),
             b = if_else(!is.na(b), b, a * lag(b, 1)),
             c = if_else(!is.na(c), c, b * lag(a, 2)))
  }
  return(x)
}

method_peter <- function(x){
  for(i in 2:nrow(x)){
    x[i, "a"] <- ifelse(is.na(x[i, "a"]), x[i - 1, "a"] * 1.1,       x[i, "a"])
    x[i, "b"] <- ifelse(is.na(x[i, "b"]), x[i, "a"] * x[i - 1, "b"], x[i, "b"])
    x[i, "c"] <- ifelse(is.na(x[i, "c"]), x[i, "b"] * x[i - 2, "a"], x[i, "c"])
  }
  return(x)
}

stopifnot(identical(method_matt(x), method_ian(x)))

microbenchmark( method_matt(x), method_peter(x), method_ian(x))
#> Unit: milliseconds
#>             expr       min        lq      mean    median        uq       max neval
#>   method_matt(x) 324.90086 330.93192 337.46518 334.55447 338.38461 426.30457   100
#>  method_peter(x) 208.27498 211.60526 213.59438 212.66088 214.36421 242.59854   100
#>    method_ian(x)  13.06774  13.43105  14.30003  13.86428  14.32263  19.54843   100

^{Created on 2020-10-06 by the reprex package (v0.3.0)}

I don't think there's any simple way in tidyverse to do calculations with row-dependencies. Something with Reduce or gather + spread could be possible but I don't expect them to score poits on readability.

Anyways, on the bright side, your calculations are vectorizable using dplyr and zoo packages -

x %>% 
  mutate(
    a = ifelse(is.na(a), na.locf(a) * 1.1^(t-5), a),
    b = ifelse(is.na(b), na.locf(b) * c(rep(1, 5), cumprod(a[6:n()])), b),
    c = ifelse(is.na(c), b * lag(a, 2), c)
  )

 # A tibble: 10 x 4
 t     a        b        c
 <int> <dbl>    <dbl>    <dbl>
 1     1  100  1.85e- 1 9.43e- 1
 2     2  110  7.02e- 1 1.29e- 1
 3     3  120  5.73e- 1 8.33e- 1
 4     4  130  1.68e- 1 4.68e- 1
 5     5  140  9.44e- 1 5.50e- 1
 6     6  154  1.45e+ 2 1.89e+ 4
 7     7  169. 2.46e+ 4 3.45e+ 6
 8     8  186. 4.59e+ 6 7.07e+ 8
 9     9  205. 9.40e+ 8 1.59e+11
10    10  225. 2.12e+11 3.95e+13

Data -

set.seed(2)
x <- tibble(t = c(1:10),
            a = c(seq(100, 140, 10), rep(NA_real_, 5)),
            b = c(runif(5), rep(NA_real_, 5)),
            c = c(runif(5), rep(NA_real_, 5)))

What's a tidyverse approach to iterating over rows in a data frame when vectorisation is not feasible?

Tags:

R

Dplyr

Tidyverse

Purrr

Related

Recent Posts