What's a tidyverse approach to iterating over rows in a data frame when vectorisation is not feasible?
I think that for this sort of intrinsically iterative process it is genuinely hard to beat a for
loop. The method proposed by @Shree depends on the NAs being continuous and starting in a known spot.
Here is my mild improvement on your loop, which I think is more readable and about 2.5 times the speed and will probably scale up better than your approach which combines vectorized operations with the loop. By moving out of the tidyverse altogether and embracing a rowwise loop that really works on each row one at a time, we get some efficiencies on both counts:
method_peter <- function(x){
for(i in 2:nrow(x)){
x[i, "a"] <- ifelse(is.na(x[i, "a"]), x[i - 1, "a"] * 1.1, x[i, "a"])
x[i, "b"] <- ifelse(is.na(x[i, "b"]), x[i, "a"] * x[i - 1, "b"], x[i, "b"])
x[i, "c"] <- ifelse(is.na(x[i, "c"]), x[i, "b"] * x[i - 2, "a"], x[i, "c"])
}
return(x)
}
There's doubtless more efficiencies possible, and of course this is an ideal candidate to rewrite it in C++ :).
This is about twice as fast as your method as seen by this:
method_matt <- function(x){
for(i in 1:nrow(x)) {
x <- x %>%
mutate(a = if_else(!is.na(a), a, lag(a, 1) * 1.1),
b = if_else(!is.na(b), b, a * lag(b, 1)),
c = if_else(!is.na(c), c, b * lag(a, 2)))
}
return(x)
}
set.seed(123)
x <- tibble(t = c(1:10),
a = c(seq(100, 140, 10), rep(NA_real_, 5)),
b = c(runif(5), rep(NA_real_, 5)),
c = c(runif(5), rep(NA_real_, 5)))
stopifnot(identical(method_matt(x), method_peter(x)))
library(microbenchmark)
microbenchmark(
method_matt(x),
method_peter(x)
)
which returns:
Unit: milliseconds
expr min lq mean median uq max neval
method_matt(x) 24.1975 25.50925 30.64438 26.33310 31.8681 74.5093 100
method_peter(x) 10.0005 10.56050 13.33751 11.06495 13.5913 42.0568 100
@Shree's method is much faster again and is ideal for the example data, but I'm not sure it is flexible enough to work in all your use cases.
I would like to see a tidyverse solution if there is one.
Edit: Added tidyverse approach
Heres a readable and flexible tidyverse approach. The downside is that it is very slow.
accumutate <- function(df, ...){
df %>% group_by(row_number()) %>%
nest() %>%
pull(data) %>%
accumulate(function(x,y){bind_rows(x,y) %>% mutate(!!!enquos(...)) }) %>%
.[[length(.)]]
}
x %>%
accumutate(a = ifelse(is.na(a), 1.1 * lag(a,1), a)) %>%
accumutate(b = ifelse(is.na(b), a * lag(b), b)) %>%
accumutate(c = ifelse(is.na(c),b * lag(a, 2), c))
#> # A tibble: 10 x 4
#> t a b c
#> <int> <dbl> <dbl> <dbl>
#> 1 1 100 2.88e- 1 4.56e- 2
#> 2 2 110 7.88e- 1 5.28e- 1
#> 3 3 120 4.09e- 1 8.92e- 1
#> 4 4 130 8.83e- 1 5.51e- 1
#> 5 5 140 9.40e- 1 4.57e- 1
#> 6 6 154 1.45e+ 2 1.88e+ 4
#> 7 7 169. 2.45e+ 4 3.43e+ 6
#> 8 8 186. 4.57e+ 6 7.04e+ 8
#> 9 9 205. 9.37e+ 8 1.59e+11
#> 10 10 225. 2.11e+11 3.94e+13
Created on 2020-10-07 by the reprex package (v0.3.0)
Here's another approach that you might find interesting. It's not concise or especially readable, but it's tidyverse (or at least functionally) inspired. And it performs fairly well.
It uses a semigroup pattern, converting the mutate expressions into binary functions, creating corresponding lists and then using accumulate
.
library(tidyverse)
library(dplyr)
library(microbenchmark)
options(width =100)
set.seed(123)
# Create the data frame
x <- tibble(t = c(1:100),
a = c(seq(100, 140, 10), rep(NA_real_,100- 5)),
b = c(runif(5), rep(NA_real_, 100-5)),
c = c(runif(5), rep(NA_real_, 100-5)))
a_mappend <- function(a1, a2) {
ifelse(is.na(a2), a1 * 1.1, a2)
}
b_mappend <- function(ab1, ab2) {
list(a = ab2$a, b = ifelse(is.na(ab2$b), ab2$a * ab1$b,ab2$b))
}
c_mappend <- function(abc12, abc23) {
list(abc1 = list(a = abc12$abc2$a, b = abc12$abc2$b, c = abc12$abc2$c),
abc2 = list(a = abc23$abc2$a, b = abc23$abc2$b, c = ifelse(is.na(abc23$abc2$c),abc12$abc1$a * abc23$abc2$b,abc23$abc2$c)))
}
method_ian <- function(x) {
x %>%
mutate(a = accumulate(a, a_mappend)) %>%
mutate(b = list(a, b) %>%
pmap(~ list(a = .x, b = .y)) %>%
accumulate(b_mappend) %>% map_dbl(~ .x$b)) %>%
mutate(c = list(a, b, c, c(a[-1], NA), c(b[-1], NA), c(c[-1], NA)) %>%
pmap(~ list(abc1 = list(a = ..1, b = ..2, c = ..3),
abc2 = list(a = ..4, b = ..5, c = ..6))) %>%
accumulate(c_mappend) %>% map_dbl(~ .x$abc1$c))
}
method_matt <- function(x){
for(i in 1:nrow(x)) {
x <- x %>%
mutate(a = if_else(!is.na(a), a, lag(a, 1) * 1.1),
b = if_else(!is.na(b), b, a * lag(b, 1)),
c = if_else(!is.na(c), c, b * lag(a, 2)))
}
return(x)
}
method_peter <- function(x){
for(i in 2:nrow(x)){
x[i, "a"] <- ifelse(is.na(x[i, "a"]), x[i - 1, "a"] * 1.1, x[i, "a"])
x[i, "b"] <- ifelse(is.na(x[i, "b"]), x[i, "a"] * x[i - 1, "b"], x[i, "b"])
x[i, "c"] <- ifelse(is.na(x[i, "c"]), x[i, "b"] * x[i - 2, "a"], x[i, "c"])
}
return(x)
}
stopifnot(identical(method_matt(x), method_ian(x)))
microbenchmark( method_matt(x), method_peter(x), method_ian(x))
#> Unit: milliseconds
#> expr min lq mean median uq max neval
#> method_matt(x) 324.90086 330.93192 337.46518 334.55447 338.38461 426.30457 100
#> method_peter(x) 208.27498 211.60526 213.59438 212.66088 214.36421 242.59854 100
#> method_ian(x) 13.06774 13.43105 14.30003 13.86428 14.32263 19.54843 100
Created on 2020-10-06 by the reprex package (v0.3.0)
I don't think there's any simple way in tidyverse
to do calculations with row-dependencies. Something with Reduce
or gather + spread
could be possible but I don't expect them to score poits on readability.
Anyways, on the bright side, your calculations are vectorizable using dplyr
and zoo
packages -
x %>%
mutate(
a = ifelse(is.na(a), na.locf(a) * 1.1^(t-5), a),
b = ifelse(is.na(b), na.locf(b) * c(rep(1, 5), cumprod(a[6:n()])), b),
c = ifelse(is.na(c), b * lag(a, 2), c)
)
# A tibble: 10 x 4
t a b c
<int> <dbl> <dbl> <dbl>
1 1 100 1.85e- 1 9.43e- 1
2 2 110 7.02e- 1 1.29e- 1
3 3 120 5.73e- 1 8.33e- 1
4 4 130 1.68e- 1 4.68e- 1
5 5 140 9.44e- 1 5.50e- 1
6 6 154 1.45e+ 2 1.89e+ 4
7 7 169. 2.46e+ 4 3.45e+ 6
8 8 186. 4.59e+ 6 7.07e+ 8
9 9 205. 9.40e+ 8 1.59e+11
10 10 225. 2.12e+11 3.95e+13
Data -
set.seed(2)
x <- tibble(t = c(1:10),
a = c(seq(100, 140, 10), rep(NA_real_, 5)),
b = c(runif(5), rep(NA_real_, 5)),
c = c(runif(5), rep(NA_real_, 5)))