R: Which is the optimal way to compute functions over time with 3D arrays (latitude, longitude, and time)?

Usually data.table is quite fast and memory efficient. You might want to check ?data.table::frollapply and note, that as.data.table provides a S3 method for class 'array'.

The following compares the base R code snippets you provided with their data.table counterparts (I'm not really familiar with tidyverse). I'm focusing only on the calculation itself not on the output format - so for the data.table solution the output isn't converted back to an array. Here is some info on that - also using aperm as you do above.

~~The benchmark was created using the reduced dataset (I'll update with the results of your datasets once the computing is done).~~

Edit: Just updated the benchmark results using the entire dataset.

library(data.table)
library(microbenchmark)
library(zoo)

lat <- seq(from = -90, to = 90, by = 0.25)
lon <- seq(from = 0, to = 359.75, by = 0.25)
time <- seq(from = as.Date('1980-01-01'), by = 'month', length.out = 480)

# data <- array(rnorm(721 * 1440 * 480), dim = c(721, 1440, 480), dimnames = list(lat = lat, lon = lon, time = time))
data <- array(rnorm(72 * 144 * 48), dim = c(72, 144, 48), dimnames = list(lat = NULL, lon  = NULL, time  = NULL)) # reduced dataset
DT <- as.data.table(data)
setorder(DT, time) # as per @jangorecki we need to ensure that DT is sorted by time

rolling_mean <- function(x) {
  return(rollapply(
    data = x,
    width = 3,
    by = 1,
    FUN = mean
  ))
}

microbenchmark(
  "mean - base R" = {
    apply(data, 1:2, mean)
  },
  "mean - data.table" = {
    DT[, .("mean_value" = mean(value)), by = .(lat, lon)]
  },
  times = 1L
)

Unit: seconds
              expr       min        lq      mean    median        uq       max neval
     mean - base R 33.152048 33.152048 33.152048 33.152048 33.152048 33.152048     1
 mean - data.table  8.287379  8.287379  8.287379  8.287379  8.287379  8.287379     1

microbenchmark(
  "rollmean - base R" = {
      apply(X = data,
            MARGIN = 1:2,
            FUN = rolling_mean)
  },
  "rollmean - data.table" = {
    DT[, rollmean := frollmean(value, n = 3), by = .(lat, lon)]
  },
  times = 1L
)

Unit: seconds
                  expr       min        lq      mean    median        uq       max neval
     rollmean - base R 477.26644 477.26644 477.26644 477.26644 477.26644 477.26644     1
 rollmean - data.table  48.80027  48.80027  48.80027  48.80027  48.80027  48.80027     1

As you can see, using data.table provides the results significantly faster.

For efficient means for the array, we can transpose an array and use colSums(...). This is faster than (my favorite) data.table and does not require coercion from an array to a data.frame like object. Thanks to @ismirsehregal for the structure of the benchmarks.

Note, this is the smaller dataset. While there is a conversion time penalty for converting from array -> data.table, the data.table method would likely become faster at some point as it can make use of parallel cores unlike the colMeans approach. I didn't have enough RAM to use the larger data.set.

colMeans(aperm(data, c(3L, 1L, 2L)))

all.equal(apply(data, 1:2, mean), colMeans(aperm(data, c(3L, 1L, 2L))))
## [1] TRUE

microbenchmark::microbenchmark(
  "mean - base R" = apply(data, 1:2, mean),
  "mean - data.table" = DT[, .("mean_value" = mean(value)), by = .(lat, lon)],
  "mean - base colMeans" = colMeans(aperm(data, c(3L, 1L, 2L)))
)

## Unit: milliseconds
##                  expr     min       lq      mean  median      uq      max neval
##         mean - base R 48.6293 52.30380 57.611552 54.8006 61.2880 146.4658   100
##     mean - data.table  7.5610  8.95255 13.004569  9.7459 17.4888  28.7324   100
##  mean - base colMeans  6.1713  6.45505  8.421273  6.6429  7.7849  99.3952   100

For efficient rolling means, zoo::roll_applyr() is intuitive but not fast. We can use a number of package solutions such as data.table::frollmean() or RcppRoll::roll_mean() . See also here for various methods for rolling averages:

Calculating moving average

Or @JanGorecki's question on performance:

Adaptive moving average - top performance in R

I profiled @Matti Pastell's solution and @pipefish's solution from the first link in addition to using data.table::frollmean() with apply:

##@Matti Pastell's
base_ma = function(x, n = 5){stats::filter(x, rep(1 / n, n), sides = 2)}
##@pipefish's
base_ma2 = function(x, n = 5L) {
  cx <- c(0,cumsum(x))
  (cx[(n+1):length(cx)] - cx[1:(length(cx) - n)]) / n
}

microbenchmark(
  "rollmean - apply zoo" = {
    apply(X = data,
          MARGIN = 1:2,
          FUN = rolling_mean)
  },
  "rollmean - apply dt.frollmean" = {
    apply(X = data,
          MARGIN = 1:2,
          FUN = frollmean, n = 3L)
  },
  "rollmean - data.table" = {
    DT[, rollmean := frollmean(value, n = 3), by = .(lat, lon)]
  },
  "rollmean - apply stats::filter" = {
    apply(X = data,
        MARGIN = 1:2,
        FUN = base_ma, n = 3L)
  },
  "rollmean - apply w_cumsum" = {
    apply(X = data,
          MARGIN = 1:2,
          FUN = base_ma2, n = 3L)
  },
  times = 1L
)

Unit: milliseconds
                           expr       min        lq      mean    median        uq       max neval
           rollmean - apply zoo 2823.6218 2823.6218 2823.6218 2823.6218 2823.6218 2823.6218     1
  rollmean - apply dt.frollmean  331.7787  331.7787  331.7787  331.7787  331.7787  331.7787     1
          rollmean - data.table  358.0787  358.0787  358.0787  358.0787  358.0787  358.0787     1
 rollmean - apply stats::filter  435.5238  435.5238  435.5238  435.5238  435.5238  435.5238     1
      rollmean - apply w_cumsum  148.7428  148.7428  148.7428  148.7428  148.7428  148.7428     1

R: Which is the optimal way to compute functions over time with 3D arrays (latitude, longitude, and time)?

Tags:

Arrays

R

Data.Table

Tidyverse

Related

Recent Posts