R: Which is the optimal way to compute functions over time with 3D arrays (latitude, longitude, and time)?
Usually data.table
is quite fast and memory efficient. You might want to check ?data.table::frollapply
and note, that as.data.table
provides a S3 method for class 'array'.
The following compares the base R code snippets you provided with their data.table
counterparts (I'm not really familiar with tidyverse). I'm focusing only on the calculation itself not on the output format - so for the data.table
solution the output isn't converted back to an array. Here is some info on that - also using aperm
as you do above.
The benchmark was created using the reduced dataset (I'll update with the results of your datasets once the computing is done).
Edit: Just updated the benchmark results using the entire dataset.
library(data.table)
library(microbenchmark)
library(zoo)
lat <- seq(from = -90, to = 90, by = 0.25)
lon <- seq(from = 0, to = 359.75, by = 0.25)
time <- seq(from = as.Date('1980-01-01'), by = 'month', length.out = 480)
# data <- array(rnorm(721 * 1440 * 480), dim = c(721, 1440, 480), dimnames = list(lat = lat, lon = lon, time = time))
data <- array(rnorm(72 * 144 * 48), dim = c(72, 144, 48), dimnames = list(lat = NULL, lon = NULL, time = NULL)) # reduced dataset
DT <- as.data.table(data)
setorder(DT, time) # as per @jangorecki we need to ensure that DT is sorted by time
rolling_mean <- function(x) {
return(rollapply(
data = x,
width = 3,
by = 1,
FUN = mean
))
}
microbenchmark(
"mean - base R" = {
apply(data, 1:2, mean)
},
"mean - data.table" = {
DT[, .("mean_value" = mean(value)), by = .(lat, lon)]
},
times = 1L
)
Unit: seconds
expr min lq mean median uq max neval
mean - base R 33.152048 33.152048 33.152048 33.152048 33.152048 33.152048 1
mean - data.table 8.287379 8.287379 8.287379 8.287379 8.287379 8.287379 1
microbenchmark(
"rollmean - base R" = {
apply(X = data,
MARGIN = 1:2,
FUN = rolling_mean)
},
"rollmean - data.table" = {
DT[, rollmean := frollmean(value, n = 3), by = .(lat, lon)]
},
times = 1L
)
Unit: seconds
expr min lq mean median uq max neval
rollmean - base R 477.26644 477.26644 477.26644 477.26644 477.26644 477.26644 1
rollmean - data.table 48.80027 48.80027 48.80027 48.80027 48.80027 48.80027 1
As you can see, using data.table
provides the results significantly faster.
For efficient means for the array, we can transpose an array and use colSums(...)
. This is faster than (my favorite) data.table and does not require coercion from an array to a data.frame like object. Thanks to @ismirsehregal for the structure of the benchmarks.
Note, this is the smaller dataset. While there is a conversion time penalty for converting from array -> data.table, the data.table
method would likely become faster at some point as it can make use of parallel cores unlike the colMeans
approach. I didn't have enough RAM to use the larger data.set.
colMeans(aperm(data, c(3L, 1L, 2L)))
all.equal(apply(data, 1:2, mean), colMeans(aperm(data, c(3L, 1L, 2L))))
## [1] TRUE
microbenchmark::microbenchmark(
"mean - base R" = apply(data, 1:2, mean),
"mean - data.table" = DT[, .("mean_value" = mean(value)), by = .(lat, lon)],
"mean - base colMeans" = colMeans(aperm(data, c(3L, 1L, 2L)))
)
## Unit: milliseconds
## expr min lq mean median uq max neval
## mean - base R 48.6293 52.30380 57.611552 54.8006 61.2880 146.4658 100
## mean - data.table 7.5610 8.95255 13.004569 9.7459 17.4888 28.7324 100
## mean - base colMeans 6.1713 6.45505 8.421273 6.6429 7.7849 99.3952 100
For efficient rolling means, zoo::roll_applyr()
is intuitive but not fast. We can use a number of package solutions such as data.table::frollmean()
or RcppRoll::roll_mean()
. See also here for various methods for rolling averages:
Calculating moving average
Or @JanGorecki's question on performance:
Adaptive moving average - top performance in R
I profiled @Matti Pastell's solution and @pipefish's solution from the first link in addition to using data.table::frollmean()
with apply
:
##@Matti Pastell's
base_ma = function(x, n = 5){stats::filter(x, rep(1 / n, n), sides = 2)}
##@pipefish's
base_ma2 = function(x, n = 5L) {
cx <- c(0,cumsum(x))
(cx[(n+1):length(cx)] - cx[1:(length(cx) - n)]) / n
}
microbenchmark(
"rollmean - apply zoo" = {
apply(X = data,
MARGIN = 1:2,
FUN = rolling_mean)
},
"rollmean - apply dt.frollmean" = {
apply(X = data,
MARGIN = 1:2,
FUN = frollmean, n = 3L)
},
"rollmean - data.table" = {
DT[, rollmean := frollmean(value, n = 3), by = .(lat, lon)]
},
"rollmean - apply stats::filter" = {
apply(X = data,
MARGIN = 1:2,
FUN = base_ma, n = 3L)
},
"rollmean - apply w_cumsum" = {
apply(X = data,
MARGIN = 1:2,
FUN = base_ma2, n = 3L)
},
times = 1L
)
Unit: milliseconds
expr min lq mean median uq max neval
rollmean - apply zoo 2823.6218 2823.6218 2823.6218 2823.6218 2823.6218 2823.6218 1
rollmean - apply dt.frollmean 331.7787 331.7787 331.7787 331.7787 331.7787 331.7787 1
rollmean - data.table 358.0787 358.0787 358.0787 358.0787 358.0787 358.0787 1
rollmean - apply stats::filter 435.5238 435.5238 435.5238 435.5238 435.5238 435.5238 1
rollmean - apply w_cumsum 148.7428 148.7428 148.7428 148.7428 148.7428 148.7428 1