Why is the Rcpp implementation in my example much slower than the R function?
Your R code seems to be more or less optimal, i.e. all the real work is done in compiled code. For the C++ code the main problem I could find is calling c1.ncol()
in a tight loop. If I replace that with m
, the C++ solution is almost as fast as R. If I add RcppArmadillo to the mix, I get a very compact syntax, but not faster than the pure Rcpp code. To me this shows that it can be really hard to beat well written R code:
// [[Rcpp::depends(RcppArmadillo)]]
#include <RcppArmadillo.h>
using namespace Rcpp;
// [[Rcpp::export]]
List arma_test(const arma::vec& Delta,
const arma::mat& delta,
const arma::mat& rx) {
int l = delta.n_cols;
List c(l);
for (int i = 0; i < l; ++i) {
c(i) = rx.each_col() % (Delta / (1 + delta.col(i)));
}
return c;
}
// [[Rcpp::export]]
List rcpp_test(NumericVector Delta,
NumericMatrix delta,
NumericMatrix rx) {
int n = Delta.length();
int m = rx.ncol();
List c(delta.ncol());
NumericMatrix c1;
for(int i = 0; i < delta.ncol(); ++i){
c1 = NumericMatrix(n, m);
for(int k = 0; k < n; ++k){
double tmp = Delta[k] / (1 + delta(k, i));
for(int j = 0; j < m; ++j){
c1(k, j) = rx(k, j) * tmp;
}
}
c(i) = c1;
}
return c;
}
/*** R
test <- function(Delta, delta, rx){
const <- list()
for(i in 1:ncol(delta)){
const[[i]] <- rx * (Delta / (1 + delta[, i]))
}
const
}
n <- 50000
Delta <- exp(rnorm(n))
delta <- exp(matrix(rnorm(n * 5), nrow = n))
rx <- matrix(rnorm(n * 20), nrow = n)
bench::mark(test(Delta, delta, rx),
arma_test(Delta, delta, rx),
rcpp_test(Delta, delta, rx))
*/
Output:
# A tibble: 3 x 14
expression min mean median max `itr/sec` mem_alloc n_gc n_itr
<chr> <bch:t> <bch:t> <bch:t> <bch:t> <dbl> <bch:byt> <dbl> <int>
1 test(Delt… 84.3ms 85.2ms 84.9ms 86.6ms 11.7 44.9MB 2 4
2 arma_test… 106.5ms 107.7ms 107.7ms 108.9ms 9.28 38.1MB 3 2
3 rcpp_test… 101.9ms 103.2ms 102.2ms 106.6ms 9.69 38.1MB 1 4
# … with 5 more variables: total_time <bch:tm>, result <list>, memory <list>,
# time <list>, gc <list>
I have also explicitly initialized the output list to the required size avoiding the push_back
, but that did not make a big difference. With vector like data structures from Rcpp you should definitely avoid using push_back
though, since there a copy will be made every time you extend the vector.
I would like to add to @RalfStubner's excellent answer.
You will notice that we are making many allocations in the first for loop (i.e. c1 = NumerMatrix(n, m)
). This can be expensive as we are initializing every element to 0 in addition to allocating memory. We can change this to the following for increased efficiency:
NumericMatrix c1 = no_init_matrix(n, m)
I also went ahead and added the keyword const
where possible. It's debatable if doing this allows the compiler to optimize certain pieces of code, but I still add it where I can for code clarity (i.e. "I don't want this variable to change"). With that, we have:
// [[Rcpp::export]]
List rcpp_test_modified(const NumericVector Delta,
const NumericMatrix delta,
const NumericMatrix rx) {
int n = Delta.length();
int m = rx.ncol();
int dCol = delta.ncol();
List c(dCol);
for(int i = 0; i < dCol; ++i) {
NumericMatrix c1 = no_init_matrix(n, m);
for(int k = 0; k < n; ++k) {
const double tmp = Delta[k] / (1 + delta(k, i));
for(int j = 0; j < m; ++j) {
c1(k, j) = rx(k, j) * tmp;
}
}
c[i] = c1;
}
return c;
}
And here are some benchmarks (Armadillo
solution left out):
bench::mark(test(Delta, delta, rx),
rcpp_test_modified(Delta, delta, rx),
rcpp_test(Delta, delta, rx))
# A tibble: 3 x 14
expression min mean median max `itr/sec` mem_alloc n_gc n_itr total_time result memory time
<chr> <bch:t> <bch:> <bch:t> <bch:> <dbl> <bch:byt> <dbl> <int> <bch:tm> <list> <list> <lis>
1 test(Delt… 12.27ms 17.2ms 14.56ms 29.5ms 58.1 41.1MB 13 8 138ms <list… <Rpro… <bch…
2 rcpp_test… 7.55ms 11.4ms 8.46ms 26ms 87.8 38.1MB 16 21 239ms <list… <Rpro… <bch…
3 rcpp_test… 10.36ms 15.8ms 13.64ms 28.9ms 63.4 38.1MB 10 17 268ms <list… <Rpro… <bch…
# … with 1 more variable: gc <list>
And we see a 50%
improvement with the Rcpp
version.