Why is this naive matrix multiplication faster than base R's?
A quick glance in names.c
(here in particular) points you to do_matprod
, the C function that is called by %*%
and which is found in the file array.c
. (Interestingly, it turns out, that both crossprod
and tcrossprod
dispatch to that same function as well). Here is a link to the code of do_matprod
.
Scrolling through the function, you can see that it takes care of a number of things your naive implementation does not, including:
- Keeps row and column names, where that makes sense.
- Allows for dispatch to alternative S4 methods when the two objects being operated on by a call to
%*%
are of classes for which such methods have been provided. (That's what's happening in this portion of the function.) - Handles both real and complex matrices.
- Implements a series of rules for how to handle multiplication of a matrix and a matrix, a vector and a matrix, a matrix and a vector, and a vector and a vector. (Recall that under cross-multiplication in R, a vector on the LHS is treated as a row vector, whereas on the RHS, it is treated as a column vector; this is the code that makes that so.)
Near the end of the function, it dispatches to either of matprod
or or cmatprod
. Interestingly (to me at least), in the case of real matrices, if either matrix might contain NaN
or Inf
values, then matprod
dispatches (here) to a function called simple_matprod
which is about as simple and straightforward as your own. Otherwise, it dispatches to one of a couple of BLAS Fortran routines which, presumably are faster, if uniformly 'well-behaved' matrix elements can be guaranteed.
Josh's answer explains why R's matrix multiplication is not as fast as this naive approach. I was curious to see how much one could gain using RcppArmadillo. The code is simple enough:
arma_code <-
"arma::vec arma_mm(const arma::mat& m, const arma::vec& v) {
return m * v;
};"
arma_mm = cppFunction(code = arma_code, depends = "RcppArmadillo")
Benchmark:
> microbenchmark::microbenchmark(my_mm(m,v), m %*% v, arma_mm(m,v), times = 10)
Unit: milliseconds
expr min lq mean median uq max neval
my_mm(m, v) 71.23347 75.22364 90.13766 96.88279 98.07348 98.50182 10
m %*% v 92.86398 95.58153 106.00601 111.61335 113.66167 116.09751 10
arma_mm(m, v) 41.13348 41.42314 41.89311 41.81979 42.39311 42.78396 10
So RcppArmadillo gives us nicer syntax and better performance.
Curiosity got the better of me. Here a solution for using BLAS directly:
blas_code = "
NumericVector blas_mm(NumericMatrix m, NumericVector v){
int nRow = m.rows();
int nCol = m.cols();
NumericVector ans(nRow);
char trans = 'N';
double one = 1.0, zero = 0.0;
int ione = 1;
F77_CALL(dgemv)(&trans, &nRow, &nCol, &one, m.begin(), &nRow, v.begin(),
&ione, &zero, ans.begin(), &ione);
return ans;
}"
blas_mm <- cppFunction(code = blas_code, includes = "#include <R_ext/BLAS.h>")
Benchmark:
Unit: milliseconds
expr min lq mean median uq max neval
my_mm(m, v) 72.61298 75.40050 89.75529 96.04413 96.59283 98.29938 10
m %*% v 95.08793 98.53650 109.52715 111.93729 112.89662 128.69572 10
arma_mm(m, v) 41.06718 41.70331 42.62366 42.47320 43.22625 45.19704 10
blas_mm(m, v) 41.58618 42.14718 42.89853 42.68584 43.39182 44.46577 10
Armadillo and BLAS (OpenBLAS in my case) are almost the same. And the BLAS code is what R does in the end as well. So 2/3 of what R does is error checking etc.
To add another point to Ralf Stubner's solution, then you can use the following C++ version to
- do multiple columns at the same time to avoid re-reading the output vector many times.
- add
__restrict__
to potentially allow for vector operations (likely does not matter here as it is only reads I guess).
#include <Rcpp.h>
using namespace Rcpp;
inline void mat_vec_mult_vanilla
(double const * __restrict__ m,
double const * __restrict__ v,
double * __restrict__ const res,
size_t const dn, size_t const dm) noexcept {
for(size_t j = 0; j < dm; ++j, ++v){
double * r = res;
for(size_t i = 0; i < dn; ++i, ++r, ++m)
*r += *m * *v;
}
}
inline void mat_vec_mult
(double const * __restrict__ const m,
double const * __restrict__ const v,
double * __restrict__ const res,
size_t const dn, size_t const dm) noexcept {
size_t j(0L);
double const * vj = v,
* mi = m;
constexpr size_t const ncl(8L);
{
double const * mvals[ncl];
size_t const end_j = dm - (dm % ncl),
inc = ncl * dn;
for(; j < end_j; j += ncl, vj += ncl, mi += inc){
double *r = res;
mvals[0] = mi;
for(size_t i = 1; i < ncl; ++i)
mvals[i] = mvals[i - 1L] + dn;
for(size_t i = 0; i < dn; ++i, ++r)
for(size_t ii = 0; ii < ncl; ++ii)
*r += *(vj + ii) * *mvals[ii]++;
}
}
mat_vec_mult_vanilla(mi, vj, res, dn, dm - j);
}
// [[Rcpp::export("mat_vec_mult", rng = false)]]
NumericVector mat_vec_mult_cpp(NumericMatrix m, NumericVector v){
size_t const dn = m.nrow(),
dm = m.ncol();
NumericVector res(dn);
mat_vec_mult(&m[0], &v[0], &res[0], dn, dm);
return res;
}
// [[Rcpp::export("mat_vec_mult_vanilla", rng = false)]]
NumericVector mat_vec_mult_vanilla_cpp(NumericMatrix m, NumericVector v){
size_t const dn = m.nrow(),
dm = m.ncol();
NumericVector res(dn);
mat_vec_mult_vanilla(&m[0], &v[0], &res[0], dn, dm);
return res;
}
The result with -O3
in my Makevars file and gcc-8.3 is
set.seed(1)
dn <- 10001L
dm <- 10001L
m <- matrix(rnorm(dn * dm), dn, dm)
lv <- rnorm(dm)
all.equal(drop(m %*% lv), mat_vec_mult(m = m, v = lv))
#R> [1] TRUE
all.equal(drop(m %*% lv), mat_vec_mult_vanilla(m = m, v = lv))
#R> [1] TRUE
bench::mark(
R = m %*% lv,
`OP's version` = my_mm(m = m, v = lv),
`BLAS` = blas_mm(m = m, v = lv),
`C++ vanilla` = mat_vec_mult_vanilla(m = m, v = lv),
`C++` = mat_vec_mult(m = m, v = lv), check = FALSE)
#R> # A tibble: 5 x 13
#R> expression min median `itr/sec` mem_alloc `gc/sec` n_itr n_gc total_time result memory time gc
#R> <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl> <int> <dbl> <bch:tm> <list> <list> <list> <list>
#R> 1 R 147.9ms 151ms 6.57 78.2KB 0 4 0 609ms <NULL> <Rprofmem[,3] [2 × 3]> <bch:tm [4]> <tibble [4 × 3]>
#R> 2 OP's version 56.9ms 57.1ms 17.4 78.2KB 0 9 0 516ms <NULL> <Rprofmem[,3] [2 × 3]> <bch:tm [9]> <tibble [9 × 3]>
#R> 3 BLAS 90.1ms 90.7ms 11.0 78.2KB 0 6 0 545ms <NULL> <Rprofmem[,3] [2 × 3]> <bch:tm [6]> <tibble [6 × 3]>
#R> 4 C++ vanilla 57.2ms 57.4ms 17.4 78.2KB 0 9 0 518ms <NULL> <Rprofmem[,3] [2 × 3]> <bch:tm [9]> <tibble [9 × 3]>
#R> 5 C++ 51ms 51.4ms 19.3 78.2KB 0 10 0 519ms <NULL> <Rprofmem[,3] [2 × 3]> <bch:tm [10]> <tibble [10 × 3]>
So a slight improvement. The result may though be very dependent on the BLAS version. The version I used is
sessionInfo()
#R> #...
#R> Matrix products: default
#R> BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.7.1
#R> LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.7.1
#R> ...
The whole file I Rcpp::sourceCpp()
ed is
#include <Rcpp.h>
#include <R_ext/BLAS.h>
using namespace Rcpp;
inline void mat_vec_mult_vanilla
(double const * __restrict__ m,
double const * __restrict__ v,
double * __restrict__ const res,
size_t const dn, size_t const dm) noexcept {
for(size_t j = 0; j < dm; ++j, ++v){
double * r = res;
for(size_t i = 0; i < dn; ++i, ++r, ++m)
*r += *m * *v;
}
}
inline void mat_vec_mult
(double const * __restrict__ const m,
double const * __restrict__ const v,
double * __restrict__ const res,
size_t const dn, size_t const dm) noexcept {
size_t j(0L);
double const * vj = v,
* mi = m;
constexpr size_t const ncl(8L);
{
double const * mvals[ncl];
size_t const end_j = dm - (dm % ncl),
inc = ncl * dn;
for(; j < end_j; j += ncl, vj += ncl, mi += inc){
double *r = res;
mvals[0] = mi;
for(size_t i = 1; i < ncl; ++i)
mvals[i] = mvals[i - 1L] + dn;
for(size_t i = 0; i < dn; ++i, ++r)
for(size_t ii = 0; ii < ncl; ++ii)
*r += *(vj + ii) * *mvals[ii]++;
}
}
mat_vec_mult_vanilla(mi, vj, res, dn, dm - j);
}
// [[Rcpp::export("mat_vec_mult", rng = false)]]
NumericVector mat_vec_mult_cpp(NumericMatrix m, NumericVector v){
size_t const dn = m.nrow(),
dm = m.ncol();
NumericVector res(dn);
mat_vec_mult(&m[0], &v[0], &res[0], dn, dm);
return res;
}
// [[Rcpp::export("mat_vec_mult_vanilla", rng = false)]]
NumericVector mat_vec_mult_vanilla_cpp(NumericMatrix m, NumericVector v){
size_t const dn = m.nrow(),
dm = m.ncol();
NumericVector res(dn);
mat_vec_mult_vanilla(&m[0], &v[0], &res[0], dn, dm);
return res;
}
// [[Rcpp::export(rng = false)]]
NumericVector my_mm(NumericMatrix m, NumericVector v){
int nRow = m.rows();
int nCol = m.cols();
NumericVector ans(nRow);
double v_j;
for(int j = 0; j < nCol; j++){
v_j = v[j];
for(int i = 0; i < nRow; i++){
ans[i] += m(i,j) * v_j;
}
}
return(ans);
}
// [[Rcpp::export(rng = false)]]
NumericVector blas_mm(NumericMatrix m, NumericVector v){
int nRow = m.rows();
int nCol = m.cols();
NumericVector ans(nRow);
char trans = 'N';
double one = 1.0, zero = 0.0;
int ione = 1;
F77_CALL(dgemv)(&trans, &nRow, &nCol, &one, m.begin(), &nRow, v.begin(),
&ione, &zero, ans.begin(), &ione);
return ans;
}
/*** R
set.seed(1)
dn <- 10001L
dm <- 10001L
m <- matrix(rnorm(dn * dm), dn, dm)
lv <- rnorm(dm)
all.equal(drop(m %*% lv), mat_vec_mult(m = m, v = lv))
all.equal(drop(m %*% lv), mat_vec_mult_vanilla(m = m, v = lv))
bench::mark(
R = m %*% lv,
`OP's version` = my_mm(m = m, v = lv),
`BLAS` = blas_mm(m = m, v = lv),
`C++ vanilla` = mat_vec_mult_vanilla(m = m, v = lv),
`C++` = mat_vec_mult(m = m, v = lv), check = FALSE)
*/