Optimizing R objective function with Rcpp slower, why?
I can think of four potential optimizations over Ralf's and Olivers answers.
(You should accept their answers, but I just wanted to add my 2 cents).
1) Use // [[Rcpp::export(rng = false)]]
as a comment header to the function in a seperate C++ file. This leads to a ~80% speed up on my machine. (This is the most important suggestion out of the 4).
2) Prefer cmath
when possible. (In this case, it doesn't seem to make a difference).
3) Avoid allocation whenever possible, e.g. don't shift beta
into a new vector.
4) Stretch goal: use SEXP
parameters rather than Rcpp vectors. (Left as an exercise to the reader). Rcpp vectors are very thin wrappers, but they're still wrappers and there is a small overhead.
These suggestions wouldn't be important, if not for the fact that you're calling the function in a tight loop in optim
. So any overhead is very important.
Bench:
microbenchmark("llmnl_int_R_v1" = optim(beta, llmnl_int, Obs = mnl_sample,
n_cat = n_cat, method = "BFGS", hessian = F,
control = list(fnscale = -1)),
"llmnl_int_R_v2" = optim(beta, llmnl_int_R_v2, Obs = mnl_sample,
n_cat = n_cat, method = "BFGS", hessian = F,
control = list(fnscale = -1)),
"llmnl_int_C_v2" = optim(beta, llmnl_int_C_v2, Obs = mnl_sample,
n_cat = n_cat, method = "BFGS", hessian = F,
control = list(fnscale = -1)),
"llmnl_int_C_v3" = optim(beta, llmnl_int_C_v3, Obs = mnl_sample,
n_cat = n_cat, method = "BFGS", hessian = F,
control = list(fnscale = -1)),
"llmnl_int_C_v4" = optim(beta, llmnl_int_C_v4, Obs = mnl_sample,
n_cat = n_cat, method = "BFGS", hessian = F,
control = list(fnscale = -1)),
times = 1000)
Unit: microseconds
expr min lq mean median uq max neval cld
llmnl_int_R_v1 9480.780 10662.3530 14126.6399 11359.8460 18505.6280 146823.430 1000 c
llmnl_int_R_v2 697.276 735.7735 1015.8217 768.5735 810.6235 11095.924 1000 b
llmnl_int_C_v2 997.828 1021.4720 1106.0968 1031.7905 1078.2835 11222.803 1000 b
llmnl_int_C_v3 284.519 295.7825 328.5890 304.0325 328.2015 9647.417 1000 a
llmnl_int_C_v4 245.650 256.9760 283.9071 266.3985 299.2090 1156.448 1000 a
v3 is Oliver's answer with rng=false
. v4 is with Suggestions #2 and #3 included.
The function:
#include <Rcpp.h>
#include <cmath>
using namespace Rcpp;
// [[Rcpp::export(rng = false)]]
double llmnl_int_C_v4(NumericVector beta, IntegerVector Obs, int n_cat) {
int n_Obs = Obs.size();
//2: Calculate log sum only once:
// double expBetas_log_sum = log(sum(exp(betas)));
double expBetas_log_sum = 1.0; // std::exp(0)
for (int i = 1; i < n_cat; i++) {
expBetas_log_sum += std::exp(beta[i-1]);
};
expBetas_log_sum = std::log(expBetas_log_sum);
double ll_sum = 0;
//3: Use n_Obs, to avoid calling Xby.size() every time
for (int i = 0; i < n_Obs; i++) {
if(Obs[i] == 1L) continue;
ll_sum += beta[Obs[i]-2L];
};
//4: Use that we know denom is the same for all I:
ll_sum = ll_sum - expBetas_log_sum * n_Obs;
return ll_sum;
}
In general if you are able to use vectorized functions, you will find it to be (almost) as fast as running your code directly in Rcpp. This is because many vectorized functions in R (almost all vectorized functions in Base R) are written in C, Cpp or Fortran and as such there is often little to gain.
That said, there are improvements to gain both in your R
and Rcpp
code. Optimization comes from carefully studying the code, and removing unnecessary steps (memory assignment, sums, etc.).
Lets start with the Rcpp
code optimization.
In your case the main optimization is to remove unnecessary matrix and vector calculations. The code is in essence
- Shift beta
- calculate the log of the sum of exp(shift beta) [log-sum-exp]
- use Obs as an index for the shifted beta and sum over all the probabilities
- substract the log-sum-exp
Using this observation we can reduce your code to 2 for-loops. Note that sum
is simply another for-loop (more or less: for(i = 0; i < max; i++){ sum += x }
) so avoiding the sums can speed up ones code further (in most situations this is unnecessary optimization!). In addition your input Obs
is an integer vector, and we can further optimize the code by using the IntegerVector
type to avoid casting the double
elements to integer
values (Credit to Ralf Stubner's answer).
cppFunction('double llmnl_int_C_v2(NumericVector beta, IntegerVector Obs, int n_cat)
{
int n_Obs = Obs.size();
NumericVector betas = (beta.size()+1);
//1: shift beta
for (int i = 1; i < n_cat; i++) {
betas[i] = beta[i-1];
};
//2: Calculate log sum only once:
double expBetas_log_sum = log(sum(exp(betas)));
// pre allocate sum
double ll_sum = 0;
//3: Use n_Obs, to avoid calling Xby.size() every time
for (int i = 0; i < n_Obs; i++) {
ll_sum += betas(Obs[i] - 1.0) ;
};
//4: Use that we know denom is the same for all I:
ll_sum = ll_sum - expBetas_log_sum * n_Obs;
return ll_sum;
}')
Note that I have removed quite a few memory allocations and removed unnecessary calculations in the for-loop. Also i have used that denom
is the same for all iterations and simply multiplied for the final result.
We can perform similar optimizations in your R-code, which results in the below function:
llmnl_int_R_v2 <- function(beta, Obs, n_cat) {
n_Obs <- length(Obs)
betas <- c(0, beta)
#note: denom = log(sum(exp(betas)))
sum(betas[Obs]) - log(sum(exp(betas))) * n_Obs
}
Note the complexity of the function has been drastically reduced making it simpler for others to read. Just to be sure that I haven't messed up in the code somewhere let's check that they return the same results:
set.seed(2020)
mnl_sample <- t(rmultinom(n = 1000,size = 1,prob = c(0.3, 0.4, 0.2, 0.1)))
mnl_sample <- apply(mnl_sample,1,function(r) which(r == 1))
beta = c(4,2,1)
Obs = mnl_sample
n_cat = 4
xr <- llmnl_int(beta = beta, Obs = mnl_sample, n_cat = n_cat)
xr2 <- llmnl_int_R_v2(beta = beta, Obs = mnl_sample, n_cat = n_cat)
xc <- llmnl_int_C(beta = beta, Obs = mnl_sample, n_cat = n_cat)
xc2 <- llmnl_int_C_v2(beta = beta, Obs = mnl_sample, n_cat = n_cat)
all.equal(c(xr, xr2), c(xc, xc2))
TRUE
well that's a relief.
Performance:
I'll use microbenchmark to illustrate the performance. The optimized functions are fast, so I'll run the functions 1e5
times to reduce the effect of the garbage collector
microbenchmark("llmml_int_R" = llmnl_int(beta = beta, Obs = mnl_sample, n_cat = n_cat),
"llmml_int_C" = llmnl_int_C(beta = beta, Obs = mnl_sample, n_cat = n_cat),
"llmnl_int_R_v2" = llmnl_int_R_v2(beta = beta, Obs = mnl_sample, n_cat = n_cat),
"llmml_int_C_v2" = llmnl_int_C_v2(beta = beta, Obs = mnl_sample, n_cat = n_cat),
times = 1e5)
#Output:
#Unit: microseconds
# expr min lq mean median uq max neval
# llmml_int_R 202.701 206.801 288.219673 227.601 334.301 57368.902 1e+05
# llmml_int_C 250.101 252.802 342.190342 272.001 399.251 112459.601 1e+05
# llmnl_int_R_v2 4.800 5.601 8.930027 6.401 9.702 5232.001 1e+05
# llmml_int_C_v2 5.100 5.801 8.834646 6.700 10.101 7154.901 1e+05
Here we see the same result as before. Now the new functions are roughly 35x faster (R) and 40x faster (Cpp) compared to their first counter-parts. Interestingly enough the optimized R
function is still very slightly (0.3 ms or 4 %) faster than my optimized Cpp
function. My best bet here is that there is some overhead from the Rcpp
package, and if this was removed the two would be identical or the R.
Similarly we can check performance using Optim.
microbenchmark("llmnl_int" = optim(beta, llmnl_int, Obs = mnl_sample,
n_cat = n_cat, method = "BFGS", hessian = F,
control = list(fnscale = -1)),
"llmnl_int_C" = optim(beta, llmnl_int_C, Obs = mnl_sample,
n_cat = n_cat, method = "BFGS", hessian = F,
control = list(fnscale = -1)),
"llmnl_int_R_v2" = optim(beta, llmnl_int_R_v2, Obs = mnl_sample,
n_cat = n_cat, method = "BFGS", hessian = F,
control = list(fnscale = -1)),
"llmnl_int_C_v2" = optim(beta, llmnl_int_C_v2, Obs = mnl_sample,
n_cat = n_cat, method = "BFGS", hessian = F,
control = list(fnscale = -1)),
times = 1e3)
#Output:
#Unit: microseconds
# expr min lq mean median uq max neval
# llmnl_int 29541.301 53156.801 70304.446 76753.851 83528.101 196415.5 1000
# llmnl_int_C 36879.501 59981.901 83134.218 92419.551 100208.451 190099.1 1000
# llmnl_int_R_v2 667.802 1253.452 1962.875 1585.101 1984.151 22718.3 1000
# llmnl_int_C_v2 704.401 1248.200 1983.247 1671.151 2033.401 11540.3 1000
Once again the result is the same.
Conclusion:
As a short conclusion it is worth noting that this is one example, where converting your code to Rcpp is not really worth the trouble. This is not always the case, but often it is worth taking a second look at your function, to see if there are areas of your code, where unnecessary calculations are performed. Especially in situations where one uses buildin vectorized functions, it is often not worth the time to convert code to Rcpp. More often one can see great improvements if one uses for-loops
with code that cant easily be vectorized in order to remove the for-loop.