R: Vectorize loop to create pairwise matrix
A data.table
solution :
library(data.table)
setDT(df)
setkey(df,Shop)
dcast(df[df,on=.(Shop=Shop),allow.cartesian=T][
,.(cnt=sum(i.Order<Order&i.Fruit!=Fruit)),by=.(Fruit,i.Fruit)]
,Fruit~i.Fruit,value.var='cnt')
Fruit apple orange pear
1: apple 0 0 2
2: orange 2 0 1
3: pear 1 2 0
The Shop
index isn't necessary for this example, but will probably improve performance on a larger dataset.
As the question raised many comments on performance, I decided to check what Rcpp
could bring:
library(Rcpp)
cppFunction('NumericMatrix rcppPair(DataFrame df) {
std::vector<std::string> Shop = Rcpp::as<std::vector<std::string> >(df["Shop"]);
Rcpp::NumericVector Order = df["Order"];
Rcpp::StringVector Fruit = df["Fruit"];
StringVector FruitLevels = sort_unique(Fruit);
IntegerVector FruitInt = match(Fruit, FruitLevels);
int n = FruitLevels.length();
std::string currentShop = "";
int order, fruit, i, f;
NumericMatrix result(n,n);
NumericVector fruitOrder(n);
for (i=0;i<Fruit.length();i++){
if (currentShop != Shop[i]) {
//Init counter for each shop
currentShop = Shop[i];
std::fill(fruitOrder.begin(), fruitOrder.end(), 0);
}
order = Order[i];
fruit = FruitInt[i];
fruitOrder[fruit-1] = order;
for (f=0;f<n;f++) {
if (order > fruitOrder[f] & fruitOrder[f]>0 ) {
result(fruit-1,f) = result(fruit-1,f)+1;
}
}
}
rownames(result) = FruitLevels;
colnames(result) = FruitLevels;
return(result);
}
')
rcppPair(df)
apple orange pear
apple 0 0 2
orange 2 0 1
pear 1 2 0
On the example dataset, this runs >500 times faster than the data.table
solution, probably because it doesn't have the cartesian product problem. This isn't supposed to be robust on wrong input, and expects that shops / order are in ascending order.
Considering the few minutes spent to find the 3 lines of code for the data.table
solution, compared to the much longer Rcpp
solution / debugging process, I wouldn't recommend to go for Rcpp
here unless there's a real performance bottleneck.
Interesting however to remember that if performance is a must, Rcpp
might be worth the effort.
Here is an approach that makes simple modifications to make it 5x faster.
loop.function2 <- function(df){
spl_df = split(df[, c(1L, 3L)], df[[2L]])
mat <- array(0L,
dim=c(length(spl_df), length(spl_df)),
dimnames = list(names(spl_df), names(spl_df)))
for (m in 1:(length(spl_df) - 1L)) {
xm = spl_df[[m]]
mShop = xm$Shop
for (n in ((1+m):length(spl_df))) {
xn = spl_df[[n]]
mm = match(mShop, xn$Shop)
inds = which(!is.na(mm))
mOrder = xm[inds, "Order"]
nOrder = xn[mm[inds], "Order"]
mat[m, n] <- sum(nOrder < mOrder)
mat[n, m] <- sum(mOrder < nOrder)
}
}
mat
}
There are 3 main concepts:
- The original
df[df$Fruits == fruits[m], ]
lines were inefficient as you would be making the same comparisonlength(Fruits)^2
times. Instead, we can usesplit()
which means we are only scanning the Fruits once. - There was a lot of use of
df$var
which will extract the vector during each loop. Here, we place the assignment ofxm
outside of the inner loop and we try to minimize what we need to subset / extract. - I changed it to be closer to
combn
as we can re-use ourmatch()
condition by doing bothsum(xmOrder > xnOrder)
and then switching it tosum(xmOrder < xnOrder)
.
Performance:
bench::mark(loop.function(df), loop.function2(df))
# A tibble: 2 x 13
## expression min median
## <bch:expr> <bch:tm> <bch:>
##1 loop.function(df) 3.57ms 4.34ms
##2 loop.function2(df) 677.2us 858.6us
My hunch is that for your larger dataset, @Waldi's data.table solution will be faster. But for smaller datasets, this should be pretty perfomant.
Finally, here's yet another rcpp approach that seems to be slower than @Waldi:
#include <Rcpp.h>
using namespace Rcpp;
// [[Rcpp::export]]
IntegerMatrix loop_function_cpp(List x) {
int x_size = x.size();
IntegerMatrix ans(x_size, x_size);
for (int m = 0; m < x_size - 1; m++) {
DataFrame xm = x[m];
CharacterVector mShop = xm[0];
IntegerVector mOrder = xm[1];
int nrows = mShop.size();
for (int n = m + 1; n < x_size; n++) {
DataFrame xn = x[n];
CharacterVector nShop = xn[0];
IntegerVector nOrder = xn[1];
for (int i = 0; i < nrows; i++) {
for (int j = 0; j < nrows; j++) {
if (mShop[i] == nShop[j]) {
if (mOrder[i] > nOrder[j])
ans(m, n)++;
else
ans(n, m)++;
break;
}
}
}
}
}
return(ans);
}
loop_wrapper = function(df) {
loop_function_cpp(split(df[, c(1L, 3L)], df[[2L]]))
}
loop_wrapper(df)
``