Is there any sparse support for dist function in R?
**Update: ** You can do what qlcMatrix does quite easily in fact:
sparse.cos <- function(x, y = NULL, drop = TRUE){
if(!is.null(y)){
if(class(x) != "dgCMatrix" || class(y) != "dgCMatrix") stop ("class(x) or class(y) != dgCMatrix")
if(drop == TRUE) colnames(x) <- rownames(x) <- colnames(y) <- rownames(y) <- NULL
crossprod(
tcrossprod(
x,
Diagonal(x = as.vector(crossprod(x ^ 2, rep(1, x@Dim[1]))) ^ -0.5)
),
tcrossprod(
y,
Diagonal(x = as.vector(crossprod(y ^ 2, rep(1, x@Dim[1]))) ^ -0.5))
)
)
} else {
if(class(x) != "dgCMatrix") stop ("class(x) != dgCMatrix")
if(drop == TRUE) colnames(x) <- rownames(X) <- NULL
crossprod(
tcrossprod(
x,
Diagonal(x = as.vector(crossprod(x ^ 2, rep(1, nrow(x)))) ^ -0.5))
)
}
}
I can find no significant difference in performance between the above and qlcMatrix::cosSparse
.
qlcMatrix::cosSparse
is faster than wordspace::dist.matrix
when data is >50% sparse or the similarity is being calculated on the longest edge of the input matrix (i.e. tall format).
Performance of wordspace::dist.matrix
vs. qlcMatrix::cosSparse
on a wide matrix (1000 x 5000) of varying sparsity (10%, 50%, 90%, or 99% sparse) to calculate a 1000 x 1000 similarity:
# M1 is 10% sparse, M99 is 99% sparse
set.seed(123)
M10 <- rsparsematrix(5000, 1000, density = 1)
M50 <- rsparsematrix(5000, 1000, density = 0.5)
M90 <- rsparsematrix(5000, 1000, density = 0.1)
M99 <- rsparsematrix(5000, 1000, density = 0.01)
tM10 <- t(M10)
tM50 <- t(M50)
tM90 <- t(M90)
tM99 <- t(M99)
benchmark(
"cosSparse: 10% sparse" = cosSparse(M10),
"cosSparse: 50% sparse" = cosSparse(M50),
"cosSparse: 90% sparse" = cosSparse(M90),
"cosSparse: 99% sparse" = cosSparse(M99),
"wordspace: 10% sparse" = dist.matrix(tM10, byrow = TRUE),
"wordspace: 50% sparse" = dist.matrix(tM50, byrow = TRUE),
"wordspace: 90% sparse" = dist.matrix(tM90, byrow = TRUE),
"wordspace: 99% sparse" = dist.matrix(tM99, byrow = TRUE),
replications = 2, columns = c("test", "elapsed", "relative"))
The two functions are quite comparable, with wordspace taking a slight lead at lower sparsity, but definitely not at high sparsity:
test elapsed relative
1 cosSparse: 10% sparse 15.83 527.667
2 cosSparse: 50% sparse 4.72 157.333
3 cosSparse: 90% sparse 0.31 10.333
4 cosSparse: 99% sparse 0.03 1.000
5 wordspace: 10% sparse 15.23 507.667
6 wordspace: 50% sparse 4.28 142.667
7 wordspace: 90% sparse 0.36 12.000
8 wordspace: 99% sparse 0.09 3.000
If we flip the calculation around to compute a 5000 x 5000 matrix, then:
benchmark(
"cosSparse: 50% sparse" = cosSparse(tM50),
"cosSparse: 90% sparse" = cosSparse(tM90),
"cosSparse: 99% sparse" = cosSparse(tM99),
"wordspace: 50% sparse" = dist.matrix(M50, byrow = TRUE),
"wordspace: 90% sparse" = dist.matrix(M90, byrow = TRUE),
"wordspace: 99% sparse" = dist.matrix(M99, byrow = TRUE),
replications = 1, columns = c("test", "elapsed", "relative"))
Now the competitive advantage of cosSparse becomes very clear:
test elapsed relative
1 cosSparse: 50% sparse 10.58 151.143
2 cosSparse: 90% sparse 1.44 20.571
3 cosSparse: 99% sparse 0.07 1.000
4 wordspace: 50% sparse 11.41 163.000
5 wordspace: 90% sparse 2.39 34.143
6 wordspace: 99% sparse 0.64 9.143
The change in efficiency is not very dramatic at 50% sparsity, but at 90% sparsity, wordspace is 1.6x slower, and at 99% sparsity it's nearly 10x slower!
Compare this performance to a square matrix:
M50.square <- rsparsematrix(1000, 1000, density = 0.5)
tM50.square <- t(M50.square)
M90.square <- rsparsematrix(1000, 1000, density = 0.1)
tM90.square <- t(M90.square)
benchmark(
"cosSparse: square, 50% sparse" = cosSparse(M50.square),
"wordspace: square, 50% sparse" = dist.matrix(tM50.square, byrow = TRUE),
"cosSparse: square, 90% sparse" = cosSparse(M90.square),
"wordspace: square, 90% sparse" = dist.matrix(tM90.square, byrow = TRUE),
replications = 5, columns = c("test", "elapsed", "relative"))
cosSparse is marginally faster at both 50% sparsity, and almost twice as fast at 90% sparsity!
test elapsed relative
1 cosSparse: square, 50% sparse 2.12 9.217
3 cosSparse: square, 90% sparse 0.23 1.000
2 wordspace: square, 50% sparse 2.15 9.348
4 wordspace: square, 90% sparse 0.40 1.739
Note that the wordspace::dist.matrix
has more edge case checks than qlcMatrix::cosSparse
and also permits parallelization through openmp
in R. Also, wordspace::dist.matrix
supports euclidean and jaccard distance measures, although these are far slower. There are a number of other handy features built into that package.
That said, if you only need cosine similarity, and your matrix is >50% sparse, and you're computing the tall way, cosSparse
should be the tool of choice.
You want wordspace::dist.matrix
.
It accepts sparse matrices from the Matrix
package (which isn’t clear from the documentation) and can also do cross distances, output both Matrix
and dist
objects and more.
The default distance measure is 'cosine'
though, so be sure to specify method = 'euclidean'
if you want that.