Running cor() (or any variant) over a sparse matrix in R

EDITED ANSWER - optimized for memory use and speed.

Your error is logic, as a sparse matrix is not recognized by the cor function as a matrix, and there is -yet- no method for correlations in the Matrix package.

There is no function I am aware of that will let you calculate this, but you can easily calculate that yourself, using the matrix operators that are available in the Matrix package :

sparse.cor <- function(x){
  n <- nrow(x)
  m <- ncol(x)
  ii <- unique(x@i)+1 # rows with a non-zero element

  Ex <- colMeans(x)
  nozero <- as.vector(x[ii,]) - rep(Ex,each=length(ii))        # colmeans

  covmat <- ( crossprod(matrix(nozero,ncol=m)) +
              crossprod(t(Ex))*(n-length(ii))
            )/(n-1)
  sdvec <- sqrt(diag(covmat))
  covmat/crossprod(t(sdvec))
}

the covmat is your variance-covariance matrix, so you can calculate that one as well. The calculation is based on selecting the rows where at least one element is non-zero. to the cross product of this one, you add the colmeans multiplied by the number of all-zero rows. This is equivalent to

( X - E[X] ) times ( X - E[X] ) transposed

Divide by n-1 and you have your variance-covariance matrix. The rest is easy.

A test case :

X <- sample(0:10,1e8,replace=T,p=c(0.99,rep(0.001,10)))
xx <- Matrix(X,ncol=5)

> system.time(out1 <- sparse.cor(xx))
   user  system elapsed 
   0.50    0.09    0.59 
> system.time(out2 <- cor(as.matrix(xx)))
   user  system elapsed 
   1.75    0.28    2.05 
> all.equal(out1,out2)
[1] TRUE

This is what I ended up using. Thanks @Joris for all the help!

My x matrix is quite big. Assuming it's size is n * p, n=200k and p=10k in my case.

The trick is to maintain the sparsity of the operations and perform the calculations on p * p matrices instead of p*n x n*p.

Version 1, is more straightforward, but less efficient on time and memory, as the outer product operation is expensive:

sparse.cor2 <- function(x){
    n <- nrow(x)

    covmat <- (crossprod(x)-2*(colMeans(x) %o% colSums(x))
              +n*colMeans(x)%o%colMeans(x))/(n-1)

    sdvec <- sqrt(diag(covmat)) # standard deviations of columns
    covmat/crossprod(t(sdvec)) # correlation matrix
}

Version 2 is more efficient on time (saves several operations) and on memory. Still requires huge amounts of memory for a p=10k matrix:

sparse.cor3 <- function(x){
    memory.limit(size=10000)
    n <- nrow(x)

    cMeans <- colMeans(x)
    cSums <- colSums(x)

    # Calculate the population covariance matrix.
    # There's no need to divide by (n-1) as the std. dev is also calculated the same way.
    # The code is optimized to minize use of memory and expensive operations
    covmat <- tcrossprod(cMeans, (-2*cSums+n*cMeans))
    crossp <- as.matrix(crossprod(x))
    covmat <- covmat+crossp

    sdvec <- sqrt(diag(covmat)) # standard deviations of columns
    covmat/crossprod(t(sdvec)) # correlation matrix
}

Timing comparisons (sparse.cor is @Joris latest version):

> X <- sample(0:10,1e7,replace=T,p=c(0.9,rep(0.01,10)))
> x <- Matrix(X,ncol=10)
> 
> object.size(x)
11999472 bytes
> 
> system.time(corx <- sparse.cor(x))
   user  system elapsed 
   0.50    0.06    0.56 
> system.time(corx2 <- sparse.cor2(x))
   user  system elapsed 
   0.17    0.00    0.17 
> system.time(corx3 <- sparse.cor3(x))
   user  system elapsed 
   0.13    0.00    0.12 
> system.time(correg <-cor(as.matrix(x)))
   user  system elapsed 
   0.25    0.03    0.29 
> all.equal(c(as.matrix(corx)),c(as.matrix(correg)))
[1] TRUE
> all.equal(c(as.matrix(corx2)),c(as.matrix(correg)))
[1] TRUE
> all.equal(c(as.matrix(corx3)),c(as.matrix(correg)))
[1] TRUE

Much larger x matrix:

> X <- sample(0:10,1e8,replace=T,p=c(0.9,rep(0.01,10)))
> x <- Matrix(X,ncol=10)
> object.size(x)
120005688 bytes
> system.time(corx2 <- sparse.cor2(x))
   user  system elapsed 
   1.47    0.07    1.53 
> system.time(corx3 <- sparse.cor3(x))
   user  system elapsed 
   1.18    0.09    1.29 
> system.time(corx <- sparse.cor(x))
   user  system elapsed 
   5.43    1.26    6.71

Tags:

R