Join and sum not compatible matrices

I'd just line up the names and go to town with base R.

Here's a simple function that takes an unspecified number of matrices and adds them up by their row/column names.

add_matrices_1 <- function(...) {
  a <- list(...)
  cols <- sort(unique(unlist(lapply(a, colnames))))
  rows <- sort(unique(unlist(lapply(a, rownames))))
  out <- array(0, dim=c(length(rows), length(cols)), dimnames=list(rows,cols))
  for(M in a) { out[rownames(M), colnames(M)] <- out[rownames(M), colnames(M)] + M }
  out
}

It then works like this:

# giving them rownames and colnames
colnames(M1) <- rownames(M1) <- c(1,3,4,5,7,8)
colnames(M2) <- rownames(M2) <- c(1,3,4,5,8)

add_matrices_1(M1, M2)
#   1 3 4 5 7 8
# 1 0 0 2 0 0 0
# 3 0 0 0 0 0 0
# 4 2 0 0 0 0 0
# 5 0 0 0 0 0 0
# 7 0 0 0 0 1 0
# 8 0 0 0 0 0 0

For bigger matrices, however, it doesn't do as well. Here's a function to make a matrix, choosing n columns out of N possibilities, and filling k spots with non-zero values. (This assumes symmetrical matrices.)

makeM <- function(N, n, k) {
  s1 <- sample(N, n)
  M1 <- array(0, dim=c(n,n), dimnames=list(s1, s1))
  r1 <- sample(n,k, replace=TRUE)
  c1 <- sample(n,k, replace=TRUE)
  M1[cbind(c(r1,c1), c(c1,r1))] <- sample(N,k)
  M1
}

Then here's another version that uses sparse matrices.

add_matrices_3 <- function(...) {
  a <- list(...)
  cols <- sort(unique(unlist(lapply(a, colnames))))
  rows <- sort(unique(unlist(lapply(a, rownames))))
  nrows <- length(rows)
  ncols <- length(cols)
  newms <- lapply(a, function(m) {
    s <- summary(m)
    i <- match(rownames(m), rows)[s$i]
    j <- match(colnames(m), cols)[s$j]
    ilj <- i<j
    sparseMatrix(i=ifelse(ilj, i, j),
                 j=ifelse(ilj, j, i),
                 x=s$x,
                 dims=c(nrows, ncols),
                 dimnames=list(rows, cols), symmetric=TRUE)
  })
  Reduce(`+`, newms)
}

This version is definitely faster when the matrices are large and sparse. (Note that I'm not timing the conversion to a sparse symmetric matrix, as hopefully if that's a suitable format, you'll use that format throughout your code.)

set.seed(50)
M1 <- makeM(10000, 5000, 50)
M2 <- makeM(10000, 5000, 50)
mm2 <- Matrix(M2)
mm1 <- Matrix(M1)
system.time(add_matrices_1(M1, M2))
#   user  system elapsed 
#  2.987   0.841   4.133 
system.time(add_matrices_3(mm1, mm2))
#   user  system elapsed 
#  0.042   0.012   0.504

But when the matrices are small, my first solution is still faster.

set.seed(50)
M1 <- makeM(100, 50, 20)
M2 <- makeM(100, 50, 20)
mm2 <- Matrix(M2)
mm1 <- Matrix(M1)
microbenchmark(add_matrices_1(M1, M2), add_matrices_3(mm1, mm2))
# Unit: microseconds
#                       expr      min       lq   median        uq       max
# 1   add_matrices_1(M1, M2)  398.495  406.543  423.825  544.0905  43077.27
# 2 add_matrices_3(mm1, mm2) 5734.623 5937.473 6044.007 6286.6675 509584.24

Moral of the story: Size and sparsity matter.

Also, getting it right is more important than saving a few microseconds. It's almost always best to use simple functions and don't worry about speed unless you run into trouble. So in small cases, I'd prefer MadScone's solution, as it's easy to code and simple to understand. When that gets slow, I'd write a function like my first attempt. When that gets slow, I'd write a function like my second attempt.

Here is a data.table solution. The magic is to add the .SD components (which have identical names in both) then assign the remaining column by reference.

# a function to quickly get the non key columns
nonkey <- function(DT){ setdiff(names(DT),key(DT))}
# the columns in DT1 only
notinR <- setdiff(nonkey(DT1), nonkey(DT2))

#calculate; .. means "up one level"
result <- DT2[DT1, .SD + .SD, roll= TRUE][,notinR := unclass(DT1[, ..notinR])]

# re set the column order to the original (DT1) order
setcolorder(result, names(DT1))

# voila!
result

   rn 1 3 4 5 7 8
1:  1 0 0 2 0 0 0
2:  3 0 0 0 0 0 0
3:  4 2 0 0 0 0 0
4:  5 0 0 0 0 0 0
5:  7 0 0 0 0 1 0
6:  8 0 0 0 0 0 0

I'm not convinced this is a particularly stable solution, given that I'm not sure it isn't fluking the answer because M1 and M2 are subsets of eachother

Edit, an ugly approach using eval

This is made harder because you have non-syntatic names (`1` etc)

inBoth <- intersect(nonkey(DT1), nonKey(DT2))

 backquote <- function(x){paste0('`', x, '`')}
 bqBoth <- backquote(inBoth)

 charexp <- sprintf('list(%s)',paste(c(paste0( bqBoth,'=',  bqBoth, '+ i.',inBoth), backquote(notinR)), collapse = ','))

result2 <- DT2[DT1,eval(parse(text = charexp)), roll = TRUE]
 setcolorder(result2, names(DT1))

# voila!
result2


   rn 1 3 4 5 7 8
1:  1 0 0 2 0 0 0
2:  3 0 0 0 0 0 0
3:  4 2 0 0 0 0 0
4:  5 0 0 0 0 0 0
5:  7 0 0 0 0 1 0
6:  8 0 0 0 0 0 0

Join and sum not compatible matrices

Tags:

Matrix

Join

Merge

R

Data.Table

Related

Recent Posts