What is the fastest way to slice a scipy.sparse matrix?
I've found that the advertised fast row indexing of scipy.sparse.csr_matrix
can be made a lot quicker by rolling your own row indexer. Here's the idea:
class SparseRowIndexer:
def __init__(self, csr_matrix):
data = []
indices = []
indptr = []
# Iterating over the rows this way is significantly more efficient
# than csr_matrix[row_index,:] and csr_matrix.getrow(row_index)
for row_start, row_end in zip(csr_matrix.indptr[:-1], csr_matrix.indptr[1:]):
data.append(csr_matrix.data[row_start:row_end])
indices.append(csr_matrix.indices[row_start:row_end])
indptr.append(row_end-row_start) # nnz of the row
self.data = np.array(data)
self.indices = np.array(indices)
self.indptr = np.array(indptr)
self.n_columns = csr_matrix.shape[1]
def __getitem__(self, row_selector):
data = np.concatenate(self.data[row_selector])
indices = np.concatenate(self.indices[row_selector])
indptr = np.append(0, np.cumsum(self.indptr[row_selector]))
shape = [indptr.shape[0]-1, self.n_columns]
return sparse.csr_matrix((data, indices, indptr), shape=shape)
That is, it is possible to utilize the fast indexing of numpy arrays by storing the non-zero values of each row in separate arrays (with a different length for each row) and putting all of those row arrays in an object-typed array (allowing each row to have a different size) that can be indexed efficiently. The column indices are stored the same way. The approach is slightly different to the standard CSR data structure which stores all non-zero values in a single array, requiring look-ups to see where each row starts and ends. These look-ups can slow down random access but should be efficient for retrieval of contiguous rows.
Profiling results
My matrix mat
is a 1,900,000x1,250,000 csr_matrix
with 400,000,000 non-zero elements.
ilocs
is an array of 200,000 random row indices.
>>> %timeit mat[ilocs]
2.66 s ± 233 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
compared to:
>>> row_indexer = SparseRowIndexer(mat)
>>> %timeit row_indexer[ilocs]
59.9 ms ± 4.51 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
The SparseRowIndexer seems to be faster when using fancy indexing compared to boolean masks.
If you want to obtain a sparse matrix as output the fastest way to do row slicing is to have a csr
type, and for columns slicing csc
, as detailed here. In both cases you just have to do what you are currently doing:
matrix[l1:l2,c1:c2]
If you want another type as output there maybe faster ways. In this other answer it is explained many methods for slicing a matrix and their different timings compared. For example, if you want a ndarray
as output the fastest slicing is:
matrix.A[l1:l2,c1:c2]
or:
matrix.toarray()[l1:l2,c1:c2]
much faster than:
matrix[l1:l2,c1:c2].A #or .toarray()