What is most efficient way of setting row to zeros for a sparse scipy matrix?
If you dislike digging around in the guts of the sparse matrices, you can also use sparse matrix multiplication with a diagonal matrix:
def zero_rows(M, rows_to_zero):
ixs = numpy.ones(M.shape[0], int)
ixs[rows_to_zero] = 0
D = sparse.diags(ixs)
res = D * M
return res
A slightly different take on alko/seberg's approach. I find for loops disturbing, so I spent the better part of this morning figuring a way to get rid of it. The following is not always faster than the other approach. It performs better the more rows there are to be zeroed and the sparser the matrix:
def csr_zero_rows(csr, rows_to_zero):
rows, cols = csr.shape
mask = np.ones((rows,), dtype=np.bool)
mask[rows_to_zero] = False
nnz_per_row = np.diff(csr.indptr)
mask = np.repeat(mask, nnz_per_row)
nnz_per_row[rows_to_zero] = 0
csr.data = csr.data[mask]
csr.indices = csr.indices[mask]
csr.indptr[1:] = np.cumsum(nnz_per_row)
And to test drive both approaches:
rows, cols = 334863, 334863
a = sps.rand(rows, cols, density=0.00001, format='csr')
b = a.copy()
rows_to_zero = np.random.choice(np.arange(rows), size=10000, replace=False)
In [117]: a
Out[117]:
<334863x334863 sparse matrix of type '<type 'numpy.float64'>'
with 1121332 stored elements in Compressed Sparse Row format>
In [118]: %timeit -n1 -r1 csr_rows_set_nz_to_val(a, rows_to_zero)
1 loops, best of 1: 75.8 ms per loop
In [119]: %timeit -n1 -r1 csr_zero_rows(b, rows_to_zero)
1 loops, best of 1: 32.5 ms per loop
And of course:
np.allclose(a.data, b.data)
Out[122]: True
np.allclose(a.indices, b.indices)
Out[123]: True
np.allclose(a.indptr, b.indptr)
Out[124]: True
The solution I use for similar task attributes to @seberg and do not convert to lil
format:
import scipy.sparse
import numpy
import time
def csr_row_set_nz_to_val(csr, row, value=0):
"""Set all nonzero elements (elements currently in the sparsity pattern)
to the given value. Useful to set to 0 mostly.
"""
if not isinstance(csr, scipy.sparse.csr_matrix):
raise ValueError('Matrix given must be of CSR format.')
csr.data[csr.indptr[row]:csr.indptr[row+1]] = value
def csr_rows_set_nz_to_val(csr, rows, value=0):
for row in rows:
csr_row_set_nz_to_val(csr, row)
if value == 0:
csr.eliminate_zeros()
wrap your evaluations with timing
def evaluate(size):
degs = [1]*size
inVs = list(xrange(1, size, size/25))
outVs = list(xrange(5, size, size/25))
where = numpy.hstack((inVs, outVs)).astype(int)
start_time = time.time()
A = scipy.sparse.csc_matrix((size, size))
M = scipy.sparse.diags([degs], [0]) - A
csr_rows_set_nz_to_val(M, where)
return time.time()-start_time
and test its performance:
>>> print 'elapsed %.5f seconds' % evaluate(334863)
elapsed 0.53054 seconds