Efficient way to normalize a Scipy Sparse Matrix

This has been implemented in scikit-learn sklearn.preprocessing.normalize.

from sklearn.preprocessing import normalize
w_normalized = normalize(w, norm='l1', axis=1)

axis=1 should normalize by rows, axis=0 to normalize by column. Use the optional argument copy=False to modify the matrix in place.

While Aarons answer is correct, I implemented a solution when I wanted to normalize with respect to the maximum of the absolute values, which sklearn is not offering. My method uses the nonzero entries and finds them in the csr_matrix.data array to replace values there quickly.

def normalize_sparse(csr_matrix):
    nonzero_rows = csr_matrix.nonzero()[0]
    for idx in np.unique(nonzero_rows):
        data_idx = np.where(nonzero_rows==idx)[0]
        abs_max = np.max(np.abs(csr_matrix.data[data_idx]))
        if abs_max != 0:
            csr_matrix.data[data_idx] = 1./abs_max * csr_matrix.data[data_idx]

In contrast to sunan's solution, this method does not require any casting of the matrix into dense format (which could raise memory problems) and no matrix multiplications either. I tested the method on a sparse matrix of shape (35'000, 486'000) and it took ~ 18 seconds.

here is my solution.

transpose A
calculate sum of each col
format diagonal matrix B with reciprocal of sum
A*B equals normalization

transpose C

import scipy.sparse as sp
import numpy as np
import math

minf = 0.0001

A = sp.lil_matrix((5,5))
b = np.arange(0,5)
A.setdiag(b[:-1], k=1)
A.setdiag(b)
print A.todense()
A = A.T
print A.todense()

sum_of_col = A.sum(0).tolist()
print sum_of_col
c = []
for i in sum_of_col:
    for j in i:
        if math.fabs(j)<minf:
            c.append(0)
        else:
            c.append(1/j)

print c

B = sp.lil_matrix((5,5))
B.setdiag(c)
print B.todense()

C = A*B
print C.todense()
C = C.T
print C.todense()

I found this as an elegant way of doing it without using inbuilt functions.

import scipy.sparse as sp

def normalize(W):
    #Find the row scalars as a Matrix_(n,1)
    rowSumW = sp.csr_matrix(W.sum(axis=1))
    rowSumW.data = 1/rowSumW.data

    #Find the diagonal matrix to scale the rows
    rowSumW = rowSumW.transpose()
    scaling_matrix = sp.diags(rowSumW.toarray()[0])

    return scaling_matrix.dot(W)

Efficient way to normalize a Scipy Sparse Matrix

Tags:

Python

Numpy

Scipy

Sparse Matrix

Related

Recent Posts