Compute Jaccard distances on sparse matrix
Vectorization is relatively easy if you use matrix multiplication to calculate the set intersections and then the rule |union(a, b)| == |a| + |b| - |intersection(a, b)|
to determine the unions:
# Not actually necessary for sparse matrices, but it is for
# dense matrices and ndarrays, if X.dtype is integer.
from __future__ import division
def pairwise_jaccard(X):
"""Computes the Jaccard distance between the rows of `X`.
"""
X = X.astype(bool).astype(int)
intrsct = X.dot(X.T)
row_sums = intrsct.diagonal()
unions = row_sums[:,None] + row_sums - intrsct
dist = 1.0 - intrsct / unions
return dist
Note the cast to bool and then int, because the dtype of X
must be large enough to accumulate twice the maximum row sum and that entries of X
must be either zero or one. The downside of this code is that it's heavy on RAM, because unions
and dists
are dense matrices.
If you're only interested in distances smaller than some cut-off epsilon
, the code can be tuned for sparse matrices:
from scipy.sparse import csr_matrix
def pairwise_jaccard_sparse(csr, epsilon):
"""Computes the Jaccard distance between the rows of `csr`,
smaller than the cut-off distance `epsilon`.
"""
assert(0 < epsilon < 1)
csr = csr_matrix(csr).astype(bool).astype(int)
csr_rownnz = csr.getnnz(axis=1)
intrsct = csr.dot(csr.T)
nnz_i = np.repeat(csr_rownnz, intrsct.getnnz(axis=1))
unions = nnz_i + csr_rownnz[intrsct.indices] - intrsct.data
dists = 1.0 - intrsct.data / unions
mask = (dists > 0) & (dists <= epsilon)
data = dists[mask]
indices = intrsct.indices[mask]
rownnz = np.add.reduceat(mask, intrsct.indptr[:-1])
indptr = np.r_[0, np.cumsum(rownnz)]
out = csr_matrix((data, indices, indptr), intrsct.shape)
return out
If this still takes to much RAM you could try to vectorize over one dimension and Python-loop over the other.