Compute the pairwise distance in scipy with missing values
If I understand you correctly, you want the distance for all dimensions that two vector have valid values for.
Unfortunately pdist
doesn't understand masked arrays in that sense, so I modified your semi-solution to not reduce information. It is however not the most efficient solution, nor most readable:
np.array([pdist(data[s][:, ~numpy.isnan(data[s]).any(axis=0)], "euclidean") for s in map(list, itertools.combinations(range(data.shape[0]), 2))]).ravel()
The outer making it to an array and ravel
is just to get it in a matching shape to what you would expect.
itertools.combinations
produces all pairwise possible indices of the data
-array.
I then just slice data on these (must be a list
and not a tuple
to slice correctly) and do the pairwise filtering of nan
just as your code did.