How does condensed distance matrix work? (pdist)
Condensed distance matrix to full distance matrix
A condensed distance matrix as returned by pdist can be converted to a full distance matrix by using scipy.spatial.distance.squareform
:
>>> import numpy as np
>>> from scipy.spatial.distance import pdist, squareform
>>> points = np.array([[0,1],[1,1],[3,5], [15, 5]])
>>> dist_condensed = pdist(points)
>>> dist_condensed
array([ 1. , 5. , 15.5241747 , 4.47213595,
14.56021978, 12. ])
Use squareform
to convert to full matrix:
>>> dist = squareform(dist_condensed)
array([[ 0. , 1. , 5. , 15.5241747 ],
[ 1. , 0. , 4.47213595, 14.56021978],
[ 5. , 4.47213595, 0. , 12. ],
[ 15.5241747 , 14.56021978, 12. , 0. ]])
Distance between point i,j is stored in dist[i, j]:
>>> dist[2, 0]
5.0
>>> np.linalg.norm(points[2] - points[0])
5.0
Indices to condensed index
One can convert indices used for accessing the elements of the square matrix to the index in the condensed matrix:
def square_to_condensed(i, j, n):
assert i != j, "no diagonal elements in condensed matrix"
if i < j:
i, j = j, i
return n*j - j*(j+1)//2 + i - 1 - j
Example:
>>> square_to_condensed(1, 2, len(points))
3
>>> dist_condensed[3]
4.4721359549995796
>>> dist[1,2]
4.4721359549995796
Condensed index to indices
Also the other direction is possible without sqaureform which is better in terms of runtime and memory consumption:
import math
def calc_row_idx(k, n):
return int(math.ceil((1/2.) * (- (-8*k + 4 *n**2 -4*n - 7)**0.5 + 2*n -1) - 1))
def elem_in_i_rows(i, n):
return i * (n - 1 - i) + (i*(i + 1))//2
def calc_col_idx(k, i, n):
return int(n - elem_in_i_rows(i + 1, n) + k)
def condensed_to_square(k, n):
i = calc_row_idx(k, n)
j = calc_col_idx(k, i, n)
return i, j
Example:
>>> condensed_to_square(3, 4)
(1.0, 2.0)
Runtime comparison with squareform
>>> import numpy as np
>>> from scipy.spatial.distance import pdist, squareform
>>> points = np.random.random((10**4,3))
>>> %timeit dist_condensed = pdist(points)
1 loops, best of 3: 555 ms per loop
Creating the sqaureform turns out to be really slow:
>>> dist_condensed = pdist(points)
>>> %timeit dist = squareform(dist_condensed)
1 loops, best of 3: 2.25 s per loop
If we are searching two points with maximum distance it is not surprising that searching in full matrix is O(n) while in condensed form only O(n/2):
>>> dist = squareform(dist_condensed)
>>> %timeit dist_condensed.argmax()
10 loops, best of 3: 45.2 ms per loop
>>> %timeit dist.argmax()
10 loops, best of 3: 93.3 ms per loop
Getting the inideces for the two points takes almost no time in both cases but of course there is some overhead for calculating the condensed index:
>>> idx_flat = dist.argmax()
>>> idx_condensed = dist.argmax()
>>> %timeit list(np.unravel_index(idx_flat, dist.shape))
100000 loops, best of 3: 2.28 µs per loop
>>> %timeit condensed_to_square(idx_condensed, len(points))
100000 loops, best of 3: 14.7 µs per loop
I had the same question. And I found it simpler to use numpy.triu_indices
:
import numpy as np
from scipy.spatial.distance import pdist, squareform
N = 10
# Calculate distances
X = np.random.random((N,3))
dist_condensed = pdist(X)
# Get indexes: matrix indices of dist_condensed[i] are [a[i],b[i]]
a,b = np.triu_indices(N,k=1)
# Fill distance matrix
dist_matrix = np.zeros((N,N))
for i in range(len(dist_condensed)):
dist_matrix[a[i],b[i]] = dist_condensed[i]
dist_matrix[b[i],a[i]] = dist_condensed[i]
# Compare with squareform output
np.all(dist_matrix == squareform(dist_condensed))
The vector of the compressed matrix corresponds to the bottom triangular region of the square matrix. To convert for a point in that triangular region, you need to calculate the number of points to the left in the triangle, and the number above in the column.
You can use the following function to convert:
q = lambda i,j,n: n*j - j*(j+1)/2 + i - 1 - j
Check:
import numpy as np
from scipy.spatial.distance import pdist, squareform
x = np.random.uniform( size = 100 ).reshape( ( 50, 2 ) )
d = pdist( x )
ds = squareform( d )
for i in xrange( 1, 50 ):
for j in xrange( i ):
assert ds[ i, j ] == d[ q( i, j, 50 ) ]
You can look at it this way: Suppose x
is m by n. The possible pairs of m
rows, chosen two at a time, is itertools.combinations(range(m), 2)
, e.g, for m=3
:
>>> import itertools
>>> list(combinations(range(3),2))
[(0, 1), (0, 2), (1, 2)]
So if d = pdist(x)
, the k
th tuple in combinations(range(m), 2))
gives the indices of the rows of x
associated with d[k]
.
Example:
>>> x = array([[0,10],[10,10],[20,20]])
>>> pdist(x)
array([ 10. , 22.36067977, 14.14213562])
The first element is dist(x[0], x[1])
, the second is dist(x[0], x[2])
and the third is dist(x[1], x[2])
.
Or you can view it as the elements in the upper triangular part of the square distance matrix, strung together into a 1D array.
E.g.
>>> squareform(pdist(x))
array([[ 0. , 10. , 22.361],
[ 10. , 0. , 14.142],
[ 22.361, 14.142, 0. ]])
>>> y = array([[0,10],[10,10],[20,20],[10,0]])
>>> squareform(pdist(y))
array([[ 0. , 10. , 22.361, 14.142],
[ 10. , 0. , 14.142, 10. ],
[ 22.361, 14.142, 0. , 22.361],
[ 14.142, 10. , 22.361, 0. ]])
>>> pdist(y)
array([ 10. , 22.361, 14.142, 14.142, 10. , 22.361])