DBSCAN for clustering of geographic location data
You can cluster spatial latitude-longitude data with scikit-learn's DBSCAN without precomputing a distance matrix.
db = DBSCAN(eps=2/6371., min_samples=5, algorithm='ball_tree', metric='haversine').fit(np.radians(coordinates))
This comes from this tutorial on clustering spatial data with scikit-learn DBSCAN. In particular, notice that the eps
value is still 2km, but it's divided by 6371 to convert it to radians. Also, notice that .fit()
takes the coordinates in radian units for the haversine metric.
DBSCAN is meant to be used on the raw data, with a spatial index for acceleration. The only tool I know with acceleration for geo distances is ELKI (Java) - scikit-learn unfortunately only supports this for a few distances like Euclidean distance (see sklearn.neighbors.NearestNeighbors
).
But apparently, you can affort to precompute pairwise distances, so this is not (yet) an issue.
However, you did not read the documentation carefully enough, and your assumption that DBSCAN uses a distance matrix is wrong:
from sklearn.cluster import DBSCAN
db = DBSCAN(eps=2,min_samples=5)
db.fit_predict(distance_matrix)
uses Euclidean distance on the distance matrix rows, which obviously does not make any sense.
See the documentation of DBSCAN
(emphasis added):
class sklearn.cluster.DBSCAN(eps=0.5, min_samples=5, metric='euclidean', algorithm='auto', leaf_size=30, p=None, random_state=None)
metric : string, or callable
The metric to use when calculating distance between instances in a feature array. If metric is a string or callable, it must be one of the options allowed by metrics.pairwise.calculate_distance for its metric parameter. If metric is “precomputed”, X is assumed to be a distance matrix and must be square. X may be a sparse matrix, in which case only “nonzero” elements may be considered neighbors for DBSCAN.
similar for fit_predict
:
X : array or sparse (CSR) matrix of shape (n_samples, n_features), or array of shape (n_samples, n_samples)
A feature array, or array of distances between samples if metric='precomputed'.
In other words, you need to do
db = DBSCAN(eps=2, min_samples=5, metric="precomputed")