Scikit-Learn: Predicting new points with DBSCAN
Clustering is not classification.
Clustering is unlabeled. If you want to squeeze it into a prediction mindset (which is not the best idea), then it essentially predicts without learning. Because there is no labeled training data available for clustering. It has to make up new labels for the data, based on what it sees. But you can't do this on a single instance, you can only "bulk predict".
But there is something wrong with scipys DBSCAN:
random_state
: numpy.RandomState, optional :The generator used to initialize the centers. Defaults to numpy.random.
DBSCAN does not "initialize the centers", because there are no centers in DBSCAN.
Pretty much the only clustering algorithm where you can assign new points to the old clusters is k-means (and its many variations). Because it performs a "1NN classification" using the previous iterations cluster centers, then updates the centers. But most algorithms don't work like k-means, so you can't copy this.
If you want to classify new points, it is best to train a classifier on your clustering result.
What the R version maybe is doing, is using a 1NN classificator for prediction; maybe with the extra rule that points are assigned the noise label, if their 1NN distance is larger than epsilon, mabye also using the core points only. Maybe not.
Get the DBSCAN paper, it does not discuss "prediction" IIRC.
While Anony-Mousse has some good points (Clustering is indeed not classifying) I think the ability of assigning new points has it's usefulness. *
Based on the original paper on DBSCAN and robertlaytons ideas on github.com/scikit-learn, I suggest running through core points and assigning to the cluster of the first core point that is within eps
of you new point.
Then it is guaranteed that your point will at least be a border point of the assigned cluster according to the definitions used for the clustering.
(Be aware that your point might be deemed noise and not assigned to a cluster)
I've done a quick implementation:
import numpy as np
import scipy as sp
def dbscan_predict(dbscan_model, X_new, metric=sp.spatial.distance.cosine):
# Result is noise by default
y_new = np.ones(shape=len(X_new), dtype=int)*-1
# Iterate all input samples for a label
for j, x_new in enumerate(X_new):
# Find a core sample closer than EPS
for i, x_core in enumerate(dbscan_model.components_):
if metric(x_new, x_core) < dbscan_model.eps:
# Assign label of x_core to x_new
y_new[j] = dbscan_model.labels_[dbscan_model.core_sample_indices_[i]]
break
return y_new
The labels obtained by clustering (dbscan_model = DBSCAN(...).fit(X)
and the labels obtained from the same model on the same data (dbscan_predict(dbscan_model, X)
) sometimes differ. I'm not quite certain if this is a bug somewhere or a result of randomness.
EDIT: I Think the above problem of differing prediction outcomes could stem from the possibility that a border point can be close to multiple clusters. Please update if you test this and find an answer. Ambiguity might be solved by shuffling core points every time or by picking the closest instead of the first core point.
*) Case at hand: I'd like to evaluate if the clusters obtained from a subset of my data makes sense for other subset or is simply a special case. If it generalises it supports the validity of the clusters and the earlier steps of pre-processing applied.