How to cluster by trend instead of by distance in R?

This question might be better suited to stats.stackexchange.com, but here's a solution anyway.

Your question is actually "How do I pick the right distance metric?". Instead of Euclidean distance between these vectors, you want a distance that measures similarity in trend.

Here's one option:

a1 <- t(apply(a,1,scale))
a2 <- t(apply(a1,1,diff))

cl <- clara(a2,2)
matplot(t(a),type="b", pch=20, col=cl$clustering)

enter image description here

Instead of defining a new distance metric, I've accomplished essentially the same thing by transforming the data. First scaling each row, so that we can compare relative trends without differences in scale throwing us off. Next, we just convert the data to the differences.

Warning: This is not necessarily going to work for all "trend" data. In particular, looking at successive differences only captures a single, limited facet of "trend". You may have to put some thought into more sophisticated metrics.

Do more preprocessing. To any data mining, preprocessing is 90% of the effort.

For example, if you want to cluster by trends, then you maybe should apply the clustering to the trends, and not the raw values. So for example, standardize the curves each to a mean of 0 and a standard deviation of 1. Then compute the differences from one value to the next, then apply the clustering to this preprocessed data!

How to cluster by trend instead of by distance in R?

Tags:

Cluster Analysis

R

Related

Recent Posts