How to compute distances between centroids and data matrix (for kmeans algorithm)
Your main question seems to be how to calculate distances between a data matrix and some set of points ("centers").
For this you can write a function that takes as input a data matrix and your set of points and returns distances for each row (point) in the data matrix to all the "centers".
Here is such a function:
myEuclid <- function(points1, points2) {
distanceMatrix <- matrix(NA, nrow=dim(points1)[1], ncol=dim(points2)[1])
for(i in 1:nrow(points2)) {
distanceMatrix[,i] <- sqrt(rowSums(t(t(points1)-points2[i,])^2))
}
distanceMatrix
}
points1
is the data matrix with points as rows and dimensions as columns. points2
is the matrix of centers (points as rows again). The first line of code just defines the answer matrix (which will have as many rows as there are rows in the data matrix and as many columns as there are centers). So the point i,j
in the result matrix will be the distance from the ith point to the jth center.
Then the for loop iterates over all centers. For each center it computes the euclidean distance from each point to the current center and returns the result. This line here: sqrt(rowSums(t(t(points1)-points2[i,])^2))
is euclidean distance. Inspect it closer and look up the formula if you have any troubles with that. (the transposes there are mainly done to make sure subtraction is being done row-wise).
Now you can also implement k-means algorithm:
myKmeans <- function(x, centers, distFun, nItter=10) {
clusterHistory <- vector(nItter, mode="list")
centerHistory <- vector(nItter, mode="list")
for(i in 1:nItter) {
distsToCenters <- distFun(x, centers)
clusters <- apply(distsToCenters, 1, which.min)
centers <- apply(x, 2, tapply, clusters, mean)
# Saving history
clusterHistory[[i]] <- clusters
centerHistory[[i]] <- centers
}
list(clusters=clusterHistory, centers=centerHistory)
}
As you can see it's also a very simple function - it takes data matrix, centers, your distance function (the one defined above) and number of wanted iterations.
The clusters are defined by assigning the closest center for each point. And centers are updated as a mean of the points assigned to that center. Which is a basic k-means algorithm).
Let's try it out. Define some random points (in 2d, so number of columns = 2)
mat <- matrix(rnorm(100), ncol=2)
Assign 5 random points from that matrix as initial centers:
centers <- mat[sample(nrow(mat), 5),]
Now run the algorithm:
theResult <- myKmeans(mat, centers, myEuclid, 10)
Here are the centers in the 10th iteration:
theResult$centers[[10]]
[,1] [,2]
1 -0.1343239 1.27925285
2 -0.8004432 -0.77838017
3 0.1956119 -0.19193849
4 0.3886721 -1.80298698
5 1.3640693 -0.04091114
Compare that with implemented kmeans
function:
theResult2 <- kmeans(mat, centers, 10, algorithm="Forgy")
theResult2$centers
[,1] [,2]
1 -0.1343239 1.27925285
2 -0.8004432 -0.77838017
3 0.1956119 -0.19193849
4 0.3886721 -1.80298698
5 1.3640693 -0.04091114
Works fine. Our function however tracks the iterations. We can plot the progress over the first 4 iterations like this:
par(mfrow=c(2,2))
for(i in 1:4) {
plot(mat, col=theResult$clusters[[i]], main=paste("itteration:", i), xlab="x", ylab="y")
points(theResult$centers[[i]], cex=3, pch=19, col=1:nrow(theResult$centers[[i]]))
}
Nice.
However this simple design allows for much more. For example if we want to use another kind of distance (not euclidean) we can just use any function that takes data and centers as inputs. Here is one for correlation distances:
myCor <- function(points1, points2) {
return(1 - ((cor(t(points1), t(points2))+1)/2))
}
And we then can do Kmeans based on those:
theResult <- myKmeans(mat, centers, myCor, 10)
The resulting picture for 4 iterations then looks like this:
Even thou we specified 5 clusters - there were 2 left at the end. That is because for 2 dimensions the correlation can have to values - either +1 or -1. Then when looking for the clusters each point get's assigned to one center, even if it has the same distance to multiple centers - the first one get's chosen.
Anyway this is now getting out of scope. The bottom line is that there are many possible distance metrics and one simple function allows you to use any distance you want and track the results over iterations.
Modified the distance matrix function above (added another loop for no. of points) as the above function displays only the distance of first point from all clusters and not all points, which is what the question is looking for:
myEuclid <- function(points1, points2) {
distanceMatrix <- matrix(NA, nrow=dim(points1)[1], ncol=dim(points2)[1])
for(i in 1:nrow(points2)) {
for (j in c(1:dim(t(points1))[2])) {
distanceMatrix[j,i] <- sqrt(rowSums(t(t(points1)[,j]-t(points2[i,]))^2))
}
}
distanceMatrix
}
Do let me know if this works fine!