How to calculate BIC for k-means clustering in R
For anyone else landing here, there's a method proposed by Sherry Towers at http://sherrytowers.com/2013/10/24/k-means-clustering/, which uses output from stats::kmeans
. I quote:
The AIC can be calculated with the following function:
kmeansAIC = function(fit){ m = ncol(fit$centers) n = length(fit$cluster) k = nrow(fit$centers) D = fit$tot.withinss return(D + 2*m*k) }
From the help for stats::AIC
, you can also see that the BIC can be calculated in a similar way to the AIC. An easy way to get the BIC is to replace the return()
in the above function, with this:
return(data.frame(AIC = D + 2*m*k,
BIC = D + log(n)*m*k))
So you would use this as follows:
fit <- kmeans(x = data,centers = 6)
kmeansAIC(fit)
To compute BIC, just add .5*k*d*log(n)
(where k
is the number of means, d
is the length of a vector in your dataset, and n
is the number of data points) to the standard k-means error function.
The standard k-means penalty is \sum_n (m_k(n)-x_n)^2
, where m_k(n)
is the mean associated with the nth data point. This penalty can be interpreted as a log probability, so BIC is perfectly valid.
BIC just adds an additional penalty term to the k-means error proportional to k
.
Just to add to what user1149913 said (I don't have enough reputation to comment), since you're using the kmeans function in R, \sum_n (m_k(n)-x_n)^2
is already calculated for you as KClData$tot.withinss
.