Golf the K-means algorithm

C++, 479 474 bytes

Only ~20x as much as Matlab!

Golfed

#define V vector<P>
#define f float
struct P{f x,y,i=0;f d(P&p){return(p.x-x)*(p.x-x)+(p.y-y)*(p.y-y);}f n(P&p){return i?x/=i,y/=i,d(p):x=p.x,y=p.y,0;}f a(P&p){x+=p.x,y+=p.y,i++;}};P z;int l(P a,P b){return a.d(z)<b.d(z);}f m(f k,V&p){f s=p.size(),i,j=0,t=1;V c(k),n=c,d;for(random_shuffle(p.begin(),p.end());j<k;c[j].i=j++)c[j]=p[j];for(;t;c=n,n=V(k)){for(i=0;i<s;i++)d=c,z=p[i],sort(d.begin(),d.end(),l),j=d[0].i,p[i].i=j,n[j].a(p[i]);for(j=t=0;j<k;j++)t+=n[j].n(c[j]);}}

The input/output to the algorithm is a set of points (struct P) with x and y; and the output is the same set, with their i's set to indicate the index of the output cluster that the point ends in.

That extra i is also used to identify the clusters. In the main loop, the closest centroid to each point is found by sorting a copy of the current centroids by proximity to that point.

This handles degenerate cases (empty clusters) by keeping the corresponding centroids' previous position (see definition of P::n, which also returns distance-to-previous-centroid). A few chars could be saved by assuming that these will not crop up.

Ungolfed, with main

#include <cstdio>
#include <ctime>
#include <cstdlib>
#include <vector>
#include <algorithm>
using namespace std;

#define V vector<P>
#define f float
struct P{
    f x,y,i=0;
    f d(P&p){return(p.x-x)*(p.x-x)+(p.y-y)*(p.y-y);} // distance squared
    f n(P&p){return i?x/=i,y/=i,d(p):x=p.x,y=p.y,0;} // normalize-or-reset
    f a(P&p){x+=p.x,y+=p.y,i++;}                     // add coordinates
};
P z;int l(P a,P b){return a.d(z)<b.d(z);}            // closer-to-z comparator 
f m(f k,V&p){
    f s=p.size(),i,j=0,t=1;V c(k),n=c,d;
    for(random_shuffle(p.begin(),p.end());j<k;c[j].i=j++)
        c[j]=p[j];                                // initial random assignment
    for(;t;c=n,n=V(k)){                           
        for(i=0;i<s;i++)                          // assign to clusters
            d=c,z=p[i],sort(d.begin(),d.end(),l),
            j=d[0].i,p[i].i=j,n[j].a(p[i]);       // and add those coords
        for(j=t=0;j<k;j++)t+=n[j].n(c[j]);        // normalize & count changes
    }        
}

int main(int argc, char **argv) {
    srand((unsigned long)time(0));

    int k;
    V p;
    sscanf(argv[1], "%d", &k);
    printf("Input:\n");
    for (int i=2,j=0; i<argc; i+=2, j++) {
        P n;
        sscanf(argv[i], "%f", &(n.x));
        sscanf(argv[i+1], "%f", &(n.y));
        p.push_back(n);
        printf("%d : %f,%f\n", j, p[j].x, p[j].y);
    }

    m(k,p);
    printf("Clusters:\n");
    for (int q=0; q<k; q++) {
        printf("%d\n", q);
        for (unsigned int i=0; i<p.size(); i++) {
            if (p[i].i == q) printf("\t%f,%f (%d)\n", p[i].x, p[i].y, i);
        }
    }
    return 0;
}

Matlab, 25 bytes

@(x,k)kmeans(x,k,'S','u')

Given a n x 2 matrix (one row per point e.g. [[4,8]; [15,16]; [23,42]; [-13.37,-12.1]; [666,-666]]) this functions returns a list of labels for each input point.


J, 60 54 bytes

p=:[:(i.<./)"1([:+/&.:*:-)"1/
]p](p(+/%#)/.[)^:_(?#){]

Defines a helper verb p which takes a list of points and centroids and classifies each point by the index of the nearest centroid. Then, it uses that to repeat the process of choosing new centroid by taking the averages of the points in each cluster until it converges, and then to partition the points for output.

Usage

The value of k is given as an integer on the LHS. The list of points is given as a 2d array on the RHS. Here it is specified as a list of points which is reshaped into a 2d array of 5 x 2. The output will be the label for which cluster each point belongs in the same order as the input.

If you desire to use a fixed seed for reproducible results, replace the ? with a ?. at (?#).

   p =: [:(i.<./)"1([:+/&.:*:-)"1/
   f =: ]p](p(+/%#)/.[)^:_(?#){]
   3 f (5 2 $ 4 8 15 16 23 42 _13.37 _12.1 666 _666)
0 1 1 0 2

Explanation

[:(i.<./)"1([:+/&.:*:-)"1/  Input: points on LHS, centroids on RHS
           (          )"1/  Form a table between each point and centroid and for each
                     -        Find the difference elementwise
            [:     *:         Square each
              +/&.:           Reduce using addition
                              Apply the inverse of square (square root) to that sum
[:(     )"1                 For each row of that table
     <./                      Reduce using min
   i.                         Find the index of the minimum in that row
                            Returns a list of indices for each point that shows
                            which centroid it belongs to

]p](p(+/%#)/.[)^:_(?#){]  Input: k on LHS, points on RHS
                    #     Count the number of points
                   ?      Choose k values in the range [0, len(points))
                          without repetition
                       ]  Identity function, get points
                      {   Select the points at the indices above
  ]                       Identity function, get points
   (         )^:_         Repeat until convergence
    p                       Get the labels for each point
             [              Identity function, get points
           /.               Partition the points using the labels and for each
      +/                      Take the sums of points elementwise
         #                    Get the number of points
        %                     Divide sum elementwise by the count
                            Return the new values as the next centroids
]                         Identity function, get points
 p                        Get the labels for each point and return it