How to split data into clusters based on fitting to function

If we can assume we know more or less the functions for the two groups, in this case two linear functions (two parameters each) we can just NMinimize for the Min distance of the two functions.

{w, sol} = NMinimize[
  Total[
   Map[
    Min[Abs[#[[2]] - (n1 + m1 #[[1]])], 
      Abs[#[[2]] - (n2 + m2 #[[1]])]] &
    , dd
    ]
   ], {n1, n2, m1, m2}]
{95.3241, {n1 -> 1.84575, n2 -> -19.8623, m1 -> 0.300732,  m2 -> 0.598053}}

Once we have extracted the parameters for the two functions, then we just GatherBy each point, based on which function is closer.

ListPlot@Module[{n1, n2, m1, m2},
  {n1, n2, m1, m2} = Values[sol];
  GatherBy[dd, 
   LessEqual[Abs[#[[2]] - (n1 + m1 #[[1]])], 
     Abs[#[[2]] - (n2 + m2 #[[1]])]] &]
  ]

Mathematica graphics

More general

Define your function with arbitrary number of parameters

func[parms_List][x_] := parms[[1]] + parms[[2]]  x

A distance function

dist[f_][{x_, y_}] := Abs[y - f[x]]

Or (after comment by @SampoSmolander)

dist[f_][{x_, y_}] := (y - f[x])^2

Fake data

dd = Table[
   With[
    {
     a = RandomChoice[{1, 4}],
     b = RandomChoice[{1/2, 2}]
     }, {x, func[{a, b}][x] + RandomReal[{-1, 1}]}], {x, 0, 10, 
    0.01}];

Minimisation, The i index goes through the number of parameters of each function (here 2), the index j over the number of functions to fit (in this case 4).

sol2 = Last@With[{n = 2, m = 4},
   NMinimize[
    Total@Map[Function[{L}, Min @@ Table[
         dist[func[Array[c[#, k] &, n]]][L]
         , {k, m}]], dd]
    , Flatten[Table[c[i, j], {i, n}, {j, m}]]]
   ]

Plot

Show[
 ListPlot@GatherBy[
   dd
   , Position[#, Min[#]] &[
     Table[dist[func[{c[1, k], c[2, k]} /. sol2]][#], {k, 4}]] &
   ],
 Plot[
  Evaluate[
   Table[
     func[{c[1, k], c[2, k]}][x]
     , {k, 4}] /. sol2
   ]
  , {x, 0, 10}
  ]]

Mathematica graphics

Obviously there is no way to be sure this will converge to reasonable clusters if the lines are too mixed or the functions too complicated.

Fitting different number of functions

Let's see how does this perform when there are four groups of points and we ask for m different number of clusters, from 1 to 6.

plots = Table[
  sol2 = Last@With[{n = 2},
     NMinimize[
      Total@Map[Function[{L}, Min @@ Table[
           dist[func[Array[c[#, k] &, n]]][L]
           , {k, m}]], dd]
      , Flatten[Table[c[i, j], {i, n}, {j, m}]]]
     ];
  ListPlot[
   GatherBy[
    dd
    , Position[#, Min[#]] &[
      Table[dist[func[{c[1, k], c[2, k]} /. sol2]][#], {k, m}]] &
    ], PlotLabel -> m], {m, 6}];

Grid@Partition[plots, 2]

Mathematica graphics


Another approach to this problem is to use the image processing capabilities. This code creates a binary image from the data, then uses ImageLines, which implements the Radon transform to locate the most prominent lines. These are then highlighted in orange.

dd = Join[Table[{x, .3 x + 2 + RandomReal[{-1, 1}]}, {x, 0, 100}], 
  Table[{x, .6 x - 20 + RandomReal[{-1, 1}]}, {x, 0, 100}]]; 
img = ListPlot[dd, Axes -> False] // Image;
binImg = Binarize[ColorNegate@img];
lines = ImageLines[binImg, 0.2];
HighlightImage[binImg, {Orange, Line /@ lines}]

enter image description here

In the event of more lines, you can change the second argument to ImageLines.


This does the same as the other answer (fit two lines to the data, cluster data points based on whether they are closer to one line or the other), but with a bit different coding style:

dd = Join[Table[ {x, .3 x + 2 + RandomReal[{-1, 1}]}, {x, 0, 100}], 
  Table[ {x, .6 x - 20 + RandomReal[{-1, 1}]}, {x, 0, 100}]];

cost1[a1_, b1_, a2_, b2_, {x_, y_}] :=
  Min[(a1 + b1 x - y)^2, (a2 + b2 x - y)^2]
cost[a1_, b1_, a2_, b2_] := 
  Total[Map[cost1[a1, b1, a2, b2, #]&, dd]]

sol = Minimize[cost[a1, b1, a2, b2], {a1, b1, a2, b2}][[2]]

set1q[{x_, y_}] := (a1 + b1 x - y)^2 < (a2 + b2 x - y)^2 /. sol
set1 = Select[dd, set1q];
set2 = Complement[dd, set1];

ListPlot[{set1, set2}]