How to split data into clusters based on fitting to function
If we can assume we know more or less the functions for the two groups, in this case two linear functions (two parameters each) we can just NMinimize
for the Min
distance of the two functions.
{w, sol} = NMinimize[
Total[
Map[
Min[Abs[#[[2]] - (n1 + m1 #[[1]])],
Abs[#[[2]] - (n2 + m2 #[[1]])]] &
, dd
]
], {n1, n2, m1, m2}]
{95.3241, {n1 -> 1.84575, n2 -> -19.8623, m1 -> 0.300732, m2 -> 0.598053}}
Once we have extracted the parameters for the two functions, then we just GatherBy
each point, based on which function is closer.
ListPlot@Module[{n1, n2, m1, m2},
{n1, n2, m1, m2} = Values[sol];
GatherBy[dd,
LessEqual[Abs[#[[2]] - (n1 + m1 #[[1]])],
Abs[#[[2]] - (n2 + m2 #[[1]])]] &]
]
More general
Define your function with arbitrary number of parameters
func[parms_List][x_] := parms[[1]] + parms[[2]] x
A distance function
dist[f_][{x_, y_}] := Abs[y - f[x]]
Or (after comment by @SampoSmolander)
dist[f_][{x_, y_}] := (y - f[x])^2
Fake data
dd = Table[
With[
{
a = RandomChoice[{1, 4}],
b = RandomChoice[{1/2, 2}]
}, {x, func[{a, b}][x] + RandomReal[{-1, 1}]}], {x, 0, 10,
0.01}];
Minimisation, The i
index goes through the number of parameters of each function (here 2), the index j
over the number of functions to fit (in this case 4).
sol2 = Last@With[{n = 2, m = 4},
NMinimize[
Total@Map[Function[{L}, Min @@ Table[
dist[func[Array[c[#, k] &, n]]][L]
, {k, m}]], dd]
, Flatten[Table[c[i, j], {i, n}, {j, m}]]]
]
Plot
Show[
ListPlot@GatherBy[
dd
, Position[#, Min[#]] &[
Table[dist[func[{c[1, k], c[2, k]} /. sol2]][#], {k, 4}]] &
],
Plot[
Evaluate[
Table[
func[{c[1, k], c[2, k]}][x]
, {k, 4}] /. sol2
]
, {x, 0, 10}
]]
Obviously there is no way to be sure this will converge to reasonable clusters if the lines are too mixed or the functions too complicated.
Fitting different number of functions
Let's see how does this perform when there are four groups of points and we ask for m
different number of clusters, from 1 to 6.
plots = Table[
sol2 = Last@With[{n = 2},
NMinimize[
Total@Map[Function[{L}, Min @@ Table[
dist[func[Array[c[#, k] &, n]]][L]
, {k, m}]], dd]
, Flatten[Table[c[i, j], {i, n}, {j, m}]]]
];
ListPlot[
GatherBy[
dd
, Position[#, Min[#]] &[
Table[dist[func[{c[1, k], c[2, k]} /. sol2]][#], {k, m}]] &
], PlotLabel -> m], {m, 6}];
Grid@Partition[plots, 2]
Another approach to this problem is to use the image processing capabilities. This code creates a binary image from the data, then uses ImageLines, which implements the Radon transform to locate the most prominent lines. These are then highlighted in orange.
dd = Join[Table[{x, .3 x + 2 + RandomReal[{-1, 1}]}, {x, 0, 100}],
Table[{x, .6 x - 20 + RandomReal[{-1, 1}]}, {x, 0, 100}]];
img = ListPlot[dd, Axes -> False] // Image;
binImg = Binarize[ColorNegate@img];
lines = ImageLines[binImg, 0.2];
HighlightImage[binImg, {Orange, Line /@ lines}]
In the event of more lines, you can change the second argument to ImageLines.
This does the same as the other answer (fit two lines to the data, cluster data points based on whether they are closer to one line or the other), but with a bit different coding style:
dd = Join[Table[ {x, .3 x + 2 + RandomReal[{-1, 1}]}, {x, 0, 100}],
Table[ {x, .6 x - 20 + RandomReal[{-1, 1}]}, {x, 0, 100}]];
cost1[a1_, b1_, a2_, b2_, {x_, y_}] :=
Min[(a1 + b1 x - y)^2, (a2 + b2 x - y)^2]
cost[a1_, b1_, a2_, b2_] :=
Total[Map[cost1[a1, b1, a2, b2, #]&, dd]]
sol = Minimize[cost[a1, b1, a2, b2], {a1, b1, a2, b2}][[2]]
set1q[{x_, y_}] := (a1 + b1 x - y)^2 < (a2 + b2 x - y)^2 /. sol
set1 = Select[dd, set1q];
set2 = Complement[dd, set1];
ListPlot[{set1, set2}]