Deleting noisy data from a plot (manually) and export the best remaining data

One simple way of "filtering" your data is to treat the points as a graph, and search for the shortest path from left to right:

Click to copy

xScale = 10.; 
xy = Transpose[{N[Range[Length[data]]]*(xScale/Length[data]), data}]; 

start = {-xScale, Mean[data]}; 
finish = {2*xScale, Mean[data]}; 
graph = NearestNeighborGraph[Join[{start}, xy, {finish}], 25]; 
graph = SetProperty[graph, 
   EdgeWeight -> 
    Apply[SquaredEuclideanDistance, EdgeList[graph], {1}]];

Now we have graph with all your points, plus two extra points "start" and "finish" far left and far right of the point set.

enter image description here

Every point is connected to it's 25 closest neighbors, and I'm using SquaredEuclideanDistance as edge weights - that way, long edges are penalized, and the shortest path should contain all "inlier" points.

Here's the shortest path from start to finish:

Click to copy

path = FindShortestPath[graph, start, finish][[2 ;; -2]];
ListLinePlot[path, PlotRange -> MinMax[data], 
 Prolog -> {Gray, Point[xy]}, PlotStyle -> Red, ImageSize -> 600]

enter image description here

ADD:

I was asked what the xScale parameter means. Basically, it defines the cost of "skipping" a point of data, in relation to going "up and down" in y direction. So for smaller values of xScale, to cost of skipping a point is low, and a lot of points are skipped. If we increase xScale, more points get selected:

enter image description here

Here's an alternate approach that takes the data into an image that you can edit in a Paint program, and then back into data. Presumes you have "insider knowledge" about the data set that allows you to identify and exclude bad data.

Assuming the data is in dat, plot it

Click to copy

 ListPlot[dat]

enter image description here

Convert to an image, using a sparse array as an intermediary:

Click to copy

mindat = dat // Min
normdat = (dat - mindat)/((dat - mindat) // Max);
discretizer = 5000; (* discretizes the vertical axis, you pick the value to choose resolution *)
datsparse = Floor[normdat * discretizer] + 1;
sparsedata = Transpose[{datsparse, Range[Length[datsparse]]}];
rules = Map[# -> 1 &, sparsedata];
spa = SparseArray[rules]
MatrixPlot[spa, ColorFunction -> "Monochrome"]

enter image description here

From sparse array to image

Click to copy

immf = MaxFilter[Image[spa, "Bit"], 2]
Export["C:\\immf.tiff", immf]

Note: the MaxFilter[ ] command will result in multiple values for the same time step. Can delete if you want. Also note the image is inverted. Taken care of on the Import[ ].

enter image description here

Edit it in your favorite paint program and import, reversing the math

Click to copy

immfmod = Import["C:\\immfmod.tiff"] // Binarize

enter image description here

Grab the data and renormalize. First the data...

Click to copy

imd = immfmod // ImageData

Get the location of positions where they have a value of '1', doing a little manipulating to get the X and Y axes correct, and transpose so we can scale the Y axis in the next step.

Click to copy

postrans = Transpose[Sort[Map[Reverse, Position[imd, 1]]]];

Reset the Y axis back to the original space

Click to copy

postrans[[2]] = (postrans[[2]] - 1)/discretizer ((dat - mindat) // Max) + mindat;
pos = postrans // Transpose;

Take a look

enter image description here

The data has repeated points for the same X-axis values. If you don't like that, get rid of the MaxFilter[ ] command, as I just used it to make the image easier to see.

Introduction

It seems to me that this question should be answered using more "traditional" time series methods than the already provided interesting solutions (with graphs and image processing.)

The workflow shown below is something considered during the design of the QRMon package and it is very similar to the data cleaning done in "Cleaning away data points which are enveloped within a function".

The "traditional" time series procedure

Summarize the data
Do a (Quantile Regression) fit.
Pick points close to the fitted curve.
- Using an appropriate threshold.
Plot the picked points.
If satisfactory results stop, else use the picked points as new data and goto to 1.

Get the data

Actually the other answers did not discuss how the data is obtained. I downloaded the data from the provided link and had to pre-process it a bit.

Click to copy

data = Import["~/Downloads/MSE-q188361.txt", "Data"];

Tally[Length /@ data]
(* {{5, 1704}, {4, 34}} *)

data = Select[data, Length[#] == 5 &];
data = data[[All, 1 ;; 4]];
data = Select[data, VectorQ[#, NumberQ] &];
Dimensions[data]
(* {1703, 4} *)

Workflow code

The implementation below uses the package QRMon:

Click to copy

Import["https://raw.githubusercontent.com/antononcube/MathematicaForPrediction/master/MonadicProgramming/MonadicQuantileRegression.m"]

and Fold. Only two interations are needed, but I did experiment with different regression fits (algorithms, function bases parameters, and algorithm options) and different point picking thresholds.

The data is quite skewed, so the built-in function Fit does not work that well. The Quantile Regression algorithm is somewhat slow, but the whole computation should finish within 15 seconds.

Click to copy

AbsoluteTiming[
 cleanData =
   Fold[
    First[Values[
       QRMonUnit[#1]⟹
        QRMonEcho[Style[Row[{"Iteration parameters:\n{number of knots, quantile, pick threshold}=", #2}], Bold, Purple, FontSize -> 16]]⟹
        QRMonEchoDataSummary⟹
        QRMonQuantileRegression[#2[[1]], #2[[2]], Method -> {LinearProgramming, Method -> "InteriorPoint", Tolerance -> 10^(-3)}]⟹
        QRMonSetRegressionFunctionsPlotOptions[{PlotStyle -> Red}]⟹
        QRMonPlot[ImageSize -> Large, PlotLabel -> Style["Data and fit", Bold, 16]]⟹
        QRMonPickPathPoints[#2[[3]]]⟹
        QRMonEchoFunctionValue[ListPlot[#, ImageSize -> Large, PlotLabel -> Style["Picked points", Bold, 16], PlotTheme -> "Detailed"] & /@ # &]⟹
        QRMonTakeValue
       ]] &, 
    Join @@ data, {{16, 0.3, 0.1}, {30, 0.5, 0.025} (*,{24, 0.5, 0.01}*)}];
]

enter image description here

The final result is given to the variable cleanData:

Click to copy

Short[cleanData]
(* {{2, 5.3698}, {4, 5.3698} <<4563>> {6809, 4.813}, {6811, 4.813}) *)

Deleting noisy data from a plot (manually) and export the best remaining data

ADD:

Introduction

The "traditional" time series procedure

Get the data

Workflow code

Tags:

Image

Dataset

Signal Processing

Image Processing

Time Series

Related

Recent Posts