Filtering Outliers - how to make median-based Hampel Function faster?
A Pandas solution is several orders of magnitude faster:
def hampel(vals_orig, k=7, t0=3):
'''
vals: pandas series of values from which to remove outliers
k: size of window (including the sample; 7 is equal to 3 on either side of value)
'''
#Make copy so original not edited
vals=vals_orig.copy()
#Hampel Filter
L= 1.4826
rolling_median=vals.rolling(k).median()
difference=np.abs(rolling_median-vals)
median_abs_deviation=difference.rolling(k).median()
threshold= t0 *L * median_abs_deviation
outlier_idx=difference>threshold
vals[outlier_idx]=np.nan
return(vals)
Timing this gives 11 ms vs 15 seconds; vast improvement.
I found a solution for a similar filter in this post.
Solution by @EHB above is helpful, but it is incorrect. Specifically, the rolling median calculated in median_abs_deviation is of difference, which itself is the difference between each data point and the rolling median calculated in rolling_median, but it should be the median of differences between the data in the rolling window and the median over the window. I took the code above and modified it:
def hampel(vals_orig, k=7, t0=3):
'''
vals: pandas series of values from which to remove outliers
k: size of window (including the sample; 7 is equal to 3 on either side of value)
'''
#Make copy so original not edited
vals = vals_orig.copy()
#Hampel Filter
L = 1.4826
rolling_median = vals.rolling(window=k, center=True).median()
MAD = lambda x: np.median(np.abs(x - np.median(x)))
rolling_MAD = vals.rolling(window=k, center=True).apply(MAD)
threshold = t0 * L * rolling_MAD
difference = np.abs(vals - rolling_median)
'''
Perhaps a condition should be added here in the case that the threshold value
is 0.0; maybe do not mark as outlier. MAD may be 0.0 without the original values
being equal. See differences between MAD vs SDV.
'''
outlier_idx = difference > threshold
vals[outlier_idx] = rolling_median[outlier_idx]
return(vals)