Estimate error on slope of linear regression given data with associated uncertainty
Here's a method for doing weighted orthogonal regression of a straight line, based on the formulae in Krystek/Anton and York:
ortlinfit[data_?MatrixQ, errs_?MatrixQ] :=
Module[{n = Length[data], c, ct, dk, dm, k, m, p, s, st, ul, vl, w, wt, xm, ym},
(* yes, I know I could have used FindFit[] for this... *)
{ct, st, k} = Flatten[MapAt[Normalize[{1, #}] &,
NArgMin[Norm[Function[{x, y}, y - \[FormalM] x - \[FormalK]] @@@ data],
{\[FormalM], \[FormalK]}], 1]];
(* find orthogonal regression coefficients *)
{c, s, p} = FindArgMin[{
Total[(data.{-\[FormalS], \[FormalC]} -
\[FormalP])^2/((errs^2).{\[FormalS]^2, \[FormalC]^2})],
\[FormalC]^2 + \[FormalS]^2 == 1},
{{\[FormalC], ct}, {\[FormalS], st}, {\[FormalP], k/ct}}];
(* slope and intercept *)
{m, k} = {s, p}/c;
wt = 1/errs^2; w = (Times @@@ wt)/(wt.{1, m^2});
{xm, ym} = w.data/Total[w];
ul = data[[All, 1]] - xm; vl = data[[All, 2]] - ym;
(* uncertainties in slope and intercept *)
dm = w.(m ul - vl)^2/w.ul^2/(n - 2);
dk = dm (w.data[[All, 1]]^2/Total[w]);
{Function[\[FormalX], Evaluate[{m, k}.{\[FormalX], 1}]], Sqrt[{dm, dk}]}] /;
Dimensions[data] === Dimensions[errs]
ortlinfit[]
expects data
to contain the $(x_j,y_j)$ pairs, and errs
to contain the corresponding uncertainties $(\rm{dx}_j,\rm{dy}_j)$. The routine returns the best-fit line as a pure function, as well as the uncertainties in the slope and intercept ($\sigma_m$ and $\sigma_k$).
As an example, here's some data used in York's paper:
data = {{0, 5.9}, {0.9, 5.4}, {1.8, 4.4}, {2.6, 4.6}, {3.3, 3.5}, {4.4, 3.7},
{5.2, 2.8}, {6.1, 2.8}, {6.5, 2.4}, {7.4, 1.5}};
errs = {{1000., 1.}, {1000., 1.8}, {500., 4.}, {800., 8.}, {200., 20.},
{80., 20.}, {60., 70.}, {20., 70.}, {1.8, 100.}, {1, 500.}} // Sqrt[1/#] &;
{lin, {sm, sk}} = ortlinfit[data, errs]
{Function[x, 5.47991 - 0.480533 x], {0.0710065, 0.361871}}
Now, let's look at the data, the associated error ellipses (constructed from the uncertainties), the best-fit line, and the "bounding lines" $(m-\sigma_m)x+(k-\sigma_k)$ and $(m+\sigma_m)x+(k+\sigma_k)$:
Show[
Graphics[{AbsolutePointSize[4], Point[data], MapThread[Circle, {data, errs}]},
Frame -> True],
Plot[{lin[x], lin[x] - sm x - sk, lin[x] + sm x + sk}, {x, -1, 9},
PlotStyle -> {Directive[Thick, Red],
Directive[Dashed, Gray], Directive[Dashed, Gray]}]
]
Summary
Use the Weights
option to LinearModelFit
, setting the weights to be inversely proportional to the variances of the error terms.
Theory
This is a standard problem: when the errors in the individual $y$ values are expressed in a way that can be related to their variances, then use weighted least squares with the reciprocal variances as the weights. (Search our sister site Cross Validated for more about this, including references and generalizations.)
Creating realistic data to study
To illustrate, suppose the data are given as vectors $x$ and $y$ with the "errors" expressed either as standard deviations of $y$ or as standard errors of estimate of $y$, or any other quantity that can be interpreted as a fixed, constant multiple of the standard deviations of the $y$. Specifically, the applicable model for these data is
$$y_i = \beta_0 + \beta_1 x_i + \varepsilon_i$$
where $\beta_0$ (the intercept) and $\beta_1$ (the slope) are constants to be estimated, the $\varepsilon_i$ are independent random deviations with mean zero, and $\text{Var}(\varepsilon_i) = \sigma_i^2$ for some given quantities $\sigma_i$ assumed to be known accurately. (The case where all the $\sigma_i$ equal a common unknown constant is the usual linear regression setting.)
In Mathematica we can simulate such data with random number generation. Let's wrap this into a function whose arguments are the amount of data and the slope and intercept. I will make the sizes of the errors vary randomly, but generally they will increase as $x$ increases.
simulate[n_Integer, intercept_: 0, slope_: 0] /; n >= 1 :=
Module[{x, y, errors, sim},
x = Range[n];
errors = RandomReal[GammaDistribution[n, #/(10 n)]] & /@ x;
y = RandomReal[NormalDistribution[intercept + slope x[[#]], errors[[#]]]] & / Range[n];
sim["x"] = x;
sim["y"] = y;
sim["errors"] = errors;
sim
]
Here is a tiny example of its use.
SeedRandom[17];
{x, y, errors} = simulate[16, 50, -2/3][#] & /@ {"x", "y", "errors"};
ListPlot[{y, y + errors, y - errors}, Joined -> {False, True, True},
PlotStyle -> {PointSize[0.015], Thick, Thick},
AxesOrigin -> {0, Min[y - errors]}]
The simulated points are surrounded by error bands.
Weighted least-squares estimation
To fit these data, use the Weights
option of LinearModelFit
. Once again, let's prepare for later analysis by encapsulating the fitting in a function. For comparison, let's fit the data both with and without the weights.
trial[n_Integer: 1, intercept_: 0, slope_: 0] :=
Module[{x, y, errors, t, fit, fit0},
{x, y, errors} = simulate[n, intercept, slope][#] & /@ {"x", "y", "errors"};
fit = LinearModelFit[y, t, t, Weights -> 1 / errors^2];
fit0 = LinearModelFit[y, t, t];
{fit[#], fit0[#]} & @ "BestFitParameters"
]
The output is a list whose elements give {intercept, slope}: the first element is for the weighted fit, the second for the unweighted.
Monte-Carlo comparison of the weighted and ordinary least squares methods
Let's run a lot of independent trials (say, $1000$ of them for simulated datasets of $n=32$ points each) and compare their results:
SeedRandom[17];
simulation = ParallelTable[trial[32, 20, -1/2], {i, 1, 1000}];
ranges = {{18.5, 22}, {-0.65, -0.35}};
TableForm[
Table[Histogram[simulation[[All, i, j]], ImageSize -> 250,
PlotRange -> {ranges[[j]], Automatic}, Axes -> {True, False}],
{i, 1, 2}, {j, 1, 2}],
TableHeadings -> {{"Weighted", "OLS"}, {"Intercept", "Slope"}}
]
Because I specified an intercept of $20$ and slope of $-1/2$, we will want to use these values as references. Indeed, the histograms in the left column ("Intercept") display sets of estimated intercepts hovering around $20$ and the histograms in the right column ("Slope") display sets of slopes hovering around $-0.50 = -1/2$. This illustrates the theoretical fact that the estimates in either case are unbiased. However, looking more closely at the spreads of the histograms (read the numbers on the horizontal axes), we see that those in the upper row ("Weighted") have smaller widths of their counterparts in the lower row ("OLS," or "Ordinary Least Squares"). This is evidence that the weighted estimates tend, on the whole, to be better then the unweighted ones, because they tend to deviate less from the true parameter values.
When the underlying data truly conform to the hypothesized model--there is a linear relationship between the $x$'s and $y$'s, with errors in the $y$'s having known but differing standard deviations--then among all unbiased linear estimates of the slope and intercept, weighted least squares using reciprocal variances for the weights is best in the sense just illustrated.
Obtaining the estimation error of the slope
Now, to answer the question: we would like to assess the estimation error in the slope. This can be obtained from the fit
object in many ways: consult the help page for details. Here is a nicely formatted table:
fit = LinearModelFit[y, t, t, Weights -> 1 / errors^2];
fit0 = LinearModelFit[y, t, t];
TableForm[{{fit[#], fit0[#]} & @ "ParameterConfidenceIntervalTable"},
TableHeadings -> {{}, {"Weighted", "OLS"}}]
In this case, for this particular set of simulated data (as created previously), the weighted method reports a much smaller standard error for the intercept than the OLS method (because errors near $x=0$ are low according to the information in errors
) but the weighted estimate of the slope has only a slightly smaller standard error than the OLS estimate of the slope.
Comments
Errors in both $x$ and $y$ can be handled, using--for instance--methods of maximum likelihood. However, this involves considerably more mathematical, statistical, and computational machinery and requires a careful assessment of the nature of those errors (such as whether the $x$ errors and $y$ errors are independent). One general result in the statistical literature is that when the $x$ errors are typically smaller than the $y$ errors, yet independent of them, it is usually safe to ignore the $x$ errors. For more about all this, good search terms include "errors-in-variables regression," "Deming regression," and even "principal components analysis (PCA)".
I made this implementation of York's classical (and easy to understand) method following this paper by Cameron Reed.
f[x_List, y_List, wxi_List, wyi_List] :=
Module[{n = Length@x, wi, ui, vi, wmean, d, g, a, b, set, least},
wi[i_, m_] := wxi[[i]] wyi[[i]]/(m ^2 wyi[[i]] + wxi[[i]]);
ui[i_, m_] := x[[i]] - wmean[x, m];
vi[i_, m_] := y[[i]] - wmean[y, m];
wmean[q_List, m_] := Sum[wi[i, m] q[[i]], {i, n}]/Sum[wi[i, m], {i, n}];
d[m_] := Sum[wi[i, m]^2 ui[i, m]^2/wxi[[i]], {i, n}];
g[m_] :=- Sum[wi[i, m] ui[i, m] vi[i, m], {i, n}]/d[m];
a[m_] :=2 Sum[wi[i, m]^2 ui[i, m] vi[i, m]/wxi[[i]], {i, n}]/(3 d[m]);
b[m_] := (Sum[wi[i, m]^2 vi[i, m]^2/wxi[[i]], {i, n}] -
Sum[wi[i, m] ui[i, m]^2, {i, n}])/(3 d[m]);
set = {ToRules@ Reduce[m^3 - 3 a[m] m m + 3 b[m] m - g[m] == 0 &&
c == wmean[y, m] - m wmean[x, m], {m, c},
Backsubstitution -> True]};
least = Sum[wxi[[i]] (x[[i]] - (y[[i]] - c)/m)^2 +
wyi[[i]] (y[[i]] - (m x[[i]] + c))^2, {i, Length@x}] /.
set[[Flatten@Position[m /. set, _Real]]];
set[[Flatten@Position[m /. set, _Real]]][[Position[least, Min@least][[1]]]]];
Usage
f[Range[10], 3 Range[10] + RandomReal[.2], Array[# &, 10], Array[# &, 10]]
(*
-> {{m -> 3., c -> 0.110805}}
*)