Why work with squares of error in regression analysis?

From a Bayesian point of view, this is equivalent to assuming that your data is generated by a line plus Gaussian noise, and finding the maximum likelihood line based on that assumption. Using the absolute values means assuming that your noise has pdf proportional to $e^{-|x|}$ which is substantially less natural than assuming Gaussian noise (e.g. Gaussian noise falls out of the central limit theorem).

Using the squared errors also makes the regression extremely easy to compute, which is probably a major practical factor. Most other functions of the error would result in something much more annoying to compute.


You square the error terms because of the Pythagorean theorem x^2 + y^2 = z^2.

Consider just the 2-dimensional case.

The x and y correspond to error terms in each orthogonal dimension. But that hypotenuse z is the distance you really want to minimize.

Now minimizing the sum of the squares of x and y will also minimize the square root of the sum of the squares. So there is no need to take the final square root.

With a little thought you will see that this works as you add more x,y error terms to the mix. Minimizing

x1^2 + y1^2 + ... + xN^2 + yN^2

has the effect of also minimizing the over sum of the distances (all those little hypotenuses)

sqrt(x1^2 + y1^2) + ... + sqrt(xN^2 + yN^2) = z1 + ... + zN

but is much simpler to calculate.

Make sense?

Ok, so what would happen if you took absolute values and minimized

|x1| + |y1| + ... + |xN| + |yN| ?

Instead of minimizing the sum of the distances you would bias the resulting fit toward a slope of 1 or -1 and away from lines slopes near 0 or infinity. Of course you can do that, but your resulting fit will be sucked toward a line with a slope of plus or minus 1 and away from the solution that minimizes those Pythagorean distances.


Basically, you can ask the same question in the much simpler setting of finding the "best" average of values $x_1,\ldots,x_n$, where I here refer to average in the general sense of finding a single value to represent them such as the (arithmetic) mean, geometric mean, median, or $l_p$-mean (not sure if that's the right name).

For data that actually come from a normal distribution, the mean will be the most powerful estimator of the true mean. However, if the distribution is long-tailed (or has extreme values) the median will be more robust.

You can also use the $l_p$ norm and find the $l_p$-mean, $u$, that minimises $\sum_i |x_i-u|^p$ for any $p\ge1$. (For $p<1$ this need no longer be unique.) For $p=2 $ we have the traditional square distance, while for $p=1$ we get the median (almost). I once found $p=1.5$ to behave well in terms of both power and robustness.

So, switching from least square regression ($l_2$-norm) to using absolute distance ($l_1$-norm) corresponds to switching from mean to median. Which is better depends on the data, and also on the context of the analysis: what you are actually looking for.

The mean does have the advantage that it is an unbiased estimator of the true mean no matter what the underlying distribution is, but usually accuracy is more important than unbiasedness.

Tags:

Regression