Why can't you just average ADC samples to get more resolution from an ADC?
If you ask someone to measure a 45.2cm board accurate to the nearest centimeter, they would (or should) answer 45. If you ask then to measure it again, they would answer 45 again. Repeat the exercise 8 more times and the average of all the measurements should be exactly 45. No matter how many times one samples the input, one will end up with a value of 45. The average of all those readings would, of course, be 45 (even though the board is 45.2cm long).
If you had the person adjust the measuring apparatus so as to read 0.45cm long before the first measurement, 0.35cm long before the second, 0.05 cm long before the fifth, 0.05cm short before the sixth, etc. up to 0.45cm short before the tenth, then two of the measurements would read 46 and the other eight would read 45. The average of all of them would be 45.2.
In practice, managing to bias things so precisely is difficult. If one randomly adjusts the measurement apparatus before each measurement to read somewhere between 0.5cm long and 0.5cm short, then about 1/5 of the measurements would read 46 and the rest 45, but because the adjustments are random the actual fraction might be higher or lower. Taking ten measurements would not add quite a full significant figure worth of precision, but averaging about 100 would.
I'm not sure I quite understand the paper's rationale for the distinction between averaging and right shifting. One needs to be mindful that the apparent precision achieved by averaging may exceed the meaningful level of precision, but from my experience the question of when and how much to right-shift should be driven by the limits of the processor's numerical range. Working with numbers that are scaled up as much as they can be without causing overflow will generally minimize the effects of rounding errors, provided that one doesn't attach undue significance to small amounts of noise.
Incidentally, in the original usage, to "decimate" an army was to kill 1/10 of the soldiers therein. To decimate the data from an ADC is to discard part of it. The common prefix with the phrase "decimal point" does not imply an association.
The short answer is noise, and its not necessarily the noise that matters, but the type of noise. The other problem is nonlinear effects like INL that throw off the average value
First on to Noise:
If we were to sample a Gaussian distribution it would look something like this:
The red line is closer to the actual thermal distribution (averaged over time) and the blue histogram represents many ADC samples. If we were to continuously sample this distribution we would get better statistics and we would be able to find the average value or mean with better accuracy(which is usually what were after, Yes I realize signals move around, there is filtering and signal to noise depending on the frequency content but lets just consider the DC case where the signal is not moving for now).
$$ \mu = \frac{1}{n} \sum_{i=1}^{n}{x_i}$$
The problem is flicker noise or 1/f noise, it shifts the Gaussian mean around and causes the statistics to break down, because the distribution is no longer gaussian.
This is a poor model but you could consider it looking something like this INL is also a problem because it can introduce a few bits of error which also throws off the mean.
$$ \mu = \frac{1}{n} \sum_{i=1}^{n}{x_i}+error$$
That is probably confusing, lets look at the time domain as shown below
In the top image you can see a signal with gaussian noise it would be easy to "draw a line" through the middle and find the mean. The more sample you have from a signal like this, the better accuracy and knowledge you will have of the mean.
In the lower image you can see what flicker noise looks like, averaging is not going to help here.
The problem is most electronics have flicker noise, resistors do not (assuming there is no influence from the room temp) but transistors and IC's do. There are amplifiers called chopping amplifiers that do overcome these effects.
Another thing to know is there are ADC's (linear has a new SAR core) where the engineers have worked to eliminate the effects of 1/f noise (and other nonlinear effects of ADC's like INL ) to a level much lower than the the ADCs bit value. You can employ heavy oversampling and get out 32-bit values out of a 14-bit core.
Source: EDN- 1/f Noise—the flickering candle
I took a look at the note and that is indeed a weird claim (or a confusing way of saying what they actually mean).
Perhaps what they actually mean is the point that if you want to get more resolution, you can't divide/shift the number afterward to the same scale as a single sample because (in integer arithmetic) that would throw out the bits you gained.
If your ADC samples are noisy, then of course you can divide to get a less noisy value at the original scale.
The other thing I thought of from just your question was the point that to do oversampling right you need to use an effective low-pass filter, and a straightforward moving average is not as good at being a low-pass filter as a properly designed FIR (or IIR) filter — but that doesn't seem to be supported by the text of the note.