How do I find the average in a LARGE set of numbers?

Integers or floats?

If they're integers, you need to accumulate a frequency distribution by reading the numbers and recording how many of each value you see. That can be averaged easily.

For floating point, this is a bit of a problem. Given the overall range of the floats, and the actual distribution, you have to work out a bin-size that preserves the accuracy you want without preserving all of the numbers.

Edit

First, you need to sample your data to get a mean and a standard deviation. A few thousand points should be good enough.

Then, you need to determine a respectable range. Folks pick things like ±6σ (standard deviations) around the mean. You'll divide this range into as many buckets as you can stand.

In effect, the number of buckets determines the number of significant digits in your average. So, pick 10,000 or 100,000 buckets to get 4 or 5 digits of precision. Since it's a measurement, odds are good that your measurements only have two or three digits.

Edit

What you'll discover is that the mean of your initial sample is very close to the mean of any other sample. And any sample mean is close to the population mean. You'll note that most (but not all) of your means are with 1 standard deviation of each other.

You should find that your measurement errors and inaccuracies are larger than your standard deviation.

This means that a sample mean is as useful as a population mean.

You can sample randomly from your set ("population") to get an average ("mean"). The accuracy will be determined by how much your samples vary (as determined by "standard deviation" or variance).

The advantage is that you have billions of observations, and you only have to sample a fraction of them to get a decent accuracy or the "confidence range" of your choice. If the conditions are right, this cuts down the amount of work you will be doing.

Here's a numerical library for C# that includes a random sequence generator. Just make a random sequence of numbers that reference indices in your array of elements (from 1 to x, the number of elements in your array). Dereference to get the values, and then calculate your mean and standard deviation.

If you want to test the distribution of your data, consider using the Chi-Squared Fit test or the K-S test, which you'll find in many spreadsheet and statistical packages (e.g., R). That will help confirm whether this approach is usable or not.

How do I find the average in a LARGE set of numbers?

Tags:

C#

Memory

Math

Related

Recent Posts