How can I calculate the variance of a list in python?

Well, there are two ways for defining the variance. You have the variance n that you use when you have a full set, and the variance n-1 that you use when you have a sample.

The difference between the 2 is whether the value m = sum(xi) / n is the real average or whether it is just an approximation of what the average should be.

Example1 : you want to know the average height of the students in a class and its variance : ok, the value m = sum(xi) / n is the real average, and the formulas given by Cleb are ok (variance n).

Example2 : you want to know the average hour at which a bus passes at the bus stop and its variance. You note the hour for a month, and get 30 values. Here the value m = sum(xi) / n is only an approximation of the real average, and that approximation will be more accurate with more values. In that case the best approximation for the actual variance is the variance n-1

varRes = sum([(xi - m)**2 for xi in results]) / (len(results) -1)

Ok, it has nothing to do with Python, but it does have an impact on statistical analysis, and the question is tagged statistics and variance

Note: ordinarily, statistical libraries like numpy use the variance n for what they call var or variance, and the variance n-1 for the function that gives the standard deviation.

You can use numpy's built-in function var:

import numpy as np

results = [-14.82381293, -0.29423447, -13.56067979, -1.6288903, -0.31632439,
          0.53459687, -1.34069996, -1.61042692, -4.03220519, -0.24332097]

print(np.var(results))

This gives you 28.822364260579157

If - for whatever reason - you cannot use numpy and/or you don't want to use a built-in function for it, you can also calculate it "by hand" using e.g. a list comprehension:

# calculate mean
m = sum(results) / len(results)

# calculate variance using a list comprehension
var_res = sum((xi - m) ** 2 for xi in results) / len(results)

which gives you the identical result.

If you are interested in the standard deviation, you can use numpy.std:

print(np.std(results))
5.36864640860051

@Serge Ballesta explained very well the difference between variance n and n-1. In numpy you can easily set this parameter using the option ddof; its default is 0, so for the n-1 case you can simply do:

np.var(results, ddof=1)

The "by hand" solution is given in @Serge Ballesta's answer.

Both approaches yield 32.024849178421285.

You can set the parameter also for std:

np.std(results, ddof=1)
5.659050201086865

Numpy is indeed the most elegant and fast way to do it.

I think the actual question was about how to access the individual elements of a list to do such a calculation yourself, so below an example:

results=[-14.82381293, -0.29423447, -13.56067979, -1.6288903, -0.31632439,
      0.53459687, -1.34069996, -1.61042692, -4.03220519, -0.24332097]

import numpy as np
print 'numpy variance: ', np.var(results)


# without numpy by hand  

# there are two ways of calculating the variance 
#   - 1. direct as central 2nd order moment (https://en.wikipedia.org/wiki/Moment_(mathematics))divided by the length of the vector
#   - 2. "mean of square minus square of mean" (see https://en.wikipedia.org/wiki/Variance)

# calculate mean
n= len(results)
sum=0
for i in range(n):
    sum = sum+ results[i]


mean=sum/n
print 'mean: ', mean

#  calculate the central moment
sum2=0
for i in range(n):
    sum2=sum2+ (results[i]-mean)**2

myvar1=sum2/n
print "my variance1: ", myvar1

# calculate the mean of square minus square of mean
sum3=0
for i in range(n):
    sum3=sum3+ results[i]**2

myvar2 = sum3/n - mean**2
print "my variance2: ", myvar2

gives you:

numpy variance:  28.8223642606
mean:  -3.731599805
my variance1:  28.8223642606
my variance2:  28.8223642606

Starting Python 3.4, the standard library comes with the variance function (sample variance or variance n-1) as part of the statistics module:

from statistics import variance
# data = [-14.82381293, -0.29423447, -13.56067979, -1.6288903, -0.31632439, 0.53459687, -1.34069996, -1.61042692, -4.03220519, -0.24332097]
variance(data)
# 32.024849178421285

The population variance (or variance n) can be obtained using the pvariance function:

from statistics import pvariance
# data = [-14.82381293, -0.29423447, -13.56067979, -1.6288903, -0.31632439, 0.53459687, -1.34069996, -1.61042692, -4.03220519, -0.24332097]
pvariance(data)
# 28.822364260579157

Also note that if you already know the mean of your list, the variance and pvariance functions take a second argument (respectively xbar and mu) in order to spare recomputing the mean of the sample (which is part of the variance computation).

How can I calculate the variance of a list in python?

Tags:

Python

List

Statistics

Variance

Related

Recent Posts