What is the difference between numpy var() and statistics variance() in python?
Use this
print(np.var([1,2,3,4],ddof=1))
1.66666666667
Delta Degrees of Freedom: the divisor used in the calculation is N - ddof
, where N represents the number of elements. By default, ddof
is zero.
The mean is normally calculated as x.sum() / N
, where N = len(x)
. If, however, ddof
is specified, the divisor N - ddof
is used instead.
In standard statistical practice, ddof=1
provides an unbiased estimator of the variance of a hypothetical infinite population. ddof=0
provides a maximum likelihood estimate of the variance for normally distributed variables.
Statistical libraries like numpy use the variance n for what they call var or variance and the standard deviation
For more information refer this documentation : numpy doc
It is correct that dividing by N-1 gives an unbiased estimate for the mean, which can give the impression that dividing by N-1 is therefore slightly more accurate, albeit a little more complex. What is too often not stated is that dividing by N gives the minimum variance estimate for the mean, which is likely to be closer to the true mean than the unbiased estimate, as well as being somewhat simpler.