Implementing a Kolmogorov Smirnov test in python scipy

An update on unutbu's answer:

For distributions that only depend on the location and scale but do not have a shape parameter, the distributions of several goodness-of-fit test statistics are independent of the location and scale values. The distribution is non-standard, however, it can be tabulated and used with any location and scale of the underlying distribution.

The Kolmogorov-Smirnov test for the normal distribution with estimated location and scale is also called the Lilliefors test.

It is now available in statsmodels, with approximate p-values for the relevant decision range.

>>> import numpy as np
>>> mu,sigma = 0.07, 0.89
>>> x = np.random.normal(mu, sigma, 10000)
>>> import statsmodels.api as sm
>>> sm.stats.lilliefors(x)
(0.0055267411213540951, 0.66190841161592895)

Most Monte Carlo studies show that the Anderson-Darling test is more powerful than the Kolmogorov-Smirnov test. It is available in scipy.stats with critical values, and in statsmodels with approximate p-values:

>>> sm.stats.normal_ad(x)
(0.23016468240712129, 0.80657628536145665)

Neither of the test rejects the Null hypothesis that the sample is normal distributed. While the kstest in the question rejects the Null hypothesis that the sample is standard normal distributed.


You may also want to consider using the Shapiro-Wilk test, which "tests the null hypothesis that the data was drawn from a normal distribution." It's also implemented in scipy:

http://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.shapiro.html

You'll need to pass your data directly into the function.

import scipy

W, p = scipy.stats.shapiro(dataset)
print("Shapiro-Wilk test statistic, W:", W, "\n", "p-value:", p)

Which returns something like:

 Shapiro-Wilk test statistic, W: 0.7761164903640747 
 p-value: 6.317247641091492e-37

With p << 0.01 (or 0.05, if you prefer - it doesn't matter,) we have good reason to reject the null hypothesis that these data were drawn from a normal distribution.


Your data was generated with mu=0.07 and sigma=0.89. You are testing this data against a normal distribution with mean 0 and standard deviation of 1.

The null hypothesis (H0) is that the distribution of which your data is a sample is equal to the standard normal distribution with mean 0, std deviation 1.

The small p-value is indicating that a test statistic as large as D would be expected with probability p-value.

In other words, (with p-value ~8.9e-22) it is highly unlikely that H0 is true.

That is reasonable, since the means and std deviations don't match.

Compare your result with:

In [22]: import numpy as np
In [23]: import scipy.stats as stats
In [24]: stats.kstest(np.random.normal(0,1,10000),'norm')
Out[24]: (0.007038739782416259, 0.70477679457831155)

To test your data is gaussian, you could shift and rescale it so it is normal with mean 0 and std deviation 1:

data=np.random.normal(mu,sigma,10000)
normed_data=(data-mu)/sigma
print(stats.kstest(normed_data,'norm'))
# (0.0085805670733036798, 0.45316245879609179)

Warning: (many thanks to user333700 (aka scipy developer Josef Perktold)) If you don't know mu and sigma, estimating the parameters makes the p-value invalid:

import numpy as np
import scipy.stats as stats

mu = 0.3
sigma = 5

num_tests = 10**5
num_rejects = 0
alpha = 0.05
for i in xrange(num_tests):
    data = np.random.normal(mu, sigma, 10000)
    # normed_data = (data - mu) / sigma    # this is okay
    # 4915/100000 = 0.05 rejects at rejection level 0.05 (as expected)
    normed_data = (data - data.mean()) / data.std()    # this is NOT okay
    # 20/100000 = 0.00 rejects at rejection level 0.05 (not expected)
    D, pval = stats.kstest(normed_data, 'norm')
    if pval < alpha:
        num_rejects += 1
ratio = float(num_rejects) / num_tests
print('{}/{} = {:.2f} rejects at rejection level {}'.format(
    num_rejects, num_tests, ratio, alpha))     

prints

20/100000 = 0.00 rejects at rejection level 0.05 (not expected)

which shows that stats.kstest may not reject the expected number of null hypotheses if the sample is normalized using the sample's mean and standard deviation

normed_data = (data - data.mean()) / data.std()    # this is NOT okay