How can I efficiently calculate the binomial cumulative distribution function?
I think you want to evaluate the incomplete beta function.
There's a nice implementation using a continued fraction representation in "Numerical Recipes In C", chapter 6: 'Special Functions'.
Exact Binomial Distribution
def factorial(n):
if n < 2: return 1
return reduce(lambda x, y: x*y, xrange(2, int(n)+1))
def prob(s, p, n):
x = 1.0 - p
a = n - s
b = s + 1
c = a + b - 1
prob = 0.0
for j in xrange(a, c + 1):
prob += factorial(c) / (factorial(j)*factorial(c-j)) \
* x**j * (1 - x)**(c-j)
return prob
>>> prob(20, 0.3, 100)
0.016462853241869437
>>> 1-prob(40-1, 0.3, 100)
0.020988576003924564
Normal Estimate, good for large n
import math
def erf(z):
t = 1.0 / (1.0 + 0.5 * abs(z))
# use Horner's method
ans = 1 - t * math.exp( -z*z - 1.26551223 +
t * ( 1.00002368 +
t * ( 0.37409196 +
t * ( 0.09678418 +
t * (-0.18628806 +
t * ( 0.27886807 +
t * (-1.13520398 +
t * ( 1.48851587 +
t * (-0.82215223 +
t * ( 0.17087277))))))))))
if z >= 0.0:
return ans
else:
return -ans
def normal_estimate(s, p, n):
u = n * p
o = (u * (1-p)) ** 0.5
return 0.5 * (1 + erf((s-u)/(o*2**0.5)))
>>> normal_estimate(20, 0.3, 100)
0.014548164531920815
>>> 1-normal_estimate(40-1, 0.3, 100)
0.024767304545069813
Poisson Estimate: Good for large n and small p
import math
def poisson(s,p,n):
L = n*p
sum = 0
for i in xrange(0, s+1):
sum += L**i/factorial(i)
return sum*math.e**(-L)
>>> poisson(20, 0.3, 100)
0.013411150012837811
>>> 1-poisson(40-1, 0.3, 100)
0.046253037645840323
I was on a project where we needed to be able to calculate the binomial CDF in an environment that didn't have a factorial or gamma function defined. It took me a few weeks, but I ended up coming up with the following algorithm which calculates the CDF exactly (i.e. no approximation necessary). Python is basically as good as pseudocode, right?
import numpy as np
def binomial_cdf(x,n,p):
cdf = 0
b = 0
for k in range(x+1):
if k > 0:
b += + np.log(n-k+1) - np.log(k)
log_pmf_k = b + k * np.log(p) + (n-k) * np.log(1-p)
cdf += np.exp(log_pmf_k)
return cdf
Performance scales with x. For small values of x, this solution is about an order of magnitude faster than scipy.stats.binom.cdf
, with similar performance at around x=10,000.
I won't go into a full derivation of this algorithm because stackoverflow doesn't support MathJax, but the thrust of it is first identifying the following equivalence:
- For all k > 0,
sp.misc.comb(n,k) == np.prod([(n-k+1)/k for k in range(1,k+1)])
Which we can rewrite as:
sp.misc.comb(n,k) == sp.misc.comb(n,k-1) * (n-k+1)/k
or in log space:
np.log( sp.misc.comb(n,k) ) == np.log(sp.misc.comb(n,k-1)) + np.log(n-k+1) - np.log(k)
Because the CDF is a summation of PMFs, we can use this formulation to calculate the binomial coefficient (the log of which is b
in the function above) for PMF_{x=i} from the coefficient we calculated for PMF_{x=i-1}. This means we can do everything inside a single loop using accumulators, and we don't need to calculate any factorials!
The reason most of the calculations are done in log space is to improve the numerical stability of the polynomial terms, i.e. p^x
and (1-p)^(1-x)
have the potential to be extremely large or extremely small, which can cause computational errors.
EDIT: Is this a novel algorithm? I've been poking around on and off since before I posted this, and I'm increasingly wondering if I should write this up more formally and submit it to a journal.