Monday, February 27, 2012

Binomial vs. Poisson vs. Normal Distribution

Consider this scenario: Given 100 bottles, each of which independently has 10% chance of being defected, what's the chance that up to 0 will be defected? Up to 1 will be defected? Up to 2? 3? ... Change to a general case, given (n) samples, each of which independently has probability (p) of returning true, what's the probability that up to (k) of the samples will return true? The binomial distribution gives an exact answer, while Poisson and normal distribution help to approximate an answer, with varying accuracy depending on the scenario.

Write MATLAB code as follows. In this case, n = 100, p = 0.05, and k = 0 to 10, but all of those values can be easily changed manually. The first column of the vector were manually-chosen (k) values for which the cumulative probability values want to be calculated:
result = 0;
table = zeros(11,4);
table(:,1) = [0;1;2;3;4;5;6;7;8;9;10];
n = 100;
p = 0.05;

for i=1:11
    table(i,2) = binocdf(i,n,p);
    table(i,3) = poisscdf(i,n*p);
    table(i,4) = normcdf((i+0.5-n*p)/(n*p*(1-p))^0.5);
end
table
Now use and tweak the program (value of n, p, and table(:,1)) above to run through two cases. In case 1, n is big while p is small. In case 2, p is relatively large. The combination of MATLAB results and Excel calculations were used to produce this table:

Case 1: n = 200, p = 0.02, λ = 4

k Binomial Poisson Normal Pois Error Norm Error
0 0.0176 0.0183 0.0385 3.98% 118.75%
1 0.0894 0.0916 0.1034 2.46% 15.66%
2 0.2351 0.2381 0.2243 1.28% -4.59%
3 0.4315 0.4335 0.4003 0.46% -7.23%
4 0.6288 0.6288 0.5997 0.00% -4.63%
5 0.7867 0.7851 0.7757 -0.20% -1.40%
6 0.8914 0.8893 0.8966 -0.24% 0.58%
7 0.9507 0.9489 0.9615 -0.19% 1.14%
8 0.9798 0.9786 0.9885 -0.12% 0.89%
9 0.9925 0.9919 0.9973 -0.06% 0.48%
10 0.9975 0.9972 0.9995 -0.03% 0.20%

Case 2: n = 100, p = 0.4, λ = 40

k Binomial Poisson Normal Pois Error Norm Error
15 0 0 0 #DIV/0! #DIV/0!
20 0 0.0004 0 #DIV/0! #DIV/0!
25 0.0012 0.0076 0.0015 533.33% 25.00%
30 0.0248 0.0617 0.0262 148.79% 5.65%
35 0.1795 0.2424 0.1792 35.04% -0.17%
40 0.5433 0.5419 0.5406 -0.26% -0.50%
45 0.8689 0.8097 0.8692 -6.81% 0.03%
50 0.9832 0.9474 0.984 -3.64% 0.08%
55 0.9991 0.9903 0.9992 -0.88% 0.01%
60 1 0.9988 1 -0.12% 0.00%
65 1 0.9999 1 -0.01% 0.00%

These data show that Poisson distribution is a better approximation when p is small, while normal distribution is a better approximation when p is large. As the numbers here were copied from MATLAB onto Excel, rounding errors have be distorted the percentage error calculations a bit.