Intuitively, why does Bayes' theorem work?
Draw a Venn diagram to help you understand $P(A|B)=P(A\cap B)/P(B)$. Then use this to relate the quantities $P(A|B)$ and $P(B|A)$ algebraically. Let's discuss the first point.
Suppose we have a finite sample space so we can count the number outcomes in each possible "event." To determine $P(A|B)$, we're essentially asking what the probability of getting an outcome in $A$ is if we uniformly at random (for simplicity) pick an outcome in event $B$.
For example, consider a collection of 100 objects. Say 64 are balls and 36 are blocks. Suppose further that among the balls, 24 are red balls and 40 are blue balls. If $A$ is the event of being a red object (we don't know how many red blocks there are, but it won't matter) and $B$ is the event of being a ball, then $P(A|B)$ is the probability of picking a red object given the fact that the object you picked was a ball, or equivalently of picking a red ball out of all of the balls, which will equal the number of red balls divided by the number of balls, or $P(A\cap B)/P(B)=\frac{24}{64}$.
Now, given $P(A|B)P(B)=P(A\cap B)=P(B|A)P(A)$ it shouldn't be hard to finish.
The answers here are good, but if you're like me, you learn better with visual aids and actual numbers. I have one for Bayes's Theorem using the same testing idea. Here's the setup:
- You have a population of 100,000 people
- 0.5% of this population use a drug
- We have a test that tells us with 99% accuracy if a person is or is not a drug user
- You tested positive for this drug. What are the odds that you're a user given this information?
Let's set this up in a simple grid:
In the columns, I show counts of users and non-users. We are told that 0.5% of the population of 100,000 use this drug, so there are 500 users and 99,500 non-users.
In the rows, I show counts of test results. I will fill in these values in a moment.
I've highlighted cells inside the table in a certain way:
- Green cells are accurate test results. If a person is a user and the test result is positive, the test is accurate and the cell is green. Likewise, if a person is not a user and the test result is negative, the test is accurate and the cell is green.
- Red cells are inaccurate test results. If a person is a user but the test is negative, this is a false-negative and the cell is red. Likewise, if a person is not a user but the test is positive, this is a false-positive and the cell is red.
Let's start filling in the table. Given that we have 500 users, how many positive and negative test results will we have among this subset of the population? We're told the test is 99% accurate, so that's 495 positive results and 5 false-negatives:
Following the same process for the non-users: given that we have 99,500 non-users, how many positive and negative results will we have among this subset of the population? Again, the test is 99% accurate, so that's 98,505 negative results and 995 false-positives.
Time to analyze the results. We're told that you tested positive for the drug, so let's throw out the results pertaining to negative test results and look at only the "Positive" row in the grid. 1,490 people tested positive for the drug, of which 495 actually are users, and 995 are false-positives. From here it's easy, the probability that you're actually a user given that you're in the "Positive" row is $\frac{495}{1490} = 33.2\%$
Turning this into a formula:
$P(Drug User Given Positive Result) = \frac{P(DrugUser)*N*TestAccuracy}{P(DrugUser)*N*TestAccuracy + P(NotDrugUser)*N*(1 - TestAccuracy)}$
This is exactly the formula I calculated to get 33.2%. Here, N is the 100,000 number that I made up. N cancels out and thus my 100,000 is extraneous information, but I thought it'd be helpful to include it.
With this grid I hope it's clear why, even with a test accuracy so high, your chances of actually being a drug user given a positive test result are so much lower. The number of false-positives is large enough to skew the naive intuition of the result.
The prior distribution and the likelihood function (based on data) both contain information about a parameter. Bayes' theorem allows these two kinds and sources of information to put together into a posterior distribution. The combined information from the posterior distribution can be used make inferences about the parameter. A couple of examples illustrate this process.
Screening test for a disease. Suppose we wonder whether a particular person has a disease. The prevalence in the population to which the subject belongs is 2%, so this can be considered as our prior information about the subject. P(Subj has Disease) = .02. A quick and inexpensive,, but imperfect, screening test for the disease is available. Its characteristics are described in conditional probabilities: P(Pos test | Dis) = .99, P(Neg test | No Dis) = .97. Suppose our data is that the subject tests positive.
Then using the elementary form of Bayes' Theorem we can find the posterior probability P(Dis | Pos test) = 0.4024. Some people, focusing on P{Pos test | Dis} = .99) are surprised the posterior probability is so small.
However, the appropriate focus for our purposes is that the data (positive test result) has gone together with the prior probability of 0.02 to give us a posterior probability about 0.40. The screening test is imperfect, but data from it has made a considerable change in our assessment of the subject's probability of disease. A subject with a 40% chance of having a serious disease should be evaluated with further and perhaps more time consuming and expensive tests.
Public Opinion Poll. A newly hired consultant for a political campaign to elect Candidate A feels that the candidate will win, but not overwhelmingly. Suppose her prior distribution on the probability $\psi$ of winning is $Beta(330, 270)$, which has mean 0.55 and 95% of its probability in the interval $(0.51,0.59).$ Then a poll of 1000 randomly selected potential voters shows 620 of them in favor of Candidate A. This is our data and it is reflected in the binomial likelihood function with kernel $\psi^{620}(1-\psi)^{380}$.
Bayes' Theorem melds the prior distribution with the likelihood function encoding the data to give the posterior distribution $Beta(950, 650)$, where multiplying the prior by the likelihood gives the posterior beta parameters $330+620=950$ and $270+380=650.$ The posterior beta distribution has mean about 0.59 and puts about 95% of its probability in the interval $(0.57, 0.62),$ which we take as our posterior probability interval for $\psi$, a somewhat more optimistic outlook for the candidate than given by the prior.
Here again, the information in the prior distribution and the data (as reflected in the likelihood function) have been combined to give a posterior distribution. Very roughly speaking, it is as if the consultant's prior distribution contributed information equivalent to that in a poll of 600 prospective voters of whom 330 favored the candidate.
Note: I have chosen these two examples, so that the math (if you care to carry it through) is quite simple. In some cases, much more computational effort is required to find and use the posterior distribution. But the computation needs to be viewed as a means to an end: to combine the information in the prior with the information in the data in order to make inferences based on both.
Acknowledgment: Numbers and distributions in these examples are the same as for ones in Ch 5 & 8, respectively, of Suess and Trumbo (2010), Springer.