Bayes Theorem Example Essays Compare
An Intuitive Explanation of Bayes' Theorem
for the curious and bewildered;
an excruciatingly gentle introduction.
This page has now been obsoleted by a vastly improved guide to Bayes's Theorem, the Arbital Guide to Bayes's Rule. Please read that instead. Seriously. I mean it.
Your friends and colleagues are talking about something called "Bayes' Theorem" or "Bayes' Rule", or something called Bayesian reasoning. They sound really enthusiastic about it, too, so you google and find a webpage about Bayes' Theorem and...
It's this equation. That's all. Just one equation. The page you found gives a definition of it, but it doesn't say what it is, or why it's useful, or why your friends would be interested in it. It looks like this random statistics thing.
So you came here. Maybe you don't understand what the equation says. Maybe you understand it in theory, but every time you try to apply it in practice you get mixed up trying to remember the difference between p(a|x) and p(x|a), and whether p(a)*p(x|a) belongs in the numerator or the denominator. Maybe you see the theorem, and you understand the theorem, and you can use the theorem, but you can't understand why your friends and/or research colleagues seem to think it's the secret of the universe. Maybe your friends are all wearing Bayes' Theorem T-shirts, and you're feeling left out. Maybe you're a girl looking for a boyfriend, but the boy you're interested in refuses to date anyone who "isn't Bayesian". What matters is that Bayes is cool, and if you don't know Bayes, you aren't cool.
Why does a mathematical concept generate this strange enthusiasm in its students? What is the so-called Bayesian Revolution now sweeping through the sciences, which claims to subsume even the experimental method itself as a special case? What is the secret that the adherents of Bayes know? What is the light that they have seen?
Soon you will know. Soon you will be one of us.
While there are a few existing online explanations of Bayes' Theorem, my experience with trying to introduce people to Bayesian reasoning is that the existing online explanations are too abstract. Bayesian reasoning is verycounterintuitive. People do not employ Bayesian reasoning intuitively, find it very difficult to learn Bayesian reasoning when tutored, and rapidly forget Bayesian methods once the tutoring is over. This holds equally true for novice students and highly trained professionals in a field. Bayesian reasoning is apparently one of those things which, like quantum mechanics or the Wason Selection Test, is inherently difficult for humans to grasp with our built-in mental faculties.
Or so they claim. Here you will find an attempt to offer an intuitive explanation of Bayesian reasoning - an excruciatingly gentle introduction that invokes all the human ways of grasping numbers, from natural frequencies to spatial visualization. The intent is to convey, not abstract rules for manipulating numbers, but what the numbers mean, and why the rules are what they are (and cannot possibly be anything else). When you are finished reading this page, you will see Bayesian problems in your dreams.
And let's begin.
Here's a story problem about a situation that doctors often encounter:
1% of women at age forty who participate in routine screening have breast cancer. 80% of women with breast cancer will get positive mammographies. 9.6% of women without breast cancer will also get positive mammographies. A woman in this age group had a positive mammography in a routine screening. What is the probability that she actually has breast cancer?
What do you think the answer is? If you haven't encountered this kind of problem before, please take a moment to come up with your own answer before continuing.
Next, suppose I told you that most doctors get the same wrong answer on this problem - usually, only around 15% of doctors get it right. ("Really? 15%? Is that a real number, or an urban legend based on an Internet poll?" It's a real number. See Casscells, Schoenberger, and Grayboys 1978; Eddy 1982; Gigerenzer and Hoffrage 1995; and many other studies. It's a surprising result which is easy to replicate, so it's been extensively replicated.)
On the story problem above, most doctors estimate the probability to be between 70% and 80%, which is wildly incorrect.
Here's an alternate version of the problem on which doctors fare somewhat better:
10 out of 1000 women at age forty who participate in routine screening have breast cancer. 800 out of 1000 women with breast cancer will get positive mammographies. 96 out of 1000 women without breast cancer will also get positive mammographies. If 1000 women in this age group undergo a routine screening, about what fraction of women with positive mammographies will actually have breast cancer?
And finally, here's the problem on which doctors fare best of all, with 46% - nearly half - arriving at the correct answer:
100 out of 10,000 women at age forty who participate in routine screening have breast cancer. 80 of every 100 women with breast cancer will get a positive mammography. 950 out of 9,900 women without breast cancer will also get a positive mammography. If 10,000 women in this age group undergo a routine screening, about what fraction of women with positive mammographies will actually have breast cancer?
The correct answer is 7.8%, obtained as follows: Out of 10,000 women, 100 have breast cancer; 80 of those 100 have positive mammographies. From the same 10,000 women, 9,900 will not have breast cancer and of those 9,900 women, 950 will also get positive mammographies. This makes the total number of women with positive mammographies 950+80 or 1,030. Of those 1,030 women with positive mammographies, 80 will have cancer. Expressed as a proportion, this is 80/1,030 or 0.07767 or 7.8%.
To put it another way, before the mammography screening, the 10,000 women can be divided into two groups:
- Group 1: 100 women with breast cancer.
- Group 2: 9,900 women without breast cancer.
- Group A: 80 women with breast cancer, and a positive mammography.
- Group B: 20 women with breast cancer, and a negative mammography.
- Group C: 950 women without breast cancer, and a positive mammography.
- Group D: 8,950 women without breast cancer, and a negative mammography.
The proportion of cancer patients with positive results, within the group of all patients with positive results, is the proportion of (A) within (A + C): 80 / (80 + 950) = 80 / 1030 = 7.8%. If you administer a mammography to 10,000 patients, then out of the 1030 with positive mammographies, 80 of those positive-mammography patients will have cancer. This is the correct answer, the answer a doctor should give a positive-mammography patient if she asks about the chance she has breast cancer; if thirteen patients ask this question, roughly 1 out of those 13 will have cancer.
The most common mistake is to ignore the original fraction of women with breast cancer, and the fraction of women without breast cancer who receive false positives, and focus only on the fraction of women with breast cancer who get positive results. For example, the vast majority of doctors in these studies seem to have thought that if around 80% of women with breast cancer have positive mammographies, then the probability of a women with a positive mammography having breast cancer must be around 80%.
Figuring out the final answer always requires all three pieces of information - the percentage of women with breast cancer, the percentage of women without breast cancer who receive false positives, and the percentage of women with breast cancer who receive (correct) positives.
To see that the final answer always depends on the original fraction of women with breast cancer, consider an alternate universe in which only one woman out of a million has breast cancer. Even if mammography in this world detects breast cancer in 8 out of 10 cases, while returning a false positive on a woman without breast cancer in only 1 out of 10 cases, there will still be a hundred thousand false positives for every real case of cancer detected. The original probability that a woman has cancer is so extremely low that, although a positive result on the mammography does increase the estimated probability, the probability isn't increased to certainty or even "a noticeable chance"; the probability goes from 1:1,000,000 to 1:100,000.
Similarly, in an alternate universe where only one out of a million women does not have breast cancer, a positive result on the patient's mammography obviously doesn't mean that she has an 80% chance of having breast cancer! If this were the case her estimated probability of having cancer would have been revised drastically downward after she got a positive result on her mammography - an 80% chance of having cancer is a lot less than 99.9999%! If you administer mammographies to ten million women in this world, around eight million women with breast cancer will get correct positive results, while one woman without breast cancer will get false positive results. Thus, if you got a positive mammography in this alternate universe, your chance of having cancer would go from 99.9999% up to 99.999987%. That is, your chance of being healthy would go from 1:1,000,000 down to 1:8,000,000.
These two extreme examples help demonstrate that the mammography result doesn't replace your old information about the patient's chance of having cancer; the mammography slides the estimated probability in the direction of the result. A positive result slides the original probability upward; a negative result slides the probability downward. For example, in the original problem where 1% of the women have cancer, 80% of women with cancer get positive mammographies, and 9.6% of women without cancer get positive mammographies, a positive result on the mammography slides the 1% chance upward to 7.8%.
Most people encountering problems of this type for the first time carry out the mental operation of replacing the original 1% probability with the 80% probability that a woman with cancer gets a positive mammography. It may seem like a good idea, but it just doesn't work. "The probability that a woman with a positive mammography has breast cancer" is not at all the same thing as "the probability that a woman with breast cancer has a positive mammography"; they are as unlike as apples and cheese. Finding the final answer, "the probability that a woman with a positive mammography has breast cancer", uses all three pieces of problem information - "the prior probability that a woman has breast cancer", "the probability that a woman with breast cancer gets a positive mammography", and "the probability that a woman without breast cancer gets a positive mammography".
|Q. What is the Bayesian Conspiracy?|
A. The Bayesian Conspiracy is a multinational, interdisciplinary, and shadowy group of scientists that controls publication, grants, tenure, and the illicit traffic in grad students. The best way to be accepted into the Bayesian Conspiracy is to join the Campus Crusade for Bayes in high school or college, and gradually work your way up to the inner circles. It is rumored that at the upper levels of the Bayesian Conspiracy exist nine silent figures known only as the Bayes Council.
To see that the final answer always depends on the chance that a woman without breast cancer gets a positive mammography, consider an alternate test, mammography+. Like the original test, mammography+ returns positive for 80% of women with breast cancer. However, mammography+ returns a positive result for only one out of a million women without breast cancer - mammography+ has the same rate of false negatives, but a vastly lower rate of false positives. Suppose a patient receives a positive mammography+. What is the chance that this patient has breast cancer? Under the new test, it is a virtual certainty - 99.988%, i.e., a 1 in 8082 chance of being healthy.
Remember, at this point, that neither mammography nor mammography+ actually change the number of women who have breast cancer. It may seem like "There is a virtual certainty you have breast cancer" is a terrible thing to say, causing much distress and despair; that the more hopeful verdict of the previous mammography test - a 7.8% chance of having breast cancer - was much to be preferred. This comes under the heading of "Don't shoot the messenger". The number of women who really do have cancer stays exactly the same between the two cases. Only the accuracy with which we detect cancer changes. Under the previous mammography test, 80 women with cancer (who already had cancer, before the mammography) are first told that they have a 7.8% chance of having cancer, creating X amount of uncertainty and fear, after which more detailed tests will inform them that they definitely do have breast cancer. The old mammography test also involves informing 950 women without breast cancer that they have a 7.8% chance of having cancer, thus creating twelve times as much additional fear and uncertainty. The new test, mammography+, does not give 950 women false positives, and the 80 women with cancer are told the same facts they would have learned eventually, only earlier and without an intervening period of uncertainty. Mammography+ is thus a better test in terms of its total emotional impact on patients, as well as being more accurate. Regardless of its emotional impact, it remains a fact that a patient with positive mammography+ has a 99.988% chance of having breast cancer.
Of course, that mammography+ does not give 950 healthy women false positives means that all 80 of the patients with positive mammography+ will be patients with breast cancer. Thus, if you have a positive mammography+, your chance of having cancer is a virtual certainty. It is because mammography+ does not generate as many false positives (and needless emotional stress), that the (much smaller) group of patients who do get positive results will be composed almost entirely of genuine cancer patients (who have bad news coming to them regardless of when it arrives).
Similarly, let's suppose that we have a less discriminating test, mammography*, that still has a 20% rate of false negatives, as in the original case. However, mammography* has an 80% rate of false positives. In other words, a patient without breast cancer has an 80% chance of getting a false positive result on her mammography* test. If we suppose the same 1% prior probability that a patient presenting herself for screening has breast cancer, what is the chance that a patient with positive mammography* has cancer?
- Group 1: 100 patients with breast cancer.
- Group 2: 9,900 patients without breast cancer.
- Group A: 80 patients with breast cancer and a "positive" mammography*.
- Group B: 20 patients with breast cancer and a "negative" mammography*.
- Group C: 7920 patients without breast cancer and a "positive" mammography*.
- Group D: 1980 patients without breast cancer and a "negative" mammography*.
We can show algebraically that this must hold for any case where the chance of a true positive and the chance of a false positive are the same, i.e:
- Group 1: 100 patients with breast cancer.
- Group 2: 9,900 patients without breast cancer.
- Group A: 100*M patients with breast cancer and a "positive" result.
- Group B: 100*(1 - M) patients with breast cancer and a "negative" result.
- Group C: 9,900*M patients without breast cancer and a "positive" result.
- Group D: 9,900*(1 - M) patients without breast cancer and a "negative" result.
You can run through the same algebra, replacing the prior proportion of patients with breast cancer with an arbitrary percentage P:
- Group 1: Within some number of patients, a fraction P have breast cancer.
- Group 2: Within some number of patients, a fraction (1 - P) do not have breast cancer.
- Group A: P*M patients have breast cancer and a "positive" result.
- Group B: P*(1 - M) patients have breast cancer and a "negative" result.
- Group C: (1 - P)*M patients have no breast cancer and a "positive" result.
- Group D: (1 - P)*(1 - M) patients have no breast cancer and a "negative" result.
Which is common sense. Take, for example, the "test" of flipping a coin; if the coin comes up heads, does it tell you anything about whether a patient has breast cancer? No; the coin has a 50% chance of coming up heads if the patient has breast cancer, and also a 50% chance of coming up heads if the patient does not have breast cancer. Therefore there is no reason to call either heads or tails a "positive" result. It's not the probability being "50/50" that makes the coin a bad test; it's that the two probabilities, for "cancer patient turns up heads" and "healthy patient turns up heads", are the same. If the coin was slightly biased, so that it had a 60% chance of coming up heads, it still wouldn't be a cancer test - what makes a coin a poor test is not that it has a 50/50 chance of coming up heads if the patient has cancer, but that it also has a 50/50 chance of coming up heads if the patient does not have cancer. You can even use a test that comes up "positive" for cancer patients 100% of the time, and still not learn anything. An example of such a test is "Add 2 + 2 and see if the answer is 4." This test returns positive 100% of the time for patients with breast cancer. It also returns positive 100% of the time for patients without breast cancer. So you learn nothing.
The original proportion of patients with breast cancer is known as the prior probability. The chance that a patient with breast cancer gets a positive mammography, and the chance that a patient without breast cancer gets a positive mammography, are known as the two conditional probabilities. Collectively, this initial information is known as the priors. The final answer - the estimated probability that a patient has breast cancer, given that we know she has a positive result on her mammography - is known as the revised probability or the posterior probability. What we've just shown is that if the two conditional probabilities are equal, the posterior probability equals the prior probability.
|Q. How can I find the priors for a problem?|
A. Many commonly used priors are listed in the Handbook of Chemistry and Physics.
Q. Where do priors originally come from?
A. Never ask that question.
Q. Uh huh. Then where do scientists get their priors?
A. Priors for scientific problems are established by annual vote of the AAAS. In recent years the vote has become fractious and controversial, with widespread acrimony, factional polarization, and several outright assassinations. This may be a front for infighting within the Bayes Council, or it may be that the disputants have too much spare time. No one is really sure.
Q. I see. And where does everyone else get their priors?
A. They download their priors from Kazaa.
Q. What if the priors I want aren't available on Kazaa?
A. There's a small, cluttered antique shop in a back alley of San Francisco's Chinatown. Don't ask about the bronze rat.
Actually, priors are true or false just like the final answer - they reflect reality and can be judged by comparing them against reality. For example, if you think that 920 out of 10,000 women in a sample have breast cancer, and the actual number is 100 out of 10,000, then your priors are wrong. For our particular problem, the priors might have been established by three studies - a study on the case histories of women with breast cancer to see how many of them tested positive on a mammography, a study on women without breast cancer to see how many of them test positive on a mammography, and an epidemiological study on the prevalence of breast cancer in some specific demographic.
Suppose that a barrel contains many small plastic eggs. Some eggs are painted red and some are painted blue. 40% of the eggs in the bin contain pearls, and 60% contain nothing. 30% of eggs containing pearls are painted blue, and 10% of eggs containing nothing are painted blue. What is the probability that a blue egg contains a pearl? For this example the arithmetic is simple enough that you may be able to do it in your head, and I would suggest trying to do so.
A more compact way of specifying the problem:
- p(pearl) = 40%
- p(blue|pearl) = 30%
- p(blue|~pearl) = 10%
- p(pearl|blue) = ?
blue|pearl is shorthand for "blue given pearl" or "the probability that an egg is painted blue, given that the egg contains a pearl". One thing that's confusing about this notation is that the order of implication is read right-to-left, as in Hebrew or Arabic. blue|pearl means "blue<-pearl", the degree to which pearl-ness implies blue-ness, not the degree to which blue-ness implies pearl-ness. This is confusing, but it's unfortunately the standard notation in probability theory.
Readers familiar with quantum mechanics will have already encountered this peculiarity; in quantum mechanics, for example, <d|c><c|b><b|a> reads as "the probability that a particle at A goes to B, then to C, ending up at D". To follow the particle, you move your eyes from right to left. Reading from left to right, "|" means "given"; reading from right to left, "|" means "implies" or "leads to". Thus, moving your eyes from left to right, blue|pearl reads "blue given pearl" or "the probability that an egg is painted blue, given that the egg contains a pearl". Moving your eyes from right to left, blue|pearl reads "pearl implies blue" or "the probability that an egg containing a pearl is painted blue".
The item on the right side is what you already know or the premise, and the item on the left side is the implication or conclusion. If we have p(blue|pearl) = 30%, and we already know that some egg contains a pearl, then we can conclude there is a 30% chance that the egg is painted blue. Thus, the final fact we're looking for - "the chance that a blue egg contains a pearl" or "the probability that an egg contains a pearl, if we know the egg is painted blue" - reads p(pearl|blue).
Let's return to the problem. We have that 40% of the eggs contain pearls, and 60% of the eggs contain nothing. 30% of the eggs containing pearls are painted blue, so 12% of the eggs altogether contain pearls and are painted blue. 10% of the eggs containing nothing are painted blue, so altogether 6% of the eggs contain nothing and are painted blue. A total of 18% of the eggs are painted blue, and a total of 12% of the eggs are painted blue and contain pearls, so the chance a blue egg contains a pearl is 12/18 or 2/3 or around 67%.
The applet below, courtesy of Christian Rovner, shows a graphic representation of this problem:
(Are you having trouble seeing this applet? Do you see an image of the applet rather than the applet itself? Try downloading an updated Java.)
Looking at this applet, it's easier to see why the final answer depends on all three probabilities; it's the differential pressure between the two conditional probabilities, p(blue|pearl) and p(blue|~pearl), that slides the prior probability p(pearl) to the posterior probability p(pearl|blue).
As before, we can see the necessity of all three pieces of information by considering extreme cases (feel free to type them into the applet). In a (large) barrel in which only one egg out of a thousand contains a pearl, knowing that an egg is painted blue slides the probability from 0.1% to 0.3% (instead of sliding the probability from 40% to 67%). Similarly, if 999 out of 1000 eggs contain pearls, knowing that an egg is blue slides the probability from 99.9% to 99.966%; the probability that the egg does not contain a pearl goes from 1/1000 to around 1/3000. Even when the prior probability changes, the differential pressure of the two conditional probabilities always slides the probability in the same direction. If you learn the egg is painted blue, the probability the egg contains a pearl always goes up - but it goes up from the prior probability, so you need to know the prior probability in order to calculate the final answer. 0.1% goes up to 0.3%, 10% goes up to 25%, 40% goes up to 67%, 80% goes up to 92%, and 99.9% goes up to 99.966%. If you're interested in knowing how any other probabilities slide, you can type your own prior probability into the Java applet. You can also click and drag the dividing line between pearl and ~pearl in the upper bar, and watch the posterior probability change in the bottom bar.
Studies of clinical reasoning show that most doctors carry out the mental operation of replacing the original 1% probability with the 80% probability that a woman with cancer would get a positive mammography. Similarly, on the pearl-egg problem, most respondents unfamiliar with Bayesian reasoning would probably respond that the probability a blue egg contains a pearl is 30%, or perhaps 20% (the 30% chance of a true positive minus the 10% chance of a false positive). Even if this mental operation seems like a good idea at the time, it makes no sense in terms of the question asked. It's like the experiment in which you ask a second-grader: "If eighteen people get on a bus, and then seven more people get on the bus, how old is the bus driver?" Many second-graders will respond: "Twenty-five." They understand when they're being prompted to carry out a particular mental procedure, but they haven't quite connected the procedure to reality. Similarly, to find the probability that a woman with a positive mammography has breast cancer, it makes no sense whatsoever to replace the original probability that the woman has cancer with the probability that a woman with breast cancer gets a positive mammography. Neither can you subtract the probability of a false positive from the probability of the true positive. These operations are as wildly irrelevant as adding the number of people on the bus to find the age of the bus driver.
I keep emphasizing the idea that evidence slides probability because of research that shows people tend to use spatial intutions to grasp numbers. In particular, there's interesting evidence that we have an innate sense of quantity that's localized to left inferior parietal cortex - patients with damage to this area can selectively lose their sense of whether 5 is less than 8, while retaining their ability to read, write, and so on. (Yes, really!) The parietal cortex processes our sense of where things are in space (roughly speaking), so an innate "number line", or rather "quantity line", may be responsible for the human sense of numbers. This is why I suggest visualizing Bayesian evidence as sliding the probability along the number line; my hope is that this will translate Bayesian reasoning into something that makes sense to innate human brainware. (That, really, is what an "intuitive explanation" is.) For more information, see Stanislas Dehaene's The Number Sense.
A study by Gigerenzer and Hoffrage in 1995 showed that some ways of phrasing story problems are much more evocative of correct Bayesian reasoning. The least evocative phrasing used probabilities. A slightly more evocative phrasing used frequencies instead of probabilities; the problem remained the same, but instead of saying that 1% of women had breast cancer, one would say that 1 out of 100 women had breast cancer, that 80 out of 100 women with breast cancer would get a positive mammography, and so on. Why did a higher proportion of subjects display Bayesian reasoning on this problem? Probably because saying "1 out of 100 women" encourages you to concretely visualize X women with cancer, leading you to visualize X women with cancer and a positive mammography, etc.
The most effective presentation found so far is what's known as natural frequencies - saying that 40 out of 100 eggs contain pearls, 12 out of 40 eggs containing pearls are painted blue, and 6 out of 60 eggs containing nothing are painted blue. A natural frequencies presentation is one in which the information about the prior probability is included in presenting the conditional probabilities. If you were just learning about the eggs' conditional probabilities through natural experimentation, you would - in the course of cracking open a hundred eggs - crack open around 40 eggs containing pearls, of which 12 eggs would be painted blue, while cracking open 60 eggs containing nothing, of which about 6 would be painted blue. In the course of learning the conditional probabilities, you'd see examples of blue eggs containing pearls about twice as often as you saw examples of blue eggs containing nothing.
It may seem like presenting the problem in this way is "cheating", and indeed if it were a story problem in a math book, it probably would be cheating. However, if you're talking about real doctors, you want to cheat; you want the doctors to draw the right conclusions as easily as possible. The obvious next move would be to present all medical statistics in terms of natural frequencies. Unfortunately, while natural frequencies are a step in the right direction, it probably won't be enough. When problems are presented in natural frequences, the proportion of people using Bayesian reasoning rises to around half. A big improvement, but not big enough when you're talking about real doctors and real patients.
A presentation of the problem in natural frequencies might be visualized like this:
In the frequency visualization, the selective attrition of the two conditional probabilities changes the proportion of eggs that contain pearls. The bottom bar is shorter than the top bar, just as the number of eggs painted blue is less than the total number of eggs. The probability graph shown earlier is really just the frequency graph with the bottom bar "renormalized", stretched out to the same length as the top bar. In the frequency applet you can change the conditional probabilities by clicking and dragging the left and right edges of the graph. (For example, to change the conditional probability blue|pearl, click and drag the line on the left that stretches from the left edge of the top bar to the left edge of the bottom bar.)
In the probability applet, you can see that when the conditional probabilities are equal, there's no differential pressure - the arrows are the same size - so the prior probability doesn't slide between the top bar and the bottom bar. But the bottom bar in the probability applet is just a renormalized (stretched out) version of the bottom bar in the frequency applet, and the frequency applet shows why the probability doesn't slide if the two conditional probabilities are equal. Here's a case where the prior proportion of pearls remains 40%, and the proportion of pearl eggs painted blue remains 30%, but the number of empty eggs painted blue is also 30%:
If you diminish two shapes by the same factor, their relative proportion will be the same as before. If you diminish the left section of the top bar by the same factor as the right section, then the bottom bar will have the same proportions as the top bar - it'll just be smaller. If the two conditional probabilities are equal, learning that the egg is blue doesn't change the probability that the egg contains a pearl - for the same reason that similar triangles have identical angles; geometric figures don't change shape when you shrink them by a constant factor.
In this case, you might as well just say that 30% of eggs are painted blue, since the probability of an egg being painted blue is independent of whether the egg contains a pearl. Applying a "test" that is statistically independent of its condition just shrinks the sample size. In this case, requiring that the egg be painted blue doesn't shrink the group of eggs with pearls any more or less than it shrinks the group of eggs without pearls. It just shrinks the total number of eggs in the sample.
|Q. Why did the Bayesian reasoner cross the road?|
A. You need more information to answer this question.
Here's what the original medical problem looks like when graphed. 1% of women have breast cancer, 80% of those women test positive on a mammography, and 9.6% of women without breast cancer also receive positive mammographies.
As is now clearly visible, the mammography doesn't increase the probability a positive-testing woman has breast cancer by increasing the number of women with breast cancer - of course not; if mammography increased the number of women with breast cancer, no one would ever take the test! However, requiring a positive mammography is a membership test that eliminates many more women without breast cancer than women with cancer. The number of women without breast cancer diminishes by a factor of more than ten, from 9,900 to 950, while the number of women with breast cancer is diminished only from 100 to 80. Thus, the proportion of 80 within 1,030 is much larger than the proportion of 100 within 10,000. In the graph, the left sector (representing women with breast cancer) is small, but the mammography test projects almost all of this sector into the bottom bar. The right sector (representing women without breast cancer) is large, but the mammography test projects a much smaller fraction of this sector into the bottom bar. There are, indeed, fewer women with breast cancer and positive mammographies than there are women with breast cancer - obeying the law of probabilities which requires that p(A) >= p(A&B). But even though the left sector in the bottom bar is actually slightly smaller, the proportion of the left sector within the bottom bar is greater - though still not very great. If the bottom bar were renormalized to the same length as the top bar, it would look like the left sector had expanded. This is why the proportion of "women with breast cancer" in the group "women with positive mammographies" is higher than the proportion of "women with breast cancer" in the general population - although the proportion is still not very high. The evidence of the positive mammography slides the prior probability of 1% to the posterior probability of 7.8%.
Suppose there's yet another variant of the mammography test, mammography@, which behaves as follows. 1% of women in a certain demographic have breast cancer. Like ordinary mammography, mammography@ returns positive 9.6% of the time for women without breast cancer. However, mammography@ returns positive 0% of the time (say, once in a billion) for women with breast cancer. The graph for this scenario looks like this:
What is it that this test actually does? If a patient comes to you with a positive result on her mammography@, what do you say?
"Congratulations, you're among the rare 9.5% of the population whose health is definitely established by this test."
Mammography@ isn't a cancer test; it's a health test! Few women without breast cancer get positive results on mammography@, but only women without breast cancer ever get positive results at all. Not much of the right sector of the top bar projects into the bottom bar, but none of the left sector projects into the bottom bar. So a positive result on mammography@ means you definitely don't have breast cancer.
What makes ordinary mammography a positive indicator for breast cancer is not that someone named the result "positive", but rather that the test result stands in a specific Bayesian relation to the condition of breast cancer. You could call the same result "positive" or "negative" or "blue" or "red" or "James Rutherford", or give it no name at all, and the test result would still slide the probability in exactly the same way. To minimize confusion, a test result which slides the probability of breast cancer upward should be called "positive". A test result which slides the probability of breast cancer downward should be called "negative". If the test result is statistically unrelated to the presence or absence of breast cancer - if the two conditional probabilities are equal - then we shouldn't call the procedure a "cancer test"! The meaning of the test is determined by the two conditional probabilities; any names attached to the results are simply convenient labels.
The bottom bar for the graph of mammography@ is small; mammography@ is a test that's only rarely useful. Or rather, the test only rarely gives strong evidence, and most of the time gives weak evidence. A negative result on mammography@ does slide probability - it just doesn't slide it very far. Click the "Result" switch at the bottom left corner of the applet to see what a negative result on mammography@ would imply. You might intuit that since the test could have returned positive for health, but didn't, then the failure of the test to return positive must mean that the woman has a higher chance of having breast cancer - that her probability of having breast cancer must be slid upward by the negative result on her health test.
This intuition is correct! The sum of the groups with negative results and positive results must always equal the group of all women. If the positive-testing group has "more than its fair share" of women without breast cancer, there must be an at least slightly higher proportion of women with cancer in the negative-testing group. A positive result is rare but very strong evidence in one direction, while a negative result is common but very weak evidence in the opposite direction. You might call this the Law of Conservation of Probability - not a standard term, but the conservation rule is exact. If you take the revised probability of breast cancer after a positive result, times the probability of a positive result, and add that to the revised probability of breast cancer after a negative result, times the probability of a negative result, then you must always arrive at the prior probability. If you don't yet know what the test result is, the expected revised probability after the test result arrives - taking both possible results into account - should always equal the prior probability.
On ordinary mammography, the test is expected to return "positive" 10.3% of the time - 80 positive women with cancer plus 950 positive women without cancer equals 1030 women with positive results. Conversely, the mammography should return negative 89.7% of the time: 100% - 10.3% = 89.7%. A positive result slides the revised probability from 1% to 7.8%, while a negative result slides the revised probability from 1% to 0.22%. So p(cancer|positive)*p(positive) + p(cancer|negative)*p(negative) = 7.8%*10.3% + 0.22%*89.7% = 1% = p(cancer), as expected.
Why "as expected"? Let's take a look at the quantities involved:
|p(cancer):||0.01|| ||Group 1: 100 women with breast cancer|
|p(~cancer):||0.99||Group 2: 9900 women without breast cancer|
|p(positive|cancer):||80.0%||80% of women with breast cancer have positive mammographies|
|p(~positive|cancer):||20.0%||20% of women with breast cancer have negative mammographies|
|p(positive|~cancer):||9.6%||9.6% of women without breast cancer have positive mammographies|
|p(~positive|~cancer):||90.4%||90.4% of women without breast cancer have negative mammographies|
|p(cancer&positive):||0.008||Group A: 80 women with breast cancer and positive mammographies|
|p(cancer&~positive):||0.002||Group B: 20 women with breast cancer and negative mammographies|
|p(~cancer&positive):||0.095||Group C: 950 women without breast cancer and positive mammographies|
|p(~cancer&~positive):||0.895||Group D: 8950 women without breast cancer and negative mammographies|
|p(positive):||0.103||1030 women with positive results|
|p(~positive):||0.897||8970 women with negative results|
|p(cancer|positive):||7.80%||Chance you have breast cancer if mammography is positive: 7.8%|
|p(~cancer|positive):||92.20%||Chance you are healthy if mammography is positive: 92.2%|
|p(cancer|~positive):||0.22%||Chance you have breast cancer if mammography is negative: 0.22%|
|p(~cancer|~positive):||99.78%||Chance you are healthy if mammography is negative: 99.78%|
One of the common confusions in using Bayesian reasoning is to mix up some or all of these quantities - which, as you can see, are all numerically different and have different meanings. p(A&B) is the same as p(B&A), but p(A|B) is not the same thing as p(B|A), and p(A&B) is completely different from p(A|B). (I don't know who chose the symmetrical "|" symbol to mean "implies", and then made the direction of implication right-to-left, but it was probably a bad idea.)
To get acquainted with all these quantities and the relationships between them, we'll play "follow the degrees of freedom". For example, the two quantities p(cancer) and p(~cancer) have 1 degree of freedom between them, because of the general law p(A) + p(~A) = 1. If you know that p(~cancer) = .99, you can obtain p(cancer) = 1 - p(~cancer) = .01. There's no room to say that p(~cancer) = .99 and then also specify p(cancer) = .25; it would violate the rule p(A) + p(~A) = 1.
p(positive|cancer) and p(~positive|cancer) also have only one degree of freedom between them; either a woman with breast cancer gets a positive mammography or she doesn't. On the other hand, p(positive|cancer) and p(positive|~cancer) have two degrees of freedom. You can have a mammography test that returns positive for 80% of cancerous patients and 9.6% of healthy patients, or that returns positive for 70% of cancerous patients and 2% of healthy patients, or even a health test that returns "positive" for 30% of cancerous patients and 92% of healthy patients. The two quantities, the output of the mammography test for cancerous patients and the output of the mammography test for healthy patients, are in mathematical terms independent; one cannot be obtained from the other in any way, and so they have two degrees of freedom between them.
What about p(positive&cancer), p(positive|cancer), and p(cancer)? Here we have three quantities; how many degrees of freedom are there? In this case the equation that must hold is p(positive&cancer) = p(positive|cancer) * p(cancer). This equality reduces the degrees of freedom by one. If we know the fraction of patients with cancer, and chance that a cancerous patient has a positive mammography, we can deduce the fraction of patients who have breast cancer and a positive mammography by multiplying. You should recognize this operation from the graph; it's the projection of the top bar into the bottom bar. p(cancer) is the left sector of the top bar, and p(positive|cancer) determines how much of that sector projects into the bottom bar, and the left sector of the bottom bar is p(positive&cancer).
Similarly, if we know the number of patients with breast cancer and positive mammographies, and also the number of patients with breast cancer, we can estimate the chance that a woman with breast cancer gets a positive mammography by dividing: p(positive|cancer) = p(positive&cancer) / p(cancer). In fact, this is exactly how such medical diagnostic tests are calibrated; you do a study on 8,520 women with breast cancer and see that there are 6,816 (or thereabouts) women with breast cancer andpositive mammographies, then divide 6,816 by 8520 to find that 80% of women with breast cancer had positive mammographies. (Incidentally, if you accidentally divide 8520 by 6,816 instead of the other way around, your calculations will start doing strange things, such as insisting that 125% of women with breast cancer and positive mammographies have breast cancer. This is a common mistake in carrying out Bayesian arithmetic, in my experience.) And finally, if you know p(positive&cancer) and p(positive|cancer), you can deduce how many cancer patients there must have been originally. There are two degrees of freedom shared out among the three quantities; if we know any two, we can deduce the third.
How about p(positive), p(positive&cancer), and p(positive&~cancer)? Again there are only two degrees of freedom among these three variables. The equation occupying the extra degree of freedom is p(positive) = p(positive&cancer) + p(positive&~cancer). This is how p(positive) is computed to begin with; we figure out the number of women with breast cancer who have positive mammographies, and the number of women without breast cancer who have positive mammographies, then add them together to get the total number of women with positive mammographies. It would be very strange to go out and conduct a study to determine the number of women with positive mammographies - just that one number and nothing else - but in theory you could do so. And if you then conducted another study and found the number of those women who had positive mammographies and breast cancer, you would also know the number of women with positive mammographies and no breast cancer - either a woman with a positive mammography has breast cancer or she doesn't. In general, p(A&B) + p(A&~B) = p(A). Symmetrically, p(A&B) + p(~A&B) = p(B).
What about p(positive&cancer), p(positive&~cancer), p(~positive&cancer), and p(~positive&~cancer)? You might at first be tempted to think that there are only two degrees of freedom for these four quantities - that you can, for example, get p(positive&~cancer) by multiplying p(positive) * p(~cancer), and thus that all four quantities can be found given only the two quantities p(positive) and p(cancer). This is not the case! p(positive&~cancer) = p(positive) * p(~cancer) only if the two probabilities are statistically independent - if the chance that a woman has breast cancer has no bearing on whether she has a positive mammography. As you'll recall, this amounts to requiring that the two conditional probabilities be equal to each other - a requirement which would eliminate one degree of freedom. If you remember that these four quantities are the groups A, B, C, and D, you can look over those four groups and realize that, in theory, you can put any number of people into the four groups. If you start with a group of 80 women with breast cancer and positive mammographies, there's no reason why you can't add another group of 500 women with breast cancer and negative mammographies, followed by a group of 3 women without breast cancer and negative mammographies, and so on. So now it seems like the four quantities have four degrees of freedom. And they would, except that in expressing them as probabilities, we need to normalize them to fractions of the complete group, which adds the constraint that p(positive&cancer) + p(positive&~cancer) + p(~positive&cancer) + p(~positive&~cancer) = 1. This equation takes up one degree of freedom, leaving three degrees of freedom among the four quantities. If you specify the fractions of women in groups A, B, and D, you can deduce the fraction of women in group C.
Given the four groups A, B, C, and D, it is very straightforward to compute everything else: p(cancer) = A + B, p(~positive|cancer) = B / (A + B), and so on. Since ABCD contains three degrees of freedom, it follows that the entire set of 16 probabilities contains only three degrees of freedom. Remember that in our problems we always needed three pieces of information - the prior probability and the two conditional probabilities - which, indeed, have three degrees of freedom among them. Actually, for Bayesian problems, any three quantities with three degrees of freedom between them should logically specify the entire problem. For example, let's take a barrel of eggs with p(blue) = 0.40, p(blue|pearl) = 5/13, and p(~blue&~pearl) = 0.20. Given this information, you can compute p(pearl|blue).
As a story problem:
Suppose you have a large barrel containing a number of plastic eggs. Some eggs contain pearls, the rest contain nothing. Some eggs are painted blue, the rest are painted red. Suppose that 40% of the eggs are painted blue, 5/13 of the eggs containing pearls are painted blue, and 20% of the eggs are both empty and painted red. What is the probability that an egg painted blue contains a pearl?
Try it - I assure you it is possible.
As a check on your calculations, does the (meaningless) quantity p(~pearl|~blue)/p(pearl) roughly equal .51? (In story problem terms: The likelihood that a red egg is empty, divided by the likelihood that an egg contains a pearl, equals approximately .51.) Of course, using this information in the problem would be cheating.
If you can solve that problem, then when we revisit Conservation of Probability, it seems perfectly straightforward. Of course the mean revised probability, after administering the test, must be the same as the prior probability. Of course strong but rare evidence in one direction must be counterbalanced by common but weak evidence in the other direction.
In terms of the four groups:
p(cancer|positive) = A / (A + C)
p(positive) = A + C
p(cancer&positive) = A
p(cancer|~positive) = B / (B + D)
p(~positive) = B + D
p(cancer&~positive) = B
p(cancer) = A + B
Let's return to the original barrel of eggs - 40% of the eggs containing pearls, 30% of the pearl eggs painted blue, 10% of the empty eggs painted blue. The graph for this problem is:
What happens to the revised probability, p(pearl|blue), if the proportion of eggs containing pearls is kept constant, but 60% of the eggs with pearls are painted blue (instead of 30%), and 20% of the empty eggs are painted blue (instead of 10%)? You could type 60% and 20% into the inputs for the two conditional probabilities, and see how the graph changes - but can you figure out in advance what the change will look like?
If you guessed that the revised probability remains the same, because the bottom bar grows by a factor of 2 but retains the same proportions, congratulations! Take a moment to think about how far you've come. Looking at a problem like
1% of women have breast cancer. 80% of women with breast cancer get positive mammographies. 9.6% of women without breast cancer get positive mammographies. If a woman has a positive mammography, what is the probability she has breast cancer?
the vast majority of respondents intuit that around 70-80% of women with positive mammographies have breast cancer. Now, looking at a problem like
Suppose there are two barrels containing many small plastic eggs. In both barrels, some eggs are painted blue and the rest are painted red. In both barrels, 40% of the eggs contain pearls and the rest are empty. In the first barrel, 30% of the pearl eggs are painted blue, and 10% of the empty eggs are painted blue. In the second barrel, 60% of the pearl eggs are painted blue, and 20% of the empty eggs are painted blue. Would you rather have a blue egg from the first or second barrel?
you can see it's intuitively obvious that the probability of a blue egg containing a pearl is the same for either barrel. Imagine how hard it would be to see that using the old way of thinking!
It's intuitively obvious, but how to prove it? Suppose that we call P the prior probability that an egg contains a pearl, that we call M the first conditional probability (that a pearl egg is painted blue), and N the second conditional probability (that an empty egg is painted blue). Suppose that M and N are both increased or diminished by an arbitrary factor X - for example, in the problem above, they are both increased by a factor of 2. Does the revised probability that an egg contains a pearl, given that we know the egg is blue, stay the same?
- p(pearl) = P
- p(blue|pearl) = M*X
- p(blue|~pearl) = N*X
- p(pearl|blue) = ?
- Group A: p(pearl&blue) = P*M*X
- Group B: p(pearl&~blue) = P*(1 - (M*X))
- Group C: p(~pearl&blue) = (1 - P)*N*X
- Group D: p(~pearl&~blue) = (1 - P)*(1 - (N*X))
|Q. Suppose that there are two barrels, each containing a number of plastic eggs. In both barrels, some eggs are painted blue and the rest are painted red. In the first barrel, 90% of the eggs contain pearls and 20% of the pearl eggs are painted blue. In the second barrel, 45% of the eggs contain pearls and 60% of the empty eggs are painted red. Would you rather have a blue pearl egg from the first or second barrel?|
A. Actually, it doesn't matter which barrel you choose! Can you see why?
The probability that a test gives a true positive divided by the probability that a test gives a false positive is known as the likelihood ratio of that test. Does the likelihood ratio of a medical test sum up everything there is to know about the usefulness of the test?
No, it does not! The likelihood ratio sums up everything there is to know about the meaning of a positive result on the medical test, but the meaning of a negative result on the test is not specified, nor is the frequency with which the test is useful. If we examine the algebra above, while p(pearl|blue) remains constant, p(pearl|~blue) may change - the X does not cancel out. As a story problem, this strange fact would look something like this:
Bayes’ Rule is a way of calculating conditional probabilities. It is difficult to find an explanation of its relevance that is both mathematically comprehensive and easily accessible to all readers. This article tries to fill that void, by laying out the nature of Bayes’ Rule and its implications for clinicians in a way that assumes little or no background in probability theory. It builds on Meehl and Rosen's (1955) classic paper, by laying out algebraic proofs that they simply allude to, and by providing extremely simple and intuitively accessible examples of the concepts that they assumed their reader understood.
Keywords: probability, diagnosis, Bayes theory, base rates
Conditional probabilities are those probabilities whose value depends on the value of another probability. Such probabilities are ubiquitous. For example, we may wish to calculate the probability that a particular patient has a disease, given the presence of a particular set of symptoms. The probability of disease may be more or less close to certain, depending on the nature and number of symptoms. We will certainly wish to take into account a patient's relevant prior history with medication (e.g., the known probability of responding) before prescribing medication (Belmaker et al., 2010). Or we may wish to take into account factors (such as defensiveness) that might impact on a success in psychotherapy before we begin that therapy (Zanarini et al., 2009). More generally, restating all these specific cases in a more abstract way, we may wish to calculate the probability that a given hypothesis is true, given a diverse set of evidence (say, results from several diagnostic instruments) for or against it. Hypothesis testing is just one way of assigning weight to belief. Conditional probabilities come into play when we wish to decide how much confidence we wish to assign to a given such beliefs as “this patient will respond to this intervention,” or “this person should receive this specific diagnosis” or “it is worth incorporating this method into my clinical practice.”
A very simple example of conditional probability will elucidate its nature. Consider the question: How likely is that you would win the jackpot in a lottery if you didn't have a lottery ticket? It should be obvious that the answer is zero – you certainly could not win if you didn't even have a ticket. It may be equally obvious that you are more likely to win the lottery the more tickets you buy. So the probability of winning a lottery is really a conditional probability, where your odds of winning are conditional on the number of tickets you have purchased. If you have zero tickets, then you have no chance of winning. With one ticket, you have a small chance to win. With two tickets, your odds will be twice as good.
We symbolize conditionality by using a vertical slash “ | ”, which can be read as “given.” Then the odds of winning a lottery with one ticket could be expressed as P(Winning | One ticket). There are many “keywords” in a problem's definition that may (but need not necessarily) suggest that you are dealing with a problem of conditional probability. Phrases like “given,” “if,” “with the constraint that,” “assuming that,” “under the assumption that” and so on all suggest that there may be a conditional clause in the problem.
One thing that sometimes confuses students of probability is the fact that all probability problems are really conditional. Consider the simple probability question: “What is the probability of getting a head with a coin toss?” The question implicitly assumes that the coin is fair (that is, that heads and tails are equally probable), and should really be phrased “What is the probability of getting a head with a coin toss, given that the coin is fair?” Non-conditional probability problems conceal their conditional clause in the background assumptions that either explicitly or implicitly limit the domain in which the probability calculation is supposed to apply.
This observation sheds light on what conditionality actually does. A condition always serves exactly this role: to limit the domain in which the “non-conditional” portion of the question is supposed to apply. When you are asked “What is the probability of getting a head with a coin toss?” you are supposed to understand that we are limiting the domain to which the question applies by considering only fair coins. When you are asked “What is the probability that you have disease X, given that you have symptom Y?,” you are supposed to understand that the probability calculation only applies to those people who do have symptom Y. An appropriate way of thinking about conditional probability is to understand that a conditional limits the number and kind of cases you are supposed to consider. You can think of the vertical slash as meaning something like “ignoring everything to which the following constraint does not apply.” So “What is the probability of getting a head with a coin toss, given that the coin is fair?” means “What is the probability of getting a head with a coin toss, ignoring every coin to which the following statement does not apply: The coin is fair.”
Bayes’ Rule and other methods of solving conditional probability questions are simply mathematical means of limiting the domain across which a calculation is being computed. To see that this is so, consider the following simple question:
Three tall and two short men went on a picnic with four tall and four short woman. What is P(Tall | Female), the probability that a person is tall, given that the person is female?
The solution to this problem may be immediately obvious, but it is worth working through a few ways of solving it. These are all formally the same, though they may appear to be different.
The first way is just to turn the question into a very simple non-conditional question that we know how to solve. Following the discussion above, the question can be re-phrased to say “What is the probability that a person is tall, ignoring everyone who is not a woman?” If we ignore the men, we have a really simple question, viz. “Four tall and four short woman went on a picnic? What is the probability that a woman who went on the picnic was tall?” This is simple (that is, non-conditional) probability. Like any simple probability question, it can be solved by dividing the number of ways the outcome of interest (“being tall”) can happen by the number of ways any outcome in the domain (“being a woman”) can happen. So: 4 tall women/(4 tall woman + 4 short woman) = 0.5 probability that a person on the picnic was tall, given that she was a woman.
A formally identical way of solving the same problem can be seen by drawing a 2 × 2 table such as the following
The condition “Given that she was a female” means that we can simply ignore the rightmost column of this box, the males, and act as if the question about the probability of being tall only applied to the leftmost column, the woman.
Here comes the tricky part. This diagram makes clear what the question is asking: What is the ratio of people who are both tall and female (top left cell) to people who are female (sum of left column)? We can re-state this and solve the problem in a third way by asking: What is the ratio of the probability that a person is both female and tall to the probability that a person is female? To see why, consider the concrete example again. There were 13 people on the picnic. Since 4 were tall females, the probability of being a tall female is 4/13. Since 8 were females, the probability of being female was 8/13. The ratio of people who were both tall and female to people who were female is therefore 4/13 / 8/13, or 4/8, or 50%. The reason this may seem “tricky” is that here we consider the domain as a whole – all people who went on the picnic– and then take the ratio of two subsets within that domain.
If you understand this third method of calculating the conditional probability, then you will understand Bayes’ Rule. Bayes Rule is a way to “automatically” pick out this very same ratio: the ratio of the probability of being in the cell of interest (in this case, the cell consisting of tall and female picnickers) to the probability of being in the sub-domain of interest that is specified by the conditional clause (in this case, woman, a subset of all the people who went on the picnic).
Before we look at how the math works, let's introduce the rule itself.
Bayes’ Rule is very often referred to Bayes’ Theorem, but it is not really a theorem, and should more properly be referred to as Bayes’ Rule (Hacking, 2001). In either case, it is so-called because it was first stated (in a different form than we consider here) by Reverend Thomas Bayes in his “Essay toward solving a problem in the doctrine of chances,” which was published in the Philosophical Transactions of the Royal Society of London in 1764. Bayes was a minister interested in probability and stated a form of his famous rule in the context of solving a somewhat complex problem involving billiard balls that need not concern us here.
Bayes’ Rule has many analogous forms of varying degrees of apparent complexity. This paper concerns itself almost entirely with the simplest form, which covers the cases in which two sets of mutually exclusive possibilities A and B are considered, and where the total probability in each set is 1. At the end of the paper we will briefly examine how this most simple case is just a specific case of a more general form of Bayes’ Rule. The simplest case covers many diagnostic situations, in which the patient either has or does not have a diagnosable condition (possibility set A) and either has or does not have a set of symptoms (possibility set B). For such cases, Bayes’ Rule can be used to calculate P(A | B), the probability that the patient has the condition given the symptom set. Bayes’ Rule says that:
P(A | B) = P(B | A) P(A) / P(B)
P(A) is called the marginal or prior probability of A, since it is the probability of A prior to having any information about B. Similarly, the term P(B) is the marginal or prior probability of B. Because it does depend on having information about B, the term P(A | B) is called the posterior probability of A given B. The term P(B | A) is called the likelihood function for B given A.
In the third solution to the example above, we solve for the probability of being female, given that you are tall, by considered the ratio of those who were tall and female to those who were female:
P(Tall | Female) = P(Tall & Female)/P(Female)
This suggests that Bayes’ Rule can also be stated in the following form:
P(A | B) = P(A & B) / P(B)
From this it should be evident, by equating the numerators of the two equations above, that:
P(A & B) = P(B | A) P(A)
This is true by the definition of “&.” Let us try to understand why this is so, by again considering the three tall and two short men went on a picnic with four tall and four short woman. We have already convinced ourselves that P(Female & tall) is 4/13, because there are 4 people in the cell of interest and thirteen people in the problem's domain. Let's see how the definition agrees with this answer. The definition above says that P(Female & Tall) = P(Tall | Female)P(Female). P(Tall | Female), the probability of a picnicker being tall given that she is female, is 4/8. P(Female) is 8/13, because eight of the thirteen people on the picnic are females. 4/8 multiplied by 8/13 is 4/13.
Note that it is equally correct to write that:
P(A & B) = P(A | B) P(B)
In other words:
P(B | A)P(A) = P(A | B) P(B)
Let's see why using the same example. Now we will see that P(Female & Tall) = P(Female | Tall)P(Tall). P(Female | Tall), the probability of a picnicker being female given that he or she is tall, is 4/7, because there are four tall females and seven tall people altogether. P(Tall) is 7/13, because seven of the 13 people on the picnic are tall. 4/7 multiplied by 7/13 is 4/13.
If you go back and look at the 2 × 2 table above, you should be able to understand why these two calculations of P(A & B) must be the same. The first calculation picks out the cell of tall females by column. The second picks it out by row. It doesn't matter if you concern yourself with females who are tall or tall people who are females – in the end you must get to the same answer if you want to know about people who are both tall and female. A tall female person is also a female tall person.
So now we have
P(A | B) = P(B | A)P(A)/P(B) = P(A | B)P(B)/P(B)
Although either form will give the same answer, the first form is the “canonical” form of Bayes’ Rule, for a reason that should be obvious: because the second form contains the same element on the right, P(A | B), as the left element that we are trying to calculate. If we already know P(A | B), then we don't need to compute it. If we don't know it, then it will not help us to include it in the equation we will use to calculate it.
Bayes’ Rule can be easily derived from the definition of P(A | B), in the following manner:
P(A | B) = P(A & B)/P(B) [By definition]
P(B | A) = P(A & B)/P(A) [By definition]
P(B | A) P(A) = P(A & B) [Multiply 2.) by P(A)]
P(A | B) P(B) = P(B | A) P(A) [Substitute 1.) in 3.)]
P(A | B) = P(B | A) P(A)/P(B) [Bayes’ Rule]
It might seem at first glance that Bayes’ Rule cannot be a very helpful rule, because it says that to solve a conditional probability P(A | B) you have to know another conditional probability P(B | A). However, Reverend Bayes’ insight was that in many cases the second possibility is knowable when the first is not. In diagnostic cases where were are trying to calculate P(Condition | Symptom) we often know P(Symptom | Condition), the probability that you have the symptom given the condition, because this data has been collected from previous confirmed cases.
Implications of Bayes Rule
Bayes’ Rule is very simple. However, its implications are often unexpected. Many studies have shown that people of all kinds – even those who are trained in probability theory – tend to be very poor at estimating conditional probabilities. It seems to be kind of innate incompetence in our species. As a result, people are often surprised by what Bayes’ Rule tells them.
Let us consider a concrete example given in Meehl and Rosen (1955), from which much of the discussion in this section is drawn. A particular disorder has a base rate occurrence of 1/1000 people. A test to detect this disease has a false positive rate of 5% – that is, 5% of the time that it says a person has the disease, it is mistaken. Assume that the false negative rate is 0% – the test correctly diagnoses every person who does have the disease. What is the chance that a randomly selected person with a positive result actually has the disease?
When this question was posed to Harvard University medical students, about half said that the answer was 95%, presumably because the test has a 5% false positive rate. The average response was 56%. Only 16% gave the correct answer, which can be computed with Bayes’ Rule in the following manner:
Let: P(A) = Probability of having the disease = 0.001
P(B) = Probability of positive test
= Sum of probabilities of all independent ways to get a positive test
= Probability of true positive + probability of false positive
= (True positive base rate × Percent correctly identified) + (Negative Base Rate × Percent incorrectly identified)
= (0.001 × 1) + (0.999 × 0.05)
P(B | A) = Probability of positive test given disease = 1
Then: P(A | B) = P(B | A) P(A)/P(B)
= (1 × 0.001)/(0.051)
= 0.02, or 2%
Although the test is highly accurate, it in fact gives a correct positive result just 2% of the time. How can this be? The answer (and the importance of Bayes’ Rule in diagnostic situations) lies in the highly skewed base rates of the disease. Since so few people actually have the disease, the probability of a true positive test result is very small. It is swamped by the probability of a false positive result, which is fifty times larger than the probability of a true positive result.
You can concretely understand how the false positive rate swamps the true positive rate by considering a population of 10,000 people who are given the test. Just 1/1000th or 10 of those people will actually have the disease and therefore a true positive test result. However, 5% of the remaining 9990 people, or 500 people, will have a false positive test result. So the probability that a person has the disease given that they have a positive test result is 10/510, or 2%.
Many cases are subtle. Consider another case cited by Meehl and Rosen (1955). This involved a test to detect psychological adjustment in soldiers. The authors of the instrument validated their test by giving it to 415 soldiers known to be well-adjusted, and 89 soldiers known to be mal-adjusted. The test correctly diagnosed 55% of the mal-adjusted soldiers as mal-adjusted, and incorrectly diagnosed only 19% of the adjusted soldiers. Since the true positive rate (55%) is much higher than the false positive rate (19%), the authors believed their test was good. However, they failed to take into account base rates. Meehl and Rosen did not know P(Maladjusted), the probability that a randomly selected soldier was maladjusted, but they guessed that it might be as high as 5%. With this estimate, we can use Bayes’ Rule as follows:
Let P(M) = Probability of being maladjusted = 0.05, by assumption
Let P(D) = Probability of being diagnosed as being maladjusted.
=Probability of true positive + probability of false positive
=(True positive base rate × Percent correctly identified) + (Negative Base Rate × Percent incorrectly identified)
=(0.55 × 0.05) + (0.95 × 0.19)
P(D | M) = Probability of being diagnosed, given maladjustment.
=0.55, as found by the authors.
P(M | D) = Probability of maladjustment given diagnosis as maladjusted
=P(D | M)P(M)/P(D) [Bayes’ Rule]
=0.13 or 13%
When base rates are taken into account, the test's true positive rate is just 13%, not 55% as claimed. The test is still better than guessing that everyone is maladjusted. With that strategy 5% of positive diagnoses would be correct. However, note that the test's diagnosis of maladjustment is much more likely to be wrong (87% probability) than right (13% probability).
Of course clinicians prefer to make diagnoses that are more likely to be right than wrong. We can state this desire more formally by saying that we prefer the fraction of the population that is diagnosed correctly to be greater than the fraction of the population that is diagnosed incorrectly. Mathematically this leads to a useful conclusion in the following manner:
Fraction diagnosed correctly > Fraction diagnosed incorrectly
Fraction diagnosed incorrectly / Fraction diagnosed correctly < 1
Let D = Diseased and S = Selected (“∼” means “not”)
P(D & ∼S)/P(D & S) < 1 [Substitute symbols]
P(D | ∼S)P(∼S)/P(D | S) P(S) < 1 [By definition of “&”]
P(D | ∼S)/P(D | S) P(S) < 1/P(∼S) [Divide by P(∼S)]
P(D | ∼S)/P(D | S) < P(S)/P(∼S) [Multiply by P(S)]
In English this can be expressed as:
False positive rate/True positive rate < Positive base rate/Negative base rate
We need the ratio of positive to negative base rates to be greater than the ratio of the false positive rate to the true positive rate, if we want to be more likely to be right than wrong.
This can be a handy heuristic because it allows us to calculate the minimum proportion of the population we are working with that needs to be diseased in order for our diagnostic methods to be useful. In the example above, the ratio of false positive to true positive rates is 0.19/0.55 or 0.34. This means that the test can only be useful – in the sense of having a positive diagnosis that is more likely to be true than false – when it is used in settings in which the ratio of the maladjusted people (positive base rate) to the number of people who are not maladjusted (negative base rate) is at least 0.34.
Again we can consider one example from Meehl and Rosen (1955). Imagine that you have a test that correctly identifies 80% of brain-damaged patients, but also misidentifies 15% of non-brain-damaged people. The calculation above says that this test will only be reliable if the ratio of brain-damaged to non-brain-damaged people is greater than 0.15/0.80, or about 0.19. If we are using the test in a setting which has a lower ratio of brain damaged people, we will run in to the problem described above, in which we find that the base rates have made it more likely that we are wrong than right when we make a diagnosis.
As another example, let us consider an analysis of the utility of the screening version of the Psychopathy Checklist (PCL:SV; Hart et al., 1995) in predicting violence within a year after discharge from a civil psychiatric institute. Skeem and Mulvey (2001) report that “a threshold of approximately 8 [much lower than the cut off of 17 for probable diagnoses of psychopathy] simultaneously maximizes the sensitivity and specificity of the PCL:SV in predicting violence in this sample” (p. 365). They therefore suggest 8 as the optimal cut-off. With that cut-off, the test has a true positive rate (sensitivity) of 0.72. It has a true negative rate (specificity) of 0.65, and therefore a false positive rate of 0.35. The ratio of false to true positives is thus 0.35/0.72, or 0.486. With the prescribed cut-off point, the test will only predict violence correctly if at least 48.6% of people in the sample are violent. In the sample, 245/871 or 28% were actually violent. A person would be more accurate than using the cut-off if she simply guessed that no one will be violent, since she would then correctly classify the 72% of the discharged who will not be. With a higher cut-off of 16, the true positive rate is just 0.21 but the false positive rate plummets from 0.35 to 0.06. This gives a ratio of false to true positives of 0.06/0.21 or 0.286, close to the actual ratio of violent individuals in the population, suggesting this (or a higher) cut-off point is better from the point of view of maximizing accuracy. In this case, the mathematical result is somewhat equivocal because of the unequal costs of making false positive and false negative identifications. The rate of identifying future violence is certainly very poor with the prescribed cut-off of 8. The ratio of false to true positives shows that if a person uses this cut-off, he will do only a little better than he would if he predicted who will be violent by flipping a coin, since using the cut-off will make him wrong (48.6% of time) almost as often as he is right (51.4% of the time). However, it may be more desirable to err on the side of conservatism by incorrectly treating 35% of people as likely to be violent than to lower the overall error rate (by raising the cutoff above 16) at the cost of missing 79% of the people who actually will be violent. Sometimes we have pragmatic reasons to prefer one kind of inaccuracy to another.
Note that Meehl's heuristic does not mean that the true population base rate must be as high as the calculation prescribes – it is sufficient for the base rate of the subpopulation to which the test is exposed to be high enough. If the test is used in settings (such a mental clinic to which front-line physicians refer) that have “higher concentration” of maladjusted subjects than the general population as a result on non-random sampling of that population, then the test may be useful in that setting, even though it may not be reliable if subjects were randomly selected from the population as a whole. For example, Fontaine et al. (2001) looked at how an elevated t-score on the Minnesota Multiphasic Personality Inventory-Adolescent (MMPI-A; Butcher et al., 1992) was able to classify subjects as “normal” or “clinical” in an inpatient sample with a base rate of 50% versus a normative sample with a base rate of 20%. They found, as Bayes’ Rule guarantees they must, that “the classification accuracy hit rates generally increased as the clinical base rate increased from 20 to 50% of the total sample” (p. 276).
This ability to skew true diagnosis rates in a favorable direction by pre-selecting subjects has important implications. In most of the examples we have considered so far, we have assumed low base rates. If the base rates are very high, an opposite issue arises: it becomes increasingly less worthwhile to give a diagnostic test if the base rate odds of the diagnosis are very high to begin with, because test results may add so little certainty to the base rate as to make it not worth the effort (or risk) of administering the tests. A recent practical example with a very strong result concerns the use of the Wada test, an invasive, potentially dangerous, and expensive test for determining language lateralization prior to surgery. The test involves injecting sodium amytal into each internal carotid artery to anesthetize each cerebral hemisphere independently. Kemp et al. (2008) looked at 141 consecutive administrations of the Wada test. One key finding was that no patient failed the test who had both a right temporal lesion and a stronger verbal than visual memory test result. The memory test result is also a key piece of lateralizing evidence (suggesting left lateralized language) that can be obtained relatively cheaply and safely. Based on the base rate information for this particular subset of patients with right temporal lesions and clear memory test results, Kemp et al. concluded that “that this group of patients is at negligible risk of failing the Wada test and the risks of the procedure probably outweigh the information obtained” (p. 632).
This is one “degenerate” case in which the base rate in one subsample of interest went 100% in one direction, eliminating the possibility that another test could add any further certainty to the diagnostic question of language lateralization. The degenerate case in the opposite direction – when base rates are 0% – has equally clear implications: except perhaps as a confirmation of the continuing absence of the disease in a population, it is a waste of resources to test for a condition that no one has. In between 0 and 100%, the implications of a conditional clause, such as a the probability of that a person has a disease given a positive tests results, become more severe as the base rates moves away from 0.5 in either direction. The further the base rate is from 50/50, the further it takes the posterior probability P(A | B) from the simple “hit rate,” given by taking the ratio of the true positive rate to the positive diagnoses rate (the sum of the true and false positive rate).
Mathematically, we can see this by expanding the canonical form of Bayes’ Rule given above, just as we did with the example of the maladjusted soldiers above:
Let P(C) = Probability of belonging to the diagnostic category
Let TP = True positive rate = P(C & Diagnosed)
Let FP = False positive rate = P(∼C & Diagnosed)
Let B = Base rate of the diagnostic category
Let P(D) = Probability of being diagnosed as being maladjusted.
= Probability of true positive + probability of false positive
= (True positive base rate × Percent correctly identified) + (Negative Base Rate × Percent incorrectly identified)
= (B × TP) + ((1 − B) × FP)
P(C | D) = Probability of belonging to the category given diagnosis
= P(D | C)P(C)/P(D) [Bayes’ Rule]
= (TP × B)/(B × TP) + ((1 − B) × FP) [Substitute P(D)]
= (TP × 0.5)/(0.5 × TP) + (0.5 × FP) [Let the base rate B = 0.5]
= TP/TP + FP [Divide by 0.5]
Along with the extreme cases considered above (100% or 0% base rates), this case of 50% base rates is another “degenerate” case of Bayes’ Rule, in which the rule is not really needed. When the base rate of a disorder is 50%, the conditional collapses to the simple (i.e., unconditional) probability that is given by the ratio of the probability of getting diagnosed correctly to the probability of getting diagnosed at all, whether correctly or not. One way of understanding what is happening in this case is to note that the true and false positive rates are sampling equally from the population. When this is so, we don't need to bother to “weight” their respective contributions to the conditional probability of belonging to the category given a diagnosis.
A concrete example may make this interpretation more clear. Consider the conditional probability of having blue eyes, given that you are female. Since eye color is not a sex-linked character, the conditional is the same for both those who are in the group of interest (females) and those who are not (males). You may be able to intuit in this case that the conditional is therefore irrelevant: that is, the probability of being blue-eyed given that you are female is just the same as the probability of being blue-eyed.
This degenerate case of exactly equal base rates with and without the character of interest may occur only rarely, but the general principle illustrated by this case is of wider relevance for the reason note above: the further the positive and negative base rates are from being equal, the greater the difference between the conditional probability that depends on that base rate and the simple probability given by the ratio of the probability of getting diagnosed correctly to the probability of getting diagnosed at all (that is, the ratio of the true positives to the sum of the true and false positives).
Intuitively, this makes sense for the following reasons. Insofar as a disease is less common, it becomes more likely that a larger portion of the positives are false positives, as in the case considered above that bamboozled so many of the Harvard medical students. By the same token, insofar as a disease is more common, it becomes more likely that many of the negative diagnoses are false. At some point as base rates increase, they may come to exceed the ability of the test to identify them, rendering the test worse than guessing, as discussed above.
Bayes’ Rule may be easily generalized to incorporate multiple pieces of evidence bearing on a single belief, hypothesis, or diagnosis, or to incorporate multiple pieces of evidence bearing on multiple beliefs, hypotheses, or diagnoses.
The simplest way to “extend” Bayes’ Rule is to note that the posterior probability may depend on more than one piece of evidence. This is not an extension at all, since we noted at the beginning that what was given in a conditional may be a set of evidence rather than a single piece of evidence. However, it is worth emphasizing this point, since so many of the examples considered in this paper have treated the conditional as a single piece of evidence. Given a belief, hypothesis, or diagnosis H, and a single relevant piece of evidence E1, we have seen how to compute some new probability P(H | E1). If we get a new piece of relevant evidence E2, that is independent from E1, we could as easily calculate P(H | E2) for the same H. However, that calculation would not take into account the fact that we already attached a certain level of probability to H because of the prior evidence A. To get that, we need to calculate P(H | E1&E2).
For example, imagining trying to guess a single card from a deck. If you know it is red, then you have P(Guess | Red) = 1/26, because there are 26 red cards in a deck. If you know it is a face card, you have P(Guess | Face) = 4/13, because there are four face cards per suit of 13 cards. If you know it is both a face card and red, you need to calculate P(Guess | (Face & Red) = 8/52 or 2/13, because there are eight cards that are both red and a face card.
A slightly more complex way of generalizing Bayes’ Rule comes about when there is more than one competing hypothesis, diagnosis, or possibility to be considered. In that case, evidence brought to bear in favor of any single hypothesis needs to be considered in the context of the domain of all other competing hypotheses. In fact the simple forms of Bayes’ Rule we have considered in this paper does exactly this. We have seen that P(H | E) = P(E | H) P(H)/P(E), where H is some hypothesis, diagnosis, or possibility, and E is some evidence bearing on it. We have also seen in several examples that the denominator P(E) – to be concrete, the probability of getting a positive diagnosis – can be expanded into sum of (the true positive rate × the positive base rate) and (the false positive rate × the negative base rate). The two elements in this sum are just two different hypotheses about where a positive diagnosis could have come from: it could either have come from a mistaken diagnosis or a true diagnosis. If there was also a possibility of a deliberately fraudulent diagnosis, we would have to add that in to our calculation of the probability of getting a positive diagnosis, as a third term in P(E).
The generalization of Bayes’ Rule to handle any number of competing hypotheses simply makes explicit that the denominator in Bayes’ Rule is the domain of possible kinds of evidence that could explain H- or said another way, the domain of possible ways the evidence under consideration could come about. The generalized expression is:
P(Hn | E) = P(E | Hn)P(Hn)/Σ[P(E | Hn-1) P(Hn-1)]
Hn is a current hypothesis, and E is, as ever, some new piece of evidence, such as a diagnostic sign. The denominator, as above in the specific cases we have considered, is simply the sum of all ways the diagnostic sign might occur, howsoever that may be.
Bayes’ Rule has important implications for clinicians, allowing as it does for formal specification of the probability of a diagnosis being correct taking into account relevant prior probabilities. Although Bayes’ Rule is simple, it is often ignored in practice, perhaps because the mathematics underlying the rule is often either dealt with in cursory manner in clinical training or else left under-specified. Although Meehl and Rosen’s (1955) exposition of the importance of Bayes’ Theorem is thorough and convincing, it left many proofs for the reader, with an apparent (probably erroneous) assumption that they were too simple to include. In this article I have followed the substance of Meehl and Rosen's exposition, but started from a simpler base and provided all the details of algebraic derivation that were left out of that article. My goal in doing so has been to make their exposition of Bayes’ Rule more accessible, and thereby make it possible for more clinicians to benefit from their ground-breaking work demonstrating the importance of the rule in clinical settings.
Conflict of Interest Statement
The author declares that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.
Thanks to Gail Moroschan for feedback on an earlier draft of this article.
- Belmaker R. H., Bersudky Y., Lichtenberg P. (2010). Bayesian approach to bipolar guidelines. World J. Biol. Psychiatry11, 76–77 [PubMed]
- Butcher J. N., Williams C. L., Graham J. R., Archer R. P., Tellegen A., Ben-Porath Y. S., Kaemmer B. (1992). MMPI-A (Minnesota Multiphasic Personality Instrument-Adolescent): Manual for Administration, Scoring, and Interpretation. Minneapolis: University of Minnesota Press
- Fontaine J. L., Archer R. P., Elkins D. E., Johnsen J. (2001). The effects of MMPI-A T-score elevation on classification accuracy for normal and clinical adolescent samples. J. Pers. Assess.76, 264–281 [PubMed]
- Hacking I. (2001). An Introduction to Probability, and Deductive Logic.Cambridge, England: Cambridge University Press
- Hart S., Cox D., Hare R. (1995). Manual for Psychopathy Checklist: Screening Version (PCL:SV). Toronto, Canada: Multi-Health Systems
- Kemp S., Wilkinson K., Caswell H., Reynders H., Baker G. (2008). The base rate of Wada test failure. Epilepsy Behav.13, 630–63310.1016/j.yebeh.2008.07.013 [PubMed][Cross Ref]
- Meehl P., Rosen A. (1955). Antecedent probability and the efficiency of psychometric signs, patterns, or cutting scores. Psychol. Bull.52, 194–216 As reprinted in: Meehl, P. Psychodiagnosis: Selected Papers. New York, USA: W.W. Norton & Sons; 1977. [PubMed]
- Skeem J. L., Mulvey E. P. (2001). Psychopathy and community violence among civil psychiatric patients: results from the MacArthur Violence Risk Assessment Study. J. Consult. Clin. Psychol.69, 358–374 [PubMed]
- Zanarini M. C., Weingeroff M. A., Frankenburg F. R. (2009). Defense mechanisms associated with borderline personality disorder. J. Pers. Disord.23, 113–121 [PMC free article][PubMed]