Loading...

**A 15 Page Introduction To Bayes Theorem**

**By Scott Hartshorn**

This is intended to be a short book to help you understand Bayes Theorem without covering every detail. To keep it a length that you can read through and try out the examples in under an hour, this book walks through only two examples.

The first example is a “toy example”, where an unknown die is drawn from a bag and you have to identify what die it was. Although just a toy example, this turns out to have a lot in common with some real life problems, such as identifying how many tanks the enemy has based on captured serial numbers.

The second example takes us into the seat of a spaceship about to fly through an asteroid field. Sure the odds of successfully navigating that asteroid field may seem astronomical (at least according to your loud-mouthed golden robot) but he might not know the full story. This example shows how you can include additional information into the Bayes Theorem calculation in multiple steps. In real life, this example might have applications to a person trying to figure out how much risk they have of heart disease if they have no family history and have good physical fitness, but also have high blood pressure and are over 50 years old.

**As a way of saying thank you for your purchase, I’m offering this free Bayes Theorem cheat sheet that’s exclusive to my readers.**

This cheat sheet contains information about the Bayes Theorem and key terminology, 6 easy steps to solve a Bayes Theorem Problem, and an example to follow. This is a PDF document that I encourage you to print, save, and share. You can download it by going here

Even though we are only doing two examples, Bayes Theorem is a straight forward topic to understand. The simplest explanation is that Bayes Theorem is the same type of probability you already know, just in reverse.

What does that mean? In your typical experience, you have likely seen situations where you have a known starting point and you are asked to calculate the probability of an outcome. For instance, if I know that I am holding a six sided die, what is the probability that I will roll a 3? Alternatively, if you know that you are innocent of a crime, what is the probability that the DNA test will come back positive anyway. (False positive). With both of those examples, we know the current state and want to calculate the probability of something in the future.

In our typical experience, we have also seen situations where we have two events in a series. In that case, those probabilities get multiplied together. For instance, I could say “I have a bag that has 4 different dice in it, each with a different number of sides. One of those dice has 6 dices. If I draw a single die, what is the probability that I will draw the 6 sided die **and** roll a 3 with it on the first try?” In this case, the solution is to multiply the probability of drawing the 6 sided die ( 1 in 4 ) with the odds of rolling a 3 (1 in 6) for a total of 1 in 24.

With Bayes theorem, we can do the reverse of the same questions. Instead of “What are the odds of drawing a 6 sided die and rolling a 3?” It is “I rolled a 3, what are the odds that I had a 6 sided die?”. (To actually solve that problem we would need to either know or estimate a little more information, namely what are the other dice in the bag)

The reason we bother with Bayes Theorem is that we live in a world where we frequently see outcomes but have to guess at the initial events that caused those outcomes. Some knowledge is withheld from us due to circumstances, so we need to estimate it as best we can.

The equation for Bayes Theorem is

This equation is not immediately understandable to me, so instead of focusing on that equation, it will be more intuitive to show how to actually solve the problems, and then show how the equation fits in. The easiest way to understand how to solve a Bayes Theorem problem is with a table. The first example we will show is a dice example, but the same type of table can be used to solve all types of Bayes Theorem problems.

For this problem, assume that I randomly drew a die from a bag that contained 4 dice. The 4 dice each have a different number of sides, one has 4 sides, one has 5 sides, one has 6 sides and one has 8 sides. If I roll that die and tell you the result without showing you the actual die, can tell me what the die is, or at least give me probabilities for each die? For this example let’s say that I rolled a 5.

Here you have access to the outcome, the roll of the die, but not the initial state.

If we were asking the problem the other way it would be simple. If I told you that I was holding an 8 sided die and asked you the odds that I would roll a 5, it is easy. Those odds would be 1 in 8. In fact, the odds that I would roll any given number between 1 and 8 with the 8 sided die are 1 in 8. The odds for any given number for the 4, 5, and 6 sided die are also easy, 1 in 4, 5, or 6 respectively up to the maximum number on that dice.

That was simple, and it turns out we can use our knowledge of probabilities going in one direction to calculate the odds going in the other direction. If we make a table of the probabilities of rolling a certain value for any given die what we get is this

Note that this table ignores the odds of actually drawing a specific die. It shows the odds for a given die once it has been drawn. Each die has its own column. Once I tell you which die I am holding, for instance, the 5 sided die, you could delete all the other columns and tell me the odds of the any given outcome based on that information.

Bayes theorem tells us that we can go the other direction also.

If instead of telling you that I have a 5 sided die, I tell you that I rolled a 5 (with whatever die I had), we can delete all the blocks where the 5 wasn’t rolled. (Basically deleting rows instead of columns)

What we are left with is

What this tells us is that

*

p<>{color:#000;}. If we had a 4 sided die, we had a 0% chance of rolling a 5

*

p<>{color:#000;}. If we had a 5 sided die, we had a 20% chance of rolling a 5

*

p<>{color:#000;}. If we had a 6 sided die, we had a 16.7% chance of rolling a 5

*

p<>{color:#000;}. If we had an 8 sided die, we had a 12.5% chance of rolling a 5

Since I’m telling you that the 5 was the measured outcome, this is all that we need to worry about. Knowing the outcome of that event allows us to remove all the probabilities associated with other events.

However, we aren’t quite done yet. With this type of probability, all the outcomes must sum up to 1.0 (i.e. 100% likelihood). That is because the outcome in question is no longer just possible, it actually happened. I actually did roll a 5, there is a 100% chance that that event occurred.

With this Bayes Theorem example, the rows do not all sum to 1.0. So after we determine which cells we are keeping based on our known outcome, we need to adjust them to make them sum to 1.0, which is called normalizing. In this example, all of the outcomes only sum up to 49.2% (20+16.7+12.5). To make them sum to 1.0 we divide each of the probabilities by the sum of all the probabilities.

*

p<>{color:#000;}. .2 / .492 = .407 = 40.7%

*

p<>{color:#000;}. .167 / .492 = .339 = 33.9%

*

p<>{color:#000;}. .125 / .492 = .254 = 25.4%

What this means is that, based on this single roll of a die, we have a 40.7% chance of having a 5 sided die, a 33.9% chance of having a 6 sided die, and a 25.4% chance of having an 8 sided die.

Everything in the above paragraphs is correct for the example given. But something is missing for more complicated examples. To understand what that is, it is time to actually look at the Bayes theorem equation shown below.

If we label the terms what they are is

*

p<>{color:#000;}. Prior: Which has our initial estimate of probability before we know the result of our data

*

p<>{color:#000;}. Likelihood: Which is the probability that any given initial condition would produce the result that we got

*

p<>{color:#000;}. Normalizing Constant: Which is the sum of the probability of all the conditions which satisfy our result

*

p<>{color:#000;}. Posterior: This is the result that we are looking for

What we did in the dice example above was use the likelihood and the normalizing constant. I.e. we used this part of the equation

But did not explicitly use this part of the equation

The P(B|A) part of the equation is the likelihood, which is the resulting value of any given cell. I.e. the likelihood for a 5 sided die was 20%. What this is saying is that if we assume we have a 5 sided die the odds that we would roll a 5 is 20%. Likewise, the likelihoods for the 6 sided die and the 8 sided die are 16.7% and 12.5% respectively.

The P(B) part of the equation is the normalizing constant, which is the sum of all of those likelihoods. In this case that was 49.2%. As we saw before, dividing by the normalizing constant means that the sum of all of the likelihoods is forced to equal 1.0.

What we did not explicitly use was the P(A) part of the equation, which is the prior. The prior is how you account for your initial estimate of probability before incorporating new data. In the example above, I told you that I had 4 dice in a bag and randomly drew one out. So it was reasonable to assume that I had equal odds of drawing out any single die. Since the initial odds for all of our outcomes were the same, the fact that we ignored the prior didn’t affect the results since the prior would have gone away in the normalization.

However, it’s possible, and even likely in many cases, that you won’t have equal odds for all initial conditions in the prior. For the example above, you might assume that I was more likely to draw a 6 sided die than the 8 sided die because of the shape. Or, more reasonably, instead of 4 dice in the bag I might have 10 dice in the bag, one 4 sided die, two 5 sided dice, three 6 sided dice, and four 8 sided dice, in which case your initial estimate of probabilities would not be that there was an equal likelihood of drawing any given die. To relate that back to a real life example, if you are trying to estimate how many tanks an enemy country has, you might pick the most likely number and then estimate a bell curve around it for the other initial probability estimates. As a result, each individual starting state (i.e. whether they have 200 tanks or 400 tanks) might get a different initial probability.

With Bayes Theorem, the way we account for whatever initial probability estimates we have is by multiplying the prior probability by each individual likelihood. I.e. if I have a 10% chance of having a 4 sided die, and I have a 25% chance of rolling a three, assuming that I have a 4 sided die, then my odds of both having a 4 sided die and rolling a three are .1 * .25 = .025.

So we can create the table again accounting for both the prior probability and the likelihood of a given roll assuming you have that dice.

If we assume there is a 10%, 20%, 30%, and 40% chance of the 4, 5, 6, and 8 sided dice respectively, then what we are doing is multiplying each column of our original likelihood table

By the associated column in this initial probability table.

The result is

Importantly, if you sum every cell in this table, the total value is 1.0

Now if we say we rolled a 5, we can remove all the non-5 rolls,

Sum up the remaining results

And divide by that sum to normalize those results

The final results we get is that we have a 28.6% chance of having the 5 sided die, and a 35.7% chance of having each of the 6 or 8 sided dice

Using the process above is good for actually solving the updated probabilities after incorporating new information. But for understanding exactly what happened, sometimes it is helpful to see it more graphically.

The table below represents the likelihood table that we initially had when we had one of each of the four dice. This table is a 1x1 square, so it has a total area of 1.0. That represents a 100% probability. Thus the relative size of any given square represents how likely that outcome is.

In the table above, each of the columns has the same width, that is because every die is equally likely to be drawn. Since I have a 1 in 4 chance of drawing any single die, each column has 25% of the total area.

Within a given column, all the rectangles have the same height. That is because, for any given die, you are equally likely to roll any number. However different rectangles have different heights in different columns because different dice have different odds of rolling a number. I.e. you have a 25% chance of rolling a 1 with a 4 sided die, but only a 20% chance of rolling it with the 5 sided die.

We can adjust the table above to account for any initial probabilities that we have (i.e. the prior). In the second part of the example above, we said we have a 10% chance of drawing the 4 sided die, a 20% chance of drawing the 5 sided die, a 30% chance of the 6 sided die, and a 40% chance of the 8 sided die. We can make the table reflect this by adjusting the width of the columns.

Now the blue column for the 4 sided die has 10% of the total area, and the width columns for the other dice correspond to their initial probabilities.

So far we haven’t done Bayes Theorem. This is the same visual table you could make if you just had the different dice in a bag, and wanted to make a table of your odds of drawing any die and getting a specific roll with it.

We can start using Bayes Theorem by incorporating another result. Here we observe that we rolled a 5, so we can remove all outcomes that are not associated with rolling a 5. What remains is shown below

This gives us the relative likelihood of any given dice resulting in the observed outcome. However we don’t want to just know the relative likelihood, we want to know the absolute probability. So we have to adjust our results so that they all add up to 100% probability again (i.e. normalize). We can do that by adjusting the relative width of the rectangles above and then stretching them all to have a total area of 1.0, as is shown below.

At this point, we have our result. If we measured the area of each of the columns, we would see that the 5 sided die has 28.6% of the total area, and the other two dice have 35.7% of the total area each. This is the same result we got using the table method.

If we wanted to incorporate another roll, we could keep the resulting table above as our starting point, and incorporate the likelihoods of another roll. We will show that mathematically in the next example below.

In the example above, we made an initial estimate of the probability that we had a given dice and used Bayes theorem to update that probability a single time. However you are not limited to only updating probability a single time, you can continue updating it multiple times as long as you have new information to incorporate.

As an example, imagine you are a scruffy looking spaceship captain about to fly through an asteroid field. As you are about to enter the asteroid field, your “helpful” robot companion follows protocol and tells you that the odds of navigating the asteroid field are 1 in 3720 or approximately .0269%. But it doesn’t know you very well, and if it had more information about your piloting skill, it would likely give you better odds. Bayes Theorem is how it can update that probability with new information.

In terms of Bayes Theorem, what were the odds the robot initially stated, the 1 in 3720? That was the initial estimate of your odds of success, i.e. the prior. How did the robot get that information? Likely he has records of thousands of pilots who flew into asteroid fields and how many came back out alive. I.e. he knows the odds for the population in general, but he might not know about you specifically. And there are some important factors that might affect the results. For instance, these might improve your odds

*

p<>{color:#000;}. You have flown through asteroid fields multiple times before (perhaps smuggling illicit cargo in hidden storage compartments)

*

p<>{color:#000;}. You have a skilled (if rather hirsute) co-pilot

But on the other hand, you are being chased and shot at by the galactic military, which will tend to lower your odds of survival. How can you account for all these factors? By using Bayes Theorem

Let’s start with the first piece of information, which is that you have flown through asteroid fields multiple times in the past. We need to make a likelihood table for how many of the people who survived and did not survive had previous experience flying through asteroid fields, and how many didn’t. Based on the information from your golden android, that likelihood table is shown below

What we see is that of the people who didn’t survive, 98% of them had never navigated an asteroid field before. And of the people who survived, only 20% had never flown through an asteroid field before.

Now we can multiply that likelihood table by the initial probabilities of

*

p<>{color:#000;}. 1 in 3720 people did survive and

*

p<>{color:#000;}. 3719 in 3720 people did not survive

And get this new table

We get 4 resulting paired probabilities. From this, we can see that, of all the people flying into an asteroid field, the vast majority, 97.97% are people doing it for the first time who will perish in the attempt. However what we care about are people who are doing it for the second (or more) time. So and we delete the row that doesn’t match with our situation. In this case, we keep the fact that we have previously flown through asteroid fields.

Summing those odds and normalizing gives us this result.

What we have calculated is that an experienced pilot has better odds of navigating the asteroid field. .01064 is just over 1 percent or approximately 1 in 94. Now those might not be great odds, but they are certainly better than 1 in 3720.

**Swamping The Prior**

This brings up an interesting point about Bayes Theorem. And that is if your initial odds are really low, the new odds after 1 piece of new data are likely to be low. This is frequently seen in medical testing. If the odds that you have a rare disease are low, then the odds that you have the disease after a single positive test are probably still low (i.e. a false positive) After several tests, however, the initial probability becomes dominated by the new information, which is known as “swamping the prior”

In this example, we have additional information we can incorporate into our probability calculation. Namely, we can account for the fact that we have a skilled co-pilot, but sadly also need to account for the fact that we are being pursued.

The easiest way to include this is to just take the results from the previous step and treat it as the prior for this step. So our odds going into this second step are

*

p<>{color:#000;}. .01064 that we will survive

*

p<>{color:#000;}. .98936 that we will not survive

If we make a likelihood table of who survived based on their co-pilot status

We can then multiply by our prior and get

After discarding the events that don’t fit with our situation we are left with

Now in the previous examples, we normalized these results. And we could normalize again right now and it would work just fine. But the thing about normalizing is, as long as you do it in the last step, it doesn’t matter if you do it in the intermediate steps or not. Since we also need to adjust the odds based on the fact that the military is shooting at us, we can wait and normalize after doing that calculation. The only thing to be aware after multiple iterations is that the probabilities can get so small that you can start under running the available precision of whatever computer/calculator/robot you are using.

So here let’s skip normalizing and go straight into the third Bayes calculation where we incorporate the fact that we are being pursued by the galactic government.

Assume that the likelihood table looks like this

We can multiply that by the prior from the previous step and get these results

When we normalize we get the final odds of survival of

.06977 is 6.977% which is approximately 1 in 15. Those are just the kind of heroic odds we need to impress any nearby princesses.

This book showed a table method of doing Bayes Theorem, which I think is a very intuitive method. This table method is not the only way to solve Bayes Theorem. We wasted some effort by generating likelihoods for outcomes that we immediately threw away. A refinement of this method would be to not construct the full table each step. For instance, if you know you rolled a 5, you don’t have to populate every roll, and then discard most of them, you just need to make the data associated with rolling a 5.

No matter what refinements you include though, you will have to calculate the odds of that outcome for every single possible initial state (i.e. each column). As a result, thinking of Bayes Theorem as a table where you are keeping certain rows based on the outcome you observe is a good way to remember how it works.

If you liked this book, you may be interested in checking out some of my other books. The full list with links is located here

Some that you may like are

*

p<>{color:#000;}. * Bayes Theorem Examples* – This book gives additional examples of how to use Bayes Theorem. It dives into some details that were not covered in this shorter book, such as how you can account for potential errors in your data.

*

p<>{color:#000;}. * Probability – A Beginner’s Guide To Permutations and Combinations* – Which dives deeply into what the permutation and combination equations really mean, and how to understand permutations and combinations without having to just memorize the equations. It also shows how to solve problems that the traditional equations don’t cover, such as “If you have 20 basketball players, how many different ways you can split them into 4 teams of 5 players each?” (Answer 11,732,745,024)

*

p<>{color:#000;}. * Linear Regression and Correlation*: Linear Regression is a way of simplifying a set of data into a single equation. For instance, we all know Moore’s law: that the number of transistors on a computer chip doubles every two years. This law was derived by using regression analysis to simplify the progress of dozens of computer manufacturers over the course of decades into a single equation. This book walks through how to do regression analysis, including multiple regression when you have more than one independent variable. It also demonstrates how to find the correlation between two sets of numbers.

And here is another more advanced book on Bayes Theorem by a different author. “* Think Bayes*” by Allen Downey This book goes much deeper into complicated probability distributions for the priors and the likelihoods. I found the probability of any given spot getting hit by a paintball during a paintball competition to be interesting.

Before you go, I’d like to say thank you for purchasing my eBook. I know you have a lot of options online to learn this kind of information. So a big thank you for downloading this book and reading all the way to the end.

If you like this book, then I need your help. Please take a moment to leave a review for this book. It really does make a difference and will help me continue to write quality eBooks on Math, Statistics, and Computer Science.

**P.S.**

I would love to hear from you. It is easy for you to connect with us on Facebook here

or on our web page here

But it’s often better to have one-on-one conversations. So I encourage you to reach out over email with any questions you have or just to say hi!

Simply write here:

~ Scott Hartshorn

Bayes Theorem Is Important Bayes Theorem is a way of updating probability as you get new information. Essentially, you make an initial guess, and then get more data to improve it. Bayes Theorem, or Bayes Rule, has a ton of real world applications, from estimating your risk of a heart attack to making recommendations on Netflix But It Isn't That Complicated This book is a short introduction to Bayes Theorem. It is only 15 pages long, and is intended to show you how Bayes Theorem works as quickly as possible. The examples are intentionally kept simple to focus solely on Bayes Theorem without requiring that the reader know complicated probability distributions. If you want to learn the basics of Bayes Theorem as quickly as possible, with some easy to duplicate examples, this is a good book for you.

- ISBN: 9781370730704
- Author: Scott Hartshorn
- Published: 2017-09-20 07:20:12
- Words: 4389