math, philosophy

What is Bayesian Statistics: A basic worked example

I did a series on Bayes’ Theorem awhile ago and it gave us some nice heuristics on how a rational person ought to update their beliefs as new evidence comes in. The term “Bayesian statistics” gets thrown around a lot these days, so I thought I’d do a whole post just working through a single example in excruciating detail to show what is meant by this. If you understand this example, then you basically understand what Bayesian statistics is.

Problem: We run an experiment of flipping a coin {N} times and record a {1} every time it comes up heads and a {0} every time it comes up tails. This gives us a data set. Using this data set and Bayes’ theorem, we want to figure out whether or not the coin is biased and how confident we are in that assertion.

Let’s get some technical stuff out of the way. This is the least important part to fully understand for this post, but is kind of necessary. Define {\theta} to be the bias towards heads. This just means that if {\theta=0.5}, then the coin has no bias and is perfectly fair. If {\theta=1}, then the coin will never land on tails. If {\theta = 0.75}, then if we flip the coin a huge number of times we will see close to {3} out of every {4} flips lands on heads. For notation we’ll let {y} be the trait of whether or not it lands on heads or tails (so it is {0} or {1}).

We can encode this information mathematically by saying {P(y=1|\theta)=\theta}. In plain english: The probability that the coin lands on heads given that the bias towards heads is {\theta} is {\theta}. Likewise, {P(y=0|\theta)=1-\theta}. Let’s just chain a bunch of these coin flips together now. Let {a} be the event of seeing {a} heads when flipping the coin {N} times (I know, the double use of {a} is horrifying there but the abuse makes notation easier later).

Since coin flips are independent we just multiply probabilities and hence {P(a|\theta)=\theta^a(1-\theta)^{N-a}}. Rather than lug around the total number {N} and have that subtraction, normally people just let {b} be the number of tails and write {P(a,b |\theta)=\theta^a(1-\theta)^b}. Let’s just do a quick sanity check to make sure this seems right. Note that if {a,b\geq 1}, then as the bias goes to zero the probability goes to zero. This is expected because we observed a heads ({a\geq 1}), so it is highly unlikely to be totally biased towards tails. Likewise as {\theta} gets near {1} the probability goes to {0}, because we observed a tails.

The other special cases are when {a=0} or {b=0}, and in these cases we just recover that the probability of getting heads a times in a row if the probability of heads is {\theta} is {\theta^a}. Of course, the mean of {\beta (a,b)} is {a/(a+b)}, the proportion of the number of heads observed. Moving on, we haven’t quite thought of this in the correct way yet, because in our introductory problem we have a fixed data set that we want to analyze. So from now on we should think about {a} and {b} being fixed from the data we observed.

The idea now is that as {\theta} varies through {[0,1]} we have a distribution {P(a,b|\theta)}. What we want to do is multiply this by the constant that makes it integrate to {1} so we can think of it as a probability distribution. In fact, it has a name called the beta distribution (caution: the usual form is shifted from what I’m writing), so we’ll just write {\beta(a,b)} for this (the number we multiply by is the inverse of {B(a,b)=\int_0^1 \theta^a(1-\theta)^b d\theta} called the (shifted) beta function).

This might seem unnecessarily complicated to start thinking of this as a probability distribution in {\theta}, but it is actually exactly what we are looking for. Consider the following three examples:


The red one says if we observe {2} heads and {8} tails, then the probability that the coin has a bias towards tails is greater. The mean happens at {0.20}, but because we don’t have a lot of data there is still a pretty high probability of the true bias lying elsewhere. The middle one says if we observe 5 heads and 5 tails, then the most probable thing is that the bias is {0.5}, but again there is still a lot of room for error. If we do a ton of trials to get enough data to be more confident in our guess, then we see something like:


Already at observing 50 heads and 50 tails we can say with 95% confidence that the true bias lies between 0.40 and 0.60. Alright, you might be objecting at this point that this is just usual statistics, where the heck is Bayes’ Theorem? You’d be right. Bayes’ Theorem comes in because we aren’t building our statistical model in a vacuum. We have prior beliefs about what the bias is.

Let’s just write down Bayes’ Theorem in this case. We want to know the probability of the bias {\theta} being some number given our observations in our data. We use the “continuous form” of Bayes’ Theorem:

\displaystyle P(\theta|a,b)=\frac{P(a,b|\theta)P(\theta)}{\int_0^1 P(a,b|\theta)d\theta}

I’m trying to give you a feel for Bayesian statistics, so I won’t work out in detail the simplification of this. Just note that the “posterior probability” (the left hand side of the equation), i.e. the distribution we get after taking into account our data is the likelihood times our prior beliefs divided by the evidence. Now if you use that the denominator is just the definition of {B(a,b)} and work everything out it turns out to be another beta distribution!

If our prior belief is that the bias has distribution {\beta(x,y)}, then if our data has {a} heads and {b} tails we get {P(\theta|a,b)=\beta(a+x, b+y)}. The way we update our beliefs based on evidence in this model is incredibly simple. Now I want to sanity check that this makes sense again. Suppose we have absolutely no idea what the bias is and we make our prior belief {\beta(0,0)} the flat line. This says that we believe ahead of time that all biases are equally likely.

Now we observe {3} heads and {1} tails. Bayesian analysis tells us that our new (posterior probability) distribution is {\beta(3,1)}:


Yikes! We don’t have a lot of certainty, but it looks like the bias is heavily towards heads. Danger: This is because we used a terrible prior. This is the real world so it isn’t reasonable to think that a bias of {0.99} is just as likely as {0.45}. Let’s see what happens if we use just an ever so slightly more modest prior. We’ll use {\beta(2,2)}. This puts our assumption on it being most likely close to {0.5}, but it is still very open to whatever the data suggests. In this case our {3} heads and {1} tails tells us our updated belief is {\beta(5,3)}:


Ah. Much better. We see a slight bias coming from the fact that we observed {3} heads and {1} tails and these can’t totally be ignored, but our prior belief tames how much we let this sway our new beliefs. This is what makes Bayesian statistics so great. If we have tons of prior evidence of a hypothesis, then observing a few outliers shouldn’t make us change our minds. On the other hand, the setup allows for us to change our minds even if we are 99% certain on something as long as sufficient evidence is given. This is the mantra: extraordinary claims require extraordinary evidence.

Not only would a ton of evidence be able to persuade us that the coin bias is {0.90}, but we should need a ton of evidence. This is part of the shortcomings of non-Bayesian analysis. It would be much easier to become convinced of such a bias if we didn’t have a lot of data and we accidentally sampled some outliers.

Anyway. Now you should have an idea of Bayesian statistics. In fact, if you understood this example, then most of the rest is just adding parameters and using other distributions, so you actually have a really good idea of what is meant by that term now.

15 thoughts on “What is Bayesian Statistics: A basic worked example”

  1. “f we have tons of prior evidence of a hypothesis, then observing a few outliers shouldn’t make us change our minds. ”
    So you mean to say that a Bayesian can think about a prior which dominates likelihood? A novice might need more explanation on this point, I feel

  2. Using p(θ\a, b) =θ^a*(1-θ)^b*p(θ)/[integrand from 0 to 1 of (θ^a(1-θ)^b dθ)] that has been obtained from formulas in your exposition above.

    For a = 4, b = 6, θ = 0.4, (1-θ) = 0.6 and p(θ) = 0.4; the above expression solves to 1.1016. I don’t know what to make if this result of of a probability of more than 1.

    Unless I am missing something, I understand p(θ) to be the prior and that it is always equal to θ.

    Please help me get around this!

  3. p(\theta) is the prior, but it is a distribution, not a number. The values of the function you calculated aren’t probabilities. The whole thing as a function of theta is the posterior distribution (i.e. a probability density function). To turn this into a probability, you integrate. This is why individual values can be greater than 1.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s