math

# Bayesian Statistics Worked Example Part 2

Last time I decided my post was too long, so I cut some stuff out and now this post is fleshing those parts into their own post. Recall our setup. We perform an experiment of flippling a coin. Our data set consists of ${a}$ heads and ${b}$ tails. We want to run a Bayesian analysis to figure out whether or not the coin is biased. Our bias is a number between ${0}$ and ${1}$ which just indicates the expected proportion of times it will land on heads.

We found our situation was modeled by the beta distribution: ${P(\theta |a,b)=\beta(a,b)}$. I reiterate here a word of warning. ALL other sources will call this ${B(a+1, b+1)}$. I’ve just shifted by 1 for ease of notation. We saw last time that if our prior belief is that the probability distribution is ${\beta(x,y)}$, then our posterior belief should be ${\beta(x+a, y+b)}$. This simple “update rule” falls out purely from Bayes’ Theorem.

The main thing I didn’t explain last time was what exactly I meant by the phrase “we can say with 95% confidence that the true bias of the coin lies between ${0.40}$ and ${0.60}$” or whatever the particular numbers are that we get from our data. What I had in mind for that phrase was something called the highest density interval (HDI). The 95% HDI just means that it is an interval for which the area under the distribution is ${0.95}$ (i.e. an interval spanning 95% of the distribution) such that every point in the interval has a higher probability than any point outside of the interval (I apologize for such highly unprofessional pictures):

(It doesn’t look like it, but that is supposed to be perfectly symmetrical.)

The first is the correct way to make the interval, because notice all points on the curve over the shaded region are higher up (i.e. more probable) than points on the curve not in the region. There are lots of 95% intervals that are not HDI’s. The second is such a non-example, because even though the area under the curve is 0.95, the big purple point is not in the interval but is higher up than some of the points off to the left which are included in the interval.

Lastly, we will say that a hypothesized bias ${\theta_0}$ is credible if some small neighborhood of that value lies completely inside our 95% HDI. That small threshold is sometimes called the “region of practical equivalence (ROPE)” and is just a value we must set. If we set it to be 0.02, then we would say that the coin being fair is a credible hypothesis if the whole interval from 0.48 to 0.52 is inside the 95% HDI.

A note ahead of time, calculating the HDI for the beta distribution is actually kind of a mess because of the nature of the function. There is no closed form solution, so usually you can just look these things up in a table or approximate it somehow. Both the mean ${\mu=\frac{a}{a+b}}$ and the standard deviation ${\left(\frac{\mu(1-\mu)}{a+b+1}\right)^{1/2}}$ do have closed forms. Thus I’m going to approximate for the sake of this post using the “two standard deviations” rule that says that two standard deviations on either side of the mean is roughly 95%. Caution, if the distribution is highly skewed, for example ${\beta(3,25)}$ or something, then this approximation will actually be way off.

Let’s go back to the same examples from before and add in this new terminology to see how it works. Suppose we have absolutely no idea what the bias is and we make our prior belief ${\beta(0,0)}$ the flat line. This says that we believe ahead of time that all biases are equally likely. Now we observe ${3}$ heads and ${1}$ tails. Bayesian analysis tells us that our new distribution is ${\beta(3,1)}$. The 95% HDI in this case is approximately 0.49 to 0.84. Thus we can say with 95% certainty that the true bias is in this region. Note that it is NOT a credible hypothesis off of this data to guess that the coin is fair because 0.48 is not in HDI. This example really illustrates how choosing different thresholds can matter, because if we picked an interval of 0.01 rather than 0.02, then that guess would be credible!

Let’s see what happens if we use just an ever so slightly more reasonable prior. We’ll use ${\beta(2,2)}$. This gives us a starting assumption that the coin is probably fair, but it is still very open to whatever the data suggests. In this case our ${3}$ heads and ${1}$ tails tells us our posterior distribution is ${\beta(5,3)}$. In this case the 95% HDI is 0.45 to 0.75. Using the same data we get a little bit more narrow interval here, but more importantly we feel much more comfortable with the claim that the coin being fair is still a credible hypothesis.

This brings up a sort of “statistical uncertainty principle.” If we want a ton of certainty, then it forces our interval to get wider and wider. This makes intuitive sense, because if I want to give you a range that I’m 99.9999999% certain the true bias is in, then I better give you practically every possibility. If I want to pinpoint a precise spot for the bias, then I have to give up certainty (unless you’re in an extreme situation where the distribution is a really sharp spike or something). You’ll end up with something like: I can say with 1% certainty that the true bias is between 0.59999999 and 0.6000000001. We’ve locked onto a small range, but we’ve given up certainty. Note the similarity to the Heisenberg uncertainty principle which says the more precisely you know the momentum or position of a particle the less precisely you know the other.

Let’s wrap up by trying to pinpoint exactly where we needed to make choices for this statistical model. The most common objection to Bayesian models is that you can subjectively pick a prior to rig the model to get any answer you want. Hopefully this wrap up will show that in the abstract that objection is essentially correct, but in real life practice you cannot get away with this.

Step 1 was to write down the likelihood function ${P(\theta | a,b)=\beta(a,b)}$. This was derived directly from the type of data we were collecting and was not a choice. Step 2 was to determine our prior distribution. This was a choice, but a constrained one. In real life statistics you will probably have a lot of prior information that will go into this choice. Recall that the prior encodes both what we believe is likely to be true and how confident we are in that belief. Suppose you make a model to predict who will win an election based off of polling data. You have previous year’s data and that collected data has been tested, so you know how accurate it was! Thus forming your prior based on this information is a well-informed choice. Just because a choice is involved here doesn’t mean you can arbitrarily pick any prior you want to get any conclusion you want.

I can’t reiterate this enough. In our example, if you pick a prior of ${\beta(100,1)}$ with no reason to expect to coin is biased, then we have every right to reject your model as useless. Your prior must be informed and must be justified. If you can’t justify your prior, then you probably don’t have a good model. The choice of prior is a feature, not a bug. If a Bayesian model turns out to be much more accurate than all other models, then it probably came from the fact that prior knowledge was not being ignored. It is frustrating to see opponents of Bayesian statistics use the “arbitrariness of the prior” as a failure when it is exactly the opposite (see the picture at the end of this post for a humorous illustration.)

The last step is to set a ROPE to determine whether or not a particular hypothesis is credible. This merely rules out considering something right on the edge of the 95% HDI from being a credible guess. Admittedly, this step really is pretty arbitrary, but every statistical model has this problem. It isn’t unique to Bayesian statistics, and it isn’t typically a problem in real life. If something is so close to being outside of your HDI, then you’ll probably want more data. For example, if you are a scientist, then you re-run the experiment or you honestly admit that it seems possible to go either way.