The Carter Catastrophe

I’ve been readingĀ Manifold: TimeĀ by Stephen Baxter. The book is quite good so far, and it presents a fascinating probabilistic argument that humans will go extinct in the near future. It is sometimes called the Carter Catastrophe, because Brandon Carter first proposed it in 1983.

I’ll use Bayesian arguments, so you might want to review some of my previous posts on the topic if you’re feeling shaky. One thing we didn’t talk all that much about is the idea of model selection. This is the most common thing scientists have to do. If you run an experiment, you get a bunch of data. Then you have to figure out the most likely reason for what you see.

Let’s take a basic example. We have a giant tub of golf balls, and we can’t see inside the tub. There could be 1 ball or a million. We’re told the owner accidentally dropped a red ball in at some point. All the other balls are the standard white golf balls. We decide to run an experiment where we draw a ball out, one at a time, until we reach the red one.

First ball: white. Second ball: white. Third ball: red. We stop. We’ve now generated a data set from our experiment, and we want to use Bayesian methods to give the probability of there being three total balls or seven or a million. In probability terms, we need to calculate the probability that there are x balls in the tub given that we drew the red ball on the third draw. Any time we see this language, our first thought should be Bayes’ theorem.

Define A_i to be the model of there being exactly i balls in the tub. I’ll use “3” inside of P( ) to be the event of drawing the red ball on the third try. We have to make a finiteness assumption, and although this is one of the main critiques of the argument, we can examine what happens as we let the size of the bound grow. Suppose for now the tub can only hold 100 balls.

A priori, we have no idea how many balls are in there, so we’ll assume all “models” are equally likely. This means P(A_i)=1/100 for all i. By Bayes’ theorem we can calculate:

P(A_3|3) = \frac{P(3|A_3)P(A_3)}{(\sum_{i=1}^{100}P(3|A_i)P(A_i))}

\frac{(1/3)(1/100)}{(1/100)\sum_{i=3}^{100}1/i} \approx 0.09

So there’s around a 9% chance that there are only 3 balls in the tub. That bottom summation remains exactly the same when computing P(A_n | 3) for any n and equals about 3.69, and the (1/100) cancels out every time. So we can compute explicitly that for n > 3:

P(A_n|3)\approx \frac{1}{n}(0.27)

This is a decreasing function of n, and this shouldn’t be surprising at all. It says that as we guess there are more and more balls in the tub, the probability of that guess goes down. This makes sense, because it’s unreasonable to think we’d see the red one that early if there are actually 100 balls in the tub.

There’s lots of ways to play with this. What happens if our tub could hold millions but we still assume a uniform prior? It just takes all the probabilities down, but the general trend is the same: It becomes less and less reasonable to assume large amounts of total balls given that we found the red one so early.

You could also only care about this “earliness” idea and redo the computations where you ask how likely is A_n given that we found the red ball by the third try. This is actually the more typical way the problem is formulated in the Doomsday arguments. It’s more complicated, but the same idea pops out, and this should make intuitive sense.

Part of the reason these computations were somewhat involved is because we tried to get a distribution on the natural numbers. But we could have tried to compare heuristically to get a super clear answer (homework for you). What if we only had two choices “small number of total balls (say 10)” or “large number of total balls (say 10,000)”? You’d find there is around a 99% chance that the “small” hypothesis is correct.

Here’s the leap. Now assume the fact that you exist right now is random. In other words, you popped out at a random point in the existence of humans. So the totality of humans to ever exist are the white balls and you are the red ball. The same type of argument above applies, and it says that the most likely thing is that you aren’t born at some super early point in human history. In fact, it’s unreasonable from a probabilistic standpoint to think that humans will continue much longer at all given your existence.

The “small” total population of humans is far, far more likely than the “large” total population, and the interesting thing is that this remains true even if you mess with the uniform prior. You could assume it is much more likely a priori for humans to continue to make improvements and colonize space and develop vaccines giving a higher prior for the species existing far into the future. But unfortunately the Bayesian argument will still pull so strongly in favor of humans ceasing to exist in the near future that one must conclude it is inevitable and will happen soon!

Anyway. I’m travelling this week, so I’m sorry if there are errors in those calculations. I was in a hurry and never double checked them. The crux of the argument should still make sense even if you don’t get my exact numbers. There’s also a lot of interesting and convincing rebuttals, but I don’t have time to get into them now (including the fact that unlikely hypotheses turn out to be true all the time).

Advertisements

Does Bayesian Epistemology Suffer Foundational Problems?

I recently had a discussion about whether Bayesian epistemology suffers from the problem of induction, and I think some interesting things came from it. If these words make you uncomfortable, think of epistemology as the study of how we form beliefs and gain knowledge. Bayesian epistemology means we model it probabilistically using Bayesian methods. This old post of mine talks a bit about it but is long and unnecessary to read to get the gist of this post.

I always think of the problem of induction in terms of the classic swan analogy. Someone wants to claim that all swans are white. They go out and see swan after swan after swan, each confirming the claim. Is there any point at which the person can legitimately say they know that all swans are white?

Classically, the answer is no. The problem of induction is crippling to classic epistemologies, because we can never be justified in believing any universal claim (at least using empirical methods). One of the great things about probabilistic epistemologies (not just Bayesian) is that it circumvents this problem.

Classical epistemologies require you to have 100% certainty to attain knowledge. Since you can’t ever be sure you’ve encountered every instance of a universal, you can’t be certain there is no instance that violates the universal. Hence the problem of induction is an actual problem. But note it is only a problem if your definition of knowledge requires you to have absolute certainty of truth.

Probabilistic epistemologies lower the threshold. They merely ask that you have 95% (or 98%, etc) confidence (or that your claim sits in some credible region, etc) for the justification. By definition, knowledge is always tentative and subject to change in these theories of knowledge.

This is one of the main reasons to use a probabilistic epistemology. It is the whole point. They were invented to solve this problem, so I definitely do not believe that Bayesian epistemology suffers from the problem of induction.

But it turns out I had misunderstood. The point the other person tried to make was much more subtle. It had to do with the other half of the problem of induction (which I always forget about, because I usually consider it an axiom when doing epistemology).

This other problem is referred to as the principle of the uniformity of nature. One must presuppose that the laws of nature are consistent across time and space. Recall that a Bayesian has prior beliefs and then upon encountering new data they update their beliefs factoring in both the prior and new data.

This criticism has to do with the application of Bayes’ theorem period. In order to consider the prior to be relevant to factor in at all, you must believe it is … well, relevant! You’ve implicitly assumed at that step the uniformity of nature. If you don’t believe nature is consistent across time, then you should not factor prior beliefs into the formation of knowledge.

Now a Bayesian will often try to use Bayesian methods to justify the uniformity of nature. We start with a uniform prior so that we haven’t assumed anything about the past or its relevance to the future. Then we merely note that billions of people across thousands of years have only ever observed a uniformity of nature, and hence it is credible to believe the axiom is true.

Even though my gut buys that argument, it is a bit intellectually dishonest. You can never, ever justify an axiom by using a method that relies on that axiom. That is the quintessential begging the question fallacy.

I think the uniformity of nature issue can be dismissed on different grounds. If you want to dismiss an epistemology based on the uniformity of nature issue, then you have to be willing to dismiss every epistemology that allows you to come to knowledge.

It doesn’t matter what the method is. If you somehow come to knowledge, then one second later all of nature could have changed and hence you no longer have that knowledge. Knowledge is impossible if you want to use that criticism. All this leave you with is radical skepticism, which of course leads to self-contradiction (if you know you can’t know anything, then you know something –><– ).

This is why I think of the uniformity of nature as a necessary axiom for epistemology. Without some form of it, epistemology is impossible. So at least in terms of the problem of induction, I do not see foundational problems for Bayesian epistemology.

Markov Chain Monte Carlo Example

Let’s look at a problem called parameter estimation. As usual, we have a bunch of coin flips. What we’ve learned to do with Bayesian statistics is calculate some posterior distribution that tells me how likely the bias {\theta} is. I ask myself, “Is it a credible hypothesis that the coin is fair ({\theta =1/2})?” I find out yes it is. I ask myself, “Is it a credible hypothesis that the coin is massively biased at {\theta=4/5}?” I find out yes it is. Uh oh.

Maybe in abstract math land this type of contradictory information is fine. I should be honest that both are credible guesses based on my data, and Bayesian statistics helps me to be very precise about my degrees of certainty and uncertainty. Unfortunately, in the real world I want to figure out which {\theta} is “best” so that I can use it in my model for practical purposes. This is called parameter estimation, because I want to estimate what my parameter {\theta} should be in my model.

We’re in luck for the coin example, because we only have one parameter living in one-dimensional space. This alone vastly simplifies the situation, but we have something far, far better. Our posterior distribution has a unique maximum, that max happens to equal the mean of the distribution, and that max can be determined easily and exactly! This means that we can safely use that parameter as the “best.”

In the real world, we often have several parameters we are trying to estimate in a high-dimensional space, and the posterior is some non-convex crazy thing with lots of local mins/maxs that can’t be determined analytically. Let’s face it. Optimization is really hard even in relatively nice situations. The real world is usually not nice.

There often isn’t even an obvious notion of what you mean by “best” set of parameters either. Think of a symmetrical bimodal distribution where both peaks have the same max. You don’t really have any good reason to pick one of the points that gives the max, and if you do something like take the mean, then you might end up with a min on accident. The method I’m going to describe doesn’t really help with this issue of “equally good choices”, but it does give a beautiful way to deal with high-dimensional parameter spaces and crazy posterior distributions.

The idea is extremely simple. You will pick some starting collection of parameters. Then you let those parameters randomly move in some direction. We will then use our model to test whether or not it is more or less likely to see the data that we observed under each of those parameter choices. With some probability depending on this likelihood we will move that parameter to that value. This is just a Markov chain process of our {\theta} values moving through the possible parameter values, and hence this technique is called a Markov Chain Monte Carlo (MCMC) method (I used the indefinite article “a” because there are all sorts of variations on this out there).

It turns out that as long as we set this up in a reasonable way, then it will converge. Here’s something cool about this. Your parameters could live in some gigantic space for which it would be impossible to search for a good parameter estimation. Usually there is some much, much smaller dimensional subset of reasonably likely candidates. Once you move to this smaller dimensional set, by the nature of the algorithm, you will stay close to it and hence start moving to something optimal much faster. Here’s a picture showing how the random walks stay on a smaller set in a real example:

Let’s actually implement this in the silly case of the coin example where we know what the answer should be. My next post might try to implement this for some sort of real data set, although that could be more time consuming than I’m willing to do. To make this example more fun, I had the computer pick a random number in {[0,1]} and then generate 100 coin flips with bias equal to that number without telling me the number! This way we are in a realistic situation of not knowing what the “correct” answer is ahead of time.

I got 85 heads and 15 tails. To make computations easier, let’s assume the prior probability is just uniform. This means the posterior is given by {p(D|\theta)=\theta^{85}\cdot (1-\theta)^{15}}. I’ll start the random walk at {\theta = 0.5}. To know how much to move by, I pick a random number from a normal distribution with mean {0} and standard deviation {0.1}. So if I pick {0.05}, then my candidate place to move to is {0.55}.

I compute {p(D|\theta_{new})/p(D|\theta_{old})} and I move to the new spot with this probability. Note that if my new theta value is more likely to be the true theta, then I will always move to the new value, because the probability of moving is greater than {1}. The more unlikely my new theta value is, the less likely it is that I will move there. This implementation is called the Metropolis (or Metropolis-Hastings) algorithm. Note how simple the implementation is. It is only a few lines of code:

import numpy as np
import random
import pylab

# Posterior Distribution
def p(theta):
    return (theta**85)*((1-theta)**15)

# Random Walk Step Size
def norm_dist():
    return random.normalvariate(0, 0.1)
    
# Perform one step of random walk from spot theta
def rand_walk(theta):
    x = norm_dist()
    if theta + x < 1 and theta + x >0:
        return theta + x
    else:
        return theta
 
# Simulate the random walk for 1000 time steps   
walk = []
walk.append(0.5)
for i in xrange(1000):
    n = walk.pop()
    walk.append(n)
    y = rand_walk(n)
    if random.random() < p(y)/p(n):
        walk.append(y)
    else:
        walk.append(n)


# Plot the results
ylab = [i for i in xrange(len(walk))]
pylab.plot(walk, ylab)
pylab.title('Random Walk Visualization')
pylab.xlabel('Theta Value')
pylab.ylabel('Time')
pylab.show() 

Note that the key insight that MCMC gives us is that picking values from the posterior is going to be “easy.” Even if we don’t know much about the distribution and have no idea how to explicitly calculate anything from it, we can still perform this random walk. This is what it looks like:

mcmc

The last step is to actually do a parameter estimation. The whole point is that the walk will stay close to the best value, so we can now just average these to get {\theta = 0.84}. The average is just a finite sum instead of an integral now. If we had done this analytically, we would have gotten {0.85}. Since we know MCMC is just giving us an estimation coming from randomness, this is really quite good!

Decision Theory 2

Now we will move on to a far, far more complex version of the same problem from before. Recall last time we worked with a fair coin. We want to make guesses that minimize our loss (or maximize our utility). The assumption that the coin was fair basically nullified having to do any analysis. No matter what decision function we picked, we would have the same expected loss, i.e. there is no way to do better than random guessing.

Let’s introduce the complexity of an extra parameter slowly through an example. Let’s suppose again that the coin is fair, but we don’t know that ahead of time. We have no idea what the bias of the coin is. We’ve already analyzed how to model this situation in our Bayesian statistics example.

If we observe {n} heads and {m} tails, we have a probability distribution describing the likelihood of the possible biases. We found this to be the beta distribution {B(n-1, m-1)}. If we start with a uniform, uninformed prior, then we could use Bayesian statistics to update our decision rule after each flip. This should make intuitive sense, because if the bias of the coin is 0.9, we should quickly see the posterior distribution reflect this and we will start getting most of our guesses correct.

Thus, the most naive thing to do is to look at the mean of the posterior distribution: {\frac{n}{n+m}}. If this number is bigger than {0.5}, then we guess heads because our Bayesian posterior predicts heads is coming up more frequently. If it is less than {0.5}, then we guess tails. If it equals {0.5}, then we make a random guess. Note that as long as the true bias is not {0.5}, we should be able to tell this with statistics after sufficiently many flips which will give us a better expected loss (i.e. risk) than random guessing. Let’s try two examples to see what happens.

I won’t post the code or the graph of what happens if the true bias is {0.5}, because our previous analysis shows it to be exactly the same independent of our decision function. Thus our more complicated decision rule doesn’t actually do anything to improve our guess. As a second example, we can mildly modify the code previously to see what happens with a {0.75} bias:

import random
import numpy as np
import pylab

def flip(true_bias):
    rand = random.random()
    if rand > true_bias:
        return 0
    else:
        return 1

def simulate(money, bet, true_bias, num_flips):
    num_heads = 0
    est_bias = 0.5
    for i in range(num_flips):
    
        #make a choice based on Bayesian posterior
        if est_bias >= 0.5:
            choice = 1
        else:
            choice = 0
            
        #flip the coin
        rand = flip(true_bias)
        
        #keep track of the number of heads
        num_heads += rand
        
        #update estimated bias
        est_bias = float(num_heads+1)/(i+3)
                   
        #check whether or not choice was correct
        if choice == rand:
            money += 2*bet
        else:
            money -= bet
    return money

results = []
for i in range(1000):
    results.append(simulate(10, 1, 0.75, 100))

pylab.plot(results)
pylab.title('Coin Experiment Results')
pylab.xlabel('Trial Number')
pylab.ylabel('Money at the End of the Trial')
pylab.show()

print np.mean(results)

The program says we average ending with {134.3} cents. We made pretty close to {125} cents as opposed to making {50} cents off of the {0.5} bias. These numbers should not be mysterious, because in the long run we expect to start guessing heads which will occur {3/4} of the time. Thus our expected gain is {100((-1/4)+(3/4)*2)=125}. Here’s the plot of the experiment:

coinblog2

This should feel a little weird, because with this decision rule we expect to always do better than (or equal to) our previous example. But this example is more realistic, because we don’t assume to know the bias of the coin! How could we do better with “less” information? That is the power of Bayesian decision theory which allows you to update your decision rule as you observe more information.

The classical admissible decision of always picking heads will do better if the bias is towards heads because we don’t have to wait for our posterior to tell us to pick heads, but it will do terrible if the bias is towards tails because even once we see that we get mostly tails we are not allowed to change our decision rule.

Let’s go back to our experiment of 100 coin flips. If {\theta} is the true bias of the coin, then the negative of the risk (the expected value of the utility function) of our Bayesian naive decision rule is

{-R(\theta) = \begin{cases} 100 (3\theta -1) & \ \text{if} \ \theta \geq 0.5 \\ 100(2-3\theta) & \text{if} \ \theta < 0.5\end{cases}.}

We've now successfully incorporated our new parameter. The risk will in general depend on this parameter. The function is just a "V" when graphed and our risk from last post is just a straight line {-R(\theta)=100(3\theta-1)}. It matches on the right piece, but is strictly below this one on the left half. This shows that no matter the bias of the coin, the naive Bayesian decision rule does better than our first post's choice.

Last post I said we could order the decision functions based on risk, and then we just call a minimum in the ordering admissible. Now we have to be more careful. With this extra parameter we only get a partial ordering by checking whether or not the risk is greater pointwise for every {\theta}. As just pointed out, the Bayesian decision function is lower in the ordering than random guessing or always picking heads (the two are comparable!). The question is, how do we know whether or not it is a minimum? Is this the best we can do? Is this naive decision rule admissible?

We will dig a little more into the theory next time about how those risk functions were computed (I just told you what they were which matched our experiments), and how to actually prove that a certain decision is admissible in this more complicated situation.

Statistical Oddities 4: The Base Rate Fallacy

If you are a classically trained scientist, then you probably do an experiment, get some data, run it through SPSS (or something similar), then to see whether or not the results are significant you look at the {p}-value. It is a standard that if {p<0.05}, then you consider the result to likely be "real" and not just some random noise making a pattern.

Why is that? Well, here's how you define the {p}-value. Suppose your hypothesis is false. What is the probability of seeing your data? That is the {p}-value. I hypothesize that my coin is fair. I do 200 flips. I calculate {p=0.04}. Then there is only a 4% chance that I would see that particular combination of flips if my coin is biased.

Danger! Frequently, people try to negate everything and say that "there is a 96% chance (or less obviously wrong they'll say we have 96% confidence) that the coin is fair." If you read these posts, then you should immediately see the error. There can’t be a 96% chance of the coin being fair, because no matter what our flips were (it could 100 and 100 after a 200 flip trial), the probability of it being fair is still {0} (just compute the integral of the posterior distribution from {0.5} to {0.5}). Yet you see this language all over scientific papers.

If we want to talk about confidence, then we have a way to do it and it does not involve the {p}-value. I wouldn’t say this is a frequentist vs Bayesian thing, but I think the Bayesian analysis actually makes it harder to make this mistake. Recall that what we did there was use the language that being unbiased was a credible hypothesis given our 95% HDI. What we have confidence about is an interval. Maybe we have 95% confidence that the bias is in the interval {(0.2, 0.8)}. In this case, the hypothesis of being unbiased is credible, but the hypothesis of it being {0.7} is also credible with the same HDI.

Anyway, back to {p}-values. Since I’m using coin flipping as an example, you might think this is silly, but let’s ramp up our experiment. Suppose my lab works for a casino. We make sure coins are fair before passing them on to the casino (for their hot new coin flipping game?). I use a {p} value of {0.05} as usual. After {100} coins I expect that I’ve made a mistake on {5} or fewer because of my {p}-value, right? This is the type of interpretation you see all the time! It is clearly wrong.

Suppose {10} of them are biased due to manufacturing errors. Depending on the power of my test (I haven’t talked about power, but as you can imagine it depends on how many flips I use in my trials among other things) maybe I find 8 of them (this would be a power of {0.8} which isn’t unreasonable in science). Now recall our definition of the {p}-value. I also have a {5\%} chance of incorrectly saying that one of my unbiased coins is biased. This puts me at identifying {13} biased coins only {8} of which are actually biased. Despite a {p}-value threshold of {0.05}, I actually only got {62\%} of my guesses of bias correct (you could calculate this much more easily using Bayes’ theorem).

The above scenario is extremely common in some medical science labs where it matters. Suppose you test a drug to see if it works. Your test has {0.8} power and you use a {p}-value of {0.05} as you’ve been taught. You send {13} to drug manufacturers claiming they work. You think that you are wrong only {5\%} of the time, but in reality after you’ve tested {100} drugs, {5} out of the {13} drugs you send don’t work! This is extremely dangerous. Of course, these should be weeded out on secondary trials, but who has the time or money to do that? If we think we have {95\%} confidence that it works, then we may as well send them out to help people and only do our repeat experiment while it is on the market.

Ignoring the real interpretation of the p-value in favor of the more optimistic one is so common it has a name: the base rate fallacy. This is because that high number of false postives comes from the fact that the base rate of a drug working (or the coin being unbiased) is so low that you are likely to get false positives even with a high power test and a small p-value. I know this type of thing has been posted on the internet all over the place, but I hadn’t done it yet and it seemed to fit in with the statistical oddities series. For the record, the example scenario above was taken from Statistics Done Wrong by Alex Reinhart.

Bayesian Statistics Worked Example Part 2

Last time I decided my post was too long, so I cut some stuff out and now this post is fleshing those parts into their own post. Recall our setup. We perform an experiment of flippling a coin. Our data set consists of {a} heads and {b} tails. We want to run a Bayesian analysis to figure out whether or not the coin is biased. Our bias is a number between {0} and {1} which just indicates the expected proportion of times it will land on heads.

We found our situation was modeled by the beta distribution: {P(\theta |a,b)=\beta(a,b)}. I reiterate here a word of warning. ALL other sources will call this {B(a+1, b+1)}. I’ve just shifted by 1 for ease of notation. We saw last time that if our prior belief is that the probability distribution is {\beta(x,y)}, then our posterior belief should be {\beta(x+a, y+b)}. This simple “update rule” falls out purely from Bayes’ Theorem.

The main thing I didn’t explain last time was what exactly I meant by the phrase “we can say with 95% confidence that the true bias of the coin lies between {0.40} and {0.60}” or whatever the particular numbers are that we get from our data. What I had in mind for that phrase was something called the highest density interval (HDI). The 95% HDI just means that it is an interval for which the area under the distribution is {0.95} (i.e. an interval spanning 95% of the distribution) such that every point in the interval has a higher probability than any point outside of the interval (I apologize for such highly unprofessional pictures):

bayes1

(It doesn’t look like it, but that is supposed to be perfectly symmetrical.)

Untitled drawing

The first is the correct way to make the interval, because notice all points on the curve over the shaded region are higher up (i.e. more probable) than points on the curve not in the region. There are lots of 95% intervals that are not HDI’s. The second is such a non-example, because even though the area under the curve is 0.95, the big purple point is not in the interval but is higher up than some of the points off to the left which are included in the interval.

Lastly, we will say that a hypothesized bias {\theta_0} is credible if some small neighborhood of that value lies completely inside our 95% HDI. That small threshold is sometimes called the “region of practical equivalence (ROPE)” and is just a value we must set. If we set it to be 0.02, then we would say that the coin being fair is a credible hypothesis if the whole interval from 0.48 to 0.52 is inside the 95% HDI.

A note ahead of time, calculating the HDI for the beta distribution is actually kind of a mess because of the nature of the function. There is no closed form solution, so usually you can just look these things up in a table or approximate it somehow. Both the mean {\mu=\frac{a}{a+b}} and the standard deviation {\left(\frac{\mu(1-\mu)}{a+b+1}\right)^{1/2}} do have closed forms. Thus I’m going to approximate for the sake of this post using the “two standard deviations” rule that says that two standard deviations on either side of the mean is roughly 95%. Caution, if the distribution is highly skewed, for example {\beta(3,25)} or something, then this approximation will actually be way off.

Let’s go back to the same examples from before and add in this new terminology to see how it works. Suppose we have absolutely no idea what the bias is and we make our prior belief {\beta(0,0)} the flat line. This says that we believe ahead of time that all biases are equally likely. Now we observe {3} heads and {1} tails. Bayesian analysis tells us that our new distribution is {\beta(3,1)}. The 95% HDI in this case is approximately 0.49 to 0.84. Thus we can say with 95% certainty that the true bias is in this region. Note that it is NOT a credible hypothesis off of this data to guess that the coin is fair because 0.48 is not in HDI. This example really illustrates how choosing different thresholds can matter, because if we picked an interval of 0.01 rather than 0.02, then that guess would be credible!

Let’s see what happens if we use just an ever so slightly more reasonable prior. We’ll use {\beta(2,2)}. This gives us a starting assumption that the coin is probably fair, but it is still very open to whatever the data suggests. In this case our {3} heads and {1} tails tells us our posterior distribution is {\beta(5,3)}. In this case the 95% HDI is 0.45 to 0.75. Using the same data we get a little bit more narrow interval here, but more importantly we feel much more comfortable with the claim that the coin being fair is still a credible hypothesis.

This brings up a sort of “statistical uncertainty principle.” If we want a ton of certainty, then it forces our interval to get wider and wider. This makes intuitive sense, because if I want to give you a range that I’m 99.9999999% certain the true bias is in, then I better give you practically every possibility. If I want to pinpoint a precise spot for the bias, then I have to give up certainty (unless you’re in an extreme situation where the distribution is a really sharp spike or something). You’ll end up with something like: I can say with 1% certainty that the true bias is between 0.59999999 and 0.6000000001. We’ve locked onto a small range, but we’ve given up certainty. Note the similarity to the Heisenberg uncertainty principle which says the more precisely you know the momentum or position of a particle the less precisely you know the other.

Let’s wrap up by trying to pinpoint exactly where we needed to make choices for this statistical model. The most common objection to Bayesian models is that you can subjectively pick a prior to rig the model to get any answer you want. Hopefully this wrap up will show that in the abstract that objection is essentially correct, but in real life practice you cannot get away with this.

Step 1 was to write down the likelihood function {P(\theta | a,b)=\beta(a,b)}. This was derived directly from the type of data we were collecting and was not a choice. Step 2 was to determine our prior distribution. This was a choice, but a constrained one. In real life statistics you will probably have a lot of prior information that will go into this choice. Recall that the prior encodes both what we believe is likely to be true and how confident we are in that belief. Suppose you make a model to predict who will win an election based off of polling data. You have previous year’s data and that collected data has been tested, so you know how accurate it was! Thus forming your prior based on this information is a well-informed choice. Just because a choice is involved here doesn’t mean you can arbitrarily pick any prior you want to get any conclusion you want.

I can’t reiterate this enough. In our example, if you pick a prior of {\beta(100,1)} with no reason to expect to coin is biased, then we have every right to reject your model as useless. Your prior must be informed and must be justified. If you can’t justify your prior, then you probably don’t have a good model. The choice of prior is a feature, not a bug. If a Bayesian model turns out to be much more accurate than all other models, then it probably came from the fact that prior knowledge was not being ignored. It is frustrating to see opponents of Bayesian statistics use the “arbitrariness of the prior” as a failure when it is exactly the opposite (see the picture at the end of this post for a humorous illustration.)

The last step is to set a ROPE to determine whether or not a particular hypothesis is credible. This merely rules out considering something right on the edge of the 95% HDI from being a credible guess. Admittedly, this step really is pretty arbitrary, but every statistical model has this problem. It isn’t unique to Bayesian statistics, and it isn’t typically a problem in real life. If something is so close to being outside of your HDI, then you’ll probably want more data. For example, if you are a scientist, then you re-run the experiment or you honestly admit that it seems possible to go either way.

What is Bayesian Statistics: A basic worked example

I did a series on Bayes’ Theorem awhile ago and it gave us some nice heuristics on how a rational person ought to update their beliefs as new evidence comes in. The term “Bayesian statistics” gets thrown around a lot these days, so I thought I’d do a whole post just working through a single example in excruciating detail to show what is meant by this. If you understand this example, then you basically understand what Bayesian statistics is.

Problem: We run an experiment of flipping a coin {N} times and record a {1} every time it comes up heads and a {0} every time it comes up tails. This gives us a data set. Using this data set and Bayes’ theorem, we want to figure out whether or not the coin is biased and how confident we are in that assertion.

Let’s get some technical stuff out of the way. This is the least important part to fully understand for this post, but is kind of necessary. Define {\theta} to be the bias towards heads. This just means that if {\theta=0.5}, then the coin has no bias and is perfectly fair. If {\theta=1}, then the coin will never land on tails. If {\theta = 0.75}, then if we flip the coin a huge number of times we will see close to {3} out of every {4} flips lands on heads. For notation we’ll let {y} be the trait of whether or not it lands on heads or tails (so it is {0} or {1}).

We can encode this information mathematically by saying {P(y=1|\theta)=\theta}. In plain english: The probability that the coin lands on heads given that the bias towards heads is {\theta} is {\theta}. Likewise, {P(y=0|\theta)=1-\theta}. Let’s just chain a bunch of these coin flips together now. Let {a} be the event of seeing {a} heads when flipping the coin {N} times (I know, the double use of {a} is horrifying there but the abuse makes notation easier later).

Since coin flips are independent we just multiply probabilities and hence {P(a|\theta)=\theta^a(1-\theta)^{N-a}}. Rather than lug around the total number {N} and have that subtraction, normally people just let {b} be the number of tails and write {P(a,b |\theta)=\theta^a(1-\theta)^b}. Let’s just do a quick sanity check to make sure this seems right. Note that if {a,b\geq 1}, then as the bias goes to zero the probability goes to zero. This is expected because we observed a heads ({a\geq 1}), so it is highly unlikely to be totally biased towards tails. Likewise as {\theta} gets near {1} the probability goes to {0}, because we observed a tails.

The other special cases are when {a=0} or {b=0}, and in these cases we just recover that the probability of getting heads a times in a row if the probability of heads is {\theta} is {\theta^a}. Of course, the mean of {\beta (a,b)} is {a/(a+b)}, the proportion of the number of heads observed. Moving on, we haven’t quite thought of this in the correct way yet, because in our introductory problem we have a fixed data set that we want to analyze. So from now on we should think about {a} and {b} being fixed from the data we observed.

The idea now is that as {\theta} varies through {[0,1]} we have a distribution {P(a,b|\theta)}. What we want to do is multiply this by the constant that makes it integrate to {1} so we can think of it as a probability distribution. In fact, it has a name called the beta distribution (caution: the usual form is shifted from what I’m writing), so we’ll just write {\beta(a,b)} for this (the number we multiply by is the inverse of {B(a,b)=\int_0^1 \theta^a(1-\theta)^b d\theta} called the (shifted) beta function).

This might seem unnecessarily complicated to start thinking of this as a probability distribution in {\theta}, but it is actually exactly what we are looking for. Consider the following three examples:

beta1

The red one says if we observe {2} heads and {8} tails, then the probability that the coin has a bias towards tails is greater. The mean happens at {0.20}, but because we don’t have a lot of data there is still a pretty high probability of the true bias lying elsewhere. The middle one says if we observe 5 heads and 5 tails, then the most probable thing is that the bias is {0.5}, but again there is still a lot of room for error. If we do a ton of trials to get enough data to be more confident in our guess, then we see something like:

beta2

Already at observing 50 heads and 50 tails we can say with 95% confidence that the true bias lies between 0.40 and 0.60. Alright, you might be objecting at this point that this is just usual statistics, where the heck is Bayes’ Theorem? You’d be right. Bayes’ Theorem comes in because we aren’t building our statistical model in a vacuum. We have prior beliefs about what the bias is.

Let’s just write down Bayes’ Theorem in this case. We want to know the probability of the bias {\theta} being some number given our observations in our data. We use the “continuous form” of Bayes’ Theorem:

\displaystyle P(\theta|a,b)=\frac{P(a,b|\theta)P(\theta)}{\int_0^1 P(a,b|\theta)d\theta}

I’m trying to give you a feel for Bayesian statistics, so I won’t work out in detail the simplification of this. Just note that the “posterior probability” (the left hand side of the equation), i.e. the distribution we get after taking into account our data is the likelihood times our prior beliefs divided by the evidence. Now if you use that the denominator is just the definition of {B(a,b)} and work everything out it turns out to be another beta distribution!

If our prior belief is that the bias has distribution {\beta(x,y)}, then if our data has {a} heads and {b} tails we get {P(\theta|a,b)=\beta(a+x, b+y)}. The way we update our beliefs based on evidence in this model is incredibly simple. Now I want to sanity check that this makes sense again. Suppose we have absolutely no idea what the bias is and we make our prior belief {\beta(0,0)} the flat line. This says that we believe ahead of time that all biases are equally likely.

Now we observe {3} heads and {1} tails. Bayesian analysis tells us that our new (posterior probability) distribution is {\beta(3,1)}:

beta3

Yikes! We don’t have a lot of certainty, but it looks like the bias is heavily towards heads. Danger: This is because we used a terrible prior. This is the real world so it isn’t reasonable to think that a bias of {0.99} is just as likely as {0.45}. Let’s see what happens if we use just an ever so slightly more modest prior. We’ll use {\beta(2,2)}. This puts our assumption on it being most likely close to {0.5}, but it is still very open to whatever the data suggests. In this case our {3} heads and {1} tails tells us our updated belief is {\beta(5,3)}:

beta4

Ah. Much better. We see a slight bias coming from the fact that we observed {3} heads and {1} tails and these can’t totally be ignored, but our prior belief tames how much we let this sway our new beliefs. This is what makes Bayesian statistics so great. If we have tons of prior evidence of a hypothesis, then observing a few outliers shouldn’t make us change our minds. On the other hand, the setup allows for us to change our minds even if we are 99% certain on something as long as sufficient evidence is given. This is the mantra: extraordinary claims require extraordinary evidence.

Not only would a ton of evidence be able to persuade us that the coin bias is {0.90}, but we should need a ton of evidence. This is part of the shortcomings of non-Bayesian analysis. It would be much easier to become convinced of such a bias if we didn’t have a lot of data and we accidentally sampled some outliers.

Anyway. Now you should have an idea of Bayesian statistics. In fact, if you understood this example, then most of the rest is just adding parameters and using other distributions, so you actually have a really good idea of what is meant by that term now.