The Carter Catastrophe

I’ve been reading Manifold: Time by Stephen Baxter. The book is quite good so far, and it presents a fascinating probabilistic argument that humans will go extinct in the near future. It is sometimes called the Carter Catastrophe, because Brandon Carter first proposed it in 1983.

I’ll use Bayesian arguments, so you might want to review some of my previous posts on the topic if you’re feeling shaky. One thing we didn’t talk all that much about is the idea of model selection. This is the most common thing scientists have to do. If you run an experiment, you get a bunch of data. Then you have to figure out the most likely reason for what you see.

Let’s take a basic example. We have a giant tub of golf balls, and we can’t see inside the tub. There could be 1 ball or a million. We’re told the owner accidentally dropped a red ball in at some point. All the other balls are the standard white golf balls. We decide to run an experiment where we draw a ball out, one at a time, until we reach the red one.

First ball: white. Second ball: white. Third ball: red. We stop. We’ve now generated a data set from our experiment, and we want to use Bayesian methods to give the probability of there being three total balls or seven or a million. In probability terms, we need to calculate the probability that there are x balls in the tub given that we drew the red ball on the third draw. Any time we see this language, our first thought should be Bayes’ theorem.

Define A_i to be the model of there being exactly i balls in the tub. I’ll use “3” inside of P( ) to be the event of drawing the red ball on the third try. We have to make a finiteness assumption, and although this is one of the main critiques of the argument, we can examine what happens as we let the size of the bound grow. Suppose for now the tub can only hold 100 balls.

A priori, we have no idea how many balls are in there, so we’ll assume all “models” are equally likely. This means P(A_i)=1/100 for all i. By Bayes’ theorem we can calculate:

P(A_3|3) = \frac{P(3|A_3)P(A_3)}{(\sum_{i=1}^{100}P(3|A_i)P(A_i))}

\frac{(1/3)(1/100)}{(1/100)\sum_{i=3}^{100}1/i} \approx 0.09

So there’s around a 9% chance that there are only 3 balls in the tub. That bottom summation remains exactly the same when computing P(A_n | 3) for any n and equals about 3.69, and the (1/100) cancels out every time. So we can compute explicitly that for n > 3:

P(A_n|3)\approx \frac{1}{n}(0.27)

This is a decreasing function of n, and this shouldn’t be surprising at all. It says that as we guess there are more and more balls in the tub, the probability of that guess goes down. This makes sense, because it’s unreasonable to think we’d see the red one that early if there are actually 100 balls in the tub.

There’s lots of ways to play with this. What happens if our tub could hold millions but we still assume a uniform prior? It just takes all the probabilities down, but the general trend is the same: It becomes less and less reasonable to assume large amounts of total balls given that we found the red one so early.

You could also only care about this “earliness” idea and redo the computations where you ask how likely is A_n given that we found the red ball by the third try. This is actually the more typical way the problem is formulated in the Doomsday arguments. It’s more complicated, but the same idea pops out, and this should make intuitive sense.

Part of the reason these computations were somewhat involved is because we tried to get a distribution on the natural numbers. But we could have tried to compare heuristically to get a super clear answer (homework for you). What if we only had two choices “small number of total balls (say 10)” or “large number of total balls (say 10,000)”? You’d find there is around a 99% chance that the “small” hypothesis is correct.

Here’s the leap. Now assume the fact that you exist right now is random. In other words, you popped out at a random point in the existence of humans. So the totality of humans to ever exist are the white balls and you are the red ball. The same type of argument above applies, and it says that the most likely thing is that you aren’t born at some super early point in human history. In fact, it’s unreasonable from a probabilistic standpoint to think that humans will continue much longer at all given your existence.

The “small” total population of humans is far, far more likely than the “large” total population, and the interesting thing is that this remains true even if you mess with the uniform prior. You could assume it is much more likely a priori for humans to continue to make improvements and colonize space and develop vaccines giving a higher prior for the species existing far into the future. But unfortunately the Bayesian argument will still pull so strongly in favor of humans ceasing to exist in the near future that one must conclude it is inevitable and will happen soon!

Anyway. I’m travelling this week, so I’m sorry if there are errors in those calculations. I was in a hurry and never double checked them. The crux of the argument should still make sense even if you don’t get my exact numbers. There’s also a lot of interesting and convincing rebuttals, but I don’t have time to get into them now (including the fact that unlikely hypotheses turn out to be true all the time).


Does Bayesian Epistemology Suffer Foundational Problems?

I recently had a discussion about whether Bayesian epistemology suffers from the problem of induction, and I think some interesting things came from it. If these words make you uncomfortable, think of epistemology as the study of how we form beliefs and gain knowledge. Bayesian epistemology means we model it probabilistically using Bayesian methods. This old post of mine talks a bit about it but is long and unnecessary to read to get the gist of this post.

I always think of the problem of induction in terms of the classic swan analogy. Someone wants to claim that all swans are white. They go out and see swan after swan after swan, each confirming the claim. Is there any point at which the person can legitimately say they know that all swans are white?

Classically, the answer is no. The problem of induction is crippling to classic epistemologies, because we can never be justified in believing any universal claim (at least using empirical methods). One of the great things about probabilistic epistemologies (not just Bayesian) is that it circumvents this problem.

Classical epistemologies require you to have 100% certainty to attain knowledge. Since you can’t ever be sure you’ve encountered every instance of a universal, you can’t be certain there is no instance that violates the universal. Hence the problem of induction is an actual problem. But note it is only a problem if your definition of knowledge requires you to have absolute certainty of truth.

Probabilistic epistemologies lower the threshold. They merely ask that you have 95% (or 98%, etc) confidence (or that your claim sits in some credible region, etc) for the justification. By definition, knowledge is always tentative and subject to change in these theories of knowledge.

This is one of the main reasons to use a probabilistic epistemology. It is the whole point. They were invented to solve this problem, so I definitely do not believe that Bayesian epistemology suffers from the problem of induction.

But it turns out I had misunderstood. The point the other person tried to make was much more subtle. It had to do with the other half of the problem of induction (which I always forget about, because I usually consider it an axiom when doing epistemology).

This other problem is referred to as the principle of the uniformity of nature. One must presuppose that the laws of nature are consistent across time and space. Recall that a Bayesian has prior beliefs and then upon encountering new data they update their beliefs factoring in both the prior and new data.

This criticism has to do with the application of Bayes’ theorem period. In order to consider the prior to be relevant to factor in at all, you must believe it is … well, relevant! You’ve implicitly assumed at that step the uniformity of nature. If you don’t believe nature is consistent across time, then you should not factor prior beliefs into the formation of knowledge.

Now a Bayesian will often try to use Bayesian methods to justify the uniformity of nature. We start with a uniform prior so that we haven’t assumed anything about the past or its relevance to the future. Then we merely note that billions of people across thousands of years have only ever observed a uniformity of nature, and hence it is credible to believe the axiom is true.

Even though my gut buys that argument, it is a bit intellectually dishonest. You can never, ever justify an axiom by using a method that relies on that axiom. That is the quintessential begging the question fallacy.

I think the uniformity of nature issue can be dismissed on different grounds. If you want to dismiss an epistemology based on the uniformity of nature issue, then you have to be willing to dismiss every epistemology that allows you to come to knowledge.

It doesn’t matter what the method is. If you somehow come to knowledge, then one second later all of nature could have changed and hence you no longer have that knowledge. Knowledge is impossible if you want to use that criticism. All this leave you with is radical skepticism, which of course leads to self-contradiction (if you know you can’t know anything, then you know something –><– ).

This is why I think of the uniformity of nature as a necessary axiom for epistemology. Without some form of it, epistemology is impossible. So at least in terms of the problem of induction, I do not see foundational problems for Bayesian epistemology.

Markov Chain Monte Carlo Example

Let’s look at a problem called parameter estimation. As usual, we have a bunch of coin flips. What we’ve learned to do with Bayesian statistics is calculate some posterior distribution that tells me how likely the bias {\theta} is. I ask myself, “Is it a credible hypothesis that the coin is fair ({\theta =1/2})?” I find out yes it is. I ask myself, “Is it a credible hypothesis that the coin is massively biased at {\theta=4/5}?” I find out yes it is. Uh oh.

Maybe in abstract math land this type of contradictory information is fine. I should be honest that both are credible guesses based on my data, and Bayesian statistics helps me to be very precise about my degrees of certainty and uncertainty. Unfortunately, in the real world I want to figure out which {\theta} is “best” so that I can use it in my model for practical purposes. This is called parameter estimation, because I want to estimate what my parameter {\theta} should be in my model.

We’re in luck for the coin example, because we only have one parameter living in one-dimensional space. This alone vastly simplifies the situation, but we have something far, far better. Our posterior distribution has a unique maximum, that max happens to equal the mean of the distribution, and that max can be determined easily and exactly! This means that we can safely use that parameter as the “best.”

In the real world, we often have several parameters we are trying to estimate in a high-dimensional space, and the posterior is some non-convex crazy thing with lots of local mins/maxs that can’t be determined analytically. Let’s face it. Optimization is really hard even in relatively nice situations. The real world is usually not nice.

There often isn’t even an obvious notion of what you mean by “best” set of parameters either. Think of a symmetrical bimodal distribution where both peaks have the same max. You don’t really have any good reason to pick one of the points that gives the max, and if you do something like take the mean, then you might end up with a min on accident. The method I’m going to describe doesn’t really help with this issue of “equally good choices”, but it does give a beautiful way to deal with high-dimensional parameter spaces and crazy posterior distributions.

The idea is extremely simple. You will pick some starting collection of parameters. Then you let those parameters randomly move in some direction. We will then use our model to test whether or not it is more or less likely to see the data that we observed under each of those parameter choices. With some probability depending on this likelihood we will move that parameter to that value. This is just a Markov chain process of our {\theta} values moving through the possible parameter values, and hence this technique is called a Markov Chain Monte Carlo (MCMC) method (I used the indefinite article “a” because there are all sorts of variations on this out there).

It turns out that as long as we set this up in a reasonable way, then it will converge. Here’s something cool about this. Your parameters could live in some gigantic space for which it would be impossible to search for a good parameter estimation. Usually there is some much, much smaller dimensional subset of reasonably likely candidates. Once you move to this smaller dimensional set, by the nature of the algorithm, you will stay close to it and hence start moving to something optimal much faster. Here’s a picture showing how the random walks stay on a smaller set in a real example:

Let’s actually implement this in the silly case of the coin example where we know what the answer should be. My next post might try to implement this for some sort of real data set, although that could be more time consuming than I’m willing to do. To make this example more fun, I had the computer pick a random number in {[0,1]} and then generate 100 coin flips with bias equal to that number without telling me the number! This way we are in a realistic situation of not knowing what the “correct” answer is ahead of time.

I got 85 heads and 15 tails. To make computations easier, let’s assume the prior probability is just uniform. This means the posterior is given by {p(D|\theta)=\theta^{85}\cdot (1-\theta)^{15}}. I’ll start the random walk at {\theta = 0.5}. To know how much to move by, I pick a random number from a normal distribution with mean {0} and standard deviation {0.1}. So if I pick {0.05}, then my candidate place to move to is {0.55}.

I compute {p(D|\theta_{new})/p(D|\theta_{old})} and I move to the new spot with this probability. Note that if my new theta value is more likely to be the true theta, then I will always move to the new value, because the probability of moving is greater than {1}. The more unlikely my new theta value is, the less likely it is that I will move there. This implementation is called the Metropolis (or Metropolis-Hastings) algorithm. Note how simple the implementation is. It is only a few lines of code:

import numpy as np
import random
import pylab

# Posterior Distribution
def p(theta):
    return (theta**85)*((1-theta)**15)

# Random Walk Step Size
def norm_dist():
    return random.normalvariate(0, 0.1)
# Perform one step of random walk from spot theta
def rand_walk(theta):
    x = norm_dist()
    if theta + x < 1 and theta + x >0:
        return theta + x
        return theta
# Simulate the random walk for 1000 time steps   
walk = []
for i in xrange(1000):
    n = walk.pop()
    y = rand_walk(n)
    if random.random() < p(y)/p(n):

# Plot the results
ylab = [i for i in xrange(len(walk))]
pylab.plot(walk, ylab)
pylab.title('Random Walk Visualization')
pylab.xlabel('Theta Value')

Note that the key insight that MCMC gives us is that picking values from the posterior is going to be “easy.” Even if we don’t know much about the distribution and have no idea how to explicitly calculate anything from it, we can still perform this random walk. This is what it looks like:


The last step is to actually do a parameter estimation. The whole point is that the walk will stay close to the best value, so we can now just average these to get {\theta = 0.84}. The average is just a finite sum instead of an integral now. If we had done this analytically, we would have gotten {0.85}. Since we know MCMC is just giving us an estimation coming from randomness, this is really quite good!

Decision Theory 2

Now we will move on to a far, far more complex version of the same problem from before. Recall last time we worked with a fair coin. We want to make guesses that minimize our loss (or maximize our utility). The assumption that the coin was fair basically nullified having to do any analysis. No matter what decision function we picked, we would have the same expected loss, i.e. there is no way to do better than random guessing.

Let’s introduce the complexity of an extra parameter slowly through an example. Let’s suppose again that the coin is fair, but we don’t know that ahead of time. We have no idea what the bias of the coin is. We’ve already analyzed how to model this situation in our Bayesian statistics example.

If we observe {n} heads and {m} tails, we have a probability distribution describing the likelihood of the possible biases. We found this to be the beta distribution {B(n-1, m-1)}. If we start with a uniform, uninformed prior, then we could use Bayesian statistics to update our decision rule after each flip. This should make intuitive sense, because if the bias of the coin is 0.9, we should quickly see the posterior distribution reflect this and we will start getting most of our guesses correct.

Thus, the most naive thing to do is to look at the mean of the posterior distribution: {\frac{n}{n+m}}. If this number is bigger than {0.5}, then we guess heads because our Bayesian posterior predicts heads is coming up more frequently. If it is less than {0.5}, then we guess tails. If it equals {0.5}, then we make a random guess. Note that as long as the true bias is not {0.5}, we should be able to tell this with statistics after sufficiently many flips which will give us a better expected loss (i.e. risk) than random guessing. Let’s try two examples to see what happens.

I won’t post the code or the graph of what happens if the true bias is {0.5}, because our previous analysis shows it to be exactly the same independent of our decision function. Thus our more complicated decision rule doesn’t actually do anything to improve our guess. As a second example, we can mildly modify the code previously to see what happens with a {0.75} bias:

import random
import numpy as np
import pylab

def flip(true_bias):
    rand = random.random()
    if rand > true_bias:
        return 0
        return 1

def simulate(money, bet, true_bias, num_flips):
    num_heads = 0
    est_bias = 0.5
    for i in range(num_flips):
        #make a choice based on Bayesian posterior
        if est_bias >= 0.5:
            choice = 1
            choice = 0
        #flip the coin
        rand = flip(true_bias)
        #keep track of the number of heads
        num_heads += rand
        #update estimated bias
        est_bias = float(num_heads+1)/(i+3)
        #check whether or not choice was correct
        if choice == rand:
            money += 2*bet
            money -= bet
    return money

results = []
for i in range(1000):
    results.append(simulate(10, 1, 0.75, 100))

pylab.title('Coin Experiment Results')
pylab.xlabel('Trial Number')
pylab.ylabel('Money at the End of the Trial')

print np.mean(results)

The program says we average ending with {134.3} cents. We made pretty close to {125} cents as opposed to making {50} cents off of the {0.5} bias. These numbers should not be mysterious, because in the long run we expect to start guessing heads which will occur {3/4} of the time. Thus our expected gain is {100((-1/4)+(3/4)*2)=125}. Here’s the plot of the experiment:


This should feel a little weird, because with this decision rule we expect to always do better than (or equal to) our previous example. But this example is more realistic, because we don’t assume to know the bias of the coin! How could we do better with “less” information? That is the power of Bayesian decision theory which allows you to update your decision rule as you observe more information.

The classical admissible decision of always picking heads will do better if the bias is towards heads because we don’t have to wait for our posterior to tell us to pick heads, but it will do terrible if the bias is towards tails because even once we see that we get mostly tails we are not allowed to change our decision rule.

Let’s go back to our experiment of 100 coin flips. If {\theta} is the true bias of the coin, then the negative of the risk (the expected value of the utility function) of our Bayesian naive decision rule is

{-R(\theta) = \begin{cases} 100 (3\theta -1) & \ \text{if} \ \theta \geq 0.5 \\ 100(2-3\theta) & \text{if} \ \theta < 0.5\end{cases}.}

We've now successfully incorporated our new parameter. The risk will in general depend on this parameter. The function is just a "V" when graphed and our risk from last post is just a straight line {-R(\theta)=100(3\theta-1)}. It matches on the right piece, but is strictly below this one on the left half. This shows that no matter the bias of the coin, the naive Bayesian decision rule does better than our first post's choice.

Last post I said we could order the decision functions based on risk, and then we just call a minimum in the ordering admissible. Now we have to be more careful. With this extra parameter we only get a partial ordering by checking whether or not the risk is greater pointwise for every {\theta}. As just pointed out, the Bayesian decision function is lower in the ordering than random guessing or always picking heads (the two are comparable!). The question is, how do we know whether or not it is a minimum? Is this the best we can do? Is this naive decision rule admissible?

We will dig a little more into the theory next time about how those risk functions were computed (I just told you what they were which matched our experiments), and how to actually prove that a certain decision is admissible in this more complicated situation.

Statistical Oddities 4: The Base Rate Fallacy

If you are a classically trained scientist, then you probably do an experiment, get some data, run it through SPSS (or something similar), then to see whether or not the results are significant you look at the {p}-value. It is a standard that if {p<0.05}, then you consider the result to likely be "real" and not just some random noise making a pattern.

Why is that? Well, here's how you define the {p}-value. Suppose your hypothesis is false. What is the probability of seeing your data? That is the {p}-value. I hypothesize that my coin is fair. I do 200 flips. I calculate {p=0.04}. Then there is only a 4% chance that I would see that particular combination of flips if my coin is biased.

Danger! Frequently, people try to negate everything and say that "there is a 96% chance (or less obviously wrong they'll say we have 96% confidence) that the coin is fair." If you read these posts, then you should immediately see the error. There can’t be a 96% chance of the coin being fair, because no matter what our flips were (it could 100 and 100 after a 200 flip trial), the probability of it being fair is still {0} (just compute the integral of the posterior distribution from {0.5} to {0.5}). Yet you see this language all over scientific papers.

If we want to talk about confidence, then we have a way to do it and it does not involve the {p}-value. I wouldn’t say this is a frequentist vs Bayesian thing, but I think the Bayesian analysis actually makes it harder to make this mistake. Recall that what we did there was use the language that being unbiased was a credible hypothesis given our 95% HDI. What we have confidence about is an interval. Maybe we have 95% confidence that the bias is in the interval {(0.2, 0.8)}. In this case, the hypothesis of being unbiased is credible, but the hypothesis of it being {0.7} is also credible with the same HDI.

Anyway, back to {p}-values. Since I’m using coin flipping as an example, you might think this is silly, but let’s ramp up our experiment. Suppose my lab works for a casino. We make sure coins are fair before passing them on to the casino (for their hot new coin flipping game?). I use a {p} value of {0.05} as usual. After {100} coins I expect that I’ve made a mistake on {5} or fewer because of my {p}-value, right? This is the type of interpretation you see all the time! It is clearly wrong.

Suppose {10} of them are biased due to manufacturing errors. Depending on the power of my test (I haven’t talked about power, but as you can imagine it depends on how many flips I use in my trials among other things) maybe I find 8 of them (this would be a power of {0.8} which isn’t unreasonable in science). Now recall our definition of the {p}-value. I also have a {5\%} chance of incorrectly saying that one of my unbiased coins is biased. This puts me at identifying {13} biased coins only {8} of which are actually biased. Despite a {p}-value threshold of {0.05}, I actually only got {62\%} of my guesses of bias correct (you could calculate this much more easily using Bayes’ theorem).

The above scenario is extremely common in some medical science labs where it matters. Suppose you test a drug to see if it works. Your test has {0.8} power and you use a {p}-value of {0.05} as you’ve been taught. You send {13} to drug manufacturers claiming they work. You think that you are wrong only {5\%} of the time, but in reality after you’ve tested {100} drugs, {5} out of the {13} drugs you send don’t work! This is extremely dangerous. Of course, these should be weeded out on secondary trials, but who has the time or money to do that? If we think we have {95\%} confidence that it works, then we may as well send them out to help people and only do our repeat experiment while it is on the market.

Ignoring the real interpretation of the p-value in favor of the more optimistic one is so common it has a name: the base rate fallacy. This is because that high number of false postives comes from the fact that the base rate of a drug working (or the coin being unbiased) is so low that you are likely to get false positives even with a high power test and a small p-value. I know this type of thing has been posted on the internet all over the place, but I hadn’t done it yet and it seemed to fit in with the statistical oddities series. For the record, the example scenario above was taken from Statistics Done Wrong by Alex Reinhart.

Bayesian Statistics Worked Example Part 2

Last time I decided my post was too long, so I cut some stuff out and now this post is fleshing those parts into their own post. Recall our setup. We perform an experiment of flippling a coin. Our data set consists of {a} heads and {b} tails. We want to run a Bayesian analysis to figure out whether or not the coin is biased. Our bias is a number between {0} and {1} which just indicates the expected proportion of times it will land on heads.

We found our situation was modeled by the beta distribution: {P(\theta |a,b)=\beta(a,b)}. I reiterate here a word of warning. ALL other sources will call this {B(a+1, b+1)}. I’ve just shifted by 1 for ease of notation. We saw last time that if our prior belief is that the probability distribution is {\beta(x,y)}, then our posterior belief should be {\beta(x+a, y+b)}. This simple “update rule” falls out purely from Bayes’ Theorem.

The main thing I didn’t explain last time was what exactly I meant by the phrase “we can say with 95% confidence that the true bias of the coin lies between {0.40} and {0.60}” or whatever the particular numbers are that we get from our data. What I had in mind for that phrase was something called the highest density interval (HDI). The 95% HDI just means that it is an interval for which the area under the distribution is {0.95} (i.e. an interval spanning 95% of the distribution) such that every point in the interval has a higher probability than any point outside of the interval (I apologize for such highly unprofessional pictures):


(It doesn’t look like it, but that is supposed to be perfectly symmetrical.)

Untitled drawing

The first is the correct way to make the interval, because notice all points on the curve over the shaded region are higher up (i.e. more probable) than points on the curve not in the region. There are lots of 95% intervals that are not HDI’s. The second is such a non-example, because even though the area under the curve is 0.95, the big purple point is not in the interval but is higher up than some of the points off to the left which are included in the interval.

Lastly, we will say that a hypothesized bias {\theta_0} is credible if some small neighborhood of that value lies completely inside our 95% HDI. That small threshold is sometimes called the “region of practical equivalence (ROPE)” and is just a value we must set. If we set it to be 0.02, then we would say that the coin being fair is a credible hypothesis if the whole interval from 0.48 to 0.52 is inside the 95% HDI.

A note ahead of time, calculating the HDI for the beta distribution is actually kind of a mess because of the nature of the function. There is no closed form solution, so usually you can just look these things up in a table or approximate it somehow. Both the mean {\mu=\frac{a}{a+b}} and the standard deviation {\left(\frac{\mu(1-\mu)}{a+b+1}\right)^{1/2}} do have closed forms. Thus I’m going to approximate for the sake of this post using the “two standard deviations” rule that says that two standard deviations on either side of the mean is roughly 95%. Caution, if the distribution is highly skewed, for example {\beta(3,25)} or something, then this approximation will actually be way off.

Let’s go back to the same examples from before and add in this new terminology to see how it works. Suppose we have absolutely no idea what the bias is and we make our prior belief {\beta(0,0)} the flat line. This says that we believe ahead of time that all biases are equally likely. Now we observe {3} heads and {1} tails. Bayesian analysis tells us that our new distribution is {\beta(3,1)}. The 95% HDI in this case is approximately 0.49 to 0.84. Thus we can say with 95% certainty that the true bias is in this region. Note that it is NOT a credible hypothesis off of this data to guess that the coin is fair because 0.48 is not in HDI. This example really illustrates how choosing different thresholds can matter, because if we picked an interval of 0.01 rather than 0.02, then that guess would be credible!

Let’s see what happens if we use just an ever so slightly more reasonable prior. We’ll use {\beta(2,2)}. This gives us a starting assumption that the coin is probably fair, but it is still very open to whatever the data suggests. In this case our {3} heads and {1} tails tells us our posterior distribution is {\beta(5,3)}. In this case the 95% HDI is 0.45 to 0.75. Using the same data we get a little bit more narrow interval here, but more importantly we feel much more comfortable with the claim that the coin being fair is still a credible hypothesis.

This brings up a sort of “statistical uncertainty principle.” If we want a ton of certainty, then it forces our interval to get wider and wider. This makes intuitive sense, because if I want to give you a range that I’m 99.9999999% certain the true bias is in, then I better give you practically every possibility. If I want to pinpoint a precise spot for the bias, then I have to give up certainty (unless you’re in an extreme situation where the distribution is a really sharp spike or something). You’ll end up with something like: I can say with 1% certainty that the true bias is between 0.59999999 and 0.6000000001. We’ve locked onto a small range, but we’ve given up certainty. Note the similarity to the Heisenberg uncertainty principle which says the more precisely you know the momentum or position of a particle the less precisely you know the other.

Let’s wrap up by trying to pinpoint exactly where we needed to make choices for this statistical model. The most common objection to Bayesian models is that you can subjectively pick a prior to rig the model to get any answer you want. Hopefully this wrap up will show that in the abstract that objection is essentially correct, but in real life practice you cannot get away with this.

Step 1 was to write down the likelihood function {P(\theta | a,b)=\beta(a,b)}. This was derived directly from the type of data we were collecting and was not a choice. Step 2 was to determine our prior distribution. This was a choice, but a constrained one. In real life statistics you will probably have a lot of prior information that will go into this choice. Recall that the prior encodes both what we believe is likely to be true and how confident we are in that belief. Suppose you make a model to predict who will win an election based off of polling data. You have previous year’s data and that collected data has been tested, so you know how accurate it was! Thus forming your prior based on this information is a well-informed choice. Just because a choice is involved here doesn’t mean you can arbitrarily pick any prior you want to get any conclusion you want.

I can’t reiterate this enough. In our example, if you pick a prior of {\beta(100,1)} with no reason to expect to coin is biased, then we have every right to reject your model as useless. Your prior must be informed and must be justified. If you can’t justify your prior, then you probably don’t have a good model. The choice of prior is a feature, not a bug. If a Bayesian model turns out to be much more accurate than all other models, then it probably came from the fact that prior knowledge was not being ignored. It is frustrating to see opponents of Bayesian statistics use the “arbitrariness of the prior” as a failure when it is exactly the opposite (see the picture at the end of this post for a humorous illustration.)

The last step is to set a ROPE to determine whether or not a particular hypothesis is credible. This merely rules out considering something right on the edge of the 95% HDI from being a credible guess. Admittedly, this step really is pretty arbitrary, but every statistical model has this problem. It isn’t unique to Bayesian statistics, and it isn’t typically a problem in real life. If something is so close to being outside of your HDI, then you’ll probably want more data. For example, if you are a scientist, then you re-run the experiment or you honestly admit that it seems possible to go either way.

What is Bayesian Statistics: A basic worked example

I did a series on Bayes’ Theorem awhile ago and it gave us some nice heuristics on how a rational person ought to update their beliefs as new evidence comes in. The term “Bayesian statistics” gets thrown around a lot these days, so I thought I’d do a whole post just working through a single example in excruciating detail to show what is meant by this. If you understand this example, then you basically understand what Bayesian statistics is.

Problem: We run an experiment of flipping a coin {N} times and record a {1} every time it comes up heads and a {0} every time it comes up tails. This gives us a data set. Using this data set and Bayes’ theorem, we want to figure out whether or not the coin is biased and how confident we are in that assertion.

Let’s get some technical stuff out of the way. This is the least important part to fully understand for this post, but is kind of necessary. Define {\theta} to be the bias towards heads. This just means that if {\theta=0.5}, then the coin has no bias and is perfectly fair. If {\theta=1}, then the coin will never land on tails. If {\theta = 0.75}, then if we flip the coin a huge number of times we will see close to {3} out of every {4} flips lands on heads. For notation we’ll let {y} be the trait of whether or not it lands on heads or tails (so it is {0} or {1}).

We can encode this information mathematically by saying {P(y=1|\theta)=\theta}. In plain english: The probability that the coin lands on heads given that the bias towards heads is {\theta} is {\theta}. Likewise, {P(y=0|\theta)=1-\theta}. Let’s just chain a bunch of these coin flips together now. Let {a} be the event of seeing {a} heads when flipping the coin {N} times (I know, the double use of {a} is horrifying there but the abuse makes notation easier later).

Since coin flips are independent we just multiply probabilities and hence {P(a|\theta)=\theta^a(1-\theta)^{N-a}}. Rather than lug around the total number {N} and have that subtraction, normally people just let {b} be the number of tails and write {P(a,b |\theta)=\theta^a(1-\theta)^b}. Let’s just do a quick sanity check to make sure this seems right. Note that if {a,b\geq 1}, then as the bias goes to zero the probability goes to zero. This is expected because we observed a heads ({a\geq 1}), so it is highly unlikely to be totally biased towards tails. Likewise as {\theta} gets near {1} the probability goes to {0}, because we observed a tails.

The other special cases are when {a=0} or {b=0}, and in these cases we just recover that the probability of getting heads a times in a row if the probability of heads is {\theta} is {\theta^a}. Of course, the mean of {\beta (a,b)} is {a/(a+b)}, the proportion of the number of heads observed. Moving on, we haven’t quite thought of this in the correct way yet, because in our introductory problem we have a fixed data set that we want to analyze. So from now on we should think about {a} and {b} being fixed from the data we observed.

The idea now is that as {\theta} varies through {[0,1]} we have a distribution {P(a,b|\theta)}. What we want to do is multiply this by the constant that makes it integrate to {1} so we can think of it as a probability distribution. In fact, it has a name called the beta distribution (caution: the usual form is shifted from what I’m writing), so we’ll just write {\beta(a,b)} for this (the number we multiply by is the inverse of {B(a,b)=\int_0^1 \theta^a(1-\theta)^b d\theta} called the (shifted) beta function).

This might seem unnecessarily complicated to start thinking of this as a probability distribution in {\theta}, but it is actually exactly what we are looking for. Consider the following three examples:


The red one says if we observe {2} heads and {8} tails, then the probability that the coin has a bias towards tails is greater. The mean happens at {0.20}, but because we don’t have a lot of data there is still a pretty high probability of the true bias lying elsewhere. The middle one says if we observe 5 heads and 5 tails, then the most probable thing is that the bias is {0.5}, but again there is still a lot of room for error. If we do a ton of trials to get enough data to be more confident in our guess, then we see something like:


Already at observing 50 heads and 50 tails we can say with 95% confidence that the true bias lies between 0.40 and 0.60. Alright, you might be objecting at this point that this is just usual statistics, where the heck is Bayes’ Theorem? You’d be right. Bayes’ Theorem comes in because we aren’t building our statistical model in a vacuum. We have prior beliefs about what the bias is.

Let’s just write down Bayes’ Theorem in this case. We want to know the probability of the bias {\theta} being some number given our observations in our data. We use the “continuous form” of Bayes’ Theorem:

\displaystyle P(\theta|a,b)=\frac{P(a,b|\theta)P(\theta)}{\int_0^1 P(a,b|\theta)d\theta}

I’m trying to give you a feel for Bayesian statistics, so I won’t work out in detail the simplification of this. Just note that the “posterior probability” (the left hand side of the equation), i.e. the distribution we get after taking into account our data is the likelihood times our prior beliefs divided by the evidence. Now if you use that the denominator is just the definition of {B(a,b)} and work everything out it turns out to be another beta distribution!

If our prior belief is that the bias has distribution {\beta(x,y)}, then if our data has {a} heads and {b} tails we get {P(\theta|a,b)=\beta(a+x, b+y)}. The way we update our beliefs based on evidence in this model is incredibly simple. Now I want to sanity check that this makes sense again. Suppose we have absolutely no idea what the bias is and we make our prior belief {\beta(0,0)} the flat line. This says that we believe ahead of time that all biases are equally likely.

Now we observe {3} heads and {1} tails. Bayesian analysis tells us that our new (posterior probability) distribution is {\beta(3,1)}:


Yikes! We don’t have a lot of certainty, but it looks like the bias is heavily towards heads. Danger: This is because we used a terrible prior. This is the real world so it isn’t reasonable to think that a bias of {0.99} is just as likely as {0.45}. Let’s see what happens if we use just an ever so slightly more modest prior. We’ll use {\beta(2,2)}. This puts our assumption on it being most likely close to {0.5}, but it is still very open to whatever the data suggests. In this case our {3} heads and {1} tails tells us our updated belief is {\beta(5,3)}:


Ah. Much better. We see a slight bias coming from the fact that we observed {3} heads and {1} tails and these can’t totally be ignored, but our prior belief tames how much we let this sway our new beliefs. This is what makes Bayesian statistics so great. If we have tons of prior evidence of a hypothesis, then observing a few outliers shouldn’t make us change our minds. On the other hand, the setup allows for us to change our minds even if we are 99% certain on something as long as sufficient evidence is given. This is the mantra: extraordinary claims require extraordinary evidence.

Not only would a ton of evidence be able to persuade us that the coin bias is {0.90}, but we should need a ton of evidence. This is part of the shortcomings of non-Bayesian analysis. It would be much easier to become convinced of such a bias if we didn’t have a lot of data and we accidentally sampled some outliers.

Anyway. Now you should have an idea of Bayesian statistics. In fact, if you understood this example, then most of the rest is just adding parameters and using other distributions, so you actually have a really good idea of what is meant by that term now.

Bayes’ Theorem and Bias

I wonder how many times I’ll say this. This will be my last Bayes’ theorem post. At this point a careful reader should be able to extract most of the following post from the past few, but it is definitely worth spelling out in detail here. We’ve been covering how academics have used Bayes’ theorem in their work. It is also important to see how Bayes’ theorem could be useful to you in your own life.

For this post I’m going to take as my starting point that it is better in the long run for you to believe true things rather than false things. Since this is an academic blog and most academics are seeking truth in general, they hold to some sort of rational or skeptic philosophy. Whole books have been written defending why this position will improve society, but wanting to believe true things shouldn’t be that controversial of a position.

Honestly, to people who haven’t spent a lot of time learning about bias, it is probably impossible to overestimate how important a role it plays in making decisions. Lots of well-educated, logical, careful people can look at both sides of the evidence of something and honestly think they are making an objective decision about what is true based on the evidence, but in reality they are just reconfirming a belief they made for totally irrational reasons.

We’ve all seen this type of picture before:

Even though you know the following things:

1. What the optical illusion (i.e. bias) is.
2. How and why it works.
3. The truth, which is easily verified using a ruler, that the lines are the same length.

This knowledge does not give you the power to overcome the illusion/bias. You will still continue to see the lines as different lengths. If bias can do this for a sense as objective as sight, think about how easily tricked you can be if you go off of intuition or feelings.

This exercise makes us confront a startling conclusion. In order to form a true belief, we must use the conclusion that looks and feels wrong. We must trust the fact we came to through the verifiably more objective means. This is true of your opinions/beliefs as well. You probably have false beliefs that will still look and feel true to you even once you’ve changed your mind about them. You need to trust the evidence and arguments.

A Bayesian analysis of this example might run as follows. You have the belief that the lines are different lengths from looking at it. In fact, you could reasonably set the prior probability that this belief is true pretty high because although your eyesight has been wrong in the past, you estimate that around 99% it wouldn’t make such an obvious and large error. The key piece of evidence you acquired is when you measured this with a ruler. You find they are the same length. This evidence is so strong in comparison with your confidence in your eyesight that it vastly outweighs the prior probability and you confidently conclude your first belief was false.

You probably came to many of the beliefs you have early on in life. Maybe your parents held them. Maybe your childhood friends influenced you. Maybe you saw some program on TV that got you thinking a certain way. In any case, all of these are bad reasons to believe something. Now you’ve grown up, and you think that you’ve re-evaluated these beliefs and can justify them. In reality, you’ve probably just reconfirmed them through bias.

Once you’ve taken a position on something, your brain has an almost incomprehensible number of tricks it can do in order to prevent you from changing your mind. This is called bias and you will be totally unaware of it happening. The rational position is to recognize this happens and try to remove it as much as possible in order to change an untrue belief to a true belief. Trust me. If done right, this will be a painful process. But if you want true beliefs, it must be done and you must be willing to give up your most cherished beliefs since childhood even if it means temporary social ostracization (spell check tells me this isn’t a real word, but it feels right).

What this tells us is that if we really want true beliefs we need to periodically revisit our beliefs and do a thorough sifting of the evidence in as objective a way as possible to see if our beliefs have a chance at being true.

Since there are literally thousands of cognitive biases we know about, I can’t go through all the ones you might encounter, but here are a few. One is confirmation bias. When you look at evidence for and against a belief you will tend to remember only the evidence that confirmed your already held belief (even if the evidence against is exponentially bigger!). It is difficult to reiterate this enough, but you will not consciously throw the evidence out. You will not be aware that this happened to you. You will feel as if you evenly weighed both sides of the evidence.

One of my favorite biases that seems to receive less attention is what I call the many-against-one bias (I’m not sure if it has a real name). Suppose you have three solid pieces of evidence for your belief. Suppose the counter-evidence is much better and there are seven solid pieces of it. When you look through this, what you will tend to do is look at the first piece of evidence and think, “Well, my side has these three pieces of evidence and so although that looks good it isn’t as strong as my side.” Then you move on to piece of counter-evidence two and do the same thing.

All of a sudden you’ve dismissed tons of good evidence that when taken together would convince you to change your mind, but since it was evaluated separately in a many-against-one (hence the name!) fashion you’ve kept your old opinion. Since you can’t read all the counter-evidence simultaneously, and you probably have your own personal evidence well-reinforced, it is extremely difficult to avoid this fallacy.

And on and on and on and on and on … it goes. Seriously. This should not be thought of as “bad” or something. Just a fact. It will happen, and you will not be aware of it. If you just simply look at both sides of the argument you will 99.99% of the time just come out believing the same thing. You need to take very careful precautions to avoid this.

Enter Bayes’ theorem. Do not misconstrue what I’m saying here as this being a totally objective way to come to the truth. This is just one way that you could try as a starting point. Here’s how it works. You take a claim/belief which we call B. Now you look at the best arguments and evidence for the claim you can find. You write each one down, clearly numbered, with lots of space between. Now you go find all the best counterarguments and evidence you can find to those claims and write those down next to the original ones. Now do the exact same thing with the best arguments/evidence you can find against the claim/belief.

One at a time you totally bracket off all your feelings and thoughts about the total question at hand. Just look at evidence 1 together with its counter-evidence. Supposing the claim is true, what are the chances that you would have this evidence? This is part of your P(E|B) calculation. Don’t think about how it will affect the total P(B|E) calculation. Stay detached. Find people who have the opposite opinion as you and try to convince them of your number just on this one tiny thing. If you can’t, maybe you aren’t weighting it right.

Go through every piece of evidence this way weighing it totally on its own merits and not in relation to the whole argument. Having everything written down ahead of time will help you overcome confirmation bias. Evaluating the probabilities in this way one at a time will help you overcome the many-against-one bias (you’ll probably physically feel this bias when you do it this way as you start to think, “But it isn’t that good in relation to this.”) This will also overcome numerous other biases, especially ones involving poor intuitions about probability. But do not think you’ve somehow overcome them all, because you won’t.

One of the hardest steps is then to combine your calculations into Bayes’ theorem. You should think about whether or not pieces of evidence are truly independent if you want a proper calculation. But overall you’ll get the probability that your belief is true given the evidence, and it will probably be pretty shocking. Maybe you were super confident (99.99% or something) that there was no real reason to doubt it, but you find out it is more like 55%.

Maybe something you believe only has a 5% chance of being true and you’ve just never weighed the evidence in this objective a way. You need to either update what you think is true, or at very least if it still seems to be able to go either way, be much more honest about how sure you are. I hope more people start doing this as I am one of those people that think the world would be a much better place if people stopped confidently clinging to their beliefs taught to them from childhood.

Changing your mind should not have the cultural stigma it does. Currently people who change their minds are perceived as weak and not knowing what they are talking about. At very least, they give the impression that since their opinion changes it shouldn’t be taken seriously as it might change again soon. What needs to happen is that we come to recognize the ability to change ones beliefs as an honest endeavor, having academic integrity, and is something that someone who really seeks to hold true beliefs does frequently. These people should be held up as models and not the other way around.

Bayesian vs Frequentist Statistics

I was tempted for Easter to do an analysis of the Resurrection narratives in some of the Gospels as this is possibly even more fascinating (re: the differences are starker) than our analysis of the Passion narratives. But we’ll return to the Bayesian stuff. I’m not sure what more to add after this discussion, so this topic might end. I feel like continually presenting endless examples of Bayesian methods will get boring.

Essentially everything in today’s post will be from Chapter 8 of Nate Silver’s book The Signal and the Noise (again from memory so hopefully I don’t make any major mistakes and if so don’t think they are in the book or anything). I should say this book is pretty good, but a large part of it is just examples of models which might be cool if you haven’t been thinking about this for awhile, but feels repetitive. I still recommend it if you have an interest in how Bayesian models are used in the real world.

Today’s topic is an explanation of essentially the only rival theory out there to Bayesianism. It is a method called “frequentism.” One might refer to this as “classical statistics.” It is what you would learn in a first undergraduate course in statistics, and although it still seems to be the default method in most fields of study, recently Bayesian methods have been surging and may soon replace frequentist methods.

It turns out that frequentist methods are newer and in some sense an attempt to replace some of the wishy-washy guess-work of Bayesianism. Recall that Bayesianism requires us to form a prior probability. To apply Bayes’ theorem we need to assign a probability based on … well … prior knowledge? In some fields like history this isn’t so weird. You look at similar cases that have been examined already to get the number. It is a little more awkward in science because when calculating P(C|E) the probability a conjecture is true given the evidence, you need to calculate P(C) which is your best guess at the probability your conjecture is true. It feels circular or like you can rig it so that you assume your experiment into a certain conclusion.

The frequentist will argue that assigning this probability involves all sorts of bias and subjectivity on the part of the person doing the analysis. Now this argument has been going in circles for years, but we’ve already addressed this. The Bayesian can just use probabilities that have a solid rationale that even opponents of the conclusion will agree to, or could make a whole interval of possible probabilities. It is true that the frequentist has a point, though. The bias/subjectivity does exist and an honest Bayesian admits this and takes precaution against it.

The frequentist method involves a rather simple idea (that gets complicated fast as anyone who has taken such a course knows). The idea is that we shouldn’t stack the odds for a conclusion by subjectively assigning some prior. We should just take measurements. Then, only after objective statistical analysis, should we make any such judgments. The problem is that when we take measurements, we only have a small sample of everything. We need a way to take this into account.

To illustrate using an example, we could do a poll to see who people will vote for in an election. We’re only going to poll a small number of people compared to everyone in the country. But the idea is that if we use a large enough sample size we can assume that it will roughly match the whole population. In other words, we can assume (if it was truly random) that we haven’t accidentally gotten a patch of the population that will vote significantly differently than the rest of the country. If we take a larger sample size, then our margin of error will decrease.

But built into this assumption we already have several problems. First is that hidden behind the scenes is that we must assume the voting population falls into some nice distribution for our model (for example a normal distribution). This is actually a major problem, because depending on what you are modelling there are different standards for what type of distribution to use. Moreover, we assume the sampling was random and falls into this distribution. These are two assumption that usually can’t be well-justified (at least until well after the fact when we see if its predictive value was correct).

After that, we can figure out what our expected margin of error will be. This is exactly what we see in real political polling. They give us the results and some margin of error. If you’ve taken statistics you’ve probably spent lots of time calculating these so-called “confidence intervals.” There are lots of numerics such as p-values to tell you how significant or trust-worthy the statistics and interval are.

Richard Carrier seems to argue in Proving History that there isn’t really a big difference between these two viewpoints. Bayesianism is just epistemic frequentism. They are just sort of hiding the bias and subjectivity in different places. I’d argue that Bayesian methods are superior for some simple reasons. First, the subjectivity can be quantified and put on the table for everyone to see and make their own judgments about. Second, Bayesian methods allow you to consistently update based on new evidence and takes into account that more extraordinary claims require more extraordinary evidence. Lastly, you are less likely to make standard fallacies such as the correlation implies causation fallacy.

For a funny (and fairly accurate in my opinion) summary that is clearly advocating for Bayesian methods see this:


Bayesianism in the Philosophy of Math

Today I’ll sketch an idea that I fist learned about from David Corfield’s excellent book Towards a Philosophy of Real Mathematics. I read it about six years ago while doing my undergraduate honors thesis and my copy is filled with notes in the margins. It has been interesting to revisit this book. What I’m going to talk about is done in much greater detail and thoroughness with tons of examples in that book. So check it out if this is interesting to you.

There are lots of ways we could use Bayesian analysis in the philosophy of math. I’ll just use a single example to show how we can use it to describe how confident we are in certain conjectures. In other words, we’ll come up with a probability for how plausible a conjecture is given the known evidence. As usual we’ll denote this P(C|E). Before doing this, let’s address the question of why would we want to do this.

To me, there are two main answers to this question. The first is that mathematicians already do this colloquially. When someone proposes something in an informal setting, you hear phrases like, “I don’t believe that at all,” or “How could that be true considering …” or “I buy that, it seems plausible.” If you think that the subject of philosophy of mathematics has any legitimacy, then certainly one of its main goals would be to take such statements and try to figure out what is meant by them and whether or not they seem justified. This is exactly what our analysis will do.

The second answer is much more practical in nature. Suppose you conjecture something as part of your research program. As we’ve been doing in these posts, you could use Baye’s theorem to give two estimates on the plausibility of your conjecture being true. One is giving the most generous probabilities given the evidence, and the other is giving the least generous. You’ll get some sort of Bayesian confidence interval of the probability of the conjecture being true. If the entire interval is low (say below 60% or something), then before spending several months trying prove it your time might be better spent gathering more evidence for or against it.

Again, mathematicians already do this at some subconscious level, so being aware of one way to analyze what it is you are actually doing could be very useful. Humans have tons of cognitive biases, so maybe you have greatly overestimated how likely something is and doing a quick Bayes’ theorem calculation can set you straight before wasting a ton of time. Or you could write all this off as nonsense. Whatever. It’s up to you.

If you’ve followed the posts up to now, you’ll probably find this calculation quite repetitive. You can probably guess what we’ll do. We want to figure out P(C|E), the probability that a conjecture is true given the evidence you’ve accumulated. What goes into Bayes’ theorem? Well, P(E|C) the probability that we would see the evidence we have supposing the conjecture is true; P(C) the prior probability that the conjecture is true; P(E|-C) the probability we would see the evidence we have supposing the conjecture is not true; and P(-C) the prior probability that the conjecture is not true.

Clearly the problem of assigning some exact probability to any of these is insanely subjective. But also, as before, it should be possible to find the most optimistic person about a conjecture to overestimate the probability and the most skeptical person to underestimate the probability. This type of interval forming should be a lot less subjective and fairly consistent. One should even have strong arguments to support the estimates which will convince someone who questions them.

Let’s use the Riemann hypothesis as an example. In our modern age, we have massive numerical evidence that the Riemann hypothesis is true. Recall that it just says that all the zeroes of the Riemann zeta function in the critical strip lie on the line with real part 1/2. Something like the first 10,000,000,000,000 zeroes have been checked by computer plus lots (billions?) have been checked in random other places after this.

Interestingly enough, if this were our “evidence” our estimation of P(E|C) may as well be 1, but P(E|-C) would have to contribute a significant non-trivial factor in the denominator of Bayes’ theorem. This is because we estimate this probability based on what we’ve seen in the past in similar situations. It turns out that in analytic number theory we have several prior instances of the phenomenon of a conjecture looking true for exceedingly large numbers before getting a counterexample. In fact, Merten’s Conjecture is explicitly connected to the Riemann hypothesis and the first counterexample could be around 10^{30} (no explicit counterexample is known, just that one exists, but we know by checking that it is exceedingly large).

It probably isn’t unreasonable to say that most mathematicians believe the Riemann hypothesis. Even giving generous prior probabilities, the above analysis would give a not too high level of confidence. So where does the confidence come from? Remember, that in Bayesian analysis it is often easy to accidentally not use all available evidence (subconscious bias may play a role in this process).

I could do an entire series on the analogies and relations between the Riemann hypothesis for curves over finite fields and the standard Riemann hypothesis, so I won’t explain it here. The curves over finite fields case has been proven and provides quite good evidence in terms of making P(E|-C) small.

The Bayesian calculation becomes much, much more complicated in terms of modern mathematics because of all the analogies and more concretely the ways in which the RH is interrelated with theorems about number fields and Galois representations and cohomological techniques. We have conjectures equivalent to (or implying or implied by) the RH which allows us to transfer evidence for and against these other conjectures.

In some sense, essentially all this complication will only increase the Bayesian estimate, so we could simplify our lives and make some baseline estimate taking into account the clearest of these and then just say that our confidence is at least that much. That is one explanation of why many mathematicians beleive the RH even if they’ve never explicitly thought of it that way. Well, this has gone on too long, but I hope the idea has been elucidated.