The Carter Catastrophe

I’ve been readingĀ Manifold: TimeĀ by Stephen Baxter. The book is quite good so far, and it presents a fascinating probabilistic argument that humans will go extinct in the near future. It is sometimes called the Carter Catastrophe, because Brandon Carter first proposed it in 1983.

I’ll use Bayesian arguments, so you might want to review some of my previous posts on the topic if you’re feeling shaky. One thing we didn’t talk all that much about is the idea of model selection. This is the most common thing scientists have to do. If you run an experiment, you get a bunch of data. Then you have to figure out the most likely reason for what you see.

Let’s take a basic example. We have a giant tub of golf balls, and we can’t see inside the tub. There could be 1 ball or a million. We’re told the owner accidentally dropped a red ball in at some point. All the other balls are the standard white golf balls. We decide to run an experiment where we draw a ball out, one at a time, until we reach the red one.

First ball: white. Second ball: white. Third ball: red. We stop. We’ve now generated a data set from our experiment, and we want to use Bayesian methods to give the probability of there being three total balls or seven or a million. In probability terms, we need to calculate the probability that there are x balls in the tub given that we drew the red ball on the third draw. Any time we see this language, our first thought should be Bayes’ theorem.

Define A_i to be the model of there being exactly i balls in the tub. I’ll use “3” inside of P( ) to be the event of drawing the red ball on the third try. We have to make a finiteness assumption, and although this is one of the main critiques of the argument, we can examine what happens as we let the size of the bound grow. Suppose for now the tub can only hold 100 balls.

A priori, we have no idea how many balls are in there, so we’ll assume all “models” are equally likely. This means P(A_i)=1/100 for all i. By Bayes’ theorem we can calculate:

P(A_3|3) = \frac{P(3|A_3)P(A_3)}{(\sum_{i=1}^{100}P(3|A_i)P(A_i))}

\frac{(1/3)(1/100)}{(1/100)\sum_{i=3}^{100}1/i} \approx 0.09

So there’s around a 9% chance that there are only 3 balls in the tub. That bottom summation remains exactly the same when computing P(A_n | 3) for any n and equals about 3.69, and the (1/100) cancels out every time. So we can compute explicitly that for n > 3:

P(A_n|3)\approx \frac{1}{n}(0.27)

This is a decreasing function of n, and this shouldn’t be surprising at all. It says that as we guess there are more and more balls in the tub, the probability of that guess goes down. This makes sense, because it’s unreasonable to think we’d see the red one that early if there are actually 100 balls in the tub.

There’s lots of ways to play with this. What happens if our tub could hold millions but we still assume a uniform prior? It just takes all the probabilities down, but the general trend is the same: It becomes less and less reasonable to assume large amounts of total balls given that we found the red one so early.

You could also only care about this “earliness” idea and redo the computations where you ask how likely is A_n given that we found the red ball by the third try. This is actually the more typical way the problem is formulated in the Doomsday arguments. It’s more complicated, but the same idea pops out, and this should make intuitive sense.

Part of the reason these computations were somewhat involved is because we tried to get a distribution on the natural numbers. But we could have tried to compare heuristically to get a super clear answer (homework for you). What if we only had two choices “small number of total balls (say 10)” or “large number of total balls (say 10,000)”? You’d find there is around a 99% chance that the “small” hypothesis is correct.

Here’s the leap. Now assume the fact that you exist right now is random. In other words, you popped out at a random point in the existence of humans. So the totality of humans to ever exist are the white balls and you are the red ball. The same type of argument above applies, and it says that the most likely thing is that you aren’t born at some super early point in human history. In fact, it’s unreasonable from a probabilistic standpoint to think that humans will continue much longer at all given your existence.

The “small” total population of humans is far, far more likely than the “large” total population, and the interesting thing is that this remains true even if you mess with the uniform prior. You could assume it is much more likely a priori for humans to continue to make improvements and colonize space and develop vaccines giving a higher prior for the species existing far into the future. But unfortunately the Bayesian argument will still pull so strongly in favor of humans ceasing to exist in the near future that one must conclude it is inevitable and will happen soon!

Anyway. I’m travelling this week, so I’m sorry if there are errors in those calculations. I was in a hurry and never double checked them. The crux of the argument should still make sense even if you don’t get my exact numbers. There’s also a lot of interesting and convincing rebuttals, but I don’t have time to get into them now (including the fact that unlikely hypotheses turn out to be true all the time).

Does Bayesian Epistemology Suffer Foundational Problems?

I recently had a discussion about whether Bayesian epistemology suffers from the problem of induction, and I think some interesting things came from it. If these words make you uncomfortable, think of epistemology as the study of how we form beliefs and gain knowledge. Bayesian epistemology means we model it probabilistically using Bayesian methods. This old post of mine talks a bit about it but is long and unnecessary to read to get the gist of this post.

I always think of the problem of induction in terms of the classic swan analogy. Someone wants to claim that all swans are white. They go out and see swan after swan after swan, each confirming the claim. Is there any point at which the person can legitimately say they know that all swans are white?

Classically, the answer is no. The problem of induction is crippling to classic epistemologies, because we can never be justified in believing any universal claim (at least using empirical methods). One of the great things about probabilistic epistemologies (not just Bayesian) is that it circumvents this problem.

Classical epistemologies require you to have 100% certainty to attain knowledge. Since you can’t ever be sure you’ve encountered every instance of a universal, you can’t be certain there is no instance that violates the universal. Hence the problem of induction is an actual problem. But note it is only a problem if your definition of knowledge requires you to have absolute certainty of truth.

Probabilistic epistemologies lower the threshold. They merely ask that you have 95% (or 98%, etc) confidence (or that your claim sits in some credible region, etc) for the justification. By definition, knowledge is always tentative and subject to change in these theories of knowledge.

This is one of the main reasons to use a probabilistic epistemology. It is the whole point. They were invented to solve this problem, so I definitely do not believe that Bayesian epistemology suffers from the problem of induction.

But it turns out I had misunderstood. The point the other person tried to make was much more subtle. It had to do with the other half of the problem of induction (which I always forget about, because I usually consider it an axiom when doing epistemology).

This other problem is referred to as the principle of the uniformity of nature. One must presuppose that the laws of nature are consistent across time and space. Recall that a Bayesian has prior beliefs and then upon encountering new data they update their beliefs factoring in both the prior and new data.

This criticism has to do with the application of Bayes’ theorem period. In order to consider the prior to be relevant to factor in at all, you must believe it is … well, relevant! You’ve implicitly assumed at that step the uniformity of nature. If you don’t believe nature is consistent across time, then you should not factor prior beliefs into the formation of knowledge.

Now a Bayesian will often try to use Bayesian methods to justify the uniformity of nature. We start with a uniform prior so that we haven’t assumed anything about the past or its relevance to the future. Then we merely note that billions of people across thousands of years have only ever observed a uniformity of nature, and hence it is credible to believe the axiom is true.

Even though my gut buys that argument, it is a bit intellectually dishonest. You can never, ever justify an axiom by using a method that relies on that axiom. That is the quintessential begging the question fallacy.

I think the uniformity of nature issue can be dismissed on different grounds. If you want to dismiss an epistemology based on the uniformity of nature issue, then you have to be willing to dismiss every epistemology that allows you to come to knowledge.

It doesn’t matter what the method is. If you somehow come to knowledge, then one second later all of nature could have changed and hence you no longer have that knowledge. Knowledge is impossible if you want to use that criticism. All this leave you with is radical skepticism, which of course leads to self-contradiction (if you know you can’t know anything, then you know something –><– ).

This is why I think of the uniformity of nature as a necessary axiom for epistemology. Without some form of it, epistemology is impossible. So at least in terms of the problem of induction, I do not see foundational problems for Bayesian epistemology.

Bayesian Statistics Worked Example Part 2

Last time I decided my post was too long, so I cut some stuff out and now this post is fleshing those parts into their own post. Recall our setup. We perform an experiment of flippling a coin. Our data set consists of {a} heads and {b} tails. We want to run a Bayesian analysis to figure out whether or not the coin is biased. Our bias is a number between {0} and {1} which just indicates the expected proportion of times it will land on heads.

We found our situation was modeled by the beta distribution: {P(\theta |a,b)=\beta(a,b)}. I reiterate here a word of warning. ALL other sources will call this {B(a+1, b+1)}. I’ve just shifted by 1 for ease of notation. We saw last time that if our prior belief is that the probability distribution is {\beta(x,y)}, then our posterior belief should be {\beta(x+a, y+b)}. This simple “update rule” falls out purely from Bayes’ Theorem.

The main thing I didn’t explain last time was what exactly I meant by the phrase “we can say with 95% confidence that the true bias of the coin lies between {0.40} and {0.60}” or whatever the particular numbers are that we get from our data. What I had in mind for that phrase was something called the highest density interval (HDI). The 95% HDI just means that it is an interval for which the area under the distribution is {0.95} (i.e. an interval spanning 95% of the distribution) such that every point in the interval has a higher probability than any point outside of the interval (I apologize for such highly unprofessional pictures):

bayes1

(It doesn’t look like it, but that is supposed to be perfectly symmetrical.)

Untitled drawing

The first is the correct way to make the interval, because notice all points on the curve over the shaded region are higher up (i.e. more probable) than points on the curve not in the region. There are lots of 95% intervals that are not HDI’s. The second is such a non-example, because even though the area under the curve is 0.95, the big purple point is not in the interval but is higher up than some of the points off to the left which are included in the interval.

Lastly, we will say that a hypothesized bias {\theta_0} is credible if some small neighborhood of that value lies completely inside our 95% HDI. That small threshold is sometimes called the “region of practical equivalence (ROPE)” and is just a value we must set. If we set it to be 0.02, then we would say that the coin being fair is a credible hypothesis if the whole interval from 0.48 to 0.52 is inside the 95% HDI.

A note ahead of time, calculating the HDI for the beta distribution is actually kind of a mess because of the nature of the function. There is no closed form solution, so usually you can just look these things up in a table or approximate it somehow. Both the mean {\mu=\frac{a}{a+b}} and the standard deviation {\left(\frac{\mu(1-\mu)}{a+b+1}\right)^{1/2}} do have closed forms. Thus I’m going to approximate for the sake of this post using the “two standard deviations” rule that says that two standard deviations on either side of the mean is roughly 95%. Caution, if the distribution is highly skewed, for example {\beta(3,25)} or something, then this approximation will actually be way off.

Let’s go back to the same examples from before and add in this new terminology to see how it works. Suppose we have absolutely no idea what the bias is and we make our prior belief {\beta(0,0)} the flat line. This says that we believe ahead of time that all biases are equally likely. Now we observe {3} heads and {1} tails. Bayesian analysis tells us that our new distribution is {\beta(3,1)}. The 95% HDI in this case is approximately 0.49 to 0.84. Thus we can say with 95% certainty that the true bias is in this region. Note that it is NOT a credible hypothesis off of this data to guess that the coin is fair because 0.48 is not in HDI. This example really illustrates how choosing different thresholds can matter, because if we picked an interval of 0.01 rather than 0.02, then that guess would be credible!

Let’s see what happens if we use just an ever so slightly more reasonable prior. We’ll use {\beta(2,2)}. This gives us a starting assumption that the coin is probably fair, but it is still very open to whatever the data suggests. In this case our {3} heads and {1} tails tells us our posterior distribution is {\beta(5,3)}. In this case the 95% HDI is 0.45 to 0.75. Using the same data we get a little bit more narrow interval here, but more importantly we feel much more comfortable with the claim that the coin being fair is still a credible hypothesis.

This brings up a sort of “statistical uncertainty principle.” If we want a ton of certainty, then it forces our interval to get wider and wider. This makes intuitive sense, because if I want to give you a range that I’m 99.9999999% certain the true bias is in, then I better give you practically every possibility. If I want to pinpoint a precise spot for the bias, then I have to give up certainty (unless you’re in an extreme situation where the distribution is a really sharp spike or something). You’ll end up with something like: I can say with 1% certainty that the true bias is between 0.59999999 and 0.6000000001. We’ve locked onto a small range, but we’ve given up certainty. Note the similarity to the Heisenberg uncertainty principle which says the more precisely you know the momentum or position of a particle the less precisely you know the other.

Let’s wrap up by trying to pinpoint exactly where we needed to make choices for this statistical model. The most common objection to Bayesian models is that you can subjectively pick a prior to rig the model to get any answer you want. Hopefully this wrap up will show that in the abstract that objection is essentially correct, but in real life practice you cannot get away with this.

Step 1 was to write down the likelihood function {P(\theta | a,b)=\beta(a,b)}. This was derived directly from the type of data we were collecting and was not a choice. Step 2 was to determine our prior distribution. This was a choice, but a constrained one. In real life statistics you will probably have a lot of prior information that will go into this choice. Recall that the prior encodes both what we believe is likely to be true and how confident we are in that belief. Suppose you make a model to predict who will win an election based off of polling data. You have previous year’s data and that collected data has been tested, so you know how accurate it was! Thus forming your prior based on this information is a well-informed choice. Just because a choice is involved here doesn’t mean you can arbitrarily pick any prior you want to get any conclusion you want.

I can’t reiterate this enough. In our example, if you pick a prior of {\beta(100,1)} with no reason to expect to coin is biased, then we have every right to reject your model as useless. Your prior must be informed and must be justified. If you can’t justify your prior, then you probably don’t have a good model. The choice of prior is a feature, not a bug. If a Bayesian model turns out to be much more accurate than all other models, then it probably came from the fact that prior knowledge was not being ignored. It is frustrating to see opponents of Bayesian statistics use the “arbitrariness of the prior” as a failure when it is exactly the opposite (see the picture at the end of this post for a humorous illustration.)

The last step is to set a ROPE to determine whether or not a particular hypothesis is credible. This merely rules out considering something right on the edge of the 95% HDI from being a credible guess. Admittedly, this step really is pretty arbitrary, but every statistical model has this problem. It isn’t unique to Bayesian statistics, and it isn’t typically a problem in real life. If something is so close to being outside of your HDI, then you’ll probably want more data. For example, if you are a scientist, then you re-run the experiment or you honestly admit that it seems possible to go either way.

What is Bayesian Statistics: A basic worked example

I did a series on Bayes’ Theorem awhile ago and it gave us some nice heuristics on how a rational person ought to update their beliefs as new evidence comes in. The term “Bayesian statistics” gets thrown around a lot these days, so I thought I’d do a whole post just working through a single example in excruciating detail to show what is meant by this. If you understand this example, then you basically understand what Bayesian statistics is.

Problem: We run an experiment of flipping a coin {N} times and record a {1} every time it comes up heads and a {0} every time it comes up tails. This gives us a data set. Using this data set and Bayes’ theorem, we want to figure out whether or not the coin is biased and how confident we are in that assertion.

Let’s get some technical stuff out of the way. This is the least important part to fully understand for this post, but is kind of necessary. Define {\theta} to be the bias towards heads. This just means that if {\theta=0.5}, then the coin has no bias and is perfectly fair. If {\theta=1}, then the coin will never land on tails. If {\theta = 0.75}, then if we flip the coin a huge number of times we will see close to {3} out of every {4} flips lands on heads. For notation we’ll let {y} be the trait of whether or not it lands on heads or tails (so it is {0} or {1}).

We can encode this information mathematically by saying {P(y=1|\theta)=\theta}. In plain english: The probability that the coin lands on heads given that the bias towards heads is {\theta} is {\theta}. Likewise, {P(y=0|\theta)=1-\theta}. Let’s just chain a bunch of these coin flips together now. Let {a} be the event of seeing {a} heads when flipping the coin {N} times (I know, the double use of {a} is horrifying there but the abuse makes notation easier later).

Since coin flips are independent we just multiply probabilities and hence {P(a|\theta)=\theta^a(1-\theta)^{N-a}}. Rather than lug around the total number {N} and have that subtraction, normally people just let {b} be the number of tails and write {P(a,b |\theta)=\theta^a(1-\theta)^b}. Let’s just do a quick sanity check to make sure this seems right. Note that if {a,b\geq 1}, then as the bias goes to zero the probability goes to zero. This is expected because we observed a heads ({a\geq 1}), so it is highly unlikely to be totally biased towards tails. Likewise as {\theta} gets near {1} the probability goes to {0}, because we observed a tails.

The other special cases are when {a=0} or {b=0}, and in these cases we just recover that the probability of getting heads a times in a row if the probability of heads is {\theta} is {\theta^a}. Of course, the mean of {\beta (a,b)} is {a/(a+b)}, the proportion of the number of heads observed. Moving on, we haven’t quite thought of this in the correct way yet, because in our introductory problem we have a fixed data set that we want to analyze. So from now on we should think about {a} and {b} being fixed from the data we observed.

The idea now is that as {\theta} varies through {[0,1]} we have a distribution {P(a,b|\theta)}. What we want to do is multiply this by the constant that makes it integrate to {1} so we can think of it as a probability distribution. In fact, it has a name called the beta distribution (caution: the usual form is shifted from what I’m writing), so we’ll just write {\beta(a,b)} for this (the number we multiply by is the inverse of {B(a,b)=\int_0^1 \theta^a(1-\theta)^b d\theta} called the (shifted) beta function).

This might seem unnecessarily complicated to start thinking of this as a probability distribution in {\theta}, but it is actually exactly what we are looking for. Consider the following three examples:

beta1

The red one says if we observe {2} heads and {8} tails, then the probability that the coin has a bias towards tails is greater. The mean happens at {0.20}, but because we don’t have a lot of data there is still a pretty high probability of the true bias lying elsewhere. The middle one says if we observe 5 heads and 5 tails, then the most probable thing is that the bias is {0.5}, but again there is still a lot of room for error. If we do a ton of trials to get enough data to be more confident in our guess, then we see something like:

beta2

Already at observing 50 heads and 50 tails we can say with 95% confidence that the true bias lies between 0.40 and 0.60. Alright, you might be objecting at this point that this is just usual statistics, where the heck is Bayes’ Theorem? You’d be right. Bayes’ Theorem comes in because we aren’t building our statistical model in a vacuum. We have prior beliefs about what the bias is.

Let’s just write down Bayes’ Theorem in this case. We want to know the probability of the bias {\theta} being some number given our observations in our data. We use the “continuous form” of Bayes’ Theorem:

\displaystyle P(\theta|a,b)=\frac{P(a,b|\theta)P(\theta)}{\int_0^1 P(a,b|\theta)d\theta}

I’m trying to give you a feel for Bayesian statistics, so I won’t work out in detail the simplification of this. Just note that the “posterior probability” (the left hand side of the equation), i.e. the distribution we get after taking into account our data is the likelihood times our prior beliefs divided by the evidence. Now if you use that the denominator is just the definition of {B(a,b)} and work everything out it turns out to be another beta distribution!

If our prior belief is that the bias has distribution {\beta(x,y)}, then if our data has {a} heads and {b} tails we get {P(\theta|a,b)=\beta(a+x, b+y)}. The way we update our beliefs based on evidence in this model is incredibly simple. Now I want to sanity check that this makes sense again. Suppose we have absolutely no idea what the bias is and we make our prior belief {\beta(0,0)} the flat line. This says that we believe ahead of time that all biases are equally likely.

Now we observe {3} heads and {1} tails. Bayesian analysis tells us that our new (posterior probability) distribution is {\beta(3,1)}:

beta3

Yikes! We don’t have a lot of certainty, but it looks like the bias is heavily towards heads. Danger: This is because we used a terrible prior. This is the real world so it isn’t reasonable to think that a bias of {0.99} is just as likely as {0.45}. Let’s see what happens if we use just an ever so slightly more modest prior. We’ll use {\beta(2,2)}. This puts our assumption on it being most likely close to {0.5}, but it is still very open to whatever the data suggests. In this case our {3} heads and {1} tails tells us our updated belief is {\beta(5,3)}:

beta4

Ah. Much better. We see a slight bias coming from the fact that we observed {3} heads and {1} tails and these can’t totally be ignored, but our prior belief tames how much we let this sway our new beliefs. This is what makes Bayesian statistics so great. If we have tons of prior evidence of a hypothesis, then observing a few outliers shouldn’t make us change our minds. On the other hand, the setup allows for us to change our minds even if we are 99% certain on something as long as sufficient evidence is given. This is the mantra: extraordinary claims require extraordinary evidence.

Not only would a ton of evidence be able to persuade us that the coin bias is {0.90}, but we should need a ton of evidence. This is part of the shortcomings of non-Bayesian analysis. It would be much easier to become convinced of such a bias if we didn’t have a lot of data and we accidentally sampled some outliers.

Anyway. Now you should have an idea of Bayesian statistics. In fact, if you understood this example, then most of the rest is just adding parameters and using other distributions, so you actually have a really good idea of what is meant by that term now.

Bayesian vs Frequentist Statistics

I was tempted for Easter to do an analysis of the Resurrection narratives in some of the Gospels as this is possibly even more fascinating (re: the differences are starker) than our analysis of the Passion narratives. But we’ll return to the Bayesian stuff. I’m not sure what more to add after this discussion, so this topic might end. I feel like continually presenting endless examples of Bayesian methods will get boring.

Essentially everything in today’s post will be from Chapter 8 of Nate Silver’s book The Signal and the Noise (again from memory so hopefully I don’t make any major mistakes and if so don’t think they are in the book or anything). I should say this book is pretty good, but a large part of it is just examples of models which might be cool if you haven’t been thinking about this for awhile, but feels repetitive. I still recommend it if you have an interest in how Bayesian models are used in the real world.

Today’s topic is an explanation of essentially the only rival theory out there to Bayesianism. It is a method called “frequentism.” One might refer to this as “classical statistics.” It is what you would learn in a first undergraduate course in statistics, and although it still seems to be the default method in most fields of study, recently Bayesian methods have been surging and may soon replace frequentist methods.

It turns out that frequentist methods are newer and in some sense an attempt to replace some of the wishy-washy guess-work of Bayesianism. Recall that Bayesianism requires us to form a prior probability. To apply Bayes’ theorem we need to assign a probability based on … well … prior knowledge? In some fields like history this isn’t so weird. You look at similar cases that have been examined already to get the number. It is a little more awkward in science because when calculating P(C|E) the probability a conjecture is true given the evidence, you need to calculate P(C) which is your best guess at the probability your conjecture is true. It feels circular or like you can rig it so that you assume your experiment into a certain conclusion.

The frequentist will argue that assigning this probability involves all sorts of bias and subjectivity on the part of the person doing the analysis. Now this argument has been going in circles for years, but we’ve already addressed this. The Bayesian can just use probabilities that have a solid rationale that even opponents of the conclusion will agree to, or could make a whole interval of possible probabilities. It is true that the frequentist has a point, though. The bias/subjectivity does exist and an honest Bayesian admits this and takes precaution against it.

The frequentist method involves a rather simple idea (that gets complicated fast as anyone who has taken such a course knows). The idea is that we shouldn’t stack the odds for a conclusion by subjectively assigning some prior. We should just take measurements. Then, only after objective statistical analysis, should we make any such judgments. The problem is that when we take measurements, we only have a small sample of everything. We need a way to take this into account.

To illustrate using an example, we could do a poll to see who people will vote for in an election. We’re only going to poll a small number of people compared to everyone in the country. But the idea is that if we use a large enough sample size we can assume that it will roughly match the whole population. In other words, we can assume (if it was truly random) that we haven’t accidentally gotten a patch of the population that will vote significantly differently than the rest of the country. If we take a larger sample size, then our margin of error will decrease.

But built into this assumption we already have several problems. First is that hidden behind the scenes is that we must assume the voting population falls into some nice distribution for our model (for example a normal distribution). This is actually a major problem, because depending on what you are modelling there are different standards for what type of distribution to use. Moreover, we assume the sampling was random and falls into this distribution. These are two assumption that usually can’t be well-justified (at least until well after the fact when we see if its predictive value was correct).

After that, we can figure out what our expected margin of error will be. This is exactly what we see in real political polling. They give us the results and some margin of error. If you’ve taken statistics you’ve probably spent lots of time calculating these so-called “confidence intervals.” There are lots of numerics such as p-values to tell you how significant or trust-worthy the statistics and interval are.

Richard Carrier seems to argue in Proving History that there isn’t really a big difference between these two viewpoints. Bayesianism is just epistemic frequentism. They are just sort of hiding the bias and subjectivity in different places. I’d argue that Bayesian methods are superior for some simple reasons. First, the subjectivity can be quantified and put on the table for everyone to see and make their own judgments about. Second, Bayesian methods allow you to consistently update based on new evidence and takes into account that more extraordinary claims require more extraordinary evidence. Lastly, you are less likely to make standard fallacies such as the correlation implies causation fallacy.

For a funny (and fairly accurate in my opinion) summary that is clearly advocating for Bayesian methods see this:

Baysian_vs_Frequentist

Bayesianism in the Philosophy of Math

Today I’ll sketch an idea that I fist learned about from David Corfield’s excellent book Towards a Philosophy of Real Mathematics. I read it about six years ago while doing my undergraduate honors thesis and my copy is filled with notes in the margins. It has been interesting to revisit this book. What I’m going to talk about is done in much greater detail and thoroughness with tons of examples in that book. So check it out if this is interesting to you.

There are lots of ways we could use Bayesian analysis in the philosophy of math. I’ll just use a single example to show how we can use it to describe how confident we are in certain conjectures. In other words, we’ll come up with a probability for how plausible a conjecture is given the known evidence. As usual we’ll denote this P(C|E). Before doing this, let’s address the question of why would we want to do this.

To me, there are two main answers to this question. The first is that mathematicians already do this colloquially. When someone proposes something in an informal setting, you hear phrases like, “I don’t believe that at all,” or “How could that be true considering …” or “I buy that, it seems plausible.” If you think that the subject of philosophy of mathematics has any legitimacy, then certainly one of its main goals would be to take such statements and try to figure out what is meant by them and whether or not they seem justified. This is exactly what our analysis will do.

The second answer is much more practical in nature. Suppose you conjecture something as part of your research program. As we’ve been doing in these posts, you could use Baye’s theorem to give two estimates on the plausibility of your conjecture being true. One is giving the most generous probabilities given the evidence, and the other is giving the least generous. You’ll get some sort of Bayesian confidence interval of the probability of the conjecture being true. If the entire interval is low (say below 60% or something), then before spending several months trying prove it your time might be better spent gathering more evidence for or against it.

Again, mathematicians already do this at some subconscious level, so being aware of one way to analyze what it is you are actually doing could be very useful. Humans have tons of cognitive biases, so maybe you have greatly overestimated how likely something is and doing a quick Bayes’ theorem calculation can set you straight before wasting a ton of time. Or you could write all this off as nonsense. Whatever. It’s up to you.

If you’ve followed the posts up to now, you’ll probably find this calculation quite repetitive. You can probably guess what we’ll do. We want to figure out P(C|E), the probability that a conjecture is true given the evidence you’ve accumulated. What goes into Bayes’ theorem? Well, P(E|C) the probability that we would see the evidence we have supposing the conjecture is true; P(C) the prior probability that the conjecture is true; P(E|-C) the probability we would see the evidence we have supposing the conjecture is not true; and P(-C) the prior probability that the conjecture is not true.

Clearly the problem of assigning some exact probability to any of these is insanely subjective. But also, as before, it should be possible to find the most optimistic person about a conjecture to overestimate the probability and the most skeptical person to underestimate the probability. This type of interval forming should be a lot less subjective and fairly consistent. One should even have strong arguments to support the estimates which will convince someone who questions them.

Let’s use the Riemann hypothesis as an example. In our modern age, we have massive numerical evidence that the Riemann hypothesis is true. Recall that it just says that all the zeroes of the Riemann zeta function in the critical strip lie on the line with real part 1/2. Something like the first 10,000,000,000,000 zeroes have been checked by computer plus lots (billions?) have been checked in random other places after this.

Interestingly enough, if this were our “evidence” our estimation of P(E|C) may as well be 1, but P(E|-C) would have to contribute a significant non-trivial factor in the denominator of Bayes’ theorem. This is because we estimate this probability based on what we’ve seen in the past in similar situations. It turns out that in analytic number theory we have several prior instances of the phenomenon of a conjecture looking true for exceedingly large numbers before getting a counterexample. In fact, Merten’s Conjecture is explicitly connected to the Riemann hypothesis and the first counterexample could be around 10^{30} (no explicit counterexample is known, just that one exists, but we know by checking that it is exceedingly large).

It probably isn’t unreasonable to say that most mathematicians believe the Riemann hypothesis. Even giving generous prior probabilities, the above analysis would give a not too high level of confidence. So where does the confidence come from? Remember, that in Bayesian analysis it is often easy to accidentally not use all available evidence (subconscious bias may play a role in this process).

I could do an entire series on the analogies and relations between the Riemann hypothesis for curves over finite fields and the standard Riemann hypothesis, so I won’t explain it here. The curves over finite fields case has been proven and provides quite good evidence in terms of making P(E|-C) small.

The Bayesian calculation becomes much, much more complicated in terms of modern mathematics because of all the analogies and more concretely the ways in which the RH is interrelated with theorems about number fields and Galois representations and cohomological techniques. We have conjectures equivalent to (or implying or implied by) the RH which allows us to transfer evidence for and against these other conjectures.

In some sense, essentially all this complication will only increase the Bayesian estimate, so we could simplify our lives and make some baseline estimate taking into account the clearest of these and then just say that our confidence is at least that much. That is one explanation of why many mathematicians beleive the RH even if they’ve never explicitly thought of it that way. Well, this has gone on too long, but I hope the idea has been elucidated.

Bayes’ Theorem 3: Arguments from Absence of Evidence (Historical Edition)

If you move in the same circles that I do then you’ve probably heard the following phrase many times, “Absence of evidence is not evidence of absence.” In and of itself this is totally true. In fact, it is just a special case of a well-known logical fallacy called an argument from ignorance.

One of the really cool things about using Bayesian methods when analyzing historical events (actually you could adapt the following to the example of the scientific method as well) is that you can quantify how improbable a certain absence of evidence is to make a sound argument. This allows you to conclude that a historical event actually did not take place based on the absence of evidence.

Now I could try to do this in some extremely abstract fashion, but it is so much clear to just show you in an example. There’s some good news and some bad news. The good news is that this post is being made near Easter, so the example is timely. The bad news is that some people might find the example highly offensive, because we will show using Bayesion inference that a certain event from the Gospels was entirely made up (or at least we can say with better certainty than we could ever hope for in our wildest imagination that this is the case).

This example, including the numbers, is entirely lifted from Richard Carrier’s book Proving History. This is not intended as plagiarism, but as I am not an expert in history I feel like randomly making up probabilities about how likely certain historical events are would just not make as convincing an example. Here’s the example: In the Synoptic Gospels (Matthew, Mark, and Luke) it is said that up to the death of Jesus the entire Earth was covered in darkness for three hours.

We want to figure out the probability that this was a historical event using the fact that there are no extra-Biblical accounts of this event happening. One thing to note is that there were civilizations all across the Earth in the first century who were already keeping copious records of bizarre astronomical phenomena that have survived to this time.

One very important thing to keep in mind when doing this example is the following. One might be tempted to make an argument about history vs supernatural events and so on. But the cool thing about this is that we don’t need to make any assumption about the occurrence of supernatural events to do this analysis. In fact, we could assume that supernatural events happen all the time and we will still come to the conclusion that this story was fabricated.

Let A be the statement that the Earth was covered in darkness for three hours. Let B be the event that we have no extra-Biblical accounts of this fact (I use the term “extra-Biblical” loosely to mean no sources that don’t admit they are referencing the Bible). We want to calculate P(A|B) the probability that the Earth was actually covered in darkness for three hours given the fact that we have no evidence for it.

The quantities that come up in Bayes’ theorem are the following: P(B|A), the probability we have no evidence of the event occurring supposing that it actually did occur. If we are exceedingly generous we can assign this probability at 1%. Note how high this percentage is though. Given our knowledge of surviving records of the time it is so mind-bogglingly unlikely that every civilization on the planet just accidentally missed something that would have scared them all out of their minds.

We also have P(A). This is slightly subtle, because in this case it represents not merely the probability that the event occurred, but is really the probability that the author of Mark (or one of the other Gospels which were probably just copying this detail) is telling the truth about the event occurring. More precisely, considering all the times that we know of (our “prior knowledge” as it was called in the previous post) of people telling us that the sun was blotted out how frequently did it actually happen (or less confusingly, when doing P(-A) how frequently did it turn out the story was made up). Being exceedingly generous again we’ll call this 1%.

Note we are not dismissing this on grounds of being a supernatural event (we’ve assumed for the purposes of this calculation that they happen all the time). The low number of 1% comes from the fact that we know of tons of examples in history where people tell stories like this one, but where we later find out they were made up. Lastly, we need P(B|-A) which is the probability of finding no external evidence for the event assuming the event was made up. This is so close to 100% that we may as well assign it a probability of 1.

Plugging everything in tells us that with at least (remember we were quite generous with the numbers) 99.99% certainty (re: there is a 99.99% chance that) the event never happened in history and was just made up by the authors of the Synoptic Gospels. And that is how Bayesian inference can lead to a sound argument from absence of evidence.

Of course, this should be an entirely non-controversial example because outside of a tiny few fundamentalist “scholars” who are clearly pushing an agenda, the fact that this even never happened in history has essentially unanimous consensus among all historians and Biblical scholars. So our result shouldn’t actually be surprising.