The Carter Catastrophe

I’ve been readingĀ Manifold: TimeĀ by Stephen Baxter. The book is quite good so far, and it presents a fascinating probabilistic argument that humans will go extinct in the near future. It is sometimes called the Carter Catastrophe, because Brandon Carter first proposed it in 1983.

I’ll use Bayesian arguments, so you might want to review some of my previous posts on the topic if you’re feeling shaky. One thing we didn’t talk all that much about is the idea of model selection. This is the most common thing scientists have to do. If you run an experiment, you get a bunch of data. Then you have to figure out the most likely reason for what you see.

Let’s take a basic example. We have a giant tub of golf balls, and we can’t see inside the tub. There could be 1 ball or a million. We’re told the owner accidentally dropped a red ball in at some point. All the other balls are the standard white golf balls. We decide to run an experiment where we draw a ball out, one at a time, until we reach the red one.

First ball: white. Second ball: white. Third ball: red. We stop. We’ve now generated a data set from our experiment, and we want to use Bayesian methods to give the probability of there being three total balls or seven or a million. In probability terms, we need to calculate the probability that there are x balls in the tub given that we drew the red ball on the third draw. Any time we see this language, our first thought should be Bayes’ theorem.

Define A_i to be the model of there being exactly i balls in the tub. I’ll use “3” inside of P( ) to be the event of drawing the red ball on the third try. We have to make a finiteness assumption, and although this is one of the main critiques of the argument, we can examine what happens as we let the size of the bound grow. Suppose for now the tub can only hold 100 balls.

A priori, we have no idea how many balls are in there, so we’ll assume all “models” are equally likely. This means P(A_i)=1/100 for all i. By Bayes’ theorem we can calculate:

P(A_3|3) = \frac{P(3|A_3)P(A_3)}{(\sum_{i=1}^{100}P(3|A_i)P(A_i))}

\frac{(1/3)(1/100)}{(1/100)\sum_{i=3}^{100}1/i} \approx 0.09

So there’s around a 9% chance that there are only 3 balls in the tub. That bottom summation remains exactly the same when computing P(A_n | 3) for any n and equals about 3.69, and the (1/100) cancels out every time. So we can compute explicitly that for n > 3:

P(A_n|3)\approx \frac{1}{n}(0.27)

This is a decreasing function of n, and this shouldn’t be surprising at all. It says that as we guess there are more and more balls in the tub, the probability of that guess goes down. This makes sense, because it’s unreasonable to think we’d see the red one that early if there are actually 100 balls in the tub.

There’s lots of ways to play with this. What happens if our tub could hold millions but we still assume a uniform prior? It just takes all the probabilities down, but the general trend is the same: It becomes less and less reasonable to assume large amounts of total balls given that we found the red one so early.

You could also only care about this “earliness” idea and redo the computations where you ask how likely is A_n given that we found the red ball by the third try. This is actually the more typical way the problem is formulated in the Doomsday arguments. It’s more complicated, but the same idea pops out, and this should make intuitive sense.

Part of the reason these computations were somewhat involved is because we tried to get a distribution on the natural numbers. But we could have tried to compare heuristically to get a super clear answer (homework for you). What if we only had two choices “small number of total balls (say 10)” or “large number of total balls (say 10,000)”? You’d find there is around a 99% chance that the “small” hypothesis is correct.

Here’s the leap. Now assume the fact that you exist right now is random. In other words, you popped out at a random point in the existence of humans. So the totality of humans to ever exist are the white balls and you are the red ball. The same type of argument above applies, and it says that the most likely thing is that you aren’t born at some super early point in human history. In fact, it’s unreasonable from a probabilistic standpoint to think that humans will continue much longer at all given your existence.

The “small” total population of humans is far, far more likely than the “large” total population, and the interesting thing is that this remains true even if you mess with the uniform prior. You could assume it is much more likely a priori for humans to continue to make improvements and colonize space and develop vaccines giving a higher prior for the species existing far into the future. But unfortunately the Bayesian argument will still pull so strongly in favor of humans ceasing to exist in the near future that one must conclude it is inevitable and will happen soon!

Anyway. I’m travelling this week, so I’m sorry if there are errors in those calculations. I was in a hurry and never double checked them. The crux of the argument should still make sense even if you don’t get my exact numbers. There’s also a lot of interesting and convincing rebuttals, but I don’t have time to get into them now (including the fact that unlikely hypotheses turn out to be true all the time).

Advertisements

Does Bayesian Epistemology Suffer Foundational Problems?

I recently had a discussion about whether Bayesian epistemology suffers from the problem of induction, and I think some interesting things came from it. If these words make you uncomfortable, think of epistemology as the study of how we form beliefs and gain knowledge. Bayesian epistemology means we model it probabilistically using Bayesian methods. This old post of mine talks a bit about it but is long and unnecessary to read to get the gist of this post.

I always think of the problem of induction in terms of the classic swan analogy. Someone wants to claim that all swans are white. They go out and see swan after swan after swan, each confirming the claim. Is there any point at which the person can legitimately say they know that all swans are white?

Classically, the answer is no. The problem of induction is crippling to classic epistemologies, because we can never be justified in believing any universal claim (at least using empirical methods). One of the great things about probabilistic epistemologies (not just Bayesian) is that it circumvents this problem.

Classical epistemologies require you to have 100% certainty to attain knowledge. Since you can’t ever be sure you’ve encountered every instance of a universal, you can’t be certain there is no instance that violates the universal. Hence the problem of induction is an actual problem. But note it is only a problem if your definition of knowledge requires you to have absolute certainty of truth.

Probabilistic epistemologies lower the threshold. They merely ask that you have 95% (or 98%, etc) confidence (or that your claim sits in some credible region, etc) for the justification. By definition, knowledge is always tentative and subject to change in these theories of knowledge.

This is one of the main reasons to use a probabilistic epistemology. It is the whole point. They were invented to solve this problem, so I definitely do not believe that Bayesian epistemology suffers from the problem of induction.

But it turns out I had misunderstood. The point the other person tried to make was much more subtle. It had to do with the other half of the problem of induction (which I always forget about, because I usually consider it an axiom when doing epistemology).

This other problem is referred to as the principle of the uniformity of nature. One must presuppose that the laws of nature are consistent across time and space. Recall that a Bayesian has prior beliefs and then upon encountering new data they update their beliefs factoring in both the prior and new data.

This criticism has to do with the application of Bayes’ theorem period. In order to consider the prior to be relevant to factor in at all, you must believe it is … well, relevant! You’ve implicitly assumed at that step the uniformity of nature. If you don’t believe nature is consistent across time, then you should not factor prior beliefs into the formation of knowledge.

Now a Bayesian will often try to use Bayesian methods to justify the uniformity of nature. We start with a uniform prior so that we haven’t assumed anything about the past or its relevance to the future. Then we merely note that billions of people across thousands of years have only ever observed a uniformity of nature, and hence it is credible to believe the axiom is true.

Even though my gut buys that argument, it is a bit intellectually dishonest. You can never, ever justify an axiom by using a method that relies on that axiom. That is the quintessential begging the question fallacy.

I think the uniformity of nature issue can be dismissed on different grounds. If you want to dismiss an epistemology based on the uniformity of nature issue, then you have to be willing to dismiss every epistemology that allows you to come to knowledge.

It doesn’t matter what the method is. If you somehow come to knowledge, then one second later all of nature could have changed and hence you no longer have that knowledge. Knowledge is impossible if you want to use that criticism. All this leave you with is radical skepticism, which of course leads to self-contradiction (if you know you can’t know anything, then you know something –><– ).

This is why I think of the uniformity of nature as a necessary axiom for epistemology. Without some form of it, epistemology is impossible. So at least in terms of the problem of induction, I do not see foundational problems for Bayesian epistemology.

Markov Chain Monte Carlo Example

Let’s look at a problem called parameter estimation. As usual, we have a bunch of coin flips. What we’ve learned to do with Bayesian statistics is calculate some posterior distribution that tells me how likely the bias {\theta} is. I ask myself, “Is it a credible hypothesis that the coin is fair ({\theta =1/2})?” I find out yes it is. I ask myself, “Is it a credible hypothesis that the coin is massively biased at {\theta=4/5}?” I find out yes it is. Uh oh.

Maybe in abstract math land this type of contradictory information is fine. I should be honest that both are credible guesses based on my data, and Bayesian statistics helps me to be very precise about my degrees of certainty and uncertainty. Unfortunately, in the real world I want to figure out which {\theta} is “best” so that I can use it in my model for practical purposes. This is called parameter estimation, because I want to estimate what my parameter {\theta} should be in my model.

We’re in luck for the coin example, because we only have one parameter living in one-dimensional space. This alone vastly simplifies the situation, but we have something far, far better. Our posterior distribution has a unique maximum, that max happens to equal the mean of the distribution, and that max can be determined easily and exactly! This means that we can safely use that parameter as the “best.”

In the real world, we often have several parameters we are trying to estimate in a high-dimensional space, and the posterior is some non-convex crazy thing with lots of local mins/maxs that can’t be determined analytically. Let’s face it. Optimization is really hard even in relatively nice situations. The real world is usually not nice.

There often isn’t even an obvious notion of what you mean by “best” set of parameters either. Think of a symmetrical bimodal distribution where both peaks have the same max. You don’t really have any good reason to pick one of the points that gives the max, and if you do something like take the mean, then you might end up with a min on accident. The method I’m going to describe doesn’t really help with this issue of “equally good choices”, but it does give a beautiful way to deal with high-dimensional parameter spaces and crazy posterior distributions.

The idea is extremely simple. You will pick some starting collection of parameters. Then you let those parameters randomly move in some direction. We will then use our model to test whether or not it is more or less likely to see the data that we observed under each of those parameter choices. With some probability depending on this likelihood we will move that parameter to that value. This is just a Markov chain process of our {\theta} values moving through the possible parameter values, and hence this technique is called a Markov Chain Monte Carlo (MCMC) method (I used the indefinite article “a” because there are all sorts of variations on this out there).

It turns out that as long as we set this up in a reasonable way, then it will converge. Here’s something cool about this. Your parameters could live in some gigantic space for which it would be impossible to search for a good parameter estimation. Usually there is some much, much smaller dimensional subset of reasonably likely candidates. Once you move to this smaller dimensional set, by the nature of the algorithm, you will stay close to it and hence start moving to something optimal much faster. Here’s a picture showing how the random walks stay on a smaller set in a real example:

Let’s actually implement this in the silly case of the coin example where we know what the answer should be. My next post might try to implement this for some sort of real data set, although that could be more time consuming than I’m willing to do. To make this example more fun, I had the computer pick a random number in {[0,1]} and then generate 100 coin flips with bias equal to that number without telling me the number! This way we are in a realistic situation of not knowing what the “correct” answer is ahead of time.

I got 85 heads and 15 tails. To make computations easier, let’s assume the prior probability is just uniform. This means the posterior is given by {p(D|\theta)=\theta^{85}\cdot (1-\theta)^{15}}. I’ll start the random walk at {\theta = 0.5}. To know how much to move by, I pick a random number from a normal distribution with mean {0} and standard deviation {0.1}. So if I pick {0.05}, then my candidate place to move to is {0.55}.

I compute {p(D|\theta_{new})/p(D|\theta_{old})} and I move to the new spot with this probability. Note that if my new theta value is more likely to be the true theta, then I will always move to the new value, because the probability of moving is greater than {1}. The more unlikely my new theta value is, the less likely it is that I will move there. This implementation is called the Metropolis (or Metropolis-Hastings) algorithm. Note how simple the implementation is. It is only a few lines of code:

import numpy as np
import random
import pylab

# Posterior Distribution
def p(theta):
    return (theta**85)*((1-theta)**15)

# Random Walk Step Size
def norm_dist():
    return random.normalvariate(0, 0.1)
    
# Perform one step of random walk from spot theta
def rand_walk(theta):
    x = norm_dist()
    if theta + x < 1 and theta + x >0:
        return theta + x
    else:
        return theta
 
# Simulate the random walk for 1000 time steps   
walk = []
walk.append(0.5)
for i in xrange(1000):
    n = walk.pop()
    walk.append(n)
    y = rand_walk(n)
    if random.random() < p(y)/p(n):
        walk.append(y)
    else:
        walk.append(n)


# Plot the results
ylab = [i for i in xrange(len(walk))]
pylab.plot(walk, ylab)
pylab.title('Random Walk Visualization')
pylab.xlabel('Theta Value')
pylab.ylabel('Time')
pylab.show() 

Note that the key insight that MCMC gives us is that picking values from the posterior is going to be “easy.” Even if we don’t know much about the distribution and have no idea how to explicitly calculate anything from it, we can still perform this random walk. This is what it looks like:

mcmc

The last step is to actually do a parameter estimation. The whole point is that the walk will stay close to the best value, so we can now just average these to get {\theta = 0.84}. The average is just a finite sum instead of an integral now. If we had done this analytically, we would have gotten {0.85}. Since we know MCMC is just giving us an estimation coming from randomness, this is really quite good!

Decision Theory 4: Hypothesis Testing

claimtoken-5348b79fd759b

Now we return to decision theory. It is also a return to the thing that first made me interested in learning something about statistics a year or two ago. I had heard about John Ioannidis’ shocking article “Why Most Published Research Findings are False” and started to investigate. To me statistics was some settled thing that you hit your data with after doing an experiment. It told you whether or not your findings were real and how confident you could be in them.

Moreover, I believed that as long as you followed the prescriptions taught to you, you couldn’t mess it up. It was foolproof. Just look around and try to find one thing that science hasn’t touched. The scientific method has clearly led us to discover something about the world. That’s why stats seemed like an uninteresting part of math. People seemed to have figured it all out. Then I was shocked to find that article. I started to learn about all these fallacies, and methodological problems that I’ve been pointing out over the past few months.

One of the main difficulties, particularly in science, is classical null hypothesis significance testing (NHST). One way to try to mitigate these difficulties is to rephrase our hypothesis test as a Bayesian decision theory problem. This is not the only Bayesian reformulation (Kruschke’s MCMC stuff is pretty cool which I might get to someday), but it fits in as a nice example of the use of decision theory outside of the silly gambling problems I’ve been using.

Let’s start by seeing how to test a point null hypothesis. Think about the biased coin example. We want to test {\theta=1/2}, i.e. is the coin unbiased? This is obviously a ridiculous type of hypothesis test, because the term “unbiased” in real life encompasses a range {(1/2-\varepsilon, 1/2+\varepsilon)} where we can’t tell the difference. This is actually the case in most scientific situations as well (there is only so much precision your instruments can achieve), and often scientists incorrectly use a point NHST when there should be a ROPE (region of practical equivalence).

Our first step is to take the previous paragraph’s discussion and cheat a little. Suppose we want to test {\theta = \theta_0}. The Bayesian way to do this would work out of the box using a ROPE. Unfortunately, if we want continuous densities for the probabilities, then we will always reject our null hypothesis. This is because a point has probability zero. The cheat is to just convert the continuous prior, {\pi(\theta)}, to a piecewise defined prior where we assign a point mass of probability

\displaystyle \pi_0 = \displaystyle \int_{\theta_0-\varepsilon}^{\theta_0+\varepsilon} \pi(\theta)d\theta

to {\theta_0} and the renormalized old prior otherwise. This is merely saying that we make a starting assumption that {\theta} has true value {\theta_0} with probability {\pi_0}, and hence no actual integral needs to be calculated. That is just for intuitive justification for the shape of {\pi}. If this makes you uncomfortable, then use the uninformed prior of {\theta=\theta_0} having probability {1/2} and the alternative having a uniform distribution of mass 1/2.

Let’s recap what we are trying to do. We have two hypotheses. The null which is {H_0: \theta=\theta_0}, and the alternative {H_1: \theta\neq \theta_0}. This type of NHST came up in the last post where we wanted to experimentally test whether or not the acceleration due to gravity was {g=9.8}. Our process should be clear if you’ve been following this sequence of posts. We just use our data to calculate the posterior distributions {P(H_0|x)} and {P(H_1|x)}. We must decide between these two by seeing which one has less risk (and that risk will come from a loss function which appropriately penalizes falsely accepting/rejecting each one).

This approach is really nice, because depending on your situation you will want to penalize differently. If you are testing a drug for effectiveness, then it is better to harshly penalize falsely claiming a placebo to be effective (a false positive or Type I error). If you are testing whether or not someone has a fatal disease, then you want to harshly penalize falsely claiming they have it and having them undergo dangerous and expensive unnecessary treatments. Maybe these aren’t the best examples, but you see how having a flexible system could be a lot more useful than blindly running a {p=0.05} NHST.

Rather than going through some made up example from fake randomly generated data as I’ve been doing, let’s examine some differences at the theoretical level when we assume everything is normal. Suppose our data is a sample of {n} points from a normal distribution. Any book on Bayesian statistics will have the details on working this out, so I’ll get to the punch line.

If we denote {m(x)} the marginal density, then the posterior distribution for {H_0} is given by

\displaystyle \frac{f(x|\theta_0)\pi_0}{m(x)}.

In the normal distribution (we assume the prior has {\tau} standard deviation and the data has {\mu} and both have mean {\theta_0}) case we get something much more specific:

\displaystyle \left(1+\frac{(1-\pi_0)}{\pi_0}\cdot \frac{\exp(\frac{1}{2}z^2[1+\sigma^2/(n\tau^2)]^{-1}}{(1+n\tau^2/\sigma^2)^{1/2}}\right)^{-1}

where {z=\frac{\sqrt{n}|\overline{x}-\theta_0|}{\sigma}}. This term actually appears in classical NHST as well. Let’s look at the differences. For the purpose of getting some numbers down, let’s assume {\pi_0=1/2} and {\sigma=\tau}. In a two-tailed test, let’s assume that we observe a {p=0.01} and hence would very, very strongly and confidently reject {H_0}. This corresponds to a {z}-value of {2.576}. In this case if {n} is small, i.e. in the 10-100 range, then the posterior is around {0.14} to {0.27}. This means that we would likely want to reject {H_0}, because it is quite a bit more unlikely than {H_1} (this will of course depend on the specifics of our loss function).

Shockingly, if {n} is large, we lose a lot of confidence. If {n=1000}, then the posterior for {H_0} is {0.53}. Woops. The Bayesian approach says that {H_0} is actually more likely to be true than {H_1}, but our NHST gives us {p=0.01} level confidence for rejecting (i.e. there is a 99% chance that our data observations were not a fluke chance and the result that causes us to reject {H_0} is real).

As we see, by working with the Bayesian framework, we get posterior probabilities for how likely {H_0} and {H_1} are given our observations of the data. This allows us to do a suitable analysis. The classical framework feels very limited, because even when we get extreme {p}-values that give us lots of confidence, we could accidentally be overlooking something that would be obvious if we worked directly with how likely each is to be true.

To end this post, I’ll just reiterate that careful scientists are completely aware of the fact that a {p}-value is not to be interpreted as probabilities against {H_0}. One can certainly apply classical methods and end with a solid analysis. On the other hand, this is quite a widespread sloppiness or less generously I’ll call it a widespread misunderstanding of what is going on.

Decision Theory 3

If you could follow the last post then you have all the pieces you need to understand the basic theory. Let’s go back and actually work this out in the abstract now that we have an example for our template. If you only care about seeing some examples in action, then you should feel free to skip this post which will be almost entirely defining the pieces of the last post more rigorously. We will need to revise some things from the first post, because we were able to state things in a simpler form without Bayesian updating or continuous distributions happening.

Last time we introduced a one-parameter family of unknowns, the true bias of the coin. We denoted this {\theta}. For now we’ll keep this to just be some continuous real-valued parameter and it will represent an unknown quantity in our model. If you haven’t thought about this before, then I recommend continuing in the way we did last post. You pretend like {\theta} is some fixed known quantity and run classical decision theory. From there you extrapolate. The value of this parameter could be this, or this, or this, and my decision has to be the best no matter what it really is.

In the future, there could be a whole bunch of unknowns and {\theta} will turn into a vector or matrix, but for now we’ll stick to just a single variable. To pin down terminology, we will call {\theta} the parameter and {\Theta} the parameter space (all the possible values of {\theta}). So in our coin example {\Theta = [0,1]}.

We also have a collection of actions: {A}. An individual action will be denoted by {a}. For the coin example, an action would be betting on heads or tails. We will never be able to know {\theta} … because it is an unknown, but we will want to make observations/gather data which will be denoted {X}. In the coin example, this would be our observed sequence of flips (so it is probably best represented as a vector). We will denote the collection of all possible observations {\mathcal{X}} and this is called the sample space. In the coin example, we flipped the coin {100} times, so this consists of {2^{100}} vectors. In general, we will want to allow {X} to be continuous random variables and hence {\mathcal{X}} could be subsets of {\mathbb{R}^n}.

Let {I\subset \mathcal{X}} (suggestively we will often want to consider an “interval” {[a,b]\subset \mathbb{R}} if we just have one continuous random variable). As I already pointed out earlier, we will often want to take the view of a given fixed {\theta}. In this situation we will assume for the purposes of being able to analyze things that we always have an integrable probability distribution {f(x|\theta)} which is “the probability of observing x given {\theta}“. Thus, by definition, the probability of observing {I} given {\theta} is just the integral:

\displaystyle P_{\theta}(I)=\int_I f(x|\theta)dx

I won’t adopt the cumbersome notation that some texts use to indicate that this could be an integral or a finite sum. I will just use the integral, and assume the reader can translate to the appropriate sum if {\mathcal{X}} is discrete. If we have some function {h(X)}, then we define the expected value of {h(X)} over {\mathcal{X}} to be

\displaystyle E_{\theta}[h(X)] = \int_{\mathcal{X}}h(X)f(x|\theta)dx

Now that that is settled, let’s formalize the decision function, loss, and risk. Suppose that we have some prior probability describing the possibilities for {\theta}. We denote this {\pi(\theta)}. The choice of such a thing in the absence of any actual prior knowledge is one of the main (only?) arguments against Bayesian statistics. This shouldn’t be distressing, because any reasonable experiment will have a large enough sample size that picking an uninformed uniform prior will easily be overcome.

In the first decision theory post, we made a decision rule without basing it on any data. This is why we need to change our definition a little. In that situation a decision rule is equivalent to picking an action. If observing some data is involved, then our decision rule is a function {\delta: \mathcal{X}\rightarrow A}. This should just be read, “If I observe this type of data, then I will act in this way.” You let the data inform your decision. Our decision rule in the coin example was to look at the ratio of heads to tails. If there were more heads we pick heads. If there were more tails, we pick tails.

The loss function is a function {L: \Theta\times A \rightarrow \mathbb{R}}. This is the choice that people should feel a little uncomfortable with, because there is a definite choice that may or may not be reasonable affecting everything. The value {L(\theta, a)} should measure the loss that will be incurred if you do action {a} and {\theta} is the true value of the unknown.

We won’t worry so much about this right now. The more important one for us is the decision loss function {L:\Theta\times \mathcal{X}\rightarrow \mathbb{R}}. This is just plugging in to the other one: {L(\theta, \delta(x))}. Sometimes we just start with this one though. This was a no-brainer for our coin example, because I purposely set up the question to have a natural loss function. This was due to the fact that a well-defined “bet” was being made. In more general situations, the choice of a loss function could be seen as essentially equivalent to picking a betting scheme for your choices. You could easily come up with some wacky ones to see that it might not reflect reality if you aren’t careful.

To me the more “intuitive” notion is that of the risk function. This is the expected value of the loss:

\displaystyle R(\theta, \delta)=E_{\theta}[L(\theta, \delta(X))] = \int_\mathcal{X} L(\theta, \delta(X))f(x|\theta)dx

Note we integrate out the random variables {x}, but we are left over with a function of {\theta}. We saw this in our coin example last time. We get a similar thing for the Bayesian risk, but we incorporate the prior probability of {\theta}. Lots of times it is actually somewhat easier to just jump right to the risk, because in the case of squared-error loss (see we just get that the risk is the variance of the posterior distribution. No extra intermediary calculations are needed.

In general, most loss functions will be a variant on one of two types. The first is called the squared-error loss function. It is given by {L(\theta, a)=(\theta-a)^2}. You can think of this as “least-squares” fitting your decision or minimizing risk in the {L^2}-norm. The other is called the {0-1} loss function. This one arises quite naturally when you just have to pick between two choices like the coin flip. Ours was a variant on this. It penalizes you by {1} unit if your “decision is incorrect” and doesn’t penalize you at all if your “decision is correct.” It is given by

\displaystyle L(\theta, a_i)=\begin{cases} 0 & \text{if} \ \theta\in \Theta_i \\ 1 & \text{if} \ \theta\in \Theta_j \ \text{for} \ i\neq j\end{cases}

The beautiful thing about this one is that the risk is just {1} minus the posterior distribution. Thus, it is minimized at the max of the posterior which is often really easy to calculate. In the coin example, we got the beta distribution and hence the max was just the mean. Of course, we have to be careful that we are measuring the right thing, because we aren’t trying to predict the true bias. We were merely trying to predict heads or tails so that situation was an even easier discrete version.

Lastly, there is a partial ordering on decision functions given by \delta_1 \leq \delta_2 if and only if R(\theta, \delta_1) \leq R(\theta, \delta_2) for all \theta . A minimum in this ordering is called admissible and corresponds to a rational decision. If you make some other decision you are just asking to lose more.

Well, I think this post has gone on long enough (I’ve basically been trapped at the airport for the past 8 hours, so …). We’ll get back to some examples of all this next time. I just needed to finally formalize what we were doing before going any further.

A Bayesian Formulation of Occam’s Razor

Today we will formulate Occam’s Razor in Bayesian terms. Recall that this says that if two hypotheses explain the data equally well, then the one with less assumptions is to be preferred. Before continuing, we should first get a handle on what this is and what the Bayesian reformulation means. First, it is basically a scientific heuristic. The intuitive reason for it is that unnecessary hypotheses are just going to make your model more likely to make mistakes (i.e. it will “overfit”).

What this post is going to do is give a formulation of it in Bayesian terms. This is not a mathematical proof that Occam’s Razor is true or anything, but it will be a proof that under certain mild assumptions the principle falls out as a consequence. That’s what makes this kind of cool. We want to decide whether or not hypothesis A or B is a better statistical model where A and B explain the observed data equally well, but B has an extra parameter.

How should we do this? Well, in probabilistic terms we want to figure out {P(A|D)} and {P(B|D)}, the “probability that {A} is true given the data {D}” and the “probability that {B} is true given the data {D}.” We merely compare these two quantities for example by taking the quotient

\displaystyle \frac{P(A|D)}{P(B|D)}.

If the quotient is near {1}, then they are roughly equally good models. If the quotient is large, then {A} is a better hypothesis and if the quantity is close to {0}, then {B} is the better hypothesis.

Let’s take stock of our assumptions here. We do not assume Occam’s Razor (some people feel like OR is a pretty steep assumption), because it is not a priori clear that it is always the best principle to follow. But here we are merely making the assumption that comparing the probabilities that each model is a true model of the data we observe is a good test for selecting one model over another. It is kind of hard to argue against such a common sense assumption.

Now we use Bayes’ Theorem to convert these quantities to things we actually know about:

\displaystyle \frac{P(A|D)}{P(B|D)} = \frac{P(D|A)P(A)}{P(D|B)P(B)}

At this point we have some difficulty with the {B} hypothesis still, because implicitly we have assumed it relies on some extra parameter {\lambda}. To simplify the argument, we will assume that {\lambda} lies in some range (this isn’t unreasonable because in real life you should have some idea what order of magnitude etc this parameter should be): {\lambda_m \leq \lambda \leq \lambda_M}. We will make a less reasonable simplifying assumption and say that once this range is specified, we have a uniform chance of it being anything in the range, i.e.

\displaystyle P(\lambda |B) = \frac{1}{\lambda_M - \lambda_m}

for {\lambda} in the range and {0} otherwise. There will be an observed {\lambda_0} that maximizes the likelihood function (i.e. fits the data the best). Choose {\delta} so that {\lambda_0 \pm \delta} is an interval giving us reasonable certainty of the best {\lambda_0} (we could use the 95% HDI if we wanted to get the interval). Now let’s work out what is happening for {B}:

\displaystyle P(D|B) = \int P(D, \lambda|B)d\lambda = \int P(D|\lambda, B)P(\lambda |B)d\lambda

\displaystyle =\frac{1}{\lambda_M - \lambda_m}\int P(D|\lambda_0, B)exp\left(-\frac{(\lambda-\lambda_0)^2}{2\delta^2}\right)d\lambda

\displaystyle =\frac{\delta\sqrt{2\pi}P(D|\lambda_0, B)}{\lambda_M - \lambda_m}

Now we can plug this into our original comparison ratio and use the fact that both are equally good at explaining the data:

\displaystyle \frac{P(A|D)}{P(B|D)}=\frac{(\lambda_M-\lambda_m)P(D|A)}{\delta\sqrt{2\pi}P(D|\lambda_0, B)}

This gives us two main conclusions. The first is that if we assume our two models make roughly equivalent predictions on the data, i.e. {P(D|A)\approx P(D|\lambda_0, B)}, then we should prefer {A} because the possible range for {\lambda} giving a factor in the numerator will in general be quite a bit larger than {\delta}. This is exactly Occam’s Razor.

The possibly more interesting consequence is that we now know exactly how much this extra parameter is “penalizing” the theory. So given specific cases we can test whether or not that extra parameter is worth putting in. In other words, are the predictions significantly enough better with the extra parameter to overcome the penalty of introducing an extra complicated hypothesis? This abstract and vague notion from Occam’s Razor gets explicitly quantified in Bayesian analysis so that it is no longer vague or abstract and we can confidently apply Occam’s Razor when it is needed and avoid it when it isn’t.

Statistical Oddities Part 3

This oddity is really hard to get your head around if you’ve been doing standard null-hypothesis testing all your life. This oddity says that null hypothesis significance testing depends on the intentions of the experimenter.

What does this mean? Well, let’s go back to our worked example of flipping a coin and trying to determine whether or not it is biased based on the observed data. Recall that in our Bayesian analysis we take our data and our test for whether or not it was biased was determined by whether or not 0.5 was a reasonable guess given the posterior distribution. We didn’t need to know anything about the intentions of the person flipping the coin.

How does traditional (re: frequentist) null hypothesis testing work? We set up an experiment in which the experimenter flips the coin 100 times. If we observe 47 heads, then we calculate the probability that this would happen given the coin is fair. If that probability is below a certain threshold, then we say the coin is biased because it is extremely unlikely that we would observe that number by chance alone. Otherwise we do not reject the null hypothesis and say the coin is fair.

Unfortunately, our probability space depends on the number of total coin flips. The probability space is extremely different if the experimenter set up the experiment so that the number of flips was not predetermined and instead a coin was flipped as many times as possible for 5 minutes. The probability space in this case is much larger because some possible samples would have 90 flips and some would have 110 and so on.

It would also be radically different if the experimenter decided to flip the coin until they reached 47 heads. Then the probability space would again have all sorts of different possibilities for the number of flips. Maybe sometimes you would expect to do 150 flips before seeing 47 heads.

Just to reiterate, this isn’t a trivial matter. This says we need to know the intent of the experimenter if we want to do a legitimate null hypothesis significance test. If we don’t know how the experiment was designed, then we don’t know what our probability space should look like to know whether or not we should reject the null hypothesis.

To see why this is shocking just do the thought experiment where three labs flip the same coin. Each of the labs sets up the experiment in the three ways listed above. You get the exact same data from each of the labs. You could rig the numbers so that in some cases you decide the coin is fair and in others you decide that it is not fair. But they gave you the same exact data of 47 heads out of 100 flips (or whatever your thought experiment requires)! Let’s reiterate: They gave you the exact same data, but came to different conclusions about the fairness of the coin. How is this possible?

If we live in some sort of objective universe where we can do experiments and draw conclusions from them, then the results of an experiment should rely on the data and not on the subjective intentions of the experimenter. More bluntly, determining whether or not the coin is biased should not depend on what is happening in the coin flipper’s mind during the flipping.

This is a very real and dangerous statistical oddity if the person running the analysis isn’t aware of it. In fact, I dare say that this is one of the easy ways to massage data in the sciences to get “results” where none exist. To me, this is actually one of the strongest arguments for scientists to use Bayesian statistics rather than null hypothesis testing. As we saw in the linked post, Bayesian statistics gets around this issue and only needs the raw data and not the intentions of the experimenter.

By the way, before I get sued, I stole this example (with different numbers) from Doing Bayesian Data Analysis by John K. Kruschke. It is a really fantastic book to learn about this stuff.