The Carter Catastrophe

I’ve been readingĀ Manifold: TimeĀ by Stephen Baxter. The book is quite good so far, and it presents a fascinating probabilistic argument that humans will go extinct in the near future. It is sometimes called the Carter Catastrophe, because Brandon Carter first proposed it in 1983.

I’ll use Bayesian arguments, so you might want to review some of my previous posts on the topic if you’re feeling shaky. One thing we didn’t talk all that much about is the idea of model selection. This is the most common thing scientists have to do. If you run an experiment, you get a bunch of data. Then you have to figure out the most likely reason for what you see.

Let’s take a basic example. We have a giant tub of golf balls, and we can’t see inside the tub. There could be 1 ball or a million. We’re told the owner accidentally dropped a red ball in at some point. All the other balls are the standard white golf balls. We decide to run an experiment where we draw a ball out, one at a time, until we reach the red one.

First ball: white. Second ball: white. Third ball: red. We stop. We’ve now generated a data set from our experiment, and we want to use Bayesian methods to give the probability of there being three total balls or seven or a million. In probability terms, we need to calculate the probability that there are x balls in the tub given that we drew the red ball on the third draw. Any time we see this language, our first thought should be Bayes’ theorem.

Define A_i to be the model of there being exactly i balls in the tub. I’ll use “3” inside of P( ) to be the event of drawing the red ball on the third try. We have to make a finiteness assumption, and although this is one of the main critiques of the argument, we can examine what happens as we let the size of the bound grow. Suppose for now the tub can only hold 100 balls.

A priori, we have no idea how many balls are in there, so we’ll assume all “models” are equally likely. This means P(A_i)=1/100 for all i. By Bayes’ theorem we can calculate:

P(A_3|3) = \frac{P(3|A_3)P(A_3)}{(\sum_{i=1}^{100}P(3|A_i)P(A_i))}

\frac{(1/3)(1/100)}{(1/100)\sum_{i=3}^{100}1/i} \approx 0.09

So there’s around a 9% chance that there are only 3 balls in the tub. That bottom summation remains exactly the same when computing P(A_n | 3) for any n and equals about 3.69, and the (1/100) cancels out every time. So we can compute explicitly that for n > 3:

P(A_n|3)\approx \frac{1}{n}(0.27)

This is a decreasing function of n, and this shouldn’t be surprising at all. It says that as we guess there are more and more balls in the tub, the probability of that guess goes down. This makes sense, because it’s unreasonable to think we’d see the red one that early if there are actually 100 balls in the tub.

There’s lots of ways to play with this. What happens if our tub could hold millions but we still assume a uniform prior? It just takes all the probabilities down, but the general trend is the same: It becomes less and less reasonable to assume large amounts of total balls given that we found the red one so early.

You could also only care about this “earliness” idea and redo the computations where you ask how likely is A_n given that we found the red ball by the third try. This is actually the more typical way the problem is formulated in the Doomsday arguments. It’s more complicated, but the same idea pops out, and this should make intuitive sense.

Part of the reason these computations were somewhat involved is because we tried to get a distribution on the natural numbers. But we could have tried to compare heuristically to get a super clear answer (homework for you). What if we only had two choices “small number of total balls (say 10)” or “large number of total balls (say 10,000)”? You’d find there is around a 99% chance that the “small” hypothesis is correct.

Here’s the leap. Now assume the fact that you exist right now is random. In other words, you popped out at a random point in the existence of humans. So the totality of humans to ever exist are the white balls and you are the red ball. The same type of argument above applies, and it says that the most likely thing is that you aren’t born at some super early point in human history. In fact, it’s unreasonable from a probabilistic standpoint to think that humans will continue much longer at all given your existence.

The “small” total population of humans is far, far more likely than the “large” total population, and the interesting thing is that this remains true even if you mess with the uniform prior. You could assume it is much more likely a priori for humans to continue to make improvements and colonize space and develop vaccines giving a higher prior for the species existing far into the future. But unfortunately the Bayesian argument will still pull so strongly in favor of humans ceasing to exist in the near future that one must conclude it is inevitable and will happen soon!

Anyway. I’m travelling this week, so I’m sorry if there are errors in those calculations. I was in a hurry and never double checked them. The crux of the argument should still make sense even if you don’t get my exact numbers. There’s also a lot of interesting and convincing rebuttals, but I don’t have time to get into them now (including the fact that unlikely hypotheses turn out to be true all the time).

Does Bayesian Epistemology Suffer Foundational Problems?

I recently had a discussion about whether Bayesian epistemology suffers from the problem of induction, and I think some interesting things came from it. If these words make you uncomfortable, think of epistemology as the study of how we form beliefs and gain knowledge. Bayesian epistemology means we model it probabilistically using Bayesian methods. This old post of mine talks a bit about it but is long and unnecessary to read to get the gist of this post.

I always think of the problem of induction in terms of the classic swan analogy. Someone wants to claim that all swans are white. They go out and see swan after swan after swan, each confirming the claim. Is there any point at which the person can legitimately say they know that all swans are white?

Classically, the answer is no. The problem of induction is crippling to classic epistemologies, because we can never be justified in believing any universal claim (at least using empirical methods). One of the great things about probabilistic epistemologies (not just Bayesian) is that it circumvents this problem.

Classical epistemologies require you to have 100% certainty to attain knowledge. Since you can’t ever be sure you’ve encountered every instance of a universal, you can’t be certain there is no instance that violates the universal. Hence the problem of induction is an actual problem. But note it is only a problem if your definition of knowledge requires you to have absolute certainty of truth.

Probabilistic epistemologies lower the threshold. They merely ask that you have 95% (or 98%, etc) confidence (or that your claim sits in some credible region, etc) for the justification. By definition, knowledge is always tentative and subject to change in these theories of knowledge.

This is one of the main reasons to use a probabilistic epistemology. It is the whole point. They were invented to solve this problem, so I definitely do not believe that Bayesian epistemology suffers from the problem of induction.

But it turns out I had misunderstood. The point the other person tried to make was much more subtle. It had to do with the other half of the problem of induction (which I always forget about, because I usually consider it an axiom when doing epistemology).

This other problem is referred to as the principle of the uniformity of nature. One must presuppose that the laws of nature are consistent across time and space. Recall that a Bayesian has prior beliefs and then upon encountering new data they update their beliefs factoring in both the prior and new data.

This criticism has to do with the application of Bayes’ theorem period. In order to consider the prior to be relevant to factor in at all, you must believe it is … well, relevant! You’ve implicitly assumed at that step the uniformity of nature. If you don’t believe nature is consistent across time, then you should not factor prior beliefs into the formation of knowledge.

Now a Bayesian will often try to use Bayesian methods to justify the uniformity of nature. We start with a uniform prior so that we haven’t assumed anything about the past or its relevance to the future. Then we merely note that billions of people across thousands of years have only ever observed a uniformity of nature, and hence it is credible to believe the axiom is true.

Even though my gut buys that argument, it is a bit intellectually dishonest. You can never, ever justify an axiom by using a method that relies on that axiom. That is the quintessential begging the question fallacy.

I think the uniformity of nature issue can be dismissed on different grounds. If you want to dismiss an epistemology based on the uniformity of nature issue, then you have to be willing to dismiss every epistemology that allows you to come to knowledge.

It doesn’t matter what the method is. If you somehow come to knowledge, then one second later all of nature could have changed and hence you no longer have that knowledge. Knowledge is impossible if you want to use that criticism. All this leave you with is radical skepticism, which of course leads to self-contradiction (if you know you can’t know anything, then you know something –><– ).

This is why I think of the uniformity of nature as a necessary axiom for epistemology. Without some form of it, epistemology is impossible. So at least in terms of the problem of induction, I do not see foundational problems for Bayesian epistemology.

Markov Chain Monte Carlo Example

Let’s look at a problem called parameter estimation. As usual, we have a bunch of coin flips. What we’ve learned to do with Bayesian statistics is calculate some posterior distribution that tells me how likely the bias {\theta} is. I ask myself, “Is it a credible hypothesis that the coin is fair ({\theta =1/2})?” I find out yes it is. I ask myself, “Is it a credible hypothesis that the coin is massively biased at {\theta=4/5}?” I find out yes it is. Uh oh.

Maybe in abstract math land this type of contradictory information is fine. I should be honest that both are credible guesses based on my data, and Bayesian statistics helps me to be very precise about my degrees of certainty and uncertainty. Unfortunately, in the real world I want to figure out which {\theta} is “best” so that I can use it in my model for practical purposes. This is called parameter estimation, because I want to estimate what my parameter {\theta} should be in my model.

We’re in luck for the coin example, because we only have one parameter living in one-dimensional space. This alone vastly simplifies the situation, but we have something far, far better. Our posterior distribution has a unique maximum, that max happens to equal the mean of the distribution, and that max can be determined easily and exactly! This means that we can safely use that parameter as the “best.”

In the real world, we often have several parameters we are trying to estimate in a high-dimensional space, and the posterior is some non-convex crazy thing with lots of local mins/maxs that can’t be determined analytically. Let’s face it. Optimization is really hard even in relatively nice situations. The real world is usually not nice.

There often isn’t even an obvious notion of what you mean by “best” set of parameters either. Think of a symmetrical bimodal distribution where both peaks have the same max. You don’t really have any good reason to pick one of the points that gives the max, and if you do something like take the mean, then you might end up with a min on accident. The method I’m going to describe doesn’t really help with this issue of “equally good choices”, but it does give a beautiful way to deal with high-dimensional parameter spaces and crazy posterior distributions.

The idea is extremely simple. You will pick some starting collection of parameters. Then you let those parameters randomly move in some direction. We will then use our model to test whether or not it is more or less likely to see the data that we observed under each of those parameter choices. With some probability depending on this likelihood we will move that parameter to that value. This is just a Markov chain process of our {\theta} values moving through the possible parameter values, and hence this technique is called a Markov Chain Monte Carlo (MCMC) method (I used the indefinite article “a” because there are all sorts of variations on this out there).

It turns out that as long as we set this up in a reasonable way, then it will converge. Here’s something cool about this. Your parameters could live in some gigantic space for which it would be impossible to search for a good parameter estimation. Usually there is some much, much smaller dimensional subset of reasonably likely candidates. Once you move to this smaller dimensional set, by the nature of the algorithm, you will stay close to it and hence start moving to something optimal much faster. Here’s a picture showing how the random walks stay on a smaller set in a real example:

Let’s actually implement this in the silly case of the coin example where we know what the answer should be. My next post might try to implement this for some sort of real data set, although that could be more time consuming than I’m willing to do. To make this example more fun, I had the computer pick a random number in {[0,1]} and then generate 100 coin flips with bias equal to that number without telling me the number! This way we are in a realistic situation of not knowing what the “correct” answer is ahead of time.

I got 85 heads and 15 tails. To make computations easier, let’s assume the prior probability is just uniform. This means the posterior is given by {p(D|\theta)=\theta^{85}\cdot (1-\theta)^{15}}. I’ll start the random walk at {\theta = 0.5}. To know how much to move by, I pick a random number from a normal distribution with mean {0} and standard deviation {0.1}. So if I pick {0.05}, then my candidate place to move to is {0.55}.

I compute {p(D|\theta_{new})/p(D|\theta_{old})} and I move to the new spot with this probability. Note that if my new theta value is more likely to be the true theta, then I will always move to the new value, because the probability of moving is greater than {1}. The more unlikely my new theta value is, the less likely it is that I will move there. This implementation is called the Metropolis (or Metropolis-Hastings) algorithm. Note how simple the implementation is. It is only a few lines of code:

import numpy as np
import random
import pylab

# Posterior Distribution
def p(theta):
    return (theta**85)*((1-theta)**15)

# Random Walk Step Size
def norm_dist():
    return random.normalvariate(0, 0.1)
    
# Perform one step of random walk from spot theta
def rand_walk(theta):
    x = norm_dist()
    if theta + x < 1 and theta + x >0:
        return theta + x
    else:
        return theta
 
# Simulate the random walk for 1000 time steps   
walk = []
walk.append(0.5)
for i in xrange(1000):
    n = walk.pop()
    walk.append(n)
    y = rand_walk(n)
    if random.random() < p(y)/p(n):
        walk.append(y)
    else:
        walk.append(n)


# Plot the results
ylab = [i for i in xrange(len(walk))]
pylab.plot(walk, ylab)
pylab.title('Random Walk Visualization')
pylab.xlabel('Theta Value')
pylab.ylabel('Time')
pylab.show() 

Note that the key insight that MCMC gives us is that picking values from the posterior is going to be “easy.” Even if we don’t know much about the distribution and have no idea how to explicitly calculate anything from it, we can still perform this random walk. This is what it looks like:

mcmc

The last step is to actually do a parameter estimation. The whole point is that the walk will stay close to the best value, so we can now just average these to get {\theta = 0.84}. The average is just a finite sum instead of an integral now. If we had done this analytically, we would have gotten {0.85}. Since we know MCMC is just giving us an estimation coming from randomness, this is really quite good!

Decision Theory 4: Hypothesis Testing

claimtoken-5348b79fd759b

Now we return to decision theory. It is also a return to the thing that first made me interested in learning something about statistics a year or two ago. I had heard about John Ioannidis’ shocking article “Why Most Published Research Findings are False” and started to investigate. To me statistics was some settled thing that you hit your data with after doing an experiment. It told you whether or not your findings were real and how confident you could be in them.

Moreover, I believed that as long as you followed the prescriptions taught to you, you couldn’t mess it up. It was foolproof. Just look around and try to find one thing that science hasn’t touched. The scientific method has clearly led us to discover something about the world. That’s why stats seemed like an uninteresting part of math. People seemed to have figured it all out. Then I was shocked to find that article. I started to learn about all these fallacies, and methodological problems that I’ve been pointing out over the past few months.

One of the main difficulties, particularly in science, is classical null hypothesis significance testing (NHST). One way to try to mitigate these difficulties is to rephrase our hypothesis test as a Bayesian decision theory problem. This is not the only Bayesian reformulation (Kruschke’s MCMC stuff is pretty cool which I might get to someday), but it fits in as a nice example of the use of decision theory outside of the silly gambling problems I’ve been using.

Let’s start by seeing how to test a point null hypothesis. Think about the biased coin example. We want to test {\theta=1/2}, i.e. is the coin unbiased? This is obviously a ridiculous type of hypothesis test, because the term “unbiased” in real life encompasses a range {(1/2-\varepsilon, 1/2+\varepsilon)} where we can’t tell the difference. This is actually the case in most scientific situations as well (there is only so much precision your instruments can achieve), and often scientists incorrectly use a point NHST when there should be a ROPE (region of practical equivalence).

Our first step is to take the previous paragraph’s discussion and cheat a little. Suppose we want to test {\theta = \theta_0}. The Bayesian way to do this would work out of the box using a ROPE. Unfortunately, if we want continuous densities for the probabilities, then we will always reject our null hypothesis. This is because a point has probability zero. The cheat is to just convert the continuous prior, {\pi(\theta)}, to a piecewise defined prior where we assign a point mass of probability

\displaystyle \pi_0 = \displaystyle \int_{\theta_0-\varepsilon}^{\theta_0+\varepsilon} \pi(\theta)d\theta

to {\theta_0} and the renormalized old prior otherwise. This is merely saying that we make a starting assumption that {\theta} has true value {\theta_0} with probability {\pi_0}, and hence no actual integral needs to be calculated. That is just for intuitive justification for the shape of {\pi}. If this makes you uncomfortable, then use the uninformed prior of {\theta=\theta_0} having probability {1/2} and the alternative having a uniform distribution of mass 1/2.

Let’s recap what we are trying to do. We have two hypotheses. The null which is {H_0: \theta=\theta_0}, and the alternative {H_1: \theta\neq \theta_0}. This type of NHST came up in the last post where we wanted to experimentally test whether or not the acceleration due to gravity was {g=9.8}. Our process should be clear if you’ve been following this sequence of posts. We just use our data to calculate the posterior distributions {P(H_0|x)} and {P(H_1|x)}. We must decide between these two by seeing which one has less risk (and that risk will come from a loss function which appropriately penalizes falsely accepting/rejecting each one).

This approach is really nice, because depending on your situation you will want to penalize differently. If you are testing a drug for effectiveness, then it is better to harshly penalize falsely claiming a placebo to be effective (a false positive or Type I error). If you are testing whether or not someone has a fatal disease, then you want to harshly penalize falsely claiming they have it and having them undergo dangerous and expensive unnecessary treatments. Maybe these aren’t the best examples, but you see how having a flexible system could be a lot more useful than blindly running a {p=0.05} NHST.

Rather than going through some made up example from fake randomly generated data as I’ve been doing, let’s examine some differences at the theoretical level when we assume everything is normal. Suppose our data is a sample of {n} points from a normal distribution. Any book on Bayesian statistics will have the details on working this out, so I’ll get to the punch line.

If we denote {m(x)} the marginal density, then the posterior distribution for {H_0} is given by

\displaystyle \frac{f(x|\theta_0)\pi_0}{m(x)}.

In the normal distribution (we assume the prior has {\tau} standard deviation and the data has {\mu} and both have mean {\theta_0}) case we get something much more specific:

\displaystyle \left(1+\frac{(1-\pi_0)}{\pi_0}\cdot \frac{\exp(\frac{1}{2}z^2[1+\sigma^2/(n\tau^2)]^{-1}}{(1+n\tau^2/\sigma^2)^{1/2}}\right)^{-1}

where {z=\frac{\sqrt{n}|\overline{x}-\theta_0|}{\sigma}}. This term actually appears in classical NHST as well. Let’s look at the differences. For the purpose of getting some numbers down, let’s assume {\pi_0=1/2} and {\sigma=\tau}. In a two-tailed test, let’s assume that we observe a {p=0.01} and hence would very, very strongly and confidently reject {H_0}. This corresponds to a {z}-value of {2.576}. In this case if {n} is small, i.e. in the 10-100 range, then the posterior is around {0.14} to {0.27}. This means that we would likely want to reject {H_0}, because it is quite a bit more unlikely than {H_1} (this will of course depend on the specifics of our loss function).

Shockingly, if {n} is large, we lose a lot of confidence. If {n=1000}, then the posterior for {H_0} is {0.53}. Woops. The Bayesian approach says that {H_0} is actually more likely to be true than {H_1}, but our NHST gives us {p=0.01} level confidence for rejecting (i.e. there is a 99% chance that our data observations were not a fluke chance and the result that causes us to reject {H_0} is real).

As we see, by working with the Bayesian framework, we get posterior probabilities for how likely {H_0} and {H_1} are given our observations of the data. This allows us to do a suitable analysis. The classical framework feels very limited, because even when we get extreme {p}-values that give us lots of confidence, we could accidentally be overlooking something that would be obvious if we worked directly with how likely each is to be true.

To end this post, I’ll just reiterate that careful scientists are completely aware of the fact that a {p}-value is not to be interpreted as probabilities against {H_0}. One can certainly apply classical methods and end with a solid analysis. On the other hand, this is quite a widespread sloppiness or less generously I’ll call it a widespread misunderstanding of what is going on.

Decision Theory 3

If you could follow the last post then you have all the pieces you need to understand the basic theory. Let’s go back and actually work this out in the abstract now that we have an example for our template. If you only care about seeing some examples in action, then you should feel free to skip this post which will be almost entirely defining the pieces of the last post more rigorously. We will need to revise some things from the first post, because we were able to state things in a simpler form without Bayesian updating or continuous distributions happening.

Last time we introduced a one-parameter family of unknowns, the true bias of the coin. We denoted this {\theta}. For now we’ll keep this to just be some continuous real-valued parameter and it will represent an unknown quantity in our model. If you haven’t thought about this before, then I recommend continuing in the way we did last post. You pretend like {\theta} is some fixed known quantity and run classical decision theory. From there you extrapolate. The value of this parameter could be this, or this, or this, and my decision has to be the best no matter what it really is.

In the future, there could be a whole bunch of unknowns and {\theta} will turn into a vector or matrix, but for now we’ll stick to just a single variable. To pin down terminology, we will call {\theta} the parameter and {\Theta} the parameter space (all the possible values of {\theta}). So in our coin example {\Theta = [0,1]}.

We also have a collection of actions: {A}. An individual action will be denoted by {a}. For the coin example, an action would be betting on heads or tails. We will never be able to know {\theta} … because it is an unknown, but we will want to make observations/gather data which will be denoted {X}. In the coin example, this would be our observed sequence of flips (so it is probably best represented as a vector). We will denote the collection of all possible observations {\mathcal{X}} and this is called the sample space. In the coin example, we flipped the coin {100} times, so this consists of {2^{100}} vectors. In general, we will want to allow {X} to be continuous random variables and hence {\mathcal{X}} could be subsets of {\mathbb{R}^n}.

Let {I\subset \mathcal{X}} (suggestively we will often want to consider an “interval” {[a,b]\subset \mathbb{R}} if we just have one continuous random variable). As I already pointed out earlier, we will often want to take the view of a given fixed {\theta}. In this situation we will assume for the purposes of being able to analyze things that we always have an integrable probability distribution {f(x|\theta)} which is “the probability of observing x given {\theta}“. Thus, by definition, the probability of observing {I} given {\theta} is just the integral:

\displaystyle P_{\theta}(I)=\int_I f(x|\theta)dx

I won’t adopt the cumbersome notation that some texts use to indicate that this could be an integral or a finite sum. I will just use the integral, and assume the reader can translate to the appropriate sum if {\mathcal{X}} is discrete. If we have some function {h(X)}, then we define the expected value of {h(X)} over {\mathcal{X}} to be

\displaystyle E_{\theta}[h(X)] = \int_{\mathcal{X}}h(X)f(x|\theta)dx

Now that that is settled, let’s formalize the decision function, loss, and risk. Suppose that we have some prior probability describing the possibilities for {\theta}. We denote this {\pi(\theta)}. The choice of such a thing in the absence of any actual prior knowledge is one of the main (only?) arguments against Bayesian statistics. This shouldn’t be distressing, because any reasonable experiment will have a large enough sample size that picking an uninformed uniform prior will easily be overcome.

In the first decision theory post, we made a decision rule without basing it on any data. This is why we need to change our definition a little. In that situation a decision rule is equivalent to picking an action. If observing some data is involved, then our decision rule is a function {\delta: \mathcal{X}\rightarrow A}. This should just be read, “If I observe this type of data, then I will act in this way.” You let the data inform your decision. Our decision rule in the coin example was to look at the ratio of heads to tails. If there were more heads we pick heads. If there were more tails, we pick tails.

The loss function is a function {L: \Theta\times A \rightarrow \mathbb{R}}. This is the choice that people should feel a little uncomfortable with, because there is a definite choice that may or may not be reasonable affecting everything. The value {L(\theta, a)} should measure the loss that will be incurred if you do action {a} and {\theta} is the true value of the unknown.

We won’t worry so much about this right now. The more important one for us is the decision loss function {L:\Theta\times \mathcal{X}\rightarrow \mathbb{R}}. This is just plugging in to the other one: {L(\theta, \delta(x))}. Sometimes we just start with this one though. This was a no-brainer for our coin example, because I purposely set up the question to have a natural loss function. This was due to the fact that a well-defined “bet” was being made. In more general situations, the choice of a loss function could be seen as essentially equivalent to picking a betting scheme for your choices. You could easily come up with some wacky ones to see that it might not reflect reality if you aren’t careful.

To me the more “intuitive” notion is that of the risk function. This is the expected value of the loss:

\displaystyle R(\theta, \delta)=E_{\theta}[L(\theta, \delta(X))] = \int_\mathcal{X} L(\theta, \delta(X))f(x|\theta)dx

Note we integrate out the random variables {x}, but we are left over with a function of {\theta}. We saw this in our coin example last time. We get a similar thing for the Bayesian risk, but we incorporate the prior probability of {\theta}. Lots of times it is actually somewhat easier to just jump right to the risk, because in the case of squared-error loss (see we just get that the risk is the variance of the posterior distribution. No extra intermediary calculations are needed.

In general, most loss functions will be a variant on one of two types. The first is called the squared-error loss function. It is given by {L(\theta, a)=(\theta-a)^2}. You can think of this as “least-squares” fitting your decision or minimizing risk in the {L^2}-norm. The other is called the {0-1} loss function. This one arises quite naturally when you just have to pick between two choices like the coin flip. Ours was a variant on this. It penalizes you by {1} unit if your “decision is incorrect” and doesn’t penalize you at all if your “decision is correct.” It is given by

\displaystyle L(\theta, a_i)=\begin{cases} 0 & \text{if} \ \theta\in \Theta_i \\ 1 & \text{if} \ \theta\in \Theta_j \ \text{for} \ i\neq j\end{cases}

The beautiful thing about this one is that the risk is just {1} minus the posterior distribution. Thus, it is minimized at the max of the posterior which is often really easy to calculate. In the coin example, we got the beta distribution and hence the max was just the mean. Of course, we have to be careful that we are measuring the right thing, because we aren’t trying to predict the true bias. We were merely trying to predict heads or tails so that situation was an even easier discrete version.

Lastly, there is a partial ordering on decision functions given by \delta_1 \leq \delta_2 if and only if R(\theta, \delta_1) \leq R(\theta, \delta_2) for all \theta . A minimum in this ordering is called admissible and corresponds to a rational decision. If you make some other decision you are just asking to lose more.

Well, I think this post has gone on long enough (I’ve basically been trapped at the airport for the past 8 hours, so …). We’ll get back to some examples of all this next time. I just needed to finally formalize what we were doing before going any further.

A Bayesian Formulation of Occam’s Razor

Today we will formulate Occam’s Razor in Bayesian terms. Recall that this says that if two hypotheses explain the data equally well, then the one with less assumptions is to be preferred. Before continuing, we should first get a handle on what this is and what the Bayesian reformulation means. First, it is basically a scientific heuristic. The intuitive reason for it is that unnecessary hypotheses are just going to make your model more likely to make mistakes (i.e. it will “overfit”).

What this post is going to do is give a formulation of it in Bayesian terms. This is not a mathematical proof that Occam’s Razor is true or anything, but it will be a proof that under certain mild assumptions the principle falls out as a consequence. That’s what makes this kind of cool. We want to decide whether or not hypothesis A or B is a better statistical model where A and B explain the observed data equally well, but B has an extra parameter.

How should we do this? Well, in probabilistic terms we want to figure out {P(A|D)} and {P(B|D)}, the “probability that {A} is true given the data {D}” and the “probability that {B} is true given the data {D}.” We merely compare these two quantities for example by taking the quotient

\displaystyle \frac{P(A|D)}{P(B|D)}.

If the quotient is near {1}, then they are roughly equally good models. If the quotient is large, then {A} is a better hypothesis and if the quantity is close to {0}, then {B} is the better hypothesis.

Let’s take stock of our assumptions here. We do not assume Occam’s Razor (some people feel like OR is a pretty steep assumption), because it is not a priori clear that it is always the best principle to follow. But here we are merely making the assumption that comparing the probabilities that each model is a true model of the data we observe is a good test for selecting one model over another. It is kind of hard to argue against such a common sense assumption.

Now we use Bayes’ Theorem to convert these quantities to things we actually know about:

\displaystyle \frac{P(A|D)}{P(B|D)} = \frac{P(D|A)P(A)}{P(D|B)P(B)}

At this point we have some difficulty with the {B} hypothesis still, because implicitly we have assumed it relies on some extra parameter {\lambda}. To simplify the argument, we will assume that {\lambda} lies in some range (this isn’t unreasonable because in real life you should have some idea what order of magnitude etc this parameter should be): {\lambda_m \leq \lambda \leq \lambda_M}. We will make a less reasonable simplifying assumption and say that once this range is specified, we have a uniform chance of it being anything in the range, i.e.

\displaystyle P(\lambda |B) = \frac{1}{\lambda_M - \lambda_m}

for {\lambda} in the range and {0} otherwise. There will be an observed {\lambda_0} that maximizes the likelihood function (i.e. fits the data the best). Choose {\delta} so that {\lambda_0 \pm \delta} is an interval giving us reasonable certainty of the best {\lambda_0} (we could use the 95% HDI if we wanted to get the interval). Now let’s work out what is happening for {B}:

\displaystyle P(D|B) = \int P(D, \lambda|B)d\lambda = \int P(D|\lambda, B)P(\lambda |B)d\lambda

\displaystyle =\frac{1}{\lambda_M - \lambda_m}\int P(D|\lambda_0, B)exp\left(-\frac{(\lambda-\lambda_0)^2}{2\delta^2}\right)d\lambda

\displaystyle =\frac{\delta\sqrt{2\pi}P(D|\lambda_0, B)}{\lambda_M - \lambda_m}

Now we can plug this into our original comparison ratio and use the fact that both are equally good at explaining the data:

\displaystyle \frac{P(A|D)}{P(B|D)}=\frac{(\lambda_M-\lambda_m)P(D|A)}{\delta\sqrt{2\pi}P(D|\lambda_0, B)}

This gives us two main conclusions. The first is that if we assume our two models make roughly equivalent predictions on the data, i.e. {P(D|A)\approx P(D|\lambda_0, B)}, then we should prefer {A} because the possible range for {\lambda} giving a factor in the numerator will in general be quite a bit larger than {\delta}. This is exactly Occam’s Razor.

The possibly more interesting consequence is that we now know exactly how much this extra parameter is “penalizing” the theory. So given specific cases we can test whether or not that extra parameter is worth putting in. In other words, are the predictions significantly enough better with the extra parameter to overcome the penalty of introducing an extra complicated hypothesis? This abstract and vague notion from Occam’s Razor gets explicitly quantified in Bayesian analysis so that it is no longer vague or abstract and we can confidently apply Occam’s Razor when it is needed and avoid it when it isn’t.

Statistical Oddities Part 3

This oddity is really hard to get your head around if you’ve been doing standard null-hypothesis testing all your life. This oddity says that null hypothesis significance testing depends on the intentions of the experimenter.

What does this mean? Well, let’s go back to our worked example of flipping a coin and trying to determine whether or not it is biased based on the observed data. Recall that in our Bayesian analysis we take our data and our test for whether or not it was biased was determined by whether or not 0.5 was a reasonable guess given the posterior distribution. We didn’t need to know anything about the intentions of the person flipping the coin.

How does traditional (re: frequentist) null hypothesis testing work? We set up an experiment in which the experimenter flips the coin 100 times. If we observe 47 heads, then we calculate the probability that this would happen given the coin is fair. If that probability is below a certain threshold, then we say the coin is biased because it is extremely unlikely that we would observe that number by chance alone. Otherwise we do not reject the null hypothesis and say the coin is fair.

Unfortunately, our probability space depends on the number of total coin flips. The probability space is extremely different if the experimenter set up the experiment so that the number of flips was not predetermined and instead a coin was flipped as many times as possible for 5 minutes. The probability space in this case is much larger because some possible samples would have 90 flips and some would have 110 and so on.

It would also be radically different if the experimenter decided to flip the coin until they reached 47 heads. Then the probability space would again have all sorts of different possibilities for the number of flips. Maybe sometimes you would expect to do 150 flips before seeing 47 heads.

Just to reiterate, this isn’t a trivial matter. This says we need to know the intent of the experimenter if we want to do a legitimate null hypothesis significance test. If we don’t know how the experiment was designed, then we don’t know what our probability space should look like to know whether or not we should reject the null hypothesis.

To see why this is shocking just do the thought experiment where three labs flip the same coin. Each of the labs sets up the experiment in the three ways listed above. You get the exact same data from each of the labs. You could rig the numbers so that in some cases you decide the coin is fair and in others you decide that it is not fair. But they gave you the same exact data of 47 heads out of 100 flips (or whatever your thought experiment requires)! Let’s reiterate: They gave you the exact same data, but came to different conclusions about the fairness of the coin. How is this possible?

If we live in some sort of objective universe where we can do experiments and draw conclusions from them, then the results of an experiment should rely on the data and not on the subjective intentions of the experimenter. More bluntly, determining whether or not the coin is biased should not depend on what is happening in the coin flipper’s mind during the flipping.

This is a very real and dangerous statistical oddity if the person running the analysis isn’t aware of it. In fact, I dare say that this is one of the easy ways to massage data in the sciences to get “results” where none exist. To me, this is actually one of the strongest arguments for scientists to use Bayesian statistics rather than null hypothesis testing. As we saw in the linked post, Bayesian statistics gets around this issue and only needs the raw data and not the intentions of the experimenter.

By the way, before I get sued, I stole this example (with different numbers) from Doing Bayesian Data Analysis by John K. Kruschke. It is a really fantastic book to learn about this stuff.

Statistical Oddities Part 1

I’m going to come back to posting on L-series soon. For now I’m going to do a first post in a series on some funny business in statistics. These are mostly going to be little things to watch out for if you are new to statistical analysis. These will be well-known things, so experts probably won’t find much use in this series. On the other hand, if you’ve just had a single undergraduate course, then you probably missed these strange examples which cause danger for mis-analysis in the “real world.”

Our funny distribution today is the Cauchy distribution. To point out the funny business, let’s recall the Bayesian worked example. We flipped a coin and modelled it using the beta distribution. We wanted to determine whether or not it was biased. If we got {n} heads and {m} tails, then the maximum of the distribution happened at {\frac{n}{n+m}}.

Nice! The most likely value for our bias was exactly the mean. Now beta distributions can be a little skewed, so our 95% confidence interval wasn’t symmetric about the mean, but the mean is always in the confidence interval no matter what our threshold is or how skewed our posterior is. This feels right, and it turns out that basically every distribution you encounter in a first stats class has this property.

This property (that the mean is always a “good guess”) is essentially a formal consequence of the Central Limit Theorem. That’s the problem. To prove the CLT, our distribution has to satisfy some mild amount of niceties. One of them is that the moments/variance are defined. It turns out that the Cauchy distribution does not satisfy this.

One scary thing is that the Cauchy distribution actually appears very naturally in lots of situations. It has two hyperparameters

\displaystyle P(x|\alpha, \beta)=\frac{\beta}{\pi(\beta^2+(x-\alpha)^2)}

and even worse it lends itself to a Bayesian analysis well, because the way we update the distribution as new data comes in gives us another Cauchy distribution.

Suppose we do an experiment: we collect photons from an unknown source and want to locate the {x} and {y} (i.e {\alpha} and {\beta}) values of the source. This fits into a Cauchy distribution framework. In fact, Hanson and Wolf did some computer simulations using a Monte Carlo method to see what happens (the full paper is here). To simplify things, we assume that one of the values is known exactly.

The Cauchy distribution actually peaks extremely fast (inversely proportional to {\sqrt{N}} where {N} is the sample size). So after a reasonable amount of data we get an extremely high confidence in a very narrow range. We can say with near certainty exactly where the location is by using the posterior distribution.

So what happened with the mean? In the experiment with the most data, they found the actual location at {0.236} and the mean was {7.14\times 10^8}. So…it was off by probably worse than your wildest imagination could have guessed. On the other hand, the median was {0.256}.

The variance of the distribution is infinite, so the outliers throw the mean around alot, but the median is actually protected against this. This goes to show that you cannot always assume the mean of the data is a reasonable guess! You actually have to do the Bayesian analysis and go to the posterior distribution to get the correct estimator.

Bayesian Statistics Worked Example Part 2

Last time I decided my post was too long, so I cut some stuff out and now this post is fleshing those parts into their own post. Recall our setup. We perform an experiment of flippling a coin. Our data set consists of {a} heads and {b} tails. We want to run a Bayesian analysis to figure out whether or not the coin is biased. Our bias is a number between {0} and {1} which just indicates the expected proportion of times it will land on heads.

We found our situation was modeled by the beta distribution: {P(\theta |a,b)=\beta(a,b)}. I reiterate here a word of warning. ALL other sources will call this {B(a+1, b+1)}. I’ve just shifted by 1 for ease of notation. We saw last time that if our prior belief is that the probability distribution is {\beta(x,y)}, then our posterior belief should be {\beta(x+a, y+b)}. This simple “update rule” falls out purely from Bayes’ Theorem.

The main thing I didn’t explain last time was what exactly I meant by the phrase “we can say with 95% confidence that the true bias of the coin lies between {0.40} and {0.60}” or whatever the particular numbers are that we get from our data. What I had in mind for that phrase was something called the highest density interval (HDI). The 95% HDI just means that it is an interval for which the area under the distribution is {0.95} (i.e. an interval spanning 95% of the distribution) such that every point in the interval has a higher probability than any point outside of the interval (I apologize for such highly unprofessional pictures):

bayes1

(It doesn’t look like it, but that is supposed to be perfectly symmetrical.)

Untitled drawing

The first is the correct way to make the interval, because notice all points on the curve over the shaded region are higher up (i.e. more probable) than points on the curve not in the region. There are lots of 95% intervals that are not HDI’s. The second is such a non-example, because even though the area under the curve is 0.95, the big purple point is not in the interval but is higher up than some of the points off to the left which are included in the interval.

Lastly, we will say that a hypothesized bias {\theta_0} is credible if some small neighborhood of that value lies completely inside our 95% HDI. That small threshold is sometimes called the “region of practical equivalence (ROPE)” and is just a value we must set. If we set it to be 0.02, then we would say that the coin being fair is a credible hypothesis if the whole interval from 0.48 to 0.52 is inside the 95% HDI.

A note ahead of time, calculating the HDI for the beta distribution is actually kind of a mess because of the nature of the function. There is no closed form solution, so usually you can just look these things up in a table or approximate it somehow. Both the mean {\mu=\frac{a}{a+b}} and the standard deviation {\left(\frac{\mu(1-\mu)}{a+b+1}\right)^{1/2}} do have closed forms. Thus I’m going to approximate for the sake of this post using the “two standard deviations” rule that says that two standard deviations on either side of the mean is roughly 95%. Caution, if the distribution is highly skewed, for example {\beta(3,25)} or something, then this approximation will actually be way off.

Let’s go back to the same examples from before and add in this new terminology to see how it works. Suppose we have absolutely no idea what the bias is and we make our prior belief {\beta(0,0)} the flat line. This says that we believe ahead of time that all biases are equally likely. Now we observe {3} heads and {1} tails. Bayesian analysis tells us that our new distribution is {\beta(3,1)}. The 95% HDI in this case is approximately 0.49 to 0.84. Thus we can say with 95% certainty that the true bias is in this region. Note that it is NOT a credible hypothesis off of this data to guess that the coin is fair because 0.48 is not in HDI. This example really illustrates how choosing different thresholds can matter, because if we picked an interval of 0.01 rather than 0.02, then that guess would be credible!

Let’s see what happens if we use just an ever so slightly more reasonable prior. We’ll use {\beta(2,2)}. This gives us a starting assumption that the coin is probably fair, but it is still very open to whatever the data suggests. In this case our {3} heads and {1} tails tells us our posterior distribution is {\beta(5,3)}. In this case the 95% HDI is 0.45 to 0.75. Using the same data we get a little bit more narrow interval here, but more importantly we feel much more comfortable with the claim that the coin being fair is still a credible hypothesis.

This brings up a sort of “statistical uncertainty principle.” If we want a ton of certainty, then it forces our interval to get wider and wider. This makes intuitive sense, because if I want to give you a range that I’m 99.9999999% certain the true bias is in, then I better give you practically every possibility. If I want to pinpoint a precise spot for the bias, then I have to give up certainty (unless you’re in an extreme situation where the distribution is a really sharp spike or something). You’ll end up with something like: I can say with 1% certainty that the true bias is between 0.59999999 and 0.6000000001. We’ve locked onto a small range, but we’ve given up certainty. Note the similarity to the Heisenberg uncertainty principle which says the more precisely you know the momentum or position of a particle the less precisely you know the other.

Let’s wrap up by trying to pinpoint exactly where we needed to make choices for this statistical model. The most common objection to Bayesian models is that you can subjectively pick a prior to rig the model to get any answer you want. Hopefully this wrap up will show that in the abstract that objection is essentially correct, but in real life practice you cannot get away with this.

Step 1 was to write down the likelihood function {P(\theta | a,b)=\beta(a,b)}. This was derived directly from the type of data we were collecting and was not a choice. Step 2 was to determine our prior distribution. This was a choice, but a constrained one. In real life statistics you will probably have a lot of prior information that will go into this choice. Recall that the prior encodes both what we believe is likely to be true and how confident we are in that belief. Suppose you make a model to predict who will win an election based off of polling data. You have previous year’s data and that collected data has been tested, so you know how accurate it was! Thus forming your prior based on this information is a well-informed choice. Just because a choice is involved here doesn’t mean you can arbitrarily pick any prior you want to get any conclusion you want.

I can’t reiterate this enough. In our example, if you pick a prior of {\beta(100,1)} with no reason to expect to coin is biased, then we have every right to reject your model as useless. Your prior must be informed and must be justified. If you can’t justify your prior, then you probably don’t have a good model. The choice of prior is a feature, not a bug. If a Bayesian model turns out to be much more accurate than all other models, then it probably came from the fact that prior knowledge was not being ignored. It is frustrating to see opponents of Bayesian statistics use the “arbitrariness of the prior” as a failure when it is exactly the opposite (see the picture at the end of this post for a humorous illustration.)

The last step is to set a ROPE to determine whether or not a particular hypothesis is credible. This merely rules out considering something right on the edge of the 95% HDI from being a credible guess. Admittedly, this step really is pretty arbitrary, but every statistical model has this problem. It isn’t unique to Bayesian statistics, and it isn’t typically a problem in real life. If something is so close to being outside of your HDI, then you’ll probably want more data. For example, if you are a scientist, then you re-run the experiment or you honestly admit that it seems possible to go either way.

What is Bayesian Statistics: A basic worked example

I did a series on Bayes’ Theorem awhile ago and it gave us some nice heuristics on how a rational person ought to update their beliefs as new evidence comes in. The term “Bayesian statistics” gets thrown around a lot these days, so I thought I’d do a whole post just working through a single example in excruciating detail to show what is meant by this. If you understand this example, then you basically understand what Bayesian statistics is.

Problem: We run an experiment of flipping a coin {N} times and record a {1} every time it comes up heads and a {0} every time it comes up tails. This gives us a data set. Using this data set and Bayes’ theorem, we want to figure out whether or not the coin is biased and how confident we are in that assertion.

Let’s get some technical stuff out of the way. This is the least important part to fully understand for this post, but is kind of necessary. Define {\theta} to be the bias towards heads. This just means that if {\theta=0.5}, then the coin has no bias and is perfectly fair. If {\theta=1}, then the coin will never land on tails. If {\theta = 0.75}, then if we flip the coin a huge number of times we will see close to {3} out of every {4} flips lands on heads. For notation we’ll let {y} be the trait of whether or not it lands on heads or tails (so it is {0} or {1}).

We can encode this information mathematically by saying {P(y=1|\theta)=\theta}. In plain english: The probability that the coin lands on heads given that the bias towards heads is {\theta} is {\theta}. Likewise, {P(y=0|\theta)=1-\theta}. Let’s just chain a bunch of these coin flips together now. Let {a} be the event of seeing {a} heads when flipping the coin {N} times (I know, the double use of {a} is horrifying there but the abuse makes notation easier later).

Since coin flips are independent we just multiply probabilities and hence {P(a|\theta)=\theta^a(1-\theta)^{N-a}}. Rather than lug around the total number {N} and have that subtraction, normally people just let {b} be the number of tails and write {P(a,b |\theta)=\theta^a(1-\theta)^b}. Let’s just do a quick sanity check to make sure this seems right. Note that if {a,b\geq 1}, then as the bias goes to zero the probability goes to zero. This is expected because we observed a heads ({a\geq 1}), so it is highly unlikely to be totally biased towards tails. Likewise as {\theta} gets near {1} the probability goes to {0}, because we observed a tails.

The other special cases are when {a=0} or {b=0}, and in these cases we just recover that the probability of getting heads a times in a row if the probability of heads is {\theta} is {\theta^a}. Of course, the mean of {\beta (a,b)} is {a/(a+b)}, the proportion of the number of heads observed. Moving on, we haven’t quite thought of this in the correct way yet, because in our introductory problem we have a fixed data set that we want to analyze. So from now on we should think about {a} and {b} being fixed from the data we observed.

The idea now is that as {\theta} varies through {[0,1]} we have a distribution {P(a,b|\theta)}. What we want to do is multiply this by the constant that makes it integrate to {1} so we can think of it as a probability distribution. In fact, it has a name called the beta distribution (caution: the usual form is shifted from what I’m writing), so we’ll just write {\beta(a,b)} for this (the number we multiply by is the inverse of {B(a,b)=\int_0^1 \theta^a(1-\theta)^b d\theta} called the (shifted) beta function).

This might seem unnecessarily complicated to start thinking of this as a probability distribution in {\theta}, but it is actually exactly what we are looking for. Consider the following three examples:

beta1

The red one says if we observe {2} heads and {8} tails, then the probability that the coin has a bias towards tails is greater. The mean happens at {0.20}, but because we don’t have a lot of data there is still a pretty high probability of the true bias lying elsewhere. The middle one says if we observe 5 heads and 5 tails, then the most probable thing is that the bias is {0.5}, but again there is still a lot of room for error. If we do a ton of trials to get enough data to be more confident in our guess, then we see something like:

beta2

Already at observing 50 heads and 50 tails we can say with 95% confidence that the true bias lies between 0.40 and 0.60. Alright, you might be objecting at this point that this is just usual statistics, where the heck is Bayes’ Theorem? You’d be right. Bayes’ Theorem comes in because we aren’t building our statistical model in a vacuum. We have prior beliefs about what the bias is.

Let’s just write down Bayes’ Theorem in this case. We want to know the probability of the bias {\theta} being some number given our observations in our data. We use the “continuous form” of Bayes’ Theorem:

\displaystyle P(\theta|a,b)=\frac{P(a,b|\theta)P(\theta)}{\int_0^1 P(a,b|\theta)d\theta}

I’m trying to give you a feel for Bayesian statistics, so I won’t work out in detail the simplification of this. Just note that the “posterior probability” (the left hand side of the equation), i.e. the distribution we get after taking into account our data is the likelihood times our prior beliefs divided by the evidence. Now if you use that the denominator is just the definition of {B(a,b)} and work everything out it turns out to be another beta distribution!

If our prior belief is that the bias has distribution {\beta(x,y)}, then if our data has {a} heads and {b} tails we get {P(\theta|a,b)=\beta(a+x, b+y)}. The way we update our beliefs based on evidence in this model is incredibly simple. Now I want to sanity check that this makes sense again. Suppose we have absolutely no idea what the bias is and we make our prior belief {\beta(0,0)} the flat line. This says that we believe ahead of time that all biases are equally likely.

Now we observe {3} heads and {1} tails. Bayesian analysis tells us that our new (posterior probability) distribution is {\beta(3,1)}:

beta3

Yikes! We don’t have a lot of certainty, but it looks like the bias is heavily towards heads. Danger: This is because we used a terrible prior. This is the real world so it isn’t reasonable to think that a bias of {0.99} is just as likely as {0.45}. Let’s see what happens if we use just an ever so slightly more modest prior. We’ll use {\beta(2,2)}. This puts our assumption on it being most likely close to {0.5}, but it is still very open to whatever the data suggests. In this case our {3} heads and {1} tails tells us our updated belief is {\beta(5,3)}:

beta4

Ah. Much better. We see a slight bias coming from the fact that we observed {3} heads and {1} tails and these can’t totally be ignored, but our prior belief tames how much we let this sway our new beliefs. This is what makes Bayesian statistics so great. If we have tons of prior evidence of a hypothesis, then observing a few outliers shouldn’t make us change our minds. On the other hand, the setup allows for us to change our minds even if we are 99% certain on something as long as sufficient evidence is given. This is the mantra: extraordinary claims require extraordinary evidence.

Not only would a ton of evidence be able to persuade us that the coin bias is {0.90}, but we should need a ton of evidence. This is part of the shortcomings of non-Bayesian analysis. It would be much easier to become convinced of such a bias if we didn’t have a lot of data and we accidentally sampled some outliers.

Anyway. Now you should have an idea of Bayesian statistics. In fact, if you understood this example, then most of the rest is just adding parameters and using other distributions, so you actually have a really good idea of what is meant by that term now.