Decision Theory 4: Hypothesis Testing

Now we return to decision theory. It is also a return to the thing that first made me interested in learning something about statistics a year or two ago. I had heard about John Ioannidis’ shocking article “Why Most Published Research Findings are False” and started to investigate. To me statistics was some settled thing that you hit your data with after doing an experiment. It told you whether or not your findings were real and how confident you could be in them.

Moreover, I believed that as long as you followed the prescriptions taught to you, you couldn’t mess it up. It was foolproof. Just look around and try to find one thing that science hasn’t touched. The scientific method has clearly led us to discover something about the world. That’s why stats seemed like an uninteresting part of math. People seemed to have figured it all out. Then I was shocked to find that article. I started to learn about all these fallacies, and methodological problems that I’ve been pointing out over the past few months.

One of the main difficulties, particularly in science, is classical null hypothesis significance testing (NHST). One way to try to mitigate these difficulties is to rephrase our hypothesis test as a Bayesian decision theory problem. This is not the only Bayesian reformulation (Kruschke’s MCMC stuff is pretty cool which I might get to someday), but it fits in as a nice example of the use of decision theory outside of the silly gambling problems I’ve been using.

Let’s start by seeing how to test a point null hypothesis. Think about the biased coin example. We want to test ${\theta=1/2}$, i.e. is the coin unbiased? This is obviously a ridiculous type of hypothesis test, because the term “unbiased” in real life encompasses a range ${(1/2-\varepsilon, 1/2+\varepsilon)}$ where we can’t tell the difference. This is actually the case in most scientific situations as well (there is only so much precision your instruments can achieve), and often scientists incorrectly use a point NHST when there should be a ROPE (region of practical equivalence).

Our first step is to take the previous paragraph’s discussion and cheat a little. Suppose we want to test ${\theta = \theta_0}$. The Bayesian way to do this would work out of the box using a ROPE. Unfortunately, if we want continuous densities for the probabilities, then we will always reject our null hypothesis. This is because a point has probability zero. The cheat is to just convert the continuous prior, ${\pi(\theta)}$, to a piecewise defined prior where we assign a point mass of probability

$\displaystyle \pi_0 = \displaystyle \int_{\theta_0-\varepsilon}^{\theta_0+\varepsilon} \pi(\theta)d\theta$

to ${\theta_0}$ and the renormalized old prior otherwise. This is merely saying that we make a starting assumption that ${\theta}$ has true value ${\theta_0}$ with probability ${\pi_0}$, and hence no actual integral needs to be calculated. That is just for intuitive justification for the shape of ${\pi}$. If this makes you uncomfortable, then use the uninformed prior of ${\theta=\theta_0}$ having probability ${1/2}$ and the alternative having a uniform distribution of mass 1/2.

Let’s recap what we are trying to do. We have two hypotheses. The null which is ${H_0: \theta=\theta_0}$, and the alternative ${H_1: \theta\neq \theta_0}$. This type of NHST came up in the last post where we wanted to experimentally test whether or not the acceleration due to gravity was ${g=9.8}$. Our process should be clear if you’ve been following this sequence of posts. We just use our data to calculate the posterior distributions ${P(H_0|x)}$ and ${P(H_1|x)}$. We must decide between these two by seeing which one has less risk (and that risk will come from a loss function which appropriately penalizes falsely accepting/rejecting each one).

This approach is really nice, because depending on your situation you will want to penalize differently. If you are testing a drug for effectiveness, then it is better to harshly penalize falsely claiming a placebo to be effective (a false positive or Type I error). If you are testing whether or not someone has a fatal disease, then you want to harshly penalize falsely claiming they have it and having them undergo dangerous and expensive unnecessary treatments. Maybe these aren’t the best examples, but you see how having a flexible system could be a lot more useful than blindly running a ${p=0.05}$ NHST.

Rather than going through some made up example from fake randomly generated data as I’ve been doing, let’s examine some differences at the theoretical level when we assume everything is normal. Suppose our data is a sample of ${n}$ points from a normal distribution. Any book on Bayesian statistics will have the details on working this out, so I’ll get to the punch line.

If we denote ${m(x)}$ the marginal density, then the posterior distribution for ${H_0}$ is given by

$\displaystyle \frac{f(x|\theta_0)\pi_0}{m(x)}.$

In the normal distribution (we assume the prior has ${\tau}$ standard deviation and the data has ${\mu}$ and both have mean ${\theta_0}$) case we get something much more specific:

$\displaystyle \left(1+\frac{(1-\pi_0)}{\pi_0}\cdot \frac{\exp(\frac{1}{2}z^2[1+\sigma^2/(n\tau^2)]^{-1}}{(1+n\tau^2/\sigma^2)^{1/2}}\right)^{-1}$

where ${z=\frac{\sqrt{n}|\overline{x}-\theta_0|}{\sigma}}$. This term actually appears in classical NHST as well. Let’s look at the differences. For the purpose of getting some numbers down, let’s assume ${\pi_0=1/2}$ and ${\sigma=\tau}$. In a two-tailed test, let’s assume that we observe a ${p=0.01}$ and hence would very, very strongly and confidently reject ${H_0}$. This corresponds to a ${z}$-value of ${2.576}$. In this case if ${n}$ is small, i.e. in the 10-100 range, then the posterior is around ${0.14}$ to ${0.27}$. This means that we would likely want to reject ${H_0}$, because it is quite a bit more unlikely than ${H_1}$ (this will of course depend on the specifics of our loss function).

Shockingly, if ${n}$ is large, we lose a lot of confidence. If ${n=1000}$, then the posterior for ${H_0}$ is ${0.53}$. Woops. The Bayesian approach says that ${H_0}$ is actually more likely to be true than ${H_1}$, but our NHST gives us ${p=0.01}$ level confidence for rejecting (i.e. there is a 99% chance that our data observations were not a fluke chance and the result that causes us to reject ${H_0}$ is real).

As we see, by working with the Bayesian framework, we get posterior probabilities for how likely ${H_0}$ and ${H_1}$ are given our observations of the data. This allows us to do a suitable analysis. The classical framework feels very limited, because even when we get extreme ${p}$-values that give us lots of confidence, we could accidentally be overlooking something that would be obvious if we worked directly with how likely each is to be true.

To end this post, I’ll just reiterate that careful scientists are completely aware of the fact that a ${p}$-value is not to be interpreted as probabilities against ${H_0}$. One can certainly apply classical methods and end with a solid analysis. On the other hand, this is quite a widespread sloppiness or less generously I’ll call it a widespread misunderstanding of what is going on.