This oddity is really hard to get your head around if you’ve been doing standard null-hypothesis testing all your life. This oddity says that null hypothesis significance testing *depends on the intentions of the experimenter*.

What does this mean? Well, let’s go back to our worked example of flipping a coin and trying to determine whether or not it is biased based on the observed data. Recall that in our Bayesian analysis we take our data and our test for whether or not it was biased was determined by whether or not 0.5 was a reasonable guess given the posterior distribution. We didn’t need to know anything about the intentions of the person flipping the coin.

How does traditional (re: frequentist) null hypothesis testing work? We set up an experiment in which the experimenter flips the coin 100 times. If we observe 47 heads, then we calculate the probability that this would happen given the coin is fair. If that probability is below a certain threshold, then we say the coin is biased because it is extremely unlikely that we would observe that number by chance alone. Otherwise we do not reject the null hypothesis and say the coin is fair.

Unfortunately, our probability space depends on the number of total coin flips. The probability space is extremely different if the experimenter set up the experiment so that the number of flips was not predetermined and instead a coin was flipped as many times as possible for 5 minutes. The probability space in this case is much larger because some possible samples would have 90 flips and some would have 110 and so on.

It would also be radically different if the experimenter decided to flip the coin until they reached 47 heads. Then the probability space would again have all sorts of different possibilities for the number of flips. Maybe sometimes you would expect to do 150 flips before seeing 47 heads.

Just to reiterate, this isn’t a trivial matter. This says we need to know the intent of the experimenter if we want to do a legitimate null hypothesis significance test. If we don’t know how the experiment was designed, then we don’t know what our probability space should look like to know whether or not we should reject the null hypothesis.

To see why this is shocking just do the thought experiment where three labs flip the same coin. Each of the labs sets up the experiment in the three ways listed above. You get the exact same data from each of the labs. You could rig the numbers so that in some cases you decide the coin is fair and in others you decide that it is not fair. But they gave you the same exact data of 47 heads out of 100 flips (or whatever your thought experiment requires)! Let’s reiterate: They gave you the exact same data, but came to different conclusions about the fairness of the coin. How is this possible?

If we live in some sort of objective universe where we can do experiments and draw conclusions from them, then the results of an experiment should rely on the data and not on the subjective intentions of the experimenter. More bluntly, determining whether or not the coin is biased should not depend on what is happening in the coin flipper’s mind during the flipping.

This is a very real and dangerous statistical oddity if the person running the analysis isn’t aware of it. In fact, I dare say that this is one of the easy ways to massage data in the sciences to get “results” where none exist. To me, this is actually one of the strongest arguments for scientists to use Bayesian statistics rather than null hypothesis testing. As we saw in the linked post, Bayesian statistics gets around this issue and only needs the raw data and not the intentions of the experimenter.

By the way, before I get sued, I stole this example (with different numbers) from *Doing Bayesian Data Analysis* by John K. Kruschke. It is a really fantastic book to learn about this stuff.