Our next decision theory post is going to be on how to rephrase hypothesis testing in terms of Bayesian decision theory. We already saw in our last statistical oddities post that -values can cause some problems if you are not careful. This oddity makes the situation even worse. We’ll show that if you use a classical null hypothesis significance test (NHST) even at and your experimental design is to check significance after each iteration of a sample, then as the sample size increases, you will falsely reject the hypothesis more and more.
I’ll reiterate that this is more of an experimental design flaw than a statistical problem, so a careful statistician will not run into the problem. On the other hand, lots of scientists are not careful statisticians and do make these mistakes. These mistakes don’t exist in the Bayesian framework (advertisement for the next post). I also want to reiterate that the oddity is not that you sometimes falsely reject hypotheses (this is obviously going to happen, since we are dealing with a degree of randomness). The oddity is that as the sample size grows, your false rejection rate will tend to 100% ! Usually people think that a higher sample size will protect them, but in this case it exacerbates the problem.
To avoid offending people, let’s assume you are a freshmen in college and you go to your very first physics lab. Of course, it will be to let a ball drop. You measure how long it takes to drop at various heights. You want to determine whether or not the acceleration due to gravity is really 9.8. You took a statistics class in high school, so you recall that you can run a NHST at the level and impress your professor with this knowledge. Unfortunately, you haven’t quite grasped experimental methodology, so you rerun your NHST after each trial of dropping the ball.
When you see you get excited because you can safely reject the hypothesis! This happens and you turn in a lab write-up claiming that with greater than certainty the true acceleration due to gravity is NOT . Let's make the nicest assumptions possible and see that it was still likely for you to reach that conclusion. Assume exactly. Also, assume that your measurements are pretty good and hence form a normal distribution with mean . I wrote the following code to simulate exactly that:
import random import numpy as np import pylab from scipy import stats #Generate normal sample def norm(): return random.normalvariate(9.8,1) #Run the experiment, return 1 if falsely rejects and 0 else def experiment(num_samples, p_val): x =  #One by one we append an observation to our list for i in xrange(num_samples): x.append(norm()) #Run a t-test at p_val significance to see if we reject the hypothesis t,p = stats.ttest_1samp(x, 9.8) if p < p_val: return 1 return 0 #Check the proportion of falsely rejecting at various sample sizes rej_proportion =  for j in xrange(10): f_rej = 0 for i in xrange(5000): f_rej += experiment(10*j+1, 0.05) rej_proportion.append(float(f_rej)/5000) #Plot the results axis = [10*j+1 for j in xrange(10)] pylab.plot(axis, rej_proportion) pylab.title('Proportion of Falsely Rejecting the Hypothesis') pylab.xlabel('Sample Size') pylab.ylabel('Proportion') pylab.show()
What is this producing? On the first run of the experiment, what is the probability that you reject the null hypothesis? Basically , because the test knows that this isn't enough data to make a firm conclusion. If you run the experiment 10 times, what is the probability that at some point you reject the null hypothesis? It has gone up a bit. On and on this goes up to 100 trials where you have nearly a 40% chance of rejecting the null hypothesis using this method. This should make you uncomfortable, because this is ideal data where the mean really is 9.8 exactly! This isn't coming from imprecise measurements or something.
The trend will actually continue, but already because of the so-called problem in programming this was taking a while to run, so I cut it off. As you accumulate more and more experiments, you will be more and more likely to reject the hypothesis:
Actually, if you think about this carefully it isn't so surprising. The fault is that you recheck whether or not to reject after each sample. Recall that the -value tells you how likely it is to see these results by random chance supposing the hypothesis is false. But the value is not which means with enough trials you'll get the wrong thing. If you have a sample size of and you recheck your NHST after each sample is added, then you give yourself 100 chances to see this randomness manifest rather than checking once with all data points. As your sample size increases, you give yourself more and more chances to see the randomness and hence as your sample goes to infinity your probability of falsely rejecting the hypothesis tends to .
We can modify the above code to just track the p-value over a single 1000 sample experiment (the word "trial" in the title was meant to indicate dropping a ball in the physics experiment). This shows that if you cut your experiment off almost anywhere and run your NHST, then you would not reject the hypothesis. It is only because you incorrectly tracked the p-value until it dipped below 0.05 that a mistake was made: