If you are a classically trained scientist, then you probably do an experiment, get some data, run it through SPSS (or something similar), then to see whether or not the results are significant you look at the -value. It is a standard that if , then you consider the result to likely be "real" and not just some random noise making a pattern.

Why is that? Well, here's how you define the -value. Suppose your hypothesis is false. What is the probability of seeing your data? That is the -value. I hypothesize that my coin is fair. I do 200 flips. I calculate . Then there is only a 4% chance that I would see that particular combination of flips if my coin is biased.

Danger! Frequently, people try to negate everything and say that "there is a 96% chance (or less obviously wrong they'll say we have 96% confidence) that the coin is fair." If you read these posts, then you should immediately see the error. There can’t be a 96% chance of the coin being fair, because no matter what our flips were (it could 100 and 100 after a 200 flip trial), the probability of it being fair is still (just compute the integral of the posterior distribution from to ). Yet you see this language all over scientific papers.

If we want to talk about confidence, then we have a way to do it and it does not involve the -value. I wouldn’t say this is a frequentist vs Bayesian thing, but I think the Bayesian analysis actually makes it harder to make this mistake. Recall that what we did there was use the language that being unbiased was a *credible hypothesis* given our 95% HDI. What we have confidence about is an interval. Maybe we have 95% confidence that the bias is in the interval . In this case, the hypothesis of being unbiased is credible, but the hypothesis of it being is also credible with the same HDI.

Anyway, back to -values. Since I’m using coin flipping as an example, you might think this is silly, but let’s ramp up our experiment. Suppose my lab works for a casino. We make sure coins are fair before passing them on to the casino (for their hot new coin flipping game?). I use a value of as usual. After coins I expect that I’ve made a mistake on or fewer because of my -value, right? This is the type of interpretation you see all the time! It is clearly wrong.

Suppose of them are biased due to manufacturing errors. Depending on the power of my test (I haven’t talked about power, but as you can imagine it depends on how many flips I use in my trials among other things) maybe I find 8 of them (this would be a power of which isn’t unreasonable in science). Now recall our definition of the -value. I also have a chance of incorrectly saying that one of my unbiased coins is biased. This puts me at identifying biased coins only of which are actually biased. Despite a -value threshold of , I actually only got of my guesses of bias correct (you could calculate this much more easily using Bayes’ theorem).

The above scenario is extremely common in some medical science labs where it matters. Suppose you test a drug to see if it works. Your test has power and you use a -value of as you’ve been taught. You send to drug manufacturers claiming they work. You think that you are wrong only of the time, but in reality after you’ve tested drugs, out of the drugs you send don’t work! This is extremely dangerous. Of course, these should be weeded out on secondary trials, but who has the time or money to do that? If we think we have confidence that it works, then we may as well send them out to help people and only do our repeat experiment while it is on the market.

Ignoring the real interpretation of the p-value in favor of the more optimistic one is so common it has a name: *the base rate fallacy*. This is because that high number of false postives comes from the fact that the base rate of a drug working (or the coin being unbiased) is so low that you are likely to get false positives even with a high power test and a small p-value. I know this type of thing has been posted on the internet all over the place, but I hadn’t done it yet and it seemed to fit in with the statistical oddities series. For the record, the example scenario above was taken from Statistics Done Wrong by Alex Reinhart.

Pingback: Statistical Oddities 5: Sequential Testing | A Mind for Madness