In the world of big data that constantly bombards us with fancy graphics, the statistical fallacy that I think we are most likely to fall for is called the Texas Sharpshooter Fallacy. What makes this fallacy so dangerous is that it is propped up by solid, correct statistics which can be hard to argue against.
Here’s the idea. A person goes into the yard and shoots their rifle at random at their barn. Maybe even say the person is drunk, so the holes have no underlying pattern to them. The person then goes to the barn and figures out a way to draw a bullseye after the fact that makes it look like they are a competent sharpshooter.
The fallacy is that if you look at a large enough amount of data with good enough visualization tools, you will probably start to find patterns that aren’t actually there by strategically drawing artificial boundaries. Let’s make the example a bit more real.
Suppose you want to better understand the causes of Disease X, something just discovered and occurs in 10% of the population naturally. You plot the data of a nearby town of 10,000 to see if you can find a pattern.
Here is the plot (I used a uniform distribution so we know any clumps have no underlying cause):
Your eye gets drawn to an oddly dense clump of cases of Disease X. You circle it and then run a statistical test to see if the number of cases is significant. You’re shocked! Your properly run statistical test shows you the increased number of cases is significant and with 95% certainty you conclude it isn’t just a fluke.
So what do you do? You start looking for causes. Of course you’ll be able to find one. Maybe that clump of houses has a power station nearby, or they drink from the same well water source or whatever. When you are looking for something in common, you’ll be able to find it.
When this happens, you’ve committed the Texas Sharpshooter Fallacy. It might be okay to use this data exploration to look for a cause if you merely intend to turn it into a hypothesis to be tested. So you hypothesize that it is radon in the water that caused the spike of cases in that cluster.
Now do real science where you do a randomized controlled study to actually test your null hypothesis. Doing statistics on big data is risky business, because any clever person can construct correlations from a large enough data set that first off may not actually be there but second off is almost surely not causally related.
Another way to think about why this is a fallacy is that when you have 95% certainty, 5 out of 100 times you will falsely find correlation where none exists. So if your data set is large enough to draw 100 different boundaries, then by random chance 5 of those will have false correlations. When you allow your eye to catch the cluster, it is your brain being good at finding patterns. It probably rejected 100 non-clusters to find that one.
This is scary in today’s world, because lots of news articles do exactly this. They claim some crazy thing, and they use statistics people don’t understand to “prove” its legitimacy (numbers can’t lie don’t you know). But really it is just this fallacy at work. The media don’t want to double check it because “Cancer rate five times higher near power station” is going to get a lot of hits and interest.
Actually, cancer is particularly susceptible to this type of fallacy as dozens of examples of these studies getting publicity despite no actual correlation (yet alone causation!) are documented in George Johnson’s (excellent) The Cancer Chronicles or an older The New Yorker article called “The Cancer-Cluster Myth.”
So the next time you read about one of these public health outcries, you should pay careful attention in the article to see if this fallacy has been made. For example, the vaccination causes autism myth also orignated this way.
Probably the most egregious example is The China Study, a highly praised vegan propaganda book. It takes the largest diet study ever done (367 variables) and pulls out the correlations that support the hypothesis “meat is bad.”
What the book doesn’t tell you is that the study found over 8000 statistically significant correlations, many contradicting the ones presented in the book. This is why large studies of observational epidemiology always have to be treated with caution. The larger the study, the more likely you will be able to find a way to support your hypothesis.
If you don’t believe me, and you want to protect marriage in Maine, then make sure you eat less margarine this year: