I’m going to come back to posting on L-series soon. For now I’m going to do a first post in a series on some funny business in statistics. These are mostly going to be little things to watch out for if you are new to statistical analysis. These will be well-known things, so experts probably won’t find much use in this series. On the other hand, if you’ve just had a single undergraduate course, then you probably missed these strange examples which cause danger for mis-analysis in the “real world.”
Our funny distribution today is the Cauchy distribution. To point out the funny business, let’s recall the Bayesian worked example. We flipped a coin and modelled it using the beta distribution. We wanted to determine whether or not it was biased. If we got heads and tails, then the maximum of the distribution happened at .
Nice! The most likely value for our bias was exactly the mean. Now beta distributions can be a little skewed, so our 95% confidence interval wasn’t symmetric about the mean, but the mean is always in the confidence interval no matter what our threshold is or how skewed our posterior is. This feels right, and it turns out that basically every distribution you encounter in a first stats class has this property.
This property (that the mean is always a “good guess”) is essentially a formal consequence of the Central Limit Theorem. That’s the problem. To prove the CLT, our distribution has to satisfy some mild amount of niceties. One of them is that the moments/variance are defined. It turns out that the Cauchy distribution does not satisfy this.
One scary thing is that the Cauchy distribution actually appears very naturally in lots of situations. It has two hyperparameters
and even worse it lends itself to a Bayesian analysis well, because the way we update the distribution as new data comes in gives us another Cauchy distribution.
Suppose we do an experiment: we collect photons from an unknown source and want to locate the and (i.e and ) values of the source. This fits into a Cauchy distribution framework. In fact, Hanson and Wolf did some computer simulations using a Monte Carlo method to see what happens (the full paper is here). To simplify things, we assume that one of the values is known exactly.
The Cauchy distribution actually peaks extremely fast (inversely proportional to where is the sample size). So after a reasonable amount of data we get an extremely high confidence in a very narrow range. We can say with near certainty exactly where the location is by using the posterior distribution.
So what happened with the mean? In the experiment with the most data, they found the actual location at and the mean was . So…it was off by probably worse than your wildest imagination could have guessed. On the other hand, the median was .
The variance of the distribution is infinite, so the outliers throw the mean around alot, but the median is actually protected against this. This goes to show that you cannot always assume the mean of the data is a reasonable guess! You actually have to do the Bayesian analysis and go to the posterior distribution to get the correct estimator.