Statistical Oddities Part 1

I’m going to come back to posting on L-series soon. For now I’m going to do a first post in a series on some funny business in statistics. These are mostly going to be little things to watch out for if you are new to statistical analysis. These will be well-known things, so experts probably won’t find much use in this series. On the other hand, if you’ve just had a single undergraduate course, then you probably missed these strange examples which cause danger for mis-analysis in the “real world.”

Our funny distribution today is the Cauchy distribution. To point out the funny business, let’s recall the Bayesian worked example. We flipped a coin and modelled it using the beta distribution. We wanted to determine whether or not it was biased. If we got {n} heads and {m} tails, then the maximum of the distribution happened at {\frac{n}{n+m}}.

Nice! The most likely value for our bias was exactly the mean. Now beta distributions can be a little skewed, so our 95% confidence interval wasn’t symmetric about the mean, but the mean is always in the confidence interval no matter what our threshold is or how skewed our posterior is. This feels right, and it turns out that basically every distribution you encounter in a first stats class has this property.

This property (that the mean is always a “good guess”) is essentially a formal consequence of the Central Limit Theorem. That’s the problem. To prove the CLT, our distribution has to satisfy some mild amount of niceties. One of them is that the moments/variance are defined. It turns out that the Cauchy distribution does not satisfy this.

One scary thing is that the Cauchy distribution actually appears very naturally in lots of situations. It has two hyperparameters

\displaystyle P(x|\alpha, \beta)=\frac{\beta}{\pi(\beta^2+(x-\alpha)^2)}

and even worse it lends itself to a Bayesian analysis well, because the way we update the distribution as new data comes in gives us another Cauchy distribution.

Suppose we do an experiment: we collect photons from an unknown source and want to locate the {x} and {y} (i.e {\alpha} and {\beta}) values of the source. This fits into a Cauchy distribution framework. In fact, Hanson and Wolf did some computer simulations using a Monte Carlo method to see what happens (the full paper is here). To simplify things, we assume that one of the values is known exactly.

The Cauchy distribution actually peaks extremely fast (inversely proportional to {\sqrt{N}} where {N} is the sample size). So after a reasonable amount of data we get an extremely high confidence in a very narrow range. We can say with near certainty exactly where the location is by using the posterior distribution.

So what happened with the mean? In the experiment with the most data, they found the actual location at {0.236} and the mean was {7.14\times 10^8}. So…it was off by probably worse than your wildest imagination could have guessed. On the other hand, the median was {0.256}.

The variance of the distribution is infinite, so the outliers throw the mean around alot, but the median is actually protected against this. This goes to show that you cannot always assume the mean of the data is a reasonable guess! You actually have to do the Bayesian analysis and go to the posterior distribution to get the correct estimator.


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s