## Statistical Oddities 4: The Base Rate Fallacy

If you are a classically trained scientist, then you probably do an experiment, get some data, run it through SPSS (or something similar), then to see whether or not the results are significant you look at the ${p}$-value. It is a standard that if ${p<0.05}$, then you consider the result to likely be "real" and not just some random noise making a pattern.

Why is that? Well, here's how you define the ${p}$-value. Suppose your hypothesis is false. What is the probability of seeing your data? That is the ${p}$-value. I hypothesize that my coin is fair. I do 200 flips. I calculate ${p=0.04}$. Then there is only a 4% chance that I would see that particular combination of flips if my coin is biased.

Danger! Frequently, people try to negate everything and say that "there is a 96% chance (or less obviously wrong they'll say we have 96% confidence) that the coin is fair." If you read these posts, then you should immediately see the error. There can’t be a 96% chance of the coin being fair, because no matter what our flips were (it could 100 and 100 after a 200 flip trial), the probability of it being fair is still ${0}$ (just compute the integral of the posterior distribution from ${0.5}$ to ${0.5}$). Yet you see this language all over scientific papers.

If we want to talk about confidence, then we have a way to do it and it does not involve the ${p}$-value. I wouldn’t say this is a frequentist vs Bayesian thing, but I think the Bayesian analysis actually makes it harder to make this mistake. Recall that what we did there was use the language that being unbiased was a credible hypothesis given our 95% HDI. What we have confidence about is an interval. Maybe we have 95% confidence that the bias is in the interval ${(0.2, 0.8)}$. In this case, the hypothesis of being unbiased is credible, but the hypothesis of it being ${0.7}$ is also credible with the same HDI.

Anyway, back to ${p}$-values. Since I’m using coin flipping as an example, you might think this is silly, but let’s ramp up our experiment. Suppose my lab works for a casino. We make sure coins are fair before passing them on to the casino (for their hot new coin flipping game?). I use a ${p}$ value of ${0.05}$ as usual. After ${100}$ coins I expect that I’ve made a mistake on ${5}$ or fewer because of my ${p}$-value, right? This is the type of interpretation you see all the time! It is clearly wrong.

Suppose ${10}$ of them are biased due to manufacturing errors. Depending on the power of my test (I haven’t talked about power, but as you can imagine it depends on how many flips I use in my trials among other things) maybe I find 8 of them (this would be a power of ${0.8}$ which isn’t unreasonable in science). Now recall our definition of the ${p}$-value. I also have a ${5\%}$ chance of incorrectly saying that one of my unbiased coins is biased. This puts me at identifying ${13}$ biased coins only ${8}$ of which are actually biased. Despite a ${p}$-value threshold of ${0.05}$, I actually only got ${62\%}$ of my guesses of bias correct (you could calculate this much more easily using Bayes’ theorem).

The above scenario is extremely common in some medical science labs where it matters. Suppose you test a drug to see if it works. Your test has ${0.8}$ power and you use a ${p}$-value of ${0.05}$ as you’ve been taught. You send ${13}$ to drug manufacturers claiming they work. You think that you are wrong only ${5\%}$ of the time, but in reality after you’ve tested ${100}$ drugs, ${5}$ out of the ${13}$ drugs you send don’t work! This is extremely dangerous. Of course, these should be weeded out on secondary trials, but who has the time or money to do that? If we think we have ${95\%}$ confidence that it works, then we may as well send them out to help people and only do our repeat experiment while it is on the market.

Ignoring the real interpretation of the p-value in favor of the more optimistic one is so common it has a name: the base rate fallacy. This is because that high number of false postives comes from the fact that the base rate of a drug working (or the coin being unbiased) is so low that you are likely to get false positives even with a high power test and a small p-value. I know this type of thing has been posted on the internet all over the place, but I hadn’t done it yet and it seemed to fit in with the statistical oddities series. For the record, the example scenario above was taken from Statistics Done Wrong by Alex Reinhart.

## Westward the Course of Empire Takes its Way

This is mostly meant to be a direct continuation of the last post, but there is so much to say about the importance of this short story for understanding Infinite Jest that I needed a full post to do it. I will try to stick to this thesis, but I get so excited about unraveling all the complexities and parallels in this story that I may wander off at times. This story may, in fact, be more complicated and difficult to read than Infinite Jest, so be warned.

Let’s start with the basics. The main character is a writer that wants to write a new type of fiction. He claims that it will use the old metafictional devices, but also move past it and stab the reader in the heart. We already saw this idea in the last post, but this story is a way for DFW to tell us how he intends to do it, i.e. it serves as a reader’s guide to Infinite Jest. That’s why this story is so important for prep material (if you choose to do such a thing).

What is going on takes a moment to digest. Here goes. The work is a criticism of the shortcomings of metafiction. But it is a metafictional story using those very devices to do the criticism. The main critique is of Barth’s “Lost in the Funhouse.” To do this, Barth is literally a character in the story as Professor Ambrose who wrote the aforementioned story (LitF from now on, because that is getting annoying to type), but this time it is an autobiographical nonfiction work instead of Barth’s fiction (recall that the main character of LitF is Ambrose). Summary: Prof Ambrose wrote LitF in DFW’s story and is leading a writing workshop.

Ambrose (despite being a “real” character already) from LitF is fictionalized as Mark, the main character in “Westward …” through a retelling of LitF. LitF is a story about Ambrose travelling to Ocean City and getting lost in a funhouse at the amusement park. DFW uses wordplay in the retelling and has Mark travelling to a McDonald’s commercial actors reunion where there will (of course!) be a Ronald McDonald “funhouse.”

I said I wouldn’t do this, so I’m going to cut myself off there. I trust that if you’ve read LitF, and you take some time to meditate on the above two paragraphs until it stops being so confusing, then you can continue to unravel this ridiculously convoluted metaphor and story within a story that is a retelling of that story (which is already in the story …). Stop. I must stop. But it is just so much fun to unravel (for example, the funhouse in LitF is being franchised which is an insult that post-modernism has become commercial).

So what is DFW trying to tell us? Well, Barth uses his story to tell us how he sees metafiction. His metaphor is the funhouse of mirrors. In LitF he writes, “In a funhouse mirror-room you can’t see yourself go on forever, because no matter how you stand your head gets in the way.” This is the exact type of critical theory conundrum that DFW faces. He wants to affect the reader. But words and texts and people’s thoughts (i.e. “heads”) are always in the way. You can’t ever truly get to the person.

DFW’s metaphor is a bow and arrow, because Mark, the main character, is a pro archer. He has a beautiful description in “Westward …” of how an archer must take into account that the arrow doesn’t fly true. So to hit the bullseye, the archer actually makes adjustments ahead of time, aims off-center, and ends up hitting the center.

He’s saying that Barth can’t hit the reader, because he’s aiming at the wrong place: the head. Writers that strike at the reader’s heart also fall short, because they aim at it too directly. This new type of fiction will take this into account and aim in between. The result will be a piercing of the reader’s heart in a new and more serious way.

Mark’s girlfriend is post-modernist writer in Ambrose’s workshop. Without going too far into it, the thing to pay attention to with her is that she is the epitome of the type of metafiction that DFW wants to do away with. Remember, DFW wants to keep some metafiction and throw out other parts to invent a new type of fiction. This character is a guide to the parts he wants thrown out.

This is a long story, and so I can’t help you through every detail. Another general principle to keep in mind while interpreting this is that the arrow is meant to be a stand-in for the pen. So when the arrow “kills” things/people, you should figure out what those things/people are representing. For example, Mark writes a story about a person named Dave (oh no, Mark who is Ambrose is a stand-in for DFW writes a work of “new fiction” with Dave as its main character …).

Dave has a lover named L– (presumably meant to be “literature”). But L– commits suicide (as the post-modernists brought the death of literature) with the arrow. Dave is innocent, but feels guilty and hence admits that (after translation out of the metaphor) his writing helped bring about the death of literature. Of course, Mark makes an appearance in this story that he wrote causing yet another story within the story with a character as the person that wrote the story, but also a stand-in for someone else (which sets up a weird endless loop that DFW is Mark, Mark is Dave, and Dave is DFW …). I seem to be losing my way again, so I’ll end this line of thought.

Hopefully you have a bit of a feel for what “Westward …” is doing. I’ll end this post by going through my thoroughly well-worn copy of the story and pulling the quotes that I think are the most important to focus on for understanding how and why DFW wrote Infinite Jest.

“…they want to build a Funhouse for lovers out of a story that does not love. J.D. himself had said the story doesn’t love, no? Yes. However, Mark postulates that Steelritter is only half-right. The story does not love, but this is precisely because it is not cruel…. The way to make a story a Funhouse is to put the story itself in one. For a lover. Make the reader a lover, who wants to be inside.”

“Please don’t tell anybody, but Mark Nechtr desires, some distant hard-earned day, to write something that stabs you in the heart. That pierces you, makes you think you’re going to die…. The stuff would probably use metafiction as a bright smiling disguise, a harmless floppy-shoed costume, because metafiction is safe to read, familiar as syndication; and no victim is as delicious as the one who smiles in relief at your familiar approach.”

Barth’s LitF famously opens with, “For whom is the funhouse fun? Perhaps for lovers. For Ambrose it is a place of fear and confusion.”

DFW turns it around and beautifully sums up what he is doing with his closing lines:

“For whom?
You are loved.”

## Minor Preparation to Get the Most out of Infinite Jest

I’ve been reading the biography of David Foster Wallace, Every Love Story is a Ghost Story by D.T. Max, and it reminded me that for years I’ve been meaning to do a blog post on some of the preparation you can do to have a much better experience reading Infinite Jest.

First, I’m not doing this out of some condescending “let the self-declared expert tell you how you must read this” type of thing. I actually get asked this question semi-frequently, and I want something I can direct people to. My first answer is usually, “Just do it.” You can get a lot of enjoyment out of the novel without delving into the philosophy of the meta-fictional devices.

On the other hand, if you are going to spend a few months of your life reading a 1000 page beast of a novel, then you should be willing to do some minor preparation. I estimate a dedicated person could easily do these reading assignments in less than a week. I picked these for both brevity and clarity after years of reading everything he’s ever written and watching/reading tons of interviews with him, and reading as many things as I can that he points out as influences.

This will take two posts. One on everything and why I chose it. The other on understanding his story Westward the Course of Empire Takes its Way. If you are really pressed for time, then my advice is to finish reading this post. Read that story. Then read my soon to come explanation of why that story is the most important thing he ever wrote in trying to decipher why he writes in the way he writes. That story is a Rosetta stone to understanding his later works.

Lost in the Funhouse by John Barth (a short story)
“The Balloon” by Donald Barthelme (a short story)
The Mezzanine by Nicholson Baker (a very short novella)
“Westward the Course of Empire Takes its Way” by David Foster Wallace (a short story/novella)

That may look like a lot, but each story can probably be read in one sitting, although I recommend going slowly through that last one. Let’s take them one at a time.

“The Balloon” is probably the least important of the list. This is a short story that DFW talked about in several interviews. It was a story that basically changed his life. He wasn’t a literature or creative writing major in college, but this story made him see writing in a different light. It made him want to be a writer.

Here’s how I understand this. All the fiction that DFW wrote was deeply philosophical. He majored in philosophy and as a grad student took lots of critical theory. He was obsessed with the theory behind the relationship between author, text, and reader. This wasn’t abstract for him. Because he wanted to develop a relationship with his readers through what he wrote, he needed to understand what the nature of that relationship was.

What Barthelme’s story does, which was so novel at the time, is put the theoretical considerations right in the story plainly for all to see. This is essentially a defining characteristic of the post-modernists of the time. The story as a whole has some macro-structure (“plot” if you want to use that term), but the individual sentences have a micro-structure which is informing you as you go how to interpret the macro-structure.

The story is very enigmatic. Just as you are thinking, “What in the world is going on?” you encounter characters who say things like, “We have learned not to insist on meanings.” This isn’t the type of place where DFW ended in his writing, but it makes a lot of sense why he started here. The story is difficult, but the reader who is willing to put in the effort to think about the individual sentences is rewarded by being helped by the author, i.e. a back-and-forth rewarding relationship is built. Both sides have to put in effort, which is a key idea that will keep coming up.

As linked above, I’ve written about “Lost in the Funhouse” before. You can read that for details. Some might go so far as to call it “the canonical” example of post-modernism. The main importance on this list is that “Westward …” is simultaneously a parody of it, a rewriting of it, and a tool to get some messages across. I dare say it is impossible to to read “Westward …” and have any idea what is going on without having read “Lost in the Funhouse” first. We’ll discuss it a bit more next time.

Last is The Mezzanine by Nicholson Baker. This book takes place over something like 10 seconds. The plot (and full main text!) of the novella is that a man walks into a mezzanine and takes an escalator up to the next floor. That’s it. What makes this so compelling is that there are about 130 pages of footnotes telling you what the guy is thinking through this whole process.

The book is a page turner. I’m not joking. It gives you a glimpse into the mind of another human in such a raw and unfiltered way. It, of course, is really funny at times, but the fact that it is funny is because you know your thoughts do the same exact types of things. You chain together all sorts of seemingly unrelated stupid things.

The reason for putting this on here is two-fold. First, DFW constantly talked about the importance of literature being that it makes you for a moment feel less alone. Here’s the quote, “We all suffer alone in the real world. True empathy’s impossible. But if a piece of fiction can alow us imaginatively to identify with a character’s pain, we might then also more easily conceive of others identifying with their own. This is nourishing, redemptive; we become less alone inside. It might just be that simple.” This book comes as close as any that I can think of to achieving the idea of truly identifying with a character.

The second reason I chose this book is actually the key one. The way the book does it is not by any of the conventional means. It achieves this truly magnificent feat purely through the use of footnotes. DFW loved this book. Now ask yourself what is the most daunting part of Infinite Jest? Most people say it is the extensive use of endnotes.

We’ll get more to the endnotes next time, but I think The Mezzanine holds the key to one of the reasons DFW used them. They aren’t purely distraction. They aren’t meta-fictional wankery. They aren’t highfalutin philosophical nonsense. DFW read a book that achieved what he considered the goal of literature, and it was done using this device. If you can understand the use in The Mezzanine, then you will be well on your way to understanding the use of the endnotes in Infinite Jest.

We’re only halfway there, but if you’ve made it this far and you want some extra credit, then I also recommend finding a copy of Marshall Boswell’s Understanding David Foster Wallace. It is a good resource if you want to delve deeper into the philosophy and critical theory of what he was trying to do. Also, DFW is trying to surpass his post-modern idols, so it helps to be familiar with post-modernism in general. If you aren’t, then The Crying of Lot 49 by Thomas Pynchon is a pretty short but classic book in that style as well.

## A Bayesian Formulation of Occam’s Razor

Today we will formulate Occam’s Razor in Bayesian terms. Recall that this says that if two hypotheses explain the data equally well, then the one with less assumptions is to be preferred. Before continuing, we should first get a handle on what this is and what the Bayesian reformulation means. First, it is basically a scientific heuristic. The intuitive reason for it is that unnecessary hypotheses are just going to make your model more likely to make mistakes (i.e. it will “overfit”).

What this post is going to do is give a formulation of it in Bayesian terms. This is not a mathematical proof that Occam’s Razor is true or anything, but it will be a proof that under certain mild assumptions the principle falls out as a consequence. That’s what makes this kind of cool. We want to decide whether or not hypothesis A or B is a better statistical model where A and B explain the observed data equally well, but B has an extra parameter.

How should we do this? Well, in probabilistic terms we want to figure out ${P(A|D)}$ and ${P(B|D)}$, the “probability that ${A}$ is true given the data ${D}$” and the “probability that ${B}$ is true given the data ${D}$.” We merely compare these two quantities for example by taking the quotient

$\displaystyle \frac{P(A|D)}{P(B|D)}.$

If the quotient is near ${1}$, then they are roughly equally good models. If the quotient is large, then ${A}$ is a better hypothesis and if the quantity is close to ${0}$, then ${B}$ is the better hypothesis.

Let’s take stock of our assumptions here. We do not assume Occam’s Razor (some people feel like OR is a pretty steep assumption), because it is not a priori clear that it is always the best principle to follow. But here we are merely making the assumption that comparing the probabilities that each model is a true model of the data we observe is a good test for selecting one model over another. It is kind of hard to argue against such a common sense assumption.

Now we use Bayes’ Theorem to convert these quantities to things we actually know about:

$\displaystyle \frac{P(A|D)}{P(B|D)} = \frac{P(D|A)P(A)}{P(D|B)P(B)}$

At this point we have some difficulty with the ${B}$ hypothesis still, because implicitly we have assumed it relies on some extra parameter ${\lambda}$. To simplify the argument, we will assume that ${\lambda}$ lies in some range (this isn’t unreasonable because in real life you should have some idea what order of magnitude etc this parameter should be): ${\lambda_m \leq \lambda \leq \lambda_M}$. We will make a less reasonable simplifying assumption and say that once this range is specified, we have a uniform chance of it being anything in the range, i.e.

$\displaystyle P(\lambda |B) = \frac{1}{\lambda_M - \lambda_m}$

for ${\lambda}$ in the range and ${0}$ otherwise. There will be an observed ${\lambda_0}$ that maximizes the likelihood function (i.e. fits the data the best). Choose ${\delta}$ so that ${\lambda_0 \pm \delta}$ is an interval giving us reasonable certainty of the best ${\lambda_0}$ (we could use the 95% HDI if we wanted to get the interval). Now let’s work out what is happening for ${B}$:

$\displaystyle P(D|B) = \int P(D, \lambda|B)d\lambda = \int P(D|\lambda, B)P(\lambda |B)d\lambda$

$\displaystyle =\frac{1}{\lambda_M - \lambda_m}\int P(D|\lambda_0, B)exp\left(-\frac{(\lambda-\lambda_0)^2}{2\delta^2}\right)d\lambda$

$\displaystyle =\frac{\delta\sqrt{2\pi}P(D|\lambda_0, B)}{\lambda_M - \lambda_m}$

Now we can plug this into our original comparison ratio and use the fact that both are equally good at explaining the data:

$\displaystyle \frac{P(A|D)}{P(B|D)}=\frac{(\lambda_M-\lambda_m)P(D|A)}{\delta\sqrt{2\pi}P(D|\lambda_0, B)}$

This gives us two main conclusions. The first is that if we assume our two models make roughly equivalent predictions on the data, i.e. ${P(D|A)\approx P(D|\lambda_0, B)}$, then we should prefer ${A}$ because the possible range for ${\lambda}$ giving a factor in the numerator will in general be quite a bit larger than ${\delta}$. This is exactly Occam’s Razor.

The possibly more interesting consequence is that we now know exactly how much this extra parameter is “penalizing” the theory. So given specific cases we can test whether or not that extra parameter is worth putting in. In other words, are the predictions significantly enough better with the extra parameter to overcome the penalty of introducing an extra complicated hypothesis? This abstract and vague notion from Occam’s Razor gets explicitly quantified in Bayesian analysis so that it is no longer vague or abstract and we can confidently apply Occam’s Razor when it is needed and avoid it when it isn’t.

## Statistical Oddities Part 3

This oddity is really hard to get your head around if you’ve been doing standard null-hypothesis testing all your life. This oddity says that null hypothesis significance testing depends on the intentions of the experimenter.

What does this mean? Well, let’s go back to our worked example of flipping a coin and trying to determine whether or not it is biased based on the observed data. Recall that in our Bayesian analysis we take our data and our test for whether or not it was biased was determined by whether or not 0.5 was a reasonable guess given the posterior distribution. We didn’t need to know anything about the intentions of the person flipping the coin.

How does traditional (re: frequentist) null hypothesis testing work? We set up an experiment in which the experimenter flips the coin 100 times. If we observe 47 heads, then we calculate the probability that this would happen given the coin is fair. If that probability is below a certain threshold, then we say the coin is biased because it is extremely unlikely that we would observe that number by chance alone. Otherwise we do not reject the null hypothesis and say the coin is fair.

Unfortunately, our probability space depends on the number of total coin flips. The probability space is extremely different if the experimenter set up the experiment so that the number of flips was not predetermined and instead a coin was flipped as many times as possible for 5 minutes. The probability space in this case is much larger because some possible samples would have 90 flips and some would have 110 and so on.

It would also be radically different if the experimenter decided to flip the coin until they reached 47 heads. Then the probability space would again have all sorts of different possibilities for the number of flips. Maybe sometimes you would expect to do 150 flips before seeing 47 heads.

Just to reiterate, this isn’t a trivial matter. This says we need to know the intent of the experimenter if we want to do a legitimate null hypothesis significance test. If we don’t know how the experiment was designed, then we don’t know what our probability space should look like to know whether or not we should reject the null hypothesis.

To see why this is shocking just do the thought experiment where three labs flip the same coin. Each of the labs sets up the experiment in the three ways listed above. You get the exact same data from each of the labs. You could rig the numbers so that in some cases you decide the coin is fair and in others you decide that it is not fair. But they gave you the same exact data of 47 heads out of 100 flips (or whatever your thought experiment requires)! Let’s reiterate: They gave you the exact same data, but came to different conclusions about the fairness of the coin. How is this possible?

If we live in some sort of objective universe where we can do experiments and draw conclusions from them, then the results of an experiment should rely on the data and not on the subjective intentions of the experimenter. More bluntly, determining whether or not the coin is biased should not depend on what is happening in the coin flipper’s mind during the flipping.

This is a very real and dangerous statistical oddity if the person running the analysis isn’t aware of it. In fact, I dare say that this is one of the easy ways to massage data in the sciences to get “results” where none exist. To me, this is actually one of the strongest arguments for scientists to use Bayesian statistics rather than null hypothesis testing. As we saw in the linked post, Bayesian statistics gets around this issue and only needs the raw data and not the intentions of the experimenter.

By the way, before I get sued, I stole this example (with different numbers) from Doing Bayesian Data Analysis by John K. Kruschke. It is a really fantastic book to learn about this stuff.

## Statistical Oddities Part 2

Suppose you take in a bunch of data and you make a statistical model out of it. You start making predictions and find that you are wrong a lot of the time. Naturally, your first thought is to go collect a lot more data. Question: Is feeding the model more data necessarily going to improve your prediction record in a significant way?

Intuition tells us that the answer should be yes. I used to think the more you know, the better your guesses are going to be even if the model is bad. It turns out that the answer depends on what is causing your error. Nowadays there are tons of ways to measure error, but let’s compare two of them. One of them you are probably already familiar with called the variance. The other is called the bias.

Bias roughly corresponds to “being biased” towards a certain answer. Your guesses are localized around something that isn’t correct. Some people call this “underfitting.” If your data set comes from a parabola and you use linear regression to model your predictions, then you will see a high bias.

High variance is the opposite. It comes from guesses that are not localized enough. Little changes are causing big swings in your predictions. You are confusing the noise in the data for a real signal.

Thinking about these two vastly different ways your predictions could be going wrong, it turns out that if you are in the high bias case then more data will not improve your predictions. This is just because once you’ve reached a critical amount of data, then the predictions are set. Adding in more data will not update the model to something new or different, because it is biased to give a certain prediction. Thinking back to using linear regression to model data coming from a parabola, your predictions obviously won’t improve just by feeding it more data.

If you have a high variance problem, then getting more data will actually help. This is basically because if you make a model that is sensitive to noise on a small data set, then the noise is going to throw your predictions around a lot. But the more data you add, the more that noise is going to cancel itself out and give some better predictions. Of course, this is a brute force fix, and you should actually try to get the variance down so that the model is better, but that is another post.

That’s the oddity for the day. It seems that adding more data should always improve predictions in a statistical model, but it turns out that this is not the case if your error is coming from high bias. This is actually somewhat related to the next oddity, but I think the next one will be much more interesting than this one, so stay tuned.

## Statistical Oddities Part 1

I’m going to come back to posting on L-series soon. For now I’m going to do a first post in a series on some funny business in statistics. These are mostly going to be little things to watch out for if you are new to statistical analysis. These will be well-known things, so experts probably won’t find much use in this series. On the other hand, if you’ve just had a single undergraduate course, then you probably missed these strange examples which cause danger for mis-analysis in the “real world.”

Our funny distribution today is the Cauchy distribution. To point out the funny business, let’s recall the Bayesian worked example. We flipped a coin and modelled it using the beta distribution. We wanted to determine whether or not it was biased. If we got ${n}$ heads and ${m}$ tails, then the maximum of the distribution happened at ${\frac{n}{n+m}}$.

Nice! The most likely value for our bias was exactly the mean. Now beta distributions can be a little skewed, so our 95% confidence interval wasn’t symmetric about the mean, but the mean is always in the confidence interval no matter what our threshold is or how skewed our posterior is. This feels right, and it turns out that basically every distribution you encounter in a first stats class has this property.

This property (that the mean is always a “good guess”) is essentially a formal consequence of the Central Limit Theorem. That’s the problem. To prove the CLT, our distribution has to satisfy some mild amount of niceties. One of them is that the moments/variance are defined. It turns out that the Cauchy distribution does not satisfy this.

One scary thing is that the Cauchy distribution actually appears very naturally in lots of situations. It has two hyperparameters

$\displaystyle P(x|\alpha, \beta)=\frac{\beta}{\pi(\beta^2+(x-\alpha)^2)}$

and even worse it lends itself to a Bayesian analysis well, because the way we update the distribution as new data comes in gives us another Cauchy distribution.

Suppose we do an experiment: we collect photons from an unknown source and want to locate the ${x}$ and ${y}$ (i.e ${\alpha}$ and ${\beta}$) values of the source. This fits into a Cauchy distribution framework. In fact, Hanson and Wolf did some computer simulations using a Monte Carlo method to see what happens (the full paper is here). To simplify things, we assume that one of the values is known exactly.

The Cauchy distribution actually peaks extremely fast (inversely proportional to ${\sqrt{N}}$ where ${N}$ is the sample size). So after a reasonable amount of data we get an extremely high confidence in a very narrow range. We can say with near certainty exactly where the location is by using the posterior distribution.

So what happened with the mean? In the experiment with the most data, they found the actual location at ${0.236}$ and the mean was ${7.14\times 10^8}$. So…it was off by probably worse than your wildest imagination could have guessed. On the other hand, the median was ${0.256}$.

The variance of the distribution is infinite, so the outliers throw the mean around alot, but the median is actually protected against this. This goes to show that you cannot always assume the mean of the data is a reasonable guess! You actually have to do the Bayesian analysis and go to the posterior distribution to get the correct estimator.