The Carter Catastrophe

I’ve been reading Manifold: Time by Stephen Baxter. The book is quite good so far, and it presents a fascinating probabilistic argument that humans will go extinct in the near future. It is sometimes called the Carter Catastrophe, because Brandon Carter first proposed it in 1983.

I’ll use Bayesian arguments, so you might want to review some of my previous posts on the topic if you’re feeling shaky. One thing we didn’t talk all that much about is the idea of model selection. This is the most common thing scientists have to do. If you run an experiment, you get a bunch of data. Then you have to figure out the most likely reason for what you see.

Let’s take a basic example. We have a giant tub of golf balls, and we can’t see inside the tub. There could be 1 ball or a million. We’re told the owner accidentally dropped a red ball in at some point. All the other balls are the standard white golf balls. We decide to run an experiment where we draw a ball out, one at a time, until we reach the red one.

First ball: white. Second ball: white. Third ball: red. We stop. We’ve now generated a data set from our experiment, and we want to use Bayesian methods to give the probability of there being three total balls or seven or a million. In probability terms, we need to calculate the probability that there are x balls in the tub given that we drew the red ball on the third draw. Any time we see this language, our first thought should be Bayes’ theorem.

Define A_i to be the model of there being exactly i balls in the tub. I’ll use “3” inside of P( ) to be the event of drawing the red ball on the third try. We have to make a finiteness assumption, and although this is one of the main critiques of the argument, we can examine what happens as we let the size of the bound grow. Suppose for now the tub can only hold 100 balls.

A priori, we have no idea how many balls are in there, so we’ll assume all “models” are equally likely. This means P(A_i)=1/100 for all i. By Bayes’ theorem we can calculate:

P(A_3|3) = \frac{P(3|A_3)P(A_3)}{(\sum_{i=1}^{100}P(3|A_i)P(A_i))}

\frac{(1/3)(1/100)}{(1/100)\sum_{i=3}^{100}1/i} \approx 0.09

So there’s around a 9% chance that there are only 3 balls in the tub. That bottom summation remains exactly the same when computing P(A_n | 3) for any n and equals about 3.69, and the (1/100) cancels out every time. So we can compute explicitly that for n > 3:

P(A_n|3)\approx \frac{1}{n}(0.27)

This is a decreasing function of n, and this shouldn’t be surprising at all. It says that as we guess there are more and more balls in the tub, the probability of that guess goes down. This makes sense, because it’s unreasonable to think we’d see the red one that early if there are actually 100 balls in the tub.

There’s lots of ways to play with this. What happens if our tub could hold millions but we still assume a uniform prior? It just takes all the probabilities down, but the general trend is the same: It becomes less and less reasonable to assume large amounts of total balls given that we found the red one so early.

You could also only care about this “earliness” idea and redo the computations where you ask how likely is A_n given that we found the red ball by the third try. This is actually the more typical way the problem is formulated in the Doomsday arguments. It’s more complicated, but the same idea pops out, and this should make intuitive sense.

Part of the reason these computations were somewhat involved is because we tried to get a distribution on the natural numbers. But we could have tried to compare heuristically to get a super clear answer (homework for you). What if we only had two choices “small number of total balls (say 10)” or “large number of total balls (say 10,000)”? You’d find there is around a 99% chance that the “small” hypothesis is correct.

Here’s the leap. Now assume the fact that you exist right now is random. In other words, you popped out at a random point in the existence of humans. So the totality of humans to ever exist are the white balls and you are the red ball. The same type of argument above applies, and it says that the most likely thing is that you aren’t born at some super early point in human history. In fact, it’s unreasonable from a probabilistic standpoint to think that humans will continue much longer at all given your existence.

The “small” total population of humans is far, far more likely than the “large” total population, and the interesting thing is that this remains true even if you mess with the uniform prior. You could assume it is much more likely a priori for humans to continue to make improvements and colonize space and develop vaccines giving a higher prior for the species existing far into the future. But unfortunately the Bayesian argument will still pull so strongly in favor of humans ceasing to exist in the near future that one must conclude it is inevitable and will happen soon!

Anyway. I’m travelling this week, so I’m sorry if there are errors in those calculations. I was in a hurry and never double checked them. The crux of the argument should still make sense even if you don’t get my exact numbers. There’s also a lot of interesting and convincing rebuttals, but I don’t have time to get into them now (including the fact that unlikely hypotheses turn out to be true all the time).

Confounding Variables and Apparent Bias

I was going to call this post something inflammatory like #CylonLivesMatter but decided against it. Today will be a thought experiment to clarify some confusion over whether apparent bias is real bias based on aggregate data. I’ll unpack all that with a very simple example.

Let’s suppose we have a region, say a county, and we are trying to tell if car accidents disproportionately affect cylons due to bias. If you’re unfamiliar with this term, it comes from Battlestar Galactica. They were the “bad guys,” but they had absolutely no distinguishing features. From looking at them, there was no way to tell if your best friend was one or not. I want to use this for the thought experiment so that we can be absolutely certain there is no bias based on appearance.

The county we get our data from has roughly two main areas: Location 1 and Location 2. Location 1 has 5 cylons and 95 humans. Location 2 has 20 cylons and 80 humans. This means the county is 12.5% cylon and 87.5% human.

Let’s assume that there is no behavioral reason among the people of Location 1 to have safer driving habits. Let’s assume it is merely an environmental thing, say the roads are naturally larger and speed limits lower or something. They only average 1 car accident per month. Location 2, on the other hand, has poorly designed roads and bad visibility in areas, so they have 10 car accidents per month.

At the end of the year, if there is absolutely no bias at all, we would expect to see 12 car accidents uniformly distributed among the population of Location 1 and 120 car accidents uniformly distributed among the population of Location 2. This means Location 1 had 1 cylon in an accident and 11 humans, and Location 2 had 24 cylons and 96 humans in accidents.

We work for the county, and we take the full statistics: 25 cylon accidents and 107 human accidents. That means 19% of car accidents involve cylons, even though their population in the county is only 12.5%. As an investigator into this matter, we now try to conclude that since there is a disproportionate number of cylons in car accidents with respect to their baseline population, there must be some bias or speciesism present causing this.

Now I think everyone knows where this is going. It is clear from the example that combining together all the numbers from across the county, and then saying that the disproportionately high number of cylon car accidents had to be indicative of some underlying, institutional problem, was the incorrect thing to do. But this is the standard rhetoric of #blacklivesmatter. We hear that blacks make up roughly 13% of the population but are 25% of those killed by cops. Therefore, that basic disparity is indicative of racist motives by the cops, or at least is an institutional bias that needs to be fixed.

Recently, a more nuanced study has been making the news rounds that claims there isn’t a bias in who cops kill. How can this be? Well, what happened in our example case to cause the misleading information? A disproportionate number of cylons lived in environmental conditions that caused the car accidents. It wasn’t anyone’s fault. There wasn’t bias or speciesism at work. The lack of nuance in analyzing the statistics caused apparent bias that wasn’t there.

The study by Fryer does this. It builds a model that takes into account one uncontroversial environmental factor: we expect more accidental, unnecessary shootings by cops in more dangerous locations. In other words, we expect that, regardless of race, cops will shoot out of fear for their lives in locations where higher chances of violent crimes occur.

As with any study, there is always pushback. Mathbabe had a guest post pointing to some potential problems with sampling. I’m not trying to make any sort of statement with this post. I’ve talked about statistics a lot on the blog, and I merely wanted to show how such a study is possible with a less charged example. I know a lot of the initial reaction to the study was: But 13% vs 25%!!! Of course it’s racism!!! This idiot just has an agenda, and he’s manipulating data for political purposes!!!

Actually, when we only look at aggregate statistics across the entire country, we can accidentally pick up apparent bias where none exists, as in the example. The study just tries to tease these confounding factors out. Whether it did a good job is the subject of another post.

Do we live in a patriarchy?

I recently read Roxane Gay’s Bad Feminist. It was far better than I was expecting. The essays are personal and humorous yet address a lot of serious and deep issues. Her takedown of trigger warnings is particularly good. The essays are best when sticking to specific topics like the critiques of The Help, 50 Shades, The Hunger Games, Twilight, 12 Years a Slave, and Tyler Perry’s work. The inside look of the professional Scrabble scene is entertaining.

The essays get a bit worse when being more general. Sometimes back-to-back essays contradict each other. In one she argues that there should be more diverse representation in TV, movies, and books, because people have a hard time relating to people that don’t look like them. In the next, she spends 5,000 words about how deeply she identified with the white girls in Sweet Valley High. How are we supposed to take the previous essay seriously after that?

The most cringe-worthy part had to do with the elusive concept of “patriarchy.” She had just gotten through critiquing Hanna Rosin’s The End of Men, which provides a book-long study with evidence and statistics to argue that patriarchy is essentially dead.

Gay’s sophisticated response to this was to laugh it off. Ha ha ha. Of course it’s not dead. Just look around you. It is so obvious we live in a patriarchy. Sorry Gay, but you can’t argue against the conclusion of an entire book by saying the conclusion is “obviously” incorrect.

Let me begin with a (fictional) story. In college I took a lot of physics. One day the professor gave a bunch of solid arguments with evidence and studies to back it up that the Earth goes around the Sun. I burst out laughing. It was so obviously false a conclusion.

I raised my hand, ready to embarrass the professor. I pointed out that I see the Sun go around the Earth every single day with my own two eyes. He might have had some fancy arguments, but I had obviousness on my side.

It is an unfortunate truth that much of what seems obvious (we can even produce convincing arguments!) is often wrong. This is exactly what is happening when Gay’s rebuttal is: look at the political system, most of Congress is men, we’ve never had a woman president, thus men hold all the political power.

That is convincing on its surface like the Sun going around the Earth is convincing on its surface. The gender of elected officials is one metric to measure political power. Can we think of any other?

Maybe we should take the premise of a representational democracy seriously and say the electorate have the power, because they elect members of congress. Who votes more? Well, women do! Now we’re at an impasse, because one metric claims women have the political power, and the other metric claims men have the power. This is looking a little less obvious now that we dig deeper.

I haven’t defined patriarchy yet. Most people don’t, because they don’t want to be tied down to a particular type of evidence. The relevant dictionary definition is: a social system in which power is held by men, through cultural norms and customs that favor men and withhold opportunity from women.

For each metric you come up with to show our culture favors men, I’ll come up with one to show it favors women. My starting statistics will be: life expectancy, education (measured by amount of degrees conferred), incarceration rate, poverty rate, homelessness, victims of violent crimes, workplace fatalities, and suicide rate. Your turn.

I grant you that many people argue patriarchy causes these problems for men (often stated “patriarchy hurts men too”). But that’s playing with words. By definition, a patriarchy “favors men,” and therefore cannot be the cause of society-wide disadvantages for men.

Here’s the truth. Any claim about anything can be supported with evidence if the person who believes the claim gets to pick the metric by which we measure something. This is a form of confirmation bias and sometimes the Texas sharpshooter fallacy. Raw statistics like the ones we’ve been looking at are slippery business, because they tell us nothing about causation. Is the sparsity of women in congress because the opportunity is being withheld from them by some social system that favors men, or is it some other causal factor at play?

When you pick the gender of Congress as a measure, you see ahead of time that it works in your favor, and that’s why you picked it. In other words, when you look for a pattern, you’ll find it. To avoid statistical fallacies like this, we need a metric whose results we are blind to, and we need a solid argument that this metric is actually measuring what we think it is. Only then do we test what the results show. Then we repeat this with many other metrics, because the issue is way too complicated for one metric to prove anything.

I’m not saying we don’t live in a patriarchy. What I’m saying is that you can’t laugh off someone that claims we don’t with a book-long argument to support her case because it is “obviously false” to you. Any argument that we live in a patriarchy is going to have to be subtle and complicated for the reasons listed above. It’s also more likely that the answer is somewhere in the middle. Men are favored in some places; women are favored in some places; and it’s counterproductive to decide if one outweighs the other. We can work towards equality without one gender “winning” the “oppression” war.

The 77 Cent Wage Gap Fallacy

I almost posted about this last month when “Equal Pay Day” happened. Instead, I sat back on the lookout for a good explanation of why the “fact” that “women only make 77 cents for every dollar a man makes” is meaningless. There were a ton of excellent take downs by pointing out all sorts of variables that weren’t controlled for. This is fine, but the reason the number is meaningless is so much more obvious.

Now, this blog talks about math and statistics a lot, so I felt somewhat obligated to point this out. Unfortunately, this topic is politically charged, and I’ve heard some very smart, well-intentioned people repeat this nonsense who should know better. This means bias is at work.

Let’s be clear before I start. I’m not saying there is no pay gap or no discrimination. This post is only about the most prominent figure that gets thrown around: 77 cents for every $1 and why it doesn’t mean what people want it to mean. This number is everywhere and still pops up in viral videos monthly (sometimes as “78” because they presume the gap has decreased?):

I include this video to be very clear that I am not misrepresenting the people who cite this number. They really propagate the idea that the number means a woman with the same experience and same job will tend to make 77% of what a man makes.

I did some digging and found the number comes from this outdated study. If you actually read it, you’ll find something shocking. This number refers to the median salary of a full-time, year round woman versus the median salary of a full-time, year round man. You read that right: median across everything!!

At this point, my guess is that all my readers immediately see the problem. In case someone stumbles on this who doesn’t, let’s do a little experiment where we control for everything so we know beyond all doubt that two groups of people have the exact same pay for the same work, but a median gap appears.

Company A is perfectly egalitarian. Every single employee gets $20 an hour, including the highest ranking people. This company also believes in uniforms, but gives the employees some freedom. They can choose blue or green. The company is a small start-up, so there are only 10 people: 8 choose blue and 2 choose green.

Company B likes the model of A, but can’t afford to pay as much. They pay every employee $15 an hour. In company B it turns out that 8 choose green and 2 choose blue.

It should be painfully obvious that there is no wage gap between blue and green uniformed people in any meaningful sense, because they are paid exactly the same as their coworkers with the same job. Pay is equal in the sense that everyone who argues for pay equality should want.

But, of course, the median blue uniform worker makes $20/hour whereas the green uniform worker only makes $15/hour. There is a uniform wage gap!

Here’s some of the important factors to note from this example. It cannot be from discriminatory hiring practices, because the uniform was chosen after being hired. It cannot be that green uniform people are picking lower paying jobs, because they picked the uniform after picking the job. It cannot be from green uniforms wanting to give up their careers to go have a family, because we’ll assume for the example that all the workers are single.

I’ll reiterate, it can’t be from anything, because no pay gap exists in the example! But it gets worse. Now suppose that both companies are headed by a person who likes green and gives a $1/hour raise to all green employees. This means both companies have discriminatory practices which favor green uniforms, but the pay gap would tell us that green are discriminated against!

This point can’t be stated enough. It is possible (though obviously not true based on other, narrower studies) that every company in the U.S. pays women more for equal work, yet we could still see the so-called “77 cent gender wage gap” calculated from medians. If you don’t believe this, then you haven’t understood the example I gave. Can we please stop pretending this number is meaningful?

Someone who uses a median across jobs and companies to say there is a pay gap has committed a statistical fallacy or is intentionally misleading you for political purposes. My guess is we’ll be seeing this pop up more and more as we get closer to the next election, and it will be perpetuated by both sides. It is a hard statistic to debunk in a small sound bite without sounding like you advocate unequal pay. I’ll leave you with a clip from a few weeks ago (see how many errors you spot).

Texas Sharpshooter Fallacy

In the world of big data that constantly bombards us with fancy graphics, the statistical fallacy that I think we are most likely to fall for is called the Texas Sharpshooter Fallacy. What makes this fallacy so dangerous is that it is propped up by solid, correct statistics which can be hard to argue against.

Here’s the idea. A person goes into the yard and shoots their rifle at random at their barn. Maybe even say the person is drunk, so the holes have no underlying pattern to them. The person then goes to the barn and figures out a way to draw a bullseye after the fact that makes it look like they are a competent sharpshooter.

The fallacy is that if you look at a large enough amount of data with good enough visualization tools, you will probably start to find patterns that aren’t actually there by strategically drawing artificial boundaries. Let’s make the example a bit more real.

Suppose you want to better understand the causes of Disease X, something just discovered and occurs in 10% of the population naturally. You plot the data of a nearby town of 10,000 to see if you can find a pattern.

Here is the plot (I used a uniform distribution so we know any clumps have no underlying cause):

sharpshooter

Your eye gets drawn to an oddly dense clump of cases of Disease X. You circle it and then run a statistical test to see if the number of cases is significant. You’re shocked! Your properly run statistical test shows you the increased number of cases is significant and with 95% certainty you conclude it isn’t just a fluke.

So what do you do? You start looking for causes. Of course you’ll be able to find one. Maybe that clump of houses has a power station nearby, or they drink from the same well water source or whatever. When you are looking for something in common, you’ll be able to find it.

When this happens, you’ve committed the Texas Sharpshooter Fallacy. It might be okay to use this data exploration to look for a cause if you merely intend to turn it into a hypothesis to be tested. So you hypothesize that it is radon in the water that caused the spike of cases in that cluster.

Now do real science where you do a randomized controlled study to actually test your null hypothesis. Doing statistics on big data is risky business, because any clever person can construct correlations from a large enough data set that first off may not actually be there but second off is almost surely not causally related.

Another way to think about why this is a fallacy is that when you have 95% certainty, 5 out of 100 times you will falsely find correlation where none exists. So if your data set is large enough to draw 100 different boundaries, then by random chance 5 of those will have false correlations. When you allow your eye to catch the cluster, it is your brain being good at finding patterns. It probably rejected 100 non-clusters to find that one.

This is scary in today’s world, because lots of news articles do exactly this. They claim some crazy thing, and they use statistics people don’t understand to “prove” its legitimacy (numbers can’t lie don’t you know). But really it is just this fallacy at work. The media don’t want to double check it because “Cancer rate five times higher near power station” is going to get a lot of hits and interest.

Actually, cancer is particularly susceptible to this type of fallacy as dozens of examples of these studies getting publicity despite no actual correlation (yet alone causation!) are documented in George Johnson’s (excellent) The Cancer Chronicles or an older The New Yorker article called “The Cancer-Cluster Myth.”

So the next time you read about one of these public health outcries, you should pay careful attention in the article to see if this fallacy has been made. For example, the vaccination causes autism myth also orignated this way.

Probably the most egregious example is The China Study, a highly praised vegan propaganda book. It takes the largest diet study ever done (367 variables) and pulls out the correlations that support the hypothesis “meat is bad.”

What the book doesn’t tell you is that the study found over 8000 statistically significant correlations, many contradicting the ones presented in the book. This is why large studies of observational epidemiology always have to be treated with caution. The larger the study, the more likely you will be able to find a way to support your hypothesis.

If you don’t believe me, and you want to protect marriage in Maine, then make sure you eat less margarine this year:

wuFRozj

Statistical Oddities 5: Sequential Testing

Our next decision theory post is going to be on how to rephrase hypothesis testing in terms of Bayesian decision theory. We already saw in our last statistical oddities post that {p}-values can cause some problems if you are not careful. This oddity makes the situation even worse. We’ll show that if you use a classical null hypothesis significance test (NHST) even at {p=0.05} and your experimental design is to check significance after each iteration of a sample, then as the sample size increases, you will falsely reject the hypothesis more and more.

I’ll reiterate that this is more of an experimental design flaw than a statistical problem, so a careful statistician will not run into the problem. On the other hand, lots of scientists are not careful statisticians and do make these mistakes. These mistakes don’t exist in the Bayesian framework (advertisement for the next post). I also want to reiterate that the oddity is not that you sometimes falsely reject hypotheses (this is obviously going to happen, since we are dealing with a degree of randomness). The oddity is that as the sample size grows, your false rejection rate will tend to 100% ! Usually people think that a higher sample size will protect them, but in this case it exacerbates the problem.

To avoid offending people, let’s assume you are a freshmen in college and you go to your very first physics lab. Of course, it will be to let a ball drop. You measure how long it takes to drop at various heights. You want to determine whether or not the acceleration due to gravity is really 9.8. You took a statistics class in high school, so you recall that you can run a NHST at the {p=0.05} level and impress your professor with this knowledge. Unfortunately, you haven’t quite grasped experimental methodology, so you rerun your NHST after each trial of dropping the ball.

When you see {p<0.05} you get excited because you can safely reject the hypothesis! This happens and you turn in a lab write-up claiming that with greater than {95\%} certainty the true acceleration due to gravity is NOT {9.8}. Let's make the nicest assumptions possible and see that it was still likely for you to reach that conclusion. Assume {g=9.8} exactly. Also, assume that your measurements are pretty good and hence form a normal distribution with mean {9.8}. I wrote the following code to simulate exactly that:

import random
import numpy as np
import pylab
from scipy import stats

#Generate normal sample
def norm():
    return random.normalvariate(9.8,1)

#Run the experiment, return 1 if falsely rejects and 0 else
def experiment(num_samples, p_val):
    x = []
    
    #One by one we append an observation to our list
    for i in xrange(num_samples):
        x.append(norm())
        
        #Run a t-test at p_val significance to see if we reject the hypothesis
        t,p = stats.ttest_1samp(x, 9.8)
        if p < p_val:
            return 1
    return 0

#Check the proportion of falsely rejecting at various sample sizes
rej_proportion = []
for j in xrange(10):
    f_rej = 0
    for i in xrange(5000):
        f_rej += experiment(10*j+1, 0.05)
    rej_proportion.append(float(f_rej)/5000)

#Plot the results
axis = [10*j+1 for j in xrange(10)]
pylab.plot(axis, rej_proportion)
pylab.title('Proportion of Falsely Rejecting the Hypothesis')
pylab.xlabel('Sample Size')
pylab.ylabel('Proportion')
pylab.show() 

What is this producing? On the first run of the experiment, what is the probability that you reject the null hypothesis? Basically {0}, because the test knows that this isn't enough data to make a firm conclusion. If you run the experiment 10 times, what is the probability that at some point you reject the null hypothesis? It has gone up a bit. On and on this goes up to 100 trials where you have nearly a 40% chance of rejecting the null hypothesis using this method. This should make you uncomfortable, because this is ideal data where the mean really is 9.8 exactly! This isn't coming from imprecise measurements or something.

The trend will actually continue, but already because of the so-called {n+1} problem in programming this was taking a while to run, so I cut it off. As you accumulate more and more experiments, you will be more and more likely to reject the hypothesis:

falsereject

Actually, if you think about this carefully it isn’t so surprising. The fault is that you recheck whether or not to reject after each sample. Recall that the {p}-value tells you how likely it is to see these results by random chance supposing the hypothesis is false. But the value is not {0} which means with enough trials you’ll get the wrong thing. If you have a sample size of {100} and you recheck your NHST after each sample is added, then you give yourself 100 chances to see this randomness manifest rather than checking once with all {100} data points. As your sample size increases, you give yourself more and more chances to see the randomness and hence as your sample goes to infinity your probability of falsely rejecting the hypothesis tends to {1}.

We can modify the above code to just track the p-value over a single 1000 sample experiment (the word “trial” in the title was meant to indicate dropping a ball in the physics experiment). This shows that if you cut your experiment off almost anywhere and run your NHST, then you would not reject the hypothesis. It is only because you incorrectly tracked the p-value until it dipped below 0.05 that a mistake was made:

pvals

Statistical Oddities 4: The Base Rate Fallacy

If you are a classically trained scientist, then you probably do an experiment, get some data, run it through SPSS (or something similar), then to see whether or not the results are significant you look at the {p}-value. It is a standard that if {p<0.05}, then you consider the result to likely be "real" and not just some random noise making a pattern.

Why is that? Well, here's how you define the {p}-value. Suppose your hypothesis is false. What is the probability of seeing your data? That is the {p}-value. I hypothesize that my coin is fair. I do 200 flips. I calculate {p=0.04}. Then there is only a 4% chance that I would see that particular combination of flips if my coin is biased.

Danger! Frequently, people try to negate everything and say that "there is a 96% chance (or less obviously wrong they'll say we have 96% confidence) that the coin is fair." If you read these posts, then you should immediately see the error. There can’t be a 96% chance of the coin being fair, because no matter what our flips were (it could 100 and 100 after a 200 flip trial), the probability of it being fair is still {0} (just compute the integral of the posterior distribution from {0.5} to {0.5}). Yet you see this language all over scientific papers.

If we want to talk about confidence, then we have a way to do it and it does not involve the {p}-value. I wouldn’t say this is a frequentist vs Bayesian thing, but I think the Bayesian analysis actually makes it harder to make this mistake. Recall that what we did there was use the language that being unbiased was a credible hypothesis given our 95% HDI. What we have confidence about is an interval. Maybe we have 95% confidence that the bias is in the interval {(0.2, 0.8)}. In this case, the hypothesis of being unbiased is credible, but the hypothesis of it being {0.7} is also credible with the same HDI.

Anyway, back to {p}-values. Since I’m using coin flipping as an example, you might think this is silly, but let’s ramp up our experiment. Suppose my lab works for a casino. We make sure coins are fair before passing them on to the casino (for their hot new coin flipping game?). I use a {p} value of {0.05} as usual. After {100} coins I expect that I’ve made a mistake on {5} or fewer because of my {p}-value, right? This is the type of interpretation you see all the time! It is clearly wrong.

Suppose {10} of them are biased due to manufacturing errors. Depending on the power of my test (I haven’t talked about power, but as you can imagine it depends on how many flips I use in my trials among other things) maybe I find 8 of them (this would be a power of {0.8} which isn’t unreasonable in science). Now recall our definition of the {p}-value. I also have a {5\%} chance of incorrectly saying that one of my unbiased coins is biased. This puts me at identifying {13} biased coins only {8} of which are actually biased. Despite a {p}-value threshold of {0.05}, I actually only got {62\%} of my guesses of bias correct (you could calculate this much more easily using Bayes’ theorem).

The above scenario is extremely common in some medical science labs where it matters. Suppose you test a drug to see if it works. Your test has {0.8} power and you use a {p}-value of {0.05} as you’ve been taught. You send {13} to drug manufacturers claiming they work. You think that you are wrong only {5\%} of the time, but in reality after you’ve tested {100} drugs, {5} out of the {13} drugs you send don’t work! This is extremely dangerous. Of course, these should be weeded out on secondary trials, but who has the time or money to do that? If we think we have {95\%} confidence that it works, then we may as well send them out to help people and only do our repeat experiment while it is on the market.

Ignoring the real interpretation of the p-value in favor of the more optimistic one is so common it has a name: the base rate fallacy. This is because that high number of false postives comes from the fact that the base rate of a drug working (or the coin being unbiased) is so low that you are likely to get false positives even with a high power test and a small p-value. I know this type of thing has been posted on the internet all over the place, but I hadn’t done it yet and it seemed to fit in with the statistical oddities series. For the record, the example scenario above was taken from Statistics Done Wrong by Alex Reinhart.

A Bayesian Formulation of Occam’s Razor

Today we will formulate Occam’s Razor in Bayesian terms. Recall that this says that if two hypotheses explain the data equally well, then the one with less assumptions is to be preferred. Before continuing, we should first get a handle on what this is and what the Bayesian reformulation means. First, it is basically a scientific heuristic. The intuitive reason for it is that unnecessary hypotheses are just going to make your model more likely to make mistakes (i.e. it will “overfit”).

What this post is going to do is give a formulation of it in Bayesian terms. This is not a mathematical proof that Occam’s Razor is true or anything, but it will be a proof that under certain mild assumptions the principle falls out as a consequence. That’s what makes this kind of cool. We want to decide whether or not hypothesis A or B is a better statistical model where A and B explain the observed data equally well, but B has an extra parameter.

How should we do this? Well, in probabilistic terms we want to figure out {P(A|D)} and {P(B|D)}, the “probability that {A} is true given the data {D}” and the “probability that {B} is true given the data {D}.” We merely compare these two quantities for example by taking the quotient

\displaystyle \frac{P(A|D)}{P(B|D)}.

If the quotient is near {1}, then they are roughly equally good models. If the quotient is large, then {A} is a better hypothesis and if the quantity is close to {0}, then {B} is the better hypothesis.

Let’s take stock of our assumptions here. We do not assume Occam’s Razor (some people feel like OR is a pretty steep assumption), because it is not a priori clear that it is always the best principle to follow. But here we are merely making the assumption that comparing the probabilities that each model is a true model of the data we observe is a good test for selecting one model over another. It is kind of hard to argue against such a common sense assumption.

Now we use Bayes’ Theorem to convert these quantities to things we actually know about:

\displaystyle \frac{P(A|D)}{P(B|D)} = \frac{P(D|A)P(A)}{P(D|B)P(B)}

At this point we have some difficulty with the {B} hypothesis still, because implicitly we have assumed it relies on some extra parameter {\lambda}. To simplify the argument, we will assume that {\lambda} lies in some range (this isn’t unreasonable because in real life you should have some idea what order of magnitude etc this parameter should be): {\lambda_m \leq \lambda \leq \lambda_M}. We will make a less reasonable simplifying assumption and say that once this range is specified, we have a uniform chance of it being anything in the range, i.e.

\displaystyle P(\lambda |B) = \frac{1}{\lambda_M - \lambda_m}

for {\lambda} in the range and {0} otherwise. There will be an observed {\lambda_0} that maximizes the likelihood function (i.e. fits the data the best). Choose {\delta} so that {\lambda_0 \pm \delta} is an interval giving us reasonable certainty of the best {\lambda_0} (we could use the 95% HDI if we wanted to get the interval). Now let’s work out what is happening for {B}:

\displaystyle P(D|B) = \int P(D, \lambda|B)d\lambda = \int P(D|\lambda, B)P(\lambda |B)d\lambda

\displaystyle =\frac{1}{\lambda_M - \lambda_m}\int P(D|\lambda_0, B)exp\left(-\frac{(\lambda-\lambda_0)^2}{2\delta^2}\right)d\lambda

\displaystyle =\frac{\delta\sqrt{2\pi}P(D|\lambda_0, B)}{\lambda_M - \lambda_m}

Now we can plug this into our original comparison ratio and use the fact that both are equally good at explaining the data:

\displaystyle \frac{P(A|D)}{P(B|D)}=\frac{(\lambda_M-\lambda_m)P(D|A)}{\delta\sqrt{2\pi}P(D|\lambda_0, B)}

This gives us two main conclusions. The first is that if we assume our two models make roughly equivalent predictions on the data, i.e. {P(D|A)\approx P(D|\lambda_0, B)}, then we should prefer {A} because the possible range for {\lambda} giving a factor in the numerator will in general be quite a bit larger than {\delta}. This is exactly Occam’s Razor.

The possibly more interesting consequence is that we now know exactly how much this extra parameter is “penalizing” the theory. So given specific cases we can test whether or not that extra parameter is worth putting in. In other words, are the predictions significantly enough better with the extra parameter to overcome the penalty of introducing an extra complicated hypothesis? This abstract and vague notion from Occam’s Razor gets explicitly quantified in Bayesian analysis so that it is no longer vague or abstract and we can confidently apply Occam’s Razor when it is needed and avoid it when it isn’t.

Statistical Oddities Part 3

This oddity is really hard to get your head around if you’ve been doing standard null-hypothesis testing all your life. This oddity says that null hypothesis significance testing depends on the intentions of the experimenter.

What does this mean? Well, let’s go back to our worked example of flipping a coin and trying to determine whether or not it is biased based on the observed data. Recall that in our Bayesian analysis we take our data and our test for whether or not it was biased was determined by whether or not 0.5 was a reasonable guess given the posterior distribution. We didn’t need to know anything about the intentions of the person flipping the coin.

How does traditional (re: frequentist) null hypothesis testing work? We set up an experiment in which the experimenter flips the coin 100 times. If we observe 47 heads, then we calculate the probability that this would happen given the coin is fair. If that probability is below a certain threshold, then we say the coin is biased because it is extremely unlikely that we would observe that number by chance alone. Otherwise we do not reject the null hypothesis and say the coin is fair.

Unfortunately, our probability space depends on the number of total coin flips. The probability space is extremely different if the experimenter set up the experiment so that the number of flips was not predetermined and instead a coin was flipped as many times as possible for 5 minutes. The probability space in this case is much larger because some possible samples would have 90 flips and some would have 110 and so on.

It would also be radically different if the experimenter decided to flip the coin until they reached 47 heads. Then the probability space would again have all sorts of different possibilities for the number of flips. Maybe sometimes you would expect to do 150 flips before seeing 47 heads.

Just to reiterate, this isn’t a trivial matter. This says we need to know the intent of the experimenter if we want to do a legitimate null hypothesis significance test. If we don’t know how the experiment was designed, then we don’t know what our probability space should look like to know whether or not we should reject the null hypothesis.

To see why this is shocking just do the thought experiment where three labs flip the same coin. Each of the labs sets up the experiment in the three ways listed above. You get the exact same data from each of the labs. You could rig the numbers so that in some cases you decide the coin is fair and in others you decide that it is not fair. But they gave you the same exact data of 47 heads out of 100 flips (or whatever your thought experiment requires)! Let’s reiterate: They gave you the exact same data, but came to different conclusions about the fairness of the coin. How is this possible?

If we live in some sort of objective universe where we can do experiments and draw conclusions from them, then the results of an experiment should rely on the data and not on the subjective intentions of the experimenter. More bluntly, determining whether or not the coin is biased should not depend on what is happening in the coin flipper’s mind during the flipping.

This is a very real and dangerous statistical oddity if the person running the analysis isn’t aware of it. In fact, I dare say that this is one of the easy ways to massage data in the sciences to get “results” where none exist. To me, this is actually one of the strongest arguments for scientists to use Bayesian statistics rather than null hypothesis testing. As we saw in the linked post, Bayesian statistics gets around this issue and only needs the raw data and not the intentions of the experimenter.

By the way, before I get sued, I stole this example (with different numbers) from Doing Bayesian Data Analysis by John K. Kruschke. It is a really fantastic book to learn about this stuff.

Statistical Oddities Part 2

Suppose you take in a bunch of data and you make a statistical model out of it. You start making predictions and find that you are wrong a lot of the time. Naturally, your first thought is to go collect a lot more data. Question: Is feeding the model more data necessarily going to improve your prediction record in a significant way?

Intuition tells us that the answer should be yes. I used to think the more you know, the better your guesses are going to be even if the model is bad. It turns out that the answer depends on what is causing your error. Nowadays there are tons of ways to measure error, but let’s compare two of them. One of them you are probably already familiar with called the variance. The other is called the bias.

Bias roughly corresponds to “being biased” towards a certain answer. Your guesses are localized around something that isn’t correct. Some people call this “underfitting.” If your data set comes from a parabola and you use linear regression to model your predictions, then you will see a high bias.

High variance is the opposite. It comes from guesses that are not localized enough. Little changes are causing big swings in your predictions. You are confusing the noise in the data for a real signal.

Thinking about these two vastly different ways your predictions could be going wrong, it turns out that if you are in the high bias case then more data will not improve your predictions. This is just because once you’ve reached a critical amount of data, then the predictions are set. Adding in more data will not update the model to something new or different, because it is biased to give a certain prediction. Thinking back to using linear regression to model data coming from a parabola, your predictions obviously won’t improve just by feeding it more data.

If you have a high variance problem, then getting more data will actually help. This is basically because if you make a model that is sensitive to noise on a small data set, then the noise is going to throw your predictions around a lot. But the more data you add, the more that noise is going to cancel itself out and give some better predictions. Of course, this is a brute force fix, and you should actually try to get the variance down so that the model is better, but that is another post.

That’s the oddity for the day. It seems that adding more data should always improve predictions in a statistical model, but it turns out that this is not the case if your error is coming from high bias. This is actually somewhat related to the next oddity, but I think the next one will be much more interesting than this one, so stay tuned.