The Carter Catastrophe

I’ve been readingĀ Manifold: TimeĀ by Stephen Baxter. The book is quite good so far, and it presents a fascinating probabilistic argument that humans will go extinct in the near future. It is sometimes called the Carter Catastrophe, because Brandon Carter first proposed it in 1983.

I’ll use Bayesian arguments, so you might want to review some of my previous posts on the topic if you’re feeling shaky. One thing we didn’t talk all that much about is the idea of model selection. This is the most common thing scientists have to do. If you run an experiment, you get a bunch of data. Then you have to figure out the most likely reason for what you see.

Let’s take a basic example. We have a giant tub of golf balls, and we can’t see inside the tub. There could be 1 ball or a million. We’re told the owner accidentally dropped a red ball in at some point. All the other balls are the standard white golf balls. We decide to run an experiment where we draw a ball out, one at a time, until we reach the red one.

First ball: white. Second ball: white. Third ball: red. We stop. We’ve now generated a data set from our experiment, and we want to use Bayesian methods to give the probability of there being three total balls or seven or a million. In probability terms, we need to calculate the probability that there are x balls in the tub given that we drew the red ball on the third draw. Any time we see this language, our first thought should be Bayes’ theorem.

Define A_i to be the model of there being exactly i balls in the tub. I’ll use “3” inside of P( ) to be the event of drawing the red ball on the third try. We have to make a finiteness assumption, and although this is one of the main critiques of the argument, we can examine what happens as we let the size of the bound grow. Suppose for now the tub can only hold 100 balls.

A priori, we have no idea how many balls are in there, so we’ll assume all “models” are equally likely. This means P(A_i)=1/100 for all i. By Bayes’ theorem we can calculate:

P(A_3|3) = \frac{P(3|A_3)P(A_3)}{(\sum_{i=1}^{100}P(3|A_i)P(A_i))}

\frac{(1/3)(1/100)}{(1/100)\sum_{i=3}^{100}1/i} \approx 0.09

So there’s around a 9% chance that there are only 3 balls in the tub. That bottom summation remains exactly the same when computing P(A_n | 3) for any n and equals about 3.69, and the (1/100) cancels out every time. So we can compute explicitly that for n > 3:

P(A_n|3)\approx \frac{1}{n}(0.27)

This is a decreasing function of n, and this shouldn’t be surprising at all. It says that as we guess there are more and more balls in the tub, the probability of that guess goes down. This makes sense, because it’s unreasonable to think we’d see the red one that early if there are actually 100 balls in the tub.

There’s lots of ways to play with this. What happens if our tub could hold millions but we still assume a uniform prior? It just takes all the probabilities down, but the general trend is the same: It becomes less and less reasonable to assume large amounts of total balls given that we found the red one so early.

You could also only care about this “earliness” idea and redo the computations where you ask how likely is A_n given that we found the red ball by the third try. This is actually the more typical way the problem is formulated in the Doomsday arguments. It’s more complicated, but the same idea pops out, and this should make intuitive sense.

Part of the reason these computations were somewhat involved is because we tried to get a distribution on the natural numbers. But we could have tried to compare heuristically to get a super clear answer (homework for you). What if we only had two choices “small number of total balls (say 10)” or “large number of total balls (say 10,000)”? You’d find there is around a 99% chance that the “small” hypothesis is correct.

Here’s the leap. Now assume the fact that you exist right now is random. In other words, you popped out at a random point in the existence of humans. So the totality of humans to ever exist are the white balls and you are the red ball. The same type of argument above applies, and it says that the most likely thing is that you aren’t born at some super early point in human history. In fact, it’s unreasonable from a probabilistic standpoint to think that humans will continue much longer at all given your existence.

The “small” total population of humans is far, far more likely than the “large” total population, and the interesting thing is that this remains true even if you mess with the uniform prior. You could assume it is much more likely a priori for humans to continue to make improvements and colonize space and develop vaccines giving a higher prior for the species existing far into the future. But unfortunately the Bayesian argument will still pull so strongly in favor of humans ceasing to exist in the near future that one must conclude it is inevitable and will happen soon!

Anyway. I’m travelling this week, so I’m sorry if there are errors in those calculations. I was in a hurry and never double checked them. The crux of the argument should still make sense even if you don’t get my exact numbers. There’s also a lot of interesting and convincing rebuttals, but I don’t have time to get into them now (including the fact that unlikely hypotheses turn out to be true all the time).

Validity in Interpretation Chapter 5

You know the drill by now. These are just notes from my reading of E.D. Hirsch, Jr.’s Validity in Interpretation. We have finally reached the last chapter. The main thrust of this last chapter is on how to tell whether our interpretation is valid. It rehashes a lot of stuff we’ve already covered, and it gives some examples of putting the theory to use.

The first point is that we can often trick ourselves into self-validating an invalid interpretation. Hirsch doesn’t use the term, but this is a direct rephrasing of confirmation bias to literary interpretation. If we go into a text thinking it must mean something, then try to find confirmation of this interpretation, we will always find it and will overlook conflicting evidence. This is not the correct way to validate an interpretation (or anything for that matter!).

We are led back to the hermeneutic circle, because some of the evidence will only appear after a hypothesis about the interpretation has been formed. In the next section, Hirsch doesn’t say this, but he essentially argues for a Bayesian theory of interpretation. The process of validation is to take all the hypotheses and then figure out which one is most likely correct based on the evidence. As new evidence comes in, we revise our view.

All that matters are the relative probabilities. Sometimes two interpretations are equally likely, and then we say both are valid. The point is not to have one victorious theory, but to have a way to measure how likely each is in terms of the others.

Personal Note: Whenever someone brings up probabilistic reasoning in the arts (or even history) the same sorts of objections get raised. The assignment of a probability is arbitrary. You can make up whatever priors you want to skew the results in favor of your pet interpretation. These are very recent debates that came decades after this book was published. Surprisingly, Hirsch gives the same answers to these objections that we still give.

First, we already speak in probabilities when analyzing interpretations. I think it is “extremely unlikely” that the word “plastic” means the modern substance in this 1744 poem, because it hadn’t been invented yet. It is “likely” that this poem is about the death of a loved one, because much of Donne’s work is about death. These statements assign relative probabilities to the likelihood of the interpretation, but they try to mask this.

By clearly stating what we are doing, and coming up with actual quantities that can be disputed and argued for, we make our reasoning more explicit and less likely to error. If we pretend that we are not dealing with probabilities, then our arguments and reasoning become sloppy.

As usual, when determining probabilities, we need to figure out the narrowest class that the work under consideration fits in. A good clarifying example is the broad classification of women vs men. Women live longer on average than men. But when we pick a specific woman and a specific man, it would be insane to argue that the woman will probably live longer based only on that broad class. If we note that the woman is a sedentary smoker with lung cancer, and the man is an Olympic marathon runner, then these narrower classes improve our probability judgments.

This was the point of having an entire chapter on genre. We must analyze the intrinsic genre of a work to find the narrowest class that it fits in. This gives us a prior probability for certain types of interpretation. Then we can continue the analysis, updating our views as we encounter more or less evidence.

Hirsch then goes on to talk about the principle of falsifiability as we know it from science. Rather than confirming our hypothesis, we should come up with plausible evidence that would conclusively falsify the interpretation. He goes on to give a bunch of subtle examples that would take a lot of time to explain here. For simplicity, we could go back to the plastic example. If a poem dates before 1907, then any interpretation that requires the substance meaning of the word plastic is false.

He ends the section by reminding us that we always have to think in context. There are no rules of interpretation that can be stated generally and be practical in all situations. There are always exceptions. The interpretive theory in this book is meant as a starting point or provisional guide. This is also true of all methods of interpretation (think of people who always do a “Marxist reading” or “feminist reading” of a text).

I’ll end with a quote:

“While there is not and cannot be any method or model of correct interpretation, there can be a ruthlessly critical process of validation to which many skills and many hands may contribute. Just as any individual act of interpretation comprises both a hypothetical and a critical function, so the discipline of interpretation also comprises the having of ideas and the testing of them.”

Decision Theory 1

Today we’ll start looking at a branch of math called Decision Theory. It uses the types of things in probability and statistics that we’ve been looking at to make rational decisions. In fact, in the social sciences when bias/rationality experiments are done, seeing how closely people make decisions to these optimal decisions is the base line definition of rationality.

Today’s post will just take the easiest possible scenarios to explain the terms. I think most of this stuff is really intuitive, but all the textbooks and notes I’ve looked at make this way more complicated and confusing. This basically comes from doing too much too fast and not working basic examples.

Let’s go back to our original problem which is probably getting old by now. We have a fair coin. It gets flipped. I have to bet on either heads or tails. If I guess wrong, then I lose the money I bet. If I guess right, then I double my money. The coin will be flipped 100 times. How should I bet?

Let’s work a few things out. A decision function is a function from the space of random variables {X} (technically we can let {X} be any probability space) to the set of possible actions. Let’s call {A=\{0,1\}} our set of actions where {0} corresponds to choosing tails and {1} corresponds to heads. Our decision function is a function that assigns to each flip a choice of picking heads or tails, {\delta: X \rightarrow A}. Note that in this example {X} is also just a discrete space corresponding to the 100 flips of the coin.

We now define a loss function, {L:X\times A \rightarrow \mathbb{R}}. To make things easy, suppose we bet 1 cent every time. Then our loss is {1} cent every time we guess wrong and {-2} cents if we guess right. Because of the awkwardness of thinking in terms of loss (i.e. a negative loss is a gain) we will just invert it and use a utility function in this case which measures gains. Thus {U=-1} when we guess wrong and {U=2} when we guess right. Notationally, suppose {F: X\rightarrow A} is the function that tells us the outcome of each flip. Explicitly,

\displaystyle U(x_i, \delta(x_i)) = \begin{cases} -1 \ \text{if} \ F(x_i) \neq \delta(x_i) \\ 2 \ \text{if} \ F(x_i) = \delta(x_i) \end{cases}

The last thing we need is the risk involved. The risk is just the expected value of the loss function (or the negative of the expected value of the utility). Suppose our decision function is to pick {0} every time. Then our expected utility is just {100(1/2(-1)+1/2(2))=50}. This makes sense, because half the time we expect to lose and half we expect to win. But we double our money on a win, so we expect a net gain. Thus our risk is {-50}, i.e. there is no risk involved in playing this way!
This is a weird example, because in the real world we have to make our risk function up and it does not usually have negative expected value, i.e. there is almost always real risk in a decision. Also, our typical risk will still be a function. It is only because everything is discrete that some concepts have been combined which will need to be pulled apart later.

The other reason this is weird is that even though there are {2^{10}} different decision functions, they all have the same risk because of the symmetry and independence of everything. In general, each decision function will give a different risk, and they are ordered by this risk. Any minimum risk decision function is called “admissible” and it corresponds to making a rational decision.

I want to point out that if you have the most rudimentary programming skills, then you don’t have to know anything about probability, statistics, or expected values to figure these things out in these simple toy examples. Let’s write a program to check our answer (note that you could write a much simpler program which is only about 5 lines, has no functions, etc to do this):

import random
import numpy as np
import pylab

def flip():
    return random.randint(0,1)

def simulate(money, bet, choice, length):
    for i in range(length):
        tmp = flip()
        if choice == tmp:
            money += 2*bet
            money -= bet
    return money

results = []
for i in range(1000):
    results.append(simulate(10, 1, 0, 100))

pylab.title('Coin Experiment Results')
pylab.xlabel('Trial Number')
pylab.ylabel('Money at the End of the Trial')

print np.mean(results)

This python program runs the given scenario 1000 times. You start with 10 cents. You play the betting game with 100 flips. We expect to end with 60 cents at the end (we start with 10 and have an expected gain of 50). The plot shows that sometimes we end with way more, and sometimes we end with way less (in these 1000 we never end with less than we started with, but note that is a real possibility, just highly unlikely):


It clearly hovers around 60. The program then spits out the average after 1000 simulations and we get 60.465. If we run the program a bunch of times we get the same type of thing over and over, so we can be reasonably certain that our above analysis was correct (supposing a frequentist view of probability it is by definition correct).

Eventually we will want to jump this up to continuous variables. This means doing an integral to get the expected value. We will also want to base our decision on data we observe, i.e. inform our decisions instead of just deciding on what to do ahead of time and then plugging our ears, closing our eyes, and yelling, “La, la, la, I can’t see what’s happening.” When we update our decision as the actions happen it will just update our probability distributions and turn it into a Bayesian decision theory problem.

So you have that to look forward to. Plus some fun programming/pictures should be in the future where we actually do the experiment to see if it agrees with our analysis.