A Mind for Madness

Musings on art, philosophy, mathematics, and physics

Leave a comment

Markov Chain Monte Carlo Example

Let’s look at a problem called parameter estimation. As usual, we have a bunch of coin flips. What we’ve learned to do with Bayesian statistics is calculate some posterior distribution that tells me how likely the bias {\theta} is. I ask myself, “Is it a credible hypothesis that the coin is fair ({\theta =1/2})?” I find out yes it is. I ask myself, “Is it a credible hypothesis that the coin is massively biased at {\theta=4/5}?” I find out yes it is. Uh oh.

Maybe in abstract math land this type of contradictory information is fine. I should be honest that both are credible guesses based on my data, and Bayesian statistics helps me to be very precise about my degrees of certainty and uncertainty. Unfortunately, in the real world I want to figure out which {\theta} is “best” so that I can use it in my model for practical purposes. This is called parameter estimation, because I want to estimate what my parameter {\theta} should be in my model.

We’re in luck for the coin example, because we only have one parameter living in one-dimensional space. This alone vastly simplifies the situation, but we have something far, far better. Our posterior distribution has a unique maximum, that max happens to equal the mean of the distribution, and that max can be determined easily and exactly! This means that we can safely use that parameter as the “best.”

In the real world, we often have several parameters we are trying to estimate in a high-dimensional space, and the posterior is some non-convex crazy thing with lots of local mins/maxs that can’t be determined analytically. Let’s face it. Optimization is really hard even in relatively nice situations. The real world is usually not nice.

There often isn’t even an obvious notion of what you mean by “best” set of parameters either. Think of a symmetrical bimodal distribution where both peaks have the same max. You don’t really have any good reason to pick one of the points that gives the max, and if you do something like take the mean, then you might end up with a min on accident. The method I’m going to describe doesn’t really help with this issue of “equally good choices”, but it does give a beautiful way to deal with high-dimensional parameter spaces and crazy posterior distributions.

The idea is extremely simple. You will pick some starting collection of parameters. Then you let those parameters randomly move in some direction. We will then use our model to test whether or not it is more or less likely to see the data that we observed under each of those parameter choices. With some probability depending on this likelihood we will move that parameter to that value. This is just a Markov chain process of our {\theta} values moving through the possible parameter values, and hence this technique is called a Markov Chain Monte Carlo (MCMC) method (I used the indefinite article “a” because there are all sorts of variations on this out there).

It turns out that as long as we set this up in a reasonable way, then it will converge. Here’s something cool about this. Your parameters could live in some gigantic space for which it would be impossible to search for a good parameter estimation. Usually there is some much, much smaller dimensional subset of reasonably likely candidates. Once you move to this smaller dimensional set, by the nature of the algorithm, you will stay close to it and hence start moving to something optimal much faster. Here’s a picture showing how the random walks stay on a smaller set in a real example:

Let’s actually implement this in the silly case of the coin example where we know what the answer should be. My next post might try to implement this for some sort of real data set, although that could be more time consuming than I’m willing to do. To make this example more fun, I had the computer pick a random number in {[0,1]} and then generate 100 coin flips with bias equal to that number without telling me the number! This way we are in a realistic situation of not knowing what the “correct” answer is ahead of time.

I got 85 heads and 15 tails. To make computations easier, let’s assume the prior probability is just uniform. This means the posterior is given by {p(D|\theta)=\theta^{85}\cdot (1-\theta)^{15}}. I’ll start the random walk at {\theta = 0.5}. To know how much to move by, I pick a random number from a normal distribution with mean {0} and standard deviation {0.1}. So if I pick {0.05}, then my candidate place to move to is {0.55}.

I compute {p(D|\theta_{new})/p(D|\theta_{old})} and I move to the new spot with this probability. Note that if my new theta value is more likely to be the true theta, then I will always move to the new value, because the probability of moving is greater than {1}. The more unlikely my new theta value is, the less likely it is that I will move there. This implementation is called the Metropolis (or Metropolis-Hastings) algorithm. Note how simple the implementation is. It is only a few lines of code:

import numpy as np
import random
import pylab

# Posterior Distribution
def p(theta):
    return (theta**85)*((1-theta)**15)

# Random Walk Step Size
def norm_dist():
    return random.normalvariate(0, 0.1)
# Perform one step of random walk from spot theta
def rand_walk(theta):
    x = norm_dist()
    if theta + x < 1 and theta + x >0:
        return theta + x
        return theta
# Simulate the random walk for 1000 time steps   
walk = []
for i in xrange(1000):
    n = walk.pop()
    y = rand_walk(n)
    if random.random() < p(y)/p(n):

# Plot the results
ylab = [i for i in xrange(len(walk))]
pylab.plot(walk, ylab)
pylab.title('Random Walk Visualization')
pylab.xlabel('Theta Value')

Note that the key insight that MCMC gives us is that picking values from the posterior is going to be “easy.” Even if we don’t know much about the distribution and have no idea how to explicitly calculate anything from it, we can still perform this random walk. This is what it looks like:


The last step is to actually do a parameter estimation. The whole point is that the walk will stay close to the best value, so we can now just average these to get {\theta = 0.84}. The average is just a finite sum instead of an integral now. If we had done this analytically, we would have gotten {0.85}. Since we know MCMC is just giving us an estimation coming from randomness, this is really quite good!

Leave a comment

Video Games as a Solution to the One-Sided Problem of Art

In October I wrote a post in defense of gaming in which the central argument is a claim that any person who takes experiencing art as an important human experience should consider certain types of games as a worthwhile use of time as well. Some games are basically interactive films, but some are much more interesting and original forms of interactive art. If you close yourself off from this world, then you close yourself off from deep artistic experiences that you can’t get elsewhere.

A few months ago I did two posts on David Foster Wallace, his philosophy of art, and how to get the most out of Infinite Jest.

One of DFW’s central concerns in art was the one-sided nature of art. The artist puts in hundreds of hours of work, and the viewer/reader/whatever passively experiences the work. He thought of the artist/viewer relationship as an honest relationship. If it is completely one-sided, then it is a defunct relationship and you won’t get much out of it for very long. To have a successful relationship, both sides have to be putting in reasonable amounts of work.

This is one way people justify postmodernist writing. You have a bunch of endnotes or footnotes or you pull the reader out of the reading experience in other ways by drawing attention to the fact that they are reading something. You write in stream of consciousness from points of view that change every couple of pages, so that the reader can’t immediately tell what is happening. Whatever the literary device, the idea is that the reader has to put in work.

The point is that the more work the reader puts in, the more they will get out of the experience. Just like in a relationship, the reader has to invest something if they want a meaningful experience. Of course, the relationship becomes one-sided on the other side if the author just uses a random word generator and plops nonsense on the page for the reader to spend months trying to decipher. It needs to be a symbiotic relationship where neither side carries too much of the burden.

I’m going to go out on a limb and say that this problem is a real problem, and what writers, filmmakers, artists, etc have come up with so far merely mitigates the problem. There hasn’t been a really good way to get the viewer to truly participate in and invest in the work of art … until the fairly recent paradigm shift in thinking about games as art.

I’m definitely not the first to propose this, so I won’t spend a lot of time making this into a long post. Now that I’ve blogged around this topic a few times without actually addressing it I thought I would just point out that games are one obvious solution to the problem. They provide an interactive experience where the “player” has to fully invest in the work.

In fact, if artists are scared of the idea that their art will be “played” and hence will not qualify as “serious” (two notions that are extraordinarily hard to define or separate), then they should check out some recent games like To the Moon. The game play is extremely minimal. The player experiences a moving story by progressing through the game. The game play consists of moving around to collect some items and at the end of certain segments of collecting you “solve a puzzle” (sometimes only 2 or 3 clicks of the mouse). Still, this level of interaction is vital to fully immersing you in the story as if you were really the main character. This interaction is impossible with film or literature.

Leave a comment

Decision Theory 4: Hypothesis Testing


Now we return to decision theory. It is also a return to the thing that first made me interested in learning something about statistics a year or two ago. I had heard about John Ioannidis’ shocking article “Why Most Published Research Findings are False” and started to investigate. To me statistics was some settled thing that you hit your data with after doing an experiment. It told you whether or not your findings were real and how confident you could be in them.

Moreover, I believed that as long as you followed the prescriptions taught to you, you couldn’t mess it up. It was foolproof. Just look around and try to find one thing that science hasn’t touched. The scientific method has clearly led us to discover something about the world. That’s why stats seemed like an uninteresting part of math. People seemed to have figured it all out. Then I was shocked to find that article. I started to learn about all these fallacies, and methodological problems that I’ve been pointing out over the past few months.

One of the main difficulties, particularly in science, is classical null hypothesis significance testing (NHST). One way to try to mitigate these difficulties is to rephrase our hypothesis test as a Bayesian decision theory problem. This is not the only Bayesian reformulation (Kruschke’s MCMC stuff is pretty cool which I might get to someday), but it fits in as a nice example of the use of decision theory outside of the silly gambling problems I’ve been using.

Let’s start by seeing how to test a point null hypothesis. Think about the biased coin example. We want to test {\theta=1/2}, i.e. is the coin unbiased? This is obviously a ridiculous type of hypothesis test, because the term “unbiased” in real life encompasses a range {(1/2-\varepsilon, 1/2+\varepsilon)} where we can’t tell the difference. This is actually the case in most scientific situations as well (there is only so much precision your instruments can achieve), and often scientists incorrectly use a point NHST when there should be a ROPE (region of practical equivalence).

Our first step is to take the previous paragraph’s discussion and cheat a little. Suppose we want to test {\theta = \theta_0}. The Bayesian way to do this would work out of the box using a ROPE. Unfortunately, if we want continuous densities for the probabilities, then we will always reject our null hypothesis. This is because a point has probability zero. The cheat is to just convert the continuous prior, {\pi(\theta)}, to a piecewise defined prior where we assign a point mass of probability

\displaystyle \pi_0 = \displaystyle \int_{\theta_0-\varepsilon}^{\theta_0+\varepsilon} \pi(\theta)d\theta

to {\theta_0} and the renormalized old prior otherwise. This is merely saying that we make a starting assumption that {\theta} has true value {\theta_0} with probability {\pi_0}, and hence no actual integral needs to be calculated. That is just for intuitive justification for the shape of {\pi}. If this makes you uncomfortable, then use the uninformed prior of {\theta=\theta_0} having probability {1/2} and the alternative having a uniform distribution of mass 1/2.

Let’s recap what we are trying to do. We have two hypotheses. The null which is {H_0: \theta=\theta_0}, and the alternative {H_1: \theta\neq \theta_0}. This type of NHST came up in the last post where we wanted to experimentally test whether or not the acceleration due to gravity was {g=9.8}. Our process should be clear if you’ve been following this sequence of posts. We just use our data to calculate the posterior distributions {P(H_0|x)} and {P(H_1|x)}. We must decide between these two by seeing which one has less risk (and that risk will come from a loss function which appropriately penalizes falsely accepting/rejecting each one).

This approach is really nice, because depending on your situation you will want to penalize differently. If you are testing a drug for effectiveness, then it is better to harshly penalize falsely claiming a placebo to be effective (a false positive or Type I error). If you are testing whether or not someone has a fatal disease, then you want to harshly penalize falsely claiming they have it and having them undergo dangerous and expensive unnecessary treatments. Maybe these aren’t the best examples, but you see how having a flexible system could be a lot more useful than blindly running a {p=0.05} NHST.

Rather than going through some made up example from fake randomly generated data as I’ve been doing, let’s examine some differences at the theoretical level when we assume everything is normal. Suppose our data is a sample of {n} points from a normal distribution. Any book on Bayesian statistics will have the details on working this out, so I’ll get to the punch line.

If we denote {m(x)} the marginal density, then the posterior distribution for {H_0} is given by

\displaystyle \frac{f(x|\theta_0)\pi_0}{m(x)}.

In the normal distribution (we assume the prior has {\tau} standard deviation and the data has {\mu} and both have mean {\theta_0}) case we get something much more specific:

\displaystyle \left(1+\frac{(1-\pi_0)}{\pi_0}\cdot \frac{\exp(\frac{1}{2}z^2[1+\sigma^2/(n\tau^2)]^{-1}}{(1+n\tau^2/\sigma^2)^{1/2}}\right)^{-1}

where {z=\frac{\sqrt{n}|\overline{x}-\theta_0|}{\sigma}}. This term actually appears in classical NHST as well. Let’s look at the differences. For the purpose of getting some numbers down, let’s assume {\pi_0=1/2} and {\sigma=\tau}. In a two-tailed test, let’s assume that we observe a {p=0.01} and hence would very, very strongly and confidently reject {H_0}. This corresponds to a {z}-value of {2.576}. In this case if {n} is small, i.e. in the 10-100 range, then the posterior is around {0.14} to {0.27}. This means that we would likely want to reject {H_0}, because it is quite a bit more unlikely than {H_1} (this will of course depend on the specifics of our loss function).

Shockingly, if {n} is large, we lose a lot of confidence. If {n=1000}, then the posterior for {H_0} is {0.53}. Woops. The Bayesian approach says that {H_0} is actually more likely to be true than {H_1}, but our NHST gives us {p=0.01} level confidence for rejecting (i.e. there is a 99% chance that our data observations were not a fluke chance and the result that causes us to reject {H_0} is real).

As we see, by working with the Bayesian framework, we get posterior probabilities for how likely {H_0} and {H_1} are given our observations of the data. This allows us to do a suitable analysis. The classical framework feels very limited, because even when we get extreme {p}-values that give us lots of confidence, we could accidentally be overlooking something that would be obvious if we worked directly with how likely each is to be true.

To end this post, I’ll just reiterate that careful scientists are completely aware of the fact that a {p}-value is not to be interpreted as probabilities against {H_0}. One can certainly apply classical methods and end with a solid analysis. On the other hand, this is quite a widespread sloppiness or less generously I’ll call it a widespread misunderstanding of what is going on.


Statistical Oddities 5: Sequential Testing

Our next decision theory post is going to be on how to rephrase hypothesis testing in terms of Bayesian decision theory. We already saw in our last statistical oddities post that {p}-values can cause some problems if you are not careful. This oddity makes the situation even worse. We’ll show that if you use a classical null hypothesis significance test (NHST) even at {p=0.05} and your experimental design is to check significance after each iteration of a sample, then as the sample size increases, you will falsely reject the hypothesis more and more.

I’ll reiterate that this is more of an experimental design flaw than a statistical problem, so a careful statistician will not run into the problem. On the other hand, lots of scientists are not careful statisticians and do make these mistakes. These mistakes don’t exist in the Bayesian framework (advertisement for the next post). I also want to reiterate that the oddity is not that you sometimes falsely reject hypotheses (this is obviously going to happen, since we are dealing with a degree of randomness). The oddity is that as the sample size grows, your false rejection rate will tend to 100% ! Usually people think that a higher sample size will protect them, but in this case it exacerbates the problem.

To avoid offending people, let’s assume you are a freshmen in college and you go to your very first physics lab. Of course, it will be to let a ball drop. You measure how long it takes to drop at various heights. You want to determine whether or not the acceleration due to gravity is really 9.8. You took a statistics class in high school, so you recall that you can run a NHST at the {p=0.05} level and impress your professor with this knowledge. Unfortunately, you haven’t quite grasped experimental methodology, so you rerun your NHST after each trial of dropping the ball.

When you see {p<0.05} you get excited because you can safely reject the hypothesis! This happens and you turn in a lab write-up claiming that with greater than {95\%} certainty the true acceleration due to gravity is NOT {9.8}. Let's make the nicest assumptions possible and see that it was still likely for you to reach that conclusion. Assume {g=9.8} exactly. Also, assume that your measurements are pretty good and hence form a normal distribution with mean {9.8}. The following code simulates exactly that:

import random
import numpy as np
import pylab
from scipy import stats

#Generate normal sample
def norm():
    return random.normalvariate(9.8,1)

#Run the experiment, return 1 if falsely rejects and 0 else
def experiment(num_samples, p_val):
    x = []
    #One by one we append an observation to our list
    for i in xrange(num_samples):
        #Run a t-test at p_val significance to see if we reject the hypothesis
        t,p = stats.ttest_1samp(x, 9.8)
        if p < p_val:
            return 1
    return 0

#Check the proportion of falsely rejecting at various sample sizes
rej_proportion = []
for j in xrange(10):
    f_rej = 0
    for i in xrange(5000):
        f_rej += experiment(10*j+1, 0.05)

#Plot the results
axis = [10*j+1 for j in xrange(10)]
pylab.plot(axis, rej_proportion)
pylab.title('Proportion of Falsely Rejecting the Hypothesis')
pylab.xlabel('Sample Size')

What is this producing? On the first run of the experiment, what is the probability that you reject the null hypothesis? Basically {0} because the test knows that this isn't enough data to make a firm conclusion. If you run the experiment 10 times, what is the probability that at some point you reject the null hypothesis? It has gone up a bit. On and on this goes up to 100 trials where you have nearly a 40% chance of rejecting the null hypothesis using this method. This should make you uncomfortable, because this is ideal data where the mean really is 9.8 exactly! This isn't coming from imprecise measurements or something.

The trend will actually continue, but already because of the so-called {n+1} problem in programming this was taking a while to run, so I cut it off. As you accumulate more and more experiments, you will be more and more likely to reject the hypothesis:


Actually, if you think about this carefully it isn’t so surprising. The fault is that you recheck whether or not to reject after each sample. Recall that the {p}-value tells you how likely it is to see these results by random chance supposing the hypothesis is false. But the value is not {0} which means with enough trials you’ll get the wrong thing. If you have a sample size of {100} and you recheck your NHST after each sample is added, then you give yourself 100 chances to see this randomness manifest rather than checking once with all {100} data points. As your sample size increases, you give yourself more and more chances to see the randomness and hence as your sample goes to infinity your probability of falsely rejecting the hypothesis tends to {1}.

We can modify the above code to just track the p-value over a single 1000 sample experiment (the word “trial” in the title was meant to indicate dropping a ball in the physics experiment). This shows that if you cut your experiment off almost anywhere and run your NHST, then you would not reject the hypothesis. It is only because you incorrectly tracked the p-value until it dipped below 0.05 that a mistake was made:


Leave a comment

Decision Theory 3

If you could follow the last post then you have all the pieces you need to understand the basic theory. Let’s go back and actually work this out in the abstract now that we have an example for our template. If you only care about seeing some examples in action, then you should feel free to skip this post which will be almost entirely defining the pieces of the last post more rigorously. We will need to revise some things from the first post, because we were able to state things in a simpler form without Bayesian updating or continuous distributions happening.

Last time we introduced a one-parameter family of unknowns, the true bias of the coin. We denoted this {\theta}. For now we’ll keep this to just be some continuous real-valued parameter and it will represent an unknown quantity in our model. If you haven’t thought about this before, then I recommend continuing in the way we did last post. You pretend like {\theta} is some fixed known quantity and run classical decision theory. From there you extrapolate. The value of this parameter could be this, or this, or this, and my decision has to be the best no matter what it really is.

In the future, there could be a whole bunch of unknowns and {\theta} will turn into a vector or matrix, but for now we’ll stick to just a single variable. To pin down terminology, we will call {\theta} the parameter and {\Theta} the parameter space (all the possible values of {\theta}). So in our coin example {\Theta = [0,1]}.

We also have a collection of actions: {A}. An individual action will be denoted by {a}. For the coin example, an action would be betting on heads or tails. We will never be able to know {\theta} … because it is an unknown, but we will want to make observations/gather data which will be denoted {X}. In the coin example, this would be our observed sequence of flips (so it is probably best represented as a vector). We will denote the collection of all possible observations {\mathcal{X}} and this is called the sample space. In the coin example, we flipped the coin {100} times, so this consists of {2^{100}} vectors. In general, we will want to allow {X} to be continuous random variables and hence {\mathcal{X}} could be subsets of {\mathbb{R}^n}.

Let {I\subset \mathcal{X}} (suggestively we will often want to consider an “interval” {[a,b]\subset \mathbb{R}} if we just have one continuous random variable). As I already pointed out earlier, we will often want to take the view of a given fixed {\theta}. In this situation we will assume for the purposes of being able to analyze things that we always have an integrable probability distribution {f(x|\theta)} which is “the probability of observing x given {\theta}“. Thus, by definition, the probability of observing {I} given {\theta} is just the integral:

\displaystyle P_{\theta}(I)=\int_I f(x|\theta)dx

I won’t adopt the cumbersome notation that some texts use to indicate that this could be an integral or a finite sum. I will just use the integral, and assume the reader can translate to the appropriate sum if {\mathcal{X}} is discrete. If we have some function {h(X)}, then we define the expected value of {h(X)} over {\mathcal{X}} to be

\displaystyle E_{\theta}[h(X)] = \int_{\mathcal{X}}h(X)f(x|\theta)dx

Now that that is settled, let’s formalize the decision function, loss, and risk. Suppose that we have some prior probability describing the possibilities for {\theta}. We denote this {\pi(\theta)}. The choice of such a thing in the absence of any actual prior knowledge is one of the main (only?) arguments against Bayesian statistics. This shouldn’t be distressing, because any reasonable experiment will have a large enough sample size that picking an uninformed uniform prior will easily be overcome.

In the first decision theory post, we made a decision rule without basing it on any data. This is why we need to change our definition a little. In that situation a decision rule is equivalent to picking an action. If observing some data is involved, then our decision rule is a function {\delta: \mathcal{X}\rightarrow A}. This should just be read, “If I observe this type of data, then I will act in this way.” You let the data inform your decision. Our decision rule in the coin example was to look at the ratio of heads to tails. If there were more heads we pick heads. If there were more tails, we pick tails.

The loss function is a function {L: \Theta\times A \rightarrow \mathbb{R}}. This is the choice that people should feel a little uncomfortable with, because there is a definite choice that may or may not be reasonable affecting everything. The value {L(\theta, a)} should measure the loss that will be incurred if you do action {a} and {\theta} is the true value of the unknown.

We won’t worry so much about this right now. The more important one for us is the decision loss function {L:\Theta\times \mathcal{X}\rightarrow \mathbb{R}}. This is just plugging in to the other one: {L(\theta, \delta(x))}. Sometimes we just start with this one though. This was a no-brainer for our coin example, because I purposely set up the question to have a natural loss function. This was due to the fact that a well-defined “bet” was being made. In more general situations, the choice of a loss function could be seen as essentially equivalent to picking a betting scheme for your choices. You could easily come up with some wacky ones to see that it might not reflect reality if you aren’t careful.

To me the more “intuitive” notion is that of the risk function. This is the expected value of the loss:

\displaystyle R(\theta, \delta)=E_{\theta}[L(\theta, \delta(X))] = \int_\mathcal{X} L(\theta, \delta(X))f(x|\theta)dx

Note we integrate out the random variables {x}, but we are left over with a function of {\theta}. We saw this in our coin example last time. We get a similar thing for the Bayesian risk, but we incorporate the prior probability of {\theta}. Lots of times it is actually somewhat easier to just jump right to the risk, because in the case of squared-error loss (see we just get that the risk is the variance of the posterior distribution. No extra intermediary calculations are needed.

In general, most loss functions will be a variant on one of two types. The first is called the squared-error loss function. It is given by {L(\theta, a)=(\theta-a)^2}. You can think of this as “least-squares” fitting your decision or minimizing risk in the {L^2}-norm. The other is called the {0-1} loss function. This one arises quite naturally when you just have to pick between two choices like the coin flip. Ours was a variant on this. It penalizes you by {1} unit if your “decision is incorrect” and doesn’t penalize you at all if your “decision is correct.” It is given by

\displaystyle L(\theta, a_i)=\begin{cases} 0 & \text{if} \ \theta\in \Theta_i \\ 1 & \text{if} \ \theta\in \Theta_j \ \text{for} \ i\neq j\end{cases}

The beautiful thing about this one is that the risk is just {1} minus the posterior distribution. Thus, it is minimized at the max of the posterior which is often really easy to calculate. In the coin example, we got the beta distribution and hence the max was just the mean. Of course, we have to be careful that we are measuring the right thing, because we aren’t trying to predict the true bias. We were merely trying to predict heads or tails so that situation was an even easier discrete version.

Lastly, there is a partial ordering on decision functions given by \delta_1 \leq \delta_2 if and only if R(\theta, \delta_1) \leq R(\theta, \delta_2) for all \theta . A minimum in this ordering is called admissible and corresponds to a rational decision. If you make some other decision you are just asking to lose more.

Well, I think this post has gone on long enough (I’ve basically been trapped at the airport for the past 8 hours, so …). We’ll get back to some examples of all this next time. I just needed to finally formalize what we were doing before going any further.

1 Comment

Decision Theory 2

Now we will move on to a far, far more complex version of the same problem from before. Recall last time we worked with a fair coin. We want to make guesses that minimize our loss (or maximize our utility). The assumption that the coin was fair basically nullified having to do any analysis. No matter what decision function we picked, we would have the same expected loss, i.e. there is no way to do better than random guessing.

Let’s introduce the complexity of an extra parameter slowly through an example. Let’s suppose again that the coin is fair, but we don’t know that ahead of time. We have no idea what the bias of the coin is. We’ve already analyzed how to model this situation in our Bayesian statistics example.

If we observe {n} heads and {m} tails, we have a probability distribution describing the likelihood of the possible biases. We found this to be the beta distribution {B(n-1, m-1)}. If we start with a uniform, uninformed prior, then we could use Bayesian statistics to update our decision rule after each flip. This should make intuitive sense, because if the bias of the coin is 0.9, we should quickly see the posterior distribution reflect this and we will start getting most of our guesses correct.

Thus, the most naive thing to do is to look at the mean of the posterior distribution: {\frac{n}{n+m}}. If this number is bigger than {0.5}, then we guess heads because our Bayesian posterior predicts heads is coming up more frequently. If it is less than {0.5}, then we guess tails. If it equals {0.5}, then we make a random guess. Note that as long as the true bias is not {0.5}, we should be able to tell this with statistics after sufficiently many flips which will give us a better expected loss (i.e. risk) than random guessing. Let’s try two examples to see what happens.

I won’t post the code or the graph of what happens if the true bias is {0.5}, because our previous analysis shows it to be exactly the same independent of our decision function. Thus our more complicated decision rule doesn’t actually do anything to improve our guess. As a second example, we can mildly modify the code previously to see what happens with a {0.75} bias:

import random
import numpy as np
import pylab

def flip(true_bias):
    rand = random.random()
    if rand > true_bias:
        return 0
        return 1

def simulate(money, bet, true_bias, num_flips):
    num_heads = 0
    est_bias = 0.5
    for i in range(num_flips):
        #make a choice based on Bayesian posterior
        if est_bias >= 0.5:
            choice = 1
            choice = 0
        #flip the coin
        rand = flip(true_bias)
        #keep track of the number of heads
        num_heads += rand
        #update estimated bias
        est_bias = float(num_heads+1)/(i+3)
        #check whether or not choice was correct
        if choice == rand:
            money += 2*bet
            money -= bet
    return money

results = []
for i in range(1000):
    results.append(simulate(10, 1, 0.75, 100))

pylab.title('Coin Experiment Results')
pylab.xlabel('Trial Number')
pylab.ylabel('Money at the End of the Trial')

print np.mean(results)

The program says we average ending with {134.3} cents. We made pretty close to {125} cents as opposed to making {50} cents off of the {0.5} bias. These numbers should not be mysterious, because in the long run we expect to start guessing heads which will occur {3/4} of the time. Thus our expected gain is {100((-1/4)+(3/4)*2)=125}. Here’s the plot of the experiment:


This should feel a little weird, because with this decision rule we expect to always do better than (or equal to) our previous example. But this example is more realistic, because we don’t assume to know the bias of the coin! How could we do better with “less” information? That is the power of Bayesian decision theory which allows you to update your decision rule as you observe more information.

The classical admissible decision of always picking heads will do better if the bias is towards heads because we don’t have to wait for our posterior to tell us to pick heads, but it will do terrible if the bias is towards tails because even once we see that we get mostly tails we are not allowed to change our decision rule.

Let’s go back to our experiment of 100 coin flips. If {\theta} is the true bias of the coin, then the negative of the risk (the expected value of the utility function) of our Bayesian naive decision rule is

{-R(\theta) = \begin{cases} 100 (3\theta -1) & \ \text{if} \ \theta \geq 0.5 \\ 100(2-3\theta) & \text{if} \ \theta < 0.5\end{cases}.}

We've now successfully incorporated our new parameter. The risk will in general depend on this parameter. The function is just a "V" when graphed and our risk from last post is just a straight line {-R(\theta)=100(3\theta-1)}. It matches on the right piece, but is strictly below this one on the left half. This shows that no matter the bias of the coin, the naive Bayesian decision rule does better than our first post's choice.

Last post I said we could order the decision functions based on risk, and then we just call a minimum in the ordering admissible. Now we have to be more careful. With this extra parameter we only get a partial ordering by checking whether or not the risk is greater pointwise for every {\theta}. As just pointed out, the Bayesian decision function is lower in the ordering than random guessing or always picking heads (the two are comparable!). The question is, how do we know whether or not it is a minimum? Is this the best we can do? Is this naive decision rule admissible?

We will dig a little more into the theory next time about how those risk functions were computed (I just told you what they were which matched our experiments), and how to actually prove that a certain decision is admissible in this more complicated situation.

1 Comment

Decision Theory 1

Today we’ll start looking at a branch of math called Decision Theory. It uses the types of things in probability and statistics that we’ve been looking at to make rational decisions. In fact, in the social sciences when bias/rationality experiments are done, seeing how closely people make decisions to these optimal decisions is the base line definition of rationality.

Today’s post will just take the easiest possible scenarios to explain the terms. I think most of this stuff is really intuitive, but all the textbooks and notes I’ve looked at make this way more complicated and confusing. This basically comes from doing too much too fast and not working basic examples.

Let’s go back to our original problem which is probably getting old by now. We have a fair coin. It gets flipped. I have to bet on either heads or tails. If I guess wrong, then I lose the money I bet. If I guess right, then I double my money. The coin will be flipped 100 times. How should I bet?

Let’s work a few things out. A decision function is a function from the space of random variables {X} (technically we can let {X} be any probability space) to the set of possible actions. Let’s call {A=\{0,1\}} our set of actions where {0} corresponds to choosing tails and {1} corresponds to heads. Our decision function is a function that assigns to each flip a choice of picking heads or tails, {\delta: X \rightarrow A}. Note that in this example {X} is also just a discrete space corresponding to the 100 flips of the coin.

We now define a loss function, {L:X\times A \rightarrow \mathbb{R}}. To make things easy, suppose we bet 1 cent every time. Then our loss is {1} cent every time we guess wrong and {-2} cents if we guess right. Because of the awkwardness of thinking in terms of loss (i.e. a negative loss is a gain) we will just invert it and use a utility function in this case which measures gains. Thus {U=-1} when we guess wrong and {U=2} when we guess right. Notationally, suppose {F: X\rightarrow A} is the function that tells us the outcome of each flip. Explicitly,

\displaystyle U(x_i, \delta(x_i)) = \begin{cases} -1 \ \text{if} \ F(x_i) \neq \delta(x_i) \\ 2 \ \text{if} \ F(x_i) = \delta(x_i) \end{cases}

The last thing we need is the risk involved. The risk is just the expected value of the loss function (or the negative of the expected value of the utility). Suppose our decision function is to pick {0} every time. Then our expected utility is just {100(1/2(-1)+1/2(2))=50}. This makes sense, because half the time we expect to lose and half we expect to win. But we double our money on a win, so we expect a net gain. Thus our risk is {-50}, i.e. there is no risk involved in playing this way!
This is a weird example, because in the real world we have to make our risk function up and it does not usually have negative expected value, i.e. there is almost always real risk in a decision. Also, our typical risk will still be a function. It is only because everything is discrete that some concepts have been combined which will need to be pulled apart later.

The other reason this is weird is that even though there are {2^{10}} different decision functions, they all have the same risk because of the symmetry and independence of everything. In general, each decision function will give a different risk, and they are ordered by this risk. Any minimum risk decision function is called “admissible” and it corresponds to making a rational decision.

I want to point out that if you have the most rudimentary programming skills, then you don’t have to know anything about probability, statistics, or expected values to figure these things out in these simple toy examples. Let’s write a program to check our answer (note that you could write a much simpler program which is only about 5 lines, has no functions, etc to do this):

import random
import numpy as np
import pylab

def flip():
    return random.randint(0,1)

def simulate(money, bet, choice, length):
    for i in range(length):
        tmp = flip()
        if choice == tmp:
            money += 2*bet
            money -= bet
    return money

results = []
for i in range(1000):
    results.append(simulate(10, 1, 0, 100))

pylab.title('Coin Experiment Results')
pylab.xlabel('Trial Number')
pylab.ylabel('Money at the End of the Trial')

print np.mean(results)

This python program runs the given scenario 1000 times. You start with 10 cents. You play the betting game with 100 flips. We expect to end with 60 cents at the end (we start with 10 and have an expected gain of 50). The plot shows that sometimes we end with way more, and sometimes we end with way less (in these 1000 we never end with less than we started with, but note that is a real possibility, just highly unlikely):


It clearly hovers around 60. The program then spits out the average after 1000 simulations and we get 60.465. If we run the program a bunch of times we get the same type of thing over and over, so we can be reasonably certain that our above analysis was correct (supposing a frequentist view of probability it is by definition correct).

Eventually we will want to jump this up to continuous variables. This means doing an integral to get the expected value. We will also want to base our decision on data we observe, i.e. inform our decisions instead of just deciding on what to do ahead of time and then plugging our ears, closing our eyes, and yelling, “La, la, la, I can’t see what’s happening.” When we update our decision as the actions happen it will just update our probability distributions and turn it into a Bayesian decision theory problem.

So you have that to look forward to. Plus some fun programming/pictures should be in the future where we actually do the experiment to see if it agrees with our analysis.

1 Comment

Statistical Oddities 4: The Base Rate Fallacy

If you are a classically trained scientist, then you probably do an experiment, get some data, run it through SPSS (or something similar), then to see whether or not the results are significant you look at the {p}-value. It is a standard that if {p<0.05}, then you consider the result to likely be "real" and not just some random noise making a pattern.

Why is that? Well, here's how you define the {p}-value. Suppose your hypothesis is false. What is the probability of seeing your data? That is the {p}-value. I hypothesize that my coin is fair. I do 200 flips. I calculate {p=0.04}. Then there is only a 4% chance that I would see that particular combination of flips if my coin is biased.

Danger! Frequently, people try to negate everything and say that "there is a 96% chance (or less obviously wrong they'll say we have 96% confidence) that the coin is fair." If you read these posts, then you should immediately see the error. There can’t be a 96% chance of the coin being fair, because no matter what our flips were (it could 100 and 100 after a 200 flip trial), the probability of it being fair is still {0} (just compute the integral of the posterior distribution from {0.5} to {0.5}). Yet you see this language all over scientific papers.

If we want to talk about confidence, then we have a way to do it and it does not involve the {p}-value. I wouldn’t say this is a frequentist vs Bayesian thing, but I think the Bayesian analysis actually makes it harder to make this mistake. Recall that what we did there was use the language that being unbiased was a credible hypothesis given our 95% HDI. What we have confidence about is an interval. Maybe we have 95% confidence that the bias is in the interval {(0.2, 0.8)}. In this case, the hypothesis of being unbiased is credible, but the hypothesis of it being {0.7} is also credible with the same HDI.

Anyway, back to {p}-values. Since I’m using coin flipping as an example, you might think this is silly, but let’s ramp up our experiment. Suppose my lab works for a casino. We make sure coins are fair before passing them on to the casino (for their hot new coin flipping game?). I use a {p} value of {0.05} as usual. After {100} coins I expect that I’ve made a mistake on {5} or fewer because of my {p}-value, right? This is the type of interpretation you see all the time! It is clearly wrong.

Suppose {10} of them are biased due to manufacturing errors. Depending on the power of my test (I haven’t talked about power, but as you can imagine it depends on how many flips I use in my trials among other things) maybe I find 8 of them (this would be a power of {0.8} which isn’t unreasonable in science). Now recall our definition of the {p}-value. I also have a {5\%} chance of incorrectly saying that one of my unbiased coins is biased. This puts me at identifying {13} biased coins only {8} of which are actually biased. Despite a {p}-value threshold of {0.05}, I actually only got {62\%} of my guesses of bias correct (you could calculate this much more easily using Bayes’ theorem).

The above scenario is extremely common in some medical science labs where it matters. Suppose you test a drug to see if it works. Your test has {0.8} power and you use a {p}-value of {0.05} as you’ve been taught. You send {13} to drug manufacturers claiming they work. You think that you are wrong only {5\%} of the time, but in reality after you’ve tested {100} drugs, {5} out of the {13} drugs you send don’t work! This is extremely dangerous. Of course, these should be weeded out on secondary trials, but who has the time or money to do that? If we think we have {95\%} confidence that it works, then we may as well send them out to help people and only do our repeat experiment while it is on the market.

Ignoring the real interpretation of the p-value in favor of the more optimistic one is so common it has a name: the base rate fallacy. This is because that high number of false postives comes from the fact that the base rate of a drug working (or the coin being unbiased) is so low that you are likely to get false positives even with a high power test and a small p-value. I know this type of thing has been posted on the internet all over the place, but I hadn’t done it yet and it seemed to fit in with the statistical oddities series. For the record, the example scenario above was taken from Statistics Done Wrong by Alex Reinhart.


Westward the Course of Empire Takes its Way

This is mostly meant to be a direct continuation of the last post, but there is so much to say about the importance of this short story for understanding Infinite Jest that I needed a full post to do it. I will try to stick to this thesis, but I get so excited about unraveling all the complexities and parallels in this story that I may wander off at times. This story may, in fact, be more complicated and difficult to read than Infinite Jest, so be warned.

Let’s start with the basics. The main character is a writer that wants to write a new type of fiction. He claims that it will use the old metafictional devices, but also move past it and stab the reader in the heart. We already saw this idea in the last post, but this story is a way for DFW to tell us how he intends to do it, i.e. it serves as a reader’s guide to Infinite Jest. That’s why this story is so important for prep material (if you choose to do such a thing).

What is going on takes a moment to digest. Here goes. The work is a criticism of the shortcomings of metafiction. But it is a metafictional story using those very devices to do the criticism. The main critique is of Barth’s “Lost in the Funhouse.” To do this, Barth is literally a character in the story as Professor Ambrose who wrote the aforementioned story (LitF from now on, because that is getting annoying to type), but this time it is an autobiographical nonfiction work instead of Barth’s fiction (recall that the main character of LitF is Ambrose). Summary: Prof Ambrose wrote LitF in DFW’s story and is leading a writing workshop.

Ambrose (despite being a “real” character already) from LitF is fictionalized as Mark, the main character in “Westward …” through a retelling of LitF. LitF is a story about Ambrose travelling to Ocean City and getting lost in a funhouse at the amusement park. DFW uses wordplay in the retelling and has Mark travelling to a McDonald’s commercial actors reunion where there will (of course!) be a Ronald McDonald “funhouse.”

I said I wouldn’t do this, so I’m going to cut myself off there. I trust that if you’ve read LitF, and you take some time to meditate on the above two paragraphs until it stops being so confusing, then you can continue to unravel this ridiculously convoluted metaphor and story within a story that is a retelling of that story (which is already in the story …). Stop. I must stop. But it is just so much fun to unravel (for example, the funhouse in LitF is being franchised which is an insult that post-modernism has become commercial).

So what is DFW trying to tell us? Well, Barth uses his story to tell us how he sees metafiction. His metaphor is the funhouse of mirrors. In LitF he writes, “In a funhouse mirror-room you can’t see yourself go on forever, because no matter how you stand your head gets in the way.” This is the exact type of critical theory conundrum that DFW faces. He wants to affect the reader. But words and texts and people’s thoughts (i.e. “heads”) are always in the way. You can’t ever truly get to the person.

DFW’s metaphor is a bow and arrow, because Mark, the main character, is a pro archer. He has a beautiful description in “Westward …” of how an archer must take into account that the arrow doesn’t fly true. So to hit the bullseye, the archer actually makes adjustments ahead of time, aims off-center, and ends up hitting the center.

He’s saying that Barth can’t hit the reader, because he’s aiming at the wrong place: the head. Writers that strike at the reader’s heart also fall short, because they aim at it too directly. This new type of fiction will take this into account and aim in between. The result will be a piercing of the reader’s heart in a new and more serious way.

Mark’s girlfriend is post-modernist writer in Ambrose’s workshop. Without going too far into it, the thing to pay attention to with her is that she is the epitome of the type of metafiction that DFW wants to do away with. Remember, DFW wants to keep some metafiction and throw out other parts to invent a new type of fiction. This character is a guide to the parts he wants thrown out.

This is a long story, and so I can’t help you through every detail. Another general principle to keep in mind while interpreting this is that the arrow is meant to be a stand-in for the pen. So when the arrow “kills” things/people, you should figure out what those things/people are representing. For example, Mark writes a story about a person named Dave (oh no, Mark who is Ambrose is a stand-in for DFW writes a work of “new fiction” with Dave as its main character …).

Dave has a lover named L– (presumably meant to be “literature”). But L– commits suicide (as the post-modernists brought the death of literature) with the arrow. Dave is innocent, but feels guilty and hence admits that (after translation out of the metaphor) his writing helped bring about the death of literature. Of course, Mark makes an appearance in this story that he wrote causing yet another story within the story with a character as the person that wrote the story, but also a stand-in for someone else (which sets up a weird endless loop that DFW is Mark, Mark is Dave, and Dave is DFW …). I seem to be losing my way again, so I’ll end this line of thought.

Hopefully you have a bit of a feel for what “Westward …” is doing. I’ll end this post by going through my thoroughly well-worn copy of the story and pulling the quotes that I think are the most important to focus on for understanding how and why DFW wrote Infinite Jest.

“…they want to build a Funhouse for lovers out of a story that does not love. J.D. himself had said the story doesn’t love, no? Yes. However, Mark postulates that Steelritter is only half-right. The story does not love, but this is precisely because it is not cruel…. The way to make a story a Funhouse is to put the story itself in one. For a lover. Make the reader a lover, who wants to be inside.”

“Please don’t tell anybody, but Mark Nechtr desires, some distant hard-earned day, to write something that stabs you in the heart. That pierces you, makes you think you’re going to die…. The stuff would probably use metafiction as a bright smiling disguise, a harmless floppy-shoed costume, because metafiction is safe to read, familiar as syndication; and no victim is as delicious as the one who smiles in relief at your familiar approach.”

Barth’s LitF famously opens with, “For whom is the funhouse fun? Perhaps for lovers. For Ambrose it is a place of fear and confusion.”

DFW turns it around and beautifully sums up what he is doing with his closing lines:

“For whom?
You are loved.”


Minor Preparation to Get the Most out of Infinite Jest

I’ve been reading the biography of David Foster Wallace, Every Love Story is a Ghost Story by D.T. Max, and it reminded me that for years I’ve been meaning to do a blog post on some of the preparation you can do to have a much better experience reading Infinite Jest.

First, I’m not doing this out of some condescending “let the self-declared expert tell you how you must read this” type of thing. I actually get asked this question semi-frequently, and I want something I can direct people to. My first answer is usually, “Just do it.” You can get a lot of enjoyment out of the novel without delving into the philosophy of the meta-fictional devices.

On the other hand, if you are going to spend a few months of your life reading a 1000 page beast of a novel, then you should be willing to do some minor preparation. I estimate a dedicated person could easily do these reading assignments in less than a week. I picked these for both brevity and clarity after years of reading everything he’s ever written and watching/reading tons of interviews with him, and reading as many things as I can that he points out as influences.

This will take two posts. One on everything and why I chose it. The other on understanding his story Westward the Course of Empire Takes its Way. If you are really pressed for time, then my advice is to finish reading this post. Read that story. Then read my soon to come explanation of why that story is the most important thing he ever wrote in trying to decipher why he writes in the way he writes. That story is a Rosetta stone to understanding his later works.

Here’s my reading list:
Lost in the Funhouse by John Barth (a short story)
“The Balloon” by Donald Barthelme (a short story)
The Mezzanine by Nicholson Baker (a very short novella)
“Westward the Course of Empire Takes its Way” by David Foster Wallace (a short story/novella)

That may look like a lot, but each story can probably be read in one sitting, although I recommend going slowly through that last one. Let’s take them one at a time.

“The Balloon” is probably the least important of the list. This is a short story that DFW talked about in several interviews. It was a story that basically changed his life. He wasn’t a literature or creative writing major in college, but this story made him see writing in a different light. It made him want to be a writer.

Here’s how I understand this. All the fiction that DFW wrote was deeply philosophical. He majored in philosophy and as a grad student took lots of critical theory. He was obsessed with the theory behind the relationship between author, text, and reader. This wasn’t abstract for him. Because he wanted to develop a relationship with his readers through what he wrote, he needed to understand what the nature of that relationship was.

What Barthelme’s story does, which was so novel at the time, is put the theoretical considerations right in the story plainly for all to see. This is essentially a defining characteristic of the post-modernists of the time. The story as a whole has some macro-structure (“plot” if you want to use that term), but the individual sentences have a micro-structure which is informing you as you go how to interpret the macro-structure.

The story is very enigmatic. Just as you are thinking, “What in the world is going on?” you encounter characters who say things like, “We have learned not to insist on meanings.” This isn’t the type of place where DFW ended in his writing, but it makes a lot of sense why he started here. The story is difficult, but the reader who is willing to put in the effort to think about the individual sentences is rewarded by being helped by the author, i.e. a back-and-forth rewarding relationship is built. Both sides have to put in effort, which is a key idea that will keep coming up.

As linked above, I’ve written about “Lost in the Funhouse” before. You can read that for details. Some might go so far as to call it “the canonical” example of post-modernism. The main importance on this list is that “Westward …” is simultaneously a parody of it, a rewriting of it, and a tool to get some messages across. I dare say it is impossible to to read “Westward …” and have any idea what is going on without having read “Lost in the Funhouse” first. We’ll discuss it a bit more next time.

Last is The Mezzanine by Nicholson Baker. This book takes place over something like 10 seconds. The plot (and full main text!) of the novella is that a man walks into a mezzanine and takes an escalator up to the next floor. That’s it. What makes this so compelling is that there are about 130 pages of footnotes telling you what the guy is thinking through this whole process.

The book is a page turner. I’m not joking. It gives you a glimpse into the mind of another human in such a raw and unfiltered way. It, of course, is really funny at times, but the fact that it is funny is because you know your thoughts do the same exact types of things. You chain together all sorts of seemingly unrelated stupid things.

The reason for putting this on here is two-fold. First, DFW constantly talked about the importance of literature being that it makes you for a moment feel less alone. Here’s the quote, “We all suffer alone in the real world. True empathy’s impossible. But if a piece of fiction can alow us imaginatively to identify with a character’s pain, we might then also more easily conceive of others identifying with their own. This is nourishing, redemptive; we become less alone inside. It might just be that simple.” This book comes as close as any that I can think of to achieving the idea of truly identifying with a character.

The second reason I chose this book is actually the key one. The way the book does it is not by any of the conventional means. It achieves this truly magnificent feat purely through the use of footnotes. DFW loved this book. Now ask yourself what is the most daunting part of Infinite Jest? Most people say it is the extensive use of endnotes.

We’ll get more to the endnotes next time, but I think The Mezzanine holds the key to one of the reasons DFW used them. They aren’t purely distraction. They aren’t meta-fictional wankery. They aren’t highfalutin philosophical nonsense. DFW read a book that achieved what he considered the goal of literature, and it was done using this device. If you can understand the use in The Mezzanine, then you will be well on your way to understanding the use of the endnotes in Infinite Jest.

We’re only halfway there, but if you’ve made it this far and you want some extra credit, then I also recommend finding a copy of Marshall Boswell’s Understanding David Foster Wallace. It is a good resource if you want to delve deeper into the philosophy and critical theory of what he was trying to do. Also, DFW is trying to surpass his post-modern idols, so it helps to be familiar with post-modernism in general. If you aren’t, then The Crying of Lot 49 by Thomas Pynchon is a pretty short but classic book in that style as well.


Get every new post delivered to your Inbox.

Join 143 other followers