A Mind for Madness

Musings on art, philosophy, mathematics, and physics

Leave a comment

Markov Chain Monte Carlo Example

Let’s look at a problem called parameter estimation. As usual, we have a bunch of coin flips. What we’ve learned to do with Bayesian statistics is calculate some posterior distribution that tells me how likely the bias {\theta} is. I ask myself, “Is it a credible hypothesis that the coin is fair ({\theta =1/2})?” I find out yes it is. I ask myself, “Is it a credible hypothesis that the coin is massively biased at {\theta=4/5}?” I find out yes it is. Uh oh.

Maybe in abstract math land this type of contradictory information is fine. I should be honest that both are credible guesses based on my data, and Bayesian statistics helps me to be very precise about my degrees of certainty and uncertainty. Unfortunately, in the real world I want to figure out which {\theta} is “best” so that I can use it in my model for practical purposes. This is called parameter estimation, because I want to estimate what my parameter {\theta} should be in my model.

We’re in luck for the coin example, because we only have one parameter living in one-dimensional space. This alone vastly simplifies the situation, but we have something far, far better. Our posterior distribution has a unique maximum, that max happens to equal the mean of the distribution, and that max can be determined easily and exactly! This means that we can safely use that parameter as the “best.”

In the real world, we often have several parameters we are trying to estimate in a high-dimensional space, and the posterior is some non-convex crazy thing with lots of local mins/maxs that can’t be determined analytically. Let’s face it. Optimization is really hard even in relatively nice situations. The real world is usually not nice.

There often isn’t even an obvious notion of what you mean by “best” set of parameters either. Think of a symmetrical bimodal distribution where both peaks have the same max. You don’t really have any good reason to pick one of the points that gives the max, and if you do something like take the mean, then you might end up with a min on accident. The method I’m going to describe doesn’t really help with this issue of “equally good choices”, but it does give a beautiful way to deal with high-dimensional parameter spaces and crazy posterior distributions.

The idea is extremely simple. You will pick some starting collection of parameters. Then you let those parameters randomly move in some direction. We will then use our model to test whether or not it is more or less likely to see the data that we observed under each of those parameter choices. With some probability depending on this likelihood we will move that parameter to that value. This is just a Markov chain process of our {\theta} values moving through the possible parameter values, and hence this technique is called a Markov Chain Monte Carlo (MCMC) method (I used the indefinite article “a” because there are all sorts of variations on this out there).

It turns out that as long as we set this up in a reasonable way, then it will converge. Here’s something cool about this. Your parameters could live in some gigantic space for which it would be impossible to search for a good parameter estimation. Usually there is some much, much smaller dimensional subset of reasonably likely candidates. Once you move to this smaller dimensional set, by the nature of the algorithm, you will stay close to it and hence start moving to something optimal much faster. Here’s a picture showing how the random walks stay on a smaller set in a real example:

Let’s actually implement this in the silly case of the coin example where we know what the answer should be. My next post might try to implement this for some sort of real data set, although that could be more time consuming than I’m willing to do. To make this example more fun, I had the computer pick a random number in {[0,1]} and then generate 100 coin flips with bias equal to that number without telling me the number! This way we are in a realistic situation of not knowing what the “correct” answer is ahead of time.

I got 85 heads and 15 tails. To make computations easier, let’s assume the prior probability is just uniform. This means the posterior is given by {p(D|\theta)=\theta^{85}\cdot (1-\theta)^{15}}. I’ll start the random walk at {\theta = 0.5}. To know how much to move by, I pick a random number from a normal distribution with mean {0} and standard deviation {0.1}. So if I pick {0.05}, then my candidate place to move to is {0.55}.

I compute {p(D|\theta_{new})/p(D|\theta_{old})} and I move to the new spot with this probability. Note that if my new theta value is more likely to be the true theta, then I will always move to the new value, because the probability of moving is greater than {1}. The more unlikely my new theta value is, the less likely it is that I will move there. This implementation is called the Metropolis (or Metropolis-Hastings) algorithm. Note how simple the implementation is. It is only a few lines of code:

import numpy as np
import random
import pylab

# Posterior Distribution
def p(theta):
    return (theta**85)*((1-theta)**15)

# Random Walk Step Size
def norm_dist():
    return random.normalvariate(0, 0.1)
# Perform one step of random walk from spot theta
def rand_walk(theta):
    x = norm_dist()
    if theta + x < 1 and theta + x >0:
        return theta + x
        return theta
# Simulate the random walk for 1000 time steps   
walk = []
for i in xrange(1000):
    n = walk.pop()
    y = rand_walk(n)
    if random.random() < p(y)/p(n):

# Plot the results
ylab = [i for i in xrange(len(walk))]
pylab.plot(walk, ylab)
pylab.title('Random Walk Visualization')
pylab.xlabel('Theta Value')

Note that the key insight that MCMC gives us is that picking values from the posterior is going to be “easy.” Even if we don’t know much about the distribution and have no idea how to explicitly calculate anything from it, we can still perform this random walk. This is what it looks like:


The last step is to actually do a parameter estimation. The whole point is that the walk will stay close to the best value, so we can now just average these to get {\theta = 0.84}. The average is just a finite sum instead of an integral now. If we had done this analytically, we would have gotten {0.85}. Since we know MCMC is just giving us an estimation coming from randomness, this is really quite good!

Leave a comment

Video Games as a Solution to the One-Sided Problem of Art

In October I wrote a post in defense of gaming in which the central argument is a claim that any person who takes experiencing art as an important human experience should consider certain types of games as a worthwhile use of time as well. Some games are basically interactive films, but some are much more interesting and original forms of interactive art. If you close yourself off from this world, then you close yourself off from deep artistic experiences that you can’t get elsewhere.

A few months ago I did two posts on David Foster Wallace, his philosophy of art, and how to get the most out of Infinite Jest.

One of DFW’s central concerns in art was the one-sided nature of art. The artist puts in hundreds of hours of work, and the viewer/reader/whatever passively experiences the work. He thought of the artist/viewer relationship as an honest relationship. If it is completely one-sided, then it is a defunct relationship and you won’t get much out of it for very long. To have a successful relationship, both sides have to be putting in reasonable amounts of work.

This is one way people justify postmodernist writing. You have a bunch of endnotes or footnotes or you pull the reader out of the reading experience in other ways by drawing attention to the fact that they are reading something. You write in stream of consciousness from points of view that change every couple of pages, so that the reader can’t immediately tell what is happening. Whatever the literary device, the idea is that the reader has to put in work.

The point is that the more work the reader puts in, the more they will get out of the experience. Just like in a relationship, the reader has to invest something if they want a meaningful experience. Of course, the relationship becomes one-sided on the other side if the author just uses a random word generator and plops nonsense on the page for the reader to spend months trying to decipher. It needs to be a symbiotic relationship where neither side carries too much of the burden.

I’m going to go out on a limb and say that this problem is a real problem, and what writers, filmmakers, artists, etc have come up with so far merely mitigates the problem. There hasn’t been a really good way to get the viewer to truly participate in and invest in the work of art … until the fairly recent paradigm shift in thinking about games as art.

I’m definitely not the first to propose this, so I won’t spend a lot of time making this into a long post. Now that I’ve blogged around this topic a few times without actually addressing it I thought I would just point out that games are one obvious solution to the problem. They provide an interactive experience where the “player” has to fully invest in the work.

In fact, if artists are scared of the idea that their art will be “played” and hence will not qualify as “serious” (two notions that are extraordinarily hard to define or separate), then they should check out some recent games like To the Moon. The game play is extremely minimal. The player experiences a moving story by progressing through the game. The game play consists of moving around to collect some items and at the end of certain segments of collecting you “solve a puzzle” (sometimes only 2 or 3 clicks of the mouse). Still, this level of interaction is vital to fully immersing you in the story as if you were really the main character. This interaction is impossible with film or literature.

Leave a comment

Decision Theory 4: Hypothesis Testing


Now we return to decision theory. It is also a return to the thing that first made me interested in learning something about statistics a year or two ago. I had heard about John Ioannidis’ shocking article “Why Most Published Research Findings are False” and started to investigate. To me statistics was some settled thing that you hit your data with after doing an experiment. It told you whether or not your findings were real and how confident you could be in them.

Moreover, I believed that as long as you followed the prescriptions taught to you, you couldn’t mess it up. It was foolproof. Just look around and try to find one thing that science hasn’t touched. The scientific method has clearly led us to discover something about the world. That’s why stats seemed like an uninteresting part of math. People seemed to have figured it all out. Then I was shocked to find that article. I started to learn about all these fallacies, and methodological problems that I’ve been pointing out over the past few months.

One of the main difficulties, particularly in science, is classical null hypothesis significance testing (NHST). One way to try to mitigate these difficulties is to rephrase our hypothesis test as a Bayesian decision theory problem. This is not the only Bayesian reformulation (Kruschke’s MCMC stuff is pretty cool which I might get to someday), but it fits in as a nice example of the use of decision theory outside of the silly gambling problems I’ve been using.

Let’s start by seeing how to test a point null hypothesis. Think about the biased coin example. We want to test {\theta=1/2}, i.e. is the coin unbiased? This is obviously a ridiculous type of hypothesis test, because the term “unbiased” in real life encompasses a range {(1/2-\varepsilon, 1/2+\varepsilon)} where we can’t tell the difference. This is actually the case in most scientific situations as well (there is only so much precision your instruments can achieve), and often scientists incorrectly use a point NHST when there should be a ROPE (region of practical equivalence).

Our first step is to take the previous paragraph’s discussion and cheat a little. Suppose we want to test {\theta = \theta_0}. The Bayesian way to do this would work out of the box using a ROPE. Unfortunately, if we want continuous densities for the probabilities, then we will always reject our null hypothesis. This is because a point has probability zero. The cheat is to just convert the continuous prior, {\pi(\theta)}, to a piecewise defined prior where we assign a point mass of probability

\displaystyle \pi_0 = \displaystyle \int_{\theta_0-\varepsilon}^{\theta_0+\varepsilon} \pi(\theta)d\theta

to {\theta_0} and the renormalized old prior otherwise. This is merely saying that we make a starting assumption that {\theta} has true value {\theta_0} with probability {\pi_0}, and hence no actual integral needs to be calculated. That is just for intuitive justification for the shape of {\pi}. If this makes you uncomfortable, then use the uninformed prior of {\theta=\theta_0} having probability {1/2} and the alternative having a uniform distribution of mass 1/2.

Let’s recap what we are trying to do. We have two hypotheses. The null which is {H_0: \theta=\theta_0}, and the alternative {H_1: \theta\neq \theta_0}. This type of NHST came up in the last post where we wanted to experimentally test whether or not the acceleration due to gravity was {g=9.8}. Our process should be clear if you’ve been following this sequence of posts. We just use our data to calculate the posterior distributions {P(H_0|x)} and {P(H_1|x)}. We must decide between these two by seeing which one has less risk (and that risk will come from a loss function which appropriately penalizes falsely accepting/rejecting each one).

This approach is really nice, because depending on your situation you will want to penalize differently. If you are testing a drug for effectiveness, then it is better to harshly penalize falsely claiming a placebo to be effective (a false positive or Type I error). If you are testing whether or not someone has a fatal disease, then you want to harshly penalize falsely claiming they have it and having them undergo dangerous and expensive unnecessary treatments. Maybe these aren’t the best examples, but you see how having a flexible system could be a lot more useful than blindly running a {p=0.05} NHST.

Rather than going through some made up example from fake randomly generated data as I’ve been doing, let’s examine some differences at the theoretical level when we assume everything is normal. Suppose our data is a sample of {n} points from a normal distribution. Any book on Bayesian statistics will have the details on working this out, so I’ll get to the punch line.

If we denote {m(x)} the marginal density, then the posterior distribution for {H_0} is given by

\displaystyle \frac{f(x|\theta_0)\pi_0}{m(x)}.

In the normal distribution (we assume the prior has {\tau} standard deviation and the data has {\mu} and both have mean {\theta_0}) case we get something much more specific:

\displaystyle \left(1+\frac{(1-\pi_0)}{\pi_0}\cdot \frac{\exp(\frac{1}{2}z^2[1+\sigma^2/(n\tau^2)]^{-1}}{(1+n\tau^2/\sigma^2)^{1/2}}\right)^{-1}

where {z=\frac{\sqrt{n}|\overline{x}-\theta_0|}{\sigma}}. This term actually appears in classical NHST as well. Let’s look at the differences. For the purpose of getting some numbers down, let’s assume {\pi_0=1/2} and {\sigma=\tau}. In a two-tailed test, let’s assume that we observe a {p=0.01} and hence would very, very strongly and confidently reject {H_0}. This corresponds to a {z}-value of {2.576}. In this case if {n} is small, i.e. in the 10-100 range, then the posterior is around {0.14} to {0.27}. This means that we would likely want to reject {H_0}, because it is quite a bit more unlikely than {H_1} (this will of course depend on the specifics of our loss function).

Shockingly, if {n} is large, we lose a lot of confidence. If {n=1000}, then the posterior for {H_0} is {0.53}. Woops. The Bayesian approach says that {H_0} is actually more likely to be true than {H_1}, but our NHST gives us {p=0.01} level confidence for rejecting (i.e. there is a 99% chance that our data observations were not a fluke chance and the result that causes us to reject {H_0} is real).

As we see, by working with the Bayesian framework, we get posterior probabilities for how likely {H_0} and {H_1} are given our observations of the data. This allows us to do a suitable analysis. The classical framework feels very limited, because even when we get extreme {p}-values that give us lots of confidence, we could accidentally be overlooking something that would be obvious if we worked directly with how likely each is to be true.

To end this post, I’ll just reiterate that careful scientists are completely aware of the fact that a {p}-value is not to be interpreted as probabilities against {H_0}. One can certainly apply classical methods and end with a solid analysis. On the other hand, this is quite a widespread sloppiness or less generously I’ll call it a widespread misunderstanding of what is going on.


Statistical Oddities 5: Sequential Testing

Our next decision theory post is going to be on how to rephrase hypothesis testing in terms of Bayesian decision theory. We already saw in our last statistical oddities post that {p}-values can cause some problems if you are not careful. This oddity makes the situation even worse. We’ll show that if you use a classical null hypothesis significance test (NHST) even at {p=0.05} and your experimental design is to check significance after each iteration of a sample, then as the sample size increases, you will falsely reject the hypothesis more and more.

I’ll reiterate that this is more of an experimental design flaw than a statistical problem, so a careful statistician will not run into the problem. On the other hand, lots of scientists are not careful statisticians and do make these mistakes. These mistakes don’t exist in the Bayesian framework (advertisement for the next post). I also want to reiterate that the oddity is not that you sometimes falsely reject hypotheses (this is obviously going to happen, since we are dealing with a degree of randomness). The oddity is that as the sample size grows, your false rejection rate will tend to 100% ! Usually people think that a higher sample size will protect them, but in this case it exacerbates the problem.

To avoid offending people, let’s assume you are a freshmen in college and you go to your very first physics lab. Of course, it will be to let a ball drop. You measure how long it takes to drop at various heights. You want to determine whether or not the acceleration due to gravity is really 9.8. You took a statistics class in high school, so you recall that you can run a NHST at the {p=0.05} level and impress your professor with this knowledge. Unfortunately, you haven’t quite grasped experimental methodology, so you rerun your NHST after each trial of dropping the ball.

When you see {p<0.05} you get excited because you can safely reject the hypothesis! This happens and you turn in a lab write-up claiming that with greater than {95\%} certainty the true acceleration due to gravity is NOT {9.8}. Let's make the nicest assumptions possible and see that it was still likely for you to reach that conclusion. Assume {g=9.8} exactly. Also, assume that your measurements are pretty good and hence form a normal distribution with mean {9.8}. The following code simulates exactly that:

import random
import numpy as np
import pylab
from scipy import stats

#Generate normal sample
def norm():
    return random.normalvariate(9.8,1)

#Run the experiment, return 1 if falsely rejects and 0 else
def experiment(num_samples, p_val):
    x = []
    #One by one we append an observation to our list
    for i in xrange(num_samples):
        #Run a t-test at p_val significance to see if we reject the hypothesis
        t,p = stats.ttest_1samp(x, 9.8)
        if p < p_val:
            return 1
    return 0

#Check the proportion of falsely rejecting at various sample sizes
rej_proportion = []
for j in xrange(10):
    f_rej = 0
    for i in xrange(5000):
        f_rej += experiment(10*j+1, 0.05)

#Plot the results
axis = [10*j+1 for j in xrange(10)]
pylab.plot(axis, rej_proportion)
pylab.title('Proportion of Falsely Rejecting the Hypothesis')
pylab.xlabel('Sample Size')

What is this producing? On the first run of the experiment, what is the probability that you reject the null hypothesis? Basically {0} because the test knows that this isn't enough data to make a firm conclusion. If you run the experiment 10 times, what is the probability that at some point you reject the null hypothesis? It has gone up a bit. On and on this goes up to 100 trials where you have nearly a 40% chance of rejecting the null hypothesis using this method. This should make you uncomfortable, because this is ideal data where the mean really is 9.8 exactly! This isn't coming from imprecise measurements or something.

The trend will actually continue, but already because of the so-called {n+1} problem in programming this was taking a while to run, so I cut it off. As you accumulate more and more experiments, you will be more and more likely to reject the hypothesis:


Actually, if you think about this carefully it isn’t so surprising. The fault is that you recheck whether or not to reject after each sample. Recall that the {p}-value tells you how likely it is to see these results by random chance supposing the hypothesis is false. But the value is not {0} which means with enough trials you’ll get the wrong thing. If you have a sample size of {100} and you recheck your NHST after each sample is added, then you give yourself 100 chances to see this randomness manifest rather than checking once with all {100} data points. As your sample size increases, you give yourself more and more chances to see the randomness and hence as your sample goes to infinity your probability of falsely rejecting the hypothesis tends to {1}.

We can modify the above code to just track the p-value over a single 1000 sample experiment (the word “trial” in the title was meant to indicate dropping a ball in the physics experiment). This shows that if you cut your experiment off almost anywhere and run your NHST, then you would not reject the hypothesis. It is only because you incorrectly tracked the p-value until it dipped below 0.05 that a mistake was made:


Leave a comment

Decision Theory 3

If you could follow the last post then you have all the pieces you need to understand the basic theory. Let’s go back and actually work this out in the abstract now that we have an example for our template. If you only care about seeing some examples in action, then you should feel free to skip this post which will be almost entirely defining the pieces of the last post more rigorously. We will need to revise some things from the first post, because we were able to state things in a simpler form without Bayesian updating or continuous distributions happening.

Last time we introduced a one-parameter family of unknowns, the true bias of the coin. We denoted this {\theta}. For now we’ll keep this to just be some continuous real-valued parameter and it will represent an unknown quantity in our model. If you haven’t thought about this before, then I recommend continuing in the way we did last post. You pretend like {\theta} is some fixed known quantity and run classical decision theory. From there you extrapolate. The value of this parameter could be this, or this, or this, and my decision has to be the best no matter what it really is.

In the future, there could be a whole bunch of unknowns and {\theta} will turn into a vector or matrix, but for now we’ll stick to just a single variable. To pin down terminology, we will call {\theta} the parameter and {\Theta} the parameter space (all the possible values of {\theta}). So in our coin example {\Theta = [0,1]}.

We also have a collection of actions: {A}. An individual action will be denoted by {a}. For the coin example, an action would be betting on heads or tails. We will never be able to know {\theta} … because it is an unknown, but we will want to make observations/gather data which will be denoted {X}. In the coin example, this would be our observed sequence of flips (so it is probably best represented as a vector). We will denote the collection of all possible observations {\mathcal{X}} and this is called the sample space. In the coin example, we flipped the coin {100} times, so this consists of {2^{100}} vectors. In general, we will want to allow {X} to be continuous random variables and hence {\mathcal{X}} could be subsets of {\mathbb{R}^n}.

Let {I\subset \mathcal{X}} (suggestively we will often want to consider an “interval” {[a,b]\subset \mathbb{R}} if we just have one continuous random variable). As I already pointed out earlier, we will often want to take the view of a given fixed {\theta}. In this situation we will assume for the purposes of being able to analyze things that we always have an integrable probability distribution {f(x|\theta)} which is “the probability of observing x given {\theta}“. Thus, by definition, the probability of observing {I} given {\theta} is just the integral:

\displaystyle P_{\theta}(I)=\int_I f(x|\theta)dx

I won’t adopt the cumbersome notation that some texts use to indicate that this could be an integral or a finite sum. I will just use the integral, and assume the reader can translate to the appropriate sum if {\mathcal{X}} is discrete. If we have some function {h(X)}, then we define the expected value of {h(X)} over {\mathcal{X}} to be

\displaystyle E_{\theta}[h(X)] = \int_{\mathcal{X}}h(X)f(x|\theta)dx

Now that that is settled, let’s formalize the decision function, loss, and risk. Suppose that we have some prior probability describing the possibilities for {\theta}. We denote this {\pi(\theta)}. The choice of such a thing in the absence of any actual prior knowledge is one of the main (only?) arguments against Bayesian statistics. This shouldn’t be distressing, because any reasonable experiment will have a large enough sample size that picking an uninformed uniform prior will easily be overcome.

In the first decision theory post, we made a decision rule without basing it on any data. This is why we need to change our definition a little. In that situation a decision rule is equivalent to picking an action. If observing some data is involved, then our decision rule is a function {\delta: \mathcal{X}\rightarrow A}. This should just be read, “If I observe this type of data, then I will act in this way.” You let the data inform your decision. Our decision rule in the coin example was to look at the ratio of heads to tails. If there were more heads we pick heads. If there were more tails, we pick tails.

The loss function is a function {L: \Theta\times A \rightarrow \mathbb{R}}. This is the choice that people should feel a little uncomfortable with, because there is a definite choice that may or may not be reasonable affecting everything. The value {L(\theta, a)} should measure the loss that will be incurred if you do action {a} and {\theta} is the true value of the unknown.

We won’t worry so much about this right now. The more important one for us is the decision loss function {L:\Theta\times \mathcal{X}\rightarrow \mathbb{R}}. This is just plugging in to the other one: {L(\theta, \delta(x))}. Sometimes we just start with this one though. This was a no-brainer for our coin example, because I purposely set up the question to have a natural loss function. This was due to the fact that a well-defined “bet” was being made. In more general situations, the choice of a loss function could be seen as essentially equivalent to picking a betting scheme for your choices. You could easily come up with some wacky ones to see that it might not reflect reality if you aren’t careful.

To me the more “intuitive” notion is that of the risk function. This is the expected value of the loss:

\displaystyle R(\theta, \delta)=E_{\theta}[L(\theta, \delta(X))] = \int_\mathcal{X} L(\theta, \delta(X))f(x|\theta)dx

Note we integrate out the random variables {x}, but we are left over with a function of {\theta}. We saw this in our coin example last time. We get a similar thing for the Bayesian risk, but we incorporate the prior probability of {\theta}. Lots of times it is actually somewhat easier to just jump right to the risk, because in the case of squared-error loss (see we just get that the risk is the variance of the posterior distribution. No extra intermediary calculations are needed.

In general, most loss functions will be a variant on one of two types. The first is called the squared-error loss function. It is given by {L(\theta, a)=(\theta-a)^2}. You can think of this as “least-squares” fitting your decision or minimizing risk in the {L^2}-norm. The other is called the {0-1} loss function. This one arises quite naturally when you just have to pick between two choices like the coin flip. Ours was a variant on this. It penalizes you by {1} unit if your “decision is incorrect” and doesn’t penalize you at all if your “decision is correct.” It is given by

\displaystyle L(\theta, a_i)=\begin{cases} 0 & \text{if} \ \theta\in \Theta_i \\ 1 & \text{if} \ \theta\in \Theta_j \ \text{for} \ i\neq j\end{cases}

The beautiful thing about this one is that the risk is just {1} minus the posterior distribution. Thus, it is minimized at the max of the posterior which is often really easy to calculate. In the coin example, we got the beta distribution and hence the max was just the mean. Of course, we have to be careful that we are measuring the right thing, because we aren’t trying to predict the true bias. We were merely trying to predict heads or tails so that situation was an even easier discrete version.

Lastly, there is a partial ordering on decision functions given by \delta_1 \leq \delta_2 if and only if R(\theta, \delta_1) \leq R(\theta, \delta_2) for all \theta . A minimum in this ordering is called admissible and corresponds to a rational decision. If you make some other decision you are just asking to lose more.

Well, I think this post has gone on long enough (I’ve basically been trapped at the airport for the past 8 hours, so …). We’ll get back to some examples of all this next time. I just needed to finally formalize what we were doing before going any further.

1 Comment

Decision Theory 2

Now we will move on to a far, far more complex version of the same problem from before. Recall last time we worked with a fair coin. We want to make guesses that minimize our loss (or maximize our utility). The assumption that the coin was fair basically nullified having to do any analysis. No matter what decision function we picked, we would have the same expected loss, i.e. there is no way to do better than random guessing.

Let’s introduce the complexity of an extra parameter slowly through an example. Let’s suppose again that the coin is fair, but we don’t know that ahead of time. We have no idea what the bias of the coin is. We’ve already analyzed how to model this situation in our Bayesian statistics example.

If we observe {n} heads and {m} tails, we have a probability distribution describing the likelihood of the possible biases. We found this to be the beta distribution {B(n-1, m-1)}. If we start with a uniform, uninformed prior, then we could use Bayesian statistics to update our decision rule after each flip. This should make intuitive sense, because if the bias of the coin is 0.9, we should quickly see the posterior distribution reflect this and we will start getting most of our guesses correct.

Thus, the most naive thing to do is to look at the mean of the posterior distribution: {\frac{n}{n+m}}. If this number is bigger than {0.5}, then we guess heads because our Bayesian posterior predicts heads is coming up more frequently. If it is less than {0.5}, then we guess tails. If it equals {0.5}, then we make a random guess. Note that as long as the true bias is not {0.5}, we should be able to tell this with statistics after sufficiently many flips which will give us a better expected loss (i.e. risk) than random guessing. Let’s try two examples to see what happens.

I won’t post the code or the graph of what happens if the true bias is {0.5}, because our previous analysis shows it to be exactly the same independent of our decision function. Thus our more complicated decision rule doesn’t actually do anything to improve our guess. As a second example, we can mildly modify the code previously to see what happens with a {0.75} bias:

import random
import numpy as np
import pylab

def flip(true_bias):
    rand = random.random()
    if rand > true_bias:
        return 0
        return 1

def simulate(money, bet, true_bias, num_flips):
    num_heads = 0
    est_bias = 0.5
    for i in range(num_flips):
        #make a choice based on Bayesian posterior
        if est_bias >= 0.5:
            choice = 1
            choice = 0
        #flip the coin
        rand = flip(true_bias)
        #keep track of the number of heads
        num_heads += rand
        #update estimated bias
        est_bias = float(num_heads+1)/(i+3)
        #check whether or not choice was correct
        if choice == rand:
            money += 2*bet
            money -= bet
    return money

results = []
for i in range(1000):
    results.append(simulate(10, 1, 0.75, 100))

pylab.title('Coin Experiment Results')
pylab.xlabel('Trial Number')
pylab.ylabel('Money at the End of the Trial')

print np.mean(results)

The program says we average ending with {134.3} cents. We made pretty close to {125} cents as opposed to making {50} cents off of the {0.5} bias. These numbers should not be mysterious, because in the long run we expect to start guessing heads which will occur {3/4} of the time. Thus our expected gain is {100((-1/4)+(3/4)*2)=125}. Here’s the plot of the experiment:


This should feel a little weird, because with this decision rule we expect to always do better than (or equal to) our previous example. But this example is more realistic, because we don’t assume to know the bias of the coin! How could we do better with “less” information? That is the power of Bayesian decision theory which allows you to update your decision rule as you observe more information.

The classical admissible decision of always picking heads will do better if the bias is towards heads because we don’t have to wait for our posterior to tell us to pick heads, but it will do terrible if the bias is towards tails because even once we see that we get mostly tails we are not allowed to change our decision rule.

Let’s go back to our experiment of 100 coin flips. If {\theta} is the true bias of the coin, then the negative of the risk (the expected value of the utility function) of our Bayesian naive decision rule is

{-R(\theta) = \begin{cases} 100 (3\theta -1) & \ \text{if} \ \theta \geq 0.5 \\ 100(2-3\theta) & \text{if} \ \theta < 0.5\end{cases}.}

We've now successfully incorporated our new parameter. The risk will in general depend on this parameter. The function is just a "V" when graphed and our risk from last post is just a straight line {-R(\theta)=100(3\theta-1)}. It matches on the right piece, but is strictly below this one on the left half. This shows that no matter the bias of the coin, the naive Bayesian decision rule does better than our first post's choice.

Last post I said we could order the decision functions based on risk, and then we just call a minimum in the ordering admissible. Now we have to be more careful. With this extra parameter we only get a partial ordering by checking whether or not the risk is greater pointwise for every {\theta}. As just pointed out, the Bayesian decision function is lower in the ordering than random guessing or always picking heads (the two are comparable!). The question is, how do we know whether or not it is a minimum? Is this the best we can do? Is this naive decision rule admissible?

We will dig a little more into the theory next time about how those risk functions were computed (I just told you what they were which matched our experiments), and how to actually prove that a certain decision is admissible in this more complicated situation.

1 Comment

Decision Theory 1

Today we’ll start looking at a branch of math called Decision Theory. It uses the types of things in probability and statistics that we’ve been looking at to make rational decisions. In fact, in the social sciences when bias/rationality experiments are done, seeing how closely people make decisions to these optimal decisions is the base line definition of rationality.

Today’s post will just take the easiest possible scenarios to explain the terms. I think most of this stuff is really intuitive, but all the textbooks and notes I’ve looked at make this way more complicated and confusing. This basically comes from doing too much too fast and not working basic examples.

Let’s go back to our original problem which is probably getting old by now. We have a fair coin. It gets flipped. I have to bet on either heads or tails. If I guess wrong, then I lose the money I bet. If I guess right, then I double my money. The coin will be flipped 100 times. How should I bet?

Let’s work a few things out. A decision function is a function from the space of random variables {X} (technically we can let {X} be any probability space) to the set of possible actions. Let’s call {A=\{0,1\}} our set of actions where {0} corresponds to choosing tails and {1} corresponds to heads. Our decision function is a function that assigns to each flip a choice of picking heads or tails, {\delta: X \rightarrow A}. Note that in this example {X} is also just a discrete space corresponding to the 100 flips of the coin.

We now define a loss function, {L:X\times A \rightarrow \mathbb{R}}. To make things easy, suppose we bet 1 cent every time. Then our loss is {1} cent every time we guess wrong and {-2} cents if we guess right. Because of the awkwardness of thinking in terms of loss (i.e. a negative loss is a gain) we will just invert it and use a utility function in this case which measures gains. Thus {U=-1} when we guess wrong and {U=2} when we guess right. Notationally, suppose {F: X\rightarrow A} is the function that tells us the outcome of each flip. Explicitly,

\displaystyle U(x_i, \delta(x_i)) = \begin{cases} -1 \ \text{if} \ F(x_i) \neq \delta(x_i) \\ 2 \ \text{if} \ F(x_i) = \delta(x_i) \end{cases}

The last thing we need is the risk involved. The risk is just the expected value of the loss function (or the negative of the expected value of the utility). Suppose our decision function is to pick {0} every time. Then our expected utility is just {100(1/2(-1)+1/2(2))=50}. This makes sense, because half the time we expect to lose and half we expect to win. But we double our money on a win, so we expect a net gain. Thus our risk is {-50}, i.e. there is no risk involved in playing this way!
This is a weird example, because in the real world we have to make our risk function up and it does not usually have negative expected value, i.e. there is almost always real risk in a decision. Also, our typical risk will still be a function. It is only because everything is discrete that some concepts have been combined which will need to be pulled apart later.

The other reason this is weird is that even though there are {2^{10}} different decision functions, they all have the same risk because of the symmetry and independence of everything. In general, each decision function will give a different risk, and they are ordered by this risk. Any minimum risk decision function is called “admissible” and it corresponds to making a rational decision.

I want to point out that if you have the most rudimentary programming skills, then you don’t have to know anything about probability, statistics, or expected values to figure these things out in these simple toy examples. Let’s write a program to check our answer (note that you could write a much simpler program which is only about 5 lines, has no functions, etc to do this):

import random
import numpy as np
import pylab

def flip():
    return random.randint(0,1)

def simulate(money, bet, choice, length):
    for i in range(length):
        tmp = flip()
        if choice == tmp:
            money += 2*bet
            money -= bet
    return money

results = []
for i in range(1000):
    results.append(simulate(10, 1, 0, 100))

pylab.title('Coin Experiment Results')
pylab.xlabel('Trial Number')
pylab.ylabel('Money at the End of the Trial')

print np.mean(results)

This python program runs the given scenario 1000 times. You start with 10 cents. You play the betting game with 100 flips. We expect to end with 60 cents at the end (we start with 10 and have an expected gain of 50). The plot shows that sometimes we end with way more, and sometimes we end with way less (in these 1000 we never end with less than we started with, but note that is a real possibility, just highly unlikely):


It clearly hovers around 60. The program then spits out the average after 1000 simulations and we get 60.465. If we run the program a bunch of times we get the same type of thing over and over, so we can be reasonably certain that our above analysis was correct (supposing a frequentist view of probability it is by definition correct).

Eventually we will want to jump this up to continuous variables. This means doing an integral to get the expected value. We will also want to base our decision on data we observe, i.e. inform our decisions instead of just deciding on what to do ahead of time and then plugging our ears, closing our eyes, and yelling, “La, la, la, I can’t see what’s happening.” When we update our decision as the actions happen it will just update our probability distributions and turn it into a Bayesian decision theory problem.

So you have that to look forward to. Plus some fun programming/pictures should be in the future where we actually do the experiment to see if it agrees with our analysis.


Get every new post delivered to your Inbox.

Join 142 other followers