Let’s look at a problem called parameter estimation. As usual, we have a bunch of coin flips. What we’ve learned to do with Bayesian statistics is calculate some posterior distribution that tells me how likely the bias is. I ask myself, “Is it a credible hypothesis that the coin is fair ()?” I find out yes it is. I ask myself, “Is it a credible hypothesis that the coin is massively biased at ?” I find out yes it is. Uh oh.
Maybe in abstract math land this type of contradictory information is fine. I should be honest that both are credible guesses based on my data, and Bayesian statistics helps me to be very precise about my degrees of certainty and uncertainty. Unfortunately, in the real world I want to figure out which is “best” so that I can use it in my model for practical purposes. This is called parameter estimation, because I want to estimate what my parameter should be in my model.
We’re in luck for the coin example, because we only have one parameter living in one-dimensional space. This alone vastly simplifies the situation, but we have something far, far better. Our posterior distribution has a unique maximum, that max happens to equal the mean of the distribution, and that max can be determined easily and exactly! This means that we can safely use that parameter as the “best.”
In the real world, we often have several parameters we are trying to estimate in a high-dimensional space, and the posterior is some non-convex crazy thing with lots of local mins/maxs that can’t be determined analytically. Let’s face it. Optimization is really hard even in relatively nice situations. The real world is usually not nice.
There often isn’t even an obvious notion of what you mean by “best” set of parameters either. Think of a symmetrical bimodal distribution where both peaks have the same max. You don’t really have any good reason to pick one of the points that gives the max, and if you do something like take the mean, then you might end up with a min on accident. The method I’m going to describe doesn’t really help with this issue of “equally good choices”, but it does give a beautiful way to deal with high-dimensional parameter spaces and crazy posterior distributions.
The idea is extremely simple. You will pick some starting collection of parameters. Then you let those parameters randomly move in some direction. We will then use our model to test whether or not it is more or less likely to see the data that we observed under each of those parameter choices. With some probability depending on this likelihood we will move that parameter to that value. This is just a Markov chain process of our values moving through the possible parameter values, and hence this technique is called a Markov Chain Monte Carlo (MCMC) method (I used the indefinite article “a” because there are all sorts of variations on this out there).
It turns out that as long as we set this up in a reasonable way, then it will converge. Here’s something cool about this. Your parameters could live in some gigantic space for which it would be impossible to search for a good parameter estimation. Usually there is some much, much smaller dimensional subset of reasonably likely candidates. Once you move to this smaller dimensional set, by the nature of the algorithm, you will stay close to it and hence start moving to something optimal much faster. Here’s a picture showing how the random walks stay on a smaller set in a real example:
Let’s actually implement this in the silly case of the coin example where we know what the answer should be. My next post might try to implement this for some sort of real data set, although that could be more time consuming than I’m willing to do. To make this example more fun, I had the computer pick a random number in and then generate 100 coin flips with bias equal to that number without telling me the number! This way we are in a realistic situation of not knowing what the “correct” answer is ahead of time.
I got 85 heads and 15 tails. To make computations easier, let’s assume the prior probability is just uniform. This means the posterior is given by . I’ll start the random walk at . To know how much to move by, I pick a random number from a normal distribution with mean and standard deviation . So if I pick , then my candidate place to move to is .
I compute and I move to the new spot with this probability. Note that if my new theta value is more likely to be the true theta, then I will always move to the new value, because the probability of moving is greater than . The more unlikely my new theta value is, the less likely it is that I will move there. This implementation is called the Metropolis (or Metropolis-Hastings) algorithm. Note how simple the implementation is. It is only a few lines of code:
import numpy as np import random import pylab # Posterior Distribution def p(theta): return (theta**85)*((1-theta)**15) # Random Walk Step Size def norm_dist(): return random.normalvariate(0, 0.1) # Perform one step of random walk from spot theta def rand_walk(theta): x = norm_dist() if theta + x < 1 and theta + x >0: return theta + x else: return theta # Simulate the random walk for 1000 time steps walk =  walk.append(0.5) for i in xrange(1000): n = walk.pop() walk.append(n) y = rand_walk(n) if random.random() < p(y)/p(n): walk.append(y) else: walk.append(n) # Plot the results ylab = [i for i in xrange(len(walk))] pylab.plot(walk, ylab) pylab.title('Random Walk Visualization') pylab.xlabel('Theta Value') pylab.ylabel('Time') pylab.show()
Note that the key insight that MCMC gives us is that picking values from the posterior is going to be “easy.” Even if we don’t know much about the distribution and have no idea how to explicitly calculate anything from it, we can still perform this random walk. This is what it looks like:
The last step is to actually do a parameter estimation. The whole point is that the walk will stay close to the best value, so we can now just average these to get . The average is just a finite sum instead of an integral now. If we had done this analytically, we would have gotten . Since we know MCMC is just giving us an estimation coming from randomness, this is really quite good!