Now we will move on to a far, far more complex version of the same problem from before. Recall last time we worked with a fair coin. We want to make guesses that minimize our loss (or maximize our utility). The assumption that the coin was fair basically nullified having to do any analysis. No matter what decision function we picked, we would have the same expected loss, i.e. there is no way to do better than random guessing.
Let’s introduce the complexity of an extra parameter slowly through an example. Let’s suppose again that the coin is fair, but we don’t know that ahead of time. We have no idea what the bias of the coin is. We’ve already analyzed how to model this situation in our Bayesian statistics example.
If we observe heads and tails, we have a probability distribution describing the likelihood of the possible biases. We found this to be the beta distribution . If we start with a uniform, uninformed prior, then we could use Bayesian statistics to update our decision rule after each flip. This should make intuitive sense, because if the bias of the coin is 0.9, we should quickly see the posterior distribution reflect this and we will start getting most of our guesses correct.
Thus, the most naive thing to do is to look at the mean of the posterior distribution: . If this number is bigger than , then we guess heads because our Bayesian posterior predicts heads is coming up more frequently. If it is less than , then we guess tails. If it equals , then we make a random guess. Note that as long as the true bias is not , we should be able to tell this with statistics after sufficiently many flips which will give us a better expected loss (i.e. risk) than random guessing. Let’s try two examples to see what happens.
I won’t post the code or the graph of what happens if the true bias is , because our previous analysis shows it to be exactly the same independent of our decision function. Thus our more complicated decision rule doesn’t actually do anything to improve our guess. As a second example, we can mildly modify the code previously to see what happens with a bias:
import random import numpy as np import pylab def flip(true_bias): rand = random.random() if rand > true_bias: return 0 else: return 1 def simulate(money, bet, true_bias, num_flips): num_heads = 0 est_bias = 0.5 for i in range(num_flips): #make a choice based on Bayesian posterior if est_bias >= 0.5: choice = 1 else: choice = 0 #flip the coin rand = flip(true_bias) #keep track of the number of heads num_heads += rand #update estimated bias est_bias = float(num_heads+1)/(i+3) #check whether or not choice was correct if choice == rand: money += 2*bet else: money -= bet return money results =  for i in range(1000): results.append(simulate(10, 1, 0.75, 100)) pylab.plot(results) pylab.title('Coin Experiment Results') pylab.xlabel('Trial Number') pylab.ylabel('Money at the End of the Trial') pylab.show() print np.mean(results)
The program says we average ending with cents. We made pretty close to cents as opposed to making cents off of the bias. These numbers should not be mysterious, because in the long run we expect to start guessing heads which will occur of the time. Thus our expected gain is . Here’s the plot of the experiment:
This should feel a little weird, because with this decision rule we expect to always do better than (or equal to) our previous example. But this example is more realistic, because we don’t assume to know the bias of the coin! How could we do better with “less” information? That is the power of Bayesian decision theory which allows you to update your decision rule as you observe more information.
The classical admissible decision of always picking heads will do better if the bias is towards heads because we don’t have to wait for our posterior to tell us to pick heads, but it will do terrible if the bias is towards tails because even once we see that we get mostly tails we are not allowed to change our decision rule.
Let’s go back to our experiment of 100 coin flips. If is the true bias of the coin, then the negative of the risk (the expected value of the utility function) of our Bayesian naive decision rule is
We've now successfully incorporated our new parameter. The risk will in general depend on this parameter. The function is just a "V" when graphed and our risk from last post is just a straight line . It matches on the right piece, but is strictly below this one on the left half. This shows that no matter the bias of the coin, the naive Bayesian decision rule does better than our first post's choice.
Last post I said we could order the decision functions based on risk, and then we just call a minimum in the ordering admissible. Now we have to be more careful. With this extra parameter we only get a partial ordering by checking whether or not the risk is greater pointwise for every . As just pointed out, the Bayesian decision function is lower in the ordering than random guessing or always picking heads (the two are comparable!). The question is, how do we know whether or not it is a minimum? Is this the best we can do? Is this naive decision rule admissible?
We will dig a little more into the theory next time about how those risk functions were computed (I just told you what they were which matched our experiments), and how to actually prove that a certain decision is admissible in this more complicated situation.