# Decision Theory 3

If you could follow the last post then you have all the pieces you need to understand the basic theory. Let’s go back and actually work this out in the abstract now that we have an example for our template. If you only care about seeing some examples in action, then you should feel free to skip this post which will be almost entirely defining the pieces of the last post more rigorously. We will need to revise some things from the first post, because we were able to state things in a simpler form without Bayesian updating or continuous distributions happening.

Last time we introduced a one-parameter family of unknowns, the true bias of the coin. We denoted this ${\theta}$. For now we’ll keep this to just be some continuous real-valued parameter and it will represent an unknown quantity in our model. If you haven’t thought about this before, then I recommend continuing in the way we did last post. You pretend like ${\theta}$ is some fixed known quantity and run classical decision theory. From there you extrapolate. The value of this parameter could be this, or this, or this, and my decision has to be the best no matter what it really is.

In the future, there could be a whole bunch of unknowns and ${\theta}$ will turn into a vector or matrix, but for now we’ll stick to just a single variable. To pin down terminology, we will call ${\theta}$ the parameter and ${\Theta}$ the parameter space (all the possible values of ${\theta}$). So in our coin example ${\Theta = [0,1]}$.

We also have a collection of actions: ${A}$. An individual action will be denoted by ${a}$. For the coin example, an action would be betting on heads or tails. We will never be able to know ${\theta}$ … because it is an unknown, but we will want to make observations/gather data which will be denoted ${X}$. In the coin example, this would be our observed sequence of flips (so it is probably best represented as a vector). We will denote the collection of all possible observations ${\mathcal{X}}$ and this is called the sample space. In the coin example, we flipped the coin ${100}$ times, so this consists of ${2^{100}}$ vectors. In general, we will want to allow ${X}$ to be continuous random variables and hence ${\mathcal{X}}$ could be subsets of ${\mathbb{R}^n}$.

Let ${I\subset \mathcal{X}}$ (suggestively we will often want to consider an “interval” ${[a,b]\subset \mathbb{R}}$ if we just have one continuous random variable). As I already pointed out earlier, we will often want to take the view of a given fixed ${\theta}$. In this situation we will assume for the purposes of being able to analyze things that we always have an integrable probability distribution ${f(x|\theta)}$ which is “the probability of observing x given ${\theta}$“. Thus, by definition, the probability of observing ${I}$ given ${\theta}$ is just the integral:

$\displaystyle P_{\theta}(I)=\int_I f(x|\theta)dx$

I won’t adopt the cumbersome notation that some texts use to indicate that this could be an integral or a finite sum. I will just use the integral, and assume the reader can translate to the appropriate sum if ${\mathcal{X}}$ is discrete. If we have some function ${h(X)}$, then we define the expected value of ${h(X)}$ over ${\mathcal{X}}$ to be

$\displaystyle E_{\theta}[h(X)] = \int_{\mathcal{X}}h(X)f(x|\theta)dx$

Now that that is settled, let’s formalize the decision function, loss, and risk. Suppose that we have some prior probability describing the possibilities for ${\theta}$. We denote this ${\pi(\theta)}$. The choice of such a thing in the absence of any actual prior knowledge is one of the main (only?) arguments against Bayesian statistics. This shouldn’t be distressing, because any reasonable experiment will have a large enough sample size that picking an uninformed uniform prior will easily be overcome.

In the first decision theory post, we made a decision rule without basing it on any data. This is why we need to change our definition a little. In that situation a decision rule is equivalent to picking an action. If observing some data is involved, then our decision rule is a function ${\delta: \mathcal{X}\rightarrow A}$. This should just be read, “If I observe this type of data, then I will act in this way.” You let the data inform your decision. Our decision rule in the coin example was to look at the ratio of heads to tails. If there were more heads we pick heads. If there were more tails, we pick tails.

The loss function is a function ${L: \Theta\times A \rightarrow \mathbb{R}}$. This is the choice that people should feel a little uncomfortable with, because there is a definite choice that may or may not be reasonable affecting everything. The value ${L(\theta, a)}$ should measure the loss that will be incurred if you do action ${a}$ and ${\theta}$ is the true value of the unknown.

We won’t worry so much about this right now. The more important one for us is the decision loss function ${L:\Theta\times \mathcal{X}\rightarrow \mathbb{R}}$. This is just plugging in to the other one: ${L(\theta, \delta(x))}$. Sometimes we just start with this one though. This was a no-brainer for our coin example, because I purposely set up the question to have a natural loss function. This was due to the fact that a well-defined “bet” was being made. In more general situations, the choice of a loss function could be seen as essentially equivalent to picking a betting scheme for your choices. You could easily come up with some wacky ones to see that it might not reflect reality if you aren’t careful.

To me the more “intuitive” notion is that of the risk function. This is the expected value of the loss:

$\displaystyle R(\theta, \delta)=E_{\theta}[L(\theta, \delta(X))] = \int_\mathcal{X} L(\theta, \delta(X))f(x|\theta)dx$

Note we integrate out the random variables ${x}$, but we are left over with a function of ${\theta}$. We saw this in our coin example last time. We get a similar thing for the Bayesian risk, but we incorporate the prior probability of ${\theta}$. Lots of times it is actually somewhat easier to just jump right to the risk, because in the case of squared-error loss (see we just get that the risk is the variance of the posterior distribution. No extra intermediary calculations are needed.

In general, most loss functions will be a variant on one of two types. The first is called the squared-error loss function. It is given by ${L(\theta, a)=(\theta-a)^2}$. You can think of this as “least-squares” fitting your decision or minimizing risk in the ${L^2}$-norm. The other is called the ${0-1}$ loss function. This one arises quite naturally when you just have to pick between two choices like the coin flip. Ours was a variant on this. It penalizes you by ${1}$ unit if your “decision is incorrect” and doesn’t penalize you at all if your “decision is correct.” It is given by

$\displaystyle L(\theta, a_i)=\begin{cases} 0 & \text{if} \ \theta\in \Theta_i \\ 1 & \text{if} \ \theta\in \Theta_j \ \text{for} \ i\neq j\end{cases}$

The beautiful thing about this one is that the risk is just ${1}$ minus the posterior distribution. Thus, it is minimized at the max of the posterior which is often really easy to calculate. In the coin example, we got the beta distribution and hence the max was just the mean. Of course, we have to be careful that we are measuring the right thing, because we aren’t trying to predict the true bias. We were merely trying to predict heads or tails so that situation was an even easier discrete version.

Lastly, there is a partial ordering on decision functions given by $\delta_1 \leq \delta_2$ if and only if $R(\theta, \delta_1) \leq R(\theta, \delta_2)$ for all $\theta$. A minimum in this ordering is called admissible and corresponds to a rational decision. If you make some other decision you are just asking to lose more.

Well, I think this post has gone on long enough (I’ve basically been trapped at the airport for the past 8 hours, so …). We’ll get back to some examples of all this next time. I just needed to finally formalize what we were doing before going any further.