# A Bayesian Formulation of Occam’s Razor

Today we will formulate Occam’s Razor in Bayesian terms. Recall that this says that if two hypotheses explain the data equally well, then the one with less assumptions is to be preferred. Before continuing, we should first get a handle on what this is and what the Bayesian reformulation means. First, it is basically a scientific heuristic. The intuitive reason for it is that unnecessary hypotheses are just going to make your model more likely to make mistakes (i.e. it will “overfit”).

What this post is going to do is give a formulation of it in Bayesian terms. This is not a mathematical proof that Occam’s Razor is true or anything, but it will be a proof that under certain mild assumptions the principle falls out as a consequence. That’s what makes this kind of cool. We want to decide whether or not hypothesis A or B is a better statistical model where A and B explain the observed data equally well, but B has an extra parameter.

How should we do this? Well, in probabilistic terms we want to figure out ${P(A|D)}$ and ${P(B|D)}$, the “probability that ${A}$ is true given the data ${D}$” and the “probability that ${B}$ is true given the data ${D}$.” We merely compare these two quantities for example by taking the quotient

$\displaystyle \frac{P(A|D)}{P(B|D)}.$

If the quotient is near ${1}$, then they are roughly equally good models. If the quotient is large, then ${A}$ is a better hypothesis and if the quantity is close to ${0}$, then ${B}$ is the better hypothesis.

Let’s take stock of our assumptions here. We do not assume Occam’s Razor (some people feel like OR is a pretty steep assumption), because it is not a priori clear that it is always the best principle to follow. But here we are merely making the assumption that comparing the probabilities that each model is a true model of the data we observe is a good test for selecting one model over another. It is kind of hard to argue against such a common sense assumption.

Now we use Bayes’ Theorem to convert these quantities to things we actually know about:

$\displaystyle \frac{P(A|D)}{P(B|D)} = \frac{P(D|A)P(A)}{P(D|B)P(B)}$

At this point we have some difficulty with the ${B}$ hypothesis still, because implicitly we have assumed it relies on some extra parameter ${\lambda}$. To simplify the argument, we will assume that ${\lambda}$ lies in some range (this isn’t unreasonable because in real life you should have some idea what order of magnitude etc this parameter should be): ${\lambda_m \leq \lambda \leq \lambda_M}$. We will make a less reasonable simplifying assumption and say that once this range is specified, we have a uniform chance of it being anything in the range, i.e.

$\displaystyle P(\lambda |B) = \frac{1}{\lambda_M - \lambda_m}$

for ${\lambda}$ in the range and ${0}$ otherwise. There will be an observed ${\lambda_0}$ that maximizes the likelihood function (i.e. fits the data the best). Choose ${\delta}$ so that ${\lambda_0 \pm \delta}$ is an interval giving us reasonable certainty of the best ${\lambda_0}$ (we could use the 95% HDI if we wanted to get the interval). Now let’s work out what is happening for ${B}$:

$\displaystyle P(D|B) = \int P(D, \lambda|B)d\lambda = \int P(D|\lambda, B)P(\lambda |B)d\lambda$

$\displaystyle =\frac{1}{\lambda_M - \lambda_m}\int P(D|\lambda_0, B)exp\left(-\frac{(\lambda-\lambda_0)^2}{2\delta^2}\right)d\lambda$

$\displaystyle =\frac{\delta\sqrt{2\pi}P(D|\lambda_0, B)}{\lambda_M - \lambda_m}$

Now we can plug this into our original comparison ratio and use the fact that both are equally good at explaining the data:

$\displaystyle \frac{P(A|D)}{P(B|D)}=\frac{(\lambda_M-\lambda_m)P(D|A)}{\delta\sqrt{2\pi}P(D|\lambda_0, B)}$

This gives us two main conclusions. The first is that if we assume our two models make roughly equivalent predictions on the data, i.e. ${P(D|A)\approx P(D|\lambda_0, B)}$, then we should prefer ${A}$ because the possible range for ${\lambda}$ giving a factor in the numerator will in general be quite a bit larger than ${\delta}$. This is exactly Occam’s Razor.

The possibly more interesting consequence is that we now know exactly how much this extra parameter is “penalizing” the theory. So given specific cases we can test whether or not that extra parameter is worth putting in. In other words, are the predictions significantly enough better with the extra parameter to overcome the penalty of introducing an extra complicated hypothesis? This abstract and vague notion from Occam’s Razor gets explicitly quantified in Bayesian analysis so that it is no longer vague or abstract and we can confidently apply Occam’s Razor when it is needed and avoid it when it isn’t.