# A Mind for Madness

## 2013 in review

Just because, why not?

The WordPress.com stats helper monkeys prepared a 2013 annual report for this blog.

Here’s an excerpt:

The concert hall at the Sydney Opera House holds 2,700 people. This blog was viewed about 44,000 times in 2013. If it were a concert at Sydney Opera House, it would take about 16 sold-out performances for that many people to see it.

Click here to see the complete report.

## Gauss’ Law

Since my blog claims to talk about physics sometimes and I just finished teaching multivariable calculus, I thought I’d do a post on one form of Gauss’ law. As a teacher of the course, I found this to be an astonishingly beautiful “application” of the divergence theorem. It turned out to be a touch too difficult for my students (and I vaguely recall being extremely confused about this when I took the class myself).

First, I’ll remind you what some of this stuff is if you haven’t thought about these concepts for awhile. Let’s work in ${\mathbb{R}^3}$ for simplicity. Consider some subset ${U\subset \mathbb{R}^3}$. Let ${F: U\rightarrow \mathbb{R}^3}$ be a vector field. Mathematically this is just assigning a vector to each point of ${U}$. For calculus we usually put some fairly restrictive conditions on ${F}$, such as all partial derivatives exist and are continuous.

The above situation is ubiquitous in classical physics. The vector field could be the gravitational field or the electric field or it could describe velocity of a flowing fluid or … One key quantity you might want to know about your field is what is the flux of the field through a given surface ${S}$? This measures the net change of the field flowing through the surface. If ${S}$ is just a sphere, then it is easy to visualize the flux as the amount leaving the sphere minus the amount flowing in.

Let’s suppose ${S}$ is a smooth surface bounding a solid volume ${E}$ (e.g. the sphere bounding the solid ball). In this case we have a well-defined “outward normal” direction. Define ${\mathbf{n}}$ to be the unit vector field in this direction at all points of ${S}$. Just by definition the flux of ${F}$ through ${S}$ must be “adding up” the values of ${F\cdot \mathbf{n}}$ over ${S}$, because this dot product just tells us how much ${F}$ is pointing in the outward direction.

Thus we define the flux (using Stewart’s notation) to be:

$\displaystyle \iint_S F\cdot d\mathbf{S} := \iint_S F\cdot \mathbf{n} \,dS$

Note the second integral is integrating a scalar valued function with respect to surface area “dS.” Now recall that the divergence theorem says that in our situation (given that ${F}$ extends to a vector field on an open set containing ${E}$) we can calculate this rather tedious surface integral by converting it to a usual triple integral:

$\displaystyle \iint_S F\cdot d\mathbf{S} = \iiint_E div(F) \,dV$

If you’re advanced, then of course you could just work this out as a special case of Stoke’s theorem using the musical isomorphisms and so on. Let’s now return to our original problem. Suppose I have a charge ${Q}$ inside some surface ${S}$ and I want to compute the flux of the associated electric field through ${S}$.

From my given information this would seem absolutely impossible. If ${S}$ can be anything, and ${Q}$ can be located anywhere inside, then of course there are just way too many variables to come up with a reasonably succinct answer. Surprisingly, Gauss’ law tells us that no matter what ${S}$ is and where ${Q}$ is located, the answer is always the same, and it is just a quick application of the divergence theorem to prove it.

First, let’s translate everything so that ${Q}$ is located at the origin. Since flux is translation invariant, this will not change our answer. We first need to know what the electric field is, and this is essentially a direct consequence of Coloumb’s law:

$\displaystyle F(x,y,z)=\frac{kQ}{(x^2+y^2+z^2)^{3/2}}\langle x, y, z\rangle$

If we care about higher dimensions, then we might want to note that the value only depends on the radial distance from the origin and write it in the more succinct way ${\displaystyle F(r)=\frac{kQ}{|r|^3}r}$, where ${k}$ is just some constant that depends on the textbook/units you are working in. Let’s first compute the partial of the first coordinate with respect to ${x}$ (ignoring the constant factor for now):

$\displaystyle \frac{\partial}{\partial x}\left(\frac{x}{(x^2+y^2+z^2)^{3/2}}\right) = \frac{-2x^2+y^2+z^2}{(x^2+y^2+z^2)^2}$

You get similar things for taking the other derivatives involved in the divergence except the minus sign moves to ${-2y^2}$ and ${-2z^2}$ respectively. When you add all these together you get in the numerator ${-2x^2-2y^2-2z^2+2x^2+2y^2+2z^2=0}$. Thus the divergence is ${0}$ everywhere and hence by the divergence theorem the flux must be ${0}$ too, right? Wrong! And that’s where I lost most of my students.

Recall that pesky hypothesis that ${F}$ can be extended to a vector field on an open neighborhood of ${E}$. Our ${F}$ can’t even be defined at all to extend continuously across the origin. Thus we must do something different. Here’s the idea, we just change our region ${E}$. Since ${E}$ is open and contains the origin, we can find a small sphere of radius ${\varepsilon>0}$ and centered at ${(0,0,0)}$ whose interior is properly contained in ${E}$, say ${S_\varepsilon}$.

Let ${\Omega}$ be the region between these two surfaces. Effectively this “cuts out” the bad point of ${F}$ and now we are allowed to apply the divergence theorem to ${\Omega}$ where our new boundary is ${S}$ oriented outwards and ${S_\varepsilon}$ oriented inward (negatively). We already calculated that ${div F=0}$, thus one side of the equation is ${0}$. This gives us

$\displaystyle \iint_S F\cdot d\mathbf{S} = \iint_{S_\varepsilon} F\cdot d\mathbf{S}$

This is odd, because it says that no matter how bizarre or gigantic ${S}$ was we can just compute the flux through a small sphere and get the same answer. At this point we’ve converted the problem to something we can do because the unit normal is just ${\mathbf{n}=\frac{1}{\sqrt{x^2+y^2+z^2}}\langle x, y, z\rangle}$. Direct computation gives us

$\displaystyle F\cdot \mathbf{n} = \frac{kQ (x^2+y^2+z^2)}{(x^2+y^2+z^2)^3}=\frac{kQ}{(x^2+y^2+z^2)^2}$

Plugging this all in we get that the flux through ${S}$ is

$\displaystyle \iint_{S_\varepsilon} \frac{kQ}{\varepsilon^2} \,dS = \frac{kQ}{\varepsilon^2}Area(S_\varepsilon) = 4\pi k Q.$

That’s Gauss’ Law. It says that no matter the shape of ${S}$ or the location of the charge inside ${S}$, you can always compute the flux of the electric field produced by ${Q}$ through ${S}$ as a constant multiple of the amount of charge! In fact, most books use $k=1/(4\pi \varepsilon_0)$ where $\varepsilon_0$ is the “permittivity of free space” which kills off practically all extraneous symbols in the answer.

## Christmas: The Nativity Story as Presented in Matthew and Luke

Since one of my most read posts this year was my analysis of the passion narrative around Easter time, I thought I’d do another one of these for Christmas. I’m not going to present much historical analysis that the events depicted are fiction (it is well-known that there was no census, no slaughter of innocent children, Narazeth probably wasn’t even a town at the time of Jesus, etc). This will again be a textual analysis. It will look at what the stories are and why they were invented based on internal evidence.

First, I’d like to get this out of the way. As I pointed out in the last analysis, most people who grow up in a Christian house don’t realize there are differences between the passion narratives or why they exist. This is somewhat forgivable because they are roughly the same story told in different ways. For the nativity, there are two different birth narratives in the Gospels, and it seems to me a much more devious and intentionally malicious act that churches try to keep this from people.

The birth narratives in Matthew and Luke are not at all the same story told in different ways. They are radically different and cannot be reconciled. One or both must be fiction (as we will see, we have good reason to believe both are fiction). A game gets played in churches on Christmas eve where a quote from Matthew gets said here, a quote from Luke here, and deceptively one consistent story is carefully crafted.

All the usual caveats apply. The Gospels were written by anonymous authors, but for ease of reference I’ll use the phrase “Matthew says” etc to mean “the author of the book of Matthew.” Before beginning, let’s first talk about why Mark doesn’t have a birth narrative. In the early days of Christianity there were all sorts of competing sects with different views trying to make their beliefs the orthodox view. One of these competing views was called “adoptionism.” This means that they believed that Jesus was born human by normal human means and only later (during his baptism by John the Baptist) became “adopted by” God as his son. If you read Mark’s account of the baptism, it is pretty clear this is what is happening. So Mark probably omitted a birth narrative because he held this opinion and thought there was nothing special about it.

Matthew and Luke are the only two Gospels with a birth narrative. So let’s start by just stating what their accounts are. In Matthew, Joseph wants to divorce Mary because she is pregnant, but is convinced not to in a dream. They already live in Bethlehem, so without travelling Mary has the child, and “wise men” come to investigate for Herod. The wise men decide not to return to Herod, and then because he is suspicious he orders all male children to be killed. Luckily, Mary and Joseph flee to Egypt in time. Once Herod dies they want to go back, but because Herod’s son is ruler they instead go to Nazareth.

In Luke, an angel appears to Mary to inform her of her pregnancy. Then emperor Augustus orders an empire wide census and people need to return to their ancestral home to register. Mary and Joseph live in Nazareth in this story. Thus Joseph needs to go to Bethlehem, his father’s father’s father’s … (13 times) father David’s birthplace. Here Mary gives birth in a stable because there is no room in the inn. Nearby shepherds come and worship him (note: no wise men in this story). Jesus is taken to the Temple for the standard Jewish rites and is recognized as the Messiah there. Once finished, they return to Nazareth.

Essentially no part of the stories match up. Where they are living at the start is different. Joseph gets informed by a dream vs Mary gets informed by an angel. Staying in place vs travelling before the birth. Born at their home (surprising?) vs born in a manager. Wise men travelling vs nearby shepherds. Fleeing to Egypt vs immediately returning to Nazareth. Herod ordering the murder of children vs Augustus ordering a census. Travelling to Nazareth for the first time vs returning to Nazareth.

In fact, in most cases these just can’t be harmonized at all. If one happened, then the other must be fiction. If these details aren’t enough to convince you, consider that Matthew’s mention of Herod places Jesus’ birth before 4 BC and Luke’s mention of Quirinius places Jesus’ birth after 6 AD. It is a true impossibility that both these scenarios happened.

Let’s look at how the stories are told to figure out why they might be making them up. Just like in the Easter post, the text itself provides clues to what the theological points they are trying to make are. Matthew is the easy one, because he is quite explicit about what he is doing. He bangs the reader over the head with it over and over throughout the story by saying, “To fulfill what the prophet had said…”

Aha. Now we have a hypothesis for why these strange stories were made up. Matthew needs Jesus to fulfill a bunch of prophecies from the Jewish scriptures to make his theological point that Jesus was the Messiah. Thus our hypothesis is that Matthew looked at what the prophecies were and then made up a story to fit them. Wait, I hear you protesting. You say there is no way for us to tell the difference between Matthew writing what he thought was true and just happened to fulfill prophecy versus Matthew making up a story after the fact to fit it.

Here’s the interesting thing. We actually can tell, because Matthew was using the Greek translation of the Old Testament called the Septuagint. It turns out that there were some mistranslations and misinterpretations coming from this version that are not in the actual scriptures. Thus either Jesus was fulfilling mistranslated prophecy (and hence not the real prophecy) or Matthew was making up a story based on the mistranslation. I’ll let you decide.

Here are two examples of that. In the original Hebrew version of Isaiah, the word “alma” is used to indicate that a “young woman” would give birth (“and they shall call him Immanuel.”). Strangely, even though Hebrew has a different word for “virgin,” (as opposed to “young woman”) the Septuagint mistranslates it to “parthenos” meaning virgin. Thus Matthew needs Mary to give birth to Jesus as a virgin in order to fulfill a mistranslated prophecy.

Let’s take this same part of the story in Luke. Luke makes no mention of a prophecy that Jesus would be born of a virgin. Instead, he makes it pretty clear through the words of the angel Gabriel, “He will be called the Son of God” what theological point he is trying to make. Jesus is born of a virgin not to fulfill prophecy, but to be clear that Jesus is the literal son of God and no human created him in the natural way. Thus we start to see that even ignoring historical evidence there is internal textual evidence that Matthew and Luke were advancing certain theological concerns in constructing their narratives.

The key strangeness of Matthew’s story is that Mary and Joseph live in Bethlehem, so why make up this thing about Herod which forces them to flee and eventually end up in Nazareth? Well, like the rest of the story, Matthew is trying to pull all of his details from prophecy. He has to reconcile two seemingly contradictory prophecies. The first is Micah 5:2 which seemingly predicts the birthplace of the Messiah to be Bethlehem, but also there is an unspecified reference to a prophecy that says “He shall be called a Nazarene.”

It is interesting that again, both of these interpretations are wholly unfounded and the effort to reconcile them seems for naught. The Micah prophecy is really just a description of where the Davidic dynasty originated. The Christian interpretation as a prophecy didn’t appear until much later. The prophecy about being a Nazarene appears nowhere in prophecy. There are several theories on where it came from. One of which is again just a mistranslation error of Judges 13:5, “The boy shall be a Nazarite to God.” But the word nazarite has nothing to do with Nazareth. It merely means one consecrated by taking vows and is in reference to Samson.

One can go on and on showing how Matthew not only pulled his details from prophecies, but how we know that he did so based on mistranslations or interpretations from the Septuagint. In fact, if you want to see a more thorough analysis along these lines check out chapter III of Randel Helms’ Gospel Fictions. Some scholars even propose the hypothesis that Matthew’s account is an example of Jewish Midrash (note he chooses to have Jesus flee to Egypt which is essentially retracing the steps of Moses’ flight out of Egypt).

Well, we could go on like this forever because entire bookcases have been filled with writings on this one topic, but hopefully this was interesting and new to some people. I didn’t give many references, because essentially every single New Testament scholar and ancient historian will tell you the above (and most of them are Christian!). If this sounds like some fringe atheist analysis, I challenge you to find a single respected (and doesn’t have a major source of income coming from evangelical apologetics) New Testament scholar that doesn’t hold this view.

In fact, these exact things are taught in most seminaries, so your pastor/minister is fully aware of these types of analyses. I would imagine most mainline protestant pastors would tell you behind closed doors that they also believe the birth narratives to be fiction for the above reasons. Any introductory text on the subject would go through all of this. For example, Jesus, Interrupted by Bart Ehrman probably covers this (though I’m not committing to that since I don’t have it with me).

## The Functor of Points Revisited

Mike Hopkins is giving the Milliman Lectures this week at the University of Washington and the first talk involved this idea that I’m extremely familiar with, but am also surprised at how unfamiliar most mathematicians are with it. I’ve made almost this exact post several other times, but it bears repeating. As I basked in the amazingness of this idea during the talk, I couldn’t help but notice how annoyed some people seemed to be at the level of abstractness and generality this notion forces on you.

Every branch of math has some crowning achievements and insights into how to actually think about something so that it works. The idea I’ll present in this post is a truly remarkable insight into geometry and topology. It is incredibly simple (despite the daunting language) which is what makes it so fascinating. Here is the idea. Suppose you care about some type of spaces (metric, topological, manifolds, varieties, …).

Let ${X}$ be one of your spaces. In order to figure out what ${X}$ is you could probe it by other spaces. What does this mean? It just means you look at maps ${Y\rightarrow X}$. If ${X}$ is a topological space, then you can recover the points of ${X}$ by considering all the maps from a singleton (i.e. point) ${\{x\} \rightarrow X}$. If you want to understand more about the topology, then you probe by some other spaces. Simple.

Even analysts use this idea all the time. A distribution ${\phi}$ (on ${\mathbb{R}}$) is not a well-defined function, so you can’t just tell whether or not two distributions are the same by looking at values. Instead you probe it by test functions ${\int \phi f dx}$. If these probes give you the same thing for all test functions, then the distributions are the same. This is all we are doing with our spaces above, and this is all the Yoneda lemma is saying. It says that if the maps (test functions) to ${X}$ and the maps to ${Y}$ are the same, then ${X}$ and ${Y}$ are the same.

We can fancy up the language now. Considering maps to ${X}$ is a functor ${Hom(-,X): Spaces^{op} \rightarrow Set}$. Such a functor is called a presheaf on the category of Spaces (recall, that for your particular situation this might be the category of smooth manifolds or metric spaces or algebraic varieties or …). Don’t be scared. This is literally the definition of presheaf, so if you were following to now, then introducing this term requires no new definitions.

The Yoneda lemma is saying something very simple in this fancy language. It says that there is a (fully faithful) embedding of Spaces into Pre(Spaces), the category of presheaves on Spaces. If we now work with this new category of functors, we just enlarge what we consider to be a space and this is of fundamental importance for many reasons. If ${X}$ is one of our old spaces, then we can just naturally identify it with the presheaf ${Hom(-,X)}$. The reason Mike Hopkins is giving for why this is important is very different from the one I’ll give which just goes to show how incredibly useful this idea is.

In every single branch of math people care about some sort of classification problem. Classify all elliptic curves. What are the vector bundles on my manifold? If I fix a vector bundle, what are the connections on my vector bundle? What are the Borel measures on my metric space? The list goes on forever.

In general, classification is a hugely impossible task to grapple with. We know a ton of stuff about smooth manifolds, but how can we leverage that to make the seemingly unrelated problem of classifying vector bundles more manageable? Here our insight comes to the rescue, because there is a way to write down a functor that outputs vector bundles. There is subtlety in writing it down properly (and we should now land in Grpds instead of Set so that we can identify isomorphic ones), but once we do this we get a presheaf. In other words, we make a (generalized) space whose points are the objects we are classifying.

In many situations you then go on to prove that this moduli space of vector bundles is actually one of the original types of spaces (or not too far from one) we know a lot about. Now our impossible task of understanding what the vector bundles on my manifold are is reduced to the already studied problem of understanding the geometry of a manifold itself!

Here is my challenge to any analyst who knows about measures. Warning, this could be totally ridiculous and nonsense because it is based on reading Wikipedia for 5 minutes. Construct a presheaf of real-valued Radon measures on ${\mathbb{R}}$. Analyze this “space”. If it was done right, you should somehow recover that the space is the dual space to the convex space, ${C_c(\mathbb{R})}$, of compactly supported real-valued functions on ${\mathbb{R}}$. Congratulations, you’ve just started a new branch of math in which you classify measures on a space by analyzing the topology/geometry of the associated presheaf.

## Bayesian Statistics Worked Example Part 2

Last time I decided my post was too long, so I cut some stuff out and now this post is fleshing those parts into their own post. Recall our setup. We perform an experiment of flippling a coin. Our data set consists of ${a}$ heads and ${b}$ tails. We want to run a Bayesian analysis to figure out whether or not the coin is biased. Our bias is a number between ${0}$ and ${1}$ which just indicates the expected proportion of times it will land on heads.

We found our situation was modeled by the beta distribution: ${P(\theta |a,b)=\beta(a,b)}$. I reiterate here a word of warning. ALL other sources will call this ${B(a+1, b+1)}$. I’ve just shifted by 1 for ease of notation. We saw last time that if our prior belief is that the probability distribution is ${\beta(x,y)}$, then our posterior belief should be ${\beta(x+a, y+b)}$. This simple “update rule” falls out purely from Bayes’ Theorem.

The main thing I didn’t explain last time was what exactly I meant by the phrase “we can say with 95% confidence that the true bias of the coin lies between ${0.40}$ and ${0.60}$” or whatever the particular numbers are that we get from our data. What I had in mind for that phrase was something called the highest density interval (HDI). The 95% HDI just means that it is an interval for which the area under the distribution is ${0.95}$ (i.e. an interval spanning 95% of the distribution) such that every point in the interval has a higher probability than any point outside of the interval (I apologize for such highly unprofessional pictures):

(It doesn’t look like it, but that is supposed to be perfectly symmetrical.)

The first is the correct way to make the interval, because notice all points on the curve over the shaded region are higher up (i.e. more probable) than points on the curve not in the region. There are lots of 95% intervals that are not HDI’s. The second is such a non-example, because even though the area under the curve is 0.95, the big purple point is not in the interval but is higher up than some of the points off to the left which are included in the interval.

Lastly, we will say that a hypothesized bias ${\theta_0}$ is credible if some small neighborhood of that value lies completely inside our 95% HDI. That small threshold is sometimes called the “region of practical equivalence (ROPE)” and is just a value we must set. If we set it to be 0.02, then we would say that the coin being fair is a credible hypothesis if the whole interval from 0.48 to 0.52 is inside the 95% HDI.

A note ahead of time, calculating the HDI for the beta distribution is actually kind of a mess because of the nature of the function. There is no closed form solution, so usually you can just look these things up in a table or approximate it somehow. Both the mean ${\mu=\frac{a}{a+b}}$ and the standard deviation ${\left(\frac{\mu(1-\mu)}{a+b+1}\right)^{1/2}}$ do have closed forms. Thus I’m going to approximate for the sake of this post using the “two standard deviations” rule that says that two standard deviations on either side of the mean is roughly 95%. Caution, if the distribution is highly skewed, for example ${\beta(3,25)}$ or something, then this approximation will actually be way off.

Let’s go back to the same examples from before and add in this new terminology to see how it works. Suppose we have absolutely no idea what the bias is and we make our prior belief ${\beta(0,0)}$ the flat line. This says that we believe ahead of time that all biases are equally likely. Now we observe ${3}$ heads and ${1}$ tails. Bayesian analysis tells us that our new distribution is ${\beta(3,1)}$. The 95% HDI in this case is approximately 0.49 to 0.84. Thus we can say with 95% certainty that the true bias is in this region. Note that it is NOT a credible hypothesis off of this data to guess that the coin is fair because 0.48 is not in HDI. This example really illustrates how choosing different thresholds can matter, because if we picked an interval of 0.01 rather than 0.02, then that guess would be credible!

Let’s see what happens if we use just an ever so slightly more reasonable prior. We’ll use ${\beta(2,2)}$. This gives us a starting assumption that the coin is probably fair, but it is still very open to whatever the data suggests. In this case our ${3}$ heads and ${1}$ tails tells us our posterior distribution is ${\beta(5,3)}$. In this case the 95% HDI is 0.45 to 0.75. Using the same data we get a little bit more narrow interval here, but more importantly we feel much more comfortable with the claim that the coin being fair is still a credible hypothesis.

This brings up a sort of “statistical uncertainty principle.” If we want a ton of certainty, then it forces our interval to get wider and wider. This makes intuitive sense, because if I want to give you a range that I’m 99.9999999% certain the true bias is in, then I better give you practically every possibility. If I want to pinpoint a precise spot for the bias, then I have to give up certainty (unless you’re in an extreme situation where the distribution is a really sharp spike or something). You’ll end up with something like: I can say with 1% certainty that the true bias is between 0.59999999 and 0.6000000001. We’ve locked onto a small range, but we’ve given up certainty. Note the similarity to the Heisenberg uncertainty principle which says the more precisely you know the momentum or position of a particle the less precisely you know the other.

Let’s wrap up by trying to pinpoint exactly where we needed to make choices for this statistical model. The most common objection to Bayesian models is that you can subjectively pick a prior to rig the model to get any answer you want. Hopefully this wrap up will show that in the abstract that objection is essentially correct, but in real life practice you cannot get away with this.

Step 1 was to write down the likelihood function ${P(\theta | a,b)=\beta(a,b)}$. This was derived directly from the type of data we were collecting and was not a choice. Step 2 was to determine our prior distribution. This was a choice, but a constrained one. In real life statistics you will probably have a lot of prior information that will go into this choice. Recall that the prior encodes both what we believe is likely to be true and how confident we are in that belief. Suppose you make a model to predict who will win an election based off of polling data. You have previous year’s data and that collected data has been tested, so you know how accurate it was! Thus forming your prior based on this information is a well-informed choice. Just because a choice is involved here doesn’t mean you can arbitrarily pick any prior you want to get any conclusion you want.

I can’t reiterate this enough. In our example, if you pick a prior of ${\beta(100,1)}$ with no reason to expect to coin is biased, then we have every right to reject your model as useless. Your prior must be informed and must be justified. If you can’t justify your prior, then you probably don’t have a good model. The choice of prior is a feature, not a bug. If a Bayesian model turns out to be much more accurate than all other models, then it probably came from the fact that prior knowledge was not being ignored. It is frustrating to see opponents of Bayesian statistics use the “arbitrariness of the prior” as a failure when it is exactly the opposite (see the picture at the end of this post for a humorous illustration.)

The last step is to set a ROPE to determine whether or not a particular hypothesis is credible. This merely rules out considering something right on the edge of the 95% HDI from being a credible guess. Admittedly, this step really is pretty arbitrary, but every statistical model has this problem. It isn’t unique to Bayesian statistics, and it isn’t typically a problem in real life. If something is so close to being outside of your HDI, then you’ll probably want more data. For example, if you are a scientist, then you re-run the experiment or you honestly admit that it seems possible to go either way.

## What is Bayesian Statistics: A basic worked example

I did a series on Bayes’ Theorem awhile ago and it gave us some nice heuristics on how a rational person ought to update their beliefs as new evidence comes in. The term “Bayesian statistics” gets thrown around a lot these days, so I thought I’d do a whole post just working through a single example in excruciating detail to show what is meant by this. If you understand this example, then you basically understand what Bayesian statistics is.

Problem: We run an experiment of flipping a coin ${N}$ times and record a ${1}$ every time it comes up heads and a ${0}$ every time it comes up tails. This gives us a data set. Using this data set and Bayes’ theorem, we want to figure out whether or not the coin is biased and how confident we are in that assertion.

Let’s get some technical stuff out of the way. This is the least important part to fully understand for this post, but is kind of necessary. Define ${\theta}$ to be the bias towards heads. This just means that if ${\theta=0.5}$, then the coin has no bias and is perfectly fair. If ${\theta=1}$, then the coin will never land on tails. If ${\theta = 0.75}$, then if we flip the coin a huge number of times we will see close to ${3}$ out of every ${4}$ flips lands on heads. For notation we’ll let ${y}$ be the trait of whether or not it lands on heads or tails (so it is ${0}$ or ${1}$).

We can encode this information mathematically by saying ${P(y=1|\theta)=\theta}$. In plain english: The probability that the coin lands on heads given that the bias towards heads is ${\theta}$ is ${\theta}$. Likewise, ${P(y=0|\theta)=1-\theta}$. Let’s just chain a bunch of these coin flips together now. Let ${a}$ be the event of seeing ${a}$ heads when flipping the coin ${N}$ times (I know, the double use of ${a}$ is horrifying there but the abuse makes notation easier later).

Since coin flips are independent we just multiply probabilities and hence ${P(a|\theta)=\theta^a(1-\theta)^{N-a}}$. Rather than lug around the total number ${N}$ and have that subtraction, normally people just let ${b}$ be the number of tails and write ${P(a,b |\theta)=\theta^a(1-\theta)^b}$. Let’s just do a quick sanity check to make sure this seems right. Note that if ${a,b\geq 1}$, then as the bias goes to zero the probability goes to zero. This is expected because we observed a heads (${a\geq 1}$), so it is highly unlikely to be totally biased towards tails. Likewise as ${\theta}$ gets near ${1}$ the probability goes to ${0}$, because we observed a tails.

The other special cases are when ${a=0}$ or ${b=0}$, and in these cases we just recover that the probability of getting heads a times in a row if the probability of heads is ${\theta}$ is ${\theta^a}$. Of course, the mean of ${\beta (a,b)}$ is ${a/(a+b)}$, the proportion of the number of heads observed. Moving on, we haven’t quite thought of this in the correct way yet, because in our introductory problem we have a fixed data set that we want to analyze. So from now on we should think about ${a}$ and ${b}$ being fixed from the data we observed.

The idea now is that as ${\theta}$ varies through ${[0,1]}$ we have a distribution ${P(a,b|\theta)}$. What we want to do is multiply this by the constant that makes it integrate to ${1}$ so we can think of it as a probability distribution. In fact, it has a name called the beta distribution (caution: the usual form is shifted from what I’m writing), so we’ll just write ${\beta(a,b)}$ for this (the number we multiply by is the inverse of ${B(a,b)=\int_0^1 \theta^a(1-\theta)^b d\theta}$ called the (shifted) beta function).

This might seem unnecessarily complicated to start thinking of this as a probability distribution in ${\theta}$, but it is actually exactly what we are looking for. Consider the following three examples:

The red one says if we observe ${2}$ heads and ${8}$ tails, then the probability that the coin has a bias towards tails is greater. The mean happens at ${0.20}$, but because we don’t have a lot of data there is still a pretty high probability of the true bias lying elsewhere. The middle one says if we observe 5 heads and 5 tails, then the most probable thing is that the bias is ${0.5}$, but again there is still a lot of room for error. If we do a ton of trials to get enough data to be more confident in our guess, then we see something like:

Already at observing 50 heads and 50 tails we can say with 95% confidence that the true bias lies between 0.40 and 0.60. Alright, you might be objecting at this point that this is just usual statistics, where the heck is Bayes’ Theorem? You’d be right. Bayes’ Theorem comes in because we aren’t building our statistical model in a vacuum. We have prior beliefs about what the bias is.

Let’s just write down Bayes’ Theorem in this case. We want to know the probability of the bias ${\theta}$ being some number given our observations in our data. We use the “continuous form” of Bayes’ Theorem:

$\displaystyle P(\theta|a,b)=\frac{P(a,b|\theta)P(\theta)}{\int_0^1 P(a,b|\theta)d\theta}$

I’m trying to give you a feel for Bayesian statistics, so I won’t work out in detail the simplification of this. Just note that the “posterior probability” (the left hand side of the equation), i.e. the distribution we get after taking into account our data is the likelihood times our prior beliefs divided by the evidence. Now if you use that the denominator is just the definition of ${B(a,b)}$ and work everything out it turns out to be another beta distribution!

If our prior belief is that the bias has distribution ${\beta(x,y)}$, then if our data has ${a}$ heads and ${b}$ tails we get ${P(\theta|a,b)=\beta(a+x, b+y)}$. The way we update our beliefs based on evidence in this model is incredibly simple. Now I want to sanity check that this makes sense again. Suppose we have absolutely no idea what the bias is and we make our prior belief ${\beta(0,0)}$ the flat line. This says that we believe ahead of time that all biases are equally likely.

Now we observe ${3}$ heads and ${1}$ tails. Bayesian analysis tells us that our new (posterior probability) distribution is ${\beta(3,1)}$:

Yikes! We don’t have a lot of certainty, but it looks like the bias is heavily towards heads. Danger: This is because we used a terrible prior. This is the real world so it isn’t reasonable to think that a bias of ${0.99}$ is just as likely as ${0.45}$. Let’s see what happens if we use just an ever so slightly more modest prior. We’ll use ${\beta(2,2)}$. This puts our assumption on it being most likely close to ${0.5}$, but it is still very open to whatever the data suggests. In this case our ${3}$ heads and ${1}$ tails tells us our updated belief is ${\beta(5,3)}$:

Ah. Much better. We see a slight bias coming from the fact that we observed ${3}$ heads and ${1}$ tails and these can’t totally be ignored, but our prior belief tames how much we let this sway our new beliefs. This is what makes Bayesian statistics so great. If we have tons of prior evidence of a hypothesis, then observing a few outliers shouldn’t make us change our minds. On the other hand, the setup allows for us to change our minds even if we are 99% certain on something as long as sufficient evidence is given. This is the mantra: extraordinary claims require extraordinary evidence.

Not only would a ton of evidence be able to persuade us that the coin bias is ${0.90}$, but we should need a ton of evidence. This is part of the shortcomings of non-Bayesian analysis. It would be much easier to become convinced of such a bias if we didn’t have a lot of data and we accidentally sampled some outliers.

Anyway. Now you should have an idea of Bayesian statistics. In fact, if you understood this example, then most of the rest is just adding parameters and using other distributions, so you actually have a really good idea of what is meant by that term now.

## In Defense of Gaming

It’s been over a month, so I decided to do a post that I’ve had in the bag for awhile, but don’t think adds anything to the discussion. This is what happens when you are taking classes, teaching classes, writing things up, and applying for jobs I guess.

Are video games art? What a bizarre question. It has been debated through the years, but I’m not sure there is anyone out there that has seriously thought about the question and is willing to defend that they are not. The debate seems over and the conclusion is that video games are art.

The one notable opposition is Roger Ebert, but his position boils down to a “no true Scotsman fallacy.” It is such a classic example that it should probably just start being used to illustrate what the fallacy is. He says games cannot be art. Then when shown a game that he admits is art he says, “But that isn’t a real game.” That would be like arguing novels cannot be art by just declaring that any novel that could be considered art is not a real novel. It is a silly argument that doesn’t need to be taken seriously.

First, we should notice that there is a “type error” (as a programmer would say) in the original question. No one would think “Are books art?” is a properly phrased question. What does that mean? If you find one book that is not art, then is the answer no? Do you merely need to give one book that is art to answer yes? The answer isn’t well-defined because “book” encompasses a whole class of objects: some of which are art and some of which are not.

For our purposes we’ll say a medium (like video games) “is art” if an artist can consistently use the medium to produce something that can be broadly recognized as art. This brings us to the difficult question of how to determine if something can be broadly recognized as art. Some things that come to mind are aesthetics/beauty, the ability to make a human being feel something, the ability to make someone think deeply about important questions, originality, and on and on we could go. Any given work of art could be missing any or all of these qualities, but if something exhibits enough these qualities, then we would probably have no problem calling it art.

In order to argue that games can be works of art, I’ll take two examples that are relatively recent from the “indie game” community. These are both games in a sense that even Ebert could not deny. I’ll stay away from controversial examples like Dear Esther or Proteus (which are undeniably works of art but more questionable about being games).

The first is Bastion. The art direction and world that has been constructed is a staggering work of beauty on its own. Remove everything about this game except just exploring this universe and I think you would find many people totally engrossed in the experience:

We already have check mark one down. But there’s more! The music is fantastic as well. But let’s get to what really sets this game apart as a work of art. The story is fantastic and is mostly told with great voice acting through a narrator. I won’t spoil the ending in its totality, but I’m about to give away a major plot point near the end.

Your good friend betrays you and comes close to destroying everything (literally the whole world) in the middle of the game. It hurts. Then near the end he is going to die and you have the choice to save him. The game branches and you can either keep your weapons and safely fight your way to the end of the game, or you can carry this traitor through a dangerous area possibly sacrificing your own life for him.

Books and movies can’t do this. You have to make this choice and it affects how the story progresses. It reveals to you what type of human you are. You have to live with the consequences of this choice. If you save him, then you slowly walk through an area where your enemies shoot you from afar and there is nothing you can do. When they realize what you’re doing they stop in awe and just solemnly let you pass. The visuals plus the music plus the dramatic climax of this moment brings many people to tears.

I know this because you can just search discussion boards on the game. Gaming discussion boards are notorious for being misogynistic and full of masculine one-up-manship. No one makes fun of the people who say it brought them to tears and usually there will be a bunch of other people admitting the same. If this sort of emotional connection isn’t art, then I don’t know what is. Not only that, but this type of connection can only really happen through games where you are wholly invested because you’ve made these decisions.

Maybe Bastion isn’t your thing, because it is a “gamer’s game” with a bit of a barrier to entry since it involves experience points, weapons, items, leveling up, and real-time fighting of monsters and bosses. That could be a bit much for the uninitiated. We’ll move on to a game that every person, regardless of gaming experience, can play and really see how elegantly simple an “art game” can be.

Thomas Was Alone is extremely simple. Thomas is a rectangle. You move him to a rectangular door. End of level. The game is in a genre called a “puzzle platformer.” As the levels progress you get different sized rectangles to move and moving and jumping in various orders will help you get to the end. This is the “puzzle” aspect, because you have to figure out the correct order to do things otherwise you’ll get stuck.

Why is this art? Well, why is writing a book about some animals on a farm art? Because it isn’t really about animals on a farm. The same is true here. The game is a huge metaphor. A deeply moving one at that. I consistently had to stop playing at parts because of how overwhelmed with the concept I became when I allowed myself to think about it.

Just like Bastion, this game is truly magnificent visually. The style is opposite. It has minimalism and simplicity as the guiding aesthetic virtue:

The music is perfect for the mood, and the narration which tells the story is beyond superb. You grow attached to these rectangles which have such nuanced personalities. What is the metaphor? Well, there are all these obstacles in your way, and you can’t get past them without working together. The whole idea is that there are seemingly impossible obstacles in life, but when humans cooperate and work together they can get past them.

The thing that makes the game so moving at parts is that your rectangle friends are so humanly flawed. They get upset at each other for such petty reasons. They have crushes on each other. They hate each other. But in the end they overcome those differences to work together and accomplish great things. If you haven’t experienced it, then this probably sounds totally absurd.

Again from discussion forums, I quote, “I just finished the game and a group of coloured quadrilaterals made me cry.” Or “Everything about this game makes me feel incredible. I feel as if I can achieve things I could never think of being. This is the best thing I could have experienced, and it’s worth everything…This game makes you love and cry over shapes.” When people have these reactions, that is without question the definition of art.

I think we’ve firmly established that games can be art. I thought I’d just bring up a few cultural tidbits right at the end here. Some famous art galleries across the world have started to recognize the importance of including works of art in their collection that happen to be games. MoMA (the Museum of Modern Art in NY) has a collection of 14 games in its collection currently. Paris had an exhibit that included Fez. The Smithsonian American Art Museum had one last year. There have been many others too.

I’ll try to wrap up now. If you’re the type of person that reads literary novels and goes to the symphony because you think experiencing art is an important and enriching experience, then you probably also write off video games as a mindless waste of time. This is partially warranted because so many of the most popular games today are mindless wastes of time (just like most popular music and movies are too).

I hope that after this maybe your mind has changed a little. If you are willing to make time in your schedule to read a book or go to an art gallery, then I’d argue that you should also be willing to make time in your schedule to experience great games. The medium has all the same artistic qualities as a great film, but has added value given by the interactivity you have with the medium.