## The Myth of a Great Seminar

Sometimes I peruse the debates at Intelligence Squared to see if any catch my eye. There was one this time that seemed really interesting to me. It was a debate on whether or not MOOCs are reasonable replacements for actual in-class and campus college experiences. You can see the full thing here.

This was interesting to me, because I’ve actually gone through a few MOOCs from start to finish and found them to be extremely good experiences. I was curious if there was research that would be mentioned about the effectiveness of one or the other. The debate was pretty disappointing in this regard. The main anti-MOOC argument was based around how wonderful small seminars are and that you can’t get this in a MOOC. That’s why I want to write a response to this mythical seminar.

Before talking about why I think such seminars don’t really exist in this Platonic, pristine state at any university, I want to first address the fact that the existence of seminars at all is pretty mythical. I decided to check the University of Washington’s Spring 2014 schedule. The senior level literature classes had a student range of 25-40, but most were about 30. Should I consider a 30 person class a “small seminar?” I get it. We’re a gigantic school, so I fully admit that small liberal arts colleges probably do have a lot of small seminars. But most students at most schools will graduate with few to no small seminars as their classes.

Even middle level courses like Introduction to the Theory of Literature at Ivy League schools are gigantic. That class probably has 100 students or more in it, and those are the types of courses that are offered as MOOCs. I think the comparison is a bit disingenuous when you take some capstone seminar and compare it to an “intro” MOOC. The MOOC side of the debate also responded to this criticism and pointed out that some MOOCs offer small group breakout sessions which actually do simulate small seminars. So the point doesn’t even stand.

Now that that rant is over, let’s pretend like the comparison is fair. Here are some of the myths I heard and why I think they are mostly myth (I’ll grant that maybe a few seminars run according to plan):

Let’s suppose for the sake of argument that the teacher is practically invisible in this mythical seminar and the students are all enraptured in high level critical conversation about Dostoevsky or some such nonsense. This seems to be the ideal the seminar aspires to. This is going to sound extremely cynical, but just how interesting can this conversation actually be? The seminar is going to be made up of an incredibly homogeneous group. Everyone is going to be about 20, never having had to make a living. They are all educated at the same school, which means they have roughly the same cultural experience, read the same books, and developed the same theories about how to analyze books.

What’s so great about this perfect conversation in comparison with a MOOC? When you take the exact same course as a MOOC, you will probably have a math professor in India, a farmer in the American midwest, a retired middle school teacher in Scotland, etc. The conversation about the same books is going to be infinitely more interesting and enlightening, because the perspectives will be so varied.

Now let’s back up a little from the perfect situation and get a little more realistic. We’ve all been to these seminar classes before. The free-flowing and enlightening conversation essentially never happens. You have some people who didn’t read the stuff. You have people who aren’t very good at articulating their thoughts on the spot. The whole thing usually turns into the professor calling on someone, a brief sentence or two is mumbled, and then the professor carries on along that point. The “conversation” is forced, and the student input is more like a prompt for the professor to riff on.

Depending on the day and material, the degree to which this is the case will vary, but I think the overall sentiment is what happens most days in most seminars. This is actually why I think a written discussion board in a MOOC is actually a far better method for discussion than a conversation in a seminar.

First off, there are hundreds of more topics and conversations going on at a discussion board than in class. This means that you can search around for conversations that you really want to participate in. Second, you have to write your thoughts down. This gives you time to figure out what you are going to say rather than awkwardly spewing out some muddled nonsense while everyone stares at you. It also gives you time to figure out what other people mean before responding to them.

It is amazing the number of times you start typing a response, and then when you go back to what was actually said you realize you misunderstood at first. Which brings me to my next point. A discussion board records all of it. You can continually return to conversations as your understanding of a topic develops. The conversation doesn’t end at the end of the hour. Once you leave the physical setting of a seminar, it probably only takes a few hours to forget most of what most people said. The discussion board allows you to go back whenever you want to recall certain parts of certain conversations.

To summarize, I think most courses most people take are not seminars, so it is pointless to use them as a main argument against MOOCs. I also think that the MOOC setup is actually a better platform for enlightening discussion in almost every respect than an actual seminar. That being said, I think the anti-MOOC side has a point when they say that communication skills are developed in class discussion. Unfortunately, even small seminars tend not to have real “discussions,” so I don’t find that compelling (along with the fact that some MOOCs are incorporating small group live chat sessions now).

Don’t get me wrong. I don’t think all university education should be relegated to the online setting. I’m just saying that using some idealized small seminar as the main argument is a highly flawed way to go about it.

## Correlation Does not Imply Causation

I’ve never done this before in six years and well over 400 posts. I’m going to direct your attention to a webpage rather than write a post. As they say, “A picture is worth 1000 words,” so consider this a 1000 word post:

The full page is here.

This is exactly why it is so dangerous to conclude a relationship from statistically significant correlations. Even phenomena with direct known causal relationships tend not to have 0.99 correlation. Peruse the rest of that webpage at your own risk. It is quite addicting (who knew that so many people died from getting tangled in their bed sheets every year?).

## Rorty’s Pragmatism

Today I’d like to talk about Richard Rorty. He was an American philosopher that became famous in the late 70′s and 80′s for advocating a new form of pragmatism. I thought this might be a timely topic, because we’ve been spending a lot of time on making sense of data. Modern society has become polarized on a bunch of issues which basically stem from more fundamental questions: what is knowledge and what is truth?

On the one side we have radical scientism. This side argues that in order to count something as knowledge, it must be falsifiable, formulated as a scientific hypothesis, and demonstrated with 95% certainty. There are of course much milder variants on this side. For example, one might stipulate that all questions that naturally have a scientific formulation must meet scientific standards before we consider it to be reliable information, but science doesn’t have much to say about non-scientific questions.

The other side is radical skepticism or postmodernism (I know these are not at all the same thing). The radical skeptics claim that all knowledge is impossible, so we should be skeptical of all things that we hear (even if they were proven by a scientific study). I have a lot of sympathy for this side. Facebook alone makes me skeptical of basically anything anyone says, because I know that half of the interesting things I’m told probably come from a totally false Facebook post someone made. Everyone has bias and/or funding which skews results including supposedly objective scientific ones.

Postmodernism gives a bit more substance to this argument. It essentially says that we have no foundations anymore. Science can’t prove that science is getting at truth, so we shouldn’t treat it as a special class of knowledge. This “lack of foundations” argument ends up giving merit to a lot of dangerous ideas. Since the scientific method is no longer seen as the most reliable way to truth, maybe new age spirituality or alternative medicine actually works and is just as effective.

I’ll state my bias right up front. I tend to agree with the scientism viewpoint (although I’d probably call my stance “naturalism,” but let’s not get into that). Both sides make really good critiques of the other when done by a careful thinker. Science has assumptions that cannot be justified. It is merely building models. Maybe our model of gravity is totally wrong, but just happens to consistently give really accurate predictions when tested.

Science critiques the other positions as well. Skepticism is not self-consistent, because it requires you to be skeptical of skepticism. The lack of foundations in postmodernism does not mean that all things are equally likely to be true.

These differing foundations manifest in huge shouting matches: evolution vs intelligent design, medicine vs alternative medicine, atheism vs theism, and on and on. The main reason I err on the side of science is because all people seem to think that science provides the best answers until those answers disagree with their previously held beliefs. It is only then that the lack of foundations is pointed out or the bias of the researcher is brought up. See also this post which shows why the scientific method is needed to surpass bias and this post for an ethical reason to err on the side of science.

Anyway, we’ve passed 500 words already and I’m still just setting up why Rorty is such an important thinker. His views seem to just gain importance as data sets keep getting bigger and we get confused about who we should believe. Rorty basically comes up with a middle ground which is sometimes called neopragmatism. He entered the scene at a time where both sides seemed right and wrong. His position is that the postmodernists are right that there are no foundations, but this doesn’t matter because some systems are useful. Let’s unpack this a bit.

First off, if this interests you, then go read Philosophy and the Mirror of Nature. A quick blog post cannot do it justice. It is quite complex and subtle. One side says that they’ve built a fantastic pillar called science on the solid foundations of peer review, objectivity, etc. The other side says that all our institutions can be knocked down, because there are no solid foundations.

Rorty has a somewhat shocking response that both sides are wrong. There are no foundations (i.e. external objective standards), but this doesn’t mean the pillars are unstable. It just means that the rules of the game depend on which game we’re playing. When playing tennis, we must follow the rules of tennis. When doing science, we must play by the rules of science. There is no universal, correct rule set for all games. It is just dependent on the game. That’s okay. None are more “right” than another, because this concept doesn’t even make sense.

So what is truth? Rorty says that we can think about justification, but not about truth. How we justify beliefs is dependent on the system we are in. We know how to use the word true in each system, so we don’t have to define it. This is a very classic pragmatic response. When speaking of scientific truth, we have a collection of things we mean. When speaking of literary truth we have another. These truths are dependent on time and place (e.g. “It is a truth universally acknowledged, that a single man in possession of a good fortune must be in want of a wife.”)

So how is this different from the extreme relativism of postmodernism? Well, Rorty would say that usefulness has to be taken into account. There is no way to get at objective truth, but some systems are more useful for certain purposes than others. For example, at this point in time, science seems to be the most useful system to answer scientific questions. Your computer is working, polio was eradicated, we put people on the moon, etc, etc. As the internet meme goes, “Science. It works, bitches!” And so even though we don’t know if science is getting at truth (which reasonable scientists fully admit, by the way), it does consistently get at something useful. There may be other contexts in which scientific rigor is not the most useful system.

Rorty develops a theory that fully admits that the postmodernists are right when they say that we have no basis for foundations anymore. But he doesn’t descend into extreme relativism. He leaves room for some systems of thought to be more useful than others. They don’t have a monopoly on truth, because we don’t even know what that means. Relativism doesn’t even really make sense from Rorty’s viewpoint, because you can never leave your current context from which to make a relative judgement. And that’s why I think he’s so important. He points out that our shouting matches aren’t about content or truth. They are about coming at the same question from different systems.

## Fun with Decision Theory

I’ve done quite a few decision theory posts at this point, and I think I’m mostly done with it. So to conclude the section I thought I’d leave you with some fun thought experiments having to do with decision theory. You can use your new skills to try to analyze them.

The first thought experiment I want to present has been around since at least the late 60′s. It is generally referred to as Newcomb’s paradox. Here’s the setup. Suppose you encounter a strange being in the forest that can predict your decisions (it’s telepathic or something, just go with it for the purposes of the thought experiment).

They offer you a deal. They present Box A which contains $1,000, and they present Box B which contains either$0 or $1,000,000. You are allowed to take Box B by itself or both Box A and Box B. The being predicts what choice you will make to determine the contents of Box B. If you take only Box B, then they put the$1,000,000 in it. If they predict that you will take both, then they put $0 in. All of this is done ahead of time (because they also correctly predicted that you would walk through this random area of the forest). An important part of the setup is that the predictor puts the money in ahead of time, so the contents are not determined after you make a decision. The contents cannot change. There are only four total possibilities, so if you use your decision theory skills, then it should be a pretty straightforward calculation to figure out how to maximize your profit. Strangely, this is often referred to as a paradox, because two equally valid sounding arguments lead to different answers. Here’s one analysis. Suppose this being thinks you will pick both boxes. If you actually pick B, then you get nothing. If you actually pick both, then you get$1,000. Thus picking both gets you a better result in that case. Suppose it thinks you will only pick B. Then if you actually pick both, you get $1,001,000. If you only pick B, then you only get$1,000,000. Thus picking both leads to a better result in that case as well. In fact, picking both clearly gets you more money no matter what the prediction was. Thus picking both maximizes your profit.

The other analysis says that the first one ignored vital information. We can throw out two possibilities, because by the assumption of the thought experiment the prediction will never be wrong. Thus the only two possibilities are that you pick both, in which case you get $1,000 or you pick B in which case you get$1,000,000. Therefore picking only B maximizes your profit.

I won’t present any of the attempted resolutions of this, because I’ve given you some tools to think about it on your own for awhile. I’ll just say that if you Google this, then you will find that tons of famous philosophers and mathematicians have attempted to resolve it. So answers are really easy to find if you get stuck or are curious to read more about it. If you aren’t sure where to start, I highly recommend stuff that Eliezer Yudkowsky has written on it. I dare say he has probably thought about this more deeply than most people.

Another fun and related issue is the idea of acting randomly being the best decision. Suppose you are playing a game in which if you make moves at random, then you have a 1/2 probability of winning. If your opponent can guess what your next move will be, then you only have a 1/4 chance of winning. Games like these are pretty easy to construct, but telling you one isn’t as important as the fact that it has this feature.

In such a situation, if you run your decision theory algorithm and come up with a deterministic set of moves to make that maximizes your chance of winning, then you will almost surely lose. This is because your opponent could figure out what moves you need to make to win and hence figure out which moves you are going to make. In such a situation, the only way to maximize your chance of winning is to ensure that you never make moves according to some rule that your opponent could figure out, i.e. picking a move at random maximizes your chance of winning.

In some sense, if you make your decision according to some random mechanism external to yourself, then you prevent the game from becoming a “Newcomb-like problem.” In fact, some people try to resolve the Newcomb problem with such randomness. Anyway, I thought it would be fun to end this series with something a little lighter.

## Markov Chain Monte Carlo Example

Let’s look at a problem called parameter estimation. As usual, we have a bunch of coin flips. What we’ve learned to do with Bayesian statistics is calculate some posterior distribution that tells me how likely the bias ${\theta}$ is. I ask myself, “Is it a credible hypothesis that the coin is fair (${\theta =1/2}$)?” I find out yes it is. I ask myself, “Is it a credible hypothesis that the coin is massively biased at ${\theta=4/5}$?” I find out yes it is. Uh oh.

Maybe in abstract math land this type of contradictory information is fine. I should be honest that both are credible guesses based on my data, and Bayesian statistics helps me to be very precise about my degrees of certainty and uncertainty. Unfortunately, in the real world I want to figure out which ${\theta}$ is “best” so that I can use it in my model for practical purposes. This is called parameter estimation, because I want to estimate what my parameter ${\theta}$ should be in my model.

We’re in luck for the coin example, because we only have one parameter living in one-dimensional space. This alone vastly simplifies the situation, but we have something far, far better. Our posterior distribution has a unique maximum, that max happens to equal the mean of the distribution, and that max can be determined easily and exactly! This means that we can safely use that parameter as the “best.”

In the real world, we often have several parameters we are trying to estimate in a high-dimensional space, and the posterior is some non-convex crazy thing with lots of local mins/maxs that can’t be determined analytically. Let’s face it. Optimization is really hard even in relatively nice situations. The real world is usually not nice.

There often isn’t even an obvious notion of what you mean by “best” set of parameters either. Think of a symmetrical bimodal distribution where both peaks have the same max. You don’t really have any good reason to pick one of the points that gives the max, and if you do something like take the mean, then you might end up with a min on accident. The method I’m going to describe doesn’t really help with this issue of “equally good choices”, but it does give a beautiful way to deal with high-dimensional parameter spaces and crazy posterior distributions.

The idea is extremely simple. You will pick some starting collection of parameters. Then you let those parameters randomly move in some direction. We will then use our model to test whether or not it is more or less likely to see the data that we observed under each of those parameter choices. With some probability depending on this likelihood we will move that parameter to that value. This is just a Markov chain process of our ${\theta}$ values moving through the possible parameter values, and hence this technique is called a Markov Chain Monte Carlo (MCMC) method (I used the indefinite article “a” because there are all sorts of variations on this out there).

It turns out that as long as we set this up in a reasonable way, then it will converge. Here’s something cool about this. Your parameters could live in some gigantic space for which it would be impossible to search for a good parameter estimation. Usually there is some much, much smaller dimensional subset of reasonably likely candidates. Once you move to this smaller dimensional set, by the nature of the algorithm, you will stay close to it and hence start moving to something optimal much faster. Here’s a picture showing how the random walks stay on a smaller set in a real example:

Let’s actually implement this in the silly case of the coin example where we know what the answer should be. My next post might try to implement this for some sort of real data set, although that could be more time consuming than I’m willing to do. To make this example more fun, I had the computer pick a random number in ${[0,1]}$ and then generate 100 coin flips with bias equal to that number without telling me the number! This way we are in a realistic situation of not knowing what the “correct” answer is ahead of time.

I got 85 heads and 15 tails. To make computations easier, let’s assume the prior probability is just uniform. This means the posterior is given by ${p(D|\theta)=\theta^{85}\cdot (1-\theta)^{15}}$. I’ll start the random walk at ${\theta = 0.5}$. To know how much to move by, I pick a random number from a normal distribution with mean ${0}$ and standard deviation ${0.1}$. So if I pick ${0.05}$, then my candidate place to move to is ${0.55}$.

I compute ${p(D|\theta_{new})/p(D|\theta_{old})}$ and I move to the new spot with this probability. Note that if my new theta value is more likely to be the true theta, then I will always move to the new value, because the probability of moving is greater than ${1}$. The more unlikely my new theta value is, the less likely it is that I will move there. This implementation is called the Metropolis (or Metropolis-Hastings) algorithm. Note how simple the implementation is. It is only a few lines of code:

import numpy as np
import random
import pylab

# Posterior Distribution
def p(theta):
return (theta**85)*((1-theta)**15)

# Random Walk Step Size
def norm_dist():
return random.normalvariate(0, 0.1)

# Perform one step of random walk from spot theta
def rand_walk(theta):
x = norm_dist()
if theta + x < 1 and theta + x >0:
return theta + x
else:
return theta

# Simulate the random walk for 1000 time steps
walk = []
walk.append(0.5)
for i in xrange(1000):
n = walk.pop()
walk.append(n)
y = rand_walk(n)
if random.random() < p(y)/p(n):
walk.append(y)
else:
walk.append(n)

# Plot the results
ylab = [i for i in xrange(len(walk))]
pylab.plot(walk, ylab)
pylab.title('Random Walk Visualization')
pylab.xlabel('Theta Value')
pylab.ylabel('Time')
pylab.show()


Note that the key insight that MCMC gives us is that picking values from the posterior is going to be “easy.” Even if we don’t know much about the distribution and have no idea how to explicitly calculate anything from it, we can still perform this random walk. This is what it looks like:

The last step is to actually do a parameter estimation. The whole point is that the walk will stay close to the best value, so we can now just average these to get ${\theta = 0.84}$. The average is just a finite sum instead of an integral now. If we had done this analytically, we would have gotten ${0.85}$. Since we know MCMC is just giving us an estimation coming from randomness, this is really quite good!

## Video Games as a Solution to the One-Sided Problem of Art

In October I wrote a post in defense of gaming in which the central argument is a claim that any person who takes experiencing art as an important human experience should consider certain types of games as a worthwhile use of time as well. Some games are basically interactive films, but some are much more interesting and original forms of interactive art. If you close yourself off from this world, then you close yourself off from deep artistic experiences that you can’t get elsewhere.

A few months ago I did two posts on David Foster Wallace, his philosophy of art, and how to get the most out of Infinite Jest.

One of DFW’s central concerns in art was the one-sided nature of art. The artist puts in hundreds of hours of work, and the viewer/reader/whatever passively experiences the work. He thought of the artist/viewer relationship as an honest relationship. If it is completely one-sided, then it is a defunct relationship and you won’t get much out of it for very long. To have a successful relationship, both sides have to be putting in reasonable amounts of work.

This is one way people justify postmodernist writing. You have a bunch of endnotes or footnotes or you pull the reader out of the reading experience in other ways by drawing attention to the fact that they are reading something. You write in stream of consciousness from points of view that change every couple of pages, so that the reader can’t immediately tell what is happening. Whatever the literary device, the idea is that the reader has to put in work.

The point is that the more work the reader puts in, the more they will get out of the experience. Just like in a relationship, the reader has to invest something if they want a meaningful experience. Of course, the relationship becomes one-sided on the other side if the author just uses a random word generator and plops nonsense on the page for the reader to spend months trying to decipher. It needs to be a symbiotic relationship where neither side carries too much of the burden.

I’m going to go out on a limb and say that this problem is a real problem, and what writers, filmmakers, artists, etc have come up with so far merely mitigates the problem. There hasn’t been a really good way to get the viewer to truly participate in and invest in the work of art … until the fairly recent paradigm shift in thinking about games as art.

I’m definitely not the first to propose this, so I won’t spend a lot of time making this into a long post. Now that I’ve blogged around this topic a few times without actually addressing it I thought I would just point out that games are one obvious solution to the problem. They provide an interactive experience where the “player” has to fully invest in the work.

In fact, if artists are scared of the idea that their art will be “played” and hence will not qualify as “serious” (two notions that are extraordinarily hard to define or separate), then they should check out some recent games like To the Moon. The game play is extremely minimal. The player experiences a moving story by progressing through the game. The game play consists of moving around to collect some items and at the end of certain segments of collecting you “solve a puzzle” (sometimes only 2 or 3 clicks of the mouse). Still, this level of interaction is vital to fully immersing you in the story as if you were really the main character. This interaction is impossible with film or literature.

## Decision Theory 4: Hypothesis Testing

Now we return to decision theory. It is also a return to the thing that first made me interested in learning something about statistics a year or two ago. I had heard about John Ioannidis’ shocking article “Why Most Published Research Findings are False” and started to investigate. To me statistics was some settled thing that you hit your data with after doing an experiment. It told you whether or not your findings were real and how confident you could be in them.

Moreover, I believed that as long as you followed the prescriptions taught to you, you couldn’t mess it up. It was foolproof. Just look around and try to find one thing that science hasn’t touched. The scientific method has clearly led us to discover something about the world. That’s why stats seemed like an uninteresting part of math. People seemed to have figured it all out. Then I was shocked to find that article. I started to learn about all these fallacies, and methodological problems that I’ve been pointing out over the past few months.

One of the main difficulties, particularly in science, is classical null hypothesis significance testing (NHST). One way to try to mitigate these difficulties is to rephrase our hypothesis test as a Bayesian decision theory problem. This is not the only Bayesian reformulation (Kruschke’s MCMC stuff is pretty cool which I might get to someday), but it fits in as a nice example of the use of decision theory outside of the silly gambling problems I’ve been using.

Let’s start by seeing how to test a point null hypothesis. Think about the biased coin example. We want to test ${\theta=1/2}$, i.e. is the coin unbiased? This is obviously a ridiculous type of hypothesis test, because the term “unbiased” in real life encompasses a range ${(1/2-\varepsilon, 1/2+\varepsilon)}$ where we can’t tell the difference. This is actually the case in most scientific situations as well (there is only so much precision your instruments can achieve), and often scientists incorrectly use a point NHST when there should be a ROPE (region of practical equivalence).

Our first step is to take the previous paragraph’s discussion and cheat a little. Suppose we want to test ${\theta = \theta_0}$. The Bayesian way to do this would work out of the box using a ROPE. Unfortunately, if we want continuous densities for the probabilities, then we will always reject our null hypothesis. This is because a point has probability zero. The cheat is to just convert the continuous prior, ${\pi(\theta)}$, to a piecewise defined prior where we assign a point mass of probability

$\displaystyle \pi_0 = \displaystyle \int_{\theta_0-\varepsilon}^{\theta_0+\varepsilon} \pi(\theta)d\theta$

to ${\theta_0}$ and the renormalized old prior otherwise. This is merely saying that we make a starting assumption that ${\theta}$ has true value ${\theta_0}$ with probability ${\pi_0}$, and hence no actual integral needs to be calculated. That is just for intuitive justification for the shape of ${\pi}$. If this makes you uncomfortable, then use the uninformed prior of ${\theta=\theta_0}$ having probability ${1/2}$ and the alternative having a uniform distribution of mass 1/2.

Let’s recap what we are trying to do. We have two hypotheses. The null which is ${H_0: \theta=\theta_0}$, and the alternative ${H_1: \theta\neq \theta_0}$. This type of NHST came up in the last post where we wanted to experimentally test whether or not the acceleration due to gravity was ${g=9.8}$. Our process should be clear if you’ve been following this sequence of posts. We just use our data to calculate the posterior distributions ${P(H_0|x)}$ and ${P(H_1|x)}$. We must decide between these two by seeing which one has less risk (and that risk will come from a loss function which appropriately penalizes falsely accepting/rejecting each one).

This approach is really nice, because depending on your situation you will want to penalize differently. If you are testing a drug for effectiveness, then it is better to harshly penalize falsely claiming a placebo to be effective (a false positive or Type I error). If you are testing whether or not someone has a fatal disease, then you want to harshly penalize falsely claiming they have it and having them undergo dangerous and expensive unnecessary treatments. Maybe these aren’t the best examples, but you see how having a flexible system could be a lot more useful than blindly running a ${p=0.05}$ NHST.

Rather than going through some made up example from fake randomly generated data as I’ve been doing, let’s examine some differences at the theoretical level when we assume everything is normal. Suppose our data is a sample of ${n}$ points from a normal distribution. Any book on Bayesian statistics will have the details on working this out, so I’ll get to the punch line.

If we denote ${m(x)}$ the marginal density, then the posterior distribution for ${H_0}$ is given by

$\displaystyle \frac{f(x|\theta_0)\pi_0}{m(x)}.$

In the normal distribution (we assume the prior has ${\tau}$ standard deviation and the data has ${\mu}$ and both have mean ${\theta_0}$) case we get something much more specific:

$\displaystyle \left(1+\frac{(1-\pi_0)}{\pi_0}\cdot \frac{\exp(\frac{1}{2}z^2[1+\sigma^2/(n\tau^2)]^{-1}}{(1+n\tau^2/\sigma^2)^{1/2}}\right)^{-1}$

where ${z=\frac{\sqrt{n}|\overline{x}-\theta_0|}{\sigma}}$. This term actually appears in classical NHST as well. Let’s look at the differences. For the purpose of getting some numbers down, let’s assume ${\pi_0=1/2}$ and ${\sigma=\tau}$. In a two-tailed test, let’s assume that we observe a ${p=0.01}$ and hence would very, very strongly and confidently reject ${H_0}$. This corresponds to a ${z}$-value of ${2.576}$. In this case if ${n}$ is small, i.e. in the 10-100 range, then the posterior is around ${0.14}$ to ${0.27}$. This means that we would likely want to reject ${H_0}$, because it is quite a bit more unlikely than ${H_1}$ (this will of course depend on the specifics of our loss function).

Shockingly, if ${n}$ is large, we lose a lot of confidence. If ${n=1000}$, then the posterior for ${H_0}$ is ${0.53}$. Woops. The Bayesian approach says that ${H_0}$ is actually more likely to be true than ${H_1}$, but our NHST gives us ${p=0.01}$ level confidence for rejecting (i.e. there is a 99% chance that our data observations were not a fluke chance and the result that causes us to reject ${H_0}$ is real).

As we see, by working with the Bayesian framework, we get posterior probabilities for how likely ${H_0}$ and ${H_1}$ are given our observations of the data. This allows us to do a suitable analysis. The classical framework feels very limited, because even when we get extreme ${p}$-values that give us lots of confidence, we could accidentally be overlooking something that would be obvious if we worked directly with how likely each is to be true.

To end this post, I’ll just reiterate that careful scientists are completely aware of the fact that a ${p}$-value is not to be interpreted as probabilities against ${H_0}$. One can certainly apply classical methods and end with a solid analysis. On the other hand, this is quite a widespread sloppiness or less generously I’ll call it a widespread misunderstanding of what is going on.