In which I use watching a movie to explain Bayesian Probability.

(10/13/2017: ~4.5 years ago, I wrote this article to explain bayesianism without invoking any math. Bayes Theorem is not particular to bayesianism although it is true that the rule underpins the technique when you zoom out far enough. But the essence of what bayesian computation consists of has, in my opinion, never been properly explicated at a high level. Looking back, I think I did well but for a few misunderstandings I have corrected in parenthesized italics).

Friday, February 22, 2013

What is Bayesian Probability?

Subjective (as of now, I consider the notion of subjective to be superfluous) Bayesian Probability is a very powerful but partially impractical philosophy about the most sensible way to handle information. It was basically invented by this guy called Laplace. Quite often, when people explain Bayesian probability, the focus of their efforts gets tripped up because of the name. So they end up spending way too much time talking about Bayes Theorem or Rule. Which is actually not at all special to Bayesian Probability.

The proper way to view Bayesian probability is as the idea that everything can be assigned a probability (this is not the best frame, it's much better to think of it as repeated conditioning on a space/collection of hypotheses. That is, slicing away and eliminating falsified hypotheses as well as up or down weighting supported hypotheses or not, as approrpriate). In this way we are all natural Bayesians since a typical response would go like, 'eh? you mean you can't assign probabilities to everything?' You can't but people act this way anyways; talking about likely, unlikely, betting and percent chance. You can think of it as subsuming logic. Where each proposition has a value between 0 and 1, with false as 0 and 1 as true. Then I can ask you if you think you're at risk of being eaten by bigfoot and you could give me an answer between yes and no. Bayesian Probability is also naturally extensible to quantum mechanics and makes a lot of (but not all) things less counterintuitive. But how to understand it?

Understanding Likelihoods, Priors and Posterior distributions

Well consider a movie. We can view a particular aspect of watching a movie as a game of prediction. The fun of the game is to try and get as close a guess to the movie's final outcome as possible, while the purpose of the director is to try and surprise as much as possible. This surprise is a measure of how not boring the movie is. So you come into the movie with your past experiences, the title, poster and maybe trailers. You've got a bunch of ideas of what will happen. As each scene unfolds you get more data and you lower the score of some guesses, while giving other guesses more weight. Each guess is a hypothesis and its weight is how likely you think it is, i.e. its probability. You can then think of the whole bunch of guesses with weights as your prior distribution (the purpose of the prior is to keep you grounded, rooting future predictions to be consistent with past knowledge). As you get more data or scenes you need to do an update of your beliefs.

That update is where the famous Bayes rule comes in. You can think of it as rebalancing your guesses with a hard to beat sanity check. Where, you have how likely each guess is (prior), check if the scenes so far match the guess (likelihood) (I would rephrase that as how likely each guess rates the scene by assuming itself true, the guesses that didn't do well get pushed down) and then balance it with how much sense the movie makes now (probability of data (more specifically, think of it as after slicing some things away, squishing others and inflating the rest, you need to rebalance the existing hypotheses with respect to the new space so they remain as probabilities)). What you get out is a posterior distribution. As it turns out, we are not consciously good at being Bayesians (this isn't quite accurate. It's true we rely on brittle heuristics to make up for the small scratch space but being bayesian is intractable anyways). We don't do the update step properly. But our unconscious mind can approximate it very well (the truth is more that the natural world is highly structured and hierarchical, therefore forgiving of many hacks). This is good because you are not consciously guessing what will happen, it just happens in the background. So people as doing bayesian movie watching is a pretty good description. There's one more concept to talk about. Nonparametricity. It sounds complicated but you can think of it (roughly) like this. When you start the movie you don't have all possible scenarios already figured out, as you get more data/scenes you grow the size and complexity of your hypothesis space/bag.

Now, at the end of the movie one hypothesis ends up as true. If you did well then the difference between your final posterior distribution and your prior before that—where you ended up concentrating most of the weight—should be minimal. This is a typical movie where you are not very surprised. However, good directors and writers have a handful of choices to trip you up with.

How to Trip Up your Priors

Directors and writers can overwhelm you with information or use non intuitive methods such as playing the film backwards or jumping around scenes or write a script with lots of possible scenarios. So you are not able to incorporate data properly and your distribution goes out of whack.

They can downplay certain cues so you don't give certain guesses the proper weight till near the end. If they do it right, you would hold the guess in your bag of guesses but not count it for much and if they do it wrong they will just outright give an unjustified conclusion that was not at all in your prior. This is often upsetting and out of nowhere. A Movie can also be so nonsensical that you reject it because of the probability of the data. Maybe there is another semantic or logical meta-layer (this is hierarchical bayes which is necessary in the real world—note the probabilistic programming baysian community also and IMO confusingly calls nested parameter dependencies hierarchical) that looks at the consistency of the movie, compares it to experience and says: wtf m8?

Another option is to come up with an original story that is totally out of sync with your priors that your updates don't end up converging/on target in time for the movie end. These are those super rare "make you think movies". A typical good movie is one which surprises you but ends up with an explanation that you did not overly discount and with a collection of scenes which are consistent with their internal logic.


Now, there is a controversy between Bayesians and non-Bayesians. Which basically reduces to dissatisfaction that a great deal of the time, a Director can come up with a final story so completely outside your prior that you never converge despite it being completely consistent and with you updating all the data properly. You can only ever have imperfect coverage over all possibilities (There is a more common but less defensible series of arguments I omitted. A lot of people like to argue about objective vs subjective priors. This makes no sense since objective priors merely hide subjectivity. There is also a misapprehension leading to a belief that non-informative priors are superior. In reality, use of informative priors is vital to get learning going and to guard against conspiracy theories).

Another problem is that it is very hard to compute. So it must be estimated. In reality, a respectable Bayesian will use non-Bayesian methods as sanity checks or guides. Beware of the Rabid Rationalist Zealots.

An aside: our brains optimized on a particular tract that did well in its proper environment but now shine as "biases". Speaking of which. This example also gives somewhat of an idea of a basic model of evolution. Replace guesses with particular versions of genes (alleles), likelihood as number of organisms with that gene (I don't like this phrasing. Since likelihood reweights things, I'd amend it as a process that yields the new proportion of each gene in organisms that survived to reproduction) and surprise as number of deaths. Indeed Bayesian updating is just a special case of evolutionary learning (which fuels my belief that methods like Holland's deserve more study (To clarify: it's general in the sense of being less strict on what constitutes an update, i.e. you can do more involved things than simoply conditioning on observations. You can mutate and recombine hypotheses and not be strict about things adding up to or being probabilities—not as strict on normalizing in order to understand what is going on, local.).

(Another criticism is that the prior has to match the data generation process but this is true for any learning method—every learner equips structural priors which must match the data regularities or do no better than random. Sanity checks, testing and noting the lack of separability between the likelihood and prior acknowledge the limitations of bayes in reality. Bayesianism, or more usefully, probabilistic programming is ideal as it's a simple way to reason about uncertainty, regularize, package hypothesis and condition on new information. It's preferable in exactly the same way that a statically typed functional programming is preferable to C (there are less ways to shoot yourself in the foot and more ways to guard against simple but impactful mistakes, despite not being a panacea. Probabilistic programming also has the advantage of extensibility, generalizability and modularization between inference procedure and problem declearation. Furthermore, a bayesian system that can generate hypotheses online, that is not limited to some fixed set of hypotheses and has some "fast learning" trap door, and that learns in a hierarchical manner is probably the most powerful learner this universe can host.)

And, taking this further you can look at evolution as a type of intelligence1. The universe learning about itself doing inference and making predictions about its own rules. Life as a computation trying to figure out what combination of elements lead to the most stable and (entropy exporting) persistent structures. Why, If I was being poetic I would say Life is the universe thinking about itself (recursively self modeling—Good Regulator invokes self-similarity!).

And there you have it. I've taken some liberties but I think this is a comprehensible but still representative explanation.

Deen Abiola

[1] Are we more intelligent than evolution? A look at how we fare vs disease would suggest barely so. I would guess an individual human is less intelligent but maybe billions of us (when counting across history) are. I would also suggest that we can represent individually, arbitrary functions more compactly, so we could be considered more intelligent in that manner (We can also learn from single examples unlike evolution that must aggregate over a population). And certainly more so with the addition of computers.