# Compressibility is a Key Concept

Saturday, June 30, 2012 - 4:02 PM edited Sunday, September 09, 2012 - 2:00 PM

In the future, I think we will look back at how incomplete our model of probability was because it did not fully incorporate notions of compressibility.

The central dogma of statistics is that data should be viewed as realizations of random variables. This has been a very fruitful idea, but it has its limits. Itâ€™s a reification of the world. And like all reifications, it eventually becomes invisible to those who rely on it.

This jives with the qualms I have with that aspect of the subject, which I will now elucidate.

Notions of Compressibility and Probability should really go hand in hand. For example, if someone says something is unpredictable (which is clearly not true for anything interesting) what do they mean? They mean that they think the phenomenon is so complex it would be hard to build a predictive model that does better than random without building the entire thing itself. Did you notice something implicit in that definition?

Consider the sequence "010101010..", what is the best way to characterize this sequence?

Leave that thought hanging a moment and consider the two infinite sequences; A:"001001001001..." and B:"111000010001...".

The probability of randomly picking a "1" in both of these sequences is 1/3 yet there is something different about one of the two. One of them is less regular, more random. Can you guess which? Although instinctually people tend to generate sequences like A when trying to be random, when compared like this it is clear that B is more random. We can arbitrarily approach 100% confidence in knowing the next digit in sequence A but can never do better than random for sequence B.

Probability theory is not equipped with the vocabulary to tell the two apart. And while you can certainly build a Markov Chain conditioned on prior history that can properly represent this sequence, an umbrella term for what this concept falls under is not easily expressed in the language or probability. There are valiant attempts in model picking, Minimum Description Lengths and various information criterion to do so and while they see practical use, proper use is as yet an art form.

The word which accurately and cleanly describes all these concepts is compressibility. Sequence A can easily be compressed to a Markov chain or computer program. Sequence B cannot. The shortest computer program amongst arbitrarily many which can model A is the most preferable for no other reason than it minimized complexity. That makes that explanation most robust. Unfortunately the Halting Problem makes this metric incomputable. As said attempts such as regularization, early stopping, adding noise, dropping features are all ad hoc methods to get around this. MDL is the cleanest but still remains messy and while computability theory permanently force this as an art form perhaps there is some principled way we can account for Compressibility and Complexity...

When someone says something is unpredictable, what they really mean is that this thing is incompressible, it is already in a state of maximal entropy. Incompressible things cannot be learned. It is an interesting parallel that in physical systems, no work can be squeezed from states of maximal entropy.