Compression for Clustering

Deen Abiola - 24 Feb 2013

An Example

A refrain that I repeat over and over is that many things can be viewed as a prediction game and that better prediction is just better compression. And when someone says something is unpredictable they actually mean this data is highly incompressible - maybe the data generating function is of high kolmogorov complexity or comes from a treacherous search space.

So for example I can tell you that in their core essence, there is not much difference between a scientist and gzip or winzip (actually the best comparison would be 7zip as I explain later). A scientist is trying to make sense of data by creating a simpler description that makes predictions - i.e. compression.

This can be taken literally by doing actual science using something like 7-zip. Using a very simple technique called normalized compression distance - a measure of the similarity between two bit strings by approximating their normalized information distance, itself incomputable since it is based on the Kolmogorov complexity function. For example, one can compare and cluster genomes into phylogenetic trees using 7zip.

To do so I first used the PPMD algo from the 7zip library to compare pairs of files (something like LZW/gzip will not do due to having too small a context though using a larger dictionary cancels this). I then wrote an implementation of the nearest-neighbor chain algorithm for clustering (in F#, no errors, it just worked on first go! Also, in retrospect I see this is easily adaptable to huffman coding).

Of course I don't have the time or bandwidth to compare whole genomes so settled for a couple chromosomes across a a handful of monkeys/apes (gorilla, chimp, baboon, rhesus monkey, macaque and human). You can see the resulting graph below even though looking at a single chromosome or two is not very representative. It ran through the O(N^2) algorithm on the ~ 1GB data set in a few minutes on my machine.

I did the same with some 30 or so text documents (much smaller so faster) and got a result much more easily judged.

The resulting clustering was excellent: with a section on math and category theory, another on neuroscience, another on machine learning and another on exotic computing.

Compression and Intelligence

Marcus Hutter has an excellent argument for why being able to compress English text at its expected entropy rate ~ 1 bit per character would necessarily signify an AI. In fact top entries in his competition make use of Neural Networks in their compression algorithm and take nearly 10 hours to compress 1GB, blurring the illusory boundaries between Machine Learning and compression. Alas, while impressive they are still mostly limited to the word level. To get better concepts will need to be incorporated.

You can look at it this way. Near perfect compression implies a couple of things. It implies the ability to understand ambiguous contexts, non-specific references and vague language. Such an algorithm would require sophisticated knowledge of not just grammar but knowledge and culture, hence it could read and write English as well as any human.

Consider a simple example of compressing a series of digits such as 14159265... This is at first incompressible but if you recognized it as the decimal portion of pi you could simply represent it symbolically. Of course utilizing all of human knowledge as the model makes a mere look up table and is counter to Occam's Razor. It would be intelligence through brute force stupidity which is not even computationally tractable - hence the need for powerful but smaller models. The more intelligent the model the more it minimizes the data's minimum description length: Length(Model) - log2(probability of data Given Model). We prefer the smaller model not because it is necessarily better but because all things being equal it's much less likely to break way into the future.

I hope after reading this you now have an appreciation for the intelligence behind 7zip (the best open source compression library btw) and how learning, compression, coding and prediction are all related.

Images of the Clusters

The below are the clusters created from the NCDs.


img1 img2

Categorizing Books and Articles


Same Tree as Above but from Another view



Below is the code. It was not written with optimality in mind although it is still fairly performant (I like that word so nya).

// Learn more about F# at
// See the 'F# Tutorial' project for more help.
open System.IO
open SevenZip
open System
let dir = @"files/path/..."
let txts = Directory.GetFiles(dir + @"Test")
let sz = SevenZip.SevenZipCompressor()
let st = System.Diagnostics.Stopwatch()
sz.CompressionMethod <- CompressionMethod.Ppmd
sz.CompressionLevel <- CompressionLevel.Low
let compress2 (f:byte[]) =
use mio = new MemoryStream(f)
use m2 = new MemoryStream(f.Length * 2)
sz.CompressStream(mio, m2)
m2.GetBuffer().[ m2.Length - 1]
let compressionMap = txts |> (fun f -> Path.GetFileNameWithoutExtension f, f |> File.ReadAllBytes |> compress2) |> Map.ofArray
let compredist f1 f2 n1 n2 =
let code = compressionMap.[n1]
let code2 = compressionMap.[n2]
let fxy = Array.append f1 f2
let code3 = compress2 (fxy)
float(code3.Length - (min (code.Length) (code2.Length))) / float(max (code.Length) (code2.Length))
let nearEdges = [| for f in txts -> let n1 = Path.GetFileNameWithoutExtension f
let fbytes = File.ReadAllBytes f
txts |> (fun fname ->
let n2 = Path.GetFileNameWithoutExtension fname
Path.GetFileNameWithoutExtension fname, compredist fbytes (File.ReadAllBytes(fname)) n1 n2)
|> Array.sortBy snd|]
let nearEdgesMap = nearEdges |> Map.ofArray
let pairs = nearEdgesMap |> (fun _ v -> Map.ofArray v)
printfn "%A" nearEdges
type 'a Tree =
| Node of 'a
| Branch of 'a Tree * 'a Tree
type Cluster<'a when 'a : comparison> =
| Singleton of 'a Set
| Clusters of 'a Set * 'a Tree
let completelinkage (ps:Map<'a, Map<'a,float>>) (a: 'a Set) (b:'a Set) =
a |> (fun item1 -> b |> (fun item2 -> ps.[item1].[item2])
|> Set.maxElement) //we only want the two largest pair distances)
|> Set.maxElement
let distclust ps = function
| Singleton (item), Clusters(items, _) -> completelinkage ps item items
| Clusters (items, _) , Singleton(item) -> completelinkage ps items item
| Clusters (items1, _), Clusters(items2, _) -> completelinkage ps items1 items2
| Singleton (item1) , Singleton(item2) -> ps.[item1.MaximumElement].[item2.MaximumElement]
let mergeClusters = function
| Singleton (item), Clusters(items, dendogram)
| Clusters (items, dendogram) , Singleton(item) -> Clusters(Set.union item items, Branch(dendogram, Node item.MinimumElement))
| Clusters (items1, dendogram1), Clusters(items2, dendogram2) -> Clusters(Set.union items1 items2, Branch(dendogram1, dendogram2))
| Singleton (item1) , Singleton(item2) -> Clusters(Set.union item1 item2, Branch(Node item1.MinimumElement, Node item2.MinimumElement))
let r = Random()
A function that takes a cluster and a set of clusters and finds the nearest item using cluster dist functions
A function that takes a cluster and an item and calculates distance as maxdist (item, clustermember)
There is a map that holds every item and its neighbiors
If we have an item we find the closest item by looking it up in the map.
But we also need to find the closest in the cluster. So we must compare the item to a cluster
To do this we for each cluster, compare the distance to our current item
If an item is closest we add the merged 2 to the cluster stack as a branch and remove the item from actives
If a cluster is closest we merge the item to the tree, remove it from the cluster stack and add the new tree to the stack
If the next item we are looking at is a cluster we must find the closest item.
To find it in the single set we map each item to its distance from the cluster using dist clust
We also sort the cluster set by distance from current cluster
Again if the single item is the closest we merge with cluster and remove from map;
If the cluster is the closest we remove both clusters from clusterset, merge them and put them back
Item , Item -> Pack as a Singleton
// (clusterset : Map<string, string Cluster>)
let asCluster x = Singleton (set [x])
let closestinActives distances cluster (item : string Set) =
item |> (fun s -> distclust distances (asCluster s, cluster), s)
|> Set.minElement
let find points first closest (distances : Map<string, Map<string,float>>) =
let initialActives = points |> Set.ofArray |> Set.remove first
|> Set.remove closest
let rec seek (stack : string Cluster list) (actives : string Set) =
let current = stack.Head
if stack.Length = 1 && actives = Set.empty then current
let nextDist, next = if actives.Count = 0 then Double.MaxValue,"" else closestinActives distances current actives
if stack.Length = 1 then seek (asCluster next :: stack) (actives.Remove(next))
else let topofstack = stack.Tail.Head
let stackDist = distclust distances (topofstack, current)
if nextDist < stackDist then
seek (asCluster next :: stack) (actives.Remove(next))
else seek ((mergeClusters (current, topofstack)) :: (stack.Tail.Tail)) actives
seek [asCluster closest ; asCluster first] initialActives
let rec toGraph depth = function
| Node(x) -> x, "", " node\r\n [\r\n id\t\""+x+"\"\r\n label\t\"" + x + "\"\r\n ]\r\n"
| Branch(ltree,rtree) -> let lname, lgraph, names1 = toGraph (depth + 1) ltree
let rname, rgraph, names2 = toGraph (depth + 1) rtree
let name = string (r.Next(0, int(2. ** (float depth + 9.))) )
name, sprintf "%s\r\n%s\r\n edge\r\n [\r\n source\t\"%s\"\r\n target\t\"%s\"\r\n ]\r\n edge\r\n [\r\n source\t\"%s\"\r\n target\t\"%s\"\r\n ]"
lgraph rgraph name lname name rname,
(sprintf " node\r\n [\r\n id\t\"%s\"\r\n label\t\"\"\r\n ]\r\n" name) + names1 + names2
let first = fst nearEdges.[r.Next(0,txts.Length)]
let closest = fst nearEdgesMap.[first].[1]
let items, fcluster = (function | Clusters(leset, letree) -> leset, letree) (find (nearEdges |> fst) first closest pairs)
let _, outgraph, nodes = toGraph 0 fcluster
let n = "graph [" + nodes + outgraph + "]"
File.WriteAllText("mbook.gml", n) adv.

  1. In a literal manner; word for word: translated the Greek passage literally.
  2. In a literal or strict sense: Don't take my remarks literally.

Usage Problem

  1. a. Really; actually: "There are people in the world who literally do not know how to boil water" (Craig Claiborne).
  2. b. Used as an intensive before a figurative expression.

Usage Note: For more than a hundred years, critics have remarked on the incoherency of using literally in a way that suggests the exact opposite of its primary sense of "in a manner that accords with the literal sense of the words." In 1926, for example, H.W. Fowler cited the example "The 300,000 Unionists ... will be literally thrown to the wolves." The practice does not stem from a change in the meaning of literally itselfif it did, the word would long since have come to mean "virtually" or "figuratively"but from a natural tendency to use the word as a general intensive, as in They had literally no help from the government on the project, where no contrast with the figurative sense of the words is intended.