Distilling meaning to a number between 0 and 65,536

What is a concept? What does a word mean? A good reader quickly learns that meaning does not come from a word but rather, from the words around it, the words it tends to keep company with. That is the key motivation behind the Distributional hypothesis where:

The basic idea of distributional semantics can be summed up in the so-called Distributional hypothesis: linguistic items with similar distributions have similar meanings.

There are many ways to leverage this, one of the oldest is something called Latent Semantic indexing where Singular Value Decomposition (SVD) is used to find the associations between words in similar contexts; words that tend to fill the same idea shaped hole. The problem is that it's slow - for my needs anything slower than linear is almost always unacceptable.

There's another idea, Random Indexing which also has the added benefit of being online - where each new document or word does not start the learning process again from scratch. The idea is to keep extremely high dimensional but sparse random vectors for each word. These word vectors are then used to update a context vector as the model grazes on various sentences. The detailed how of this is something I will save for another post but the same kind of semantic indexing as SVD is achieved but with a many several order of magnitude reduction in cost. There are many ways I use this, including document similarity, summarization, paragraph segmentation and query expansion but the simplest example is in finding similar words. Say I put in 'thus' then it, without explicit training knows similar words from having analyzed text. Then I get a result like: "(thus, 1), (then, 0.95), (therefore, 0.94), (hence, 0.92)". This is useful when dealing with new jargon and you wish to know where to go next (i.e. interactively searching dozens of pages at a time).

Problem is, I now have this 1000 dimensional vector and I'd like to find words which have similar meanings - usages (or contexts) to this new word I've never seen before. Tricks like kd-trees are not going to help here. Random Hyper planes to the rescue.

The idea is a form of local sensitivity hashing (another post) where similar things hash down to the same bucket. So I generate a hash function for my vector by generating a random vector, r, where a bitmap is generated such that presence of a bit is decided by if < r, v > >= 0 then 1 else 0. And with say 16 of these, the probability of collision should also double as a sort of similarity function. 16 of these also means I can represent it as a single 16 bit number. So two things happen here. I store the semantic dictionary with a single 16 bit number as the key, words with similar contexts will tend to fall in the same bucket. That single number represents a particular concept in my dictionary. And also, even with a linear search I can do bit twiddling to get the hamming distance of a 16 or 32 bit number much more quickly than calculating the cosine similarity of two extremely high dimensional vectors.

Then by scanning through and changing just one bit of the bitmap the word's concept vector hashes down to, I also get a pretty good neighborhood of word's similar to my vector. I can then do the more expensive (with magnitude pre-calculated) cosine similarity operation on this much reduced space. My tests show a 95-99% reduction in the space searched and acts as a sort of parameter-less approximate nearest neighbor (not all near neighbors are returned). Reducing the bits in the key so there are more collisions, results in a more thorough but still efficient search. For example, using 8 bit keys results in still surprising sensible divisions - I had to search only 10% of the space. This method allows a very quick use of context to get an idea of what a never before met word might mean - almost like what humans do.

An awesome corollary is that I could then take a document and reduce it to its key topic words, take their average and then hash that down to a single integer. Pretty nifty eh? And now for a twist.

Imagine two entities that have taken this concept to its logical extreme. To communicate entire documents they spit out single numbers with perfect extraction and hashing. Vectors are shared such that each number is decompressed to its topics, the topics are automatically generated to full ideas and expanded out to trees. Communication is really dense, involving code numbers to shared references and compressed thought trees...While the communicating using streams of numbers thing is not really tenable I do think something like communicating thought trees is possible. More on that later.


Examples below use 16 bit keys showing various clusters from: http://nplusonemag.com/issue-13/essays/stupidity-of-computers/ (I have vectors derived from all my papers and notes but it's easier to analyze single pieces of text)

[You can see the full list using an 8 bit key and a hamming distance of 1 at: http://sir-deenicus.github.io/home/rvec1.htm]

31990 -> ["situations"; "perceptions"; "our"; "ontologies"] 30948 -> ["facebook"; "degrees"; "and"] 22775 -> ["pseudocode"; "algorithm"; "aggressive"]

from a computer go article

30924 -> ["rematch"; "loses"; "1965"] 2910 -> ["winner"; "second"; "favorite"; "event"] 52 -> ["tokyo"; "japanese"]

Sometimes the words are not synonyms as in 22778 -> ["remained"; "meaningful"; "ambiguities"]