| Root | Articles & Essays | Lab.Project Int.Aug | Project Int.Aug

Analogies as (Endo)Functors

What does intelligence mean? I've for some time been thinking about a framework within which one could treat questions on intelligence—whether of human or AI or corporation or evolution—in the spirit of Leibniz. Instead of rambling discussions where everyone brings lots of extravagant baggage, I wish to some day be able to think through (even if not calculate) answers under the framework and ask, do you have something better? (This is not rhetorical, if you do then I can stop wasting my time right away).

The framework starts from intuition, guided by experience of what framing is most fruitful when the task is principled generation of hypotheses. The framework is also grounded mathematically—I achieve this by starting from computational principles and working backwards to what basic math adequately expresses the concepts of interest. Although the framework is broad; based on computing theory and the link between information theory/thermodynamics, I'll zoom in and focus this article on analogy because I was able to perform an experiment around it.

Analogical Reasoning

One of the most important aspects of intelligence is analogical reasoning. Analogical reasoning is when concepts from one space are mapped to concepts in another space. For example, I might ask: what is the biological construct most similar to this physics concept? Or, I might say: what do I know that is like this new thing I am seeing?

By noticing how one thing is similar to another thing, I accomplish two things: first is that I pave the way towards being able to code both concepts in terms of something more general and thus achieving a better compression of both. Second is that by leveraging knowledge of the known structure, I can more quickly learn, probe and design experiments to better understand the new structure. The most creative individuals are really good at exploiting connections between things. Analogies do have the downside that, if the structure is not actually shared across the full space, you end up with a worse understanding of the new space than if you had not tried to accelerate by leveraging analogy.

Analogies as Functors

Thinking about analogy, the most obvious representation is that it's a structure preserving map from one space to another (although in reality, either in the brain or practically in an AI, I doubt it is ever so direct). In category theoretic language, an analogy can be thought of as a Functor from one category to another. If you know A and wish to learn about B, then you must first learn F, a functor from \(A \to B\) and then using knowledge about A, and your guesses about the structure of F, you design experiments with F(A) to further learn about F and B (and how B diverges from A).

From this inspiration, I considered a simpler scenario for a computable experiment. Instead of functors and categories, which I have only a rudimentary understanding of, I use vector spaces and linear transformations. Then we can consider, suppose you have a vector space model A, of one topic and you learn a model B for another topic, then what you want is a some X such that AX=B. In this model scenario, the linear transformation X is the analogical mapping/functor and can be solved for easily enough using QR factorization. The problem now is how to construct said spaces and then testing if X really is an analogical mapping.

This too is easily done, there are a multitude of vector space models: Latent semantic indexing, random indexing, GloVe, Holographic Reduced Representations and word2vec. For my own purposes; across the requirements of speed, memory use, data requirements, online (learning cleanly on the go) and accuracy, random indexing dominates all other choices. The curious thing about random projections is that they are "anti-fragile", they actually turn the curse of dimensionality into a boon. There are even some neuroscience models that hypothesize the brain makes extensive use of random projections. In our case, because random vectors in high dimensions are almost orthogonal to each other, even small nudges are enough to get useful results. In the case of small data this is useful.

Digression: When word2vec analogies are not useful

You might have heard of word2vec, a shallow neural network based method that approximately "reduces/factorizes" point wise mutual information on words, using co-occurrence stats; its ability to handle analogies using vector math seems impressive at first but upon closer inspection, is found out as ultimately a narrow gimmick with limited applicability.

The word2vec analogies fail quite often beyond well represented examples e.g. king:man :: woman:... vs baseball : sabermetrics::basketball:.... They work best for obvious i.e. 1 step analogies on single words which in general, is not useful. Those analogies are too simple and limted to serve as a demonstration of analogies as mappings across (sub)spaces.

(Aside: All vector space models based on context cannot learn to distinguish antonyms from synonyms.)

Digression 2: When Big Data is worse than Small Data

Big data is not always best—for example, when I wish to narrowly focus on two topics, I do not care for vectors that have been trained on the entirety of Wikipedia. They will fail to surface the connections I am most interested in. Consider the case when I want to find analogies between learning and evolution, even when I simplify the problem to cute SAT style analogies, the results I get are completely useless.

Space

The analogies I build are based on mapping from one space to another space. The source of the vectors can be anything (including word2vec, my criticism is for the analogy demos not the quality of generated "concept vectors") but I use random indexing vectors since, as previously stated, they are pareto-optimal for my requirements (e.g. speed). For this demonstration, I proceed by extracting and building vector space models of the wikipedia page for evolution and reinforcement learning separately. The vectors have dimensionality 250. After this, I now 'solve' for the analogy F from reinforcement learning to evolution which is represented by a 250x250 matrix.

Because the vector dimensionality is 250, I need at least 250 sample vectors in order that the equation not be underspecified. Then, if I want to look at concepts in evolution in terms of reinforcement learning or vice versa, I simply multiply the vectors by the appropriate solved for Matrix. To my surprise, it worked (why was I surprised? because I generated a matrix randomly filled with 1,0 and -1s, performed a slipshod approximation of sparse matrix multiplication against co-occurrence frequencies, ran QR factorization on that, multiplied the solved for matrix with my random matrix and it worked out interesting analogies! As I will emphasize again and again, "Human unique" Intelligence is over-estimated). You can scroll to the bottom to see some examples or check the full results here and here—it's almost all interesting.

There are a couple impressive examples I'd like to highlight. Selection (from natural selection) is mapped to search, iteration, improvement, gradient and evaluation. Those all are extremely on point! Survival is mapped to states, actions, regret and moves. One of those is a non-trivial insight as it was a link only recently suggested (more in part 2). It also maps survival to non-episodic, this is correct and a link I'd never made before. Genes too are well mapped, with history a low scoring (appropriate that it should be a low) match.

In the other direction, from reinforcement to evolution, the map is also has some stand out samples. Policy in particular is another observation I've never considered before. Policies are what an RL agent learns in order to behave optimally at each particular state. What's the most appropriate match? The genome and the phenotype of course—alleles, survival and gene are also all linked and pretty good matches. Reinforcement is matched with breeding, existence and elimination; learning to adapted and surviving; exploration to species, individuals, population and mutation.

Function too, is a curious one and highlights what I mean by advantages of small data. F(function) is judged as very close to hypothesis, idea and observations. In reinforcement learning function is in the sense of value function or action value function, and indeed value functions are much closer to hypothesis and observations than to purpose, graph or chart—which is what something based on more data would have reckoned.

These are all rather impressive and together, especially when one considers cosine similarity values and vector dimensionality, virtually impossible to have all occurred by chance.

Conclusion

So what am I to make of this? Haphazardly multiplying co-occurrence counts from a couple wiki pages with a random matrix is enough to make associations that would give many humans trouble. Why? I hypothesize two reasons working in conjunction. Text is not random, and is structured for the purpose passing along information—I wonder if a concept as Potential Intelligence inspired by potential energy is appropriate. The other, I think, is that this is another instance of the moravec paradox. If you list the things that you intuitively would think require the most computational power and intelligence, you'd probably have written that list upside down with respect to the correct order. If you want to understand intelligence—in animals, in humans and in AI, you would do well to take a look at Moravec's Paradox. After you've done that tell me if you think the conventional presentation of the "Clever Hans Effect" makes even a tiny bit of sense.

What's Next

In this post I explicitly computed an "analogy" as a map between two spaces. But although—to my surprise—the method worked, I doubt that such a map is ever explicitly made in the brain. In Project Int.Aug, this method is not used, it's too rigid, it works only on two concepts when, usually I'm considering multiple concepts at a time. Secondly, although the map works alright, especially for well represented words, it outputs nonsense in many instances. In a future post I'll show the practical method in use, including various ways of constructing subspaces and also composition, that go beyond single words and two concepts at a time while also possessing increased accuracy.

In my next post I talk about how AI and intelligence amplification differ in their treatment of A,F and B. I'll go over my choice of these two topics (and why not others? Well, one aspect is that a recurring theme in my writing is evolution is as learning and so I'm usually trying to amplify my understanding of related topics by e.g., mapping between reinforcement learning <=> regret <=> evolution).

Meta: I hope that if before, you had not a clear idea of reinforcement learning but had knowledge of evolution (which is more widely known), the analogies (not mine!) have put some of the concepts in context.

Examples

In the below, the words in all caps are the source (evolution) and the table following are the target (RL) nearest neighbors—remember, the word vectors have been learned completely separately and are being linked using the matrix F.

SELECTION

Name

Sim

search

0.9078385

iteration

0.5983979

improvement

0.5010358

gradient

0.4711323

evaluation

0.434157

EVOLUTION

Name

Sim

computing

0.8000988

maintaining

0.4766024

finding

0.4171107

SURVIVAL

Name

Sim

states

0.7352338

actions

0.4152263

non-episodic

0.4131915

samples

0.4096077

regret

0.4069343

moves

0.3897748

KNOWN

Name

Sim

improve

0.8170525

demonstrate

0.6549885

address

0.6359063

change

0.6319368

observe

0.6311502

compute

0.6282258

unify

0.619693

GENES

Name

Sim

evaluation

0.8402829

improvement

0.6233622

space

0.4338689

iteration

0.4281315

search

0.407939

gradient

0.3085533

history

0.3024744

ENVIRONMENT

Name

Sim

computation

0.8335705

domain

0.6284672

expectation

0.6255608

class

0.6169248

set

0.6092064

description

0.6053985


The examples in the other direction, from RL to evolution:

POLICY

Name

Sim

principle

0.8019225

environment

0.5997837

phenotype

0.5960799

Survival

0.5955663

genome

0.5697417

gene

0.5480887

alleles

0.5475575

METHODS

Name

Sim

offspring

0.8382673

genes

0.4183966

mutations

0.4166025

produce

0.3901573

individuals

0.3897389

BASED

Name

Sim

According

0.880974

Due

0.7900774

due

0.7368965

appears

0.7197356

REINFORCEMENT

Name

Sim

breeding

0.8270694

policies

0.5013691

existence

0.4515101

elimination

0.4443123

mates

0.4295801

LEARNING

Name

Sim

adapted

0.8522058

adaptation

0.4814506

applied

0.480039

surviving

0.4398851

EXPLORATION

Name

Sim

species

0.4411014

vestigial

0.4271055

sexually

0.4088307

region

0.3941363

genes

0.3930252

individuals

0.3822998

population

0.375715

mutations

0.3743682