Thursday, November 16, 2017

A Critique of Numenta's Location Hypothesis

Why I Respect Numenta

I have always had respect for Numenta. Over the years, under the leadership of their maverick founder and chief architect, Jeff Hawkins, they have steadfastly maintained that deep learning was not the way to achieve artificial general intelligence (AGI). They insisted that imitating the brain was the right way forward, that intelligence was based on the timing of sensory signals and that learning in the brain consisted mainly of making new synaptic connections, not modifying connection weights. They did it while the deep learning hype was in full swing. They never flinched even in the face of overt hostility from the mainstream AI community. They had a healthy, think-outside-the-box attitude. As a rebel, I admired that. Lately, however, and apparently reacting to pressure from the AI community to show some serious results, the folks at Numenta seem to have lost their way. Their latest offering, the so-called location hypothesis, misses the mark. Worse, there is no demo program to support the theory.

The Universal Invariant Recognition Problem

One of the most difficult problems in AI is universal invariant recognition. The human brain has the seemingly magical ability to recognize an object regardless of its position and orientation in the field of view. Deep learning experts tried to solve the problem by using brute force. That is, they train the network with millions of images in the hope of covering every possible situation. However, this approach will invariably leave holes that can lead to spectacular failures. So they (Yann LeCun et co) came up with a partial solution, a technique called convolution that gave the network a degree of translation invariance. Even then, deep neural nets can still be fooled by adversarial examples. It turns out that they can fail catastrophically if a previously learned pattern is modified by an imperceptibly small number of pixels. In other words, deep neural nets are not universally invariant. Some in the AI community (e.g., DeepMind) have been promoting deep learning as a stepping stone toward AGI. They are sorely mistaken. Others (e.g., Geoffrey Hinton and Yann LeCun) seem to be more aware of its limitations.

The Location Hypothesis

Jeff Hawkins and his team at Numenta believe they may have found the secret of universal invariance. They are proposing that the brain somehow generates a special signal that specifies the location of an object under observation and the location of its features relative to the object. The idea seems to be that, by knowing the position of an object relative to its features, the brain can compensate for positional differences and solve the problem of invariant recognition. They write:
We propose that a representation of location relative to the object being sensed is calculated within the sub-granular layers of each column. The location signal is provided as an input to the network, where it is combined with sensory data.
...
A key component of our theory is the presence in each column of a signal representing location. The location signal represents an “allocentric” location, meaning it is a location relative to the object being sensed. In our theory, the input layer receives both a sensory signal and the location signal. Thus, the input layer knows both what feature it is sensing and where the sensory feature is on the object being sensed. The output layer learns complete models of objects as a set of features at locations. This is analogous to how computer-aided-design programs represent multi-dimensional objects.
This article by Hawkins explains Numenta's approach in an easy to read style. While I admire the courage and willingness of Numenta to attack a hard problem head on, I must say that I am disappointed with this hypothesis.

Why Is the Location Hypothesis Flawed?

There are several reasons as follows.
  • As I have argued on many occasions, neurons are slow and there is very little time and energy in the brain for fancy calculations. Maintaining a location reference for visual objects is a particularly complex task. This is especially true if it is a 3-dimensional reference location which it would have to be if the sensed object is in a 3-dimensional world. The system would have to determine, not only the location of the object relative to the viewer but also the location of a reference point relative to the object itself. Is it in the middle of the object or somewhere else? This is not an easy task. And this is not even taking into account the fact that the brain must somehow detect the boundaries of the object under observation while excluding all the other objects in the scene.
  • A location signal is necessarily encoded with spikes (discrete pulses). A spike, by itself, has no information content other than its time of arrival. How many spikes would it take to encode a continually changing location vector in 3D space? The answer is: a lot. Again, there is no time for this in the brain. The highest spiking frequency is about 1000 Hz and the brain only has about a 10 millisecond window to process each sensory input. There is not enough time to encode even a 1-dimensional location for each input signal.
  • Let us suppose, for argument's sake, that the brain uses a single connection for each possible location. This would require millions of connections per feature. This is clearly out of the question.
I have other objections but these three should suffice to show that Numenta's location hypothesis is not biologically plausible.

A New Memory Model

I am proposing a new memory model based on spike timing. The model assumes that the brain perceives and learns by detecting many minute changes in its sensory space. I hypothesize that the brain uses branches in its hierarchical sequence memory to detect complex objects in the world regardless of their locations or orientations. A branch is a top-level node in the sequence hierarchy that is activated when it receives enough signals from lower level nodes to trigger a recognition. This memory model has the ability to instantly sense and understand complex objects in the environment, even objects that it has never encountered before.


There are two hierarchies, one for pattern detection (not shown) and one for sequences. Sequence memory is where actual object recognition happens. It receives discrete signals from pattern memory. Pattern neurons learn to detect a huge number of small elementary patterns such as lines, edges, dots, etc. Signals from pattern neurons are fed directly to the bottom or entry level of the sequence hierarchy. Pattern signals are stitched together in sequence memory to form any complex object.


As an example of sequence processing, consider the horizontal motion of a short vertical line or edge across the retina. This would result in multiple pattern neurons generating a series of spikes (one at a time) separated by a short interval. This series of event can be captured by an indefinitely long structure of connected nodes at the bottom level of the sequence hierarchy. I call these long structures "vines" to distinguish them from the shorter "sequences". The nodes in the vines would fire in succession as the line/edge moves horizontally in a given direction. There are many such sequence structures in sequence memory that capture various movements or other form of changes in the environment. The important thing to note here is that the interval between nodes in a vine is not fixed but can vary over time.

How the Brain Does Invariant Object Recognition

Obviously, the brain must have a simple and energy efficient solution that does not require lengthy calculations. Recognition must happen quickly and accurately using uncertain sensory information. How does the brain do it? I propose that the brain has a way to pool multiple concurrent sequences together to form branches that can detect any arbitrarily complex moving object. Recognition is based on a competitive, winner-take-all process. Only the branches that receive enough signals will trigger a recognition event.

Like almost everybody who has attempted to design a sequence hierarchy for AI, I used to think that a higher-level sequence was just a mechanism that served to join two or more non-overlapping sequences at a lower level. It took me years to figure out that I was wrong. It turned out that the main function of the sequence hierarchy is not to manage sequence storage but to find as many fixed temporal correlations between multiple co-occurring sequences as possible. Here is how it works.

It would be too inefficient to test every node in a vine with every other node in sequence memory. The brain uses a divide and conquer approach. Every vine is divided into multiple seven-node sequences. Why seven? It is a compromise. Less than seven would consume too much energy while more than seven would result in sluggish performance.


Let me come out on a limb and claim that these short sequences are implemented in the brain as cortical columns. In addition to serving as a mechanism for ordering pattern activations, they can also record their activities by retaining a trace (both time and speed) of their last activation in their minicolumns. The seventh node of every sequence can be connected to nodes in an upper level to form higher level vines. These are, likewise, divided into sequences which, in turn can send connections to an upper level. I happen to know the sequence hierarchy has 20 levels. How I know this and how vines are constructed are topics for a future article. The important thing to notice here is that upper sequences are just mechanisms that connect lower level sequences that are temporally related. They essentially bind a number of patterns together to form a single complex object.

A top level sequence is what I call a branch in the sequence hierarchy. It is a complex object detector. It is also the brain's mechanism of attention: only one branch can be "awake" at a time. During recognition, signals from pattern memory quickly travel (via the seventh nodes of many sequences) all the way up the sequence tree as far as they can go. A top level sequence will trigger a recognition event as soon as it receives enough signals from lower levels to account for the overall activation of only two of its nodes. This recognition event is invariant to the actual activation states at the lower level sequences. What matters is that enough signals reach the top.

Partial activation of more than two nodes is acceptable as long as the required overall amount is reached. This is how the brain handles uncertainty. It means that it takes relatively few sensory signals to trigger a recognition. Even partial occlusions can trigger a recognition. This, combined with the variable intervals of the sequences, is the reason that we can recognize faces and animals in the clouds, different handwritings or fonts, highly stylized art, etc. When a top level sequence is triggered, it sends a recognition signal via feedback pathways all the way back down to pattern memory where pattern neurons are also triggered, thus correcting any incomplete or corrupt pattern information.


Note: In a future article, I will explain how sequence learning is done using spike timing, among other interesting things. I may also have a demo program (one never knows) to support my claims. Stay tuned and be patient.

See Also:

Invariant Recognition of Visual Objects (Frontiers Media)
A Theory of How Columns in the Neocortex Enable Learning the Structure of the World (Frontiers Media)
Unsupervised Machine Learning: What Will Replace Backpropagation
Fast Unsupervised Pattern Learning Using Spike Timing
Fast Unsupervised Sequence Learning Using Spike Timing

3 comments:

Robert said...

The seventh node of every sequence can be connected to nodes in an upper level to form higher level vines. These are, likewise, divided into sequences which, in turn can send connections to an upper level... I hypothese that the brain uses branches in its hierarchical sequence memory to detect complex objects in the world regardless of their locations or orientations. A branch is a top-level node in the sequence hierarchy that is activated when it receives enough signals lower level nodes to trigger a recognition… Only the branches that receive enough signals will trigger a recognition event… only one branch can be "awake" at a time.

Genesis 2:8; 3:24; Proverbs 3:18; 11:30 ; Isaiah 5; John 15; 19:41; Revelation 2:7; 22:14

http://journalingthebible.com/wp-content/uploads/2017/11/IMG_8361.jpg

http://mediad.publicbroadcasting.net/p/kalw/files/201401/Tree_%2B_Roots-2.jpg

http://happyheartsbiblestudy.weebly.com/uploads/3/7/8/5/37851145/5880557_orig.gif

https://www.blueletterbible.org/assets/images/study/larkin/dispensationalTruth/29_1.jpg

https://www.blueletterbible.org/assets/images/study/larkin/dispensationalTruth/29_2.jpg

https://www.blueletterbible.org/assets/images/study/larkin/dispensationalTruth/29_3.jpg

http://remnantnewspaper.com/web/images/f0a48e8836afc47e9466d774d7d29414.jpg

https://subratachak.files.wordpress.com/2016/06/79c43-twelveseejamesgreat.jpg

http://revelationrevolution.org/wp-content/uploads/2013/08/Trial-of-Jesus-and-Pilate1.jpg

https://www.goodnewsunlimited.com/wp-content/uploads/resurrection-2.jpg

https://library.osu.edu/innovation-projects/omeka/exhibits/show/the-king-james-bible/game/item/75

https://www.youtube.com/watch?v=q8Oeq12zjZk

Anonymous said...


Have you heard of Hinton's "capsules theory".
The theory is very similar to your description and it uses
hierarchy of layers composing DeepLearning network.

A./

Louis Savain said...

Anonymous:

Have you heard of Hinton's "capsules theory".
The theory is very similar to your description and it uses
hierarchy of layers composing DeepLearning network.


Thanks for the comment. Are you kidding me? Hinton is even more lost in the woods than Hawkins. His capsules are a complete joke. They have nothing in common with the model I am proposing. Also, everybody has been using a hierarchy for ages.