Wednesday, February 12, 2014

Why Convolutional Neural Networks Miss the Mark


Convolutional neural networks (CNNs) are a type of deep learning neural networks that have been successfully applied to visual recognition. They owe their success to being faster to train (probably because of their sparse connectivity) and to being invariant to certain spatial transformations such as translations. In this article, I argue that CNNs miss the mark because they have a rather limited form of invariance, whereas the brain is universally invariant.

Universal Versus Translation Invariance

If you hold your hand in front of your face and rotate it, move it side to side, up and down, shine a blue or red light on it, make a fist, a thumb up or peace sign, etc., at no point during the transformations will there be any doubt in your mind that you are looking at your hand. This is in spite of the fact that, during the transformations, your visual cortex is presented with literally hundreds of very different images. This is an example of universal invariance, something that the brain accomplishes with ease. CNNs can handle only a subset of these transformations because, as seen in the diagram below, they are hardwired for translation invariance.

With some modifications, it should even be possible to get a CNN to tolerate rotations. But CNNs suffer from an even bigger problem. They may be invariant to translations but they have no way of telling whether all the successive images represent the same hand. They can only recognize each image as a hand and that's about it. This lack of continuity makes them ill suited to future robot intelligence.
CNNs are invariant to translations thanks to a technique known as spatial pooling. Essentially, neighboring units in a given layer are pooled together to activate a unit in the layer immediately above. The pooling method can use either addition, averaging or maximum. The end result is that the activation of a top layer unit is invariant to the position of a stimulus at the bottom layer.

Biological Implausibility

It is highly unlikely that the visual cortex uses pre-wired spatial pooling to obtain translation invariance. Why? First off, if the brain used a different type of invariant architecture for every type of transformation, the cortex would be a wiring mess. Second, one would expect the auditory cortex to have a different architecture for invariance than the visual cortex but this is not observed. The global uniformity of the cortex is one of its most striking features. A ferret whose optic nerves were rerouted to its auditory cortex in the embryonic stage, was able to use its auditory cortex to learn to see and navigate fairly normally.

How Does the Brain Do It?

It should be fairly obvious that the brain uses a single method to achieve universal invariance. The most likely hypothesis is that the brain has two memory hierarchies, one for concurrent patterns and one for sequences of patterns. Learning in the brain is 100% unsupervised. The sequence hierarchy is a powerful memory structure that serves multiple functions. It is a common storage mechanism for attention, prediction, planning, adaptation, short and long-term memory, analogies, and last but not least, temporal pooling. Every invariant object is represented by a single branch in the hierarchy. I hypothesize that temporal pooling is the way the cortex achieves universal invariance. To emulate the brain's universal invariance, one must first design a good pattern learner/recognizer that feeds its output signals to a sequence learning mechanism. The latter must be able to automatically stitch patterns and related sequences together to form invariant object representations. I will have more to say about pattern and sequence learning in future articles.

See Also

Why Deep Learning Is a Hindrance to Progress Toward True AI
The Billion Dollar AI Castle in the Air
Why Deep Learning Will Go the Way of Symbolic AI


David Díaz Vico said...

Hi Louis,

Very interesting. I agree that CNNs are limited in some way. However, I also think their design is very smart and, by now, they are the state of the art for many kinds of problems.

On the issue of the time invariance for sequences of images (video), wouldn't a CNN with order 3 tensor filters do the trick? It's clear the usual order 2 tensor filters CNN that is used for static images, like LeNet for MNIST, cannot make any profit from the topological structure of the data in the time dimension, but it could be extended easily just by increasing the order of the tensor-filters in one. I'm not a guru of CNNs, but I think the theory is the same. Of course, working with order 3 tensors would be much more computationally expensive than working with order 2 tensors.

In any case, I agree that biological neural networks probably don't get time invariance through that mechanism, as you say. I guess it has something to do with temporal coding through splikes in their activations, but I'm quite new to these ideas and this is just some intuition. Also probably a recurrent architecture is involved.

And speaking of rotation invariance, could you give more details on this? I find it very interesting, but I don't know of any current model that achieves it.

Just one more thing. I wouldn't say ALL brain learning is unsupervised. Most of it surely is, but I think supervised learning or reinforcement learning have their role too.

Thanks for sharing your thoughts. I find them really interesting.

Best regards.

David Díaz

Louis Savain said...

Hi David,

I see what you are saying and I'm sure that adding a third dimension to represent time in a CNN could lead to a temporal recognizer for video and audio data. However, it would not eliminate some of the fundamental problems with the current deep learning paradigm. For one, there is no need to have spatial pooling if temporal learning is used. Temporal pooling can handle all types of transformations. Also, there is a lot more to understanding a scene than recognizing objects in the scene. There are precise cause-effect relationships between the objects in the scene and between the viewer and the objects in the scene. These things cannot be done in deep learning network because nobody really knows how knowledge is represented in the network. The reason for this is the use of supervised learning. So-called unsupervised deep learning is really a joke.

It is possible that current DL models will be able to outperform the brain in one or two narrow domains but there is no doubt that they are not part of the future of AI and intelligent robotics. In my opinion, the future of AI is in spiking neural networks which eliminate the concept of a static image altogether. IOW, if something does not change, it cannot be seen or heard. This immediately forces us to use center-surround sensors and micro-saccades for visual recognition. These gives the classifier much more meaningful temporal information about a scene right off the bat. Concurrent patterns can be easily discovered without the use of supervision because the fitness criterion for learning is simple and independent of the subject matter: the spikes in a pattern should arrive concurrently. A similarly simple fitness criterion can be used for sequence learning. So the killer advantage of temporal learning is that it makes unsupervised learning a breeze.

You mentioned that you don't think that all brain learning is unsupervised and I agree. However, my understanding is different than yours. In my opinion, the only part of the brain that uses strict supervised learning is the cerebellum. It learns routine sensorimotor tasks (e.g., posture maintenance, walking, etc.) from the volitional cortico-motor complex. This is necessary in order to give the brain the ability to attend on important events and volitional behavior such as, thinking, speaking and eating while walking, sitting, etc. This is the reason that people who have impaired cerebellums find it difficult to speak or think while standing or walking.


Louis Savain said...

Hi David,

In regard to this: "Just one more thing. I wouldn't say ALL brain learning is unsupervised. Most of it surely is, but I think supervised learning or reinforcement learning have their role too."

Maybe I should have written that perceptual learning in the cortex is 100% unsupervised. In other words, the brain does not depend on labeled data to learn to classify sensory objects. You could say that reinforcement learning uses labeled data (from pain and pleasure sensors) but that is pushing the definition, in my opinion.