Convolutional neural networks (CNNs) are a type of deep learning neural networks that have been successfully applied to visual recognition. They owe their success to being faster to train (probably because of their sparse connectivity) and to being invariant to certain spatial transformations such as translations. In this article, I argue that CNNs miss the mark because they have a rather limited form of invariance, whereas the brain is universally invariant.
Universal Versus Translation Invariance
If you hold your hand in front of your face and rotate it, move it side to side, up and down, shine a blue or red light on it, make a fist, a thumb up or peace sign, etc., at no point during the transformations will there be any doubt in your mind that you are looking at your hand. This is in spite of the fact that, during the transformations, your visual cortex is presented with literally hundreds of very different images. This is an example of universal invariance, something that the brain accomplishes with ease. CNNs can handle only a subset of these transformations because, as seen in the diagram below, they are hardwired for translation invariance.
With some modifications, it should even be possible to get a CNN to tolerate rotations. But CNNs suffer from an even bigger problem. They may be invariant to translations but they have no way of telling whether all the successive images represent the same hand. They can only recognize each image as a hand and that's about it. This lack of continuity makes them ill suited to future robot intelligence.
It is highly unlikely that the visual cortex uses pre-wired spatial pooling to obtain translation invariance. Why? First off, if the brain used a different type of invariant architecture for every type of transformation, the cortex would be a wiring mess. Second, one would expect the auditory cortex to have a different architecture for invariance than the visual cortex but this is not observed. The global uniformity of the cortex is one of its most striking features. A ferret whose optic nerves were rerouted to its auditory cortex in the embryonic stage, was able to use its auditory cortex to learn to see and navigate fairly normally.
How Does the Brain Do It?
It should be fairly obvious that the brain uses a single method to achieve universal invariance. The most likely hypothesis is that the brain has two memory hierarchies, one for concurrent patterns and one for sequences of patterns. Learning in the brain is 100% unsupervised. The sequence hierarchy is a powerful memory structure that serves multiple functions. It is a common storage mechanism for attention, prediction, planning, adaptation, short and long-term memory, analogies, and last but not least, temporal pooling. Every invariant object is represented by a single branch in the hierarchy. I hypothesize that temporal pooling is the way the cortex achieves universal invariance. To emulate the brain's universal invariance, one must first design a good pattern learner/recognizer that feeds its output signals to a sequence learning mechanism. The latter must be able to automatically stitch patterns and related sequences together to form invariant object representations. I will have more to say about pattern and sequence learning in future articles.
Why Deep Learning Is a Hindrance to Progress Toward True AI
The Billion Dollar AI Castle in the Air
Why Deep Learning Will Go the Way of Symbolic AI