I recently read a paper (by way of PhysOrg) about a new speech recognition model called the Hidden Conditional Neural Fields for Continuous Phoneme Speech Recognition. The good news is that HCNF outperforms existing models. The bad news is that it does not come close to solving any of the pressing problems that plague automatic speech recognition. Problems like noise intolerance and the inability to focus on one speaker in a roomful of speakers continue to vex the best experts in the field. While reading the paper, it occurred to me that the main problem with speech recognition models is that they exist in the first place. I will argue that the very fact that we have such models is at the root of the problem. Let me explain.
Not Good Enough
Wouldn't it be nice if we could talk directly to our television sets? We should be able to say things like, "turn the volume down a little" or "please record the next episode of Game of Thrones". Indeed, why is it that we can talk to our smartphones but not to our TVs? The reason is this. Current speech recognizers are pretty much useless in the presence of noise or multiple voices speaking at the same time. The sounds coming from the TV alone would confuse any state of the art recognizer. Sure, we could turn the remote control device into a noise reduction microphone and hold it close to the mouth when speaking but that would defeat the purpose of having a hands-free and intuitive way of interacting with our TVs. What we need is a recognizer that can focus on one or two voices in the room while ignoring everything else including other voices, e.g., from children, pets or guests, or from the TV. A good TV speech recognizer should respond only to the voices of those it was instructed to pay attention to. It should also ignore any conversation that does not concern its function. Unfortunately, these capabilities are way beyond what current speech recognition technology can achieve and there are no solutions in sight.
I am arguing that speech recognition models are not up to the task simply because they are limited domain models, i.e., they only work with speech. But why shouldn't they, you ask? It is because the brain does not use a different representation or learning scheme for different types of knowledge. To the brain, knowledge is knowledge, regardless of whether its origin is auditory, tactile or visual. It does not matter whether it has to do with language, music, pictures, food, houses, trees, cats or what have you. The cortical mechanism that lets you recognize your grandmother's face is not structurally different than the one that lets you recognize your grandmother's name. A good speech recognition model should be able to learn to recognize any type of sensory data, not just speech. It should also be able to recognize multiple languages, not just one. And why not? If the human brain can do it, a computer program can do it too, right? After all, it is just a neural mechanism. However, as such, the model would no longer be a speech recognition model but a general perceptual learning model.
There Is a Pattern to the Madness
The brain learns by finding patterns in the stream of signals that it continually receives from its sensors. The origin of the signals does not matter because a signal arriving from an audio sensor is no different than a signal arriving from a light detector. It is just a transient pulse, a temporal marker that signifies that something just happened. This begs the question, how can the brain use the same model to learn different types of knowledge? In other words, how does the brain extract knowledge from a stream of unlabeled sensory pulses? The answer lies in the observation that sensory signals do not occur randomly. There is a pattern to the madness. In fact, there are millions of patterns in the brain's sensory stream. The key to learning them all has to do with timing. That is, sensory signals can be grouped and categorized according to their temporal relationships. It turns out that signals can have only two types of temporal relationships; they can be either concurrent or sequential. The learning mechanism of the brain is designed to discover those relationships and recognize them every time they occur. This is the basis of all learning and knowledge.
The Holy Grail of Perceptual Learning
Many in the business assume that the cocktail party problem is relevant only to speech recognition. In reality, it is a problem that must be solved for every type of sensory phenomena, not just speech sounds. Humans and animals do it continually when they shift their attention from one object to another. The brain's ability to pay attention to one thing at a time is the holy grail of perceptual learning. In conclusion, let me reiterate that we don't need different models for visual and speech recognition. We need only one perceptual learning model for everything.
PS. I am continuing to write code for the Rebel Speech recognizer and incorporating the principles of perceptual learning that I have written about on this blog. I am making steady progress and I will post a demo executable as soon as it is ready. Hang in there.