Friday, October 11, 2013

The Problem with Speech Recognition Models

Abstract

I recently read a paper (by way of PhysOrg) about a new speech recognition model called the Hidden Conditional Neural Fields for Continuous Phoneme Speech Recognition. The good news is that HCNF outperforms existing models. The bad news is that it does not come close to solving any of the pressing problems that plague automatic speech recognition. Problems like noise intolerance and the inability to focus on one speaker in a roomful of speakers continue to vex the best experts in the field. While reading the paper, it occurred to me that the main problem with speech recognition models is that they exist in the first place. I will argue that the very fact that we have such models is at the root of the problem. Let me explain.

Not Good Enough

Wouldn't it be nice if we could talk directly to our television sets? We should be able to say things like, "turn the volume down a little" or "please record the next episode of Game of Thrones". Indeed, why is it that we can talk to our smartphones but not to our TVs? The reason is this. Current speech recognizers are pretty much useless in the presence of noise or multiple voices speaking at the same time. The sounds coming from the TV alone would confuse any state of the art recognizer. Sure, we could turn the remote control device into a noise reduction microphone and hold it close to the mouth when speaking but that would defeat the purpose of having a hands-free and intuitive way of interacting with our TVs. What we need is a recognizer that can focus on one or two voices in the room while ignoring everything else including other voices, e.g., from children, pets or guests, or from the TV. A good TV speech recognizer should respond only to the voices of those it was instructed to pay attention to. It should also ignore any conversation that does not concern its function. Unfortunately, these capabilities are way beyond what current speech recognition technology can achieve and there are no solutions in sight.

Limited Domain

I am arguing that speech recognition models are not up to the task simply because they are limited domain models, i.e., they only work with speech. But why shouldn't they, you ask? It is because the brain does not use a different representation or learning scheme for different types of knowledge. To the brain, knowledge is knowledge, regardless of whether its origin is auditory, tactile or visual. It does not matter whether it has to do with language, music, pictures, food, houses, trees, cats or what have you. The cortical mechanism that lets you recognize your grandmother's face is not structurally different than the one that lets you recognize your grandmother's name. A good speech recognition model should be able to learn to recognize any type of sensory data, not just speech. It should also be able to recognize multiple languages, not just one. And why not? If the human brain can do it, a computer program can do it too, right? After all, it is just a neural mechanism. However, as such, the model would no longer be a speech recognition model but a general perceptual learning model.

There Is a Pattern to the Madness

The brain learns by finding patterns in the stream of signals that it continually receives from its sensors. The origin of the signals does not matter because a signal arriving from an audio sensor is no different than a signal arriving from a light detector. It is just a transient pulse, a temporal marker that signifies that something just happened. This begs the question, how can the brain use the same model to learn different types of knowledge? In other words, how does the brain extract knowledge from a stream of unlabeled sensory pulses? The answer lies in the observation that sensory signals do not occur randomly. There is a pattern to the madness. In fact, there are millions of patterns in the brain's sensory stream. The key to learning them all has to do with timing. That is, sensory signals can be grouped and categorized according to their temporal relationships. It turns out that signals can have only two types of temporal relationships; they can be either concurrent or sequential. The learning mechanism of the brain is designed to discover those relationships and recognize them every time they occur. This is the basis of all learning and knowledge.

The Holy Grail of Perceptual Learning

Many in the business assume that the cocktail party problem is relevant only to speech recognition. In reality, it is a problem that must be solved for every type of sensory phenomena, not just speech sounds. Humans and animals do it continually when they shift their attention from one object to another. The brain's ability to pay attention to one thing at a time is the holy grail of perceptual learning. In conclusion, let me reiterate that we don't need different models for visual and speech recognition. We need only one perceptual learning model for everything.

PS. I am continuing to write code for the Rebel Speech recognizer and incorporating the principles of perceptual learning that I have written about on this blog. I am making steady progress and I will post a demo executable as soon as it is ready. Hang in there.

6 comments:

reppoHssarg said...

Have you heard of Numenta? They have both a .org & .com but the .com has become GROK.

bitcuration said...

What is "Attention" in neural science? http://www.amazon.com/Neuroscience-Attention-Attentional-Control-Selection/dp/0195334361
This text addresses the basic neuroscience of how the brain controls the focus of attention, and how this focused attention influences sensory and motor processes, for freakin $89.98.

Louis Savain said...

reppoHssarg,

Thank you and yes, I know about Numenta and Grok.

bitcuration,

Thanks for the link. I seriously doubt that anybody in the field understands how attention works unless they've been reading my blog. I am not boasting or anything, because I did not figure it out on my own. I would never buy that book because, if anybody in the neuroscience community really understood the brain's mechanism of attention, it would be big news. Very big news.

There is no need to spend your money because the way it works is really simple. Essentially, cortical sequence memory is organized hierarchically like a tree. Each branch of the tree represents an object or concept. Only one branch of the tree of knowledge can be active at a time and it can remain active for no longer than about 12 seconds, at which point attention is switched to another branch. That's pretty much it.

pobri said...

"Essentially, cortical sequence memory is organized hierarchically like a tree. Each branch of the tree represents an object or concept. Only one branch of the tree of knowledge can be active at a time and it can remain active for no longer than about 12 seconds, at which point attention is switched to another branch. That's pretty much it."

Doesn't really sound like any revelation, I mean, you're describing a binary search tree. They're frequently used in AI research. How is this new?

Louis Savain said...

Hi pobri. You wrote:

Doesn't really sound like any revelation, I mean, you're describing a binary search tree. They're frequently used in AI research. How is this new?

Well, it's not a binary search tree. It's a temporal knowledge classification tree that receives its inputs from pattern memory. Each node in the tree is a sequence of seven nodes. The tree serves multiple purposes, as seen below.

1. It provides a mechanism for attention and invariant recognition, the branch.
2. It is a behavior selection mechanism because only the active branch can participate in motor output.
3. It is a prediction mechanism because sequences can be traced.
4. Thanks to its predictive ability, it is also an adaptation mechanism (seeking rewards and avoiding pain).

There are a few tricks used in building the tree but I cannot go into full details right now because I want to save that for later. I hope this helped a little.

pobri said...

Fair enough, well I look forward to the date you can shed further light on it, or perhaps an example of a working implementation. It's hard for me to really comment on its efficacy as an AI tool without seeing the code :-) Thanks for your reply.