Wednesday, August 15, 2012

Rebel Speech Recognition Theory

A Different Approach

The approach that I use for speech recognition in the Rebel Speech recognizer is completely different than the one used by most existing technologies. As I have mentioned elsewhere, I disagree with AI experts that the brain uses anything like Bayesian statistics to process probabilistic stimuli. I added a theory section in the Rebel Speech design document (pdf) and I reproduce it below.

Bayesian Bandwagon

The most surprising thing about the Rebel speech recognition engine is that, unlike current state of the art speech recognizers, it does not use Bayesian statistics. This will come as a surprise to AI experts because they have all jumped on the Bayesian bandwagon many years ago. Even those who claim to be closely emulating biological systems believe in the myth of the Bayesian brain. Of course, this is pure speculation and wishful thinking because there is no biological evidence for it. In a way, this is not unlike the way the AI community jumped on the symbol manipulation bandwagon back in the 1950s, only to be proven wrong more than half a century later. I have excellent reasons to believe that, in spite of its current utility, this is yet another red herring on the road to true AI.

Traditional Speech Recognition

Most speech recognition systems use a Bayesian probabilistic model, such as the hidden Markov model, to determine which senone, phoneme or word is most likely to come next in a given speech segment. A learning algorithm is normally used to compile a large database of such probabilities. During recognition, hypotheses generated for a given sound are tested against these precompiled expectations and the one with the highest probability is selected as the winner.

Rebel Speech Recognition

In contrast to the above, the Rebel Speech engine does not rely on pre-learned probabilities. Rather, it uses an approach that is as counter-intuitive as it is powerful. In this approach, the probability that the interpretation of a sound is correct is not known in advance but is computed on the fly. The way it works is that the engine creates a hierarchical database of as many sequences of learned sounds as possible, starting with tiny snippets of sound that are shorter than a senone. When sounds are detected, they attempt to activate various sequences and the sequence with the highest hit count is the winner. A winner is usually found even before the speaker has finished speaking. It works because sound patterns are so unique, they form very few sequences. Once a winner is determined, all other sequences that do not belong to the same branch in the hierarchy are immediately suppressed. This approach leads to very high recognition accuracy even when parts of the speech are missing; and it makes it possible to solve the cocktail party problem (pdf).

No comments: