The approach that I use for speech recognition in the Rebel Speech recognizer is completely different than the one used by most existing technologies. As I have mentioned elsewhere, I disagree with AI experts that the brain uses anything like Bayesian statistics to process probabilistic stimuli. I added a theory section in the Rebel Speech design document (pdf) and I reproduce it below.
Bayesian Bandwagon
Traditional Speech Recognition
Most speech recognition systems use a Bayesian probabilistic model, such as the
hidden Markov model, to
determine which senone, phoneme or word is most likely to come next in a
given speech segment. A learning algorithm is normally used to compile a large database
of such probabilities. During recognition, hypotheses generated for a given sound are
tested against these precompiled expectations and the one with the highest
probability is selected as the winner.
Rebel Speech Recognition
In contrast to the above, the Rebel Speech engine does not rely on pre-learned probabilities. Rather, it uses an approach that is as counter-intuitive as it is powerful. In this approach, the probability that the interpretation of a sound is correct is not known in advance but is computed on the fly. The way it works is that the engine creates a hierarchical database of as many sequences of learned sounds as possible, starting with tiny snippets of sound that are shorter than a senone. When sounds are detected, they attempt to activate various sequences and the sequence with the highest hit count is the winner. A winner is usually found even before the speaker has finished speaking. It works because sound patterns are so unique, they form very few sequences. Once a winner is determined, all other sequences that do not belong to the same branch in the hierarchy are immediately suppressed. This approach leads to very high recognition accuracy even when parts of the speech are missing; and it makes it possible to solve the cocktail party problem (pdf).
No comments:
Post a Comment