Wednesday, May 28, 2014

The Rebel Speech Recognition Project

Progress Update

I am making rapid progress working on the Rebel Speech project and it will not be long before I release a demo. Please have patience. Rebel Speech will be a game changer in more ways than one. There are many things I need to consider as far as when and how to publish the results of my research. I cannot divulge the state of the engine at this time but what I can say is that it will take many by surprise.

My plan, which is subject to change, is to release a program that will demonstrate most of the capabilities of the model. The demo will consist of an executable program and a single data file for the neural network, aka the brain. The latter will be pre-trained to recognize the digits 1 to 20 (or more) in three or four different languages. I will not release the learning module and the source code, at least not for a while. The reason is that I need to monetize this technology to raise enough money to continue my AI research. What follows is a general description of Rebel Speech.

The Rebel Speech Recognition Engine

The Rebel Speech recognition engine is a biologically plausible spiking neural network designed for general audio learning and recognition. The engine uses two hierarchical subnetworks (one for patterns and one for sequences) to convert audio waveform data into discrete classifications that represent phonemes, syllables, words and even whole phrases and sentences. The following is a list of some of the characteristics that distinguish Rebel Speech’s architecture from other speech recognizers and neural networks:
  • It can learn to recognize speech in any language, just by listening from a microphone.
  • It can learn multiple languages concurrently.
  • It can learn to recognize any type of sound, e.g., music, machinery, animal sounds, etc.
  • Learning is fully unsupervised.
  • It is as accurate as humans on trained data. Or better.
  • It is noise and speaker tolerant.
  • It can recognize partial words and sentences.
  • It uses no math other than simple arithmetic.
Even though Rebel Speech has multiple layers of neurons in two hierarchical networks, this is where the similarity with deep learning ends. Unlike deep neural networks, the layers in Rebel Speech are not pre-wired and synaptic connections have no weights. A synapse is either connected or it is not. In fact, when Rebel Speech begins training, both networks are empty. Neurons and synapses are created and added on the fly during learning and only when needed.

Program Design

The engine consists of three software modules as depicted below.

The sensory layer is a collection of audio sensors. It uses a Fast Fourier Transform algorithm and threshold detectors (sensors) to convert audio waveform data into multiple streams of discrete signals (pulses) representing changes in amplitude. These raw signals are fed directly to pattern memory where they are combined into concurrent groups called patterns. Pattern detectors send their signals to sequence memory where they are organized into temporal hierarchies called branches. Each branch is a classification structure that represents a specific sound or sequence of sounds.


Most speech recognition systems use a Bayesian probabilistic model, such as the hidden Markov model, to determine which phoneme or word is most likely to come next in a given speech segment. A special algorithm is used to compile a large database of such probabilities. During recognition, hypotheses generated for a given sound segment are tested against these precompiled expectations and the one with the highest probability is selected.

In Rebel Speech, by contrast, the probability that the interpretation of a sound is correct is not known in advance. During learning, the engine creates a hierarchical database of as many non-random sequences of patterns as possible. Sequences compete for activation. When certain sound segments are detected, they attempt to activate various pre-learned sequences in memory and the one with the highest hit count is the winner. A winner usually pops up before the speaker has finished speaking. Once a winner is found, all other competing sequences are suppressed. This approach leads to high recognition accuracy even in noisy environments or when parts of the speech are missing.

Stay tuned.