Thursday, August 11, 2011

Rebel Speech Recognition

Rebel Speech

I started yet another artificial intelligence project, Rebel Speech. Actually, it is part of the Rebel Cortex project, which now consists of Rebel Vision and Rebel Speech. Both subprojects will use the same sensory cortex for learning and recognition. Programming wise, speech recognition is less complex than visual recognition because the sensory mechanism is easier to implement. It's mostly a matter of using a Fast Fourier Transform to convert a time domain audio signal from a microphone into a frequency domain signal. In addition, only a fraction of the detected frequency spectrum is required for good performance. I envision that someone will one day design a microphone that works like the human ear, i.e., it would use many microscopic hair-like sensors that respond directly to various frequencies. In the meantime, a regular microphone and an FFT will do.

Population Modulation

I've been writing some Windows C# code for Rebel Speech in my spare time in the last few days. I have already implemented the microphone capture and FFT code. Well, it's not all that hard considering that there is a lot of good and free FFT code on the net and Microsoft provides a handy Microphone class in its XNA framework. I am now working on designing the audio sensors and the sensory layer. It's a little complicated not just because I need to design both signal onset and offset sensors but also because dealing with stimulus amplitude is counterintuitive. In the brain, all signals are carried by pulses which have pretty much equal amplitude. One would think that changes in the intensity of a stimulus should be converted into frequency modulation but that is not the way it works either. The brain uses a technique that I call population modulation to encode amplitude. In other words, there are many sensors that handle a single phenomenon. The number of sensors that fire in response to a sensory stimulus is a function of the intensity of the stimulus.

In the brain, this sort of parse activation is accomplished with the use of inhibitory connections between the cells in a group. Luckily, in a computer brain simulation, all we need is a list of cells. Stay tuned.

See Also:

Rebel Cortex
Invariant Visual Recognition, Patterns and Sequences
Rebel Cortex: Temporal Learning in the Tree of Knowledge


juha.ranta said...


To be really cool, though, I think your speech cortex should be capable of handling the cocktail party problem. That is, if there are many streams of speech intermixed, it should learn to follow one stream based on expectations of that stream.


Louis Savain said...

Hi Juha,

Thanks for the comment. I think that the branch concept in Rebel Cortex can easily handle the cocktail party problem. In RC, only one branch of the tree of knowledge is active at any one time. Once a particular speech is recognized, its branch stays active for about twelve seconds.

This is the way it works in the human brain, in my opinion. At the end of that short attention span, the branch goes back to sleep. It is either immediately reactivated or some other branch wakes up to take its place. The question is, given that there are many branches vying for attention, what determines which branch is activated? I'll leave that question for a future blog article.