Tuesday, November 19, 2013

Did OSU Researchers Solve the Cocktail Party Problem?

Potential Speech Recognition Breakthrough?

There is incredible news coming out of Ohio State University. Speech recognition researchers claim to be able to pick out a particular speech sound from a sound track mixed with random noise and/or other speech sounds using a deep neural network. I have my doubts about the claims but, if this is true, it would mean that they have solved a major part of the perceptual learning puzzle: the ability to focus on one thing while ignoring others. If true, it would be the biggest single breakthrough in the history of artificial intelligence research, in my opinion. That being said, I will reserve judgement until I know more about the details of the research.

7 comments:

Bill said...

Pretty sure this has already been solved. Andrew Ng from Stanford/Google talks about the cocktail party problem in his intro class on Unsupervised ML. Check out this Coursera lecture around 5:23:

http://www.youtube.com/watch?v=5km0Tx9OcIo&t=5m23s

Louis Savain said...

Bill,

Thanks for the link. Using multiple microphones is indeed a known method for solving this problem but it's a purely mathematical approach that detects differences in signal volumes between the two tracks. The OSU researchers claim to be using a neural network to process a single monaural sound track and this is what makes it amazing. Humans are pretty good at focusing on a single voice even when the sound is coming from a single source such as a TV or a radio speaker.

What's even more amazing with the OSU announcement is that their technology seems to be at least an order of magnitude better than the human brain. I'm pretty sure there is something they are not telling us.

Bill said...

Here is the OSU paper. I'm reading it now:

http://www.cse.ohio-state.edu/~wangyuxu/papers/Wang-Wang.NIPS12.pdf

Bill said...

BTW, humans use two "microphones" called ears! ;-)

Louis Savain said...

Yeah, there is no question that having two ears helps a lot in real world situations but it does not help if you're listening to a mono recording of a conversation with overlapping voices and background noises. The human brain is still pretty good at focusing on one voice (or any particular sound source) at a time in such situations. Current state of the art recognizers get totally confused.

Bill said...

Good point. What did you think of the paper?

Louis Savain said...

Bill, after reading just the introduction and the conclusion (I refuse to wade through the math) of their paper, I now think they're faking it to a great extent. I don't believe that the demo they released contained actual speech sounds in the background but unintelligible speech sounding noises. It is possible to use a predictive algorithm to discern between gobbledygook and actual speech that is predictable.

If I'm right, they will definitely get some flak for this. Although I don't think they have a true cocktail party breakthrough, I think their technology will still be very useful for noise reduction in hearing aids and smartphones.