Artificial Intelligence

Deep listening: The neural network learning to hear you in a crowd

27th Jun `16, 11:27 AM in Artificial Intelligence

The human auditory system gives us the extraordinary ability to converse above the chatter of a lively cocktail…

Sophie Curtis
Sophie Curtis Contributor

deep learning John Hershey

The human auditory system gives us the extraordinary ability to converse above the chatter of a lively cocktail party. Selective listening in such conditions is an extremely challenging task for computers, and has been the holy grail of speech processing for more than 50 years.

Previously, no practical method existed in the case of single channel mixtures of speech, especially when the speakers are unknown, but now Mitsubishi Electric Research Labs (MERL) are addressing the problem of acoustic source separation with a deep learning framework they call “deep clustering”. At the Deep Learning Summit in Boston last month John Hershey, Senior Principal Research Scientist at MERL, presented ‘Cracking the Cocktail Party Problem: Deep Clustering for Speech Separation’ and shared their breakthrough, using their deep clustering network to assign embedding vectors to different sonic elements of the noisy signal. With this technology, MERL are on the verge of solving the general audio separation problem, opening up a new era in spontaneous human-machine communication.

I asked John a few questions to learn more about speech recognition, human-machine communication, and his thoughts on the future of deep learning.

Please tell us a bit more about your work in deep learning.

The “cocktail party problem” has been an enigma for 50 years: how is it possible for humans to hear separate voices in a crowd even though the individual sound waves add together into a single waveform. We were eager to apply the discriminative power of deep learning to this problem, but it was not straightforward. We had tried direct applications of deep networks, using a bank of outputs for each source type, to identify the parts of the spectrum corresponding to that source at each moment. We found that this works well for speech versus noise, but it fails dramatically for speech versus speech. In this case the network needs to arbitrarily decide which signal to assign to the output for each source and it can’t easily learn to do this in a consistent way. Our approach is to instead allow the network to produce embedding vectors trained to discriminate between different voices, but without forcing it to decide on the overall segmentation of the spectrum. Instead the embeddings implicitly represent the network’s uncertainty about the segmentation. The embeddings are then clustered in a second stage to produce a more holistic decision process. At first we didn’t know if this would work, but we now have some very encouraging results, and we feel we are close to solving the cocktail party problem in general, so this is a very exciting time for us.

How will this change spontaneous human-machine communication?

Speech recognition has always been confined to special situations where interfering sounds are kept out of the mix. For example, speech recognition works well over the telephone, or in cars with nobody else talking. Out in the real world, interfering sounds cannot be controlled. And it may actually be desirable to discern among multiple speakers at the same time – that’s what humans do. But we can potentially move beyond even that since deep clustering can apply to arbitrary sounds. Acoustic event detection is extremely challenging in a mixture of sounds, and deep clustering could allow recognition of all the sounds in the environment, from hearing what people are talking about, to recognizing the sounds of danger, or interpreting the sounds of music. For the hearing impaired this could be revolutionary. And for robots, this is perhaps the only way for hearing to be useful.

What do you feel is essential to future progress in speech processing?

If you had asked this before we stumbled into deep clustering, I would have said solving the cocktail party problem is an important hurdle to clear. But if we pretend that is already solved, I would think the most important remaining problem, and the real elephant in the room, is natural language understanding. Speech recognition already works extremely well. But in order to have meaningful conversations with a speech system, and have the system take the right actions, we arguably need some form of artificial intelligence.

What present or potential future applications applications of deep learning excite you most?

Following from the previous question, training a system to having real understanding of semantics is obviously a major challenge, and we might not know exactly when we’ve succeeded. But what we are seeing in work coming out today is already far beyond what existed a few years ago. It is widely thought that a key to true semantic understanding is “grounding” the semantics in everyday experience of the real world. That is one reason to be excited about the possibilities for cross-modal learning, and the potential for reinforcement learning of complex behaviors.

What do you feel are the leading factors enabling recent advancements and uptake of deep learning?

We all know the first part of the story: the deep learning revolution sprang from the combination of faster computers, large data, and the apparent scalability of the deep neural networks. While we are still scratching our heads to explain how deep networks can achieve so many remarkable results, there are some exiting trends at work. It used to be that different domains such as audio, video, and language understanding, all used very different frameworks. But now these are converging around the language of deep networks. So for the first time it is not rocket science to consider jointly training end-to-end systems that encompass vision, hearing, language processing, and more. Another nice development is that we are beginning to understand how to inject deep networks with some of the desirable features of probabilistic models to improve their flexibility. Finally, some recent architectures are combining memory, attention, sequential processing, and reinforcement learning. This moves the field firmly beyond simple pattern recognition, and we are very excited about the possibilities.

What developments can we expect to see in deep learning in the next 5 years?

Unsupervised learning has taken a back seat to more discriminative methods, but we can expect that to change. With the right learning framework, the benefit of immense data should outweighs the disadvantage of having to infer labels. Humans don’t need such massive supervision, so I am confident that reinforcement learning and other methods can help bootstrap deep learning without dense labels. Another issue is the modularity and flexibility needed for deep learning. I would like to see systems with trained parts that can be re-combined or re-absorbed so that a system can engage in life-long learning. That way instead of keeping around data to retrain for each new task, the products of learning could be transferred from system to system.

Which industries do you feel will be most disrupted by deep learning, and machine learning in general, in the future?

All of them will be most disrupted by machine learning, eventually. Things are developing so quickly now, it is tempting to just watch and make bets on what will happen next. Kidding aside, certainly internet service-based industries have already been transformed by machine learning, behind the scenes. But I suppose the self-driving car would shake up the automotive industry in a most dramatic way. I’m looking forward to it!

Originally appeared on Rework