Episode 50, November 14, 2018
After decades of research in processing audio signals, we’ve reached the point of so-called performance saturation. But recent advances in machine learning and signal processing algorithms have paved the way for a revolution in speech recognition technology and audio signal processing. Dr. Ivan Tashev, a Partner Software Architect in the Audio and Acoustics Group at Microsoft Research, is no small part of the revolution, having both published papers and shipped products at the forefront of the science of sound.
On today’s podcast, Dr. Tashev gives us an overview of the quest for better sound processing and speech enhancement, tells us about the latest innovations in 3D audio, and explains why the research behind audio processing technology is, thanks to variations in human perception, equal parts science, art and craft.
Ivan Tashev: You know, humans, they don’t care about mean square error solution or maximum likelihood solution, they just want the sound to sound better. For them. And it’s about human perception. That’s one of the very tricky parts in audio signal processing.
Host: You’re listening to the Microsoft Research Podcast, a show that brings you closer to the cutting-edge of technology research and the scientists behind it. I’m your host, Gretchen Huizinga.
Host: After decades of research in processing audio signals, we’ve reached the point of so-called performance saturation. But recent advances in machine learning and signal processing algorithms have paved the way for a revolution in speech recognition technology and audio signal processing. Dr. Ivan Tashev, a Partner Software Architect in the Audio and Acoustics Group at Microsoft Research, is no small part of the revolution, having both published papers and shipped products at the forefront of the science of sound.
On today’s podcast, Dr. Tashev gives us an overview of the quest for better sound processing and speech enhancement, tells us about the latest innovations in 3D audio, and explains why the research behind audio processing technology is, thanks to variations in human perception, equal parts science, art and craft. That and much more on this episode of the Microsoft Research Podcast.
Host: Ivan Tashev, welcome to the podcast.
Ivan Tashev: Thank you.
Host: Great to have you here. You’re a Partner Software Architect in the Audio and Acoustics groups at Microsoft Research, so, in broad strokes, tell us about your work. What gets you up in the morning, what questions are you asking, what big problems are you trying to solve?
Ivan Tashev: So, in general, in Audio and Acoustics Research Group, we do audio signal processing. That includes enhancing of a captured sound by our microphones, better sound reproduction using binaural audio, so-called spatial audio. We do a lot of work in audio analytics, recognition of audio objects, recognition of the audio background. We design a lot of interesting audio devices. Our research ranges from applied research related to Microsoft products to a blue-sky research far from what is Microsoft business today.
Host: So, what’s the ultimate goal? Perfect sound?
Ivan Tashev: Hhhh… Perfect sound is a very tricky thing, because it is about human perception. And this is very difficult to be modeled using mathematical equations. So, the classic statistical signal processing was established in 1947 with a paper published by Norbert Wiener defining what we call, today, the Wiener Filtering. The approach is simple: you have a process, you make a statistical model, you define optimality criterion, make the first derivative, make it zero, voila! You have the analytical solution of the problem. The problem is that, you either have an approximate model, and find the solution analytically, or you have precise model which you cannot solve analytically. The other thing is the optimality criterion. You know, humans, they don’t care about mean square error solution or maximum likelihood solution, they just want the sound to sound better. For them. And it’s about human perception. That’s one of the very tricky parts in audio signal processing.
Host: So, where are we heading in audio signal processing, in the era of machine learning and neural networks?
Ivan Tashev: The machine learning and neural networks are capable to find the solution from the data without us making an approximate model. And this is the beauty of this whole application of machine learning in signal processing, and the reason why we achieve significantly better results than using statistical signal processing. Even more, we train the neural network using certain cost function and we can make the cost function to be even another neural network, trained on human perception for better audio which allows us to achieve better perception of a higher quality of the speech enhancement we do using neural network. I’m not saying that we should go in every single audio processing block using machine learning and neural networks. We have processing blocks which have a nice and clean analytical solution, and this runs fast and efficient, and they will remain the same. But in many cases, we operate with approximate models with not very natural optimality criteria. And then, this is where the machine learning shines. This is where we can achieve much better results and provide a higher quality of our output signal.
Host: One interesting area of research that you are doing is noise robust speech recognition. And this is where researchers are working to improve automatic speech recognition systems. So, what’s the science behind this and how are algorithms helping to clean up the signal?
Ivan Tashev: We are witnessing a revolution in speech recognition. The classic speech recognizer was based on so-called Hidden Markov Models or HMM’s. And they served us quite well, but the revolution came when neural networks were implemented and trained to do speech recognition. My colleagues in the speech research group were the first to design a neural network-based speech recognition algorithm which instantly showed better results than the existing production HMM-based speech recognizer. The speech recognition engine has one channel input, while in audio processing, we can deal with multiple channels, so-called microphone arrays, and they give us a sense of spatiality. We can detect the direction where the sounds came from. We can enhance that sound. We can suppress sounds coming from other directions. And then provide this cleaner sound to the speech recognition engine. The microphone reprocessing technologies combined together with techniques like sound source localization and tracking and sound source separation allow us to even separate two simultaneously speaking humans in the conference room and feed two separate instances of the speech recognizer for meeting transcription.
Host: Are you serious?
Ivan Tashev: Yes, we can do that. Even more, the audio processing engine has more prior information. For example, the signal we send to the loudspeakers. And the goal of this engine is to remove the sound which is interfering for our sound. And this is also one of the oldest signal processing algorithms and every single speaker phone has it. But, in all instances, it has been implemented as a mono acoustic echo cancellation. In Microsoft, we were the first to design a stereo and surround sound echo canceller despite a paper written by the inventor of the acoustic echo cancellation himself, stating that stereo acoustic cancellation is not possible. And it’s relatively simple to understand: you have two channels between the left and the right speaker coming to one microphone, so you have one equation and two unknowns. And Microsoft released, as part of Kinect for Xbox, a surround sound echo cancellation engine. Not that we solved five unknowns from one equation, but we just found a workaround which was good enough for any practical purposes and allowed us to clean the surround sound coming from the Xbox to provide a cleaner sound to the speech recognition engine.
Host: So, did you write a paper and say, “Yes, it is possible, thank you very much!”?
Ivan Tashev: I did write a paper.
Host: Oh, you did!
Ivan Tashev: And it was rejected with the most crucial feedback from the reviewers I have ever seen in my career. It is the same to go to the French Academy of Sciences and to propose eternal engine. They have decided, since 18th century, not to discuss papers about that. When I received the rejection notice, I went downstairs in my lab, started the demo, listened to the output. Okay, it works! So, we should be fine!
Host: One thing that’s fascinated me about your work is the infamous anechoic chamber – or chambers, as I came to find out – at Microsoft, and one’s right here in Building 99, but there are others. And so, phrases like “the quietest place on earth” and “where sound goes to die” are kind of sensational, but these are really interesting structures and have really specific purposes which I was interested to find out about. So, tell us about these anechoic, or echo-free, chambers. How many are there here, how are they different from one another and what are they used for?
Ivan Tashev: So, the anechoic chamber is just a room insulated from the sounds outside. In our case, it’s a concrete cube which does not touch the building and sits on around half a meter of rubber to prevent vibrations from the street to come into the room. And internally, the walls, the ceiling and the floor are covered with sound absorption panels. This is pretty much it. What happens is that the sound from the source reaches the microphone, or the human ear, only using the direct path. There is no reflection from the walls and there is no other noise in the chamber. Pretty much that anechoic chamber simulates absence of a room. And it’s just an instrument for making acoustical measurements. What we do in the chamber is we measure the directivity patterns of microphones or radiation patterns of loudspeakers as they are installed in the devices we design. Initially, the anechoic chamber here, in Microsoft Building 99, the headquarters of Microsoft Research, was the only one in Microsoft. But with our engagement with product teams, it became overcrowded, and our business partners decided to build their own anechoic chambers. And there are, today, five in Microsoft Corporation. They all can perform the standard set of measurements, but all of them are a little bit different from each other. For example, the “Quietest Place in the Earth,” as recorded in the Guinness Book of Records, is the anechoic chamber in Building 88. And the largest anechoic chamber is in Studio B which allows making measurements with lower frequencies than in the rest of the chambers. In our chamber, in Building 99, it’s the only one in Microsoft which can allow human beings to stay prolonged amount of time in the chamber because we have air-conditioning connected to the chamber. It’s a different story how much effort it cost us to make the rumbling noise from the air conditioner not to enter the anechoic chamber. But this allowed us to do a lot of research on human spatial hearing in that chamber.
Host: So, drill in on that a little bit because, coming from a video production background, the air conditioner in a building is always the annoying part for the sound people. But you’ve got that figured out in the way that you situated the air conditioning unit and so on?
Ivan Tashev: To remove this rumbling sound from the air conditioner, we installed a gigantic filter which is under the floor of the entire equipment room. So, think about six by four meters floor and this is how we were able to reduce the sound from the air conditioning. Still, if you do a very precise acoustical measurement, we have the ability to switch it off.
Host: Okay. So, back to what you had said about having humans in this room for prolonged periods of time. I’ve heard that your brain starts to play tricks on you when you are in that quiet of a place for a prolonged period of time. What’s the deal there?
Ivan Tashev: OK. This is the human perception of the anechoic chamber. Humans, in general, are, I would say two and a half dimensional creatures. When we walk on the ground, we don’t have very good spatial hearing, vertically. We do much better horizontally. But also, we count on the first reflection from the ground to use it as a distance cue. When you enter the anechoic chamber, you subconsciously swallow, and this is a reaction because your brain thinks that there is a difference in the pressure between your inner ear and the atmosphere which presses the ear drums and you cannot hear anything.
Host: So that swallowing reaction is what you do when you’re in an airplane and the pressure actually changes. And you get the same perception in this room, but the pressure didn’t change.
Ivan Tashev: Exactly. But the problem in the room is that you cannot hear anything just because there is no sound in the chamber. And the other thing what happens is you cannot hear that reflection from the floor which is basically very hard-wired in our brains. We can distinguish two separate sounds when the distance between them is a couple of milliseconds. And when the sound source is far away, this difference between the direct path and the reflection from the ground is less than that. We hear this as one sound. We start to perceive those two as separate sounds when the sound source is closer than a couple of meters away… means two jumps. Then subconsciously alarm bells start to ring in our brain that, hey, there is a sound source less than two jumps away, watch out not to become the dinner! Or maybe this is the dinner!
Host: So, the progress, though, of what your brain does and what your hearing does inside the chamber for one minute, for ten minutes, what happens?
Ivan Tashev: So, there is no sound. And, the brain tries to acquire as much information as possible. And the situation when you don’t get information is called information deprival. You, first after a minute or so, start to hear a shhhhhh, which is actually the blood in the vessels of your ear. Then, after a couple of minutes, you start to hear your body sounds, your heartbeat, your breathing. And, under no other senses, eyes closed, no sound coming, literally you reach, after ten, fifteen minutes the stage of audio hallucinations. Our brains are pattern-matching machines, so sooner or later, the brain will start to recognize sounds you have heard somewhere in different places. We – people from my team – we have not reached that stage, simply because when you work there, the door is open, the tools are clanking, we have conversations, etcetera, etcetera. But maybe someday I will have to lay there and close my eyes and see, can I reach the hallucination stage?
Host: Well, let’s talk about the research behind Microsoft Kinect. And that’s been a huge driver of innovations in this field. Tell us how the legacy of research and hardware for Kinect led to progress in other areas of Microsoft.
Ivan Tashev: Kinect introduced us to new modalities in human-machine interfaces: voice and gesture. And it was a wildly successful product. Kinect entered the Guinness Book of Records for the fastest-selling electronic device in the history of mankind. Microsoft sold eight million devices in the first three months of the beginning of the production. Since then, most of the technologies in Kinect have been further developed. But even during the first year of Kinect, Microsoft released Kinect for Windows which allowed researchers from all over the globe to do things we even didn’t thought of. This is so-called Kinect Effect. We had more than fifty start-ups building their products using technologies from Microsoft Kinect. Today, most of them are further developed, enhanced, and are part of our products. I’ll give just two examples. The first is HoloLens. The device does not have a mouse or keyboard and the human-machine interface is built on three input modalities: gaze, gesture and voice. In HoloLens, we have a depth camera, quite similar to the one in Kinect, and we do gesture recognition using super-refined and improved algorithms, but they originate from the ones we had in Kinect. The second example is also HoloLens. HoloLens has four microphones, the same number as Kinect, and I would say that the audio enhancement pipeline for getting the voice of the person wearing the device is the granddaughter of the audio pipeline released in Kinect in 2010.
Host: Now let’s talk about one of the coolest projects you are working on. It’s the spatial audio or 3D audio. What’s your team doing to make the 3D audio experience a reality?
Ivan Tashev: In general, spatial audio or 3D audio is a technology that allows us to project audio sources in any desired position to be perceived by the human being wearing headphones. This technology is not something new. Actually, we have instances of it in mid-19th century, when two microphones and two rented telephone lines were used for stereo broadcasting of a theatrical play. Later, in the 20th century, there have been vinyl records marked to be listened with headphones because they were stereo recorded using a dummy head with two microphones in the ears. This technology did not fly because of two major deficiencies. The first is, you move your head left and right and the entire audio scene rotates with you. The second is that your brain may not exactly like the spatial cues coming from the microphones in the ear of the dummy head. And this is where we reach the topic of head-related transfer functions. Literally, if you have a sound source somewhere in the space, the sound from it reaches your left and right ear in a slightly different way. It can be modeled as two filters. And if you filter it through those two filters and play it through headphones, your brain will perceive the sound coming from that direction. If we know those pairs of filters for all directions around you, this is called head-related transfer functions. The problem is that they are highly individual. Head-related transfer functions are formed by the size and the dimensions of the head, the position of the ears, the fine structure of the pinna, the reflections from the shoulders. And we did a lot of research to find the way to quickly generate personalized head-related transfer functions. We put, in our anechoic chamber, more than four hundred subjects. We measured their HRTFs. We did a submillimeter precision scan of their head and torso, and we did measurement of certain anthropometric dimensions of those subjects. Today, we can just measure several dimensions of your head and generate your personalized head-related transfer function. We can do this even from a depth picture. Literally, you can tell how you hear from the way you look. And we polished this technology to extend that in HoloLens, you have your spatial audio personalized without even knowing it. You put the device on and you hear through your own personalized spatial hearing.
Host: How does that do that automatically?
Ivan Tashev: Silently, we measure certain anthropometrics of your head. Our engineering teams, our partners, decided that there should not be anything visible for generation of those personalized spatial hearing.
Host: So, if I put this on, say the HoloLens headset, it’s going to measure me on the fly?
Ivan Tashev: Mmm hmmm.
Host: And then the 3D audio will happen for me. Both of us could have the headset on and hear a noise in one of our ears that supposedly is coming from behind us, but really isn’t. It’s virtual.
Ivan Tashev: That’s absolutely correct. With the two loudspeakers in HoloLens or in your headphones, we can make you perceive the sound coming from above, from below, from behind. And this is actually the main difference between surround sound and 3D audio for headphones. Surround sound has five or seven loudspeakers, but they are all in one plane. So, surround audio world is actually flat. While with this spatial audio engine, we can actually render audio above and below which opens pretty much a new frontier in expressiveness of the audio, what we can do.
Host: Listen, as you talk, I have a vision of a bat in my head sending out signals and getting signals and echolocations and…
Ivan Tashev: We did that.
Ivan Tashev: We did that!
Host: Okay, tell.
Ivan Tashev: So, one of our projects – this is one of those more blue-sky research projects – is exactly about that. What we wanted to explore is using audio as echolocation in the same way the bats see in complete darkness. And we built a spherical loudspeaker array of eight transducers which sent ultrasound pulses towards given direction, and near it, an eight-element microphone array which, through the technology called beam forming, listens towards the same direction. With this, we utilized the energy of the loudspeakers well, and reduced the amount of sounds coming from other directions and this allows us to measure the energy reflected by the object in that direction. When you do the scanning of the space, you can create an image which is exactly the same as created from a depth camera using infrared light but with a fraction of the energy. The ultimate goal, eventually, will be to get the same gesture recognition with one tenth or one hundredth of the power necessary. This is important for all portable battery-operated devices.
Host: Yeah. Speaking of that, accessibility is a huge area of interest for Microsoft right now, especially here in Microsoft Research with the AI for Accessibility initiative. And it’s really revolutionizing access to technology for people with disabilities. Tell us how the research you’re doing is finding its way into the projects and products in the arena of accessibility.
Ivan Tashev: You know, accessibility finds a resonance among Microsoft employees. The first application of our spatial audio technology was actually not HoloLens. It was a project which was a kind of a grass roots project when Microsoft employees worked with a charity organization called Guide Dogs in United Kingdom. And from the name you can basically guess that they train guiding dogs for people with blindness. The idea was to use the spatial audio to help the visually impaired. Multiple teams in Microsoft Research, actually, have been involved to overcome a lot of problems, including my team, and this whole story ended up with releasing a product called Soundscape, which is a phone application which allows people with blindness to navigate easier where the spatial audio acts like a finger-pointer. When the system says, “And on the left is the department store,” actually that voice-prompt came from the direction where the department store is, and this is additional spatial cue which helps the orientation of the visually impaired people. Another interesting project we have been involved, also is a grass roots project. It was driven by a girl which was hearing-impaired. She initiated a project during one of the yearly hackathons. And the project was triggered by the fact that she was told by her neighbor that your CO2 alarm is beeping already a week. You have to replace the battery. So, we created a phone application which was able to recognize numerous sounds like CO2 alarm, fire alarm, door knock, phone ring, baby crying, etcetera, etcetera, and to signal the hearing-impaired person using vibration, or the display. And this is to help to navigate and to live a better life in our environment.
Host: You have an interesting personal story. Tell us a bit about your background. Where did you grow up, what got you interested in the work you are doing and how did you end up at Microsoft Research?
Ivan Tashev: I’m born in a small country in Southeast Europe called Bulgaria. I took my diploma in electronic engineering, and PhD in computer science from the Technical University of Sofia, and immediately after my graduation, started to work as a researcher there. In 1998, I was Assistant Professor in the Department of Electronic Engineering when Microsoft hired me, and I moved to Washington State. Spent to two full shipping cycles in Microsoft engineering teams before, in 2001, to move in Microsoft Research. And what I have learned during those two shipping cycles actually helped me a lot to talk better with the engineers during the technology transfers I have done with Microsoft engineering teams.
Host: Yeah, and there’s quite a bit of tech transfer that’s coming out of your group. What are some examples of the things that have been “blue sky research” at the beginning that are now finding their way into millions of users’ desks and homes?
Ivan Tashev: I have been lucky enough to be a part of very strong research groups and to learn from masters like Anoop Gupta or Rico Malvar. My first project in Microsoft Research was called Distributed Meetings and we used that device to record meetings, to store them and to process them. Later, this device became a roundtable device which is part of many conference rooms worldwide. Then, I decided to generalize the microphone array support I designed for round table device and this became the microphone array support in Windows Vista. Next challenge was to make this speech enhancement pipeline to work even in more harsh conditions like the noisy car. And, I designed the algorithms and transferred them to the first speech-driven entertainment system in a mass-production car. And then, the story continues with Kinect, with HoloLens, many other products, and this is another difference between industrial research and academia. The satisfaction from your work is measurable. You know to how many homes your technology has been released, to how many people you changed the way they live, entertain or work.
Host: As we close, Ivan, perhaps you can give some parting advice to those of our listeners that might be interested in the science of sound, so to speak. What are the exciting challenges out there in audio and acoustics research, and what guidance would you offer would-be researchers in this area?
Ivan Tashev: So, audio processing is a very interesting area of research because it is a mixture between art, craft and science. It is science because we work with mathematical models and we have repetitive results. But it is an art because it’s about human perception. Humans have their own preferences and tastes, and this makes it very difficult to model with mathematical models. And it’s also a craft. There are always some small tricks and secret sauce which are not mathematical models but make the algorithms from one lab work much better than the algorithms from another lab. Into the mixture, we have to add the powerful innovation of machine learning technologies, neural networks and artificial intelligence which allow us to solve problems we thought were unsolvable and to produce algorithms which work much better than the classic ones. So, the advice is, learn signal processing and machine learning. This combination is very powerful!
Host: Ivan Tashev, thank you for joining us today.
Ivan Tashev: Thank you.
To learn more about Dr. Ivan Tashev and how Microsoft Research is working to make sound sound better, visit Microsoft.com/research.