The evolution from research to product
It’s one thing for a Microsoft researcher to use all the available bells and whistles, plus Azure’s powerful computing infrastructure, to develop an AI-based machine translation model that can perform as well as a person on a narrow research benchmark with lots of data. It’s quite another to make that model work in a commercial product.
To tackle the human parity challenge, three research teams used deep neural networks and applied other cutting-edge training techniques that mimic the way people might approach a problem to provide more fluent and accurate translations. Those included translating sentences back and forth between English and Chinese and comparing results, as well as repeating the same translation over and over until its quality improves.
“In the beginning, we were not taking into account whether this technology was shippable as a product. We were just asking ourselves if we took everything in the kitchen sink and threw it at the problem, how good could it get?” Menezes said. “So we came up with this research system that was very big, very slow and very expensive just to push the limits of achieving human parity.”
“Since then, our goal has been to figure out how we can bring this level of quality — or as close to this level of quality as possible — into our production API,” Menezes said.
Someone using Microsoft Translator types in a sentence and expects a translation in milliseconds, Menezes said. So the team needed to figure out how to make its big, complicated research model much leaner and faster. But as they were working to shrink the research system algorithmically, they also had to broaden its reach exponentially — not just training it on news articles but on anything from handbooks and recipes to encyclopedia entries.
To accomplish this, the team employed a technique called knowledge distillation, which involves creating a lightweight “student” model that learns from translations generated by the “teacher” model with all the bells and whistles, rather than the massive amounts of raw parallel data that machine translation systems are generally trained on. The goal is to engineer the student model to be much faster and less complex than its teacher, while still retaining most of the quality.
In one example, the team found that the student model could use a simplified decoding algorithm to select the best translated word at each step, rather than the usual method of searching through a huge space of possible translations.
The researchers also developed a different approach to dual learning, which takes advantage of “round trip” translation checks. For example, if a person learning Japanese wants to check and see if a letter she wrote to an overseas friend is accurate, she might run the letter back through an English translator to see if it makes sense. Machine learning algorithms can also learn from this approach.
In the research model, the team used dual learning to improve the model’s output. In the production model, the team used dual learning to clean the data that the student learned from, essentially throwing out sentence pairs that represented inaccurate or confusing translations, Menezes said. That preserved a lot of the technique’s benefit without requiring as much computing.
With lots of trial and error and engineering, the team developed a recipe that allowed the machine translation student model — which is simple enough to operate in a cloud API — to deliver real-time results that are nearly as accurate as the more complex teacher, Menezes said.
Improving search with multi-task learning
In the rapidly evolving AI landscape, where new language understanding models are constantly introduced and improved upon by others in the research community, Bing’s search experts are always on the hunt for new and promising techniques. Unlike the old days, in which people might type in a keyword and click through a list of links to get to the information they’re looking for, users today increasingly search by asking a question — “How much would the Mona Lisa cost?” or “Which spider bites are dangerous?” — and expect the answer to bubble up to the top.
“This is really about giving the customers the right information and saving them time,” said Rangan Majumder, partner group program manager of search and AI in Bing. “We are expected to do the work on their behalf by picking the most authoritative websites and extracting the parts of the website that actually shows the answer to their question.”
To do this, not only does an AI model have to pick the most trustworthy documents, but it also has to develop an understanding of the content within each document, which requires proficiency in any number of language understanding tasks.
Last June, Microsoft researchers were the first to develop a machine learning model that surpassed the estimate for human performance on the General Language Understanding Evaluation (GLUE) benchmark, which measures mastery of nine different language understanding tasks ranging from sentiment analysis to text similarity and question answering. Their Multi-Task Deep Neural Network (MT-DNN) solution employed both knowledge distillation and multi-task learning, which allows the same model to train on and learn from multiple tasks at once and to apply knowledge gained in one area to others.
Bing’s experts this fall incorporated core principles from that research into their own machine learning model, which they estimate has improved answers in up to 26 percent of all questions sent to Bing in English markets. It also improved caption generation — or the links and descriptions lower down on the page — in 20 percent of those queries. Multi-task deep learning led to some of the largest improvements in Bing question answering and captions, which have traditionally been done independently, by using a single model to perform both.
For instance, the new model can answer the question “How much does the Mona Lisa cost?” with a bolded numerical estimate: $830 million. In the answer below, it first has to know that the word cost is looking for a number, but it also has to understand the context within the answer to pick today’s estimate over the older value of $100 million in 1962. Through multi-task training, the Bing team built a single model that selects the best answer, whether it should trigger and which exact words to bold.
Earlier this year, Bing engineers open sourced their code to pretrain large language representations on Azure. Building on that same code, Bing engineers working on Project Turing developed their own neural language representation, a general language understanding model that is pretrained to understand key principles of language and is reusable for other downstream tasks. It masters these by learning how to fill in the blanks when words are removed from sentences, similar to the popular children’s game Mad Libs.
You take a Wikipedia document, remove a phrase and the model has to learn to predict what phrase should go in the gap only by the words around it,” Majumder said. “And by doing that it’s learning about syntax, semantics and sometimes even knowledge. This approach blows other things out of the water because when you fine tune it for a specific task, it’s already learned a lot of the basic nuances about language.”
To teach the pretrained model how to tackle question answering and caption generation, the Bing team applied the multi-task learning approach developed by Microsoft Research to fine tune the model on multiple tasks at once. When a model learns something useful from one task, it can apply those learnings to the other areas, said Jianfeng Gao, partner research manager in the Deep Learning Group at Microsoft Research.
For example, he said, when a person learns to ride a bike, she has to master balance, which is also a useful skill in skiing. Relying on those lessons from bicycling can make it easier and faster to learn how to ski, as compared with someone who hasn’t had that experience, he said.
“In some sense, we’re borrowing from the way human beings work. As you accumulate more and more experience in life, when you face a new task you can draw from all the information you’ve learned in other situations and apply them,” Gao said.
Like the Microsoft Translator team, the Bing team also used knowledge distillation to convert their large and complex model into a leaner model that is fast and cost-effective enough to work in a commercial product.
And now, that same AI model working in Microsoft Search in Bing is being used to improve question answering when people search for information within their own company. If an employee types a question like “Can I bring a dog to work”? into the company’s intranet, the new model can recognize that a dog is a pet and pull up the company’s pet policy for that employee — even if the word dog never appears in that text. And it can surface a direct answer to the question.
“Just like we can get answers for Bing searches from the public web, we can use that same model to understand a question you might have sitting at your desk at work and read through your enterprise documents and give you the answer,” Majumder said.
Top image: Microsoft investments in natural language understanding research are improving the way Bing answers search questions like “How much does the Mona Lisa cost?” Image by Musée du Louvre/Wikimedia Commons.
Jennifer Langston writes about Microsoft research and innovation. Follow her on Twitter.