Paralympics in Brazil
Project Tokyo was born out of a challenge, in early 2016, from senior leaders at Microsoft to create AI systems that would go beyond completing tasks such as fetching sports scores and weather forecasts or identifying objects. Morrison said creating tools for people who are blind and with low vision was a natural fit for the project, because people with disabilities are often early adopters of new technology.
“It is not about saying, ‘Let’s build something for blind people,’” Morrison said. “We are working with blind people to help us imagine the future, and that future is about new experiences with AI.”
Morrison and her colleague Ed Cutrell, a senior principal researcher at Microsoft’s research lab in Redmond, Washington, were tapped to lead the project. Both have expertise in designing technologies with people who are blind or with low vision and decided to begin by trying to understand how an agent technology could augment, or extend, the capabilities of these users.
To start, they followed a group of athletes and spectators with varying levels of vision on a trip from the United Kingdom to the 2016 Paralympic Games in Rio de Janeiro, Brazil, observing how they interacted with other people as they navigated airports, attended sporting venues and went sightseeing, among other activities. A key learning, noted Cutrell, was how an enriched understanding of social context could help people who are blind or with low vision make sense of their environment.
“We, as humans, have this very, very nuanced and elaborate sense of social understanding of how to interact with people – getting a sense of who is in the room, what are they doing, what is their relationship to me, how do I understand if they are relevant for me or not,” he said. “And for blind people a lot of the cues that we take for granted just go away.”
This understanding spurred a series of workshops with the blind and low vision community that were focused on potential technologies that could provide such an experience. Peter Bosher, an audio engineer in his mid-50s who has been blind most of his life and worked with the Project Tokyo team, said the concept of a technology that provided information about the people around him resonated immediately.
“Whenever I am in a situation with more than two or three people, especially if I don’t know some of them, it becomes exponentially more difficult to deal with because people use more and more eye contact and body language to signal that they want to talk to such-and-such a person, that they want to speak now,” he said. “It is really very difficult as a blind person.”
A modified HoloLens
Once the Project Tokyo researchers understood the type of AI experience they wanted to create, they set out to build the enabling technology. They started with the original Microsoft HoloLens, a mixed reality headset that projects holograms into the real world that users can manipulate.
“HoloLens gives us a ton of what we need to build a real time AI agent that can communicate the social environment,” said Grayson during a demonstration of the technology at Microsoft’s research lab in Cambridge.
For example, the device has an array of grayscale cameras that provide a near 180-degree view of the environment and a high-resolution color camera for high-accuracy facial recognition. In addition, the speakers above the user’s ears allow for spatialized audio – the creation of sounds that seem to be coming from specific locations around the user.
Machine learning experts on the Project Tokyo team then developed computer vision algorithms that provide varying levels of information about who is where in the user’s environment. The models run on graphical processing units, known as GPUs, that are housed in the black chest that Grayson carted off to Regan’s house for the user testing with Theo.
One model, for example, detects the pose of people in the environment, which provides a sense of where and how far away people are from the user. Another analyzes the stream of photos from the high-resolution camera to recognize people and determine if they have opted to make their names known to the system. All this information is relayed to the user through audio cues.
For example, if the device detects a person one meter away on the user’s left side, the system will play a click that sounds like it is coming from one meter away on the left. If the system recognizes the person’s face, it will play a bump sound, and if that person is also known to the system, it will announce their name.
When the user only hears a click but wants to know who the person is, a second layer of sound that resembles an elastic band stretching guides the user’s gaze toward the person’s face. When the lens’ central camera connects with the person’s nose, the user hears a high-pitched click and, if the person is known to the system, their name.
“I particularly like the thing that gives you the angle of gaze because I’m never really sure what is the sensible angle for your head to be at,” said Bosher, who worked with the Project Tokyo team on the audio experience early in the design process and returned to the Cambridge lab to discuss his experience and check out the latest iteration. “That would be a great tool for learning body language.”
Prototyping with adults
As the Project Tokyo team has developed and evolved the technology, the researchers routinely invite adults who are blind or with low vision to test the system and provide feedback. To facilitate more direct social interaction, for example, the team removed the lenses from the front of the HoloLens.
Several users expressed a desire to unobtrusively get the information collected by the system without constantly turning their heads, which felt socially awkward. The feedback prompted the Project Tokyo team to work on features that help users quickly learn who is around them by, for example, asking for an overview and getting a spatial readout of all the names of people who have given permission to be recognized by the system.
Another experimental feature alerts the user with a spatialized chime when someone is looking at them, because people with typical vision often establish eye contact to initiate a conversation. Unlike the bump, however, the chime is not followed by a name.
“We already use the name when you look at somebody,” Grayson explained to Emily, a tester in her 20s who has low vision and visited the Cambridge lab to learn about the most recent features. “But also, by not giving the name, it might draw your attention to turn to somebody who is trying to get your attention. And by turning to them, you find out their name.”
“I totally agree with that. That is how sighted people react. They capture someone out of the corner of their eye, or you get that sense, and go, ‘Cecily,’” Emily said.
The modified HoloLens the researchers showed to Emily also included an LED strip affixed above the band of cameras. A white light tracks the person closest to the user and turns green when the person has been identified to the user. The feature lets communication partners or bystanders know they’ve been seen, making it more natural to initiate a conversation.
The LED strip also provides people an opportunity to move out of the device’s field of view and not be seen, if they so choose. “When you know you are about to be seen, you can also decide not to be seen,” noted Morrison. “If you know when you are being seen, you know when you are not being seen.”