designing for speech

Daimler AG announced the launch of it's new MBUX system at CES 2018. With all the different ways of interacting with in-car infotainment systems, the advancement of speech technology has helped make the new MBUX much easier to use.

project goals

  1. Develop new use cases for voice interaction.
  2. Create an all new voice interface for in-car infotainment systems.

my roles and responsibilities

The following summarizes a 2 year design project. There is a lot more to this whole story, but to cover everything in detail you would be here reading this forever!!

the process

It began rather innocuously, to develop interface concepts for displaying speech commands when the voice system is active (teleprompters). In current and prior car models, the speech interfaces were comprised primarily of teleprompter interfaces, just speech commands on the screen.

We sent over various wireframe concepts (like the one seen above) and discussed some of the challenges with this kind of interface. Largely it was the information architecture that needed improvement. With the addition of new speech commands and more natural language recognition, the number of potential voice commands grew significantly and the request from management was to display them all.

As the project progressed, I began to have other ideas about what we might be able to do with a speech interface. Obviously just developing a new teleprompter UI wasn't going to cut it. But what could it become? This is where the real project began in earnest. As usual, it began with a lot of sketching and brainstorming.

These are just a fraction of the whiteboard sketches that were generated at this point. I was paired up with a designer from a different team that did visual design. Lots of detailed discussions about various use cases and scenarios took place at this point. There's nothing quite like locking yourself up in a room with a bunch of whiteboards and hashing out concept after concept to refine an idea.

flexible, modular interfaces

What we eventually settled on originally was an interface based on modular tiles. Why? The basic idea was that we could improve the voice experience by only displaying the relevant information to help the user continue into the next dialogue step.

Keep in mind, the general convention at the time was to use text lists everywhere, and I do mean everywhere. In earlier systems, if you did a "free POI search" for a restaurant (aside from the clunky dialogue) you would be presented with an ordered list of names, addresses, and relative distances filling up your screen.

While it certainly provides a lot of detailed information, as an interface, it resembles a set of database records and not an actual user interface. So our mission overall was to change that, which proved to be much harder in practice than it seemed.

collaboration and compromise

Collaborating with international teams is not easy. Technology has not been able to resolve the problem of remote interaction. Video conferences just don't cut it. Face-to-face meetings are the best way to build collaboration, but you can't always fly your whole team to a remote location (Germany in this case, of course). So we had to have video conferences, lots and lots of video conferences.

I pitched the tile UI concept over and over. Pitched it to as many people as I could. There was simply too much resistance to this kind of idea. In retrospect, I might have tried to change my tactic a bit simply because we were showing some really different UI concepts from the generally expected text lists. Anyway, as a result, we had to figure out an effective compromise. We couldn't have just text lists, but the modular tile concept wasn't winning people over either. Meeting in the middle isn't always the best thing for a product, but it is very important to keep the project moving forward!

So these were the results of the compromises. We decided to develop a new set of UI "archetypes" as an iteration of the tile concept and keep things moving forward. The idea of modularity was still important (given how dialogues work) and even though I couldn't get agreement on the tile concepts, the overall mission to simplify and improve the voice interface remained. After reading through all of the different dialogue specifications (there were a lot!) I realized we could abstract the system into a set of key moments, maps, contacts, music, messages, etc. and these moments, when combined, would form the basis for the whole speech UI system.

continuing our learnings

Usability testing was critical to help us refine the design. In order to get to test the UI however, we needed something interactive that we could test. Voice interactions are tricky in that they're not as easy to stick into a usability test as a simple clickthrough prototype. It's difficult to measure the quality of a voice interaction without a speech dialogue system. So we had to get far enough with prototyping to be able to evaluate some of our advanced dialogue concepts.

One example was where we extended the dialogue for "Hey Mercedes, take me home." If the user had not set a home address, the original requirement was to just end the dialogue with "Sorry, no home address available." We proposed to extend it to allow the user to set their home address by voice (and then start the navigation) because an intelligent system would know what data it has and assist the user accordingly. Testing showed that these were positive improvements and through a lot of different discussions we got approval for a lot of the dialogue enhancements.

Usability tests also helped us evolve the UI concepts. Even though the idea was to be "voice first", we had to allow for multimodal interaction as well because we knew that people wouldn't only use their voice to interact with the UI. The challenge was to design a voice interface that understood that you might also want to interact with it via touch. We had to design all the touch affordances in a different way than the rest of the system. Text labels had to be written to indicate voice commands. Focus cursors (that highlighted where the remote touch interaction would take place) were not visible until after some time. We even tested what an appropriate timing would be to automatically change the UI to display more touch affordances.

All of this test data helped us develop the proper visual designs that would end up in the final product.

how it works

Bringing it all together, here's how the product works

After activating the speech system (push-to-talk button on the steering wheel or the wake word "Hey Mercedes") the Voice UI opens in a minimal state with the animated voice wave at just the top portion of the screen. The main system is still visible beneath the speech interface so that the context of their previous interactions with the system are still preserved.

If the user utters a request at this point, the dialogue continues into the next step. If they happen to wait or basically not say anything for 4.5 seconds (usability testing revealed 4.5 seconds was the ideal timing for this) the system would react with the teleprompter UI. We use their reluctance to speak as an indicator to provide more assistance.

Conversely, it's also possible that certain utterances (referred to as "one shots") conclude the dialogue as well. For example, saying "Hey Mercedes, navigate to 309 North Pastor Avenue in Sunnyvale" will close the Voice UI and start route guidance to the requested location. With these use cases, the minimized UI is all that's necessary. However, continuing on in the dialogue will display one of the archetypal screens depending on the speech domain that was identified by the NLU.

For example, when someone asks to "...find an italian restaurant," the appropriate screen is shown and the full Voice UI is used to help the person continue on in the next dialogue steps.

At this point, what happens next is highly dependent on the user. Based on the results, they might take a direct action ("Navigate to number 3") which would close the Voice UI and perform the action. They could also continue looking through the results to find something more appropriate, or they could ask for more information on one of the results they see on screen ("Show me details about number 6").

From the detail view screens, the user can complete their dialogue with the system, or return back to the results lists, to look for something else. They, of course, always have the possibility to cancel and exit the speech system at any time.

a note about the visual designs

Given that my role was more as a concept lead and interaction designer, I was not responsible for the visual designs of the interface. I collaborated very closely with designers on a different team to help develop the visual language used in the UI. One of the main things we had to develop was the visualization of the person speaking.

In a conventional voice UI (like Siri) your spoken utterances appear in the interface as text as soon as you speak the words. To accomplish this, the system relies upon your phone's robust data connection and onboard language processing. Unfortunately for the car, the actual time it would take to go out and back to translate your speech to visible text would take far too long to be usable. Knowing that feedback is critical for effective interface design, we decided to create an animated graphical display of speech instead. This "voice wave" would provide the necessary affordances that let the user know whether the system is ready for speech input or not.

In addition we also worked with the UI teams in Germany to ensure that the voice UI designs we were coming up with looked like they fit within the overall appearance of the system. When working on a single aspect of a larger system it is possible that you end up doing things that work well for your own given use cases, but may create inconsistencies with other parts of the system. We had to have consistent communication (and business travel) to help ensure that we were developing an interface that looked like it belonged all inside one product.

see it in action

This video from Nuance Communications (NLU provider for MBUX) shows all the cool features along with the new interface I designed.

conclusions and take-aways

It is difficult to encapsulate every detail about this project. This is just a brief summary of all the different things that happened throughout the project. In total, it was 2 years worth of effort, coming up with the ideas, iterating, reviewing, and eventually working with various different teams and suppliers to implement the final product. It is truly incredible to see the response from people, as this new voice experience is something never seen before in a Mercedes-Benz. Additionally, it is very exciting to hear the company executives praise the features that we worked so hard to develop. It was a long and challenging road, and now, starting with the 2018 A-Class, people will be able to get their hands on vehicles equipped with the voice UI in the all new MBUX!

next project ➡️