Designing for speech

Daimler AG announced the launch of it's new MBUX system at CES 2018. With all the different ways of interacting with in-car infotainment systems, the advancement of speech technology has helped make the new MBUX much easier to use.

The following summarizes a 2 year design project. There is a lot more to this whole story, but to cover everything in detail you would be here reading this forever!!


The Summary

  • The secret is to align the interface design with the flow of natural conversations.

  • Designed a modular UI concept to support a wide variety of conversation contexts.

  • I built up the international collaboration to bring it to life by sharing and being open with our ideas and mission.

  • The new system, MBUX, launched in 2018 with a huge presentation at CES.

  • MBUX got a lot of positive press, especially the new voice system!


The Problem

How might we create a more compelling in-car voice experience?

The image above shows the previous version of the voice UI. Much of the interface was just lists of text, and the idea at the time was that providing as much text content as possible was how to help people decide what they wanted. The side effect of this idea is that reading text on screens becomes the main way people use the speech system, which is not necessarily good for voice interaction, and creates a potential safety issue.


The Customer

This was a big challenge. At the outset of the project, there wasn't a decision made about which cars will get the new MBUX, so it was not clear who the end customer would be. The general idea was that all the different carlines would eventually get MBUX, but there hadn't been any decisions made about it when I started, so I had to do the best I could with the information I had.

Later on, it was eventually decided to release with the all new 2019 A-class, but by that point, we a lot of the work had already been completed. This is not really the best way to proceed, but that's how it goes sometimes.


What were their goals?

The speech interface had to support all the core functionality of the main, haptic system. Speech allows for hands-free operation of the system, meaning the user can get what they need from the system without taking their eyes off the road. Even though I had no information on a particular type of customer, I could still focus the product on the core functions of in-car HMI — navigation, entertainment, and communications.

What were their pain points?

Driving a car requires a lot of cognitive processes. Previous speech interfaces in Mercedes cars comprised of mostly text-based lists. This required a lot more reading, which can be difficult to use (possibly distracting) while driving.


The Idea

The core idea was to create a modular UI system to support a wide variety of use cases and situations, a design that would work everywhere.


It began rather innocuously, to develop interface concepts for displaying speech commands when the voice system is active (speech teleprompters). In prior car models, the speech interfaces were comprised primarily of teleprompter interfaces (as seen above) to tell people exactly what to say at each moment in the dialogue. This was partly due to the technology but also due to the lack of designers involved in the process.

I sent over various wireframe concepts (like the one seen above) and discussed the complexity of the information architecture (which needed improvement). With the addition of new speech commands and more natural language recognition, the number of potential voice commands grew significantly and the request from management was to display them all, which is obviously not advisable.

As the project progressed, I began to have other ideas about what we might be able to do with a speech interface. For example, if the teleprompter is meant to assist the user, then why not provide a better interface that includes user assistance? Obviously just developing a new teleprompter UI wasn't going to cut it. But what could it become? This is where the real project began in earnest. As usual, it began with a lot of sketching and brainstorming.

These are just a fraction of the whiteboard sketches that were generated at this point. I was paired up with a designer from a different team that did visual design. Lots of detailed discussions about various use cases and scenarios took place at this point. There's nothing quite like locking yourself up in a room with a bunch of whiteboards and hashing out concept after concept to refine an idea.

What we eventually settled on originally was an interface based on modular tiles. Why? The basic idea was that we could improve the voice experience by presenting information visually and only displaying the relevant information to help the user continue to the next dialogue step.

Keep in mind, the general UI convention at the time was to use text lists everywhere, and I do mean everywhere. In earlier systems, if you did a "free POI search" for a restaurant (aside from the clunky dialogue) you would be presented with an ordered list of names, addresses, and relative distances filling up your screen.

While it certainly provides a lot of detailed information, as an interface, it resembles a set of database records and not an actual user interface. So our mission overall was to change that, which proved to be much harder in practice than it seemed.

Collaboration and Compromise

Collaborating with international teams is not easy. Technology has not been able to resolve the problem of remote interaction. Video conferences just don't cut it. Face-to-face meetings are the best way to build collaboration, but you can't always fly your whole team to a remote location (Germany in this case, of course). So we had to have video conferences, lots and lots of video conferences.

I pitched the tile UI concept over and over. Pitched it to as many people as I could. There was simply too much resistance to this kind of idea. In retrospect, I might have tried to change my tactic a bit simply because we were showing some really different UI concepts from the generally expected text lists. Anyway, as a result, we had to figure out an effective compromise. We couldn't have just text lists, but the modular tile concept wasn't winning people over either. Meeting in the middle isn't always the best thing for a product, but it is very important to keep the project moving forward!

So these were the results of the compromises. We decided to develop a new set of UI "archetypes" as an iteration of the tile concept and keep things moving forward. The idea of modularity was still important (given how dialogues work) and even though I couldn't get agreement on the tile concepts, the overall mission to simplify and improve the voice interface remained. After reading through all of the different dialogue specifications (there were a lot!) I realized we could abstract the system into a set of key moments, maps, contacts, music, messages, etc. and these moments, when combined, would form the basis for the whole speech UI system

Continuing our learnings

Usability testing was critical to help us refine the design. In order to get to test the UI however, we needed something interactive that we could test. Voice interactions are tricky in that they're not as easy to stick into a usability test as a simple clickthrough prototype. It's difficult to measure the quality of a voice interaction without a speech dialogue system. So we had to get far enough with prototyping to be able to evaluate some of our advanced dialogue concepts.

One example was where we extended the dialogue for "Hey Mercedes, take me home." If the user had not set a home address, the original requirement was to just end the dialogue with "Sorry, no home address available." We proposed to extend it to allow the user to set their home address by voice (and then start the navigation) because an intelligent system would know what data it has and assist the user accordingly. Testing showed that these were positive improvements and through a lot of different discussions we got approval for a lot of the dialogue enhancements.

Usability tests also helped us evolve the UI concepts. Even though the idea was to be "voice first", we had to allow for multimodal interaction as well because we knew that people wouldn't only use their voice to interact with the UI. The challenge was to design a voice interface that understood that you might also want to interact with it via touch. We had to design all the touch affordances in a different way than the rest of the system. Text labels had to be written to indicate voice commands. Focus cursors (that highlighted where the remote touch interaction would take place) were not visible until after some time. We even tested what an appropriate timing would be to automatically change the UI to display more touch affordances.

All of this test data helped us develop the proper visual designs that would end up in the final product.


The Implementation

Bringing it all together, here's how the product works, as narrated by Mr. Sajjad Khan, Head of Digital Vehicle and Mobility.


After activating the speech system (push-to-talk button on the steering wheel or the wake word "Hey Mercedes") the Voice UI opens in a minimal state with the animated voice wave at just the top portion of the screen. The main system is still visible beneath the speech interface so that the context of their previous interactions with the system are still preserved.

If the user utters a request at this point, the dialogue continues into the next step. If they happen to wait or basically not say anything for 4.5 seconds (usability testing revealed 4.5 seconds was the ideal timing for this) the system would react with the teleprompter UI. We use their reluctance to speak as an indicator to provide more assistance.

Conversely, it's also possible that certain utterances (referred to as "one shots") conclude the dialogue as well. For example, saying "Hey Mercedes, navigate to 309 North Pastoria Avenue in Sunnyvale" will close the Voice UI and start route guidance to the requested location. With these use cases, the minimized UI is all that's necessary. However, continuing on in the dialogue will display one of the archetypal screens depending on the speech domain that was identified by the NLU.

For example, when someone asks to "...find an italian restaurant," the appropriate screen is shown and the full Voice UI is used to help the person continue on in the next dialogue steps.

At this point, what happens next is highly dependent on the user. Based on the results, they might take a direct action ("Navigate to number 3") which would close the Voice UI and perform the action. They could also continue looking through the results to find something more appropriate, or they could ask for more information on one of the results they see on screen ("Show me details about number 6").

From the detail view screens, the user can complete their dialogue with the system, or return back to the results lists, to look for something else. They, of course, always have the possibility to cancel and exit the speech system at any time.

A note about visual designs

Given that my role was more as a concept lead and interaction designer, I was not responsible for the visual designs of the interface. I collaborated very closely with designers on a different team to help develop the visual language used in the UI. One of the main things we had to develop was the visualization of the person speaking.

In a conventional voice UI (like Siri) your spoken utterances appear in the interface as text as soon as you speak the words. To accomplish this, the system relies upon your phone's robust data connection and onboard language processing. Unfortunately for the car, the actual time it would take to go out and back to translate your speech to visible text would take far too long to be usable. Knowing that feedback is critical for effective interface design, we decided to create an animated graphical display of speech instead. This "voice wave" would provide the necessary affordances that let the user know whether the system is ready for speech input or not.

In addition we also worked with the UI teams in Germany to ensure that the voice UI designs we were coming up with looked like they fit within the overall appearance of the system. When working on a single aspect of a larger system it is possible that you end up doing things that work well for your own given use cases, but may create inconsistencies with other parts of the system. We had to have consistent communication (and business travel) to help ensure that we were developing an interface that looked like it belonged all inside one product.


The Outcome

As part of GDPR rules, Daimler AG unfortunately does not really gather analytics about how people use the infotainment system. This made it very difficult in understanding the true outcome of the work, but we did conduct many usability tests during the design phase of the project that indicated strong improvements and significant benefits to the user.

I also gave a talk about this whole thing. :)


The Takeaways

It is difficult to encapsulate every detail about this project. This is just a brief summary of all the different things that happened throughout the project. In total, it was 2 years worth of effort, coming up with the ideas, iterating, reviewing, and eventually working with various different teams and suppliers to implement the final product. It is truly incredible to see the response from people, as this new voice experience is something never seen before in a Mercedes-Benz. Additionally, it is very exciting to hear the company executives praise the features that we worked so hard to develop. It was a long and challenging road, and now, starting with the 2018 A-Class, people will be able to get their hands on vehicles equipped with the voice UI in the all new MBUX!