Designing for speech

Daimler AG announced the launch of it's new MBUX system at CES 2018. With all the different ways of interacting with in-car infotainment systems, the advancement of speech technology has helped make the new MBUX much easier to use.

The following summarizes a 2 year design project. There is a lot more to this whole story, but to cover everything in detail you would be here reading this forever!!

Project Summary

From this project, I learned a lot about designing for new kinds of interaction modes. Prior to this, I had only ever designed for the usual screen-based products. Voice interfaces were something new for me and I had to become a student of these kinds of products in order to come up with ideas. Especially for the automotive space, non-haptic interaction, like speech, can really be of benefit to people because you no longer need to stare at a display. If designed well, we can provide a more distraction-free experience, and truly assist drivers as they go on about their lives.

In addition to the usual process of sketching and writing, to wireframes and prototypes, I also spent time on up-front research. I had to familiarize myself with not just the conventional voice assistant products of the day, but also conduct usability studies to learn about different UX concepts, and how they might affect drivers. Prototyping for embedded systems is a little different, since you can't use more common development tools. What made this even more complicated was that I couldn't use popular cloud-based design tools due to the company's IT security protocols. This meant that everything had to be exported and documented (like a specification) rather than prototyped and experienced.

Working with an incredibly long timeline for just one "release" is not without challenges. Many business trips to Germany and countless video conferences later, and I was able to build up the collaboration with various partner teams to be able to launch a completely redesigned experience for the 2018 A-Class! I am very proud of the accomplishment and despite the difficulties of working internationally on a long project like this, it will always stand out for me as an amazing experience in my career!

Project goals

  • Develop new use cases for voice interaction.

  • Create an all new voice interface for in-car infotainment systems.

My roles and responsibilities

  • product management

  • interaction design

  • UI concept lead // creative strategy

The Why

Who is the customer?
This was a big challenge. At the outset of the project, there wasn't a decision made about which cars will get the new MBUX, so it was not clear who the end customer would be. The general idea was that all the different carlines would eventually get MBUX, but there hadn't been any decisions made at the outset, so I had to do the best I could with the information I had. Later on, it was eventually decided to release with the all new 2019 A-class, but by that point, we a lot of the work had already been completed. This is not really the best way to proceed, but it's life sometimes.

What were their goals?
The speech interface had to support all the core functionality of the main, haptic system. Speech allows for hands-free operation of the system, meaning the user can get what they need from the system without taking their eyes off the road. Even though I had no information on a particular type of customer, I could still focus the product on the core functions of in-car HMI — navigation, entertainment, and communications.

What were their pain points?
Driving a car requires a lot of cognitive processes. Previous speech interfaces in Mercedes cars comprised of mostly text-based lists. This required a lot more reading, which can be difficult to do (possibly distracting) while driving.

The What

It began rather innocuously, to develop interface concepts for displaying speech commands when the voice system is active (teleprompters). In current and prior car models, the speech interfaces were comprised primarily of teleprompter interfaces, just speech commands on the screen.

I sent over various wireframe concepts (like the one seen above) and discussed the complexity of the information architecture (which needed improvement). With the addition of new speech commands and more natural language recognition, the number of potential voice commands grew significantly and the request from management was to display them all.

As the project progressed, I began to have other ideas about what we might be able to do with a speech interface. For example, if the teleprompter is meant to assist the user, then why not provide a better interface that includes user assistance? Obviously just developing a new teleprompter UI wasn't going to cut it. But what could it become? This is where the real project began in earnest. As usual, it began with a lot of sketching and brainstorming.

These are just a fraction of the whiteboard sketches that were generated at this point. I was paired up with a designer from a different team that did visual design. Lots of detailed discussions about various use cases and scenarios took place at this point. There's nothing quite like locking yourself up in a room with a bunch of whiteboards and hashing out concept after concept to refine an idea.

Flexible, modular interfaces

What we eventually settled on originally was an interface based on modular tiles. Why? The basic idea was that we could improve the voice experience by presenting information visually and only displaying the relevant information to help the user continue to the next dialogue step.

Keep in mind, the general UI convention at the time was to use text lists everywhere, and I do mean everywhere. In earlier systems, if you did a "free POI search" for a restaurant (aside from the clunky dialogue) you would be presented with an ordered list of names, addresses, and relative distances filling up your screen.

While it certainly provides a lot of detailed information, as an interface, it resembles a set of database records and not an actual user interface. So our mission overall was to change that, which proved to be much harder in practice than it seemed.

Collaboration and compromise

Collaborating with international teams is not easy. Technology has not been able to resolve the problem of remote interaction. Video conferences just don't cut it. Face-to-face meetings are the best way to build collaboration, but you can't always fly your whole team to a remote location (Germany in this case, of course). So we had to have video conferences, lots and lots of video conferences.

I pitched the tile UI concept over and over. Pitched it to as many people as I could. There was simply too much resistance to this kind of idea. In retrospect, I might have tried to change my tactic a bit simply because we were showing some really different UI concepts from the generally expected text lists. Anyway, as a result, we had to figure out an effective compromise. We couldn't have just text lists, but the modular tile concept wasn't winning people over either. Meeting in the middle isn't always the best thing for a product, but it is very important to keep the project moving forward!

So these were the results of the compromises. We decided to develop a new set of UI "archetypes" as an iteration of the tile concept and keep things moving forward. The idea of modularity was still important (given how dialogues work) and even though I couldn't get agreement on the tile concepts, the overall mission to simplify and improve the voice interface remained. After reading through all of the different dialogue specifications (there were a lot!) I realized we could abstract the system into a set of key moments, maps, contacts, music, messages, etc. and these moments, when combined, would form the basis for the whole speech UI system.

Continuing our learnings

Usability testing was critical to help us refine the design. In order to get to test the UI however, we needed something interactive that we could test. Voice interactions are tricky in that they're not as easy to stick into a usability test as a simple clickthrough prototype. It's difficult to measure the quality of a voice interaction without a speech dialogue system. So we had to get far enough with prototyping to be able to evaluate some of our advanced dialogue concepts.

One example was where we extended the dialogue for "Hey Mercedes, take me home." If the user had not set a home address, the original requirement was to just end the dialogue with "Sorry, no home address available." We proposed to extend it to allow the user to set their home address by voice (and then start the navigation) because an intelligent system would know what data it has and assist the user accordingly. Testing showed that these were positive improvements and through a lot of different discussions we got approval for a lot of the dialogue enhancements.

Usability tests also helped us evolve the UI concepts. Even though the idea was to be "voice first", we had to allow for multimodal interaction as well because we knew that people wouldn't only use their voice to interact with the UI. The challenge was to design a voice interface that understood that you might also want to interact with it via touch. We had to design all the touch affordances in a different way than the rest of the system. Text labels had to be written to indicate voice commands. Focus cursors (that highlighted where the remote touch interaction would take place) were not visible until after some time. We even tested what an appropriate timing would be to automatically change the UI to display more touch affordances.

All of this test data helped us develop the proper visual designs that would end up in the final product.

The How

Bringing it all together, here's how the product works.

After activating the speech system (push-to-talk button on the steering wheel or the wake word "Hey Mercedes") the Voice UI opens in a minimal state with the animated voice wave at just the top portion of the screen. The main system is still visible beneath the speech interface so that the context of their previous interactions with the system are still preserved.

If the user utters a request at this point, the dialogue continues into the next step. If they happen to wait or basically not say anything for 4.5 seconds (usability testing revealed 4.5 seconds was the ideal timing for this) the system would react with the teleprompter UI. We use their reluctance to speak as an indicator to provide more assistance.

Conversely, it's also possible that certain utterances (referred to as "one shots") conclude the dialogue as well. For example, saying "Hey Mercedes, navigate to 309 North Pastoria Avenue in Sunnyvale" will close the Voice UI and start route guidance to the requested location. With these use cases, the minimized UI is all that's necessary. However, continuing on in the dialogue will display one of the archetypal screens depending on the speech domain that was identified by the NLU.

For example, when someone asks to "...find an italian restaurant," the appropriate screen is shown and the full Voice UI is used to help the person continue on in the next dialogue steps.

At this point, what happens next is highly dependent on the user. Based on the results, they might take a direct action ("Navigate to number 3") which would close the Voice UI and perform the action. They could also continue looking through the results to find something more appropriate, or they could ask for more information on one of the results they see on screen ("Show me details about number 6").

From the detail view screens, the user can complete their dialogue with the system, or return back to the results lists, to look for something else. They, of course, always have the possibility to cancel and exit the speech system at any time.

A note about the visual designs

Given that my role was more as a concept lead and interaction designer, I was not responsible for the visual designs of the interface. I collaborated very closely with designers on a different team to help develop the visual language used in the UI. One of the main things we had to develop was the visualization of the person speaking.

In a conventional voice UI (like Siri) your spoken utterances appear in the interface as text as soon as you speak the words. To accomplish this, the system relies upon your phone's robust data connection and onboard language processing. Unfortunately for the car, the actual time it would take to go out and back to translate your speech to visible text would take far too long to be usable. Knowing that feedback is critical for effective interface design, we decided to create an animated graphical display of speech instead. This "voice wave" would provide the necessary affordances that let the user know whether the system is ready for speech input or not.

In addition we also worked with the UI teams in Germany to ensure that the voice UI designs we were coming up with looked like they fit within the overall appearance of the system. When working on a single aspect of a larger system it is possible that you end up doing things that work well for your own given use cases, but may create inconsistencies with other parts of the system. We had to have consistent communication (and business travel) to help ensure that we were developing an interface that looked like it belonged all inside one product.

I also gave a talk about this whole thing.

Conclusions and take-aways

It is difficult to encapsulate every detail about this project. This is just a brief summary of all the different things that happened throughout the project. In total, it was 2 years worth of effort, coming up with the ideas, iterating, reviewing, and eventually working with various different teams and suppliers to implement the final product. It is truly incredible to see the response from people, as this new voice experience is something never seen before in a Mercedes-Benz. Additionally, it is very exciting to hear the company executives praise the features that we worked so hard to develop. It was a long and challenging road, and now, starting with the 2018 A-Class, people will be able to get their hands on vehicles equipped with the voice UI in the all new MBUX!

More topics