How to build conversation friendly voice apps that won’t kill you

Voice is the future — here’s what to do while waiting
Reading Time: 8 min

This article is a part of the Battery001 issue: An issue dedicated to Voice!

Speaking to computers have been a part of sci-fi since forever. HAL 9000 gave us an idea about what to expect both from computer’s conversational skills and their mental disorders. Now voice assistants are in everyone’s hands so this must be the future, right?

If you saw the Google Duplex demo last year then it actually seems like it is finally here, the future! Natural conversation, tone of voice, intonation, impressive! I will cover Duplex in another article, due shortly.

And Google is rolling it out now. In selected cities. English only. For booking restaurants (which I don’t do that often). And haircuts (which I don’t need). And probably a list of other stuff before getting to something that makes sense to me. It seems like at least I have to wait for the future a bit more.

“Hmmm, I don’t know that one.”

But while waiting there is a lot we can do! Here is my top priorities for building smart and useful voice apps today!

Before starting, let me define what I mean with voice apps. On one hand there are simple one-shot request/response skill/actions, on the other there are the AI assistants that are supposed to handle everything. A voice app is what lives in the huge but still quite unpopulated space in between. The voice equivalent of a chatbot.

Starting with skipping the “obvious”

If we have already decided to create a voice app then we also have figured out the reason why, what specific problems we want to solve, why voice is the natural choice compared to a standard website or app, and more, so I will not bore you with such trivialities 😜.

Some of the advice below might still feel like they fall under the “obvious” label, but if they were that “obvious” then everyone would do them, and that is “obviously” not the case.

Setting expectations, ours and the user’s

Building anything beyond a simple single-shot request/response skill/action can get complicated pretty quickly. That is why we need to have a very clear idea about what the objective of the voice app is and how to achieve just that.

This is also essential from the user’s perspective. An app with a clear focus is easy to explain and easy to use. Make sure that the user understands what the app does and doesn’t do from the very start. Setting the right expectations is the first step to avoid frustration and disappointment, both for the user and ourselves.

First make it work, then make it great

It is easy to get carried away with all the creative opportunities. Put all those amazing ideas in a box, mark it “ideas for later”, start with an MVP and build from there.

Focus on basic functionality, test with real users (see below) and verify that the experience is on the right track before getting too clever. Do the simple solutions, avoid guesses and assumptions about user behavior, let the dialog flow evolve based on real world testing. Never solve a problem before knowing it is real, “just to be safe”, just put a note in the book and let reality be the judge. In proper user testing of course 😜 (see below)

Focus on the 95%

Define what is the most common user behavior and build the flow from there. Focus on giving the best user experience for the absolute majority, not the outliers, there is no way to make everyone happy anyway.

Unless being edgy is the goal of the app, don’t spend any time on designing for all the edge cases, e.g:

“What if the user says “f**k you”?? We need to cover for that too!”

No, we don’t, and actually we shouldn’t. The conversation should be designed to be fun for the 95% that wants to play along with the rules, not the rebels that want to break them. People can come up with more weird shit than we will ever be able to cover for, just trying will make them even more determined.

Affiliation by association

Let’s just give up from the start, make it boooooring to break the rules, that will get most of the trolls back in line. This way we reduce overall complexity and lower probability for unintentional errors, win-win-win!!

If you still want to go that route, here is how Mitsuku handles provocateurs: The Curse of the Chatbot Users

Test, test and test

Proper user testing is essential for creating an amazing user experience. Sounds obvious, but it turns out that end user testing on voice platforms is a real pain, especially if you want to support several platforms. 
Don’t use that as an excuse!

The more we know about a project, the less useful we are as a test subject. The worst testers are developers (sorry guys!), they know exactly what works and not, so they simply don’t do anything that breaks the experience. A schmuck on the other hand breaks apps within minutes.

Affiliation by association

Both Alexa and Google Assistant provides tools for testing. However, the tools are not constructed for testing by “real” users. They are developer tools and are a part of the dev environment. The level of damage the aforementioned schmuck could cause is massive.

So to do proper end user tests on Alexa/Google Home the skill/action has to be deployed to the platforms and must be accessed by supported hardware. Don’t let that discourage you, just plan it well and you will be greatly rewarded.

Logs are the real treasure troves!

Logging user interaction is not only for development, but for the live application too. There is so much to learn from real user interaction. Make sure that every interaction is logged in a structured and searchable form so that you can squeeze all the value out of it.

The logs help us catch the cases where the voice app wasn’t able to understand the user’s intent. What did the user say, whether we should add training data or add there a response missing. Or do they deserve to get the fallback option?

The logs will not only help us to fix what is broken, it will also tell us about the user’s expectations, and how they would like to use the app. Keeping an eye on the logs will guide you how the app should evolve over time.

Voice is not text

For someone coming from chatbots the difference looks small, they might even use the same tools. But voice is not text. In speech we use simpler words but longer sentences, so the user input will be quite different.

The responses from the app must be short and to the point. Points in the flow where the user can make selections must be simplified. Buttons are no option. Large number of options and long speech chunks are not good user experiences. KISS!

Affiliation by association

If you want to cover both voice and text, think voice-first. It is easier to adapt a voice flow to text than the opposite.

A script that feels great in our heads might be totally awkward when read aloud. And the users never say what we assume they will. Test, test and test. And then test.

Context is King

Context is what makes human interactions flow and feel natural. In voice apps context is what makes it possible for the user to handle more complex tasks, such as making selections between multiple options. Keeping track of where the user is in the process makes your voice app feel smart and give meaningful responses. Context awareness rules!

A totally unrelated cow

What is the fallback?

“I’m sorry, I don’t know how to help with that yet.”

Occasionally our voice app will not understand what the user says. Maybe the user has a strong accent or doesn’t understand what they are supposed to do. It is not uncommon that apps keep on repeating “I don’t understand” or variations on the same theme until the user gives up in frustration.

Instead, think about why the user gets stuck and what they need to get back on track. An example on how to handle this:

Do you want option A or B?
[mumble mumble]
I did not understand, do you want option A or B?
[blah blah blah]
I am sorry, I still don’t understand. Let me know if you want option A or B, or if you want to get back to the menu.
[doobee doobee do]
Here is the menu…

Choose a 2 or 3 strike strategy, make sure the user understands that their input has not been understood, clarify the options if needed, and if all else fails get out of the gridlock by going back to an earlier step.

Be careful with “personality”

One of the most common advice for making a chatbot great is to give it a personality. For custom voices, see below, that is a part of the package. But if we plan to use the platforms’ standard voices, maybe a personality doesn’t even make sense?

In all cases, make sure your app understands what the user says, otherwise we have a sure way to annoy the hell out of our users. For example responding

“Sorry, dozed off for a second. What were you saying?”

when it doesn’t understand…grrrrrrr. A stupid bot with a “personality”, not a good idea.

Use a custom voice

Personality in a custom voice is a completely different thing. Instead of using the default voices for the platforms you could use a voice actor and record the responses for your app. There are several reasons why this could make perfect sense:

  • it makes the app unique and personal, and stand out from the generic Alexa/Google Assistant apps
  • it can express any emotion and state of mind, the tone of voice will be made to fit the content and context
  • it can be made to fit a brand
  • it makes the user experience consistent over platforms, e.g. Alexa, Google Assistant, web based.

There are some downsides though. Where a digital voice can say anything, a pre-recorded voice is limited to what has been recorded (duh!). It increases production costs, requires more planning ahead, might be hard or even impossible to update at a later time.

An option is to create a custom digital voice, but that will not solve the challenges with tone of voice and expressiveness.

Wrapping it up

Scratching the surface as always. Some final thoughts that I did not cover:

  • Speech will replace many of our click/touch interactions, but will still benefit from supporting visuals, i.e. smart displays. To solve both, think voice-first.
  • Smart displays (which now are pretty dumb) will get the same capabilities as mobile phones because why not?
  • Skills and actions are now hardware dependent, but the functionality we want to access is general. In that way the home devices are like browsers and all services should be developed platform independent.

That was all for now. Some more voice related articles are in the pipeline.

Until next time!

This article is a part of the Battery001 issue: An issue dedicated to Voice!

Tired of waiting for the future?


The future of voice assistants: a personal digital clone?? Part 1

What Google Duplex tells us about the future of voice

Join the Dinahmoe Newsletter

Every sometimes we post about what we have been up to, about new projects, articles and products.

We will never flood your inbox, most likely the opposite (?). But it’ll be fun, we promise!

We use MailerLite as our marketing automation platform.By clicking below to submit this form, you acknowledge that
the information you provide will be transferred to MailerLite for processing in accordance with their Privacy Policy and Terms of Service.