<- all articles

How to talk with humans? A story about Voice User Interfaces (VUI)

Aagje Reynders

What is a Voice User Interface?

A Voice User Interface (VUI) is a system that makes spoken conversations between humans and computers possible.

In this blog, we won't be focusing on how to create a VUI by writing code or talking about technical details. Instead, we'll try to give some pointers to keep in mind when developing a VUI to make it as user-friendly as possible.

Not every VUI is useful

It might sound stupid, but it happens a lot: You get a wild idea to create something, it  sounds amazing, and you're confident that everyone will love it, but once you're done, no one really uses your creation. This may also be the case with a virtual assistant. You should keep in mind that sometimes a VUI simply isn't the best solution. So before you consider building your VUI, here are some questions you might want to ask yourself:

Why use a VUI if I could do it manually?

A voice user interface for setting my alarm could be really cool, but why would I use the voice interface if I could do it in a single tap in the application? Would it be faster? Would the system remind me before bed: “Oh hey, I see you have an appointment at 9 am, should I set an alarm at 7 am?”?

There are multiple ways to achieve the same goal, so when developing a VUI, you need to analyze the different paths a user can take, to make sure what the added value of your system is.

Where will the VUI be used?

Before you begin, take into consideration where and how users might interact with the interface.

Do you need any visuals to support the VUI or would it be interesting to use the system at any time and place? If this is the case, using an application on your phone would be very interesting. If it's something you typically only use in a room (bedroom, living room), maybe a Google Home or Amazon Echo would be more relevant.

Where should we start?

Photographer: Kaleidico | Source: Unsplash

A VUI could be really small and easy, or large and complex. You should keep a few things in mind before you start writing code and “assume” how conversations will go.

First, you should consider personas. What kind of conversation do you expect that the user will have? Does the VUI work with question > answer > end of the conversation? Or will the system have long, complex conversations with your user? However, the first option is more frequently used, a user should get the feeling that the conversation can still go on. Your conversation flow is really important in this step.

When you know what kind of conversation you want to have, you can start making Sample Designs. It’s a textual representation of how your conversation will go, just like a movie script. You can write 5 conversations and think with each scenario where things could go wrong and how your system has to react. To take this even further you can read this conversation out loud with someone else to try seeing how the conversation would go and if the interaction feels natural.

If you are done with using Sample Designs, you can start working on flow-diagrams. Where you can map all the possible options and answers. This to give an idea of how the conversation can go and how big your project really is.

Wizard of Oz testing

At last, before you start coding and creating the magic, you should first do some user tests. You now have your sample flows and diagrams and it would be very useful to know if your user can work with these and if the user actually wants to have the conversation like that. You could do this with the “Wizzard of Oz”- testing technique. It's basically the wizard behind the curtain doing all the magic, while the assistant observes the user and how he interacts with the wizard. There is no need to program anything or have something working. You can fake it till you make it.

Understanding human responses

Robot playing the piano
Photographer: Franck V. | Source: Unsplash

A conversation is more than only words. There are some fundamental elements for two parties to completely understand one another: non-verbal communication, tone, emotions, sarcasm, … You should find a good balance between making people clear your VUI is just some code, but still make the dialogues human-like. So, how do we do that?


Predicting how a user could answer, makes you understand better how to ask your questions and how to make use of this when we think about confirmation. Let's discuss three types of responses.

  1. Constrained responses
    You know you will get a specific type of response back.
    ex. yes/no, type of color, song titles, ...
  2. Open speech:
    The user can answer whatever they want. The system will probably not understand what it means, but the answer will be stored where someone can still consult it.
    ex. a doctor to view the symptoms of a client
  3. Categorization:
    You don't know what kind of answer the user will give, but you can give it different labels.
    ex. happy, sad, angry


It can be really annoying but it is also really important. Too much confirmation can lead to long and tiresome conversations, but if there is not enough confirmation and the VUI is ordering a flight to the wrong destination, the outcome is frustrating. In any case, the user will stop using the application.
There are multiple ways to tackle confirmation.

If you have too much confirmation in your bot, you could work with Three-Tiered confidence. Everything a user says to the VUI is rated with a confidence percentage of how sure the system is of what the person just said. Depending on how sure your system is, you can either ask for confirmation or not.
ex. "I want a latte macchiato"  
<45%: "Sorry I didn't understand, could you repeat that?"
45% - 80%: "So you want me to order a Latte macchiato?"
>80%: "Alright, I will order a Latte macchiato, do you want something else?"
( the percentage is given by the system of how sure it is about understanding the answer)

You don't need to ask the user every time if it's correct or not, you could ask it implicitly. Instead of saying: "Alright, got it", you could confirm the answer by repeating (like the example above): "Alright, I will order a latte macchiato".

You can also work with generic confirmation, this is really useful with open questions. A VUI doesn't always need to understand every word, as long as it understands the difference between a positive and a negative answer. It's important to note that you will need different generic answers. If the VUI will always say "Thank you for sharing that", it will become predictable and boring very soon.
ex. VUI: "How was your meeting?"
user: "Well it went good, the clients looked really interested, I should call them next week"
VUI: "That's great to hear, I wrote that down for you."
or ...
user: "They weren't interested at all"
VUI: "Okay, got it, hopefuly better next time."

Finally, you could also work with a non-speech or visually confirmation. If your VUI has a visual aspect, you could easily show the answers the user gave on the screen. Or give confirmation by light or sound. When a VUI didn't understand what you said, sometimes saying nothing works better than constantly asking "Sorry, I didn't understand that, could you repeat yourself?".  

man on a smartphone
Photographer: Gilles Lambert | Source: Unsplash

Getting information at the right time

A VUI is often busy with collecting data to perform an action. Users are quite unpredictable to know what kind of information they will provide. They can ask for "I want a new flight" but they can also ask for "I want to go to Barcelona next week ".
Your system needs to be ready to deal with both scenarios and guide the user through the flow of data-collecting.

Some existing Voice User Interfaces solve the problem of not having enough information by just guessing.  Let's give an example: "What's the weather in Aalst?". In Belgium and in the Netherlands, there are multiple cities and villages called Aalst, so knowing the region and country should be an important value to determine the weather. You might expect a follow-up question, where it would ask which region. But when asking this question today to Google Home and Siri, it just assumed the region. This doesn't sound user-friendly at all.

When receiving too much information, it can be hard for the VUI to focus on only one item. When asking for your favorite hobby but the user answers with "Oh I love swimming and playing the guitar". Which one should your system choose? It should recognize both and try to get more specific answers by asking "which one of those do you like to do most?". You should keep in mind that you can't always predict the answer of the user.

Some unrelated tips

- If doing a questionnaire, give users an idea of how long it is going to take with keywords like first, halfway there and finally
- Give positive feedback: Good job, Nice to hear that, ...
- Distinguish 2 types of users: a novice user ( someone who is new or only uses it one or two times a month) and an expert user (someone who is using it every day). You don't need to explain something like measuring your blood pressure every time to an expert user, but you do need to explain it for a novice user.
- Negation: Keep in mind your system needs to recognize the difference between "I am feeling great" and " I am feeling not great"


Creating a VUI has a lot of UX challenges and we have to think about these topics to make it as powerful as we want it to be. Don't forget that user testing is a really important part of the development. If you are looking for a powerful chatbot platform and an inspiring company who helps you with the mentioned topics, check us out!

Source: Designing Voice User Interfaces by Cathy Pearl  

Written by

Aagje Reynders

Want to know more?

Related articles