The state of Conversational AI
Talking with machines that talk backThis article is a part of the Battery001 issue: An issue dedicated to Voice!
Nothing beats a good conversation, to explore topics of shared interest, the exchange of ideas and experiences. And as much as I love that, I loathe the opposite, where no topics trigger any interest and all questions just drops dead.
That is how it feels to talk to Siri, Apple’s digital assistant. Nothing interests her, most things she doesn’t know anything about, and when she does her response is usually not that exciting. Her colleagues seem to be slightly ahead, but as conversation partners they all suck.
Our dead boring assistants are examples of what is called Conversational UIand Conversational AI (oh the irony!). There seems to be some work to be done here. What is actually required to get the conversation going?
What is a conversation?
I’ll let Oxford Dictionaries chime in:
A talk, especially an informal one, between two or more people, in which news and ideas are exchanged.
Makes sense if we ignore the “people” part. Here is a typical Alexa user interaction:
Alexa, what time is it?
The time is twenty to five
Practical, but a conversation? I doubt that anyone would describe this “I just had a conversation with Alexa about time”. Which on the other hand sounds really intriguing, what did she say?
How about this then:
Alexa, do you know what time is it?
It is twenty to five. Why do you ask?
…?!
Whoa, a conversation starter! That’s much better! I initiated the conversation, I wanted to learn the time, Alexa obliged but also asked a follow up question. As if she had a will of her own…
A real conversation is a dance
Human conversations are genuinely fascinating when you start picking them apart. Take this short excerpt from a phone call between two friends that haven’t talked for a while, catching up:
B: Um, have you talked to… what’s her name…Sheila?
A: Sheila who?
B: Sheila…what’s her name…Nadine and them’s sister
A: White
B: You see her still?
A: The twins getting ready to graduate girl
B: Really?
A: Yes. I know we getting old
There is so much going on here, it is like a dance, they are taking turns, mixing questions, answers, prompts, implicit knowledge, assumptions, clarifications, acknowledgements, all within 8 lines. A dance.
The example is from this paper which provides a detailed breakdown and analysis of the same lines and more, great stuff!
The state of virtual conversations
The breadth, depth and complexity of human conversation is immense and is very hard to replicate in a virtual personality. All behavior has to be replicated in a system of rules, i.e. logic. Whether that is even possible is a topic for a nice, long, late night philosophical conversation, probably not with a machine. Regardless of your point of view the challenge is massive.
Most conversations don’t even have a clear goal, they move from topic to topic, and at some point both parts decide that the conversation is over, that it fulfilled its purpose. Good luck with figuring that one out.
There has been some amazing progress lately. Google Duplex’ ability to mimic a human is super impressive but it is strictly limited to a specific domain. Try saying something outside script, like offering a harcut when it asks for a dinner and see what happens.
Chatbots are generally much more conversational than assistants. The four time winner of the Loebner prize Mitsuku is able to respond meaningful to almost everything you write, an incredible achievement! Still, it is hard not to notice that I, the human, am driving the interaction. Mitsuku has no will of her own.
Intention as a driver
Intentions or intents is a core concept in platforms for creating conversations. It assumes that the user wants to do something, e.g. order a beverage, more specifically a double espresso macchiato. Ordering a beverage is then defined as an intent, and each configuration option — size, type, add-on — is defined as entities.
If you are able to cram all this into a single sentence then you’ll just get your coffee. If you leave some entity out, then the bot asks for the missing entities. And that is as fun as it gets.
Machines can have intentions too
The terminology of conversational platforms is built on that it is the user that has intent, that drives the interaction. Google Duplex puts this on its head. Now it is the machine that seems to drive the conversation, that has intent.
The human fulfills the need of the machine…
However, it is an illusion. Duplex is still just a much more fancy way to fill in the blanks, and when they are filled then the interaction is over.
No conversation.
It takes two to tango
Think about reconstructing the human-to-human conversation quoted above in a chatbot. The driving part in the conversation is constantly switching, intents are running in parallel, questions are answered by the part that asked them. A beautiful mess.
In order for a conversation to happen it is not enough that one of the parts is driving interaction, we need at least two parts with a will of their own, with their own goals and intentions.
Fake it till you make it
I think we have to wait a bit more for the truly conversational assistant. But I am a fan of fake it till you make it. We will get better and better at mimicking human behavior and at some point we will surprise ourselves.
Before getting to some conclusion, we have had the fortune to work with several projects where the goal of the conversation wasn’t a utility but to give the user the impression of talking to a real person. Below are some examples of our experiments into the fake conversational space.
Nissan Infiniti — Deja View
Deja View is an interactive film where you interact with the characters in the film by talking to them. During the experience the characters will pick up their phones and call your real world phone. What you say to them during those calls determines what will happen next and how the story evolves.
Here is how it works. The user arrives at the site, calls a phone number on the screen and gets a code, enters this in the browser and the experience starts. The first scene in the film shows the two main characters sitting in a car, waking up with no memory about who they are or where they come from. Looking at their mobile phones they discover that someone has called them several times, the number on the screen is YOUR phone number. They decide to call the number and your mobile rings in real life.
Here is a screen capture of a conversation.
The core concept of Deja View was to break the fourth wall. This was achieved in several ways:
- the characters in the film that is shown on your computer called you on your mobile phone. The physical separation makes it feel a little magical even if you can figure out how it is done.
- talking on the mobile is something associated with “real” humans, not with characters in a film.
- what you say outside the film affects how the story in the film evolves. The characters even reference what you said during the call in the film, using your own words.
- every conversation is based on what the user had said earlier, giving the feeling that the characters remembered you.
Everything sums up and the experience of talking to the characters becomes much more believable and emotionally engaging.
In the call the characters are driving the conversation, i.e. have intent, which makes it easier to make sure the conversation leads to something worth experiencing for the user, and also gives us enough data to drive the story in a meaningful way.
There is a lot to say about this project, it was exceptional in many ways. I have described the project on a general level here.
Slaves for Santánico
Slaves for Santánico was created to generate awareness and excitement for the premiere of From Dusk Till Dawn: The Series, a TV show based on Robert Rodriguez’ film of the same name. We were invited to bring the show’s most infamous character, Santánico Pandemonium, to life as a virtual character.
The channel for the conversation was “Santánico’s Party Line”. Talking urinal cakes and burlesque posters were placed in dive bars promoting Santánico’s 1–800 number. When they called they had a conversation with Santánico, where she demanded they prove their devotion. She responded to callers based on their originality, creativity and depravity. Her favorites were immortalized on the Tumblr: SlavesForSantanico.com.
We developed the conversation logic that made her come to life, and bring out the true depraved creativity of her fans. We set up the phone system including a custom phone server, speech recognition, natural language processing, dialogue logic etc.
Compared to an online campaign the activation in the physical world made the call feel more real. But we wanted to take the experience of the conversation to the next level. We added background noise and sound effects of the room where Santánico took your call, disturbances to the phone line etc. This served two functions. First, it made the call feel like a real phone call, not a fake one. Second, it uses the power of misdirection, making the user slightly disoriented, no longer focusing on whether Santánico is real or not.
In the same way as for Deja View it all sums up. Each detail might not be that important in itself but the sum is larger than its parts, the conversation becomes more real and the user gets a much more immersive experience.
Cisco — Internet of Everything
We and GSPSF invited then Google Chief Technology Advocate Michael T Jones to explain the concept Internet of Everything . “Everything” is a pretty big topic to cover in an interview, so we wanted to give the user the possibility to navigate the interview to what interested them most. An interactive interview! Or as Mr Jones puts it, a real two-way conversation.
In this project the whole experience was in the browser, no 4th wall to break, but to present content in a novel and useful way. The user activated their microphone in their browser and could then both navigate the topics and ask questions. Menu options were available for those not using voice.
To enhance the experience we used the user’s mobile phone as a second screen. This content was not only available during the interview but after, as a list of interesting topics to explore deeper.
While Deja View and Santánico were immersive and emotional experiences, this was more function and utility. Still, using voice made the experience much more personal and human.
Conclusion
We have to wait for a truly conversational assistant a bit more. The data needed for training to get beyond dinners and haircuts quickly becomes unmanageable and it is an infinitely long way to reach the open world of human conversations.
The current tools for creating dialogs are not made for real conversations. The user is the only one driving the conversation and the machine is only filling the slots. Furthermore, in a real conversation, not only does the intent move between the parts, it is constantly in flux. The model of intents and sub-intents are simply not working for creating the organic flow of a conversation.
That doesn’t mean that we should stop trying! We will get better and better at mimicking human behavior and suddenly we have created something that passes not only the Turing test but also get our emotional and intellectual approval to become a part of the conversation.
And when we get there, what will we talk about?
This article is a part of the Battery001 issue: An issue dedicated to Voice!