Alexa, Please Kill Me Now
The biggest conundrum in technology is that it’s harder to tell a computer to do something than it is for the computer to do it. Complex and difficult jobs are fairly easy for digital power to accomplish, but instructing and directing the nuance and intention of that complexity remains an evergreen challenge. Thus the rationale for the entire profession of interaction design.
Some people think that the difficulties of directing digital technology will ease dramatically when we perfect the voice interface. That is, when we can simply talk to our computers, interacting with them will become simple, clear, and easy. This notion has been around for decades, and like a tire fire eternally burning in the foothills, it will never subside. As voice recognition software has gotten better — and it’s pretty good, now — the noxious flames mount ever higher.
Our imagination glides to a Hollywood vision of effortless, empathetic conversation, and our machines-of-loving-grace respectfully bow as they retire to do our bidding. They become our sentient and willing servants, responding to our verbal commands. “Fix dinner.” “Let Jen know that I’ll be late.” “Increase sales by ten percent.” “Make sure nobody is spying on me.”
This vision is not just anthropomorphic, it’s fantastic. It’s not just imputing human capabilities onto computers, it’s imputing super-human capabilities. Just because we can form a thought in our head, we mistakenly assume that someone else can form that same thought based on some noises we make in our throats.
Just because your computer recognizes the words you say, don’t extrapolate from that to assume that it understands what you mean. Your spouse, who has lived with you for 20 years is just now getting an inkling of what you mean when you talk. Your computer is likely never going to understand you for the simple reason that the things you say aren’t really understandable.
The long history of confusion, misunderstanding, and failed human-to-human communications should keep us on notice that this assumption is based on what we want and not on what actually exists. If giving verbal instructions to people is so fraught, how are we ever going to effectively give verbal instructions to computers? Lots of people, me included, think this fantasy world will remain an unattainable chimera.
“Alexa, turn off the lights!” is voice recognition capability that’s already here. It’s cool! It’s fun! Amaze your friends! It’s not a killer app, but it is what the technology can do today, so we will see oceans of similar behavior in the near future. Of course, the unintended consequences of every cheesy appliance in your household having built-in voice recognition is remarkably easy to foresee. “Alexa, turn off the lights!” “Not those lights!” “No, the other lights!” “Alexa, just the lights in the garage!” “No, Alexa, turn them off, not on.” “Just the garage lights.” “Damn you, Alexa!”
One of the things that tantalizes and confuses us about conversational user interfaces is that modern software is quite good at speech recognition. Unfortunately, “quite good” is a relative term, depending on what you are trying to do.
Several years ago a good friend of mine with a strong pedigree in the healthcare industry started a company to address the age-old problem of doctors taking notes. Currently, physicians spend nearly as much time jotting down notes as they do examining patients and this product promised to be a great timesaver. My buddy was going to let doctors simply speak those notes into a lavalier microphone as they poked and palpated their patients. The product relied on the very capable Dragon speech recognition platform. Everything worked fine except that it didn’t work fine enough for the needs of healthcare. The doctors found that they still had to proofread the transcription. In mission-critical apps, 99.9% success means a one-in-a-thousand failure rate. When people’s lives are at stake, that’s not good enough.
Doctors notwithstanding, there is still significant value to be found with voice recognition in many data entry applications. The latest Apple iPhone, for example, presents a written transcript of my voicemail messages. This is a remarkably handy timesaver because — even though about 20% of the words are skipped or garbled — I can read enough to understand the gist of a message without having to first listen to it.
Recognizing words is not at all the same thing as recognizing meaning, and meaning is critical when giving instructions. The place where voice recognition is most needed is in important, complex applications where the user is already engaged in using their hands and eyes. In the TV commercials, the attractive young woman behind the wheel of the latest luxury automobile says, “Call Robert,” and her handsome young husband answers the phone while she cruises down a recently-wetted suburban boulevard.
In my car — which resides in the real world — it goes a little differently. “Call Robert.” “I’m sorry, I don’t understand.” “Call Robert.” “I’m sorry, I don’t understand.” “Dial Robert.” “Did you mean Robert Jones at 555–543–1298?” “Yes.” “Ready.” “Dial.” “Dialing.” At this point I realize that while I was preoccupied with this excessive verbalization, I have missed my exit. From an interaction design standpoint, the user’s every voice command must be considered mission-critical, and that is why most voice-response systems in automobiles are never used once driven off the showroom floor.
Now, imagine that same automotive level of obtuse misunderstanding and slothful, pedantic obstructionism when trying to control a tractor, an assembly line, a jet airplane, or a nuclear warhead. Such command recognition systems are not obtuse by accident. They need to behave that way in order to resolve ambiguity because the one thing that cannot be tolerated is uncertainty in the man-machine dialog. Sadly, inserting voice in the interaction always inserts uncertainty, too, and that, I predict, will never go away.
It is inevitable that we will use more and more conversational user interfaces in the future. This is not because they are good or better than other interface technologies, but because they are cheaper. They substitute a software program where otherwise a human operator would be necessary. Cost reduction, not user benefits, drive this evolution.
Other Cooperistas have a lot to say about conversational user interfaces. Here’s Nate Clinton’s keynote speech on the subject. Here’s some very interesting real-world observations about voice patterns from Joe Kappes and Jiwon Paik. Here’s Cale LeRoy discussing the visual design of voice UI.
One of my favorite movies is The Conversation, by director Francis Ford Coppola. This moody, flawed gem is a very intimate and personal film made by the great director in 1974 after his triumph with the larger-than-life The Godfather. Essentially, like any good noir detective story, it’s a character study masquerading as a murder mystery. What’s relevant here is that the characterization, the plot, the theme, who’s the good guy and who’s bad, everything, hinges on the interpretation of the pronunciation of a single word.