January 19, 2014

The uncanny valley for computer speech

While configuring my Vim text editor to play a "clicky keyboard" sound for typing, I found out that you can make it play sounds for other processes as well. A welcome and goodbye message would be nice to hear when you start it up or shut it down -- but where to get a voice sample?

I didn't want it to be a human being saying "Welcome" or "You've got mail." That always sounded lame. It was trying to sound all futuristic, like "Whoa, my computer's talking to me!" -- but it was just a human voice recorded and played back on an ordinary audio medium. No different than hearing a human voice singing when you played a CD on your computer.

As luck would have it, some guy made an emulator of the Speak & Spell toy, famous for its speech synthesizer, and made all of the audio files available for download. Then I used Audacity to string several words into a single message. It also has some neat "beep boop bee-beep" motifs that now play when text editor switches between modes.

These messages sound heavily synthesized -- there's no doubt it's a robot talking -- but the meaning or content is still clear.

It struck me how nice it is to hear your computer talking like a computer, when these days the dominant trend is toward trying to make machines sound like people. That Siri thing just sounds weird. The blending of sounds and the intonational contour are almost human, but not quite. It's noticeable right away, and hearing it over the course of a conversation only makes that "wtf am I talking to?" impression stronger.

I tried out some free text-to-speech applications online, such as this site, and they all sound just as weird as Siri does. They fall in the uncanny valley of spoken language, where they're neither distinctly artificial nor distinctly organic. It leaves your brain confused and unable to attend to the conversation because it's constantly trying to resolve who or what the hell is talking to you. They sound like they're crippled by a speech pathology, rather than merely having a foreign accent.

What ever happened to the Stephen Hawking kind of speech device? It didn't creep you out because it was so clearly a machine. The odd phonetic blending and the offbeat intonation made it sound quirky and charming. Yeah, I know we weren't supposed to laugh at someone who couldn't speak on his own. But it was laughing with, not laughing at. Like, "Gadgets, eh?"

Then there's that new movie, Her, about a hipster who falls in love with the feminine incarnation of an Apple device, now with the (almost) fully human voice of Scarlett Johansson. That's even weirder because of the conflicting cues -- organic speech sounds coming from something with the cold, featureless look-and-feel of a desktop computer.

It also undercuts the whole premise of "man falls in love with his computer" (if that's really the main point; I haven't seen it). The whole time, it doesn't sound like he's interacting with a machine but with a person. It's more like he's a lonely schlub who's falling for the phone sex operator he calls every night. Giving her a more Speak & Spell kind of voice would have emphasized the "odd couple" theme -- partners who always seem to be talking past each other just a bit.

But audiences today would not have responded to that approach, given how deeply committed they are to finding emotional fulfillment from websurfing and gaming. Cocooners want their virtual friends to be substitutes for actual friends. Back in the '80s, it would have been about adding a robo-friend as an offbeat complement to your existing social circle. (I think the last example of that was Screech's robot Kevin from Saved by the Bell.)

That different approach -- the robot as a welcome outsider -- would have brought along other choices that would have lessened the "uncanny valley" effect further still.

For example, when I was piecing the messages together from the single words available, it was hard to get across most messages because there were only dozens of words, rather than thousands, to build with. You want it to say something like "file saved"? Well, there are no words like file, work, writing, etc., nor for save, store, or whatever.

I finally settled on "mirror, built," hoping that a human being would understand the meaning from context (taking the steps that save a file). That's right: this little exercise made me try to get inside a robot's mind. The starting-up message: "Blood, is, coming, to, circuit." And for shutting-down: "Machine, workman, is, quiet."

What resulted, then, was a kind of pidgin between two strangers who shared only a handful of words in common, and would try to cobble together phrases that were not exactly straightforward but got the meaning across all the same.

Hearing the text editor use these eccentric phrases makes it sound even less like a normal person than its strange pronunciation does alone. Hence, even more acceptable and easy to get along with, being that much farther away from the uncanny valley. Phrases, slang, and the like are strong markers of in-group vs. out-group membership. The computer ends up sounding more like an exchange student from Mars, a well-meaning fish out of water.

I thought of coining some idioms for it that cannot be decomposed, and thus do not translate, like "let the cat out of the bag," only using the Speak & Spell vocabulary. "Pull poultry," "a reindeer for the ("d") dungeon," and so on. Perhaps when I find another context, as the three basic ones are already taken.

Don't overlook the importance of the "beep boop beep" sounds for avoiding the uncanny valley either. Machines are suppposed to make their own, er, machine-sounding sounds. When you start up your car, it hasn't been rigged to make the ignition sound like a dog barking, and the turn signal doesn't make a dripping faucet sound. At the same time, they should sound natural enough to make the meaning intuitive -- a sequence with rising intonation for starting-up or going-into, and one with falling intonation for shutting-down or going-out-of.

On the computer, I chafe when I hear goofy sound effects, already belonging to some other thing, gentle Zen whooshes that belong in a spa, or anything else that tries to make it sound organic rather than the artificial and electronic gizmo that it is (such as wood thudding against a ceramic tub). Don't confuse my brain -- make the computer sound mechanical and electronic. Things ought to sound within the range that we'd expect them to, given their main properties.

7 comments:

  1. off-topic, but Steve Sailer has an interesting new article: "New York Times, 1986: Boyhood Effeminacy and Homosexuality". I'm sure you've seen it, but what the heck.

    http://www.nytimes.com/1986/12/16/science/boyhood-effeminancy-and-later-homosexuality.html?pagewanted=all

    -Curtis

    ReplyDelete
  2. something I've been wondering about, do technological advances correlate with a more outgoing culture?

    -Curtis

    ReplyDelete
  3. I don't think so. The airplane, the transistor, and the computer are all mid-century creations.

    Particular uses or attitudes toward tech are what seem to change along with the social mood.

    ReplyDelete
  4. Airplane -- I mean civilian airliners (modified WWII bombers) and their industry.

    ReplyDelete
  5. There was a movie called "Simone" which covered some of the same subject matter, the title character being a virtual film actress.

    ReplyDelete
  6. Have you heard Hatsune Miku?

    I think they use a real woman to provide the basic samples at some level, but for the most part, she's a 99% computer-synthesized voice.

    ReplyDelete
  7. Those technologies came out of the war, though. Didn't the crime rate rise at some point in the 40s?

    A lot of technology was invented in the 1900-1930 period(rising crime): the radio, flight, tanks, etc.

    -Curtis

    ReplyDelete

You MUST enter a nickname with the "Name/URL" option if you're not signed in. We can't follow who is saying what if everyone is "Anonymous."