Canoe here be dow?
May 3, 2010 11:14 PM Subscribe

Oh, punt appalled bait oars, Hal. Why Computer Speech Recognition hasn't gotten any better since 2001. Or bed her sin stew thou send Juan.
posted by oneswellfoop (106 comments total) 21 users marked this as a favorite

Speech recognition? I'm afraid I can't do that, Dave.
posted by zarq at 11:16 PM on May 3, 2010 [6 favorites]

Be sure to read Jeff Foley's comment for an alternative perspective. Conversational speech recognition has stalled to a certain extent, but task-based SR is charging ahead with great results. I figure it'll tide us over until we manage some real AI.
posted by wemayfreeze at 11:22 PM on May 3, 2010 [2 favorites]

Actually, you should say it hasn't gotten any better since 1984, when IBM achieved "95% accuracy" on a mainframe.

See also their splendid 1962 voice recognition hardware.
posted by shii at 11:22 PM on May 3, 2010 [1 favorite]

Before the onslaught of puns, please consider the work of renowned Professor Afferbeck Lauder.
posted by Fiasco da Gama at 11:28 PM on May 3, 2010 [2 favorites]

Sorry, I did not understand your response. To hear the menu again, please press 1.
posted by crapmatic at 11:32 PM on May 3, 2010 [30 favorites]

I find the spoken menu systems to be much easier to use on my cellphone, actually, since it's a touch screen.
posted by !Jim at 11:39 PM on May 3, 2010

Cod ram plucking pizza ship.
posted by The otter lady at 11:41 PM on May 3, 2010 [43 favorites]

Also, pressing a button is *silent* on my end. I don't have to disturb the people around me, and if someone else is using one of the systems, they don't bother me.

50 irate customers trying to rebook their cancelled flights on their cells, standing at the gate together, makes for a lot of "I said one you stupid machine!" "Oh, you don't understand when I swear at you? $#%@#%! " etc.

In other news, two thoughts about the actual article:
1) how much does which language you're using matter? English is not such a great example, as so much of its lexicon is borrowed from many different sources; would a language without so much, uh, corruption, be easier? (Does such a thing even really exist?) Or one that has fewer phonemes, or maybe more? Maybe even just a language which has a more phonetic spelling system would help; if one can translate sound to writing, perhaps it's easier to search for reasonable word boundaries within that writing.

2) The estimate for the number of possible sentences (given as 10^570) is oddly close to the estimate for the number of possible string theory vacua (10^500 give or take a bunch of orders of magnitude). In fact I can't think of anywhere else I've seen that big a number. Woah.
posted by nat at 11:43 PM on May 3, 2010 [1 favorite]

Before the onslaught of puns
posted by Fiasco da Gama

Ebony stair decal.
posted by UbuRoivas at 11:43 PM on May 3, 2010 [29 favorites]

Amen, afroblanco. I honestly think anybody behind adopting those things should be fired.
posted by weston at 11:45 PM on May 3, 2010 [2 favorites]

There are really two questions: can computers deal with meaning; and can they simulate dealing with it well enough that their error rate is in the human range.

On the first question, we have virtually no practical idea what meaning is or how it works, no real progress has ever been made, AI labs have consistently ignored the issue, and there are theoretical reasons to think computers may never be able to deal with it.

On the second, brute force has achieved a great deal, but the claim here that existing technology has reached its limits seems plausible to me.

Overall, that may be a good thing: it's useful (possibly life-saving) to be able to tell the difference between entities that understand you and entities that only seem to.
posted by Phanx at 11:46 PM on May 3, 2010 [2 favorites]

Ah - "Open the pod bay doors, Hal". Took me a minute to figure that one out.
posted by w0mbat at 11:52 PM on May 3, 2010

So, upshot:

1. We need artificial intelligence to make speech recognition work;
2. We need speech recognition to make artificial intelligence work.

One obvious question that provokes: does that mean they're the same thing?
posted by Malor at 11:53 PM on May 3, 2010 [2 favorites]

I'm pretty sure that voice response systems are only there to teach humans something about the futility of getting frustrated at a machine.

Oh gawd! AT&T customer service has the WORST voice response system ever. I pay my bill by phone every month, and I've gotten used to it, so it doesn't drive me insane anymore, but the funniest part is at the very end, when the disembodied computer voice says "Ranking your experience on a scale of 1 to 5, with 1 being very dissatisfied and 5 being very satisfied, how would you rate your experience?"

I always say a very firm and very emphatic "ONE" (i.e. very dissatisfied) and the voice every-so-pleasantly answers "I'm sorry to hear that. Goodbye!"
posted by amyms at 11:55 PM on May 3, 2010 [2 favorites]

It usually goes something like this:

"For technical support, say 'support'; for accounts, say 'accounts'; for sales, say 'sales'. To hear these options again, say 'options'."

"Cabbages."

"I'm sorry, I don't understand. For technical support, say 'support'; for accounts, say 'accounts'; for sales, say 'sales'. To hear these options again, say 'options'."

"Cabbages."

"Connecting you to the first available consultant. You are ^third in the queue."

(on-hold music)

This will generally bypass all the sub-menus and get me in as fast as possible. Which is usually not terribly fast, but at least I don't waste much time talking to idiot bots.
posted by flabdablet at 12:02 AM on May 4, 2010 [3 favorites]

flabdablet: I've heard that most interactive voice recognition systems are programmed to recognise swear words, and direct those customers straight to the human operator queue.

In other words, skip the polite "cabbages" and jump straight in with the FUCKING SHIT FUCK ASSHOLE SYSTEM!!! and your customer experience will be greatly enhanced.
posted by UbuRoivas at 12:07 AM on May 4, 2010 [5 favorites]

FTA: "We use grammar all the time, but no effort to completely formalize it in a set of rules has succeeded. If such rules exist, computers programs turned loose on great bodies of text haven’t been able to suss them out either."

emphasis mine
posted by idiopath at 12:08 AM on May 4, 2010 [4 favorites]

Astronaut: Oh, punt appalled bait oars, Hal.
HAL: Silly country state your request.
Astronaut: Oh, punt appalled bait oars, Hal.
HAL: Scuze me while I kiss this guy!
Astronaut: Oh, punt appalled bait oars, Hal.
HAL: Thank you falettinme be mice elf agin!
Astronaut: *disconnects own air supply*
posted by pracowity at 12:15 AM on May 4, 2010 [4 favorites]

"Cabbages" performs the same function, in my experience. Since there's some chance that the human operator will already have been listening to my responses to the bot, and since it's never the call centre peon's fault that the VRS sucks, I prefer to reserve displays of hostility until they're tactically appropriate.
posted by flabdablet at 12:15 AM on May 4, 2010

In theory the systems should work fine with a limited vocabulary of options, like with the phone systems. I've never had a problem with them.
posted by furiousxgeorge at 12:18 AM on May 4, 2010

no mention of the blind? Geez, sorry speech recog didn't save you any time.
posted by sredefer at 12:20 AM on May 4, 2010

sredefer: "no mention of the blind?"

The blind computer users I have known have all been perfectly happy with using keyboards. Hell I have working eyes and I don't need to look at my keyboard.
posted by idiopath at 12:23 AM on May 4, 2010

Hey, I just got the title of this post ... it's Hoffnung, in 1956 at the Royal Festival Hall!
posted by woodblock100 at 12:23 AM on May 4, 2010

In my experience the problem with VR menus is not that they get the menu choices wrong, it's that they're often structured as four-deep submenus that don't recognize anything until they've finished droning their way through all the options. Hence, cabbages.

On the non-menu front, Telstra has recently implemented a voice-to-SMS system for leaving messages to mobile phones you can't reach straight away. The results are frequently hilarious. I've never seen it actually get one right.
posted by flabdablet at 12:24 AM on May 4, 2010

I use the same technique on voice response systems as I do on telemarketers: I just keep saying "cabbages" until they do what I want (telemarketers = go away, VRS = connect me to a human). If a VRS won't play nice at cabbages, I take my business elsewhere.

That is so funny! I say "blueberry." Somehow I don't get nearly as angry when I continually say "blueberry" to a robot voice on the phone. It's like a mantra of calmness.
posted by The Light Fantastic at 12:26 AM on May 4, 2010

Also in the same vein of puns the apocryphal exchange between Doug Anthony the then leader of the Country Party and Gough Whitlam the then Australian Prime Minister—
"I'm a country member..."
"We remember!"
posted by Fiasco da Gama at 12:27 AM on May 4, 2010 [6 favorites]

For savoy, say 'savoy'; for napa, say 'napa'; for bok choy, say 'bok choy'; for late flat dutch, say 'late flat dutch'; for early jersey wakefield, say 'early jersey wakefield'; for danish ballhead, say 'danish ballhead'; for meteor, say 'meteor'; for red rodan, say 'red rodan'; for ruby ball, say 'ruby ball'; for scarlet o'hara, say 'scarlet o'hara'; for green say 'green'. To hear these choices again, say 'cabbages'.
posted by tellurian at 12:27 AM on May 4, 2010 [28 favorites]

at least we're obeying Cole's Law.
posted by oneswellfoop at 12:32 AM on May 4, 2010 [12 favorites]

idiopath: The blind computer users I have known have all been perfectly happy with using keyboards.

fair. But can you see my point? i.e. There are blind people you don't know.
posted by sredefer at 12:35 AM on May 4, 2010

eye hefcott sawb hauls
posted by uncanny hengeman at 12:44 AM on May 4, 2010

sredefer: "But can you see my point?"

Yes, but not being an expert on the subject I can only offer anecdote and speculation. So now I will add speculation:

A well designed tactile input device should be no harder for a blind person than a sighted person, and should be quite a bit more efficient to use than the current sorry state of speech recognition.

A blind person who also has problems with touch or proprioception would obviously be very well served by voice recognition.
posted by idiopath at 12:46 AM on May 4, 2010 [2 favorites]

Ubu: in my experience, that isn't true.
posted by jacalata at 12:49 AM on May 4, 2010

And by well designed tactile input I mean a standard keyboard with the little bumps on the f and j keys (or whatever keys are in those positions for another regional keyboard).

A blind user should also probably be using a text based interface rather than a shape/color/pointer based one, but that is an ancient and very well understood thing in the world of computers, and very easy to implement.
posted by idiopath at 12:51 AM on May 4, 2010

If that ever happens, tellurian, I'm a gonna cabbage you.
posted by flabdablet at 12:54 AM on May 4, 2010

To use your keyboard to interact with the menu system, press 'one' now. To use the speech recognition system, say 'FUCKING SHIT HELL GODDAMN.' To use the speech recognition system en Francaise, say, 'Nique ta mère.'

Reading this article made me think that it would be good to round up funding for a Netflix-prize like competition, geared towards conversational speech recognition. Ok, the crowd can find movie recommendations for a million dollars, Let's see what they can do with a more fundamental problem... Given the counter-point comment, I'm a bit less convinced it's a good idea though; the capitalists might already be doing as good a job as one could hope for.

The other thing this makes me think of is that we should be able to write some much, much better text adventure interpreters than what currently exist. These still run accept the same level of input that was common in the early eighties, mainly due to the expectations within the genre. "x mailbox" is perfectly good for interacting with the system, but for new players it should be able to take a sentence like "I want to look at the mailbox," and deduce that the player wants to "x mailbox." It could then suggest the shorter phrasing, to save the player future typing. Such a system should also try to automatically correct typing mistakes:
> I wnat too lok at the malebox.
I'm sorry, I think you said, "I want to look at the mail-box." Is that correct?

It's a highly constrained, typed text environment, so this should be quite feasible...
Maybe I'll go try to implement something this summer.
posted by kaibutsu at 12:56 AM on May 4, 2010 [3 favorites]

"> I wnat too lok at the malebox."

I have played a MUD where the parser was intentionally very strict, and if you typed something like that, the game would reply with something like "derp" and everyone around you would see "<name> drools and stares off into space like an idiot". People learned typing skills pretty fast (not to mention that swift and accurate typing is a life or death matter in a MUD - best typing tutor ever - just hope you don't lose your job and your girlfriend and flunk out of school from the addiction).

The problem with speech recognition is that voices (and the technology for capturing voice) are extremely more variable, subject to noise, and error prone than typing is.

Imagine the problem of building a parser that has to cope with lisps. And people from Scotland. And people from Boston. And people who have immigrated to Boston from Scotland and speak with a mix of both accents - with a lisp. We have immense and very powerful heuristic computation systems, with extensive mutability and capacity for backtracking in order to understand the sounds we hear when someone else speaks. It is no wonder that computers are so lousy at the job.

At least for English, there is a much lower entropy measure for written as opposed to spoken speech - thus it is easier to detect errors (an interesting side effect of this fact is that acoustic crosswords would be much easier to build than textual crosswords).
posted by idiopath at 1:24 AM on May 4, 2010 [3 favorites]

Language is an order of magnitude more analog than other human systems. I mean, jesus, adult human beings can't parse meaning out of a language other than their primary one without grinding, protracted effort. I see no reason to believe we'll be successful in getting computers' semantic understanding up near human levels any time soon. Throwing distributed processing power at Internet-scale training datasets seems more likely to yield improvement over the near term.

Aside: why the hell do IVR systems bother to read you all the options (and in a tone of voice similar to the way one might talk to a particularly slow child)? I just say "agent, agent agent, agent, AGENT" over and over until I get bounced out to a live human. Works at least 80% of the time.
posted by killdevil at 1:28 AM on May 4, 2010 [1 favorite]

Computer, please.
posted by hal9k at 2:23 AM on May 4, 2010

Cry sit what in iss howle.
posted by GilloD at 2:26 AM on May 4, 2010 [1 favorite]

whale oil beef hooked

hoof hearted? ice melted

And that's all from me - I PROMISE.
posted by uncanny hengeman at 2:39 AM on May 4, 2010 [1 favorite]

To the contrary, I was recently surprised to find how damn good voice recognition has become. Two years ago my father had scanned 13,000 photos and wanted to add dates and captions. He started typing, but he's purely hunt-and-peck and it would have taken years. I had him try Dragon Naturally Speaking and was amazed at how good it was.

A few months later my mom commented that he now "spends all day on the computer talking to himself!" He had a great time doing it, basically re-living his life by talking about it one picture at a time. By the end the system had learned all the names of our family members, towns we'd lived in, and other family-specific vocabulary. He completed it all in just a few months.

Here's a short video about his project. Quote from the video about speech recognition: "It's like a miracle! He just says what it is and it types it for him. What a time saver."

I second the above commenter: go read Jeff Foley's comment in the original post. He worked on the product that my dad used and makes an excellent rebuttal.
posted by wanderingstan at 2:49 AM on May 4, 2010 [9 favorites]

I find it interesting that these discussions are always entirely anglophone and anglo-centric. There are languages that I can only imagine are much worse than English for recognition (due to the similarity of many sounds, and the lack of proper enunciation) such as Danish. By the same logic, there must be some languages with a clearer spoken language, more differentiated phonemes, etc. Perhaps German?

(I know nothing about linguistics, and if someone who knows what they're talking about could shoot me down, that'd be awesome)
posted by Dysk at 3:05 AM on May 4, 2010

Speech recognition is BS not because computers can't recognize speech but because recognizing speech is the EASY part of understanding meaning in a spoken context.
posted by DU at 3:05 AM on May 4, 2010 [1 favorite]

DU: "not because computers can't recognize speech"

When most people say "speech recognition", they don't mean "identifying the set of sounds uttered" (yes, easy), but rather "identifying which word you mean by that set of sounds" (which computers fail at, hard). See also "recognizing pitch content of a recording" (easy) vs. "recognizing the notes being played by the instruments you recorded" (very hard and not very reliable).

Computers are good at sensing and analyzing, but that is not yet recognizing. With the current state of the art, computers are pretty bad at recognizing, especially with recognizing sound.
posted by idiopath at 3:19 AM on May 4, 2010

A blind person who also has problems with touch or proprioception would obviously be very well served by voice recognition.

Or a sighted person with carpal tunnel and pinched nerves at C6 who has to type all fucking day long.
posted by FelliniBlank at 4:34 AM on May 4, 2010 [1 favorite]

Once computers get good AI for speech recognition, will they even care what we have to say? I mean, I could dictate a letter or an essay, but my computer will just criticize my grammar and vocabulary. I might as well just let the computer come up with its own ideas, rather than dealing with that primadonna.

God made computers with keyboards for a reason.

But that said, text-to-speech is more interesting. Once computers can speak with proper emotion (10 minutes after the singularity, probably), it will kill the audiobook industry. Plus, computers could deliver error messages empathetically and uniquely each time (like how a person never says a phrase the exact same way twice), which would make them less frustrating.
posted by mccarty.tim at 4:39 AM on May 4, 2010

Recently:

Me: "Pay and talk account"
Telus: "Sorry, I did not understand your response"
Me: "PAY . AND . TALK . ACCOUNT"
Telus: "Sorry, I did not understand your response"
Me: "That's because you're a fucking computer"
*Pause*
"Hello, my name is Darren. May I help you?
Me: "Yes. Your speech recognition program doesn't recognize "pay and talk account", but apparently it recognizes "fucking".
posted by weapons-grade pandemonium at 4:40 AM on May 4, 2010 [9 favorites]

I use speech recognition every day for my work, and my experience matches the Foley comment that wemayfreeze calls out. I hesitated for a long time, because my typing speed is in the low 90s and I didn't think it would make much of a difference, but it has upped my output by about 30%. Although you do have to watch out for the "speakos", on the other hand it never transposes letters, misspells anything, or accidentally enters "contract" when I mean "contrast".

It still has a hard time with the little words, but it can also be uncanny in its ability to recognize certain things. For example, yesterday I was translating a paper in the field of paleozoology, discussing remains from a site in Spain called "Kiputz"--an alternate spelling of a Basque word that basically appears nowhere in the English-language literature on anything. And yet, because the word appeared half a dozen times elsewhere in the untranslated text, when I spoke it into the microphone, out it popped on the screen--Kiputz--right as rain, and likewise each time thereafter, never as "kibbutz" or "kaput is" or "cap huts". Freaked me right the fuck out.

Which leads me to my one gripe about speech recognition (I'm using Vista's built-in SR, btw, which is damn good, and I've heard may be better than NaturallySpeaking in some regards) is that it sucks balls for recognizing curse words. No doubt this is done on purpose to prevent embarrassing speakos from slipping through the proofreading process, but I wish there was an NC-17 setting for when I'm working with police interview transcripts, for example.
posted by drlith at 4:47 AM on May 4, 2010 [7 favorites]

drlith: "it has upped my output by about 30%"

I am using an operating system without any good speech recognition. Is the linked article wrong about the error rates of the speech recognition or am I failing to grasp something about your workflow?
posted by idiopath at 4:58 AM on May 4, 2010

I know how they can instantly get computer speech recognition to be more humanlike; do what I do. Pretend I heard/understood what a person said and then change the subject.
posted by digsrus at 5:04 AM on May 4, 2010 [4 favorites]

I'm listed in the acknowledgements of a PhD thesis for phoneme-based speech recognition out of MITs LCS in the mid-90's. Not for any theoretical, mathematical or other meaningful contributions but for my poker skills, or lack thereof. A nice side effect was that I received a crash course in Hidden Markov Models and some of the more interesting aspects of Draper Labs. I don't think any of the people I got to know stayed in the field for more than another 5 years. There was substantially more money in using their math and modelling skills in the financial sector.
posted by michswiss at 5:10 AM on May 4, 2010

Cries, wad and as hoe.
posted by mr_crash_davis mark II: Jazz Odyssey at 5:21 AM on May 4, 2010 [3 favorites]

For gods fucking sake, IT DOESN'T SAVE ANY TIME.

You misunderstand - your time is an externality, of no concern to the company. It saves plenty of time for the paid employee of the place you are trying to do business with.
posted by Meatbomb at 5:24 AM on May 4, 2010 [2 favorites]

It seems to me despite its limitations, voice recognition does have some uses. For example singing into a voice recognition system could result in an effective mondegren generator.
posted by TedW at 5:37 AM on May 4, 2010

A few years ago, I fractured my right shoulder, and ended up using voice input for a semester. I got fairly good at navigating formal writing, although it required a fair amount of backspacing and second tries, but my IMs became a lot more amusing to my friends. I think they would start conversations just to get gibberish....
posted by GenjiandProust at 5:43 AM on May 4, 2010

On a more serious note, I have voice activated phone and navigation in my car, and even with a vocabulary limited to numbers and a few hundred commands, the results are sometimes laughable. When a say a phone number I dial on a regular basis, many times it will come out right, but others it will be right for all but one or two digits (which makes it useless as a phone number), and still other times it will be almost completely wrong. Based on my experience with voice activated dialing in my car, I haven't even tried the voice control feature on my iPhone even after a year. I wonder how others experience in using voice activated phones compares with mine and what the overall accuracy of just this one subset of voice recognition is. It seems to me if voice dialing still has a long way to go, then more complex speech recognition is still about as attainable as nuclear fusion as an energy source.
posted by TedW at 5:47 AM on May 4, 2010

For further exploration of the task-based approach to SR, check out Siri CTO Tom Gruber's keynote at the semantic web conference. It really opened my eyes to how much is possible with task-based SR. The Siri iphone app can really feels like magic. Essentially, by stringing together Web apps they're able to achieve a super high level of functionality using only speech as the input. Along with great recognition, this makes for an experience that feels like you could tell it anything... "Siri, I've been having trouble, uh, standing at attention if you know what I mean?"

Siri was recently acquired by Apple (did someone mention that already?), which is prettify exciting. I'm imagining a world where even deep functionality in every app on your device is accessible through a unified speech interface.
posted by wemayfreeze at 5:51 AM on May 4, 2010

You misunderstand - your time is an externality, of no concern to the company. It saves plenty of time for the paid employee of the place you are trying to do business with.

Perhaps it's you that misunderstands. Whether the automated system asks you to speak your choice or asks you to press 1 on your phone won't make a lick of difference to the paid employees. However, it will save untold amounts of frustration on the part of the customer if he didn't have to wrangle with a half-assed, technologically challenged system that insists he speak to the computer (which can't understand a staggering number of people) and could instead just press the damn "1" on the phone, which is easily recognized by machines that are decades old.
posted by splice at 5:55 AM on May 4, 2010 [1 favorite]

A blind person wouldn't have any problem learning to use a keyboard with Braille labels.

Also, voice recognition will probably work if you speak clearly and enunciate

---

One important question is: Why does phone call quality still suck so hard? It would be really easy for a phone to use high fidelity audio that would work much better for voice recognition, but that's never happened. Probably because phone companies want to be able to compress the hell out of the signal.
posted by delmoi at 6:00 AM on May 4, 2010

I use the same technique on voice response systems as I do on telemarketers: I just keep saying "cabbages" until they do what I want (telemarketers = go away, VRS = connect me to a human). If a VRS won't play nice at cabbages, I take my business elsewhere.

I tried that yesterday with a telemarketer trying to sell me "fresh organic produce delivered straight to your door" or something. Worked like a charm! I think- oh hang on there seems to be a truck backing into my driveway for some reason, just a sec...
posted by EndsOfInvention at 6:21 AM on May 4, 2010

Fuck.
posted by EndsOfInvention at 6:23 AM on May 4, 2010 [3 favorites]

Perhaps it's you that misunderstands. Whether the automated system asks you to speak your choice or asks you to press 1 on your phone won't make a lick of difference to the paid employees. However, it will save untold amounts of frustration on the part of the customer if he didn't have to wrangle with a half-assed, technologically challenged system that insists he speak to the computer (which can't understand a staggering number of people) and could instead just press the damn "1" on the phone, which is easily recognized by machines that are decades old.

You have reached the Odeon Cinema Line! Please select the name of the cinema you would like film times for:
For Aberdeen, press 1.
For Anglesea, press 2.

[some time later]

For Wandsworth, press 247.
For Worcester, press 248.
posted by EndsOfInvention at 6:27 AM on May 4, 2010 [3 favorites]

FYI: you can bypass most automated phone menu systems, VR or otherwise, by just pressing 0
posted by Uther Bentrazor at 6:38 AM on May 4, 2010

As one who has worked in a customer service cube farm, I can tell you that the voice recognition systems more and more corporations are installing have nothing to do with improving customer experience. They have everything to do with saving money on customer service drones.

Keeping a body in that seat all year is going to cost the company at least $30-40k, what with taxes and insurance, even with no benefits. Once they've been there a while we could easily be talking $40-50k.

For that much you can easily purchase, install, and maintain a voice recognition program which has the potential to eliminate potentially dozens of otherwise necessary positions by handling routine matters automatically. This could literally save a company hundreds of thousands of dollars every year by reducing their overhead.

Yes, your customers hate you, but if 1) they're costing you less money, and 2) you don't have to talk to them anymore, the MBA analysis is pretty obvious. You'd have to lose a lot of customers over your shitty and frustrating customer experience before you start to actually lose money on the deal, and if everyone else is doing it too...

A more philosophical analysis would lead to a different conclusion, but business-types don't tend to give a damn about virtue these days.
posted by valkyryn at 6:42 AM on May 4, 2010 [1 favorite]

God-ram pizza plucking ship.
posted by PuppyCat at 6:45 AM on May 4, 2010

But instead of pressing an easily-understood button, you have to sound like a fucking moron repeating the same fucking word a thousand fucking times

That's the point, I think. The more they can make you feel like a hapless moron on your own, the easier it will be when they start treating you like one. The degradation is a feature, not a bug!
posted by Hiker at 6:52 AM on May 4, 2010

A friend likes telling his 2004-ish story of trying to tell American Airlines' VR system his flight number. "Four one nine" - "I'm sorry, I didn't understand that. Please say your flight number again." "Foouuuur Oooone Niiiiine" - "I'm sorry ... etc." "FOUR. ONE. NINE." - "I'm sorry .. etc."

The whole time speaking on a cellphone that had a huge number keypad on it.
posted by anthill at 6:55 AM on May 4, 2010 [1 favorite]

The whole conflation of AI and speech recognition is understandable but not warranted.

To many computer scientists, AI is essentially "the set of problems for which exact algorithms do not yet exist." This is much, much broader than speech recognition.

Speech recognition also does not involve "understanding" the way humans "understand" each other. Problems like "the human user just said a word that has a homophone; which word should I choose as the output?" can be decided on statistically by the context, without the computer having ever experienced qualia. (Incidentally, this gets easier the more the output of the computer is allowed to lag behind the input, because more context can inform the choice.)

If you want to appreciate the difficulty of the job the computer is doing, open up an audio editor program. Record yourself saying a phrase. Selecting small snippets of audio at a time, identify where each word ends. Congratulations, you have segmented words! Notice how the word "boundaries" that you have found do not match up with the clumps of peaks in the audio. The final consonants of many words will end up in the next clump of peaks, or there may be unexpected silences in the middle of words. If you actually go through with this in reality instead of as a thought experiment, you will probably get a sense of the complete hopelessness of isolating individual phonemes in the signal; they just blend one right into the next.

It is pretty much a miracle that speech recognition software works as well as it does.
posted by a snickering nuthatch at 7:00 AM on May 4, 2010 [3 favorites]

many VR systems will respond if you just keep saying the word "agent." It is programmed in.

VR: ...Say something I understand...
Me: Agent
VR (pretending ignorance): I'm sorry I didn't get that, say something I understand.
Me: Agent
VR (starting to cave): I can connect you with an agent, but first I need you to say something I understand...
Me: Agent
VR: (surrendering): I'm connecting you to an agent.

I was told this trick by a UPS driver.
posted by warbaby at 7:09 AM on May 4, 2010 [4 favorites]

Those of us in the medical transcription field smiled knowingly as Dragon ate up, then spit out, many clients. Very intelligent people invested in Dragon systems which ultimately required more time and money than human scribes ever could. One MD spent over four hours every evening fixing what his VR software had mangled; ultimately he returned to using people.

That this stuff has hit a brick wall comes as no surprise. Despite dire warnings to the contrary, our business has actually picked up (dramatically) in the past few years as the limitations of VR programs became increasingly apparent. Interestingly the same has occurred with clients using overseas services, since few can completely expunge the colloquial from their dictations and offshore workers lacked the background needed to process same. No doubt this also complicated efforts to make VR work for all users.
posted by kinnakeet at 7:17 AM on May 4, 2010 [2 favorites]

Speech recognition also does not involve "understanding" the way humans "understand" each other. Problems like "the human user just said a word that has a homophone; which word should I choose as the output?" can be decided on statistically by the context, without the computer having ever experienced qualia.

The article specifically addresses this. Yes, it can do that, and it can do pretty well, but it eventually hits a limit to how accurate it can be. The author claims that we've hit that limit, and it's significantly less accurate than human listeners. It seems likely that something like "understanding" is necessary to do any better.
posted by bonecrusher at 7:26 AM on May 4, 2010

You know what made me hate voice recognition? Voice response systems. Has to be the biggest step backward in customer service since outsourcing.

The rest of this menu will be presented in interpretive dance.
posted by Evilspork at 7:41 AM on May 4, 2010 [1 favorite]

Those of us in the medical transcription field smiled knowingly as Dragon ate up, then spit out, many clients.

Yeah, I'm not sure that people who haven't spent much time transcribing voice to text realize how challenging it is for humans to do it, let alone machines. I once was hired to clean up some very rough transcriptions of videotaped conference presentations, and I'm pretty sure the (American) person who did the initial voice-to-text of one had never heard someone speaking English with a British accent.
posted by FelliniBlank at 7:47 AM on May 4, 2010

>> drlith: "it has upped my output by about 30%"
>
> I am using an operating system without any good speech recognition. Is the linked article
> wrong about the error rates of the speech recognition or am I failing to grasp something
> about your workflow?
> posted by idiopath at 7:58 AM on May 4 [+] [!]

I work in a hospital's department of radiology. All our radiologists use voice rec when dictating reports of what they see in your chest xray or CT scan or whatever. This works pretty well but no better than pretty well, with the provisos that 1. the system only has to recognize a limited vocabulary of medical terms and radiologist boilerplate (There is a pleural-based 1.5 cm fat density lesion at the level of the left lung apex posteriorly medially suggesting a
lipoma yes, eventually (see proviso 2); Ph'nglui mglw'nafh Cthulhu R'lyeh wgah'nagl fhtagn not in a million years) and 2. it only works even pretty well for single individuals, each of whom has put a lot of time and effort into "training" his/her instance of voice rec to transcribe his/her voice timbre and habits of speech. Process:

- "say "eponysterical", click transcribe button on your speech-rec mic
- system transcribes that to text as "Ebony stair decal" in the report window. Or maybe "mongolia cellphone".
- highlight "Ebony stair decal" text, say "eponysterical" again, click transcribe button
- after several iterations of this, the system may (may!) start transcribing the spoken word "eponysterical" as "eponysterical" in printed text.

I say "may" and that needs emphasis. Every single radiologist's report contains the phrase "Impression:" (pronounced "impression colon") after the detailed report and followed by a summary. For one particular radiologist here the system has yet to learn to transcribe "impression colon" correctly and shows no sign of ever doing so no matter how many times he says it and corrects whatever hairball the system coughs as output.
posted by jfuller at 8:02 AM on May 4, 2010 [4 favorites]

The article specifically addresses this. Yes, it can do that, and it can do pretty well, but it eventually hits a limit to how accurate it can be. The author claims that we've hit that limit, and it's significantly less accurate than human listeners. It seems likely that something like "understanding" is necessary to do any better.
posted by bonecrusher at 9:26 AM on May 4 [+] [!]

Sorry- I should've been more clear that I was addressing comments made in the thread and not in the article.
posted by a snickering nuthatch at 8:03 AM on May 4, 2010

Great article! Quick comment about those wondering about ease of speech recognition using other languages. It doesn't matter a whole lot, because the crux of the problem lies in speaker variation. Sometimes that's easy to see...when there's homophones for the words we're trying to get the machine to understand. But there's actually variation found in every single word one can speak. This is true in every language. Think of the word 'cat'. In the first consonant alone, different speakers will pronounce that 'k' sound either more forward or back in the mouth, more or less voiced (like a 'g' sound), longer or shorter, with an air burst at the end or not (breathy, aspirated, etc.), higher or lower pitched, more evenly throughout or with variation, co-articulated with the next sound or not. We recognize ALL these subtle changes as 'k'. Same is true for the next sound, the 'a' can be higher in the mouth (more like ket) or lower (like caaaat), more forward or more backward in the mouth (like 'cot' or 'caught'), more rounded or open-mouthed, longer or shorter, pitched, creaky or breathy, and on and on. We are so good at normalizing the speech stream that we don't even hear the variation anymore, unless it's really pronounced (a thick accent or something that could be parsed differently, "olive juice"). But if you have a little training on speech sounds, and a fair amount of curiosity, you start hearing quirks in everybody's speech (one if my favorites is how Obama uses plosive 'b' sounds often when he's making a point. I have a linguist friend who does this too and when she talks and says something like "No, both books" we'll sometimes catch each others eyes and smile because we both heard it, even if other people around didn't.)

When yelling at a voice recognition system, does little good to raise your voice, stretch out the word and attempt to speak slowly and clearly (when you do that, you're introducing a really unnatural, exaggerated set of features into your words). You're better off trying to speak at a normal rate and volume and adopting the standard accent of the region.
posted by iamkimiam at 8:05 AM on May 4, 2010 [1 favorite]

Oops, hit post too soon. So, derlith, are you working on a system that allows training to your particular voice? Or a one-size-fits-all system like a voice-menu ayatem?
posted by jfuller at 8:05 AM on May 4, 2010

Er, that is, if you really want the recognition system to understand you and take you to the next menu. I recognize that there's all sorts of other reasons to raise your voice or yell.
posted by iamkimiam at 8:06 AM on May 4, 2010

grr, implosive 'b' sounds.
posted by iamkimiam at 8:08 AM on May 4, 2010

I know of a doctor who used Dragon Naturally Speaking. He became so irritated at how often the software misinterpreted what he said that he added a vocal shortcut: "Dragon, delete that fucking sentence."

Train these things on music. I have three bands in mind: ZZ Top, to train the systems to recognize divisions between words when they are not clearly spoken; Nirvana, to train it on screaming ("I'll concede from shame," Kurt?); and early Duran Duran, to extract meaning where not much may exist.

Or we could all learn lojban. I submit that the humans will adapt to the machines more readily than the machines will adapt to humans.
posted by adipocere at 8:12 AM on May 4, 2010

I just had an idea — maybe we need a speech recognition system specifically for transcriptionists in the healing arts. Restricting the problem domain to a specific sandbox would help as it creates Markov chains, strung together to represent probable sentences, digesting sound into text output.

The Human Centipede Voice Recognition System — it's 100% medically accurate!
posted by adipocere at 8:19 AM on May 4, 2010 [1 favorite]

idiopath: "recognizing the notes being played by the instruments you recorded" (very hard and not very reliable)

Do you know about Melodyne's Direct Note Access? (MeFi post)
posted by zsazsa at 8:36 AM on May 4, 2010 [1 favorite]

Is our computers learning?
posted by Mister_A at 9:10 AM on May 4, 2010

COVOX. VOICE MAS-TER.
posted by joe lisboa at 9:10 AM on May 4, 2010 [3 favorites]

A thought about medical transcription... how many reports come out as "Impression : the patient has a tumor on his : "?

And the limitations of human voice recognition have long been the subject of games (a round of "Telephone", anyone?), not to mention childish play, from "Owa Tagoo Siam" to "Sofa King Cool"...

I'm delighted with the quality of discussion, when the only reason I did the post was for the opportunity to write "Oh, punt appalled bait oars, Hal" on the MeFi front page. Now, THAT's intelligent interpretation.
posted by oneswellfoop at 9:11 AM on May 4, 2010

I've posted this before, but a college friend of mine had a job as a summer intern at Apple during the late 80s. He worked on a speech recognition project there and at the end of the summer everyone got a tee-shirt that said:

"I helped Apple wreck a nice beach."

I wonder if he still has it.
posted by The Bellman at 9:19 AM on May 4, 2010 [4 favorites]

Sorry for the non sequitur. Here's the backstory: A company out of Oregon called Covox marketed voice recognition hardware and software for the Apple II. I save up a bunch of my allowance money over an entire summer and bought the Covox Voice Master and it was everything I dreamed of and more. Me, my brother and my cousin spent months working on writing a text-and-graphics RPG game (in BASIC!) that used voice recognition to control your character. I was so thrilled I could puke. Then, we went out of town for my aunt's wedding one weekend and returned to find our house broken into. Jewelry, TV, appliances all gone, but the most tragic loss (to me) was that they took the computer, with the Voice Master still plugged into it and the floppy containing the only copy of the program we had labored so hard to produce. It was seriously crushing. Now that I think about it, that was right around when my love affair with computers started to simmer into a low burn and I got more involved in thing like reading and writing than coding. It's one of those highest-high yet lowest-low memories that I've always hung onto. And that is my voice recognition story. Such as it is.
posted by joe lisboa at 9:20 AM on May 4, 2010 [8 favorites]

Google Voice does transcription quite well. I don't even listen to my voicemails, I read them. Sure it fails on some accents, but it is very good and getting better.

The article is stupid. Sure progress is going to be harder but it hasn't stopped.
posted by bhnyc at 9:28 AM on May 4, 2010

> Speech recognition also does not involve "understanding" the way humans "understand" each other. Problems like "the human user just said a word that has a homophone; which word should I choose as the output?" can be decided on statistically by the context, without the computer having ever experienced qualia.

Using statistics and context will only get you so far. My work involves extracting meaning from text, so this comes up a lot. Understanding an utterance involves a lot of layers that all interact with each other in complicated ways.

My work involves text, so word segmentation is pretty trivial, and homophones are usually not a problem, so the first interesting problem I've looked at is part-of-speech recognition. In other words, you need to decide if each word is a noun, verb, etc. before you can do much of anything else. The current state of the art for English using purely statistical methods is around 95%, meaning about 1 in 20 words is misrecognized. This is a big problem, because the errors frequently involve mistaking a noun for a verb or vice versa, which screws up everything on down the line.

The next step is parsing, which the article touched on. Sometimes you can use the parse to fix part-of-speech errors, but just as often, a part-of-speech error will result in a perfectly grammatical parse that just happens to be wrong, so it's not a cure-all. Parsing also introduces its own ambiguities even when the parts of speech are all correct. For instance, if I say "I saw a man with a telescope", does the man have the telescope, or do I? A human will usually guess that I have the telescope, because telescopes are used for seeing, but you have no hope of figuring this kind of thing out without some fairly sophisticated knowledge of what the words mean. This is different from the example sentence in the article ("The driver struck the ball"), because there is not much difficulty deciding what the individual words mean, but it's impossible to know for sure they the words relate to each other.

Language researchers have also tried to tackle the problem of "textual entailment", determining if one sentence implies (or contradicts another). For example, if I say "I'm going to LA tomorrow", does that imply "I will be in California tomorrow"? This kind of thing is really hard, and it's telling that the most successful approaches so far has been almost purely statistical. The results are abysmal, of course, because you can't draw correct inferences from sentences without some idea of what the words mean and how they fit together, but parsing that isn't informed by semantics is almost useless for any task more complicated than grammar checking.

I guess the point I'm trying to make is that there's a lot that absolutely must go on in order to interpret speech, even if it's not what you would call "understanding". Even a problem as simple is figuring out which of two homophones a person is saying frequently can't be solved without detailed knowledge of what the words mean and inferences based on common sense. Your brain automatically filters out nonsensical interpretations of the sounds you hear based on an amazing amount of knowledge and reasoning, and you'll never even be aware of it unless you sit down and very consciously pick apart sentences, looking for incorrect ways to interpret every little thing.

Whether you call the kind of interpretation "understanding" is beside the point for anyone working in the field. We don't trouble ourselves over whether the computer experiences qualia; it would be enough for the computer to draw enough correct inferences to avoid making stupid mistakes all the time.
posted by shponglespore at 9:50 AM on May 4, 2010 [2 favorites]

The article notes that voice recognition works a lot better for specific purposes.

Communication works a lot better when you do it for a specific purpose. Professionals regularly come up with jargon terms for the specific meanings they need to talk about when they work. This allows them to communicate more effectively than without said terms.

This suggest to me that a truly generalized voice recognition program is a pipe dream. Computers can't recognize speech-in-general because neither can humans! If you walk into an operating room and try to understand what is going on, then unless you have the appropriate MD, you'll fail. The same thing happens if you try to socialize among a subculture you're not familiar with. Apart from obvious things like slang words, different subcultures use ordinary words in different ways; witness LOLspeak and Metafilter's curious preoccupation with "Metafilter: a curious preoccupation" et al. Humans have the benefit of a lifetime of experience learning new dialects (perhaps simple ones, but nonetheless); computers don't have that.

Instead we want computers that teach themselves how to recognize our particular modes of speech. Dragon Naturally Speaking already does this a bit, according to wanderingstan, when it learns to recognize people's names. Right now, the process of teaching a computer to recognize any particular kind of speech still requires big organizations with lots of funding, and this limits the applications of the technology to those that are worth millions of dollars. Making the machines learn better, faster, and cheaper will likely result in applications more like the computers in Star Trek.

IANAartificial intelligence researcher
posted by LogicalDash at 9:50 AM on May 4, 2010

> I find it interesting that these discussions are always entirely anglophone and anglo-centric. There are languages that I can only imagine are much worse than English for recognition (due to the similarity of many sounds, and the lack of proper enunciation) such as Danish. By the same logic, there must be some languages with a clearer spoken language, more differentiated phonemes, etc. Perhaps German?

Languages don't all present exactly the same challenges, but I don't think many researchers would be willing to say that one language is easier than another overall. English is the most studied language, but there is quite a bit of work being done in languages like Arabic and Chinese that are very different from English.

It seems that languages always evolve towards a certain level of difficulty. When it comes to phonetics, you generally see a balance between the number of possible phonemes and clarity of enunciation. English has unusually subtle phonetics, so English speakers must speak relatively slowly in order to be understood. Spanish has far fewer phonemes, so Spanish speakers can speak very quickly and slur words together much more.
posted by shponglespore at 10:04 AM on May 4, 2010 [1 favorite]

shponglespore, I don't think we actually disagree except in ways that are outside the scope of the thread. That is to say, our comments don't differ wrt the mechanics of speech recognition (which you did not talk about much) or of text parsing/NLP (which I did not comment at all about). We might disagree about what constitutes "understanding" or "interpretation", but I suspect the nature of the disagreement would be a huge derail. We do not disagree that a human level of understanding is not relevant to getting either speech recognition or text parsing done- and indeed, that irrelevancy was part of my original point)
posted by a snickering nuthatch at 10:16 AM on May 4, 2010

I'll bet Hungarian works really well with speech recognition. Every word has its emphasis on the first syllable. Always. My first day in Budapest, I could already tell where the words were in sentences I overhead - so unlike French or Japanese, where it's all just syllable soup.
posted by Michael Roberts at 11:32 AM on May 4, 2010

LogicalDash: Computers can't recognize speech-in-general because neither can humans!

That doesn't follow. It's something I've actually often said and thought in the past, but statistical methods (just to mention one thing) often provide extremely counterintuitive success that surpasses what humans can do, just because they require more memory than our short-term capacity, or better math ability than we can muster, or what have you.

One thing that comes to mind that I just saw this week: in a textual analysis of a bunch of Enron email to estimate who reported to whom in the organization (i.e. the actual power structure independent of anything on paper), it turned out that the stop words had the best predictive power. Words like "on" and "the". Seriously. And it freaking worked, and it is so very much not what a human could do, because it just isn't the way we process things.

So I'm open to the possibility that really good machine speech recognition could be possible without even the slightest smidgen of understanding. It would likely have some failure modes that humans don't - but it would have success modes we don't, too.
posted by Michael Roberts at 11:38 AM on May 4, 2010 [1 favorite]

On a bus in Burma once, they played a pirated DVD of the latest James Bond movie.

The movie had been dubbed into Spanish, and then subtitled back into English, apparently using voice-to-text translation software, which must be an order of magnitude more prone to misinterpretations than regular voice-to-text.

I didn't have much of an idea what the movie was about, but goddamn it was funny.
posted by UbuRoivas at 1:59 PM on May 4, 2010

Wait, is the assertion in the article that we're at about 80%, and the last 20% will be really hard?

Huh.
posted by rush at 2:15 PM on May 4, 2010

Speech recognition is pretty tought in Hungarian too (although I haven't tried myself). The number of possible forms of a single verb is estimated to be in millions because of suffixes and large number of grammatical cases. So a Hidden Markov Model on a word level really doesn't work at all, because it is more or less impossible to even have enough data for modeling all possible state transitions.
posted by ikalliom at 2:45 PM on May 4, 2010

But Hungarian verbs do have some structure. You've got the root, and then you throw on modifiers until you decide to move on to the next word. Some of these modifiers indicate whether we're in first-person, etc, another indicates the dative case or ownership, and another tells you that the verb is doing something to something vaguely round in shape. Ok, so there's lots of suffixes, but they generally just add meaning individually... You can also cut the number of suffixes in half or a third because of the vowel-sound agreement rules. In other words, there are three endings indicating the first person, but which one you use is determined by the type of vowels present in the root. When I was learning the language (and I was nowhere near fluent in the year I lived in Budapest), I thought it very nice that there were certain redundancies built into the word structures.

That said, I also understood that there were other ways to order a sentence, which I avoided at all costs after being roundly ridiculed for trying. Yes, you can put the object at the start of the sentence, followed by the subject, and then the verb. Or any other permutation of the three. But usually all but one of those permutations will get you funny looks....

(btw, I'll be in budapest next week, if anyone's still there...)
posted by kaibutsu at 7:13 PM on May 4, 2010

I think Japanese tends to be one of the easiest languages for speech recognition, because it's a syllable-based alphabet (i.e., it's generally possible to correctly "spell" any word you hear), and so there's consequently very little variation in pronunciation from one person to the next.
Even the cheapo cell phone I had when I first came to Japan 10 years ago had usable voice recognition, including for names in the address book.
posted by bakerybob at 4:37 AM on May 5, 2010 [1 favorite]

Japanese is mora-based, not syllable-based. (A mora is a unit of speech that is based on weight and has an effect on timing & stress.)
posted by iamkimiam at 8:31 AM on May 5, 2010

"The last [speech recognition] system I had I think was called Speaking Naturally; the development of the work is continuing, but it does depend, of course, on training it to the voice. When I was fully trained, I did a test sentence. I said, 'This machine can type ten times as fast as a top typist can type'. And it printed out, 'This magazine can bite ten times as fast as a top baptist can fight'." -- Tony Benn
posted by logopetria at 10:05 PM on May 5, 2010

You have reached the Odeon Cinema Line! Please select the name of the cinema you would like film times for:
For Aberdeen, press 1.
For Anglesea, press 2.

[some time later]

For Wandsworth, press 247.
For Worcester, press 248.
posted by EndsOfInvention

Eponysterical!

Hey, take a look at your phone keypad. You see those nice letters? Wow! What a concept!

Now all we'd need is to have a system that asks you to spell the name of the cinema you'd want times for... Something like, when you press "223", the system would recognize that the only cinema which translates to something starting with 223 is aberdeen (22373336), and then would say "Do you want times for Aberdeen?", you'd press 1, and there you go, all you had to do was key in 4 easy numbers.

Wait, you mean they already have systems like that? That a great number of companies with automated IVRs have company directories, which act exactly like this? It's like they thought of everything!
posted by splice at 6:37 AM on May 6, 2010

David Pogue, the author and New York Times columnist, is enthusiastic about Dragon Naturally Speaking. He wrote about version 9 here and version 10 here.

In testing the software right out of the box, he found it to be better than 99% accurate, even without "training" (the process by which the program learns your idiosyncratic pronunciation). Not bad!

By the way, the program is available in French, Italian, German, Spanish, Dutch, Japanese, as well as varieties of English, and comes in special versions for legal and medical transcription.
posted by exphysicist345 at 7:08 PM on May 6, 2010

Are you a shill?
posted by OmieWise at 7:03 AM on May 7, 2010

No, I am not a shill. I've never even used DNS, and have no relationship with the company. I just wanted to point out that, according to a NYTimes columnist, there is an affordable commercial program that does many of the things MeFites asked for above, such as good accuracy and versions for medical and legal transcribing. Sorry if my comment came off as a product promotion — I didn't mean it that way.
posted by exphysicist345 at 8:30 AM on May 9, 2010

« Older SK8 OR DIE! | The sun never sets on the British Empire; it sets... Newer »

This thread has been archived and is closed to new comments

MetaFilter

Canoe here be dow?
May 3, 2010 11:14 PM Subscribe

Tags

Share

Canoe here be dow? May 3, 2010 11:14 PM Subscribe

Tags

Share

Canoe here be dow?
May 3, 2010 11:14 PM Subscribe