How speech-recognition software got so good

Apr 23rd 2014

By R.L.G. | BERLIN

FOR a long time speech-recognition software was poor, confirming the saying that computers find it difficult to do things humans do easily, and vice-versa. But lately it has got much better: most modern smartphones now have a host of voice-activated features which actually work. Not only can programs such as Google Now or the iPhone's Siri handle restricted tasks like finding a restaurant or dialling a phone number; smartphones are also getting much better at free-form speech recognition, such as taking dictated text-messages or e-mails. How did computers get so much better at understanding speech?

Almost any word can begin a sentence, so the first word in a sentence can be one of tens of thousands. If any word were as likely as any other in any position, a five-word utterance from a vocabulary of 20,000 words would have 3.2 x 10²¹possibilities. Faced with such odds (and a sound signal degraded by cheap microphones, background noise and compression), the task would be impossible.

But words do not appear in random order, so the computer does not have to guess from (say) a vocabulary of 20,000 words for each word you speak. Instead, the software assesses how likely you are to have said a given word based on the surrounding words, drawing on statistical models derived from vast repositories of digitised documents and the previous utterances of other users. What comes after “the” is probably not a verb, for example, narrowing the possibilities. What comes after “Jefferson wrote the Declaration of” narrows down the possibilities rather a lot more. Dictate “a nice cream truck” at a natural rate of speech into your phone, and it is likely to return the nearly homophonous “an ice cream truck”. All of the words in “a nice cream truck” are common, but the combination is not. Smartphones can improve their guesswork further by taking into account the user's personal information, such as names in his address book or cities near his location.

Such statistical models are powering all kinds of language applications. For example, older forms of computerised translation tended to try to break down the grammar and meaning of a sentence and recompose it in the new language. The best modern systems rely on the likelihood of string A in the original language being rendered correctly as string B in the target language, based on a body of human-translated material that the computer has been trained on. And statistical models can correct common and obvious mistakes: text a friend “on the way mow” and, even though “mow” is an English word, some software will know to correct it to “on the way now”, since “mow” is a relatively uncommon word, virtually never preceded by “on the way”. Computers can be more useful to humans the more they learn about us, both collectively and individually. Increasingly, the question for consumers is how much personal information they are willing to give up in return for more helpful and reliable services.

Dig deeper:
Move over, Siri: a new breed of personal-assistant software knows what users want before they ask for it (November 2013)
The rise of the cheap smartphone (April 2014)
Does Scarlett Johansson deserve an Oscar for her role as the voice of a computer operating-system? (January 2014)

Reuse this content

How speech-recognition software got so good

More from The Economist explains

Why are embassies supposed to be inviolable?

What are “golden visas”?

Why the Moon needs its own time