Sign up for our newsletter
Technology / AI and automation


IBM is claiming a great leap forward in computer recognition of human speech after a successful demonstration of a 20,000-word vocabulary for the experimental system at the T J Watson Research Center in Yorktown Heights, New York. The company says that the achievement of 20,000 words in a desk-top system marks another leap forward for the project, which it claims as the world’s most advanced. The 20,000-word vocabulary includes 97% of all the words a speaker is likely to use in business. Speech uttered into a small microphone, with brief pauses between words, appears almost instantly on the screen. Documents can then be edited either by voice or keyboard, stored, printed or transmitted. It requires only 20 minutes of training to the individual user’s voice, during which he or she must read a special document that is used by the system to characterise and store the individual’s unique way of speaking. Trials of the system are planned for IBM offices as IBM speeds up development of a system to recognise continuous speech without pauses between words. The IBM approach to speech recognition is claimed to be unique, and is based on two statistical models. The first comes from the speaker’s training session in which 200 sound patterns that characterise the speaker are established. A selection of candidate words, drawn from the 20,000-word vocabulary and described by those sound patterns, results. The candidate words are then matched against the second model using a database of 25m words of IBM office correspondence. The number of candidates is thereby reduced by determining which are most likely to follow the two previous words. The system then makes its final selection of the best word after it has determined that analysis of subsequent words won’t affect the choice. This contextual ability enables the system to distinguish between homonyms – to, two, too. The Personal Computer-based system uses two high-speed subsystems, each using an IBM signal-processor chip; the first transforms a speaker’s words into labels to encode the speech, the second does the pattern matching.

White papers from our partners

This article is from the CBROnline archive: some formatting and images may not be present.

CBR Staff Writer

CBR Online legacy content.