
Speech Background
(Thanks to Microsoft Speech Awareness)
Dictation
A dictation application requires certain hardware and software on the user's
computer. Not all computers have the memory, speed, microphone, or speakers
required to support speech, so it is a good idea to design the application so
that speech is optional.
These hardware and software requirements should be considered when designing
a speech application:
- Processor speed. The speech recognition and text-to-speech engines
currently on the market typically require a Pentium 60 or faster processor
for discrete dictation and a Pentium 200 or faster processor for continuous
dictation.
- Memory. On the average, speech recognition for dictation consumes 4 to 8
megabytes (MB) of random-access memory (RAM) for discrete dictation and
about 32 megabytes for continuous dictation in addition to that required by
the running application.
- Sound card. Almost any sound card will work for speech recognition and
text-to-speech, including Sound Blaster™, Media Vision™, ESS Technology,
cards that are compatible with the Microsoft® Windows Sound System, and the
audio hardware built into multimedia computers. A few speech recognition
engines still need a DSP (digital signal processor) card.
- Microphone. The user can choose between two kinds of microphones: either a
close-talk or headset microphone that is held close to the mouth or a
medium-distance microphone that rests on the computer 30 to 60 centimeters
away from the speaker. A headset microphone is needed for noisy
environments. Dictation works best with close-talk microphones.
- Operating system. The Microsoft Speech application programming interface
(API) requires either Windows 95 or Windows NT version 3.5.
- Speech-recognition and text-to-speech engine. speech recognition and
text-to-speech software must be installed on the user's system. Many new
audio-enabled computers and sound cards are bundled with speech recognition
and text-to-speech engines. As an alternative, many engine vendors offer
retail packages for speech recognition or text-to-speech, and some license
copies of their engines.
Even the most sophisticated speech recognition engine has limitations that
affect what it can recognize and how accurate the recognition will be. The
following list illustrates many of the limitations found today. The limitations
do pose some problems, but they do not prevent the design and development of
savvy applications that use dictation.
Microphones and sound cards
The microphone is the largest problem that speech recognition encounters.
Microphones inherently have the following problems:
- Not every user has a sound card. Over time more and more PCs will bundle a
sound card.
- Not every user has a microphone. Over time more and more PCs will bundle a
microphone.
- Sound cards (being in the back) don't make it very easy for users to plug
in the microphone.
- Most microphones that come with computers are cheap, and they don't do as
well as more expensive microphones that retail for $50 to $100. Furthermore,
many of the cheap microphones that are designed to be worn are
uncomfortable. A user will not use a microphone if it is uncomfortable.
- Users don't know how to use a microphone. If the microphone is a worn on
their head they often wear it incorrectly, or if it sits on their desktop
they will lean towards it to speak even though the microphone is designed
for the user to speak from their normal sitting position;
Most applications can do little about the microphone. One way that vendors
can deal with this is to test and verify the user's microphone setup as part of
the installation of any speech component software. Software to test a user's
microphone can be delivered along with other components to ensure that the user
can periodically test and adjust the microphone and configuration.
Most users of dictation will wear close-talk microphones for maximum
accuracy. Close-talk mikes have the best characteristics for speech recognition;
they alleviate a number of the problems encountered in Command and Control
recognition caused by weaknesses in the capabilities of user microphones in
speech recognition and dictation applications.
Speech Recognizers make mistakes
Speech recognizers make mistakes, and will always make mistakes. The only
thing that is changing is that every two years recognizers make half as many
mistakes as they did before. But, no matter how great a recognizer is it will
always make mistakes.
To make matters worse, dictation engines make misrecognitions that are
correctly spelled and often grammatically correct, but mean nothing.
Unfortunately, the misrecognitions sometimes mean something completely different
than the user intended. These sorts of errors serve to illustrate some of the
complexity of speech communication, particularly in that people are not
accustomed to attributing strange wording to speech errors.
To minimize some of the misrecognitions, an application can:
- Make it as easy as possible for users to correct mistakes.
- Provide easy access to the "Correction Window" so the user can
correct mistakes that the recognizer made.
- Allow the user to train the speech recognition system to his/her voice.
Is it a Command?
When speech recognition is listening for dictation, user's will often want to
interject commands such as "cross-out" to delete the previous word or
"capitalize-that". Applications should make sure that:
- If a command is just one word, it does not replace a word that people like
to dictate.
- If a command is multiple words, it can't be a phrase that people like to
dictate.
Finite Number of Words
Speech recognizers listen for 20,000 to 100,000 words. Because of this, one
out of every fifty words a user speaks isn't recognized because it isn't in the
20,000 -- 100,000 words supported by the engine.
Applications can reduce the error rate of an engine if the application tells
the engine about what words the engine should expect.
Other Problems
Some other problems crop up:
- Having a user spell out words is a bad idea, since most recognizers are
too inaccurate.
- An engine also cannot tell who is speaking, although some engines may be
able to detect a change in the speaker. Voice-recognition algorithms exist
that can be used to identify a speaker, but currently they cannot also
determine what the speaker is saying.
- An engine cannot detect multiple speakers talking over each other in the
same digital-audio stream. This means that a dictation system used to
transcribe a meeting will not perform accurately during times when two or
more people are talking at once.
- Unlike a human being, an engine cannot hear a new word and guess its
spelling.
- Localization of a speech recognition engine is time-consuming and
expensive, requiring extensive amounts of speech data and the skills of a
trained linguist. If a language has strong dialects that each represent
sizable markets, it is also necessary to localize the engine for each
dialect. Consequently, most engines support only five or ten major
languages-for example, European languages and Japanese, or possibly Korean.
- Speakers with accents, or those speaking in nonstandard dialects, can
expect more misrecognitions until they train the engine to recognize their
speech, and even then, the engine accuracy will not be as high as it would
be for someone with the expected accent or dialect. An engine can be
designed to recognize different accents or dialects, but this requires
almost as much effort as porting the engine to a new language.
Here are some design considerations for applications using command and
control speech recognition.
Design Speech Recognition in From the Start
Don't make the mistake of implementing speech recognition in your application
as an afterthought. It's a poor design if the application is designed for a
mouse and keyboard. Applications designed for just the keyboard and mouse get
little benefit from speech recognition. The speech interface is at a point
similar to where the mouse interface was when applications were designed for
keyboard input only-not until applications were deliberately designed for
mousing did the mouse prove generally effective for user input.
Do Not Replace the Keyboard and Mouse
Most dictation systems provide discrete dictation, allowing users to speak up
to 50 words per minute. While this is faster than hunt-and-peck typists, touch
typists can type at least 70 words per minute. Discrete dictation will not be
used by touch typists. Continuous dictation allows up to 120 words per minute.
Communicate Speech Awareness
Since most applications today do not include speech recognition, users will
find speech recognition a new technology. They probably won't assume that your
application has it, and won't know how to use it.
When you design a speech recognition application, it is important to
communicate to the user that your application is speech-aware and to provide him
or her with the commands it understands. It is also important to provide command
sets that are consistent and complete.
Manage User Expectations
Users will often have the expectation that speech-enabled applications will
provide a level of comprehension and interaction comparable to the futuristic
speech-enabled computers of Star Trek and 2001: A Space Odyssey. Some users will
expect the computer to correctly transcribe every word that they speak,
understand it, and then act upon it in an intelligent manner.
You should convey as clearly as possible exactly what an application can and
cannot do and emphasize that the user should speak clearly, using words the
application understands.
Where the Engine Comes From
If an application implements speech recognition, it can work on an end user's
PC only if the system has a speech recognition engine installed on it. The
application has two choices:
- The application can bundle in and install a speech recognition engine.
This strategy guarantees that speech recognition will be installed and also
guarantees a known level of quality from the speech recognizer. However, if
an application does this, royalties will need to be paid to the engine
vendor.
- Alternatively, an application can assume that the speech recognition
engine is already on the PC or that the user will purchase one if they wish
to use speech recognition. The user may already have speech recognition
because many PCs and sound cards will come bundled with an engine. Or, the
user may have purchased another application that included an engine. If the
user has no speech recognition engine installed, the application can tell
the user that they need to purchase a speech recognition engine and install
it. Several engine vendors offer retail versions of their engines.
© 1995-1998 Microsoft Corporation. All rights
reserved.