Controlling endpointing
Endpointer technology detects the start and end of speech in audio samples. The speech detector and Recognizer support several modes of endpointer technology, which correspond to voice application barge-in modes.
Note: An application controls these barge-in (endpointer) modes using VoiceXML properties and MRCP headers.
Endpointer modes
Here are the available endpointer modes:
- begin_only is the most commonly used mode. The speech detector detects the beginning of speech. The voice browser terminates the current prompt immediately, and sends the speech for processing.
Note: Speech Server continues to send speech to the endpointer so that it can adapt to the speech volume, background noise, and line noise.
- selective_barge_in prevents accidental interruption by allowing applications to define a small set of key words (to be spoken by callers) that trigger an intended barge-in. A client application that supports selective barge-in always listens for commands, whether the caller is speaking or listening to prompts. Speech Server interrupts the utterance or prompt only after a successful recognition. That is, Speech Server sends the speech for recognition and awaits a successful result before terminating the current utterance or prompt.
- magic_word mode is identical to selective_barge_in except that in magic_word, the speech detector rejects candidates that are too short or long (as configured by the browser using parameters described in Setting barge-in modes (endpointing)) before sending them to Recognizer.
The following table compares and contrasts the endpointer modes:
begin_only |
selective_barge_in |
magic_word |
|
---|---|---|---|
Robustness to noise |
High |
Very high |
Very high |
Robustness to background speech |
Medium |
High |
Very high |
Responsiveness1 |
Quick |
Less quick |
Less quick |
Vocabulary flexibility |
No limits |
Short words |
Single word |
Ease of use for application |
Easy |
Easy |
Medium |
How applications use the modes
Typically, these modes are used for prompts that play long, informational messages (for example, email messages) where an accidental barge-in would disrupt the user’s experience. Until the caller speaks a key word or phrase, the application takes no action. For this reason, the voice platform continues to play a prompt until Recognizer returns a successful result.
Another use of these modes involves no prompt at all. The application can wait silently until triggered into action by a successful recognition. For example, a voice dialing application could allow callers to have a conversation while Recognizer listens for a command word (a magic word). The application could let a caller place a series of telephone calls without needing to hang up: at the end of one call, the caller could speak a magic word and then give commands for the next phone call.
An application can use either selective_barge_in or magic_word depending on its needs. Note the following:
- Because of the duration constraints in magic_word mode, the speech detector returns less audio. This has the advantage of reducing network traffic and Recognizer load.
- However, with magic_word the duration of each sound must be checked before the endpointer can send the first audio sample to Recognizer. This has the disadvantage of adding latency to magic_word recognitions. With selective_barge_in, on the other hand, Speech Server can send audio to Recognizer immediately once speech begins.
In both magic_word and selective_barge_in, the application must activate grammars that contain only single words or very short phrases such as “go back”, “skip this”, or “wake up”:
- A complex grammar adds too much recognition processing in this mode.
- Callers speak with uncertainty when speaking during prompt playback, and the uncertainty lowers recognition success. With single words and short phrases you limit the duration of their overlaid speech.
- To improve the performance of magic_word and selective_barge_in, applications must instruct callers to pause briefly before and after saying the key words.
Setting barge-in modes (endpointing)
The application and Speech Server must co-ordinate the speech detector and Recognizer: both resources must be set to compatible modes for any given recognition.
Compatible mode combinations:
swiep_mode |
swirec_barge_in_mode |
---|---|
begin_only (default) |
normal |
magic_word |
magic_word |
selective_barge_in |
selective_barge_in |
Use the following parameters to control the speech detector:
Parameter |
Description |
Default |
---|---|---|
Sets special recognition modes (such as magic word) in the endpointer. |
1500 (milliseconds) |
|
Maximum duration of a magic word candidate for recognition. |
800 (milliseconds) |
|
Minimum duration of a magic word candidate for recognition. |
200 (milliseconds) |
|
Duration of silence to determine that callers have finished speaking. |
begin_only |
Use the following parameters to control Recognizer:
Parameter |
Description |
Default |
---|---|---|
Sets special recognition modes in Recognizer. |
normal |
|
Confidence threshold for magic word recognition results. |
500 |
|
Confidence threshold for selective_barge_in mode. |
500 |