Controlling endpointing

Endpointer technology detects the start and end of speech in audio samples. The speech detector and Recognizer support several modes of endpointer technology, which correspond to voice application barge-in modes.

Note: An application controls these barge-in (endpointer) modes using VoiceXML properties and MRCP headers.

Endpointer modes

Here are the available endpointer modes:

begin_only is the most commonly used mode. The speech detector detects the beginning of speech. The voice browser terminates the current prompt immediately, and sends the speech for processing.
Note: Speech Server continues to send speech to the endpointer so that it can adapt to the speech volume, background noise, and line noise.
selective_barge_in prevents accidental interruption by allowing applications to define a small set of key words (to be spoken by callers) that trigger an intended barge-in. A client application that supports selective barge-in always listens for commands, whether the caller is speaking or listening to prompts. Speech Server interrupts the utterance or prompt only after a successful recognition. That is, Speech Server sends the speech for recognition and awaits a successful result before terminating the current utterance or prompt.
magic_word mode is identical to selective_barge_in except that in magic_word, the speech detector rejects candidates that are too short or long (as configured by the browser using parameters described in Setting barge-in modes (endpointing)) before sending them to Recognizer.

The following table compares and contrasts the endpointer modes:

	begin_only	selective_barge_in	magic_word
Robustness to noise	High	Very high	Very high
Robustness to background speech	Medium	High	Very high
Responsiveness1	Quick	Less quick	Less quick
Vocabulary flexibility	No limits	Short words	Single word
Ease of use for application	Easy	Easy	Medium

How applications use the modes

Typically, these modes are used for prompts that play long, informational messages (for example, email messages) where an accidental barge-in would disrupt the user’s experience. Until the caller speaks a key word or phrase, the application takes no action. For this reason, the voice platform continues to play a prompt until Recognizer returns a successful result.

Another use of these modes involves no prompt at all. The application can wait silently until triggered into action by a successful recognition. For example, a voice dialing application could allow callers to have a conversation while Recognizer listens for a command word (a magic word). The application could let a caller place a series of telephone calls without needing to hang up: at the end of one call, the caller could speak a magic word and then give commands for the next phone call.

An application can use either selective_barge_in or magic_word depending on its needs. Note the following:

Because of the duration constraints in magic_word mode, the speech detector returns less audio. This has the advantage of reducing network traffic and Recognizer load.
However, with magic_word the duration of each sound must be checked before the endpointer can send the first audio sample to Recognizer. This has the disadvantage of adding latency to magic_word recognitions. With selective_barge_in, on the other hand, Speech Server can send audio to Recognizer immediately once speech begins.

In both magic_word and selective_barge_in, the application must activate grammars that contain only single words or very short phrases such as “go back”, “skip this”, or “wake up”:

A complex grammar adds too much recognition processing in this mode.
Callers speak with uncertainty when speaking during prompt playback, and the uncertainty lowers recognition success. With single words and short phrases you limit the duration of their overlaid speech.
To improve the performance of magic_word and selective_barge_in, applications must instruct callers to pause briefly before and after saying the key words.

Setting barge-in modes (endpointing)

The application and Speech Server must co-ordinate the speech detector and Recognizer: both resources must be set to compatible modes for any given recognition.

Compatible mode combinations:

swiep_mode	swirec_barge_in_mode
begin_only (default)	normal
magic_word	magic_word
selective_barge_in	selective_barge_in

Use the following parameters to control the speech detector:

Parameter	Description	Default
swiep_mode	Sets special recognition modes (such as magic word) in the endpointer.	1500 (milliseconds)
swiep_magic_word_max_msec	Maximum duration of a magic word candidate for recognition.	800 (milliseconds)
swiep_magic_word_min_msec	Minimum duration of a magic word candidate for recognition.	200 (milliseconds)
incompletetimeout	Duration of silence to determine that callers have finished speaking.	begin_only

Use the following parameters to control Recognizer:

Parameter	Description	Default
swirec_barge_in_mode	Sets special recognition modes in Recognizer.	normal
swirec_magic_word_conf_thresh	Confidence threshold for magic word recognition results.	500
swirec_selective_barge_in_conf_thresh	Confidence threshold for selective_barge_in mode.	500