Controlling endpointing

Endpointer technology detects the start and end of speech in audio samples. The speech detector and Recognizer support several modes of endpointer technology, which correspond to voice application barge-in modes.

Note: An application controls these barge-in (endpointer) modes using VoiceXML properties and MRCP headers.

Endpointer modes

Here are the available endpointer modes:

  • begin_only is the most commonly used mode. The speech detector detects the beginning of speech. The voice browser terminates the current prompt immediately, and sends the speech for processing.

    Note: Speech Server continues to send speech to the endpointer so that it can adapt to the speech volume, background noise, and line noise.

  • selective_barge_in prevents accidental interruption by allowing applications to define a small set of key words (to be spoken by callers) that trigger an intended barge-in. A client application that supports selective barge-in always listens for commands, whether the caller is speaking or listening to prompts. Speech Server interrupts the utterance or prompt only after a successful recognition. That is, Speech Server sends the speech for recognition and awaits a successful result before terminating the current utterance or prompt.
  • magic_word mode is identical to selective_barge_in except that in magic_word, the speech detector rejects candidates that are too short or long (as configured by the browser using parameters described in Setting barge-in modes (endpointing)) before sending them to Recognizer.

The following table compares and contrasts the endpointer modes:

 

begin_only

selective_barge_in

magic_word

Robustness to noise

High

Very high

Very high

Robustness to background speech

Medium

High

Very high

Responsiveness1

Quick

Less quick

Less quick

Vocabulary flexibility

No limits

Short words

Single word

Ease of use for application

Easy

Easy

Medium

How applications use the modes

Typically, these modes are used for prompts that play long, informational messages (for example, email messages) where an accidental barge-in would disrupt the user’s experience. Until the caller speaks a key word or phrase, the application takes no action. For this reason, the voice platform continues to play a prompt until Recognizer returns a successful result.

Another use of these modes involves no prompt at all. The application can wait silently until triggered into action by a successful recognition. For example, a voice dialing application could allow callers to have a conversation while Recognizer listens for a command word (a magic word). The application could let a caller place a series of telephone calls without needing to hang up: at the end of one call, the caller could speak a magic word and then give commands for the next phone call.

An application can use either selective_barge_in or magic_word depending on its needs. Note the following:

  • Because of the duration constraints in magic_word mode, the speech detector returns less audio. This has the advantage of reducing network traffic and Recognizer load.
  • However, with magic_word the duration of each sound must be checked before the endpointer can send the first audio sample to Recognizer. This has the disadvantage of adding latency to magic_word recognitions. With selective_barge_in, on the other hand, Speech Server can send audio to Recognizer immediately once speech begins.

In both magic_word and selective_barge_in, the application must activate grammars that contain only single words or very short phrases such as “go back”, “skip this”, or “wake up”:

  • A complex grammar adds too much recognition processing in this mode.
  • Callers speak with uncertainty when speaking during prompt playback, and the uncertainty lowers recognition success. With single words and short phrases you limit the duration of their overlaid speech.
  • To improve the performance of magic_word and selective_barge_in, applications must instruct callers to pause briefly before and after saying the key words.

Setting barge-in modes (endpointing)

The application and Speech Server must co-ordinate the speech detector and Recognizer: both resources must be set to compatible modes for any given recognition.

Compatible mode combinations:

swiep_mode

swirec_barge_in_mode

begin_only (default)

normal

magic_word

magic_word

selective_barge_in

selective_barge_in

Use the following parameters to control the speech detector:

Parameter

Description

Default

swiep_mode

Sets special recognition modes (such as magic word) in the endpointer.

1500 (milliseconds)

swiep_magic_word_max_msec

Maximum duration of a magic word candidate for recognition.

800 (milliseconds)

swiep_magic_word_min_msec

Minimum duration of a magic word candidate for recognition.

200 (milliseconds)

incompletetimeout

Duration of silence to determine that callers have finished speaking.

begin_only

Use the following parameters to control Recognizer:

Parameter

Description

Default

swirec_barge_in_mode

Sets special recognition modes in Recognizer.

normal

swirec_magic_word_conf_thresh

Confidence threshold for magic word recognition results.

500

swirec_selective_barge_in_conf_thresh

Confidence threshold for selective_barge_in mode.

500