Troubleshooting bargein and endpointing issues

This section describes common problems in speech detection. You may encounter such problems when using the bargein feature.

Most issues with barge-in or general endpointing are often caused by the quality of the incoming audio. For example, prompt echo may bleed into the input audio, or the input audio may be too quiet or too loud. These problems can cause the endpointer to activate too early, or too late. To resolve them, you must determine exactly what is causing the problem, and adjust the system configuration accordingly.

What is endpointing?

Endpointing is the process of determining the beginning and end of speech within an incoming sample stream. Good endpointing allows more efficient use of the Voice Platform recognition engine’s resources, because it keeps silence, background noises, and many non-speech sounds from being processed needlessly.

Endpointing involves several factors that can be affected by Voice Platform properties:

noise floor—A minimum sound level used as a reference to determine whether the caller is speaking. Sounds quieter than this volume are ambient sounds, to be ignored.
sensitivity—This determines how easily the endpointer is triggered to detect the beginning of speech. If the endpointer is too sensitive, it may be triggered by background noise. If not sensitive enough, it may miss the beginning of caller speech.
timeout—A period of silence used to determine the end of caller speech. An extended silence may indicate the final end of caller speech. However, the endpointer must allow a reasonable amount of time for pauses between words so the caller's responses are not cut off before they are complete. See Recognition timeouts for details.

The Voice Platform endpointer detects the beginning of speech when it receives sound that is significantly louder than the noise floor, at a level determined by the sensitivity. Samples begin flowing to the recognition service as soon as the beginning of speech is detected, and this flow continues until the end of speech. Voice Platform uses both the endpointer and the recognizer to detect this end of speech. The endpointer waits for extended silence, while the recognizer determines if the speech already matches an item in the current grammar.

The Voice Platform endpointer is designed to adapt to most audio conditions automatically. It is optimized for accuracy, reduced CPU utilization, and better out-of-the-box performance.

Use caution when modifying endpointer properties

Some properties directly impact the Nuance endpointer performance. You can set such properties in a <property> element at different scopes, from a field to the application root. Some can be set as attributes within a <prompt>.

Under most circumstances, you do not need to change endpointer settings in Voice Platform at all. The endpointer is already optimized at installation, and any changes can easily cause problems. If you do modify any endpointer properties, be sure to test the effects carefully before finalizing these changes.

It is strongly advised that you not modify the default values of the endpointer without first consulting an expert (a Nuance Professional Services employee). If you do make changes to endpointer parameters, first ensure that you have a mechanism to rollback to the defaults if necessary. Finally, if you have modified any endpointing parameters, make this explicitly clear when opening up any tech support requests.

Recording pre-endpointed audio

The most straightforward way to determine whether the problem lies with input audio is to configure the system to record the pre-endpointed audio input. This essentially means that the system will record all of the audio that is sent to it, rather than only the audio that is eventually sent to the recognizer after endpointing. This approach is not recommended for a production environment, but can be useful during testing and troubleshooting.

Recording the original audio input allows you to compare it with the endpointed version sent to the recognizer, so you can determine where the problem lies.

Parameters for initial endpointing

This section describes the parameters that allow the endpointer to reject speech-like residual echoes, which are typically strongest at the start of the call.

You can set these parameters in a Recognizer user configuration file. For a discussion of how to create such a file, see Recognizer user configuration files.

Note: The endpointer is optimized to provide the best performance with the default settings for these service properties. Modifying these properties is particularly risky, so be aware that the warnings about adjusting the endpointer are doubly applicable here.

swiep_bargein_initial_noise_floor

This service property is used only when bargein is enabled. It specifies the speech level (on a dBm scale) of the quietest utterance that can trigger the endpointer at the start of the first recognition. This minimum keeps the endpointer from being triggered by prompt echoes at the beginning of the first recognition, before echo cancellation has converged. The minimum level gradually decreases to the value specified by swiep_bargein_initial_noise_floor, according to the timing specified by swiep_bargein_initial_hold_seconds and swiep_bargein_initial_decay_seconds.

swiep_bargein_initial_hold_seconds

This service property specifies how long (in seconds) the initial noise floor set by swiep_bargein_initial_noise_floor is held. After this time, the initial noise floor starts to transition to the value specified in swiep_bargein_min_noise_floor. The transition lasts for the duration set in swiep_bargein_initial_decay_seconds. This service property is used only when bargein is enabled.

swiep_bargein_initial_decay_seconds

This service property specifies the duration (in seconds) during which the minimum noise floor transitions from the value specified in swiep_bargein_initial_noise_floor to the value specified in swiep_bargein_min_noise_floor. This service property is used only when bargein is enabled.

swiep_bargein_min_noise_floor

This service property specifies the absolute minimum level of speech (on a dBm scale) that can trigger the endpointer. This service property is always used, regardless of the value in bargein.

Detection sensitivity

The sensitivity of the endpointer determines the sound level at which the endpointer begins to interpret or record speech. To trigger the endpointer, the current signal must exceed an adapting estimate of the background noise by a certain threshold. If the endpointer is too sensitive, it may be triggered by background noise. If not sensitive enough, it may miss the beginning of caller speech.

You can adjust the sensitivity of the endpointer using the VoiceXML sensitivity property, which ranges from 0.0 (least sensitive) to 1.0 (most sensitive).

<property name="sensitivity" value="0.5"/>

To increase the sensitivity of the endpointer for quiet input, increase the value of the sensitivity property from 0.5 to a maximum of 1.0.
To decrease the sensitivity of the endpointer for noisy environments, decrease the value of the sensitivity property from 0.5 to a minimum of 0.0.

The default value of the sensitivity property is 0.5.

Modifying detection sensitivity within a prompt

A reduced sensitivity level can be applied during prompt to reduce the chance of false barge-in. If there is evidence that a reduced sensitivity during the prompt is of benefit, adjust the swirec.swiep_in_prompt_sensitivity_percent property:

<property name="swirec.swiep_in_prompt_sensitivity_percent" value="40"/>

A value lower than 50 reduces the chance of accidental barge-in, but requires louder speech to barge in. Most applications keep the default value for most circumstances.

Beginning of speech margin

To ensure that the beginning of speech is captured, when the endpointer sends speech to the Voice Platform recognizer it includes a brief interval of the sound or silence that occurred before the endpointer was triggered. This margin increases recognition accuracy by including the first moments of speech, which might otherwise be lost against background noise.

By default, the beginning-of-speech margin is set at 200 milliseconds. You can modify this margin by specifying a swirec.swiep_BOS_backoff value:

<property name="swirec.swiep_BOS_backoff" value="300"/>

Here, the margin is increased to 300 milliseconds.

End of speech margin

In order to ensure that the end of speech is captured, the endpointer also sends a brief interval of the sound or silence that occurs after the end of speech is determined. As with the beginning-of-speech margin, the end-of-speech margin increases recognition accuracy by ensuring that the final sounds of the speech are not lost against background noise.

By default, the end-of-speech margin is set at 240 milliseconds. You can modify this margin by specifying a swirec.swiep_EOS_backoff value:

<property name="swirec.swiep_EOS_backoff" value="400"/>

Here, the margin is increased to 400 milliseconds.

The end-of-speech margin also determines the minimum possible value for the incompletetimeout parameter (see incompletetimeout): the incompletetimeout must be greater than or equal to the end-of-speech margin. If you set incompletetimeout to less than this margin, the end-of-speech margin is used instead.

Enabling waveform log collection

You can record the full audio input using the save-waveform parameter in the application’s session.xml file. To do so, enter this parameter in the <speechrecog> section of <speechserver>, under the <sessionparameters> section, and set its value to 1 (true):

<sessionparameters>

...

    <speechserver>

        <speechrecog>

            <param name="save-waveform"><value>1</value></param>

        </speechrecog>

...

    <speechserver/>

...

</sessionparameters>

The endpointer logs the input samples as ep*.wav files in the standard call log directories. By default, these directories are named as follows:

%NUANCE_DATA_DIR%/Company/callLogs/App/YYYY/MMmonth/DD/HH/

In this example:

%NUANCE_DATA_DIR% is the default data directory (on Windows this directory is C:\ProgramData\Nuance\Enterprise—for Linux it is var/local/Nuance).
Company is the name of the company
App is the name of the application

YYYY, MMmonth, DD, and HH represent the year, month, day, and hour respectively.

The files themselves are named according to the time, hostname, session ID, and type. These files contain all the audio received by the telephony service, including the leading and ending silences that endpointing removed.

Note: When waveform collection is enabled, it can fill the disk relatively quickly. Nuance therefore recommends that you only enable this feature during analysis, and be sure to disable it when you put your application into production.

For a complete discussion of recorded utterance files, see Data management.

For more information on session.xml files, see Application defaults.

Whole call recording

Voice Platform also supports whole call recording, which lets you record the entire call from start to finish, including prompts. These recordings are often helpful, as they provide insight as to whether the volume of the incoming speech is too loud or quiet, or whether prompt echo is bleeding in. For details on this feature, see Enabling whole call recording.