Using barge-in
The barge-in feature lets users interrupt a prompt while it is playing, so they can provide an immediate answer and thus save time. When barge-in is enabled, Voice Platform listens for caller input even while the prompt is playing, and stops playing the prompt as soon as speech or DTMF input is detected. When barge-in is disabled, the interpreter plays an entire prompt without allowing interruption.
Best practices: enable barge-in to allow experienced users to speed the dialog by interrupting prompts as soon as they know what to say. Disable barge-in for crucial error messages, advertisements, and other prompts that the user must hear in their entirety.
By default, the barge-in function is enabled unless Voice Platform detects that the audio channel does not support full-duplex audio (simultaneous play and record). In this case a warning message is logged, and bargein is automatically disabled.
Enabling or disabling barge-in
You can explicitly disable (or enable) barge-in in VoiceXML using the bargein property, which configures the endpointer for barge-in. You can disable barge-in functionality in a document by setting this property to “false” in the header:
<property name="bargein" value="false"/>
If you don’t want to use barge-in in an application at all, set the bargein property to “false” in the header of the application root document. To suppress barge-in for a single prompt, you can set bargein to “false” in the <prompt> itself:
<prompt bargein="false">
Please listen to this crucial information before you speak.
</prompt>
The input received by the endpointer differs depending on whether barge-in is enabled. When barge-in is enabled, the ratio of silence to speech sent to the endpointer increases considerably, and the initial input to the endpointer is expected to be either the residual prompt or background noise. With barge-in disabled, the first sound processed is usually the caller speaking.
If your application expects speech right at the beginning of the first recognition call, perhaps with no prompt—for example, in an automated system that calls people—you can set bargein to “false” for the first recognition, allowing the endpointer to interpret the first incoming signals as speech, rather than background noise. Then you can set it back to “true” if barge-in is used throughout the rest of the call.
Suppressing barge-in for only part of a prompt
You can suppress barge-in during the first part of a prompt by using the swirec.swiep_suppress_bargein_time property:
<property name="swirec.swiep_suppress_bargein_time" value="3000"/>
Here, a caller cannot interrupt for the first three seconds (3000 milliseconds) of the prompt—long enough for a short statement (“Hi! Welcome to PizzaTalk”).
bargeintype
When barge-in is enabled, you can use the bargeintype property to specify a mode that determines how the barge-in functionality is handled:
- speech—The system detects any user speech. The platform immediately terminates the current prompt, and sends the speech to the recognizer.
- selective—The system only terminates the current prompt when and if a specific key word or phrase is recognized. The system detects speech and sends it to the recognizer, but does not interrupt the prompt unless that speech is recognized as the key word or phrase. This type is a Nuance extension.
- hotword—This bargeintype functions exactly like the selective type, with the additional constraint that the key word (a hot word) must fit within a period defined by a specified minimum length and maximum duration. For details on how to specify this period, see Tuning hot word duration.
By default, the bargeintype is assumed to be speech unless otherwise specified.
The difference between hotword and selective is that in a hotword barge-in, the word triggering the barge-in must be of a specific length (typically brief), while in a selective barge-in the key word can be as long as needed. For example, if the bargeintype is selective, the key word that triggers the barge-in can be an entire phrase, like “I’d like to exit, please.” If the bargeintype is hotword, the key word itself will be subject to duration constraints, and will likely be shorter. For example, it might simply be the word “exit.”
For more on bargeintype modes, see Using hot word recognition.
Initial noise floor parameters
At the start of the first utterance of a call, the endpointer is triggered by sounds louder than a certain level. This level is called the noise floor.
During a prompt, the noise floor is temporarily increased to a higher level in order to prevent residual echoes (which are strongest at the beginning of a call) from activating the endpointer. This higher initial level then decays back down to the regular level. There are two properties you can use to affect this behavior:
- swiep_bargein_initial_hold_seconds: How long the initial noise floor is used.
- swiep_bargein_initial_decay_seconds: How long it takes to revert to the regular noise floor after the swiep_bargein_initial_hold_seconds period has passed.
The default values for these properties are already optimized, and do not need to be changed. Use the defaults unless the initial prompt is less than 2 seconds. In such cases, you can change swiep_bargein_initial_hold_seconds to the duration of the prompt minus 0.1 seconds, and set swiep_bargein_initial_decay_seconds to 0.1 seconds.
You can only modify these properties by specifying them in a user configuration file. For more information, see Parameters for initial endpointing.
Tracking the time of barge-in
To keep track of the time when a barge-in occurred, put a <mark> element at the beginning of a prompt, and use the resulting marktime shadow variable from the field variable. For example, you can use <mark> elements to keep track of the information a caller received before interrupting:
<field name="team">
<prompt bargein="true">
<mark name="ad_start"/>
Baseball scores brought to you by Elephant Peanuts.
There's nothing like the taste of fresh roasted peanuts.
Elephant Peanuts. Ask for them by name.
<mark name="ad_end"/>
<break time="500ms"/>
Say the name of a team. For example, say Boston Red Sox.
</prompt>
<grammar type="application/srgs+xml" root="Sox" version="1.0">
<rule id=”Sox” scope=”public”>
<one-of>
<item>Boston</item>
<item>Red</item>
</one-of>
</rule>
</grammar>
<filled>
<if cond="typeof(team$.markname) == 'string' &&
(team$.markname=='ad_end' ||
(team$.markname=='ad_start' &&
team$.marktime >= 5000))">
<assign name="played_ad" expr="true"/>
<else/><assign name="played_ad" expr="false"/>
</if>
</filled>
</field>
Here, if a caller barges in during the advertisement for Elephant Peanuts, the interruption activates the “ad_start” mark. If the caller barges in after the advertisement—by barging in during the prompt that requests a team name, or after that prompt has finished playing—the caller speech activates the “ad_end” mark. Either way, the mark name is automatically assigned to the team$.markname shadow variable, and the time of the bargein is assigned to team$.marktime.
If the ad_end mark was activated, or if the ad_start mark was activated five seconds (that is, 5000 milliseconds) into the prompt, the caller is assumed to have heard the ad, and the played_ad variable is set to “true”.