Using hot word recognition

The term hot word recognition describes a scenario where the system is constantly listening for a particular command word or phrase (the hot word) that acts as the signal for it to take a given action. For example, hot word recognition can take place while the user is placing a bridged transfer call to a third party. In this scenario, when the system recognizes the hot word, it ends the call to the third party, and the dialog between the caller and the application resumes.

Another possible use of hot word recognition is during the playing of a long prompt, where the hot word can be used simply to stop the prompt and continue with the next part of the application. The advantage of using hot word over normal recognition in this scenario is that the prompt playback will only be stopped when a successful recognition has occurred, rather than at the moment when the user has started to speak.

A typical speech recognition application is designed as a turn-taking dialog, where the system and the user take turns speaking to one another. In this scenario, the recognizer knows when to expect audio from the user.

In other cases, the system listens to and processes the stream of speech but does not take any action until it recognizes a hot word. In this scenario, the recognizer does not know when to expect the hot word. However, the moment the hot word is recognized, the application stops listening and takes action.

There are two principal occasions where you can find a hot word useful:

Hot word return from a <transfer> element allows the user to terminate a bridged <transfer> and return to the application. Voice Platform supports either near-end (the caller or party A uses the hot word) or far-end (the called party or party C uses the hot word) hot word return.
A field-level hot word allows you to specify hot word recognition instead of standard recognition for a particular <field>.

Hot word return from a bridged transfer

Using the VoiceXML <transfer> element, you can transfer the user to a third party. If you set the attribute type="bridge" (or bridge="true" in VXML 2.0) in the <transfer> element, the application stays connected while the user is in the call with the third party.

During a bridged transfer, you can define a hot word and instruct Voice Platform to listen for it on one leg of the call. If Voice Platform recognizes this hot word, it terminates the call to the third party, and returns the user to the main application. Voice Platform supports hot word recognition in a bridged transfer on either the near end (party A) or the far end (party C) of a call. Voice Platform can only monitor one end of a call during a transfer.

For example, you can use hot word on party A (default) when a central or main application combines several applications by connecting the caller to different telephone numbers. Suppose the component applications read the news, read emails, and provide a weather forecast. The main application transfers the caller to one of these component applications, but stays connected and listens for the hot word. When the caller says the hot word, the application disconnects the third party application and resumes the call with the caller.

Implementing the hot word within a transfer

To implement hot word recognition during a bridged transfer, nest a <grammar> element within the <transfer> element to recognize the hot word. This grammar specifies the hot word, and the natural language slot that it fills.

When using SIP, you must also enable forking in order to enable hot word recognition in a bridged transfer.

To specify which end of the call to monitor, use the farendhotword attribute of the <transfer> element. This attribute takes a boolean value:

true means that Voice Platform listens for the hot word on the far end (from the third party)
false means that Voice Platform listens for the hot word on the near end (from the caller)

If not specified, the farendhotword value is false by default.

Far end disconnection

When Voice Platform is listening for the hot word on the far end of a call and the far end is disconnected, it assigns one of two values to the <transfer> element guard variable:

far_end_disconnect means that the third party hung up
far_end_recognition means that the third party used the hot word

This value can be used in conditional logic to determine the application’s next action. For example, you can direct the application to transfer the caller to a human agent after a far_end_recognition, but continue normally after a far_end_disconnect.

Example of near end return

The following example shows a hot word grammar used on the near end of a call within a <transfer> element. The hot word is “Home home”. When it is recognized, the “home” natural language slot is given the value “home.” The phrase “go home” is a decoy. (For a discussion of decoys, see Tuning the grammar.)

<transfer destexpr="'tel:+1' + phoneNumber" type="bridge">

    <grammar>

        <rule id="Top" scope="public">

        <one-of>

        <item>home home

            <tag>

            home = 'home';

            </tag>

        </item>

        <item>go home

        </item>

        </one-of>

        </rule>

    </grammar>

</transfer>

This code produces the following behavior when a bridged call transfer is executed. The VoiceXML application:

Determines what number to call
Places the call transfer
Eavesdrops on the bridged call, waiting for the hot word
When the hot word is recognized, disconnects the called party
Resumes VoiceXML document interpretation

For more information, see the <transfer>element.

Example of far end return

This example shows a hot word grammar used on the far end of a call within a <transfer> element, and includes conditional logic to determine what action to take based on the result. To make this second part possible, the <transfer> guard variable is explicitly given a name (“mycall”) so it can be used in <if> and <elseif> conditional expressions:

<transfer name=”mycall” destexpr="'tel:+1' + phoneNumber" type="bridge"

    farenddialog="confirm.vxml" farendhotword="true"maxtime=”1800s”>

    <grammar>

        <rule id="Top" scope="public">

        <one-of>

            <item>home home

                <tag>home = 'home';</tag>

            </item>

            <item>go home</item>

        </one-of>

        </rule>

    </grammar>

<!-- Since the transfer has an explicit name, -->

    <filled>

    <if cond="mycall == 'busy'">

        Action performed if the line was busy when attempting to place

        call to party C

    <elseif cond="mycall == 'noanswer'"/>

        Action performed if there was no answer coming from party C.

    <elseif cond="mycall == 'far_end_disconnect'"/>

        Action performed if Party C has disconnected from the call.

    <elseif cond="mycall == 'maxtime_disconnect'"/>

        Action performed if transfer connection time has exceed the

        maximum time (maxtime parameter) set in the transfer tag

    <elseif cond="mycall == 'far_end_recognition'"/>

        Action performed if Party C used the hot word.

    <else/>

        Action performed if no other condition is met.

    </if>

</transfer>

As in the near end example, the hot word is “home home”, and “go home” is a decoy.

Field-level hot word recognition

Normally, when barge-in is allowed, a prompt stops playing as soon as the system detects input from the user. When you use field-level hot word recognition, the system listens for user input and performs recognition but does not take any action until it recognizes the hot word. This is especially useful for preventing unintentional interruptions when prompts are very long—for example, if a user clears his throat while listening to a news-reading application.

Another use of these modes involves no prompt at all. The application can wait silently until triggered into action by a successful recognition. For example, a voice dialing application could allow callers to have a conversation while the recognizer listens for the hot word. The application could let a caller place a series of telephone calls without needing to hang up: at the end of one call, the caller just speaks a hot word and then gives commands for the next phone call.

To use hot word recognition within a field, set the bargeintype attribute in the <prompt> element to “hotword” or “selective” (see bargeintype). This activates hot word recognition using the grammar for the current field code block.

Hotword vs. selective

A hotword bargein is more robust than a selective barge-in. The duration constraints on the hot word automatically disqualify some utterances before they are sent to the recognizer, which prevents some false acceptances and saves recognizer resources.

However, this increased robustness comes at the cost of the processing resources of other Voice Platform components. A hotword bargein also places limits on the duration of the hot word, which may be a problem in some circumstances, and may cause false rejections if the duration required to speak the hot word is misjudged.

See Tuning hot word duration for details on how to specify the duration constraints on a hot word when the bargeintype is hotword.

Example

Consider the following example of hot word use:

(...)

<!--

    Play the news prompt at the specified offset. Barge-in

    only on recognition to avoid misrecognition cutting

    off the prompt.

-->

<field name="news">

    <prompt bargeintype="hotword">

        <audio src="long_news.wav"/>

    </prompt>

    <!--

        NL is needed in hot word mode to trigger

        recognition.

-->

    <grammar>

        <rule id="Top" scope="public">

        <one-of>

        <item>

            cancel

            <tag>

            news = 'cancel';

            </tag>

        </item>

        </one-of>

        </rule>

    </grammar>

</field>

(...)

For more information on the <prompt> element, see <prompt>.

Tuning hot word performance

The goal of the tuning process is to find grammar and system property settings that improve recognition performance over a given test set. It is assumed that these settings will then improve recognition performance in a live deployment.

You create a test set by collecting utterance logs from a given host and obtaining transcriptions for them. The endpointed utterance logs are collected by default. For more information, see Waveform logging.

Tuning the grammar

Data analysis during the tuning phase may reveal cases where the system recognized a wrong word as the hot word: for example, perhaps “departure” was mis-recognized as the hot word “voyager.” This type of error is known as a false acceptance. If the system makes a false acceptance, it fills the hot word slot and returns control to the application prematurely.

You can reduce the false acceptance rate by including some decoys in the hot word grammar. Decoys are words that may be mis-recognized as the hot word, because they are acoustically similar and similar in duration to the hot word.

You include a decoy in the hot word grammar by listing the word but not providing a slot for it. Then if the system recognizes the decoy, it won’t interrupt the prompt because the decoy has no slot; it will simply continue to listen for the hot word. In the following grammar, “voyager” is the hot word; “departure” and “raiders” are decoys.

<grammar>

    <rule id="Hotword" scope="public">

        <one-of>

            <item>

            voyager

            <tag>

            voyager = 'voyager';

            </tag>

            </item>

            <item weight='0.1'>

                <ruleref uri="#OogWords"/>

            </item>

        </one-of>

    </rule>

    <rule id="OogWords" scope="public">

        <one-of>

            <item>

                departure

            </item>

            <item>

                raiders

            </item>

        </one-of>

    </rule>

</grammar>

Consider whether the hot word might commonly occur as part of a longer utterance, for example, “quote for voyager fund?” If your hot word occurs frequently in a normal dialog, you can add another decoy in the grammar. The hot word must be put in the context that users will say it within the decoy.

Tuning hot word duration

When the bargeintype attribute is set to hotword, the default duration limits for the hot word are a minimum of 200 milliseconds, and a maximum of 800 milliseconds. However, you can change these duration limits by using the swirec.swiep_magic_word_max_msec and swirec.swiep_magic_word_min_msec properties. For example:

<field name="news">

    <property name="swirec.swiep_magic_word_min_msec" value="300"/>

    <property name="swirec.swiep_magic_word_max_msec" value="1000"/>

    <prompt bargeintype="hotword">

        <audio src="long_news.wav"/>

    </prompt>

            <grammar>

                <rule id="Top" scope="public">

                <one-of>

                <item>cancel <tag>news = 'cancel';</tag></item>

                </one-of>

                </rule>

            </grammar>

(...)

Here all utterances that last less than 300 milliseconds or more than 1000 milliseconds will be ignored. Only utterances that fall within the range will be sent to the recognizer to determine whether they match the hot word (cancel).

Caution must be used in setting these duration properties.

If the range is too large, it will not filter out noise effectively. This will cause additional processing on the recognizer, and may decrease accuracy due to false acceptances.
If the range is too small, it will ignore some valid uses of the hot word. The caller will be forced to repeat the word, and may not be able to get it to work.
If the duration limits are badly set—if it takes less or more time to speak the word than the limits allow—callers may not be able to barge in at all.

Hot word grammars and DTMF

For information on how to use DTMF in a hot word grammar, see Including DTMF in a hot word grammar.