Wakeup words

A wakeup word is a word or phrase that users can say to activate an application, for example, Hey Nuance or Hi Dragon. ASRaaS reports the wakeup word spoken by the user, and optionally removes the word from the final transcript.

Follow these steps to use wakeup words in your client applications.

Define wakeup words

Specify one or more wakeup words or phrases in your recognition request, using WakeupWord: words.

Each wakeup word consists of one or more words or phrases. The recognition resource fields, weight_enum, weight_value, and reuse, are ignored for wakeup words.

For best recognition results, include several variations of the wakeup word your application can accept. (Use the Sample recognition client to try it out.)

# Define wakeup words
wakeups = RecognitionResource(
    wakeup_word = WakeupWord(
        words = ["Hi Dragon", "Hey Dragon", "Yo Dragon"] )
)

# Add wakeups to resource list, remove from final results
def client_stream(wf):
    try:
        init = RecognitionInitMessage(
            parameters = RecognitionParameters(
                language = 'en-us',
                topic = 'GEN',
                audio_format = AudioFormat(pcm=PCM(sample_rate_hz=wf.getframerate())),
                result_type = 'FINAL',
                utterance_detection_mode = 'SINGLE',
                recognition_flags = RecognitionFlags(
                    filter_wakeup_word = True )
            ),
            resources = [travel_dlm, places_wordset, wakeups]
        )
        yield RecognitionRequest(recognition_init_message=init)

Filter wakeup word

Optionally remove the wakeup word from the final results, by setting RecognitionFlags: filter_wakeup_word to true.

By default, this flag is false, meaning wakeup words are included in the resulting transcript.

When the audio contains only a wakeup word, it is always included in the final results even when filter_wakeup_word is true. See Only a wakeup word for an example.

See the results

When filter_wakeup_word is true, ASRaaS removes the wakeup word spoken by the user (if any) from the final transcription results, in most situations. Specifically, in the result Hypothesis:

  • A wakeup word at the start of formatted_text or minimally_formatted_text result is removed from final results.

  • A wakeup word as the first element of a word array is removed from final results.

  • However, if the wakeup word is the only input, it is not filtered. It is included in the final results: in formatted_text, minimally_formatted_text, and in the word array. In this situation, filter_wakeup_word is ignored.

  • In all partial results, wakeup words are reported normally. They are not removed from partial or immutable partial results.

These are the final results from the utterance: Hey Dragon, I’d like to watch The Godfather, with filter_wakeup_word true. Notice that the wakeup word is filtered out, meaning it’s not in the final results:

stream ../../audio/wuw.wav
100 Continue - recognition started on audio/l16;rate=8000 stream
final: I'd like to watch the Godfather

These are the results for the IMMUTABLE_PARTIAL result type, showing the wakeup word in partial but not final results:

stream ../../audio/wuw.wav
100 Continue - recognition started on audio/l16;rate=8000 stream
partial: Hey
partial: Hey Dragon
partial: Hey Dragon I
final: I'd like to watch the Godfather

See the detected word

The wakeup word spoken by the user is returned in Hypothesis: detected_wakeup_word.

When the input includes a wakeup word and ASRaaS recognizes it, the result field, detected_wakeup_word, always contains the wakeup word, even if the word is removed from the final results by filter_wakeup_word set to true.

If the user does not say any of the wakeup words, the transcription proceeds without error, reporting all words spoken by the user. The detected_wakeup_word field is not included in the result.

The overall result properties remain intact, including abs_start_ms, utterance_info, and the hypothesis confidences, which all reflect the presence of the wakeup word.

This extract from the full results for IMMUTABLE_PARTIAL shows the wakeup work in the partial results, but not the final results, and the detected wakeup word at the end:

result {
  result_type: PARTIAL
  hypotheses {
    formatted_text: "Hey Dragon"
    minimally_formatted_text: "Hey Dragon"
}

result {
  result_type: FINAL 
  hypotheses {
    formatted_text: "I\'d like to watch the Godfather"
    minimally_formatted_text: "I\'d like to watch the Godfather"
    ...
    detected_wakeup_word: "Hey Dragon"
  }

Limitations

Grammars are not supported (RecognitionResource: inline_grammar) when wakeup words are used.

Only a wakeup word

When the user says only a wakeup word, the word is always included in the results: in the final hypotheses and in detected_wakeup_word. For example, this is the final result from the utterance: Hey Dragon, with filter_wakeup_word true. Notice the wakeup word is not filtered from the results:

stream ../../audio/wuw.wav
100 Continue - recognition started on audio/l16;rate=8000 stream
final: Hey Dragon

And the wakeup word is shown in detected_wakeup_word in the details of the final hypotheses, whether filter_wakeup_word is true or false:

  hypotheses {
    average_confidence: 0.3019999861717224
    formatted_text: "Hey Dragon"
    minimally_formatted_text: "Hey Dragon"
    ...
    words {
      text: "Hey"
      start_ms: 220
      end_ms: 320
    }
    words {
      text: "Dragon"
      start_ms: 320
      end_ms: 740
    }
  }
  ...
    detected_wakeup_word: "Hey Dragon"
  }