Audio output

The most common form of executable content found in a voice application is audio output, which is used to provide prompts and messages to the caller. The source of the audio output may be a prerecorded audio file, or TTS.

VoiceXML includes several elements that define audio output:

  • An <audio> element plays a prerecorded audio clip.

  • A <prompt> element generates synthesized speech from text input, or plays the prerecorded audio specified in an <audio> element. It may contain SSML elements.

  • The <value> element within a <prompt> evaluates an expression to produce text, and generates synthesized speech from this output using the Vocalizer TTS engine.

Prompts can appear within executable content as well as in elements for collecting user input. Anywhere a <prompt> is valid, any otherwise unmarked text is interpreted as a prompt and used to produce synthesized speech even if the enclosing <prompt> and </prompt> delimiters are omitted.

Fallback text

NVP may not be able to play the file specified in an <audio> element if that file is missing, or recorded in a format that NVP does not support. To avoid this problem, you can write the <audio> element as a delimiter pair rather than a self-contained element, and place fallback text between the two delimiters:

<prompt><audio src="welcome.wav">Welcome to our company!</audio></prompt>

If NVP is unable to play the audio file (welcome.wav), it logs an error, but renders the fallback text (Welcome to our company!) as speech in the applicable default TTS voice.

If NVP is unable to play an audio file and no fallback text is specified, the system behavior is determined by the applicable ssml_validation parameter (see The ssml_validation parameter).

SSML

If you are using TTS to produce prompts, you can affect the results by using SSML elements to add expression and other vocal cues to TTS output.

The NVP implementation of SSML is described in SSML elements. Some useful SSML elements are:

  • <break>—Inserts a pause.

  • <emphasis>—Specifies vocal emphasis on the enclosed text.

  • <prosody>—Specifies the speed and volume of speech for the enclosed text.

  • <say-as>—Determines how the enclosed text should be spoken: for example, whether a number should be spoken as a number, as digits, as a phone number, or as a date.

  • <voice>—Specifies which voice to use if more than one voice is available.

As a general guideline, you can use these elements within any element where text for TTS conversion is used.

The ssml_validation parameter

The ssml_validation parameter determines how NVP behaves when it is unable to play a specified audio file or to synthesize the text in a prompt. It offers three possible settings:

  • strict—Validates the input against the SSML 1.0 Recommendation, including Nuance extensions. If the input is not valid, NVP writes error messages to the logs, and stops speech synthesis and audio playback.

  • warn—Performs the same validation, but only logs errors rather than failing out of the synthesis operation, and continues with the prompt if it can.

  • none—Skips validation and plays the prompt if it can, without logging an error.

For example, consider the playback from the ConfirmSizeAndToppings.jsp file in PizzaTalk. This file creates a single toppingListStr string from the toppings in the toppings array:

for(int i=0;i<n-1;i++) {toppingListStr += "<audio expr=\"PromptPath + '"
    + stok.nextToken() +         ".wav'\"/>\n";}
toppingListStr += "<audio expr=\"PromptPath + 'and.wav'\"/>\n";
toppingListStr += "<audio expr=\"PromptPath + '" + stok.nextToken()
    + ".wav'\"/>";

This toppingListStr is later inserted into a <prompt> element as part of the JSP itself:

<prompt cond="entry == 'init'">
  <audio expr="PromptPath + 'you_wanna.wav'"/>
  <audio expr="PromptPath + PizzaSize + '.wav'"/>
  <audio expr="PromptPath + 'pizza_with.wav'"/>
  <%= toppingListStr %>
  <audio expr="PromptPath + 'is_that_right.wav'"/>
</prompt>

So if the recognized toppings were “onions”, “olives”, and “mushrooms”, the resulting toppingListStr string evaluates to <audio> elements playing prerecorded prompts:

<audio expr="PromptPath + 'onions.wav'"/>
<audio expr="PromptPath + 'olives.wav'"/>
<audio expr="PromptPath + 'and.wav'"/>
<audio expr="PromptPath + 'mushrooms.wav'"/>

Now, suppose that NVP is unable to find the olives.wav prerecorded prompt (it has been accidentally erased). In this case NVP plays the prompt up to the first topping (“You wanna small pizza with onions,”). However, the system’s behavior upon reaching the olives.wav prompt depends on the ssml_validation setting:

  • strict—The playback stops immediately, so the caller only hears “You wanna small pizza with onions.” Behind the scenes, NVP logs an error.

  • warn—The playback skips the missing audio file, but continues with the prompt. The caller hears “You wanna small pizza with onions, and mushrooms. Is that right?” Behind the scenes, NVP logs an error.

  • none—The playback skips the missing audio file, but continues with the prompt. The caller hears “You wanna small pizza with onions, and mushrooms. Is that right?” NVP does not log an error, even though the audio file was missing.

In general, the strict setting is recommended for most applications. The warn and none settings may result in strange TTS output when NVP attempts to synthesize bad input.

bargein

If your prompt is very long or contains information that experienced users will already know, you may want to allow callers to interrupt the prompt. Alternatively, you may want to prevent them from interrupting until the prompt has finished playing. You can permit or forbid such interruptions using the bargein attribute of the <prompt> element:

<prompt bargein="false">
    Before we begin, here is some important information.
</prompt>

By default, bargein is enabled automatically unless you explicitly set it to false.

NVP supports different types of bargein functionality, including normal speech, hotwords, and selective bargein. For details, see Using hot word recognition.

timeout

After playing a prompt in an input item, the voice browser service expects some sort of reply from the caller. If no reply is detected within a certain span of time, the voice browser service throws a “noinput” event and plays the prompt again.

You can specify this timeout by specifying a value for the timeout attribute in a <prompt>. The timeout must be specified in seconds (s) or milliseconds (ms):

<prompt timeout="2s">
    Please answer this prompt quickly.</prompt>

The value overrides the default timeout that would otherwise apply.