VoiceXML prompts

VoiceXML applications use the <prompt> element to control speech output to users.

Prompts consist of any combination of synthesized speech and pre-recorded files:

  • During application development, most prompts are synthesized speech. Using text-to-speech is fast, flexible, and less expensive than making recordings.
  • Later in the development life cycle, in preparation for deployment, application developers replace synthesized speech with pre-recorded audio. Use pre-recorded audio in places where the text is known, and where the synthesized speech doesn’t have the desired speech quality or emotional characteristics.

    A pre-recorded prompt is a file fetched from a web server. Well-crafted audio prompts improve application success through their subtle choice of words and intonations. The recordings are high-quality audio files spoken by trained voice talents who are coached by user interface specialists.

The voice browser extracts the VoiceXML prompt into an SSML document, and generates an MRCP SPEAK message, which Speech Server passes to Vocalizer. (Vocalizer is not required.)

Vocalizer generates the requested prompts (concatenating pre-recorded and synthesized audio as needed). Prompts are queued for playback, and are not played until input is needed from the caller. At this point, the prompts are played, and the system waits for user input (speech or DTMF) to send for recognition.

The <prompt> element

The <prompt> element has the following attributes:

Attribute

Description

bargein

Specifies whether a user can interrupt a prompt. Values are: yes, no.

Default: value of the bargein property.

bargeintype

Type of bargein. This defaults to the value of the bargeintype property.

Values are: speech, hotword

Default: speech.

Note: The Krypton recognition engine does not support hotword mode recognition.

timeout

Timeout for user input. The value is a Time Designation.

Default: noinput timeout is set by the voice platform.

xml:lang

Language identifier for the prompt.

Default: value specified in the document's "xml:lang" attribute.

xml:base

Base URI for resolving relative URIs in the prompt. This base declaration has precedence over the <vxml> base URI declaration. If a local declaration is omitted, the value is inherited from the document hierarchy.

Speech synthesis markup

For speech synthesis, a <prompt> can add any of the following VoiceXML mark-up elements to improve the quality of the resulting audio.

For definitions and examples, see the W3C Speech Synthesis Markup Language specification.

Element

Purpose

<audio>

Specifies audio files to be played and text to be spoken. Inserts prerecorded audio within the text to be synthesized. See Audio prompting.

<break>

Specifies a pause in the prompt.

<desc>

Provides a description of a non-speech audio source in <audio>.

<emphasis>

Speaks the enclosed text with emphasis.

<lexicon>

Specifies a pronunciation lexicon for the prompt.

<mark>

Ignored by voice platforms.

<meta>

Specifies meta and "http-equiv" properties for the prompt.

<metadata>

Specifies XML metadata content for the prompt.

<p>

Identifies the enclosed text as a paragraph, containing zero or more sentences.

<phoneme>

Specifies a phonetic pronunciation for the enclosed text.

<prosody>

Specifies prosodic information for the enclosed text.

<say-as>

Specifies the type of text construct contained within the element.

<s>

Identifies the enclosed text as a sentence.

<sub>

Specifies replacement spoken text for the enclosed text.

<voice>

Specifies voice characteristics for the prompt.

Audio prompting

A VoiceXML application can use the <audio> element to include prerecorded text in a prompt:

Attribute

Description

src

URI of the audio prompt.

You must specify either "src" or "expr"; otherwise, an error.badfetch event is thrown.

expr

An ECMAScript expression that determines the source of the audio to be played. The expression can be either a reference to audio previously recorded with the <record/> item or evaluate to the URI of an audio resource to fetch.

You must specify either "src" or "expr"; otherwise, an error.badfetch event is thrown.

fetchtimeout

This defaults to the fetchtimeout property.

fetchhint

This defaults to the audiofetchhint property.

maxage

This defaults to the audiomaxage property.

maxstale

This defaults to the audiomaxstale property.

Note: To optimize performance, the voice browser can stream audio to the user. That is, it can begin processing audio content as it arrives from the Speech Server and not wait for full retrieval. To request full audio retrieval prior to playback, use the "prefetch" fetchhint.

The following example illustrates a basic VoiceXML prompt and the corresponding MRCP SPEAK request. Note that the example includes several different audio modes:

  • Audio file, with TTS text as backup (welcome)
  • Audio file (thrush call)
  • TTS
  • Say-as TTS ($299.95)