Integrating into VoiceXML platforms

Vocalizer supports all of the prompting requirements of voice platforms. The platforms delegate all prompt playback to Vocalizer (including text-to-speech and audio recordings). This design yields a more robust and efficient platform that is easier to develop and maintain:

  1. Extract the text strings relating to a prompt into an SSML document.
  2. Submit the SSML to Vocalizer for playback.
  3. Obtaining a single audio stream from Vocalizer that unifies all the audio from the text-to-speech and audio recordings.

While this is straightforward, there are subtle issues to ensure that VoiceXML application developers have full access to all Vocalizer capabilities. In particular, the platform decides which VoiceXML elements to pass through to the SSML document, and which to omit.

To clarify this integration effort, see the topics that follow:

Vocalizer supports the SSML 1.0 Recommendation (September 2004) as used by the VoiceXML 2.0 Recommendation (March 2004) and VoiceXML 2.1 Recommendation (June 2007), and returns, by default, an SSML parse error for any elements or attributes that don’t comply with the SSML 1.0 Recommendation. Your platform can implement older VoiceXML specifications such as VoiceXML 2.0 working drafts or VoiceXML 1.0, but you might need to convert some older elements and attributes to the SSML 1.0 Recommendation syntax. This is typically straightforward since during the evolution of SSML 1.0 there were many feature additions and syntax changes, but few major semantic changes.

VoiceXML elements to pass through to SSML

To construct an SSML document, begin by identifying the VoiceXML elements to be extracted into the document for handling by Vocalizer.

Except for the <audio> element which is described in a separate section below, pass all elements as-is to the SSML document. It is safe to pass all valid SSML 1.0 content as-is because Vocalizer gracefully uses fallback for those features it does not support. (There are a small number of SSML 1.0 features that Vocalizer doesn’t support. For example, rarely used features like pitch contours.)

VoiceXML element

Description

Notes

audio

Specifies audio files to be played

See Handling <audio> elements for important implementation details. The VoiceXML interpreter must evaluate the "expr" attribute, converting it to the static "src" attribute.

break

Specifies a pause in the speech output

 

desc

Provides a description of a non-speech audio source in <audio>

Only for use in visual interfaces, so while this can be passed, it is currently ignored by Vocalizer.

emphasis

Specifies text to speak with emphasis

 

lexicon

Specifies a pronunciation lexicon

Very important to pass so applications can tune text-to-speech via Vocalizer user dictionaries, rulesets, and ActivePrompt databases (three different types of lexicons).

mark

Bookmark, used by VoiceXML 2.1 to indicate the barge-in location.

The VoiceXML 2.1 "namexpr" attribute must be evaluated by the VoiceXML interpreter, converting it to the static "name" attribute.

meta

Specifies meta and "http-equiv" properties

Important to pass so applications can block logging of confidential information like credit card numbers by using name="secure_context", or to specify Internet fetch controls for <audio> and <lexicon> elements (currently ignored by Vocalizer but will be implemented in a future release). It is best to pass through all <meta> elements as-is because they have very little overhead, Vocalizer gracefully ignores irrelevant meta content, and future Vocalizer releases are likely to add more <meta> properties.

metadata

Specifies XML metadata content

Only for application developer reference purposes, so while this can be passed, it is currently ignored by Vocalizer.

p

Identifies a paragraph

 

phoneme

Specifies a phonetic pronunciation

Vocalizer supports IPA and the L&H+ phoneme alphabets. It is safe to do blind pass-through, Vocalizer uses the fallback for unsupported alphabets.

prosody

Specifies prosodic information

 

s

Identifies a sentence

 

say-as

Specifies the type of text

 

sub

Specifies replacement spoken text for the contained text

 

voice

Specifies voice characteristics

 

Handling <audio> elements

Some VoiceXML platforms handle <audio> element playback themselves, but it is better to delegate all prompts (including all <audio> elements) to Vocalizer:

  • Vocalizer makes the VoiceXML platform easier to design and implement, handling all the complex logic for <audio> fetches and fallback on failure, sequencing <audio> fetches with text-to-speech fragments, handling audio file headers and sample conversions, and delivering all of it to the VoiceXML platform as a single real time audio stream.
  • Vocalizer does sophisticated HTTP/1.1 compliant fetching and caching of http://, https://, and file:// URIs, minimizing latency for the caller and the load on the web servers.
  • Vocalizer supplies numerous controls for tuning this fetching and caching, and the Vocalizer call logs provide detailed fetching and caching information to make it easy to measure and tune the system and troubleshoot problems. For details on HTTP/1.1 fetching and caching support, see Internet fetch support, and for logging details, see Application call logs.
  • Vocalizer supports the VoiceXML fetchtimeout, fetchhint, maxage, and maxstale attributes, despite those not being part of the SSML 1.0 specification. The only VoiceXML attribute that needs to be handled externally is using the VoiceXML interpreter context to convert the "expr" attribute to the static "src" attribute.
  • Vocalizer handles all the audio formats required or mentioned by the VoiceXML specification: headerless, WAV format, and AU format files that contain 8kHz µ-law or A-law samples. Vocalizer also supports NIST SPHERE format files with 8kHz µ-law or A-law samples, and supports headerless, WAV format, AU format, and NIST SPHERE audio files that contain linear 16-bit PCM samples (8kHz or 22kHz depending on the Vocalizer voice frequency being used, but NIST SPHERE shorten and wavpack compression is not supported).
  • VoiceXML does intelligent text-to-speech processing that considers the full context, including <audio> insertions. Providing the full context including audio insertions allows for optimal audio quality (including blending the audio and text-to-speech segments) and makes it easier for VoiceXML application developers to tune Vocalizer’s speech output.

Handling Nuance extensions to SSML

Vocalizer supports SSML extensions (seeNuance SSML extensions) that VoiceXML platforms need to allow in VoiceXML documents. The platform needs to pass the following extensions to SSML documents spoken by Vocalizer.

  • The extra <audio> element attributes that are specified by VoiceXML 2.0:  fetchtimeout, fetchhint, maxage, and maxstale.
  • The "x-l&h+" alphabet for <phoneme>.
  • An optional ssft-domaintype attribute for <speak>, <s>, and <p>.

Note: Vocalizer supports the <prompt> element for explicit SSML ActivePrompt insertions. However, this conflicts with VoiceXML <prompt>, and may safely be left out of VoiceXML integrations (ActivePrompt insertions remain usable via proprietary markup or via automatic matching). See Tuning TTS output with ActivePrompts.

Generating an SSML document

The final step in delegating prompts to Vocalizer is to generate the SSML document. This requires choosing a character encoding for the document, constructing a SSML document header, and passing the relevant VoiceXML elements.

Nuance recommends making SSML documents as big as possible, rather than generating lots of small SSML documents for each individual VoiceXML element: this significantly reduces processing overhead, supplies contextual information that is important for optimal audio quality, and makes it much easier for VoiceXML application developers to tune the speech output.

For the character encoding, the best choice is a Unicode encoding, such as UTF-8 or UTF-16. Most VoiceXML platforms use a XML parser library, and most XML parsers return the text as UTF-8 or UTF-16. This makes the VoiceXML extraction language independent, handling any world language without VoiceXML platform code changes.

For the SSML document header, use a "speak" root element with the following attributes (as defined in the SSML 1.0 specification):

<speak> attribute

Required/Optional

Notes

version

required

Must be "1.0".

xml:lang

required

Language for the document.

Pass the xml:lang property for the current VoiceXML interpreter scope as inherited from the closest enclosing VoiceXML <prompt>.

Pass the xml:lang property for the <vxml> root element if there is no <prompt> or no xml:lang on the <prompt> elements.

xmlns

required

The XML namespace for SSML 1.0, "http://www.w3.org/2001/10/synthesis", otherwise Vocalizer returns an SSML parse error.

xml:base

optional but highly recommended

An absolute URI for the VoiceXML document that is being extracted into the SSML document. This allows Vocalizer to properly resolve relative URIs, and is also helpful for diagnostic purposes.

xmlns:xsi

optional

The SSML 1.0 specification shows the following on the first <speak> element example: xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance". This makes it so "xsi" can be used to refer to the XML namespace for XML 1.0 Schema later on, used for xsi:schemaLocation.

xsi:schemaLocation

ignored

The Vocalizer code includes the SSML 1.0 schema document.