Integrating into VoiceXML platforms
Vocalizer supports all of the prompting requirements of voice platforms. The platforms delegate all prompt playback to Vocalizer (including text-to-speech and audio recordings). This design yields a more robust and efficient platform that is easier to develop and maintain:
- Extract the text strings relating to a prompt into an SSML document.
- Submit the SSML to Vocalizer for playback.
- Obtaining a single audio stream from Vocalizer that unifies all the audio from the text-to-speech and audio recordings.
While this is straightforward, there are subtle issues to ensure that VoiceXML application developers have full access to all Vocalizer capabilities. In particular, the platform decides which VoiceXML elements to pass through to the SSML document, and which to omit.
To clarify this integration effort, see the topics that follow:
- VoiceXML elements to pass through to SSML, including deciding which elements to modify and which to pass unmodified.
- Handling <audio> elements
- Handling Nuance extensions to SSML
- Generating an SSML document
Vocalizer supports the SSML 1.0 Recommendation (September 2004) as used by the VoiceXML 2.0 Recommendation (March 2004) and VoiceXML 2.1 Recommendation (June 2007), and returns, by default, an SSML parse error for any elements or attributes that don’t comply with the SSML 1.0 Recommendation. Your platform can implement older VoiceXML specifications such as VoiceXML 2.0 working drafts or VoiceXML 1.0, but you might need to convert some older elements and attributes to the SSML 1.0 Recommendation syntax. This is typically straightforward since during the evolution of SSML 1.0 there were many feature additions and syntax changes, but few major semantic changes.
VoiceXML elements to pass through to SSML
To construct an SSML document, begin by identifying the VoiceXML elements to be extracted into the document for handling by Vocalizer.
Except for the <audio> element which is described in a separate section below, pass all elements as-is to the SSML document. It is safe to pass all valid SSML 1.0 content as-is because Vocalizer gracefully uses fallback for those features it does not support. (There are a small number of SSML 1.0 features that Vocalizer doesn’t support. For example, rarely used features like pitch contours.)
|
VoiceXML element |
Description |
Notes |
|---|---|---|
|
audio |
Specifies audio files to be played |
See Handling <audio> elements for important implementation details. The VoiceXML interpreter must evaluate the "expr" attribute, converting it to the static "src" attribute. |
|
break |
Specifies a pause in the speech output |
|
|
desc |
Provides a description of a non-speech audio source in <audio> |
Only for use in visual interfaces, so while this can be passed, it is currently ignored by Vocalizer. |
|
emphasis |
Specifies text to speak with emphasis |
|
|
lexicon |
Specifies a pronunciation lexicon |
Very important to pass so applications can tune text-to-speech via Vocalizer user dictionaries, rulesets, and ActivePrompt databases (three different types of lexicons). |
|
mark |
Bookmark, used by VoiceXML 2.1 to indicate the barge-in location. |
The VoiceXML 2.1 "namexpr" attribute must be evaluated by the VoiceXML interpreter, converting it to the static "name" attribute. |
|
meta |
Specifies meta and "http-equiv" properties |
Important to pass so applications can block logging of confidential information like credit card numbers by using name="secure_context", or to specify Internet fetch controls for <audio> and <lexicon> elements (currently ignored by Vocalizer but will be implemented in a future release). It is best to pass through all <meta> elements as-is because they have very little overhead, Vocalizer gracefully ignores irrelevant meta content, and future Vocalizer releases are likely to add more <meta> properties. |
|
metadata |
Specifies XML metadata content |
Only for application developer reference purposes, so while this can be passed, it is currently ignored by Vocalizer. |
|
p |
Identifies a paragraph |
|
|
phoneme |
Specifies a phonetic pronunciation |
Vocalizer supports IPA and the L&H+ phoneme alphabets. It is safe to do blind pass-through, Vocalizer uses the fallback for unsupported alphabets. |
|
prosody |
Specifies prosodic information |
|
|
s |
Identifies a sentence |
|
|
say-as |
Specifies the type of text |
|
|
sub |
Specifies replacement spoken text for the contained text |
|
|
voice |
Specifies voice characteristics |
Handling <audio> elements
Some VoiceXML platforms handle <audio> element playback themselves, but it is better to delegate all prompts (including all <audio> elements) to Vocalizer:
- Vocalizer makes the VoiceXML platform easier to design and implement, handling all the complex logic for <audio> fetches and fallback on failure, sequencing <audio> fetches with text-to-speech fragments, handling audio file headers and sample conversions, and delivering all of it to the VoiceXML platform as a single real time audio stream.
- Vocalizer does sophisticated HTTP/1.1 compliant fetching and caching of http://, https://, and file:// URIs, minimizing latency for the caller and the load on the web servers.
- Vocalizer supplies numerous controls for tuning this fetching and caching, and the Vocalizer call logs provide detailed fetching and caching information to make it easy to measure and tune the system and troubleshoot problems. For details on HTTP/1.1 fetching and caching support, see Internet fetch support, and for logging details, see Application call logs.
- Vocalizer supports the VoiceXML fetchtimeout, fetchhint, maxage, and maxstale attributes, despite those not being part of the SSML 1.0 specification. The only VoiceXML attribute that needs to be handled externally is using the VoiceXML interpreter context to convert the "expr" attribute to the static "src" attribute.
- Vocalizer handles all the audio formats required or mentioned by the VoiceXML specification: headerless, WAV format, and AU format files that contain 8kHz µ-law or A-law samples. Vocalizer also supports NIST SPHERE format files with 8kHz µ-law or A-law samples, and supports headerless, WAV format, AU format, and NIST SPHERE audio files that contain linear 16-bit PCM samples (8kHz or 22kHz depending on the Vocalizer voice frequency being used, but NIST SPHERE shorten and wavpack compression is not supported).
- VoiceXML does intelligent text-to-speech processing that considers the full context, including <audio> insertions. Providing the full context including audio insertions allows for optimal audio quality (including blending the audio and text-to-speech segments) and makes it easier for VoiceXML application developers to tune Vocalizer’s speech output.
Handling Nuance extensions to SSML
Vocalizer supports SSML extensions (seeNuance SSML extensions) that VoiceXML platforms need to allow in VoiceXML documents. The platform needs to pass the following extensions to SSML documents spoken by Vocalizer.
- The extra <audio> element attributes that are specified by VoiceXML 2.0: fetchtimeout, fetchhint, maxage, and maxstale.
- The "x-l&h+" alphabet for <phoneme>.
- An optional ssft-domaintype attribute for <speak>, <s>, and <p>.
Note: Vocalizer supports the <prompt> element for explicit SSML ActivePrompt insertions. However, this conflicts with VoiceXML <prompt>, and may safely be left out of VoiceXML integrations (ActivePrompt insertions remain usable via proprietary markup or via automatic matching). See Tuning TTS output with ActivePrompts.
Generating an SSML document
The final step in delegating prompts to Vocalizer is to generate the SSML document. This requires choosing a character encoding for the document, constructing a SSML document header, and passing the relevant VoiceXML elements.
Nuance recommends making SSML documents as big as possible, rather than generating lots of small SSML documents for each individual VoiceXML element: this significantly reduces processing overhead, supplies contextual information that is important for optimal audio quality, and makes it much easier for VoiceXML application developers to tune the speech output.
For the character encoding, the best choice is a Unicode encoding, such as UTF-8 or UTF-16. Most VoiceXML platforms use a XML parser library, and most XML parsers return the text as UTF-8 or UTF-16. This makes the VoiceXML extraction language independent, handling any world language without VoiceXML platform code changes.
For the SSML document header, use a "speak" root element with the following attributes (as defined in the SSML 1.0 specification):
|
<speak> attribute |
Required/Optional |
Notes |
|---|---|---|
|
version |
required |
Must be "1.0". |
|
xml:lang |
required |
Language for the document. Pass the xml:lang property for the current VoiceXML interpreter scope as inherited from the closest enclosing VoiceXML <prompt>. Pass the xml:lang property for the <vxml> root element if there is no <prompt> or no xml:lang on the <prompt> elements. |
|
xmlns |
required |
The XML namespace for SSML 1.0, "http://www.w3.org/2001/10/synthesis", otherwise Vocalizer returns an SSML parse error. |
|
xml:base |
optional but highly recommended |
An absolute URI for the VoiceXML document that is being extracted into the SSML document. This allows Vocalizer to properly resolve relative URIs, and is also helpful for diagnostic purposes. |
|
xmlns:xsi |
optional |
The SSML 1.0 specification shows the following on the first <speak> element example: xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance". This makes it so "xsi" can be used to refer to the XML namespace for XML 1.0 Schema later on, used for xsi:schemaLocation. |
|
xsi:schemaLocation |
ignored |
The Vocalizer code includes the SSML 1.0 schema document. |