Vocalizer SSML support

Speech Synthesizer Markup Language (SSML) is a markup language specification for voice browsers established by the World Wide Web Consortium (W3C). SSML provides a rich, XML-based markup language that assists synthetic speech generation in web and other applications. The essential role of the markup language is to standardize control of pronunciation, volume, rate, and other aspects of speech.

The Vocalizer SDK provides a built-in preprocessor that supports most of the W3C specification Speech Synthesis Markup Language Specification Version 1.0–W3C Recommendation 7 September 2004.

Moreover, Vocalizer extends SSML with a few Nuance-specific elements/attributes, and provides schema documents for the extensions in VOCALIZER_SDK\doc\synthesis.xsd and synthesis-core.xsd.

See each Language Supplement for information on language-specific support.

SSML compliance

The default SSML parser supports all elements/attributes in the September 2004 Recommendation, regardless of their rating (MUST, REQUIRED, SHALL, SHOULD, RECOMMENDED, MAY, OPTIONAL), but with the following exceptions:

  • The <emphasis> element: the "none" level is not supported. Using this element does not necessarily lead to audible differences, as the system may elect to ignore these targets in order to produce optimal natural speech output.
  • The <voice> element is handled as follows:
    • The variant attribute is not supported.
    • The age attribute is supported, but to use this attribute you need to install a set of voices with varying age over the same language and gender. This requires the use of custom voices.
  • The <prosody> element is handled as follows:
    • Pitch, contour, range, and duration attributes are ignored.
    • Rate and volume attributes are properly handled.
  • The <break> element:
    • The maximum duration for each <break> is 60 seconds (60000ms). To create a longer break, use more than one <break> (which must appear in separate <s> elements). For example:

      <s>Let's take a two-minute pause <break time="60000ms"/></s>
      <s><break time="60000ms"/> and now the break is finished.</s>

    • The <break strength="none"/> setting only has an audible effect when the TTS engine would have inserted a sentence break without an explicit <break> element.
  • The <lang> element (as defined in version 1.1 of the SSML specification) is supported. Use it to specify the natural language of the content. Supported attributes of this element:

    xml:lang — Required. Specifies the language of the root document. Follow the guidelines for the xml:lang attribute of the <speak> element.

    onlangfailure — Optional attribute Describes the desired behavior when language speaking fails. Contains one of the following: 

    • ignorelang—ignore the change in language and speak as if the content were in the previous language.
    • ignoretext—do not render the text that is in the failed language
    • changevoice—swtich to another voice that can speak the language, if it exists. Otherwise, use ignorelang. (Using this value with nested lang elements can get unpredictable results: the system might not detect language speaking failures.
    • processorchoice (default)—Same meaning as ignorelang.
  • The <meta> element: the http-equiv attribute is not supported.
  • The <say-as> element: the SSML specification does not standardize the list of <say-as> attribute values. See Say-as support.

Note: To learn about SSML compliance for the light SSML parser, see Using the light SSML parser.

Nuance SSML extensions

The Vocalizer SSML extensions are:

  • The <audio> element supports extra attributes to control internet fetching as described for the W3C VoiceXML 2.0 specification’s version of the <audio> element:
    • fetchtimeout: time to attempt to open and read the audio document. The value must be an unsigned integer with a mandatory suffix as required by the VoiceXML 2.0 specification, "s" for seconds, "ms" for milliseconds.
    • maxage: value for the HTTP 1.1 cache-control max-age directive. This specifies the application is willing to accept a cached copy of the audio document no older than this value. You can use a value of 0 to force re-validating the cached copy with the origin server, but in most cases it's best to allow the origin server to control cache expiration (by omitting this attribute). The value must be an unsigned integer to specify the number of seconds; as required by the VoiceXML 2.0 specification, it must not have a suffix. That is, "s" and "ms" are not allowed.
    • maxstale: value for the HTTP 1.1 cache-Control max-stale directive. This specifies the client is willing to accept a cached copy that is expired by up to this value past the expiration time specified by the origin server. In most cases, set this property to 0 or omit it (thus respecting the cache expiration time specified by the origin server). The value must be an unsigned integer to specify the number of seconds, as required by the VoiceXML 2.0 specification it must not have a suffix. That is, "s" and "ms" are not allowed.
    • fetchhint: "prefetch" to allow prefetching the audio content, "safe" (the default) to follow HTTP/1.1 caching semantics. Vocalizer allows this attribute, but currently does not behave differently for "prefetch" mode.

      Note: Vocalizer does not support the <audio> expr attribute defined in the VoiceXML 2.0 specification.

  • The <phoneme> element supports specifying L&H+ phoneme strings when the alphabet attribute is set to "x-l&h+", and phoneme strings in the IPA alphabet when the alphabet attribute is set to "ipa". The ampersand is a reserved XML character, so in an SSML document the L&H+ alphabet needs to be specified with alphabet="x-l&amp;h+". Use the necessary escape characters for phoneme strings in the IPA alphabet (because they cannot be expressed otherwise). See each voice Language Supplement for a list of escape codes.
  • The <speak>, <s>, and <p> elements support an optional ssft-domaintype attribute for activating an ActivePrompt domain, equivalent to the <ESC>\domain\ native control sequence. The attribute value is the ActivePrompt domain name.
  • The <prompt> element supports specifying ActivePrompt IDs, equivalent to the <ESC>\prompt\ native control sequence. The "id" attribute is required, and specifies the ActivePrompt in <domain>:<prompt> format. The content of the element specifies fallback text that is only spoken if the ActivePrompt cannot be found, similar to SSML <audio>.