Vocalizer SSML support

Speech Synthesizer Markup Language (SSML) is a markup language specification for voice browsers established by the World Wide Web Consortium (W3C). SSML provides a rich, XML-based markup language that assists synthetic speech generation in web and other applications. The essential role of the markup language is to standardize control of pronunciation, volume, rate, and other aspects of speech.

The Vocalizer SDK provides a built-in preprocessor that supports most of the W3C specification Speech Synthesis Markup Language Specification Version 1.0–W3C Recommendation 7 September 2004.

Moreover, Vocalizer extends SSML with a few Nuance-specific elements/attributes, and provides schema documents for the extensions in VOCALIZER_SDK\doc\synthesis.xsd and synthesis-core.xsd.

See each Language Supplement for information on language-specific support.

SSML compliance

The default SSML parser supports all elements/attributes in the September 2004 Recommendation, regardless of their rating (MUST, REQUIRED, SHALL, SHOULD, RECOMMENDED, MAY, OPTIONAL), but with the following exceptions:

The <emphasis> element: the "none" level is not supported. Using this element does not necessarily lead to audible differences, as the system may elect to ignore these targets in order to produce optimal natural speech output.
The <voice> element is handled as follows:
- The variant attribute is not supported.
- The age attribute is supported, but to use this attribute you need to install a set of voices with varying age over the same language and gender. This requires the use of custom voices.
The <prosody> element is handled as follows:
- Pitch, contour, range, and duration attributes are ignored.
- Rate and volume attributes are properly handled.
The <break> element:
- The maximum duration for each <break> is 60 seconds (60000ms). To create a longer break, use more than one <break> (which must appear in separate <s> elements). For example:
```
<s>Let's take a two-minute pause <break time="60000ms"/></s>
```
```
<s><break time="60000ms"/> and now the break is finished.</s>
```
- The <break strength="none"/> setting only has an audible effect when the TTS engine would have inserted a sentence break without an explicit <break> element.
The <lang> element (as defined in version 1.1 of the SSML specification) is supported. Use it to specify the natural language of the content. Supported attributes of this element:
xml:lang — Required. Specifies the language of the root document. Follow the guidelines for the xml:lang attribute of the <speak> element.
onlangfailure — Optional attribute Describes the desired behavior when language speaking fails. Contains one of the following:
- ignorelang—ignore the change in language and speak as if the content were in the previous language.
- ignoretext—do not render the text that is in the failed language
- changevoice—swtich to another voice that can speak the language, if it exists. Otherwise, use ignorelang. (Using this value with nested lang elements can get unpredictable results: the system might not detect language speaking failures.
- processorchoice (default)—Same meaning as ignorelang.
The <meta> element: the http-equiv attribute is not supported.
The <say-as> element: the SSML specification does not standardize the list of <say-as> attribute values. See Say-as support.

Note: To learn about SSML compliance for the light SSML parser, see Using the light SSML parser.

Audio fetch error handling

Volume scale conversion

The default value is 100, and the scale is amplitude linear. Although SSML specifies a range of 0–100, Vocalizer extends the range to 200 with the values above 100 reached via relative changes or the symbolic values "loud" and "x-loud".

The table below describes the mapping between the SSML volume scale and the Nuance native volume scale (where the volume value is an integer in the range 0 to 100 which can be set via the native <ESC>\vol=x\ markup).

SSML volume value	Amplitude amplification factor	Loudness in dB	Nuance volume value
0	0.00	¥ dB	0
10	0.10	-20.0 dB	13
20	0.20	-14.0 dB	33
30	0.30	-10.5 dB	45
40	0.40	-8.0 dB	53
50	0.50	-6.0 dB	60
60	0.60	-4.4 dB	65
70	0.70	-3.1 dB	70
80	0.80	-1.9 dB	74
90	0.90	-0.9 dB	77
100	1.00	0.0 dB	80
(141)	1.41	+3.0 dB	90
(200)	2.00	+6.0 dB	100

The formula for converting the SSML volume value (Vssml) to the amplification factor A is very simple:

A = Vssml / 100

The formula for converting a non-zero amplification factor A to the corresponding Vocalizer volume value Xvocalizer is:

Xvocalizer = Round((20 * log10(A) / 0.30) + 80)

The formula for converting a non-zero Vocalizer value Xvocalizer to the dB value Y is:

Y (dB) = (Xvocalizer - 80) * 0.3 dB

The SSML symbolic values are mapped as follows:

SSML volume value	Symbolic value	Amplitude amplification factor	Loudness in dB
0	silent	0.00	¥ dB
18	x-soft	0.18	-15.0 dB
50	soft	0.50	-6.0 dB
100	medium	1.00	0.0 dB
(141)	loud	1.41	+3.0 dB
(200)	x-loud	2.00	+6.0 dB

Rate scale conversion

Vocalizer fully supports SSML rate markup. Use the following tables/rules to map SSML markup to equivalent Vocalizer native markup.

The default value is 1.00.

SSML "number" value	Symbolic value	SSML percentage w.r.t. voice default	Vocalizer native rate value
0.50	x-slow	-50%	50
0.70	slow	-30%	70
1.00	medium	+0%	100
1.60	fast	+60%	160
2.50	x-fast	+150%	250

SSML descriptive and number values change the rate against the voice default. All other rate changes are relative against the (XML) parent element.

Here are the formulas used to convert an SSML <prosody rate=" Xssml"> value into a Vocalizer <ESC>\rate=Yvocalizer\ value. Do note that rate changes are relative to the parent element: you must keep track of all changes to your SSML value before converting. For example, to change from 50 to 25 is -50%; but to restore the original value from 25 to 50 is +100%.

For increasing the rate (Xssml > 1.0)

Yvocalizer = Round(100 * Xssml)

When decreasing the rate (Xssml < 1.0)

Yvocalizer = Round(50 - (1 - Xssml) * 100)

Break implementation

Symbolic value for `strength`	Duration in ms of the pause	Corresponding native markup value
x-weak	100	<ESC>\pause=100\
weak	200	<ESC>\pause=200\
medium (default)	400	<ESC>\pause=400\
strong	700	<ESC>\pause=700\
x-strong	1200	<ESC>\pause=1200\
none	0	<ESC>\eos=0\

Nuance SSML extensions

The Vocalizer SSML extensions are:

The <audio> element supports extra attributes to control internet fetching as described for the W3C VoiceXML 2.0 specification’s version of the <audio> element:
- fetchtimeout: time to attempt to open and read the audio document. The value must be an unsigned integer with a mandatory suffix as required by the VoiceXML 2.0 specification, "s" for seconds, "ms" for milliseconds.
- maxage: value for the HTTP 1.1 cache-control max-age directive. This specifies the application is willing to accept a cached copy of the audio document no older than this value. You can use a value of 0 to force re-validating the cached copy with the origin server, but in most cases it's best to allow the origin server to control cache expiration (by omitting this attribute). The value must be an unsigned integer to specify the number of seconds; as required by the VoiceXML 2.0 specification, it must not have a suffix. That is, "s" and "ms" are not allowed.
- maxstale: value for the HTTP 1.1 cache-Control max-stale directive. This specifies the client is willing to accept a cached copy that is expired by up to this value past the expiration time specified by the origin server. In most cases, set this property to 0 or omit it (thus respecting the cache expiration time specified by the origin server). The value must be an unsigned integer to specify the number of seconds, as required by the VoiceXML 2.0 specification it must not have a suffix. That is, "s" and "ms" are not allowed.
- fetchhint: "prefetch" to allow prefetching the audio content, "safe" (the default) to follow HTTP/1.1 caching semantics. Vocalizer allows this attribute, but currently does not behave differently for "prefetch" mode.
  Note: Vocalizer does not support the <audio> expr attribute defined in the VoiceXML 2.0 specification.
The <phoneme> element supports specifying L&H+ phoneme strings when the alphabet attribute is set to "x-l&h+", and phoneme strings in the IPA alphabet when the alphabet attribute is set to "ipa". The ampersand is a reserved XML character, so in an SSML document the L&H+ alphabet needs to be specified with alphabet="x-l&h+". Use the necessary escape characters for phoneme strings in the IPA alphabet (because they cannot be expressed otherwise). See each voice Language Supplement for a list of escape codes.
The <speak>, <s>, and <p> elements support an optional ssft-domaintype attribute for activating an ActivePrompt domain, equivalent to the <ESC>\domain\ native control sequence. The attribute value is the ActivePrompt domain name.
The <prompt> element supports specifying ActivePrompt IDs, equivalent to the <ESC>\prompt\ native control sequence. The "id" attribute is required, and specifies the ActivePrompt in <domain>:<prompt> format. The content of the element specifies fallback text that is only spoken if the ActivePrompt cannot be found, similar to SSML <audio>.

Vocalizer SSML support

SSML compliance

Nuance SSML extensions

Related topics