Input to synthesize: Text or SSML

The SynthesisRequest Input message contains the material for Neural TTSaaS to synthesize. It can be plain text or SSML code.

If you are using the Sample synthesis client for Neural TTSaaS, enter the different types of input as request.input lines in the input file, flow.py.

Plain text input

For plain text input, use SynthesisRequest voice: name and input: text: text to specify a voice and the text to synthesize.

For example, this is synthesized as “Your order will be ready to pick up in forty five minutes”:

SynthesisRequest(
    voice = Voice(
        name = "en-US-JennyNeural"
    ),
    input = Input(
        text = Text(
            text = "Your order will be ready to pick up in 45 minutes."
        )
    )
)

If you are using the sample client, enter your input in flow.py:

request.voice.name = "en-US-JennyNeural"
request.input.text.text = "Your order will be ready to pick up in 45 minutes."

SSML input

For SSML input, use SynthesisRequest input: ssml: text to specify your SSML message. Include the voice either as voice: name or as a <voice> element within the SSML input.

In SSML input, the <speak> and <voice> elements are required by SSML but Neural TTSaaS adds them if necessary. See Partial SSML below.

This example is synthesized as “It’s twenty four thousand nine hundred one miles around the earth, or forty thousand seventy five kilometers.” (For multiple lines, enclose the message in triple quotation marks.)

SynthesisRequest(
    input = Input(
        ssml = SSML(
            text = '''<voice name="en-US-ChristopherNeural">
            It's 24,901 miles around the earth, or 40,075 km.
            </voice>'''
        )
    )
)

If you are using the sample client, enter your input in the flow.py file as follows:

request.input.ssml.text = '''<voice name = "en-US-ChristopherNeural">
It's 24,901 miles around the earth, or 40,075 km.
</voice>'''

You may also include other SSML elements, for example these <prosody> elements change the speaking volume. For example:

request.input.ssml.text = '''<voice name="en-US-ChristopherNeural">
This is the normal volume of my voice.
<prosody volume="+30">I can also speak more loudly. </prosody>
<prosody volume="50">And I can also speak more quietly.</prosody>
</voice>'''

See SSML elements below for more examples.

Partial SSML

You may omit some required elements in your SSML input, and let Neural TTSaaS supply them automatically. A complete SSML request contains a <speak> element and one or more <voice> elements, enclosing the text to be synthesized.

<speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis" 
xmlns:mstts="https://www.w3.org/2001/mstts" xml:lang="en-US">
<voice name="en-US-JennyNeural">
Text to be synthesized
</voice>
</speak>

If your SSML message does not include these elements, Neural TTSaaS supplies them in most situations.

Optional <speak>

You may omit the enclosing <speak> element completely and let Neural TTSaaS add it before sending the request. For example, you enter:

<voice name="en-US-JennyNeural">
This is a test <break time="1000ms" /> a very simple test. 
</voice>

And Neural TTSaaS adds the <speak> element before sending:

<speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis" 
xmlns:mstts="https://www.w3.org/2001/mstts" xml:lang="en-US">
<voice name="en-US-JennyNeural">
This is a test <break time="1000ms" /> a very simple test. 
</voice>
</speak>

Notice that Neural TTSaaS includes xmlns:mstts in the <speak> element. This custom XML namespace is required to support Microsoft voice styles. See Voice style.

You may include the <speak> element with one or more attributes, and Neural TTSaaS will provide the rest. If you do provide a complete <speak> element, it must contain the correct information.

Optional <voice>

For SSML input, you must provide a voice, either as SynthesisRequest voice.name or a <voice> element in your SSML message. For messages with just one voice, you may use voice.name and omit the <voice>:

request.voice.name = "en-US-JennyNeural"
request.input.ssml.text = "Take a deep breath, pause <break time="2000ms" />, then exhale slowly."

Alternatively, add a <voice> tag in the SSML message. (You may also set a voice.name without error, but it is ignored.)

request.input.ssml.text = '''<voice name="en-US-JennyNeural">
Take a deep breath, pause <break time="2000ms" />, then exhale slowly.
</voice>'''

If you don’t provide a voice at all, Neural TTSaaS returns an error.

If your SSML message includes multiple voices, you must set them using <voice> elements. For example:

request.input.ssml.text = '''<voice name="en-US-JennyNeural">
Hello, it's Jenny.</voice>
<voice name="en-US-AriaNeural">
Hi, it's Aria.
</voice>'''

SSML elements

Neural TTSaaS supports the SSML elements and attributes described in the Microsoft speech-to-text documentation: Speech Synthesis Markup Language (SSML) overview  .

The following examples use the Sample synthesis client for Neural TTSaaS to illustrate how to include some of these elements in your Neural TTSaaS applications.

Text only

This example shows plain text input synthesized by Jenny to say, “This is a test, a very simple test.”

request.input.ssml.text = '''<voice name="en-US-JennyNeural">
This is a test, a very simple test.
</voice>'''

The <speak> element is omitted in these examples, as Neural TTSaaS adds it automatically. You may set the voice as a <voice> element (as shown here) or as a SynthesisRequest voice.name.

Multiple voices

In this multi-voice example, Jenny starts by saying, “Hello it’s Jenny,” then Aria says, “Hi it’s Aria.”

request.input.ssml.text = '''<voice name="en-US-JennyNeural">
Hello it’s Jenny.</voice>
<voice name="en-US-AriaNeural">
Hi it’s Aria.
</voice>'''

In an SSML message containing multiple voices, you must set the voices using <voice> elements.

Multilingual voice

A Jenny voice, en-US-JennyMultilingualNeural, has multilingual capabilities. Here she starts in English to say, “Hello and welcome to,” then switches to French to say, “la ville de Saint Jean sur Richelieu.”

request.input.ssml.text = '''<voice name="en-US-JennyMultilingualNeural">
Hello and welcome to
<lang xml:lang="fr-CA">la ville de St Jean sur Richelieu.</lang>
</voice>'''

Voice style

This example gives Jenny a cheerful style to her voice as she says, “That’d be just amazing.”

request.input.ssml.text = '''<voice name="en-US-JennyNeural">
This is my normal speaking style. Now I'm going to try to be more cheerful.</voice>
<voice name="en-US-JennyNeural"><mstts:express-as style="cheerful">
That'd be just amazing.</mstts:express-as>
</voice>'''

The style attribute is part of a custom Microsoft element, mstts:express-as, and requires an additional xmlns attribute in <speak> to identify the mstts namespace: xmlns:mstts="https://www.w3.org/2001/mstts". If you omit this attribute (or the entire <speak> element), Neural TTSaaS adds it automatically.

To learn which styles your voice supports, use GetVoicesRequest, optionally using the Sample synthesis client for Neural TTSaaS.

In the Microsoft documentation, see Voice styles and roles in Language and voice support  .

Say as

This example uses the <say-as> tag to let Christopher interpret an address correctly. He says: “My address is, apartment seventeen, twenty-eight north Whitney street, Saint Augustine Beach, Florida, three two zero eight four six seven one five.”

request.input.ssml.text = '''<voice name="en-US-ChristopherNeural">
My address is, <say-as interpret-as="address">Apt. 17, 28 N. Whitney St., 
Saint Augustine Beach, FL 32084-6715</say-as>
</voice>'''

For details, see the Microsoft documentation: say-as element  .

Prosody

The prosody element includes several attributes for changing the pattern of speech, including pitch variations, the speaking rate, and the volume. This example uses <prosody volume> to demonstrate some different speech volumes.

Note that 100 is the loudest absolute value, and the default. To increase the volume, use a relative value (+30 in this example).

request.input.ssml.text = '''<voice name="en-US-ChristopherNeural">
This is the normal volume of my voice.
<prosody volume="+30">I can also speak more loudly. </prosody>
<prosody volume="50">And I can also speak more quietly.</prosody>
</voice>'''

For details, see the Microsoft documentation: Adjust prosody  .

Lexicon

The lexicon element points to a custom lexicon file that provides pronunciations for one or more words or phrases. The lexicon file is an XML file and must be stored on a publicly accessible web server.

For example, this lexicon1.xml file expands an acronym and provides phomenes for two words:

<lexicon xmlns="http://www.w3.org/2005/01/pronunciation-lexicon" 
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" version="1.0" 
xsi:schemaLocation="http://www.w3.org/2005/01/pronunciation-lexicon http://www.w3.org/TR/2007/CR-pronunciation-lexicon-20071212/pls.xsd" 
alphabet="ipa" xml:lang="en-US">
  <lexeme>
    <grapheme>NVC</grapheme>
    <alias>Nuance Vocalizer for Cloud</alias>
  </lexeme>
  <lexeme>
    <grapheme>pecan</grapheme>
    <phoneme>piˈkɘn</phoneme>
  </lexeme>
  <lexeme>
    <grapheme>tomato</grapheme>
    <phoneme>tə.ˈma.toʊ</phoneme>
  </lexeme>
</lexicon>

When the lexicon is hosted on a public web server and referenced in this SSML request, “NVC” is expanded to “Nuance Vocalizer for Cloud,” and the words “tomato” and “pecan” are pronounced differently.

request.input.ssml.text = '''<voice name="en-US-ChristopherNeural">
<lexicon uri="https://www.example.com/lexicon1.xml/">
Welcome to the snack bar at NVC. Our specials today are tomato salad and pecan pie. 
</voice>'''

In the Microsoft documentation, see Custom lexicon  and the related phoneme element  .

Prerecorded audio

The audio element lets you include prerecorded audio in your synthesized speech. The audio must be a wave file on a public HTTPS web server. For example:

request.input.ssml.text = '''<voice name="en-US-JennyNeural">
Please leave a message at the tone. <audio src="https://www.example.com/beep.wav">Beep</audio>
</voice>'''

In this example, the audio file contains the complete prompt to the user, along with alternative text to be rendered by the Jenny voice if the audio file cannot be played.

request.input.ssml.text = '''<voice name="en-US-JennyNeural">
<audio src="https://www.example.com/welcome.wav">Hello and welcome to the coffee app</audio>
</voice>'''

In the Microsoft documentation, see Add recorded audio  .

More SSML elements

For details of all supported SSML elements, consult the Microsoft speech-to-text documentation: Speech Synthesis Markup Language (SSML) overview  .