Input to synthesize
You provide the text for TTSaaS to synthesize in the Input message. It can be:
- Plain text
- SSML input
- Tokenized sequence of text and Nuance control codes
If you are using the Sample synthesis client, enter the different types of input as request.input
lines in the input file, flow.py.
Plain text
Enter plain text as input type Text. For example, this is synthesized as “Your order will be ready to pick up in forty five minutes.”
SynthesisRequest (
voice = Voice (
name = "Evan",
model = "enhanced"),
input.text.text = "Your order will be ready to pick up in 45 minutes."
)
If you are using the sample client, enter the following in flow.py.
request.input.text.text = "Your order will be ready to pick up in 45 minutes."
SSML input
Enter SSML as input type SSML. This example contains only text, and is synthesized (in American English) as “It’s twenty four thousand nine hundred one miles around the earth, or forty thousand seventy five kilometers.”
SynthesisRequest (
voice = Voice (
name = "Evan",
model = "enhanced"),
input = Input (
ssml = SSML (
text = "<speak>It's 24,901 miles around the earth, or 40,075 km.</speak>",
ssml_validation_mode = WARN)
)
)
In the sample client, enter:
request.input.ssml.text = "<speak>It's 24,901 miles around the earth, or 40,075 km.</speak>"
This example contains text and SSML elements, and is synthesized as “I can speak rather quietly, BUT ALSO VERY LOUDLY.”
SynthesisRequest (
voice = Voice (
name = "Evan",
model = "enhanced"),
input = Input (
ssml = SSML (
text = "<speak><prosody volume='10'>I can speak rather quietly,</prosody> <prosody volume='90'>But also very loudly.</prosody></speak>",
ssml_validation_mode = WARN)
)
)
In the sample client, enter:
request.input.ssml.text = "<speak><prosody volume='10'>I can speak rather quietly, </prosody><prosody volume='90'>But also very loudly.</prosody></speak>"
SSML tags
SSML elements may be included when using the input type SSML. These tags indicate how the text segments within the tag should be spoken.
This generic example shows optional elements omitted.
<speak>Text before SSML element.
<prosody volume="10">Text following or affected by SSML element code.</prosody>
</speak>
Optional elements may be included without error.
<?xml version="1.0"?>
<speak xmlns="http://www.w3.org/2001/10/synthesis" xml:lang="en-US" version="1.0">
Text before SSML element.
<prosody volume="10">Text following or affected by SSML element code.</prosody>
</speak>
When using the sample client, enter SSML in the flow.py input file.
# You can enclose the SSML in double quotes with single quotes inside
request.input.ssml.text = "<speak>It's easy. Take a deep breath, pause for a second or two <break time='1500ms'/> and then exhale slowly.</speak>"
# Or vice versa, escaping any apostrophes
request.input.ssml.text = '<speak>It\'s easy. Take a deep breath, pause for a second or two <break time="1500ms"/> and then exhale slowly.</speak>'
# Or enclose in three (single or double) quotes for multiline text
request.input.ssml.text = '''<speak>It's easy.
Take a deep breath, pause for a second or two
<break time="1500ms"/> and then exhale slowly.
</speak>'''
TTSaaS supports the following SSML elements and attributes in SSML input. For details about these items, see SSML Specification 1.0 . Note that TTSaaS does not support all SSML elements and attributes listed in the W3C specification.
Switching voice and/or language
You can change the voice and/or the language of the speaker within SSML input, using several methods. These elements change to a voice with a different language:
And this control code changes the language in a multilingual voice:
lang
with escape codes
xml
An XML declaration, specifying the XML version, 1.0.
In TTSaaS, this element is optional. If omitted, TTSaaS adds it automatically.
<?xml version="1.0"?>
speak
The root SSML element. Mandatory. It contains the required attributes, xml:lang
and version
, and encloses text to be synthesized along with optional elements shown below. The xml:lang
attribute sets the base language for the synthesis.
In TTSaaS, the attributes of this element are optional: only <speak>
is required. If the attributes are omitted, TTSaaS adds them automatically to the speak element.
Optional attributes may be specified if wanted. If you include the language, it must match the locale of the principal voice.
<speak xmlns="http://www.w3.org/2001/10/synthesis" xml:lang="en-US" version="1.0">
Input text and tags
</speak>
audio
The audio element inserts a digital audio recording at the current location. TTSaaS supports WAV files containing 16-bit PCM samples. The src
attribute specifies the location of the recording as either:
-
A URN in Mix cloud storage. Use the Storage gRPC API to upload the audio file. See Sample storage client: Upload audio.
-
A secure URL. The file must be a WAV file on a web server accessed through a secure (https) URL, with a valid TLS certificate.
This request specifies a URN for the audio file:
<speak>Please leave your name after the tone.
<audio src="urn:nuance-mix:tag:tuning:audio/coffee_app/beep/mix.tts" />
<speak>
This specifies an audio file via secure URL:
<speak>Please leave your name after the tone.
<audio src="https:///<host>/audio/beep.wav" />
</speak>
For both URN and URL access, you may include alternative text in the <audio> element. If the audio file cannot be found or is not a WAV file, TTSaaS synthesizes the alternative text and includes it in the results.
Without the alternative text, TTSaaS reports an error if the file is not a WAV file or is not accessed through a URN or an https URL.
These examples show alternative text for URN or URL access. If the audio file is unavailable, the synthesis results are: “Please leave your name after the tone. Beep.”
<speak>Please leave your name after the tone.
<audio src="urn:nuance-mix:tag:tuning:audio/coffee_app/beep/mix.tts">Beep</audio>
</speak>
<speak>Please leave your name after the tone.
<audio src="https:///<host>/audio/beep.wav">Beep</audio>
</speak>
break
The break element controls pausing between words, overriding the default breaks based on punctuation in the text. The break tag has two optional attributes:
-
time
specifies the duration of the break as seconds (1s) or milliseconds (300ms). -
strength
specifies a keyword to indicate the duration of the break: none, x-weak, weak, medium (default), strong, or x-strong.break strength="none"
can prevent a pause (caused by a comma or period, for example) that would otherwise occur.
These two examples are read as: “His name is… Michael.” and “Tom lives in New York City. So does John. He’s at one hundred eighty Park Avenue room twenty four.”
<speak>His name is <break time="300ms"/> Michael.</speak>
<speak>Tom lives in New York City. So does John. He\'s at 180 Park Ave. <break strength="none"/> Room 24</speak>
lang
When used in SSML with a multilingual (-Ml) voice, the lang control code switches to another language supported by the voice. This example uses Zoe-Ml, defined with two languages apart from American English.
voices {
name: "Zoe-Ml"
model: "enhanced"
language: "en-us"
. . .
foreign_languages: "es-mx"
foreign_languages: "fr-ca"
}
In this example using the sample client, Zoe starts in her English voice, then switches to her French voice to read “St-Jean-sur-Richelieu” using French pronunciation. When lang
is used with a non-multilingual voice, the text is pronounced using the voice’s base language.
# From the client's input file, flow.py
request.voice.name = "Zoe-Ml"
request.voice.model = "enhanced"
request.input.ssml.text = "<speak>Hello and welcome to \!\lang=fr-CA\ St-Jean-sur-Richelieu \!\lang=normal\. </speak>"
This code is supported in SSML using escape code format, as shown.
mark
The mark element inserts a bookmark that is returned in the results. The value can be any string.
<speak>This bookmark <mark name="bookmark1"/> marks a reference point.
Another <mark name="bookmark2"/> does the same.
</speak>
p
The p element indicates a paragraph break. A paragraph break is equivalent to break strength="x-strong"
.
<speak><p>Welcome to Vocalizer.</p>
<p>Vocalizer is a state-of-the-art text to speech system.</p>
</speak>
The optional xml:lang
attribute switches to a voice whose base language is the locale specified. It does not use a foreign language of the current voice. If possible, the same gender as the original voice is used.
In this scenario, installed voices include American English (en-US) Evan and Zoe-Ml, as well as Spanish American (es-US) Javier and Paulina-Ml. In this example, Evan reads the English text, then Javier reads the Spanish. When starting with Zoe-Ml, the female Paulina-Ml voice is selected as the Spanish voice.
<speak>Say English for an English message.
<p xml:lang="es-US">Decir español para un mensaje en español.</p></speak>
s
prosody
The prosody element specifies intonation in the generated voice using several attributes. You may combine multiple attributes within the same prosody element.
prosody pitch
Prosody pitch changes the speaking voice to sound lower (lower values) or higher (higher values). Not supported for all languages. The value is a keyword, a number (50-200, default is 100), or a relative percentage (+/-n%). The keywords are:
- x-low (-30%)
- low (-15%)
- medium (0%)
- default (0%)
- high
- x-high
You may combine pitch, rate, and timbre for more precise results. For example, pitch and timbre values of 80 or 90 for a female voice give a more neutral voice.
<speak>Hi, I\'m Zoe. This is the normal pitch and timbre of my voice.
<prosody pitch="80" timbre="90">But now my voice sounds lower and richer.</prosody>
</speak>
prosody rate
Prosody rate sets the speaking rate as a keyword, a number (0-100), or a relative percentage (+/-n%). The keywords are:
- x-slow (-50%)
- slow (-30%)
- medium (0%)
- fast (+60%)
- x-fast (+150%)
<speak>This is my normal speaking rate.
<prosody rate="+50%"> But I can speed up the rate.</prosody>
<prosody rate="-25%"> Or I can slow it down.</prosody>
</speak>
prosody timbre
Prosody timbre changes the speaking voice to sound bigger and older (lower values) or smaller and younger (higher values). Not supported for all languages. The value is a keyword, a number (50-200, default is 100), or a relative percentage (+/-n%). The keywords are:
- x-young (+35%)
- young (+20%)
- medium (0%)
- default (0%)
- old (-20%)
- x-old (-35%)
<speak>This is the normal timbre of my voice.
<prosody timbre="young"> I can sound a bit younger. </prosody>
<prosody timbre="old" rate="-10%"> Or older and hopefully wiser. </prosody>
</speak>
prosody volume
Prosody volume changes the speaking volume. The value is a keyword, a number (0-100), or a relative percentage (+/-n%). The keywords are: silent, x-soft, soft, medium (default), loud, or x-loud.
<speak>This is my normal speaking volume.
<prosody volume="-50%">I can also speak rather quietly,</prosody>
<prosody volume="+50%"> or also very loudly.</prosody>
</speak>
s
The s element indicates a sentence break. A sentence break is equivalent to break strength="strong"
.
<speak><s>The wind was a torrent of darkness, among the gusty trees</s>
<s>The moon was a ghostly galleon, tossed upon cloudy seas</s>
</speak>
The optional xml:lang
attribute works as for the p element. In this example, it switches to a fr-CA voice to say the name of the song.
<speak>The name of the song is <s xml:lang="fr-CA"> Je ne regrette rien.</s></speak>
say-as
The say-as element controls how to say specific types of text, using the interpret-as
attribute to specify a value and (in some cases) a format
. A wide range of input is accepted for most values. The values are:
-
address: Provides optimal reading for complete postal addresses. For example, “Apt. 17, 28 N. Whitney St., Saint Augustine Beach, FL 32084-6715” is read as “apartment seventeen, twenty eight north Whitney street, Saint Augustine Beach, Florida three two zero eight four six seven one five.”
<speak><say-as interpret-as="address">Apt. 17, 28 N. Whitney St., Saint Augustine Beach, FL 32084-6715</say-as>
-
currency: Reads text as currency. For example, “123.45USD” is read as “one hundred twenty three U S dollars and forty five cents.”
<speak><say-as interpret-as="currency">12USD</say-as></speak>
-
date: Reads text as a date. For example, “11/21/2020” is read as “November twenty-first, two thousand twenty.” The
format
attribute is ignored for date values. It may be specified without error but has no effect.The precise date output is determined by the voice, and ambiguous dates are interpreted according to the conventions of the voice’s locale. For example, “05/12/2020” is read by an American English voice as “May twelfth two thousand twenty” and by a British English voice as “the fifth of December two thousand and twenty.”
<speak><say-as interpret-as="date">11/21/2020</say-as></speak>
-
name: Gives correct reading of names, including personal names with roman numerals, such as Pius IX (read as “Pius the ninth”), John I (“John the first”), and Richard III (“Richard the third”). The name must be capitalized but the roman numeral may be in upper or lowercase (III or iii). Do not add a punctuation mark immediately following the roman numeral.
<speak><say-as interpret-as="name">Care Telecom Ltd</say-as></speak> <speak><say-as interpret-as="name">King Richard III</say-as></speak>
-
ordinal: Reads positional numbers such as 1st, 2nd, 3rd, and so on. For example, “12th” is read as “twelfth.”
<speak><say-as interpret-as="ordinal">12th</say-as></speak>
-
phone: Reads telephone numbers. For example, “1-800-688-0068” is read as “One, eight hundred, six eight eight, zero zero six eight.”
<speak><say-as interpret-as="phone">1-800-688-0068</say-as></speak>
-
raw: Provides a literal reading of the text, such as blocking undesired abbreviation expansion. It operates principally on the abbreviations and acronyms but may impact the surrounding text as well.
<speak><say-as interpret-as="raw">app.</say-as></speak>
-
sms: Gives short message service (SMS) reading. For example, “ttyl, James, :-)” is read as “Talk to you later, James, smiley happy.”
<speak><say-as interpret-as="sms">ttyl, James, :-)</say-as></speak>
-
spell format=alphanumeric: Spells out all alphabetic and numeric characters, but does not read white space, special characters, and punctuation marks. This is how items are spoken with and without this tag, in American English.
Spell Input With spell alphanumeric Without spell alphanumeric a34y - 347 A three four Y three four seven a thirty-four y three hundred forty-seven 12345 one two three four five twelve thousand three hundred forty-five Smythe capital S M Y T H E smith <speak><say-as interpret-as="spell" format="alphanumeric">a34y - 347</say-as></speak>
-
spell format=strict: Spells out all characters, including white space, special characters, and punctuation marks. For example, “a34y - 347” is pronounced “A three four Y space hyphen space three four seven.”
For both types of spelling, accented and capital characters are indicated. For example: “café” is spoken as “C A F E acute” and “Abc” is spoken as “capital A B C.”
<speak><say-as interpret-as="spell" format="strict">a34y - 347</say-as></speak>
-
state: Expands and pronounces state, city, and province names and abbreviations, as appropriate for the locale. For example, “FL” is read as “Florida.” Not supported for all languages.
<speak><say-as interpret-as="state">FL</say-as></speak>
-
streetname: Reads street names and abbreviations. For example, “Emerson Rd.” is pronounced “Emerson road.” Not supported for all languages.
<speak><say-as interpret-as="streetname">Emerson Rd.</say-as></speak>
-
streetnumber: Reads street numbers. For example, “11001-11010” is read as “eleven oh oh one to eleven oh ten.” Not supported for all languages.
<speak><say-as interpret-as="streetnumber">11001-11010</say-as></speak>
-
time: Gives a time of day reading. For example, “10:00” is pronounced “ten o’clock.” The
format
attribute is ignored for time values. It may be specified without error but has no effect.<speak><say-as interpret-as="time">10:00</say-as></speak>
-
zip: Reads US zip codes. Supported for American English only.
<speak><say-as interpret-as="zip">01803</say-as></speak>
style
The style element sets the speaking style of the voice. Values for name depend on the voice but are usually neutral, lively, forceful, and apologetic. The default depends on the voice. If you request a style that the voice does not support, there is no effect.
The first example reads “Hello, this is Samantha” in Samantha’s default style, then switches to lively style to say “Hope you’re having a nice day!”
The style resets to default at the end of the synthesis request or if it encounters a change of voice. The second example continues with Nathan in default style saying “Hello, this is Nathan.”
<speak>Hello, this is Samantha. <style name="lively">Hope you’re having a nice day!</style>
</speak>
<speak>Hello, this is Samantha. <style name="lively">Hope you’re having a nice day!</style>
<voice name="nathan">Hello, this is Nathan.</voice>
</speak>
voice
The voice element changes the speaking voice, which also forces a sentence break. Values for name are the voices available to the session.
<speak><voice name="samantha">Hello, this is Samantha. </voice>
<voice name="tom">Hello, this is Tom.</voice>
</speak>
If you specify a voice in another language, the text is spoken using that language.
<speak>Hi, my name is Zoe.
<voice name="Audrey-Ml">Bonjour, je m\'appelle Audrey.</voice></speak>
Tokenized sequence
You may also enter a sequence of text and Nuance control codes, as input type TokenizedSequence. For example, this is synthesized as “My name is… Jeremiah Jones.”
SynthesisRequest (
voice = Voice (
name = "Evan",
model = "enhanced"),
input = Input (
tokenized_sequence = TokenizedSequence (
tokens = [
Token (text = "My name is "),
Token (control_code = ControlCode (key = "pause", value = "300")),
Token (text = "Jeremiah Jones") ]
)
)
)
In the sample client, enter:
request.input.tokenized_sequence.tokens.extend([
Token(text="My name is "),
Token(control_code=ControlCode(key="pause", value="300")),
Token(text="Jeremiah Jones")
])
The following more complicated example, shown in the sample client, is synthesized (in American English) as “The time and date is: ten o’clock, May twenty-sixth, two thousand twenty. My phone number is: one eight hundred, six eight eight, zero zero six eight.”
request.input.tokenized_sequence.tokens.extend([
Token(text="The time and date is."),
Token(control_code=ControlCode(key="tn", value="time")),
Token(text="10:00"),
Token(control_code=ControlCode(key="pause", value="300")),
Token(control_code=ControlCode(key="tn", value="date")),
Token(text="05/26/2020"),
Token(control_code=ControlCode(key="pause", value="300")),
Token(text="My phone number is."),
Token(control_code=ControlCode(key="tn", value="phone")),
Token(text="1-800-688-0068")
])
Control codes
Nuance control codes, also known as control sequences, may be included in the input text when using the input type TokenizedSequence. These codes indicate how the text segments following the code should be spoken.
A tokenized sequence alternates text and control codes.
SynthesisRequest - Input - TokenizedSequence -
Token (text = "Text before control code"),
Token (control_code=ControlCode (key="code name", value="code value")),
Token (text = "Text following or affected by control code")
When using the sample client, enter tokenized sequences in the flow.py input file, for example:
request.input.tokenized_sequence.tokens.extend([
Token (text = "My name and address is: "),
Token (control_code = ControlCode (key = "tn", value = "name")),
Token (text = "Aardvark & Sons Co. Inc.,"),
Token (control_code = ControlCode (key = "tn", value = "address")),
Token (text = "123 E. Forest Ave., Portland, ME 04103"),
Token (control_code = ControlCode (key = "tn", value = "normal"))
])
Nuance supports the following control codes and values in TokenizedSequence.
audio
The audio code inserts a digital audio recording at the current location. The value
attribute specifies the location of the recording as either:
-
A URN in Mix cloud storage. Use the Storage API to upload the audio file. See Sample storage client: Upload audio.
-
A secure URL. The file must be a WAV file on a web server accessed through a secure (https) URL, with a valid TLS certificate.
This request specifies a URN for the audio file:
Token (text = "Please leave your name after the tone. "),
Token (control_code = ControlCode (key = "audio",
value = "urn:nuance-mix:tag:tuning:audio/coffee_app/beep/mix.tts"))
This specifies an audio file via secure URL:
Token (text = "Please leave your name after the tone. "),
Token (control_code = ControlCode (key = "audio",
value = "https://<host>/audio/beep.wav"))
TTSaaS supports WAV files containing 16-bit PCM samples.
If the audio file cannot be found or is not a WAV file, TTSaaS reports an error. With the Synthesize method, TTSaaS synthesizes any text tokens in the sequence but does not download or include the file. In these examples, if the audio file is unavailable, the results are only: “Please leave your name after the tone.”
With UnarySynthesize, TTSaaS does not synthesize anything and simply reports an error.
eos
The eos code controls end-of-sentence detection. Values are:
- 1: Forces a sentence break.
- 0: Suppresses a sentence break. To suppress a sentence break, eos 0 must appear immediately after the symbol (such as a period) that triggers the break.
To disable automatic end-of-sentence detection for a block of text, use readmode explicit_eos
.
Token (text = "Tom lives in the U.S."),
Token (control_code=ControlCode (key="eos", value="1")),
Token (text = "So does John. 180 Park Ave."),
Token (control_code=ControlCode (key="eos", value="0")),
Token (text = "Room 24")
lang
The lang code labels text identified as from an unknown language. Values are:
- normal: The current voice language.
- unknown: Any other language. You may not specify an explicit language.
The value lang unknown
labels all text from that position (up to a lang normal
or the end of input) as being from an unknown language. TTSaaS then uses its language identification feature on a sentence-by-sentence basis to determine the language, and switches to a voice for that language if necessary. The original voice is restored at the next lang normal
or the end of the synthesis request.
See LanguageIdentificationParameters.
Language identification is only supported for a limited set of languages.
Token (text = "The name of the song is."),
Token (control_code=ControlCode (key="lang", value="unknown")),
Token (text = "Au clair de la lune."),
Token (control_code=ControlCode (key="lang", value="normal")),
Token (text = "It's a folk song meaning, in the light of the moon.")
When used with a multilingual (-Ml) voice, the lang code switches to another language supported by the voice. This example uses Zoe-Ml, defined with two languages apart from American English.
voices {
name: "Zoe-Ml"
model: "enhanced"
language: "en-us"
. . .
foreign_languages: "es-mx"
foreign_languages: "fr-ca"
}
In this example, the voice reads “St-Jean-sur-Richelieu” using French pronunciation, while the rest of the sentence is in English. When the lang code is used with a non-multilingual voice, “St-Jean-sur-Richelieu” is pronounced using the voice’s base language.
Token (text = "Hello and welcome to the city of "),
Token (control_code=ControlCode (key="lang", value="fr-CA")),
Token (text = "St-Jean-sur-Richelieu."),
Token (control_code=ControlCode (key="lang", value="normal"))
mrk
The mrk code inserts a bookmark that is returned in the results. The value can be any name.
Token (control_code=ControlCode (key="mrk", value="important")),
Token (text = "This is an important point. ")
pause
The pause code inserts a pause of a specified duration in milliseconds. Values from 1 to 65,535.
Token (text = "My name is "),
Token (control_code=ControlCode (key="pause", value="300")),
Token (text = "Jeremiah Jones. ")
para
The para code indicates a paragraph break and implies a sentence break. The difference between this and eos 1 (end of sentence) is that this triggers the delivery of a paragraph mark event.
Token (text = "Introduction to Vocalizer"),
Token (control_code=ControlCode (key="para")),
Token (text = "Vocalizer is a state-of-the-art text-to-speech system.")
pitch
The pitch code changes the speaking voice to sound lower (lower values) or higher (higher values). Values are between 50 and 200, and 100 is typical.
You may combine pitch, rate, and timbre for more precise results. For example, pitch and timbre values of 80 or 90 for a female voice give a more neutral voice.
Token (text = "Hi I 'm Zoe. This is the normal pitch and timbre of my voice."),
Token (control_code=ControlCode (key="pitch", value="80")),
Token (control_code=ControlCode (key="timbre", value="90")),
Token (text = "But now my voice sounds lower and richer.")
prompt
The prompt code inserts an ActivePrompt at a specific location in the text. The value is the name of the prompt within an ActivePrompt database.
To use an ActivePrompt database, you must upload it to central storage using UploadRequest and load it into the session using SynthesisRequest: Input: SynthesisResource: EnumResourceType: ACTIVEPROMPT_DB or ACTIVEPROMPT_DB_AUTO.
Token (control_code=ControlCode (key="prompt", value="banking::confirm_account_number")),
Token (text = "Thanks ")
rate
The rate code sets the speaking rate as a percentage of the default speaking rate. Values are from 1 to 100, with 50 as the default rate.
You may combine the pitch, rate, and timbre codes for more precise results.
Token (text = "I can "),
Token (control_code=ControlCode (key="rate", value="75")),
Token (text = "speed up the rate"),
Token (control_code=ControlCode (key="rate", value="25")),
Token (text = "or slow it down")
readmode
The readmode code changes the reading mode from sentence mode (the default) to specialized modes. Values are the modes:
- sent: Sentence mode (default).
- char: Character-by-character mode, similar to spelling.
- word: Word-by-word mode.
- line: Line-by-line, or list mode, with a pause at the end of each line.
- explicit_eos: Explicit end-of-sentence mode, with sentence breaks only where indicated by eos 1. In the example, the list will be read without sentence breaks.
Return to readmode sent
after the specialized readme.
Token (control_code=ControlCode (key="readmode", value="sent")),
Token (text = "Please buy green apples. You can also get pears.")
Token (control_code=ControlCode (key="readmode", value="char")),
Token (text = "Apples")
Token (control_code=ControlCode (key="readmode", value="word")),
Token (text = "Please buy green apples.")
Token (control_code=ControlCode (key="readmode", value="line")),
Token (text = "Bananas. Low-fat milk. Whole wheat flour.")
Token (control_code=ControlCode (key="readmode", value="explicit_eos")),
Token (text = "Bananas. Low-fat milk. Whole wheat flour.")
rst
The rst code resets all codes to the default values.
Token (control_code=ControlCode (key="vol", value="10")),
Token (text = "The volume is set to a low value."),
Token (control_code=ControlCode (key="rst")),
Token (text = "Now it is reset to its default value.")
spell
The spell code sets the inter-character pause, in milliseconds, for tn spell. Values are from 1 to 65535.
Token (control_code=ControlCode (key="tn", value="spell")),
Token (control_code=ControlCode (key="spell", value="200")),
Token (text = "a134b"),
Token (control_code=ControlCode (key="tn", value="normal"))
style
The style code sets the speaking style of the voice. Values depend on the voice but are usually neutral, lively, forceful, and apologetic. The default is usually neutral. If you request a style that the voice does not support, there is no effect.
The first example reads “Hello, this is Samantha” in Samantha’s default style, then switches to lively style to say “Hope you’re having a nice day!”
The style resets to default at the end of the synthesis request or if it encounters a change of voice. The second example continues with Nathan in default style saying “Hello, this is Nathan.”
Token (text = "Hello, this is Samantha. "),
Token (control_code=ControlCode (key="style", value="lively")),
Token (text = "Hope you're having a nice day!")
Token (text = "Hello, this is Samantha. "),
Token (control_code=ControlCode (key="style", value="lively")),
Token (text = "Hope you're having a nice day!"),
Token (control_code=ControlCode (key="voice", value="nathan")),
Token (text = "Hello, this is Nathan."),
timbre
The timbre code changes the speaking voice to sound bigger and older (lower values) or smaller and younger (higher values). Values are between 50 and 200, and 100 is typical.
You may combine the pitch, rate, and timbre codes for more precise results.
Token (control_code=ControlCode (key="timbre", value="180")),
Token (text = "I can sound quite young. "),
Token (control_code=ControlCode (key="timbre", value="50")),
Token (text = "Or I can sound old and maybe wise. "),
Token (control_code=ControlCode (key="tn", value="normal"))
tn
The tn code guides text normalization. Values are the different types of text. After applying the normalization mode, apply another code or return to normal.
tn address
The tn address code provides optimal reading for complete postal addresses.
Token (control_code=ControlCode (key="tn", value="address")),
Token (text = "Apt. 7-12, 28 N. Whitney St., Saint Augustine Beach, FL 32084-6715 "),
Token (control_code=ControlCode (key="tn", value="normal"))
Do not include the name portion of the address to avoid undesired expansions of name-specific abbreviations. Instead, include the name in a separate tn name section prior to the tn address. For example, the full name and address is read as: “Aardvark and Sons Company Incorporated, one two three East Forest avenue, Portland, Maine, zero four one zero three.”
Token (control_code=ControlCode (key="tn", value="name")),
Token (text = "Aardvark & Sons Co. Inc., "),
Token (control_code=ControlCode (key="tn", value="address")),
Token (text = "123 E. Forest Ave., Portland, ME 04103 "),
Token (control_code=ControlCode (key="tn", value="normal"))
tn alphanumeric
The tn alphanumeric code is an alias of tn spell:alphanumeric.
tn boolean
The tn boolean code reads boolean values (true, false, yes, no) by spelling them out. This example spells out “T R U E.”
Token (control_code=ControlCode (key="tn", value="boolean")),
Token (text = "true "),
Token (control_code=ControlCode (key="tn", value="normal"))
tn cardinal
The tn cardinal code is an alias of tn number.
tn characters
The tn characters code is an alias of tn spell:alphanumeric.
tn currency
The tn currency code reads text as currency. For example, “123.45USD” is read as “one hundred twenty three U S dollars and forty five cents.”
Token (control_code=ControlCode (key="tn", value="currency")),
Token (text = "123.45USD "),
Token (control_code=ControlCode (key="tn", value="normal"))
tn date
The tn date code reads text as a date. For example, “11/21/1984” is read as “November twenty-first, nineteen eighty four.”
The precise output is determined by the voice, and ambiguous dates are interpreted according to the conventions of the voice’s locale. For example, “05/12/2020” is read by an American English voice as “May twelfth two thousand twenty” and by a British English voice as “the fifth of December two thousand and twenty.”
Token (control_code=ControlCode (key="tn", value="date")),
Token (text = "11/21/1984 "),
Token (control_code=ControlCode (key="tn", value="normal"))
tn digits
The tn digits code is an alias for tn spell:alphanumeric.
tn name
The tn name code gives correct reading of names, including personal names with roman numerals, such as Pius IX (read as “Pius the ninth”), John I (“John the first”), and Richard III (“Richard the third”). The name must be capitalized but the roman numeral may be in upper or lowercase (III or iii). Do not include punctuation immediately following the roman numeral in the tn name text. If punctuation is required, include it in the tn normal text.
The examples are read as: “Care Telecom Limited” and “I’m talking about Richard the third. He lived in the fifteenth century.”
Token (control_code=ControlCode (key="tn", value="name")),
Token (text = "Care Telecom Ltd. "),
Token (control_code=ControlCode (key="tn", value="normal"))
Token (text = "I'm talking about "),
Token (control_code=ControlCode (key="tn", value="name")),
Token (text = "King Richard III "),
Token (control_code=ControlCode (key="tn", value="normal")),
Token (text = ". He lived in the 15th century. ")
tn normal
The tn normal code returns to generic normalization following a text fragment that is normalized in a special way. All the examples in this tn section include tn normal following the specific normalization segment.
tn ordinal
The tn ordinal code reads positional numbers such as 1st, 2nd, 3rd, and so on.
Token (control_code=ControlCode (key="tn", value="ordinal")),
Token (text = "12th "),
Token (control_code=ControlCode (key="tn", value="normal"))
tn phone
The tn phone code reads telephone numbers. For example, “1-800-688-0068” is read as “One, eight hundred, six eight eight, zero zero six eight.”
Token (control_code=ControlCode (key="tn", value="phone")),
Token (text = "1-800-688-0068 "),
Token (control_code=ControlCode (key="tn", value="normal"))
tn raw
The tn raw code provides a literal reading of the text, such as blocking undesired abbreviation expansion. It operates principally on the abbreviations and acronyms but may impact the surrounding text as well.
For example, “app.” is read as “app” only, without expanding the abbreviation.
Token (control_code=ControlCode (key="tn", value="raw")),
Token (text = "app. "),
Token (control_code=ControlCode (key="tn", value="normal"))
tn scope
Use the control sequence tn=scope to activate a dictionary for a specific scope. The value of scope is any TN type including any user-defined types you might create.
When creating a dictionary with Vocalizer Studio, you define a scope by assigning a domain to that dictionary. When the dictionary is loaded, the scope is declared as a suffix to the MIME type. When your application supplies marked-up text to be spoken, the mark-up can activate that dictionary by referring to its scope: when the mark-up matches the language and scope of any loaded dictionary, Vocalizer consults that dictionary at runtime. Otherwise, Vocalizer ignores dictionaries that don’t match the language and scope.
Imagine you have an English-speaking application for the sport of long-distance bicycling, and many of the technical descriptions use French words such as “brevet” and “randonneuring” with peculiar American pronunciations. You could create a user dictionary designated as a “biking” domain.
In this example, the dictionary might normalize the spoken text as “Welcome to the render nearing hotline. Every brevay in the series begins on Thursday mornings.”
Token (control_code=ControlCode (key="tn", value="biking")),
Token (text = "Welcome to the randonneuring hotline. Every brevet in the series begins on Thursday mornings. "),
Token (control_code=ControlCode (key="tn", value="normal"))
tn sms
The tn sms code gives short message service (SMS) reading. For example, “ttyl, James, :-)” is read as “Talk to you later, James, smiley happy.”
Token (control_code=ControlCode (key="tn", value="sms")),
Token (text = "ttyl, James, :-) "),
Token (control_code=ControlCode (key="tn", value="normal"))
tn spell:alphanumeric
The tn spell:alphanumeric code spells out all alphabetic and numeric characters, but does not read white space, special characters, and punctuation marks. This is how items are spoken with and without this code, in American English.
Input | With spell:alphanumeric | Without spell:alphanumeric |
---|---|---|
a34y - 347 | A three four Y three four seven | a thirty-four y three hundred forty-seven |
12345 | one two three four five | twelve thousand three hundred forty-five |
Smythe | capital S M Y T H E | smith |
For both types of spell normalization, accented and capital characters are indicated. For example: “café” is spoken as “C A F E acute” and “Abc” is spoken as “capital A B C.”
Token (control_code=ControlCode (key="tn", value="spell:alphanumeric")),
Token (text = "a34y - 347"),
Token (control_code=ControlCode (key="tn", value="normal"))
tn spell:strict
The tn spell:strict code spells out all characters, including white space, special characters, and punctuation marks.
For example, “a34y - 347” is pronounced “A three four Y, space hyphen space, three four seven.”
Token (control_code=ControlCode (key="tn", value="spell:strict")),
Token (text = "a34y - 347"),
Token (control_code=ControlCode (key="tn", value="normal"))
tn state
The tn state code expands and pronounces state, city, and province names and abbreviations, as appropriate for the locale. Not supported for all languages.
Token (control_code=ControlCode (key="tn", value="state")),
Token (text = "FL"),
Token (control_code=ControlCode (key="tn", value="normal"))
tn streetname
The tn streetname reads street names and abbreviations. Not supported for all languages.
Token (control_code=ControlCode (key="tn", value="streetname")),
Token (text = "Emerson Rd."),
Token (control_code=ControlCode (key="tn", value="normal"))
tn telephone
The tn telephone code is an alias of tn phone.
tn time
The tn time code gives a time of day reading. For example, 10:00 is pronounced “ten o’clock.”
Token (control_code=ControlCode (key="tn", value="time")),
Token (text = "10:00"),
Token (control_code=ControlCode (key="tn", value="normal"))
tn zip
The tn zip code reads US zip codes. Supported for American English only.
Token (control_code=ControlCode (key="tn", value="zip")),
Token (text = "01803"),
Token (control_code=ControlCode (key="tn", value="normal"))
voice
The voice code changes the speaking voice, which also forces a sentence break. Values are the voices within the request.
Token (control_code=ControlCode (key="voice", value="samantha")),
Token (text = "Hello, this is Samantha."),
Token (control_code=ControlCode (key="voice", value="tom")),
Token (text = "Hello, this is Tom.")
If you specify a voice in another language, the text is spoken in that language.
Token (control_code=ControlCode (key="voice", value="samantha")),
Token (text = "Hello, this is Samantha."),
Token (control_code=ControlCode (key="voice", value="aurelie")),
Token (text = "Bonjour, je m'appelle Aurelie.")
vol
The vol code changes the volume as a percentage of maximum volume. Values are from 0 (silent) to 100 (maximum volume). The default is typically 80.
Token (text = "I can "),
Token (control_code=ControlCode (key="vol", value="10")),
Token (text = "speak rather quietly,"),
Token (control_code=ControlCode (key="vol", value="90")),
Token (text = "but also very loudly.")
wait
The wait code specifies the end-of-sentence pause duration. Values are from 0 to 9, where the pause is 200 milliseconds multiplied by the value.
Token (control_code=ControlCode (key="wait", value="2")),
Token (text = "There will be a short wait period after this sentence."),
Token (control_code=ControlCode (key="wait", value="9")),
Token (text = "This sentence will be followed by a long wait. Did you notice the difference? ")
Feedback
Was this page helpful?
Glad to hear it! Please tell us how we can improve.
Sorry to hear that. Please tell us how we can improve.