Control sequences
A control sequence is a piece of text that is not read out, but instead affects how other text is spoken, or that performs a specific task. By using control sequences, you can acquire full control over the pronunciation of the input text.
For example, you can use a control sequence to tell Vocalizer to speak a particular word in your text more loudly than the others, or to insert a bookmark that will appear in your application logs. See also Natural language understanding.
Vocalizer supports two types of control sequences:
- Native control sequences: Vocalizer supports a proprietary syntax for control sequences, which is described in the appropriate HTML format Language Supplement found in the VOCALIZER_SDK\doc\languages directory.
- SSML markup: Vocalizer accepts SSML (Speech Synthesis Markup Language) elements in an XML document.
All native control sequences follow this general syntax notation:
<ESC> \ parameter = value \
In this notation:
- <ESC> normally represents the escape character "\x1B" (decimal 27) that generates the ASCII character 27 (Hex 1B).
- parameter is the name of the control parameter that the control sequence affects.
- value is the value you want to assign to the control parameter.
For example, you can insert a half-second pause in your text with the pause parameter:
Welcome to our phone system. <ESC>\pause=500\ How may I help you?
A value that is set with a control sequence remains active until another control sequence sets a new value, or until the end of the input text.
Enter control sequences that affecting the pronunciation of a word outside of that word. If entered inside a word, they will break it into two words.
It is possible to use native control sequences within an SSML document. However, this is difficult because the default <ESC> escape character for native control sequences is forbidden in XML documents. For more on using native control sequences within an XML document, see Defining an alternative escape sequence.
SSML markup
The SSML markup language includes several elements that have the same effect as native Vocalizer control sequences. For example, you can insert a half-second pause using the SSML <break> element:
Welcome to our phone system. <break time="500ms"/> How may I help you?
See the SSML Specification for a list of SSML elements, and Vocalizer SSML support for details on how Vocalizer supports them.
You cannot use SSML outside an XML document.
Using native control sequences within SSML can lead to unexpected behavior, so test them carefully. See Native control sequences.
Defining an alternative escape sequence
Under some circumstances you may be unable to use the <ESC> escape sequence, such as including a native control sequence in an SSML document, for example. You may also wish to augment <ESC> with an alternative sequence of your own.
To define an alternative sequence, use <escape_sequence> in Management Station. For example, to define three hashmarks (###) as the escape sequence:
<escape_sequence>###</escape_sequence>
You can then use this new sequence instead of <ESC>:
Welcome to our phone system. ###\pause=500\ How may I help you?
The alternative sequence must be a Perl 5 compatible regular expression. This means that any special characters in the sequence, such as a period (.), pipe (|), question mark (?), and so on, must be escaped using a backslash character (\).
Be careful to avoid using characters that might appear in your input text, otherwise you may inadvertently create an extra escape sequence. For example, "\$" would be a bad choice if your input text might include "$". This example creates an unwanted escape sequence:
\$volume=80\$50 has been transferred to your savings account
The alternative sequence supplements <ESC> rather than replacing it, so you can still use <ESC> for control sequences in non-XML documents.
Control sequence tasks
The following table summarizes the tasks you can achieve using control sequences, and whether they are supported in native or SSML format:
Inserting a digital audio recording
Use this control sequence to insert a digital audio recording at a specific location in the text.
The control sequence <ESC>\audio="path"\ inserts the recording specified by path, a URI or local file system path. For example, the following sequence plays the audio recording found at c:\recordings\beep.wav:
Say your name at the beep. <ESC>\audio="c:\recordings\beep.wav"\
Vocalizer supports inserting headerless, WAV format, AU format, or NIST SPHERE format audio files that contain mulaw, alaw, or linear 16-bit PCM samples. If the recording sampling rate does not match the current voice, Vocalizer resamples it before inserting it in the speech output.
The SSML equivalent of this control sequence is the <audio> element:
Say your name at the beep. <audio src="c:\recordings\beep.wav"/>
The control sequence can also accept alternate text for an audio recording. For example, the following sequence specifies for Vocalizer to read "beep sound", instead of playing the audio recording at c:\recordings\beep.wav:
<ESC>\audio="c:\recordings\beep.wav":"beep sound"\
Vocalizer uses the alternate text "beep sound" when the audio recording is unavailable or incompatible, such as an unsupported file format. Changes in rate, volume, and pitch on the alternate text are audibly the same as normal input text.
Note: Vocalizer extracts the alternate text from the control sequence without the surrounding double quotes. If the alternate text contains the double quote character, you can escape it with “\””.
Inserting an ActivePrompt
Use this control sequence to explicitly insert an ActivePrompt at a specific location in the text.
For example:
<ESC>\prompt=banking::confirm_account_number\ 238773?
ActivePrompts are explained in Tuning TTS output with ActivePrompts.
This control sequence has no equivalent in SSML.
Activating implicit matching for an ActivePrompt domain
Use this control sequence to activate implicit matching for an ActivePrompt domain starting at a specific location in the text. If no domain value is specified (for example, <ESC>\domain\), the most recently activated domain is activated.
For example:
<ESC>\domain=banking\Is your account number 238773?
The SSML equivalent of this control sequence consists of using the ssft-domaintype attribute within a <p> or <s> element:
<s ssft-domaintype="banking">Is your account number 238773?</s>
ActivePrompts are explained in Tuning TTS output with ActivePrompts.
Inserting phonetic input
Use this control sequence to clarify the phonetic value of words to ensure they are pronounced correctly. The sequence is useful for words whose spelling deviates from the pronunciation rules of a given language. For example, use it for foreign words or acronyms unknown to the system.
The phonetic input string is composed of symbols of the L&H+ phonetic alphabet, a Nuance specific alphabet that can be conveniently entered from a keyboard. See the Language Supplement for the subset of the L&H+ Phonetic Alphabet relevant for each language, along with examples to help you construct proper phonetic text. However, the following general information applies across all languages, with American English examples to illustrate proper use.
The SSML equivalent of this control sequence is the <phoneme> element:
Would you like a <phoneme alphabet="ipa" ph="təˈme͡ɪ.to͡ʊ">
tomato</phoneme>?
Use the control sequence <ESC>\toi=lhp\ to mark the beginning of a piece of phonetic text (switch to L&H+ phonetic input mode), and <ESC>\toi=orth\ to mark the end (switch back to orthographic input mode).
This control sequence can be extended as <ESC>\toi=lhp:"orthographic_text"\, where orthographic_text supplies the orthographic (plain text) equivalent of the phonetic fragment. The orthographic equivalent can be any Unicode string, but a double quote (") has to be escaped as (\"), a backslash has to be escaped as a double backslash (\\), and the string cannot contain other <ESC> sequences or SSML markup. The orthographic text may be used by Vocalizer for cross-word effects in some languages, but typically is used as an application comment. This is similar to SSML <phoneme> where the "ph" attribute specifies the phonetic input and the <phoneme> element’s content supplies the orthographic alternative. However, unlike SSML <phoneme>, if Vocalizer encounters invalid symbols in the <ESC>\toi=lhp\ phonetic fragment, it drops those symbols, rather than falling back to the orthographic equivalent.
In addition to the L&H+ phonetic symbols in the Language Supplement, use the following characters to clarify the pronunciation of the phonetic input string:
L&H+ symbol
|
Meaning
|
Example
|
'
(ASCII 39, Hex 27)
|
Primary word stress
|
<ESC>\toi=lhp:"record"\ R+I.'kOR+d <ESC>\toi=orth\ (the verb “record”)
versus:
<ESC>\toi=lhp:"record"\ 'R+E.kOR+d <ESC>\toi=orth\ (the noun “record”)
|
'2
|
Secondary word stress
|
<ESC>\toi=lhp:"explanation"\ '2Ek.spl$.'ne&I.S$n<ESC>\toi=orth\
(“explanation”)
|
"
(ASCII 34, Hex 22)
|
Sentence accent
|
<ESC>\toi=lhp\DER+_AR+_"tu_"@k.sEnts_?In_DI_'sEn.t$ns <ESC>\toi=orth\
(“There are TWO ACCENTS in this sentence”)
|
.
|
Syllable boundary
|
<ESC>\toi=lhp:"syllable"\ 'sI.l$.b$l <ESC>\toi=orth\
(“syllable”)
|
#
|
Silence (pause)
|
<ESC>\toi=lhp\?a&I_"sEd#do&Unt_"du_It <ESC>\toi=orth\
(“I said: don’t do it.”)
|
Punctuation marks remain useful within phonetic input to assure correct intonation. Each punctuation mark must be preceded by an asterisk.
L&H+ symbol
|
Meaning
|
_
|
Word delimiter
|
*.
|
End of declarative
|
*,
|
Comma
|
*!
|
End of exclamation
|
*?
|
End of question
|
*;
|
Semicolon
|
*:
|
Colon
|
For example:
<ESC>\toi=lhp\ "jEs.t$.de&I*,_De&I_'lEft_"?E0.li*. <ESC>\toi=orth\
(“Yesterday, they left early.”)
|
Lexical stress and sentence accents can be indicated in phonetic strings by using a single quote (') or double quote (") respectively. Vocalizer automatically converts all lexical stress marks into sentence accents if the phonetic input doesn’t contain any sentence accents.
Note that manually specified lexical stress marks and sentence accents sometimes have no effect in Vocalizer, because the synthesis module sometimes needs to override the requested stress or accent.
For example:
<ESC>\toi=lhp\If_D$_'wE.D$R+_Is_fa&In_t$.'mA.R+o&U*,_wi_wIl_liv_fOR+_nu.'jOR+k*.<ESC>\toi=orth\
(“If the weather is fine tomorrow, we will leave for New York.”)
|
If the phonetic input contains at least one manually added sentence accent, no additional sentence accents are assigned by Vocalizer. Therefore, only those words marked with a double quote (") will get a sentence accent. As a consequence, input containing only one manual sentence accent produces an almost flat intonation on all the other words.
For example:
<ESC>\toi=lhp\If_D$_wE.D$R+_Is_fa&In_t$."mA.R+o&U*,_wi_wIl_liv_fOR+_nu.jOR+k*.<ESC>\toi=orth\
(Only one sentence accent is realized in, “If the weather is fine tomorrow, we will leave for New York.”)
|
Phonetic input can also be combined with orthographic input. If no sentence accents are found in the input text (indicated by <ESC>\sent_accent\ in orthographic input, or by " in phonetic input), Vocalizer automatically assigns sentence accents. In the orthographic part of the input, Vocalizer realizes these sentence accents on the basis of part-of-speech and syntactic information. In the phonetic part of the input, all lexical stress marks (if any) are converted into sentence accents. If there are no lexical stress marks, no sentence accent will be realized for the phonetic part of the input. If the user has manually specified one or more sentence accents, no additional sentence accents are realized.
For example:
If the weather is fine tomorrow, we will leave for <ESC>\toi=lhp:"New York"\nu.'jOR+k <ESC>\toi=orth\.
(No sentence accents are found; Vocalizer automatically assigns sentence accents.)
|
If the weather is fine tomorrow, we will leave for <ESC>\toi=lhp:"New York"\nu."jOR+k <ESC>\toi=orth\.
(A sentence accent is specified in the phonetic part of the input text. No additional sentence accents will be realized.)
|
If the weather is <ESC>\sent_accent\fine tomorrow, we will leave for <ESC>\toi=lhp:"New York"\nu.jOR+k<ESC>\toi=orth\.
(A sentence accent is specified in the orthographic part of the input text. No additional sentence accents will be realized.)
|
If the weather is <ESC>\sent_accent\fine tomorrow, we will leave for <ESC>\toi=lhp:"New York"\nu."jOR+k<ESC>\toi=orth\.
(Two sentence accents are specified; no additional sentence accents will be realized.)
|
Inserting Pinyin input for Chinese languages
Use this control sequence to insert Pinyin input for Chinese languages. Pinyin is a Romanized form that represents Chinese ideographs using Latin letters and numbers. Use the <ESC>\toi=pyt\ control sequence at the beginning of the Pinyin input, and <ESC>\toi=orth\ at the end (to restore orthographic input mode).
To ensure correct output, mark the Latin regions using <ESC>\lang=latin\ at the beginning and <ESC>\lang=normal\ at the end.
For example:
<ESC>\toi=pyt\ zhe4li5 miao2 shu4 le5 wo3 gong1 si1 <ESC>\lang=latin\ TTS<ESC>\lang=normal\ xi4 tong3 dui4 tuo1 li2 xu4 lie4 de5 zhi1 chi2. <ESC>\toi=orth\
|
This control sequence has no equivalent in SSML.
Inserting diacritized input for Arabic
Use this control sequence to pronounce Arabic input according to the rules of Arabic orthography. By default it assumes undiacritized orthographic input. Use the control sequence <ESC>\toi=diacritized\ to insert diacritized input, and <ESC>\toi=orth\ at the end of that input to restore orthographic input mode.
For example, undiacritized orthographic input:
Using diacritized input:
<ESC>\toi=diacritized\
<ESC>\toi=orth\
This control sequence has no equivalent in SSML.
Inserting a pause
Use this control sequence to insert a pause of a specified duration at a specific location in the text. For example:
His name is <ESC>\pause=300\ Michael.
The control sequence <ESC>\pause=dur_ms\ inserts a pause of dur_ms milliseconds; the supported range is 1–65535 msec.
The SSML equivalent of this control sequence is the <break> element:
<speak>His name is <break time="300ms"/> Michael.</speak>
The default duration of a pause is 200 milliseconds, inserted as follows:
<ESC>\pause\
To prevent a pause between phrases, specify 0. This example reads the sentence without a pause before saying "Michael":
His name is: <ESC>\pause=0\ Michael.
Guiding text normalization
Use this control sequence to guide the text normalization processing step. This control sequence <ESC>\tn=<type>\ is the equivalent of the SSML <say-as> element. (For details on all the supported text normalization types and supported input formats, see the Language Supplement for each language.)
If the text within the control sequence doesn’t match a supported input format, Vocalizer spells the content. Even though Vocalizer supports a broad range of input formats, you must still be careful about the text input format: always specify <ESC>\tn=normal\ to specify the end of the text block with that specialized text normalization format.
Some common text normalization (TN) types are listed below. These types can also be used in SSML via <say-as>, where the TN type is specified using the interpret-as attribute, as indicated in these examples.
TN type
|
Use
|
Examples
|
address
|
Address reading
|
<ESC>\tn=address\Apt. 7-12, 28 N. Whitney St., Saint Augustine Beach, FL 32084-6715<ESC>\tn=normal\
<say-as interpret-as="address">Apt. 7-12, 28 N. Whitney St., Saint Augustine Beach, FL 32084-6715</say-as>
|
alphanumeric
|
Alias of spell:alphanumeric
|
|
boolean
|
Alias of vxml:boolean
|
|
cardinal
|
Alias of number
|
|
characters
|
Alias of spell:alphanumeric
|
|
currency
|
Currency reading
|
<ESC>\tn=currency\12USD<ESC>\tn=normal\
<say-as interpret-as="currency">12USD</say-as>
|
date
|
Date reading
|
<ESC>\tn=date\12/3/1995<ESC>\tn=normal\
<say-as interpret-as="date">12/3/1995</say-as>
|
digits
|
Alias of spell:alphanumeric
|
|
name
|
Proper name reading
|
<ESC>\tn=name\Care Telecom Ltd<ESC>\tn=normal\
<say-as interpret-as="name">Care Telecom Ltd</say-as>
|
ordinal
|
Ordinal number reading
|
<ESC>\tn=ordinal\12th<ESC>\tn=normal\
<say-as interpret-as="ordinal">12th</say-as>
|
phone
|
Telephone number reading
|
<ESC>\tn=vxml:phone\1-800-688-0068<ESC>\tn=normal\
<say-as interpret-as="phone">1-800-688-0068</say-as>
|
raw
|
Block expansions of abbreviations and acronyms.
|
<ESC>\tn=raw\app.<ESC>\tn=normal\
<say-as interpret-as="raw">app.</say-as>
|
scope |
Activate a dictionary based on a scope, where the "scope" is any TN type. |
<ESC>\tn=biking\brevet<ESC>\tn=normal\
<say-as interpret-as="biking">brevet</say-as>
|
sms
|
Short message service (SMS) reading
|
<ESC>\tn=sms\CU (-:<ESC>\tn=normal\
<say-as interpret-as="sms">CU (-:</say-as>
|
spell
|
Alias of spell:strict
|
|
spell:alphanumeric
|
Spell alphanumeric characters except for white space and punctuation
|
<ESC>\tn=spell:alphanumeric\a34y<ESC>\tn=normal\
<say-as interpret-as="spell" format="alphanumeric"> a34y</say-as>
|
spell:strict
|
Spell all characters including white space and punctuation
|
<ESC>\tn=spell:strict\ a34y-347<ESC>\tn=normal\
<say-as interpret-as="spell" format="strict">a34y-347</say-as>
|
state
|
(Not all languages.) State, city, and province names and abbreviations reading
|
<ESC>\tn=state\ FL<ESC>\tn=normal\
<say-as interpret-as="state">FL</say-as>
|
streetname
|
(Not all languages.) Street name and abbreviation reading
|
<ESC>\tn=streetname\ Emerson Rd.<ESC>\tn=normal\
<say-as interpret-as="streetname">Emerson Rd.</say-as>
|
streetnumber
|
(Not all languages.) Street number reading
|
<ESC>\tn=streetnumber\11001-11010<ESC>\tn=normal\
<say-as interpret-as="streetnumber">11001-11010</say-as>
|
telephone
|
Alias of phone
|
|
time
|
Time of day reading
|
<ESC>\tn=time\10:00<ESC>\tn=normal\
<say-as interpret-as="time">10:00</say-as>
|
vxml:boolean
|
VoiceXML 2.0 defined type for boolean input
|
<ESC>\tn=vxml:boolean\true<ESC>\tn=normal\
<say-as interpret-as="vxml:boolean">true</say-as>
|
vxml:currency
|
VoiceXML 2.0 defined type for currencies
|
<ESC>\tn=vxml:currency\EUR15.23<ESC>\tn=normal\
<say-as interpret-as="vxml:currency">EUR15.23</say-as>
|
vxml:date
|
VoiceXML 2.0 defined type for dates
|
<ESC>\tn=vxml:date\20100102<ESC>\tn=normal\
<say-as interpret-as="vxml:date">20100102</say-as>
|
vxml:digits
|
VoiceXML 2.0 defined type for digit sequences
|
<ESC>\tn=vxml:digits\20051225<ESC>\tn=normal\
<say-as interpret-as="vxml:digits">20051225</say-as>
|
vxml:number
|
VoiceXML 2.0 defined type for numbers
|
<ESC>\tn=number\+15243.1235<ESC>\tn=normal\
<say-as interpret-as="vxml:number">+15243.1235</say-as>
|
vxml:phone
|
VoiceXML 2.0 defined type for telephone numbers
|
<ESC>\tn=vxml:phone\7815655000<ESC>\tn=normal\
<say-as interpret-as="vxml:phone">7815655000</say-as>
|
vxml:time
|
VoiceXML 2.0 defined type for time strings
|
<ESC>\tn=vxml:time\0100a<ESC>\tn=normal\
<say-as interpret-as="vxml:time">0100a</say-as>
|
zip
|
(American English only.) ZIP codes
|
<ESC>\tn=zip\01803<ESC>\tn=normal\
<say-as interpret-as="zip">01803</say-as>
|
Where:
- address: Provides optimal reading for complete postal addresses. Do not include the addressee portion (name portion) in the address to avoid undesired expansions of name specific abbreviations. Instead, include the name portion in a separate <ESC>\tn=name\ section prior to the <ESC>\tn=address\.
- name: Gives correct reading of names, including personal names with roman numerals, such as Pius IX (read as "Pius the ninth"), John I ("John the first"), and Richard III ("Richard the third"). The name must be capitalized but the roman numeral may be in upper or lowercase (III or iii). Do not add a punctuation mark immediately following the roman numeral.
Some examples:
<ESC>\tn=name\Care Telecom Ltd<ESC>\tn=normal\
<ESC>\tn=name\Pope Pius IX<ESC>\tn=normal\ lived in the 19th century.
I\'m talking about <ESC>\tn=name\King Richard III<ESC>\tn=normal\. He lived in the 15th century.
- normal: Normalizes the end of a text fragment by tagging with <ESC>\tn=normal\.
Some examples:
<ESC>\tn=address\ 244 Perryn Rd
Ithaca, NY 14850<ESC>\tn=normal\
That’s spelled <ESC>\tn=spell\Ithaca<ESC>\tn=normal\
<ESC>\tn=sms\ Carlo, can u give me a lift 2 Helena's house 2nite? David <ESC>\tn=normal\
- raw: Provides a more literal reading of the text, such as blocking an undesired abbreviation expansion. <ESC>\tn=raw\ operates on the abbreviations and acronyms as listed in each Language Supplement, but may affect the surrounding text as well. For example, the <ESC>\tn=raw\ in the following text would also block recognition of "12/6" as a date:
<ESC>\tn=raw\ Wed. <ESC>\tn=normal\ 12/6
- spell: Vocalizer supports two Text Normalization (TN) types for spelling text:
<ESC>\tn=spell:alphanumeric\
<ESC>\tn=spell:strict\
- <ESC>\tn=spell:strict\ has the following behavior:
- All characters are spelled, including white space, special characters, and punctuation marks.
- Characters with diacritics are pronounced as such. (For example, ú is spoken as “u with acute accent.”)
- “Upper case” is pronounced for upper case letters. (For example, “Abc” is spoken as “Upper case a, b, c.”)
- <ESC>\tn=spell:alphanumeric\ has the following behavior:
- All alphabetic and numeric characters are spelled. This excludes white space, special characters, and punctuation marks.
- Characters with diacritics are pronounced as such. (For example, ú is spoken as “u with acute accent.”)
- “Upper case” is pronounced for upper case letters. (For example, “Abc” is spoken as “Upper case a, b, c.”)
- vxml: The vxml-prefixed TN types conform to the VoiceXML 2.0 specification. The vxml input formats are also handled by the non-vxml-prefixed counterparts. For example, <ESC>\tn=time\ covers all the input formats supported by <ESC>\tn=vxml:time\.
Using scopes to activate dictionaries
Use the control sequence <ESC>\tn=scope to activate a dictionary for a specific scope. The value of scope is any TN type including any user-defined types you might create.
When creating a dictionary with Vocalizer Studio, you define a scope by assigning a domain to that dictionary. When the dictionary is loaded, the scope is declared as a suffix to the MIME type. When your application supplies marked-up text to be spoken, the mark-up can activate that dictionary by referring to its scope: when the mark-up matches the language and scope of any loaded dictionary, Vocalizer consults that dictionary at runtime. Otherwise, Vocalizer ignores dictionaries that don't match the language and scope.
Imagine you have an English-speaking application for the sport of long-distance bicycling, and many of the technical descriptions use French words such as "brevet" and "randonneuring" with peculiar American pronunciations. You could create a user dictionary designated as a "biking" domain. Example mark-up (with bold text to highlight the text substitutions in the dictionary):
<ESC>\tn=biking\Welcome to the randonneuring hotline. Every brevet in the series begins on Thursday mornings.<ESC>\tn=normal\
For example, the dictionary might normalize the spoken text as "Welcome to the render nearing hotline. Every brevay in the series begins on Thursday mornings."
Inserting a bookmark
Use the control sequence <ESC>\mrk=name\ control sequence to mark a position in the input text. Vocalizer tracks this position throughout the TTS conversion. The bookmark name can be any text sequence. After synthesis Vocalizer delivers a bookmark marker that refers to this position in the input text and the corresponding position in the audio output.
The use of this control sequence does not affect the speech output process.
Some examples:
This bookmark <ESC>\mrk=bookmark 1\ marks a reference point.
Another <ESC>\mrk=-bookmark 2\ does the same.
The SSML equivalent of this control sequence is the <mark> element:
<speak>This bookmark <mark name="bookmark1"/> marks a reference point.
Another <mark name="bookmark2"/> does the same.</speak>
Changing the speaking rate
Use this control sequence to set the speaking rate to a specified value. The format is <ESC>\rate=level\ where level is between 50 (half the default rate) and 400 (four times the default rate), and 100 is the default speaking rate.
Example:
I can <ESC>\rate=150\ speed up the rate <ESC>\rate=75\ or slow it down.
The SSML equivalent is the rate attribute of the <prosody> element:
<speak>I can <prosody rate="+50%">speed up the rate</prosody>
<prosody rate="-25%">or slow it down</prosody></speak>
See Rate scale conversion.
For more precise results, experiment with different combinations of pitch, rate, and timbre. For example, you can create a more gender-neutral voice by assigning pitch and timbre to 80 or 90 for a female voice.
Changing the pitch
Use this control sequence to set the pitch to the specified level. The pitch code changes the speaking voice to sound deep (lower values) or thin (higher values).
The format is <ESC>\pitch=level\ where level is a value between 50 (lower pitch) and 200 (higher pitch), where 100 is typical. For example:
<ESC>\pitch=30\ I can speak with a deep voice, <ESC>\pitch=170\ but also very thin.
The SSML equivalent is the pitch attribute of the <prosody> element where the values are relative percentages of change:
<prosody pitch="-50%">I can speak rather deeply,</prosody>
<prosody pitch="+50%">but also very thinly.</prosody>
In SSML, you can set symbolic values instead of percentages: x-low, low, medium, high, x-high, and default.
For more precise results, experiment with different combinations of pitch, rate, and timbre. For example, you can create a more gender-neutral voice by assigning pitch and timbre to 80 or 90 for a female voice.
Changing the volume
Use this control sequence to set the volume to the specified level. The format is <ESC>\vol=level\ where level is a value between 0 (no volume) and 100 (the maximum volume), where 80 is typically the default volume. For example:
<ESC>\vol=10\ I can speak rather quietly, <ESC>\vol=90\ but also very loudly.
The SSML equivalent is the volume attribute of the <prosody> element:
<prosody volume="-50%">I can speak rather quietly,</prosody>
<prosody volume="+50%">but also very loudly.</prosody>
See Volume scale conversion.
Changing the timbre
Use this control sequence to make the timbre of the speaking voice sound older (lower values) or younger (higher values). You can use this feature on any voice.
The control sequence <ESC>\timbre=level\ sets the timbre to the specified level, where level is a percentage value between 50 and 200, where 100 is typical. For example:
<ESC>\timbre=180\ I can sound like this, <ESC>\timbre=50\ but also sound very different.
The SSML equivalent is the timbre attribute of the <prosody> element:
<prosody timbre="180">I can sound like this,</prosody>
<prosody timbre="50">but I can also sound very different.</prosody>
You can set symbolic values instead of percentages:
Symbolic value |
Corresponding percentage |
x-young |
+35% |
young |
+20% |
medium |
0% |
default |
0% |
old |
-20% |
x-old |
-35% |
For more precise results, experiment with different combinations of pitch, rate, and timbre. For example, you can create a more gender-neutral voice by assigning pitch and timbre to 80 or 90 for a female voice.
Setting the end-of-sentence pause duration
Use this control sequence to set an end-of-sentence pause duration (wait period). The format is <ESC>\wait=value\ where the value is between 0 and 9. The pause is that number multiplied by 200 milliseconds.
Examples:
<ESC>\wait=2\ There will be a short wait period after this sentence.
<ESC>\wait=9\ This sentence will be followed by a long wait period. Did you notice the difference?
This control sequence has no equivalent in SSML, although you can use the <break> element to set the length of pauses explicitly.
Setting the spelling pause duration
Use this control sequence to set the inter-character pause. The format is <ESC>\spell=duration\ where the duration value is milliseconds. For example:
The part code is <ESC>\tn=spell\<ESC>\spell=200\a134b<ESC>\tn=normal\
Note: The spelling pause duration does not affect the spelling done by <ESC>\readmode=char\, because that mode treats each character as a separate sentence. To adjust the spelling pause duration for <ESC>\readmode=char\, set the end of sentence pause duration using <ESC>\wait\ instead.
This control sequence has no equivalent in SSML.
Controlling end-of-sentence detection
Use this control sequence to control end of sentence detection. The format is <ESC>\eos=1\ and <ESC>\eos=0\ where the value of 1 forces a sentence break and 0 suppresses a sentence break. Optionally, use this sequence in conjunction with explicit read mode (which disables automatic end-of-sentence detection for a block of text). See Controlling the read mode.
For suppression, the sequence must appear immediately after the symbol that would normally trigger a break (such as after a period).
Examples:
Tom lives in the U.S. <ESC>\eos=1\ So does John.
180 Park Ave. <ESC>\eos=0\ Room 24
The SSML equivalent of this control sequence is the <s> (or <sentence>) element to force a sentence break, and a <break> with attribute strength set to "none" to suppress a break:
<s>Tom lives in the U.S.</s>
<s>So does John.180 Park Ave. <break strength="none"/> Room 24</s>
There is no SSML equivalent for the <ESC>\readmode=explicit_eos\ sequence. SSML lets you force or suppress a sentence break, but does not allow you to activate explicit end-of-sentence mode.
Setting the textual context explicitly
Use this control sequence to indicate a position in the sentence of the text so Vocalizer can adjust the intonation appropriately. The format is <ESC>\prosody=position\ where the value is one of these:
Prosody position values
|
Description
|
<ESC>\prosody=medial\
|
Mark for middle of phrase
|
<ESC>\prosody=phrase-break\
|
Mark for phrase boundary
|
<ESC>\prosody=sentence-break\
|
Mark for sentence boundary
|
For example, this markup identifies the date as being preceded by a carrier phrase and followed by a sentence boundary:
<ESC>\prosody=medial\<ESC>\tn=date\ 2011-07-04
<ESC>\tn=normal\<ESC>\prosody=sentence-break\
With SSML, use the detail attribute in the <say-as> element:
Prosody
|
Preceded by
|
Followed by
|
prosody-start-medial-end-medial
|
Carrier phrase
|
Carrier phrase
|
prosody-start-medial-end-phrase
|
Carrier phrase
|
Phrase boundary
|
prosody-start-medial-end-sentence
|
Carrier phrase
|
Sentence boundary
|
prosody-start-phrase-end-medial
|
Phrase boundary
|
Carrier phrase
|
prosody-start-phrase-end-phrase
|
Phrase boundary
|
Phrase boundary
|
prosody-start-phrase-end-sentence
|
Phrase boundary
|
Sentence boundary
|
prosody-start-sentence-end-medial
|
Sentence boundary
|
Carrier phrase
|
prosody-start-sentence-end-phrase
|
Sentence boundary
|
Phrase boundary
|
prosody-start-sentence-end-sentence
|
Sentence boundary
|
Sentence boundary
|
The SSML equivalent to the above example identifies the date as being preceded by a carrier phrase and followed by a sentence boundary:
<say-as interpret-as="date" detail="prosody-start-medial-end-sentence"> 2011-07-04</say-as>
If the intonation pattern isn’t explicitly specified at the SSML level, Vocalizer uses intonation patterns implicitly provided by textual context. If the intonation pattern is explicitly specified at the SSML level, the detail attribute in the <say-as> prompt has priority over the textual context.
For example, the following SSML element implicitly considers the date inserted as being preceded by a carrier phrase and followed by sentence boundary:
The date is: <say-as interpret-as="date">2011-06-28</say-as>.
Controlling the read mode
Use this control sequence to change the default reading mode. The format is <ESC>\readmode=mode\ where the value is one of these:
Read mode
|
Description
|
<ESC>\readmode=sent\
|
Sentence mode (the default)
|
<ESC>\readmode=char\
|
Character mode (similar to spelling)
|
<ESC>\readmode=word\
|
Word-by-word mode
|
<ESC>\readmode=line\
|
Line-by-line mode
|
<ESC>\readmode=explicit_eos\
|
Explicit end-of-sentence mode (sentence breaks only where indicated by <ESC>\eos=1\)
|
Examples:
<ESC>\readmode=sent\ Please buy green apples. You can also get pears.
(This input is read sentence by sentence.)
|
<ESC>\readmode=char\ Apples
(The word “Apples” is spelled.)
|
<ESC>\readmode=word\ Please buy green apples.
(This sentence is read word by word.)
|
<ESC>\readmode=line\ Bananas Low-fat milk Whole wheat flour
(This input is read as a list, with a pause at the end of each line.)
|
<ESC>\readmode=explicit_eos\ Bananas. Low-fat milk. Whole wheat flour.
(This input is read as one sentence.)
|
This control sequence has no equivalent in SSML.
Changing the voice
Use this control sequence to change the speaking voice and force a sentence break. The format is <ESC>\voice=voice_name\ where the value is any installed voice. For example:
<ESC>\voice=samantha\ Hello, this is Samantha.
<ESC>\voice=tom\ Hello, this is Tom.
The SSML equivalent of this control sequence is the <voice> element:
<voice name="Samantha">Hello, this is Samantha.</voice>
<voice name="Tom">Hello, this is Tom.</voice>
To use this control sequence successfully, you must have more than one voice installed. If you do not have the requested voice installed, Vocalizer flags a warning and does its best to carry on. In this example, if Samantha is installed, but Tom is not, Vocalizer synthesizes, “Hello, this is Tom,” in Samantha’s voice, and produces this debug message:
SEVERE 16123: TTSEG|Could not do a mid-synthesis voice switch, voice load failed, voice=Tom
Instead of a specific voice, the native control sequence can accept key-value pairs that let you choose a language or gender for the voice rather than a specific voice. For example:
<ESC>\voice=key:value[,key:value]\
Where a key may be:
- lang—The three-letter code for a language (for example, ENU for American English). This may be "unknown".
- gender—A gender for the voice (male or female).
- ietf—The IETF code for a language (for example, en-US for American English).
Several key-value pairs may be included, using a comma or semi-colon as a separator between each pair. For example:
<ESC>\voice=(lang:unknown,gender:female)\
Vocalizer chooses a default voice that meets the specified key and value criteria, if such a voice is available.
Labeling text for language identification
Use this escape sequence to label text as an unknown language, Vocalizer automatically determines the language with its built-in language identifier. This feature only works with languages that support language ID. See Using automatic language identification.
To label the text, use the following control sequence:
- Begin the string with: <ESC>\lang=unknown\
- End the string with: <ESC>\lang=normal\
(alternatively, simply end the input)
The automatic language identifier scope is enabled by default (set to user-defined). Use a Vocalizer configuration file to change the setting. If the scope is not enabled, Vocalizer ignores the control sequence.
Vocalizer identifies the language on a sentence-by-sentence basis within the text and switches the synthesis voice if necessary. Vocalizer restores the original synthesis voice at the next <ESC>\lang=normal\ or the end of the synthesis request.
Note: Vocalizer does not support specifying an explicit language name instead of "unknown".
Example:
Le titre de la chanson est : <ESC>\lang=unknown\In Between <ESC>\lang=normal\
The SSML equivalent of this control sequence is the xml:lang attribute, which is available for several SSML elements, including <p>, <speak>, and <s>:
<speak>Le titre de la chanson est:</speak>
<speak xml:lang="unknown">In Between</speak>
Indicating a paragraph break
Use this escape sequence to declare a paragraph break (which also implies a sentence break).The format is <ESC>\para\ with no value to specify.
Example:
Introduction to Vocalizer. <ESC>\para\ Vocalizer is a state-of-the-art text to speech system.
The SSML equivalent of this control sequence is the <p> (or <paragraph>) element:
<p>Introduction to Vocalizer.</p>
<p>Vocalizer is a state-of-the-art text to speech system.</p>
Resetting control sequences to the default
Use this escape sequence to reset all parameters to the original settings at the start of synthesis. The format is <ESC>\rst\ with no value to specify. For example:
<ESC>\vol=10\ The volume is set to a low value. <ESC>\rst\ Now it is reset to its default value.
<ESC>\rate=10\ The rate is set to a low value. <ESC>\rst\ Now it is reset to its default value.
This control sequence has no equivalent in SSML.
Changing the speaking style
Use this escape sequence to change the speaking style of the current voice. The format is <ESC>\style=value\ where the value is the name of a style. For example:
<ESC>\style=lively\This text would be read in lively style.
Different voices support different styles. Typical values are lively, neutral, lively, formal, didactic, and apologetic. See Changing the speaking style. If you specify an unsupported value, there is no change in the speaking style.
To reset the speaking style to the default, specify the value default. For example:
<ESC>\style=default\This text would be read in default style of the voice.
This control sequence has no equivalent in SSML.
Controlling agreement of number, gender, and case
Use this control sequence to define the case, gender, and number of a word. This feature and its values can vary for each voice. It is first implemented for German Petra-ml xpremium-high and added to other voices over time. For details, see the Language Supplement for each voice you download. (If a supplement doe not mention the feature, this means the voice does not support it.)
The format is <ESC>\agreement=features\ where the value is one or more key/value pairs separated by semi-colons. You can list the pairs in any order. The sequence applies to the next input word (Sekunde, in the following example):
<ESC>\agreement=gender:FEM;case:NOM;number:SING\1 Sekunde
Use this sequence (in languages and voices that support it) where words can vary by number, gender, or case depending upon the implied context where the words appear. Typically, this occurs when reading numeric values where the actual spoken numbers may change based on the context where they are used. When you explicitly define this context, the engine generates the correct word.
Feature |
Value |
Description |
case |
ACC
DAT
GEN
NOM
|
Accusative
Dative
Genitive
Nominative
|
gender |
FEM
NEUT
MASC
|
Feminine
Neuter
Masculine
|
number |
PLUR
SING
|
Plural
Singular
|
It's not necessary to specify all features. If you omit a feature, Vocalizer uses its normal processing algorithms.
Vocalizer ignores all specified features in the following situations:
- If the voice does not support the feature.
- If any value is incompatible with the input context. For example, 2 is plural even if the sequence declares number:SING.
- If there's any punctuation between the sequence and the target word.
- If any key/value pair is malformed.
This control sequence has no equivalent in SSML.