Synthesizer gRPC API
The Synthesizer gRPC API contains methods for requesting speech synthesis from TTSaaS, using standard and enhanced voices.
If you wish to use Microsoft neural voices, use Neural TTSaaS instead. Consult the Neural TTSaaS documentation: Synthesizer gRPC API for Neural TTSaaS.
Tip:
Try out this API using a Sample synthesis client.Proto file structure
The Synthesizer API is defined in the synthesizer.proto file.
└── nuance
├── rpc (RPC message files)
└── tts
├── storage
│ └── v1beta1
│ └── storage.proto
└── v1
└── synthesizer.proto
The proto file defines a Synthesizer service with three RPC methods: GetVoices, Synthesize, and UnarySynthesize.
Synthesizer
The Synthesizer service offers three methods related to voice synthesis.
Name | Request | Response | Description |
---|---|---|---|
GetVoices | GetVoicesRequest | GetVoicesResponse | Queries the list of available voices, with filters to reduce the search space. |
Synthesize | SynthesisRequest | SynthesisResponse stream | Synthesizes audio from input text and parameters, and returns an audio stream. |
UnarySynthesize | SynthesisRequest | UnarySynthesisResponse | Synthesizes audio and returns a single (unary) audio response. |
Streamed vs. unary response
TTSaaS offers two types of synthesis response: a streamed response in SynthesisResponse and a non-streamed response in UnarySynthesisResponse.
The request is the same in both cases: SynthesisRequest specifies a voice, the input text to synthesize, and optional parameters. The response can be either:
-
SynthesisResponse: Returns one status message followed by multiple streamed audio buffers, each including the markers or other events specified in the request. Each audio buffer contains the latest synthesized audio.
-
UnarySynthesisResponse: Returns one status message and one audio buffer, containing all the markers and events specified in the request. The underlying TTSaaS engine caps the audio response size.
See Run client for unary response to run the sample Python client with a unary response, activated by a command line flag.
One request, two possible responses (from proto file):
service Synthesizer {
rpc Synthesize(SynthesisRequest) returns (stream SynthesisResponse) {}
rpc UnarySynthesize(SynthesisRequest) returns (UnarySynthesisResponse {}
. . .
message SynthesisRequest {
Voice voice = 1;
AudioParameters audio_params = 2;
Input input = 3;
EventParameters event_params = 4;
map<string, string> client_data = 5;
}
message SynthesisResponse {
oneof response {
Status status = 1;
Events events = 2;
bytes audio = 3;
}
}
message UnarySynthesisResponse {
Status status = 1;
Events events = 2;
bytes audio = 3;
}
GetVoicesRequest
Input message for Synthesizer: GetVoices, to query voices available to the client.
Field | Type | Description |
---|---|---|
voice | Voice | Optional. Filter the voices to retrieve. For example, set language to en-US to return only American English voices. |
The GetVoicesRequest message includes:
GetVoicesRequest
voice (Voice)
name
model
language
age_group (EnumAgeGroup)
gender (EnumGender)
sample_rate_hz
language_tlw
For example:
# This retrieves all American English voices
GetVoicesRequest (
voice = Voice (language = "en-us")
)
# This returns one named voice
GetVoicesRequest (
voice = Voice (name = "Evan")
)
# This returns all female American English voices
GetVoicesRequest (
voice = Voice (
gender = EnumGender.FEMALE,
language = "en-us"
)
)
Voice
Input or output message for voices:
- In GetVoicesRequest, it filters the list of available voices.
- In GetVoicesResponse, it returns the list of available voices.
- In SynthesisRequest, it specifies the voice to use for the synthesis operation.
These fields are supported in all cases:
Field | Type | Description |
---|---|---|
name | string | The voice’s name, for example, Evan. Mandatory for SynthesisRequest. |
model | string | The voice’s quality model, for example, enhanced or standard. Mandatory for SynthesisRequest. |
These Voice fields are used only in GetVoicesRequest and GetVoicesResponse. They are ignored in SynthesisRequest.
Field | Type | Description |
---|---|---|
language | string | IETF language code, for example, en-US. Search for voices with a specific language. Some voices support multiple languages. |
age_group | EnumAgeGroup | Search for adult or child voices. |
gender | EnumGender | Search for voices with a certain gender. |
sample_rate_hz | uint32 | Search for a certain native sample rate. |
language_tlw | string | Three-letter language code (for example, enu for American English) for configuring language identification in Input. |
restricted | bool | Used only in GetVoicesResponse, to identify restricted voices (restricted true). These are custom voices available only to specific customers. Default is false, meaning the voice is public. |
version | string | Used only in GetVoicesResponse, to return the voice’s version. |
foreign_languages | string | Repeated. Used only in GetVoicesResponse, to return the foreign languages of a multilingual voice. |
The Voice message includes different fields depending on the context:
GetVoicesRequest
voice (Voice)
name
model
language
age_group (EnumAgeGroup)
gender (EnumGender)
sample_rate_hz
language_tlw
GetVoicesResponse
voice (Voice)
name
model
language
age_group (EnumAgeGroup)
gender (EnumGender)
sample_rate_hz
language_tlw
restricted
version
foreign_languages
SynthesisRequest
voice (Voice)
name
model
EnumAgeGroup
Input field for GetVoicesRequest or output field for GetVoicesResponse, specifying whether the voice uses its adult or child version, if available. Included in Voice.
Name | Number | Description |
---|---|---|
ADULT | 0 | Adult voice. Default for GetVoicesRequest. |
CHILD | 1 | Child voice. |
EnumGender
Input field for GetVoicesRequest or output field for GetVoicesResponse, specifying gender for voices that support multiple genders. Included in Voice.
Name | Number | Description |
---|---|---|
ANY | 0 | Any gender voice. Default for GetVoicesRequest. |
MALE | 1 | Male voice. |
FEMALE | 2 | Female voice. |
NEUTRAL | 3 | Neutral gender voice. |
GetVoicesResponse
Output message for Synthesizer: GetVoices. Includes a list of voices that matched the input criteria, if any.
Field | Type | Description |
---|---|---|
voices | Voice | Repeated. Voices and characteristics returned. |
The GetVoicesResponse message includes:
GetVoicesResponse
voice (Voice)
name
model
language
age_group (EnumAgeGroup)
gender (EnumGender)
sample_rate_hz
language_tlw
restricted
version
foreign_languages
This response to GetVoicesRequest returns all American English (en-us) voices:
2023-09-26 15:51:16,151 (139911033857856) INFO Iteration #1
2023-09-26 15:51:16,154 (139911033857856) DEBUG Creating secure gRPC channel
2023-09-26 15:51:16,161 (139911033857856) INFO Running file [flow.py]
2023-09-26 15:51:16,161 (139911033857856) DEBUG [voice {
language: "en-us"
}
]
2023-09-26 15:51:16,161 (139911033857856) INFO Sending GetVoices request
2023-09-26 15:51:16,367 (139911033857856) INFO voices {
name: "Allison"
model: "standard"
language: "en-US"
gender: FEMALE
sample_rate_hz: 22050
language_tlw: "enu"
version: "2.0.0"
}
voices {
name: "Ava-Ml"
model: "enhanced"
language: "en-US"
gender: FEMALE
sample_rate_hz: 22050
language_tlw: "enu"
version: "3.0.1"
foreign_languages: "es-MX"
}
voices {
name: "Chloe"
model: "standard"
language: "en-US"
gender: FEMALE
sample_rate_hz: 22050
language_tlw: "enu"
version: "5.2.3.15315"
}
voices {
name: "Chloe"
model: "standard"
language: "en-US"
gender: FEMALE
sample_rate_hz: 8000
language_tlw: "enu"
version: "5.2.3.15315"
}
voices {
name: "Erica"
model: "standard"
language: "en-US"
gender: FEMALE
sample_rate_hz: 22050
language_tlw: "enu"
restricted: true
version: "1.0.2"
}
voices {
name: "Erica"
model: "standard"
language: "en-US"
gender: FEMALE
sample_rate_hz: 8000
language_tlw: "enu"
restricted: true
version: "1.0.2"
}
voices {
name: "Evan"
model: "enhanced"
language: "en-US"
gender: MALE
sample_rate_hz: 22050
language_tlw: "enu"
version: "1.1.1"
}
voices {
name: "Evelyn"
model: "standard"
language: "en-US"
gender: FEMALE
sample_rate_hz: 22050
language_tlw: "enu"
version: "5.2.3.15114"
}
voices {
name: "Evelyn"
model: "standard"
language: "en-US"
gender: FEMALE
sample_rate_hz: 8000
language_tlw: "enu"
version: "5.2.3.15114"
}
voices {
name: "Nathan"
model: "enhanced"
language: "en-US"
gender: MALE
sample_rate_hz: 22050
language_tlw: "enu"
version: "4.1.1"
}
voices {
name: "Nolan"
model: "standard"
language: "en-US"
gender: MALE
sample_rate_hz: 22050
language_tlw: "enu"
version: "5.2.3.15315"
}
voices {
name: "Nolan"
model: "standard"
language: "en-US"
gender: MALE
sample_rate_hz: 8000
language_tlw: "enu"
version: "5.2.3.15315"
}
voices {
name: "Samantha"
model: "standard"
language: "en-US"
gender: FEMALE
sample_rate_hz: 22050
language_tlw: "enu"
version: "2.0.0"
}
voices {
name: "Susan"
model: "standard"
language: "en-US"
gender: FEMALE
sample_rate_hz: 22050
language_tlw: "enu"
version: "2.0.0"
}
voices {
name: "Tom"
model: "standard"
language: "en-US"
gender: MALE
sample_rate_hz: 22050
language_tlw: "enu"
version: "3.2.1"
}
voices {
name: "Zoe-Ml"
model: "enhanced"
language: "en-US"
gender: FEMALE
sample_rate_hz: 22050
language_tlw: "enu"
version: "2.0.0"
foreign_languages: "es-MX"
foreign_languages: "fr-CA"
}
2023-09-26 15:51:16,368 (139911033857856) INFO Done running file [flow.py]
2023-09-26 15:51:16,369 (139911033857856) INFO Iteration #1 complete
2023-09-26 15:51:16,369 (139911033857856) INFO Done
SynthesisRequest
Input message for Synthesizer: Synthesize. Specifies input text, audio parameters, and events to subscribe to, in exchange for synthesized audio. See Defaults for default values for optional fields.
Field | Type | Description |
---|---|---|
voice | Voice | Mandatory. The voice to use for audio synthesis. |
audio_params | AudioParameters | Output audio parameters, such as encoding and volume. Default is PCM audio at 22050 Hz. |
input | Input | Mandatory. Input text to synthesize, tuning data, etc. |
event_params | EventParameters | Markers and other info to include in events logged during synthesis. |
client_data | map<string,string> | Map of client-supplied key:value pairs to inject into the event log. |
user_id | string | Identifies a specific user within the application. |
The SynthesisRequest message includes:
SynthesisRequest
voice (Voice)
name
model
audio_params (AudioParameters)
audio_format (AudioFormat)
volume_percentage
speaking_rate_factor
audio_chunk_duration_ms
target_audio_length
disable_early_emission
input (Input)
text (Text)
ssml (SSML)
tokenized_sequence (TokenizedSequence)
resources (SynthesisResource)
lid_params (LanguageIdentificationParameters)
download_params (DownloadParameters)
event_params (EventParameters)
send_sentence_marker_events
send_word_marker_events
send_phoneme_marker_events
send_bookmark_marker_events
send_paragraph_marker_events
send_visemes
send_log_events
suppress_input
client_data
user_id
This synthesis request includes most fields:
SynthesisRequest(
voice = Voice(
name = "Evan",
model = "enhanced"
),
audio_params = AudioParameters(
audio_format = AudioFormat(
pcm = PCM(sample_rate_hz = 22050)
),
volume_percentage = 80, # Default value
speaking_rate_factor = 1.0 # Default value
),
input = Input(
text = Text(
text = "Your coffee will be ready in 5 minutes")
),
event_params = EventParameters(
send_log_events = True,
suppress_input = True
),
client_data = {'company':'Aardvark Coffee','user':'Leslie'},
user_id = "leslie.somebody@aardvark.com"
)
This minimal synthesis request uses all defaults:
SynthesisRequest(
voice = Voice(
name = "Evan",
model = "enhanced"
),
input = Input(
text = Text(
text = "Your coffee will be ready in 5 minutes")
)
)
AudioParameters
Input message for audio-related parameters during synthesis, including encoding, volume, and audio length. Included in SynthesisRequest.
Field | Type | Description |
---|---|---|
audio_format | AudioFormat | Audio encoding. Default PCM 22050 Hz. |
volume_percentage | uint32 | Volume amplitude, from 0 to 100. Default 80. |
speaking_rate_factor | float | Speaking rate, from 0 to 2.0. Default 1.0. |
audio_chunk_ duration_ms | uint32 | Maximum duration, in ms, of an audio chunk delivered to the client, from 1 to 60000. Default is 20000 (20 seconds). When this parameter is large enough (for example, 20 or 30 seconds), each audio chunk contains an audible segment surrounded by silence. |
target_audio_length_ms | uint32 | Maximum duration, in ms, of synthesized audio. When greater than 0, the server stops ongoing synthesis at the first sentence end, or silence, closest to the value. |
disable_early_emission | bool | By default, audio segments are emitted as soon as possible, even if they are not audible. This behavior may be disabled. |
The AudioParameters message includes:
SynthesisRequest
voice (Voice)
audio_params (AudioParameters)
audio_format (AudioFormat)
pcm (PCM)
alaw (Alaw)
ulaw (Ulaw)
ogg_opus (OggOpus)
opus (Opus)
volume_percentage
speaking_rate_factor
audio_chunk_duration_ms
target_audio_length
disable_early_emission
AudioFormat
Input message for audio encoding of synthesized text. Included in AudioParameters.
Field | Type | Description |
---|---|---|
pcm | PCM | Signed 16-bit little endian PCM. |
alaw | ALaw | G.711 A-law, 8kHz. |
ulaw | ULaw | G.711 Mu-law, 8kHz. |
ogg_opus | OggOpus | Ogg Opus, 8kHz,16kHz, or 24 kHz. |
opus | Opus | Opus, 8kHz, 16kHz, or 24kHz. The audio will be sent one Opus packet at a time. |
The AudioFormat message includes:
SynthesisRequest
voice (Voice)
audio_params (AudioParameters)
audio_format (AudioFormat)
pcm (PCM)
sample_rate_hz
alaw (Alaw)
ulaw (Ulaw)
ogg_opus (OggOpus)
sample_rate_hz
bit_rate_bps
max_frame_ duration_ms
complexity
vbr (EnumVariableBitrate)
opus (Opus)
sample_rate_hz
bit_rate_bps
max_frame_ duration_ms
complexity
vbr (EnumVariableBitrate)
The PCM audio format is shown in this example, with alternatives in commented lines:
SynthesisRequest(
voice = Voice(
name = "Evan",
model = "enhanced"
),
audio_params = AudioParameters(
audio_format = AudioFormat(
pcm = PCM(sample_rate_hz = 22050)
# alaw = ALaw()
# ulaw = ULaw()
# ogg_opus = OggOpus(sample_rate_hz = 16000)
# opus = Opus(sample_rate_hz = 8000, bit_rate_bps = 30000)
)
)
PCM
Input message defining PCM sample rate. Included in AudioFormat.
Field | Type | Description |
---|---|---|
sample_rate_hz | uint32 | Output sample rate in Hz. Supported values: 8000, 11025, 16000, 22050, 24000. |
ALaw
Input message defining A-law audio format. Included in AudioFormat. G.711 audio formats are set to 8kHz.
ULaw
Input message defining Mu-law audio format. Included in AudioFormat. G.711 audio formats are set to 8kHz.
OggOpus
Input message defining Ogg Opus output rate. Included in AudioFormat.
Field | Type | Description |
---|---|---|
sample_rate_hz | uint32 | Output sample rate in Hz. Supported values: 8000, 16000, 24000. |
bit_rate_bps | uint32 | Valid range is 500 to 256000 bps. Default 28000. |
max_frame_ duration_ms | float | Opus frame size in ms: 2.5, 5, 10, 20, 40, 60. Default 20. |
complexity | uint32 | Computational complexity. A complexity of 0 means the codec default. |
vbr | EnumVariableBitrate | Variable bitrate. On by default. |
Opus
Input message defining Opus output rate. Included in AudioFormat.
Field | Type | Description |
---|---|---|
sample_rate_hz | uint32 | Output sample rate in Hz. Supported values: 8000, 16000, 24000. |
bit_rate_bps | uint32 | Valid range is 500 to 256000 bps. Default 28000. |
max_frame_ duration_ms | float | Opus frame size in ms: 2.5, 5, 10, 20, 40, 60. Default 20. |
complexity | uint32 | Computational complexity. A complexity of 0 means the codec default. |
vbr | EnumVariableBitrate | Variable bitrate. On by default. |
EnumVariableBitrate
Settings for variable bitrate. Included in OggOpus and Opus. Turned on by default.
Name | Number | Description |
---|---|---|
VARIABLE_BITRATE_ON | 0 | Use variable bitrate. Default. |
VARIABLE_BITRATE_OFF | 1 | Do not use variable bitrate. |
VARIABLE_BITRATE_ CONSTRAINED | 2 | Use constrained variable bitrate. |
Input
Input message containing text to synthesize and synthesis parameters, including tuning data, etc. Included in SynthesisRequest. The type of input may be plain text, SSML, or a sequence of plain text and Nuance control codes. See Input to synthesize for more examples.
Field | Type | Description |
---|---|---|
text | Text | Plain text input. |
ssml | SSML | SSML input, including text and SSML elements. |
tokenized_sequence | TokenizedSequence | Sequence of text and Nuance control codes. |
resources | SynthesisResource | Repeated. Synthesis resources (user dictionaries, rulesets, etc.) to tune synthesized audio. Default blank. |
lid_params | LanguageIdentification Parameters | LID parameters. |
download_params | DownloadParameters | Remote file download parameters. |
The Input message includes:
SynthesisRequest
voice (Voice)
input (Input)
text (Text)
text
uri
ssml (SSML)
text
uri
ssml_validation_mode (EnumSSMLValidationMode)
tokenized_sequence (TokenizedSequence)
tokens (Token)
uri
Text
Input message for synthesizing plain text. The encoding must be UTF-8. Included in Input.
Field | Type | Description |
---|---|---|
text | string | Plain input text in UTF-8 encoding. |
uri | string | Remote URI to the plain input text. Not supported in Nuance-hosted TTS. |
This example shows plain text input:
SynthesisRequest(
voice = Voice(
name = "Evan",
model = "enhanced"
),
input = Input(
text = Text(
text = "Your coffee will be ready in 5 minutes")
),
)
SSML
Input message for synthesizing SSML input. See SSML input for a list of supported elements and examples.
Input message for synthesizing SSML input.
Field | Type | Description |
---|---|---|
text | string | SSML input text and elements. |
uri | string | Remote URI to the SSML input text. Not supported in Nuance-hosted TTS. |
ssml_validation_mode | EnumSSML ValidationMode | SSML validation mode. Default STRICT. |
This input contains SSML:
SynthesisRequest(
voice = Voice(
name = "Evan",
model = "enhanced"
),
input = Input(
ssml = SSML(
text = '<?xml version="1.0"?><speak xmlns="http://www.w3.org/2001/10/synthesis"
xml:lang="en-US" version="1.0">This is the normal volume of my voice.
<prosody volume="10">I can speak rather quietly, </prosody>
<prosody volume="90">But also very loudly.</prosody></speak>',
ssml_validation_mode = WARN
)
)
)
The xml tag and the speak attributes may be omitted:
SynthesisRequest(
voice = Voice(
name = "Evan",
model = "enhanced"
),
input = Input(
ssml = SSML(
text = '<speak>This is the normal volume of my voice.
<prosody volume="10">I can speak rather quietly,</prosody>
<prosody volume="90">But also very loudly.</prosody></speak>',
ssml_validation_mode = WARN
)
)
)
EnumSSMLValidationMode
SSML validation mode when using SSML input. Included in SSML. Strict by default but can be relaxed.
Name | Number | Description |
---|---|---|
STRICT | 0 | Strict SSL validation. Default. |
WARN | 1 | Give warning only. |
NONE | 2 | Do not validate. |
TokenizedSequence
Input message for synthesizing a sequence of plain text and Nuance control codes. Included in Input. See Tokenized sequence for a list of supported codes and examples.
Field | Type | Description |
---|---|---|
tokens | Token | Repeated. Sequence of text and control codes. |
The TokenizedSequence message includes:
SynthesisRequest
voice (Voice)
input (Input)
tokenized_sequence (TokenizedSequence)
tokens (Token)
text
control_code (ControlCode)
key
value
This input is a sequence of tokens: text and control codes:
SynthesisRequest(
voice = Voice(
name = "Evan",
model = "enhanced"
),
input = Input(
tokenized_sequence = TokenizedSequence(
tokens = [
Token(control_code = ControlCode(
key = "vol",
value = "10")),
Token(text = "I can speak rather quietly,"),
Token(control_code = ControlCode(
key = "vol",
value = "90")),
Token(text = "but also very loudly.")
]
)
)
)
Token
The unit when using TokenizedSequence for input. Each token can be either plain text or a Nuance control code.
Field | Type | Description |
---|---|---|
text | string | Plain input text. |
control_code | ControlCode | Nuance control code. |
ControlCode
Nuance control code that specifies how text should be spoken, similar to SSML. Included in Token.
Field | Type | Description |
---|---|---|
key | string | Name of the control code, for example, pause. |
value | string | Value of the control code. |
SynthesisResource
Input message specifying the type of file to tune the synthesized output and its location or contents. Included in Input. See Synthesis resources.
Field | Type | Description |
---|---|---|
type | EnumResourceType | Resource type, e.g. user dictionary, etc. Default USER_DICTIONARY. |
uri | string | The URN of a resource previously uploaded to cloud storage with the Storage API. See URNs for the format. |
body | bytes | For EnumResourceType USER_DICTIONARY, the contents of the file. See Inline dictionary for an example. |
The SynthesisResource message includes:
SynthesisRequest
voice (Voice)
input (Input)
resources (SynthesisResource)
type (EnumResourceType)
uri
body
This request includes an inline compiled user dictionary (with body):
SynthesisRequest (
voice = Voice (name = "Evan", model = "enhanced"),
input = Input (
text = Text (text = "Your coffee will be ready in 5 minutes"),
resources = [
SynthesisResource (
type = USER_DICTIONARY,
body = open("/path/to/user_dictionary.dcb", 'rb').read()
)
]
)
)
This request includes an external user dictionary:
SynthesisRequest (
voice = Voice (name = "Evan", model = "enhanced"),
input = Input (
text = Text (text = "Your coffee will be ready in 5 minutes"),
resources = [
SynthesisResource (
type = USER_DICTIONARY,
uri = "urn:nuance-mix:tag:tuning:lang/coffee_app/coffee_dict/en-us/mix.tts"
)
]
)
)
This includes an ActivePrompt database:
SynthesisRequest (
voice = Voice (name = "Evan", model = "enhanced"),
input = Input (
text = Text (text = "Your coffee will be ready in 5 minutes"),
resources = [
SynthesisResource (
type = ACTIVEPROMPT_DB,
uri = "urn:nuance-mix:tag:tuning:voice/coffee_app/coffee_prompts/Evan/mix.tts"
)
]
)
)
And this includes a user ruleset:
SynthesisRequest (
voice = Voice (name = "Evan", model = "enhanced"),
input = Input (
text = Text (text = "Your coffee will be ready in 5 minutes"),
resources = [
SynthesisResource (
type = TEXT_USER_RULESET,
uri = "urn:nuance-mix:tag:tuning:lang/coffee_app/coffee_rules/en-us/mix.tts"
)
]
)
)
EnumResourceType
The type of synthesis resource to tune the output. Included in SynthesisResource. User dictionaries provide custom pronunciations, rulesets apply search-and-replace rules to input text, and ActivePrompt databases help tune synthesized audio under certain conditions, using Nuance Vocalizer Studio.
Name | Number | Description |
---|---|---|
USER_DICTIONARY | 0 | User dictionary (application/edct-bin-dictionary). Default. |
TEXT_USER_RULESET | 1 | Text user ruleset (application/x-vocalizer-rettt+text). |
BINARY_USER_RULESET | 2 | Not supported. Binary user ruleset (application/x-vocalizer-rettt+bin). |
ACTIVEPROMPT_DB | 3 | ActivePrompt database (application/x-vocalizer-activeprompt-db). |
ACTIVEPROMPT_DB_AUTO | 4 | ActivePrompt database with automatic insertion (application/x-vocalizer-activeprompt-db;mode=automatic). This keyword specifies any ActivePrompt database but changes the behavior. |
SYSTEM_DICTIONARY | 5 | Nuance system dictionary (application/sdct-bin-dictionary). Not supported. |
URNs
The uri field in SynthesisResource defines the location of a synthesis resource as a URN in the Mix cloud storage area. In SSML and TokenizedSequence input, the audio tag or code references a WAV file as a URN. The format depends on the object type:
-
User dictionaries and rulesets:
urn:nuance-mix:tag:tuning:lang/context_tag/name/language/mix.tts
-
ActivePrompt databases:
urn:nuance-mix:tag:tuning:voice/context_tag/name/voice/mix.tts
-
Audio files:
urn:nuance-mix:tag:tuning:audio/context_tag/name/mix.tts
When you upload these resources using the Storage API, you provide only the context tag and name in UploadInitMessage. The UploadResponse message confirms the complete URN for the object.
uri: "urn:nuance-mix:tag:tuning:lang/coffee_app/coffee_dict/en-us/mix.tts?type=userdict"
Syntax | Description |
---|---|
urn:nuance-mix:tag:tuning | The prefix for all synthesis resources. |
lang and language | The scope keyword, lang, for dictionaries and rulesets, plus the language in the format xx-xx. |
voice and voice | The scope keyword, voice, for ActivePrompt databases, plus the voice name. |
audio | The scope keyword, audio, for audio files. |
context_tag | A name for the collection of objects being stored. This can be a Context Tag from a Mix project or another collective name. If the context tag does not exist, it will be created. |
name | An identifier for the content being uploaded, using 1 to 64 alphanumeric characters or underscore (a-z, A-Z, 0-9, _). |
mix.tts | The suffix for all synthesis resources. |
?type=resource_type | An informational field returned by UploadRequest that identifies the type of resource. This field is not required when using the URN in a Synthesis request, although it may be included without error. |
Examples of URNs:
User dictionary:
urn:nuance-mix:tag:tuning:lang/coffee_app/coffee_dict/en-us/mix.tts
Text ruleset:
urn:nuance-mix:tag:tuning:lang/coffee_app/coffee_rules/en-us/mix.tts
ActivePrompt database:
urn:nuance-mix:tag:tuning:voice/coffee_app/coffee_prompts/Evan/mix.tts
Audio file:
urn:nuance-mix:tag:tuning:audio/coffee_app/thanks/mix.tts
LanguageIdentificationParameters
Input message controlling the language identifier. Included in Input. The language identifier runs on input blocks labeled with control code lang unknown
or the SSML attribute xml:lang="unknown"
.
By default, the language identifier matches languages to all installed voices. The languages
field limits the permissible languages, and also sets the order of precedence (first to last) when they have equal confidence scores.
Field | Type | Description |
---|---|---|
disable | bool | Whether to disable language identification. Turned on by default. |
languages | string | Repeated. List of three-letter language codes (for example, enu, frc, spm) to restrict language identification results, in order of precedence. Use GetVoicesRequest to obtain the three-letter codes, returned in GetVoicesResponse language_tlw. Default blank. |
always_use_ highest_confidence | bool | If enabled, language identification always chooses the language with the highest confidence score, even if the score is low. Default false, meaning use language with any confidence. |
The LanguageIdentificationParameters message includes:
SynthesisRequest
voice (Voice)
input (Input)
lid_params (LanguageIdentificationParameters)
disable
languages
always_use_highest_confidence
This Input message includes LID parameters to limit the choice of languages to French Canadian (frc) or American English (enu):
SynthesisRequest(
voice = Voice(
name = "Evan",
model = "enhanced"
),
input = Input(
tokenized_sequence = TokenizedSequence(
tokens = [
Token(text = "The name of the song is. "),
Token(control_code = ControlCode(
key = "lang",
value = "unknown")),
Token(text = "Au clair de la lune."),
Token(control_code = ControlCode(
key = "lang",
value = "normal")),
Token(text = "It's a folk song meaning, in the light of the moon.")
]
),
lid_params = LanguageIdentificationParameters(
languages = (["frc", "enu"])
)
)
)
DownloadParameters
Input message containing parameters for remote file download, whether for input text (Input.uri) or a SynthesisResource (SynthesisResource.uri). Included in Input.
Field | Type | Description |
---|---|---|
headers | map<string,string> | Map of HTTP header name,value pairs to include in outgoing requests. Supported headers: max_age, max_stale. |
request_timeout_ms | uint32 | Request timeout in ms. Default (0) means server default, usually 30000 (30 seconds). |
refuse_cookies | bool | Whether to disable cookies. By default, HTTP requests accept cookies. |
The DownloadParameters message includes:
SynthesisRequest
voice (Voice)
input (Input)
download_params (DownloadParameters)
headers
request_timeout_ms
refuse_cookies
EventParameters
Input message that defines event subscription parameters. Included in SynthesisRequest. Events that are requested are sent throughout the SynthesisResponse stream, when generated. Marker events can send events as certain parts of the synthesized audio are reached, for example, at the end of a word, sentence, or user-defined bookmark.
Log events are produced throughout a synthesis request for events such as a voice loaded by the server or an audio chunk being ready to send.
Field | Type | Description |
---|---|---|
send_sentence_marker_events | bool | Sentence marker. Default: do not send. |
send_word_marker_events | bool | Word marker. Default: do not send. |
send_phoneme_marker_events | bool | Phoneme marker. Default: do not send. |
send_bookmark_marker_events | bool | Bookmark marker. Default: do not send. |
send_paragraph_marker_events | bool | Paragraph marker. Default: do not send. |
send_visemes | bool | Lipsync information. Default: do not send. |
send_log_events | bool | Whether to log events in the synthesis response: Events. By default, logging is turned off. |
suppress_input | bool | Whether to omit input text and URIs from log events in Events or in server event logs. By default, these items are included. |
The EventParameters message includes:
SynthesisRequest
voice (Voice)
input (Input)
event_params (EventParameters)
send_sentence_marker_events
send_word_marker_events
send_phoneme_marker_events
send_bookmark_marker_events
send_paragraph_marker_events
send_visemes
send_log_events
suppress_input
Event parameters in SynthesisRequest:
SynthesisRequest(
voice = Voice(
name = "Evan",
model = "enhanced"
),
input = Input(
text = Text(
text = "Your coffee will be ready in 5 minutes.")
),
event_params = EventParameters(
send_sentence_marker_events = True,
send_paragraph_marker_events = True,
send_log_events = True,
suppress_input = True
)
)
SynthesisResponse
The Synthesizer Synthesize method returns a stream of SynthesisResponse messages. (See UnarySynthesisResponse for a non-streamed response.) Each response contains one of:
- A status response, indicating completion or failure of the request. This is received only once and signifies the end of a Synthesize call.
- A list of events the client has requested. This can be received many times.
- An audio buffer. This may be received many times.
Field | Type | Description |
---|---|---|
status | Status | A status response, indicating completion or failure of the request. |
events | Events | A list of events. See EventParameters for details. |
audio | bytes | The latest audio buffer. |
The SynthesisResponse message includes:
SynthesisResponse
status (Status)
code
message
details
events (Events)
event (Event)
name
values
audio
Response to synthesis request:
try:
if args.output_audio_file:
audio_file = open(args.output_audio_file, "wb")
for response in stream_in:
if response.HasField("audio"):
print("Received audio: %d bytes" % len(response.audio))
if(audio_file):
audio_file.write(response.audio)
elif response.HasField("events"):
print("Received events")
print(text_format.MessageToString(response.events))
else:
if response.status.code == 200:
print("Received status response: SUCCESS")
else:
print("Received status response: FAILED")
print("Code: {}, Message: {}".format(response.status.code, response.status.message))
print('Error: {}'.format(response.status.details))
except Exception as e:
print(e)
if audio_file:
print("Saved audio to {}".format(args.output_audio_file))
audio_file.close()
These results synthesize a simple text string using the Evan voice, and include a user dictionary:
2023-09-26 15:45:42,436 (139898668111680) INFO Iteration #1
2023-09-26 15:45:42,439 (139898668111680) DEBUG Creating secure gRPC channel
2023-09-26 15:45:42,444 (139898668111680) INFO Running file [flow.py]
2023-09-26 15:45:42,444 (139898668111680) DEBUG [voice {
name: "Evan"
model: "enhanced"
}
audio_params {
audio_format {
pcm {
sample_rate_hz: 22050
}
}
volume_percentage: 80
speaking_rate_factor: 1.0
audio_chunk_duration_ms: 2000
}
input {
text {
text: "This is a test. A very simple test."
}
resources {
uri: "urn:nuance-mix:tag:tuning:lang/coffee_app/coffee_dict/en-us/mix.tts"
}
}
user_id: "MyApplicationUser"
]
2023-09-26 15:45:42,444 (139898668111680) INFO Sending Synthesis request
2023-09-26 15:45:42,631 (139898668111680) INFO Received audio: 57484 bytes
2023-09-26 15:45:42,655 (139898668111680) INFO Received audio: 70432 bytes
2023-09-26 15:45:42,657 (139898668111680) INFO Received status response: SUCCESS
2023-09-26 15:45:42,657 (139898668111680) INFO Wrote audio to flow.py_i1_s1.wav
2023-09-26 15:45:42,657 (139898668111680) INFO Done running file [flow.py]
2023-09-26 15:45:42,660 (139898668111680) INFO Iteration #1 complete
2023-09-26 15:45:42,660 (139898668111680) INFO Done
Status
Output message containing a status response, indicating completion or failure of a Synthesize call. Included in SynthesisResponse and UnarySynthesisResponse.
Field | Type | Description |
---|---|---|
code | uint32 | HTTP-style return code: 200, 4xx, or 5xx as appropriate. See Status codes. |
message | string | Brief description of the status. |
details | string | Longer description if available. |
Events
Output message defining a container for a list of events. This container is needed because oneof
does not allow repeated parameters. Included in SynthesisResponse and UnarySynthesisResponse.
Field | Type | Description |
---|---|---|
events | Event | Repeated. One or more events. |
Events are returned when send_log_events
is True in the request’s EventParameters.
For a description of the NVOC events in the results, see TTS payload: callsummary.
Event
Output message defining an event message. Included in Events. See EventParameters for details.
Field | Type | Description |
---|---|---|
name | string | Either “Markers” or the name of the event in the case of a Log Event. |
values | map<string,string> | Map of key:value data relevant to the current event. |
UnarySynthesisResponse
The Synthesizer UnarySynthesize RPC call returns a single UnarySynthesisResponse message. It is similar to SynthesisResponse but includes all the information at once instead of a streaming response. The response contains:
- A status response, indicating completion or failure of the request.
- A list of events the client has requested.
- The complete audio buffer of the synthesized text.
Field | Type | Description |
---|---|---|
status | Status | A status response, indicating completion or failure of the request. |
events | Events | A list of events. See EventParameters for details. |
audio | bytes | Audio buffer of the synthesized text, capped if necessary to a configured audio response size. |
The UnarySynthesisResponse message includes:
UnarySynthesisResponse
status (Status)
code
message
details
events (Events)
event (Event)
name
values
audio
Scalar value types
The data types in the proto files are mapped to equivalent types in the generated client stub files.
Feedback
Was this page helpful?
Glad to hear it! Please tell us how we can improve.
Sorry to hear that. Please tell us how we can improve.