Synthesizer gRPC API

The Synthesizer gRPC API contains methods for requesting speech synthesis from TTSaaS, using standard and enhanced voices.

If you wish to use Microsoft neural voices, use Neural TTSaaS instead. Consult the Neural TTSaaS documentation: Synthesizer gRPC API for Neural TTSaaS.

Tip:

Try out this API using a Sample synthesis client.

Proto file structure

The Synthesizer API is defined in the synthesizer.proto file.

└── nuance
    ├── rpc (RPC message files)
    └── tts
        ├── storage
        │   └── v1beta1
        │       └── storage.proto
        └── v1
            └── synthesizer.proto

The proto file defines a Synthesizer service with three RPC methods: GetVoices, Synthesize, and UnarySynthesize.

Proto file fields for GetVoices

Proto file fields for Synthesize and UnarySynthesize

Synthesizer

The Synthesizer service offers three methods related to voice synthesis.

Synthesizer service
Name	Request	Response	Description
GetVoices	GetVoicesRequest	GetVoicesResponse	Queries the list of available voices, with filters to reduce the search space.
Synthesize	SynthesisRequest	SynthesisResponse stream	Synthesizes audio from input text and parameters, and returns an audio stream.
UnarySynthesize	SynthesisRequest	UnarySynthesisResponse	Synthesizes audio and returns a single (unary) audio response.

Streamed vs. unary response

TTSaaS offers two types of synthesis response: a streamed response in SynthesisResponse and a non-streamed response in UnarySynthesisResponse.

The request is the same in both cases: SynthesisRequest specifies a voice, the input text to synthesize, and optional parameters. The response can be either:

SynthesisResponse: Returns one status message followed by multiple streamed audio buffers, each including the markers or other events specified in the request. Each audio buffer contains the latest synthesized audio.
UnarySynthesisResponse: Returns one status message and one audio buffer, containing all the markers and events specified in the request. The underlying TTSaaS engine caps the audio response size.

See Run client for unary response to run the sample Python client with a unary response, activated by a command line flag.

One request, two possible responses (from proto file):

service Synthesizer {
    rpc Synthesize(SynthesisRequest) returns (stream SynthesisResponse) {} 
    rpc UnarySynthesize(SynthesisRequest) returns (UnarySynthesisResponse {}
. . .
message SynthesisRequest { 
    Voice voice = 1;  
    AudioParameters audio_params = 2; 
    Input input = 3;   
    EventParameters event_params = 4;  
    map<string, string> client_data = 5; 
}

message SynthesisResponse {
    oneof response {
        Status status = 1;
        Events events = 2;
        bytes audio = 3;
    }
}

message UnarySynthesisResponse {
    Status status = 1;
    Events events = 2;
    bytes audio = 3;
}

GetVoicesRequest

Input message for Synthesizer: GetVoices, to query voices available to the client.

Get voices request
Field	Type	Description
voice	Voice	Optional. Filter the voices to retrieve. For example, set language to en-US to return only American English voices.

The GetVoicesRequest message includes:

GetVoicesRequest
  voice (Voice)
    name
    model
    language
    age_group (EnumAgeGroup)
    gender (EnumGender)
    sample_rate_hz
    language_tlw

For example:

# This retrieves all American English voices
GetVoicesRequest (
    voice = Voice (language = "en-us")
)

# This returns one named voice
GetVoicesRequest (
    voice = Voice (name = "Evan")
)

# This returns all female American English voices
GetVoicesRequest (
    voice = Voice (
        gender = EnumGender.FEMALE,
        language = "en-us"
    )
)

Voice

Input or output message for voices:

In GetVoicesRequest, it filters the list of available voices.
In GetVoicesResponse, it returns the list of available voices.
In SynthesisRequest, it specifies the voice to use for the synthesis operation.

These fields are supported in all cases:

Voice message
Field	Type	Description
name	string	The voice’s name, for example, Evan. Mandatory for SynthesisRequest.
model	string	The voice’s quality model, for example, enhanced or standard. Mandatory for SynthesisRequest.

These Voice fields are used only in GetVoicesRequest and GetVoicesResponse. They are ignored in SynthesisRequest.

Voice fields
Field	Type	Description
language	string	IETF language code, for example, en-US. Search for voices with a specific language. Some voices support multiple languages.
age_group	EnumAgeGroup	Search for adult or child voices.
gender	EnumGender	Search for voices with a certain gender.
sample_rate_hz	uint32	Search for a certain native sample rate.
language_tlw	string	Three-letter language code (for example, enu for American English) for configuring language identification in Input.
restricted	bool	Used only in GetVoicesResponse, to identify restricted voices (restricted true). These are custom voices available only to specific customers. Default is false, meaning the voice is public.
version	string	Used only in GetVoicesResponse, to return the voice’s version.
foreign_languages	string	Repeated. Used only in GetVoicesResponse, to return the foreign languages of a multilingual voice.

The Voice message includes different fields depending on the context:

GetVoicesRequest
  voice (Voice)
    name
    model
    language
    age_group (EnumAgeGroup)
    gender (EnumGender)
    sample_rate_hz
    language_tlw

GetVoicesResponse
  voice (Voice)
    name
    model
    language
    age_group (EnumAgeGroup)
    gender (EnumGender)
    sample_rate_hz
    language_tlw
    restricted
    version
    foreign_languages

SynthesisRequest
  voice (Voice)
    name
    model

EnumAgeGroup

Input field for GetVoicesRequest or output field for GetVoicesResponse, specifying whether the voice uses its adult or child version, if available. Included in Voice.

Age group
Name	Number	Description
ADULT	0	Adult voice. Default for GetVoicesRequest.
CHILD	1	Child voice.

EnumGender

Input field for GetVoicesRequest or output field for GetVoicesResponse, specifying gender for voices that support multiple genders. Included in Voice.

Gender
Name	Number	Description
ANY	0	Any gender voice. Default for GetVoicesRequest.
MALE	1	Male voice.
FEMALE	2	Female voice.
NEUTRAL	3	Neutral gender voice.

GetVoicesResponse

Output message for Synthesizer: GetVoices. Includes a list of voices that matched the input criteria, if any.

Get voices responses
Field	Type	Description
voices	Voice	Repeated. Voices and characteristics returned.

The GetVoicesResponse message includes:

GetVoicesResponse
  voice (Voice)
    name
    model
    language
    age_group (EnumAgeGroup)
    gender (EnumGender)
    sample_rate_hz
    language_tlw
    restricted
    version
    foreign_languages

This response to GetVoicesRequest returns all American English (en-us) voices:

2023-09-26 15:51:16,151 (139911033857856) INFO  Iteration #1
2023-09-26 15:51:16,154 (139911033857856) DEBUG Creating secure gRPC channel
2023-09-26 15:51:16,161 (139911033857856) INFO  Running file [flow.py]
2023-09-26 15:51:16,161 (139911033857856) DEBUG [voice {
  language: "en-us"
}
]
2023-09-26 15:51:16,161 (139911033857856) INFO  Sending GetVoices request
2023-09-26 15:51:16,367 (139911033857856) INFO  voices {
  name: "Allison"
  model: "standard"
  language: "en-US"
  gender: FEMALE
  sample_rate_hz: 22050
  language_tlw: "enu"
  version: "2.0.0"
}
voices {
  name: "Ava-Ml"
  model: "enhanced"
  language: "en-US"
  gender: FEMALE
  sample_rate_hz: 22050
  language_tlw: "enu"
  version: "3.0.1"
  foreign_languages: "es-MX"
}
voices {
  name: "Chloe"
  model: "standard"
  language: "en-US"
  gender: FEMALE
  sample_rate_hz: 22050
  language_tlw: "enu"
  version: "5.2.3.15315"
}
voices {
  name: "Chloe"
  model: "standard"
  language: "en-US"
  gender: FEMALE
  sample_rate_hz: 8000
  language_tlw: "enu"
  version: "5.2.3.15315"
}
voices {
  name: "Erica"
  model: "standard"
  language: "en-US"
  gender: FEMALE
  sample_rate_hz: 22050
  language_tlw: "enu"
  restricted: true
  version: "1.0.2"
}
voices {
  name: "Erica"
  model: "standard"
  language: "en-US"
  gender: FEMALE
  sample_rate_hz: 8000
  language_tlw: "enu"
  restricted: true
  version: "1.0.2"
}
voices {
  name: "Evan"
  model: "enhanced"
  language: "en-US"
  gender: MALE
  sample_rate_hz: 22050
  language_tlw: "enu"
  version: "1.1.1"
}
voices {
  name: "Evelyn"
  model: "standard"
  language: "en-US"
  gender: FEMALE
  sample_rate_hz: 22050
  language_tlw: "enu"
  version: "5.2.3.15114"
}
voices {
  name: "Evelyn"
  model: "standard"
  language: "en-US"
  gender: FEMALE
  sample_rate_hz: 8000
  language_tlw: "enu"
  version: "5.2.3.15114"
}
voices {
  name: "Nathan"
  model: "enhanced"
  language: "en-US"
  gender: MALE
  sample_rate_hz: 22050
  language_tlw: "enu"
  version: "4.1.1"
}
voices {
  name: "Nolan"
  model: "standard"
  language: "en-US"
  gender: MALE
  sample_rate_hz: 22050
  language_tlw: "enu"
  version: "5.2.3.15315"
}
voices {
  name: "Nolan"
  model: "standard"
  language: "en-US"
  gender: MALE
  sample_rate_hz: 8000
  language_tlw: "enu"
  version: "5.2.3.15315"
}
voices {
  name: "Samantha"
  model: "standard"
  language: "en-US"
  gender: FEMALE
  sample_rate_hz: 22050
  language_tlw: "enu"
  version: "2.0.0"
}
voices {
  name: "Susan"
  model: "standard"
  language: "en-US"
  gender: FEMALE
  sample_rate_hz: 22050
  language_tlw: "enu"
  version: "2.0.0"
}
voices {
  name: "Tom"
  model: "standard"
  language: "en-US"
  gender: MALE
  sample_rate_hz: 22050
  language_tlw: "enu"
  version: "3.2.1"
}
voices {
  name: "Zoe-Ml"
  model: "enhanced"
  language: "en-US"
  gender: FEMALE
  sample_rate_hz: 22050
  language_tlw: "enu"
  version: "2.0.0"
  foreign_languages: "es-MX"
  foreign_languages: "fr-CA"
}

2023-09-26 15:51:16,368 (139911033857856) INFO  Done running file [flow.py]
2023-09-26 15:51:16,369 (139911033857856) INFO  Iteration #1 complete
2023-09-26 15:51:16,369 (139911033857856) INFO  Done

SynthesisRequest

Input message for Synthesizer: Synthesize. Specifies input text, audio parameters, and events to subscribe to, in exchange for synthesized audio. See Defaults for default values for optional fields.

Synthesis request
Field	Type	Description
voice	Voice	Mandatory. The voice to use for audio synthesis.
audio_params	AudioParameters	Output audio parameters, such as encoding and volume. Default is PCM audio at 22050 Hz.
input	Input	Mandatory. Input text to synthesize, tuning data, etc.
event_params	EventParameters	Markers and other info to include in events logged during synthesis.
client_data	map<string,string>	Map of client-supplied key:value pairs to inject into the event log.
user_id	string	Identifies a specific user within the application.

The SynthesisRequest message includes:

SynthesisRequest
  voice (Voice)
    name
    model
  audio_params (AudioParameters)
    audio_format (AudioFormat)
    volume_percentage
    speaking_rate_factor
    audio_chunk_duration_ms
    target_audio_length
    disable_early_emission
  input (Input)
    text (Text)
    ssml (SSML)
    tokenized_sequence (TokenizedSequence)
    resources (SynthesisResource)
    lid_params (LanguageIdentificationParameters)
    download_params (DownloadParameters)
  event_params (EventParameters)
    send_sentence_marker_events
    send_word_marker_events
    send_phoneme_marker_events
    send_bookmark_marker_events
    send_paragraph_marker_events
    send_visemes
    send_log_events
    suppress_input
  client_data
  user_id

This synthesis request includes most fields:

SynthesisRequest(
    voice = Voice(
        name = "Evan",
        model = "enhanced"
    ),
    audio_params = AudioParameters(
        audio_format = AudioFormat(
            pcm = PCM(sample_rate_hz = 22050)
        ),
        volume_percentage = 80,       # Default value
        speaking_rate_factor = 1.0    # Default value
    ),
    input = Input(
        text = Text(
           text = "Your coffee will be ready in 5 minutes")
    ),
    event_params = EventParameters(
        send_log_events = True,
        suppress_input = True
    ),
    client_data = {'company':'Aardvark Coffee','user':'Leslie'},
    user_id = "leslie.somebody@aardvark.com"
)

This minimal synthesis request uses all defaults:

SynthesisRequest(
    voice = Voice(
        name = "Evan",
        model = "enhanced"
    ),
    input = Input(
        text = Text(
           text = "Your coffee will be ready in 5 minutes")
    )
)

AudioParameters

Input message for audio-related parameters during synthesis, including encoding, volume, and audio length. Included in SynthesisRequest.

Audio parameters
Field	Type	Description
audio_format	AudioFormat	Audio encoding. Default PCM 22050 Hz.
volume_percentage	uint32	Volume amplitude, from 0 to 100. Default 80.
speaking_rate_factor	float	Speaking rate, from 0 to 2.0. Default 1.0.
audio_chunk_ duration_ms	uint32	Maximum duration, in ms, of an audio chunk delivered to the client, from 1 to 60000. Default is 20000 (20 seconds). When this parameter is large enough (for example, 20 or 30 seconds), each audio chunk contains an audible segment surrounded by silence.
target_audio_length_ms	uint32	Maximum duration, in ms, of synthesized audio. When greater than 0, the server stops ongoing synthesis at the first sentence end, or silence, closest to the value.
disable_early_emission	bool	By default, audio segments are emitted as soon as possible, even if they are not audible. This behavior may be disabled.

The AudioParameters message includes:

SynthesisRequest
  voice (Voice)
  audio_params (AudioParameters)
    audio_format (AudioFormat)
      pcm (PCM)
      alaw (Alaw)
      ulaw (Ulaw)
      ogg_opus (OggOpus)
      opus (Opus)
    volume_percentage
    speaking_rate_factor
    audio_chunk_duration_ms
    target_audio_length
    disable_early_emission

AudioFormat

Input message for audio encoding of synthesized text. Included in AudioParameters.

Audio format
Field	Type	Description
pcm	PCM	Signed 16-bit little endian PCM.
alaw	ALaw	G.711 A-law, 8kHz.
ulaw	ULaw	G.711 Mu-law, 8kHz.
ogg_opus	OggOpus	Ogg Opus, 8kHz,16kHz, or 24 kHz.
opus	Opus	Opus, 8kHz, 16kHz, or 24kHz. The audio will be sent one Opus packet at a time.

The AudioFormat message includes:

SynthesisRequest
  voice (Voice)
  audio_params (AudioParameters)
    audio_format (AudioFormat)
      pcm (PCM)
        sample_rate_hz
      alaw (Alaw)
      ulaw (Ulaw)
      ogg_opus (OggOpus)
        sample_rate_hz
        bit_rate_bps
        max_frame_ duration_ms
        complexity
        vbr (EnumVariableBitrate)
      opus (Opus)
        sample_rate_hz
        bit_rate_bps
        max_frame_ duration_ms
        complexity
        vbr (EnumVariableBitrate)

The PCM audio format is shown in this example, with alternatives in commented lines:

SynthesisRequest(
    voice = Voice(
        name = "Evan",
        model = "enhanced"
    ),
    audio_params = AudioParameters(
        audio_format = AudioFormat(
            pcm = PCM(sample_rate_hz = 22050)
#           alaw = ALaw()
#           ulaw = ULaw()
#           ogg_opus = OggOpus(sample_rate_hz = 16000)
#           opus = Opus(sample_rate_hz = 8000, bit_rate_bps = 30000)
        )
    )

PCM

Input message defining PCM sample rate. Included in AudioFormat.

PCM audio
Field	Type	Description
sample_rate_hz	uint32	Output sample rate in Hz. Supported values: 8000, 11025, 16000, 22050, 24000.

ALaw

Input message defining A-law audio format. Included in AudioFormat. G.711 audio formats are set to 8kHz.

ULaw

Input message defining Mu-law audio format. Included in AudioFormat. G.711 audio formats are set to 8kHz.

OggOpus

Input message defining Ogg Opus output rate. Included in AudioFormat.

Ogg Opus audio
Field	Type	Description
sample_rate_hz	uint32	Output sample rate in Hz. Supported values: 8000, 16000, 24000.
bit_rate_bps	uint32	Valid range is 500 to 256000 bps. Default 28000.
max_frame_ duration_ms	float	Opus frame size in ms: 2.5, 5, 10, 20, 40, 60. Default 20.
complexity	uint32	Computational complexity. A complexity of 0 means the codec default.
vbr	EnumVariableBitrate	Variable bitrate. On by default.

Opus

Input message defining Opus output rate. Included in AudioFormat.

Opus audio
Field	Type	Description
sample_rate_hz	uint32	Output sample rate in Hz. Supported values: 8000, 16000, 24000.
bit_rate_bps	uint32	Valid range is 500 to 256000 bps. Default 28000.
max_frame_ duration_ms	float	Opus frame size in ms: 2.5, 5, 10, 20, 40, 60. Default 20.
complexity	uint32	Computational complexity. A complexity of 0 means the codec default.
vbr	EnumVariableBitrate	Variable bitrate. On by default.

EnumVariableBitrate

Settings for variable bitrate. Included in OggOpus and Opus. Turned on by default.

Variable bitrate
Name	Number	Description
VARIABLE_BITRATE_ON	0	Use variable bitrate. Default.
VARIABLE_BITRATE_OFF	1	Do not use variable bitrate.
VARIABLE_BITRATE_ CONSTRAINED	2	Use constrained variable bitrate.

Input

Input message containing text to synthesize and synthesis parameters, including tuning data, etc. Included in SynthesisRequest. The type of input may be plain text, SSML, or a sequence of plain text and Nuance control codes. See Input to synthesize for more examples.

Input
Field	Type	Description
text	Text	Plain text input.
ssml	SSML	SSML input, including text and SSML elements.
tokenized_sequence	TokenizedSequence	Sequence of text and Nuance control codes.
resources	SynthesisResource	Repeated. Synthesis resources (user dictionaries, rulesets, etc.) to tune synthesized audio. Default blank.
lid_params	LanguageIdentification Parameters	LID parameters.
download_params	DownloadParameters	Remote file download parameters.

The Input message includes:

SynthesisRequest
  voice (Voice)
  input (Input)
    text (Text)
      text
      uri
    ssml (SSML)
      text
      uri
      ssml_validation_mode (EnumSSMLValidationMode)
    tokenized_sequence (TokenizedSequence)
      tokens (Token)
      uri

Text

Input message for synthesizing plain text. The encoding must be UTF-8. Included in Input.

Text input
Field	Type	Description
text	string	Plain input text in UTF-8 encoding.
uri	string	Remote URI to the plain input text. Not supported in Nuance-hosted TTS.

This example shows plain text input:

SynthesisRequest(
   voice = Voice(
       name = "Evan",
       model = "enhanced"
    ),
    input = Input(
        text = Text(
           text = "Your coffee will be ready in 5 minutes")
    ),
)

SSML

Input message for synthesizing SSML input. See SSML input for a list of supported elements and examples.

Input message for synthesizing SSML input.

SSML input
Field	Type	Description
text	string	SSML input text and elements.
uri	string	Remote URI to the SSML input text. Not supported in Nuance-hosted TTS.
ssml_validation_mode	EnumSSML ValidationMode	SSML validation mode. Default STRICT.

This input contains SSML:

SynthesisRequest(
   voice = Voice(
       name = "Evan",
       model = "enhanced"
    ),
    input = Input(
        ssml = SSML(
            text = '<?xml version="1.0"?><speak  xmlns="http://www.w3.org/2001/10/synthesis"
xml:lang="en-US" version="1.0">This is the normal volume of my voice.
<prosody volume="10">I can speak rather quietly, </prosody>
<prosody volume="90">But also very loudly.</prosody></speak>',
            ssml_validation_mode = WARN
        )
    )
)

The xml tag and the speak attributes may be omitted:

SynthesisRequest(
   voice = Voice(
       name = "Evan",
       model = "enhanced"
    ),
    input = Input(
        ssml = SSML(
            text = '<speak>This is the normal volume of my voice.
<prosody volume="10">I can speak rather quietly,</prosody>
<prosody volume="90">But also very loudly.</prosody></speak>',
            ssml_validation_mode = WARN
        )
    )
)

EnumSSMLValidationMode

SSML validation mode when using SSML input. Included in SSML. Strict by default but can be relaxed.

SSML validation mode
Name	Number	Description
STRICT	0	Strict SSL validation. Default.
WARN	1	Give warning only.
NONE	2	Do not validate.

TokenizedSequence

Input message for synthesizing a sequence of plain text and Nuance control codes. Included in Input. See Tokenized sequence for a list of supported codes and examples.

Tokenized sequence
Field	Type	Description
tokens	Token	Repeated. Sequence of text and control codes.

The TokenizedSequence message includes:

SynthesisRequest
  voice (Voice)
  input (Input)
    tokenized_sequence (TokenizedSequence)
      tokens (Token)
        text
        control_code (ControlCode)
          key
          value

This input is a sequence of tokens: text and control codes:

SynthesisRequest(
   voice = Voice(
       name = "Evan",
       model = "enhanced"
    ),
    input = Input(
        tokenized_sequence = TokenizedSequence(
            tokens = [
                Token(control_code = ControlCode(
                    key = "vol",
                    value = "10")),
                 Token(text = "I can speak rather quietly,"),
                 Token(control_code = ControlCode(
                     key = "vol",
                     value = "90")),
                 Token(text = "but also very loudly.")
             ]
        )
    )
)

Token

The unit when using TokenizedSequence for input. Each token can be either plain text or a Nuance control code.

Token
Field	Type	Description
text	string	Plain input text.
control_code	ControlCode	Nuance control code.

ControlCode

Nuance control code that specifies how text should be spoken, similar to SSML. Included in Token.

Control code
Field	Type	Description
key	string	Name of the control code, for example, pause.
value	string	Value of the control code.

SynthesisResource

Input message specifying the type of file to tune the synthesized output and its location or contents. Included in Input. See Synthesis resources.

caption
Field	Type	Description
type	EnumResourceType	Resource type, e.g. user dictionary, etc. Default USER_DICTIONARY.
uri	string	The URN of a resource previously uploaded to cloud storage with the Storage API. See URNs for the format.
body	bytes	For EnumResourceType USER_DICTIONARY, the contents of the file. See Inline dictionary for an example.

The SynthesisResource message includes:

SynthesisRequest
  voice (Voice)
  input (Input)
  resources (SynthesisResource)
    type (EnumResourceType)
    uri
    body

This request includes an inline compiled user dictionary (with body):

SynthesisRequest (
    voice = Voice (name = "Evan", model = "enhanced"),
    input = Input (
        text = Text (text = "Your coffee will be ready in 5 minutes"),
        resources =  [
            SynthesisResource (
                type = USER_DICTIONARY,
                body = open("/path/to/user_dictionary.dcb", 'rb').read()
            )
        ]
    )
)

This request includes an external user dictionary:

SynthesisRequest (
    voice = Voice (name = "Evan", model = "enhanced"),
    input = Input (
        text = Text (text = "Your coffee will be ready in 5 minutes"),
        resources =  [
            SynthesisResource (
                type = USER_DICTIONARY,
                uri = "urn:nuance-mix:tag:tuning:lang/coffee_app/coffee_dict/en-us/mix.tts"
            )
        ]
    )
)

This includes an ActivePrompt database:

SynthesisRequest (
    voice = Voice (name = "Evan", model = "enhanced"),
    input = Input (
        text = Text (text = "Your coffee will be ready in 5 minutes"),
        resources =  [
            SynthesisResource (
                type = ACTIVEPROMPT_DB,
                uri = "urn:nuance-mix:tag:tuning:voice/coffee_app/coffee_prompts/Evan/mix.tts"
            )
        ]
    )
)

And this includes a user ruleset:

SynthesisRequest (
    voice = Voice (name = "Evan", model = "enhanced"),
    input = Input (
        text = Text (text = "Your coffee will be ready in 5 minutes"),
        resources =  [
            SynthesisResource (
                type = TEXT_USER_RULESET,
                uri = "urn:nuance-mix:tag:tuning:lang/coffee_app/coffee_rules/en-us/mix.tts"
            )
        ]
    )
)

EnumResourceType

The type of synthesis resource to tune the output. Included in SynthesisResource. User dictionaries provide custom pronunciations, rulesets apply search-and-replace rules to input text, and ActivePrompt databases help tune synthesized audio under certain conditions, using Nuance Vocalizer Studio.

caption
Name	Number	Description
USER_DICTIONARY	0	User dictionary (application/edct-bin-dictionary). Default.
TEXT_USER_RULESET	1	Text user ruleset (application/x-vocalizer-rettt+text).
BINARY_USER_RULESET	2	Not supported. Binary user ruleset (application/x-vocalizer-rettt+bin).
ACTIVEPROMPT_DB	3	ActivePrompt database (application/x-vocalizer-activeprompt-db).
ACTIVEPROMPT_DB_AUTO	4	ActivePrompt database with automatic insertion (application/x-vocalizer-activeprompt-db;mode=automatic). This keyword specifies any ActivePrompt database but changes the behavior.
SYSTEM_DICTIONARY	5	Nuance system dictionary (application/sdct-bin-dictionary). Not supported.

URNs

The uri field in SynthesisResource defines the location of a synthesis resource as a URN in the Mix cloud storage area. In SSML and TokenizedSequence input, the audio tag or code references a WAV file as a URN. The format depends on the object type:

User dictionaries and rulesets: urn:nuance-mix:tag:tuning:lang/context_tag/name/language/mix.tts
ActivePrompt databases: urn:nuance-mix:tag:tuning:voice/context_tag/name/voice/mix.tts
Audio files: urn:nuance-mix:tag:tuning:audio/context_tag/name/mix.tts

When you upload these resources using the Storage API, you provide only the context tag and name in UploadInitMessage. The UploadResponse message confirms the complete URN for the object.

uri: "urn:nuance-mix:tag:tuning:lang/coffee_app/coffee_dict/en-us/mix.tts?type=userdict"

URN syntax
Syntax	Description
urn:nuance-mix:tag:tuning	The prefix for all synthesis resources.
lang and language	The scope keyword, lang, for dictionaries and rulesets, plus the language in the format xx-xx.
voice and voice	The scope keyword, voice, for ActivePrompt databases, plus the voice name.
audio	The scope keyword, audio, for audio files.
context_tag	A name for the collection of objects being stored. This can be a Context Tag from a Mix project or another collective name. If the context tag does not exist, it will be created.
name	An identifier for the content being uploaded, using 1 to 64 alphanumeric characters or underscore (a-z, A-Z, 0-9, _).
mix.tts	The suffix for all synthesis resources.
?type=resource_type	An informational field returned by UploadRequest that identifies the type of resource. This field is not required when using the URN in a Synthesis request, although it may be included without error.

Examples of URNs:

User dictionary:
urn:nuance-mix:tag:tuning:lang/coffee_app/coffee_dict/en-us/mix.tts

Text ruleset:
urn:nuance-mix:tag:tuning:lang/coffee_app/coffee_rules/en-us/mix.tts

ActivePrompt database:
urn:nuance-mix:tag:tuning:voice/coffee_app/coffee_prompts/Evan/mix.tts

Audio file:
urn:nuance-mix:tag:tuning:audio/coffee_app/thanks/mix.tts

LanguageIdentificationParameters

Input message controlling the language identifier. Included in Input. The language identifier runs on input blocks labeled with control code lang unknown or the SSML attribute xml:lang="unknown".

By default, the language identifier matches languages to all installed voices. The languages field limits the permissible languages, and also sets the order of precedence (first to last) when they have equal confidence scores.

Language identification parameters
Field	Type	Description
disable	bool	Whether to disable language identification. Turned on by default.
languages	string	Repeated. List of three-letter language codes (for example, enu, frc, spm) to restrict language identification results, in order of precedence. Use GetVoicesRequest to obtain the three-letter codes, returned in GetVoicesResponse language_tlw. Default blank.
always_use_ highest_confidence	bool	If enabled, language identification always chooses the language with the highest confidence score, even if the score is low. Default false, meaning use language with any confidence.

The LanguageIdentificationParameters message includes:

SynthesisRequest
  voice (Voice)
  input (Input)
    lid_params (LanguageIdentificationParameters)
      disable
      languages
      always_use_highest_confidence

This Input message includes LID parameters to limit the choice of languages to French Canadian (frc) or American English (enu):

SynthesisRequest(
    voice = Voice(
        name = "Evan",
        model = "enhanced"
    ),
    input = Input(
       tokenized_sequence = TokenizedSequence(
            tokens = [
                Token(text = "The name of the song is. "),
                Token(control_code = ControlCode(
                    key = "lang",
                    value = "unknown")),
                Token(text = "Au clair de la lune."),
                Token(control_code = ControlCode(
                    key = "lang",
                    value = "normal")),
                Token(text = "It's a folk song meaning, in the light of the moon.")
            ]
        ),
        lid_params = LanguageIdentificationParameters(
            languages = (["frc", "enu"])
        )
    )
)

DownloadParameters

Input message containing parameters for remote file download, whether for input text (Input.uri) or a SynthesisResource (SynthesisResource.uri). Included in Input.

Download parameters
Field	Type	Description
headers	map<string,string>	Map of HTTP header name,value pairs to include in outgoing requests. Supported headers: max_age, max_stale.
request_timeout_ms	uint32	Request timeout in ms. Default (0) means server default, usually 30000 (30 seconds).
refuse_cookies	bool	Whether to disable cookies. By default, HTTP requests accept cookies.

The DownloadParameters message includes:

SynthesisRequest
  voice (Voice)
  input (Input)
    download_params (DownloadParameters)
      headers
      request_timeout_ms
      refuse_cookies

EventParameters

Input message that defines event subscription parameters. Included in SynthesisRequest. Events that are requested are sent throughout the SynthesisResponse stream, when generated. Marker events can send events as certain parts of the synthesized audio are reached, for example, at the end of a word, sentence, or user-defined bookmark.

Log events are produced throughout a synthesis request for events such as a voice loaded by the server or an audio chunk being ready to send.

Event parameterss
Field	Type	Description
send_sentence_marker_events	bool	Sentence marker. Default: do not send.
send_word_marker_events	bool	Word marker. Default: do not send.
send_phoneme_marker_events	bool	Phoneme marker. Default: do not send.
send_bookmark_marker_events	bool	Bookmark marker. Default: do not send.
send_paragraph_marker_events	bool	Paragraph marker. Default: do not send.
send_visemes	bool	Lipsync information. Default: do not send.
send_log_events	bool	Whether to log events in the synthesis response: Events. By default, logging is turned off.
suppress_input	bool	Whether to omit input text and URIs from log events in Events or in server event logs. By default, these items are included.

The EventParameters message includes:

SynthesisRequest
  voice (Voice)
  input (Input)
  event_params (EventParameters)
    send_sentence_marker_events
    send_word_marker_events
    send_phoneme_marker_events
    send_bookmark_marker_events
    send_paragraph_marker_events
    send_visemes
    send_log_events
    suppress_input

Event parameters in SynthesisRequest:

SynthesisRequest(
    voice = Voice(
        name = "Evan",
        model = "enhanced"
    ),
    input = Input(
        text = Text(
           text = "Your coffee will be ready in 5 minutes.")
    ),
    event_params = EventParameters(
        send_sentence_marker_events = True,
        send_paragraph_marker_events = True,
        send_log_events = True,
        suppress_input = True
    )
)

SynthesisResponse

The Synthesizer Synthesize method returns a stream of SynthesisResponse messages. (See UnarySynthesisResponse for a non-streamed response.) Each response contains one of:

A status response, indicating completion or failure of the request. This is received only once and signifies the end of a Synthesize call.
A list of events the client has requested. This can be received many times.
An audio buffer. This may be received many times.

Synthesis response
Field	Type	Description
status	Status	A status response, indicating completion or failure of the request.
events	Events	A list of events. See EventParameters for details.
audio	bytes	The latest audio buffer.

The SynthesisResponse message includes:

SynthesisResponse
  status (Status)
    code
    message
    details
  events (Events)
    event (Event)
      name
      values
  audio

Response to synthesis request:

try:
    if args.output_audio_file:
        audio_file = open(args.output_audio_file, "wb")
    for response in stream_in:
        if response.HasField("audio"):
            print("Received audio: %d bytes" % len(response.audio))
            if(audio_file):
                audio_file.write(response.audio)
        elif response.HasField("events"):
            print("Received events")
            print(text_format.MessageToString(response.events))
        else:
            if response.status.code == 200:
                print("Received status response: SUCCESS")
            else:
                print("Received status response: FAILED")
                print("Code: {}, Message: {}".format(response.status.code, response.status.message))
                print('Error: {}'.format(response.status.details))
except Exception as e:
    print(e)
if audio_file:
    print("Saved audio to {}".format(args.output_audio_file))
    audio_file.close()

These results synthesize a simple text string using the Evan voice, and include a user dictionary:

2023-09-26 15:45:42,436 (139898668111680) INFO  Iteration #1
2023-09-26 15:45:42,439 (139898668111680) DEBUG Creating secure gRPC channel
2023-09-26 15:45:42,444 (139898668111680) INFO  Running file [flow.py]
2023-09-26 15:45:42,444 (139898668111680) DEBUG [voice {
  name: "Evan"
  model: "enhanced"
}
audio_params {
  audio_format {
    pcm {
      sample_rate_hz: 22050
    }
  }
  volume_percentage: 80
  speaking_rate_factor: 1.0
  audio_chunk_duration_ms: 2000
}
input {
  text {
    text: "This is a test. A very simple test."
  }
  resources {
    uri: "urn:nuance-mix:tag:tuning:lang/coffee_app/coffee_dict/en-us/mix.tts"
  }
}
user_id: "MyApplicationUser"
]
2023-09-26 15:45:42,444 (139898668111680) INFO  Sending Synthesis request
2023-09-26 15:45:42,631 (139898668111680) INFO  Received audio: 57484 bytes
2023-09-26 15:45:42,655 (139898668111680) INFO  Received audio: 70432 bytes
2023-09-26 15:45:42,657 (139898668111680) INFO  Received status response: SUCCESS
2023-09-26 15:45:42,657 (139898668111680) INFO  Wrote audio to flow.py_i1_s1.wav
2023-09-26 15:45:42,657 (139898668111680) INFO  Done running file [flow.py]
2023-09-26 15:45:42,660 (139898668111680) INFO  Iteration #1 complete
2023-09-26 15:45:42,660 (139898668111680) INFO  Done

Status

Output message containing a status response, indicating completion or failure of a Synthesize call. Included in SynthesisResponse and UnarySynthesisResponse.

Status response
Field	Type	Description
code	uint32	HTTP-style return code: 200, 4xx, or 5xx as appropriate. See Status codes.
message	string	Brief description of the status.
details	string	Longer description if available.

Events

Output message defining a container for a list of events. This container is needed because oneof does not allow repeated parameters. Included in SynthesisResponse and UnarySynthesisResponse.

Events
Field	Type	Description
events	Event	Repeated. One or more events.

Events are returned when send_log_events is True in the request’s EventParameters.

For a description of the NVOC events in the results, see TTS payload: callsummary.

Results with events

2023-08-09 13:00:56,401 (140323311499072) INFO  Iteration #1
2023-08-09 13:00:56,403 (140323311499072) DEBUG Creating secure gRPC channel
2023-08-09 13:00:56,408 (140323311499072) INFO  Running file [mix-flow.py]
2023-08-09 13:00:56,408 (140323311499072) DEBUG [voice {
  name: "Evan"
}
, voice {
  name: "Evan"
  model: "enhanced"
}
audio_params {
  audio_format {
    pcm {
      sample_rate_hz: 22050
    }
  }
  volume_percentage: 80
  speaking_rate_factor: 1.0
  audio_chunk_duration_ms: 2000
}
input {
  text {
    text: "This is a test. A very simple test."
  }
}
event_params {
  send_log_events: true
}
user_id: "my-user-id"
]
2023-08-09 13:00:56,408 (140323311499072) INFO  Sending GetVoices request
2023-08-09 13:00:56,644 (140323311499072) INFO  voices {
  name: "Evan"
  model: "enhanced"
  language: "en-US"
  gender: MALE
  sample_rate_hz: 22050
  language_tlw: "enu"
  version: "1.1.1"
}

2023-08-09 13:00:56,644 (140323311499072) INFO  Sending Synthesis request
2023-08-09 13:00:56,768 (140323311499072) INFO  Received events
2023-08-09 13:00:56,769 (140323311499072) INFO  events {
  name: "NVOCliss"
  values {
    key: "LFEAT"
    value: "tts,unthrottled"
  }
  values {
    key: "LMAX"
    value: "5000"
  }
  values {
    key: "LUSED"
    value: "1"
  }
  values {
    key: "OMAX"
    value: "5000"
  }
  values {
    key: "TTSAASTIME"
    value: "2023-08-09T17:00:24.892Z"
  }
  values {
    key: "TTSTIME"
    value: "2023/08/09 17:00:24.892 UTC"
  }
}

2023-08-09 13:00:56,769 (140323311499072) INFO  Received events
2023-08-09 13:00:56,770 (140323311499072) INFO  events {
  name: "NVOCsyst"
  values {
    key: "APNM"
    value: "MyApp"
  }
  values {
    key: "CODEC"
    value: "linear_16"
  }
  values {
    key: "FREQ"
    value: "22050"
  }
  values {
    key: "LANG"
    value: "American English"
  }
  values {
    key: "LVER"
    value: "17.7.000000"
  }
  values {
    key: "PVER"
    value: "7.4.0"
  }
  values {
    key: "TTSAASTIME"
    value: "2023-08-09T17:00:24.892Z"
  }
  values {
    key: "TTSAASVER"
    value: "1.0.0"
  }
  values {
    key: "TTSTIME"
    value: "2023/08/09 17:00:24.892 UTC"
  }
  values {
    key: "VMDL"
    value: "xpremium-high"
  }
  values {
    key: "VOIC"
    value: "Evan"
  }
  values {
    key: "VOPVER"
    value: "1.1.1"
  }
  values {
    key: "VVER"
    value: "1.1.0"
  }
}

2023-08-09 13:00:56,770 (140323311499072) INFO  Received events
2023-08-09 13:00:56,770 (140323311499072) INFO  events {
  name: "NVOCadon"
  values {
    key: "ADDON_NAME"
    value: "/usr/local/Nuance/Vocalizer_for_Cloud/voices/nuance/drug-names_sysdict_enu_cfg4_v6.5.000000.sdc"
  }
  values {
    key: "TTSAASTIME"
    value: "2023-08-09T17:00:24.895Z"
  }
  values {
    key: "TTSTIME"
    value: "2023/08/09 17:00:24.895 UTC"
  }
}

2023-08-09 13:00:56,826 (140323311499072) INFO  Received events
2023-08-09 13:00:56,827 (140323311499072) INFO  events {
  name: "NVOCaudf"
  values {
    key: "FREQ"
    value: "22"
  }
  values {
    key: "SAMP"
    value: "28742"
  }
  values {
    key: "TTSAASTIME"
    value: "2023-08-09T17:00:24.952Z"
  }
  values {
    key: "TTSTIME"
    value: "2023/08/09 17:00:24.952 UTC"
  }
}

2023-08-09 13:00:56,879 (140323311499072) INFO  Received audio: 57484 bytes
2023-08-09 13:00:56,880 (140323311499072) INFO  Received events
2023-08-09 13:00:56,881 (140323311499072) INFO  events {
  name: "NVOCadon"
  values {
    key: "ADDON_NAME"
    value: "/usr/local/Nuance/Vocalizer_for_Cloud/voices/nuance/drug-names_sysdict_enu_cfg4_v6.5.000000.sdc"
  }
  values {
    key: "TTSAASTIME"
    value: "2023-08-09T17:00:24.954Z"
  }
  values {
    key: "TTSTIME"
    value: "2023/08/09 17:00:24.954 UTC"
  }
}

2023-08-09 13:00:56,881 (140323311499072) INFO  Received events
2023-08-09 13:00:56,881 (140323311499072) INFO  events {
  name: "NVOCadon"
  values {
    key: "ADDON_NAME"
    value: "/usr/local/Nuance/Vocalizer_for_Cloud/voices/nuance/drug-names_sysdict_enu_cfg4_v6.5.000000.sdc"
  }
  values {
    key: "TTSAASTIME"
    value: "2023-08-09T17:00:24.954Z"
  }
  values {
    key: "TTSTIME"
    value: "2023/08/09 17:00:24.954 UTC"
  }
}

2023-08-09 13:00:56,941 (140323311499072) INFO  Received events
2023-08-09 13:00:56,942 (140323311499072) INFO  events {
  name: "NVOCaudn"
  values {
    key: "FREQ"
    value: "22"
  }
  values {
    key: "SAMP"
    value: "35216"
  }
  values {
    key: "TTSAASTIME"
    value: "2023-08-09T17:00:25.064Z"
  }
  values {
    key: "TTSTIME"
    value: "2023/08/09 17:00:25.064 UTC"
  }
}

2023-08-09 13:00:56,968 (140323311499072) INFO  Received audio: 70432 bytes
2023-08-09 13:00:56,969 (140323311499072) INFO  Received events
2023-08-09 13:00:56,969 (140323311499072) INFO  events {
  name: "NVOCcntv"
  values {
    key: "CHARS"
    value: "28"
  }
  values {
    key: "TTSAASTIME"
    value: "2023-08-09T17:00:25.065Z"
  }
  values {
    key: "TTSTIME"
    value: "2023/08/09 17:00:25.065 UTC"
  }
  values {
    key: "VOICE_VOP"
    value: "Evan_xpremium-high"
  }
}

2023-08-09 13:00:56,970 (140323311499072) INFO  Received events
2023-08-09 13:00:56,970 (140323311499072) INFO  events {
  name: "NVOCcntg"
  values {
    key: "CHARS"
    value: "28"
  }
  values {
    key: "TTSAASTIME"
    value: "2023-08-09T17:00:25.065Z"
  }
  values {
    key: "TTSTIME"
    value: "2023/08/09 17:00:25.065 UTC"
  }
}

2023-08-09 13:00:56,970 (140323311499072) INFO  Received events
2023-08-09 13:00:56,970 (140323311499072) INFO  events {
  name: "NVOCinpt"
  values {
    key: "MIME"
    value: "text/plain;charset=utf-8"
  }
  values {
    key: "TEXT"
    value: "This is a test. A very simple test."
  }
  values {
    key: "TTSAASTIME"
    value: "2023-08-09T17:00:25.065Z"
  }
  values {
    key: "TTSTIME"
    value: "2023/08/09 17:00:25.065 UTC"
  }
  values {
    key: "TXSZ"
    value: "35"
  }
}

2023-08-09 13:00:56,971 (140323311499072) INFO  Received events
2023-08-09 13:00:56,971 (140323311499072) INFO  events {
  name: "NVOCsynd"
  values {
    key: "DURS"
    value: "2900"
  }
  values {
    key: "INPT"
    value: "35"
  }
  values {
    key: "RSTT"
    value: "ok"
  }
  values {
    key: "TTSAASTIME"
    value: "2023-08-09T17:00:25.065Z"
  }
  values {
    key: "TTSTIME"
    value: "2023/08/09 17:00:25.065 UTC"
  }
}

2023-08-09 13:00:56,971 (140323311499072) INFO  Received events
2023-08-09 13:00:56,972 (140323311499072) INFO  events {
  name: "NVOClise"
  values {
    key: "LFEAT"
    value: "tts,unthrottled"
  }
  values {
    key: "LMAX"
    value: "5000"
  }
  values {
    key: "LUSED"
    value: "1"
  }
  values {
    key: "OMAX"
    value: "5000"
  }
  values {
    key: "TTSAASTIME"
    value: "2023-08-09T17:00:25.065Z"
  }
  values {
    key: "TTSTIME"
    value: "2023/08/09 17:00:25.065 UTC"
  }
}

2023-08-09 13:00:56,972 (140323311499072) INFO  Received status response: SUCCESS
2023-08-09 13:00:56,973 (140323311499072) INFO  Wrote audio to mix-flow.py_i1_s1.wav
2023-08-09 13:00:56,973 (140323311499072) INFO  Done running file [mix-flow.py]
2023-08-09 13:00:56,973 (140323311499072) INFO  Iteration #1 complete
2023-08-09 13:00:56,973 (140323311499072) INFO  Done

Event

Output message defining an event message. Included in Events. See EventParameters for details.

Event
Field	Type	Description
name	string	Either “Markers” or the name of the event in the case of a Log Event.
values	map<string,string>	Map of key:value data relevant to the current event.

UnarySynthesisResponse

The Synthesizer UnarySynthesize RPC call returns a single UnarySynthesisResponse message. It is similar to SynthesisResponse but includes all the information at once instead of a streaming response. The response contains:

A status response, indicating completion or failure of the request.
A list of events the client has requested.
The complete audio buffer of the synthesized text.

Unary synthesis response
Field	Type	Description
status	Status	A status response, indicating completion or failure of the request.
events	Events	A list of events. See EventParameters for details.
audio	bytes	Audio buffer of the synthesized text, capped if necessary to a configured audio response size.

The UnarySynthesisResponse message includes:

UnarySynthesisResponse
  status (Status)
    code
    message
    details
  events (Events)
    event (Event)
      name
      values
  audio

Scalar value types

The data types in the proto files are mapped to equivalent types in the generated client stub files.

Scalar data types
Proto	Notes	C++	Java	Python
double		double	double	float
float		float	float	float
int32	Uses variable-length encoding. Inefficient for encoding negative numbers. If your field is likely to have negative values, use sint32 instead.	int32	int	int
int64	Uses variable-length encoding. Inefficient for encoding negative numbers. If your field is likely to have negative values, use sint64 instead.	int64	long	int/long
uint32	Uses variable-length encoding.	uint32	int	int/long
uint64	Uses variable-length encoding.	uint64	long	int/long
sint32	Uses variable-length encoding. Signed int value. These encode negative numbers more efficiently than regular int32s.	int32	int	int
sint64	Uses variable-length encoding. Signed int value. These encode negative numbers more efficiently than regular int64s.	int64	long	int/long
fixed32	Always four bytes. More efficient than uint32 if values are often greater than 2^28.	uint32	int	int
fixed64	Always eight bytes. More efficient than uint64 if values are often greater than 2^56.	uint64	long	int/long
sfixed32	Always four bytes.	int32	int	int
sfixed64	Always eight bytes.	int64	long	int/long
bool		bool	boolean	boolean
string	A string must always contain UTF-8 encoded or 7-bit ASCII text.	string	String	str/unicode
bytes	May contain any arbitrary sequence of bytes.	string	ByteString	str

Feedback

Was this page helpful?

Glad to hear it! Please tell us how we can improve.

Sorry to hear that. Please tell us how we can improve.