Synthesizer gRPC API

The Synthesizer gRPC API contains methods for requesting speech synthesis from TTSaaS, using standard and enhanced voices.

If you wish to use Microsoft neural voices, use Neural TTSaaS instead. Consult the Neural TTSaaS documentation: Synthesizer gRPC API for Neural TTSaaS.

Proto file structure

The Synthesizer API is defined in the synthesizer.proto file.

└── nuance
    ├── rpc (RPC message files)
    └── tts
        ├── storage
        │   └── v1beta1
        │       └── storage.proto
        └── v1
            └── synthesizer.proto

The proto file defines a Synthesizer service with three RPC methods: GetVoices, Synthesize, and UnarySynthesize.

  Proto file fields for GetVoices  
  Proto file fields for Synthesize and UnarySynthesize  

Synthesizer

The Synthesizer service offers three methods related to voice synthesis.

Synthesizer service
Name Request Response Description
GetVoices GetVoicesRequest GetVoicesResponse Queries the list of available voices, with filters to reduce the search space.
Synthesize SynthesisRequest SynthesisResponse stream Synthesizes audio from input text and parameters, and returns an audio stream.
UnarySynthesize SynthesisRequest UnarySynthesisResponse Synthesizes audio and returns a single (unary) audio response.

Streamed vs. unary response

TTSaaS offers two types of synthesis response: a streamed response in SynthesisResponse and a non-streamed response in UnarySynthesisResponse.

The request is the same in both cases: SynthesisRequest specifies a voice, the input text to synthesize, and optional parameters. The response can be either:

  • SynthesisResponse: Returns one status message followed by multiple streamed audio buffers, each including the markers or other events specified in the request. Each audio buffer contains the latest synthesized audio.

  • UnarySynthesisResponse: Returns one status message and one audio buffer, containing all the markers and events specified in the request. The underlying TTSaaS engine caps the audio response size.

    See Run client for unary response to run the sample Python client with a unary response, activated by a command line flag.

One request, two possible responses (from proto file):

service Synthesizer {
    rpc Synthesize(SynthesisRequest) returns (stream SynthesisResponse) {} 
    rpc UnarySynthesize(SynthesisRequest) returns (UnarySynthesisResponse {}
. . .
message SynthesisRequest { 
    Voice voice = 1;  
    AudioParameters audio_params = 2; 
    Input input = 3;   
    EventParameters event_params = 4;  
    map<string, string> client_data = 5; 
}

message SynthesisResponse {
    oneof response {
        Status status = 1;
        Events events = 2;
        bytes audio = 3;
    }
}

message UnarySynthesisResponse {
    Status status = 1;
    Events events = 2;
    bytes audio = 3;
}

GetVoicesRequest

Input message for message for Synthesizer: GetVoices, to query voices available to the client.

Get voices request
Field Type Description
voice Voice Optional. Filter the voices to retrieve. For example, set language to en-US to return only American English voices.

The GetVoicesRequest message includes:

GetVoicesRequest
  voice (Voice)
    name
    model
    language
    age_group (EnumAgeGroup)
    gender (EnumGender)
    sample_rate_hz
    language_tlw

For example:

# This retrieves all American English voices
GetVoicesRequest (
    voice = Voice (language = "en-us")
)

# This returns one named voice
GetVoicesRequest (
    voice = Voice (name = "Evan")
)

# This returns all female American English voices
GetVoicesRequest (
    voice = Voice (
        gender = EnumGender.FEMALE,
        language = "en-us"
    )
)

Voice

Input or output message for voices:

These fields are supported in all cases:

Voice message
Field Type Description
name string The voice’s name, for example, Evan. Mandatory for SynthesisRequest.
model string The voice’s quality model, for example, enhanced or standard. Mandatory for SynthesisRequest.

These Voice fields are used only in GetVoicesRequest and GetVoicesResponse. They are ignored in SynthesisRequest.

Voice fields
Field Type Description
language string IETF language code, for example, en-US. Search for voices with a specific language. Some voices support multiple languages.
age_group EnumAgeGroup Search for adult or child voices.
gender EnumGender Search for voices with a certain gender.
sample_rate_hz uint32 Search for a certain native sample rate.
language_tlw string Three-letter language code (for example, enu for American English) for configuring language identification in Input.
restricted bool Used only in GetVoicesResponse, to identify restricted voices (restricted true). These are custom voices available only to specific customers. Default is false, meaning the voice is public.
version string Used only in GetVoicesResponse, to return the voice’s version.
foreign_languages string Repeated. Used only in GetVoicesResponse, to return the foreign languages of a multilingual voice.

The Voice message includes different fields depending on the context:

GetVoicesRequest
  voice (Voice)
    name
    model
    language
    age_group (EnumAgeGroup)
    gender (EnumGender)
    sample_rate_hz
    language_tlw
 
GetVoicesResponse
  voice (Voice)
    name
    model
    language
    age_group (EnumAgeGroup)
    gender (EnumGender)
    sample_rate_hz
    language_tlw
    restricted
    version
    foreign_languages
 
SynthesisRequest
  voice (Voice)
    name
    model

EnumAgeGroup

Input field for GetVoicesRequest or output field for GetVoicesResponse, specifying whether the voice uses its adult or child version, if available. Included in Voice.

Age group
Name Number Description
ADULT 0 Adult voice. Default for GetVoicesRequest.
CHILD 1 Child voice.

EnumGender

Input field for GetVoicesRequest or output field for GetVoicesResponse, specifying gender for voices that support multiple genders. Included in Voice.

Gender
Name Number Description
ANY 0 Any gender voice. Default for GetVoicesRequest.
MALE 1 Male voice.
FEMALE 2 Female voice.
NEUTRAL 3 Neutral gender voice.

GetVoicesResponse

Output message for Synthesizer: GetVoices. Includes a list of voices that matched the input criteria, if any.

Get voices responses
Field Type Description
voices Voice Repeated. Voices and characteristics returned.

The GetVoicesResponse message includes:

GetVoicesResponse
  voice (Voice)
    name
    model
    language
    age_group (EnumAgeGroup)
    gender (EnumGender)
    sample_rate_hz
    language_tlw
    restricted
    version
    foreign_languages

This response to GetVoicesRequest returns all American English (en-us) voices:

2023-09-26 15:51:16,151 (139911033857856) INFO  Iteration #1
2023-09-26 15:51:16,154 (139911033857856) DEBUG Creating secure gRPC channel
2023-09-26 15:51:16,161 (139911033857856) INFO  Running file [flow.py]
2023-09-26 15:51:16,161 (139911033857856) DEBUG [voice {
  language: "en-us"
}
]
2023-09-26 15:51:16,161 (139911033857856) INFO  Sending GetVoices request
2023-09-26 15:51:16,367 (139911033857856) INFO  voices {
  name: "Allison"
  model: "standard"
  language: "en-US"
  gender: FEMALE
  sample_rate_hz: 22050
  language_tlw: "enu"
  version: "2.0.0"
}
voices {
  name: "Ava-Ml"
  model: "enhanced"
  language: "en-US"
  gender: FEMALE
  sample_rate_hz: 22050
  language_tlw: "enu"
  version: "3.0.1"
  foreign_languages: "es-MX"
}
voices {
  name: "Chloe"
  model: "standard"
  language: "en-US"
  gender: FEMALE
  sample_rate_hz: 22050
  language_tlw: "enu"
  version: "5.2.3.15315"
}
voices {
  name: "Chloe"
  model: "standard"
  language: "en-US"
  gender: FEMALE
  sample_rate_hz: 8000
  language_tlw: "enu"
  version: "5.2.3.15315"
}
voices {
  name: "Erica"
  model: "standard"
  language: "en-US"
  gender: FEMALE
  sample_rate_hz: 22050
  language_tlw: "enu"
  restricted: true
  version: "1.0.2"
}
voices {
  name: "Erica"
  model: "standard"
  language: "en-US"
  gender: FEMALE
  sample_rate_hz: 8000
  language_tlw: "enu"
  restricted: true
  version: "1.0.2"
}
voices {
  name: "Evan"
  model: "enhanced"
  language: "en-US"
  gender: MALE
  sample_rate_hz: 22050
  language_tlw: "enu"
  version: "1.1.1"
}
voices {
  name: "Evelyn"
  model: "standard"
  language: "en-US"
  gender: FEMALE
  sample_rate_hz: 22050
  language_tlw: "enu"
  version: "5.2.3.15114"
}
voices {
  name: "Evelyn"
  model: "standard"
  language: "en-US"
  gender: FEMALE
  sample_rate_hz: 8000
  language_tlw: "enu"
  version: "5.2.3.15114"
}
voices {
  name: "Nathan"
  model: "enhanced"
  language: "en-US"
  gender: MALE
  sample_rate_hz: 22050
  language_tlw: "enu"
  version: "4.1.1"
}
voices {
  name: "Nolan"
  model: "standard"
  language: "en-US"
  gender: MALE
  sample_rate_hz: 22050
  language_tlw: "enu"
  version: "5.2.3.15315"
}
voices {
  name: "Nolan"
  model: "standard"
  language: "en-US"
  gender: MALE
  sample_rate_hz: 8000
  language_tlw: "enu"
  version: "5.2.3.15315"
}
voices {
  name: "Samantha"
  model: "standard"
  language: "en-US"
  gender: FEMALE
  sample_rate_hz: 22050
  language_tlw: "enu"
  version: "2.0.0"
}
voices {
  name: "Susan"
  model: "standard"
  language: "en-US"
  gender: FEMALE
  sample_rate_hz: 22050
  language_tlw: "enu"
  version: "2.0.0"
}
voices {
  name: "Tom"
  model: "standard"
  language: "en-US"
  gender: MALE
  sample_rate_hz: 22050
  language_tlw: "enu"
  version: "3.2.1"
}
voices {
  name: "Zoe-Ml"
  model: "enhanced"
  language: "en-US"
  gender: FEMALE
  sample_rate_hz: 22050
  language_tlw: "enu"
  version: "2.0.0"
  foreign_languages: "es-MX"
  foreign_languages: "fr-CA"
}

2023-09-26 15:51:16,368 (139911033857856) INFO  Done running file [flow.py]
2023-09-26 15:51:16,369 (139911033857856) INFO  Iteration #1 complete
2023-09-26 15:51:16,369 (139911033857856) INFO  Done

SynthesisRequest

Input message for Synthesizer: Synthesize. Specifies input text, audio parameters, and events to subscribe to, in exchange for synthesized audio. See Defaults for default values for optional fields.

Synthesis request
Field Type Description
voice Voice Mandatory. The voice to use for audio synthesis.
audio_params AudioParameters Output audio parameters, such as encoding and volume. Default is PCM audio at 22050 Hz.
input Input Mandatory. Input text to synthesize, tuning data, etc.
event_params EventParameters Markers and other info to include in server events returned during synthesis.
client_data map<string,string> Map of client-supplied key:value pairs to inject into the event log.
user_id string Identifies a specific user within the application.

The SynthesisRequest message includes:

SynthesisRequest
  voice (Voice)
    name
    model
  audio_params (AudioParameters)
    audio_format (AudioFormat)
    volume_percentage
    speaking_rate_factor
    audio_chunk_duration_ms
    target_audio_length
    disable_early_emission
  input (Input)
    text (Text)
    ssml (SSML)
    tokenized_sequence (TokenizedSequence)
    resources (SynthesisResource)
    lid_params (LanguageIdentificationParameters)
    download_params (DownloadParameters)
  event_params (EventParameters)
    send_sentence_marker_events
    send_word_marker_events
    send_phoneme_marker_events
    send_bookmark_marker_events
    send_paragraph_marker_events
    send_visemes
    send_log_events
    suppress_input
  client_data
  user_id

This synthesis request includes most fields:

SynthesisRequest(
    voice = Voice(
        name = "Evan",
        model = "enhanced"
    ),
    audio_params = AudioParameters(
        audio_format = AudioFormat(
            pcm = PCM(sample_rate_hz = 22050)
        ),
        volume_percentage = 80,       # Default value
        speaking_rate_factor = 1.0    # Default value
    ),
    input = Input(
        text = Text(
           text = "Your coffee will be ready in 5 minutes")
    ),
    event_params = EventParameters(
        send_log_events = True,
        suppress_input = True
    ),
    client_data = {'company':'Aardvark Coffee','user':'Leslie'},
    user_id = "leslie.somebody@aardvark.com"
)

This minimal synthesis request uses all defaults:

SynthesisRequest(
    voice = Voice(
        name = "Evan",
        model = "enhanced"
    ),
    input = Input(
        text = Text(
           text = "Your coffee will be ready in 5 minutes")
    )
)

AudioParameters

Input message for audio-related parameters during synthesis, including encoding, volume, and audio length. Included in SynthesisRequest.

Audio parameters
Field Type Description
audio_format AudioFormat Audio encoding. Default PCM 22050 Hz.
volume_percentage uint32 Volume amplitude, from 0 to 100. Default 80.
speaking_rate_factor float Speaking rate, from 0 to 2.0. Default 1.0.
audio_chunk_ duration_ms uint32 Maximum duration, in ms, of an audio chunk delivered to the client, from 1 to 60000. Default is 20000 (20 seconds). When this parameter is large enough (for example, 20 or 30 seconds), each audio chunk contains an audible segment surrounded by silence.
target_audio_length_ms uint32 Maximum duration, in ms, of synthesized audio. When greater than 0, the server stops ongoing synthesis at the first sentence end, or silence, closest to the value.
disable_early_emission bool By default, audio segments are emitted as soon as possible, even if they are not audible. This behavior may be disabled.

The AudioParameters message includes:

SynthesisRequest
  voice (Voice)
  audio_params (AudioParameters)
    audio_format (AudioFormat)
      pcm (PCM)
      alaw (Alaw)
      ulaw (Ulaw)
      ogg_opus (OggOpus)
      opus (Opus)
    volume_percentage
    speaking_rate_factor
    audio_chunk_duration_ms
    target_audio_length
    disable_early_emission

AudioFormat

Input message for audio encoding of synthesized text. Included in AudioParameters.

Audio format
Field Type Description
pcm PCM Signed 16-bit little endian PCM.
alaw ALaw G.711 A-law, 8kHz.
ulaw ULaw G.711 Mu-law, 8kHz.
ogg_opus OggOpus Ogg Opus, 8kHz,16kHz, or 24 kHz.
opus Opus Opus, 8kHz, 16kHz, or 24kHz. The audio will be sent one Opus packet at a time.

The AudioFormat message includes:

SynthesisRequest
  voice (Voice)
  audio_params (AudioParameters)
    audio_format (AudioFormat)
      pcm (PCM)
        sample_rate_hz
      alaw (Alaw)
      ulaw (Ulaw)
      ogg_opus (OggOpus)
        sample_rate_hz
        bit_rate_bps
        max_frame_ duration_ms
        complexity
        vbr (EnumVariableBitrate)
      opus (Opus)
        sample_rate_hz
        bit_rate_bps
        max_frame_ duration_ms
        complexity
        vbr (EnumVariableBitrate)

The PCM audio format is shown in this example, with alternatives in commented lines:

SynthesisRequest(
    voice = Voice(
        name = "Evan",
        model = "enhanced"
    ),
    audio_params = AudioParameters(
        audio_format = AudioFormat(
            pcm = PCM(sample_rate_hz = 22050)
#           alaw = ALaw()
#           ulaw = ULaw()
#           ogg_opus = OggOpus(sample_rate_hz = 16000)
#           opus = Opus(sample_rate_hz = 8000, bit_rate_bps = 30000)
        )
    )

PCM

Input message defining PCM sample rate. Included in AudioFormat.

PCM audio
Field Type Description
sample_rate_hz uint32 Output sample rate in Hz. Supported values: 8000, 11025, 16000, 22050, 24000.

ALaw

Input message defining A-law audio format. Included in AudioFormat. G.711 audio formats are set to 8kHz.

ULaw

Input message defining Mu-law audio format. Included in AudioFormat. G.711 audio formats are set to 8kHz.

OggOpus

Input message defining Ogg Opus output rate. Included in AudioFormat.

Ogg Opus audio
Field Type Description
sample_rate_hz uint32 Output sample rate in Hz. Supported values: 8000, 16000, 24000.
bit_rate_bps uint32 Valid range is 500 to 256000 bps. Default 28000.
max_frame_ duration_ms float Opus frame size in ms: 2.5, 5, 10, 20, 40, 60. Default 20.
complexity uint32 Computational complexity. A complexity of 0 means the codec default.
vbr EnumVariableBitrate Variable bitrate. On by default.

Opus

Input message defining Opus output rate. Included in AudioFormat.

Opus audio
Field Type Description
sample_rate_hz uint32 Output sample rate in Hz. Supported values: 8000, 16000, 24000.
bit_rate_bps uint32 Valid range is 500 to 256000 bps. Default 28000.
max_frame_ duration_ms float Opus frame size in ms: 2.5, 5, 10, 20, 40, 60. Default 20.
complexity uint32 Computational complexity. A complexity of 0 means the codec default.
vbr EnumVariableBitrate Variable bitrate. On by default.

EnumVariableBitrate

Settings for variable bitrate. Included in OggOpus and Opus. Turned on by default.

Variable bitrate
Name Number Description
VARIABLE_BITRATE_ON 0 Use variable bitrate. Default.
VARIABLE_BITRATE_OFF 1 Do not use variable bitrate.
VARIABLE_BITRATE_ CONSTRAINED 2 Use constrained variable bitrate.

Input

Input message containing text to synthesize and synthesis parameters, including tuning data, etc. Included in SynthesisRequest. The type of input may be plain text, SSML, or a sequence of plain text and Nuance control codes. See Input to synthesize for more examples.

Input
Field Type Description
text Text Plain text input.
ssml SSML SSML input, including text and SSML elements.
tokenized_sequence TokenizedSequence Sequence of text and Nuance control codes.
resources SynthesisResource Repeated. Synthesis resources (user dictionaries, rulesets, etc.) to tune synthesized audio. Default blank.
lid_params LanguageIdentification Parameters LID parameters.
download_params DownloadParameters Remote file download parameters.

The Input message includes:

SynthesisRequest
  voice (Voice)
  input (Input)
    text (Text)
      text
      uri
    ssml (SSML)
      text
      uri
      ssml_validation_mode (EnumSSMLValidationMode)
    tokenized_sequence (TokenizedSequence)
      tokens (Token)
      uri

Text

Input message for synthesizing plain text. The encoding must be UTF-8. Included in Input.

Text input
Field Type Description
text string Plain input text in UTF-8 encoding.
uri string Remote URI to the plain input text. Not supported in Nuance-hosted TTS.

This example shows plain text input:

SynthesisRequest(
   voice = Voice(
       name = "Evan",
       model = "enhanced"
    ),
    input = Input(
        text = Text(
           text = "Your coffee will be ready in 5 minutes")
    ),
)

SSML

Input message for synthesizing SSML input. See SSML input for a list of supported elements and examples.

Input message for synthesizing SSML input.

SSML input
Field Type Description
text string SSML input text and elements.
uri string Remote URI to the SSML input text. Not supported in Nuance-hosted TTS.
ssml_validation_mode EnumSSML ValidationMode SSML validation mode. Default STRICT.

This input contains SSML:

SynthesisRequest(
   voice = Voice(
       name = "Evan",
       model = "enhanced"
    ),
    input = Input(
        ssml = SSML(
            text = '<?xml version="1.0"?><speak  xmlns="http://www.w3.org/2001/10/synthesis"
xml:lang="en-US" version="1.0">This is the normal volume of my voice.
<prosody volume="10">I can speak rather quietly, </prosody>
<prosody volume="90">But also very loudly.</prosody></speak>',
            ssml_validation_mode = WARN
        )
    )
)

The xml tag and the speak attributes may be omitted:

SynthesisRequest(
   voice = Voice(
       name = "Evan",
       model = "enhanced"
    ),
    input = Input(
        ssml = SSML(
            text = '<speak>This is the normal volume of my voice.
<prosody volume="10">I can speak rather quietly,</prosody>
<prosody volume="90">But also very loudly.</prosody></speak>',
            ssml_validation_mode = WARN
        )
    )
)

EnumSSMLValidationMode

SSML validation mode when using SSML input. Included in SSML. Strict by default but can be relaxed.

SSML validation mode
Name Number Description
STRICT 0 Strict SSL validation. Default.
WARN 1 Give warning only.
NONE 2 Do not validate.

TokenizedSequence

Input message for synthesizing a sequence of plain text and Nuance control codes. Included in Input. See Tokenized sequence for a list of supported codes and examples.

Tokenized sequence
Field Type Description
tokens Token Repeated. Sequence of text and control codes.

The TokenizedSequence message includes:

SynthesisRequest
  voice (Voice)
  input (Input)
    tokenized_sequence (TokenizedSequence)
      tokens (Token)
        text
        control_code (ControlCode)
          key
          value

This input is a sequence of tokens: text and control codes:

SynthesisRequest(
   voice = Voice(
       name = "Evan",
       model = "enhanced"
    ),
    input = Input(
        tokenized_sequence = TokenizedSequence(
            tokens = [
                Token(control_code = ControlCode(
                    key = "vol",
                    value = "10")),
                 Token(text = "I can speak rather quietly,"),
                 Token(control_code = ControlCode(
                     key = "vol",
                     value = "90")),
                 Token(text = "but also very loudly.")
             ]
        )
    )
)

Token

The unit when using TokenizedSequence for input. Each token can be either plain text or a Nuance control code.

Token
Field Type Description
text string Plain input text.
control_code ControlCode Nuance control code.

ControlCode

Nuance control code that specifies how text should be spoken, similar to SSML. Included in Token.

Control code
Field Type Description
key string Name of the control code, for example, pause.
value string Value of the control code.

SynthesisResource

Input message specifying the type of file to tune the synthesized output and its location or contents. Included in Input. See Synthesis resources.

caption
Field Type Description
type EnumResourceType Resource type, e.g. user dictionary, etc. Default USER_DICTIONARY.
uri string The URN of a resource previously uploaded to cloud storage with the Storage API. See URNs for the format.
body bytes For EnumResourceType USER_DICTIONARY, the contents of the file. See Inline dictionary for an example.

The SynthesisResource message includes:

SynthesisRequest
  voice (Voice)
  input (Input)
  resources (SynthesisResource)
    type (EnumResourceType)
    uri
    body

This request includes an inline compiled user dictionary (with body):

SynthesisRequest (
    voice = Voice (name = "Evan", model = "enhanced"),
    input = Input (
        text = Text (text = "Your coffee will be ready in 5 minutes"),
        resources =  [
            SynthesisResource (
                type = USER_DICTIONARY,
                body = open("/path/to/user_dictionary.dcb", 'rb').read()
            )
        ]
    )
)

This request includes an external user dictionary:

SynthesisRequest (
    voice = Voice (name = "Evan", model = "enhanced"),
    input = Input (
        text = Text (text = "Your coffee will be ready in 5 minutes"),
        resources =  [
            SynthesisResource (
                type = USER_DICTIONARY,
                uri = "urn:nuance-mix:tag:tuning:lang/coffee_app/coffee_dict/en-us/mix.tts"
            )
        ]
    )
)

This includes an ActivePrompt database:

SynthesisRequest (
    voice = Voice (name = "Evan", model = "enhanced"),
    input = Input (
        text = Text (text = "Your coffee will be ready in 5 minutes"),
        resources =  [
            SynthesisResource (
                type = ACTIVEPROMPT_DB,
                uri = "urn:nuance-mix:tag:tuning:voice/coffee_app/coffee_prompts/Evan/mix.tts"
            )
        ]
    )
)

And this includes a user ruleset:

SynthesisRequest (
    voice = Voice (name = "Evan", model = "enhanced"),
    input = Input (
        text = Text (text = "Your coffee will be ready in 5 minutes"),
        resources =  [
            SynthesisResource (
                type = TEXT_USER_RULESET,
                uri = "urn:nuance-mix:tag:tuning:lang/coffee_app/coffee_rules/en-us/mix.tts"
            )
        ]
    )
)

EnumResourceType

The type of synthesis resource to tune the output. Included in SynthesisResource. User dictionaries provide custom pronunciations, rulesets apply search-and-replace rules to input text, and ActivePrompt databases help tune synthesized audio under certain conditions, using Nuance Vocalizer Studio.

caption
Name Number Description
USER_DICTIONARY 0 User dictionary (application/edct-bin-dictionary). Default.
TEXT_USER_RULESET 1 Text user ruleset (application/x-vocalizer-rettt+text).
BINARY_USER_RULESET 2 Not supported. Binary user ruleset (application/x-vocalizer-rettt+bin).
ACTIVEPROMPT_DB 3 ActivePrompt database (application/x-vocalizer-activeprompt-db).
ACTIVEPROMPT_DB_AUTO 4

ActivePrompt database with automatic insertion (application/x-vocalizer-activeprompt-db;mode=automatic).

This keyword specifies any ActivePrompt database but changes the behavior.

SYSTEM_DICTIONARY 5 Nuance system dictionary (application/sdct-bin-dictionary). Not supported.

URNs

The uri field in SynthesisResource defines the location of a synthesis resource as a URN in the Mix cloud storage area. In SSML and TokenizedSequence input, the audio tag or code references a WAV file as a URN. The format depends on the object type:

  • User dictionaries and rulesets: urn:nuance-mix:tag:tuning:lang/context_tag/name/language/mix.tts

  • ActivePrompt databases: urn:nuance-mix:tag:tuning:voice/context_tag/name/voice/mix.tts

  • Audio files: urn:nuance-mix:tag:tuning:audio/context_tag/name/mix.tts

When you upload these resources using the Storage API, you provide only the context tag and name in UploadInitMessage. The UploadResponse message confirms the complete URN for the object.

uri: "urn:nuance-mix:tag:tuning:lang/coffee_app/coffee_dict/en-us/mix.tts?type=userdict"
URN syntax
Syntax Description
urn:nuance-mix:tag:tuning The prefix for all synthesis resources.
lang and language The scope keyword, lang, for dictionaries and rulesets, plus the language in the format xx-xx.
voice and voice The scope keyword, voice, for ActivePrompt databases, plus the voice name.
audio The scope keyword, audio, for audio files.
context_tag A name for the collection of objects being stored. This can be a Context Tag from a Mix project or another collective name. If the context tag does not exist, it will be created.
name An identifier for the content being uploaded, using 1 to 64 alphanumeric characters or underscore (a-z, A-Z, 0-9, _).
mix.tts The suffix for all synthesis resources.
?type=resource_type An informational field returned by UploadRequest that identifies the type of resource. This field is not required when using the URN in a Synthesis request, although it may be included without error.

Examples of URNs:

User dictionary:
urn:nuance-mix:tag:tuning:lang/coffee_app/coffee_dict/en-us/mix.tts

Text ruleset:
urn:nuance-mix:tag:tuning:lang/coffee_app/coffee_rules/en-us/mix.tts

ActivePrompt database:
urn:nuance-mix:tag:tuning:voice/coffee_app/coffee_prompts/Evan/mix.tts

Audio file:
urn:nuance-mix:tag:tuning:audio/coffee_app/thanks/mix.tts

LanguageIdentificationParameters

Input message controlling the language identifier. Included in Input. The language identifier runs on input blocks labeled with control code lang unknown or the SSML attribute xml:lang="unknown".

By default, the language identifier matches languages to all installed voices. The languages field limits the permissible languages, and also sets the order of precedence (first to last) when they have equal confidence scores.

Language identification parameters
Field Type Description
disable bool Whether to disable language identification. Turned on by default.
languages string Repeated. List of three-letter language codes (for example, enu, frc, spm) to restrict language identification results, in order of precedence. Use GetVoicesRequest to obtain the three-letter codes, returned in GetVoicesResponse language_tlw. Default blank.
always_use_ highest_confidence bool If enabled, language identification always chooses the language with the highest confidence score, even if the score is low. Default false, meaning use language with any confidence.

The LanguageIdentificationParameters message includes:

SynthesisRequest
  voice (Voice)
  input (Input)
    lid_params (LanguageIdentificationParameters)
      disable
      languages
      always_use_highest_confidence

This Input message includes LID parameters to limit the choice of languages to French Canadian (frc) or American English (enu):

SynthesisRequest(
    voice = Voice(
        name = "Evan",
        model = "enhanced"
    ),
    input = Input(
       tokenized_sequence = TokenizedSequence(
            tokens = [
                Token(text = "The name of the song is. "),
                Token(control_code = ControlCode(
                    key = "lang",
                    value = "unknown")),
                Token(text = "Au clair de la lune."),
                Token(control_code = ControlCode(
                    key = "lang",
                    value = "normal")),
                Token(text = "It's a folk song meaning, in the light of the moon.")
            ]
        ),
        lid_params = LanguageIdentificationParameters(
            languages = (["frc", "enu"])
        )
    )
)

DownloadParameters

Input message containing parameters for remote file download, whether for input text (Input.uri) or a SynthesisResource (SynthesisResource.uri). Included in Input.

Download parameters
Field Type Description
headers map<string,string> Map of HTTP header name,value pairs to include in outgoing requests. Supported headers: max_age, max_stale.
request_timeout_ms uint32 Request timeout in ms. Default (0) means server default, usually 30000 (30 seconds).
refuse_cookies bool Whether to disable cookies. By default, HTTP requests accept cookies.

The DownloadParameters message includes:

SynthesisRequest
  voice (Voice)
  input (Input)
    download_params (DownloadParameters)
      headers
      request_timeout_ms
      refuse_cookies

EventParameters

Input message that defines event subscription parameters. Included in SynthesisRequest. Events that are requested are sent throughout the SynthesisResponse stream, when generated. Marker events can send events as certain parts of the synthesized audio are reached, for example, at the end of a word, sentence, or user-defined bookmark.

Log events are produced throughout a synthesis request for events such as a voice loaded by the server or an audio chunk being ready to send.

Event parameterss
Field Type Description
send_sentence_marker_events bool Sentence marker. Default: do not send.
send_word_marker_events bool Word marker. Default: do not send.
send_phoneme_marker_events bool Phoneme marker. Default: do not send.
send_bookmark_marker_events bool Bookmark marker. Default: do not send.
send_paragraph_marker_events bool Paragraph marker. Default: do not send.
send_visemes bool Lipsync information. Default: do not send.
send_log_events bool Whether to log events during synthesis. By default, logging is turned off.
suppress_input bool Whether to omit input text and URIs from log events. By default, these items are included.

The EventParameters message includes:

SynthesisRequest
  voice (Voice)
  input (Input)
  event_params (EventParameters)
    send_sentence_marker_events
    send_word_marker_events
    send_phoneme_marker_events
    send_bookmark_marker_events
    send_paragraph_marker_events
    send_visemes
    send_log_events
    suppress_input

Event parameters in SynthesisRequest:

SynthesisRequest(
    voice = Voice(
        name = "Evan",
        model = "enhanced"
    ),
    input = Input(
        text = Text(
           text = "Your coffee will be ready in 5 minutes.")
    ),
    event_params = EventParameters(
        send_sentence_marker_events = True,
        send_paragraph_marker_events = True,
        send_log_events = True,
        suppress_input = True
    )
)

SynthesisResponse

The Synthesizer Synthesize method returns a stream of SynthesisResponse messages. (See UnarySynthesisResponse for a non-streamed response.) Each response contains one of:

  • A status response, indicating completion or failure of the request. This is received only once and signifies the end of a Synthesize call.
  • A list of events the client has requested. This can be received many times.
  • An audio buffer. This may be received many times.
Synthesis response
Field Type Description
status Status A status response, indicating completion or failure of the request.
events Events A list of events. See EventParameters for details.
audio bytes The latest audio buffer.

The SynthesisResponse message includes:

SynthesisResponse
  status (Status)
    code
    message
    details
  events (Events)
    event (Event)
      name
      values
  audio

Response to synthesis request:

try:
    if args.output_audio_file:
        audio_file = open(args.output_audio_file, "wb")
    for response in stream_in:
        if response.HasField("audio"):
            print("Received audio: %d bytes" % len(response.audio))
            if(audio_file):
                audio_file.write(response.audio)
        elif response.HasField("events"):
            print("Received events")
            print(text_format.MessageToString(response.events))
        else:
            if response.status.code == 200:
                print("Received status response: SUCCESS")
            else:
                print("Received status response: FAILED")
                print("Code: {}, Message: {}".format(response.status.code, response.status.message))
                print('Error: {}'.format(response.status.details))
except Exception as e:
    print(e)
if audio_file:
    print("Saved audio to {}".format(args.output_audio_file))
    audio_file.close()

These results synthesize a simple text string using the Evan voice, and include a user dictionary:

2023-09-26 15:45:42,436 (139898668111680) INFO  Iteration #1
2023-09-26 15:45:42,439 (139898668111680) DEBUG Creating secure gRPC channel
2023-09-26 15:45:42,444 (139898668111680) INFO  Running file [flow.py]
2023-09-26 15:45:42,444 (139898668111680) DEBUG [voice {
  name: "Evan"
  model: "enhanced"
}
audio_params {
  audio_format {
    pcm {
      sample_rate_hz: 22050
    }
  }
  volume_percentage: 80
  speaking_rate_factor: 1.0
  audio_chunk_duration_ms: 2000
}
input {
  text {
    text: "This is a test. A very simple test."
  }
  resources {
    uri: "urn:nuance-mix:tag:tuning:lang/coffee_app/coffee_dict/en-us/mix.tts"
  }
}
user_id: "MyApplicationUser"
]
2023-09-26 15:45:42,444 (139898668111680) INFO  Sending Synthesis request
2023-09-26 15:45:42,631 (139898668111680) INFO  Received audio: 57484 bytes
2023-09-26 15:45:42,655 (139898668111680) INFO  Received audio: 70432 bytes
2023-09-26 15:45:42,657 (139898668111680) INFO  Received status response: SUCCESS
2023-09-26 15:45:42,657 (139898668111680) INFO  Wrote audio to flow.py_i1_s1.wav
2023-09-26 15:45:42,657 (139898668111680) INFO  Done running file [flow.py]
2023-09-26 15:45:42,660 (139898668111680) INFO  Iteration #1 complete
2023-09-26 15:45:42,660 (139898668111680) INFO  Done

Status

Output message containing a status response, indicating completion or failure of a Synthesize call. Included in SynthesisResponse and UnarySynthesisResponse.

Status response
Field Type Description
code uint32 HTTP-style return code: 200, 4xx, or 5xx as appropriate. See Status codes.
message string Brief description of the status.
details string Longer description if available.

Events

Output message defining a container for a list of events. This container is needed because oneof does not allow repeated parameters. Included in SynthesisResponse and UnarySynthesisResponse.

Events
Field Type Description
events Event Repeated. One or more events.

Events are returned when send_log_events is True in the request’s EventParameters.

  Results with events  

For a description of the NVOC events in the results, see TTS payload: callsummary.

Event

Output message defining an event message. Included in Events. See EventParameters for details.

Event
Field Type Description
name string Either “Markers” or the name of the event in the case of a Log Event.
values map<string,string> Map of key:value data relevant to the current event.

UnarySynthesisResponse

The Synthesizer UnarySynthesize RPC call returns a single UnarySynthesisResponse message. It is similar to SynthesisResponse but includes all the information at once instead of a streaming response. The response contains:

  • A status response, indicating completion or failure of the request.
  • A list of events the client has requested.
  • The complete audio buffer of the synthesized text.
Unary synthesis response
Field Type Description
status Status A status response, indicating completion or failure of the request.
events Events A list of events. See EventParameters for details.
audio bytes Audio buffer of the synthesized text, capped if necessary to a configured audio response size.

The UnarySynthesisResponse message includes:

UnarySynthesisResponse
  status (Status)
    code
    message
    details
  events (Events)
    event (Event)
      name
      values
  audio

Scalar value types

The data types in the proto files are mapped to equivalent types in the generated client stub files.

Scalar data types
Proto Notes C++ Java Python
double double double float
float float float float
int32 Uses variable-length encoding. Inefficient for encoding negative numbers. If your field is likely to have negative values, use sint32 instead. int32 int int
int64 Uses variable-length encoding. Inefficient for encoding negative numbers. If your field is likely to have negative values, use sint64 instead. int64 long int/long
uint32 Uses variable-length encoding. uint32 int int/long
uint64 Uses variable-length encoding. uint64 long int/long
sint32 Uses variable-length encoding. Signed int value. These encode negative numbers more efficiently than regular int32s. int32 int int
sint64 Uses variable-length encoding. Signed int value. These encode negative numbers more efficiently than regular int64s. int64 long int/long
fixed32 Always four bytes. More efficient than uint32 if values are often greater than 2^28. uint32 int int
fixed64 Always eight bytes. More efficient than uint64 if values are often greater than 2^56. uint64 long int/long
sfixed32 Always four bytes. int32 int int
sfixed64 Always eight bytes. int64 long int/long
bool bool boolean boolean
string A string must always contain UTF-8 encoded or 7-bit ASCII text. string String str/unicode
bytes May contain any arbitrary sequence of bytes. string ByteString str