Synthesizer gRPC API for Neural TTSaaS

The Synthesizer gRPC API contains methods for requesting speech synthesis from Neural TTSaaS, using Microsoft neural voices.

Proto file structure

The Synthesizer API is defined in the synthesizer.proto file.

└── nuance
    └── tts
        └── v1 
            └── synthesizer.proto

Neural TTSaaS does not support all fields in synthesizer.proto. See Supported fields and defaults.

For Neural TTSaaS, the proto file defines a Synthesizer service with two RPC methods: GetVoices and Synthesize.

Sequence flow

The essential tasks are illustrated in the following high-level sequence flow of a client application at runtime:

  Runtime sequence flow  

Synthesizer

The Synthesizer service offers these functionalities:

  • GetVoices: Queries the list of available voices, with filters to reduce the search space.
  • Synthesize: Synthesizes audio from input text and parameters, and returns an audio stream.
Method Request Type Response Type
GetVoices GetVoicesRequest GetVoicesResponse
Synthesize SynthesisRequest SynthesisResponse stream
UnarySynthesize (Ignored) SynthesisRequest UnarySynthesisResponse

GetVoicesRequest

Input message for the GetVoices method, to query voices available to the client. For more examples, see Sample synthesis client for Neural TTSaaS: Get voices and Voice filters.

For information on the Microsoft voices returned by GetVoices, see the Microsoft documentation: Language and voice support for the Speech service  .

Field Type Description
voice Voice Optionally filter the voices to retrieve, for example, set language to en-US to return only American English voices.

The GetVoicesRequest message includes:

GetVoicesRequest
  voice (Voice)
    name
    language
    gender (EnumGender)
    sample_rate_hz

For example, this requests information about all female American English voices:

GetVoicesRequest (
    voice = Voice (
        language = "en-US",
        gender = EnumGender.FEMALE
    )
)

This asks about one named voice:

GetVoicesRequest (
    voice = Voice (
        name = "en-US-JennyNeural"
    )
)

Voice

Input or output message for voices. Different fields are supported depending on the method.

  • In SynthesisRequest:

    For plain text, it specifies the voice to use with the mandatory name field.

    For SSML, it optionally specifies the voice to use with the name field. The voice may instead be set with <voice> in the SSML input. See SSML input.

  • In GetVoicesRequest, it filters the list of available voices, with optional fields name, language, gender, foreign_languages, styles, and sample_rate_hz. See Voice filters for more examples.

  • In GetVoicesResponse, it returns the list of available voices, with name, model, language, gender, sample_rate_hz. It includes foreign_languages and/or styles when available for the voice.
Field Type Description
name string The voice’s name, for example en-US-JennyNeural. Mandatory for SynthesisRequest with plain text input. Optional for SSML input.
Used in GetVoicesRequest to search for a named voice. Included in GetVoicesResponse.
model string The voice’s model, for example neural. Included in GetVoicesResponse. Ignored otherwise.
language string IETF language code, for example en-US. Used in GetVoicesRequest and GetVoicesResponse, to return voices with a certain mother tongue. Ignored otherwise.
age_group EnumAgeGroup Ignored.
gender EnumGender Used in GetVoicesRequest and GetVoicesResponse, to return voices with a certain gender. Ignored otherwise.
sample_rate_hz uint32 Used in GetVoicesRequest and GetVoicesResponse, to return a voice’s sampling rate. Ignored otherwise.
language_tlw string Ignored.
restricted bool Ignored.
versions string Ignored.
foreign_languages string Repeated. Used in GetVoicesRequest and GetVoicesResponse, to return the foreign languages of a multilingual voice. Ignored otherwise.
styles string Repeated. Used in GetVoicesRequest and GetVoicesResponse, to return the available styles of a voice. Ignored otherwise.

The Voice message includes different fields depending on the context:

GetVoicesRequest
  voice (Voice)
    name
    language
    gender (EnumGender)
    sample_rate_hz
 
GetVoicesResponse
  voice (Voice)
    name
    model
    language
    gender (EnumGender)
    sample_rate_hz
    foreign_languages
    styles
 
SynthesisRequest
  voice (Voice)
    name

EnumGender

Input field for GetVoicesRequest or output field for GetVoicesResponse, specifying gender for voices that support multiple genders. Included in Voice.

Name Number Description
ANY 0 Any gender voice. Default for GetVoicesRequest.
MALE 1 Male voice.
FEMALE 2 Female voice.
NEUTRAL 3 Neutral gender voice. Ignored.

GetVoicesResponse

Output message in response to GetVoicesRequest. Contains information about the voices that match the input criteria, if any, and includes foreign languages and styles for voices that support them.

To use the styles in a synthesis request, see Input to synthesize: SSML elements: Voice style.

Field Type Description
voice Voice Repeated. Voices and characteristics returned.

For example, this is the response to GetVoices for American English voices. Notice that voice styles are included for voices that support them, and foreign languages are listed for the Jenny multilingual voice.

2022-10-24 15:56:27,265 (140266945111872) DEBUG [voice {
    language: "en-US"
}
]
2022-10-24 15:56:27,265 (140266945111872) INFO  Sending GetVoices request
2022-10-24 15:56:27,405 (140266945111872) INFO  voices {
    name: "en-US-JennyNeural"
    model: "neural"
    language: "en-US"
    gender: FEMALE
    sample_rate_hz: 24000
    styles: "assistant"
    styles: "chat"
    styles: "customerservice"
    styles: "newscast"
    styles: "angry"
    styles: "cheerful"
    styles: "sad"
    styles: "excited"
    styles: "friendly"
    styles: "terrified"
    styles: "shouting"
    styles: "unfriendly"
    styles: "whispering"
    styles: "hopeful"
}
voices {
    name: "en-US-JennyMultilingualNeural"
    model: "neural"
    language: "en-US"
    gender: FEMALE
    sample_rate_hz: 24000
    foreign_languages: "de-DE"
    foreign_languages: "en-AU"
    foreign_languages: "en-CA"
    foreign_languages: "en-GB"
    foreign_languages: "es-ES"
    foreign_languages: "es-MX"
    foreign_languages: "fr-CA"
    foreign_languages: "fr-FR"
    foreign_languages: "it-IT"
    foreign_languages: "ja-JP"
    foreign_languages: "ko-KR"
    foreign_languages: "pt-BR"
    foreign_languages: "zh-CN"
}
voices {
    name: "en-US-GuyNeural"
    model: "neural"
    language: "en-US"
    gender: MALE
    sample_rate_hz: 24000
    styles: "newscast"
    styles: "angry"
    styles: "cheerful"
    styles: "sad"
    styles: "excited"
    styles: "friendly"
    styles: "terrified"
    styles: "shouting"
    styles: "unfriendly"
    styles: "whispering"
    styles: "hopeful"
}
voices {
    name: "en-US-AmberNeural"
    model: "enhanced"
    language: "en-US"
    gender: FEMALE
    sample_rate_hz: 24000
}
voices {
    name: "en-US-AnaNeural"
    model: "enhanced"
    language: "en-US"
    gender: FEMALE
    sample_rate_hz: 24000
}
. . . voices omitted here . . . 
voices {
    name: "en-US-ZiraRUS"
    model: "enhanced"
    language: "en-US"
    gender: FEMALE
    sample_rate_hz: 24000
}

2022-10-24 15:56:27,405 (140266945111872) INFO  Done running file [flow.py]
2022-10-24 15:56:27,407 (140266945111872) INFO  Iteration #1 complete
2022-10-24 15:56:27,407 (140266945111872) INFO  Done

SynthesisRequest

Input message for the Synthesize method. Specifies input text, audio parameters, and events to subscribe to, in exchange for synthesized audio. See Supported fields and defaults.

For more examples, see Sample synthesis client for Neural TTSaaS > Synthesize text input and Synthesize SSML input.

SynthesisRequest
Field Type Description
voice Voice Mandatory for plain text input. Optional for SSML input. The voice to use for audio synthesis.
audio_params AudioParameters Output audio parameters, such as encoding and volume. Default is PCM audio at 22050 Hz.
input Input Mandatory. Input text to synthesize.
event_params EventParameters Markers and other information to include in server events returned during synthesis.
client_data map<string,string> Map of client-supplied key:value pairs to inject into the event log.
user_id string Identifies a specific user within the application.

The SynthesisRequest message includes:

SynthesisRequest
  voice (Voice)
    name
  audio_params (AudioParameters)
    audio_format (AudioFormat)
  input (Input)
    text (Text)
    ssml (SSML)
  event_params (EventParameters)
    send_bookmark_marker_events
    send_visemes
    suppress_input
  client_data
  user_id

This synthesis request includes most fields:

SynthesisRequest(
    voice = Voice(
        name = "en-US-JennyNeural"
    ),
    audio_params = AudioParameters(
        audio_format = AudioFormat(
            pcm = PCM(sample_rate_hz = 22050) 
        )
    ),
    input = Input(
        text = Text(
            text = "Your coffee will be ready in 5 minutes"
        )
    ),
    event_params = EventParameters(
        send_visemes = True  
    ),
    client_data = {"company":"Aardvark Coffee","user":"Leslie"},
    user_id = "leslie.somebody@aardvark.com"
)

This is a minimal synthesis request, using all defaults:

SynthesisRequest(
    voice = Voice(
        name = "en-US-JennyNeural"
    ),
    input = Input(
        text = Text(
            text = "Your coffee will be ready in 5 minutes."
        )
    )
)

The API voice field is optional in SSML input. A voice may instead by provided in the <voice> element in the SSML input.

SynthesisRequest(
    input = Input(
        ssml = SSML(
            text = '''<speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis" xml:lang="en-US">
            <voice name="en-US-JennyNeural">Your coffee will be ready in 5 minutes.</voice>
            </speak>'''
        )
    )
)

AudioParameters

Input message for audio-related parameters during synthesis, including encoding, volume, and audio length. Included in SynthesisRequest.

AudioParameters
Field Type Description
audio_format AudioFormat Audio encoding. Default PCM 22050 Hz.
volume_percentage uint32 Ignored.
speaking_rate_factor float Ignored.
audio_chunk_ duration_ms uint32 Ignored.
target_audio_length_ms uint32 Ignored.
disable_early_emission bool Ignored.

The AudioParameters message includes:

SynthesisRequest
  voice (Voice)
  audio_params (AudioParameters)
    audio_format (AudioFormat)
      pcm (PCM)
      alaw (Alaw)
      ulaw (Ulaw)
      ogg_opus (OggOpus)
      opus (Opus)

AudioFormat

Input message for audio encoding of synthesized text. Included in AudioParameters.

AudioFormat
Field Type Description
pcm PCM Signed 16-bit little endian PCM.
alaw ALaw G.711 A-law, 8kHz.
ulaw ULaw G.711 Mu-law, 8kHz.
ogg_opus OggOpus Ogg Opus, 16kHz or 24 kHz.
opus Opus Opus, 16kHz or 24kHz. The audio will be sent one Opus packet at a time.

The AudioFormat message includes:

SynthesisRequest
  voice (Voice)
  audio_params (AudioParameters)
    audio_format (AudioFormat)
      pcm (PCM)
        sample_rate_hz
      alaw (Alaw)
      ulaw (Ulaw)
      ogg_opus (OggOpus)
        sample_rate_hz
      opus (Opus)
        sample_rate_hz
        bit_rate_bps

The PCM audio format is shown, with alternatives in commented lines:

SynthesisRequest(
    voice = Voice(
        name = "en-US-JennyNeural"
    ),
    audio_params = AudioParameters(
        audio_format = AudioFormat(
            pcm = PCM(sample_rate_hz = 22050)
#           alaw = ALaw()
#           ulaw = ULaw()
#           ogg_opus = OggOpus(sample_rate_hz = 16000)
#           opus = Opus(sample_rate_hz = 16000, bit_rate_bps = 30000)
        )
    )
)

PCM

Input message defining PCM sample rate. Included in AudioFormat.

PCM
Field Type Description
sample_rate_hz uint32 Output sample rate in Hz. Supported values: 8000, 16000, 22050, 24000.

ALaw

Input message defining A-law audio format. Included in AudioFormat. G.711 audio formats are set to 8kHz.

ULaw

Input message defining Mu-law audio format. Included in AudioFormat. G.711 audio formats are set to 8kHz.

OggOpus

Input message defining Ogg Opus output rate. Included in AudioFormat.

OggOpus
Field Type Description
sample_rate_hz uint32 Output sample rate in Hz. Supported values: 16000, 24000.
bit_rate_bps uint32 Ignored.
max_frame_ duration_ms float Ignored.
complexity uint32 Ignored.
vbr EnumVariableBitrate Ignored.

Opus

Input message defining Opus output rate. Included in AudioFormat.

Opus
Field Type Description
sample_rate_hz uint32 Output sample rate in Hz. Supported values: 16000, 24000.
bit_rate_bps uint32 Output bitrate. Supported values:
For 16 kHz: 20 ms frame, bitrate can be 0 (default, meaning 32000) or 32000.
For 24 kHz: 20 ms frame, bitrate can be 0 (default, meaning 24000), 24000, or 48000.
max_frame_ duration_ms float Ignored.
complexity uint32 Ignored.
vbr EnumVariableBitrate Ignored.

Input

Input message containing text to synthesize and synthesis parameters, including tuning data, etc. Included in SynthesisRequest. The type of input may be plain text or SSML. See Input to synthesize for examples.

Input
Field Type Description
text Text Plain text input.
ssml SSML SSML input, including text and SSML elements.
tokenized_sequence TokenizedSequence Not allowed.
resources SynthesisResource Ignored.
lid_params LanguageIdentification Parameters Ignored.
download_params DownloadParameters Ignored.

The Input message includes:

SynthesisRequest
  voice (Voice)
  input (Input)
    text (Text)
      text
    ssml (SSML)
      text

Text

Input message for synthesizing plain text. The encoding must be UTF-8. For plain text input, a voice field is required.

Text
Field Type Description
text string Plain input text in UTF-8 encoding.
uri string Not allowed.

For example, this is plain text input:

SynthesisRequest(
    voice = Voice(
        name = "en-US-JennyNeural"
    ),
    input = Input(
        text = Text(
            text = "Your coffee will be ready in 5 minutes"
        )
    )
)

SSML

Input message for synthesizing SSML input. See SSML elements for supported elements and examples.

SSML
Field Type Description
text string SSML input text and elements.
uri string Not allowed.
ssml_validation_mode EnumSSML ValidationMode Ignored.

For example, this is SSML input. The SynthesisRequest voice field is ignored and may be omitted because the voice is set in the <voice> element in the SSML.

SynthesisRequest(
    input = Input(
        ssml = SSML(
            text = '''<speak  xmlns="http://www.w3.org/2001/10/synthesis" xml:lang="en-US" version="1.0">
            <voice name = "en-US-JennyNeural">Your coffee will be ready in 5 minutes.</voice>
            </speak>'''
        ) 
    )
)

EventParameters

Input message that defines event subscription parameters. Included in SynthesisRequest. Events that are requested are sent throughout the SynthesisResponse stream as they are generated.

Log events are produced throughout a synthesis request for events such as a voice loaded by the server or an audio chunk being ready to send.

EventParameters
Field Type Description
send_sentence_marker_events bool Ignored.
send_word_marker_events bool Ignored.
send_phoneme_marker_events bool Ignored.
send_bookmark_marker_events bool Bookmark marker. Default: do not send.
send_paragraph_marker_events bool Ignored.
send_visemes bool Lipsync information. Default: do not send.
send_log_events bool Ignored.
suppress_input bool Whether to omit input text and URIs from log events. By default, these items are included.

The EventParameters message includes:

SynthesisRequest
  voice (Voice)
  input (Input)
  event_params (EventParameters)
    send_bookmark_marker_events
    send_visemes
    suppress_input

Event parameters in SynthesisRequest

SynthesisRequest(
    voice = Voice(
    name = "en-US-JennyNeural"
    ),
    input = Input(
        text = Text(
            text = "Your coffee will be ready in 5 minutes."
        )
    ),
    event_params = EventParameters(
        send_visemes = True
    )
)

SynthesisResponse

Output message in response to a SynthesisRequest, consisting of a stream of SynthesisResponse responses. Each response contains one of:

  • A status response, indicating completion or failure of the request. This is received only once and signifies the end of a Synthesize call.
  • A list of events the client has requested. This can be received many times. See EventParameters for details.
  • An audio buffer. This may be received many times.
SynthesisResponse
Field Type Description
status Status A status response, indicating completion or failure of the request.
events Events A list of events. See EventParameters for details.
audio bytes The latest audio buffer.

The SynthesisResponse message includes:

SynthesisResponse
  status (Status)
    code
    message
    details
  events (Events)
    event (Event)
      name
      values
  audio

Status

Output message containing a status response, indicating completion or failure of a Synthesize call. Included in SynthesisResponse.

Status
Field Type Description
code uint32 HTTP-style return code: 200, 4xx, or 5xx as appropriate. See Status codes.
message string Brief description of the status.
details string Longer description if available.

Events

Output message defining a container for a list of events. This container is needed because oneof does not allow repeated parameters in Protobuf. Included in SynthesisResponse.

Events
Field Type Description
events Event Repeated. One or more events.

Event

Output message defining an event message. Included in Events. See EventParameters for details.

Event
Field Type Description
name string Either “Markers” or the name of the event in the case of a Log Event.
values map<string,string> Map of key:value data relevant to the current event.

Scalar value types

The data types in the proto files are mapped to equivalent types in the generated client stub files.

Scalar data types
Proto Notes C++ Java Python
double double double float
float float float float
int32 Uses variable-length encoding. Inefficient for encoding negative numbers. If your field is likely to have negative values, use sint32 instead. int32 int int
int64 Uses variable-length encoding. Inefficient for encoding negative numbers. If your field is likely to have negative values, use sint64 instead. int64 long int/long
uint32 Uses variable-length encoding. uint32 int int/long
uint64 Uses variable-length encoding. uint64 long int/long
sint32 Uses variable-length encoding. Signed int value. These encode negative numbers more efficiently than regular int32s. int32 int int
sint64 Uses variable-length encoding. Signed int value. These encode negative numbers more efficiently than regular int64s. int64 long int/long
fixed32 Always four bytes. More efficient than uint32 if values are often greater than 2^28. uint32 int int
fixed64 Always eight bytes. More efficient than uint64 if values are often greater than 2^56. uint64 long int/long
sfixed32 Always four bytes. int32 int int
sfixed64 Always eight bytes. int64 long int/long
bool bool boolean boolean
string A string must always contain UTF-8 encoded or 7-bit ASCII text. string String str/unicode
bytes May contain any arbitrary sequence of bytes. string ByteString str