Synthesizer gRPC API for Neural TTSaaS

The Synthesizer gRPC API contains methods for requesting speech synthesis from Neural TTSaaS, using Microsoft neural voices.

Tip:

Try out this API using the Sample synthesis client for Neural TTSaaS that you can download and run.

Proto file structure

The Synthesizer API is defined in the synthesizer.proto file.

└── nuance
    └── tts
        └── v1 
            └── synthesizer.proto

Neural TTSaaS does not support all fields in synthesizer.proto. See Supported fields and defaults.

For Neural TTSaaS, the proto file defines a Synthesizer service with two RPC methods: GetVoices and Synthesize.

Sequence flow

The essential tasks are illustrated in the following high-level sequence flow of a client application at runtime:

Runtime sequence flow

Synthesizer

The Synthesizer service offers these functionalities:

GetVoices: Queries the list of available voices, with filters to reduce the search space.
Synthesize: Synthesizes audio from input text and parameters, and returns an audio stream.

Method	Request Type	Response Type
GetVoices	GetVoicesRequest	GetVoicesResponse
Synthesize	SynthesisRequest	SynthesisResponse stream
UnarySynthesize (Ignored)	SynthesisRequest	UnarySynthesisResponse

GetVoicesRequest

Input message for the GetVoices method, to query voices available to the client. For more examples, see Sample synthesis client for Neural TTSaaS: Get voices and Voice filters.

For information on the Microsoft voices returned by GetVoices, see the Microsoft documentation: Language and voice support for the Speech service .

Field	Type	Description
voice	Voice	Optionally filter the voices to retrieve, for example, set language to en-US to return only American English voices.

The GetVoicesRequest message includes:

GetVoicesRequest
  voice (Voice)
    name
    language
    gender (EnumGender)
    sample_rate_hz

For example, this requests information about all female American English voices:

GetVoicesRequest (
    voice = Voice (
        language = "en-US",
        gender = EnumGender.FEMALE
    )
)

This asks about one named voice:

GetVoicesRequest (
    voice = Voice (
        name = "en-US-JennyNeural"
    )
)

Voice

Input or output message for voices. Different fields are supported depending on the method.

In SynthesisRequest:

For plain text, it specifies the voice to use with the mandatory name field.
For SSML, it optionally specifies the voice to use with the name field. The voice may instead be set with <voice> in the SSML input. See SSML input.

In GetVoicesRequest, it filters the list of available voices, with optional fields name, language, gender, foreign_languages, styles, and sample_rate_hz. See Voice filters for more examples.

In GetVoicesResponse, it returns the list of available voices, with name, model, language, gender, sample_rate_hz. It includes foreign_languages and/or styles when available for the voice.

Field	Type	Description
name	string	The voice’s name, for example en-US-JennyNeural. Mandatory for SynthesisRequest with plain text input. Optional for SSML input. Used in GetVoicesRequest to search for a named voice. Included in GetVoicesResponse.
model	string	The voice’s model, for example neural. Included in GetVoicesResponse. Ignored otherwise.
language	string	IETF language code, for example en-US. Used in GetVoicesRequest and GetVoicesResponse, to return voices with a certain mother tongue. Ignored otherwise.
age_group	EnumAgeGroup	Ignored.
gender	EnumGender	Used in GetVoicesRequest and GetVoicesResponse, to return voices with a certain gender. Ignored otherwise.
sample_rate_hz	uint32	Used in GetVoicesRequest and GetVoicesResponse, to return a voice’s sampling rate. Ignored otherwise.
language_tlw	string	Ignored.
restricted	bool	Ignored.
versions	string	Ignored.
foreign_languages	string	Repeated. Used in GetVoicesRequest and GetVoicesResponse, to return the foreign languages of a multilingual voice. Ignored otherwise.
styles	string	Repeated. Used in GetVoicesRequest and GetVoicesResponse, to return the available styles of a voice. Ignored otherwise.

The Voice message includes different fields depending on the context:

GetVoicesRequest
  voice (Voice)
    name
    language
    gender (EnumGender)
    sample_rate_hz

GetVoicesResponse
  voice (Voice)
    name
    model
    language
    gender (EnumGender)
    sample_rate_hz
    foreign_languages
    styles

SynthesisRequest
  voice (Voice)
    name

EnumGender

Input field for GetVoicesRequest or output field for GetVoicesResponse, specifying gender for voices that support multiple genders. Included in Voice.

Name	Number	Description
ANY	0	Any gender voice. Default for GetVoicesRequest.
MALE	1	Male voice.
FEMALE	2	Female voice.
NEUTRAL	3	Neutral gender voice. Ignored.

GetVoicesResponse

Output message in response to GetVoicesRequest. Contains information about the voices that match the input criteria, if any, and includes foreign languages and styles for voices that support them.

To use the styles in a synthesis request, see Input to synthesize: SSML elements: Voice style.

Field	Type	Description
voice	Voice	Repeated. Voices and characteristics returned.

For example, this is the response to GetVoices for American English voices. Notice that voice styles are included for voices that support them, and foreign languages are listed for the Jenny multilingual voice.

2022-10-24 15:56:27,265 (140266945111872) DEBUG [voice {
    language: "en-US"
}
]
2022-10-24 15:56:27,265 (140266945111872) INFO  Sending GetVoices request
2022-10-24 15:56:27,405 (140266945111872) INFO  voices {
    name: "en-US-JennyNeural"
    model: "neural"
    language: "en-US"
    gender: FEMALE
    sample_rate_hz: 24000
    styles: "assistant"
    styles: "chat"
    styles: "customerservice"
    styles: "newscast"
    styles: "angry"
    styles: "cheerful"
    styles: "sad"
    styles: "excited"
    styles: "friendly"
    styles: "terrified"
    styles: "shouting"
    styles: "unfriendly"
    styles: "whispering"
    styles: "hopeful"
}
voices {
    name: "en-US-JennyMultilingualNeural"
    model: "neural"
    language: "en-US"
    gender: FEMALE
    sample_rate_hz: 24000
    foreign_languages: "de-DE"
    foreign_languages: "en-AU"
    foreign_languages: "en-CA"
    foreign_languages: "en-GB"
    foreign_languages: "es-ES"
    foreign_languages: "es-MX"
    foreign_languages: "fr-CA"
    foreign_languages: "fr-FR"
    foreign_languages: "it-IT"
    foreign_languages: "ja-JP"
    foreign_languages: "ko-KR"
    foreign_languages: "pt-BR"
    foreign_languages: "zh-CN"
}
voices {
    name: "en-US-GuyNeural"
    model: "neural"
    language: "en-US"
    gender: MALE
    sample_rate_hz: 24000
    styles: "newscast"
    styles: "angry"
    styles: "cheerful"
    styles: "sad"
    styles: "excited"
    styles: "friendly"
    styles: "terrified"
    styles: "shouting"
    styles: "unfriendly"
    styles: "whispering"
    styles: "hopeful"
}
voices {
    name: "en-US-AmberNeural"
    model: "enhanced"
    language: "en-US"
    gender: FEMALE
    sample_rate_hz: 24000
}
voices {
    name: "en-US-AnaNeural"
    model: "enhanced"
    language: "en-US"
    gender: FEMALE
    sample_rate_hz: 24000
}
. . . voices omitted here . . . 
voices {
    name: "en-US-ZiraRUS"
    model: "enhanced"
    language: "en-US"
    gender: FEMALE
    sample_rate_hz: 24000
}

2022-10-24 15:56:27,405 (140266945111872) INFO  Done running file [flow.py]
2022-10-24 15:56:27,407 (140266945111872) INFO  Iteration #1 complete
2022-10-24 15:56:27,407 (140266945111872) INFO  Done

SynthesisRequest

Input message for the Synthesize method. Specifies input text, audio parameters, and events to subscribe to, in exchange for synthesized audio. See Supported fields and defaults.

For more examples, see Sample synthesis client for Neural TTSaaS > Synthesize text input and Synthesize SSML input.

SynthesisRequest
Field	Type	Description
voice	Voice	Mandatory for plain text input. Optional for SSML input. The voice to use for audio synthesis.
audio_params	AudioParameters	Output audio parameters, such as encoding and volume. Default is PCM audio at 22050 Hz.
input	Input	Mandatory. Input text to synthesize.
event_params	EventParameters	Markers and other information to include in server events returned during synthesis.
client_data	map<string,string>	Map of client-supplied key:value pairs to inject into the event log.
user_id	string	Identifies a specific user within the application.

The SynthesisRequest message includes:

SynthesisRequest
  voice (Voice)
    name
  audio_params (AudioParameters)
    audio_format (AudioFormat)
  input (Input)
    text (Text)
    ssml (SSML)
  event_params (EventParameters)
    send_bookmark_marker_events
    send_visemes
    suppress_input
  client_data
  user_id

This synthesis request includes most fields:

SynthesisRequest(
    voice = Voice(
        name = "en-US-JennyNeural"
    ),
    audio_params = AudioParameters(
        audio_format = AudioFormat(
            pcm = PCM(sample_rate_hz = 22050) 
        )
    ),
    input = Input(
        text = Text(
            text = "Your coffee will be ready in 5 minutes"
        )
    ),
    event_params = EventParameters(
        send_visemes = True  
    ),
    client_data = {"company":"Aardvark Coffee","user":"Leslie"},
    user_id = "leslie.somebody@aardvark.com"
)

This is a minimal synthesis request, using all defaults:

SynthesisRequest(
    voice = Voice(
        name = "en-US-JennyNeural"
    ),
    input = Input(
        text = Text(
            text = "Your coffee will be ready in 5 minutes."
        )
    )
)

The API voice field is optional in SSML input. A voice may instead by provided in the <voice> element in the SSML input.

SynthesisRequest(
    input = Input(
        ssml = SSML(
            text = '''<speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis" xml:lang="en-US">
            <voice name="en-US-JennyNeural">Your coffee will be ready in 5 minutes.</voice>
            </speak>'''
        )
    )
)

AudioParameters

Input message for audio-related parameters during synthesis, including encoding, volume, and audio length. Included in SynthesisRequest.

AudioParameters
Field	Type	Description
audio_format	AudioFormat	Audio encoding. Default PCM 22050 Hz.
volume_percentage	uint32	Ignored.
speaking_rate_factor	float	Ignored.
audio_chunk_ duration_ms	uint32	Ignored.
target_audio_length_ms	uint32	Ignored.
disable_early_emission	bool	Ignored.

The AudioParameters message includes:

SynthesisRequest
  voice (Voice)
  audio_params (AudioParameters)
    audio_format (AudioFormat)
      pcm (PCM)
      alaw (Alaw)
      ulaw (Ulaw)
      ogg_opus (OggOpus)
      opus (Opus)

AudioFormat

Input message for audio encoding of synthesized text. Included in AudioParameters.

AudioFormat
Field	Type	Description
pcm	PCM	Signed 16-bit little endian PCM.
alaw	ALaw	G.711 A-law, 8kHz.
ulaw	ULaw	G.711 Mu-law, 8kHz.
ogg_opus	OggOpus	Ogg Opus, 16kHz or 24 kHz.
opus	Opus	Opus, 16kHz or 24kHz. The audio will be sent one Opus packet at a time.

The AudioFormat message includes:

SynthesisRequest
  voice (Voice)
  audio_params (AudioParameters)
    audio_format (AudioFormat)
      pcm (PCM)
        sample_rate_hz
      alaw (Alaw)
      ulaw (Ulaw)
      ogg_opus (OggOpus)
        sample_rate_hz
      opus (Opus)
        sample_rate_hz
        bit_rate_bps

The PCM audio format is shown, with alternatives in commented lines:

SynthesisRequest(
    voice = Voice(
        name = "en-US-JennyNeural"
    ),
    audio_params = AudioParameters(
        audio_format = AudioFormat(
            pcm = PCM(sample_rate_hz = 22050)
#           alaw = ALaw()
#           ulaw = ULaw()
#           ogg_opus = OggOpus(sample_rate_hz = 16000)
#           opus = Opus(sample_rate_hz = 16000, bit_rate_bps = 30000)
        )
    )
)

PCM

Input message defining PCM sample rate. Included in AudioFormat.

PCM
Field	Type	Description
sample_rate_hz	uint32	Output sample rate in Hz. Supported values: 8000, 16000, 22050, 24000.

ALaw

Input message defining A-law audio format. Included in AudioFormat. G.711 audio formats are set to 8kHz.

ULaw

Input message defining Mu-law audio format. Included in AudioFormat. G.711 audio formats are set to 8kHz.

OggOpus

Input message defining Ogg Opus output rate. Included in AudioFormat.

OggOpus
Field	Type	Description
sample_rate_hz	uint32	Output sample rate in Hz. Supported values: 16000, 24000.
bit_rate_bps	uint32	Ignored.
max_frame_ duration_ms	float	Ignored.
complexity	uint32	Ignored.
vbr	EnumVariableBitrate	Ignored.

Opus

Input message defining Opus output rate. Included in AudioFormat.

Opus
Field	Type	Description
sample_rate_hz	uint32	Output sample rate in Hz. Supported values: 16000, 24000.
bit_rate_bps	uint32	Output bitrate. Supported values: For 16 kHz: 20 ms frame, bitrate can be 0 (default, meaning 32000) or 32000. For 24 kHz: 20 ms frame, bitrate can be 0 (default, meaning 24000), 24000, or 48000.
max_frame_ duration_ms	float	Ignored.
complexity	uint32	Ignored.
vbr	EnumVariableBitrate	Ignored.

Input

Input message containing text to synthesize and synthesis parameters, including tuning data, etc. Included in SynthesisRequest. The type of input may be plain text or SSML. See Input to synthesize for examples.

Input
Field	Type	Description
text	Text	Plain text input.
ssml	SSML	SSML input, including text and SSML elements.
tokenized_sequence	TokenizedSequence	Not allowed.
resources	SynthesisResource	Ignored.
lid_params	LanguageIdentification Parameters	Ignored.
download_params	DownloadParameters	Ignored.

The Input message includes:

SynthesisRequest
  voice (Voice)
  input (Input)
    text (Text)
      text
    ssml (SSML)
      text

Text

Input message for synthesizing plain text. The encoding must be UTF-8. For plain text input, a voice field is required.

Text
Field	Type	Description
text	string	Plain input text in UTF-8 encoding.
uri	string	Not allowed.

For example, this is plain text input:

SynthesisRequest(
    voice = Voice(
        name = "en-US-JennyNeural"
    ),
    input = Input(
        text = Text(
            text = "Your coffee will be ready in 5 minutes"
        )
    )
)

SSML

Input message for synthesizing SSML input. See SSML elements for supported elements and examples.

SSML
Field	Type	Description
text	string	SSML input text and elements.
uri	string	Not allowed.
ssml_validation_mode	EnumSSML ValidationMode	Ignored.

For example, this is SSML input. The SynthesisRequest voice field is ignored and may be omitted because the voice is set in the <voice> element in the SSML.

SynthesisRequest(
    input = Input(
        ssml = SSML(
            text = '''<speak  xmlns="http://www.w3.org/2001/10/synthesis" xml:lang="en-US" version="1.0">
            <voice name = "en-US-JennyNeural">Your coffee will be ready in 5 minutes.</voice>
            </speak>'''
        ) 
    )
)

EventParameters

Input message that defines event subscription parameters. Included in SynthesisRequest. Events that are requested are sent throughout the SynthesisResponse stream as they are generated.

Log events are produced throughout a synthesis request for events such as a voice loaded by the server or an audio chunk being ready to send.

EventParameters
Field	Type	Description
send_sentence_marker_events	bool	Ignored.
send_word_marker_events	bool	Ignored.
send_phoneme_marker_events	bool	Ignored.
send_bookmark_marker_events	bool	Bookmark marker. Default: do not send.
send_paragraph_marker_events	bool	Ignored.
send_visemes	bool	Lipsync information. Default: do not send.
send_log_events	bool	Ignored.
suppress_input	bool	Whether to omit input text and URIs from log events. By default, these items are included.

The EventParameters message includes:

SynthesisRequest
  voice (Voice)
  input (Input)
  event_params (EventParameters)
    send_bookmark_marker_events
    send_visemes
    suppress_input

Event parameters in SynthesisRequest

SynthesisRequest(
    voice = Voice(
    name = "en-US-JennyNeural"
    ),
    input = Input(
        text = Text(
            text = "Your coffee will be ready in 5 minutes."
        )
    ),
    event_params = EventParameters(
        send_visemes = True
    )
)

SynthesisResponse

Output message in response to a SynthesisRequest, consisting of a stream of SynthesisResponse responses. Each response contains one of:

A status response, indicating completion or failure of the request. This is received only once and signifies the end of a Synthesize call.
A list of events the client has requested. This can be received many times. See EventParameters for details.
An audio buffer. This may be received many times.

SynthesisResponse
Field	Type	Description
status	Status	A status response, indicating completion or failure of the request.
events	Events	A list of events. See EventParameters for details.
audio	bytes	The latest audio buffer.

The SynthesisResponse message includes:

SynthesisResponse
  status (Status)
    code
    message
    details
  events (Events)
    event (Event)
      name
      values
  audio

Status

Output message containing a status response, indicating completion or failure of a Synthesize call. Included in SynthesisResponse.

Status
Field	Type	Description
code	uint32	HTTP-style return code: 200, 4xx, or 5xx as appropriate. See Status codes.
message	string	Brief description of the status.
details	string	Longer description if available.

Events

Output message defining a container for a list of events. This container is needed because oneof does not allow repeated parameters in Protobuf. Included in SynthesisResponse.

Events
Field	Type	Description
events	Event	Repeated. One or more events.

Event

Output message defining an event message. Included in Events. See EventParameters for details.

Event
Field	Type	Description
name	string	Either “Markers” or the name of the event in the case of a Log Event.
values	map<string,string>	Map of key:value data relevant to the current event.

Scalar value types

The data types in the proto files are mapped to equivalent types in the generated client stub files.

Scalar data types
Proto	Notes	C++	Java	Python
double		double	double	float
float		float	float	float
int32	Uses variable-length encoding. Inefficient for encoding negative numbers. If your field is likely to have negative values, use sint32 instead.	int32	int	int
int64	Uses variable-length encoding. Inefficient for encoding negative numbers. If your field is likely to have negative values, use sint64 instead.	int64	long	int/long
uint32	Uses variable-length encoding.	uint32	int	int/long
uint64	Uses variable-length encoding.	uint64	long	int/long
sint32	Uses variable-length encoding. Signed int value. These encode negative numbers more efficiently than regular int32s.	int32	int	int
sint64	Uses variable-length encoding. Signed int value. These encode negative numbers more efficiently than regular int64s.	int64	long	int/long
fixed32	Always four bytes. More efficient than uint32 if values are often greater than 2^28.	uint32	int	int
fixed64	Always eight bytes. More efficient than uint64 if values are often greater than 2^56.	uint64	long	int/long
sfixed32	Always four bytes.	int32	int	int
sfixed64	Always eight bytes.	int64	long	int/long
bool		bool	boolean	boolean
string	A string must always contain UTF-8 encoded or 7-bit ASCII text.	string	String	str/unicode
bytes	May contain any arbitrary sequence of bytes.	string	ByteString	str

Feedback

Was this page helpful?

Glad to hear it! Please tell us how we can improve.

Sorry to hear that. Please tell us how we can improve.