Synthesizer gRPC API for Neural TTSaaS
The Synthesizer gRPC API contains methods for requesting speech synthesis from Neural TTSaaS, using Microsoft neural voices.
Tip:
Try out this API using the Sample synthesis client for Neural TTSaaS that you can download and run.Proto file structure
The Synthesizer API is defined in the synthesizer.proto file.
└── nuance
└── tts
└── v1
└── synthesizer.proto
Neural TTSaaS does not support all fields in synthesizer.proto. See Supported fields and defaults.
For Neural TTSaaS, the proto file defines a Synthesizer service with two RPC methods: GetVoices and Synthesize.
Sequence flow
The essential tasks are illustrated in the following high-level sequence flow of a client application at runtime:
Synthesizer
The Synthesizer service offers these functionalities:
- GetVoices: Queries the list of available voices, with filters to reduce the search space.
- Synthesize: Synthesizes audio from input text and parameters, and returns an audio stream.
Method | Request Type | Response Type |
---|---|---|
GetVoices | GetVoicesRequest | GetVoicesResponse |
Synthesize | SynthesisRequest | SynthesisResponse stream |
UnarySynthesize (Ignored) | SynthesisRequest | UnarySynthesisResponse |
GetVoicesRequest
Input message for the GetVoices method, to query voices available to the client. For more examples, see Sample synthesis client for Neural TTSaaS: Get voices and Voice filters.
For information on the Microsoft voices returned by GetVoices, see the Microsoft documentation: Language and voice support for the Speech service .
Field | Type | Description |
---|---|---|
voice | Voice | Optionally filter the voices to retrieve, for example, set language to en-US to return only American English voices. |
The GetVoicesRequest message includes:
GetVoicesRequest
voice (Voice)
name
language
gender (EnumGender)
sample_rate_hz
For example, this requests information about all female American English voices:
GetVoicesRequest (
voice = Voice (
language = "en-US",
gender = EnumGender.FEMALE
)
)
This asks about one named voice:
GetVoicesRequest (
voice = Voice (
name = "en-US-JennyNeural"
)
)
Voice
Input or output message for voices. Different fields are supported depending on the method.
-
In SynthesisRequest:
For plain text, it specifies the voice to use with the mandatory
For SSML, it optionally specifies the voice to use with thename
field.name
field. The voice may instead be set with <voice> in the SSML input. See SSML input.
- In GetVoicesRequest, it filters the list of available voices, with optional fields
name
,language
,gender
,foreign_languages
,styles
, andsample_rate_hz
. See Voice filters for more examples.
- In GetVoicesResponse, it returns the list of available voices, with
name
,model
,language
,gender
,sample_rate_hz
. It includesforeign_languages
and/orstyles
when available for the voice.
Field | Type | Description |
---|---|---|
name | string | The voice’s name, for example en-US-JennyNeural. Mandatory for SynthesisRequest with plain text input. Optional for SSML input. Used in GetVoicesRequest to search for a named voice. Included in GetVoicesResponse. |
model | string | The voice’s model, for example neural. Included in GetVoicesResponse. Ignored otherwise. |
language | string | IETF language code, for example en-US. Used in GetVoicesRequest and GetVoicesResponse, to return voices with a certain mother tongue. Ignored otherwise. |
age_group | EnumAgeGroup | Ignored. |
gender | EnumGender | Used in GetVoicesRequest and GetVoicesResponse, to return voices with a certain gender. Ignored otherwise. |
sample_rate_hz | uint32 | Used in GetVoicesRequest and GetVoicesResponse, to return a voice’s sampling rate. Ignored otherwise. |
language_tlw | string | Ignored. |
restricted | bool | Ignored. |
versions | string | Ignored. |
foreign_languages | string | Repeated. Used in GetVoicesRequest and GetVoicesResponse, to return the foreign languages of a multilingual voice. Ignored otherwise. |
styles | string | Repeated. Used in GetVoicesRequest and GetVoicesResponse, to return the available styles of a voice. Ignored otherwise. |
The Voice message includes different fields depending on the context:
GetVoicesRequest
voice (Voice)
name
language
gender (EnumGender)
sample_rate_hz
GetVoicesResponse
voice (Voice)
name
model
language
gender (EnumGender)
sample_rate_hz
foreign_languages
styles
SynthesisRequest
voice (Voice)
name
EnumGender
Input field for GetVoicesRequest or output field for GetVoicesResponse, specifying gender for voices that support multiple genders. Included in Voice.
Name | Number | Description |
---|---|---|
ANY | 0 | Any gender voice. Default for GetVoicesRequest. |
MALE | 1 | Male voice. |
FEMALE | 2 | Female voice. |
NEUTRAL | 3 | Neutral gender voice. Ignored. |
GetVoicesResponse
Output message in response to GetVoicesRequest. Contains information about the voices that match the input criteria, if any, and includes foreign languages and styles for voices that support them.
To use the styles in a synthesis request, see Input to synthesize: SSML elements: Voice style.
Field | Type | Description |
---|---|---|
voice | Voice | Repeated. Voices and characteristics returned. |
For example, this is the response to GetVoices for American English voices. Notice that voice styles are included for voices that support them, and foreign languages are listed for the Jenny multilingual voice.
2022-10-24 15:56:27,265 (140266945111872) DEBUG [voice {
language: "en-US"
}
]
2022-10-24 15:56:27,265 (140266945111872) INFO Sending GetVoices request
2022-10-24 15:56:27,405 (140266945111872) INFO voices {
name: "en-US-JennyNeural"
model: "neural"
language: "en-US"
gender: FEMALE
sample_rate_hz: 24000
styles: "assistant"
styles: "chat"
styles: "customerservice"
styles: "newscast"
styles: "angry"
styles: "cheerful"
styles: "sad"
styles: "excited"
styles: "friendly"
styles: "terrified"
styles: "shouting"
styles: "unfriendly"
styles: "whispering"
styles: "hopeful"
}
voices {
name: "en-US-JennyMultilingualNeural"
model: "neural"
language: "en-US"
gender: FEMALE
sample_rate_hz: 24000
foreign_languages: "de-DE"
foreign_languages: "en-AU"
foreign_languages: "en-CA"
foreign_languages: "en-GB"
foreign_languages: "es-ES"
foreign_languages: "es-MX"
foreign_languages: "fr-CA"
foreign_languages: "fr-FR"
foreign_languages: "it-IT"
foreign_languages: "ja-JP"
foreign_languages: "ko-KR"
foreign_languages: "pt-BR"
foreign_languages: "zh-CN"
}
voices {
name: "en-US-GuyNeural"
model: "neural"
language: "en-US"
gender: MALE
sample_rate_hz: 24000
styles: "newscast"
styles: "angry"
styles: "cheerful"
styles: "sad"
styles: "excited"
styles: "friendly"
styles: "terrified"
styles: "shouting"
styles: "unfriendly"
styles: "whispering"
styles: "hopeful"
}
voices {
name: "en-US-AmberNeural"
model: "enhanced"
language: "en-US"
gender: FEMALE
sample_rate_hz: 24000
}
voices {
name: "en-US-AnaNeural"
model: "enhanced"
language: "en-US"
gender: FEMALE
sample_rate_hz: 24000
}
. . . voices omitted here . . .
voices {
name: "en-US-ZiraRUS"
model: "enhanced"
language: "en-US"
gender: FEMALE
sample_rate_hz: 24000
}
2022-10-24 15:56:27,405 (140266945111872) INFO Done running file [flow.py]
2022-10-24 15:56:27,407 (140266945111872) INFO Iteration #1 complete
2022-10-24 15:56:27,407 (140266945111872) INFO Done
SynthesisRequest
Input message for the Synthesize method. Specifies input text, audio parameters, and events to subscribe to, in exchange for synthesized audio. See Supported fields and defaults.
For more examples, see Sample synthesis client for Neural TTSaaS > Synthesize text input and Synthesize SSML input.
Field | Type | Description |
---|---|---|
voice | Voice | Mandatory for plain text input. Optional for SSML input. The voice to use for audio synthesis. |
audio_params | AudioParameters | Output audio parameters, such as encoding and volume. Default is PCM audio at 22050 Hz. |
input | Input | Mandatory. Input text to synthesize. |
event_params | EventParameters | Markers and other information to include in server events returned during synthesis. |
client_data | map<string,string> | Map of client-supplied key:value pairs to inject into the event log. |
user_id | string | Identifies a specific user within the application. |
The SynthesisRequest message includes:
SynthesisRequest
voice (Voice)
name
audio_params (AudioParameters)
audio_format (AudioFormat)
input (Input)
text (Text)
ssml (SSML)
event_params (EventParameters)
send_bookmark_marker_events
send_visemes
suppress_input
client_data
user_id
This synthesis request includes most fields:
SynthesisRequest(
voice = Voice(
name = "en-US-JennyNeural"
),
audio_params = AudioParameters(
audio_format = AudioFormat(
pcm = PCM(sample_rate_hz = 22050)
)
),
input = Input(
text = Text(
text = "Your coffee will be ready in 5 minutes"
)
),
event_params = EventParameters(
send_visemes = True
),
client_data = {"company":"Aardvark Coffee","user":"Leslie"},
user_id = "leslie.somebody@aardvark.com"
)
This is a minimal synthesis request, using all defaults:
SynthesisRequest(
voice = Voice(
name = "en-US-JennyNeural"
),
input = Input(
text = Text(
text = "Your coffee will be ready in 5 minutes."
)
)
)
The API voice
field is optional in SSML input. A voice may instead by provided in the <voice> element in the SSML input.
SynthesisRequest(
input = Input(
ssml = SSML(
text = '''<speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis" xml:lang="en-US">
<voice name="en-US-JennyNeural">Your coffee will be ready in 5 minutes.</voice>
</speak>'''
)
)
)
AudioParameters
Input message for audio-related parameters during synthesis, including encoding, volume, and audio length. Included in SynthesisRequest.
Field | Type | Description |
---|---|---|
audio_format | AudioFormat | Audio encoding. Default PCM 22050 Hz. |
volume_percentage | uint32 | Ignored. |
speaking_rate_factor | float | Ignored. |
audio_chunk_ duration_ms | uint32 | Ignored. |
target_audio_length_ms | uint32 | Ignored. |
disable_early_emission | bool | Ignored. |
The AudioParameters message includes:
SynthesisRequest
voice (Voice)
audio_params (AudioParameters)
audio_format (AudioFormat)
pcm (PCM)
alaw (Alaw)
ulaw (Ulaw)
ogg_opus (OggOpus)
opus (Opus)
AudioFormat
Input message for audio encoding of synthesized text. Included in AudioParameters.
Field | Type | Description |
---|---|---|
pcm | PCM | Signed 16-bit little endian PCM. |
alaw | ALaw | G.711 A-law, 8kHz. |
ulaw | ULaw | G.711 Mu-law, 8kHz. |
ogg_opus | OggOpus | Ogg Opus, 16kHz or 24 kHz. |
opus | Opus | Opus, 16kHz or 24kHz. The audio will be sent one Opus packet at a time. |
The AudioFormat message includes:
SynthesisRequest
voice (Voice)
audio_params (AudioParameters)
audio_format (AudioFormat)
pcm (PCM)
sample_rate_hz
alaw (Alaw)
ulaw (Ulaw)
ogg_opus (OggOpus)
sample_rate_hz
opus (Opus)
sample_rate_hz
bit_rate_bps
The PCM audio format is shown, with alternatives in commented lines:
SynthesisRequest(
voice = Voice(
name = "en-US-JennyNeural"
),
audio_params = AudioParameters(
audio_format = AudioFormat(
pcm = PCM(sample_rate_hz = 22050)
# alaw = ALaw()
# ulaw = ULaw()
# ogg_opus = OggOpus(sample_rate_hz = 16000)
# opus = Opus(sample_rate_hz = 16000, bit_rate_bps = 30000)
)
)
)
PCM
Input message defining PCM sample rate. Included in AudioFormat.
Field | Type | Description |
---|---|---|
sample_rate_hz | uint32 | Output sample rate in Hz. Supported values: 8000, 16000, 22050, 24000. |
ALaw
Input message defining A-law audio format. Included in AudioFormat. G.711 audio formats are set to 8kHz.
ULaw
Input message defining Mu-law audio format. Included in AudioFormat. G.711 audio formats are set to 8kHz.
OggOpus
Input message defining Ogg Opus output rate. Included in AudioFormat.
Field | Type | Description |
---|---|---|
sample_rate_hz | uint32 | Output sample rate in Hz. Supported values: 16000, 24000. |
bit_rate_bps | uint32 | Ignored. |
max_frame_ duration_ms | float | Ignored. |
complexity | uint32 | Ignored. |
vbr | EnumVariableBitrate | Ignored. |
Opus
Input message defining Opus output rate. Included in AudioFormat.
Field | Type | Description |
---|---|---|
sample_rate_hz | uint32 | Output sample rate in Hz. Supported values: 16000, 24000. |
bit_rate_bps | uint32 | Output bitrate. Supported values: For 16 kHz: 20 ms frame, bitrate can be 0 (default, meaning 32000) or 32000. For 24 kHz: 20 ms frame, bitrate can be 0 (default, meaning 24000), 24000, or 48000. |
max_frame_ duration_ms | float | Ignored. |
complexity | uint32 | Ignored. |
vbr | EnumVariableBitrate | Ignored. |
Input
Input message containing text to synthesize and synthesis parameters, including tuning data, etc. Included in SynthesisRequest. The type of input may be plain text or SSML. See Input to synthesize for examples.
Field | Type | Description |
---|---|---|
text | Text | Plain text input. |
ssml | SSML | SSML input, including text and SSML elements. |
tokenized_sequence | TokenizedSequence | Not allowed. |
resources | SynthesisResource | Ignored. |
lid_params | LanguageIdentification Parameters | Ignored. |
download_params | DownloadParameters | Ignored. |
The Input message includes:
SynthesisRequest
voice (Voice)
input (Input)
text (Text)
text
ssml (SSML)
text
Text
Input message for synthesizing plain text. The encoding must be UTF-8. For plain text input, a voice
field is required.
Field | Type | Description |
---|---|---|
text | string | Plain input text in UTF-8 encoding. |
uri | string | Not allowed. |
For example, this is plain text input:
SynthesisRequest(
voice = Voice(
name = "en-US-JennyNeural"
),
input = Input(
text = Text(
text = "Your coffee will be ready in 5 minutes"
)
)
)
SSML
Input message for synthesizing SSML input. See SSML elements for supported elements and examples.
Field | Type | Description |
---|---|---|
text | string | SSML input text and elements. |
uri | string | Not allowed. |
ssml_validation_mode | EnumSSML ValidationMode | Ignored. |
For example, this is SSML input. The SynthesisRequest voice
field is ignored and may be omitted because the voice is set in the <voice> element in the SSML.
SynthesisRequest(
input = Input(
ssml = SSML(
text = '''<speak xmlns="http://www.w3.org/2001/10/synthesis" xml:lang="en-US" version="1.0">
<voice name = "en-US-JennyNeural">Your coffee will be ready in 5 minutes.</voice>
</speak>'''
)
)
)
EventParameters
Input message that defines event subscription parameters. Included in SynthesisRequest. Events that are requested are sent throughout the SynthesisResponse stream as they are generated.
Log events are produced throughout a synthesis request for events such as a voice loaded by the server or an audio chunk being ready to send.
Field | Type | Description |
---|---|---|
send_sentence_marker_events | bool | Ignored. |
send_word_marker_events | bool | Ignored. |
send_phoneme_marker_events | bool | Ignored. |
send_bookmark_marker_events | bool | Bookmark marker. Default: do not send. |
send_paragraph_marker_events | bool | Ignored. |
send_visemes | bool | Lipsync information. Default: do not send. |
send_log_events | bool | Ignored. |
suppress_input | bool | Whether to omit input text and URIs from log events. By default, these items are included. |
The EventParameters message includes:
SynthesisRequest
voice (Voice)
input (Input)
event_params (EventParameters)
send_bookmark_marker_events
send_visemes
suppress_input
Event parameters in SynthesisRequest
SynthesisRequest(
voice = Voice(
name = "en-US-JennyNeural"
),
input = Input(
text = Text(
text = "Your coffee will be ready in 5 minutes."
)
),
event_params = EventParameters(
send_visemes = True
)
)
SynthesisResponse
Output message in response to a SynthesisRequest, consisting of a stream of SynthesisResponse responses. Each response contains one of:
- A status response, indicating completion or failure of the request. This is received only once and signifies the end of a Synthesize call.
- A list of events the client has requested. This can be received many times. See EventParameters for details.
- An audio buffer. This may be received many times.
Field | Type | Description |
---|---|---|
status | Status | A status response, indicating completion or failure of the request. |
events | Events | A list of events. See EventParameters for details. |
audio | bytes | The latest audio buffer. |
The SynthesisResponse message includes:
SynthesisResponse
status (Status)
code
message
details
events (Events)
event (Event)
name
values
audio
Status
Output message containing a status response, indicating completion or failure of a Synthesize call. Included in SynthesisResponse.
Field | Type | Description |
---|---|---|
code | uint32 | HTTP-style return code: 200, 4xx, or 5xx as appropriate. See Status codes. |
message | string | Brief description of the status. |
details | string | Longer description if available. |
Events
Output message defining a container for a list of events. This container is needed because oneof does not allow repeated parameters in Protobuf. Included in SynthesisResponse.
Field | Type | Description |
---|---|---|
events | Event | Repeated. One or more events. |
Event
Output message defining an event message. Included in Events. See EventParameters for details.
Field | Type | Description |
---|---|---|
name | string | Either “Markers” or the name of the event in the case of a Log Event. |
values | map<string,string> | Map of key:value data relevant to the current event. |
Scalar value types
The data types in the proto files are mapped to equivalent types in the generated client stub files.
Feedback
Was this page helpful?
Glad to hear it! Please tell us how we can improve.
Sorry to hear that. Please tell us how we can improve.