Neural TTS essentials

Neural TTSaaS is a text-to-speech Nuance service that generates synthesized speech from text input. It receives input as plain text or SSML and returns synthesized speech as an audio stream.

Neural TTSaaS vs. TTSaaS

Neural TTSaaS is a reworking of Nuance’s text-to-speech engine, Nuance Vocalizer for Cloud version 2, also known as TTSaaS. Neural TTSaaS works with the neural Text-to-Speech feature of Microsoft Azure Cognitive Services for Speech.

Although the two services are similar, there are several differences between them, as summarized in this table and described in detail below.

Comparisosn of NTTS and TTS
Feature	Neural TTSaaS	TTSaaS
Synthesis engine	Microsoft Azure Cognitive Services for Speech	Nuance Vocalizer for Enterprise
Production URL	tts.api.nuance.com with header x-nuance-tts-neural	tts.api.nuance.com
Authorization	OAuth 2 protocol with Mix credentials	OAuth 2 protocol with Mix credentials
Voices	Microsoft neural voices	Nuance standard and enhanced voices
Input type	Plain text or SSML	Plain text, SSML, or Nuance control codes
Audio formats	PCM WAV 22050 kHz, A-law, μ-law, Opus, Ogg Opus	PCM WAV 22050 kHz, A-law, μ-law, Opus, Ogg Opus
Synthesis tuning	Microsoft custom lexicons	Nuance custom dictionaries, rulesets, ActivePrompt databases
SSML audio	Audio files on public HTTPS web server	Audio files in Nuance storage via URN, or on public HTTPS web server
Synthesizer API	Some fields not allowed, some fields ignored	All fields supported
Synthesizer HTTP API	Not supported	Supported
Storage gRPC API	Not supported	Supported
Sample synthesis client	Available: same client with different options	Available: same client with different options

Synthesis engine

Neural TTSaaS uses the neural Text-to-Speech feature of Microsoft Azure Cognitive Services for Speech. This service is a cloud-based text-to-speech engine that uses deep learning to synthesize speech from text. It’s part of Microsoft’s Speech service, which provides speech recognition and translation.

TTSaaS is based on Nuance Vocalizer for Enterprise, a text-to-speech engine that uses a different technology.

Production URL

Both Neural TTSaaS and TTSaaS call the same service using the same production URL.

To call Neural TTSaaS, you must include the gRPC header x-nuance-tts-neural. When this header is not included, requests are routed to TTSaaS.

Authorization

Like TTSaaS, Neural TTSaaS is a hosted Mix service, and you must authorize your client applications using the OAuth 2 protocol. This process is the same for both TTSaaS and Neural TTSaaS.

See Sample client applications > Authorize.

Voices

Neural TTSaaS works seamlessly with Microsoft neural voices to render speech in many languages and locales, with different genders and styles available. These neural voices produce lifelike speech with realistic intonation and flow.

Microsoft neural voices are described in the Microsoft documentation on supported languages for text to speech.

You can list and filter voices programmatically to select the ones you want to use in your synthesis requests.

If you are using Mix, select a voice with a Neural model in Mix.dialog > Options > Project settings > TTS settings.

See:

Microsoft Supported languages
Synthesizer API > GetVoicesRequest
Reference topics > Voice filters

Input type

Neural TTSaaS supports two types of input: plain text and Speech Synthesis Markup Language (SSML). It does not support control codes used in TTSaaS: the input.tokenized_sequence field generates an error if you use it.

For more precise input instructions, you can use SSML (Speech Synthesis Markup Language) to control the pronunciation, intonation, and other aspects of the speech. Neural TTSaaS supports the SSML elements described in the Microsoft documentation on SSML. Several examples are provided in this documentation.

See:

Reference topics > Input to synthesize for general information and SSML examples
Sample synthesis client for Neural TTSaaS to try out the service
Microsoft Speech Synthesis Markup Language (SSML) overview

Audio formats

Neural TTSaaS can generate speech in several audio formats and sampling rates. The default is PCM WAV audio at 22050 kHz but it also supports A-law, μ-law, Opus, and encapsulated Ogg (Ogg Opus).

Neural TTSaaS supports the same audio formats as TTSaaS but other audio parameters and some Opus parameters are ignored.

See Synthesizer gRPC API > AudioParameters, including audio formats.

Synthesis tuning

The synthesis resources available in TTSaaS (custom dictionaries, rulesets, and ActivePrompt databases) are not supported in Neural TTSaaS.

You may, however, improve your speech output with Microsoft tuning resources, including custom lexicons. You can then use them in Neural TTSaaS by including them in your synthesis requests.

A Microsoft demo page is also available to further test voices and their features.

See:

Reference topics > Tuning resources
Reference topics > Input > Lexicon for an example

SSML audio

Neural TTSaaS allows prerecorded audio files in SSML synthesis requests, using the <audio> element. The audio source is the URL of a wave file on a public HTTPS web server. Only HTTPS servers are supported.

Unlike TTSaaS, Neural TTSaaS does not support audio files uploaded with the Storage API and referenced with a URN.

See:

Reference topics > Input > Prerecorded audio for an example
Microsoft demo page Audio Content Creation

Synthesizer API

Neural TTSaaS offers a gRPC synthesis API. Unlike TTSaaS, it does not offer a transcoded HTTP API or a Storage API for uploading resources to central storage.

See Synthesizer gRPC API for Neural TTSaaS.

Sample synthesis client

You can experiment with Neural TTSaaS using a sample synthesis client. A Python client is included in this documentation, along with instructions on how to use it. This client can obtain information about available voices and synthesize speech from text or SSML input.

The sample client provided with Neural TTSaaS is the same as the one used in TTSaaS, but with separate input flow.py files to show the different features of these two related products.

In Neural TTSaaS, the client calls the TTS service, tts.api.nuance.com, with the gRPC header x-nuance-tts-neural to route its requests to Neural TTSaaS.

See Sample synthesis client for Neural TTSaaS.

Feedback

Was this page helpful?

Glad to hear it! Please tell us how we can improve.

Sorry to hear that. Please tell us how we can improve.