Neural TTS essentials

Neural TTSaaS is a text-to-speech Nuance service that generates synthesized speech from text input. It receives input as plain text or SSML and returns synthesized speech as an audio stream.

Neural TTSaaS vs. TTSaaS

Neural TTSaaS is a reworking of Nuance’s text-to-speech engine, Nuance Vocalizer for Cloud version 2, also known as TTSaaS. Neural TTSaaS works with the neural Text-to-Speech feature of Microsoft Azure Cognitive Services for Speech.

Although the two services are similar, there are several differences between them, as summarized in this table and described in detail below.

Comparisosn of NTTS and TTS
Feature Neural TTSaaS TTSaaS
Synthesis engine Microsoft Azure Cognitive Services for Speech Nuance Vocalizer for Enterprise
Production URL tts.api.nuance.com with header x-nuance-tts-neural tts.api.nuance.com
Authorization OAuth 2 protocol with Mix credentials OAuth 2 protocol with Mix credentials
Voices Microsoft neural voices Nuance standard and enhanced voices
Input type Plain text or SSML Plain text, SSML, or Nuance control codes
Audio formats PCM WAV 22050 kHz, A-law, μ-law, Opus, Ogg Opus PCM WAV 22050 kHz, A-law, μ-law, Opus, Ogg Opus
Synthesis tuning Microsoft custom lexicons Nuance custom dictionaries, rulesets, ActivePrompt databases
SSML audio Audio files on public HTTPS web server Audio files in Nuance storage via URN, or on public HTTPS web server
Synthesizer API Some fields not allowed, some fields ignored All fields supported
Synthesizer HTTP API Not supported Supported
Storage gRPC API Not supported Supported
Sample synthesis client Available: same client with different options Available: same client with different options

Synthesis engine

Neural TTSaaS uses the neural Text-to-Speech feature of Microsoft Azure Cognitive Services for Speech. This service is a cloud-based text-to-speech engine that uses deep learning to synthesize speech from text. It’s part of Microsoft’s Speech service, which provides speech recognition and translation.

TTSaaS is based on Nuance Vocalizer for Enterprise, a text-to-speech engine that uses a different technology.

Production URL

Both Neural TTSaaS and TTSaaS call the same service using the same production URL.

To call Neural TTSaaS, you must include the gRPC header x-nuance-tts-neural. When this header is not included, requests are routed to TTSaaS.

Authorization

Like TTSaaS, Neural TTSaaS is a hosted Mix service, and you must authorize your client applications using the OAuth 2 protocol. This process is the same for both TTSaaS and Neural TTSaaS.

See Sample client applications > Authorize.

Voices

Neural TTSaaS works seamlessly with Microsoft neural voices to render speech in many languages and locales, with different genders and styles available. These neural voices produce lifelike speech with realistic intonation and flow.

Microsoft neural voices are described in the Microsoft documentation on supported languages for text to speech.

You can list and filter voices programmatically to select the ones you want to use in your synthesis requests.

If you are using Mix, select a voice with a Neural model in Mix.dialog > Options > Project settings > TTS settings.

See:

Input type

Neural TTSaaS supports two types of input: plain text and Speech Synthesis Markup Language (SSML). It does not support control codes used in TTSaaS: the input.tokenized_sequence field generates an error if you use it.

For more precise input instructions, you can use SSML (Speech Synthesis Markup Language) to control the pronunciation, intonation, and other aspects of the speech. Neural TTSaaS supports the SSML elements described in the Microsoft documentation on SSML. Several examples are provided in this documentation.

See:

Audio formats

Neural TTSaaS can generate speech in several audio formats and sampling rates. The default is PCM WAV audio at 22050 kHz but it also supports A-law, μ-law, Opus, and encapsulated Ogg (Ogg Opus).

Neural TTSaaS supports the same audio formats as TTSaaS but other audio parameters and some Opus parameters are ignored.

See Synthesizer gRPC API > AudioParameters, including audio formats.

Synthesis tuning

The synthesis resources available in TTSaaS (custom dictionaries, rulesets, and ActivePrompt databases) are not supported in Neural TTSaaS.

You may, however, improve your speech output with Microsoft tuning resources, including custom lexicons. You can then use them in Neural TTSaaS by including them in your synthesis requests.

A Microsoft demo page is also available to further test voices and their features.

See:

SSML audio

Neural TTSaaS allows prerecorded audio files in SSML synthesis requests, using the <audio> element. The audio source is the URL of a wave file on a public HTTPS web server. Only HTTPS servers are supported.

Unlike TTSaaS, Neural TTSaaS does not support audio files uploaded with the Storage API and referenced with a URN.

See:

Synthesizer API

Neural TTSaaS offers a gRPC synthesis API. Unlike TTSaaS, it does not offer a transcoded HTTP API or a Storage API for uploading resources to central storage.

See Synthesizer gRPC API for Neural TTSaaS.

Sample synthesis client

You can experiment with Neural TTSaaS using a sample synthesis client. A Python client is included in this documentation, along with instructions on how to use it. This client can obtain information about available voices and synthesize speech from text or SSML input.

The sample client provided with Neural TTSaaS is the same as the one used in TTSaaS, but with separate input flow.py files to show the different features of these two related products.

In Neural TTSaaS, the client calls the TTS service, tts.api.nuance.com, with the gRPC header x-nuance-tts-neural to route its requests to Neural TTSaaS.

See Sample synthesis client for Neural TTSaaS.