Neural TTS essentials
Neural TTSaaS is a text-to-speech Nuance service that generates synthesized speech from text input. It receives input as plain text or SSML and returns synthesized speech as an audio stream.
Neural TTSaaS vs. TTSaaS
Neural TTSaaS is a reworking of Nuance’s text-to-speech engine, Nuance Vocalizer for Cloud version 2, also known as TTSaaS. Neural TTSaaS works with the neural Text-to-Speech feature of Microsoft Azure Cognitive Services for Speech.
Although the two services are similar, there are several differences between them, as summarized in this table and described in detail below.
Feature | Neural TTSaaS | TTSaaS |
---|---|---|
Synthesis engine | Microsoft Azure Cognitive Services for Speech | Nuance Vocalizer for Enterprise |
Production URL | tts.api.nuance.com with header x-nuance-tts-neural | tts.api.nuance.com |
Authorization | OAuth 2 protocol with Mix credentials | OAuth 2 protocol with Mix credentials |
Voices | Microsoft neural voices | Nuance standard and enhanced voices |
Input type | Plain text or SSML | Plain text, SSML, or Nuance control codes |
Audio formats | PCM WAV 22050 kHz, A-law, μ-law, Opus, Ogg Opus | PCM WAV 22050 kHz, A-law, μ-law, Opus, Ogg Opus |
Synthesis tuning | Microsoft custom lexicons | Nuance custom dictionaries, rulesets, ActivePrompt databases |
SSML audio | Audio files on public HTTPS web server | Audio files in Nuance storage via URN, or on public HTTPS web server |
Synthesizer API | Some fields not allowed, some fields ignored | All fields supported |
Synthesizer HTTP API | Not supported | Supported |
Storage gRPC API | Not supported | Supported |
Sample synthesis client | Available: same client with different options | Available: same client with different options |
Synthesis engine
Neural TTSaaS uses the neural Text-to-Speech feature of Microsoft Azure Cognitive Services for Speech. This service is a cloud-based text-to-speech engine that uses deep learning to synthesize speech from text. It’s part of Microsoft’s Speech service, which provides speech recognition and translation.
TTSaaS is based on Nuance Vocalizer for Enterprise, a text-to-speech engine that uses a different technology.
Production URL
Both Neural TTSaaS and TTSaaS call the same service using the same production URL.
To call Neural TTSaaS, you must include the gRPC header x-nuance-tts-neural
. When this header is not included, requests are routed to TTSaaS.
Authorization
Like TTSaaS, Neural TTSaaS is a hosted Mix service, and you must authorize your client applications using the OAuth 2 protocol. This process is the same for both TTSaaS and Neural TTSaaS.
See Sample client applications > Authorize.
Voices
Neural TTSaaS works seamlessly with Microsoft neural voices to render speech in many languages and locales, with different genders and styles available. These neural voices produce lifelike speech with realistic intonation and flow.
Microsoft neural voices are described in the Microsoft documentation on supported languages for text to speech.
You can list and filter voices programmatically to select the ones you want to use in your synthesis requests.
If you are using Mix, select a voice with a Neural model in Mix.dialog > Options > Project settings > TTS settings.
See:
- Microsoft Supported languages
- Synthesizer API > GetVoicesRequest
- Reference topics > Voice filters
Input type
Neural TTSaaS supports two types of input: plain text and Speech Synthesis Markup Language (SSML). It does not support control codes used in TTSaaS: the input.tokenized_sequence
field generates an error if you use it.
For more precise input instructions, you can use SSML (Speech Synthesis Markup Language) to control the pronunciation, intonation, and other aspects of the speech. Neural TTSaaS supports the SSML elements described in the Microsoft documentation on SSML. Several examples are provided in this documentation.
See:
- Reference topics > Input to synthesize for general information and SSML examples
- Sample synthesis client for Neural TTSaaS to try out the service
- Microsoft Speech Synthesis Markup Language (SSML) overview
Audio formats
Neural TTSaaS can generate speech in several audio formats and sampling rates. The default is PCM WAV audio at 22050 kHz but it also supports A-law, μ-law, Opus, and encapsulated Ogg (Ogg Opus).
Neural TTSaaS supports the same audio formats as TTSaaS but other audio parameters and some Opus parameters are ignored.
See Synthesizer gRPC API > AudioParameters, including audio formats.
Synthesis tuning
The synthesis resources available in TTSaaS (custom dictionaries, rulesets, and ActivePrompt databases) are not supported in Neural TTSaaS.
You may, however, improve your speech output with Microsoft tuning resources, including custom lexicons. You can then use them in Neural TTSaaS by including them in your synthesis requests.
A Microsoft demo page is also available to further test voices and their features.
See:
- Reference topics > Tuning resources
- Reference topics > Input > Lexicon for an example
SSML audio
Neural TTSaaS allows prerecorded audio files in SSML synthesis requests, using the <audio> element. The audio source is the URL of a wave file on a public HTTPS web server. Only HTTPS servers are supported.
Unlike TTSaaS, Neural TTSaaS does not support audio files uploaded with the Storage API and referenced with a URN.
See:
- Reference topics > Input > Prerecorded audio for an example
- Microsoft demo page Audio Content Creation
Synthesizer API
Neural TTSaaS offers a gRPC synthesis API. Unlike TTSaaS, it does not offer a transcoded HTTP API or a Storage API for uploading resources to central storage.
See Synthesizer gRPC API for Neural TTSaaS.
Sample synthesis client
You can experiment with Neural TTSaaS using a sample synthesis client. A Python client is included in this documentation, along with instructions on how to use it. This client can obtain information about available voices and synthesize speech from text or SSML input.
The sample client provided with Neural TTSaaS is the same as the one used in TTSaaS, but with separate input flow.py files to show the different features of these two related products.
In Neural TTSaaS, the client calls the TTS service, tts.api.nuance.com
, with the gRPC header x-nuance-tts-neural
to route its requests to Neural TTSaaS.
See Sample synthesis client for Neural TTSaaS.
Feedback
Was this page helpful?
Glad to hear it! Please tell us how we can improve.
Sorry to hear that. Please tell us how we can improve.