FAQ for Neural TTS

These frequently asked questions (FAQ) describe the AI impact of Neural TTS in Mix.

What is Neural TTS?

Neural Text-to-Speech, or Neural TTS, is a speech synthesis service powered by Nuance Vocalizer for Cloud and the Text-to-Speech feature of Microsoft Azure Cognitive Services for Speech. Neural TTS takes text as input and outputs synthesized speech generated by a Microsoft neural voice.

Neural TTS uses deep neural networks to make the voices of computers nearly indistinguishable from the recordings of people. With the clear articulation of words, neural text-to-speech significantly reduces listening fatigue when users interact with AI systems.

What are the capabilities of Neural TTS?

Neural TTS accepts two inputs: the text to synthesize and the voice to perform the synthesis. The input text can be either plain text or SSML (Speech Synthesis Markup Language). When plain text is used, Neural TTS generates the speech based on the punctuation in the text. SSML, if used, can contain synthesis hints such as pauses, voice styles, prosody, and special pronunciation.

The input voice is a named Microsoft neural voice to synthesize the text. There is a wide selection of Microsoft neural voices available in many languages and locales.

What is the intended use of Neural TTS?

Although Neural TTS has many valid use cases, in the context of this product it is used to synthesize end user prompts in a conversation between an Interactive Voice Recognition (IVR) system and an end user. The IVR system uses Neural TTS to convert the IVR system’s response to the end user from text format into speech.

How was Neural TTS evaluated? What metrics are used to measure performance?

Neural TTS voices are evaluated using MOS (Mean Opinion Score) and SMOS (Similarity Mean Opinion Score), as described in Characteristics and limitations of Custom Neural Voice  .

What are the limitations of Neural TTS? How can users minimize the impact of these limitations when using the system?

In this product, Neural TTS is currently available in a limited number of voices and languages. People whose first language is different from the voice’s language may have difficulty understanding it. To mitigate this limitation, users should provide short and simple input text.

The accuracy of Neural TTS will be poor if the neural voice language specified in the request does not match the language of the input text. Users should take care to match their input with the chosen voice.

Neural TTS does not currently support multilingual text as input. Users should provide input in one language only or use SSML to switch language explicitly.

Neural TTS does not currently suppress profanity or other inappropriate language as input. Users should avoid language that may be offensive to end users.

What operational factors and settings allow for effective and responsible use of the feature?

Operational factors and settings are controlled by the service, not the user.

