Tuning TTS output with ActivePrompts

Vocalizer supports tuning synthesis through Nuance ActivePrompts. ActivePrompts are created with Nuance Vocalizer Studio (a graphical TTS tuning environment) and are stored in an ActivePrompt database for run-time use. Nuance Vocalizer Studio is a separate product. For information, please contact your Nuance representative.

There are two types of ActivePrompts:

  • Recorded ActivePrompts are digital audio recordings that are indexed by an ActivePrompt database. The recordings are stored as individual audio files on a web server or file system. This indexing enables context-sensitive expansions of static or dynamic input text to a sequence of pre-recorded audio recordings, making Vocalizer a powerful prompt concatenation engine for recording-only or mixed TTS and recording applications.

  • Tuned ActivePrompts are an ActivePrompt database that stores synthesizer instructions so input text fragments are spoken in a particular way. These instructions are created by an application developer using Nuance Vocalizer Studio to adjust tuning parameters and listen to various versions of a prompt, then freezing the prompt. These synthesizer instructions are much smaller than the audio that will be produced.

At runtime, all ActivePrompts can be used in two different ways:

  • Explicit insertion using the Nuance <prompt> extension to SSML or the native <ESC>\prompt=prompt\ control sequence.

  • Implicit matching where ActivePrompts are automatically used whenever the input text matches the ActivePrompt text. For implicit matching, there are two sub-modes:

    • Automatic mode, where implicit matches are automatically enabled across all the text in all speak requests.

    • Normal mode, where the Nuance ssft-domaintype extension to SSML or the native <ESC>\domain=domain\ control sequence is used to enable implicit matches for specific regions within the input text.

For recorded ActivePrompt databases, automatic matching can be further restricted so it is only done within a text normalization block (<ESC>\tn\ control sequence or SSML <say-as> element) for a specific type. For example, a recorded ActivePrompt database for spelling that is only used for text wrapped in <ESC>\tn=spell\ or SSML <say-as interpret-as="spell">.

Installing ActivePrompts

Applications use ActivePrompts by loading them into the system and then referencing the ActivePrompts.

The available ActivePrompts databases are found in a voice-specific directory under the Vocalizer installation directory, for example, VOCALIZER_SDK/cpr_enu_tom/. The file suffix is .dat. See the Release Notes for each voice for a list of available databases.

The recordings are found relative to the URI or path used to load the ActivePrompt database. For example, if the ActivePrompt database http://myserver/apdb_rp_tom_alphanum.dat contains a prompt named alphanum/f.alpha0 and the database specifies a file suffix of .ulaw for 8000 Hz and .wav for 22050 Hz, the recording file must be http://myserver/alphanum/f.alpha0.ulaw for the 8000 Hz version and http://myserver/alphanum/f.alpha0.wav for the 22050 Hz version.

Store ActivePrompt databases on a web server or in a file system, with the recordings underneath. Store recordings in VOCALIZER_SDK/cpr_enu_tom/domain, where domain corresponds to the available ActivePrompt Database.

To load the ActivePrompt databases for runtime use, use the SSML <lexicon> tag or the <default_activeprompt_dbs> XML configuration file parameter. You can load any number of ActivePrompt databases at runtime. The load order determines the precedence, with more recently loaded ActivePrompt databases taking precedence over previously loaded databases. At runtime, Vocalizer only consults ActivePrompt databases that match the current synthesis voice.

Prompt concatenation engine

The Vocalizer prompt concatenation engine feature leverages recorded ActivePrompts to support near flawless playback of static and dynamic input text by concatenating recordings rather than using full TTS. This includes support for recordings only or mixed TTS and recordings, and support for creating custom voices for recording only playback.

Many voice applications are built by manually specifying carrier prompt recordings using SSML <audio>, then using an application library to expand dynamic content like alphanumeric sequences, dates, times, cardinal numbers, and telephone numbers to sequences of SSML <audio> elements. However, Vocalizer’s prompt concatenation engine gives better sounding results with the following advantages:

  • Application developers don’t need to purchase, create, or maintain libraries for expanding dynamic content like alphanumeric sequences, dates, times, cardinal numbers, and telephone numbers. Instead, the application can just specify plain input text for Vocalizer to expand, then create an ActivePrompt database that defines the necessary recordings.
  • ActivePrompts support context-sensitive rules, including prompts that start and/or end on a sentence boundary, on a phrase boundary, on a sentence or phrase boundary, with a specific punctuation symbol, or are phrase internal. For playing back dynamic content, even recording just three variations of each prompt (phrase initial, phrase final, and phrase internal) gives a huge quality boost, producing very natural sounding output.
  • Some Vocalizer voices include predefined ActivePrompt databases and recordings for a variety of dynamic types, along with recording scripts that allow easily re-recording those in a different voice. These optionally support phrase initial, phrase final, and phrase internal recording variations for very high quality output as described above. See the Release Notes for each voice to see where this feature is offered, and for the details.
  • For static prompts, application developers can choose between specifying plain input text (avoids tediously specifying recording file names), SSML <audio> (recording file names), SSML <prompt> (ActivePrompt names), or using a mixed approach.
  • Providing plain input text for all the static and dynamic prompts makes it easy to create rapid application prototypes and to follow rapid application development (RAD) models such as Agile or Extreme Programming, because it uses Vocalizer text-to-speech for all the prompts at the beginning of the project, then adds ActivePrompt databases and recordings later on as required, independent of the application code.
  • Vocalizer produces a single audio stream for all the content rather than relying on rapid fetching and concatenation of individual recording files by another system component such as a telephony service. This ensures the recordings are contiguous, rather than having the extra gaps that some telephony services introduce, which lead to slow playback.
  • This solution is extensible to the wide variety of languages and dynamic data types supported by Vocalizer, rather than requiring special linguistic knowledge and major code updates for each new language or data type.

Load ActivePrompt databases

Use the SSML <lexicon> element or the <default_activeprompt_dbs> XML configuration file parameter to load ActivePrompt databases for runtime use. You can load any number of ActivePrompt databases at runtime. The load order determines the precedence, with more recently loaded ActivePrompt databases having precedence over previously loaded databases. At runtime, Vocalizer only consults ActivePrompt databases that match the current synthesis voice.

For recorded ActivePrompt databases, the recordings are found relative to the URI or file path used to load the ActivePrompt database. For example, if the ActivePrompt database http://myserver/apdb_rp_tom_alphanum.dat contains a prompt named alphanum/f.alpha0 and the database specifies a file suffix of .wav, the recording file must be http://myserver/alphanum/f.alpha0.wav.

Sample-Load an ActivePrompt database

The TTS User Config is automatically loaded at runtime by specifying the following environment variable:

VOCALIZER_USERCFG=C:\Lex\tts_config.xml

The following illustrate how to load the ActivePrompt databases, using either a file path or UNC path protocol.