Using Vocalizer with Speech Server

When using Vocalizer with Speech Server, Speech Server provides access to recognition and TTS services. The application uses VoiceXML to communicate with a voice browser, and the browser uses an MRCP client to communicate with Speech Server.

Vocalizer, MRCP, and Speech Server

The system supports text-to-speech features of the VoiceXML 2.0, MRCP version 1, and MRCP version 2 specifications. For details, see Integrating into VoiceXML platforms.

Speech Server can reside on any machine on the network, and can serve clients running on any operating system. A single Speech Server process can handle TTS requests for all the Vocalizer languages and voices installed on the host. There is no hard limit on the number of languages and voices.

In the simplest configuration, the Vocalizer client incorporated in the Speech Server communicates directly with Vocalizer via the NVS protocol. Vocalizer then communicates to Nuance Vocalizer for Enterprise to perform the request, and sends back the results.

Speech Server and Vocalizer can run on the same host:

However, they may instead be installed on separate hosts:

When Vocalizer is installed on a different host, you must supply its IP address in the NSS configuration file, so the Speech Server is able to direct its requests.

Speech processing call flow

Here is a simplified speech processing call flow, with an emphasis on the roles of Speech Server, Vocalizer, and the speech application. The process may vary depending on the telephony features implemented in actual deployment.

  1. The voice browser receives a call and starts the session. Use the session.xml file to configure the session defaults for all components.
  2. The voice browser parses the application’s VoiceXML pages, and queues prompts internally. The application indicates the initial language, accent, and voice to use for the TTS request, and it can make changes at any time.
  3. When the application requests recognition or makes another major call transition (such as initiating a call transfer or a prolonged database lookup), the browser asks Speech Server to play the queued prompts (MRCP SPEAK request), then asks Speech Server to start the recognition (MRCP RECOGNIZE request).
  4. Speech Server sends the prompt text to be synthesized (and/or the recording list to be played) to Vocalizer and sends the recognition request to the recognizer.
  5. Vocalizer delivers an audio stream to Speech Server, which then delivers the audio over an RTP stream to the voice browser. Alternatively, Speech Server can deliver the audio inside SPEECHDATA messages as described in Delivering prompts via MRCP.
  6. The browser plays the audio over the telephony interface to the caller.
  7. If the caller interrupts a prompt (barge-in), Speech Server signals Vocalizer to stop audio playback.
  8. Speech Server sends recognition results to the voice browser, which in turn formats the result for the application.
  9. The application responds appropriately; for example, by making a database lookup, or requesting that Speech Server play another prompt to the user.

More detail is provided below.