Using Vocalizer with Speech Server

When using Vocalizer with Speech Server, Speech Server provides access to recognition and TTS services. The application uses VoiceXML to communicate with a voice browser, and the browser uses an MRCP client to communicate with Speech Server.

Vocalizer, MRCP, and Speech Server

The system supports text-to-speech features of the VoiceXML 2.0, MRCP version 1, and MRCP version 2 specifications. For details, see Integrating into VoiceXML platforms.

Speech Server can reside on any machine on the network, and can serve clients running on any operating system. A single Speech Server process can handle TTS requests for all the Vocalizer languages and voices installed on the host. There is no hard limit on the number of languages and voices.

In the simplest configuration, the Vocalizer client incorporated in the Speech Server communicates directly with Vocalizer via the NVS protocol. Vocalizer then communicates to Nuance Vocalizer for Enterprise to perform the request, and sends back the results.

Speech Server and Vocalizer can run on the same host:

However, they may instead be installed on separate hosts:

When Vocalizer is installed on a different host, you must supply its IP address in the NSS configuration file, so the Speech Server is able to direct its requests.

Speech processing call flow

Here is a simplified speech processing call flow, with an emphasis on the roles of Speech Server, Vocalizer, and the speech application. The process may vary depending on the telephony features implemented in actual deployment.

The voice browser receives a call and starts the session. Use the session.xml file to configure the session defaults for all components.
The voice browser parses the application’s VoiceXML pages, and queues prompts internally. The application indicates the initial language, accent, and voice to use for the TTS request, and it can make changes at any time.
When the application requests recognition or makes another major call transition (such as initiating a call transfer or a prolonged database lookup), the browser asks Speech Server to play the queued prompts (MRCP SPEAK request), then asks Speech Server to start the recognition (MRCP RECOGNIZE request).
Speech Server sends the prompt text to be synthesized (and/or the recording list to be played) to Vocalizer and sends the recognition request to the recognizer.
Vocalizer delivers an audio stream to Speech Server, which then delivers the audio over an RTP stream to the voice browser. Alternatively, Speech Server can deliver the audio inside SPEECHDATA messages as described in Delivering prompts via MRCP.
The browser plays the audio over the telephony interface to the caller.
If the caller interrupts a prompt (barge-in), Speech Server signals Vocalizer to stop audio playback.
Speech Server sends recognition results to the voice browser, which in turn formats the result for the application.
The application responds appropriately; for example, by making a database lookup, or requesting that Speech Server play another prompt to the user.

More detail is provided below.

Establish a session

If there is no current active session, the browser must establish one. In an active session, the INVITE is not required.

The browser uses the SIP INVITE method to establish a session with Speech Server. You can establish one or more sessions on this connection by sending an INVITE message for each new session (telephone call).
Before processing calls, the browser must first open a port on the host using a TCP, TLS, or UDP connection. (The SIP default is UDP. The following example shows TCP port 5060.) The connection is to a port defined by one of the following Speech Server parameters:
In the INVITE message, specify the attribute resource attribute name "speechsynth", as in the following example: This requests the server to create a synthesizer resource control channel to generate speech, and to add a media pipe to send the generated speech. In the following requests, the client asks to use the existing connection.
```
Client->Server:
 INVITE sip:mresources@server.example.com:5060 SIP/2.0
 Via: SIP/2.0/TCP mrcpclient.nuance.com:2000;branch=z9hG4bK007ecb513
 Max-Forwards:6
 To:MediaServer <sip:mresources@example.com:5060<;tag=3D62784
 From:client_user <sip:client_ user@example.com:2000<;tag=3D1928301774
 Call-ID:a84b4c76e66710
 CSeq:314161 INVITE
 Contact:<sip:client_user@client.example.com:2000>
 Content-Type:application/sdp
 Content-Length:446 

 v=3D0
 o=3Dsarvi 2890844526 2890842808 IN IP4 192.0.2.4
 s=3DSet up MRCPv2 control and audio
 i=3DAdd TCP channel, synthesizer and one-way audio
 c=3DIN IP4 192.0.2.12
 m=3Dapplication 9  TCP/MRCPv2 1
 a=3Dsetup:active
 a=3Dconnection:new
 a=3Dresource:speechsynth
 a=3Dcmid:1
 m=3Daudio 49170 RTP/AVP 0 96
 a=3Drtpmap:0 pcmu/8000
 a=3Drecvonly
 a=3Dmid:1
```

The OK response to the SIP INVITE contains a great deal of information on what requested resources are actually provided, the RTP ports where the application expects to receive the audio data, and the port (TCP, UDP, or TLS) where the client connects for MRCPv2 messages.

Server->Client:
 SIP/2.0 200 OK
 To:MediaServer <sip:mresources@example.com<;tag=3D62784
 From:sarvi <sip:sarvi@example.com<;tag=3D1928301774
 Call-ID:a84b4c76e66710
 CSeq:314161 INVITE
 Contact:<sip:mresources@server.example.com>
 Content-Type:application/sdp
 Content-Length:... 

 v=3D0
 o=3Dsarvi 2890844526 2890842808 IN IP4 192.0.2.4
 s=3DSet up MRCPv2 control and audio
 i=3DAdd TCP channel, synthesizer and one-way audio
 c=3DIN IP4 192.0.2.11
 m=3Dapplication 32416  TCP/MRCPv2 1
 a=3Dsetup:passive
 a=3Dconnection:new
 a=3Dchannel:32AECB23433801@speechsynth
 a=3Dcmid:1
 m=3Daudio 48260 RTP/AVP 0
 a=3Drtpmap:0 pcmu/8000
 a=3Dsendonly
 a=3Dmid:1

When the client receives the OK, it returns a SIP ACK to Speech Server to confirm that it received the information and established the session.
The client can open multiple connections and establish multiple sessions over each connection. Each session is subject to session timeout rules as defined by the Speech Server configuration. If the Speech Server receives no traffic for longer than the configured session timeout period (server.mrcp2.transport.timeout), it deletes the session and responds to further MRCP messages for that session with the Session Not Found error message. Each session is subject to session timeout rules as defined by the Speech Server configuration (for example, 20 seconds).
```
C->S:
 ACK sip:mresources@server.example.com SIP/2.0
 Max-Forwards:6
 To:MediaServer <sip:mresources@example.com<;tag=3D62784
 From:Sarvi <sip:sarvi@example.com<;tag=3D1928301774
 Call-ID:a84b4c76e66710
 CSeq:314162 ACK
 Content-Length:...
```

Set appropriate parameters

The browser can set TTS parameters using the SPEAK or SET-PARAMS method.

Voice qualities

The application tells Speech Server which initial language, accent, and/or voice to use for the TTS request. Note that the input text for the TTS request can also contain markup to switch the language and/or voice.

Vocalizer supports several MRCP headers that control characteristics of the speech to be synthesized.

Speech-Language
Voice Name
Voice-Gender

Load special dictionaries

In MRCPv2 you can load special dictionaries using the MRCP DEFINE-LEXICON method. You can control dictionary loading with the following Nuance vendor-specific parameters:

switts.ssftrs_dict_load
switts.ssftrs_dict_unload
switts.ssftrs_dict_enable
switts.ssftrs_dict_disable

The DEFINE-LEXICON request can only be used to load Vocalizer dictionaries. To load Vocalizer ActivePrompt databases and rulesets, use the SSML <lexicon> or the Vocalizer <default_rulesets> XML configuration (ttsshclient.xml).

Protect confidential data

Use the vendor-specific parameters switts.secure_context or switts.mute_wcr to determine whether the result of the text-to-speech synthesis is included in the saved waveform. See Security levels to protect confidential data.

Using Vocalizer with Speech Server

Vocalizer, MRCP, and Speech Server

Speech processing call flow

Related topics