Delivering prompts via MRCP

RTP transmission is not satisfactory in some situations:

  • The system supports only 8kHz over RTP. Browsers might require 22kHz for better quality audio.
  • When the data transmission is bursty (an irregular flow of data resulting in alternating low and high data throughput), the RTP environment can cause delays. For example, mobile telephone networks can be bursty environments.

As an alternative, MRCPv2 browsers can use the Nuance-specific message SPEECHDATA to receive output audio. The SPEECHDATA method allows a 22kHz media type, and the messaging enables a smoother delivery of the audio data. However, there are restrictions (see Limitations when sending prompts with SPEECHDATA messages for details).

When a browser configures audio delivery via SPEECHDATA, Vocalizer ensures quality playback by starting and ending each SPEECHDATA event on a natural speech boundary. (For example, at the end of sentences or during pauses.) This packaging ensures that any delays between audio chunks will be tolerable to application users. Compare the following:

  • “Today is Thursday.” is acceptable as a single chunk.
  • “Today i” and “s Thursday” are not acceptable as two chunks.

Browsers can also use SPEECHDATA messages to send speech (input audio) for recognition. For details, see Sending speech requests over TCP.

Configuring prompt delivery over TCP

To enable receiving prompts on the session’s TCP connection:

  1. The browser initiates the media session with no RTP channel (because it is not needed).

    To do this, allocate UDP port zero when sending the initial SIP INVITE:

    m=audio 0 RTP/AVP 0
    c=IN IP4 10.3.17.104
    a=rtpmap:0 pcmu/8000
    a=sendrecv
    a=mid:1
  2. After establishing the MRCPv2 session, the browser uses SET-PARAMS to set the X-Nuance-Mrcp-Audio header on the speechsynth resource:
    X-Nuance-Mrcp-Audio: true

    Using SET-PARAMS enables SPEECHDATA audio for the entire session.

  3. The browser uses SET-PARAMS or SPEAK to set the X-Nuance-Mrcp-Audio-Media-Type header on the speechsynth resource:
    X-Nuance-Mrcp-Audio-Media-Type myType

    Using SET-PARAMS sets the media type for the entire session. Using SPEAK sets it for a given request. (After a SPEAK operation, the media type reverts to the previous value.)

    Possible values for myType:

    • X-Nuance-Mrcp-Audio-Media-Type: audio/basic (default)
    • X-Nuance-Mrcp-Audio-Media-Type: audio/x-alaw-basic
    • X-Nuance-Mrcp-Audio-Media-Type: audio/L16;rate=8000
    • X-Nuance-Mrcp-Audio-Media-Type: audio/L16;rate=22050
  4. In response to a SPEAK request, Speech Server delivers some number of SPEECHDATA messages where each message contains the next chunk of audio. The messages repeat the same request ID and the IN-PROGRESS status.

    Speech Server sends the SPEAK-COMPLETE event immediately after the final SPEECHDATA message. It does not wait for the realtime playback of the audio. For example, if the system sends 5 seconds of audio in 0.2 seconds, it sends SPEAK-COMPLETE after 0.2 seconds, not 5.

    Here is an example sequence of messages in a SPEAK request:

    Client->Server: 
    MRCP/2.0 n SPEAK 306640 
    
    Server->Client: 
    MRCP/2.0 n 306640 200 IN-PROGRESS 
    
    Server->Client:
    MRCP/2.0 n SPEECHDATA 306640 IN-PROGRESS
    
    Server->Client:
    MRCP/2.0 n SPEECHDATA 306640 IN-PROGRESS 
    
    Server->Client: 
    MRCP/2.0 n SPEAK-COMPLETE 306640 COMPLETE
  5. The browser delivers the output audio to the application user. The content-type of the audio (as requested by the X-Nuance-Mrcp-Audio-Media-Type header) is contained in the body of the SPEECHDATA messages from Speech Server. The content is binary and headerless.

    For example:

    Server->Client: 
    MRCP/2.0 x SPEECHDATA 306640 IN-PROGRESS
    Channel-Identifier: 3@speechsynth
    Content-Type: audio/L16;rate=22050
    Content-Transfer-Encoding: 8bit
    Content-Length: n

    This example shows the default content-type:

    Server->Client: 
    MRCP/2.0 x SPEECHDATA 2987 IN-PROGRESS
    Channel-Identifier: 5@speechsynth
    Content-Type: audio/basic
    Content-Transfer-Encoding: 8bit
    Content-Length: n

    Notice that the content type is 8bit (typical for binary data).

Performance considerations for SPEECHDATA audio prompts

For best performance, set the parameter server.mrcp2.rsspeechsynth.ttsOutBufferSize to a value that holds complete sentences (or entire audio buffers). Establish the optimal value by testing different settings. A good starting value is 1048576 (1 MB), and typical buffer sizes include 307200 for L16 8kHz, or 850000 for L16 22kHz.

If you observe latencies and need to reduce them, contact your Nuance sales representative to acquire unthrottled text-to-speech licenses for Nuance Vocalizer. But note that unthrottled text-to-speech uses more CPU. To accommodate the increase, use at least the same number of channels as the number of CPU cores visible to the operating system. For example, if the OS sees 8 cores, use 8 or fewer channels.

Applications are responsible for stopping prompt playback when a bargein event occurs (when a caller interrupts with speech).

Limitations when sending prompts with SPEECHDATA messages

When a browser uses SPEECHDATA to receive speechsynth audio from Speech Server, the audio can be a prerecorded audio file or a synthesized waveform from Vocalizer. It cannot be any of the following:

  • DTMF signals
  • playsilence
  • WCR—the application cannot enable Whole Call Recording when sending audio with SPEECHDATA

Using SPEECHDATA for audio is for MRCPv2 browsers only. MRCPv1 is not supported.