Sending speech requests over TCP

Although realtime streaming via RTP is the most common technique for receiving audio and sending to the recognizer, browsers can also accumulate audio and send to Speech Server via the session’s TCP or TLS connection.

Browsers can also play prompts (output audio) over TCP instead of RTP. For details, see Delivering prompts via MRCP.

Delivering non-streaming speech

When the browser receives speech (input audio) in buffered chunks (using the Nuance-specific message SPEECHDATA) instead of a realtime stream, there are performance considerations and a risk that application users will perceive delays in the system’s response. In these situations, the browser can deliver the speech through alternative mechanisms that manage the chunks with better performance.

Most browsers receive speech in real time from a telephony environment. The browser streams it to Speech Server for recognition using the Real-Time Protocol (RTP).

But some environments do not receive and send audio in real time. Instead, they accumulate audio before sending to Speech Server. As a result, the audio arrives to Speech Server at a faster rate than realtime streaming.

For example, in a mobile computing environment, browsers might receive audio in larger chunks that are buffered and delayed, sometimes long after the start of speech. This scenario can result in performance issues:

  • When the browser delivers the buffered chunks faster than the original speech, there is a risk of losing data packets and degrading the quality of the audio delivered to the recognizer.
  • Latency grows as the browser gathers audio data before sending, and RTP does not offer sufficient capacity to gain back the lost time.

One way to avoid packet loss and to gain time is for browsers to deliver audio data over the existing TCP or TLS connection (framed in MRCPv2 messages) instead of using RTP.

Sending speech via TCP or TLS

To send speech via TCP or TLS, the browser enables audio on the connection, and then delivers the audio inside one or more Nuance-specific SPEECHDATA requests.

Note: This mechanism is for MRCPv2 only, and does not support sending DTMF, Whole Call Recording (WCR), or playsilence.

The steps are:

  1. In the RECOGNIZE request, the browser sets the following header:
    X-Nuance-MRCP-Audio: true

    In response, the system expects audio inside SPEECHDATA requests (and stops listening for data from RTP/RTCP), and Speech Server sends the IN-PROGRESS response to the browser.

  2. The browser sends the audio data framed in the body of one or more SPEECHDATA requests. Speech Server allows 8k ulaw audio across the TCP or TLS connection. Each request specifies the media-type and encoding as follows:
    Have Content-Type: audio/basic
    Content-Encoding: base64

    The browser sends as many requests as needed. Speech Server does not acknowledge each request. (It does not send 200 COMPLETE responses.) Instead, it accepts the requests until the sending is complete.

  3. When sending the final SPEECHDATA request (the last chunk of speech), the browser sets the following header:
    X-Nuance-End-Of-SpeechData: true

    The RECOGNIZE operation will not complete until the browser sets the header. Optionally, the browser can set the header in a SPEECHDATA request with a zero-length body.