Speech Server features

Nuance Speech Server provides a single point of control for voice browsers and their applications to access speech recognition and text-to-speech resources. You can run more than one Speech Server.

At the start of a telephone call, the voice browser initiates a Speech Server session that remains dedicated for the duration of a call.

Speech Server handles activities such as:

  • Audio acquisition: Collecting speaker utterances for recognition or recording.
  • Access to speech products: Providing access to:
    • Recognition, using Nuance Recognizer or the Natural Language Processing service for Dragon Voice recognition
    • Text-to-speech generation using Nuance Vocalizer
  • Audio output: Playing back pre-recorded prompts or audio data generated by a text-to-speech (TTS) engine such as Vocalizer.

The Speech Server architecture runs on Windows and on Unix systems, and it can scale to support small to very large applications.

Speech applications communicate with Speech Server using the Media Resource Control Protocol (MRCP). The voice browser must provide an MRCP client to enable communication with Speech Server.

Protocols

Nuance Speech Server uses standard protocols to connect your speech applications to Nuance speech processing software.

MRCP

If you're developing a voice browser, you can translate VoiceXML to MRCP, a client-server protocol especially designed for automatic speech recognition (ASR) and text-to-speech (TTS) integration. Speech Server supports:

  • MRCPv2 (preferred)
  • MRCPv1 (alternative for Nuance Recognizer)

MRCPv2 uses the Session Initiation Protocol (SIP) to establish the session and the Session Description Protocol (SDP) to set up the media channels to the server. The SDP messages in the SIP message body and its application follow the SIP RFC 3264. Use the SIP OPTIONS mechanism to get the capabilities of MRCP server resources, and use the SIP INVITE method/response to set up the session. The Transmission Control Protocol (TCP) carries the MRCPv2 request/response between the voice browser and Speech Server.

WebSocket

Speech Server uses the WebSocket protocol to communicate with the Natural Language Processing service, which in turns uses WebSocket to communicate with the Krypton recognition engine and the Natural Language Engine (NLE). (NLE also uses WebSocket to communicate with the Nuance Text Processing Engine.) WebSocket is a standard extension of the HTTP protocol that provides full-duplex communication channels over a single TCP connection. The protocol consists of an opening handshake followed by basic message framing, layered over TCP. Krypton, NLE, and NTpE use JSON representation to efficiently exchange non-binary payloads.

The Natural Language Processing service acts as both server and client. As a server, it listens and accepts WebSocket connections from the Speech Server client, and as a client it connects to the Krypton engine and NLE using WebSocket. Once the connection is established, the NLP service will process and send requests received from Speech Server to Krypton and to NLE, as well as receive responses and asynchronous messages in return.

Other features

Nuance Speech Server software supports the following features:

  • Optimizations for VoiceXML interpreters, including (secure) URI resolution, W3C grammar format, parallel grammars, and grammar caching
  • Logging of waveforms (caller speech) and recognition, interpretation, and synthesis events
  • Whole-call recording
  • Transport Layer Security (TLS) and Secure RTP for secure communications