Sequence of events of a call

The voice application makes requests for speech recognition and prompts played to callers. The prompt waveforms are prerecorded audio files or text-to-speech (speech synthesis). The voice platform uses the MRCP protocol to gain access to Nuance speech resources through the Speech Server. Nuance Recognizer (the recognizer) is the resource for speech recognition, and Nuance Vocalizer (the text-to-speech engine) is the resource for speech prompts, either prerecorded or synthesized.

The following diagram summarizes the call flow for a typical session.

The steps are as follows:

  1. Voice platform receives a call and starts session.

    A session begins when a call comes into the telephony interface. A call can arrive from the PSTN or a VoIP (Voice over IP) protocol such as a SIP INVITE.

    A speech application is a collection of one or more VoiceXML documents, each composed of one or more dialogs. The documents share a common application root document containing variable declarations, scripts, links, grammars, event handlers, and properties that are available throughout the application.

    The VoiceXML context learns about a new telephone call from the telephony interface, it connects to the appropriate application, and opens the first VoiceXML page. This single VoiceXML document serves as the application entry point.

    The voice browser fetches and interprets the VoiceXML document. It sends a SIP INVITE request (in MRCPv1, an RTSP SETUP message) to establish a session with Speech Server. The session includes an MRCP control channel and a Real-Time Protocol (RTP) or Secure Real-Time Protocol (SRTP) channel between the telephony interface and Speech Server for audio (and DTMF) input and output.

    As the call progresses, the voice browser requests needed grammars, scripts, recorded audio, XML data, and fetches additional VoiceXML documents as required by the application.

  2. Application requests a prompt.

    Applications typically issue a prompt to acknowledge the call; for example, “Hello, thank you for calling.”

    Most deployed applications use a combination of prompts: prerecorded audio for prompts known in advance, and speech synthesis for prompts created at runtime (for example, to play account balances or any changing data).

    Early in the application development cycle, developers may use Vocalizer to synthesize prompts, because they don’t know the text of the final prompts and have not recorded them. Later in the cycle, developers usually replace most text-to-speech prompts with prerecorded audio, which has advantages for performance.

    When the application requests a prompt, the voice platform interprets the request and generates an MRCP SPEAK request with the necessary information: the URI of a recorded prompt; or desired language, voice, and so on for a synthesized prompt. (If a third-party IVR or telephony platform is used, the voice browser might fetch a prerecorded audio file directly from a prompt server, but most systems get their prompts from Vocalizer.)

    Once Speech Server receives the MRCP SPEAK request, it forwards the prompt request to Vocalizer.

  3. Vocalizer queues the prompt.

    Vocalizer generates audio from the text included in the SPEAK request—either from a prerecorded file or by synthesis, sends an IN-PROGRESS event to the browser, and sends the audio to Speech Server, which puts the prompt into a queue. The prompt (or concatenated sequence of prompts) remains in the queue until the application requests user input. When the prompt has been played into the media stream, Vocalizer sends a SPEAK-COMPLETE event to the browser.

    The system does not queue prompts in all situations. Sometimes, it fetches and plays prompts immediately. For example, see the fetchhint parameter in the Speech Suite documentation). This VoiceXML construct names the grammar, prompt, and recognition request all at once:

    <field name="quantity">
      <grammar type="application/srgs+xml" src="/grammars/number.grxml"/>
      <prompt>How many?</prompt>
    </field>
  4. Prompt plays when the application requests a recognition.
  5. When the application requests recognition, the voice browser sends a standard MRCP RECOGNIZE request (with accompanying properties), and the system plays the queued prompts on the audio channel, and returns an MRCP message to the browser confirming that the prompt was played.

    Here are details for the system’s response to the RECOGNIZE request:

    1. Speech Server requests a recognition server.

      For systems that use load balancing, Speech Server requests the Nuance resource manager to find an appropriate recognition server. Based on the characteristics contained in the RECOGNIZE request, the resource manager selects the appropriate recognition server with the lightest load, and sets up a connection between that recognition server and Speech Server.

    2. After establishing the connection, the resource manager drops out of the session, and the recognition server and Speech Server communicate directly.
    3. When the caller responds with speech (known as an utterance) or DTMF (touchtones), the voice platform forwards it to Speech Server on an audio channel, and Speech Server sends the audio to the recognizer for processing.
  6. Caller interrupts a prompt.

    If the caller does not wait for the prompt to complete, but starts speaking right away, the recognizer identifies the interruption and notifies Speech Server to stop playing the prompt. This feature is known as barge-in.

  7. Recognition server returns results to the application.

    Once the recognition server processes the audio, it sends the results (an array with the n-best list) through Speech Server to the voice browser, which passes the result to the application. Typically, the application uses ECMAScript to fetch the result using the shadow variable application.lastresult$. See the Speech Suite documentation for information on getting recognition results.

    Depending on the content of the recognition result, the cycle begins again, either with a new prompt or a call termination.