Perform both speech recognition and TTS in a single call

The DLGaaS ExecuteStream method allows you to stream speech audio user input to DLGaaS and receive streaming synthesized speech output in response. DLGaaS can call upon the speech recognition (ASR) and natural language understanding (NLU) capabilities of Mix to understand the intent behind the user speech and continue the dialog. And it can call upon the text to speech (TTS) capabilities of Mix to generate speech audio for the next machine response, and stream this audio back to the client application.

Streaming audio input plus streaming audio output workflow

When you need to both handle speech input and generate speech output, the workflow involves a combination of the steps for each:

  1. The Dialog service sends an ExecuteResponse with a question and answer action, indicating that it requires user input.

  2. The client application collects speech input audio from the user.

  3. The client application sends a first StreamInput message containing the first audio packet, along with the asr_control_v1, tts_control_v1, request, and control message parameters. Note that the payload of the request must be an empty ExecuteResponsePayload object. This lets DLGaaS know (1) that there is no text input to process, (2) that speech recognition is required and that it should expect additional audio, (3) that speech synthesis for outputs will be required, and (4) the parameters and resources to use to facilitate and tune the speech transcription and speech synthesis.

  4. The client application sends additional StreamInputs to stream the rest of the audio input.

  5. The client application sends an empty StreamInput to indicate the end of audio.

  6. The audio is transcribed and interpreted, and the interpretation is returned to the dialog application. The dialog continues its flow according to the identified intent and entities.

  7. If the dialog is configured to support the TTS modality, speech audio for the text of the next messages and prompts for the dialog is synthesized.

  8. An initial StreamOutput containing a standard ExcuteResponse and the first part of the synthesized speech audio is sent back to the client application.

  9. The remaining synthesized speech audio is streamed back to the client application in a series of additional StreamOutput messages.

Note about performing speech recognition and TTS in a dialog application

The speech recognition and TTS features provided as part of the DLGaaS API should be used in relation to your Mix.dialog, that is:

  • To perform recognition on a spoken user input provided in answer to a question and answer node
  • To synthesize TTS output audio corresponding to message text for the agent response returned to the user

To perform recognition or TTS outside of a Mix.dialog, use the following services:

Configuring ASR and TTS performance

The StreamInput message used for ExecuteStream provides you with a rich set of configurable parameters and controls to fine-tune the performance of ASRaaS speech recognition and TTSaaS speech speech generation. These configurations parallel many of the configurations available within the Nuance Mix ASRaaS Recognizer API and the Nuance Mix TTSaaS Synthesizer API.

At a minimum

The full details are outside the scope of this documentation. For more details, see:

At a minimum, the audio format for speech input must be configured for ASR. For TTS, at a minimum the audio format and a valid TTS voice must be configured.