Handle speech input via DLGaaS

The DLGaaS ExecuteStream method allows you to stream speech audio user input to DLGaaS. DLGaaS can call upon the speech recognition (ASR) and natural language understanding (NLU) capabilities of Mix to understand the intent behind the user speech and continue the dialog accordingly.

Streaming audio input workflow

The workflow to perform speech recognition on audio input is as follows:

  1. The Dialog service sends an ExecuteResponse with a question and answer action, indicating that it requires user input.

  2. The client application collects speech input audio from the user.

  3. The client application sends a first StreamInput message containing the first audio packet, along with the asr_control_v1, request, and control message parameters. Note that the payload of the request must be an empty ExecuteResponsePayload object. This lets DLGaaS know (1) there is no text input to process, (2) that speech recognition is required and that it should expect additional audio, and (3) the parameters and resources to use to facilitate and tune the transcription.

  4. The client application sends additional StreamInputs to stream the rest of the audio.

  5. The client application sends an empty StreamInput to indicate the end of audio.

  6. The audio is transcribed and interpreted, and the interpretation is returned to the dialog application. The dialog continues its flow according to the identified intent and entities.

  7. The Dialog service returns the corresponding ExecuteResponse in a single StreamOutput.

The workflow can be visualized in the detailed sequence flow. For example, assuming that the user says “I want an espresso”, the client application will send a series of StreamInput methods with the following content:

# First StreamInput with parameters and initial audio bytes
{
    "request": {
        "session_id": "1c2c9822-45d5-460d-8696-d3fa9d8af8c2",
        "selector": {
            "channel": "default"
            "language": "en-US"
            "library": "default"
        },
        "payload": {}
    },
    "asr_control_v1": {
        "audio_format": {
            "pcm": {
                "sample_rate_hz": 16000
            }
        }
    },
    "audio": "RIFF4\373\000\00..."
}

# Additional StreamInputs with audio bytes
{
    "audio": "...audio_bytes..."
}

# Final empty StreamInput to indicate end of audio
{

}

Once audio has been recognized, interpreted, and handled by DLGaaS, the following StreamOutput is returned:

# StreamOutput

{
    "response": {
        "payload": {
            "messages": [],
            "qa_action": {
                "message": {
                    "nlg": [{
                            "text": "What size coffee would you like? "
                        }
                    ],
                    "visual": [{
                            "text": "What size coffee would you like?"
                        }
                    ],
                    "audio": [] // This is a reference to an audio file.
                }
            }
        }
    }
}

Configure ASR performance

The asr_control_v1 and control message fields in a StreamInput message provide you with a rich set of configurable parameters and controls to fine-tune the performance of ASRaaS speech recognition used by Dialog. These configurations parallel many of the configurations available within the Nuance Mix ASRaaS Recognizer API.

The full details are outside the scope of this documentation. For more details, see the Recognizer gRPC API documentation. At a minimum, the audio format for speech input must be configured.

ASR performance can also be improved using compiled ASR resources like domain language models and compiled wordsets, or by passing in inline wordsets.