Responses and input and output modalities

At a given point in the dialog, your dialog model supports different modalities for system output and user input depending on what is configured in Mix.dialog for the current channel and node.

The supported modalities in both directions are indicated at each turn of the dialog in an execute response payload. Your client application can use this information on each turn to first play messages to the user and then (if the response contains a QA action), collect input accordingly.

Responses and supported output modalities

The output modalities supported for each message in a given turn can be determined from the fields that are present in each ExecuteResponse message action:

  • The nlg field indicates that TTS generated speech output is supported for the message. Note that the actual generated audio is not found in this field, but in the audio field of a StreamOutput if TTS was requested in an ExecuteStream() call. The nlg field contents provide backup text that can be used to try again if TTS generation was requested in an ExecuteStream() call but was not successful.
  • The visual field indicates that rich text output is supported for the message.
  • The audio field indicates that audio script prerecorded messages are supported as output for the message.

Responses and supported input modalities

Self-hosted environments: This feature is only available for version 1.5.0 (or later) of the Dialog service. This corresponds to engine pack 2.4 for Speech Suite deployments and engine pack 3.11 for self-hosted Mix deployments.

The input modalities that are supported for the present channel and QA node are indicated in an input_modes field within the Execute response payload under QAAction Recognition settings.

{
    "payload": {
        ...
        "qa_action": {
            ...
            "recognition_settings": {
                "collection_settings": {
                    "timeout": "7000",
                    "complete_timeout": "0",
                    "incomplete_timeout": "1500",
                    "max_speech_timeout": "12000"
                },
                "speechSettings": {
                    "sensitivity": "0.5",
                    "barge_in_type": "speech",
                    "speed_vs_accuracy": "0.5"
                },
                "input_modes": [
                    "text",
                    "voice"
                ]
            },
            ...
        },
        "channel": "Voice_Chat"
    }
}

The following topics describe how to handle these different types of output and input in more detail.