Generate and play synthesized speech output

The DLGaaS ExecuteStream method allows you to request synthesized speech output. DLGaaS can call upon the text to speech (TTS) capabilities of Mix to generate speech audio for the next machine response, and stream this audio back to the client application. Synthesized speech output for the current response was requested and configured on the most recent previous request.

Speech synthesis carried out by Nuance TTSaaS can either be orchestrated either by Dialog or by the client application. The workflow of generating and playing synthesized speech audio is different depending on which of these two options you choose.

Synthesize an audio output message using TTS with Dialog orchestration

To obtain synthesized speech output for the next machine response:

  1. The client application sends a StreamInput message with the tts_control_v1 and request parameters to DLGaaS. The ExecuteRequest contains user input, generally in this case as text, or requested data.

  2. DLGaaS handles the input, and continues the flow of the dialog accordingly.

  3. If the dialog is configured to support the TTS modality, speech audio for the text of the next messages and prompts for the dialog is synthesized.

  4. An initial StreamOutput response containing a standard ExcuteResponse and the first part of the synthesized speech audio is sent back to the client application. git

  5. The remaining synthesized speech audio is streamed back to the client application in a series of additional StreamOutput messages.

For example, assuming that the user entered text input, typing “I want an espresso”, the client application will send a single StreamInput method with the following content:

# StreamInput
{
    "request": {
        "session_id": "1c2c9822-45d5-460d-8696-d3fa9d8af8c2",
        "selector": {
            "channel": "default"
            "language": "en-US"
            "library": "default"
        },
        "payload": {
            "user_input": {
                "user_text": "I want an espresso"
            }
        },
    },
    "tts_control_v1": {
        "audio_params": {
            "audio_format": {
                "pcm": {
                    "sample_rate_hz": 16000
                }
            }
        }
    }
}

Once user text has been interpreted and handled by DLGaaS, the following series of StreamOutput is returned:

# First StreamOutput
{
    "response": {
        "payload": {
            "messages": [],
            "qa_action": {
                "message": {
                    "nlg": [{
                            "text": "What size coffee would you like? "
                        }
                    ],
                    "visual": [{
                            "text": "What size coffee would you like?"
                        }
                    ],
                    "audio": []
                }
            }
        }
    },
    "audio": "RIFF4\373\000\00.."
}

# Additional StreamOutputs with audio bytes
{
    "audio": "...audio_bytes..."
}

Configure TTS performance

The StreamInput message used for ExecuteStream provides you with a rich set of configurable parameters and controls to fine-tune the performance of TTSaaS speech speech generation. These configurations parallel many of the configurations available within the Nuance Mix TTSaaS Synthesizer API.

The full details are outside the scope of this documentation. For more details, see the Synthesizer gRPC API documentation.

Note that not all of the available TTS configurations are required. At a minimum, DLGaaS requires that you specify an audio format and a valid voice in order to obtain generated speech from TTSaaS. The voice can be configured either via the client API at runtime or in the project in Mix.dialog.

Configure preset TTS settings in Mix.dialog

Self-hosted environments: This feature requires version 1.3 of the Dialog service. This corresponds to engine pack 2.2 for Speech Suite deployments and engine pack 3.10 for self-hosted Mix deployments. The VoiceXML Connector does not support this feature.

Preset TTS voice settings can be configured in Mix.dialog at the global, channel, and language level. If these presets are defined in Mix.dialog, then they will be used by Dialog if speech synthesis is requested in an ExecuteStream call. Note that while in this case the voice field of tts_control_v1 does not need to be set, the audio_format field of audio_params does need to be set for TTSaaS to generate and return speech audio.

Here are some additional important points to remember in your design and configuration in Mix.dialog:

  • Microsoft neural voices can only be configured in Mix.dialog TTS settings. These voices can not be passed in via the DLGaaS API. If a neural voice is configured in Mix.dialog, the project when built is configured for Neural TTS and you will not be able to pass in a new voice via the API.
  • In your dialog designs, avoid changing the active language mid-flow between collection states, since messages are concatenated in the ExecuteResponse. To ensure messages will play in the intended language, you can set the language variable in the System Actions section of a question and answer node for example. All messages after the collection step will be in the new active language.
  • Make sure that the TTS voice settings configured in Mix.dialog are valid in your target deployment environment. See Configure TTS settings for more information.

Extract and play TTS audio from StreamOutput messages

If TTS generation worked correctly, you can extract and play the TTS audio.

For each StreamOutput message returned by ExecuteStream:

  1. Extract the audio bytes from the audio field of the StreamOutput message.
  2. Play the audio bytes to the user.

Retry TTS generation using nlg text

If TTS generation via orchestration failed, you can retry TTS generation directly with TTSaaS using backup nlg text.

The procedure is the same as that for Generate and play TTS audio client side.

TTS with orchestration by client app

To support alternate solutions for text to speech, DLGaaS provides the current conversation language and the TTS voice settings configured in Mix.dialog for the response messages as part of ExecuteResponse payload messages. The active language lets the client application know for which language to generate speech. The voice information lets the client application know, if you are using Mix TTSaaS, which Nuance voice profile to request as part of a TTSaaS SynthesisRequest.

Here is an example of language and TTS voice parameters in the DLGaaS response:

{
  "payload": {
    "messages": [],
    "qa_action": {
      "message": {
        "nlg": [{
            "text": "What type of coffee would you like?"
          }
        ],
        "visual": [{
            "text": "What <b>type</b> of coffee would you like? For the list of options, see the <a href=\"www.myserver.com/menu.html\">menu</a>."
          }
        ],
        "language": "en-us",
        "tts_parameters": {
            "voice": {
                "name": "Evan",
                "model": "enhanced",
                "gender": "MALE",
                "language": "en-us",
                "voice_type": "standard"
            }
        }
      }
    }
  }
}

The nlg text contents of response payload messages provide input to pass to TTSaaS if you are doing your own orchestration.

This field contains the message contents as defined in Mix.dialog, including dynamically rendered content as applicable. Depending on what was configured in Mix.dialog, this may include SSML tags.

Voice type and orchestration

The TTS parameters voice parameters includes a field voice_type. This field can take on one of two values:

For both standard and neural voices, you can send a TTSaaS SynthesisRequest to the same standard TTSaaS endpoint. However, some details and requirements will be different to invoke neural voices using Neural TTS compared to standard voices. See Input to synthesize, Input: text or SSML and NTTS sample synthesis client for more details.

Generate and play TTS audio client side

For each message within the ExecuteResponse of the first StreamOutput message:

  1. Extract the contents of the nlg field. This gives the text to synthesize for the message.
  2. Extract the contents of the tts_parameters field. This gives the settings to use for TTSaaS, including the voice to use.
  3. Use the TTSaaS Synthesizer API to generate speech audio using the nlg text, the voice information from the tts_parameters, and audio parameters specifying the encoding format to use for the generated audio. The Synthesizer API will generate speech and return a stream of audio bytes.
  4. Extract the audio bytes from the returned audio stream and play to the user.

If your project and application are multi-lingual, you need to be careful if your dialog flow is configured to change the active language mid-flow. TTSaaS voices are language specific, so different messages within the same response may use different TTS voices. In this case you may need to send a separate TTSaaS request for each message in the response.

Otherwise, for a single-language project configured to use one voice, it is more efficient to concatenate the nlg text from all the messages in the response and send a single TTSaaS request to generate the speech audio for the entire response.