FAQ for ASRaaS

These frequently asked questions (FAQ) describe the AI impact of ASRaaS in Mix.

What is ASRaaS?

Nuance ASRaaS (Automated Speech Recognition as a Service) is a cloud-based service for online streaming speech-to-text transcription. It is integrated into an Interactive Voice Response (IVR) system in the context of the Digital Contact Center Platform (DCCP).

ASRaaS receives a stream of audio speech and streams back text output in real time. The results include a simple text transcript as well as full word-by-word information in JSON format. It uses models based on deep neural networks to recognize and transcribe audio.

What are the capabilities of ASRaaS?

ASRaaS accepts a real-time stream of audio speech as input, supporting PCM, μ-law, and a-law codecs. When it receives the audio stream, ASRaaS first identifies an utterance, or segment of speech ending with silence or a pause. In many cases, the audio contains only one utterance, but ASRaaS can process several utterances in the audio, separated by pauses.

It then transcribes the utterance, optionally adding punctuation based on the grammar rules of the language, and formatting the transcription based on default or custom rules.

As it transcribes the utterance, ASRaaS streams the response back to the client application. Depending on the result type specified in the request, the transcription can be streamed phrase by phrase or by complete utterance. By default, ASRaaS includes multiple hypotheses of each transcription, with confidence levels.

AI is used to model the statistics associated with the automatic speech recognition. Deep neural networks (DNNs) are used to capture statistical properties of acoustics and language, either independently or together, and these networks are scored against the customer audio using a decoding algorithm to provide the most likely transcription.

What is the intended use of ASRaaS?

In the context of this IVR system, ASRaaS transcribes general end user speech into text, to be subsequently processed by other services to extract the semantic meaning of the text. Ultimately, these other services return a text prompt, which is then synthesized as speech in an audio prompt using the Neural Text-to-Speech service, Neural TTSaaS.

The end user speech is provided as input to ASRaaS along with various flow or formatting parameters. The output consists of one or several potential text transcriptions along with various metadata. No speaker adaptation is performed in the context of this IVR system.

In addition to the above, a customer-specific or end user-specific wordset list may optionally be provided inline to improve speech transcription accuracy.

How was ASRaaS evaluated? What metrics are used to measure performance?

ASRaaS is evaluated using Word Error Rate (WER), an accuracy measurement. The WER is the percentage errors (a weighted average of insertion, deletion and substitution errors) made by ASRaaS. So, one can think of “accuracy” as 100 less WER.

What are the limitations of ASRaaS? How can users minimize the impact of these limitations when using the system?

In this product, ASRaaS is currently available in a limited number of languages, and may have difficulty recognizing speech in other languages or locales.

The accuracy of ASRaaS will be poor if the language in the recognition request does not match the spoken audio language. Developers should take care to match the recognition language with the expected language by end users.

ASRaaS does not support multilingual speech as input.

ASRaaS may receive audio containing profanity or inappropriate language as input, but the software has customizable filters that remove or mask profanity from the output transcription.

What operational factors and settings allow for effective and responsible use of the feature?

Operational factors and settings are controlled by the service, not the user.