ASR essentials

ASR as a Service is a Nuance recognition engine that transcribes speech into text in real time.

ASRaaS receives a stream of audio speech and streams back a range of text output, from a simple text transcript to a search-optimized lattice of information.

You can choose whether to see only the best hypothesis of the final result or view more of ASRaaS’s “thinking process” as it recognizes the input audio stream.

Recognition essentials

When you send ASRaaS an audio stream, it first identifies an utterance, or a segment of speech ending with silence or a pause. In many cases, the audio contains only one utterance, but ASRaaS can process several utterances in the audio, separated by pauses.

ASRaaS then transcribes the utterance, optionally adding punctuation based on the grammar rules of the language, and formatting the transcription based on default or custom rules. For example, it adds initial capitals and capitalizes proper nouns, and renders “ten centimeters” as “10 cm.”

As it transcribes the utterance, ASRaaS streams the response back to the client application. Depending on the result type (see below), the transcription can be streamed word by word, with adjustments, or by phrase, or by complete utterance. By default, ASRaaS includes multiple hypotheses of each transcription, with confidence levels.

What’s an utterance?

In ASRaaS, an utterance is a segment of voice audio. It may be equivalent to a sentence, for example: “I want to pay my credit card bill” or “It’s Monday morning and the sun is shining.” It could also be a phrase, such as: “A double espresso” or “October seventh.” In the context of ASRaaS, however, an utterance is defined simply as a segment of speech ending with a pause of (by default) half a second.

In some cases, the audio you stream to ASRaaS contains only one utterance. If your audio stream has more than one utterance (meaning your audio stream contains pauses), ASRaaS processes each utterance in turn.

Audio formats and language

ASRaaS accepts a mono (single-channel) audio stream in several formats, including linear PCM, A-law, μ-law, and Ogg.

The audio may be in any supported language.

See Reference topics: Audio formats and Geographies.

Result type and scope

You can specify which results you want to receive from ASRaaS, and the scope of these results.

By default, ASRaaS returns only the final hypothesis of a single utterance, but you may instead choose to see much more, using a combination of two recognition parameters, result_type and utterance_detection_mode.

  • Result types are FINAL (the default), PARTIAL, or IMMUTABLE PARTIAL. FINAL returns the most likely hypothesis of the utterance, while the PARTIAL choices return more of ASRaaS’s guesses.

  • Utterance detection modes are SINGLE (the default), MULTIPLE, or DISABLED. With SINGLE, ASRaaS recognizes only the first utterance in the audio stream. MULTIPLE recognizes all utterance in the stream, while DISABLED recognizes everything in the audio without separating it into utterance.

See Reference topics: Results.

Formatting results

ASRaaS formats its results using the basic rules for each language, including capitalizing names and places, writing numbers as digits, and including currency symbols and standard abbreviations.

You may add your own formatting rules using a series of formatting schemes and options.

  • Formatting schemes such as date and time interpret ambiguous numbers such as It’s nine seventeen.

  • Formatting options display text using specific rules such as abbreviating titles or masking profanities.

  • Each language has its own set of formatting schemes and options. For example, Japanese has a scheme, all_as_katakana, that lets you render Japanese speech in the phonetic Katakana script.

In most cases, ASRaaS’s default formatting rules give the best results for the language, but you have the option to adjust the rules if wanted.

See Reference topics: Formatted text.

Data packs for basic recognition

Data packs provide the underlying transcription functionality for many languages and locales. The ASRaaS data packs are based on neural networks and include components for both acoustic and language models.

To see the available languages for your region, see Geographies.

DLMs for specialization

Domain language models (also known as domain LMs or DLMs) specialize and extend the language model in the data pack for a specific environment.

You can generate (or “train”) a DLM using sentences from your environment, for example, news articles, restaurant menus, or sentences from your call center. You may also include entities, or collection of specific terms, for example, the MEDS entity might contain a list of medications in a DLM created for a a pharmacy.

Use Nuance Mix to create DLMs, then reference them in your recognition requests in RecognitionResource as URNs to the DLM’s location in Mix storage.

See Reference topics: Domain LMs.

Wordsets for specific terms

Wordsets are another way to improve and customize recognition. In ASRaaS, a wordset is a list of words or phrases that extends the recognition vocabulary.

For best results, a wordset add terms to an entity defined in a DLM, but it can also provide a standalone collection of terms. For example, a wordset on the MEDS entity might add new medication names, or pharmacy items that are available only in specific locations.

You create wordsets in JSON format, and can include them as is in recognition requests, in RecognitionResource. Alternatively, for larger wordsets, you can compile them using the Training API and then reference them as URNs in central storage, also in RecogitionResource.

See Reference topics: Wordsets.

Speaker profiles for acoustic customization

A speaker profile is an adaptation of the data pack’s acoustic model for just one speaker. ASRaaS creates and maintains a speaker profile when you request one as a RecognitionResource and provide a user ID.

An API for deleting speaker profiles is available upon request. See ForgetMe gRPC API.

See Reference topics: Speaker profiles.

See also