Speech resources overview

Before jumping into the details of how to create and leverage speech resources, some background on speech recognition and speech resources may be useful for some readers. The following section provides a brief background overview.

If you are already familiar with Krypton or ASRaaS and the use of speech resources there, feel free to skip this section.

Speech recognition

One of the basic tasks in natural language processing (NLP) is automatic speech recognition (ASR), also known as speech-to-text. In speech recognition, an application takes in a user’s speech as audio input, and transcribes the text corresponding to the speech. Speech recognition as a task is not concerned with trying to understand the text, but simply rendering an accurate transcript.

Relation of ASR to other NLP tasks

In speech-enabled applications, speech recognition is related to some other NLP tasks. Natural language understanding (NLU) tasks seek to extract meaning from user text input. In the case of a digital agent, this usually means interpreting what a user wants to do in terms of some set of tasks it is prepared to handle. NLU can also use a speech recognition transcript of user speech as an input.

Conversational agents depend on NLU to identify user intents and action them. You’re not going to get an accurate understanding of speech unless you start with an accurate text transcript. So speech-enabled dialog agents rely on highly accurate speech recognition to be effective.

This document will guide you on best practices for creating effective speech recognition resources using Mix tooling.

Resources for speech recognition

Effective speech recognition depends on a number of different resources. Some of these resources are broadly useful across a language, while others provide enhancement to recognition for a particular domain. The following sections describe the different types of resources and how they are used in speech recognition.

Base language recognition with Nuance data packs

Nuance speech recognition using Krypton engine or ASRaaS starts with one or more Nuance factory data packs specific to different languages and locales.

Nuance data packs are based on neural network technology and include two components:

  • An acoustic model: Translates raw speech audio into likely phonetic representations given the language
  • A base language model: Identifies the most likely sequence of words corresponding to the speech audio

Base language models

A base language model provides baseline recognition of general, everyday speech in a particular language and location.

A base language model is a statistical language model trained on a general corpus of speech. As such, it is shaped by the frequencies of words in general speech and how words appear in relation to other words. Nuance ASR services provide base language models for supported languages and locales.

While base language models perform well on typical speech, there might be domain or customer-specific improvements you can make to increase accuracy for your application.


A data pack may include one or more builtins. Builtins are predefined recognition objects for common tasks (numbers, dates, and so on) or general vocabulary for vertical domains such as financial services or healthcare. The available builtins depend on the language and region of the data pack.

Developer-provided resources for domain-specific recognition

A domain is a specialized subject area where people will make requests using vocabulary specific to that subject area.

For example, in retail banking, customers will want to take specific actions such as:

  • Open or close an account
  • Check a balance
  • Make a funds transfer
  • Request a loan
  • Make a payment

As well, people use specific terminology such as account number, account type, balance, checking, savings, investment, mortgage, loan, transfer, and so on. Or, as another example, imagine the fields of medicine or pharmacology, which include a lot of domain-specific terminology.

Certain terms appear only in that domain or have special meanings specific to the domain. While other terms simply appear in speech and text much more frequently in that domain than in general speech and writing.

A base model trained on the patterns of general speech tends to be significantly less accurate in recognizing words in specialized domains. Other resources help supplement performance for specific domains.

Domain language models

A domain language model (DLM) is complementary to the base language model and is trained on words or phrases used in the domain. The DLM can incorporate any or all of:

  • Text samples
  • Entities
  • Pronunciations
  • Rewrites. Each of these will be described in the following sections.

The base language model and a well-trained DLM together produce more accurate recognition results on domain-specific speech than the base language model alone.

The base language model and complementary DLMs are the main resources for effective speech recognition. But other resources can also be used to further improve recognition performance.


Wordsets provide additional values for domain-related entities already defined in the DLM as list entities. Speech wordsets can include basic rough guidance on pronunciation using phonetic spelling.

There are two ways to use wordsets:

  • Inline: Passed in at runtime as a JSON string. The wordset is processed at runtime and applied.
  • Compiled: Passed in earlier and processed ahead of time. The compiled wordset is referenced and used at runtime to enhance the DLM.

There are two types of compiled wordsets:

  • App-level: General to all users of the application
  • User-level: Specific to a particular user specified by a user_id

Speaker profiles

Speaker profiles capture acoustic characteristics of speech specific to a particular user specified by a user_id.

A speaker profile is created and populated on the first recognition turn where a profile is requested for the user. Any subsequent recognition calls requesting a speech profile for the same user loads the speaker profile and uses it as a recognition resource.

By default, speaker profiles are preserved in a data store for some time following a session for future use before being discarded.

Weighting resources

All the resources stack on top of each other to contribute to generating hypotheses for recognition. Weighting can be applied to some resources to tune the performance, making some resources have more influence in the results than others. Weighting for each weightable resource can be adjusted on each recognition turn to tune performance.