Glossary of terms

This topic describes common terms and acronyms used in Mix.

access token

An access token is a string generated by the OAuth 2.0 API that authorizes clients to communicate with Nuance engines in the Mix environment.

See also token.


An anaphora is a generic word that is used to refer back to an entity mentioned earlier in the text or conversation, to avoid repetition. An anaphora could reference a person (him, her, them), a place (here, there), a thing (it, that), or a time (then, at that time).

In Mix.nlu, you can help your application determine to which entity an anaphora refers. You can:

  • Define how entities may be referred to, for example, whether they are referring a person (Contact, “him”) or a place (City, “there”).

  • Annotate samples that contain anaphoras, such as “Call him” or “Drive there”.


Annotation is the process of marking up a sample phrase or sentence to identify the entities (previously called concepts) and the intents that it contains.

Annotated samples are the main source of data used to train an NLU semantic model.


Application programming interface. Specification of routines, data structures, object classes, and protocols, with the goal to communicate with a software system or a platform.

app ID

The app ID (or application ID) is a unique application string. It is used to reference the resources created and managed in Nuance Mix.


A Mix application defines a set of credentials that you use to access ASR, NLU, and Dialog resources. A Mix application can be deployed to multiple runtime environments (for example, sandbox, QA, production).

application configuration

An application configuration associates an app ID with the ASR, NLU, and Dialog resources deployed in a runtime environment.

To create an application configuration, you specify:

  • A context tag, which is a name identifying this application configuration.

  • The versions of the ASR, NLU, and Dialog resources to include in this application configuration.

You use the application configuration at runtime by providing the app ID and context tag to load resources.


Automatic speech recognition, or ASR, is the automated conversion of spoken words to interpretable text. The technology is also known as speech-to-text.

Nuance offers these ASR services:

  • Krypton performs real-time large vocabulary continuous speech recognition. In Mix, this product is known as ASR as a Service or ASRaaS.

  • Nuance Recognizer recognizes and interprets spoken and DTMF input. This recognition engine is grammar-based, meaning it requires grammars to interpret a caller’s utterance. Unlike Krypton, Nuance Recognizer offers both word recognition and meaning interpretation. In Mix, this product is known as NR as a Service or NRaaS.


A bot is a Mix application with configurations that include dialog builds. Integrators use the DLGaaS API to interact as needed with bot applications.

Bots can take speech, text, or a natural language interpretation as input, perform reasoning in context using NLU, and produce synthesized speech, thus orchestrating the entire experience with an end user.


Builtins are predefined recognition resources based on slices of the language model in the underlying data pack. They are focused on common tasks (such as numbers and dates) or general information in a vertical domain such as financial services or healthcare.

Builtins are used in the Krypton recognition engine and are similar to predefined entities used in Mix and the NLU semantic understanding engine.

Nuance Recognizer includes builtin grammars, such as date, digits, number, phone, and time, which perform the same function.


Medium through which a message is transmitted to its intended audience, for example, to customers. Channels range from the traditional mediums such as print, telephone, and broadcast (TV and radio), later video and email, and increasingly to digital channels of communication such as SMS, live chat, and chatbots or virtual assistants.

client ID

Nuance Mix uses the OAuth 2.0 protocol for authorization. All client applications must provide an access token to be able to access the ASR, NLU, Dialog, and TTS runtime services. To obtain an access token, a client ID and a client secret must be provided.

client secret

A client secret is one of the credentials required to access Mix services and resources. It is generated through Mix Dashboard and used to obtain an access token for authorization.


See entity.

confidence score

Value assigned as a measure of the NLU engine’s confidence that it can correctly identify the intent of a sentence. The higher the score, the more likely it is that the result matches what the user said.

context tag

A context tag is a string that identifies an application configuration in Mix. It determines the project resources (and version) to use for the application.

  • For NLU, the semantic model or compiled wordset

  • For Dialog, the dialog application to use

  • For Krypton, an object such as a DLM, compiled wordset, or settings file

The context tag is specified as part of the URN. For more information, see URN format.


Concatenative prompt recording. Technique in which audio prompt recording files are designated to be concatenated (played consecutively, in a specified sequence) to form output that sounds natural for the specified language. Typically, this involves the recording of a voice talent saying particular word sequences or phrases that will later be spliced together.

Support for dynamic concatenated audio recordings is limited to these languages: English (Australia), English (United Kingdom), English (United States), French (Canada), German (Germany), Portuguese (Brazil), and Spanish (Latin America).

Available to projects with channels that support the Audio Script modality.

data pack

A data pack is a set of files containing language information for recognition and understanding.

Data packs are used by Krypton as the basis for speech recognition. They are also used in semantic understanding, by NLU and its core engine, QuickNLP. These data packs use neural network technology and include an acoustic model for recognizing speech, and a language model for interpreting speech and/or text.

data type

A data type specifies the type of content to be collected by an entity. Data types allow dialog designers to apply methods and formatting options appropriate to the data type in conditions and messages. Each data type has a number of compatible collection methods (also referred to as entity types), which specify how the entity data is collected.


Interaction between a user and a client application. A single unit of interaction or single transaction is often referred to as a dialog state.

dialog flow

Logical flow of the client application, including various dialog states, primary paths of informational exchanges, transaction outcomes, and decision logic.

In Mix.dialog a dialog flow comprises nodes that perform operations such as prompting the user, evaluating a response, retrieving information from a backend system, or transferring the user to a live agent for assistance.


Method used to clarify when the recognized item has more than one possible meaning.

domain language model (DLM)

A domain language model (domain LM or DLM) is an extension of the data pack’s language model that specializes the language model for a specific business environment, to improve word recognition.

Used by the Krypton recognition engine, DLMs identify the words and phrases most likely spoken by users of your application. You generate these specialized models from training data that is representative of your application. A DLM may also contain entities, or simple lists of specific terms.


Dual tone multi frequency. Also known as touchtone. Two-tone signal representing the digits 0-9, *, and #. Each DTMF signal is composed of one tone from a high-frequency group of tones, and a second tone from a low-frequency group.

In Mix you have the option to create projects that support DTMF as user input, for example, in IVR systems and other phone-based interfaces.

dynamic list

An entity whose possible set of values is dynamic, depending on details only known at runtime. The values for a dynamic list entity are provided using a wordset.

engine pack

An engine pack is a set of ASR, NLU, TTS, and Dialog engine versions that ensure the applications you build are compatible with their deployment environment. Engine packs are relevant mainly to self-hosted customers. (Hosted customers have access to all functionality; that is, the latest engine pack, known as the “hosted” version.)

Self-hosted environments: When you create a project, you select the engine pack version that corresponds to the engines that you have installed in your self-hosted environment. This ensures that the resources generated for your project (ASR DLMs, NLU models, and Dialog models) are compatible with the engine versions you have installed. The engine pack version also determines the tooling features you can access. In the Mix tools, features introduced in a later engine pack version will not be available. This ensures that changes introduced in any hosted engines in the Mix runtime environment will not impact existing projects.

Self-hosted customers might need to “upgrade” to a new engine pack in order to use new features available to Mix.nlu and Mix.dialog. Features that require a specific engine pack version are identified in the documentation with the “Self-hosted environments” tag and the required engine pack version.

For more information, see Manage engine packs and data packs.


An entity (previously referred to as a concept) is a collection of terms used by speakers or users of your application, for example, PLACES, NAMES, or DRINKS. Entities improve speech recognition and offer details about the sentence being interpreted.

In Mix.nlu you define entities in an ontology, and then annotate your sample data by labeling the tokens with entities. For example, if the intent is ORDER_DRINK, a relevant entity might be DRINK_TYPE. In the sample sentences “I’d like an iced vanilla latte” and “What’s in the caramel macchiato,” you might annotate the words “latte” and “macchiato” with the DRINK_TYPE entity.

While the intent is the overall meaning of a sentence, entities and values capture the meaning of individual words and phrases in that sentence.

See also predefined entity.


Words and word sequences the recognizer can recognize, and the interpretations for those utterances.

In Mix.nlu, pre-defined concepts are expressed as grammars; that is, in a set of rules defining all the ways of expressing items associated with a given concept, without having to enumerate them. For example, the pre-defined concept [CALENDARX] is a grammar for specifying dates and times. A list of such expressions (“July 5th”, “3rd of June”, “tomorrow”, “a week from Wednesday”, and so on) would be unwieldy, to say the least. Instead, a grammar provides a relatively compact way to accomplish the same thing.


gRPC is an open source RPC (remote procedure call) software used to create services. It uses HTTP/2 for transport and Protocol Buffers version 3 (“proto3”) to define the structure of the application.


Shorthand for the syntax for grammars defined in the XML format of the W3C Speech Recognition Grammar Specification. The current specification for GrXML is available on the Web at the W3C.

May also refer to the file extension for such grammars.


The Hypertext Transfer Protocol (HTTP) is the underlying protocol used by the World Wide Web and defines how messages are formatted and transmitted, and the actions Web servers and browsers should take in response to various commands. Hypertext is structured text that uses logical links (hyperlinks) between nodes containing text.


HTTPS (also called HTTP over TLS, HTTP over SSL, and HTTP Secure) is a protocol for secure communication over a computer network that is widely used on the Internet.


An intent defines and identifies an intended action in NLU. An utterance or query spoken by a user will express an intent, for example, to order a drink. As you develop an NLU model, you define intents based on what you expect your users to do in your application. You then link intents to functions or methods in your client application logic.

Intents are often associated with entities that specify particulars about the intended action.


Representation of the meaning of a sentence. May also refer to the recognition of an utterance from text rather than audio.


General-purpose system for developing and deploying telephony applications that perform automated operations and transactions to callers primarily via voice and DTMF input.

Nuance Speech Suite provides a conversational IVR experience.


Krypton is an ASR engine that offers enterprise-grade, realtime large vocabulary continuous speech recognition. The Krypton engine converts an audio stream of human speech into text by recognizing the speech and transcribing it into text.

Krypton supports DLMs among other forms of specialization, allowing it to understand terms specific to a field of work or application.

In Mix, Krypton is known as ASR, ASR as a Service, ASRaaS, or Mix.asr.

language model

A language model is part of a data pack, providing a statistical or neural model for the syntax of language constructions.

Language models are used by the Krypton ASR engine and the NLU semantic engine. An ASR language model can be extended and specialized with optional DLMs and wordsets. In NLU the base language model is always extended with an NLU semantic model.

language topic

A language topic describes a type of ASR core data pack. Each topic provides a specialized, yet still general, knowledge of a domain (such as the banking domain or the retail domain). The topic customizes the language model within the data pack that is used for recognition.

One or more language topics may be available to an organization.


A literal is the range of tokens in a user’s query that corresponds to a certain entity (or concept). The literal is the exact text spoken or written by the user. For example, in the query “I’d like a large t-shirt,” the literal corresponding to the entity [TSHIRT_SIZE] is “large”. Other literals might be “small”, “medium”, “big”, “very big”, and “extra large”.

When you annotate samples, you select a range of text to tag with an entity. Literals can be paired with canonical values. For example, “small”, “medium”, and “large” can be paired with the values “S”, “M”, and “L”, respectively. Multiple literals can have the same value, which makes it easy to map different ways a user might express an entity into a single common form. For example, “large”, “big”, “very big” might all be given the same value: “L”.

In addition, if your NLU model has a list entity, it isn’t necessary to define all of the literals for that entity. The NLU model will infer literals for the entity that are not in the list. Inferred literals will not have values returned, only the literal itself.


A modality specifies a format used to exchange information with the user, such as TTS, audio, DTMF, and text.

When you create a project in Mix, you choose input and output modalities to support for each communication channel. These modalities determine the options that are available for a channel in Mix.dialog. For example, if your project has a channel that supports interactivity, your can define interactive elements such as buttons and clickable links in the Interactivity section of question and answer nodes.

natural language understanding

Speech recognition techniques that permit a user to answer a prompt with a full phrase or sentence, as in everyday conversation. Typically, natural speech is longer in duration and has a broad range of possible meanings. A grammar (or model) capable of natural language understanding must accept a wide variety of different phrases.

neural TTS

Neural TTS, or Neural TTSaaS, is a text-to-speech engine that synthesizes speech using neural Microsoft voices. Neural TTS combines NVC and the Microsoft text-to-speech backend.


Natural Language Engine (NLE) is Nuance’s enterprise grade text-to-meaning engine or semantic engine, and the principal component in Mix’s NLU service. NLE provides ontology-based semantic processing. NLE uses as input the token sequence to identify the intent and/or meanings expressed in the human-machine turn. The outcome from NLE is typically used to drive the next machine-human turn.


See natural language understanding.

NLU model

An NLU model is a supplementary language model for natural language understanding.

An NLU model consists of entity grammars, grammars inferred from a training corpus, and trained classifiers. It builds on a base language model to interpret vocabulary, intents, and entities specific to a particular application domain. Also called a semantic model.


A dialog flow comprises nodes that perform operations such as prompting the user, evaluating a response, retrieving information from a database, or transferring the user to a live agent for assistance.

Mix.dialog provides several types of nodes that each perform a specific kind of operation. For example, Start, Question & Answer, Message, or Decision. For more information, see Node types.


NR, or Nuance Recognizer, is an ASR and NLU engine for grammar-based, constrained vocabularies (“Please say yes or no.” or “What is the destination city?”), and for statistical models that allow natural speech in response to open-ended questions (“Hello. How can I help you today?”). Nuance Recognizer offers both word recognition and meaning interpretation.

In Mix, this product is known as NR as a Service, or NRaaS.


Nuance Vocalizer for Cloud, or NVC, is a Nuance text-to-speech engine that powers Mix’s TTS services to synthesize speech from text. NVC works with two backends to produce two services: TTSaaS uses the NVE backend, and Neural TTS uses the Microsoft text-to-speech backend with neural voices.


Nuance Vocalizer for Enterprise, or NVE, is a text-to-speech backend for NVC or Mix’s TTS service.


An ontology is a formal specification of how words and language structures are related to meanings, typically within some specific context.

In Mix.nlu, the ontology for your model comprises intents, entities (also referred to as concepts), information about relationships (for example, between intents and entities or between entities and other entities), and grammars associated with such information. The ontology is the central schema for organizing your model and its sample data. The intents, entities, and associations between them are all stored in an ontology.


An organization in Mix is the parent entity that contains members, projects, applications, environments, deployment flows, and associated credentials.

Users are members of two or more organizations: the user’s personal organization, specified with the user’s email address (such as and a standard organization, which groups users with a common specific email domain (such as

An organization has members, and members have roles. The different organization types (personal and standard), as well as user roles, determine the permissions that are available for projects, applications, tools (such Mix.dialog and Mix.nlu), and so on. Some functionality is controlled at the organization level, such as access to prebuilt domains, the voice packs and language topics available, and more.

predefined entity

Predefined entities save you the trouble of defining entities that are common to many applications, such as monetary amounts, Boolean values, calendar items (dates, times, or both), cardinal and ordinal numbers.

Predefined entities are defined by the current QuickNLP (QNLP) data pack.

Predefined entities are similar to Krypton builtins.

For more information, see Predefined entities.


Recognition is the process of identifying spoken language, also known as automatic speech recognition (ASR) or speech-to-text.

With the help of data packs and optional recognition resources, Krypton offers word recognition only, while Nuance Recognizer combines word recognition with semantic interpretation.


Regex, or regular expression, is a sequence of characters representing a search pattern.

Mix supports regex-based entities, which define a set of values using regular expressions. Some regex-based entities might be account numbers, postal/zip codes, order numbers, and other pattern-based formats.


Resources are objects available to your application. Resources that you build using the Mix platform include:

  • DLMs for Mix.asr
  • NLU models for Mix.nlu
  • Conversational applications for Mix.dialog

You can add additional resources to improve the operation of an engine. For example:

  • Krypton uses recognition resources: DLMs, wordsets, custom pronunciations, builtins, and speaker profiles.

  • NVC uses synthesis resources: user dictionaries, ActivePrompt databases, rulesets, and audio files.

  • NLU uses semantic resources: the mandatory semantic (NLU) model and optional wordsets.


A sample phrase or sentence that you add to your NLU model. After you annotate samples with intents and entities (concepts), the model is trained to learn the annotations. You can exclude samples that you haven’t yet finished annotating (“Exclude from model”) so that they’re not used for training the NLU model.


Value assigned as a measure of the NLU engine’s confidence that it can correctly identify the intent of a sentence. The higher the score, the more likely it is that the result matches what the user said.

Secure Sockets Layer

Secure Sockets Layer (SSL) provides a secure channel between two machines or devices operating over the Internet or an internal network.

SSL was the most widely deployed cryptographic protocol to provide security over Internet communications before it was preceded by TLS (Transport Layer Security).


In the Dialog as a Service API, a selector helps identify the channel and language to use for each interaction.

For more information, see Languages, channels, and modalities.

semantic interpretation

Semantic interpretation allows utterances to be interpreted into structured objects that can be understood by an application.

semantic model

See NLU model.

sensitive data

Personally identifiable information (PII) that could be used to identify a specific individual.


A session is a complete, continual interaction between a user (such as a speaker) and a client application, or between the client application and an engine.

In telephony environments, the session can also refer to the duration of a call.

session ID

Unique identifier for each interaction between a speaker and a dialog application. Typically generated at the start of a session.

speaker adaptation

Speaker adaptation is a technique that adapts and improves speech recognition based on qualities of the speaker and channel. The best results are achieved by updating the models in real time based on the immediate utterance.

Krypton maintains adaptation data per caller as speaker profiles.


See Secure Sockets Layer.


Speech Synthesis Markup Language is an XML-based markup language for speech synthesis applications. SSML gives authors of synthesizable content a standard way to control aspects of speech output, such as pronunciation, volume, pitch, and rate, across different synthesis-capable platforms.

SSML is a recommendation of the W3C’s voice-browser working group.


Transport Layer Security. Cryptographic protocol that provides endpoint authentication and communications confidentiality over networks such as the Internet. TLS and its predecessor, Secure Sockets Layer (SSL), encrypt the segments of network connections at the Transport Layer end-to-end.


A token is a unit in text made up of characters. Individual words are tokens, as are intent and entity (concept) labels. Consider the following string: “I’d like an [HOT_COLD_TYPE]iced[/] [FLAVOR]vanilla[/] [DRINK_TYPE]latte[/].” Words like “I’d,” “like,” and “an” are tokens, as are the entity labels [HOT_COLD_TYPE], [FLAVOR], and [DRINK_TYPE].

A token can also refer to an access token generated by OAuth, used to communicate with a Nuance engine in the Mix environment.


Process in which a sequence of strings is broken up into individual words, keywords, phrases, symbols, and other elements called tokens. Tokens can be individual words or phrases.


Training is the process of building a model based on the data that you have provided. Developing a model is an iterative process that includes multiple training passes.

In Krypton, training also refers to the process of compiling wordsets.


A TRSX, or TRaining Set XML file, is tRSX, or Training Set XML, is a specification defined, owned, and maintained by Nuance to enable developers to manage an entire model in a single file outside of Mix and import the model into a Mix.nlu project. You can also manage training data in separate TRSX files and import them individually.


Text-to-speech, or TTS, is the technology of synthesizing audible speech from text.

Nuance offers two TTS engines: Nuance Vocalizer for Cloud, or NVC, also known as TTS or TTSaaS in the Mix context, and Neural TTS, also known as Neural TTSaaS in Mix.


A system prompt followed by a user’s response. In telephony environments, the response can be a hangup, silence, or even noise that triggers recognition; a recognition can exit after detecting speech, DTMF, a hangup, or after a timeout.

Most transactional applications require multiple turns. For example, trading a stock or paying a credit card. Simpler applications require one or two turns. The more turns in a dialog, the more complex it tends to be to design.


Uniform Resource Identifier. Generic term for all types of names and addresses that refer to objects on the Web.


Uniform Resource Locator. Global address of documents and other resources on the Web. A URL is a kind of URI.


Uniform Resource Name. A name that identifies a resource on the Internet. Unlike URLs, which use network addresses (domain, directory path, filename), URNs use regular words that are protocol- and location-independent.

Uniform Resource Names (URNs) are used in Mix to load a specific Mix resource, described in the application configuration. A URN helps the service determine how to parse the resources in a context tag.

See URN format.


Distinct chunk of caller speech, usually in response to a prompt, that is recognized using a specific active grammar. An utterance is referred to colloquially as an “utt.”


Set of words that can be understood as a part of an application. For example, both “cents” and “dollars” are in the vocabulary of the currency builtin, but these terms can only be said in particular locations in the phrase.

voice pack

A voice pack is the basic resource used by NVC to enable a text-to-speech persona. Each voice pack provides a male, female, or neutral voice for a particular locale (for example, Canadian French, American English, or Australian English).


The World Wide Web Consortium is the principal body that defines international standards for the World Wide Web (WWW).


Webhooks are POST requests sent to a web application when a specific event is triggered. For example, you can configure a webhook that will send a notification when a new application configuration tag is created or an application configuration is deployed.

For more information about using webhooks in Mix, see Configure webhooks.


WebSocket is a communications protocol that provides full-duplex communications channels over a single TCP connection.


Weight values set the relative importance when processing speech input between the base language model in the data pack and specialization objects such as DLMs, builtins, and wordsets.

Weights apply to speech recognition only and have no impact on meaning extraction, so they are relevant to the Krypton recognition engine, but not to the NLU semantic understanding engine. Nuance Recognizer does not use weighted resources.


A wordset is a set of words and short phrases that customize the vocabulary by providing additional values.

In the Krypton recognition engine, wordsets provide additional terms for recognition, such as user names or local place names. Wordsets extend entities in a DLM.

In the NLU semantic engine, wordsets provide additional values for dynamic list entities in a semantic model.