Speech terms and abbreviations

abandoned call

Call in which the caller hung up before making a reasonable attempt to complete the task.

ABNF

Augmented Backus-Naur Form. One of the two acceptable formats for W3C grammars. Nuance does not use grammars of this format, but does include a utility for converting them into the XML format—abnf2xml.

ACD

Automatic Call Distribution. System that routes incoming calls to a group of agents with particular capabilities or availability.

acoustic adaptation

Feature that analyzes task-specific data like recorded utterances and recognition results, and adapts acoustic models accordingly.

acoustic model

Also referred to as a speech model. Statistical model that allows the system to translate utterances into phonetic representations of speech. Derived by speech scientists from analyzing many different people speaking well-defined words and phrases.

active grammar

Grammar (or set of concurrent grammars) being used for the current recognition.

ActivePrompt

A reference that maps a sentence, phrase, or word to a recording, so Vocalizer uses a recording when speaking that text. Those references can be explicit—using the ActivePrompt name—or they can be implicit, where the engine automatically searches the ActivePrompt database for each synthesis request, substituting ActivePrompts whenever the normalized input text matches an ActivePrompt's normalized input text and boundary constraints. ActivePrompt databases are created by Vocalizer Studio.

Adaptive Dialog Module

Dialog Module (application building block) that incorporates additional natural language capability, including the robust parsing of SmartListener, dialog shortcuts, and one-step correction techniques.

affricative

Sound that involves a non-nasal oral closure that develops into a fricative. In phonological terms, you can represent affricates as a single phoneme or a series of two phonemes. Example: /tS/.

AGC

Automatic Gain Control. Adaptive system that adjusts gain based on input levels to provide more consistent output levels.

alarm

Event generated by a Management Station service when an error condition occurs. Alarms can be minor, major, or critical.

allophone

The phones that together form a certain phoneme. For a given phoneme, the actual sounds that are produced may vary. This depends on phoneme context, a speaker’s character, accent, social class, and so on. Example: The phoneme /l/ represents two allophones, the clear l /l/, which occurs before a vowel sound or /j/ (for example, like /laIk/), and the dark l /:/, which occurs elsewhere (for example, milk /mI:k/).

ALM

The application linguistic model (ALM) is consumed by the Nuance Text Processing Engine. It is generated by Nuance Experience Studio and packaged within the semantic model, and includes the extra vocabulary and text transformation rules specific to the customer’s application domain.

ambiguity

Case in a voice application where a recognized utterance maps to more than one natural language result in the current grammar.

ANI

Automatic Number Identification. Telephony feature that provides the called party with the caller ID of the calling party.

API

Application program interface (API). Specification of routines, data structures, object classes, and protocols for programmers to communicate with a software system.

application

Speech-enabled program that callers interact with. For the enterprise market, the application is typically written in VoiceXML and runs on a voice browser.

application linguistic model

The application linguistic model (ALM) is consumed by the Nuance Text Processing Engine. It is generated by Nuance Experience Studio and packaged within the semantic model, and includes the extra vocabulary and text transformation rules specific to the customer’s application domain.

artifacts,models,artifact,model

The artifacts required by Nuance Recognizer and Dragon Voice applications are different. Currently, the artifacts (models) required by Dragon Voice are produced only via Nuance Experience Studio. Contact Nuance about obtaining access to Experience Studio. For information on the models used by Nuance Recognizer, see the Recognizer grammar guide (Reference section).

ASP

Application Service Provider. A business that provides computer-based services to customers over a network. With regard to speech services, the ASP hosts applications for companies (customers), and the applications receive telephone calls from users (customers of the customers). May also refer to Active Server Page: Web pages created dynamically using HTML, scripts, and reusable ActiveX components.

ASR

Automatic Speech Recognition. Conversion of spoken words to interpretable text. Nuance Recognizer and Krypton are ASR engines.

barge in

What happens when the caller interrupts a prompt. This feature can be turned on to let the caller speak without waiting for direction, or off to make sure the entire prompt plays.

blind transfer

Telephony call transfer method in which a caller calls an application, the application connects the caller to a different number, then disconnects from the call. There is no feedback on the success or failure of the operation.

boolean expression

Expression that evaluates to one of two logical values: TRUE or FALSE (1 or 0). The expression can be a parameter, variable, or ECMAScript expression.

BRI

Basic Rate Interface. Telephony configuration in the physical layer as defined for ISDN networks. BRI provides two “bearer” channels for communication, and one “delta” channel for voice or user data.

bridged transfer

Call transfer method in which the application transfers a caller to a third party, while staying connected to both lines. The application can also monitor the call.

browser instance

Logical port within a voice browser service instance that communicates with corresponding telephony session service and Speech Server ports. Each browser instance handles one call.

built-in grammar

Type of grammar that is provided directly by the platform. Nuance provides the built-in grammar types defined in the VoiceXML specification, as well as some standard universal command grammars and additional Nuance types for different languages.

CA rate

Correct Acceptance Rate. Percentage of in-grammar utterances that were recognized and accepted correctly.

call flow

Logical flow of a voice application, including various dialog states, primary paths of informational exchanges, transaction outcomes, and decision logic.

call log

Text file that records activity performed during a single call session. The log contains events generated by the application as well as by other components such as Nuance Recognizer, Nuance Vocalizer for Network, and so on.

caller

Human who is talking to the speech recognition system. (“Caller” is used even when the system initiates an outbound call.)

caller event logging

Events logged by an application during the execution of a call.

caller perceived accuracy

Recognition rate observed by the caller, taking into account out-of-grammar utterances. It is calculated by subtracting the out-of-grammar rate from 100%, then applying the in-grammar accuracy to the remaining percentage. For example, with a 16% OOG and 95% accuracy, the CPA is (100-16%) * 95% = 79.8%.

caller perceived response time

Time between when the caller stops speaking and when the recognizer returns a response.

CAS

Channel Associated Signaling. Telephony signaling method that uses routing information to send the payload to its destination. Also known as Per-Trunk Signaling (PTS). With CAS, the routing information is transmitted in the same channel as the payload itself.

channel

Resources required to handle a single call. A system with 24 channels, for example, can handle 24 simultaneous calls.

closed vocabulary SLM

SLM to which you cannot add new words.

cluster

Two or more hosts running the entire set of speech services, with each host configured to perform a specific role. Each cluster includes a primary Management Station.

collection

Utterance spoken by the caller in response to a prompt.

computer phonetic alphabet

System for easily expressing phonemes in notation defined by the IPA (International Phonetic Alphabet) using a standard ASCII keyboard, for example: SAMPA.

confidence score

Value assigned to each recognition result to indicate how closely the caller speech matches the recognition result. The higher the score, the more likely it is that the recognition result matches what the caller said. The confidence score is compared to a confidence threshold to determine the application’s behavior.

confidence threshold

Value of the confidence score below which the behavior of the application changes. Usually expressed as a number between 0 and 999. For example, an application can use a threshold of 200 to decide when to reprompt. There can be two thresholds, the high confidence threshold, below which to confirm, and the low confidence threshold, below which to reject. Adjusting the confidence threshold trades off between false acceptances and false rejections.

confirmation

Asking the caller to indicate whether the system correctly interpreted earlier utterances.

consonant

Sound produced with a central obstruction in the vocal tract. According to the form of obstruction one distinguishes between different categories of consonants such as fricatives, laterals, nasals, and so on. Examples: /l/, /m/, /f/, /s/, and so on.

consultation transfer

Call transfer method where a caller calls the application and the application attempts to connect the caller to the called party. If the attempt fails, the application continues interacting with the caller (the VoiceXML document interpretation proceeds). If the attempt succeeds, the application completes the transfer and disconnects from the call. You can perform a consultation transfer with or without far-end dialog.

context

Segments of caller speech that precede or follow a specific word or phrase. In speech recognition, the context surrounding a word is often significant in determining its meaning.

context file

File that specifies parameter values to be used during recognition at particular stages of a voice application.

continuous speech

Words, phrases, and sentences spoken fluidly without overt pauses between each word. The technology that recognizes continuous speech is more advanced than a discrete speech recognition system (which does require pauses), but requires more tuning to improve accuracy. Nuance Recognizer and Krypton are continuous-speech engines.

correct acceptance

(CA) A recognition in which an utterance is correctly recognized and accepted.

correct rejection

(CR) A recognition in which an utterance is correctly rejected as being out-of-grammar.

CPA

Caller Perceived Accuracy. Recognition rate observed by the caller, taking into account out-of-grammar utterances. It is calculated by subtracting the out-of-grammar rate from 100%, then applying the in-grammar accuracy to the remaining percentage. For example, with a 16% OOG and 95% accuracy, the CPA is (100-16%) * 95% = 79.8%.

CPR

Concatenative Prompt Recording

CPU

Central Processing Unit. Part of a computer that fetches, decodes, and executes instructions; attached directly to the memory. The CPU contains an arithmetic logic unit, which performs calculations and logical operations, and a control unit, which decodes and executes instructions.

CR rate

Correct Rejection Rate. Percentage of out-of-grammar utterances that were rejected correctly.

CSV

A CSV is a comma-separated values file, which stores tabular data in plain text (using commas to separate the values in each field of data). This file format type is commonly used to transfer information from one software application or system to another.

data pack

A data pack is a set of data files that configure the Krypton recognition engine and the Nuance Text Processing Engine for a particular language. The data pack consists of an acoustic model, language model, parameter files, and other configuration files.

database event logging

Database events may be logged by applications when performing back-end database transactions. A database event is logged as a pair of events: begin-query and end-query. These events are logged by applications, not platforms.

dataPack

A data pack is a set of data files that configure the Krypton recognition engine and the Nuance Text Processing Engine for a particular language. The data pack consists of an acoustic model, language model, parameter files, and other configuration files.

dB

decibel(s). Unit for measuring relative power ratios in terms of gain or loss.

density

Speech density. Percentage of time during a call occupied by the caller’s speech (as opposed to prompts or database queries, and so on). Sometimes called the recognizer duty cycle. Since speech uses recognition resources, applications with high speech density tend to require more resources.

diagnostic log

Text file containing detailed information about the internal functions of Nuance speech components. This information can include entries at various levels of urgency ranging from alarms to normal status. They can also include informational messages generated by a service or application. Diagnostic logs are separate from the call logs, which contain application-, session-, and call-specific information.

dial plan

Combination of international prefix and regional codes that must be dialed to reach another point in the network.

dialog

Interaction between a caller and a voice application. A single unit of interaction or single transaction is often referred to as a dialog state.

Dialog Module

Application building block that provides sharable, common speech recognition tasks. For example, Dialog Modules can understand the formats of dates, monetary units, zip codes, phone numbers, and they automatically ask callers to confirm or repeat information when the confidence score is low.

Dialog Module context

The active grammar and other recognition parameters that are in effect for a particular Dialog Module role. For example, the confirmation consists of a grammar that understands “yes”, “no”, and synonyms.

Dialog Module role

Type of information being recognized—as related to the Dialog Module call flow: collection, confirmation, fallback, or disambiguation.

Dialog Module status

Exit status of a Dialog Module.

DID

Direct Inward Dialing. Service that allocates a block of telephone numbers for calling into a private branch exchange (PBX) system without having a physical telephone line for each number. The PBX automatically switches incoming calls for a given phone number to the appropriate workstation.

diphthong

Vowel with a single noticeable change in quality during a syllable. Diphthongs are usually subdivided into descending (falling) and ascending (rising) diphthongs depending on which of the two elements is stressed. In descending diphthongs the first element is more sonorous (as in all English diphthongs), whereas in ascending diphthongs the second element is more sonorous. Example: /aI/ as in time (descending diphthong).

directed dialog

Dialog that requests one piece of information at a time in a particular order.

disambiguation

Method used to clarify when the recognized item has more than one possible meaning.

discrete speech

Speech in which there are distinct silences between words. Discrete speech renders speech recognition relatively simple, but is stilted and unnatural.

distributed architecture

A Speech Suite installation with different components on different hosts in a computer network.

DNIS

Dialed Number Identification Service. Telephony feature that identifies the called number. This information can be used to route the call to the appropriate service based on the number that was called.

domain language model

The Krypton recognition engine uses domain language models to identify the words and phrases most likely spoken by users of your application. Domain language models are overlaid on the factory or base data pack Krypton initializes on startup, to provide a vocabulary for the application.

Dragon Voice

Nuance Speech Suite components responsible for bringing virtual-assistant style conversation to the IVR experience. Includes the Krypton recognition engine, Natural Language Engine, and the Nuance Text Processing Engine, with the Natural Language Processing service responsible for managing the requests for Dragon Voice recognition and interpretation processing.

DSP

Digital Signal Processor. Microprocessor optimized for the large number of mathematical calculations needed for digital signaling.

DTI

Digital Telephony Interface. Circuit board that provides the necessary voice and fax resources that utilize a T1/ISDN connection.

DTMF

Dual-Tone Multi-Frequency. Also known as touchtone. Two-tone signal representing the digits 0-9, *, and #. Each DTMF signal is composed of one tone from a high-frequency group of tones, and a second tone from a low-frequency group.

DTMF type-ahead

Feature allowing power users to enter a sequence of DTMF tones to navigate directly through a sequence of menus in a voice application without waiting to hear the prompts for each menu.

dynamic grammar

Grammar that can be dynamically created and modified by an application at runtime. Dynamic grammars enable applications to recognize items that cannot be known in advance. Dynamic grammars can be a W3C XML or a compiled grammar file (*.gram) referenced using external rule references. Nuance Recognizer uses dynamic grammars, whereas the Krypton recognition engine and Natural Language Engine use wordsets to customize the vocabulary at runtime.

E1

European digital transmission format, which transmits at a rate of 2.048 Mbps; equivalent to the North American T1. Each signal can carry 32 channels of 64 Kbps each. E1 and T1 lines can be interconnected for international use.

earcon

Landmark sounds identifying place and functionality to the caller. An “icon” that is heard instead of seen.

eavesdropping

Call transfer method that allows one application to listen to audio being recorded by another application.

echo

Reflection of transmitted signal energy back to the sender. The echo of prompts played by the application to callers can trigger false barge-ins, causing degraded recognition performance.

echo cancellation

When a prompt is played, the circuitry within the telephone network may echo it back to the recognizer. An echo cancellation algorithm softens the echo so that the recognizer does not treat the prompt as speech from the caller when the caller barges-in during the prompt.

ECMAScript

Industry standard scripting language; one common variant is JavaScript.

EMMA

Extensible MultiModal Annotation markup language. XML-based representation of the semantic meaning of recognition results. The semantic meaning can often be more useful to applications than literal results. The language was developed by the World Wide Web Consortium (W3C). It replaces NLSML.

end-of-speech

Point in time when a caller stops talking. It is important to detect the end-of-speech quickly so the caller does not experience a delay in response time. This detection process is referred to as endpointing.

end-of-speech detection

Process of detecting when the caller stops talking. Detecting the end-of-speech immediately is important so the caller does not experience a delay in response time. This detection process is referred to as endpointing.

endpointer

Used to describe both initial speech detection and end-of-speech detection.

endpointing

Process of detecting the start and end of speech by distinguishing leading or trailing background noise/silence from an utterance before sending it to the recognizer.

engine

An engine is a program or component that performs a core or essential function for other programs or components. For example, the Krypton recognition engine performs realtime large-vocabulary continuous recognition, the Nuance Language Engine provides ontology-/concept-based semantic processing, and the Nuance Text Processing Engine provides tokenization and normalization of text. Each of these engines plays a crucial role in the Speech Suite system.

event

Message generated by a service and stored in a call or diagnostic log. These logs can be collected by Management Station and can be analyzed using various tools. Also referred to as "alarm."

external rule reference

Rule in a grammar that is expressed by the contents of another grammar on the file system or network.

external speech detector

Separate speech detector that runs as a pre-processor to the internal speech detector provided with Nuance Recognizer. Nuance Speech Server provides an external speech detector that establishes the beginning of speech before the utterance is forwarded to the recognizer for recognition.

FA rate

False Acceptance (FA) rate. Percentage of utterances that were accepted incorrectly; that is, the recognizer returned an n-best but should not have. For tuning purposes, raising the confidence threshold reduces the false acceptance rate, but increases the false rejection rate.

fallback

Call flow strategy for collecting information after initial attempts fail. A typical fallback method is spelling. A fallback can be followed by a confirmation.

false acceptance (FA)

A recognition in which an utterance is accepted after an incorrect recognition. Such an utterance may be an out-of-grammar utterance that was wrongly recognized as in-grammar, or an in-grammar utterance that was incorrectly recognized as a different in-grammar utterance.

false rejection (FR)

A recognition in which an utterance is incorrectly rejected as being out-of-grammar when it is in fact in-grammar.

far

The caller doesn’t know the option, but can describe the problem: “I have a strange charge on my credit card statement.” This type is addressed by SpeakFreely technology.

far-end dialog

Interaction with the third-party to determine whether the transfer can be completed, such as in collect call applications. Far-end dialogs are supported by bridged transfers and consultation transfers.

feature extraction

Front-end processing that performs the spectral analysis of the audio data.

file transfer service

Service responsible for moving log files and other data from managed hosts to the Management Station host.

final devoicing

Speech process that involves the hardening of the final sound. Final devoicing (Auslautverhärtung) occurs in German and Dutch. Examples: Hund /hUnt/ as opposed to Hunde /hUnd@/.

finite state machine

Behavior model composed of a finite number of states, transitions between those states, and actions, similar to a flow graph in which one can inspect the way logic runs when certain conditions are met. The operation of an FSM begins from one of the states (called a start state), goes through transitions depending on input to different states, and can end in any of those available; however, only a certain set of states mark a successful flow of operation (called accept states). Finite state machines can solve a large number of problems; in linguistics—describing the grammars of natural languages.

fluent speech

Nuance Recognizer capability that allows callers to provide longer, unstructured input. Typically this means grammars of more than 100k words and utterances of more than 10 seconds. “Fluent speech” uses technologies including very large statistical language models (using interpolated LMs), high-accuracy speech recognition, fast-match (the large list of words is filtered by an initial “fast” search), and tight control over memory and CPU usage for long duration utterances.

FR rate

False Rejection (FR) rate. Percentage of in-grammar utterances that were rejected but should not have been. For tuning purposes, lowering the confidence threshold reduces the false rejection rate, but increases the false acceptance rate.

fricative

Sound that involves a turbulent airstream within the vocal tract and at least one obstruction such as the teeth, the lips, and so on. Examples: /s/, /S/, /Z/, /f/, and so on.

FSM

A finite state machine (FSM) is a mathematical model of computation consisting of a finite set of states.

FTP

FTP (File Transfer Protocol) is a standard network protocol for exchanging files over the Internet.

full duplex

Type of circuit that allows simultaneous two-way data transmission.

full natural language interpretation

Natural language processing that requires each word in the utterance to be parsed to fill slots in the grammar.

gemination

Doubled consonant sound. Examples: unknown /Vnn@Un/.

global command

Command that can be spoken during any dialog state, such as “help”, “cancel”, or “exit”. Also called a universal command.

glottal stop

Sound produced by a supraglottal closure and following opening. In our transcriptions we generally do not transcribe glottal stops.

Good-Turing

Speech science algorithm first introduced by Good and Turing, typically used to estimate probabilities from counts.

grammar

Words and word sequences the recognizer can recognize, and the interpretations for those utterances. For example, a currency grammar might allow you to say things like “thirty four dollars and twelve cents” but not “cents dollars forty four.” Grammars are often specified in GrXML.

grammar adaptation

Feature that calculates probabilities on grammars using task-specific data and generates new, adapted grammars.

grammar core

Part of a grammar that contains the most important meaning-bearing words.

grammar exemplar

Filename of the most frequently recognized grammar in a grammar set. Pointing to the grammar exemplar in a report shows the grammar set. A grammar can be the exemplar for several grammar sets.

grammar file

File that contains a grammar definition. A Nuance grammar file must be a text file written in GrXML, or a compiled binary .gram file.

grammar filler

Part of a grammar containing words or expressions that do not convey significant meaning, such as “I’d like to,” “I’m,” “please.”

grammar rule

Component of a grammar used to define words and phrases the grammar can accept. Grammar rules can be nested one within another to create a complex grammar capable of interpreting long and detailed utterances.

grammar set

Set of grammars active for a recognition.

grapheme

Set of letters or letter combinations that represent a phoneme in a writing system. May also refer to a single letter or character in a system of writing.

GrXML

Nuance shorthand for the syntax for grammars defined in the XML format of the W3C Speech Recognition Grammar Specification. May also refer to the file extension for such grammars. The current specification for GrXML is available on the Web at the W3C.

hairpin transfer

Call transfer method similar to a bridged transfer, but in which the application cannot monitor the audio path of the two connected parties.

half duplex

Type of circuit that allows data transmission in two directions, but only one direction at a time.

hiatus

Two adjacent vowels in different syllables coming together with or without a slight pause. Example: radius /reIdI@s/; the hiatus is in the sequence /I@/.

high availability

Characteristic of a system that is up most of the time, corresponding to 99.999% (five-nine) availability.

HMM

Hidden Markov Model. Complex statistical model that provides a detailed spectral and temporal representation of the speech signal.

homonyms

Two or more words that sound the same, but have different meanings—for example, “to”, “too”, and “two”. Since they all sound the same, the recognizer cannot tell which was spoken. The only way to determine the correct word is to consider the context.

hosted application

Application developed for a specific tenant (customer) but hosted by an application service provider on a server that supports multiple applications from the same or other tenants.

hosted environment

Execution environment in which multiple hosted applications are supported. (Also multi-tenant environment.)

hostid

The unique, physical MAC address of the machine. Get the address with the "ipconfig /all" command, or use the FLEXNET tool: "lmutil lmhostid".

hot insert

Update to acoustic speech recognition models while the system continues operation. Telephone callers do not perceive delays when these updates occur. Hot insert is nearly invisible to applications and voice browsers, but there are mechanisms that can be controlled.

hot word recognition

Type of recognition in which the recognizer listens for utterances, but does not respond unless a specific word or phrase is recognized. Hot word recognition is used by Nuance Recognizer for tasks such as interrupting a long TTS playback, or returning from a bridged transfer. Krypton does not support hot word recognition.

HTTP

The Hypertext Transfer Protocol (HTTP) is the underlying protocol used by the World Wide Web and defines how messages are formatted and transmitted, and what actions Web servers and browsers should take in response to various commands. Hypertext is structured text that uses logical links (hyperlinks) between nodes containing text.

HTTPS

HTTPS (also called HTTP over TLS, HTTP over SSL, and HTTP Secure) is a protocol for secure communication over a computer network that is widely used on the Internet.

identification

Process of identifying a speaker from a set of valid users, enabling a more personalized interaction. Speaker identification can be performed on a small set of users.

IETF

Internet Engineering Task Force. Organization that defines standards for Internet protocols.

imposter

Caller actively trying to get into the system using someone else’s account.

in-grammar

An utterance that can be recognized by a given grammar.

in-vocabulary

An utterance that can be recognized by the active grammar.

interpretation

Representation of the meaning of a sentence. May also refer to the recognition of an utterance from text rather than audio.

IP

Internet Protocol. Protocol used for communicating data across a packet-switched network using the Internet Protocol Suite (also referred to as TCP/IP). Each computer (known as a host) on the Internet has at least one IP address that uniquely identifies it from all other computers on the Internet.

ISA

Industry Standard Architecture. Computer bus standard for IBM-compatible computers. In 1987, succeeded by Extended Industry Standard Architecture (EISA).

ISDN

Integrated Services Digital Network. Communications standard for transmitting voice, audio, and data over digital or analog telephone wires. ISDN lines include B channels, which carry voice or data, and D channels, which carry control and signaling information.

ISO

International Standards Organization

ISUP

ISDN User Part. Protocol layer used for circuit-switched connections.

ITN

Inverse Text Normalization

IVR

Interactive Voice Response. General-purpose system for developing and deploying telephony applications that perform automated operations and transactions to callers via voice and DTMF input.

jitter

Variation in arrival time of data packets during transmission through the IP network.

JNI

Java Native Interface. Programming framework that enables communication between native applications and applications running on a Java virtual machine.

JSON

JavaScript Object Notation. Lightweight data-interchange format that is easy for humans to read and write, and easy for machines to parse and generate.

JSP

Java Server Pages

Krypton

Krypton is Nuance's enterprise-grade, realtime large vocabulary continuous speech recognition engine. The Krypton engine uses a WebSocket-based protocol to accept audio streams and asynchronously send back results as the recognition progresses. Krypton supports domain language models among other forms of specialization, allowing it to understand terms specific to a field of work/application.

LAN

Local Area Network. Computer network covering a small physical area such as a single office building. A typical LAN has high transmission rates and does not require leased telecommunications lines.

language identifier

Language code as specified in RFC 1766.

language model

Statistical model for the syntax of language constructs. The recognizer uses language models to bias it appropriately towards more common phrases. These models increase overall recognition accuracy.

language pack

Language-specific resources installed in addition to the recognition engine. The resources are needed to perform recognition in a particular language and locale (for example, American English, United Kingdom English, or Australian/New Zealand English). A language pack includes localized acoustic models, pronunciation dictionaries, and standard VoiceXML built-in grammar types like date and digits.

latency

Caller-perceived delay. Specifically, duration of the delay between reception of audio (utterance) to the emission of the recognition response.

lateral

Sound that is articulated by allowing air to escape freely over one or both sides of the tongue. Example: /l/.

license file

A key to enable ports for every licensed product. It is generated at the Nuance license fulfillment website and stored on a licensing server host.

License Manager

Nuance License Manager. Service responsible for managing the allocation of licenses across a network.

listener

Channel used to listen to audio recorded by another channel.

literal

Text of the vocabulary item returned by the recognizer. Also called the recognized item or the raw text.

location server

In SIP, the server used to identify the location of a caller in the server’s domain. A caller agent must register with a location server to be located by a proxy server for a VoIP call.

magic word

Magic-word mode is identical to selective barge-in except that in magic-word, the speech detector rejects candidates that are too short or long before sending them to the recognizer. Krypton does not support magic word recognition.

Management Station

Nuance Management Station. Component that provides centralized management of a Nuance network.

master

Channel on which the audio is recorded during eavesdropping.

MFF

Minimally Formatted Form. Plain text literal consisting only of words that indicates exactly what was recognized. Produced by Krypton and by the Nuance Text Processing Engine (NTpE).

MIME

Multipurpose Internet Mail Extensions

mixed-initiative dialog

Dialog format that allows several pieces of information to be gathered at once, in any order, and prompts for missing items as necessary.

mixture Gaussian model

Statistical method to represent how each phoneme sounds. This method is used by most state-of-the-art recognizers.

monophthong

Single vowel where there is no detectable change in quality during a syllable. Example: /V/ as in cut.

morpheme

Smallest unit of characters that carries meaning. Morphemes are not necessarily identical with syllables. For example, in dogs the morphemes are ‘dog’ and ‘s’ (‘s’ being the plural morpheme in that case). Examples: book, it, he; in disagreement the morphemes are ‘dis’, ‘agree’ and ‘ment’.

MRCP

Media Resources Control Protocol. Communication protocol for speech servers to provide various speech services to an application.

MRCP client

Software module that communicates with Speech Server using MRCP.

MREC

Modular RECognizer. MREC is a language independent, highly configurable speech recognition engine used both by Nuance products (such as the Krypton recognition engine and Nuance Text Processing Engine) and for research. MREC provides a set of primitive speech recognition functions and requires middleware such as S3 to create a recognition system. It is used for realtime and batch dictation tasks for multiple products at Nuance.

MTBF

Mean Time Between Failures. Total equipment uptime in a given period, divided by the number of failures in that period.

MTTR

Mean Time To Repair or Replace. Total equipment downtime for a given period, divided by the number of failures in that period.

mulaw

Digital encoding standard used to store audio recordings, such as prompts. Also called mu-law or µ-law.

multi-tenant environment

Execution environment in which multiple hosted applications are supported. (Also hosted environment).

mutex

Mutually Exclusive. Function used to synchronize processes operating on shared data so they don’t interfere with each other.

n-best list

List of N possible recognition results, ranked from highest to lowest likelihood by confidence score, where the application configures the value of N. The recognizer returns an n-best list of varying length to the application depending upon the ambiguity of the recognition. Typically, applications use the first result, but they can refer to other items to deduce the caller’s intended meaning.

n-gram

Sequence of N consecutive words (w1w2...wN). The words w1,w2,...wN-1 are called the predecessors of wN. For example, the sequence “I would like” is a trigram where “I” and “would” are the predecessors of “like”.

n-gram SLM

Statistical language model in which the probability of the occurrence of any word is completely determined by the probabilities of a fixed number N of preceding words. Given a string of m words, the probability of the m-th word depends on the probability of all previous m-1 words. The n-gram model is the most widely used type of model in speech recognition. All word sequences are allowed, but likely phrases are assigned higher probabilities. N is called the order of the model.

N+1 sparing

Refers to sparing, a provisioning technique that refers to providing standby hardware that can be quickly substituted for any systems that fail in your network. This is often referred to as N+1 sparing or N:M sparing, where N is the number of machines active at a given time and M is some number between 1 and N indicating how many standby systems you have available.

NAS

Nuance Application Studio is a web-based tool for designing and developing speech and touchtone applications. It streamlines the design process, facilitates communication with stakeholders, and generates code and other collateral.

nasal

Sound where a part or the whole of the respiratory stream escapes through the nose. Examples: /m/, /n/, /N/.

Natural Language Engine

The Natural Language Engine is Nuance's enterprise grade text-to-meaning engine or semantic engine. NLE provides ontology-/concept-based semantic processing. NLE takes as input the token sequence provided by the Nuance Text Processing Engine and from this input identifies the intent and/or meanings expressed in the human-machine turn. The outcome from NLE is typically used to drive the next machine-human turn.

natural language processing

Natural language processing (NLP) is the ability of a computer program, system, or application to understand human speech as it is spoken.

Natural Language Processing service

The Natural Language Processing service (or NLP service) provides a single interface to the Krypton recognition engine and to Natural Language Engine services. The NLP service also communicates with the Nuance Resource Manager to allocate a suitable Krypton and NLE resource based on the capabilities requested by the voice application.

natural language understanding

Speech recognition techniques that permit a caller to answer a prompt with a full phrase or sentence, as in everyday conversation. Typically, natural speech is longer in duration and has a broad range of possible meanings. A grammar capable of natural language understanding must accept a wide variety of different phrases.

NCLI;Nuance Command Line Interface

Tool for creating models used by Dragon Voice components.

near

The caller knows and says the option, surrounded by filler words: “I think I want billing,” “I have a question about my bill.” This is the most common out-of-grammar type.

NEBS

Network Equipment-Building System. The most common set of safety, spatial and environmental design guidelines applied to telecommunications equipment in the United States.

NES

Nuance Experience Studio is a web-based tool for creating natural language understanding models for virtual assistant and call steering applications. These models or artifacts enable the application to understand what users mean when they contact the application.

neural network

Statistical method used in phoneme classification to represent how each phoneme sounds. It can be used in conjunction with a mixture Gaussian model or on its own.

NII

Nuance Insights for IVR is an analysis and reporting tool that provides usage information about speech, mobile, and touchtone applications based on call logs and audio files collected from applications.

NLE

The Natural Language Engine is Nuance's enterprise grade text-to-meaning engine or semantic engine. NLE provides ontology-/concept-based semantic processing. NLE takes as input the token sequence provided by the Nuance Text Processing Engine and from this input identifies the intent and/or meanings expressed in the human-machine turn. The outcome from NLE is typically used to drive the next machine-human turn.

NLP

Natural language processing (NLP) is the ability of a computer program, system, or application to understand human speech as it is spoken.

NLP service

The Natural Language Processing service (or NLP service) provides a single interface to the Krypton recognition engine and to Natural Language Engine services. The NLP service also communicates with the Nuance Resource Manager to allocate a suitable Krypton and NLE resource based on the capabilities requested by the voice application.

NLPS

The Natural Language Processing service (or NLP service) provides a single interface to the Krypton recognition engine and to Natural Language Engine services. The NLP service also communicates with the Nuance Resource Manager to allocate a suitable Krypton and NLE resource based on the capabilities requested by the voice application.

NLSML

Natural Language Semantic Markup Language. XML-based representation of the semantic meaning of recognition results. The semantic meaning can often be more useful to applications than literal results. The language was developed by the World Wide Web Consortium (W3C). It has been superseded by EMMA.

NLU

Natural Language Understanding (NLU) is a speech recognition technique (or set of techniques) that permits a caller to answer a prompt with a full phrase or sentence, as in everyday conversation. Typically, natural speech is longer in duration and has a broad range of possible meanings. A grammar capable of natural language understanding must accept a wide variety of different phrases.

NOD

Nuance On Demand. Application hosting service that can handle many tenants and applications on the same pool of servers.

node

In a reference architecture each node is considered a self-contained entity, comprising all the elements needed to deploy service, including application servers, database servers, and clusters of hosts.

noise codes

Single letter codes, in brackets, that transcribers use to denote certain events in the utterance; for example, the transcription string “[n]” indicates noise, and the string “[c] hello” indicates the caller coughed and then said “hello.”

normalization

Process by which text (words/tokens) is converted to a standard form based on global parameter settings and token-specific rule settings. For example, to expand abbreviations ("PIN" to "personal information number"), or to convert currency symbols into full words (currency symbols to "dollars" or "euros").

NRS

Nuance recognition server. Service that provides access to Nuance Recognizer and a pool of preloaded grammars.

NSS

Nuance Speech Server. Central control and communication hub for speech-processing resources. Speech Server provides an open, protocol-based mechanism for using resources such as speech recognition and text-to-speech. Speech Server interacts with a voice platform via the MRCP protocol and with telephony via the RTP protocol.

NTpE

The Nuance Text processing Engine (NTpE) is Nuance's normalization and tokenization (or lexical analysis) engine. NTpE applies transformation rules and formats output for display or for further processing by semantic engines such as NLE.

Nuance Application Studio

Nuance Application Studio is a web-based tool for designing and developing speech and touchtone applications. It streamlines the design process, facilitates communication with stakeholders, and generates code and other collateral.

Nuance Experience Studio

Nuance Experience Studio is a web-based tool for creating natural language understanding models for virtual assistant and call steering applications. These models or artifacts enable the application to understand what users mean when they contact the application.

Nuance Insights for IVR

Nuance Insights for IVR is an analysis and reporting tool that provides usage information about speech, mobile, and touchtone applications based on call logs and audio files collected from applications.

Nuance License Manager

Nuance License Manager. Service responsible for managing the allocation of licenses across a network.

Nuance Management Station

Component that provides centralized management of the network.

Nuance On Demand

Application hosting service that can handle many tenants and applications on the same pool of servers.

Nuance recognition server

Server that provides access to a Nuance Recognizer instance and a pool of preloaded grammars. Systems like directory assistance can use grammars that are so large that it is not possible to load them dynamically while handling a call. Therefore, these large grammars are preloaded on specific recognition servers during initialization. At runtime, the recognition requests are directed to the server serving the requested preloaded grammar.

Nuance Recognizer

Runtime component that performs speech recognition on an audio stream that begins with speech.

Nuance Resource Manager

Distributes requests and provides load balancing to Dragon Voice recognition and interpretation resources (Krypton recognition engine, Natural Language Engine, and Nuance Text Processing Engine).

Nuance Speech Server

Central control and communication hub for speech-processing resources. Speech Server provides an open, protocol-based mechanism for using resources such as speech recognition and text-to-speech. Speech Server interacts with a voice platform via the MRCP protocol and with telephony via the RTP protocol.

Nuance Text Processing Engine

The Nuance Text processing Engine (NTpE) is Nuance's normalization and tokenization (or lexical analysis) engine. NTpE applies transformation rules and formats output for display or for further processing by semantic engines such as NLE.

Nuance Vocalizer

Nuance text-to-speech engine. Using Nuance supplied voice and language data optionally supplemented by application recordings and tuning data, Vocalizer speaks computer supplied text in a human voice.

OA&M

Operations, Administration, and Maintenance (or “Management”). Processes, procedures, and tools for operating, administering, managing, and maintaining a computer system or network.

OAM

Operations, Administration, and Maintenance (or “Management”). Processes, procedures, and tools for operating, administering, managing, and maintaining a computer system or network.

obstruent

Sound where an obstruction in the vocal tract is sufficient to cause noise. Examples are all plosives, fricatives, and affricates. Sounds without such a noise component are called sonorants. Nuance does not, however, use this distinction in our description of sounds, because this distinction is irrelevant for our purposes.

offline

Term used to represent a managed host that cannot be reached by Management Station. This typically means the watcher service is not currently running on that host.

one-step correction

Technique that allows a caller to respond to a yes/no confirmation with the correct answer, without having to go back to the original directed dialog prompt.

OOV

Words or phrases not specified in the active grammar as well as laughter, coughing, and so on.

open-ended prompt

Prompt that asks a question without expecting a response from a constrained list of options. For example, “How may I help you?”

open vocabulary SLM

SLM that contains a special class, called unknown (UNK), that lets you add new words to an SLM without retraining the model.

orthography

Set of rules that define how to write a language. A phonemic orthography is a writing system where each symbol (grapheme) corresponds to a single phoneme in the language.

out-of-grammar

Caller speech that cannot be parsed by a given grammar. Out-of-grammar errors are typically the greatest factor affecting caller-perceived accuracy. There are two types of out-of-grammar errors—near and far.

out-of-vocabulary

Words or phrases not specified in the active grammar as well as laughter, coughing, and so on.

password phrase

Phrase known only by the caller. Password phrases add security to an application by asking the speaker to provide additional information known by that speaker only.

payload

Data portion of a packet.

PBX

Private Branch Exchange. Telephone exchange that serves a particular business, as opposed to a public exchange operated by a telephone company.

PCI

Peripheral Component Interconnect. Hardware computer bus for attaching hardware devices.

PCM

Pulse Code Modulation. Digital transmission format used by traditional telephony applications, in which analog signals are sampled 8000 times per second.

PEM

PEM is a de facto file format for storing and sending cryptography keys, certificates, and other data, based on a set of IETF standards defining Privacy-Enhanced Mail. PEM is a standard format for OpenSSL and many other SSL tools.

perplexity

Measure of the quality of a language model: that is, how well the model can predict phrases. Perplexity is a function of the model and the test set on which it is measured. For a given test set, the lower the perplexity, the better the model. Perplexity is also a measure of the complexity of the task. Consider two models of equivalent complexity trained on two different test sets. Using these models, the task that has the highest perplexity is the most complex. Because perplexity is correlated with recognition accuracy, it can be used to tune applications.

persona

Personality of a system defined for an application, reflected in the voice, language use, and the audio environment. The persona is based on the target audience, user feedback, and imagery associated with the company’s brand.

phone

Perceptually distinct sound unit that does not yet represent a particular phoneme.

phoneme

Basic unit of speech, representing a single distinguishable sound used in spoken language. Examples: /f/, /S/, /m/, /@/, /I/, and so on. Nuance Recognizer represents words as sequences of phonemes. Unlike older technologies, this allows you to recognize any word without having a special model for it.

phonetic alphabet

Alphabet in which every symbol represents one and only one specific sound in a given language. A phonetic alphabet is useful for writing a pronunciation, because it leaves no room for ambiguity.

phonetic classification

Process of determining each segment's phoneme.

phonetic transcription

Process that breaks a word down into its phonemes. Examples: full /fUl/, bin /bIn/, yes /jes/.

PKCS

Public Key Cryptography Standards

platform

Runtime environment into which Nuance speech components are integrated. The platform typically includes a voice browser, an MRCP client and a telephony interface.

plosive

Sound that may or may not be voiced and involves a non-nasal oral closure. Examples: /p/, /d/, /t/, /d/, /k/, and so on.

port

Virtual/logical data connection that can be used by programs to exchange data. One computer port is required to handle each call. For example, to handle 24 simultaneous calls, a system must have 24 ports. The more ports a given computer can handle, the more economical the system is to buy and maintain.

PPX

Point-to-Point Switching. Telephony switching between two end-points.

prefix

Morphological element placed before the root of a word. Examples: un- as in unseen or dis- as in disagree.

PRI

Primary Rate Interface. Integrated Services Digital Network (ISDN) interface most likely to be found in business service and is typically used for carrying multiple voice and data transmissions between two physical locations. PRI offers 23 bearer (B) channels for user payload, plus one data (D) channel for signaling and control, which is equivalent to the 24 channels of a T1 line. The European version, known as primary rate access (PRA), offers 30 B channels, plus one D channel, which is equivalent to the 31 channels of an E1 line. The B channels can be used individually to connect on demand to any other ISDN device, and multiple B channels can be bonded and treated as a single fast connection for bandwidth-intensive applications such as data file transfers, videoconferencing, and any multimedia combination.

prompt

Speech played to a caller either to ask a question or to provide feedback. For example, “What is your account number?” or “Please wait while I check flight information.” A prompt can be played from a pre-recorded file, generated from text (TTS), or a combination of the two. The audio content can be specified via a URI, or (in VoiceXML) as a previously recorded audio variable.

pronunciation

Transcription of the sound and stress patterns of a spoken word or phrase. This is used both in text-to-speech output, and in speech recognition to match utterances against text.

proxy server

Intermediary program that routes SIP call requests within the network and performs SIP services.

PSTN

Public Switched Telephone Network. Traditional telephony system, based on circuit-switched technology and using PCM conversion. The connection between the sender and receiver must be reserved before data transfer begins. Connection resources are dedicated to the circuit for the duration of the call session.

QoS

Quality of Service. Network requirements (such as latency and maximum packet loss) to support a specific application.

raw text

Text of the vocabulary item returned by the recognizer. Also called the literal or the recognized item.

recognition

Process of identifying and interpreting spoken language. Recognition is performed by a recognizer such as Nuance Recognizer or Krypton, which, in turn require grammars or models (respectively) to define the words and phrases that can be recognized.

recognition context

Set of grammars, speech models, and parameters used when recognizing a specific utterance.

recognition resource

Host CPU or DSP card used to handle speech recognition computation. Recognition resources are often shared to save cost and physical space. For example, one DSP might handle 8 simultaneous telephone calls (in other words, the calls on 8 Ports). Sharing works well because typically only a few of the callers on different calls are speaking simultaneously. The remainder are listening to prompts or waiting for database queries to finish, and therefore not using the recognition resources.

recognition result

All the information generated during the recognition operation, including a text transcription of the recognized phrase, an interpretation of the utterance, the number of words recognized, and the confidence score and overall probability of the result.

recognition timeout

Event that occurs when there is no speech detected several seconds after a prompt has ended. Usually when this happens, the application tells the caller that it did not hear anything, and it reprompts the caller.

recognized item

Text of the vocabulary item returned by the recognizer. Also called the literal or the raw text.

recognizer

Runtime component that performs speech recognition on an audio stream that begins with speech.

recognizer response time

Time between when the recognizer detects end-of-speech and the recognizer returns a result. The caller perceives a larger delay than this: see caller-perceived response time. Calculations like median and 90th percentile response time are approximate: they are typically precise to within 0.05 second below 0.8 seconds, to within 0.1 second below 2 seconds and less precise above that value.

recognizer return value

Semantic value of the vocabulary item returned by the recognizer. The grammar associates a value for each item.

recognizer status

Classification of recognition outcomes.

recursive grammar

Grammar that references itself.

rejection

Detecting utterances (words, coughing, laughter, etc.) not defined in the active vocabulary. When the recognizer rejects an utterance, it prompts the caller to try again. For example: “Sorry, I didn't understand you. Please say the name again.” An utterance is rejected when its confidence score is so low that the system should not even attempt confirmation.

Resource Manager

Distributes requests and provides load balancing to Dragon Voice recognition and interpretation resources (Krypton recognition engine, Natural Language Engine, and Nuance Text Processing Engine).

response time

Time between when the caller stops speaking and when the recognizer returns a response. The response time depends on several factors, including the recognizer’s efficiency, difficulty of the recognition task, system load, and other computation required to return the response (for example, querying a database).

retry

After a recognition timeout or rejection, the application often prompts the caller again to speak. This is called a retry. A good user interface design gives the caller information about why the retry is needed and sometimes alerts the caller to fallback methods.

RFC

Request for Comments (IETF standard). Memorandum published by the Internet Engineering Task Force (IETF) describing methods, behaviors, research, or innovations applicable to the working of the Internet and Internet-connected systems. Through the Internet Society, engineers and computer scientists may publish discourse in the form of an RFC, either for peer review or simply to convey new concepts, information, or (occasionally) engineering humor. The IETF adopts some of the proposals published as RFCs as Internet standards

robust natural language interpretation

Natural-language processing mechanism that allows the recognizer to fill natural language slots no matter where the core words appear in a phrase. This means that not all words in the utterance need to be parsed, so the grammar need not include every possible combination of core words, and can ignore grammar filler words.

robust parsing

Natural-language processing mechanism that allows the recognizer to fill natural language slots no matter where the core words appear in a phrase. This means that not all words in the utterance need to be parsed, so the grammar need not include every possible combination of core words, and can ignore grammar filler words.

robust parsing grammar

A robust parsing grammar is able to identify the key items within a user utterance, while ignoring any dysfluencies or filler words that carry no significant meaning. A robust parsing grammar is a collection of SRGS rules (or concepts) that are applied to input text. Unlike a regular SRGS grammar, which applies rules in a specific order, a robust parsing grammar applies rules flexibly wherever they provide the best matches, and extracts meaning from those matching fragments.

role

Configuration defining a combination of services, number of instances of each service, and service property settings for each service, that can be assigned to a host through Management Station.

RTCP

Real-Time Transport Control Protocol. Internet protocol that provides out-of-band statistics and control information for an RTP flow.

RTP

Real-Time Transport Protocol. Internet protocol used for transmitting data with realtime properties, including audio and video. RTP typically runs over UDP.

runtime grammar compilation

Grammar compilation mechanism that allows grammar content to be passed directly to the recognizer at runtime, compiled, recognized, then discarded.

S3

S3 is middleware for managing a speech recognition workflow using MREC and TextProc, as implemented in the Krypton speech recognition engine.

sampled calls

Use of a subset of calls to an application instead of all calls. Using a subset saves disk space and computation time. For example, if you have a million calls a day to your system, you can sample only 20,000 a day. This enables statistics that are precise enough for analysis and reporting.

SAPI

Microsoft Speech API

SDP

Session Description Protocol. IETF proposed standard for streaming parameters for media initialization.

search

Final step in the internal recognition process, where the recognizer searches for the word or phrase that most closely matches what the caller said.

Secure Sockets Layer

Secure Sockets Layer (SSL) provides a secure channel between two machines or devices operating over the internet or an internal network. SSL was the most widely deployed cryptographic protocol to provide security over internet communications before it was superceded by TLS (Transport Layer Security).

segmentation

Process of breaking speech into pieces that are then used to perform recognition. There are several methods for segmentation. Nuance Recognizer uses an approach where speech is broken into phonemes. Many other vendors break speech into pieces of fixed-duration that are independent of where the phonemic boundaries are. The Nuance Recognizer approach results in fewer segments to analyze and thus requires less CPU resources.

selective barge-in

Prevents accidental interruption by allowing applications to define a small set of key words (to be spoken by callers) that trigger an intended barge-in. A client application that supports selective barge-in always listens for commands, whether the caller is speaking or listening to prompts. Selective barge-in interrupts the conversation or prompt only when it recognizes an utterance that is part of a predetermined grammar. Krypton does not support selective barge-in.

self-learning

Mechanism by which the recognizer adapts its behavior based on previous results, with the aim of improving recognition accuracy.

semantic interpretation

Semantic interpretation allows utterances to be interpreted into structured objects that can be understood by an application.

semantic model

Semantic models consist of concept grammars, grammars inferred from a training corpus, and trained classifiers, among other data fill. The Natural Language Engine consumes semantic models that are application- or project-specific.

semantic uniqueing

Recognition result processing mechanism whereby the recognizer ensures that if multiple answers have the same natural language interpretation, only one of those answers is returned. This ensures that all answers returned in an n-best list have different semantic interpretations.

semivowel

Sound functioning phonologically as a consonant, but sharing the phonetic qualities of a vowel. Example: /j/ as in yet /jet/.

service creation environment

GUI software development environment used to create a voice- and DTMF-based user interface and associated service logic for IVR applications.

service instance

Process participating in a service type.

service property

Property set on a service to customize its behavior.

service type

Logical entity identifying services that provide a specific feature.

session

Complete, continual interaction between a speaker and an application. In telephony environments, this is the duration of a call.

session ID

Unique identifier for each call received or placed. Typically generated at the start of a session, by the telephony interface for an inbound call, or by the VXML browser for an outbound call. The session ID is automatically written along with every log message to enable sorting of logs on a per-call basis.

SIP

Session Initiation Protocol. In TCP/IP communications, the application layer signaling protocol.

skip list

Mechanism for creating a set of possible results that should not be returned by the recognizer. Typically this is used when reprompting for misrecognized information, so that the same incorrect results are not returned in two successive dialog states.

SLM

Statistical Language Model. A mathematical object that computes the probability of sequences of words. In speech recognition systems, such a sequence is a speaker’s utterance, and the SLMs can be used to estimate the probability of the next word in the sequence based on the recognition status of the preceding words. For example, if the system has previously recognized "give me", it is more likely to recognize "more" than "sore".

smart transfer

Call transfer method in which the original call is reestablished if the transfer line is busy and the transfer cannot complete.

SmartListener

Nuance product that improves caller-perceived accuracy for directed dialogs. It allows the system to more easily detect and properly interpret near out-of-grammar utterances that include filler phrases without explicitly defining them in the original grammar.

SNMP

Simple Network Management Protocol. IETF standard that enables system managers to control a variety of network devices from a single operational console.

SNR

Signal-to-Noise Ratio. Ratio of signal to noise, in dB, indicating line quality.

softphone

Software application enabling a caller to place or receive phone calls from a PC. Calls can be made from a PC to a phone or to another PC.

sonorant

Sound whose production does not involve a noise component. Sonorants can form the nucleus of a syllable. Examples: All vowels and /m/, /n/, /N/, /j/, /w/, and so on. Sounds in which there is a noise component are called obstruents. Nuance does not, however, use this distinction in our description of sounds, because this distinction is irrelevant for our purposes.

sparing

Provisioning technique that refers to providing standby hardware that can be quickly substituted for any systems that fail in your network. This is often referred to as N+1 sparing or N:M sparing, where N is the number of machines active at a given time and M is some number between 1 and N indicating how many standby systems you have available.

speaker

Person whose speech is being recognized.

speaker identification

Process of identifying a speaker from a set of valid users, enabling a more personalized interaction. Speaker identification can be performed on a small set of users.

speaker recognition

Process of identifying which speaker said the utterance and authenticating the speaker. Speaker recognition encompasses identification and verification.

SpeakFreely

Nuance product name for the technology of using SLMs and SSMs to interpret a response to an open-ended prompt. It is useful for far out-of-grammar responses.

speech channel

Refers to the physical link that carries the voice or audio data. The speech channel provides the bridge between an application running on a specific port and the telephony session service.

speech density

Percentage of time during a call occupied by the caller’s speech (as opposed to prompts or database queries, and so on). Sometimes called the recognizer duty cycle. Since speech uses recognition resources, applications with high speech density tend to require more resources.

speech detection

Process of detecting the beginning of caller speech against silence or background noise.

speech detector

Runtime component that detects speech on an audio stream, and returns an indication of when speech began.

speech model

Statistical model that allows the system to translate utterances into phonetic representations of speech. Derived by speech scientists from analyzing many different people speaking well-defined words and phrases.

Speech Server

Central control and communication hub for speech-processing resources. Speech Server provides an open, protocol-based mechanism for using resources such as speech recognition and text-to-speech. Speech Server interacts with a voice platform via the MRCP protocol and with telephony via the RTP protocol.

SRGS

Speech Recognition Grammar Specification. The W3C standard specification for writing grammars. It includes an XML format and an ABNF format. Nuance supports only the XML format (called GrXML for convenience).

SSL

Secure Sockets Layer (SSL) provides a secure channel between two machines or devices operating over the internet or an internal network. SSL was the most widely deployed cryptographic protocol to provide security over internet communications before it was superceded by TLS (Transport Layer Security).

SSM

Statistical Semantic Model. A mathematical model that aids recognition based on the context in which words appear. An SSM uses an SLM to calculate the probability that a given sequence of words will appear and match recognized phrases to their intended meaning.

SSML

Speech Synthesis Markup Language. XML-based markup language for speech synthesis applications. Gives authors of synthesizable content a standard way to control aspects of speech output such as pronunciation, volume, pitch, rate, and so on across different synthesis-capable platforms. It is a recommendation of the W3C's voice-browser working group.

static grammar

Grammar content that is precompiled into a binary .gram file. Such a grammar loads quickly, but cannot be changed at runtime based on caller input.

Statistical Language Model

A statistical language model (SLM) determines the probability that a given word sequence will appear in response to a given prompt. This probability can then be used to guide recognition to the most likely results. When the SLM is used with another natural language technique, a meaning can be assigned to the recognized text. All natural language grammars require underlying SLMs. SLMs are trained from actual user input.

Statistical Semantic Model

A statistical semantic model (SSM) uses a trained classifier to assign meanings to each recognized utterance. Like an SLM, it is automatically generated from an XML training file that assigns meanings to the training set sentences. An SSM is most effective when you want to fill a single information slot, but the prompt is so open-ended that the caller could reply in using many different combinations of words.

statistics collector service

Management Station service that gathers call statistics for use in billing and capacity monitoring.

subgrammar

Component of a grammar used to define words and phrases the grammar can accept. Grammar rules can be nested one within another to create a complex grammar capable of interpreting long and detailed utterances.

suffix

Morphological element that is attached to the word root and may cause the change of a word class. Examples: -ship as in fellowship or -ment as in improvement (with change of word class).

SWI_grammarName

Special key returned by the recognizer that contains the name of the grammar to identify which of a set of parallel grammars parsed the result on the n-best list.

SWI_literal

Special key returned by the recognizer that contains the raw text answer (the actual recognized text).

SWI_literalTimings

Special key returned by the recognizer that contains the beginning and ending times of the words recognized. Also provides word confidence scores in recognition results.

SWI_meaning

Special key used internally by the recognizer; the recognizer always returns this key, even if you didn’t explicitly set it in your grammar. For this reason, some VoiceXML integrations may treat this as a key that is internal to the recognizer, and therefore shouldn’t be returned to the calling VoiceXML application (unless the grammar did not explicitly set any other keys). Thus, if you want the value of SWI_meaning to be returned to the calling VoiceXML application, then you need to explicitly set another key to the same value inside your grammar.

syllable

A unit of organization for a sequence of speech sounds, typically made up of a syllable nucleus (usually a vowel) with optional initial and final margins (typically consonants). Syllables are often considered the phonological “building blocks” of words. For example, the word lady contains two syllables, /leI/ and /di/.

synonyms

Two separate items in a vocabulary that have the same meaning in a system. For example, nicknames such as Bill, Billy, William, and Will, or commands like purchase and buy, or hangup, quit, exit, and logoff.

T1

Format for digital transmission using PCM and Time-Division Multiplexing at a rate of 1.544 Mbps. Each line consists of 24 channels; each channel can be configured to carry 64 Kbps of voice or data traffic.

TBCT

Two-B-Channel Transfer. TBCT enables an ISDN PRI user to request the switch to connect together two independent calls on the user's interface. The two calls can be served by the same PRI trunk or by two different PRI trunks that both serve the user. If the switch accepts the request, the user is released from the calls and the two other users are connected directly. Billing for the two original calls continues in the same manner as if the transfer had not occurred.

TCP

Transmission Control Protocol. Connection-oriented transport layer protocol that guarantees delivery of a data stream sent from one host to another without duplication or losing data.

TDM

Time Division Multiplexing. Method for combining various digital signals into a single transmission media that can be transmitted over a single channel.

telephony application

Application such as a SIP softphone that provides telephony functionality In non-telephony mode.

telephony interface

Module that converts standard telephony signalling and Time-Division Multiplexing (TDM) audio into VoIP signaling and transport. The interface uses the Session Initiation Protocol (SIP) for call control, and to signal a VoiceXML interpreter or other standard execution platform that a call has arrived.

tenant

A corporate entity that deploys one or more applications in a hosted environment (such as Nuance On Demand). The system automatically organizes log files according to the company and application names.

text normalization

Process by which text (words/tokens) is converted to a standard form based on global parameter settings and token-specific rule settings. For example, to expand abbreviations ("PIN" to "personal information number"), or to convert currency symbols into full words (currency symbols to "dollars" or "euros").

TextProc

Text Processor. TextProc is comprised of a tokenizer, which transforms written text into a graph of tokens where any path represents a probable way someone might read or dictate the text, a formatter (also known as Inverse Text Normalization or ITN), which transforms a token sequence into written text based on global parameter settings and token-specific rule settings, and a lexicon, which helps define the token philosophy and thus the semantic handling of the words in the vocabulary. TextProc is used in multiple products at Nuance, including the Nuance Text Processing Engine (NTpE).

timeout

Event that occurs when no speech is detected by the recognizer over a specified period of time. For some Dialog Modules, a timeout is normal. For example, a caller may be prompted to choose an item from a read list by remaining silent until hearing the desired item.

TLS

Transport Layer Security. Cryptographic protocol that provides endpoint authentication and communications confidentiality over networks such as the Internet. TLS and its predecessor, Secure Sockets Layer (SSL), encrypt the segments of network connections at the Transport Layer end-to-end.

tokenization

Process in which a sequence of strings is broken up into individual words, keywords, phrases, symbols, and other elements called tokens. Tokens can be individual words or phrases.

training

Using user input to refine the accuracy of an SLM. May also refer to the process of creating a voiceprint for a caller and storing it in a database.

transaction

Exchange of information between the caller and the application. Each call flow defines its own transactions, and it writes log messages at their start and completion. Using this information, you can calculate transaction completion rates, and identify problematic transactions that require attention and tuning.

transaction event logging

Transaction is a high-level task or goal of a voice application. A single transaction might consist of a series of executions that together produce a useful outcome for the caller or the system. For example, obtaining flight departure information by providing the flight number, date, and confirmation of the departure city.

transactional application

Voice application that exchanges information between the application and the caller. The caller can effect change in the application’s databases.

transcription

Written text corresponding to a spoken phrase. As part of application tuning, developers create transcriptions from captured audio and compare to recognition results recorded in the call logs. The transcriptions identify exact causes of successes and failures.

transfer event logging

Events logged by the application when an automated call is transferred to an agent or other destination. These events are not logged automatically by a platform and require that the application provide a reason for the transfer and optionally other information about the transfer. The reason indicates the cause of the transfer.

trill

Sound produced with a very fast movement of the tongue tip (“front”) or the uvula (“back”), respectively. They occur especially in romance languages, for example in the Italian pronunciation of /r/. Example: rosso /rOsso/.

tromboned transfer

Call transfer method in which the application transfers a caller to a third party, while staying connected to both lines. The application can also monitor the call. Also referred to as "bridged transfer."

try

A collection and any confirmation associated with it. An optimal collection gets the desired information with a single try. Otherwise, the application must perform retries or attempt a new strategy for getting the information.

TTS

Text-To-Speech. The process of synthesizing audible speech from typed text.

turn

A system prompt followed by a caller response. The caller response can be a hangup, silence, or even noise that triggers the recognizer; a recognition can exit after detecting speech, DTMF, a hangup, or after a timeout. Most transactional applications require multiple turns. For example, trading a stock or paying a credit card. Simpler applications (for example, an auto-attendant) require one or two turns. The more turns in a dialog, the more complex it tends to be to design.

UDP

User Datagram Protocol. Connectionless, unreliable transport protocol used for data streaming.

universal command

Command that can be spoken during any dialog state, such as “help”, “cancel”, or “exit”. Also called a global command.

universal grammars service

Service that must be run to use the universal commands that are included with the platform.

URI

Uniform Resource Locator. Global address of documents and other resources on the web; a URL is a kind of URI.

URL

Uniform Resource Identifier. Generic term for all types of names and addresses that refer to objects on the Web.

UTC

Coordinated Universal Time

utterance

Distinct chunk of caller speech, usually in response to a prompt, that is recognized using a specific active grammar. An utterance is referred to colloquially as an “utt.”

VAD

Voice Activity Detection, also known as speech activity detection or speech detection. Technique used in speech processing in which the presence or absence of human speech is detected. The main uses of VAD are in speech coding and speech recognition. It can facilitate speech processing, and can also be used to deactivate some processes during non-speech section of an audio session: it can avoid unnecessary coding/transmission of silence packets in Voice over Internet Protocol applications, saving on computation and on network bandwidth.

vital signs

System performance statistics presented by Management Station for the network or specific hosts or services.

vocabulary

Set of words that can be understood as a part of an application. The grammar determines the sequences of these words that are allowed. For example, both cents and dollars are in the vocabulary of the currency built-in grammars, but they can only be said in particular locations in the phrase.

Vocalizer

Nuance text-to-speech engine. Using Nuance supplied voice and language data optionally supplemented by application recordings and tuning data, Vocalizer speaks computer supplied text in a human voice.

VocDelta

A voc(abulary) delta file (or VocDelta) is a file that encapsulates a set of changes from/to the currently loaded vocabulary, such as adding and removing words, moving words between states, settings pronunciations, and building topic slots. Use of the VocDelta allows for grouping/isolating of changes from the current active vocabulary, so that changes may be restored later or transferred to a speaker profile.

voice browser

Software application that works with various markup languages to interpret VoiceXML content and invoke other services as necessary to interpret voice input and generate voice output.

voice enrollment

Process of adding phrases to a dynamic grammar through a voice interface—that is, by speaking them. Grammars created with this mechanism are speaker-dependent: because the pronunciations are generated based on the caller’s spoken input, they should only be used for recognition with that speaker. Also called speaker-dependent grammar or speaker-trained recognition.

voice pack

Complete set of recorded audio files used by Nuance Vocalizer for Network to enable a TTS persona. Each voice pack provides a male or a female voice for a particular locale (for example, for Canadian French, American English, or Australian English).

Voice Platform

The Nuance Voice Platform for Speech Suite 11 is a carrier-grade VoiceXML platform that supports voice applications using open web standards. The Voice Platform provides a complete, off-the-shelf solution to develop, deploy, and monitor voice applications implemented in VoiceXML.

voiced

Quality of a sound that involves vibration of the vocal folds. Example: /v/ as in wives /waIvZ/.

voiceless

Quality of a sound that does not involve the vibration of the vocal folds. Example: /f/ as in wife /waIf/.

voiceprint

Matrix of numbers reflecting physical characteristics of a person’s vocal tract, as well as behavioral characteristics of the way a person speaks.

VoiceXML

Voice-enabled markup language based on the Extensible Markup Language (XML).

VoIP

Voice-over-IP. Transmission of analog sound as digital data across an Internet protocol.

vowel

Voiced sound produced without any central obstruction in the oral tract. Examples: /i:/, /e/, /A:/, /U/, /Q/, and so on.

VRU

Voice Response Unit, also known as Interactive Voice Response or IVR. Technology that allows communication between humans and an automated computer system using voice and/or touchtone.

VUI

Voice User Interface. Voice application equivalent of a Graphical User Interface (GUI). It is the application’s presentation of audio to callers with regards to usability.

W3C

World Wide Web Consortium. Principal body that defines standards for the World Wide Web (WWW).

WAN

Wide Area Network. Computer network covering a broad physical area, typically using leased communication lines and linking together many smaller LANs.

watcher service

Service that provides communication services between Management Station and a managed host. Runs on each managed host as a Windows native service or Unix daemon.

WebSocket

WebSocket is a communications protocol that provides full-duplex communications channels over a single TCP connection.

weight

Weight values set the relative importance between different grammar packages when processing speech input. Weights apply to speech recognition only and have no impact on meaning extraction; therefore, recognition weights are relevant to Nuance Recognizer (dynamic grammars) and to the Krypton recognition engine (not to NLE/NTpE).

Whole Call Recording

Recording of a complete conversation, that is, a realtime mixed capture of both the inbound and outbound audio streams of a call (a particular RTSP [MRCPv1] or SIP [MRCPv2] session) exactly as they occurred.

wordset

A wordset is a set of words that customize the vocabulary used by an application at runtime. For example, an application might use wordsets to fetch identified user-specific information to add recognizable values into a grammar (such as the appropriate bank account information for a specific user). The Krypton recognition engine and NLE use wordsets for dynamic content injection, whereas Nuance Recognizer uses dynamic(-link) grammars.

WOZ

Wizard of OZ. Form of usability testing that simulates the call flow of a voice application. A human (“the man behind the curtain”) uses a well-defined script to impersonate the prompts and responses of the application (with no improvisation). Test subjects place calls to the pseudo-application. A WOZ test enables quick and inexpensive feedback on VUI concepts before implementation.

WSS

WebSocket Secure. WebSocket over TLS.

XML

eXtensible Markup Language. Set of rules for encoding documents based on several specifications produced by the W3C and others.

YAML

Human-readable data serialization language commonly used for configuration files.

ZooKeeper

Apache ZooKeeper provides a centralized configuration and coordination service to distributed processes.

Speech terms and abbreviations

Related topics