Speech terms and abbreviations
-
Call in which the caller hung up before making a reasonable attempt to complete the task.
-
Augmented Backus-Naur Form. One of the two acceptable formats for W3C grammars. Nuance does not use grammars of this format, but does include a utility for converting them into the XML format—abnf2xml.
-
Automatic Call Distribution. System that routes incoming calls to a group of agents with particular capabilities or availability.
-
Feature that analyzes task-specific data like recorded utterances and recognition results, and adapts acoustic models accordingly.
-
Also referred to as a speech model. Statistical model that allows the system to translate utterances into phonetic representations of speech. Derived by speech scientists from analyzing many different people speaking well-defined words and phrases.
-
Grammar (or set of concurrent grammars) being used for the current recognition.
-
A reference that maps a sentence, phrase, or word to a recording, so Vocalizer uses a recording when speaking that text. Those references can be explicit—using the ActivePrompt name—or they can be implicit, where the engine automatically searches the ActivePrompt database for each synthesis request, substituting ActivePrompts whenever the normalized input text matches an ActivePrompt's normalized input text and boundary constraints. ActivePrompt databases are created by Vocalizer Studio.
-
Dialog Module (application building block) that incorporates additional natural language capability, including the robust parsing of SmartListener, dialog shortcuts, and one-step correction techniques.
-
Sound that involves a non-nasal oral closure that develops into a fricative. In phonological terms, you can represent affricates as a single phoneme or a series of two phonemes. Example: /tS/.
-
Automatic Gain Control. Adaptive system that adjusts gain based on input levels to provide more consistent output levels.
-
Event generated by a Management Station service when an error condition occurs. Alarms can be minor, major, or critical.
-
The phones that together form a certain phoneme. For a given phoneme, the actual sounds that are produced may vary. This depends on phoneme context, a speaker’s character, accent, social class, and so on. Example: The phoneme /l/ represents two allophones, the clear l /l/, which occurs before a vowel sound or /j/ (for example, like /laIk/), and the dark l /:/, which occurs elsewhere (for example, milk /mI:k/).
-
The application linguistic model (ALM) is consumed by the Nuance Text Processing Engine. It is generated by Nuance Experience Studio and packaged within the semantic model, and includes the extra vocabulary and text transformation rules specific to the customer’s application domain.
-
Case in a voice application where a recognized utterance maps to more than one natural language result in the current grammar.
-
Automatic Number Identification. Telephony feature that provides the called party with the caller ID of the calling party.
-
Application program interface (API). Specification of routines, data structures, object classes, and protocols for programmers to communicate with a software system.
-
Speech-enabled program that callers interact with. For the enterprise market, the application is typically written in VoiceXML and runs on a voice browser.
-
The application linguistic model (ALM) is consumed by the Nuance Text Processing Engine. It is generated by Nuance Experience Studio and packaged within the semantic model, and includes the extra vocabulary and text transformation rules specific to the customer’s application domain.
-
The artifacts required by Nuance Recognizer and Dragon Voice applications are different. Currently, the artifacts (models) required by Dragon Voice are produced only via Nuance Experience Studio. Contact Nuance about obtaining access to Experience Studio. For information on the models used by Nuance Recognizer, see the Recognizer grammar guide (Reference section).
-
Application Service Provider. A business that provides computer-based services to customers over a network. With regard to speech services, the ASP hosts applications for companies (customers), and the applications receive telephone calls from users (customers of the customers). May also refer to Active Server Page: Web pages created dynamically using HTML, scripts, and reusable ActiveX components.
-
Automatic Speech Recognition. Conversion of spoken words to interpretable text. Nuance Recognizer and Krypton are ASR engines.
-
What happens when the caller interrupts a prompt. This feature can be turned on to let the caller speak without waiting for direction, or off to make sure the entire prompt plays.
-
Telephony call transfer method in which a caller calls an application, the application connects the caller to a different number, then disconnects from the call. There is no feedback on the success or failure of the operation.
-
Expression that evaluates to one of two logical values: TRUE or FALSE (1 or 0). The expression can be a parameter, variable, or ECMAScript expression.
-
Basic Rate Interface. Telephony configuration in the physical layer as defined for ISDN networks. BRI provides two “bearer” channels for communication, and one “delta” channel for voice or user data.
-
Call transfer method in which the application transfers a caller to a third party, while staying connected to both lines. The application can also monitor the call.
-
Logical port within a voice browser service instance that communicates with corresponding telephony session service and Speech Server ports. Each browser instance handles one call.
-
Type of grammar that is provided directly by the platform. Nuance provides the built-in grammar types defined in the VoiceXML specification, as well as some standard universal command grammars and additional Nuance types for different languages.
-
Correct Acceptance Rate. Percentage of in-grammar utterances that were recognized and accepted correctly.
-
Logical flow of a voice application, including various dialog states, primary paths of informational exchanges, transaction outcomes, and decision logic.
-
Text file that records activity performed during a single call session. The log contains events generated by the application as well as by other components such as Nuance Recognizer, Nuance Vocalizer for Network, and so on.
-
Human who is talking to the speech recognition system. (“Caller” is used even when the system initiates an outbound call.)
-
Events logged by an application during the execution of a call.
-
Recognition rate observed by the caller, taking into account out-of-grammar utterances. It is calculated by subtracting the out-of-grammar rate from 100%, then applying the in-grammar accuracy to the remaining percentage. For example, with a 16% OOG and 95% accuracy, the CPA is (100-16%) * 95% = 79.8%.
-
Time between when the caller stops speaking and when the recognizer returns a response.
-
Channel Associated Signaling. Telephony signaling method that uses routing information to send the payload to its destination. Also known as Per-Trunk Signaling (PTS). With CAS, the routing information is transmitted in the same channel as the payload itself.
-
Resources required to handle a single call. A system with 24 channels, for example, can handle 24 simultaneous calls.
-
SLM to which you cannot add new words.
-
Two or more hosts running the entire set of speech services, with each host configured to perform a specific role. Each cluster includes a primary Management Station.
-
Utterance spoken by the caller in response to a prompt.
-
System for easily expressing phonemes in notation defined by the IPA (International Phonetic Alphabet) using a standard ASCII keyboard, for example: SAMPA.
-
Value assigned to each recognition result to indicate how closely the caller speech matches the recognition result. The higher the score, the more likely it is that the recognition result matches what the caller said. The confidence score is compared to a confidence threshold to determine the application’s behavior.
-
Value of the confidence score below which the behavior of the application changes. Usually expressed as a number between 0 and 999. For example, an application can use a threshold of 200 to decide when to reprompt. There can be two thresholds, the high confidence threshold, below which to confirm, and the low confidence threshold, below which to reject. Adjusting the confidence threshold trades off between false acceptances and false rejections.
-
Asking the caller to indicate whether the system correctly interpreted earlier utterances.
-
Sound produced with a central obstruction in the vocal tract. According to the form of obstruction one distinguishes between different categories of consonants such as fricatives, laterals, nasals, and so on. Examples: /l/, /m/, /f/, /s/, and so on.
-
Call transfer method where a caller calls the application and the application attempts to connect the caller to the called party. If the attempt fails, the application continues interacting with the caller (the VoiceXML document interpretation proceeds). If the attempt succeeds, the application completes the transfer and disconnects from the call. You can perform a consultation transfer with or without far-end dialog.
-
Segments of caller speech that precede or follow a specific word or phrase. In speech recognition, the context surrounding a word is often significant in determining its meaning.
-
File that specifies parameter values to be used during recognition at particular stages of a voice application.
-
Words, phrases, and sentences spoken fluidly without overt pauses between each word. The technology that recognizes continuous speech is more advanced than a discrete speech recognition system (which does require pauses), but requires more tuning to improve accuracy. Nuance Recognizer and Krypton are continuous-speech engines.
-
(CA) A recognition in which an utterance is correctly recognized and accepted.
-
(CR) A recognition in which an utterance is correctly rejected as being out-of-grammar.
-
Caller Perceived Accuracy. Recognition rate observed by the caller, taking into account out-of-grammar utterances. It is calculated by subtracting the out-of-grammar rate from 100%, then applying the in-grammar accuracy to the remaining percentage. For example, with a 16% OOG and 95% accuracy, the CPA is (100-16%) * 95% = 79.8%.
-
Concatenative Prompt Recording
-
Central Processing Unit. Part of a computer that fetches, decodes, and executes instructions; attached directly to the memory. The CPU contains an arithmetic logic unit, which performs calculations and logical operations, and a control unit, which decodes and executes instructions.
-
Correct Rejection Rate. Percentage of out-of-grammar utterances that were rejected correctly.
-
A CSV is a comma-separated values file, which stores tabular data in plain text (using commas to separate the values in each field of data). This file format type is commonly used to transfer information from one software application or system to another.
-
A data pack is a set of data files that configure the Krypton recognition engine and the Nuance Text Processing Engine for a particular language. The data pack consists of an acoustic model, language model, parameter files, and other configuration files.
-
Database events may be logged by applications when performing back-end database transactions. A database event is logged as a pair of events: begin-query and end-query. These events are logged by applications, not platforms.
-
A data pack is a set of data files that configure the Krypton recognition engine and the Nuance Text Processing Engine for a particular language. The data pack consists of an acoustic model, language model, parameter files, and other configuration files.
-
decibel(s). Unit for measuring relative power ratios in terms of gain or loss.
-
Speech density. Percentage of time during a call occupied by the caller’s speech (as opposed to prompts or database queries, and so on). Sometimes called the recognizer duty cycle. Since speech uses recognition resources, applications with high speech density tend to require more resources.
-
Text file containing detailed information about the internal functions of Nuance speech components. This information can include entries at various levels of urgency ranging from alarms to normal status. They can also include informational messages generated by a service or application. Diagnostic logs are separate from the call logs, which contain application-, session-, and call-specific information.
-
Combination of international prefix and regional codes that must be dialed to reach another point in the network.
-
Interaction between a caller and a voice application. A single unit of interaction or single transaction is often referred to as a dialog state.
-
Application building block that provides sharable, common speech recognition tasks. For example, Dialog Modules can understand the formats of dates, monetary units, zip codes, phone numbers, and they automatically ask callers to confirm or repeat information when the confidence score is low.
-
The active grammar and other recognition parameters that are in effect for a particular Dialog Module role. For example, the confirmation consists of a grammar that understands “yes”, “no”, and synonyms.
-
Type of information being recognized—as related to the Dialog Module call flow: collection, confirmation, fallback, or disambiguation.
-
Exit status of a Dialog Module.
-
Direct Inward Dialing. Service that allocates a block of telephone numbers for calling into a private branch exchange (PBX) system without having a physical telephone line for each number. The PBX automatically switches incoming calls for a given phone number to the appropriate workstation.
-
Vowel with a single noticeable change in quality during a syllable. Diphthongs are usually subdivided into descending (falling) and ascending (rising) diphthongs depending on which of the two elements is stressed. In descending diphthongs the first element is more sonorous (as in all English diphthongs), whereas in ascending diphthongs the second element is more sonorous. Example: /aI/ as in time (descending diphthong).
-
Dialog that requests one piece of information at a time in a particular order.
-
Method used to clarify when the recognized item has more than one possible meaning.
-
Speech in which there are distinct silences between words. Discrete speech renders speech recognition relatively simple, but is stilted and unnatural.
-
A Speech Suite installation with different components on different hosts in a computer network.
-
Dialed Number Identification Service. Telephony feature that identifies the called number. This information can be used to route the call to the appropriate service based on the number that was called.
-
The Krypton recognition engine uses domain language models to identify the words and phrases most likely spoken by users of your application. Domain language models are overlaid on the factory or base data pack Krypton initializes on startup, to provide a vocabulary for the application.
-
Nuance Speech Suite components responsible for bringing virtual-assistant style conversation to the IVR experience. Includes the Krypton recognition engine, Natural Language Engine, and the Nuance Text Processing Engine, with the Natural Language Processing service responsible for managing the requests for Dragon Voice recognition and interpretation processing.
-
Digital Signal Processor. Microprocessor optimized for the large number of mathematical calculations needed for digital signaling.
-
Digital Telephony Interface. Circuit board that provides the necessary voice and fax resources that utilize a T1/ISDN connection.
-
Dual-Tone Multi-Frequency. Also known as touchtone. Two-tone signal representing the digits 0-9, *, and #. Each DTMF signal is composed of one tone from a high-frequency group of tones, and a second tone from a low-frequency group.
-
Feature allowing power users to enter a sequence of DTMF tones to navigate directly through a sequence of menus in a voice application without waiting to hear the prompts for each menu.
-
Grammar that can be dynamically created and modified by an application at runtime. Dynamic grammars enable applications to recognize items that cannot be known in advance. Dynamic grammars can be a W3C XML or a compiled grammar file (*.gram) referenced using external rule references. Nuance Recognizer uses dynamic grammars, whereas the Krypton recognition engine and Natural Language Engine use wordsets to customize the vocabulary at runtime.
-
European digital transmission format, which transmits at a rate of 2.048 Mbps; equivalent to the North American T1. Each signal can carry 32 channels of 64 Kbps each. E1 and T1 lines can be interconnected for international use.
-
Landmark sounds identifying place and functionality to the caller. An “icon” that is heard instead of seen.
-
Call transfer method that allows one application to listen to audio being recorded by another application.
-
Reflection of transmitted signal energy back to the sender. The echo of prompts played by the application to callers can trigger false barge-ins, causing degraded recognition performance.
-
When a prompt is played, the circuitry within the telephone network may echo it back to the recognizer. An echo cancellation algorithm softens the echo so that the recognizer does not treat the prompt as speech from the caller when the caller barges-in during the prompt.
-
Industry standard scripting language; one common variant is JavaScript.
-
Extensible MultiModal Annotation markup language. XML-based representation of the semantic meaning of recognition results. The semantic meaning can often be more useful to applications than literal results. The language was developed by the World Wide Web Consortium (W3C). It replaces NLSML.
-
Point in time when a caller stops talking. It is important to detect the end-of-speech quickly so the caller does not experience a delay in response time. This detection process is referred to as endpointing.
-
Process of detecting when the caller stops talking. Detecting the end-of-speech immediately is important so the caller does not experience a delay in response time. This detection process is referred to as endpointing.
-
Used to describe both initial speech detection and end-of-speech detection.
-
Process of detecting the start and end of speech by distinguishing leading or trailing background noise/silence from an utterance before sending it to the recognizer.
-
An engine is a program or component that performs a core or essential function for other programs or components. For example, the Krypton recognition engine performs realtime large-vocabulary continuous recognition, the Nuance Language Engine provides ontology-/concept-based semantic processing, and the Nuance Text Processing Engine provides tokenization and normalization of text. Each of these engines plays a crucial role in the Speech Suite system.
-
Message generated by a service and stored in a call or diagnostic log. These logs can be collected by Management Station and can be analyzed using various tools. Also referred to as "alarm."
-
Rule in a grammar that is expressed by the contents of another grammar on the file system or network.
-
Separate speech detector that runs as a pre-processor to the internal speech detector provided with Nuance Recognizer. Nuance Speech Server provides an external speech detector that establishes the beginning of speech before the utterance is forwarded to the recognizer for recognition.
-
False Acceptance (FA) rate. Percentage of utterances that were accepted incorrectly; that is, the recognizer returned an n-best but should not have. For tuning purposes, raising the confidence threshold reduces the false acceptance rate, but increases the false rejection rate.
-
Call flow strategy for collecting information after initial attempts fail. A typical fallback method is spelling. A fallback can be followed by a confirmation.
-
A recognition in which an utterance is accepted after an incorrect recognition. Such an utterance may be an out-of-grammar utterance that was wrongly recognized as in-grammar, or an in-grammar utterance that was incorrectly recognized as a different in-grammar utterance.
-
A recognition in which an utterance is incorrectly rejected as being out-of-grammar when it is in fact in-grammar.
-
The caller doesn’t know the option, but can describe the problem: “I have a strange charge on my credit card statement.” This type is addressed by SpeakFreely technology.
-
Interaction with the third-party to determine whether the transfer can be completed, such as in collect call applications. Far-end dialogs are supported by bridged transfers and consultation transfers.
-
Front-end processing that performs the spectral analysis of the audio data.
-
Service responsible for moving log files and other data from managed hosts to the Management Station host.
-
Speech process that involves the hardening of the final sound. Final devoicing (Auslautverhärtung) occurs in German and Dutch. Examples: Hund /hUnt/ as opposed to Hunde /hUnd@/.
-
Behavior model composed of a finite number of states, transitions between those states, and actions, similar to a flow graph in which one can inspect the way logic runs when certain conditions are met. The operation of an FSM begins from one of the states (called a start state), goes through transitions depending on input to different states, and can end in any of those available; however, only a certain set of states mark a successful flow of operation (called accept states). Finite state machines can solve a large number of problems; in linguistics—describing the grammars of natural languages.
-
Nuance Recognizer capability that allows callers to provide longer, unstructured input. Typically this means grammars of more than 100k words and utterances of more than 10 seconds. “Fluent speech” uses technologies including very large statistical language models (using interpolated LMs), high-accuracy speech recognition, fast-match (the large list of words is filtered by an initial “fast” search), and tight control over memory and CPU usage for long duration utterances.
-
False Rejection (FR) rate. Percentage of in-grammar utterances that were rejected but should not have been. For tuning purposes, lowering the confidence threshold reduces the false rejection rate, but increases the false acceptance rate.
-
Sound that involves a turbulent airstream within the vocal tract and at least one obstruction such as the teeth, the lips, and so on. Examples: /s/, /S/, /Z/, /f/, and so on.
-
A finite state machine (FSM) is a mathematical model of computation consisting of a finite set of states.
-
FTP (File Transfer Protocol) is a standard network protocol for exchanging files over the Internet.
-
Type of circuit that allows simultaneous two-way data transmission.
-
Natural language processing that requires each word in the utterance to be parsed to fill slots in the grammar.
-
Doubled consonant sound. Examples: unknown /Vnn@Un/.
-
Command that can be spoken during any dialog state, such as “help”, “cancel”, or “exit”. Also called a universal command.
-
Sound produced by a supraglottal closure and following opening. In our transcriptions we generally do not transcribe glottal stops.
-
Speech science algorithm first introduced by Good and Turing, typically used to estimate probabilities from counts.
-
Words and word sequences the recognizer can recognize, and the interpretations for those utterances. For example, a currency grammar might allow you to say things like “thirty four dollars and twelve cents” but not “cents dollars forty four.” Grammars are often specified in GrXML.
-
Feature that calculates probabilities on grammars using task-specific data and generates new, adapted grammars.
-
Part of a grammar that contains the most important meaning-bearing words.
-
Filename of the most frequently recognized grammar in a grammar set. Pointing to the grammar exemplar in a report shows the grammar set. A grammar can be the exemplar for several grammar sets.
-
File that contains a grammar definition. A Nuance grammar file must be a text file written in GrXML, or a compiled binary .gram file.
-
Part of a grammar containing words or expressions that do not convey significant meaning, such as “I’d like to,” “I’m,” “please.”
-
Component of a grammar used to define words and phrases the grammar can accept. Grammar rules can be nested one within another to create a complex grammar capable of interpreting long and detailed utterances.
-
Set of grammars active for a recognition.
-
Set of letters or letter combinations that represent a phoneme in a writing system. May also refer to a single letter or character in a system of writing.
-
Nuance shorthand for the syntax for grammars defined in the XML format of the W3C Speech Recognition Grammar Specification. May also refer to the file extension for such grammars. The current specification for GrXML is available on the Web at the W3C.
-
Call transfer method similar to a bridged transfer, but in which the application cannot monitor the audio path of the two connected parties.
-
Type of circuit that allows data transmission in two directions, but only one direction at a time.
-
Two adjacent vowels in different syllables coming together with or without a slight pause. Example: radius /reIdI@s/; the hiatus is in the sequence /I@/.
-
Characteristic of a system that is up most of the time, corresponding to 99.999% (five-nine) availability.
-
Hidden Markov Model. Complex statistical model that provides a detailed spectral and temporal representation of the speech signal.
-
Two or more words that sound the same, but have different meanings—for example, “to”, “too”, and “two”. Since they all sound the same, the recognizer cannot tell which was spoken. The only way to determine the correct word is to consider the context.
-
Application developed for a specific tenant (customer) but hosted by an application service provider on a server that supports multiple applications from the same or other tenants.
-
Execution environment in which multiple hosted applications are supported. (Also multi-tenant environment.)
-
The unique, physical MAC address of the machine. Get the address with the "ipconfig /all" command, or use the FLEXNET tool: "lmutil lmhostid".
-
Update to acoustic speech recognition models while the system continues operation. Telephone callers do not perceive delays when these updates occur. Hot insert is nearly invisible to applications and voice browsers, but there are mechanisms that can be controlled.
-
Type of recognition in which the recognizer listens for utterances, but does not respond unless a specific word or phrase is recognized. Hot word recognition is used by Nuance Recognizer for tasks such as interrupting a long TTS playback, or returning from a bridged transfer. Krypton does not support hot word recognition.
-
The Hypertext Transfer Protocol (HTTP) is the underlying protocol used by the World Wide Web and defines how messages are formatted and transmitted, and what actions Web servers and browsers should take in response to various commands. Hypertext is structured text that uses logical links (hyperlinks) between nodes containing text.
-
HTTPS (also called HTTP over TLS, HTTP over SSL, and HTTP Secure) is a protocol for secure communication over a computer network that is widely used on the Internet.
-
Process of identifying a speaker from a set of valid users, enabling a more personalized interaction. Speaker identification can be performed on a small set of users.
-
Internet Engineering Task Force. Organization that defines standards for Internet protocols.
-
Caller actively trying to get into the system using someone else’s account.
-
An utterance that can be recognized by a given grammar.
-
An utterance that can be recognized by the active grammar.
-
Representation of the meaning of a sentence. May also refer to the recognition of an utterance from text rather than audio.
-
Internet Protocol. Protocol used for communicating data across a packet-switched network using the Internet Protocol Suite (also referred to as TCP/IP). Each computer (known as a host) on the Internet has at least one IP address that uniquely identifies it from all other computers on the Internet.
-
Industry Standard Architecture. Computer bus standard for IBM-compatible computers. In 1987, succeeded by Extended Industry Standard Architecture (EISA).
-
Integrated Services Digital Network. Communications standard for transmitting voice, audio, and data over digital or analog telephone wires. ISDN lines include B channels, which carry voice or data, and D channels, which carry control and signaling information.
-
International Standards Organization
-
ISDN User Part. Protocol layer used for circuit-switched connections.
-
Inverse Text Normalization
-
Interactive Voice Response. General-purpose system for developing and deploying telephony applications that perform automated operations and transactions to callers via voice and DTMF input.
-
Variation in arrival time of data packets during transmission through the IP network.
-
Java Native Interface. Programming framework that enables communication between native applications and applications running on a Java virtual machine.
-
JavaScript Object Notation. Lightweight data-interchange format that is easy for humans to read and write, and easy for machines to parse and generate.
-
Java Server Pages
-
Krypton is Nuance's enterprise-grade, realtime large vocabulary continuous speech recognition engine. The Krypton engine uses a WebSocket-based protocol to accept audio streams and asynchronously send back results as the recognition progresses. Krypton supports domain language models among other forms of specialization, allowing it to understand terms specific to a field of work/application.
-
Local Area Network. Computer network covering a small physical area such as a single office building. A typical LAN has high transmission rates and does not require leased telecommunications lines.
-
Language code as specified in RFC 1766.
-
Statistical model for the syntax of language constructs. The recognizer uses language models to bias it appropriately towards more common phrases. These models increase overall recognition accuracy.
-
Language-specific resources installed in addition to the recognition engine. The resources are needed to perform recognition in a particular language and locale (for example, American English, United Kingdom English, or Australian/New Zealand English). A language pack includes localized acoustic models, pronunciation dictionaries, and standard VoiceXML built-in grammar types like date and digits.
-
Caller-perceived delay. Specifically, duration of the delay between reception of audio (utterance) to the emission of the recognition response.
-
Sound that is articulated by allowing air to escape freely over one or both sides of the tongue. Example: /l/.
-
A key to enable ports for every licensed product. It is generated at the Nuance license fulfillment website and stored on a licensing server host.
-
Nuance License Manager. Service responsible for managing the allocation of licenses across a network.
-
Channel used to listen to audio recorded by another channel.
-
Text of the vocabulary item returned by the recognizer. Also called the recognized item or the raw text.
-
In SIP, the server used to identify the location of a caller in the server’s domain. A caller agent must register with a location server to be located by a proxy server for a VoIP call.
-
Magic-word mode is identical to selective barge-in except that in magic-word, the speech detector rejects candidates that are too short or long before sending them to the recognizer. Krypton does not support magic word recognition.
-
Nuance Management Station. Component that provides centralized management of a Nuance network.
-
Channel on which the audio is recorded during eavesdropping.
-
Minimally Formatted Form. Plain text literal consisting only of words that indicates exactly what was recognized. Produced by Krypton and by the Nuance Text Processing Engine (NTpE).
-
Multipurpose Internet Mail Extensions
-
Dialog format that allows several pieces of information to be gathered at once, in any order, and prompts for missing items as necessary.
-
Statistical method to represent how each phoneme sounds. This method is used by most state-of-the-art recognizers.
-
Single vowel where there is no detectable change in quality during a syllable. Example: /V/ as in cut.
-
Smallest unit of characters that carries meaning. Morphemes are not necessarily identical with syllables. For example, in dogs the morphemes are ‘dog’ and ‘s’ (‘s’ being the plural morpheme in that case). Examples: book, it, he; in disagreement the morphemes are ‘dis’, ‘agree’ and ‘ment’.
-
Media Resources Control Protocol. Communication protocol for speech servers to provide various speech services to an application.
-
Software module that communicates with Speech Server using MRCP.
-
Modular RECognizer. MREC is a language independent, highly configurable speech recognition engine used both by Nuance products (such as the Krypton recognition engine and Nuance Text Processing Engine) and for research. MREC provides a set of primitive speech recognition functions and requires middleware such as S3 to create a recognition system. It is used for realtime and batch dictation tasks for multiple products at Nuance.
-
Mean Time Between Failures. Total equipment uptime in a given period, divided by the number of failures in that period.
-
Mean Time To Repair or Replace. Total equipment downtime for a given period, divided by the number of failures in that period.
-
Digital encoding standard used to store audio recordings, such as prompts. Also called mu-law or µ-law.
-
Execution environment in which multiple hosted applications are supported. (Also hosted environment).
-
Mutually Exclusive. Function used to synchronize processes operating on shared data so they don’t interfere with each other.
-
List of N possible recognition results, ranked from highest to lowest likelihood by confidence score, where the application configures the value of N. The recognizer returns an n-best list of varying length to the application depending upon the ambiguity of the recognition. Typically, applications use the first result, but they can refer to other items to deduce the caller’s intended meaning.
-
Sequence of N consecutive words (w1w2...wN). The words w1,w2,...wN-1 are called the predecessors of wN. For example, the sequence “I would like” is a trigram where “I” and “would” are the predecessors of “like”.
-
Statistical language model in which the probability of the occurrence of any word is completely determined by the probabilities of a fixed number N of preceding words. Given a string of m words, the probability of the m-th word depends on the probability of all previous m-1 words. The n-gram model is the most widely used type of model in speech recognition. All word sequences are allowed, but likely phrases are assigned higher probabilities. N is called the order of the model.
-
Refers to sparing, a provisioning technique that refers to providing standby hardware that can be quickly substituted for any systems that fail in your network. This is often referred to as N+1 sparing or N:M sparing, where N is the number of machines active at a given time and M is some number between 1 and N indicating how many standby systems you have available.
-
Nuance Application Studio is a web-based tool for designing and developing speech and touchtone applications. It streamlines the design process, facilitates communication with stakeholders, and generates code and other collateral.
-
Sound where a part or the whole of the respiratory stream escapes through the nose. Examples: /m/, /n/, /N/.
-
The Natural Language Engine is Nuance's enterprise grade text-to-meaning engine or semantic engine. NLE provides ontology-/concept-based semantic processing. NLE takes as input the token sequence provided by the Nuance Text Processing Engine and from this input identifies the intent and/or meanings expressed in the human-machine turn. The outcome from NLE is typically used to drive the next machine-human turn.
-
Natural language processing (NLP) is the ability of a computer program, system, or application to understand human speech as it is spoken.
-
The Natural Language Processing service (or NLP service) provides a single interface to the Krypton recognition engine and to Natural Language Engine services. The NLP service also communicates with the Nuance Resource Manager to allocate a suitable Krypton and NLE resource based on the capabilities requested by the voice application.
-
Speech recognition techniques that permit a caller to answer a prompt with a full phrase or sentence, as in everyday conversation. Typically, natural speech is longer in duration and has a broad range of possible meanings. A grammar capable of natural language understanding must accept a wide variety of different phrases.
-
Tool for creating models used by Dragon Voice components.
-
The caller knows and says the option, surrounded by filler words: “I think I want billing,” “I have a question about my bill.” This is the most common out-of-grammar type.
-
Network Equipment-Building System. The most common set of safety, spatial and environmental design guidelines applied to telecommunications equipment in the United States.
-
Nuance Experience Studio is a web-based tool for creating natural language understanding models for virtual assistant and call steering applications. These models or artifacts enable the application to understand what users mean when they contact the application.
-
Statistical method used in phoneme classification to represent how each phoneme sounds. It can be used in conjunction with a mixture Gaussian model or on its own.
-
Nuance Insights for IVR is an analysis and reporting tool that provides usage information about speech, mobile, and touchtone applications based on call logs and audio files collected from applications.
-
The Natural Language Engine is Nuance's enterprise grade text-to-meaning engine or semantic engine. NLE provides ontology-/concept-based semantic processing. NLE takes as input the token sequence provided by the Nuance Text Processing Engine and from this input identifies the intent and/or meanings expressed in the human-machine turn. The outcome from NLE is typically used to drive the next machine-human turn.
-
Natural language processing (NLP) is the ability of a computer program, system, or application to understand human speech as it is spoken.
-
The Natural Language Processing service (or NLP service) provides a single interface to the Krypton recognition engine and to Natural Language Engine services. The NLP service also communicates with the Nuance Resource Manager to allocate a suitable Krypton and NLE resource based on the capabilities requested by the voice application.
-
The Natural Language Processing service (or NLP service) provides a single interface to the Krypton recognition engine and to Natural Language Engine services. The NLP service also communicates with the Nuance Resource Manager to allocate a suitable Krypton and NLE resource based on the capabilities requested by the voice application.
-
Natural Language Semantic Markup Language. XML-based representation of the semantic meaning of recognition results. The semantic meaning can often be more useful to applications than literal results. The language was developed by the World Wide Web Consortium (W3C). It has been superseded by EMMA.
-
Natural Language Understanding (NLU) is a speech recognition technique (or set of techniques) that permits a caller to answer a prompt with a full phrase or sentence, as in everyday conversation. Typically, natural speech is longer in duration and has a broad range of possible meanings. A grammar capable of natural language understanding must accept a wide variety of different phrases.
-
Nuance On Demand. Application hosting service that can handle many tenants and applications on the same pool of servers.
-
In a reference architecture each node is considered a self-contained entity, comprising all the elements needed to deploy service, including application servers, database servers, and clusters of hosts.
-
Single letter codes, in brackets, that transcribers use to denote certain events in the utterance; for example, the transcription string “[n]” indicates noise, and the string “[c] hello” indicates the caller coughed and then said “hello.”
-
Process by which text (words/tokens) is converted to a standard form based on global parameter settings and token-specific rule settings. For example, to expand abbreviations ("PIN" to "personal information number"), or to convert currency symbols into full words (currency symbols to "dollars" or "euros").
-
Nuance recognition server. Service that provides access to Nuance Recognizer and a pool of preloaded grammars.
-
Nuance Speech Server. Central control and communication hub for speech-processing resources. Speech Server provides an open, protocol-based mechanism for using resources such as speech recognition and text-to-speech. Speech Server interacts with a voice platform via the MRCP protocol and with telephony via the RTP protocol.
-
The Nuance Text processing Engine (NTpE) is Nuance's normalization and tokenization (or lexical analysis) engine. NTpE applies transformation rules and formats output for display or for further processing by semantic engines such as NLE.
-
Nuance Application Studio is a web-based tool for designing and developing speech and touchtone applications. It streamlines the design process, facilitates communication with stakeholders, and generates code and other collateral.
-
Nuance Experience Studio is a web-based tool for creating natural language understanding models for virtual assistant and call steering applications. These models or artifacts enable the application to understand what users mean when they contact the application.
-
Nuance Insights for IVR is an analysis and reporting tool that provides usage information about speech, mobile, and touchtone applications based on call logs and audio files collected from applications.
-
Nuance License Manager. Service responsible for managing the allocation of licenses across a network.
-
Component that provides centralized management of the network.
-
Application hosting service that can handle many tenants and applications on the same pool of servers.
-
Server that provides access to a Nuance Recognizer instance and a pool of preloaded grammars. Systems like directory assistance can use grammars that are so large that it is not possible to load them dynamically while handling a call. Therefore, these large grammars are preloaded on specific recognition servers during initialization. At runtime, the recognition requests are directed to the server serving the requested preloaded grammar.
-
Runtime component that performs speech recognition on an audio stream that begins with speech.
-
Distributes requests and provides load balancing to Dragon Voice recognition and interpretation resources (Krypton recognition engine, Natural Language Engine, and Nuance Text Processing Engine).
-
Central control and communication hub for speech-processing resources. Speech Server provides an open, protocol-based mechanism for using resources such as speech recognition and text-to-speech. Speech Server interacts with a voice platform via the MRCP protocol and with telephony via the RTP protocol.
-
The Nuance Text processing Engine (NTpE) is Nuance's normalization and tokenization (or lexical analysis) engine. NTpE applies transformation rules and formats output for display or for further processing by semantic engines such as NLE.
-
Nuance text-to-speech engine. Using Nuance supplied voice and language data optionally supplemented by application recordings and tuning data, Vocalizer speaks computer supplied text in a human voice.
-
Operations, Administration, and Maintenance (or “Management”). Processes, procedures, and tools for operating, administering, managing, and maintaining a computer system or network.
-
Operations, Administration, and Maintenance (or “Management”). Processes, procedures, and tools for operating, administering, managing, and maintaining a computer system or network.
-
Sound where an obstruction in the vocal tract is sufficient to cause noise. Examples are all plosives, fricatives, and affricates. Sounds without such a noise component are called sonorants. Nuance does not, however, use this distinction in our description of sounds, because this distinction is irrelevant for our purposes.
-
Term used to represent a managed host that cannot be reached by Management Station. This typically means the watcher service is not currently running on that host.
-
Technique that allows a caller to respond to a yes/no confirmation with the correct answer, without having to go back to the original directed dialog prompt.
-
Words or phrases not specified in the active grammar as well as laughter, coughing, and so on.
-
Prompt that asks a question without expecting a response from a constrained list of options. For example, “How may I help you?”
-
SLM that contains a special class, called unknown (UNK), that lets you add new words to an SLM without retraining the model.
-
Set of rules that define how to write a language. A phonemic orthography is a writing system where each symbol (grapheme) corresponds to a single phoneme in the language.
-
Caller speech that cannot be parsed by a given grammar. Out-of-grammar errors are typically the greatest factor affecting caller-perceived accuracy. There are two types of out-of-grammar errors—near and far.
-
Words or phrases not specified in the active grammar as well as laughter, coughing, and so on.
-
Phrase known only by the caller. Password phrases add security to an application by asking the speaker to provide additional information known by that speaker only.
-
Data portion of a packet.
-
Private Branch Exchange. Telephone exchange that serves a particular business, as opposed to a public exchange operated by a telephone company.
-
Peripheral Component Interconnect. Hardware computer bus for attaching hardware devices.
-
Pulse Code Modulation. Digital transmission format used by traditional telephony applications, in which analog signals are sampled 8000 times per second.
-
PEM is a de facto file format for storing and sending cryptography keys, certificates, and other data, based on a set of IETF standards defining Privacy-Enhanced Mail. PEM is a standard format for OpenSSL and many other SSL tools.
-
Measure of the quality of a language model: that is, how well the model can predict phrases. Perplexity is a function of the model and the test set on which it is measured. For a given test set, the lower the perplexity, the better the model. Perplexity is also a measure of the complexity of the task. Consider two models of equivalent complexity trained on two different test sets. Using these models, the task that has the highest perplexity is the most complex. Because perplexity is correlated with recognition accuracy, it can be used to tune applications.
-
Personality of a system defined for an application, reflected in the voice, language use, and the audio environment. The persona is based on the target audience, user feedback, and imagery associated with the company’s brand.
-
Perceptually distinct sound unit that does not yet represent a particular phoneme.
-
Basic unit of speech, representing a single distinguishable sound used in spoken language. Examples: /f/, /S/, /m/, /@/, /I/, and so on. Nuance Recognizer represents words as sequences of phonemes. Unlike older technologies, this allows you to recognize any word without having a special model for it.
-
Alphabet in which every symbol represents one and only one specific sound in a given language. A phonetic alphabet is useful for writing a pronunciation, because it leaves no room for ambiguity.
-
Process of determining each segment's phoneme.
-
Process that breaks a word down into its phonemes. Examples: full /fUl/, bin /bIn/, yes /jes/.
-
Public Key Cryptography Standards
-
Runtime environment into which Nuance speech components are integrated. The platform typically includes a voice browser, an MRCP client and a telephony interface.
-
Sound that may or may not be voiced and involves a non-nasal oral closure. Examples: /p/, /d/, /t/, /d/, /k/, and so on.
-
Virtual/logical data connection that can be used by programs to exchange data. One computer port is required to handle each call. For example, to handle 24 simultaneous calls, a system must have 24 ports. The more ports a given computer can handle, the more economical the system is to buy and maintain.
-
Point-to-Point Switching. Telephony switching between two end-points.
-
Morphological element placed before the root of a word. Examples: un- as in unseen or dis- as in disagree.
-
Primary Rate Interface. Integrated Services Digital Network (ISDN) interface most likely to be found in business service and is typically used for carrying multiple voice and data transmissions between two physical locations. PRI offers 23 bearer (B) channels for user payload, plus one data (D) channel for signaling and control, which is equivalent to the 24 channels of a T1 line. The European version, known as primary rate access (PRA), offers 30 B channels, plus one D channel, which is equivalent to the 31 channels of an E1 line. The B channels can be used individually to connect on demand to any other ISDN device, and multiple B channels can be bonded and treated as a single fast connection for bandwidth-intensive applications such as data file transfers, videoconferencing, and any multimedia combination.
-
Speech played to a caller either to ask a question or to provide feedback. For example, “What is your account number?” or “Please wait while I check flight information.” A prompt can be played from a pre-recorded file, generated from text (TTS), or a combination of the two. The audio content can be specified via a URI, or (in VoiceXML) as a previously recorded audio variable.
-
Transcription of the sound and stress patterns of a spoken word or phrase. This is used both in text-to-speech output, and in speech recognition to match utterances against text.
-
Intermediary program that routes SIP call requests within the network and performs SIP services.
-
Public Switched Telephone Network. Traditional telephony system, based on circuit-switched technology and using PCM conversion. The connection between the sender and receiver must be reserved before data transfer begins. Connection resources are dedicated to the circuit for the duration of the call session.
-
Quality of Service. Network requirements (such as latency and maximum packet loss) to support a specific application.
-
Text of the vocabulary item returned by the recognizer. Also called the literal or the recognized item.
-
Process of identifying and interpreting spoken language. Recognition is performed by a recognizer such as Nuance Recognizer or Krypton, which, in turn require grammars or models (respectively) to define the words and phrases that can be recognized.
-
Set of grammars, speech models, and parameters used when recognizing a specific utterance.
-
Host CPU or DSP card used to handle speech recognition computation. Recognition resources are often shared to save cost and physical space. For example, one DSP might handle 8 simultaneous telephone calls (in other words, the calls on 8 Ports). Sharing works well because typically only a few of the callers on different calls are speaking simultaneously. The remainder are listening to prompts or waiting for database queries to finish, and therefore not using the recognition resources.
-
All the information generated during the recognition operation, including a text transcription of the recognized phrase, an interpretation of the utterance, the number of words recognized, and the confidence score and overall probability of the result.
-
Event that occurs when there is no speech detected several seconds after a prompt has ended. Usually when this happens, the application tells the caller that it did not hear anything, and it reprompts the caller.
-
Text of the vocabulary item returned by the recognizer. Also called the literal or the raw text.
-
Runtime component that performs speech recognition on an audio stream that begins with speech.
-
Time between when the recognizer detects end-of-speech and the recognizer returns a result. The caller perceives a larger delay than this: see caller-perceived response time. Calculations like median and 90th percentile response time are approximate: they are typically precise to within 0.05 second below 0.8 seconds, to within 0.1 second below 2 seconds and less precise above that value.
-
Semantic value of the vocabulary item returned by the recognizer. The grammar associates a value for each item.
-
Classification of recognition outcomes.
-
Grammar that references itself.
-
Detecting utterances (words, coughing, laughter, etc.) not defined in the active vocabulary. When the recognizer rejects an utterance, it prompts the caller to try again. For example: “Sorry, I didn't understand you. Please say the name again.” An utterance is rejected when its confidence score is so low that the system should not even attempt confirmation.
-
Distributes requests and provides load balancing to Dragon Voice recognition and interpretation resources (Krypton recognition engine, Natural Language Engine, and Nuance Text Processing Engine).
-
Time between when the caller stops speaking and when the recognizer returns a response. The response time depends on several factors, including the recognizer’s efficiency, difficulty of the recognition task, system load, and other computation required to return the response (for example, querying a database).
-
After a recognition timeout or rejection, the application often prompts the caller again to speak. This is called a retry. A good user interface design gives the caller information about why the retry is needed and sometimes alerts the caller to fallback methods.
-
Request for Comments (IETF standard). Memorandum published by the Internet Engineering Task Force (IETF) describing methods, behaviors, research, or innovations applicable to the working of the Internet and Internet-connected systems. Through the Internet Society, engineers and computer scientists may publish discourse in the form of an RFC, either for peer review or simply to convey new concepts, information, or (occasionally) engineering humor. The IETF adopts some of the proposals published as RFCs as Internet standards
-
Natural-language processing mechanism that allows the recognizer to fill natural language slots no matter where the core words appear in a phrase. This means that not all words in the utterance need to be parsed, so the grammar need not include every possible combination of core words, and can ignore grammar filler words.
-
Natural-language processing mechanism that allows the recognizer to fill natural language slots no matter where the core words appear in a phrase. This means that not all words in the utterance need to be parsed, so the grammar need not include every possible combination of core words, and can ignore grammar filler words.
-
A robust parsing grammar is able to identify the key items within a user utterance, while ignoring any dysfluencies or filler words that carry no significant meaning. A robust parsing grammar is a collection of SRGS rules (or concepts) that are applied to input text. Unlike a regular SRGS grammar, which applies rules in a specific order, a robust parsing grammar applies rules flexibly wherever they provide the best matches, and extracts meaning from those matching fragments.
-
Configuration defining a combination of services, number of instances of each service, and service property settings for each service, that can be assigned to a host through Management Station.
-
Real-Time Transport Control Protocol. Internet protocol that provides out-of-band statistics and control information for an RTP flow.
-
Real-Time Transport Protocol. Internet protocol used for transmitting data with realtime properties, including audio and video. RTP typically runs over UDP.
-
Grammar compilation mechanism that allows grammar content to be passed directly to the recognizer at runtime, compiled, recognized, then discarded.
-
S3 is middleware for managing a speech recognition workflow using MREC and TextProc, as implemented in the Krypton speech recognition engine.
-
Use of a subset of calls to an application instead of all calls. Using a subset saves disk space and computation time. For example, if you have a million calls a day to your system, you can sample only 20,000 a day. This enables statistics that are precise enough for analysis and reporting.
-
Microsoft Speech API
-
Session Description Protocol. IETF proposed standard for streaming parameters for media initialization.
-
Final step in the internal recognition process, where the recognizer searches for the word or phrase that most closely matches what the caller said.
-
Secure Sockets Layer (SSL) provides a secure channel between two machines or devices operating over the internet or an internal network. SSL was the most widely deployed cryptographic protocol to provide security over internet communications before it was superceded by TLS (Transport Layer Security).
-
Process of breaking speech into pieces that are then used to perform recognition. There are several methods for segmentation. Nuance Recognizer uses an approach where speech is broken into phonemes. Many other vendors break speech into pieces of fixed-duration that are independent of where the phonemic boundaries are. The Nuance Recognizer approach results in fewer segments to analyze and thus requires less CPU resources.
-
Prevents accidental interruption by allowing applications to define a small set of key words (to be spoken by callers) that trigger an intended barge-in. A client application that supports selective barge-in always listens for commands, whether the caller is speaking or listening to prompts. Selective barge-in interrupts the conversation or prompt only when it recognizes an utterance that is part of a predetermined grammar. Krypton does not support selective barge-in.
-
Mechanism by which the recognizer adapts its behavior based on previous results, with the aim of improving recognition accuracy.
-
Semantic interpretation allows utterances to be interpreted into structured objects that can be understood by an application.
-
Semantic models consist of concept grammars, grammars inferred from a training corpus, and trained classifiers, among other data fill. The Natural Language Engine consumes semantic models that are application- or project-specific.
-
Recognition result processing mechanism whereby the recognizer ensures that if multiple answers have the same natural language interpretation, only one of those answers is returned. This ensures that all answers returned in an n-best list have different semantic interpretations.
-
Sound functioning phonologically as a consonant, but sharing the phonetic qualities of a vowel. Example: /j/ as in yet /jet/.
-
GUI software development environment used to create a voice- and DTMF-based user interface and associated service logic for IVR applications.
-
Process participating in a service type.
-
Property set on a service to customize its behavior.
-
Logical entity identifying services that provide a specific feature.
-
Complete, continual interaction between a speaker and an application. In telephony environments, this is the duration of a call.
-
Unique identifier for each call received or placed. Typically generated at the start of a session, by the telephony interface for an inbound call, or by the VXML browser for an outbound call. The session ID is automatically written along with every log message to enable sorting of logs on a per-call basis.
-
Session Initiation Protocol. In TCP/IP communications, the application layer signaling protocol.
-
Mechanism for creating a set of possible results that should not be returned by the recognizer. Typically this is used when reprompting for misrecognized information, so that the same incorrect results are not returned in two successive dialog states.
-
Statistical Language Model. A mathematical object that computes the probability of sequences of words. In speech recognition systems, such a sequence is a speaker’s utterance, and the SLMs can be used to estimate the probability of the next word in the sequence based on the recognition status of the preceding words. For example, if the system has previously recognized "give me", it is more likely to recognize "more" than "sore".
-
Call transfer method in which the original call is reestablished if the transfer line is busy and the transfer cannot complete.
-
Nuance product that improves caller-perceived accuracy for directed dialogs. It allows the system to more easily detect and properly interpret near out-of-grammar utterances that include filler phrases without explicitly defining them in the original grammar.
-
Simple Network Management Protocol. IETF standard that enables system managers to control a variety of network devices from a single operational console.
-
Signal-to-Noise Ratio. Ratio of signal to noise, in dB, indicating line quality.
-
Software application enabling a caller to place or receive phone calls from a PC. Calls can be made from a PC to a phone or to another PC.
-
Sound whose production does not involve a noise component. Sonorants can form the nucleus of a syllable. Examples: All vowels and /m/, /n/, /N/, /j/, /w/, and so on. Sounds in which there is a noise component are called obstruents. Nuance does not, however, use this distinction in our description of sounds, because this distinction is irrelevant for our purposes.
-
Provisioning technique that refers to providing standby hardware that can be quickly substituted for any systems that fail in your network. This is often referred to as N+1 sparing or N:M sparing, where N is the number of machines active at a given time and M is some number between 1 and N indicating how many standby systems you have available.
-
Person whose speech is being recognized.
-
Process of identifying a speaker from a set of valid users, enabling a more personalized interaction. Speaker identification can be performed on a small set of users.
-
Process of identifying which speaker said the utterance and authenticating the speaker. Speaker recognition encompasses identification and verification.
-
Nuance product name for the technology of using SLMs and SSMs to interpret a response to an open-ended prompt. It is useful for far out-of-grammar responses.
-
Refers to the physical link that carries the voice or audio data. The speech channel provides the bridge between an application running on a specific port and the telephony session service.
-
Percentage of time during a call occupied by the caller’s speech (as opposed to prompts or database queries, and so on). Sometimes called the recognizer duty cycle. Since speech uses recognition resources, applications with high speech density tend to require more resources.
-
Process of detecting the beginning of caller speech against silence or background noise.
-
Runtime component that detects speech on an audio stream, and returns an indication of when speech began.
-
Statistical model that allows the system to translate utterances into phonetic representations of speech. Derived by speech scientists from analyzing many different people speaking well-defined words and phrases.
-
Central control and communication hub for speech-processing resources. Speech Server provides an open, protocol-based mechanism for using resources such as speech recognition and text-to-speech. Speech Server interacts with a voice platform via the MRCP protocol and with telephony via the RTP protocol.
-
Speech Recognition Grammar Specification. The W3C standard specification for writing grammars. It includes an XML format and an ABNF format. Nuance supports only the XML format (called GrXML for convenience).
-
Secure Sockets Layer (SSL) provides a secure channel between two machines or devices operating over the internet or an internal network. SSL was the most widely deployed cryptographic protocol to provide security over internet communications before it was superceded by TLS (Transport Layer Security).
-
Statistical Semantic Model. A mathematical model that aids recognition based on the context in which words appear. An SSM uses an SLM to calculate the probability that a given sequence of words will appear and match recognized phrases to their intended meaning.
-
Speech Synthesis Markup Language. XML-based markup language for speech synthesis applications. Gives authors of synthesizable content a standard way to control aspects of speech output such as pronunciation, volume, pitch, rate, and so on across different synthesis-capable platforms. It is a recommendation of the W3C's voice-browser working group.
-
Grammar content that is precompiled into a binary .gram file. Such a grammar loads quickly, but cannot be changed at runtime based on caller input.
-
A statistical language model (SLM) determines the probability that a given word sequence will appear in response to a given prompt. This probability can then be used to guide recognition to the most likely results. When the SLM is used with another natural language technique, a meaning can be assigned to the recognized text. All natural language grammars require underlying SLMs. SLMs are trained from actual user input.
-
A statistical semantic model (SSM) uses a trained classifier to assign meanings to each recognized utterance. Like an SLM, it is automatically generated from an XML training file that assigns meanings to the training set sentences. An SSM is most effective when you want to fill a single information slot, but the prompt is so open-ended that the caller could reply in using many different combinations of words.
-
Management Station service that gathers call statistics for use in billing and capacity monitoring.
-
Component of a grammar used to define words and phrases the grammar can accept. Grammar rules can be nested one within another to create a complex grammar capable of interpreting long and detailed utterances.
-
Morphological element that is attached to the word root and may cause the change of a word class. Examples: -ship as in fellowship or -ment as in improvement (with change of word class).
-
Special key returned by the recognizer that contains the name of the grammar to identify which of a set of parallel grammars parsed the result on the n-best list.
-
Special key returned by the recognizer that contains the raw text answer (the actual recognized text).
-
Special key returned by the recognizer that contains the beginning and ending times of the words recognized. Also provides word confidence scores in recognition results.
-
Special key used internally by the recognizer; the recognizer always returns this key, even if you didn’t explicitly set it in your grammar. For this reason, some VoiceXML integrations may treat this as a key that is internal to the recognizer, and therefore shouldn’t be returned to the calling VoiceXML application (unless the grammar did not explicitly set any other keys). Thus, if you want the value of SWI_meaning to be returned to the calling VoiceXML application, then you need to explicitly set another key to the same value inside your grammar.
-
A unit of organization for a sequence of speech sounds, typically made up of a syllable nucleus (usually a vowel) with optional initial and final margins (typically consonants). Syllables are often considered the phonological “building blocks” of words. For example, the word lady contains two syllables, /leI/ and /di/.
-
Two separate items in a vocabulary that have the same meaning in a system. For example, nicknames such as Bill, Billy, William, and Will, or commands like purchase and buy, or hangup, quit, exit, and logoff.
-
Format for digital transmission using PCM and Time-Division Multiplexing at a rate of 1.544 Mbps. Each line consists of 24 channels; each channel can be configured to carry 64 Kbps of voice or data traffic.
-
Two-B-Channel Transfer. TBCT enables an ISDN PRI user to request the switch to connect together two independent calls on the user's interface. The two calls can be served by the same PRI trunk or by two different PRI trunks that both serve the user. If the switch accepts the request, the user is released from the calls and the two other users are connected directly. Billing for the two original calls continues in the same manner as if the transfer had not occurred.
-
Transmission Control Protocol. Connection-oriented transport layer protocol that guarantees delivery of a data stream sent from one host to another without duplication or losing data.
-
Time Division Multiplexing. Method for combining various digital signals into a single transmission media that can be transmitted over a single channel.
-
Application such as a SIP softphone that provides telephony functionality In non-telephony mode.
-
Module that converts standard telephony signalling and Time-Division Multiplexing (TDM) audio into VoIP signaling and transport. The interface uses the Session Initiation Protocol (SIP) for call control, and to signal a VoiceXML interpreter or other standard execution platform that a call has arrived.
-
A corporate entity that deploys one or more applications in a hosted environment (such as Nuance On Demand). The system automatically organizes log files according to the company and application names.
-
Process by which text (words/tokens) is converted to a standard form based on global parameter settings and token-specific rule settings. For example, to expand abbreviations ("PIN" to "personal information number"), or to convert currency symbols into full words (currency symbols to "dollars" or "euros").
-
Text Processor. TextProc is comprised of a tokenizer, which transforms written text into a graph of tokens where any path represents a probable way someone might read or dictate the text, a formatter (also known as Inverse Text Normalization or ITN), which transforms a token sequence into written text based on global parameter settings and token-specific rule settings, and a lexicon, which helps define the token philosophy and thus the semantic handling of the words in the vocabulary. TextProc is used in multiple products at Nuance, including the Nuance Text Processing Engine (NTpE).
-
Event that occurs when no speech is detected by the recognizer over a specified period of time. For some Dialog Modules, a timeout is normal. For example, a caller may be prompted to choose an item from a read list by remaining silent until hearing the desired item.
-
Transport Layer Security. Cryptographic protocol that provides endpoint authentication and communications confidentiality over networks such as the Internet. TLS and its predecessor, Secure Sockets Layer (SSL), encrypt the segments of network connections at the Transport Layer end-to-end.
-
Process in which a sequence of strings is broken up into individual words, keywords, phrases, symbols, and other elements called tokens. Tokens can be individual words or phrases.
-
Using user input to refine the accuracy of an SLM. May also refer to the process of creating a voiceprint for a caller and storing it in a database.
-
Exchange of information between the caller and the application. Each call flow defines its own transactions, and it writes log messages at their start and completion. Using this information, you can calculate transaction completion rates, and identify problematic transactions that require attention and tuning.
-
Transaction is a high-level task or goal of a voice application. A single transaction might consist of a series of executions that together produce a useful outcome for the caller or the system. For example, obtaining flight departure information by providing the flight number, date, and confirmation of the departure city.
-
Voice application that exchanges information between the application and the caller. The caller can effect change in the application’s databases.
-
Written text corresponding to a spoken phrase. As part of application tuning, developers create transcriptions from captured audio and compare to recognition results recorded in the call logs. The transcriptions identify exact causes of successes and failures.
-
Events logged by the application when an automated call is transferred to an agent or other destination. These events are not logged automatically by a platform and require that the application provide a reason for the transfer and optionally other information about the transfer. The reason indicates the cause of the transfer.
-
Sound produced with a very fast movement of the tongue tip (“front”) or the uvula (“back”), respectively. They occur especially in romance languages, for example in the Italian pronunciation of /r/. Example: rosso /rOsso/.
-
Call transfer method in which the application transfers a caller to a third party, while staying connected to both lines. The application can also monitor the call. Also referred to as "bridged transfer."
-
A collection and any confirmation associated with it. An optimal collection gets the desired information with a single try. Otherwise, the application must perform retries or attempt a new strategy for getting the information.
-
Text-To-Speech. The process of synthesizing audible speech from typed text.
-
A system prompt followed by a caller response. The caller response can be a hangup, silence, or even noise that triggers the recognizer; a recognition can exit after detecting speech, DTMF, a hangup, or after a timeout. Most transactional applications require multiple turns. For example, trading a stock or paying a credit card. Simpler applications (for example, an auto-attendant) require one or two turns. The more turns in a dialog, the more complex it tends to be to design.
-
User Datagram Protocol. Connectionless, unreliable transport protocol used for data streaming.
-
Command that can be spoken during any dialog state, such as “help”, “cancel”, or “exit”. Also called a global command.
-
Service that must be run to use the universal commands that are included with the platform.
-
Uniform Resource Locator. Global address of documents and other resources on the web; a URL is a kind of URI.
-
Uniform Resource Identifier. Generic term for all types of names and addresses that refer to objects on the Web.
-
Coordinated Universal Time
-
Distinct chunk of caller speech, usually in response to a prompt, that is recognized using a specific active grammar. An utterance is referred to colloquially as an “utt.”
-
Voice Activity Detection, also known as speech activity detection or speech detection. Technique used in speech processing in which the presence or absence of human speech is detected. The main uses of VAD are in speech coding and speech recognition. It can facilitate speech processing, and can also be used to deactivate some processes during non-speech section of an audio session: it can avoid unnecessary coding/transmission of silence packets in Voice over Internet Protocol applications, saving on computation and on network bandwidth.
-
System performance statistics presented by Management Station for the network or specific hosts or services.
-
Set of words that can be understood as a part of an application. The grammar determines the sequences of these words that are allowed. For example, both cents and dollars are in the vocabulary of the currency built-in grammars, but they can only be said in particular locations in the phrase.
-
Nuance text-to-speech engine. Using Nuance supplied voice and language data optionally supplemented by application recordings and tuning data, Vocalizer speaks computer supplied text in a human voice.
-
A voc(abulary) delta file (or VocDelta) is a file that encapsulates a set of changes from/to the currently loaded vocabulary, such as adding and removing words, moving words between states, settings pronunciations, and building topic slots. Use of the VocDelta allows for grouping/isolating of changes from the current active vocabulary, so that changes may be restored later or transferred to a speaker profile.
-
Software application that works with various markup languages to interpret VoiceXML content and invoke other services as necessary to interpret voice input and generate voice output.
-
Process of adding phrases to a dynamic grammar through a voice interface—that is, by speaking them. Grammars created with this mechanism are speaker-dependent: because the pronunciations are generated based on the caller’s spoken input, they should only be used for recognition with that speaker. Also called speaker-dependent grammar or speaker-trained recognition.
-
Complete set of recorded audio files used by Nuance Vocalizer for Network to enable a TTS persona. Each voice pack provides a male or a female voice for a particular locale (for example, for Canadian French, American English, or Australian English).
-
The Nuance Voice Platform for Speech Suite 11 is a carrier-grade VoiceXML platform that supports voice applications using open web standards. The Voice Platform provides a complete, off-the-shelf solution to develop, deploy, and monitor voice applications implemented in VoiceXML.
-
Quality of a sound that involves vibration of the vocal folds. Example: /v/ as in wives /waIvZ/.
-
Quality of a sound that does not involve the vibration of the vocal folds. Example: /f/ as in wife /waIf/.
-
Matrix of numbers reflecting physical characteristics of a person’s vocal tract, as well as behavioral characteristics of the way a person speaks.
-
Voice-enabled markup language based on the Extensible Markup Language (XML).
-
Voice-over-IP. Transmission of analog sound as digital data across an Internet protocol.
-
Voiced sound produced without any central obstruction in the oral tract. Examples: /i:/, /e/, /A:/, /U/, /Q/, and so on.
-
Voice Response Unit, also known as Interactive Voice Response or IVR. Technology that allows communication between humans and an automated computer system using voice and/or touchtone.
-
Voice User Interface. Voice application equivalent of a Graphical User Interface (GUI). It is the application’s presentation of audio to callers with regards to usability.
-
World Wide Web Consortium. Principal body that defines standards for the World Wide Web (WWW).
-
Wide Area Network. Computer network covering a broad physical area, typically using leased communication lines and linking together many smaller LANs.
-
Service that provides communication services between Management Station and a managed host. Runs on each managed host as a Windows native service or Unix daemon.
-
WebSocket is a communications protocol that provides full-duplex communications channels over a single TCP connection.
-
Weight values set the relative importance between different grammar packages when processing speech input. Weights apply to speech recognition only and have no impact on meaning extraction; therefore, recognition weights are relevant to Nuance Recognizer (dynamic grammars) and to the Krypton recognition engine (not to NLE/NTpE).
-
Recording of a complete conversation, that is, a realtime mixed capture of both the inbound and outbound audio streams of a call (a particular RTSP [MRCPv1] or SIP [MRCPv2] session) exactly as they occurred.
-
A wordset is a set of words that customize the vocabulary used by an application at runtime. For example, an application might use wordsets to fetch identified user-specific information to add recognizable values into a grammar (such as the appropriate bank account information for a specific user). The Krypton recognition engine and NLE use wordsets for dynamic content injection, whereas Nuance Recognizer uses dynamic(-link) grammars.
-
Wizard of OZ. Form of usability testing that simulates the call flow of a voice application. A human (“the man behind the curtain”) uses a well-defined script to impersonate the prompts and responses of the application (with no improvisation). Test subjects place calls to the pseudo-application. A WOZ test enables quick and inexpensive feedback on VUI concepts before implementation.
-
WebSocket Secure. WebSocket over TLS.
-
eXtensible Markup Language. Set of rules for encoding documents based on several specifications produced by the W3C and others.
-
Human-readable data serialization language commonly used for configuration files.
-
Apache ZooKeeper provides a centralized configuration and coordination service to distributed processes.
-A-
-B-
-C-
-D-
-E-
-F-
-G-
-H-
-I-
-J-
-K-
-L-
-M-
-N-
-O-
-P-
-Q-
-R-
-S-
-T-
-U-
-V-
-W-
-X-
-Y-
-Z-
Related topics
Related topics