Recognition in Voice Platform

This topic explains the Nuance recognition process, which involves several phases.

Preprocessing

Before speech is sent to the Voice Platform recognition engine, the audio data is preprocessed for optimal recognition. The two most important steps in this preprocessing phase are echo cancellation and endpointing.

Echo cancellation

Echo cancellation improves the quality of a speech signal by diminishing any echo that might have been introduced by the telephone. Otherwise the recognizer cannot provide accurate results, because the echo from the played prompt is often mistakenly assumed to be the user’s voice.

Endpointing

For recognition accuracy and efficiency, the Speech Server must distinguish leading or trailing background noise or silence from the utterance. This process is called endpointing, and it is discussed fully under Using barge-in.

Front-end processing

Front-end processing, also called feature extraction, allows Voice Platform to filter out a certain amount of background noise.

The audio data is typically sampled at a rate of 8000 samples per second and segmented into 10-millisecond frames (80 samples per frame), which is the standard for digital telecommunications. Voice Platform examines the energy levels of the samples in a frame across various frequency bands and extracts a feature set, a vector of numbers corresponding to the energies in these bands (usually ranging from 300 Hz to 3.3 KHz). This vector is transformed into a new vector arranged for speech recognition processing, filtering out some background noise. Further processing is performed on these feature sets, not on the original audio samples.

Recognition search

During the search phase, Voice Platform analyzes speech features to produce a text transcription of the utterance. This search is defined by the set of possibilities specified in the current grammar. In examining these possibilities, Voice Platform uses a hierarchy of search mechanisms that allows it to select the most likely match from the set of possible matches:

At the lowest level, individual phonemes are recognized using the specified acoustic model. Each phoneme extends across multiple frames. Phoneme models can be context dependent, which means that they can depend on the preceding and following phonemes.
Sequences of phonemes make up words. The recognition service uses dictionaries along with text-to-sound rules to map phoneme sequences to words. Comprehensive dictionaries are supplied by Voice Platform, and you can provide supplemental dictionaries for technical terms or special names.
Words combine to make phrases or sentences. The grammars active during a recognition determine the set of word sequences that Voice Platform can accept.

Acoustic-phonetic analysis

The acoustic-phonetic analysis phase of the recognition process provides a probabilistic mapping from the utterance waveform to a set of possible phonemes. Typically, Voice Platform uses from 30 to 60 phonemes, depending on the language (for example, North American English uses 41 phonemes).

Because words and sentences are constructed from phonemic models, it is important that the acoustic processing be very accurate. Voice Platform uses hidden Markov models (HMM) as the acoustic models for mapping utterance waves to phoneme sequences. HMMs are complex statistical models that provide detailed spectral and temporal representations of speech signals. These models automatically learn from data, and have been optimized for telephone-quality audio.

Concurrent processing

While mapping the utterance waveform to words and to sentences, the system performs two concurrent tasks as part of the recognition process:

Segmentation analysis determines where in a speech stream the words are, and where in these words the phonemes are.
Classification determines which phoneme each segment represents, and which word was heard given a string of phonemes.

Both tasks are performed using phoneme models, word models, and a grammar.

In evaluating all possible sentences, Voice Platform considers all possible segmentations. The simultaneous evaluation of hypotheses and segmentations produces the optimal—that is, most accurate—result. Hypotheses are neither discarded too soon, which could mean discarding an accurate hypothesis, nor too late, which could mean an undue computational burden on the system.

Pruning the search space

The Voice Platform recognition engine is especially efficient, because it is able to prune the search space while it is analyzing the utterances. This means that as the system moves through the utterance, it discards results that are unlikely, reducing the search space. This results in faster recognition, which in turn means that less CPU time and less RAM are required.

The Voice Platform recognition service provides two methods of pruning:

Pruning based on likelihood scores: In theory, the recognition engine computes the likelihood of all possible hypotheses in the grammar and choose the one with the greatest likelihood given the acoustic (HMM) models. In practice, unlikely hypotheses are pruned from the search before they are fully computed. The Voice Platform recognition engine lets you control the level of pruning by specifying the point in the search where paths with lower scores are eliminated from the search set.
Phonetic pruning: With this method, the recognition engine performs additional computation based on the last phoneme analyzed at any given time during recognition. Phonetic pruning provides an independent assessment that increases the likelihood that the recognition engine makes the right decision in keeping or pruning an hypothesis.

Efficient pruning distinguishes Voice Platform from some competitors’ systems, which often prune based on likelihood scores only.

You can set some properties that allow Voice Platform to increase the pruning threshold and accuracy of the overall system dynamically using idle CPU power, for example, during low call volume time.