Data collection for training files

The key requirement for any training file is a list of sentences (or in a confidence engine training file, audio recordings of sentences) that might be spoken by callers. Data collection is the process of harvesting those sample utterances. The more representative sentences you provide, the better the resulting model.

Note: Your example sentences must accurately reflect how real customers will respond to the application prompt, including any jargon or technical terms.

Samples can come from any source. For example, they can be of responses to an application that was specifically designed for data collection purposes, they can be historical phrases stored in a database from some existing application, you can scour the web for samples, or you can even invent sentences:

Real data refers to sentences transcribed from actual telephone calls.
Fake data refers to invented sentences to predict what callers might say.

Real data is very much preferred over fake data; but often the real data is not available at the beginning of a project. In a typical project, training files are continually refined, and real data can be added as it becomes available (for example, as the result of calls transcribed during pilot testing).

In addition to training files, test sentences are needed to test the resulting grammar. The test set sentences must be independent of the training sentences.

It is strongly recommended that you collect data in live, realtime situations instead of using off-line recordings. Live data collection yields an initial deployment that is more accurate, has a better automation rate, and tunes faster to get to optimal efficiency. Possible methods for collecting training sets include:

Use an automated prompt with dummy recognition
Conduct a Wizard of Oz collection
Collect utterances using a natural language grammar

Automated prompt with dummy recognition

One effective way to collect example sentences is to use an automated prompt without attempting to recognize the caller’s response. The caller responds to the prompt, and the system records the utterance for transcription; the caller is then transferred either to an existing automated system or to a live agent. The caller will then have to repeat the information spoken to the dummy Recognizer.

This method can collect realistic data at very low cost. It has a minor impact on callers, because callers must speak to the dummy Recognizer before beginning their actual calls, but this is a one-time inconvenience balanced by the low cost of collection. Using an automated prompt with dummy recognition offers the best balance between cost and accuracy.

However, a drawback to this method is that it does not support collection of example sentences for multi-phase natural language applications. In a multi-phase natural language application where the dialog takes several steps, the initial prompt is followed by another prompt. Since this method only collects responses for the first prompt, it is not useful for collecting data to be applied to the second and later prompts.

Wizard of Oz collections

A Wizard of Oz collection simulates the target application. A human agent plays prompts and serves the role of the recognizer. The agent does not speak directly to callers; instead, the caller hears an automated prompt. The human agent (the "wizard behind the curtain") then uses a computer keyboard to determine how the application responds to each caller utterance.

Wizard of Oz collections serve two purposes: to test ideas for application prompts and callflows, and to generate recordings of caller utterances for later transcription and semantic labelling. The advantages to this approach are:

Callers are promptly routed to the correct destination.
Callers are not inconvenienced.
Operators can save time by labeling calls at runtime.

If the initial utterance is ambiguous, operators with knowledge of the application’s goals can label it accordingly, and can route callers to follow-up questions. This allows for testing of follow-up questions and collection of utterances in multi-phase natural language applications.

The main disadvantage is that a Wizard of Oz method can be costly, as some software must be written to allow the human agents to listen to and route calls while labelling the utterances, and the agents must be hired and trained.

Natural language grammar collections

To collect example utterances using a natural language grammar, a designer writes a grammar that allows callers to speak some of the anticipated answers to the prompt. Recognized utterances can be routed appropriately. Unrecognized utterances can be sent to an agent or an existing automated system.

This method is a good compromise between the cost of an automated prompt with a dummy recognition and running a Wizard of Oz collection. It allows some level of automation from the initial deployment and it supports multi-phase natural language applications.

Training files for statistical models

You need different training files for different kinds of statistical models: language models, semantic models, and confidence engine models. Each kind of training file has its own content requirements.

For details, see: