SLM training files

An SLM training file is an XML document that contains your training set sentences. An SLM training file may contain the following information:

  • Header information to identify the file (required)
  • Configuration parameters (optional)
  • A list of supported vocabulary words (required)
  • A list of training sentences that callers are likely to speak (required)
  • References to sub-grammars for dates, currency, and so on (optional)

All training configuration is done using parameters inside the training files. Other sources of configuration parameters (for example, parameters set for the configuration) do not affect SLM generation.

Training file example

Here is an example of an SLM training file:

<?xml version="1.0" encoding="UTF-8" ?>
<!DOCTYPE SLMTraining SYSTEM "SLMTraining.dtd">
<SLMTraining version="1.0.0" xml:lang="en-us">
    <param name="ngram_order"><value> 2 </value></param>
    <param name="fsm_out"><value>sample.fsm</value> </param>
    <param name="wordlist_out"><value>sample.wordlist</value></param>
    <vocab>
        <item>i</item>
        <item>want</item>
        <item>would</item>
        <item>desire</item>
        <item>to</item>
        <item>go</item>
        <item>the</item>
        <item>movie</item>
        <item>see</item>
        <item>really</item>
        <item>like</item>
        <item>view</item>
        <item>that</item>
        <item>film</item>
        <item>flick</item>
    </vocab>
    <training>
        <sentence>i want to go to the movie</sentence>
        <sentence>i want to see the movie</sentence>
        <sentence>i want to see that movie</sentence>
        <sentence>i want to see that film</sentence>
        <sentence>i want to view the movie</sentence>
        <sentence>i want to view that movie</sentence>
        <sentence>i want to go see that film</sentence>
        <sentence>i want to see the flick</sentence>
        <sentence>i really want to go to see that film</sentence>
    </training>
</SLMTraining>

About the training set

Do not include punctuation and special characters in the training set. Write all words and abbreviations as they will be spoken. For example, write January 23 as “january twenty third” and write St. Patrick St. as “saint patrick street”.

There is a strong correlation between the quality of an SLM and the quality of its training set. The best training sets are those collected from actual user responses. Also, a good training set includes several thousand sentences at a minimum. For example, for vocabulary sizes up to 2,500 words, about 20,000 training examples are adequate to properly train an SLM.

Training files and the methods that can be used to collect training data are discussed in Data collection for training files.

Note that you may also want to collect a set of sentences as a test set, to measure the quality of your model (see Tuning perplexity).