SLM training file header

The initial header lines define the content and structure of the XML file:

<?xml version="1.0" encoding="UTF-8" ?>

<!DOCTYPE SLMTraining SYSTEM "SLMTraining.dtd">

<SLMTraining version="1.0.0" xml:lang="en-us">

XML declaration

The first element in the header is always the XML declaration. This element specifies the version of XML used in the document (1.0 or 1.1). It also specifies the encoding that applies for the document, which determines the language(s) that can or cannot be used. Both version and encoding are required attributes.

See XML declaration and encoding type for details.

Document type and system

Optionally, you can use the !DOCTYPE element to define the document type. For a training file, this type is "SLMTraining", as shown in the example above.

The SYSTEM attribute specifies a document type definition (DTD), which must be described in a .dtd file. Specifying a DTD is optional, but is recommended to catch XML formatting errors. The installation includes a SLMTraining.dtd file, which is located in the %SWISRSDK%\config directory.

The example above assumes that the training file is located in the same directory as the DTD file. However, if the training file is located elsewhere, you must add the full relative path to the DTD file in the training file header.

<SLMTraining> and language declaration

The <SLMTraining> element opens the main section of an SLM training file. It has two required attributes: the version (1.0.0), and the xml:lang attribute that specifies the main language for the training file.

You can create SLMs for any language installed for Recognizer. Use the xml:lang attribute to specify the target language. The value is a string indicating the language code, for example, en-us. See Setting the language in the grammar header.

Configuration parameters

There are several configuration parameters that can be used in training files. You can specify these parameters and set their values by using the <param> and <value> elements in your training file header:

<param name="ngram_order"><value> 2 </value></param>

<param name="fsm_out"><value>sample.fsm</value></param>

<param name="wordlist_out"><value>sample.wordlist</value></param>

The default values for these parameters are typically acceptable for your initial training iterations. In later iterations, you can test parameter values during tuning. See Tuning SLMs for additional details.

The exception to using default value for the first iteration is smooth_weights, which recommends a non-default setting for interpolating models.

Many of the training file parameters tune n-grams. For an overview of n-grams, see SLMs. For a more detailed discussion, see N-gram grammars.

Available SLM configuration parameters:

SLM parameter	Description
cutoffs	Indicates when to remove bigrams or trigrams that occur infrequently, and are thus statistically insignificant.
discounts_in	Optional. Specifies an input filename for a discounts file (.dcnt), which lets you control the impact of n-grams not contained in the training set.
discounts_out	Optional. Specifies an output filename for computed discounts (see discounts_in).
fsm_out	Optional. Specifies an output filename for the finite state machine (FSM) file that is created when you generate an SLM.
ngram_order	Specifies whether to create a bigram or trigram language model.
print_arpa	Optional. Specifies an output file for writing the SLM in the ARPA format.
smooth_alg	Optional. Applies an industry-standard algorithm while training the language model.
smooth_weights	Shows the interpolation weights used when interpolating 2-gram and 3-gram probabilities with 1-gram probabilities.
wordlist_out	Optional. Directs the sgc compiler to create a vocabulary word list to accompany the n-gram file specified via fsm_out (see fsm_out).

discounts_in

Optional. Specifies an input filename for a discounts file (.dcnt), which lets you control the impact of n-grams not contained in the training set.

For example:

<param name="discounts_in">

    <value> file.dcnt </value> </param>

In the context of SLM training, a discount is a multiplier which the compiler uses to account for word permutations that are not explicitly included in the training file as sentences. Discounts are calculated using an industry-standard algorithm, as determined by the smooth_alg parameter.

Without discounting, any n-gram not seen in the training has zero probability, which makes it unlikely to be recognized in the test data. With discounts, part of the probability mass is reserved to accommodate such n-grams, by multiplying the counts of the training n-grams by a constant smaller than 1.

If you omit this parameter, Recognizer computes Good-Turing discounts from the input training set. This is suitable if you have small amounts of training data, originating from different applications. Use discounts_in when the smooth_alg parameter specifies a Good-Turing value and you do not want to re-compute the Good-Turing discounts.

A sample discounts file appears below:

Sample discount file
Good-Turing discounts 6 0.39789599 0.68462503 0.71244669 0.87349498 0.90344799 0.79262167 8 0.21760599 0.56865501 0.72032666 0.79289001 0.82419997 0.81001669 0.81023431 0.95779902 7 0.0000000 0.46160099 0.67955333 0.67672497 0.74140197 0.75773168 0.83446717

Sample discount file

Good-Turing discounts

6 0.39789599 0.68462503 0.71244669 0.87349498 0.90344799 0.79262167

8 0.21760599 0.56865501 0.72032666 0.79289001 0.82419997 0.81001669 0.81023431 0.95779902

7 0.0000000 0.46160099 0.67955333 0.67672497 0.74140197 0.75773168 0.83446717

The first line of a discount file is a comment (above, “Good-Turning discounts”).

The remaining lines correspond to the n-gram order (1, 2, and 3), and consists of floating point discount multipliers. The first digit on the line indicates the number of factors on that line. Each multiplier corresponds to a n-gram count in the training data.

In the example, there are 6 multipliers for 1-grams, 8 for 2-grams, and 7 for 3-grams. An n-gram with an original count equal to i will be modified by multiplying i by the i-th float on that line. If i is greater than the number of multipliers, the last multiplier is used.

smooth_alg

Optional. Applies an industry-standard algorithm while training the language model.

For a overview of how discounts are used, see discounts_in.

The value is one of the following strings:

Value	Description
GT-disc	Good/Turing discounting, no interpolation.
GT-disc-int	Good/Turing discounting, interpolation with 1-gram probabilities (using the smooth_weights parameter).
GT-discw-int	GT discounting using sentence weights, interpolation with 1-gram probabilities (using the smooth_weights parameter).
INT	No discounting, interpolation with 1-gram probabilities (using the smooth_weights parameter).
WB-disc	Discounted but non-interpolated Witten-Bell. This is the default.
WB-int	Regularly-interpolated Witten-Bell with controlling smooth_weights parameter.

For example:

<param name="smooth_alg">

    <value> GT-disc </value>

</param>

User dictionaries

You can use the <lexicon> element to specify a user dictionary in the training file. See Pronunciation dictionaries.

The <meta> element

The <meta> element defines a configuration parameter inside the resulting grammar. These parameters are applied to the grammar during compilation. In general, the values are local to the grammar even if the grammar imports (or is imported by) another grammar.

Note: Metas are not saved in the finite state machine and wordlist files. Put them in the wrapper grammar.