SLM training file header

The initial header lines define the content and structure of the XML file:

<?xml version="1.0" encoding="UTF-8" ?>
<!DOCTYPE SLMTraining SYSTEM "SLMTraining.dtd">
<SLMTraining version="1.0.0" xml:lang="en-us">

XML declaration

The first element in the header is always the XML declaration. This element specifies the version of XML used in the document (1.0 or 1.1). It also specifies the encoding that applies for the document, which determines the language(s) that can or cannot be used. Both version and encoding are required attributes.

See XML declaration and encoding type for details.

Document type and system

Optionally, you can use the !DOCTYPE element to define the document type. For a training file, this type is "SLMTraining", as shown in the example above.

The SYSTEM attribute specifies a document type definition (DTD), which must be described in a .dtd file. Specifying a DTD is optional, but is recommended to catch XML formatting errors. The installation includes a SLMTraining.dtd file, which is located in the %SWISRSDK%\config directory.

The example above assumes that the training file is located in the same directory as the DTD file. However, if the training file is located elsewhere, you must add the full relative path to the DTD file in the training file header.

<SLMTraining> and language declaration

The <SLMTraining> element opens the main section of an SLM training file. It has two required attributes: the version (1.0.0), and the xml:lang attribute that specifies the main language for the training file.

You can create SLMs for any language installed for Recognizer. Use the xml:lang attribute to specify the target language. The value is a string indicating the language code, for example, en-us. See Setting the language in the grammar header.

Configuration parameters

There are several configuration parameters that can be used in training files. You can specify these parameters and set their values by using the <param> and <value> elements in your training file header:

<param name="ngram_order"><value> 2 </value></param>
<param name="fsm_out"><value>sample.fsm</value></param>
<param name="wordlist_out"><value>sample.wordlist</value></param>

The default values for these parameters are typically acceptable for your initial training iterations. In later iterations, you can test parameter values during tuning. See Tuning SLMs for additional details.

The exception to using default value for the first iteration is smooth_weights, which recommends a non-default setting for interpolating models.

Many of the training file parameters tune n-grams. For an overview of n-grams, see SLMs. For a more detailed discussion, see N-gram grammars.

Available SLM configuration parameters:

SLM parameter

Description

cutoffs

Indicates when to remove bigrams or trigrams that occur infrequently, and are thus statistically insignificant.

discounts_in

Optional. Specifies an input filename for a discounts file (.dcnt), which lets you control the impact of n-grams not contained in the training set.

discounts_out

Optional. Specifies an output filename for computed discounts (see discounts_in).

fsm_out

Optional. Specifies an output filename for the finite state machine (FSM) file that is created when you generate an SLM.

ngram_order

Specifies whether to create a bigram or trigram language model.

print_arpa

Optional. Specifies an output file for writing the SLM in the ARPA format.

smooth_alg

Optional. Applies an industry-standard algorithm while training the language model.

smooth_weights

Shows the interpolation weights used when interpolating 2-gram and 3-gram probabilities with 1-gram probabilities.

wordlist_out

Optional. Directs the sgc compiler to create a vocabulary word list to accompany the n-gram file specified via fsm_out (see fsm_out).

User dictionaries

You can use the <lexicon> element to specify a user dictionary in the training file. See Pronunciation dictionaries.

The <meta> element

The <meta> element defines a configuration parameter inside the resulting grammar. These parameters are applied to the grammar during compilation. In general, the values are local to the grammar even if the grammar imports (or is imported by) another grammar.

Note: Metas are not saved in the finite state machine and wordlist files. Put them in the wrapper grammar.