SLM training file header
The initial header lines define the content and structure of the XML file:
<?xml version="1.0" encoding="UTF-8" ?>
<!DOCTYPE SLMTraining SYSTEM "SLMTraining.dtd">
<SLMTraining version="1.0.0" xml:lang="en-us">
XML declaration
The first element in the header is always the XML declaration. This element specifies the version of XML used in the document (1.0 or 1.1). It also specifies the encoding that applies for the document, which determines the language(s) that can or cannot be used. Both version and encoding are required attributes.
See XML declaration and encoding type for details.
Document type and system
Optionally, you can use the !DOCTYPE element to define the document type. For a training file, this type is "SLMTraining", as shown in the example above.
The SYSTEM attribute specifies a document type definition (DTD), which must be described in a .dtd file. Specifying a DTD is optional, but is recommended to catch XML formatting errors. The installation includes a SLMTraining.dtd file, which is located in the %SWISRSDK%\config directory.
The example above assumes that the training file is located in the same directory as the DTD file. However, if the training file is located elsewhere, you must add the full relative path to the DTD file in the training file header.
<SLMTraining> and language declaration
The <SLMTraining> element opens the main section of an SLM training file. It has two required attributes: the version (1.0.0), and the xml:lang attribute that specifies the main language for the training file.
You can create SLMs for any language installed for Recognizer. Use the xml:lang attribute to specify the target language. The value is a string indicating the language code, for example, en-us. See Setting the language in the grammar header.
Configuration parameters
There are several configuration parameters that can be used in training files. You can specify these parameters and set their values by using the <param> and <value> elements in your training file header:
<param name="ngram_order"><value> 2 </value></param>
<param name="fsm_out"><value>sample.fsm</value></param>
<param name="wordlist_out"><value>sample.wordlist</value></param>
The default values for these parameters are typically acceptable for your initial training iterations. In later iterations, you can test parameter values during tuning. See Tuning SLMs for additional details.
The exception to using default value for the first iteration is smooth_weights, which recommends a non-default setting for interpolating models.
Many of the training file parameters tune n-grams. For an overview of n-grams, see SLMs. For a more detailed discussion, see N-gram grammars.
Available SLM configuration parameters:
SLM parameter |
Description |
---|---|
Indicates when to remove bigrams or trigrams that occur infrequently, and are thus statistically insignificant. |
|
Optional. Specifies an input filename for a discounts file (.dcnt), which lets you control the impact of n-grams not contained in the training set. |
|
Optional. Specifies an output filename for computed discounts (see discounts_in). |
|
Optional. Specifies an output filename for the finite state machine (FSM) file that is created when you generate an SLM. |
|
Specifies whether to create a bigram or trigram language model. |
|
Optional. Specifies an output file for writing the SLM in the ARPA format. |
|
Optional. Applies an industry-standard algorithm while training the language model. |
|
Shows the interpolation weights used when interpolating 2-gram and 3-gram probabilities with 1-gram probabilities. |
|
Optional. Directs the sgc compiler to create a vocabulary word list to accompany the n-gram file specified via fsm_out (see fsm_out). |
User dictionaries
You can use the <lexicon> element to specify a user dictionary in the training file. See Pronunciation dictionaries.
The <meta> element
The <meta> element defines a configuration parameter inside the resulting grammar. These parameters are applied to the grammar during compilation. In general, the values are local to the grammar even if the grammar imports (or is imported by) another grammar.
- swirec_compile_parser
- swirec_enable_robust_compile
- swirec_first_pass_grammar
- swirec_fsm_grammar
- swirec_fsm_wordlist
- swirec_max_dict_prons
- swirec_multiword_replace
- swirec_normalize_to_probabilities
- swirec_optimization
- swirec_training_grammar
Note: Metas are not saved in the finite state machine and wordlist files. Put them in the wrapper grammar.