SLM training file main body

The main part of a training file defines the vocabulary allowed in the SLM, and a training section that lists example sentences which use words of the vocabulary.

The vocabulary section

The vocabulary section of the training file defines all words allowed in the training section. Words that appear in the training sentences but not in the vocabulary section will be ignored by the compiler.

The vocabulary section is defined by the <vocab> element. Within the <vocab> elements, words and classes are defined with <item> and <ruleref> elements.

The <ruleref> element (grammar classes)

The training section

The training section of the file lists example sentences that use the words from the vocabulary section. These sentences are used by the compiler to determine the probabilities to be used in recognizing user utterances.

The training section is defined by the <training> element. Within the <training> elements, sentences are defined with <sentence> element pairs.

The order of the sentences has no effect on the trained results.

The test section

Optionally, the training file may include a test section defined by a pair of <test> elements. This section lists sentences that could be used to test the SLM.

The test section is not used during SLM training. However, it comes into play when the SLM is used to support an SSM (see SLMs).

Very large training files

Although large training files can improve the quality of an SLM, those large files can be difficult to manage. They can become unwieldy to edit, and some third-party software may not accept files over a certain size.

Two techniques for dividing large training files into collections of smaller files are described below.

External training files (and compression)

Optionally, you can replace any <item> or <sentence> in a <vocab>, <training>, or <test> section with an external training file. This is especially useful for large test sets because the external file requires no XML elements and can be in a compressed format.

The <external> element has one attribute, uri, which is set to the URI of an external file. The URI must be a local path. The external file must use UTF-8 encoding. Format:

<external uri="myLocalPath\myFilename"/>

The URI can specify a file compressed with Gnu Zip. For example:

<external uri="vocab.txt.gz"/>

External training files have different headers depending on where they appear. The header is the first line of the file, and is one of the following:

Header	Description
::VOCAB	Header for the <vocab> section.
::SLMDATA	Header for the <training> or <test> sections.

After the header, each line contains training data:

For <vocab> each line defines one vocabulary word.
For <training> and <test> each line defines one sentence.

The system assumes the default language unless words have language identifiers. This consists of the exclamation mark "!" followed by a language code. For example, !en-us. (Blank spaces around the ! are allowed.) Here are vocabulary words with mixed languages:

::VOCAB

this

is!en-us

a !en-us

vocabulary! en-us

and ! en-us

esto !es-us

es !es-us

un !es-us

vocabulario !es-us

For a <training> or <test> section, each word of a sentence can have a language identifier. The langcode refers to a single word (not a phrase). Any words with no identifier use the default language. Here are test sentences with mixed languages:

::SLMDATA

this!en-us is a vocabulary !en-us

esto !es-us es!es-us un!es-us vocabulario!es-us

Training file can indicate the weight of each sentence by adding a "count" and "prior" prefix:

Count multiplies the occurrence of a sentence in the training data. It is a short-cut that repeats the same sentence multiple times. The value is an integer, and the default is 1.
Prior is a log probability. The value is a floating point, and the default is 1.0.

The default weight is 1.0. Specify the count and prior at the beginning of a sentence, separated by whitespace and followed by a comma. The following sentences are valid, and have the same meaning:

this!en-us is a vocabulary !en-us

, this!en-us is a vocabulary !en-us

1 1.0, this!en-us is a vocabulary !en-us

Above, the example repeats the default behavior. In the following example, the first sentence has a count of 10 and a prior of 2.0, the second sentence has a count of 10 and a default prior of 1.0, and the last two sentences have default weights:

10 2.0, sentence number one

10, sentence number two

,sentence number three

sentence number four

You can use rule references anywhere in an SLM external file where a word is allowed. The purpose is the same as using the <ruleref> element in an XML file, but the syntax is different. Format:

$$<URI>

Examples:

$$grammar.grxml

$$http://grammarServer.com/grammar.grxml

Here is a complete external SLM vocabulary file:

::VOCAB

are

animals

$$animals.grxml

Here is a complete external SLM training file:

::SLMDATA

2.5,$$animals.grxml are animals

SLM training file main body

The vocabulary section

The training section

The test section

Very large training files

Related topics