SSM training file

Below is an example training file. A description of the header and main body of the XML-formatted file follow.

<!DOCTYPE SSMTraining SYSTEM "SSMTraining.dtd"> 
<SSMTraining version="1.0.0" xml:lang="en-us">
    <features>
        <word>broken</word>
        <word>computer</word>
        <word>is</word>
        <!-- more words, -->
        <word>what</word>
        <word>are</word>
        <word>promotions</word>
    </features>
    <semantic_models>
        <SSM>
            <meaning prior="1.0">
                <slot name="route">sales</slot>
            </meaning>
            <meaning prior="0.8">
                <slot name="route">tech_support</slot>
            </meaning>
        </SSM>
    </semantic_models>
        <training>
        <sentence count="1">
            <semantics>
            <slot name="route">tech_support</slot>
            </semantics>
            my computer is broken
        </sentence>
        <sentence count="1">
            <semantics>
                <slot name="route">sales</slot>
            </semantics>
            what are the promotions
        </sentence>
    </training>
</SSMTraining>

SSM training file header

The initial header lines for an SSM training file are similar to those required in an SLM training file (see SLM training file header):

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE SSMTraining SYSTEM "SSMTraining.dtd"> 
<SSMTraining version="1.0.0" xml:lang="en-us">

The most important difference here is that the document type is "SSMTraining", the related document type definition is SSMTraining.dtd (also located in the %SWISRSDK%\config directory), and the main declaration uses an <SSMTraining> element rather than the <SLMTraining> element.

<SSMTraining> element

Unlike the <SLMTraining> element, the <SSMTraining> element does not support the <meta> element for specifying configuration parameters, nor the <lexicon> element for specifying a user dictionary. To specify parameters, you must instead use the <param> and <value> elements in the file main body: see SSM training configuration parameters for details.

However, <SLMTraining> does support the <training> and <test> elements. See Training and test sections.

The <SLMTraining> element does support two SSM-specific elements used in the training file main body: the <semantic_models> and <features> elements.

SSM training file main body

The main body of the training file uses several elements that are specific to SSM training. These are organized into two main sections of the file: the <features> section, and the <semantic_models> section.

<features> section (vocabulary and classes)

The <features> section in an SSM training file serves a similar role to the <vocab> section in an SLM training file: it defines a section of vocabulary words and classes using the <word> and <ruleref> elements. The vocabulary section of the training file defines all words allowed in the other sections. (Omitted words are ignored if they appear in sentences.)

You can also use ECMAScript in the <features> section to modify or augment <ruleref> meanings. See Feature extraction and ECMAScript.

<semantic_models> section

The <semantic_models> section is a required section. It declares the SSM label classifiers, sets parameters, and lists all possible meanings returned by the SSM. The main entries in the section consist of <SSM> elements, which specify the label names, and define the associated meanings using the <meaning> element. The meaning element itself may have different attributes, as discussed below.

As previously discussed, SSM training configuration parameters can only be set in this section. See SSM training configuration parameters for details.

The training file fragment below specifies a classifier labelled "action". By default, the <SSM> element fills a slot of the specified label name. In this example, the action slot has possible values of dial and enroll:

<semantic_models>
    <SSM label="action">
        <meaning prior="-1.3">
            dial
        </meaning> 
        <meaning prior="-.8">
            enroll
        </meaning>
    </SSM>
</semantic_models>

Name slots have precedence over labels. If an <SSM> element has both a label and named slots in the <meaning>, then the label merely refers to the SSM, and the meanings determine the values of the named slots.

You can specify an initial probability for each meaning using the prior attribute of the <meaning> element. The training program uses such initial probabilities as preliminary values, and adjusts them during processing. (For a related discussion, see use_prior_weight.)

By default, the special key SWI_meaning has the value of the concatenation of all labels set in the SSM (see SWI_meaning). You also can set SWI_meaning explicitly as a single slot, just as you would any label:

<SSM label="SWI_meaning">
    <meaning> dial </meaning> 
</SSM>

In the next example given below, a single decision by the classifier sets two slots—action, and destination:

<semantic_models>
    <SSM>
        <meaning>
            <slot name="action">dial</slot>
            <slot name="destination">home</slot>
        </meaning> 
        <meaning>
            <slot name="action">dial</slot>
            <slot name="destination">office</slot>
        </meaning>
    </SSM>
</semantic_models>

<meaning> element

The <meaning> element, which must be a child of an <SSM> element, is a container for slots and values. There is a limit of 5,000 meanings in an SSM.

Training and test sections

The <training> and <test> sections declare sentences for training the SSM and estimating its accuracy. The elements have identical syntax and child elements.

The sentences contained in the <training> and <test> sections independently reflect the actual distribution of the sentences seen in the actual application. The same literal sentence can be used in both sections; however, the training and test sections cannot be identical. The best way to select training sentences is to select a number of sentences randomly from your corpus of data. The remaining sentences are then used for the test section.

Typically, a training file has one <training> and <test> element; but you can divide sentences among many training or test sections to allow for different settings of the feature_extraction attribute.