Feature extraction and ECMAScript

When you train an SSM, the training tool automatically discovers correlations between features in an utterance and its interpretation. The primary function of the <feature> attribute is to provide the means for these correlations.

Optionally, you can also add domain knowledge to the training process by defining interpretation features in <ruleref> elements. Thus, the <feature> element has a dual role: it improves the trained SSM by generalizing classes of information (such as dates, times, currency amounts, and so on), and improves recognition results by letting scripts extract semantic information at runtime.

At runtime, in addition to the SSM classification of meanings, your ECMAScript can modify or augment those meanings. Your scripts can add new slots and values to the final results simply by assigning values to variables.

At runtime, the ECMAScript runs as follows:

  • During feature computation of an SSM, any rule that matches the recognized text can run an associated ECMAScript.
  • After all results for all parallel SSMs are defined, you can run ECMAScript on the full result object from those SSMs. Declare this script with the Recognizer configuration parameter swirec_grammar_script.

Types of interpretation features

Your rules can train several types of interpretation features:

  • Remove features specify words that to not use as features—the SSM treats any matching words as if they did not exist. They define phrases that contain no useful information for interpretation. For example, you can specify pause fillers (um, uh, and so on), or words that appear frequently but do not contain information (the, but, please, maybe, and so on).
  • Fragment features combine words and the matching rule name (the words and the rule name are used as features), so individual words in a phrase may also be counted by the SSM. The final probabilities are thus affected by both the individual words, and the phrase as a whole.

    You can use fragment features to define a phrases or concept that has a particular importance beyond that of the individual words in it. For example, suppose that your application has a product called the “hundred minute plan”. You could use fragment features so that the words “hundred”, “minute”, and “plan” affect the SSM, but there is an added effect when the complete phrase “hundred minute plan” appear in an utterance.

  • Stem features replace words with the matching rule name—the phrase is used as a feature, but the individual words within it have no additional effect. The final probabilities are thus affected by only the phrase if the rule is triggered; if the rule is not triggered—that is, if only part of the phrase was spoken, or words that happen to occur within the phrase were spoken in a different context—then only the probabilities for each individual word apply.

    You can use stem features to account for situations where the combination of two or more words changes the natural meaning of those words. For example, the phrases “statistical language model” and “statistical semantic model” both use the individual words “statistical” and “model”; but each phrase as a whole means something very different.

Note: Add interpretation features carefully, and validate their effect on the performance before using the SSM in a deployed application.

Here is how you specify feature types:

<features>
    <ruleref uri="my.xml#garbage" feature_generation="remove"/>
    <ruleref uri="my.xml#rulea" feature_generation="stem"/>
    <ruleref uri="my.xml#ruleb" feature_generation="fragment"/>
</features>

Specifying rules and scripts in features

Any <ruleref> appearing in a <features> section must have a unique name. In other words, no two <features> sections can have <ruleref> elements of the same name (text after the pound symbol, #), even if they are in different SSMs.

The <ruleref> can contain a <tag> element to define ECMAScript. The script runs when the rule fires. For example:

<ruleref uri="mydate.xml#DATE" feature_generation="stem">
    <tag> 
        date=DATE.date;
        var foo = 13;
        has_date = 1;
        SWI_meaning = "{date:"+date+"}";
    </tag>
</ruleref>

Above, the example adds a date slot to the results, and the value of date is returned by DATE (for example, 20060828). Also note:

  • The script explicitly adds the date value to SWI_meaning. (By default, the slot is not be added.) To add slots and values to SWI_meaning, you must explicitly assign a string of the form "{key:value}".
  • The script adds the has_date slot to the results (with value=1), but this key is not added to SWI_meaning. (Again, because it must be added explicitly.)
  • By assigning values to variables, scripts can automatically add new slots and values in the final results (however, you must not declare these variables with "var"). In the example, the variable foo is declared using "var”, and therefore has no effect on the results.

To continue the example, if the recognized text were "please activate my service on january fourth", the word "please" is likely to be removed by a #garbage rule, and the SSM computes a meaning using the features "activate my service on DATE".

Feature extraction is automatically computed from the SSM training file. There is no need to pre-expand words to their associated rules. However, if you do have a pre-expanded training file, you can disable automatic expansions to speed SSM generation (see feature_extraction attribute).

Parallel SSMs and ECMAScript

When designing SSMs, it’s helpful to remember the layers of semantic interpretation that can occur. For example, imagine two SSMs that handle sentences like the following:

“I want a return flight for Wednesday, October 13.”

One SSM fills the Action slot with the value "return_flight", and the other SSM fills the Object slot with "date". At this stage of interpretation, we know a date was recognized, but do not know which date. To provide an answer, the SSM contains a <ruleref> to the built-in date grammar (thus, training the SSM on all possible dates and not the specific dates in the sample sentences). The date rule executes ECMAScript at runtime, and the script provides the actual date.

To summarize:

  1. First, the SSM extracts features and meanings and sets its slot.
  2. Second, you can write <ruleref> scripts for individual features. At runtime, scripts extracts meanings and sets slots for features identified by the SSM.