N-gram grammars

Recognizer supports grammar syntax for n-gram (Markovian) stochastic grammars in VoiceXML.

Normally, you can set an n-gram order in an SLM by setting the ngram_order parameter in the SLM training file header. This is the recommended method. However, Recognizer also supports the Stochastic Language Models (N-Gram) Specification W3C draft proposal (3 January 2001). There are important format differences between them:

  • SWIlanguageModel n-gram specified in an SRGS grammar: Provided for backwards compatibility with old releases of the OpenSpeech Recognizer (OSR). This form of the n-gram grammar is used to specify a language model for an SRGS grammar. See SWIlanguageModel n-gram.
  • Standalone n-gram grammar as used by Recognizer:

    N-gram language models can be used to predict the likelihood that sequences of words, such as word pairs (bigrams) or word triples (trigrams) will be spoken as part of a user utterance. Like other natural language models, the n-gram language model is constructed using a large sampling of training sentences that display the characteristics expected in regular user input.

    The grammar compiler can write an n-gram grammar when compiling Statistical Language Models (SLMs). See Compiling n-gram grammars.

Syntax and standards

Recognizer supports the vocabulary (<vocab>) and the count tree (<tree>) elements of the W3C n-gram draft proposal, which allows for a broad interpretation of its contents. Recognizer also extends the specification to include meta tags (<meta>) and pronunciation dictionaries (<lexicon>).

  • The W3C draft allows for arbitrary n-grams; Recognizer is limited to bigrams and trigrams.
  • Recognizer does not allow direct input of weights. Instead, count structures are used as input (ideally from a training corpus), and these are computed into the probabilities used for recognition.
  • For details, see the W3C proposal.

Media types and import guidelines

The media type for n-gram source grammars is:

application/x-swi-ngram+xml

The media type for a compiled n-gram grammar is the same as for SRGS:

application/x-swi-grammar

Below are guidelines for importing:

  • An n-gram grammar has an implicit root rule, and SRGS grammars can import them directly.
  • The n-gram grammar may import SRGS rules for modeling as lexical tokens with unigram, bigram, and trigram counts and weights.
  • Recursive importing between the same SRGS and n-gram grammars is not allowed (and results in an error).

Compiling n-gram grammars

You can compile, load, and activate n-gram grammars like any other speech grammar. For example, the following command compiles a grammar and produces a file mytrigram.gram:

sgc mytrigram.ngxml

Note: Precompile the parent SRGS grammar for SWIlanguageModel n-grams.

The following command produces a binary grammar for the example in SWIlanguageModel n-gram:

sgc grammar.grxml

When you train a Statistical Language Model (SLM), the sgc compiler creates an intermediate form of the grammar with n-gram count information. Using the -dump_ngram_grammar switch, you can save this information as an n-gram XML file.

The following command compiles an SLM training file and writes the n-gram XML grammar (sfgram.ngxml):

sgc -dump_ngram_grammar -train sfgram.xml

Note: There is no n-gram XML form of Statistical Semantic Models (SSMs).

Elements and attributes

An n-gram grammar contains the following elements and attributes:

Element

Attribute

<N-Gram>

xml:lang (optional)

<meta>

name

content

<lexicon>

uri

xml:lang (optional)

<vocab>

 

<token>

index

xml:lang (optional)

<ruleref>

uri

type (optional)

xml:lang (optional)

<tree>

(none)

<node>

(none)

N-gram format

A simple, complete n-gram grammar appears below:

<N-Gram xml:lang="en-us">
 [<meta name="name" content="content"/> …]
 [<lexicon uri="[protocol:[//host/]][path/]file[?query]"/> …]
 <vocab>
  <token index="#" [xml:lang="en-us"]>
   CDATA |
   <ruleref uri="[protocol:[//host/]][path/]file[?query][#rule]"
   [xml:lang="en-us"] [type="media-type"] />
  </token>
   …
 </vocab>
 
 <tree>
  <node> #num-uni #total-count </node> <!-- root -->
  <node> #tokenindex [#succ-entities] #entitycount </node>
  …
 </tree>
</N-Gram>

N-gram header

Like any other XML format document, an n-gram document begins with a header specifying important global information about the document.

N-gram document main body

The main body of an n-gram document consists of two sections:

  • A vocabulary section, delimited by the <vocab> element, which defines all the words an imported rules that will appear in the n-gram tree.
  • A tree section, delimited by the <tree> element, which represents the n-gram counts with <node> elements as described below.