N-gram grammars
Recognizer supports grammar syntax for n-gram (Markovian) stochastic grammars in VoiceXML.
Normally, you can set an n-gram order in an SLM by setting the ngram_order parameter in the SLM training file header. This is the recommended method. However, Recognizer also supports the Stochastic Language Models (N-Gram) Specification W3C draft proposal (3 January 2001). There are important format differences between them:
- SWIlanguageModel n-gram specified in an SRGS grammar: Provided for backwards compatibility with old releases of the OpenSpeech Recognizer (OSR). This form of the n-gram grammar is used to specify a language model for an SRGS grammar. See SWIlanguageModel n-gram.
- Standalone n-gram grammar as used by Recognizer:
N-gram language models can be used to predict the likelihood that sequences of words, such as word pairs (bigrams) or word triples (trigrams) will be spoken as part of a user utterance. Like other natural language models, the n-gram language model is constructed using a large sampling of training sentences that display the characteristics expected in regular user input.
The grammar compiler can write an n-gram grammar when compiling Statistical Language Models (SLMs). See Compiling n-gram grammars.
Syntax and standards
Recognizer supports the vocabulary (<vocab>) and the count tree (<tree>) elements of the W3C n-gram draft proposal, which allows for a broad interpretation of its contents. Recognizer also extends the specification to include meta tags (<meta>) and pronunciation dictionaries (<lexicon>).
- The W3C draft allows for arbitrary n-grams; Recognizer is limited to bigrams and trigrams.
- Recognizer does not allow direct input of weights. Instead, count structures are used as input (ideally from a training corpus), and these are computed into the probabilities used for recognition.
- For details, see the W3C proposal.
Media types and import guidelines
The media type for n-gram source grammars is:
application/x-swi-ngram+xml
The media type for a compiled n-gram grammar is the same as for SRGS:
application/x-swi-grammar
Below are guidelines for importing:
- An n-gram grammar has an implicit root rule, and SRGS grammars can import them directly.
- The n-gram grammar may import SRGS rules for modeling as lexical tokens with unigram, bigram, and trigram counts and weights.
- Recursive importing between the same SRGS and n-gram grammars is not allowed (and results in an error).
Compiling n-gram grammars
You can compile, load, and activate n-gram grammars like any other speech grammar. For example, the following command compiles a grammar and produces a file mytrigram.gram:
sgc mytrigram.ngxml
Note: Precompile the parent SRGS grammar for SWIlanguageModel n-grams.
The following command produces a binary grammar for the example in SWIlanguageModel n-gram:
sgc grammar.grxml
When you train a Statistical Language Model (SLM), the sgc compiler creates an intermediate form of the grammar with n-gram count information. Using the -dump_ngram_grammar switch, you can save this information as an n-gram XML file.
The following command compiles an SLM training file and writes the n-gram XML grammar (sfgram.ngxml):
sgc -dump_ngram_grammar -train sfgram.xml
Note: There is no n-gram XML form of Statistical Semantic Models (SSMs).
Elements and attributes
An n-gram grammar contains the following elements and attributes:
Element |
Attribute |
---|---|
<N-Gram> |
xml:lang (optional) |
<meta> |
name content |
<lexicon> |
uri xml:lang (optional) |
<vocab> |
|
<token> |
index xml:lang (optional) |
<ruleref> |
uri type (optional) xml:lang (optional) |
<tree> |
(none) |
<node> |
(none) |
N-gram format
A simple, complete n-gram grammar appears below:
<N-Gram xml:lang="en-us">
[<meta name="name" content="content"/> …]
[<lexicon uri="[protocol:[//host/]][path/]file[?query]"/> …]
<vocab>
<token index="#" [xml:lang="en-us"]>
CDATA |
<ruleref uri="[protocol:[//host/]][path/]file[?query][#rule]"
[xml:lang="en-us"] [type="media-type"] />
</token>
…
</vocab>
<tree>
<node> #num-uni #total-count </node> <!-- root -->
<node> #tokenindex [#succ-entities] #entitycount </node>
…
</tree>
</N-Gram>
N-gram header
Like any other XML format document, an n-gram document begins with a header specifying important global information about the document.
N-gram document main body
The main body of an n-gram document consists of two sections:
- A vocabulary section, delimited by the <vocab> element, which defines all the words an imported rules that will appear in the n-gram tree.
- A tree section, delimited by the <tree> element, which represents the n-gram counts with <node> elements as described below.