Count tree (<tree>)

The event count tree (<tree>) is a depth-first representation of the raw unigram, bigram, and trigram counts (Recognizer does not support n-grams beyond a trigram). The notation is highly compressed and hard-to-read, as the syntax was designed to be generated by a computer program rather than human beings.

The tree syntax diverges from standard XML practices. It contains numbers (separated by whitespace) within the body of the <node> tag. As a whole, the nodes represent a flattened tree form where the structure is determined by specific number fields in predecessor nodes. This is done to remove the significant overhead of an XML tree structure.

Format:

<tree>
 <node> #num-uni #total-count </node> <!-- root -->
 <node> #tokenindex [#succ-entities] #entitycount </node>
 …
</tree>

#num-uni

The number of unigrams represented in the tree.

Used for the first node (the 0-gram root) of the tree.

#total-count

The total number of tokens found in the training corpus (which is usually the sum of all unigram counts; see below).

Used for the first node (the 0-gram root) of the tree.

#tokenindex

The vocabulary token index for the node.

#succ-entities

The number of successor nodes at the next level of the tree (if 0 this number is omitted).

#entitycount

The event count.

The sgc grammar compiler has an option to convert a SLM training file into an XML n-gram file. Even if you don't have a training file, it is easiest to understand the count structure as originating from one.

To explain the depth-first n-gram event count format, suppose we have the pseudo-corpus of one sentence: "-pau- A B A B C -pau2-". The unigram, bigram, and trigram information can be represented as follows:

unigram

bigram

trigram

successors

event count

n-gram

-pau-

   

bigrams=1

unigram-count=1

( "-pau-")

 

A

 

trigrams=1

bigram-count=1

( "-pau- A")

   

B

4-grams=0

trigram-count=1

( "-pau- A B")

A

   

bigrams=1

unigram-count=2

("A")

 

B

 

trigrams=2

bigram-count=2

( "A B")

   

A

4-grams=0

trigram-count=1

( "A B A")

   

C

4-grams=0

trigram-count=1

( "A B C")

B

   

bigrams=2

unigram-count=2

( "B")

 

A

 

trigrams=1

bigram-count=1

( "B A")

   

B

4-grams=0

trigram-count=1

( "B A B")

 

C

 

trigrams=1

bigram-count=1

( "B C")

   

-pau2-

4-grams=0

trigram-count=1

( "B C -pau2-")

C

   

bigrams=1

unigram-count=1

( "C")

 

-pau2-

 

trigrams=0

bigram-count=1

( "C -pau2-")

-pau2-

   

bigrams=0

unigram-count=1

( "-pau2-")

The event count tree refers to vocabulary index numbers, removing branching count values of 0 (which are not allowed). The first node is a special 0-gram root node, detailing the number of unigrams and total corpus count.

In a ten thousand word vocabulary with a training corpus of five hundred thousand words, many vocabulary words occur so infrequently that you will prefer to rely on backoff weights instead of explicit models. These words must appear in the vocabulary section of the n-gram, but are not required in the count tree. When words are omitted from the tree, Recognizer automatically adds them to the grammar with a unigram count of 0 and calculates a backoff weight. The same behavior happens for modeling bigrams and especially trigrams (it is inefficient to model all trigrams in a corpus). Do not use backoff probabilities for utterance beginning and ending tokens ('-pau-' and '-pau2-'); if you do not model them in the tree, Recognizer writes a warning to the diagnostic log.