Writing a grammar main body

The main body of a grammar consists of rules. Each rule is a child of <grammar>, and defined in a separate <rule> element, which will use some combination of words and contained child elements (<one-of>, <item>, or <ruleref> elements) to define the rule. The recognizable words are entered as text. By default, Recognizer interprets all text as automatic <token> content unless otherwise specified.

For full details on GrXML elements and their attributes, refer to the SRGS specification.

GrXML elements

Sample grammar file provides an example of how GrXML elements can be used in the main body of a grammar. These elements are described here:

Allowed symbols and digit strings

You can use certain non-alphabetic symbols and strings of digits in your grammars, but Recognizer’s interpretation of them will depend on the grammar language. For best results, it is recommended that you spell out vocabulary items in your grammars as words; avoid using digit strings and symbols if possible.

Consider the following two items:

<item> 50% </item>
<item> fifty percent</item>

In this example, the second item is preferable because it explicitly defines what words can be spoken, and avoids using non-alphabetic symbols (the percent sign) and strings of digits (50).

When you use a digit string or other abbreviation, you cannot be certain about phrases covered by the grammar. Always test the pronunciations (see Checking pronunciations with dicttest). What is usable in one language may not be in another. Some digit strings may not generate any pronunciation and therefore may interfere with grammar compilation. To avoid such failures, set swirec_enable_robust_compile in a <meta> element in the grammar.

Recognizer interprets some symbols in each language automatically (such as the percent symbol "%" and dollar sign "$" in en-US), but most symbols cause an error unless defined in a dictionary. Problem symbols include:

Character name

Character

Character name

Character

hyphen

-

period

.

underscore

_

comma

,

opening parenthesis

(

forward slash

/

closing parenthesis

)

question mark

?

single quotation mark

'

Double quotation marks (") are never allowed inside vocabulary items; they are reserved as delimiters.

Use digit strings cautiously to avoid problems:

  • Individual digits (0–9) are acceptable. Recognizer interprets them as the number they represent (zero to nine).
  • Strings of digits with matching entries in a user dictionary are acceptable. If a grammar covers the digits “32564,” and you provide a dictionary pronunciation for that number, recognition accuracy remains high.
  • For random strings of digits, use extra caution and remember that spelling the names of the numbers will get better recognition accuracy.

    For example, the phrase “one hundred and twenty three” generates accurate pronunciations and gets high accuracy. But the same phrase as digits “123,” with no matching entry in a user dictionary, generates additional dissimilar pronunciations (for example, “one two three," "one hundred and twenty three," "one hundred twenty three," "twelve three," and so on), and decreased accuracy.

  • Some languages limit the length of string to avoid accuracy problems. If you exceed the allowed length, you get a parsing error similar to this:

    SWI_ERROR_GENERIC| error| lookupIndividualWords | Could not generate pronunciation for phrase '1234' (lang en-gb).

Special rules (NULL, VOID, and GARBAGE)

GrXML offers three special rules that simplify grammar development:

  • NULL: defines a rule that is matched automatically.
  • VOID: defines a rule that cannot be spoken.
  • GARBAGE: defines a rule that matches any speech up to the next rule match, the next token, or to the end of spoken input.

These names are all reserved in GrXML, so you must not use them for rules of your own creation. Recognizer will interpret them automatically. To invoke one of these rules, use the special attribute of the <ruleref> element as shown:

<ruleref special="NULL"/> 

Details on each of these rules appear below.