Overview of grammars

The topics that follow provides an overview of the grammar creation process including a description of how grammars are implemented in XML including issues involved in planning a well-formed dialog, and how dialog design informs a our grammar strategy.

A grammar is a description of the words and phrases that a speech recognizer will understand and interpret. Grammars are loaded by the recognizer at runtime to convert the user’s spoken responses and commands into information the voice application can use. This happens in the following basic steps:

  1. The voice browser instructs the recognizer to load and activate a grammar.
  2. The voice browser plays a prompt audio designed to elicit a user response, and simultaneously records incoming audio to send to the recognizer.
  3. The recognizer listens for speech.
  4. After detecting speech, the recognizer searches for a match between the words spoken and the activated grammar(s).
  5. The recognizer returns the result of this search to the voice browser or to another component, for example, for interpretation. If this result is positive, the recognizer may pass values to application variables.

A well-designed grammar must be able to accept many different user responses, and interpret them quickly, accurately, and efficiently. This means that a developer must be able to predict what sort of responses each application prompt will produce, and encode them in a grammar as efficiently as possible.

This in turn means that grammars must be designed very much in parallel with the voice application. Grammar development is a central and essential part of overall voice application development.

Grammars are based on the SRGS grammar format (Speech Recognition Grammar Specification Version 1.0 [W3C Recommendation March 16, 2004]), and they include optimized features for VoiceXML browsers such as URI resolution, XML result format, parallel grammars, and grammar caching.

To create a grammar for Nuance Recognizer, you must first write it as a text file that describes the vocabulary that falls within the range of acceptable responses, and specifies the information to be passed to the voice application for each listed response. This file must be written in a specific XML format, and then precompiled or compiled at runtime for Recognizer to use.

Note: Within the context of Nuance Recognizer, this text format is referred to as “GrXML” for convenience.

To use the grammar you have created, you must invoke it in a field within a VoiceXML document to interpret the response to a specific prompt in the voice application.

How grammars work (the n-best list)

Each time the recognizer interprets a user utterance, it compiles a list of the closest possible matches to return to the voice application. This list is called the n-best list, since it is made up of a prespecified number n of interpretations that best match what the user seems to have said.

  1. Parameter settings in the active grammars determine the number n of interpretations that are to be listed (more than one grammar can be active at the same time). By default, the recognizer returns only one interpretation, unless you specify a higher number n.
  2. When the user speaks, the recognizer searches for the best matches among the items defined in the grammar(s), and adds each matching interpretation to the candidates being considered for the n-best list. During this search the recognizer uses acoustic models to analyze the audio input, lexical models to determine the most probable sentences in the grammars, and semantic models to determine the most probable meanings of what the caller has said.

    The recognizer searches until it has found all possible interpretations, or until the remaining items could not possibly match what was heard.

  3. The recognizer assigns a confidence score to each item in the candidate list, and ranks them from highest confidence to lowest. The recognizer re-assesses and fine-tunes these scores as new interpretations are found.

    If the grammars allow homonyms (words that sound identical but have different meanings), and one is spoken, the recognizer assigns the homonyms to separate interpretations with identical confidence scores.

  4. The recognizer refines the candidate list by processing any constraint lists (in the case of Nuance Recognizer) or semantic interpretation scripts (ECMAScript) specified in the grammars.
  5. The recognizer removes any interpretations that do not meet the target confidence levels configured for the recognition.
  6. The recognizer returns the final top n results to the application. This n-best list contains the matched text (for the entire utterance, and for individual slots), confidence scores, and any keys and values set for the utterances.

From this description, a grammar has these functions:

  • To define the words, phrases, and sentences that the recognizer will accept.
  • To return a set of keys and values to the application.
  • Optionally, to filter the candidate interpretations using scripts and constraint lists before the recognizer finalizes the n-best list.

Grammar sources

There are three basic sources for grammars:

  • External grammars: These grammars are created individually by a grammar developer. They are contained in separate files, which are invoked as needed from within the voice application (.vxml) file.
  • Built-in grammars: These grammars are included automatically as part of the recognition package, to provide coverage for common words like numbers and dates. They can be used without further development.
  • Inline grammars: Very small grammars may be coded in the voice application file directly, and interpreted by the voice browser at runtime.

External and inline grammars must be coded by the grammar developer, while built-in grammars are available when the built-in grammar service is running.

Sample grammar file

A short sample GrXML grammar appears below:

<?xml version="1.0" encoding="UTF-8" ?>
<grammar xmlns="http://www.w3.org/2001/06/grammar"
version="1.0" xml:lang="en-US" root="YesNo"
tag-format="swi-semantics/1.0">
<rule id="YesNo" scope="public">
    <one-of>
        <item>
            <ruleref uri="#Yes"/>
            <tag>YesNo='yes'</tag>
        </item>
        <item>
            <ruleref uri="#No"/>
            <tag>YesNo='no'</tag>
        </item>
    </one-of>
</rule>
<!-- Subgrammar identifying Yes responses-->
<rule id="Yes">
    <one-of>
        <item>yes</item>
        <item>yeah</item>
        <item>right</item>
        <item>correct</item>
    </one-of>
</rule>
<!-- Subgrammar identifying No responses-->
<rule id="No">
    <one-of>
        <item>no</item>
        <item>nope</item>
        <item>wrong</item>
        <item>incorrect</item>
    </one-of>
</rule>
</grammar>

This grammar will be invoked when the user answers a simple yes/no question. It will interpret user responses of “yes”, “yeah”, “right”, or “correct” as yes, and responses “no”, “nope”, “wrong”, or “incorrect” as no. This yes or no answer is then returned to the voice application, which uses it to fill a variable.

For additional deconstruction of this sample grammar file, see Sample grammar file revisited.