acc_test

You can create acc_test scripts to test grammars for basic accuracy and generate reports.

The acc_test utility tests grammar accuracy by comparing audio files that have known meanings with their recognition results, and analyzing the results in one or more statistical reports. The utility is located in: %SWISRSDK%\amd64\bin

This utility takes one or more prepared scripts as input. Each script contains instructions for recognition testing, encoded in a proprietary format. The script lists the grammars to load for testing, input audio files, and the correct meanings of each input audio file, the reports to be generated, and so on. As output, the utility generates the specified reports based on the recognition results.

Note: The acc_test utility consumes three licenses in order to run.

Supported audio

Best practice: use end-pointed wave files generated by Recognizer, as these have correct begin and end silence times. Wave files generated or processed by other methods are not recommended.

Format	Description
audio/basic	WAV or ulaw
audio/x-alaw-basic	alaw
audio/L16	16-bit linear data
application/x-aurora	Aurora data (original bitstream)
application/x-feature	Aurora data (advanced bitstream for encoding=ES_202_050)

Usage

acc_test script1 [script2 ...] -local_log filename

   [-keep_cache]

   [-no_rectest_log]

   [-real_time]

   [-report_dir directory]

Options

script1 [script2 ...]

One or more scripts written in the proprietary format. If more than one script is specified, each script runs on a separate channel.

-local_log filename

Outputs event logging to the named file.

-keep_cache

Prevents deletion of existing grammar and inet cache directories.

-no_rectest_log

Disables the logging of script commands to the event log. This means that certain events (STARTSCRIPT, ALTdtst and ALTdtnd, SWIdtst, ENDSCRIPT, and others) will not appear in the log.

-real_time

Writes audio in real time.

-report_dir directory

Specifies a directory to which the reports will be written. This directory must already exist: the command will not create it.

Example

> acc_test test.script -no_rectest_log -local_log mylog.log

This command runs the test.script testing script, and logs the results in the mylog.log file. Since -no_rectest_log was used, this log does not include any events that describe the script commands.

Script format example

A simple sample script appears below:

# Example script. Use pound sign (#) for Comments

# Header (ACC:)

:ACC

# Load grammars

SWIrecGrammarLoad G0 g0.grxml

SWIrecGrammarLoad G1 g1.grxml

SWIrecGrammarLoad G2 g2.grxml

# Define the contexts

context_define context1 500 800

context_add G1 1

context_add G2 1

context_end

context_define context2 200 900

context_add G0 1

context_end

# Open the cumulative files

open utd test_grammar.utd

open errors test_grammar.err

open xmlresult test_grammar_nlsml.xml

xmlresult_media_type application/x-vnd.speechworks.emma+xml

# Test the contexts

context_use context1

transcription blue elephant

meaning {toto}

recognize blue_elephant.ulaw

transcription blue

recognize blue.ulaw

# Reset channel normalization

SWIrecAcousticStateReset

context_use context2

transcription i want to fly from denver to boston at five o'clock

recognize boston_denver_at_5.ulaw

# Generate reports

report summary test_grammar.summary

report confidence test_grammar.confidence

report nbest test_grammar.nbest

report oov test_grammar.oov

report words test_grammar.words

# Close the cumulative files

close utd

close errors

close xmlresult

Header (ACC:)

Each script begins with the following header:

:ACC

This header tells acc_test to call Recognizer, which uses the speech detector to detect end of speech. This means that the utility may declare the end of speech before the end of the file, based on the "incompletetimeout" parameter. Typical input waveform files will have already been endpointed and padded with approximately 200 ms of silence before and after the speech.

Comments

To enter comments anywhere in the script, use a hashmark (#) at the beginning of the line. This character tells acc_test to ignore the rest of the line.

Load grammars

The next section of the script tells acc_test which grammars to load for testing:

SWIrecGrammarLoad G0 g0.grxml

SWIrecGrammarLoad G1 g1.grxml

SWIrecGrammarLoad G2 g2.grxml

Here, each SWIrecGrammarLoad command defines an internal name (for example, G0) and matches it with a grammar to be loaded (g0.grxml):

SWIrecGrammarLoad gname gpath

Here, gname is the name to be assigned to the grammar in the rest of the script, and gpath is the URI to the grammar. This URI must include the full pathname for the grammar relative to the script. In the examples above, all the grammars are assumed to be in the same directory as the script itself. If there is a problem loading the grammar, the script will exit with an error message.

Define the contexts

Once the grammars are loaded, they are used to define grammar contexts:

context_define context1 500 800

context_add G1 1

context_add G2 1

context_end

In this excerpt, the script defines "context1" as a combination of grammars G1 and G2, weights these grammars equally, and sets the confidences thresholds that will be considered low (500) and high (800) on a scale of a thousand.

The commands used to define contexts are:

context_define cname low_thresh high_thresh

This command begins each definition, specifying the name to be used for the context (cname), and setting the low (low_thresh) and high (thresh) confidence thresholds for the context. These threshold limits range from 1 to 1000. Both are required, but you can use the same number for both if desired.

context_add gname weight

This command adds a grammar gname to the current context, assigning it the specified weight within the context. To weight all grammars equally, you assign the same weight.

context_end

This command marks the end of the current context definition.

Open the cumulative files

The acc_test reports represent data which has been accumulated internally. The data is written when the report command is processed. However, some kinds of data are written as the recognitions happen:

utd: Up-to-date files recording every transaction.
error: Error files recording occurrences of errors.
xmlresult: Output of the recognition results in XML format.
The xmlresult format can be further specified using the xml_media_type command, as shown in the example.
fr: Tracks all false rejections, where Recognizer incorrectly assigns a confidence score below or equal to the low confidence threshold, and thus rejects valid input.
fa: Tracks all false acceptances, where Recognizer incorrectly assigns a confidence score above the high confidence threshold to an incorrect recognition (the utterance is out-of-grammar, or is an in-grammar utterance that has been incorrectly interpreted as a different in-grammar utterance).
cpu: Tracks the CPU used for each test recognition.

To activate these files so new results will be written during the current session, the script uses the open command:

open filetype fname

Where filetype is one of the options listed above (utd, error, or xmlresult) and fname is the name and location for the file. If the named files already exist, they will be overwritten with the new results.

These cumulative files can be closed later—normally at the end of the script—by using the close command. See Close the cumulative files for details.

Test the contexts

The testing section specifies the tests themselves. Each subsection uses a context_use command to invoke a context with which to test recognitions, and specifies the tests to be conducted. For example, the sample file above tests the context1 context with two items:

context_use context1

transcription blue elephant

meaning {toto}

recognize blue_elephant.ulaw

transcription blue

recognize blue.ulaw

The context_use command takes one argument, that being the name of the context to be tested (the cname specified in the context_define command). Only one context can be active at one time: each context_use command implicitly deactivates whatever context was active up to that point.

Each test can include the following commands:

transcription: The transcription of the audio file being used for recognition.
meaning: The meaning to be assigned to the item when recognized, if this is different from the meaning that will be returned by Recognizer (optional). Enclose the meaning in braces {like this}.
recognize: The name and location of the audio file to be recognized. In the example, both audio files are in the same directory as the script.

Use the -format option to specify an audio type (the default is 8-bit, 8 KHz ulaw audio). For example:

recognize blue.alaw -format audio/x-alaw-basic

You can use a recognize command to specify text as well:

recognize -format text/plain "blue elephant"

However, acc_test is intended for audio tests, so this is not recommended.

It is strongly recommended that you use end-pointed wave files generated by Recognizer, as these have correct begin and end silence times that the acc_test utility requires. Wave files generated or processed by other methods are not recommended.

The transcription and meaning values are used in reports (see below).

Reset channel normalization

To reset the speaker/channel normalization between tests (in order to simulate the start of a new call, for example), use a SWIrecAcousticStateReset command:

SWIrecAcousticStateReset

Reset recognition count within a script

You can reset the result count at any point with a context_reset command:

context_reset

This command erases the results from all recognitions performed up to this point, so they are not counted for subsequent reports. You will probably only want to do this if you have already generated one set of reports (see below), and want to reinitialize for the next set of reports.

Generate reports

Once the tests are complete, you can use the results to write several different kinds of reports, using a separate command for each desired report:

report rtype reportfile

Here, the rtype is the type of report to be generated, while the reportfile specifies the location and name for the report. The name can include an environment variable (for example, %fname.err%). You can use a report command anywhere in the script. Usually, reports are generated before a context_reset or just before the script’s end. See Reset recognition count within a script.

The types of reports recommended for your testing include:

summary: Provides an overall summary of the results for each context.
nbest: Shows where on the nbest list the correct answer occurred.
oov: Lists the words found to be out-of-vocabulary, by count.
words: Evaluates the overall accuracy of recognition for each word.
confidence: Lists the confidences of recognitions for each context.

Close the cumulative files

You stop writing to the cumulative files using the close command:

close filetype

Normally these files will be closed at the end of the script, as in the example. Only one file of each type can be open at a time, so there is no need to specify the file name when you close. You can make any number of cumulative files, but when you open a second file of a type, the first closes automatically.

Reports

The acc_test utility is capable of generating many different types of reports. This section describes the five most common recommended reports.

Summary report

A summary report summarizes the results of a test according to context. Information for each context appears on a separate line. The fields are:

Name: The name of the context.
#utts: The total number of utterances that were used to test the context (including all in and out of vocabulary utterances).
%err(iv): The error rate, which is the percentage of misrecognitions (including failures), divided by the total number of in-vocabulary utterances.
%cr(iv): The correct reject rate. This is the percentage of correct recognitions which fell below the low confidence threshold.
%fa_in(iv): The false acceptance rate. This is the number of mis-recognized in-vocabulary utterances which had a confidence above the high threshold.
%oov(tot): The percentage of out-of-vocabulary utterances.
%fa_out(oov): The percentage of out-of-vocabulary utterances which had a confidence score above the high threshold.

N-best report

An n-best report counts the ranking of correct recognitions on the nbest list. It includes a separate section for each tested context. The fields are:

Context: The name of the context.
iv utts: The total number of in-vocabulary utterances for the context.
nbest n: The total number of in-vocabulary utterances that were ranked nth on the n-best list (there can be several lines in each section, one for each n).
total nbest inclusion: The ratio of total nbest results (sum of all nbest n lines) to the total in-vocabulary utterances (iv utts).

Out-of-vocabulary report

For each tested context, an out-of-vocabulary (oov) report lists all utterance transcriptions that were found to be out-of-vocabulary, sorted in descending order by count so the most common oov utterance transcriptions appear at the top. The format is:

context: cname

numbertranscription

...

Here, cname is the name of the context. Each line under this heading lists the number of times (number) each oov transcription (transcription) appeared:

context: context1

   12 red squirrel

   9 purple snake

   7 chartreuse pig

Words report

The words report has two parts for each context. First, for every utterance that appeared in the transcriptions or in the recognized text, it gives a report on accuracy for that word (based on dynamic string alignment). For example:

elephant

   total:        6

   good:         4 (0.67)

   sub away:     2 (0.33)

   del:          0 (0.00)

---

   sub to:       3

   ins:          0

Here, “elephant” is the word. The fields in this first section are:

total: The number of times this utterance appeared in the transcriptions.
good: How many of these occurrences aligned correctly (the fraction in parentheses is this number divided by the total).
sub_away: How many occurrences aligned incorrectly to another utterance. Again, the fraction in parentheses is this number divided by the total.
del: How many occurrences of the utterance were deleted.
sub to: How many times some other utterance aligned with the utterance.
ins: How many times the utterance was inserted.

The second part of the report shows confusion (utterance pairings which the alignment algorithm said were substitutions), sorted by count. For example:

CONFUSIONS

    6 Boston -> Austin

    3 Boston -> Houston

Confidence report

This report provides an overview of how different confidence thresholds affect the recognition results for a given context, so you can determine an acceptable tradeoff between recognition accuracy and utterance rejection rates.

The first part of the report specifies the confidence thresholds to set in order to generate correct acceptance rates of 10% through to 90%, and lists the rates of false acceptances at each level. The second part specifies the confidence thresholds to set in order to generate false rejection rates of 5%, 2%, 1%, 0.5%, and 0.1%, and lists the correct rejection rates experienced at each level.

mycontext:  38 inv correct, 1 inv incorrect, 20 oov utts

10% CA at  990     FA(inv) =   0.0%     FA(oov) =   0.0%

20% CA at  986     FA(inv) =   0.0%     FA(oov) =   0.0%

30% CA at  984     FA(inv) =   0.0%     FA(oov) =   0.0%

40% CA at  979     FA(inv) =   0.0%     FA(oov) =   0.0%

50% CA at  976     FA(inv) =   0.0%     FA(oov) =   0.0%

60% CA at  971     FA(inv) =   0.0%     FA(oov) =   0.0%

70% CA at  964     FA(inv) =   0.0%     FA(oov) =   5.0%

80% CA at  953     FA(inv) =   0.0%     FA(oov) =   5.0%

90% CA at  819     FA(inv) = 100.0%     FA(oov) =  10.0%

 5% FR at  693     CR(inv) =   0.0%     CR(oov) =  75.0%

 2% FR at  370     CR(inv) =   0.0%     CR(oov) =  65.0%

 1% FR at  370     CR(inv) =   0.0%     CR(oov) =  65.0%

.5% FR at  370     CR(inv) =   0.0%     CR(oov) =  65.0%

.1% FR at  370     CR(inv) =   0.0%     CR(oov) =  65.0%

context-name   total-invocab-correct total-invocab-incorrect  total-oovocab

FA(inv) =   100.0 * tot_fa_iv_incorrect / tot_iv_incorrect

FA(oov) =   100.0 * tot_fa_oov / tot_oov

CR(inv) = 100.0 * tot_cr_iv_incorrect / tot_iv_incorrect

CR(oov) = 100.0 * tot_cr_oov / tot_oov

The report begins with a list of the total results (correct in-vocabulary utterances, incorrect acceptances, and total rejected utterances).

It then lists the confidence scores required to achieve the specified percent of correct acceptances. For example, the report above shows that to obtain a correct acceptance rate of only 10%, the required confidence level is 990 (out of 1000). To obtain a correct acceptance rate of 20%, the required confidence level is 986. Both confidence levels result in a false acceptance rate of 0%.