Configuring voice enrollment
Voice enrollment is a speech dialog where a user associates a pronunciation with a given function. For example, an enrollment application could associate the spoken phrase “call home” with a command to dial a phone number.
The main task of a voice enrollment application is to enroll a word or phrase (the pronunciation). The caller is prompted to speak the same utterance several times so that the system can compute a pronunciation for it. Then, the application adds the pronunciation to a user dictionary. For example, the caller repeats the phrase “call mom” several times, the pronunciation is computed, and the following entry is added to a dictionary:
<entry key="phone_5551234_1"> <definition value=" k ah l m ah m" /> </entry>
At this point, the enrollment is complete and the application, or any other application with access to the dictionary, can use the pronunciation in a grammar (which enables recognition when the caller speaks the enrolled utterance).
Note: Voice enrollment is a feature of Nuance Recognizer only.
In the example above, note that the key name (phone_5551234_1) is completely arbitrary with respect to the meaning of the utterance. The application does not know the meaning or spelling of the enrolled utterance. Instead, the application hears the phrase repeatedly and compares each collection until it is possible to compute a phonetic sequence that reliably represents the heard sounds. The enrolled utterance is a piece of data that will match when that specific caller speaks the same sounds. When the application inserts the pronunciation into a dictionary, it can assign any useful key name.
Voice enrollment headers
Speech Server supports the following MRCP headers for voice enrollment:
The Abort-Phrase-Enrollment header can optionally be specified in the END-PHASE-ENROLLMENT method to abort the phrase enrollment, rather than committing the phrase to the personal grammar.
abort-phrase-enrollment = "Abort-Phrase-Enrollment" ":" Boolean-Value CRLF
The Clash-Threshold header can be sent as part of the START-PHRASE-ENROLLMENT, SET-PARAMS, or GET-PARAMS method. Used during voice enrollment, this header specifies how similar the pronunciations of two different phrases can be before they are considered clashing. For example, pronunciations of phrases such as “John Smith” and “Jon Smits” may be so similar that they are difficult to distinguish correctly. A smaller threshold reduces the number of clashes detected. The range for this threshold is a float value between 0.0 and 1.0. The default value for this header is implementation specific. You can turn off clash testing by setting the Clash-Threshold header value to 0.
clash-threshold = "Clash-Threshold" ":" 1*DIGIT CRLF
The Confusable-Phrases-URI header specifies a grammar that defines invalid phrases for enrollment. For example, typical applications do not allow an enrolled phrase that is also a command word. This header may occur in RECOGNIZE requests that are part of an enrollment session.
confusable-phrases-uri = "Confusable-Phrases-URI" ":" Uri CRLF
The Consistency-Threshold header may be sent as part of the START-PHRASE-ENROLLMENT, SET-PARAMS, or GET-PARAMS method. Used during voice enrollment, this header specifies how similar to a previously enrolled pronunciation of the same phrase an utterance needs to be in order to be considered consistent. The higher the threshold, the closer the match between an utterance and previous pronunciations must be. The range for this threshold is a float value between 0.0 and 1.0. The default value for this header is implementation specific.
consistency-threshold = "Consistency-Threshold" ":" FLOAT CRLF
The Enroll-Utterance header may be specified in the RECOGNIZE method. If this header is set to TRUE, and an Enrollment is active, the RECOGNIZE command must add the collected utterance to the personal grammar that is being enrolled. The default value for this header is FALSE.
enroll-utterance = "Enroll-Utterance" ":" Boolean-Value CRLF
Expect the client to set to TRUE if the RECOGNIZE is for an enrollment and to false if doing a regular RECOGNIZE during an enrollment session.
The New-Phrase-Id header replaces the ID used to identify the phrase in a personal grammar. Recognizer returns the new ID when using an enrollment grammar. This header may occur in MODIFY-PHRASE requests.
new-phrase-id = "New-Phrase-ID" ":" 1*VCHAR CRLF
New-Phrase-Id is used for MODIFY-PHRASE and changes the rule in the personal grammar.
The Num-Min-Consistent-Pronunciations header may be specified in a START-PHRASE-ENROLLMENT, SET-PARAMS, or GET-PARAMS method and is used to specify the minimum number of consistent pronunciations that must be obtained to voice enroll a new phrase. The minimum value is 1. The default value is implementation specific and may be greater than 1.
num-min-consistent-pronunciations="Num-Min-Consistent-Pronunciations" ":" 1*DIGIT CRLF
Nuance Speech Server controls the number of consistent pronunciations and does not count no-matches or no-inputs. Speech Server must also parse enrollment result to find consistency status. Only count consistent utterances towards the Num-Min-Consistent-Pronunciations.
The Personal-Grammar-URI header specifies the speaker-trained grammar to be used or referenced during enrollment operations. Phrases are added to this grammar during enrollment. For example, a contact list for user “Jeff” could be stored at the Personal-Grammar-URI http://myserver.example.com/myenrollmentdb/jeff-list. Nuance Speech Server, using the HTTP database, stores the source grammar at this location. The generated grammar syntax may be implementation specific. There is no default value for this header.
personal-grammar-uri = "Personal-Grammar-URI" ":" Uri CRLF
In a request, the Phrase-Id header identifies a phrase in an existing personal grammar for which enrollment is desired. It is also returned to the client in the RECOGNIZE complete event. This header may occur in START-PHRASE-ENROLLMENT, MODIFY-PHRASE or DELETE-PHRASE requests. There is no default value for this header.
phrase-id = "Phrase-ID" ":" 1*VCHAR CRLF
Nuance Speech Server stores this rule in a personal grammar.
The Phrase-NL header is a string that specifies that a natural language statement in one of the active grammars apply to the phrase once the phrase is recognized. This header can occur in START-PHRASE-ENROLLMENT and MODIFY-PHRASE requests. There is no default value for this header.
phrase-nl = "Phrase-NL" ":" 1*VCHAR CRLF
Nuance Speech Server stores Phrase-NL as SWI_meaning in a personal grammar.
The Save-Best-Waveform header allows the client to request the recognizer resource to save the audio stream for the best repetition of the phrase that was used during the enrollment session. The recognizer must attempt to record the recognized audio and make it available to the client in the form of a URI returned in the waveform-uri header in the response to the END-PHASE-ENROLLMENT method. If there was an error in recording the stream, or the audio data is otherwise not available, the recognizer must return an empty waveform-uri header.
save-best-waveform = "Save-Best-Waveform" ":" Boolean-value CRLF
Nuance Speech Server uses the last utterance or, if possible, parses recognition results and keeps the utterance with the highest confidence.
The value of the Weight header represents the occurrence likelihood of a phrase in an enrolled grammar. When using grammar enrollment, the system is essentially constructing a grammar segment consisting of a list of possible match phrases. This is similar to the dynamic construction of a <one-of> tag in the W3C grammar specification. Each enrolled phrase becomes an item in the list that can be matched against spoken input similar to an <item> within a <one-of> list.
This header allows you to assign a weight to the phrase (that is, <item> entry) in the <one-of> list that is enrolled. Grammar weights are normalized to a sum of one at grammar compilation time, so a weight value of 1 for each phrase in an enrolled grammar list indicates that all items in the list have the same weight. This header may occur in START-PHRASE-ENROLLMENT and MODIFY-PHRASE requests. The default value for this header is implementation specific.
weight = "Weight" ":" weight-value CRLF
Nuance Speech Server stores Weight as SWI_scoreDelta in a personal grammar.
Voice enrollment methods
Speech Server supports these MRCP methods for voice enrollment:
The DELETE-PHRASE method sent from the client to the server is used to delete a phrase in a personal grammar added through voice enrollment or text enrollment. If the specified phrase does not exist, this method has no effect.
C->S: MRCP/2.0 123 DELETE-PHRASE 543266
Channel-Identifier:32AECB23433801@speechrecog
Personal-Grammar-URI:personal_grammar_uri Phrase-Id:phrase_idS->C: MRCP/2.0 49 543266 200 COMPLETE
Channel-Identifier:32AECB23433801@speechrecog
Do not call the END-PHRASE-ENROLLMENT method during an ongoing RECOGNIZE operation.
Instead, call it to commit a new phrase in the grammar during an active phrase-enrollment session. This is after successive and successful calls to RECOGNIZE where Num-Repetitions-Still-Needed returns as 0 in the RECOGNITION-COMPLETE event. Alternatively, call it by specifying the Abort-Phrase-Enrollment header to abort the phrase-enrollment session.
If the client has specified Save-Best-Waveform as true in the STARTPHRASE-ENROLLMENT request, then include the location/URI of a recording of the best repetition of the learned phrase in the response. For example:
C->S: MRCP/2.0 49 END-PHRASE-ENROLLMENT 543262
Channel-Identifier:32AECB23433801@speechrecog
S->C: MRCP/2.0 123 543262 200 COMPLETE
Channel-Identifier:32AECB23433801@speechrecog
Waveform-URI:http://mediaserver.com/recordings/file1324.wav;size=242453;duration=25432
The ENROLLMENT-ROLLBACK method discards the last live utterance from the RECOGNIZE operation. Use this method when the caller provides undesirable input such as non-speech noises, sidespeech, commands, or utterance from the RECOGNIZE grammar.
This method does not provide a stack of rollback states. Executing ENROLLMENT-ROLLBACK twice in succession without an intervening recognition operation has no effect on the second attempt as shown in this example:
C->S: MRCP/2.0 49 ENROLLMENT-ROLLBACK 543261
Channel-Identifier:32AECB23433801@speechrecog
S->C: MRCP/2.0 49 543261 200 COMPLETE
Channel-Identifier:32AECB23433801@speechrecog
The MODIFY-PHRASE method sent from the client to the server is used to change the phrase ID, NL phrase, and/or weight for a given phrase in a personal grammar.
If no fields are supplied then calling this method has no effect.
C->S: MRCP/2.0 123 MODIFY-PHRASE 543265
Channel-Identifier:32AECB23433801@speechrecog
Personal-Grammar-URI:personal_grammar_uri Phrase-Id:phrase_id New-Phrase-Id:new_phrase_id Phrase-NL:NL_phraseWeight:1
S->C: MRCP/2.0 49 543265 200 COMPLETE
The START-PHRASE-ENROLLMENT method from the client to the server starts a new phrase-enrollment session during which the client can call RECOGNIZE multiple times to enroll a new utterance in a grammar. An enrollment session consists of a set of calls to RECOGNIZE in which the caller speaks a phrase several times so the system can learn it. The phrase is then added to a personal grammar (speaker-trained grammar), so that the system can recognize it later.
Only one phrase-enrollment session may be active at a time for a resource. The Personal-Grammar-URI identifies the grammar that is used during enrollment to store the personal list of phrases. Once RECOGNIZE is called, the result is returned in a RECOGNITION-COMPLETE event and may contain either an enrollment result or a recognition result for a regular recognition. Calling END-PHASE-ENROLLMENT ends the ongoing phrase-enrollment session, which is typically done after a sequence of successful calls to RECOGNIZE. This method can be called to commit the new phrase to the personal grammar or to abort the phrase-enrollment session.
The Personal-Grammar-URI, which specifies the grammar to contain the new enrolled phrase, is created if it does not exist. Also, the personal grammar can only contain phrases added via a phrase-enrollment session.
The Phrase-ID passed to this method is used to identify this phrase in the grammar and is returned as the speech input when doing a RECOGNIZE on the grammar. The Phrase-NL similarly is returned in a RECOGNITION-COMPLETE event in the same manner as other NL in a grammar. The tag-format of this NL is implementation specific.
If the client specifies Save-Best-Waveform as true, include the location/URI of a recording (of the best repetition of the learned phrase in the response) after ending the phrase-enrollment session.
C->S: MRCP/2.0 123 START-PHRASE-ENROLLMENT 543258 5
Channel-Identifier:32AECB23433801@speechrecog
Num-Min-Consistent-Pronunciations:2
Consistency-Threshold:30
Clash-Threshold:12
Personal-Grammar-URI:personal_grammar_uri Phrase-NL:NL_phraseWeight:1
Save-Best-Waveform:true
S->C: MRCP/2.0 49 543258 200 COMPLETE
Channel-Identifier:32AECB23433801@speechrecog