Results
The results returned by ASRaaS applications can range from a simple transcript of an individual utterance to thousands of lines of JSON information. The scale of these results depends on two main factors: the recognition parameters in the request and the fields chosen by the the client application.
As well as recognition results, ASRaaS can return two types of notifications. See Notifications in results.
Recognition parameters in request
One way to customize the results from ASRaaS is with two RecognitionParameters in the request: result_type
and utterance_detection_mode
:
RecognitionParameters(
language = 'en-US',
topic = 'GEN',
audio_format = AudioFormat(...),
result_type = 'FINAL|PARTIAL|IMMUTABLE_PARTIAL',
utterance_detection_mode = 'SINGLE|MULTIPLE|DISABLED'
)
Result type
The result type specifies the level of detail that ASRaaS returns in its streaming result. Set the desired result in RecognitionParameters: result_type
. In the response, the type is indicated in Result: result_type
. This parameter has three possible values:
-
FINAL
(default): Only the final version of each hypothesis is returned. The result type is FINAL but the parameter is not shown in the results in Python applications because it is the default. To show this information to users, the app can determine the result type and display it using code such as this:elif message.HasField('result'): restype = 'partial' if message.result.result_type else 'final' print(f'{restype}: {message.result.hypotheses[0].formatted_text}')</pre>
These are the results with FINAL result type and SINGLE utterance detection mode.
final : It's Monday morning and the sun is shining
-
PARTIAL
: Partial and final results are returned. Partial results of each utterance are delivered as soon as speech is detected, but with low recognition confidence. These results usually change as more speech is processed and the context is better understood. The result type is shown as PARTIAL. Final results are returned at the end of each utterance.partial : It's partial : It's me partial : It's month partial : It's Monday partial : It's Monday no partial : It's Monday more partial : It's Monday March partial : It's Monday morning partial : It's Monday morning and partial : It's Monday morning and the partial : It's Monday morning and this partial : It's Monday morning and the sun partial : It's Monday morning and the center partial : It's Monday morning and the sun is partial : It's Monday morning and the sonny's partial : It's Monday morning and the sunshine final : It's Monday morning and the sun is shining
-
IMMUTABLE_PARTIAL
: Stabilized partial and final results are returned. Partial results are delivered after a slight delay to ensure that the recognized words do not change with the rest of the received speech. The result type is shown as PARTIAL (not IMMUTABLE_PARTIAL). Final results are returned at the end of each utterance.partial : It's Monday partial : It's Monday morning and the final : It's Monday morning and the sun is shining
Note:
In these examples, the client displays only a few basic fields. If the client displays more fields, the results include all those additional fields. See Fields chosen by client below.Some data packs perform additional processing after the initial recognition. The transcript may change slightly during this second pass, even for immutable partial results. For example, ASRaaS originally recognized “the seven fifty eight train” as “the 750 A-Train” but adjusted it during a second pass, returning “the 758 train” in the final hypothesis:
partial : I'll catch the 750
partial : I'll catch the 750 A-Train
final : I'll catch the 758 train from Cedar Park station
Utterance detection mode
Another recognition parameter, utterance_detection_mode
, determines how much of the audio ASRaaS will process. Specify the desired result in RecognitionParameters: utterance_detection_mode
. This parameter has three possible values:
-
SINGLE
(default): Return recognition results for one utterance only, ignoring any trailing audio. Default. -
MULTIPLE
: Return results for all utterances detected in the audio stream. These are the results with MULTIPLE utterance detection mode and FINAL result type:final: It's Monday morning and the sun is shining final: I'm getting ready to walk to the train and commute into work final: I'll catch the 758 train from Cedar Park station final: It will take me an hour to get into town
-
DISABLED
: Return recognition results for all audio provided by the client, without separating it into utterances. The maximum allowed audio length for this detection mode is 30 seconds.
The combination of these two parameters returns different results. In all cases, the actual returned fields also depend on which fields the client application chooses to display:
Utterance detection mode | |||
---|---|---|---|
Result type | SINGLE |
MULTIPLE |
DISABLED |
FINAL |
Final version of first utterance. | Final version of each utterance. | Final version of all speech. |
PARTIAL |
Partial results, including corrections, of first utterance. | Partial results of each utterance. | Partial results of all speech. |
IMMUTABLE_PARTIAL |
Stabilized partial results of first utterance. | Stabilized partial results of each utterance. | Stabilized partial results of all speech. |
The detection modes do not support all the timer parameters in RecognitionParameters. See Timeouts and detection modes.
Fields chosen by client
Another way to customize your results is by selecting specific fields, or all fields, in your client application.
From the complete results returned by ASRaaS, the client selects the information to display to users. It can be just a few basic fields or the complete results in JSON format.
Basic fields
In this example, the client displays only a few essential fields: the status code and message, the result type, and the formatted text of the best hypothesis of each utterance. The recognition parameters in this request include result type FINAL and utterance detection mode MULTIPLE, meaning only the final and best version of the utterance is returned and all utterances in the audio are processed:
for message in stream_in:
if message.HasField('status'):
if message.status.details:
print(f'{message.status.code} {message.status.message} - {message.status.details}')
else:
print(f'{message.status.code} {message.status.message}')
elif message.HasField('result'):
restype = 'partial' if message.result.result_type else 'final'
print(f'{restype}: {message.result.hypotheses[0].formatted_text}')
The client displays a few basic fields, giving a relatively short result:
stream ../../audio/weather16.wav
100 Continue - recognition started on audio/l16;rate=16000 stream
final: There is more snow coming to the Montreal area in the next few days
final: We're expecting 10 cm overnight and the winds are blowing hard
final: Radar and satellite pictures show that we're on the western edge of the storm system as it continues to track further to the east
200 Success
All fields
This example prints all results returned by ASRaaS, giving a long JSON output of all fields:
for message in stream_in:
print(message)
The response starts with the initial status and start-of-speech information, followed by ASRaaS’s recognition results.
ASRaaS starts its recognition process by breaking the audio into utterances, or portions of audio identified by pauses in the audio stream. The default pause between utterances is set by the server (usually 500 ms, or half a second) but you can adjust it in the client with RecognitionParameters utterance_end_silence_ms
.
Each result
in the response contains information about an utterance, including statistics, transcription hypotheses of the utterance and its words, plus the data pack used for recognition. By default, several hypotheses are returned for each utterance, showing confidence levels as well as formatted and minimally formatted text of the utterance. (See Formatted text for the difference between formatted and minimally formatted text.)
Depending on the recognition parameters in the request, these results can include one or all utterances, and can show more or less of ASRaaS’s “thinking process” as it recognizes the words the user is speaking.
In this example, the result type is FINAL, meaning ASRaaS returns several hypotheses for each utterance but only the final version of each hypothesis.
With result type PARTIAL, the results can be much longer, with many variations in each hypothesis as the words in the utterance are recognized and transcribed.
stream ../../audio/weather16.wav
status {
code: 100
message: "Continue"
details: "recognition started on audio/l16;rate=16000 stream"
}
start_of_speech {
first_audio_to_start_of_speech_ms: 840
}
result {
abs_start_ms: 840
abs_end_ms: 5600
utterance_info {
duration_ms: 4760
dsp {
snr_estimate_db: 35.0
level: 6657.0
num_channels: 1
initial_silence_ms: 260
initial_energy: -34.87929916381836
final_energy: -83.81700134277344
mean_energy: 88.64420318603516
}
}
hypotheses {
confidence: 0.24500000476837158
average_confidence: 0.8640000224113464
formatted_text: "There is more snow coming to the Montreal area in the next few days."
minimally_formatted_text: "There is more snow coming to the Montreal area in the next few days."
words {
text: "There"
confidence: 0.6790000200271606
start_ms: 260
end_ms: 400
}
words {
text: "is"
confidence: 0.765999972820282
start_ms: 400
end_ms: 580
}
words {
text: "more"
confidence: 0.8619999885559082
start_ms: 580
end_ms: 860
}
words {
text: "snow"
confidence: 0.6949999928474426
start_ms: 860
end_ms: 1220
}
*** More words here ***
*** More hypotheses here ***
data_pack {
language: "eng-USA"
topic: "GEN"
version: "4.12.1"
id: "GMT20231026205000"
}
}
*** Results for additional utterances here ***
status {
code: 200
message: "Success"
}
See RecognitionResponse: Result for a description of the fields in the response.
Notifications in results
ASRaaS may return two types of notifications in RecognitionResponse > Result: recognition warnings and warnings of imminent shutdown.
Recognition warnings
Notifications about the recognition process are returned, when applicable, in result
> notifications
, along with the recognition results. These messages are warnings or errors that don’t trigger a run-time error, so recognition can continue with the limitations or suggestions in the message.
For example, this notification warns that a wordset was compiled using a different version of the data pack and should be recompiled. The wordset is used, but may be less effective because of the mismatch:
result: {
result_type: PARTIAL
abs_start_ms: 160
abs_end_ms :3510
hypotheses: [
*** Hypotheses here ***
],
data_pack: {
language: "eng-USA"
topic: "GEN"
version: "4.11.1"
id: "GMT20230830154712"
}
notifications: [
{
code: 1002
severity: SEVERITY_WARNING
message: {
locale: "en-US"
message: "Wordset-pkg should be recompiled."
message_resource_id: 1002
}
data: {
application/x-nuance-wordset-pkg: "urn:nuance-mix:tag:wordset:lang/names-places/places-compiled-ws/eng-USA/mix.asr"
application/x-nuance-domainlm: "urn:nuance-mix:tag:model/names-places/mix.asr?=language=eng-usa"
}
}
]
}
Shutdown warnings
ASRaaS also returns a notification when the session receives a SIGTERM signal, indicating the session is about to end. This situation can occur in long recognition sessions, when the Kubernetes pod running ASRaaS terminates automatically.
ASRaaS returns the NOTIFICATIONS result type, with details in a notifications
message, allowing the client to handle the shutdown gracefully. The timeout period is shown in the data
field.
This notification is not tied to any partial or final recognition results. For example:
result {
result_type: NOTIFICATIONS
notifications {
code: 1005
severity: SEVERITY_INFO
message {
locale: "en-US"
message: "Imminent shutdown."
message_resource_id: "1005"
}
data {
key: "timeout_ms"
value: "10000"
}
}
}
After the timeout period, ASRaaS sends a final status message with status code 512: Shutdown while processing a request, indicating that the recognition session has ended.
Feedback
Was this page helpful?
Glad to hear it! Please tell us how we can improve.
Sorry to hear that. Please tell us how we can improve.