Results

The results returned by ASRaaS applications can range from a simple transcript of an individual utterance to thousands of lines of JSON information. The scale of these results depends on two main factors: the recognition parameters in the request and the fields chosen by the the client application.

As well as recognition results, ASRaaS can return two types of notifications. See Notifications in results.

Recognition parameters in request

One way to customize the results from ASRaaS is with two RecognitionParameters in the request: result_type and utterance_detection_mode:

RecognitionParameters(
    language = 'en-US',
    topic = 'GEN',
    audio_format = AudioFormat(...),
    result_type = 'FINAL|PARTIAL|IMMUTABLE_PARTIAL',
    utterance_detection_mode = 'SINGLE|MULTIPLE|DISABLED'
)

Result type

The result type specifies the level of detail that ASRaaS returns in its streaming result. Set the desired result in RecognitionParameters: result_type. In the response, the type is indicated in Result: result_type. This parameter has three possible values:

FINAL (default): Only the final version of each hypothesis is returned. The result type is FINAL but the parameter is not shown in the results in Python applications because it is the default. To show this information to users, the app can determine the result type and display it using code such as this:
```
elif message.HasField('result'):
    restype = 'partial' if message.result.result_type else 'final'
    print(f'{restype}: {message.result.hypotheses[0].formatted_text}')</pre>
```
These are the results with FINAL result type and SINGLE utterance detection mode.
```
final : It's Monday morning and the sun is shining
```

PARTIAL: Partial and final results are returned. Partial results of each utterance are delivered as soon as speech is detected, but with low recognition confidence. These results usually change as more speech is processed and the context is better understood. The result type is shown as PARTIAL. Final results are returned at the end of each utterance.

partial : It's
partial : It's me
partial : It's month
partial : It's Monday
partial : It's Monday no
partial : It's Monday more
partial : It's Monday March
partial : It's Monday morning
partial : It's Monday morning and
partial : It's Monday morning and the
partial : It's Monday morning and this
partial : It's Monday morning and the sun
partial : It's Monday morning and the center
partial : It's Monday morning and the sun is
partial : It's Monday morning and the sonny's
partial : It's Monday morning and the sunshine
final : It's Monday morning and the sun is shining

IMMUTABLE_PARTIAL: Stabilized partial and final results are returned. Partial results are delivered after a slight delay to ensure that the recognized words do not change with the rest of the received speech. The result type is shown as PARTIAL (not IMMUTABLE_PARTIAL). Final results are returned at the end of each utterance.
```
partial : It's Monday
partial : It's Monday morning and the
final : It's Monday morning and the sun is shining
```

Note:

In these examples, the client displays only a few basic fields. If the client displays more fields, the results include all those additional fields. See Fields chosen by client below.

Some data packs perform additional processing after the initial recognition. The transcript may change slightly during this second pass, even for immutable partial results. For example, ASRaaS originally recognized “the seven fifty eight train” as “the 750 A-Train” but adjusted it during a second pass, returning “the 758 train” in the final hypothesis:

partial : I'll catch the 750
partial : I'll catch the 750 A-Train
final : I'll catch the 758 train from Cedar Park station

Utterance detection mode

Another recognition parameter, utterance_detection_mode, determines how much of the audio ASRaaS will process. Specify the desired result in RecognitionParameters: utterance_detection_mode. This parameter has three possible values:

SINGLE (default): Return recognition results for one utterance only, ignoring any trailing audio. Default.

MULTIPLE: Return results for all utterances detected in the audio stream. These are the results with MULTIPLE utterance detection mode and FINAL result type:

final: It's Monday morning and the sun is shining
final: I'm getting ready to walk to the train and commute into work
final: I'll catch the 758 train from Cedar Park station
final: It will take me an hour to get into town

DISABLED: Return recognition results for all audio provided by the client, without separating it into utterances. The maximum allowed audio length for this detection mode is 30 seconds.

The combination of these two parameters returns different results. In all cases, the actual returned fields also depend on which fields the client application chooses to display:

	Utterance detection mode
Result type	`SINGLE`	`MULTIPLE`	`DISABLED`
`FINAL`	Final version of first utterance.	Final version of each utterance.	Final version of all speech.
`PARTIAL`	Partial results, including corrections, of first utterance.	Partial results of each utterance.	Partial results of all speech.
`IMMUTABLE_PARTIAL`	Stabilized partial results of first utterance.	Stabilized partial results of each utterance.	Stabilized partial results of all speech.

The detection modes do not support all the timer parameters in RecognitionParameters. See Timeouts and detection modes.

Fields chosen by client

Another way to customize your results is by selecting specific fields, or all fields, in your client application.

From the complete results returned by ASRaaS, the client selects the information to display to users. It can be just a few basic fields or the complete results in JSON format.

Basic fields

In this example, the client displays only a few essential fields: the status code and message, the result type, and the formatted text of the best hypothesis of each utterance. The recognition parameters in this request include result type FINAL and utterance detection mode MULTIPLE, meaning only the final and best version of the utterance is returned and all utterances in the audio are processed:

for message in stream_in:
    if message.HasField('status'):
        if message.status.details:
            print(f'{message.status.code} {message.status.message} - {message.status.details}')
        else:
            print(f'{message.status.code} {message.status.message}')
    elif message.HasField('result'):
        restype = 'partial' if message.result.result_type else 'final'
        print(f'{restype}: {message.result.hypotheses[0].formatted_text}')

The client displays a few basic fields, giving a relatively short result:

stream ../../audio/weather16.wav
100 Continue - recognition started on audio/l16;rate=16000 stream
final: There is more snow coming to the Montreal area in the next few days
final: We're expecting 10 cm overnight and the winds are blowing hard
final: Radar and satellite pictures show that we're on the western edge of the storm system as it continues to track further to the east
200 Success

All fields

This example prints all results returned by ASRaaS, giving a long JSON output of all fields:

for message in stream_in:
    print(message)

The response starts with the initial status and start-of-speech information, followed by ASRaaS’s recognition results.

ASRaaS starts its recognition process by breaking the audio into utterances, or portions of audio identified by pauses in the audio stream. The default pause between utterances is set by the server (usually 500 ms, or half a second) but you can adjust it in the client with RecognitionParameters utterance_end_silence_ms.

Each result in the response contains information about an utterance, including statistics, transcription hypotheses of the utterance and its words, plus the data pack used for recognition. By default, several hypotheses are returned for each utterance, showing confidence levels as well as formatted and minimally formatted text of the utterance. (See Formatted text for the difference between formatted and minimally formatted text.)

Depending on the recognition parameters in the request, these results can include one or all utterances, and can show more or less of ASRaaS’s “thinking process” as it recognizes the words the user is speaking.

In this example, the result type is FINAL, meaning ASRaaS returns several hypotheses for each utterance but only the final version of each hypothesis.

With result type PARTIAL, the results can be much longer, with many variations in each hypothesis as the words in the utterance are recognized and transcribed.

stream ../../audio/weather16.wav
status {
  code: 100
  message: "Continue"
  details: "recognition started on audio/l16;rate=16000 stream"
}

start_of_speech {
  first_audio_to_start_of_speech_ms: 840
}

result {
  abs_start_ms: 840
  abs_end_ms: 5600
  utterance_info {
    duration_ms: 4760
    dsp {
      snr_estimate_db: 35.0
      level: 6657.0
      num_channels: 1
      initial_silence_ms: 260
      initial_energy: -34.87929916381836
      final_energy: -83.81700134277344
      mean_energy: 88.64420318603516
    }
  }
  hypotheses {
    confidence: 0.24500000476837158
    average_confidence: 0.8640000224113464
    formatted_text: "There is more snow coming to the Montreal area in the next few days."
    minimally_formatted_text: "There is more snow coming to the Montreal area in the next few days."
    words {
      text: "There"
      confidence: 0.6790000200271606
      start_ms: 260
      end_ms: 400
    }
    words {
      text: "is"
      confidence: 0.765999972820282
      start_ms: 400
      end_ms: 580
    }
    words {
      text: "more"
      confidence: 0.8619999885559082
      start_ms: 580
      end_ms: 860
    }
    words {
      text: "snow"
      confidence: 0.6949999928474426
      start_ms: 860
      end_ms: 1220
    }
*** More words here ***
*** More hypotheses here ***
  data_pack {
    language: "eng-USA"
    topic: "GEN"
    version: "4.12.1"
    id: "GMT20231026205000"
  }
}
*** Results for additional utterances here ***
status {
  code: 200
  message: "Success"
}

Complete results showing multiple utterances and hypotheses

stream ../../audio/weather16.wav
status {
  code: 100
  message: "Continue"
  details: "recognition started on audio/l16;rate=16000 stream"
}

start_of_speech {
  first_audio_to_start_of_speech_ms: 840
}

result {
  abs_start_ms: 840
  abs_end_ms: 5600
  utterance_info {
    duration_ms: 4760
    dsp {
      snr_estimate_db: 35.0
      level: 6657.0
      num_channels: 1
      initial_silence_ms: 260
      initial_energy: -34.87929916381836
      final_energy: -83.81700134277344
      mean_energy: 88.64420318603516
    }
  }
  hypotheses {
    confidence: 0.24500000476837158
    average_confidence: 0.8640000224113464
    formatted_text: "There is more snow coming to the Montreal area in the next few days."
    minimally_formatted_text: "There is more snow coming to the Montreal area in the next few days."
    words {
      text: "There"
      confidence: 0.6790000200271606
      start_ms: 260
      end_ms: 400
    }
    words {
      text: "is"
      confidence: 0.765999972820282
      start_ms: 400
      end_ms: 580
    }
    words {
      text: "more"
      confidence: 0.8619999885559082
      start_ms: 580
      end_ms: 860
    }
    words {
      text: "snow"
      confidence: 0.6949999928474426
      start_ms: 860
      end_ms: 1220
    }
    words {
      text: "coming"
      confidence: 0.7639999985694885
      start_ms: 1220
      end_ms: 1680
    }
    words {
      text: "to"
      confidence: 0.8899999856948853
      start_ms: 1680
      end_ms: 1780
    }
    words {
      text: "the"
      confidence: 0.8949999809265137
      start_ms: 1780
      end_ms: 1880
    }
    words {
      text: "Montreal"
      confidence: 0.9200000166893005
      start_ms: 1880
      end_ms: 2420
    }
    words {
      text: "area"
      confidence: 0.9369999766349792
      start_ms: 2420
      end_ms: 2880
    }
    words {
      text: "in"
      confidence: 0.8880000114440918
      start_ms: 2880
      end_ms: 3060
    }
    words {
      text: "the"
      confidence: 0.9620000123977661
      start_ms: 3060
      end_ms: 3180
    }
    words {
      text: "next"
      confidence: 0.9620000123977661
      start_ms: 3180
      end_ms: 3480
    }
    words {
      text: "few"
      confidence: 0.9340000152587891
      start_ms: 3480
      end_ms: 3640
    }
    words {
      text: "days"
      confidence: 0.9620000123977661
      start_ms: 3640
      end_ms: 3900
    }
    words {
      text: "."
      confidence: 0.7450000047683716
      start_ms: 3900
      end_ms: 3920
    }
  }
  data_pack {
    language: "eng-USA"
    topic: "GEN"
    version: "4.12.1"
    id: "GMT20231026205000"
  }
}

result {
  abs_start_ms: 7060
  abs_end_ms: 12210
  utterance_info {
    duration_ms: 5150
    dsp {
      snr_estimate_db: 56.0
      level: 8193.0
      num_channels: 1
      initial_silence_ms: 260
      initial_energy: -78.53620147705078
      final_energy: -76.86959838867188
      mean_energy: 90.91629791259766
    }
  }
  hypotheses {
    confidence: 0.14100000262260437
    average_confidence: 0.8659999966621399
    formatted_text: "We are expecting 10 cm overnight and the winds are blowing hard"
    minimally_formatted_text: "We are expecting ten centimeters overnight and the winds are blowing hard"
    words {
      text: "We"
      confidence: 0.7139999866485596
      start_ms: 260
      end_ms: 360
    }
    words {
      text: "are"
      confidence: 0.6850000023841858
      start_ms: 360
      end_ms: 480
    }
    words {
      text: "expecting"
      confidence: 0.9259999990463257
      start_ms: 480
      end_ms: 940
    }
    words {
      text: "10"
      confidence: 0.9430000185966492
      start_ms: 940
      end_ms: 1260
    }
    words {
      text: "cm"
      confidence: 0.906000018119812
      start_ms: 1260
      end_ms: 1960
    }
    words {
      text: "overnight"
      confidence: 0.7929999828338623
      start_ms: 1960
      end_ms: 2540
      silence_after_word_ms: 180
    }
    words {
      text: "and"
      confidence: 0.8600000143051147
      start_ms: 2720
      end_ms: 2820
    }
    words {
      text: "the"
      confidence: 0.8960000276565552
      start_ms: 2820
      end_ms: 2940
    }
    words {
      text: "winds"
      confidence: 0.8299999833106995
      start_ms: 2940
      end_ms: 3400
    }
    words {
      text: "are"
      confidence: 0.925000011920929
      start_ms: 3400
      end_ms: 3600
    }
    words {
      text: "blowing"
      confidence: 0.9629999995231628
      start_ms: 3600
      end_ms: 3980
    }
    words {
      text: "hard"
      confidence: 0.7749999761581421
      start_ms: 3980
      end_ms: 4380
    }
  }
  data_pack {
    language: "eng-USA"
    topic: "GEN"
    version: "4.12.1"
    id: "GMT20231026205000"
  }
}

result {
  abs_start_ms: 13860
  abs_end_ms: 22750
  utterance_info {
    duration_ms: 8890
    dsp {
      snr_estimate_db: 16.0
      level: 15873.0
      num_channels: 1
      initial_silence_ms: 240
      initial_energy: -75.51029968261719
      final_energy: -23.982500076293945
      mean_energy: 97.4458999633789
    }
  }
  hypotheses {
    confidence: 0.004000000189989805
    average_confidence: 0.8659999966621399
    formatted_text: "Our radar and satellite pictures show that we\'re on the western edge of the storm system as it continues to track further to the east."
    minimally_formatted_text: "Our radar and satellite pictures show that we\'re on the western edge of the storm system as it continues to track further to the east."
    words {
      text: "Our"
      confidence: 0.6990000009536743
      start_ms: 240
      end_ms: 380
    }
    words {
      text: "radar"
      confidence: 0.828000009059906
      start_ms: 380
      end_ms: 840
    }
    words {
      text: "and"
      confidence: 0.9010000228881836
      start_ms: 840
      end_ms: 1040
    }
    words {
      text: "satellite"
      confidence: 0.9200000166893005
      start_ms: 1040
      end_ms: 1580
    }
    words {
      text: "pictures"
      confidence: 0.8569999933242798
      start_ms: 1580
      end_ms: 2060
      silence_after_word_ms: 440
    }
    words {
      text: "show"
      confidence: 0.8240000009536743
      start_ms: 2500
      end_ms: 2660
    }
    words {
      text: "that"
      confidence: 0.7670000195503235
      start_ms: 2660
      end_ms: 2900
    }
    words {
      text: "we\'re"
      confidence: 0.6480000019073486
      start_ms: 2900
      end_ms: 3020
    }
    words {
      text: "on"
      confidence: 0.8399999737739563
      start_ms: 3020
      end_ms: 3140
    }
    words {
      text: "the"
      confidence: 0.8970000147819519
      start_ms: 3140
      end_ms: 3240
    }
    words {
      text: "western"
      confidence: 0.8820000290870667
      start_ms: 3240
      end_ms: 3740
    }
    words {
      text: "edge"
      confidence: 0.8309999704360962
      start_ms: 3740
      end_ms: 3960
    }
    words {
      text: "of"
      confidence: 0.7850000262260437
      start_ms: 3960
      end_ms: 4220
    }
    words {
      text: "the"
      confidence: 0.8389999866485596
      start_ms: 4220
      end_ms: 4300
    }
    words {
      text: "storm"
      confidence: 0.902999997138977
      start_ms: 4300
      end_ms: 4740
    }
    words {
      text: "system"
      confidence: 0.765999972820282
      start_ms: 4740
      end_ms: 5200
      silence_after_word_ms: 420
    }
    words {
      text: "as"
      confidence: 0.9200000166893005
      start_ms: 5620
      end_ms: 5800
    }
    words {
      text: "it"
      confidence: 0.9459999799728394
      start_ms: 5800
      end_ms: 5940
    }
    words {
      text: "continues"
      confidence: 0.9279999732971191
      start_ms: 5940
      end_ms: 6500
    }
    words {
      text: "to"
      confidence: 0.9190000295639038
      start_ms: 6500
      end_ms: 6640
    }
    words {
      text: "track"
      confidence: 0.8820000290870667
      start_ms: 6640
      end_ms: 6960
    }
     words {
      text: "further"
      confidence: 0.8859999775886536
      start_ms: 6980
      end_ms: 7460
    }
    words {
      text: "to"
      confidence: 0.9300000071525574
      start_ms: 7460
      end_ms: 7640
    }
    words {
      text: "the"
      confidence: 0.9070000052452087
      start_ms: 7640
      end_ms: 7760
    }
    words {
      text: "east"
      confidence: 0.9359999895095825
      start_ms: 7760
      end_ms: 8160
    }
    words {
      text: "."
      confidence: 0.7059999704360962
      start_ms: 8160
      end_ms: 8180
    }
  }
  data_pack {
    language: "eng-USA"
    topic: "GEN"
    version: "4.12.1"
    id: "GMT20231026205000"
  }
}

stream complete
status {
  code: 200
  message: "Success"
}

See RecognitionResponse: Result for a description of the fields in the response.

Notifications in results

ASRaaS may return two types of notifications in RecognitionResponse > Result: recognition warnings and warnings of imminent shutdown.

Recognition warnings

Notifications about the recognition process are returned, when applicable, in result > notifications, along with the recognition results. These messages are warnings or errors that don’t trigger a run-time error, so recognition can continue with the limitations or suggestions in the message.

For example, this notification warns that a wordset was compiled using a different version of the data pack and should be recompiled. The wordset is used, but may be less effective because of the mismatch:

result: {
  result_type: PARTIAL
  abs_start_ms: 160
  abs_end_ms :3510
  hypotheses: [ 
  *** Hypotheses here *** 
  ],
  data_pack: {
    language: "eng-USA"
    topic: "GEN"
    version: "4.11.1"
    id: "GMT20230830154712"
  }
  notifications: [
    {
      code: 1002
      severity: SEVERITY_WARNING
      message: {
        locale: "en-US"
        message: "Wordset-pkg should be recompiled."
        message_resource_id: 1002
      }
      data: {
        application/x-nuance-wordset-pkg: "urn:nuance-mix:tag:wordset:lang/names-places/places-compiled-ws/eng-USA/mix.asr"
        application/x-nuance-domainlm: "urn:nuance-mix:tag:model/names-places/mix.asr?=language=eng-usa"
      }
    }
  ]
}

Shutdown warnings

ASRaaS also returns a notification when the session receives a SIGTERM signal, indicating the session is about to end. This situation can occur in long recognition sessions, when the Kubernetes pod running ASRaaS terminates automatically.

ASRaaS returns the NOTIFICATIONS result type, with details in a notifications message, allowing the client to handle the shutdown gracefully. The timeout period is shown in the data field.

This notification is not tied to any partial or final recognition results. For example:

result {
  result_type: NOTIFICATIONS
  notifications {
    code: 1005
    severity: SEVERITY_INFO
    message {
      locale: "en-US"
      message: "Imminent shutdown."
      message_resource_id: "1005"
    }
    data {
      key: "timeout_ms"
      value: "10000"
    }
  }
}

After the timeout period, ASRaaS sends a final status message with status code 512: Shutdown while processing a request, indicating that the recognition session has ended.

Feedback

Was this page helpful?

Glad to hear it! Please tell us how we can improve.

Sorry to hear that. Please tell us how we can improve.