Results

The results returned by ASRaaS applications can range from a simple transcript of an individual utterance to thousands of lines of JSON information. The scale of these results depends on two main factors: the recognition parameters in the request and the fields chosen by the the client application.

As well as recognition results, ASRaaS can return two types of notifications. See Notifications in results.

Recognition parameters in request

One way to customize the results from ASRaaS is with two RecognitionParameters in the request: result_type and utterance_detection_mode:

RecognitionParameters(
    language = 'en-US',
    topic = 'GEN',
    audio_format = AudioFormat(...),
    result_type = 'FINAL|PARTIAL|IMMUTABLE_PARTIAL',
    utterance_detection_mode = 'SINGLE|MULTIPLE|DISABLED'
)

Result type

The result type specifies the level of detail that ASRaaS returns in its streaming result. Set the desired result in RecognitionParameters: result_type. In the response, the type is indicated in Result: result_type. This parameter has three possible values:

  • FINAL (default): Only the final version of each hypothesis is returned. The result type is FINAL but the parameter is not shown in the results in Python applications because it is the default. To show this information to users, the app can determine the result type and display it using code such as this:

    elif message.HasField('result'):
        restype = 'partial' if message.result.result_type else 'final'
        print(f'{restype}: {message.result.hypotheses[0].formatted_text}')</pre>
    

    These are the results with FINAL result type and SINGLE utterance detection mode.

    final : It's Monday morning and the sun is shining
    
  • PARTIAL: Partial and final results are returned. Partial results of each utterance are delivered as soon as speech is detected, but with low recognition confidence. These results usually change as more speech is processed and the context is better understood. The result type is shown as PARTIAL. Final results are returned at the end of each utterance.

    partial : It's
    partial : It's me
    partial : It's month
    partial : It's Monday
    partial : It's Monday no
    partial : It's Monday more
    partial : It's Monday March
    partial : It's Monday morning
    partial : It's Monday morning and
    partial : It's Monday morning and the
    partial : It's Monday morning and this
    partial : It's Monday morning and the sun
    partial : It's Monday morning and the center
    partial : It's Monday morning and the sun is
    partial : It's Monday morning and the sonny's
    partial : It's Monday morning and the sunshine
    final : It's Monday morning and the sun is shining
    
  • IMMUTABLE_PARTIAL: Stabilized partial and final results are returned. Partial results are delivered after a slight delay to ensure that the recognized words do not change with the rest of the received speech. The result type is shown as PARTIAL (not IMMUTABLE_PARTIAL). Final results are returned at the end of each utterance.

    partial : It's Monday
    partial : It's Monday morning and the
    final : It's Monday morning and the sun is shining
    

Some data packs perform additional processing after the initial recognition. The transcript may change slightly during this second pass, even for immutable partial results. For example, ASRaaS originally recognized “the seven fifty eight train” as “the 750 A-Train” but adjusted it during a second pass, returning “the 758 train” in the final hypothesis:

partial : I'll catch the 750
partial : I'll catch the 750 A-Train
final : I'll catch the 758 train from Cedar Park station

Utterance detection mode

Another recognition parameter, utterance_detection_mode, determines how much of the audio ASRaaS will process. Specify the desired result in RecognitionParameters: utterance_detection_mode. This parameter has three possible values:

  • SINGLE (default): Return recognition results for one utterance only, ignoring any trailing audio. Default.

  • MULTIPLE: Return results for all utterances detected in the audio stream. These are the results with MULTIPLE utterance detection mode and FINAL result type:

    final: It's Monday morning and the sun is shining
    final: I'm getting ready to walk to the train and commute into work
    final: I'll catch the 758 train from Cedar Park station
    final: It will take me an hour to get into town
    
  • DISABLED: Return recognition results for all audio provided by the client, without separating it into utterances. The maximum allowed audio length for this detection mode is 30 seconds.

The combination of these two parameters returns different results. In all cases, the actual returned fields also depend on which fields the client application chooses to display:

  Utterance detection mode
Result type SINGLE MULTIPLE DISABLED
FINAL Final version of first utterance. Final version of each utterance. Final version of all speech.
PARTIAL Partial results, including corrections, of first utterance. Partial results of each utterance. Partial results of all speech.
IMMUTABLE_PARTIAL Stabilized partial results of first utterance. Stabilized partial results of each utterance. Stabilized partial results of all speech.

The detection modes do not support all the timer parameters in RecognitionParameters. See Timeouts and detection modes.

Fields chosen by client

Another way to customize your results is by selecting specific fields, or all fields, in your client application.

From the complete results returned by ASRaaS, the client selects the information to display to users. It can be just a few basic fields or the complete results in JSON format.

Basic fields

In this example, the client displays only a few essential fields: the status code and message, the result type, and the formatted text of the best hypothesis of each utterance. The recognition parameters in this request include result type FINAL and utterance detection mode MULTIPLE, meaning only the final and best version of the utterance is returned and all utterances in the audio are processed:

for message in stream_in:
    if message.HasField('status'):
        if message.status.details:
            print(f'{message.status.code} {message.status.message} - {message.status.details}')
        else:
            print(f'{message.status.code} {message.status.message}')
    elif message.HasField('result'):
        restype = 'partial' if message.result.result_type else 'final'
        print(f'{restype}: {message.result.hypotheses[0].formatted_text}')

The client displays a few basic fields, giving a relatively short result:

stream ../../audio/weather16.wav
100 Continue - recognition started on audio/l16;rate=16000 stream
final: There is more snow coming to the Montreal area in the next few days
final: We're expecting 10 cm overnight and the winds are blowing hard
final: Radar and satellite pictures show that we're on the western edge of the storm system as it continues to track further to the east
200 Success

All fields

This example prints all results returned by ASRaaS, giving a long JSON output of all fields:

for message in stream_in:
    print(message)

The response starts with the initial status and start-of-speech information, followed by ASRaaS’s recognition results.

ASRaaS starts its recognition process by breaking the audio into utterances, or portions of audio identified by pauses in the audio stream. The default pause between utterances is set by the server (usually 500 ms, or half a second) but you can adjust it in the client with RecognitionParameters utterance_end_silence_ms.

Each result in the response contains information about an utterance, including statistics, transcription hypotheses of the utterance and its words, plus the data pack used for recognition. By default, several hypotheses are returned for each utterance, showing confidence levels as well as formatted and minimally formatted text of the utterance. (See Formatted text for the difference between formatted and minimally formatted text.)

Depending on the recognition parameters in the request, these results can include one or all utterances, and can show more or less of ASRaaS’s “thinking process” as it recognizes the words the user is speaking.

In this example, the result type is FINAL, meaning ASRaaS returns several hypotheses for each utterance but only the final version of each hypothesis.

With result type PARTIAL, the results can be much longer, with many variations in each hypothesis as the words in the utterance are recognized and transcribed.

stream ../../audio/weather16.wav
status {
  code: 100
  message: "Continue"
  details: "recognition started on audio/l16;rate=16000 stream"
}

start_of_speech {
  first_audio_to_start_of_speech_ms: 840
}

result {
  abs_start_ms: 840
  abs_end_ms: 5600
  utterance_info {
    duration_ms: 4760
    dsp {
      snr_estimate_db: 35.0
      level: 6657.0
      num_channels: 1
      initial_silence_ms: 260
      initial_energy: -34.87929916381836
      final_energy: -83.81700134277344
      mean_energy: 88.64420318603516
    }
  }
  hypotheses {
    confidence: 0.24500000476837158
    average_confidence: 0.8640000224113464
    formatted_text: "There is more snow coming to the Montreal area in the next few days."
    minimally_formatted_text: "There is more snow coming to the Montreal area in the next few days."
    words {
      text: "There"
      confidence: 0.6790000200271606
      start_ms: 260
      end_ms: 400
    }
    words {
      text: "is"
      confidence: 0.765999972820282
      start_ms: 400
      end_ms: 580
    }
    words {
      text: "more"
      confidence: 0.8619999885559082
      start_ms: 580
      end_ms: 860
    }
    words {
      text: "snow"
      confidence: 0.6949999928474426
      start_ms: 860
      end_ms: 1220
    }
*** More words here ***
*** More hypotheses here ***
  data_pack {
    language: "eng-USA"
    topic: "GEN"
    version: "4.12.1"
    id: "GMT20231026205000"
  }
}
*** Results for additional utterances here ***
status {
  code: 200
  message: "Success"
}
  Complete results showing multiple utterances and hypotheses  

See RecognitionResponse: Result for a description of the fields in the response.

Notifications in results

ASRaaS may return two types of notifications in RecognitionResponse > Result: recognition warnings and warnings of imminent shutdown.

Recognition warnings

Notifications about the recognition process are returned, when applicable, in result > notifications, along with the recognition results. These messages are warnings or errors that don’t trigger a run-time error, so recognition can continue with the limitations or suggestions in the message.

For example, this notification warns that a wordset was compiled using a different version of the data pack and should be recompiled. The wordset is used, but may be less effective because of the mismatch:

result: {
  result_type: PARTIAL
  abs_start_ms: 160
  abs_end_ms :3510
  hypotheses: [ 
  *** Hypotheses here *** 
  ],
  data_pack: {
    language: "eng-USA"
    topic: "GEN"
    version: "4.11.1"
    id: "GMT20230830154712"
  }
  notifications: [
    {
      code: 1002
      severity: SEVERITY_WARNING
      message: {
        locale: "en-US"
        message: "Wordset-pkg should be recompiled."
        message_resource_id: 1002
      }
      data: {
        application/x-nuance-wordset-pkg: "urn:nuance-mix:tag:wordset:lang/names-places/places-compiled-ws/eng-USA/mix.asr"
        application/x-nuance-domainlm: "urn:nuance-mix:tag:model/names-places/mix.asr?=language=eng-usa"
      }
    }
  ]
}

Shutdown warnings

ASRaaS also returns a notification when the session receives a SIGTERM signal, indicating the session is about to end. This situation can occur in long recognition sessions, when the Kubernetes pod running ASRaaS terminates automatically.

ASRaaS returns the NOTIFICATIONS result type, with details in a notifications message, allowing the client to handle the shutdown gracefully. The timeout period is shown in the data field.

This notification is not tied to any partial or final recognition results. For example:

result {
  result_type: NOTIFICATIONS
  notifications {
    code: 1005
    severity: SEVERITY_INFO
    message {
      locale: "en-US"
      message: "Imminent shutdown."
      message_resource_id: "1005"
    }
    data {
      key: "timeout_ms"
      value: "10000"
    }
  }
}

After the timeout period, ASRaaS sends a final status message with status code 512: Shutdown while processing a request, indicating that the recognition session has ended.