Results

The results returned by ASRaaS applications can range from a simple transcript of an individual utterance to thousands of lines of JSON information. The scale of these results depends on two main factors: the recognition parameters in the request and the fields chosen by the the client application.

Recognition parameters in request

One way to customize the results from ASRaaS is with two RecognitionParameters in the request: result_type and utterance_detection_mode:

RecognitionParameters(
    language = 'en-US',
    topic = 'GEN',
    audio_format = AudioFormat(...),
    result_type = 'FINAL|PARTIAL|IMMUTABLE_PARTIAL',
    utterance_detection_mode = 'SINGLE|MULTIPLE|DISABLED'
)

Result type

The result type specifies the level of detail that ASRaaS returns in its streaming result. Set the desired result in RecognitionParameters: result_type. In the response, the type is indicated in Result: result_type. This parameter has three possible values:

  • FINAL (default): Only the final version of each hypothesis is returned. The result type is FINAL but the parameter is not shown in the results in Python applications because it is the default. To show this information to users, the app can determine the result type and display it using code such as this:

    elif message.HasField('result'):
        restype = 'partial' if message.result.result_type else 'final'
        print(f'{restype}: {message.result.hypotheses[0].formatted_text}')</pre>
    

    These are the results with FINAL result type and SINGLE utterance detection mode.

    final : It's Monday morning and the sun is shining
    
  • PARTIAL: Partial and final results are returned. Partial results of each utterance are delivered as soon as speech is detected, but with low recognition confidence. These results usually change as more speech is processed and the context is better understood. The result type is shown as PARTIAL. Final results are returned at the end of each utterance.

    partial : It's
    partial : It's me
    partial : It's month
    partial : It's Monday
    partial : It's Monday no
    partial : It's Monday more
    partial : It's Monday March
    partial : It's Monday morning
    partial : It's Monday morning and
    partial : It's Monday morning and the
    partial : It's Monday morning and this
    partial : It's Monday morning and the sun
    partial : It's Monday morning and the center
    partial : It's Monday morning and the sun is
    partial : It's Monday morning and the sonny's
    partial : It's Monday morning and the sunshine
    final : It's Monday morning and the sun is shining
    
  • IMMUTABLE_PARTIAL: Stabilized partial and final results are returned. Partial results are delivered after a slight delay to ensure that the recognized words do not change with the rest of the received speech. The result type is shown as PARTIAL (not IMMUTABLE_PARTIAL). Final results are returned at the end of each utterance.

    partial : It's Monday
    partial : It's Monday morning and the
    final : It's Monday morning and the sun is shining
    

Some data packs perform additional processing after the initial recognition. The transcript may change slightly during this second pass, even for immutable partial results. For example, ASRaaS originally recognized “the seven fifty eight train” as “the 750 A-Train” but adjusted it during a second pass, returning “the 758 train” in the final hypothesis:

partial : I'll catch the 750
partial : I'll catch the 750 A-Train
final : I'll catch the 758 train from Cedar Park station

Utterance detection mode

Another recognition parameter, utterance_detection_mode, determines how much of the audio ASRaaS will process. Specify the desired result in RecognitionParameters: utterance_detection_mode. This parameter has three possible values:

  • SINGLE (default): Return recognition results for one utterance only, ignoring any trailing audio. Default.

  • MULTIPLE: Return results for all utterances detected in the audio stream. These are the results with MULTIPLE utterance detection mode and FINAL result type:

    final: It's Monday morning and the sun is shining
    final: I'm getting ready to walk to the train and commute into work
    final: I'll catch the 757 train from Cedar Park station
    final: It will take me an hour to get into town
    
  • DISABLED: Return recognition results for all audio provided by the client, without separating it into utterances. The maximum allowed audio length for this detection mode is 30 seconds.

The combination of these two parameters returns different results. In all cases, the actual returned fields also depend on which fields the client application chooses to display:

  Utterance detection mode
Result type SINGLE MULTIPLE DISABLED
FINAL Final version of first utterance. Final version of each utterance. Final version of all speech.
PARTIAL Partial results, including corrections, of first utterance. Partial results of each utterance. Partial results of all speech.
IMMUTABLE_PARTIAL Stabilized partial results of first utterance. Stabilized partial results of each utterance. Stabilized partial results of all speech.

The detection modes do not support all the timer parameters in RecognitionParameters. See Timeouts and detection modes.

Fields chosen by client

Another way to customize your results is by selecting specific fields, or all fields, in your client application.

From the complete results returned by ASRaaS, the client selects the information to display to users. It can be just a few basic fields or the complete results in JSON format.

Basic fields

In this example, the client displays only a few essential fields: the status code and message, the result type, and the formatted text of the best hypothesis of each utterance. The recognition parameters in this request include result type FINAL and utterance detection mode MULTIPLE, meaning only the final and best version of the utterance is returned and all utterances in the audio are processed:

for message in stream_in:
    if message.HasField('status'):
        if message.status.details:
            print(f'{message.status.code} {message.status.message} - {message.status.details}')
        else:
            print(f'{message.status.code} {message.status.message}')
    elif message.HasField('result'):
        restype = 'partial' if message.result.result_type else 'final'
        print(f'{restype}: {message.result.hypotheses[0].formatted_text}')

The client displays a few basic fields, giving a relatively short result:

stream ../../audio/weather16.wav
100 Continue - recognition started on audio/l16;rate=16000 stream
final: There is more snow coming to the Montreal area in the next few days
final: We're expecting 10 cm overnight and the winds are blowing hard
final: Radar and satellite pictures show that we're on the western edge of the storm system as it continues to track further to the east
200 Success

All fields

This example prints all results returned by ASRaaS, giving a long JSON output of all fields:

for message in stream_in:
    print(message)

The response starts with the initial status and start-of-speech information, followed by ASRaaS’s recognition results.

ASRaaS starts its recognition process by breaking the audio into “utterances,” or portions of audio identified by pauses in the audio stream. The default pause between utterances is set by the server (usually 500 ms, or half a second) but you can adjust it in the client with RecognitionParameters utterance_end_silence_ms.

Each result in the response contains information about an utterance, including statistics, transcription hypotheses of the utterance and its words, plus the data pack used for recognition. By default, several hypotheses are returned for each utterance, showing confidence levels as well as formatted and minimally formatted text of the utterance. (See Formatted text for the difference between formatted and minimally formatted text.)

Depending on the recognition parameters in the request, these results can include one or all utterances, and can show more or less of ASRaaS’s “thinking process” as it recognizes the words the user is speaking.

In this example, the result type is FINAL, meaning ASRaaS returns several hypotheses for each utterance but only the final version of each hypothesis.

With result type PARTIAL, the results can be much longer, with many variations in each hypothesis as the words in the utterance are recognized and transcribed.

stream ../../audio/weather16.wav
status {
  code: 100
  message: "Continue"
  details: "recognition started on audio/l16;rate=16000 stream"
}

start_of_speech {
  first_audio_to_start_of_speech_ms: 840
}

result {
  abs_start_ms: 840
  abs_end_ms: 5600
  utterance_info {
    duration_ms: 4760
    dsp {
      snr_estimate_db: 35.0
      level: 6657.0
      num_channels: 1
      initial_silence_ms: 260
      initial_energy: -34.87929916381836
      final_energy: -83.81700134277344
      mean_energy: 88.64420318603516
    }
  }
  hypotheses {
    confidence: 0.24500000476837158
    average_confidence: 0.8640000224113464
    formatted_text: "There is more snow coming to the Montreal area in the next few days."
    minimally_formatted_text: "There is more snow coming to the Montreal area in the next few days."
    words {
      text: "There"
      confidence: 0.6790000200271606
      start_ms: 260
      end_ms: 400
    }
    words {
      text: "is"
      confidence: 0.765999972820282
      start_ms: 400
      end_ms: 580
    }
    words {
      text: "more"
      confidence: 0.8619999885559082
      start_ms: 580
      end_ms: 860
    }
    words {
      text: "snow"
      confidence: 0.6949999928474426
      start_ms: 860
      end_ms: 1220
    }
*** More words here ***
*** More hypotheses here ***
  data_pack {
    language: "eng-USA"
    topic: "GEN"
    version: "4.12.1"
    id: "GMT20231026205000"
  }
}
*** Results for additional utterances here ***
status {
  code: 200
  message: "Success"
}
  Complete results showing multiple utterances and hypotheses  

See RecognitionResponse: Result for a description of the fields in the response.