Recognizer gRPC API

The Recognizer gRPC API contains methods for requesting speech recognition.

Proto file structure

ASRaaS provides protocol buffer (.proto) files to define Nuance’s ASR Recognizer service for gRPC. These files contain the building blocks of your speech recognition applications. See gRPC setup to download these proto files.

  • recognizer.proto contains the main Recognize streaming service.
  • resource.proto contains recognition resources such as domain language models and wordsets.
  • result.proto contains the recognition results that ASRaaS streams back to the client application.
  • The RPC files contain status and error messages used by other Nuance APIs.
└── nuance
    ├── asr
    │   └── v1
    │       ├── recognizer.proto
    │       ├── resource.proto
    │       └── result.proto
    └── rpc (RPC message files)

The proto files define a Recognizer service with a Recognize method that streams a RecognitionRequest and RecognitionResponse. Details about each component are referenced by name within the proto file.

  Principal fields in recognition request  
  Fields in recognition response  

For the RPC fields, see RPC gRPC messages.

Recognizer

The Recognizer service offers one RPC method to perform streaming recognition. The method consists of a bidirectional streaming request and response message.

Recognizer service
Name Request Response Description
Recognize RecognitionRequest stream RecognitionResponse stream Starts a recognition request and returns a response. Both request and response are streamed.

This example shows the Recognizer stub and Recognize method:

with grpc.secure_channel(hostaddr, credentials=channel_credentials) as channel:
    stub = RecognizerStub(channel)
    stream_in = stub.Recognize(client_stream(wf))

RecognitionRequest

Input stream messages that request recognition, sent one at a time in a specific order. The first mandatory field sends recognition parameters and resources, the final field sends audio to be recognized. Included in Recognize method.

Recognition request
Field Type Description
One of:    
   recognition_init_message Recognition InitMessage Mandatory. First message in the RPC input stream, sends parameters and resources for recognition.
   control_message Control Message Optional second message in the RPC input stream, for timer control.
   audio bytes Mandatory. Subsequent message containing audio samples in the selected encoding for recognition. See AudioFormat.

ASRaaS is a real-time service and audio should be streamed at a speed as close to real time as possible. For the best recognition results, we recommend an audio chunk size of 20 to 100 milliseconds.

The RecognitionRequest message includes:

RecognitionRequest
  recognition_init_message (RecognitionInitMessage)
    parameters (RecognitionParameters)
    resources (RecognitionResource)
    client_data
    user_id
  control_message (ControlMessage)
  audio

This RecognitionRequest example sends a recognition init message and then audio to be transcribed:

def client_stream(wf):
    try:
        # Start the recognition
        init = RecognitionInitMessage(. . .)
        yield RecognitionRequest(recognition_init_message = init)

        # Simulate a typical realtime audio stream
        print(f'stream {wf.name}')
        packet_duration = 0.020
        packet_samples = int(wf.getframerate() * packet_duration)
        for packet in iter(lambda: wf.readframes(packet_samples), b''):
            yield RecognitionRequest(audio=packet)

For a control_message example, see Timers and timeouts.

RecognitionInitMessage

Input message that initiates a new recognition turn. Included in RecognitionRequest.

Recognition initial message
Field Type Description
parameters Recognition Parameters Mandatory. Language, audio format, and other recognition parameters.
resources repeated Recognition Resource One or more resources (DLMs, wordsets, builtins) to improve recognition.
client_data map<string, string> Map of client-supplied key, value pairs to inject into the call log.
user_id string Identifies a specific user within the application.

The RecognitionInitMessage message includes:

RecognitionRequest
  recognition_init_message (RecognitionInitMessage)
    parameters (RecognitionParameters)
      language
      topic
      audio_format
      utterance_detection_mode
      result_type
      etc.
    resources (RecognitionResource)
      external_reference
        type
        uri
      inline_wordset
      builtin
      inline_grammar
      weight_enum | weight_value
      reuse
    client_data
    user_id

This RecognitionInitMessage example includes many optional fields:

RecognitionInitMessage(
    parameters = RecognitionParameters(
        language = 'en-US',
        topic = 'GEN',
        audio_format = AudioFormat(
            pcm = PCM(
                sample_rate_hz=wf.getframerate()
            )
        ),
        result_type = 'FINAL',
        utterance_detection_mode = 'MULTIPLE',
        recognition_flags = RecognitionFlags(auto_punctuate=True)
    ),
    resources = [travel_dlm, places_wordset],
    client_data = {'company':'Aardvark','user':'James'},
    user_id = 'james.somebody@aardvark.com'
)

This minimal RecognitionInitMessage contains just the mandatory fields:

RecognitionInitMessage(
    parameters = RecognitionParameters(
        language = 'en-US',
        audio_format = AudioFormat(pcm=PCM(sample_rate_hz=16000))
    )
)

RecognitionParameters

Input message that defines parameters for the recognition process. Included in RecognitionInitMessage.

The language and audio_format parameters are mandatory. All others are optional.

See Defaults for a list of default values.

If the request includes a DLM, both request and DLM must use the same language and topic. See Language in request and DLM.

Recognition parameters
Field Type Description
language string

Mandatory. Language and region (locale) code as xx-XX or xxx-XXX, for example, en-US or eng-USA. Not case-sensitive.
For available languages, see Geographies.

topic string Specialized language model in data pack. Case-sensitive, uppercase. Default is GEN (general).
audio_format AudioFormat Mandatory. Audio codec type and sample rate.
utterance_detection_mode EnumUtterance DetectionMode How many utterances within the audio stream are processed. Default SINGLE.
result_type EnumResultType The level of recognition results. Default FINAL.
recognition_flags RecognitionFlags Boolean recognition parameters.
no_input_timeout_ms uint32 Maximum silence, in ms, allowed while waiting for user input after recognition timers are started. Default (0) means server default, usually no timeout. See Timers and timeouts.
recognition_timeout_ms uint32 Maximum duration, in ms, of recognition turn. Default (0) means server default, usually no timeout. See Timers and timeouts.
utterance_end_silence_ms uint32 Minimum silence, in ms, that determines the end of an utterance. Default (0) means server default, usually 500ms or half a second. See Timers and timeouts.
speech_detection_sensitivity float A balance between detecting speech and noise (breathing, etc.), 0 to 1.
0 means ignore all noise, 1 means interpret all noise as speech. Default is 0.5.
max_hypotheses uint32 Maximum number of n-best hypotheses to return. Default (0) means server default, usually 10 hypotheses.
speech_domain string Mapping to internal weight sets for language models in the data pack. Values depend on the data pack.
formatting Formatting Formatting keywords.

The RecognitionParameters message includes:

RecognitionRequest
  recognition_init_message (RecognitionInitMessage)
    parameters (RecognitionParameters)
      language
      topic
      audio_format (AudioFormat):
        pcm|alaw|ulaw|opus|ogg_opus
      utterance_detection_mode (EnumUtteranceDetectionMode): SINGLE|MULTIPLE|DISABLED
      result_type (EnumResultType): FINAL|PARTIAL|IMMUTABLE_PARTIAL
      recognition_flags (RecognitionFlags):
        auto_punctuate
        filter_profanity
        mask_load_failures
        etc.
      no_input_timeout_ms
      recognition_timeout_ms
      utterance_end_silence_ms
      speech_detection_sensitivity
      max_hypotheses
      speech_domain
      formatting (Formatting)

This RecognitionParameters example includes recognition flags:

RecognitionInitMessage(
    parameters = RecognitionParameters(
        language = 'en-US',
        topic = 'GEN',
        audio_format = AudioFormat(
            pcm = PCM(
                sample_rate_hz = wf.getframerate()
            )
        ),
        result_type = 'PARTIAL',
        utterance_detection_mode = 'MULTIPLE',
        recognition_flags = RecognitionFlags(
            auto_punctuate = True,
            filter_wakeup_word = True
        )
    )
)
For examples of the formatting parameter, see Formatting and Formatted text.

AudioFormat

Mandatory input message containing the audio format of the audio to transcribe. Included in RecognitionParameters.

Audio format
Field Type Description
One of:
   pcm PCM Signed 16-bit little endian PCM, 8kHz or 16kHz.
   alaw ALaw G.711 A-law, 8kHz.
   ulaw Ulaw G.711 µ-law, 8kHz.
   opus Opus RFC 6716  Opus, 8kHz or 16kHz.
   ogg_opus OggOpus RFC 7845  Ogg-encapsulated Opus, 8kHz or 16kHz.

This AudioFormat example sets PCM format, with alternatives shown in commented lines:

RecognitionInitMessage(
    parameters = RecognitionParameters(
        language = 'en-US',
        topic = 'GEN',
        audio_format = AudioFormat(
            pcm = PCM(
                sample_rate_hz = wf.getframerate()
            )
        ),
#        audio_format = AudioFormat(pcm = PCM()),
#        audio_format = AudioFormat(pcm = PCM(sample_rate_hz = 16000)),
#        audio_format = AudioFormat(alaw = Alaw()),
#        audio_format = AudioFormat(ulaw = Ulaw()),
#        audio_format = AudioFormat(opus = Opus(source_rate_hz = 16000)),
#        audio_format = AudioFormat(ogg_opus = OggOpus(output_rate_hz = 16000)),
        result_type = 'FINAL',
        utterance_detection_mode = 'MULTIPLE'
    )
)

PCM

Input message defining PCM sample rate. Included in AudioFormat.

PCM audio
Field Type Description
sample_rate_hz uint32 Audio sample rate in Hertz: 0, 8000, 16000. Default 0, meaning 8000.

Alaw

Input message defining A-law audio format. G.711 audio formats are set to 8kHz. Included in AudioFormat.

Ulaw

Input message defining µ-law audio format. G.711 audio formats are set to 8kHz. Included in AudioFormat.

Opus

Input message defining Opus packet stream decoding parameters. Included in AudioFormat.

Opus audio
Field Type Description
decode_rate_hz uint32 Decoder output rate in Hertz: 0, 8000, 16000. Default 0, meaning 8000.
preskip_samples uint32 Decoder 48 kHz output samples to skip.
source_rate_hz uint32 Input source sample rate in Hertz.

OggOpus

Input message defining Ogg-encapsulated Opus audio stream parameters. Included in AudioFormat.

Ogg Opus audio
Field Type Description
output_rate_hz uint32 Decoder output rate in Hertz: 0, 8000, 16000. Default 0, meaning 8000.

ASRaaS supports the Opus audio format, either raw Opus (RFC 6716) or Ogg-encapsulated Opus (RFC 7845). The recommended encoder settings for Opus for speech recognition are:

  • Sampling rate: 16 kHz
  • Complexity: 3
  • Bitrate: 28kbps recommended (20 kbps minimum)
  • Bitrate type: VBR (variable bitrate) or CBR (constant bitrate)
  • Packet length: 20ms
  • Encoder mode: SILK only mode
  • With Ogg encapsulation, the maximum Ogg container delay should be <= 100 ms.

Please note that Opus is a lossy codec, so you should not expect recognition results to be identical to those obtained with PCM audio.

EnumUtteranceDetectionMode

Input field specifying how utterances should be detected and transcribed within the audio stream. Included in RecognitionParameters. The default is SINGLE. When the detection mode is DISABLED, the recognition ends only when the client stops sending audio.

Utterance detection mode
Name Number Description
SINGLE 0 Return recognition results for one utterance only, ignoring any trailing audio. Default.
MULTIPLE 1 Return results for all utterances detected in the audio stream. Does not support RecognitionParameters recognition_timeout_ms.
DISABLED 2 Return recognition results for all audio provided by the client, without separating it into utterances. The maximum allowed audio length for this detection mode is 30 seconds. Does not support RecognitionParameters no_input_timeout_ms, recognition_timeout_ms, or utterance_end_silence_ms.
The detection modes do not support all the timer parameters in RecognitionParameters. See Timeouts and detection modes.

The MULTIPLE detection mode detects and recognizes each utterance in the audio stream:

RecognitionInitMessage(
    parameters = RecognitionParameters(
        language = 'en-US',
        topic = 'GEN',
        audio_format = AudioFormat(pcm=PCM(sample_rate_hz=wf.getframerate())),
        result_type = 'PARTIAL',
        utterance_detection_mode = 'MULTIPLE'
    )
)

EnumResultType

Input and output field specifying how results for each utterance are returned.

As input in RecognitionParameters, EnumResultType specifies the desired result type: FINAL (default), PARTIAL, or IMMUTABLE_PARTIAL.

As output in Result, it indicates the actual result type being returned: FINAL, PARTIAL, or NOTIFICATIONS.

  • For final results, the result_type field is not returned in Python applications, as FINAL is the default.
  • The PARTIAL result type is returned for both partial and immutable partial results.
  • NOTIFICATIONS is returned when the service emits a SIGTERM signal, warning of an imminent termination.

For examples, see Results > Result type and Notifications in results.

Result type
Name Number Description
FINAL 0 Only the final version of each utterance is returned. Default.
PARTIAL 1 Variable partial results are returned, followed by a final result. In response (Result), this value is used for partial and immutable partial results.
IMMUTABLE_PARTIAL 2 Stabilized partial results are returned, following by a final result. Used in request (RecognitionParameters) only, not in response (Result).
NOTIFICATIONS 3

In response (Result) only, a pseudo-result that carries only one or more notifications.

ASRaaS instances emit a notification when they receive a SIGTERM signal. This warns the client of an imminent termination and the shutdown grace period. Details are returned in a notifications message (see Notification).

Shutdown notification results are not tied to any partial or final results that may also be emitted.

In a recognition request, the PARTIAL result type asks ASRaaS to return a stream of partial results, including corrections:

RecognitionInitMessage(
    parameters = RecognitionParameters(
        language = 'en-US',
        topic = 'GEN',
        audio_format = AudioFormat(pcm=PCM(sample_rate_hz=wf.getframerate())),
        result_type = 'PARTIAL',
        utterance_detection_mode = 'MULTIPLE'
    )
)

The response to this request indicates the result type:

result {
  result_type: PARTIAL
  abs_start_ms: 840
  abs_end_ms: 5600
  ... 
}  

In a recognition response, the NOTIFICATIONS result type reports an imminent shutdown:

result {
  result_type: NOTIFICATIONS
  notifications {
    code: 1005
    severity: SEVERITY_INFO
    message {
      locale: "en-US"
      message: "Imminent shutdown."
      message_resource_id: "1005"
    }
    data {
      key: "timeout_ms"
      value: "10000"
    }
  }
}

RecognitionFlags

Input message containing boolean recognition parameters. Included in RecognitionParameters. The default is false in all cases.

Recognition flags
Field Type Description
auto_punctuate bool Whether to add punctuation to the transcription of each utterance, if available for the language. Punctuation such as commas and periods (full stops) is applied based on the grammatical logic of the language, not on pauses in the audio. See Capitalization and punctuation.
filter_profanity bool Whether to mask known profanities as *** in the result, if available for the language.
include_tokenization bool Whether to include a tokenized recognition result.
stall_timers bool Whether to disable the no-input timer. By default, this timer starts when recognition begins. See Timers and timeouts.
discard_speaker_adaptation bool If speaker profiles are used, whether to discard updated speaker data. By default, data is stored.
suppress_call_recording bool Whether to redact transcription results in the call logs and disable audio capture. By default, transcription results, audio, and metadata are generated.
mask_load_failures bool When true, errors loading external resources are not reflected in the Status message and do not terminate recognition. They are still reflected in logs.
To set this flag for a specific resource (compiled wordsets only), use mask_load_failures in ResourceReference.
suppress_initial_capitalization bool When true, the first word in a transcribed utterance is not automatically capitalized. This option does not affect words that are capitalized by definition, such as proper nouns and place names. See Capitalization and punctuation.
allow_zero_base_lm_weight bool When true, custom resources (DLMs, wordsets, and others) can use the entire weight space, disabling the base LM contribution. By default, the base LM uses at least 10% of the weight space. Even when true, words from the base LM are still recognized, but with lower probability. See Resource weights.
filter_wakeup_word bool Whether to remove the wakeup word from the final result. This field is ignored in some situations. See Wakeup words.
send_multiple_start_of_speech bool When true, send a StartOfSpeech message for each detected utterance. By default, StartOfSpeech is sent for first utterance only.

Recognition flags are set within recognition parameters:

RecognitionInitMessage(
    parameters = RecognitionParameters(
        language = 'en-US',
        topic = 'GEN',
        audio_format = AudioFormat(pcm=PCM(sample_rate_hz=wf.getframerate())),
        result_type = 'PARTIAL',
        utterance_detection_mode = 'MULTIPLE',
        recognition_flags = RecognitionFlags(
            auto_punctuate = True,
            filter_profanity = True,
            suppress_initial_capitalization = True,
            allow_zero_base_lm_weight = True,
            filter_wakeup_word = True
            send_multiple_start_of_speech = True
        )
    )
)

Capitalization and punctuation

When suppress_initial_capitalization is true, the first word in each utterance is left uncapitalized unless it’s a proper noun. For example:

final: it's Monday morning and the sun is shining
final: I'm getting ready to walk to the train and commute into work.
final: I'll catch the 757 train from Cedar Park station
final: it will take me an hour to get into town

When auto_punctuate is true, the utterance contains punctuation such as commas and periods, based on the grammar rules of the language. With this feature, the utterance may appear to contain multiple sentences. You’ll notice this effect when utterance_detection_mode is DISABLED, meaning the audio stream is treated as one utterance.

final: There is more snow coming to the Montreal area in the next few days. We're 
expecting 10 cm overnight and the winds are blowing hard. Our radar and satellite 
pictures show that we're on the western edge of the storm system as it continues 
to track further to the east.

When auto_punctuate and suppress_initial_capitalization are both true, the first word in the utterance is not capitalized, but the first word after each sentence-ending punctuation is capitalized as normal. For example:

final: there is more snow coming to the Montreal area in the next few days. We're 
expecting 10 cm overnight and the winds are blowing hard. Our radar and satellite 
pictures show that we're on the western edge of the storm system as it continues 
to track further to the east.

Formatting

Input message specifying how the results are presented, using keywords for formatting types and options supported by the data pack. Included in RecognitionParameters. See Formatted text.

Formatting
Field Type Description
scheme string Keyword for a formatting type defined in the data pack.
options map<string, bool> Map of key, value pairs of formatting options and values defined in the data pack.

This example includes a formatting scheme (date) and several formatting options.

RecognitionInitMessage(
    parameters = RecognitionParameters(
        language = 'en-US',
        topic = 'GEN',
        audio_format = AudioFormat(pcm=PCM(sample_rate_hz=wf.getframerate())),
        result_type = 'IMMUTABLE_PARTIAL',
        utterance_detection_mode = 'MULTIPLE',
        formatting = Formatting(
            scheme = 'date',
            options = {
                'abbreviate_titles': True,
                'abbreviate_units': False,
                'censor_profanities': True,
                'censor_full_words': True
            }
        )
    )
)

ControlMessage

Input message that starts the recognition no-input timer. Included in RecognitionRequest. This setting is only effective if timers were disabled in the recognition request. See Timers and timeouts.

Control message
Field Type Description
start_timers_message StartTimers ControlMessage Starts the recognition no-input timer.

StartTimersControlMessage

Input message the client sends when starting the no-input timer. Included in ControlMessage.

RecognitionResource

Input message defining one or more recognition resources (domain LMs, wordsets, etc.) to improve recognition. Included in RecognitionInitMessage.

Recognition resource
Field Type Description
One of:    
   external_reference Resource Reference The resource is an external file. Mandatory for DLMs, compiled wordsets, speaker profiles, and settings files.
   inline_wordset string Inline wordset JSON resource. Default empty, meaning no inline wordset. See Wordsets for the format.
   builtin string Name of a builtin resource in the data pack. Default empty, meaning no builtins. See Builtins.
   inline_grammar string Inline grammar, SRGS XML format. Default empty, meaning no inline grammar. For Nuance internal use only.
   wakeup_word WakeupWord List of wakeup words. See Wakeup words.
One of:   Weight applies to DLMs, builtins, and wordsets. As dictated by gRPC rules, if both weight_enum and weight_value are provided, weight_value takes precedence. See Resource weights.
   weight_enum EnumWeight Keyword for weight relative to data pack. If DEFAULT_WEIGHT or not supplied, defaults to MEDIUM (0.25) for DLMs and builtins, to LOW (0.1) for wordsets.
   weight_value float Weight relative to data pack as value from 0 to 1. If 0.0 or not supplied, defaults to 0.25 (MEDIUM) for DLMs and builtins, to 0.1 (LOW) for wordsets.
reuse EnumResource Reuse Whether the resource will be used multiple times. Default LOW_REUSE.

The RecognitionResource message includes:

RecognitionRequest
  recognition_init_message (RecognitionInitMessage)
    parameters (RecognitionParameters)
    resources (RecognitionResource)
      external_reference (ResourceReference)
      inline_wordset
      builtin
      inline_grammar
      wakeup_word
      weight_enum (EnumWeight): LOWEST to HIGHEST | weight_value
      reuse (EnumResourceReuse): LOW_REUSE|HIGH_REUSE

This RecognitionResource example includes a DLM, two wordsets, and wakeup words:

# Declare a DLM (names-places is the context tag)
travel_dlm = RecognitionResource(
    external_reference = ResourceReference(
        type = 'DOMAIN_LM',
        uri = 'urn:nuance-mix:tag:model/names-places/mix.asr?=language=eng-USA'
    ),
    weight_value = 0.5
)

# Define an inline wordset for an entity in that DLM
places_wordset = RecognitionResource(
    inline_wordset = '{"PLACES":[{"literal":"La Jolla","spoken":["la hoya"]},{"literal":"Llanfairpwllgwyngyll","spoken":["lan vire pool guin gill"]},{"literal":"Abington Pigotts"},{"literal":"Steeple Morden"},{"literal":"Hoyland Common"},{"literal":"Cogenhoe","spoken":["cook no"]},{"literal":"Fordoun","spoken":["forden"]},{"literal":"Llangollen","spoken":["lan-goth-lin","lhan-goth-luhn"]},{"literal":"Auchenblae"}]}',
    weight_value = 0.25
)

# Declare an existing compiled wordset
places_compiled_ws = RecognitionResource(
    external_reference = ResourceReference(
        type = 'COMPILED_WORDSET',
        uri = 'urn:nuance-mix:tag:wordset:lang/names-places/places-compiled-ws/eng-USA/mix.asr',
        mask_load_failures = True
    ),
    weight_value = 0.25
)

# Define wakeup words
wakeups = RecognitionResource(
    wakeup_word = WakeupWord(
        words = ["Hello Nuance", "Hey Nuance"]
    )
)

# Include resources in RecognitionInitMessage
def client_stream(wf):
    try:
        init = RecognitionInitMessage(
            parameters = RecognitionParameters(. . .),
            resources = [travel_dlm, places_wordset, places_compiled_ws, wakeups]
        )

ResourceReference

Input message for fetching an external DLM or settings file that exists in your Mix project, or for creating or updating a speaker profile. Included in RecognitionResource. See Domain LMs and Speaker profiles.

Resource reference
Field Type Description
type Enum ResourceType Resource type. Default UNDEFINED_RESOURCE_TYPE.
uri string Location of the resource as a URN reference. See URN format.
mask_load_failures bool Applies to compiled wordsets only. When true, errors loading the wordset are not reflected in the Status message and do not terminate recognition. They are still reflected in logs. To apply this flag to all resources, use mask_load_failures in RecognitionFlags.
request_timeout_ ms uint32 Time to wait when downloading resources. Default (0) means server default, usually 10000ms or 10 seconds.
headers map<string, string> Map of HTTP cache-control directives, including max-age, max-stale, min-fresh, etc. For example, in Python:
headers = {‘cache-control’: ‘max-age=604800, max-stale=3600’}

This example includes several external references:

# Declare a DLM (names-places is the context tag)
travel_dlm = RecognitionResource(
    external_reference = ResourceReference(
        type = 'DOMAIN_LM',
        uri = 'urn:nuance-mix:tag:model/names-places/mix.asr?=language=eng-USA'
    ),
    weight_value = 0.5
)

# Declare a compiled wordset
places_compiled_ws = RecognitionResource(
    external_reference = ResourceReference(
        type = 'COMPILED_WORDSET',
        uri = 'urn:nuance-mix:tag:wordset:lang/names-places/places-compiled-ws/eng-USA/mix.asr',
        mask_load_failures = True
    ),
    weight_value = 0.25
)

# Declare a settings file
settings = RecognitionResource(
    external_reference = ResourceReference(
        type = 'SETTINGS',
        uri = 'urn:nuance-mix:tag:settings/names-places/asr'
    )
)

# Declare a speaker profile (no URI)
speaker_profile = RecognitionResource(
    external_reference = ResourceReference(
        type = 'SPEAKER_PROFILE'
    )
)

# Include selected resources in recognition
def client_stream(wf):
    try:
        init = RecognitionInitMessage(
            parameters = RecognitionParameters(. . .),
            resources = [travel_dlm, places_compiled_ws, settings, speaker_profile]
        )

WakeupWord

One or more words or phrases that activate the application. Included in RecognitionResource. See the related parameter, RecognitionFlags: filter_wakeup_word. See also Wakeup words.

Wakeup word
Field Type Description
words repeated string One or more wakeup words.

This defines wakeup words for the application and removes them from final results:

# Define wakeup words
wakeups = RecognitionResource(
    wakeup_word = WakeupWord(
        words = ["Hi Dragon", "Hey Dragon", "Yo Dragon"] )
)
# Add wakeups to resource list, filter in final results
def client_stream(wf):
    try:
        init = RecognitionInitMessage(
            parameters = RecognitionParameters(
                ...
                recognition_flags = RecognitionFlags(
                    filter_wakeup_word = True)
            ),
            resources = [travel_dlm, places_wordset, wakeups]
        )

EnumResourceType

Input field defining the content type of an external recognition resource. Included in ResourceReference.

Resource type
Name Number Description
UNDEFINED_RESOURCE_TYPE 0 Resource type is not specified. Client must always specify a type.
WORDSET 1 Resource is a plain-text JSON wordset. Not currently supported, although inline_wordset is supported.
COMPILED_WORDSET 2 Resource is a compiled wordset. See Compiled wordsets.
DOMAIN_LM 3 Resource is a domain LM. See Domain LMs.
SPEAKER_PROFILE 4 Resource is a speaker profile in a datastore. See Speaker profiles.
GRAMMAR 5 Resource is an SRGS XML file. Not currently supported.
SETTINGS 6 Resource is ASR settings metadata, including the desired data pack version.

EnumWeight

Input field setting the weight of the resource relative to the data pack, as a keyword. Included in RecognitionResource. See weight_value to specify a numeric value. See Resource weights.

Weight
Name Number Description
DEFAULT_WEIGHT 0 Same effect as MEDIUM for DLMs and builtins, LOW for wordsets.
LOWEST 1 The resource has minimal influence on the recognition process, equivalent to weight_value 0.05.
LOW 2 The resource has noticeable influence, equivalent to weight_value 0.1.
MEDIUM 3 The resource has roughly an equal effect compared to the data pack, equivalent to weight_value 0.25.
HIGH 4 Words from the resource may be favored over words from the data pack, equivalent to weight_value 0.5.
HIGHEST 5 The resource has the greatest influence on the recognition, equivalent to weight_value 0.9.

EnumResourceReuse

Input field specifying whether the domain LM or wordset will be used for one or many recognition turns. Included in RecognitionResource.

Resource reuse
Name Number Description
UNDEFINED_REUSE 0 Not specified: currently defaults to LOW_REUSE.
LOW_REUSE 1 The resource will be used for only one recognition turn.
HIGH_REUSE 5 The resource will be used for a sequence of recognition turns.

RecognitionResponse

Output stream of messages in response to a recognize request. Included in Recognize method.

Recognition response
Field Type Description
status Status Always the first message returned, indicating whether recognition was initiated successfully.
start_of_speech StartOfSpeech When speech was detected.
result Result The partial or final recognition result. A series of partial results may preceed the final result.

The response contains all possible fields of information about the recognized audio, and your application may choose to print all or some fields. The sample application prints only the status and the best hypothesis of each utterance, and other examples also include the data pack version and some DSP information.

Your application may instead print all fields to the user with (in Python) a simple print(message). In this scenario, the results contain the status, start-of-speech information, followed by the result itself, consisting overall information then several hypotheses of the utterance and its words, including confidence scores.

The response depends on two recognition parameters: result_type, which specifies how much of ASRaaS’s internal processing is reflected in the results, and utterance_detection_mode, which determines whether to process all utterances in the audio or just the first one.

The RecognitionResponse message includes:

RecognitionResponse
  status (Status)
    code
    message
    details
  start_of_speech (StartOfSpeech)
    first_audio_to_start_of_speech_ms
  result (Result)
    result_type (EnumResultType): FINAL|PARTIAL|NOTIFICATIONS
    abs_start_ms
    abs_end_ms
    utterance_info (UtteranceInfo)
      duration_ms
      clipping_duration_ms
      dropped_speech_packets
      dropped_nonspeech_packets
      dsp (Dsp)
        digital signal processing results
    hypotheses (Hypothesis)
      confidence
      average_confidence
      rejected
      formatted_text
      minimally_formatted_text
      words (Word)
        text
        confidence
        start_ms
        end_ms
        silence_after_word_ms
        grammar_rule
      encrypted_tokenization
      grammar_id
      detected_wakeup_word
      tokenization (Token)
    data_pack (DataPack)
      language
      topic
      version
      id

This RecognitionResponse example prints selected fields from the results returned from ASRaaS:

try:
    # Iterate through messages returned from server
    for message in stream_in:
        if message.HasField('status'):
            if message.status.details:
                 print(f'{message.status.code} {message.status.message} - {message.status.details}')
            else:
                 print(f'{message.status.code} {message.status.message}')
        elif message.HasField('result'):
            restype = 'partial' if message.result.result_type else 'final'
            print(f'{restype}: {message.result.hypotheses[0].formatted_text}')

This prints all available fields from the message returned from ASRaaS:

try:
    # Iterate through messages returned from server, returning all information
    for message in stream_in:
        print(message)
For examples of the response, see Results.

Status

Output message indicating the status of the job. Included in RecognitionResponse.

See Status codes for details about the codes. The message and details are developer-facing error messages in English. User-facing messages should be localized by the client based on the status code.
Status
Field Type Description
code uint32 HTTP-style return code: 100, 200, 4xx, or 5xx as appropriate.
message string Brief description of the status.
details string Longer description if available.

This example references and formats status messages:

try:
    # Iterate through messages returned from server
    for message in stream_in:
        if message.HasField('status'):
            if message.status.details:
                 print(f'{message.status.code} {message.status.message} - {message.status.details}')
            else:
                 print(f'{message.status.code} {message.status.message}')

The output in this client is:

stream ../audio/weather16.wav
100 Continue - recognition started on audio/l16;rate=16000 stream
final: There is more snow coming to the Montreal area in the next few days
final: We're expecting 10 cm overnight and the winds are blowing hard
final: Our radar and satellite pictures show that we're on the western edge of the storm system as it continues to track further to the east
stream complete
200 Success

The full results look like this:

stream ../audio/weather16.wav
status {
  code: 100
  message: "Continue"
  details: "recognition started on audio/l16;rate=16000 stream"
}
*** Results in here ***
stream complete
status {
  code: 200
  message: "Success"
}

StartOfSpeech

Output message containing the start-of-speech message. Included in RecognitionResponse.

Start of speech
Field Type Description
first_audio_to_start_of_speech_ms uint32 Offset from start of audio stream to start of speech detected.

By default, a start-of-speech message is sent only for the first utterance in the audio stream. The message is sent for each utterance when:

The behavior is the same for all result types: FINAL, IMMUTABLE_PARTIAL, and PARTIAL. The start-of-speech message is sent as soon as some speech is detected for the current utterance, before any results.

This example sets utterance detection to MULTIPLE and includes the recognition flag that enables multiple start of speech messages:

init = RecognitionInitMessage(
    parameters = RecognitionParameters(
        language = 'en-us',
        topic = 'GEN',
        audio_format = AudioFormat(pcm=PCM(sample_rate_hz=wf.getframerate())),
        result_type = FINAL, 
        utterance_detection_mode = MULTIPLE,
        recognition_flags = RecognitionFlags(
            filter_wakeup_word = True,
            suppress_call_recording = True,
            send_multiple_start_of_speech = True
        ),

The output shows the start of speech information for each utterance:

start_of_speech {
  first_audio_to_start_of_speech_ms: 880
}

result {
  abs_start_ms: 880
  abs_end_ms: 5410
  utterance_info {...}
  hypotheses {
    confidence: 0.414000004529953
    average_confidence: 0.8600000143051147
    formatted_text: "There is more snow coming to the Montreal area in the next few days"
    ...
}

start_of_speech {
  first_audio_to_start_of_speech_ms: 5410
}

result {
  abs_start_ms: 5410
  abs_end_ms: 10280
  utterance_info {...}
  hypotheses {
    confidence: 0.1420000046491623
    average_confidence: 0.8790000081062317
    formatted_text: "We\'re expecting 10 cm overnight and the winds are blowing hard"
    ...
}

start_of_speech {
  first_audio_to_start_of_speech_ms: 10280
}

result {
  abs_start_ms: 10280
  abs_end_ms: 18520
  utterance_info {...}
  hypotheses {
    confidence: 0.008999999612569809
    average_confidence: 0.8730000257492065
    formatted_text: "Our radar and satellite pictures show that we\'re on the western edge of the storm system as it continues to track further to the east"
    ...
}

Result

Output message containing the result, including the result type, the start and end times, metadata about the job, and one or more recognition hypotheses. Included in RecognitionResponse. For examples, see Results.

Result
Field Type Description
result_type EnumResultType Whether final results, partial results, or notifications are returned.
abs_start_ms uint32 Start time of the audio segment that generated this result. Offset, in milliseconds, from the beginning of the audio stream.
abs_end_ms uint32 End time of the audio segment that generated this result. Offset, in milliseconds, from the beginning of the audio stream.
utterance_info UtteranceInfo Information about each utterance.
hypotheses repeated Hypothesis One or more recognition variations.
data_pack DataPack Data pack information.
notifications Notification List of errors or warnings that don’t trigger run-time errors, if any.

This prints only a few fields from the result: the status and the formatted text of the best hypothesis:

try:
    # Iterate through messages returned from server
    for message in stream_in:
        if message.HasField('status'):
        ...
    elif message.HasField('result'):
        restype = 'partial' if message.result.result_type else 'final'
        print(f'{restype}: {message.result.hypotheses[0].formatted_text}')

This prints all fields.

try:
    # Iterate through messages returned from server
    for message in stream_in:
        print(message)

See Results and Formatted text for examples of output in different formats. For other examples, see Dsp, Hypothesis, and DataPack.

UtteranceInfo

Output message containing information about the transcribed utterance in the result. Included in Result.

Utterance information
Field Type Description
duration_ms uint32 Utterance duration in milliseconds.
clipping_duration_ms uint32 Milliseconds of clipping detected.
dropped_speech_packets uint32 Number of speech audio buffers discarded during processing.
dropped_nonspeech_packets uint32 Number of non-speech audio buffers discarded during processing.
dsp Dsp Digital signal processing results.

Dsp

Output message containing digital signal processing results. Included in UtteranceInfo.

Digital signal processing
Field Type Description
snr_estimate_db float The estimated speech-to-noise ratio.
level float Estimated speech signal level.
num_channels uint32 Number of channels. Default is 1, meaning mono audio.
initial_silence_ms uint32 Milliseconds of silence observed before start of utterance.
initial_energy float Energy feature value of first speech frame.
final_energy float Energy feature value of last speech frame.
mean_energy float Average energy feature value of utterance.

This Dsp example includes the speech signal level:

try:
    # Iterate through messages returned from server
    for message in stream_in:
        if message.HasField('status'):
            if message.status.details:
                 print(f'{message.status.code} {message.status.message} - {message.status.details}')
            else:
                 print(f'{message.status.code} {message.status.message}')
        elif message.HasField('result'):
            restype = 'partial' if message.result.result_type else 'final'
            print(f'{restype}: {message.result.hypotheses[0].formatted_text}')
            print(f'Speech signal level: {message.result.utterance_info.dsp.level} SNR: {message.result.utterance_info.dsp.snr_estimate_db}')

This output shows the speech signal level and speech-to-noise ratio for each utterance:

stream ../audio/weather16.wav
100 Continue - recognition started on audio/l16;rate=16000 stream
final: There is more snow coming to the Montreal area in the next few days
Speech signal level: 20993.0 SNR: 15.0
final: We're expecting 10 cm overnight and the winds are blowing hard
Speech signal level: 18433.0 SNR: 15.0
final: Radar and satellite pictures show that we're on the western edge of the storm system as it continues to track further to the east
Speech signal level: 21505.0 SNR: 14.0
stream complete
200 Success

Hypothesis

Output message containing one or more proposed transcripts of the audio stream. Included in Result. Each variation has its own confidence level along with the text in two levels of formatting. See Formatted text.

Hypothesis
Field Type Description
confidence float The confidence score for the entire result, 0 to 1.
average_confidence float The confidence score for the hypothesis, 0 to 1: the average of all word confidence scores based on their duration.
rejected bool Whether the hypothesis was rejected or accepted.
  • True: The hypothesis was rejected.
  • False: The hypothesis was accepted.
The recognizer determines rejection based on an internal algorithm. If the audio input cannot be assigned to a sequence of tokens with sufficiently high probability, it is rejected. Recognition can be improved with domain LMs, wordsets, and builtins. The rejected field is returned for final results only, not for partial results.
formatted_text string Formatted text of the result, for example, $500. Formatting is controlled by formatting schemes and options. See Formatted text.
minimally_formatted_text string Slightly formatted text of the result, for example, Five hundred dollars. Words are spelled out, but basic capitalization and punctuation are included. See the formatting scheme, all_as_words.
words repeated Word One or more recognized words in the result.
encrypted_tokenization string Nuance-internal representation of the recognition result. Not returned when result originates from a grammar. Activated by RecognitionFlags: include_tokenization.
grammar_id string Identifier of the matching grammar, as grammar_0, grammar_1, etc. representing the order the grammars were provided as resources. Returned when result originates from an SRGS grammar rather than generic dictation.
detected_wakeup_word string The detected wakeup word when using a wakeup word resource in RecognitionResource. See Wakeup words.
tokenization repeated Token Nuance-internal representation of the recognition result in plain form. Not used in Krypton ASR v4.

This hypothesis example includes formatted_text, confidence, and whether the utterance was rejected (False means it was accepted):

try:
    # Iterate through messages returned from server
    for message in stream_in:
        if message.HasField('status'):
            if message.status.details:
                 print(f'{message.status.code} {message.status.message} - {message.status.details}')
            else:
                 print(f'{message.status.code} {message.status.message}')
        elif message.HasField('result'):
            restype = 'partial' if message.result.result_type else 'final'
            print(f'{restype}: {message.result.hypotheses[0].formatted_text}')
            print(f'Average confidence: {message.result.hypotheses[0].average_confidence} Rejected? {message.result.hypotheses[0].rejected}')

This output shows the formatted text lines, including abbreviations such as “10 cm”:

stream ../audio/weather16.wav
100 Continue - recognition started on audio/l16;rate=16000 stream
final: There is more snow coming to the Montreal area in the next few days
Average confidence: 0.4129999876022339 Rejected? False
final: We're expecting 10 cm overnight and the winds are blowing hard
Average confidence: 0.7960000038146973 Rejected? False
final: Radar and satellite pictures show that we're on the western edge of the storm system as it continues to track further to the east
Average confidence: 0.6150000095367432 Rejected? False
stream complete
200 Success
For full results, see Results: All fields.

Word

Output message containing one or more recognized words in the hypothesis, including the text, confidence score, and timing information. Included in Hypothesis.

Word
Field Type Description
text string The recognized word.
confidence float The confidence score of the recognized word, 0 to 1.
start_ms uint32 Start time of the word. Offset, in milliseconds, from the beginning of the current audio segment (abs_start_ms).
end_ms uint32 End time of the word. Offset, in milliseconds, from the beginning of the current audio segment (abs_start_ms).
silence_after_word_ms uint32 The amount of silence, in ms, detected after the word.
grammar_rule string The grammar rule that recognized the word text. Returned when result originates from an SRGS grammar rather than generic dictation.

This output example shows the words in an utterance about paying a credit card:

  hypotheses {
    confidence: 0.8019999861717224
    average_confidence: 0.9599999785423279
    formatted_text: "Pay my Visa card from my checking account tomorrow for $10"
    minimally_formatted_text: "Pay my Visa card from my checking account tomorrow for ten dollars"
    words {
      text: "Pay"
      confidence: 0.8579999804496765
      start_ms: 2180
      end_ms: 2300
    }
    words {
      text: "my"
      confidence: 0.9610000252723694
      start_ms: 2300
      end_ms: 2600
    }
    words {
      text: "Visa"
      confidence: 0.9490000009536743
      start_ms: 2600
      end_ms: 3040
    }
    words {
      text: "card"
      confidence: 0.9679999947547913
      start_ms: 3040
      end_ms: 3440
    }
    words {
      text: "from"
      confidence: 0.9649999737739563
      start_ms: 3440
      end_ms: 3720
    }
    words {
      text: "my"
      confidence: 0.9869999885559082
      start_ms: 3720
      end_ms: 3840
    }
    words {
      text: "checking"
      confidence: 0.9900000095367432
      start_ms: 3840
      end_ms: 4320
    }
    words {
      text: "account"
      confidence: 0.9739999771118164
      start_ms: 4320
      end_ms: 4780
      silence_after_word_ms: 120
    }
    words {
      text: "tomorrow"
      confidence: 0.9509999752044678
      start_ms: 4900
      end_ms: 5320
    }
    words {
      text: "for"
      confidence: 0.8980000019073486
      start_ms: 5320
      end_ms: 5600
    }
    words {
      text: "$10"
      confidence: 0.9700000286102295
      start_ms: 5600
      end_ms: 6360
    }
  }

Token

Output message containing plain text tokenization information. Included in Hypothesis. Not used in Krypton ASR v4.

DataPack

Output message containing information about the current data pack. Included in Result.

Data pack
Field Type Description
language string Language of the data pack .
topic string Topic of the data pack.
version string Version of the data pack.
id string Identifier string of the data pack, including nightly update information if a nightly build was loaded.

This DataPack example uses the dp_displayed flag to include the data pack language and version:

try:
    # Iterate through messages returned from server
    dp_displayed = False
    for message in stream_in:
        if message.HasField('status'):
            if message.status.details:
                 print(f'{message.status.code} {message.status.message} - {message.status.details}')
            else:
                 print(f'{message.status.code} {message.status.message}')
        elif message.HasField('result'):
            restype = 'partial' if message.result.result_type else 'final'
            if restype == 'final' and not dp_displayed:
                print(f'Data pack: {message.result.data_pack.language} {message.result.data_pack.version}')
                dp_displayed = True
            print(f'{restype}: {message.result.hypotheses[0].formatted_text}')

The output includes the language and version of the data pack:

stream ../audio/monday_morning_16.wav
100 Continue - recognition started on audio/l16;rate=16000 stream
Data pack: eng-USA 4.2.0
final: It's Monday morning and the sun is shining
final: I'm getting ready to walk to the train and and commute into work
final: I'll catch the 757 train from Cedar Park station
final: It will take me an hour to get into town
stream complete
200 Success

In the full response, the data pack information is returned within each result. See Results: All fields.

result {
  abs_start_ms: 1020
  abs_end_ms: 4330
  utterance_info {
    duration_ms: 3310
    dsp {
      snr_estimate_db: 39
      level: 9217
      num_channels: 1
      initial_silence_ms: 220
      initial_energy: -51.201
      final_energy: -76.437
      mean_energy: 146.309
    }
  }
  hypotheses {
    confidence: 0.788
    average_confidence: 0.967
    formatted_text: "It\'s Monday morning and the sun is shining."
    minimally_formatted_text: "It\'s Monday morning and the sun is shining."
    words {
    *** Individual words in the hypothesis ***
    }
  }

  data_pack {
    language: "eng-USA"
    topic: "GEN"
    version: "4.12.1"
    id: "GMT20231026205000"
  }
}

Notification

Output message containing a notification structure. Notifications can be warnings or information about the recognition process or alerts about an upcoming shutdown. Included in Result.

See Notifications in results.

Notification
Field Type Description
code uint32 Notification unique code.
severity EnumSeverityType Severity of the notification.
message nuance.rpc. LocalizedMessage The notification message in the local language.
data map<string, string> Map of additional key, value pairs related to the notification.

In this example, a notification warning is returned because the DLM and its associated compiled wordset were created using different data pack versions.

result: {
  result_type: PARTIAL
  abs_start_ms: 160
  abs_end_ms :3510
  hypotheses: [ 
  *** Hypotheses here *** 
  ],
  data_pack: {
    language: "eng-USA"
    topic: "GEN"
    version: "4.11.1"
    id: "GMT20230830154712"
  }
  notifications: [
    {
      code: 1002
      severity: SEVERITY_WARNING
      message: {
        locale: "en-US"
        message: "Wordset-pkg should be recompiled."
        message_resource_id: 1002
      }
      data: {
        application/x-nuance-wordset-pkg: "urn:nuance-mix:tag:wordset:lang/names-places/places-compiled-ws/eng-USA/mix.asr"
        application/x-nuance-domainlm: "urn:nuance-mix:tag:model/names-places/mix.asr?=language=eng-usa"
      }
    }
  ]
}

EnumSeverityType

Output field specifying a notification’s severity. Included in Notification.

Severity type
Name Number Description
SEVERITY_UNKNOWN 0 The notification has an unknown severity. Default.
SEVERITY_ERROR 10 The notification is an error message.
SEVERITY_WARNING 20 The notification is a warning message.
SEVERITY_INFO 30 The notification is an information message.

Scalar value types

The data types in the proto files are mapped to equivalent types in the generated client stub files.

Scalar data types
Proto Notes C++ Java Python
double double double float
float float float float
int32 Uses variable-length encoding. Inefficient for encoding negative numbers. If your field is likely to have negative values, use sint32 instead. int32 int int
int64 Uses variable-length encoding. Inefficient for encoding negative numbers. If your field is likely to have negative values, use sint64 instead. int64 long int/long
uint32 Uses variable-length encoding. uint32 int int/long
uint64 Uses variable-length encoding. uint64 long int/long
sint32 Uses variable-length encoding. Signed int value. These encode negative numbers more efficiently than regular int32s. int32 int int
sint64 Uses variable-length encoding. Signed int value. These encode negative numbers more efficiently than regular int64s. int64 long int/long
fixed32 Always four bytes. More efficient than uint32 if values are often greater than 2^28. uint32 int int
fixed64 Always eight bytes. More efficient than uint64 if values are often greater than 2^56. uint64 long int/long
sfixed32 Always four bytes. int32 int int
sfixed64 Always eight bytes. int64 long int/long
bool bool boolean boolean
string A string must always contain UTF-8 encoded or 7-bit ASCII text. string String str/unicode
bytes May contain any arbitrary sequence of bytes. string ByteString str