Sample recognizer client

This topic describes how to set up and run a simple Python application that uses NRaaS to recognize spoken numbers.

The sample Python app sample-client.py recognizes US English spoken numbers from an audio file using a built-in recognition grammar. It also shows how to use external and inline grammars, but these are commented out where recognition_init. You can run it from the shell script, which generates a token and runs the app. Pass the script the name of an audio file.

Prerequisites

To run this client, you need:

Python 3.6 or later.
The generated Python stub files from gRPC setup.
Your client ID and secret from Prerequisites from Mix.
Download and extract the files from sample-nr-speech-client.zip.
An audio file in μ-law 8 kHz format. Use the audio files in sample-nr-speech-client or provide your own. Stereo audio files are not supported.

Build the app

Download the zip file containing the sample python client and extract it. For convenience the zip file contains a copy of the Nuance Recognizer proto file mentioned in gRPC setup.

The zip file contains the following files:

nr-speech-client.py: Sample python client.
run-nr-speech-client.sh: Convenience script to acquire a token and run the sample client.
generate-protobuf-and-grpc-python.sh: Convenience script to generate the python stubs from the proto file.
fruits.grxml: Example GRXML grammar, could be loaded as a URI grammar if hosted on a web server.
README.txt: Brief information on how to run the sample client.
nuance/nrc/v1/nrc.proto: The gRPC proto file for the Nuance Recognizer service.
audio/: Sample μ-law audio files that can be used with the client.

Note:

The sample app contains a copy of the proto files. If you already have the proto file in your directory, you can overwrite the existing file.

View nr-speech-client.py

#!/usr/bin/env python3

import sys, grpc, re
from time import sleep

from nuance.nrc.v1.nrc_pb2 import *
from nuance.nrc.v1.nrc_pb2_grpc import *


def usage():
    print (f'''
Usage:
    {sys.argv[0]} <server_address> <access_token> <audio_file>

    server_address - address of the server with a port in the form server:port
      access_token - Mix access token string
        audio_file - file containing audio to recognize, ex. "audio/01234.ulaw"

Examples:
    $ export TOKEN=<access_token>
    $ {sys.argv[0]} nr.api.nuance.com:443 $TOKEN audio/01234.ulaw    // will recognize a "0-1-2-3-4" audio utterance.
    $ {sys.argv[0]} nr.api.nuance.com:443 $TOKEN audio/orange.ulaw   // will recognize the word "orange".

    // Note: By default the client expects a raw mu-law encoded audio file.
'''
    )

#----------------------------------------------------------------------------------------------------
# Builtin grammar
#----------------------------------------------------------------------------------------------------
builtin_digits_grammar = RecognitionResource(
    builtin = "builtin:grammar/digits",
    language="en-US",
    weight=1
)

#----------------------------------------------------------------------------------------------------
# Inline grammar
#----------------------------------------------------------------------------------------------------
inline_grammar_str = '''<?xml version="1.0" encoding="UTF-8"?>
<grammar xmlns="http://www.w3.org/2001/06/grammar" xml:lang="en-US" version="1.0" root="colors">
    <rule id="colors" scope="public">
        <one-of>
            <item>red</item>
            <item>blue</item>
            <item>green</item>
            <item>yellow</item>
            <item>orange</item>
            <item>black</item>
            <item>white</item>
        </one-of>
    </rule>
</grammar>
'''

# The language should correspond to the language of the grammar.
inline_colors_grammar = RecognitionResource(
    inline_grammar = InlineGrammar(
        # For a more compact string, remove newlines and spaces between xml tags.
        grammar = bytes(re.sub('>\s*', '>', inline_grammar_str), 'utf-8'),
        media_type=1
    ),
    language="en-US",
    weight=1
)

#----------------------------------------------------------------------------------------------------
# URI grammar
#----------------------------------------------------------------------------------------------------
# This example uri grammar points to a file on a http server.
# The language should correspond to the language of the grammar.
uri_fruits_grammar = RecognitionResource(
    uri_grammar = UriGrammar(
        uri="http://myserver/mygrammars/fruits.grxml",
        media_type=1
    ),
    language="en-US",
    weight = 1
)


def recognition_init():
    init = RecognitionInit(
        parameters = RecognitionParameters(
            audio_format = AudioFormat(ulaw = ULaw()),  # default, other options are ALaw and PCM.
            no_input_timeout_ms = 2000,                 # default is 7000 milliseconds.
            confidence_level = 0.4                      # return a no-match if the confidence is less than 0.4.
        ),
        resources = [
            builtin_digits_grammar,
            inline_colors_grammar,
            #uri_fruits_grammar
        ]
    )
    return init


def client_stream(audio_file):
    # Start the recognition
    init = recognition_init()

    # Log init message.
    initstr = re.sub('^|\n', '\n  ', repr(init)) # indent init message for better readability
    initstr = re.sub('\s*$', '', initstr)        # remove extra spaces at end of string
    print("Sending recognition_init {" + initstr + "\n}\n")

    yield RecognitionRequest(recognition_init = init)

    # Send audio in packets of 160 bytes (for mu-law: 1 byte = 1 audio sample).
    # 160 bytes of mu-law audio at 8kHz = 20 milliseconds of audio
    packet_size = 160
    packet_duration = 0.020

    packet = 0 # To log how many packets are sent.
    try :
        with open(audio_file, "rb") as f:
            while True:
                data = f.read(packet_size)
                if not data:
                    break
                else:
                    packet += 1
                    print ("Sending audio packet " + repr(packet) + " length " + repr(len(data)))
                    yield RecognitionRequest(audio = bytes(data))
                    # Simulate audio streaming by sleeping for the duration of the audio packet.
                    # Note that this is just for this demo, it is not required.
                    sleep(packet_duration)
    except Exception as e:
        print("File read exception" + repr(e))

    print("DONE Sending audio");


def recognize(server_address, access_token, audio_file):
    call_credentials = grpc.access_token_call_credentials(access_token)
    ssl_credentials = grpc.ssl_channel_credentials()
    channel_credentials = grpc.composite_channel_credentials(ssl_credentials, call_credentials)

    with grpc.secure_channel(server_address, credentials=channel_credentials) as channel:

        nrc = NRCStub(channel)

        response_iterator = nrc.Recognize(client_stream(audio_file))
        try:
            for response in response_iterator:
                print("\nRecognize() reponse --> " + repr(response))
        except Exception as e:
            print('Server stream exception: ' + repr(e))


if __name__ == '__main__':
    server_address = access_token = audio_file = None
    try:
        server_address = sys.argv[1]
        access_token = sys.argv[2]
        audio_file = sys.argv[3]
    except Exception as e:
        usage()
        exit(1)
    recognize(server_address, access_token, audio_file)

The script generate-protobuf-and-grpc-python.sh can be used to generate the Python stubs from the proto file. These are the resulting client files, above the nuance directory holding the proto file and its corresponding Python stubs:

├── README.txt
├── generate-protobuf-and-grpc-python.sh
├── run-nr-speech-client.sh
├── nr-speech-client.py
├── fruits.grxml
├── audio
│   └── ...
└── nuance
    └── nrc
        └── v1
            ├── nrc.proto
            ├── nrc_pb2.py
            └── nrc_pb2_grpc.py

You can use this client to request speech recognition, optionally including recognition resources such as inline or URI grammars for recognizing specific words or sentences.

Edit shell script

First, edit the sample shell script or batch file to add your Mix client ID and secret. The script also changes the colons in the client ID to %3A so curl can parse the value correctly.

#!/bin/bash

CLIENT_ID="<ENTER_YOUR_MIX_CLIENT_ID_HERE>"
SECRET="<ENTER_YOUR_MIX_CLIENT_SECRET_HERE>"

# URL encode the client id by converting ':' characters to '%3A'
CLIENT_ID=${CLIENT_ID//:/%3A}

AUTHURL="https://auth.crt.nuance.com/oauth2/token"

# Acquire token from server, extract token string from the json response.
export TOKEN="$(curl -s -u "$CLIENT_ID:$SECRET" "$AUTHURL" \
  -d "grant_type=client_credentials" \
  -d "scope=nr" \
  | python -c 'import sys, json; print(json.load(sys.stdin)["access_token"])'
  )"

# Run the client. Pass the token and audio file to recognize as command line arguments.
./nr-speech-client.py nr.api.nuance.com:443 $TOKEN $1

Alternatively, you might incorporate the token-generation code within the client, reading the credentials from a configuration file.

This client accepts arguments positionally, without names:

server_address = sys.argv[1]
access_token = sys.argv[2]
audio_file = sys.argv[3]

Pass these arguments as you run the client using the shell script:

The URI of the Nuance Recognizer region and language group. For the United States region and North America language group: nr.api.nuance.com:443.
An access token generated by the Mix OAuth server, usually as an environment variable, in this example $TOKEN.
The path to a μ-law encoded audio file to be recognized.

Run the recognition client

The client accepts an audio file and passes it to Nuance Recognizer to recognize it. The client will send the content of the audio file in 160 bytes long packets (20 milliseconds of audio per packet), sleeping 20 ms between each packet, to simulate a recognition from a streamed audio input.

Run the client from the shell script, passing it an audio file to recognize. The client loads a builtin grammar for recognizing digits and also an inline grammar that recognizes some color names (red, green, and blue).

Scenario 1: Recognize numbers

This scenario recognizes the sentence “zero one two three four”:

$ ./run-nr-speech-client.sh audio/01234.ulaw
Sending recognition_init {
  parameters {
    audio_format {
      ulaw {
      }
    }
    no_input_timeout_ms: 2000
    confidence_level: 0.4000000059604645
  }
  resources {
    builtin: "builtin:grammar/digits"
    language: "en-US"
    weight: 1
  }
  resources {
    inline_grammar {
      media_type: APPLICATION_SRGS_XML
      grammar: "<?xml version=\"1.0\" encoding=\"UTF-8\"?><grammar xmlns=\"http://www.w3.org/2001/06/grammar\" xml:lang=\"en-US\" version=\"1.0\" root=\"colors\"><rule id=\"colors\" scope=\"public\"><one-of><item>red</item><item>blue</item><item>green</item><item>yellow</item><item>orange</item><item>black</item><item>white</item></one-of></rule></grammar>"
    }
    language: "en-US"
    weight: 1
  }
}

Sending audio packet 1 length 160
Sending audio packet 2 length 160

Recognize() reponse --> status {
  code: 200
  message: "OK"
}

Sending audio packet 3 length 160
Sending audio packet 4 length 160
Sending audio packet 5 length 160
Sending audio packet 6 length 160
Sending audio packet 7 length 160
Sending audio packet 8 length 160
Sending audio packet 9 length 160
Sending audio packet 10 length 160

Recognize() reponse --> start_of_speech {
}

Sending audio packet 11 length 160
Sending audio packet 12 length 160
. . .
Sending audio packet 141 length 160
Sending audio packet 142 length 14
DONE Sending audio

Recognize() reponse --> end_of_speech {
  first_audio_to_end_of_speech_ms: 2821
}


Recognize() reponse --> result {
  formatted_text: "<result><interpretation conf=\"0.91\"><text mode=\"voice\">zero one two three four</text><instance grammar=\"builtin:grammar/digits\"><SWI_meaning>01234</SWI_meaning><MEANING conf=\"0.91\">01234</MEANING><SWI_literal>zero one two three four</SWI_literal><SWI_grammarName>builtin:grammar/digits</SWI_grammarName></instance></interpretation><interpretation conf=\"0.49\"><text mode=\"voice\">zero oh one two three four</text><instance grammar=\"builtin:grammar/digits\"><SWI_meaning>001234</SWI_meaning><MEANING conf=\"0.49\">001234</MEANING><SWI_literal>zero oh one two three four</SWI_literal><SWI_grammarName>builtin:grammar/digits</SWI_grammarName></instance></interpretation></result>"
  status: "SUCCESS"
}

In this example, the recognizer returned two potential interpretation in the result. The first one, “zero one two three four,” has the higher confidence value of 0.91.

<result>
  <interpretation conf="0.91">
    <text mode="voice">zero one two three four</text>
    <instance grammar="builtin:grammar/digits">
      <SWI_meaning>01234</SWI_meaning>
      <MEANING conf="0.91">01234</MEANING>
      <SWI_literal>zero one two three four</SWI_literal>
      <SWI_grammarName>builtin:grammar/digits</SWI_grammarName>
    </instance>
  </interpretation>
  <interpretation conf="0.49">
    <text mode="voice">zero oh one two three four</text>
    <instance grammar="builtin:grammar/digits">
      <SWI_meaning>001234</SWI_meaning>
      <MEANING conf="0.49">001234</MEANING>
      <SWI_literal>zero oh one two three four</SWI_literal>
      <SWI_grammarName>builtin:grammar/digits</SWI_grammarName>
    </instance>
  </interpretation>
</result>

Scenario 2: Recognize a fruit

This scenario recognizes the sentence “orange”:

$ ./run-nr-speech-client.sh audio/orange.ulaw
Sending recognition_init {
  parameters {
    audio_format {
      ulaw {
      }
    }
    no_input_timeout_ms: 2000
    confidence_level: 0.4000000059604645
  }
  resources {
    builtin: "builtin:grammar/digits"
    language: "en-US"
    weight: 1
  }
  resources {
    inline_grammar {
      media_type: APPLICATION_SRGS_XML
      grammar: "<?xml version=\"1.0\" encoding=\"UTF-8\"?><grammar xmlns=\"http://www.w3.org/2001/06/grammar\" xml:lang=\"en-US\" version=\"1.0\" root=\"colors\"><rule id=\"colors\" scope=\"public\"><one-of><item>red</item><item>blue</item><item>green</item><item>yellow</item><item>orange</item><item>black</item><item>white</item></one-of></rule></grammar>"
    }
    language: "en-US"
    weight: 1
  }
}

Sending audio packet 1 length 160
Sending audio packet 2 length 160

Recognize() reponse --> status {
  code: 200
  message: "OK"
}

Sending audio packet 3 length 160
Sending audio packet 4 length 160
Sending audio packet 5 length 160
Sending audio packet 6 length 160
Sending audio packet 7 length 160
Sending audio packet 8 length 160
Sending audio packet 9 length 160

Recognize() reponse --> start_of_speech {
  first_audio_to_start_of_speech_ms: 20
}

Sending audio packet 10 length 160
Sending audio packet 11 length 160
...
Sending audio packet 48 length 160
Sending audio packet 49 length 99
DONE Sending audio

Recognize() reponse --> end_of_speech {
  first_audio_to_end_of_speech_ms: 972
}


Recognize() reponse --> result {
  formatted_text: "<result><interpretation conf=\"0.88\"><text mode=\"voice\">orange</text><instance grammar=\"3935729581237448155\"><SWI_literal>orange</SWI_literal><SWI_grammarName>3935729581237448155</SWI_grammarName><SWI_meaning>{SWI_literal:orange}</SWI_meaning></instance></interpretation><interpretation conf=\"0.04\"><text mode=\"voice\">four</text><instance grammar=\"builtin:grammar/digits\"><SWI_meaning>4</SWI_meaning><MEANING conf=\"0.04\">4</MEANING><SWI_literal>four</SWI_literal><SWI_grammarName>builtin:grammar/digits</SWI_grammarName></instance></interpretation></result>"
  status: "SUCCESS"
}

In this example, the recognizer returned two potential interpretation in the result. The first one, “orange,” has the higher confidence value of 0.88.

<result>
  <interpretation conf="0.88">
    <text mode="voice">orange</text>
    <instance grammar="3935729581237448155">
      <SWI_literal>orange</SWI_literal>
      <SWI_grammarName>3935729581237448155</SWI_grammarName>
      <SWI_meaning>{SWI_literal:orange}</SWI_meaning>
    </instance>
  </interpretation>
  <interpretation conf="0.04">
    <text mode="voice">four</text>
    <instance grammar="builtin:grammar/digits">
      <SWI_meaning>4</SWI_meaning>
      <MEANING conf="0.04">4</MEANING>
      <SWI_literal>four</SWI_literal>
      <SWI_grammarName>builtin:grammar/digits</SWI_grammarName>
    </instance>
  </interpretation>
</result>

Feedback

Was this page helpful?

Glad to hear it! Please tell us how we can improve.

Sorry to hear that. Please tell us how we can improve.