Sample recognizer client

This topic describes how to set up and run a simple Python application that uses NRaaS to recognize spoken numbers.

The sample Python app sample-client.py recognizes US English spoken numbers from an audio file using a built-in recognition grammar. It also shows how to use external and inline grammars, but these are commented out where recognition_init. You can run it from the shell script, which generates a token and runs the app. Pass the script the name of an audio file.

Prerequisites

To run this client, you need:

  • Python 3.6 or later.
  • The generated Python stub files from gRPC setup.
  • Your client ID and secret from Prerequisites from Mix.
  • Download and extract the files from sample-nr-speech-client.zip.
  • An audio file in μ-law 8 kHz format. Use the audio files in sample-nr-speech-client or provide your own. Stereo audio files are not supported.

Build the app

Download the zip file containing the sample python client and extract it. For convenience the zip file contains a copy of the Nuance Recognizer proto file mentioned in gRPC setup.

The zip file contains the following files:

  • nr-speech-client.py: Sample python client.
  • run-nr-speech-client.sh: Convenience script to acquire a token and run the sample client.
  • generate-protobuf-and-grpc-python.sh: Convenience script to generate the python stubs from the proto file.
  • fruits.grxml: Example GRXML grammar, could be loaded as a URI grammar if hosted on a web server.
  • README.txt: Brief information on how to run the sample client.
  • nuance/nrc/v1/nrc.proto: The gRPC proto file for the Nuance Recognizer service.
  • audio/: Sample μ-law audio files that can be used with the client.
  View nr-speech-client.py  

The script generate-protobuf-and-grpc-python.sh can be used to generate the Python stubs from the proto file. These are the resulting client files, above the nuance directory holding the proto file and its corresponding Python stubs:

├── README.txt
├── generate-protobuf-and-grpc-python.sh
├── run-nr-speech-client.sh
├── nr-speech-client.py
├── fruits.grxml
├── audio
│   └── ...
└── nuance
    └── nrc
        └── v1
            ├── nrc.proto
            ├── nrc_pb2.py
            └── nrc_pb2_grpc.py

You can use this client to request speech recognition, optionally including recognition resources such as inline or URI grammars for recognizing specific words or sentences.

Edit shell script

First, edit the sample shell script or batch file to add your Mix client ID and secret. The script also changes the colons in the client ID to %3A so curl can parse the value correctly.

#!/bin/bash

CLIENT_ID="<ENTER_YOUR_MIX_CLIENT_ID_HERE>"
SECRET="<ENTER_YOUR_MIX_CLIENT_SECRET_HERE>"

# URL encode the client id by converting ':' characters to '%3A'
CLIENT_ID=${CLIENT_ID//:/%3A}

AUTHURL="https://auth.crt.nuance.com/oauth2/token"

# Acquire token from server, extract token string from the json response.
export TOKEN="$(curl -s -u "$CLIENT_ID:$SECRET" "$AUTHURL" \
  -d "grant_type=client_credentials" \
  -d "scope=nr" \
  | python -c 'import sys, json; print(json.load(sys.stdin)["access_token"])'
  )"

# Run the client. Pass the token and audio file to recognize as command line arguments.
./nr-speech-client.py nr.api.nuance.com:443 $TOKEN $1

Alternatively, you might incorporate the token-generation code within the client, reading the credentials from a configuration file.

This client accepts arguments positionally, without names:

server_address = sys.argv[1]
access_token = sys.argv[2]
audio_file = sys.argv[3]

Pass these arguments as you run the client using the shell script:

  • The URI of the Nuance Recognizer region and language group. For the United States region and North America language group: nr.api.nuance.com:443.

  • An access token generated by the Mix OAuth server, usually as an environment variable, in this example $TOKEN.

  • The path to a μ-law encoded audio file to be recognized.

Run the recognition client

The client accepts an audio file and passes it to Nuance Recognizer to recognize it. The client will send the content of the audio file in 160 bytes long packets (20 milliseconds of audio per packet), sleeping 20 ms between each packet, to simulate a recognition from a streamed audio input.

Run the client from the shell script, passing it an audio file to recognize. The client loads a builtin grammar for recognizing digits and also an inline grammar that recognizes some color names (red, green, and blue).

Scenario 1: Recognize numbers

This scenario recognizes the sentence “zero one two three four”:

$ ./run-nr-speech-client.sh audio/01234.ulaw
Sending recognition_init {
  parameters {
    audio_format {
      ulaw {
      }
    }
    no_input_timeout_ms: 2000
    confidence_level: 0.4000000059604645
  }
  resources {
    builtin: "builtin:grammar/digits"
    language: "en-US"
    weight: 1
  }
  resources {
    inline_grammar {
      media_type: APPLICATION_SRGS_XML
      grammar: "<?xml version=\"1.0\" encoding=\"UTF-8\"?><grammar xmlns=\"http://www.w3.org/2001/06/grammar\" xml:lang=\"en-US\" version=\"1.0\" root=\"colors\"><rule id=\"colors\" scope=\"public\"><one-of><item>red</item><item>blue</item><item>green</item><item>yellow</item><item>orange</item><item>black</item><item>white</item></one-of></rule></grammar>"
    }
    language: "en-US"
    weight: 1
  }
}

Sending audio packet 1 length 160
Sending audio packet 2 length 160

Recognize() reponse --> status {
  code: 200
  message: "OK"
}

Sending audio packet 3 length 160
Sending audio packet 4 length 160
Sending audio packet 5 length 160
Sending audio packet 6 length 160
Sending audio packet 7 length 160
Sending audio packet 8 length 160
Sending audio packet 9 length 160
Sending audio packet 10 length 160

Recognize() reponse --> start_of_speech {
}

Sending audio packet 11 length 160
Sending audio packet 12 length 160
. . .
Sending audio packet 141 length 160
Sending audio packet 142 length 14
DONE Sending audio

Recognize() reponse --> end_of_speech {
  first_audio_to_end_of_speech_ms: 2821
}


Recognize() reponse --> result {
  formatted_text: "<result><interpretation conf=\"0.91\"><text mode=\"voice\">zero one two three four</text><instance grammar=\"builtin:grammar/digits\"><SWI_meaning>01234</SWI_meaning><MEANING conf=\"0.91\">01234</MEANING><SWI_literal>zero one two three four</SWI_literal><SWI_grammarName>builtin:grammar/digits</SWI_grammarName></instance></interpretation><interpretation conf=\"0.49\"><text mode=\"voice\">zero oh one two three four</text><instance grammar=\"builtin:grammar/digits\"><SWI_meaning>001234</SWI_meaning><MEANING conf=\"0.49\">001234</MEANING><SWI_literal>zero oh one two three four</SWI_literal><SWI_grammarName>builtin:grammar/digits</SWI_grammarName></instance></interpretation></result>"
  status: "SUCCESS"
}

In this example, the recognizer returned two potential interpretation in the result. The first one, “zero one two three four,” has the higher confidence value of 0.91.

<result>
  <interpretation conf="0.91">
    <text mode="voice">zero one two three four</text>
    <instance grammar="builtin:grammar/digits">
      <SWI_meaning>01234</SWI_meaning>
      <MEANING conf="0.91">01234</MEANING>
      <SWI_literal>zero one two three four</SWI_literal>
      <SWI_grammarName>builtin:grammar/digits</SWI_grammarName>
    </instance>
  </interpretation>
  <interpretation conf="0.49">
    <text mode="voice">zero oh one two three four</text>
    <instance grammar="builtin:grammar/digits">
      <SWI_meaning>001234</SWI_meaning>
      <MEANING conf="0.49">001234</MEANING>
      <SWI_literal>zero oh one two three four</SWI_literal>
      <SWI_grammarName>builtin:grammar/digits</SWI_grammarName>
    </instance>
  </interpretation>
</result>

Scenario 2: Recognize a fruit

This scenario recognizes the sentence “orange”:

$ ./run-nr-speech-client.sh audio/orange.ulaw
Sending recognition_init {
  parameters {
    audio_format {
      ulaw {
      }
    }
    no_input_timeout_ms: 2000
    confidence_level: 0.4000000059604645
  }
  resources {
    builtin: "builtin:grammar/digits"
    language: "en-US"
    weight: 1
  }
  resources {
    inline_grammar {
      media_type: APPLICATION_SRGS_XML
      grammar: "<?xml version=\"1.0\" encoding=\"UTF-8\"?><grammar xmlns=\"http://www.w3.org/2001/06/grammar\" xml:lang=\"en-US\" version=\"1.0\" root=\"colors\"><rule id=\"colors\" scope=\"public\"><one-of><item>red</item><item>blue</item><item>green</item><item>yellow</item><item>orange</item><item>black</item><item>white</item></one-of></rule></grammar>"
    }
    language: "en-US"
    weight: 1
  }
}

Sending audio packet 1 length 160
Sending audio packet 2 length 160

Recognize() reponse --> status {
  code: 200
  message: "OK"
}

Sending audio packet 3 length 160
Sending audio packet 4 length 160
Sending audio packet 5 length 160
Sending audio packet 6 length 160
Sending audio packet 7 length 160
Sending audio packet 8 length 160
Sending audio packet 9 length 160

Recognize() reponse --> start_of_speech {
  first_audio_to_start_of_speech_ms: 20
}

Sending audio packet 10 length 160
Sending audio packet 11 length 160
...
Sending audio packet 48 length 160
Sending audio packet 49 length 99
DONE Sending audio

Recognize() reponse --> end_of_speech {
  first_audio_to_end_of_speech_ms: 972
}


Recognize() reponse --> result {
  formatted_text: "<result><interpretation conf=\"0.88\"><text mode=\"voice\">orange</text><instance grammar=\"3935729581237448155\"><SWI_literal>orange</SWI_literal><SWI_grammarName>3935729581237448155</SWI_grammarName><SWI_meaning>{SWI_literal:orange}</SWI_meaning></instance></interpretation><interpretation conf=\"0.04\"><text mode=\"voice\">four</text><instance grammar=\"builtin:grammar/digits\"><SWI_meaning>4</SWI_meaning><MEANING conf=\"0.04\">4</MEANING><SWI_literal>four</SWI_literal><SWI_grammarName>builtin:grammar/digits</SWI_grammarName></instance></interpretation></result>"
  status: "SUCCESS"
}

In this example, the recognizer returned two potential interpretation in the result. The first one, “orange,” has the higher confidence value of 0.88.

<result>
  <interpretation conf="0.88">
    <text mode="voice">orange</text>
    <instance grammar="3935729581237448155">
      <SWI_literal>orange</SWI_literal>
      <SWI_grammarName>3935729581237448155</SWI_grammarName>
      <SWI_meaning>{SWI_literal:orange}</SWI_meaning>
    </instance>
  </interpretation>
  <interpretation conf="0.04">
    <text mode="voice">four</text>
    <instance grammar="builtin:grammar/digits">
      <SWI_meaning>4</SWI_meaning>
      <MEANING conf="0.04">4</MEANING>
      <SWI_literal>four</SWI_literal>
      <SWI_grammarName>builtin:grammar/digits</SWI_grammarName>
    </instance>
  </interpretation>
</result>