Wordsets

A wordset is a set of words and short phrases that customize the vocabulary by providing additional values. Like models, wordsets are declared with InterpretRequest–resources.

Defining wordsets

The wordset is defined in JSON format as one or more arrays. Each array is named after a dynamic list entity defined within a semantic model. Wordsets allow you to add values and literals to such entities at runtime.

For example, you might have an entity, CONTACTS, containing personal names, or CITY, with place names used by the application. The wordset adds to the existing terms in the entity, but applies only to the current session. The terms in the wordset are not added permanently to the entity.

All entities must be defined in the semantic model, which are loaded and activated along with the wordset.

The wordset includes additional values for one or more entities. The syntax is:

{
  "entity-1": [
    {
      "canonical": "value",
      "literal": "written form",
    },
    {
      "canonical": "value",
      "literal": "written form",
    },
    ...
  ],
  "*entity-n": [...]
}

Where:

Wordset fields
Field Type Description
entity String Name of a dynamic list entity defined in a model. The name is case-sensitive. Consult the model for entity names.
canonical String (Optional) Value of the entity to be returned by interpretation by NLU. If not provided, the literal is used.
literal String A written form by which a user could realistically refer to the value.

For ASR (automatic speech recognition) purposes, wordsets can also include a field “spoken” to indicate pronunciations of the value. If you reuse the same wordset for both ASR and NLU, this field is ignored by NLU.

You can provide the wordset either inline in the request or reference a compiled wordset using its URN in the Mix environment.

Inline wordsets

For inline wordsets, the contents are provided as a string as part of the request object. You can either include the string literal directly in the request or read the string in from a local file programmatically.

The example wordset below extends a PAYEE entity in the model with additional payees.

{
  "PAYEE" : [
    {
      "canonical" : "AMEX",
      "literal" : "amex"
    },
    {
      "canonical" : "AMEX",
      "literal" : "american express"
    },
    {
      "canonical" : "VISA",
      "literal" : "visa"
    },
    {
      "canonical" : "SOCALGAS",
      "literal" : "southern california gas"
    },
    {
      "canonical" : "SOCALGAS",
      "literal" : "southern california gas company"
    },
    {
      "canonical" : "SOCALGAS",
      "literal" : "the gas company"
    },  
    {
      "canonical" : "SOCALGAS",
      "literal" : "socal gas"
    },  
    
  ]
}

To use a source wordset, you can specify it as inline_wordset in InterpretationResource, with the contents specified in one of two ways.

First, you may include the JSON definition directly in the inline_wordset field, compressed (without spaces) and enclosed in single quotation marks, as shown in this example:

# Text to interpret
input = InterpretationInput(text = "I want to check the balance on my Amex")
# Define semantic model
semantic_model = ResourceReference(
        type = 'SEMANTIC_MODEL', 
        uri = 'urn:nuance-mix:tag:model/bank-app/mix.nlu?=language=eng-USA'
        )

# Define the wordset inline 
payees_wordset = InterpretationResource(
    inline_wordset = '{"PAYEE":[{"canonical": "AMEX","literal":"amex"},{"canonical":"AMEX","literal":"american express"},{"canonical":"VISA","literal":"visa"},{"canonical":"SOCALGAS","literal":"southern california gas"},{"canonical":"SOCALGAS","literal":"southern california gas company"},{"canonical":"SOCALGAS","literal":"the gas company"},{"canonical":"SOCALGAS","literal":"socal gas"}]}')

# Include the semantic model and wordset in InterpretRequest
interpret_request = InterpretRequest(
    parameters = InterpretationParameters(...),
    model = semantic_model
    resources = [ payees_wordset ]
    input = input
)

Alternatively, you may instead store the source wordset in a local JSON file and read the file contents (payees-wordset.json) into a string with a programming-language function, as shown in the second example.

#Text to interpret
input = InterpretationInput(text = "I want to check the balance on my Amex")
# Define semantic model
semantic_model = ResourceReference(
        type = 'SEMANTIC_MODEL', 
        uri = 'urn:nuance-mix:tag:model/bank-app/mix.nlu?=language=eng-USA'
        )

# Read wordset from local file 
payees_wordset_content = None
with open('payees-wordset.json', 'r') as f:
    payees_wordset_content = f.read()
payees_wordset = InterpretationResource(
    inline_wordset = payees_wordset_content)

# Include the semantic model and wordset in InterpretRequest
interpret_request = InterpretRequest(
    parameters = InterpretationParameters(...),
    model = semantic_model
    resources = [ payees_wordset ]
    input = input
)

Compiled wordsets

Alternatively, you may reference a compiled wordset that was created with the Wordset API. To use a compiled wordset, specify it in ResourceReference as COMPILED_WORDSET and provide its URN in the Mix environment.

In the following example, a compiled wordset is referenced to extend a PAYEE entity in the model with a list of relevant payees.

#Text to interpret
input = InterpretationInput(text = "I want to check the balance on my Amex")
# Define semantic model as before
semantic_model = ResourceReference(
        type = 'SEMANTIC_MODEL', 
        uri = 'urn:nuance-mix:tag:model/bank-app/mix.nlu?=language=eng-USA'
        )

# Define a compiled wordset (here its context is the same as the semantic model)
payees_compiled_ws = InterpretationResource(
    external_reference = ResourceReference(
        type = 'COMPILED_WORDSET',
        uri = 'urn:nuance-mix:tag:wordset:lang/bank-app/payees-compiled-ws/eng-USA/mix.nlu')
)

# Include the semantic model and wordset in InterpretRequest
interpret_request = InterpretRequest(
    parameters = InterpretationParameters(...),
    model = semantic_model
    resources = [ payees_compiled_ws ]
    input = input
)

Inline or compiled?

Wordsets can be brought in to aid in interpretation in one of two ways depending on the size of the wordset:

  • Small wordsets (fewer than 40 terms in an entity): You can include these inline along with each interpretation request at runtime. The wordset is compiled and applied as a resource.
  • Larger wordsets: You can compile these once ahead of time using the NLUaaS Wordset API. The compiled wordset is stored in Mix and can then be referenced and loaded as an external interpretation resource at runtime. This improves latency significantly for large wordsets.

If you are unsure of which approach to take, test the latency when using wordsets inline.

Wordset URNs

To compile a wordset, the following need to be provided:

  • URN for the wordset. This is the location in URN format where the compiled wordset will be created. You will reference this URN afterward at runtime to reference the wordset as an interpretation resource.
  • URN for the companion NLU model. This is the model containing the entity that the wordset extends.
  • The wordset JSON source.

Wordsets can be either:

  • Application-level: For example, a list of names of employees in a company.
  • User-level: For example, a list of contact names for a particular user.

The URN for the wordset needs to have one of the following structures, depending on the level of the wordset:

  • Application-level wordset: urn:nuance-mix:tag:wordset:lang/contextTag/resourceName/lang/mix.nlu
  • User-level wordset: urn:nuance-mix:tag:wordset:lang/contextTag/resourceName/lang/mix.nlu?=user_id=userId

Where:

  • contextTag is an application context tag from Mix
  • resourceName is a name for the wordset
  • lang is the six-letter language and country code for which the wordset applies. For example, eng-USA.
  • userId is a unique identifier for the user

Once the wordset is compiled, it is stored on Mix and can be referenced at runtime by a client application using the same model and wordset URNs.

Scope of compiled wordsets

Wordsets are specific to a Mix App ID and can be used by any Mix applications under the same App ID.

The context tag used for the wordset does not have to match the context tag of the companion model but it is good practice to do so for easier wordset management.

The wordset must be compatible with the companion model. The companion model needs to contain the dynamic list entity that the wordset relates to.

Compiled wordset lifecycle and considerations

After a wordset is compiled, the compiled wordset can later be updated.

Compiling a wordset using an existing wordset URN will replace the existing wordset with the newer version if:

  • The model URN is different
  • The wordset payload is different
  • The time to live (TTL) for the wordset is almost expired

Otherwise, the wordset compilation request may return a status ALREADY_EXISTS. In this case, the wordset remains usable at runtime.

A compiled wordset can only be accessed at runtime within the App ID under which it was compiled. It must be used with a compatible model, ideally the one used at compilation time.

Wordsets are available for 28 days after compilation, after which they will be deleted automatically and will need to be compiled again.

Wordsets can also be manually deleted if no longer needed. Once deleted, a wordset is completely removed and cannot be restored.

Runtime requests and wordset issues

If an NLUaaS runtime request references an incompatible or missing wordset, the request will still succeed, but a warning message will be included to indicate that the wordset was incompatible or not found. For more details, see Success with warnings.

Wordset metadata

Wordsets have associated metadata. Some metadata keys are available by default. Optionally, you can provide a list of custom metadata to associate with the compiled wordset. One or more metadata entries can be provided as key-value pairs along with CompileWordsetRequest.

Both default and custom metadata can be retrieved using the Wordset API.