Creating a training set

This section describes how to construct a training set that can be used to build resources in Mix.

We’ll look at training set requirements for two cases:

  • Training sets for building a DLM
  • Training sets for building a DLM and NLU model

Training set for DLM purposes

A DLM simply needs to know the sorts of words that appear often in the domain and the sorts of sentence structures users will use. As a result, ontology definition can often be simpler than for an NLU model and sample annotations can often be correspondingly simpler too. However, there are still some steps that can be taken in building your training set to help create a more effective DLM.

Collecting data

Your DLM will be most effective when trained on sentences or phrases that match what users will say to your application. If an application is already deployed, we recommend you collect data from that application, transcribe it, and use it to build a DLM. This will give you the most accurate results. However, this is often not possible when building an application from scratch. In this case you will need to create some training data.

Here are different ways you could create training data for your DLM.

Data from existing products

The best data to build a DLM comes from an existing product. If you have deployed a version of the product (even to a limited test group), extract the customer portion of the conversation and use it to train a DLM. If the product is audio-based, transcribe the data before using it to train the DLM, otherwise ASR errors will be reinforced.

Suppose you have existing audio or text-based chat logs from a related product, possibly human-to-human conversations as opposed to from an application. Extract the customer portion of those conversations, transcribing as needed, to train the DLM.

Web or document text

Extract text from your company’s website containing product names, office locations, and other phrases that might be used in the application. Then train a DLM with that text. If your company has an FAQ document, extract the questions from the document and train a DLM with that text.

Crowdsource text

Internally or externally crowdsource example interactions with your application. Invent scenarios and write prompts asking participants to write down what they would ask in that scenario. For example, “Your internet stopped working and resetting your router did not fix the problem. You navigate to XYZ’s website and ask for help from their application. How would you ask the question?”

Start with some real or example interactions and create or crowdsource variations. For example, if you have the utterance “My internet doesn’t work,” you could come up with related variations such as, “My internet is not working,” “The internet is broken,” or “I can’t connect to the Internet.”

Generate text

Write a grammar and randomly sample or enumerate sentences from the grammar to train the DLM. Note that Mix can already do this with samples containing basic entities.

Preparing data

DLM training data should be written with proper casing and formatting.

Mix automatic normalization

Mix will attempt to normalize your data as much as possible, but any help you can provide with good training text will benefit your DLM.

If you enter all uppercase text, Mix will convert to the proper case but will prefer the case given in the text when ambiguous. For example “subway” (place) vs. “Subway” (proper name) vs. “SUBWAY” (company).

Since punctuation is not spoken, Mix will remove it prior to DLM training and replace it with an end-of-sentence token.

Words are normalized to in-vocabulary tokens for consistency and accuracy purposes. For example “wifi” will be changed to “Wi-Fi” and “web site” will be changed to “website.” If you would like a specific word form recognized instead of the normalized form, you must add it to a wordset.

Numeric expressions

Formatted numeric expressions like “$120” have multiple possible representation as tokens. For example, “one hundred twenty dollars,” “one hundred and twenty dollars,” “a hundred and twenty dollars” all mean the same thing as $120 and different people use one or the other to say it out loud.

Individual samples with numeric expressions are not automatically be expanded by Mix to all possible representations. When there are multiple best possibilities, one is chosen at random for the model to train on. However, if there are many samples in the training set containing numeric expressions, these tend to be expanded in the various different ways, across the training set, allowing your DLM to represent these possibilities. You can alternatively choose to simply spell out numbers in the DLM training text, but if you do you do not get these variations unless you provide them explicitly via multiple similar entries that spell out the variations.

Sample sentence examples

Here are some examples of good and bad training sample sentences.

Training sample dos and donts
Good example Bad example
My internet stopped working. Can you help? MY INTERNET STOPPED WORKING. CAN YOU HELP?
Call Mrs. Jones in New York call mrs. jones in new York
I’m looking for a 90-Watt PAR38 halogen dimmable flood light i’m looking for a 90 WATT P A R thirty eight halogen dimmable flood light

Training set for NLU model and DLM

Training samples that are to be used to build NLU models generally involve more detailed annotations about the semantic intent of the training samples and any related entities. An important part of NLU model development is designing an ontology that fully represents the expected user intents in the domain and the entities users may use to specify these intents. This is necessary in NLU models to accurately understand what the user wants so that the system can interpret and fulfill that intent.

The details of designing ontologies and training sets for NLU models are beyond the scope of the present document. For more details, see the document on NLU modeling best practices.