Generating data and training the initial model

If you are using a new NLU model to automate an existing application, then real (production) user utterances should be available for that application, and this usage data should be leveraged to create the training data for the initial model. See Best practices around leveraging deployment usage data.

Otherwise, if the new NLU model is for a new application for which no usage data exists, then artificial data will need to be generated to train the initial model. The basic process for creating artificial training data is documented in Build your training set. See also Best practices around creating artificial data in this topic.

Training the initial model is documented in Train your model.

Best practices around leveraging deployment usage data

This section provides best practices around selecting training data from usage data.

Note that although this section focuses on selecting training data from usage data, the same best practices apply to other use cases for selecting data sets from usage data, such as:

Use uniform random sampling to select training data

Typically, the amount of usage data that is available will be larger than what is needed to train (and test) a model. Therefore, a training set is typically generated by sampling utterances from the usage data.

Different sampling methods could in principle be used. Nuance recommends using uniform random sampling to select a training set from usage data. Random sampling has the advantage of drawing samples from the head and tail of the distribution in proportions that mirror real-world usage: If an utterance appears more more frequently in the usage data, then it will appear more frequently in the sampled training set. This results in the trained NLU model generally having the best accuracy on the most frequently encountered utterances.

The most obvious alternatives to uniform random sampling involve giving the tail of the distribution more weight in the training data. For example, selecting training data randomly from the list of unique usage data utterances will result in training data where commonly occurring usage data utterances are significantly underrepresented. This results in an NLU model with worse accuracy on the most frequent utterances. This is not desirable and, therefore, is not the recommended approach.

Annotate data using Mix

Training data must first be annotated with the correct intents and entities in Mix.nlu. Mix has the ability to import a text file of unannotated utterances, and the Optimize tab provides a convenient UI for annotating both the intent and entities of utterances in a single view.

Before you begin to annotate, you should create an annotation guide that provides instructions on how to annotate: what types of utterances do and don’t belong to each intent, and what kinds of words do and don’t belong to each entity.

An annotation guide provides three concrete benefits:

  1. It acts as a place to document decisions that are made while annotating. For example, if you have a RELATIVE_LOCATION entity and you encounter an utterance that includes “in the neighborhood”, you’ll need to decide whether the word “in” should be part of the entity or not. Once you have decided, you can document that decision in the guide to help ensure that future instances of “in the neighborhood(/area/region)” will be consistently annotated.
  2. It makes a single annotator’s annotations more self-consistent, because the annotator doesn’t need to remember all of the annotation decisions—the guide acts as a reference.
  3. It makes multiple annotators’ annotations more consistent with each other, because all annotators will be annotating according to the same guide.

Usage data and sensitive personal information

Ensure not to use sensitive personally identifiable information (PII) from users in your training set. If your model will collect entities that are flagged as sensitive, use representative artificially generated samples instead.

Usage data and problematic content

Usage data is generally a great source for training data specifically because it represents the sorts of things real users are going to say to the system. There is a flip side to this, however. You’re going to get everything the users say to the system. Sometimes that will include problematic content. This includes things like profanity, talk of violence, or hate speech. Needless to say, you don’t want this kind of content in your training set. You have a responsibility to remove these kinds of samples from your training set before you train your model.

Best practices around creating artificial data

If you don’t have an existing application which you can draw upon to obtain samples from real usage, then you will have to start off with artificially generated data.

Move as quickly as possible to training on real usage data

For reasons described below, artificial training data is a poor substitute for training data selected from production usage data. In short, prior to collecting usage data, it is simply impossible to know what the distribution of that usage data will be. For this reason, the focus of creating artificial data should be getting an NLU model into production as quickly as possible so that real usage data can be collected and used as test and train data in the production model as quickly as possible.

In other words, the primary focus of an initial system built with artificial training data should not be accuracy per se, since there is no good way to measure accuracy without usage data. Instead, the primary focus should be the speed of getting a “good enough” NLU system into production so that real accuracy testing on logged usage data can happen as quickly as possible. Obviously, the notion of “good enough”—that is, meeting minimum quality standards such as happy path coverage tests—is also critical.

Bootstrap data to get started

If you’re creating a new application with no earlier version and no previous user data, you will be starting from scratch. To get started, you can bootstrap a small amount of sample data by creating samples you imagine the users might say. It won’t be perfect, but it gives you some data to train an initial model. You can then start playing with the initial model, testing it out, and seeing how it works.

This very rough initial model can serve as a starting base that you can build on for further artificial data generation internally and for external trials. This is just a rough first effort, so the samples can be created by a single developer. When you were designing your model intents and entities earlier, you should already have been thinking about the sort of things your future users would say. You can leverage your notes from this earlier step to create some initial samples for each intent in your model.

Run data collections rather than rely on a single NLU developer

A single NLU developer thinking of different ways to phrase various utterances can be thought of as a “data collection of one person”. However, a data collection from many people is preferred, since this will provide a wider variety of utterances and thus give the model a better chance of performing well in production.

One recommended way to conduct a data collection is to provide source utterances and ask survey-takers to provide variants of these utterances. There are several ways to provide variants of source utterances:

  • Variants that don’t change the meaning of the utterance:
    • Vary the carrier phrase: For example, a variant of “I want to (order a coffee)” could be “I’d like to”, or possibly “Can I please.”
    • Replace entity literals with synonyms: For example, a variant of “big” could be “large”; a variant of “nearby” could be “in the area.”
  • Variants that change the meaning of the utterance:
    • Replace entity literals with different entity literals with different meanings: For example, the entity literal “latte” in “I’d like a latte please” could be replaced with “mocha”, giving “I’d like a mocha please.”

Note that even though a data collection is preferred to relying on single NLU developer data, data collection data is still artificial data, and the emphasis should still be on deploying and getting real usage data as quickly as possible.

Collect enough training data to cover many entity literals and carrier phrases

As a machine learning system, Mix.nlu is more likely to predict the correct NLU annotation for utterances that are similar to the utterances that the system was trained on. Therefore, because you don’t know in advance what your users will say to the system, the way to maximize accuracy is to include as many different kinds of utterances in the training data as possible. This means including in your training data as many different entity literals and as many different carrier phrases as possible, in many different combinations. (However, note that you don’t need to include all possible entity literals in your training data. You just need enough variation so that the model doesn’t begin to “memorize” specific literals. Ten different literals for each entity type is a good rule of thumb.)

Note that if an entity has a known, finite list of values, you should create that entity in Mix.nlu as either a list entity or a dynamic list entity. A regular list entity is used when the list of options is stable and known ahead of time. A dynamic list entity is used when the list of options is only known once loaded at runtime, for example, a list of the user’s local contacts. With list entities, Mix will know the values to expect at runtime. It is not necessary to include samples of all the entity values in the training set. However, including a few examples with different examples helps the model to effectively learn how to recognize the literal in realistic sentence contexts.

You should also include utterances with different numbers of entities. If you have an intent which in your ontology has three entities, add training utterances for that intent that contain one entity, two entities, and three entities—at least for all combinations of entities that are likely to spoken by your users.

The amount of training data you need for your model to be good enough to take to production depends on many factors, but as a rule of thumb it makes sense for your initial training data to include at least 20 instances of each intent and each entity, with as many different carrier phrases and different entity literals as possible.

For example, if your ontology contains four intents and three entities, then this suggests an initial training data size of (4 intents * 20) + (3 entities * 20) = 80 + 60 = 140 utterances. More training data will be needed for more complex use cases; at the end of day it will be an empirical question based on how well your specific model is performing on test data (see Evaluating NLU accuracy).

Note that the amount of training data required for a model that is good enough to take to production is much less than the amount of training data required for a mature, highly accurate model. But the additional training data that brings the model from “good enough for initial production” to “highly accurate” should come from production usage data, not additional artificial data.

Keep your training data realistic

There is no point in your trained model being able to understand things that no user will actually ever say. For this reason, don’t add training data that is not similar to utterances that users might actually say. For example, in the coffee-ordering scenario, you don’t want to add an utterance like “My good man, I would be delighted if you could provide me with a modest latte”.

The best practice to add a wide range of entity literals and carrier phrases (above) needs to be balanced with the best practice to keep training data realistic. You need a wide range of training utterances, but those utterances must all be realistic. If you can’t think of another realistic way to phrase a particular intent or entity, but you need to add additional training data, then repeat a phrasing that you have already used.

Training data also includes entity lists that you provide to the model; these entity lists should also be as realistic as possible. For example, in cases where you expect the model to encounter utterances that contain OOV (out-of-vocabulary) entity literals (typically entities with large numbers of possible literals, such as song titles), you will want to include training utterances that similarly contain entity literals that are OOV with respect to the entity lists. A data collection is one way of accomplishing this.

Include fragments in your training data

Don’t forget to include fragments in your training data, especially because users are likely to frequently speak fragments to your deployed system. Use the predefined intent NO_INTENT for annotating fragment training data. For example:

  • {NO_INTENT} [DRINK_TYPE] Italian soda [/] {/}
  • {NO_INTENT} [SIZE] short [/] [DRINK_TYPE] latte [/] {/}

Include anaphora references in samples

In conversations between people, participants will often use anaphoras—indirect, generic references to a subject that was mentioned recently in the conversation. This could be a:

  • Person (him, her, them)
  • Place (there, here, that place)
  • Thing (it)
  • Moment in time (then, at that time)

For example, if a person was just talking about plans to travel to Boston soon, that person might reasonably say “I want to go there on Wednesday,” or “Can you show me hotel rooms available there on Wednesday?”, where there is understood implicitly from the recent context to mean Boston. Similarly, a person you were just talking about might be referred to with “him” or “her”, or, for multiple people, with “them”.

If you expect users to do this in conversations built on your model, you should mark the relevant entities as referable using anaphoras, and include some samples in the training set showing anaphora references.

For more information, see Anaphoras in the Mix.nlu documentation.

Include samples using logical modifiers

In conversations you will also see sentences where people combine or modify entities using logical modifiers—and, or, or not.

For example:

  • “I would like to activate service for TV, internet, and mobile phone.”
  • “No, this call is not about my internet connection.”

You can tag sample sentences with modifiers to capture these sorts of common logical relations.

For more information see Tag modifiers in the Mix.nlu documentation.

Make sure the distribution of your training data is appropriate

In any production system, the frequency with which different intents and entities appear will vary widely. In particular, there will almost always be a few intents and entities that occur extremely frequently, and then a long tail of much less frequent types of utterances. However, when creating artificial training data for an initial model, it is impossible or at least difficult to know exactly what the distribution of production usage data will be. Thus, it’s more important to make sure that all intents and entities have enough training data, rather than trying to guess what the precise distribution should be.

However, in some cases you can be confident that certain intents and entities will be more frequent. For example, in a coffee-ordering NLU model, users will ask to order a drink much more frequently than they will ask to change their order. In these types of cases, it makes sense to create more data for the “order drink” intent than the “change order” intent. But again, it’s very difficult to know exactly what the relative frequency of these intents will be in production, so it doesn’t make sense to spend much time trying to enforce a precise distribution before you have usage data.