FAQs

Ontology design

Q: If I’m considering using two separate entities that have a great deal of overlap (greater number of members in intersection than not, for example), is it best to merge them into one entity to avoid redundancy? Or better to keep separate to guarantee integrity of generated examples?

A: In general, it is better to keep them separate. But be mindful of the kinds of sentences in which these occur. Are these identical or nearly so? If so, it may make sense to combine.

Over-generation

Q: How much does over-generation of examples hurt?

A: In test data, you absolutely do not want to over-generate (such as use a grammar that generates unrealistic utterances), since you want the test data to maximally match what the system will encounter in production.

For training data, some over-generation is acceptable. It is unknown exactly how much unrealistic training data can be added before the performance of the model begins to degrade, but as a rule of thumb, try to limit over-generated/unrealistic data to no more than 20% of training data.

Entity spans

Q: Does it ever make sense to include full noun phrases (preceding determiner, adjective) in an entity? For example:

{OPEN_INSURANCE_POLICY} I want to get insurance for [INSURED_OBJECT] my new car [/] {/}

vs.

{OPEN_INSURANCE_POLICY} I want to get insurance for my new [INSURED_OBJECT] car [/] {/}

A: Whether a determiner is included in an entity is a decision that needs to be made for each project and documented in the annotation guide. However, in general, it is probably better to leave determiners out of entities, simply because they are not necessary to understand the meaning of the entity, and it is often possible for different determiners (or no determiners) to precede an entity.

As for adjectives like “new” in the example above, the question is whether the notion of a “new car” is important to the overall application. For example, if you are booking a space on a ferry, the application logic will be the same so matter how old the car is, so the NLU doesn’t need to capture it. On the other hand, if you are applying for car insurance, then the age of the car is very important to the business rules that decide how much the insurance will cost. In this case, it’s easy to imagine a user saying not just “new car”, but also “old car”, “new truck”, “old motorcycle”, and so on. Given this ability to combine different ages and insured objects, it would be more efficient (both for the NLU model and for downstream processing) to create a second entity [AGE] : “my [AGE] new [/] [INSURED_OBJECT] car [/]”.

Literal-to-value mappings

Q: When is it appropriate for the literal and value in entity {literal, value} pairs to be identical, and when should they differ? Can/should the literal-value mapping be used to provide more fine-grained information? For example, consider entity type PAYMENT_METHOD for an intent PAY_BILL—we want to know whether the user will pay by credit card, check, or money order. Here are two example utterances:

  • Pay my bill by credit card

  • Pay my bill by Mastercard

The value for the literal “credit card” should be simply “credit card”. But since a Mastercard is a kind of credit card, should the value for “Mastercard” be “mastercard”, or should it be something like “credit card.mastercard”?

A: Values exist primarily because there are often different ways to express the same entity value. For example, “big” and “large” both mean the same thing, so these two literals should map to the same value (which could be either “big” or “large”). But it doesn’t actually matter what the string of the value is, as long as it is unique—instead of “big” or “large”, it could just as easily be “arcfrgu”—but this would make the NLU model and downstream application logic harder to understand. From this point of view, it doesn’t matter what the value associated with a literal is.

However, the specific value “credit card.mastercard” is problematic because the dot implies that the value has internal structure, but it doesn’t—it is simply a text string. The NLU model doesn’t need to know or care that a Mastercard is a kind of credit card—and downstream processing will have to know in any event. Therefore, “mastercard” is the better value here.