Training Dataset Format
The Intently NLU library leverages machine learning algorithms and some training data in order to produce a powerful intent recognition engine.
The better your training data is, the more accurate your NLU engine will be. Thus, it is worth spending a bit of time to create a dataset that matches your use case well.
Intently NLU accepts two different dataset formats. The first one, which relies on YAML, is the preferred option if you want to create or edit a dataset manually. The other dataset format uses JSON and should rather be used if you plan to create or edit datasets programmatically.
Table of contents
YAML format
The YAML dataset format allows you to define intents and entities using the YAML syntax.
Entity
Here is what a minimal entity file looks like:
# city entity
---
type: entity
name: city
values:
- berlin
The name of an entity can be anything you want. We recommend using a namespace to avoid name collisions. A namespace is typically something like this: group/entities/entity_name
. In our example, it could be flights/entities/city
.
You can specify entity values either using single YAML scalars (e.g. berlin
), or using lists (e.g. [berlin, new york, tokyo]
). If you don’t set map_synonyms
to true
, every value in a list will be treated as a single value. Otherwise they will be used as synonyms.
Here is a more comprehensive example which contains all additional attributes that are optional:
# city entity
---
type: entity
name: flights/entities/city
automatically_extensible: true # default value is false
map_synonyms: true # default value is false
matching_strictness: 1.0 # default value is 0.0
values:
- berlin
- [new york, big apple]
- tokyo
Entity attributes
type
: Must be set toentity
name
: Name(space) of the entityautomatically_extensible
: Whether or not the entity can be extended with values not present in the data. Defaults tofalse
map_synonyms
: Wether or not the first value of the synonyms list must be used for output. This is only guaranteed ifautomatically_extensible
is set tofalse
, otherwise some other values which are not invalues
could be parsed. Defaults tofalse
matching_strictness
: Controls how similar a value must be to the values used for training. Defaults to0.0
values
: Possible (ifautomatically_extensible
isfalse
) or example (ifautomatically_extensible
istrue
) values for this entity.
Intent
Here is the minimal format used to describe an intent:
# turnLightOn intent
---
type: intent
name: turnLightOn
utterances:
- Turn on the lights.
The name of an intent can be anything you want. We recommend using a namespace to avoid name collisions. A namespace is typically something like this: group/intents/intent_name
. In our example, it could be home_assistant/intents/turnLightOn
.
An intent is not required to have any slots. However, if it does have slots, they must be defined in the required_slots
or optional_slots
attributes:
# setTemperature intent
---
type: intent
name: home_assistant/intents/setTemperature
required_slots: # Parsing will fail if these slots can not be filled
- name: room_temperature
entity: home_assistant/entities/temperature
optional_slots: # If recognized, these slots will be filled, but parsing will not fail if not
- name: room
entity: home_assistant/entities/room
matching_strictness: 0.5
utterances:
- Set the temperature to [room_temperature] in the [room]
- please set the [room]'s temperature to [room_temperature]
- I want [room_temperature] in the [room] please
- Can you increase the temperature to [room_temperature]?
Intent attributes
type
: Must be set tointent
name
: Name(space) of the intentrequired_slots
: Slots that must be filled, otherwise the intent cannot be parsedname
: Name of the slot (used in output andutterances
)entity
: Entity type of the slot
optional_slots
: Slots that must not necessarily be filled. Parsing will not fail if the slot can not be filled, but the result will not contain a value for it in that case.name
: Name of the slot (used in output andutterances
)entity
: Entity type of the slot
matching_strictness
: Controls how similar an utterance must be to the training data. Defaults to0.0
utterances
: A list of example utterances with slots in square brackets[ ]
Dataset
You are free to organize the yaml documents as you want. Either having one yaml file for each intent and each entity, or gathering some documents together (e.g. all entities together, or all intents together) in the same yaml file. All files will be used together when generating the dataset. Here is the yaml file corresponding to the previous city
entity and a searchFlight
intent merged together:
# city entity
---
type: entity
name: flights/entities/city
automatically_extensible: true
map_synonyms: true
values:
- berlin
- [new york, big apple]
- tokyo
# searchFlight intent
---
type: intent
name: flights/intents/searchFlight
required_slots:
- name: origin
entity: flights/entities/city
- name: destination
entity: flights/entities/city
utterances:
- find me a flight from [origin] to [destination]
- I need a flight from [origin] to [destination]
- show me flights to go to [destination] from [origin]
If you plan to have more than one entity or intent in a YAML file, you must separate them using the YAML document separator: ---
Once your intents and entities are created using the YAML format described previously, you can produce a dataset using the Command Line Interface (CLI):
python -m intently_nlu generate_dataset en dataset.yaml
Or alternatively, you can provide multiple YAML files to the CLI:
python -m intently_nlu generate_dataset en entities.yaml intents.yaml
This will generate a JSON dataset which can be used to train your engine.
JSON format
The JSON format can be used to create datasets too, but it is not recommended because it does not support comments and it is less human-readable than the YAML format. It is also more verbose:
{
"entities": {
"flights/entities/city": {
"automatically_extensible": true,
"map_synonyms": true,
"matching_strictness": 0,
"name": "flights/entities/city",
"values": {
"berlin": "berlin",
"big apple": "new york",
"new york": "new york",
"tokyo": "tokyo"
}
}
},
"intents": {
"flights/intents/searchFlight": {
"matching_strictness": 0,
"required_slots": {
"destination": "flights/entities/city",
"origin": "flights/entities/city"
},
"utterances": [
"find me a flight from [origin] to [destination]",
"I need a flight from [origin] to [destination]",
"show me flights to go to [destination] from [origin]"
]
}
},
"language": "en"
}
Once you have created a JSON dataset, either directly or with YAML files, you can use it to train an NLU engine. To do so, you can use the CLI as documented here, or the Python API.