Textual dictionaries must only be encoded in UTF-8. Note that they may not contain the 3-byte UTF-8 preamble, also known as the UTF-8 BOM or signature.
The general format of textual dictionaries consists of one [Header] label and its properties, and several [SubHeader]-[Data] label couples with their properties and data. Each [SubHeader] describes the expected data properties (such as orthographic or phonetic text) while [Data] describes the actual source string that needs to be replaced with a destination string.
You can represent the destination string of a dictionary entry using orthographic or phonetic text. For phonetic strings, you must use the L&H+ phonetic alphabet. For information about phonemes, see your Language Supplement.
The simplest dictionary consists of one [Header] label and one [Data] label; but while it’s syntactically correct, such a dictionary doesn’t specify any actions.
Here is an example of the format:
[Header]
Language = language_code
[SubHeader]
Content=content_type
Representation=representation_type
Language = language_code
[Data]
source_stringseparatordestination_string
Item
|
Description
|
language_code
|
Three-letter code used to identify the language; for example, ENU for American English.
The language code is mandatory; it must be specified either in the header, or in each sub-header. Only one language may be used in each dictionary.
|
content_type
|
Type of content checked against the dictionary. There are two options:
- EDCT_CONTENT_ORTHOGRAPHIC
for orthographic strings
- EDCT_CONTENT_BROAD_NARROWS
for phonetic strings
The content type determines the representation type.
You must specify the content type in each sub-header in the dictionary.
|
representation_type
|
Representation type used for the output:
- EDCT_REPR_SZ_STRING if the content type is EDCT_CONTENT_ORTHOGRAPHIC
- EDCT_REPR_SZZ_STRING if the content type is EDCT_CONTENT_BROAD_NARROWS
You must specify the representation type for each sub-header in the dictionary.
|
source_string
|
Source string that is to be replaced. If the string has multiple words, enclose them in double quotes (").
Optional. To add whitespace characters to a multi-word phrase, use the <ESC>\mw\ control sequence. (This is not required. This syntax is kept for compatibility with previous releases.)
|
separator
|
Separator between the source string and the destination string. This separator must be a tab character.
|
destination_string
|
One or more words to be used to replace the source string. If the string consists of phonetic symbols, precede with with two forward slashes (//). If the string has multiple words, enclose them in double quotes (").
|
Each dictionary can include several sub-header sections; each sub-header can include several data sections; and each data section can include several different source/destination string pairings. Each source/destination string pair must appear on a separate line within the data section.
Here is an example of a short dictionary:
[Header]
Language = ENU
[SubHeader]
Content=EDCT_CONTENT_ORTHOGRAPHIC
Representation=EDCT_REPR_SZ_STRING
[Data]
DLLDynamic Link Library
HelloWelcome to the demonstration of the American English Text-to-Speech system.
infoInformation
[SubHeader]
Content = EDCT_CONTENT_BROAD_NARROWS
Representation = EDCT_REPR_SZZ_STRING
[Data]
addr // '@.dR+Es
Troubleshooting for possible errors
If you experience an error, it may be one of the following:
- Text dictionaries must only be encoded in UTF-8. Note that all characters in the 7-bit US-ASCII range (hex 20 to 7f) are encoded the same way in UTF-8, US-ASCII, Windows-1252, ISO-8859-1, and other formats. So dictionaries which only use character codes in the ASCII range can be encoded in (for example) Windows-1252.
If a non US-ASCII character is present (for example, ä) and the encoding used is (for example) Windows-1252, then an error is returned when the dictionary is compiled. Similarly, when the dictionary file is opened in Nuance Vocalizer Studio (see below), a fatal error is displayed.
- When the content type is EDCT_CONTENT_ORTHOGRAPHIC, the destination strings for this subheader must consist only of orthographic characters. A phonetic string is interpreted as an orthographic string, and no error is returned.
- When the content type is EDCT_CONTENT_BROAD_NARROWS, the destination strings expected for this subheader must consist only of phonetic characters; an error is returned for any destination string that isn't preceded by two forward slashes (//).
- When unknown symbols are used in phonetic content, they are ignored.
- Only one language can be specified. If more than one language is specified, no error is returned, but the dictionary is ignored.
- The specified language has to be installed. If the language is not installed, no error is returned, but the dictionary is ignored.