Formatted text
ASRaaS returns the results of your utterances in two Hypothesis fields: formatted_text
and minimally_formatted_text
.
The formatted text
field includes initial capitals for recognized names and places, numbers expressed as digits, currency symbols, and common abbreviations. In minimally formatted text
, words are spelled out but basic capitalization and punctuation are included.
In many cases, both formats are identical.
ASRaaS uses the default data pack settings to format the material in formatted_text
, for example, displaying ten centimeters as “10 cm”:
Formatted text: December 9, 2005
Minimally formatted text: December nine two thousand and five
Formatted text: $500
Minimally formatted text: Five hundred dollars
Formatted text: I'll catch the 758 train
Minimally formatted text: I'll catch the seven fifty eight train
Formatted text: We're expecting 10 cm overnight
Minimally formatted text: We're expecting ten centimeters overnight
Formatted text: I'm okay James, how about yourself?
Minimally formatted text: I'm okay James, how about yourself?
The default settings in the data pack provide good results in most cases. For more precise control, you may specify a formatting scheme and/or option as a recognition parameter. See RecognitionParameters > Formatting.
Formatting scheme
The formatting scheme determines how ambiguous numbers are displayed in the formatted_text
field. Only one type may be specified, for example, scheme = 'date'
.
The available schemes depend on the data pack, but most data packs support date, time, phone, address, all_as_words, default, and num_as_digits.
Each scheme is a collection of many options (see Formatting options below), but the defining option is PatternBias, which sets the preferred pattern for numbers that cannot otherwise be interpreted. The values of PatternBias give their name to most of the schemes: date, time, phone, address, and default.
The PatternBias option cannot be modified, but you may adjust other options using formatting options.
RecognitionInitMessage(
parameters = RecognitionParameters(
language = 'en-US',
topic = 'GEN',
audio_format = AudioFormat(pcm=PCM(sample_rate_hz=wf.getframerate())),
result_type = 'FINAL',
utterance_detection_mode = 'MULTIPLE',
formatting = Formatting(
scheme = 'date',
options = {
'abbreviate_titles': True,
'abbreviate_units': False,
'censor_profanities': True,
'censor_full_words': True
}
)
)
)
date, time, phone, and address
The formatting schemes date, time, phone, and address tell the engine to prefer one pattern for ambiguous numbers.
By default, the engine can identify some numbers as date or time or phone number, for example:
-
I’ll catch the seven twenty six a m train is identified as a time because of a m.
-
I was born on eleven twenty six nineteen ninety four is identified as a date (in American English) because of the sequence of month, day, and year.
-
It’s six nine seven three two nine four is identified as a phone number because of the pattern of the numbers.
However, the engine considers some numbers ambiguous:
-
I’ll catch the seven twenty six train is not recognized as a specific pattern, so ASRaaS displays it as a simple cardinal number: “I’ll catch the 726 train.”
-
My birthday is eleven twenty six. Similarly, the engine displays this as: “My birthday is 1126.”
By setting the formatting scheme to date, time, phone, or address, you instruct the engine to interpret these ambiguous numbers as the specified pattern. For example, if you know that the utterances coming into your application are likely to contain dates rather than times, set scheme: 'date'
.
For example, the engine interprets the ambiguous utterance, It’s seven twenty six, based on the formatting scheme in effect:
- With the default scheme: “It’s 726”
- With the date scheme: “It’s 7/26”
- With the time scheme: “It’s 7:26”
- With the address scheme: “It’s 726”
- With the phone scheme: “It’s 726”
all_as_words
The all_as_words scheme displays all numbers as words, even when a pattern (date, time, phone, or address) is found. For example, ASRaaS identifies this utterance as an address: My address is seven twenty six brookline avenue cambridge mass oh two one three nine:
-
With the all_as_words scheme, however, the address formatting is ignored and the numbers are written out: “My address is seven twenty six Brookline Avenue, Cambridge, Mass. Oh two one three nine”
-
With all other schemes, the text is formatted as a standard address: “My address is 726 Brookline Ave., Cambridge, MA 02139”
Similarly, this utterance is identified as a time: I’ll catch the seven twenty six a m train:
-
With the all_as_words scheme, it’s formatted neutrally as: “I’ll catch the seven twenty six a.m. train”
-
With the default or any other scheme, it’s formatted as a time: “I’ll catch the 7:26 AM train”
num_as_digits
The num_as_digits scheme is the same as default, except in its treatment of numbers under 10:
-
The default scheme formats numbers as numerals from 10 upwards: one, two, three … nine, 10, 11, 12, etc.
-
num_as_digits formats all numbers as numerals: 1, 2, 3, etc.
Num_as_digits affects isolated cardinal and ordinal numbers, plural cardinals (ones, twos, nineteen fifties, and so on), some prices, and fractions. “Isolated” means a number that is not found within a greater pattern such as date or time.
This scheme has no modifiable options.
all_as_katakana
Available for Japanese only, the all_as_katakana scheme returns the transcript in Katakana, meaning the output is entirely in the phonetic Katakana script, without Kanji, Arabic numbers, or Latin characters.
When all_as_katakana is not specified, the output is a mix of scripts representing standard written Japanese.
This scheme has no modifiable options.
For the Japanese form of How many kilograms can I check in?:
-
With the all_as_katakana scheme, this is formatted as:
アズケルニモツノオモサハナンキロマデデスカ -
With the default or any other scheme, it’s formatted as:
預ける荷物の重さは何キロまでですか
default
This scheme is the default. It has the same effect as not specifying a scheme. If ASRaaS cannot determine the format of the number, it interprets it as a cardinal number.
Formatting options
Formatting options are individual parameters for displaying words and numbers in the formatted_text
result field. All options are part of the current formatting scheme (default if not specified) but can be set on their own to override the current setting.
Examples
With no formatting scheme or options, the default scheme is in effect:
RecognitionInitMessage(
parameters = RecognitionParameters(
language = 'en-US',
topic = 'GEN',
audio_format = AudioFormat(pcm=PCM(sample_rate_hz=wf.getframerate())),
result_type = 'FINAL',
utterance_detection_mode = 'MULTIPLE'
)
)
With a scheme only, all options in the date scheme are in effect. See RecognitionParameters > Formatting.
RecognitionInitMessage(
parameters = RecognitionParameters(
language = 'en-US',
topic = 'GEN',
audio_format = AudioFormat(pcm=PCM(sample_rate_hz=wf.getframerate())),
result_type = 'FINAL',
utterance_detection_mode = 'MULTIPLE',
formatting = Formatting(
scheme= 'date'
)
)
)
With options only, options in the default scheme are overridden by specific options.
RecognitionInitMessage(
parameters = RecognitionParameters(
...
formatting = Formatting(
options = {
'abbreviate_titles': True,
'abbreviate_units': False,
'censor_profanities': True,
'censor_full_words': True,
}
)
)
)
With a scheme and options, options in the date scheme are overridden by specific options:
RecognitionInitMessage(
parameters = RecognitionParameters(
...
formatting = Formatting(
scheme = 'date',
options = {
'abbreviate_titles': True,
'abbreviate_units': False,
'censor_profanities': True,
'censor_full_words': True,
}
)
)
)
Principal options
The available options depend on the data pack. See Formatting options by language.
All options are boolean. The values are set in the scheme to which they belong. (The num_as_digits scheme has no modifiable options.)
Formatting options | Formatting scheme | |
---|---|---|
default, date, time, phone, address | all_as_words | |
PatternBias The defining characteristic of the scheme. Not modifiable. |
default, date, time, phone, addresss | |
abbreviate_titles Whether to abbreviate titles such as Captain (Capt), Director (Dir), Madame (Mme), Professor (Prof), etc. In American English, a period follows the abbreviation. The titles Mr, Mrs, and Dr are always abbreviated. |
False | False |
abbreviate_units Whether to abbreviate units of measure such as centimeters (cm), meters (m), megabytes (MB), pounds (lbs), ounces (oz), miles per hour (mph), etc. When true, metric units are always abbreviated, but imperial one-word tokens are not abbreviated, so ten feet is 10 feet and twelve quarts is 12 quarts. The formatting of expressions with multiple units depends on the units involved: only common combinations are formatted. |
True | False |
Arabic_numerals_not_Kanji (Japanese)How to display numbers.
|
True | False |
capitalize_2nd_person_pronouns (German)Whether to capitalize second person personal pronouns such as Du, Dich, etc. |
False | False |
capitalize_3rd_person_pronouns (German)Whether to capitalize third-person personal pronouns such as Sie, Ihnen, etc. |
True | True |
censor_profanities Whether to mask profanities partially with asterisks, for example, "fr*gging" versus "frigging." |
False | False |
censor_full_words Whether to mask profanities completely with asterisks, for example, "********" versus "frigging." When true, censor_profanities must also be true. |
False | False |
expand_contractions In English, whether to expand common contractions, for example, "don't" versus "do not" or "it's nice" versus "it is nice." |
False | False |
format_addresses Whether to format text identified as postal addresses. This does not include adding commas or new lines. Full street address formatting is done for most languages, following the standards of the country's postal service. |
True | False |
format_currency_codes Whether to replace the currency symbol with its ISO currency code, for example, USD125 instead of $125. When true, format_prices must also be true. |
False | False |
format_dates Whether to format text identified as dates as, for example, 7/26/1994, 7/26/94, or 7/26. The order of month and day depends on the language. |
True | False |
format_non-USA_postcodes For non-US languages, whether to format UK and Canadian postcodes. UK postcodes have the form A9 9AA, A99 9AA, etc. Canadian postal codes have the form A9A 9A9. |
False | False |
format_phone_numbers For US and Canadian, whether to format numbers identified as phone numbers, as 123-456-7890 or 456-7899, optionally with 1 or +1 before the number. |
True | False |
format_prices Whether to format numbers identified as prices, including currency symbols and price ranges. The currency symbol depends on the language. |
True | False |
format_social_security_numbers Whether to format numbers identified as US social security numbers or (for Canadian) Canadian social insurance numbers. Both are a series of nine digits formatted as 123-45-6789 or 123 456 789. |
False | False |
format_times Whether to format numbers identified as times (including both 12- and 24-hour times) as, for example, 10:35 with optional AM or PM. |
True | False |
format_URLs_and_email_addresses Whether to format web and email addresses, including @ (for at) and most suffixes, including multiple suffixes, for example, .ac.edu. Numbers are displayed as digits and output is in lowercase. |
True | False |
format_USA_phone_numbers (Mexican)Whether to use US phone formatting instead of Mexican. |
False | False |
improper_fractions_as_numerals Whether to express improper fractions as numbers, for example, 5/4 versus five fourths. |
True | False |
million_as_numerals Whether to half-format numbers ending in million, billion, trillion, and so on, for example, 5 million. |
True | Inactive |
mixed_numbers_as_numerals How to express numbers that are a combination of an integer and a fraction:
|
True | False |
names_as_katakana (Japanese)Whether recognized first and last names are transcribed in Katakana. This option can improve the transcription of homophone Japanese names, reducing variation and increasing accuracy. This option is true in the all_as_katakana scheme. In other schemes, the option is false by default, meaning names are transcribed in the script usually associated with the name. |
False | False |
two_spaces_after_period Whether to insert two spaces (instead of one) following a period (full stop), question mark, or exclamation mark. |
False | False |
Japanese options
Japanese data packs support the formatting options listed in Japanese (jpn-JPN).
In Japanese, two options work together to specify how numbers are displayed:
-
Arabic_numerals_not_Kanji
determines whether numbers are shown in Arabic, Kanji, or both.For words containing numbers, the formatting output depends on whether the word is defined in the system. For example, 八百屋 is a defined word meaning “greengrocer” (although literally “800 shop”). Even when Arabic_numerals_not_Kanji is True, it is always output as 八百屋, never as 800屋.
If the word containing a number is not defined in the system, the formatting output depends on the context and the formatting scheme in effect (date, time, price, address, and so on).
-
million_as_numerals
determines whether magnitude words (thousands, millions, etc.) are in Kanji and the rest in Arabic, or numbers are entirely in Arabic. When million_as_numerals is True, magnitudes are written in Kanji, as shown below.
万 10,000
億 100,000,000
兆 1,000,000,000,000
京 10,000,000,000,000This also affects currency values, so $50,000 is written as $5万.
You can control how numbers are displayed by combining Arabic_numerals_not_Kanji
and million_as_numerals
:
All Kanji | Half-formatted (default) | All Arabic |
---|---|---|
Arabic_numerals : False |
Arabic_numerals : Truemillion_as_numerals : True |
Arabic_numerals : Truemillion_as_numerals : False |
All numbers are displayed in Kanji. | Magnitude words are in Kanji and the rest in Arabic. | All numbers are displayed in Arabic. |
三 | 3 | 3 |
十一 | 11 | 11 |
六十五 | 65 | 65 |
八百三十七 | 837 | 837 |
千 | 1,000 | 1,000 |
千九百四十五 | 1,945 | 1,945 |
八千五百 | 8,500 | 8,500 |
一万 | 1万 | 10,000 |
一万五千 | 1万5,000 | 15,000 |
一億三千万 | 1億3,000万 | 130,000,000 |
二億五 | 2億5 | 200,000,005 |
For example, setting this option displays all numbers in Kanji:
RecognitionInitMessage(
parameters = RecognitionParameters(
language = 'ja-JP',
...
formatting = Formatting(
options = {'Arabic_numerals_not_Kanji':False}
)
)
)
This combination of options displays numbers in Kanji and Arabic. It’s the default setting so may be omitted:
RecognitionInitMessage(
parameters = RecognitionParameters(
language = 'ja-JP',
...
formatting = Formatting(
options = {'Arabic_numerals_not_Kanji':True,
'million_as_numerals':True}
)
)
)
These settings display all numbers in Arabic:
RecognitionInitMessage(
parameters = RecognitionParameters(
language = 'ja-JP',
...
formatting = Formatting(
options = {'Arabic_numerals_not_Kanji':True,
'million_as_numerals':False}
)
)
)
Scheme vs. options
Some formatting schemes have similar names to formatting options, for example, the date, phone, time, and address scheme and the options format_dates, format_times, and so on. What’s the difference?
The scheme helps interpret ambiguous numbers, while options format text for display. For example:
-
formatting scheme: 'date'
: Interpret eleven twenty six as the date 11/26 (November 26). -
formatting options 'format_dates': True
: Display numbers identified as dates in the locale’s date format, for example, 11/26 in American English. This is the default setting. -
formatting options 'format_dates': False
: Display numbers as cardinal numbers (1126) or write them out (eleven twenty-six), even for numbers identified as dates.
When you set formatting options, be aware of the default for the scheme to which it belongs. For example, format_prices is True for most schemes, so there is no need to set it explicitly if you want prices to be shown with currency symbols and characters.
For example, for the utterance My address is seven twenty six brookline avenue cambridge mass:
-
With any formatting scheme and the formatting option format_addresses set to True, it’s shown as: “My address is 726 Brookline Ave., Cambridge, MA”
-
With format_addresses set to False, it’s displayed neutrally, not as an address: “My address is 726 Brookline Avenue Cambridge Mass”
Formatting options by language
Each language supports a different set of formatting options, which you may modify to customize the way that ASRaaS formats its results. See Formatting options.
More Info:
Options may differ slightly depending on the data pack’s topic domain (gen, dtv, and so on). For example, some topics in the same locale support both censor options (censor_full_words, censor_profanities) and some support only censor_profanities.Arabic (ara-XWW)
censor_profanities
format_dates
format_times
format_URLs_and_email_addresses
Chinese (China, chm-CHN)
abbreviate_units
censor_profanities
format_addresses
format_channel_numbers
format_dates
format_phone_numbers
format_times
million_as_numerals
no_math_symbols
Chinese (Taiwan, chm-TWN)
As Chinese plus:
censor_full_word
format_prices
Croatian (hrv-HRV)
abbreviate_units
format_currency_codes
format_dates
format_prices
format_times
format_URLs_and_email_addresses
million_as_numerals
Czech (ces-CZE)
abbreviate_units
censor_profanities
format_currency_codes
format_dates
format_phone_numbers
format_prices
format_times
format_URLs_and_email_addresses
format_social_security_numbers
Danish (dan-DNK)
abbreviate_units
censor_full_words
censor_profanities
format_currency_codes
format_dates
format_phone_numbers
format_prices
format_times
format_URLs_and_email_addresses
million_as_numerals
Dutch (nld-NLD)
As Danish plus:
format_addresses
English (USA eng-USA)
abbreviate_titles
abbreviate_units
censor_full_words
censor_profanities
expand_contractions
format_addresses
format_currency_codes
format_dates
format_non-USA_postcodes
format_phone_numbers
format_prices
format_social_security_numbers
format_times
format_URLs_and_email_addresses
improper_fractions_as_numeral
million_as_numerals
mixed_numbers_as_numerals
two_spaces_after_period
English (Australia eng-AUS, Britain eng-GBR)
As English (USA) excluding:
format_non-USA_postcodes
format_social_security_numbers
English (India eng-IND)
As English (USA) excluding:
format_addresses
format_non-USA_postcodes
Finnish (fin-FIN)
abbreviate_units
censor_profanities
format_currency_codes
format_prices
format_times
format_URLs_and_email_addresses
French (France, fra-FRA), Italian (ita-ITA)
abbreviate_units
censor_profanities
format_addresses
format_currency_codes
format_dates
format_phone_numbers
format_prices
format_times
format_URLs_and_email_addresses
million_as_numerals
French (Canada fra-CAN)
As French plus:
format_social_insurance_numbers
German (deu-DEU)
abbreviate_units
capitalize_2nd_person_pronouns
capitalize_3rd_person_pronouns
censor_profanities
format_addresses
format_currency_codes
format_dates
format_phone_numbers
format_prices
format_times
format_URLs_and_email_addresses
million_as_numerals
Greek (ell-GRC)
abbreviate_units
censor_profanities
format_currency_codes
format_dates
format_prices
format_times
format_URLs_and_email_addresses
million_as_numerals
Hebrew (heb-ISR)
abbreviate_units
format_currency_codes
format_dates
format_prices
format_times
format_URLs_and_email_addresses
million_as_numerals
Hindi (hin-IND)
abbreviate_units
format_dates
format_prices
format_times
Hungarian (hun-HUN)
abbreviate_units
censor_profanities
format_addresses
format_currency_codes
format_dates
format_prices
format_times
format_URLs_and_email_addresses
million_as_numerals
Indonesian (ind-IDN)
abbreviate_units
censor_profanities
format_dates
format_phone_numbers
format_prices
format_times
Japanese (jpn-JPN)
abbreviate_units
Arabic_numerals_not_Kanji
censor_full_words
censor_profanities
format_addresses
format_dates
format_phone_numbers
format_prices
format_times
format_URLs_and_email_addresses
million_as_numerals
names_as_katakana
Korean (kor-KOR)
abbreviate_units
censor_profanities
format_addresses
format_currency_codes
format_dates
format_phone_numbers
format_prices
format_times
format_URLs_and_email_addresses
Norwegian (nor-NOR), Polish (pol-POL)
abbreviate_units
censor_profanities
format_currency_codes
format_dates
format_phone_numbers
format_prices
format_times
format_URLs_and_email_addresses
Portuguese (Brazil por-BRA, Portugal por-PRT)
abbreviate_units
censor_profanities
format_addresses
format_currency_codes
format_dates
format_phone_numbers
format_prices
format_times
format_URLs_and_email_addresses
million_as_numerals
Romanian (ron-ROU)
abbreviate_units
censor_profanities
format_currency_codes
format_dates
format_phone_numbers
format_prices
format_times
format_URLs_and_email_addresses
million_as_numerals
Slovak (slk-SVK), Ukranian (ukr-UKR)
abbreviate_units
censor_profanities
format_currency_codes
format_dates
format_prices
format_times
format_URLs_and_email_addresses
Spanish (spa-ESP)
abbreviate_units
censor_profanities
format_addresses
format_currency_codes
format_dates
format_phone_numbers
format_prices
format_times
format_URLs_and_email_addresses
format_USA_phone_numbers
million_as_numerals
Spanish Latin America (spa-XLA), USA (spa-USA)
As Spanish plus:
format_USA_phone_numbers
Thai (tha-THA)
abbreviate_units
censor_profanities
format_dates
format_prices
format_times
Turkish (tur-TUR, Swedish swe-SWE, Russian rus-RUS)
abbreviate_units
censor_full_words
censor_profanities
format_addresses
format_currency_codes
format_dates
format_prices
format_times
format_URLs_and_email_addresses
million_as_numerals
Vietnamese (vie-VNM)
abbreviate_units
censor_full_words
censor_profanities
format_dates
format_prices
format_times
Feedback
Was this page helpful?
Glad to hear it! Please tell us how we can improve.
Sorry to hear that. Please tell us how we can improve.