Formatted text

ASRaaS returns the results of your utterances in two Hypothesis fields: formatted_text and minimally_formatted_text.

The formatted text field includes initial capitals for recognized names and places, numbers expressed as digits, currency symbols, and common abbreviations. In minimally formatted text, words are spelled out but basic capitalization and punctuation are included.

In many cases, both formats are identical.

ASRaaS uses the default data pack settings to format the material in formatted_text, for example, displaying ten centimeters as “10 cm”:

Formatted text:           December 9, 2005
Minimally formatted text: December nine two thousand and five

Formatted text:           $500
Minimally formatted text: Five hundred dollars

Formatted text:           I'll catch the 758 train
Minimally formatted text: I'll catch the seven fifty eight train

Formatted text:           We're expecting 10 cm overnight
Minimally formatted text: We're expecting ten centimeters overnight

Formatted text:           I'm okay James, how about yourself?
Minimally formatted text: I'm okay James, how about yourself?

The default settings in the data pack provide good results in most cases. For more precise control, you may specify a formatting scheme and/or option as a recognition parameter. See RecognitionParameters > Formatting.

Formatting scheme

The formatting scheme determines how ambiguous numbers are displayed in the formatted_text field. Only one type may be specified, for example, scheme = 'date'.

The available schemes depend on the data pack, but most data packs support date, time, phone, address, all_as_words, default, and num_as_digits.

Each scheme is a collection of many options (see Formatting options below), but the defining option is PatternBias, which sets the preferred pattern for numbers that cannot otherwise be interpreted. The values of PatternBias give their name to most of the schemes: date, time, phone, address, and default.

The PatternBias option cannot be modified, but you may adjust other options using formatting options.

RecognitionInitMessage(
    parameters = RecognitionParameters(
        language = 'en-US',
        topic = 'GEN',
        audio_format = AudioFormat(pcm=PCM(sample_rate_hz=wf.getframerate())),
        result_type = 'FINAL',
        utterance_detection_mode = 'MULTIPLE',
        formatting = Formatting(
            scheme = 'date',
            options = {
                'abbreviate_titles': True,
                'abbreviate_units': False,
                'censor_profanities': True,
                'censor_full_words': True
            }
        )
    )
)

date, time, phone, and address

The formatting schemes date, time, phone, and address tell the engine to prefer one pattern for ambiguous numbers.

By default, the engine can identify some numbers as date or time or phone number, for example:

  • I’ll catch the seven twenty six a m train is identified as a time because of a m.

  • I was born on eleven twenty six nineteen ninety four is identified as a date (in American English) because of the sequence of month, day, and year.

  • It’s six nine seven three two nine four is identified as a phone number because of the pattern of the numbers.

However, the engine considers some numbers ambiguous:

  • I’ll catch the seven twenty six train is not recognized as a specific pattern, so ASRaaS displays it as a simple cardinal number: “I’ll catch the 726 train.”

  • My birthday is eleven twenty six. Similarly, the engine displays this as: “My birthday is 1126.”

By setting the formatting scheme to date, time, phone, or address, you instruct the engine to interpret these ambiguous numbers as the specified pattern. For example, if you know that the utterances coming into your application are likely to contain dates rather than times, set scheme: 'date'.

For example, the engine interprets the ambiguous utterance, It’s seven twenty six, based on the formatting scheme in effect:

  • With the default scheme: “It’s 726”
  • With the date scheme: “It’s 7/26”
  • With the time scheme: “It’s 7:26”
  • With the address scheme: “It’s 726”
  • With the phone scheme: “It’s 726”

all_as_words

The all_as_words scheme displays all numbers as words, even when a pattern (date, time, phone, or address) is found. For example, ASRaaS identifies this utterance as an address: My address is seven twenty six brookline avenue cambridge mass oh two one three nine:

  • With the all_as_words scheme, however, the address formatting is ignored and the numbers are written out: “My address is seven twenty six Brookline Avenue, Cambridge, Mass. Oh two one three nine”

  • With all other schemes, the text is formatted as a standard address: “My address is 726 Brookline Ave., Cambridge, MA 02139”

Similarly, this utterance is identified as a time: I’ll catch the seven twenty six a m train:

  • With the all_as_words scheme, it’s formatted neutrally as: “I’ll catch the seven twenty six a.m. train”

  • With the default or any other scheme, it’s formatted as a time: “I’ll catch the 7:26 AM train”

num_as_digits

The num_as_digits scheme is the same as default, except in its treatment of numbers under 10:

  • The default scheme formats numbers as numerals from 10 upwards: one, two, three … nine, 10, 11, 12, etc.

  • num_as_digits formats all numbers as numerals: 1, 2, 3, etc.

Num_as_digits affects isolated cardinal and ordinal numbers, plural cardinals (ones, twos, nineteen fifties, and so on), some prices, and fractions. “Isolated” means a number that is not found within a greater pattern such as date or time.

This scheme has no modifiable options.

all_as_katakana

Available for Japanese only, the all_as_katakana scheme returns the transcript in Katakana, meaning the output is entirely in the phonetic Katakana script, without Kanji, Arabic numbers, or Latin characters.

When all_as_katakana is not specified, the output is a mix of scripts representing standard written Japanese.

This scheme has no modifiable options.

For the Japanese form of How many kilograms can I check in?:

  • With the all_as_katakana scheme, this is formatted as:
    アズケルニモツノオモサハナンキロマデデスカ

  • With the default or any other scheme, it’s formatted as:
    預ける荷物の重さは何キロまでですか

default

This scheme is the default. It has the same effect as not specifying a scheme. If ASRaaS cannot determine the format of the number, it interprets it as a cardinal number.

Formatting options

Formatting options are individual parameters for displaying words and numbers in the formatted_text result field. All options are part of the current formatting scheme (default if not specified) but can be set on their own to override the current setting.

Examples

With no formatting scheme or options, the default scheme is in effect:

RecognitionInitMessage(
    parameters = RecognitionParameters(
        language = 'en-US',
        topic = 'GEN',
        audio_format = AudioFormat(pcm=PCM(sample_rate_hz=wf.getframerate())),
        result_type = 'FINAL',
        utterance_detection_mode = 'MULTIPLE'
    )
)

With a scheme only, all options in the date scheme are in effect. See RecognitionParameters > Formatting.

RecognitionInitMessage(
    parameters = RecognitionParameters(
        language = 'en-US',
        topic = 'GEN',
        audio_format = AudioFormat(pcm=PCM(sample_rate_hz=wf.getframerate())),
        result_type = 'FINAL',
        utterance_detection_mode = 'MULTIPLE',
        formatting = Formatting(
            scheme= 'date'
        )
    )
)

With options only, options in the default scheme are overridden by specific options.

RecognitionInitMessage(
    parameters = RecognitionParameters(
        ...
        formatting = Formatting(
            options = {
                'abbreviate_titles': True,
                'abbreviate_units': False,
                'censor_profanities': True,
                'censor_full_words': True,
            }
        )
    )
)

With a scheme and options, options in the date scheme are overridden by specific options:

RecognitionInitMessage(
    parameters = RecognitionParameters(
        ...
        formatting = Formatting(
            scheme = 'date',
            options = {
                'abbreviate_titles': True,
                'abbreviate_units': False,
                'censor_profanities': True,
                'censor_full_words': True,
            }
        )
    )
)

Principal options

The available options depend on the data pack. See Formatting options by language.

All options are boolean. The values are set in the scheme to which they belong. (The num_as_digits scheme has no modifiable options.)

 
Formatting options and scheme
Formatting options Formatting scheme
default, date, time, phone, address all_as_words
PatternBias
The defining characteristic of the scheme. Not modifiable.
default, date, time, phone, addresss  
abbreviate_titles
Whether to abbreviate titles such as Captain (Capt), Director (Dir), Madame (Mme), Professor (Prof), etc. In American English, a period follows the abbreviation. The titles Mr, Mrs, and Dr are always abbreviated.
False False
abbreviate_units
Whether to abbreviate units of measure such as centimeters (cm), meters (m), megabytes (MB), pounds (lbs), ounces (oz), miles per hour (mph), etc. When true, metric units are always abbreviated, but imperial one-word tokens are not abbreviated, so ten feet is 10 feet and twelve quarts is 12 quarts. The formatting of expressions with multiple units depends on the units involved: only common combinations are formatted.
True False
Arabic_numerals_not_Kanji (Japanese)
How to display numbers.
  • True: Numbers are either Arabic or half-formatted, depending on the half-formatted (million_as_numerals) setting.
  • False: All numbers are displayed in Kanji.
By default, cardinals are half-formatted, meaning that magnitude words (thousands, millions, etc.) are in Kanji.
True False
capitalize_2nd_person_pronouns (German)
Whether to capitalize second person personal pronouns such as Du, Dich, etc.
False False
capitalize_3rd_person_pronouns (German)
Whether to capitalize third-person personal pronouns such as Sie, Ihnen, etc.
True True
censor_profanities
Whether to mask profanities partially with asterisks, for example, "fr*gging" versus "frigging."
False False
censor_full_words
Whether to mask profanities completely with asterisks, for example, "********" versus "frigging."
When true, censor_profanities must also be true.
False False
expand_contractions
In English, whether to expand common contractions, for example, "don't" versus "do not" or "it's nice" versus "it is nice."
False False
format_addresses
Whether to format text identified as postal addresses. This does not include adding commas or new lines. Full street address formatting is done for most languages, following the standards of the country's postal service.
True False
format_currency_codes
Whether to replace the currency symbol with its ISO currency code, for example, USD125 instead of $125.
When true, format_prices must also be true.
False False
format_dates
Whether to format text identified as dates as, for example, 7/26/1994, 7/26/94, or 7/26. The order of month and day depends on the language.
True False
format_non-USA_postcodes
For non-US languages, whether to format UK and Canadian postcodes. UK postcodes have the form A9 9AA, A99 9AA, etc. Canadian postal codes have the form A9A 9A9.
False False
format_phone_numbers
For US and Canadian, whether to format numbers identified as phone numbers, as 123-456-7890 or 456-7899, optionally with 1 or +1 before the number.
True False
format_prices
Whether to format numbers identified as prices, including currency symbols and price ranges. The currency symbol depends on the language.
True False
format_social_security_numbers
Whether to format numbers identified as US social security numbers or (for Canadian) Canadian social insurance numbers. Both are a series of nine digits formatted as 123-45-6789 or 123 456 789.
False False
format_times
Whether to format numbers identified as times (including both 12- and 24-hour times) as, for example, 10:35 with optional AM or PM.
True False
format_URLs_and_email_addresses
Whether to format web and email addresses, including @ (for at) and most suffixes, including multiple suffixes, for example, .ac.edu. Numbers are displayed as digits and output is in lowercase.
True False
format_USA_phone_numbers (Mexican)
Whether to use US phone formatting instead of Mexican.
False False
improper_fractions_as_numerals
Whether to express improper fractions as numbers, for example, 5/4 versus five fourths.
True False
million_as_numerals
Whether to half-format numbers ending in million, billion, trillion, and so on, for example, 5 million.
True Inactive
mixed_numbers_as_numerals
How to express numbers that are a combination of an integer and a fraction:
  • True: As numerals (3 1/2)
  • False: As words (three and a half)
This option affects isolated mixed numbers, meaning numbers that are not part of a greater pattern such as a measurement or address. Mixed numbers in such patterns are usually transcribed as numerals even when this option is false. For example, The recipe calls for three and a half cups is transcribed as "The recipe calls for 3 1/2 cups."
True False
names_as_katakana (Japanese)
Whether recognized first and last names are transcribed in Katakana. This option can improve the transcription of homophone Japanese names, reducing variation and increasing accuracy. This option is true in the all_as_katakana scheme. In other schemes, the option is false by default, meaning names are transcribed in the script usually associated with the name.
False False
two_spaces_after_period
Whether to insert two spaces (instead of one) following a period (full stop), question mark, or exclamation mark.
False False

Japanese options

Japanese data packs support the formatting options listed in Japanese (jpn-JPN).

In Japanese, two options work together to specify how numbers are displayed:

  • Arabic_numerals_not_Kanji determines whether numbers are shown in Arabic, Kanji, or both.

    For words containing numbers, the formatting output depends on whether the word is defined in the system. For example, 八百屋 is a defined word meaning “greengrocer” (although literally “800 shop”). Even when Arabic_numerals_not_Kanji is True, it is always output as 八百屋, never as 800屋.

    If the word containing a number is not defined in the system, the formatting output depends on the context and the formatting scheme in effect (date, time, price, address, and so on).

  • million_as_numerals determines whether magnitude words (thousands, millions, etc.) are in Kanji and the rest in Arabic, or numbers are entirely in Arabic. When million_as_numerals is True, magnitudes are written in Kanji, as shown below.
    万 10,000
    億 100,000,000
    兆 1,000,000,000,000
    京 10,000,000,000,000

    This also affects currency values, so $50,000 is written as $5万.

You can control how numbers are displayed by combining Arabic_numerals_not_Kanji and million_as_numerals:

Japanese options
All Kanji Half-formatted (default) All Arabic
Arabic_numerals: False Arabic_numerals: True
million_as_numerals: True
Arabic_numerals: True
million_as_numerals: False
All numbers are displayed in Kanji. Magnitude words are in Kanji and the rest in Arabic. All numbers are displayed in Arabic.
3 3
十一 11 11
六十五 65 65
八百三十七 837 837
1,000 1,000
千九百四十五 1,945 1,945
八千五百 8,500 8,500
一万 1万 10,000
一万五千 1万5,000 15,000
一億三千万 1億3,000万 130,000,000
二億五 2億5 200,000,005

For example, setting this option displays all numbers in Kanji:

RecognitionInitMessage(
    parameters = RecognitionParameters(
        language = 'ja-JP',
        ...
        formatting = Formatting(
            options = {'Arabic_numerals_not_Kanji':False}
        )
    )
)

This combination of options displays numbers in Kanji and Arabic. It’s the default setting so may be omitted:

RecognitionInitMessage(
    parameters = RecognitionParameters(
        language = 'ja-JP',
        ...
        formatting = Formatting(
            options = {'Arabic_numerals_not_Kanji':True,
                       'million_as_numerals':True}
        )
    )
)

These settings display all numbers in Arabic:

RecognitionInitMessage(
    parameters = RecognitionParameters(
        language = 'ja-JP',
        ...
        formatting = Formatting(
            options = {'Arabic_numerals_not_Kanji':True,
                       'million_as_numerals':False}
        )
    )
)

Scheme vs. options

Some formatting schemes have similar names to formatting options, for example, the date, phone, time, and address scheme and the options format_dates, format_times, and so on. What’s the difference?

The scheme helps interpret ambiguous numbers, while options format text for display. For example:

  • formatting scheme: 'date': Interpret eleven twenty six as the date 11/26 (November 26).

  • formatting options 'format_dates': True: Display numbers identified as dates in the locale’s date format, for example, 11/26 in American English. This is the default setting.

  • formatting options 'format_dates': False: Display numbers as cardinal numbers (1126) or write them out (eleven twenty-six), even for numbers identified as dates.

When you set formatting options, be aware of the default for the scheme to which it belongs. For example, format_prices is True for most schemes, so there is no need to set it explicitly if you want prices to be shown with currency symbols and characters.

For example, for the utterance My address is seven twenty six brookline avenue cambridge mass:

  • With any formatting scheme and the formatting option format_addresses set to True, it’s shown as: “My address is 726 Brookline Ave., Cambridge, MA”

  • With format_addresses set to False, it’s displayed neutrally, not as an address: “My address is 726 Brookline Avenue Cambridge Mass”

Formatting options by language

Each language supports a different set of formatting options, which you may modify to customize the way that ASRaaS formats its results. See Formatting options.

Arabic (ara-XWW)

censor_profanities
format_dates
format_times
format_URLs_and_email_addresses

Chinese (China, chm-CHN)

abbreviate_units
censor_profanities
format_addresses
format_channel_numbers
format_dates
format_phone_numbers
format_times
million_as_numerals
no_math_symbols

Chinese (Taiwan, chm-TWN)

As Chinese plus:

censor_full_word
format_prices

Croatian (hrv-HRV)

abbreviate_units
format_currency_codes
format_dates
format_prices
format_times
format_URLs_and_email_addresses
million_as_numerals

Czech (ces-CZE)

abbreviate_units
censor_profanities
format_currency_codes
format_dates
format_phone_numbers
format_prices
format_times
format_URLs_and_email_addresses
format_social_security_numbers

Danish (dan-DNK)

abbreviate_units
censor_full_words
censor_profanities
format_currency_codes
format_dates
format_phone_numbers
format_prices
format_times
format_URLs_and_email_addresses
million_as_numerals

Dutch (nld-NLD)

As Danish plus:

format_addresses

English (USA eng-USA)

abbreviate_titles
abbreviate_units
censor_full_words
censor_profanities
expand_contractions
format_addresses
format_currency_codes
format_dates
format_non-USA_postcodes
format_phone_numbers
format_prices
format_social_security_numbers
format_times
format_URLs_and_email_addresses
improper_fractions_as_numeral
million_as_numerals
mixed_numbers_as_numerals
two_spaces_after_period

English (Australia eng-AUS, Britain eng-GBR)

As English (USA) excluding:

format_non-USA_postcodes
format_social_security_numbers

English (India eng-IND)

As English (USA) excluding:

format_addresses
format_non-USA_postcodes

Finnish (fin-FIN)

abbreviate_units
censor_profanities
format_currency_codes
format_prices
format_times
format_URLs_and_email_addresses

French (France, fra-FRA), Italian (ita-ITA)

abbreviate_units
censor_profanities
format_addresses
format_currency_codes
format_dates
format_phone_numbers
format_prices
format_times
format_URLs_and_email_addresses
million_as_numerals

French (Canada fra-CAN)

As French plus:

format_social_insurance_numbers

German (deu-DEU)

abbreviate_units
capitalize_2nd_person_pronouns
capitalize_3rd_person_pronouns
censor_profanities
format_addresses
format_currency_codes
format_dates
format_phone_numbers
format_prices
format_times
format_URLs_and_email_addresses
million_as_numerals

Greek (ell-GRC)

abbreviate_units
censor_profanities
format_currency_codes
format_dates
format_prices
format_times
format_URLs_and_email_addresses
million_as_numerals

Hebrew (heb-ISR)

abbreviate_units
format_currency_codes
format_dates
format_prices
format_times
format_URLs_and_email_addresses
million_as_numerals

Hindi (hin-IND)

abbreviate_units
format_dates
format_prices
format_times

Hungarian (hun-HUN)

abbreviate_units
censor_profanities
format_addresses
format_currency_codes
format_dates
format_prices
format_times
format_URLs_and_email_addresses
million_as_numerals

Indonesian (ind-IDN)

abbreviate_units
censor_profanities
format_dates
format_phone_numbers
format_prices
format_times

Japanese (jpn-JPN)

abbreviate_units
Arabic_numerals_not_Kanji
censor_full_words
censor_profanities
format_addresses
format_dates
format_phone_numbers
format_prices
format_times
format_URLs_and_email_addresses
million_as_numerals
names_as_katakana

Korean (kor-KOR)

abbreviate_units
censor_profanities
format_addresses
format_currency_codes
format_dates
format_phone_numbers
format_prices
format_times
format_URLs_and_email_addresses

Norwegian (nor-NOR), Polish (pol-POL)

abbreviate_units
censor_profanities
format_currency_codes
format_dates
format_phone_numbers
format_prices
format_times
format_URLs_and_email_addresses

Portuguese (Brazil por-BRA, Portugal por-PRT)

abbreviate_units
censor_profanities
format_addresses
format_currency_codes
format_dates
format_phone_numbers
format_prices
format_times
format_URLs_and_email_addresses
million_as_numerals

Romanian (ron-ROU)

abbreviate_units
censor_profanities
format_currency_codes
format_dates
format_phone_numbers
format_prices
format_times
format_URLs_and_email_addresses
million_as_numerals

Slovak (slk-SVK), Ukranian (ukr-UKR)

abbreviate_units
censor_profanities
format_currency_codes
format_dates
format_prices
format_times
format_URLs_and_email_addresses

Spanish (spa-ESP)

abbreviate_units
censor_profanities
format_addresses
format_currency_codes
format_dates
format_phone_numbers
format_prices
format_times
format_URLs_and_email_addresses
format_USA_phone_numbers
million_as_numerals

Spanish Latin America (spa-XLA), USA (spa-USA)

As Spanish plus:

format_USA_phone_numbers

Thai (tha-THA)

abbreviate_units
censor_profanities
format_dates
format_prices
format_times

Turkish (tur-TUR, Swedish swe-SWE, Russian rus-RUS)

abbreviate_units
censor_full_words
censor_profanities
format_addresses
format_currency_codes
format_dates
format_prices
format_times
format_URLs_and_email_addresses
million_as_numerals

Vietnamese (vie-VNM)

abbreviate_units
censor_full_words
censor_profanities
format_dates
format_prices
format_times