Cantonese Hong Kong (cn-HK)

This documentation was updated on November 22, 2023.

Creating grammars

The following subsections describe key issues for working with grammar documents in the Cantonese language.

Character encoding

Nuance Recognizer has full internal Unicode support. Create your grammars using UTF-8. For example, your grammar header might be:

<?xml version=‘1.0’ encoding=‘UTF-8’?> <grammar xml:lang=“cn-HK” version=“1.0” root=“test”>

alphanum_lc built-in grammar

The alphanum_lc built-in grammar recognizes a connected string of up to 20 digits and lower case alphabetic characters. For example, this grammar could be used to recognize a product code or order number.

Valid characters are the English letters of the alphabet (a–z) so callers can speak English characters in addition to Cantonese numbers. The pronunciation of the letter z as the British-style “zed” is recognized, but the American-style “zii” is likely to be misrecognized as the letter c.

Valid digits are 0–9. Although specified as Arabic numbers, callers speak the Cantonese equivalents: 零 一 二 三 四 五 六 七 八 九.

Non-alphanumeric characters such as hyphens (-), dots (.), and underscores (_) are not recognized; if spoken they reduce recognition accuracy.

Return keys/values

MEANING Contains a string of ISO-8859-1 digits and lowercase letters, with no embedded spaces.
SWI_literal Contains the exact text that was recognized.

Examples

In the following examples, note that the English letters of the alphabet are allowed. This is done to allow callers to speak English characters in addition to Cantonese.

Caller says MEANING key
Spaces between digits indicate individually spoken numbers: 零 一 二 三 四 五 六 七 八 九 0123456789
a b c d e f g abcdefg
a b c 1 e 6 g abc1e6g
a 一 s 二 d 三 f 四 a1s2d3f4

Here are examples of utterances that do not parse when spoken by callers:

Caller says Reason for not being recognized
十二 Natural numbers are not recognized with this grammar. Each digit must be spoken individually.

alphanum built-in grammar

**NOTE: for backward-compatibility only. Otherwise, use alphanum_lc builtin!
This grammar has been replaced by the alphanum_lc grammar, but is still available and has been retained for backward-compatibility.
For new implementations, please use the alphanum_lc builtin grammar. **

The alphanum built-in grammar recognizes a connected string of up to 20 digits and upper and lower case alphabetic characters. For example, this grammar could be used to recognize a product code or order number.

Valid characters are the English letters of the alphabet (a–z) so callers can speak English characters in addition to Cantonese numbers. The pronunciation of the letter z as the British-style “zed” is recognized, but the American-style “zii” is likely to be misrecognized as the letter c.

Valid digits are 0–9. Although specified as Arabic numbers, callers speak the Cantonese equivalents: 零 一 二 三 四 五 六 七 八 九.

Non-alphanumeric characters such as hyphens (-), dots (.), and underscores (_) are not recognized; if spoken they reduce recognition accuracy.

Return keys/values

MEANING Contains a string of ISO-8859-1 digits and lowercase letters, with no embedded spaces.
SWI_literal Contains the exact text that was recognized.

Examples

In the following examples, note that the English letters of the alphabet are allowed. This is done to allow callers to speak English characters in addition to Cantonese.

Caller says MEANING key
Spaces between digits indicate individually spoken numbers: 零 一 二 三 四 五 六 七 八 九 0123456789
a b c d e f g abcdefg
a b c 1 e 6 g abc1e6g
a 一 s 二 d 三 f 四 a1s2d3f4

Here are examples of utterances that do not parse when spoken by callers:

Caller says Reason for not being recognized
十二 Natural numbers are not recognized with this grammar. Each digit must be spoken individually.

boolean built-in grammar

The boolean grammar collects an affirmative or negative response.

Properties

The y and n parameters let you associate any two touchtone buttons as synonyms for yes and no.

Parameter Description
y Desired DTMF digit to be equivalent to 岩 (default = 1)
n Desired DTMF digit to be equivalent to 錯 (default = 2)

Examples

Caller says… MEANING key
true
false

ccexpdate built-in grammar

The ccexpdate grammar understands the expiration date on a credit card. Expiration dates are usually a month and a year, and are often embossed on a credit card in the form “mm/yy.” The grammar recognizes variations on the date, for example, December 2005 (二 零 零 五 年 十 二 月) and oh four oh five ( 二 零 零 五 年 四 月).

Some credit cards are stamped with a day of the month as well as the month and year; the ccexpdate grammar recognizes these dates as well. However, the only day of the month it recognizes is the last day of a given month, for example, November 30th, 2005 ( 二 零 零 五 年 十 一 月 三 十 號). The grammar does not check for leap years: both February 28 and February 29 are recognized, regardless of the given year.

Return keys/values

Upon return, the MEANING key is assigned to the recognized date in YYYYMMDD format, where YYYY is the year, MM is the month, and DD is the day. For example, 20100331 refers to March 31, 2010. The value is the same regardless of whether the caller specified a day of the month or not; the day is always set to the last day of the month. For example, both “oh six three oh oh five” ( 二 零 零 五 年 六 月 三 十 號) and “oh six oh five” ( 二 零 零 五 年 六 月) return 20050630. Note that if the expiration month is February, MMDD is always 0228, regardless of what the caller said or whether or not the expiration year is a leap year.

citizenid built-in grammar

The citizenid grammar understands 8 or 9 character long Hong Kong citizen ID numbers:

  • The 8 character ID has this pattern: LDDDDDDX
  • The 9 character ID has this pattern: LLDDDDDDX

These are the parts contained in the ID number:

L - letter a-z
D - digits 0-9
X - check sum character 0-9 or a

A description of the check sum calculation process can be found in the header of the source grammar.

Example

Caller says MEANING key
a b 一 二 三 四 五 六 九 ab1234569

creditcard built-in grammar

The creditcard grammar understands a caller saying a credit card number, optionally preceding the number with the credit card name, or the words “account number” (账號) or “account” (账户). For example, a caller can say, “visa account number four seven six four…” ( 维萨 卡 账號 四 七 六 四), “mastercard five two seven eight…” ( 万事达 卡 五 二 七 八), or “three seven three five…” ( 三 七 三 五).

The following card types are allowed by default: Visa, Mastercard, JCB, American Express and DinersClub.

In order to allow other card types you have to add the default card tags “visa+mastercard+jcb+amex+dinersclu” plus your other selected card types to your grammar load line, joined by + signs:
e.g.
[credit card grammar]?SWI_vars.typesallowed=mastercard+visa+dinersclub+private+amex+discover+jcb+cup

Besides the card types set by default the following card types are implemented as well in the source grammar: Discover (tag: discover), China Unionpay Card (tag: cup)

currency built-in grammar

The currency grammar collects currency amounts using Hong Kong Dollars
(written as 蚊 , 元 , or 個 ) and its subunit 毫 (10 Hong Kong Cent).

Return keys/values

MEANING If the speaker does not explicitly mention any unit name (main units 蚊 , 元 , 個 , or subunit 毫 ), then the utterance is interpreted as referring to a main unit amount, that is, " 五 " will be interpreted as 5 Hong Kong Dollar.
SWI_literal contains the exact text that was recognized.

Examples

Caller says MEANING
五 蚊 HKD5.00
五 元 HKD5.00
五 蚊 零 五 HKD5.05
五 蚊 兩 毫 半 HKD5.25
五 蚊 兩 毫 五 HKD5.25
六 十 二 萬 五 千 四 百 六 十 四 蚊 HKD625464.00
四 十 一 萬 二 千 五 百 六 十 元 HKD412560.00
四 十 一 萬 二 千 五 百 六 十 元 一 毫 HKD412560.10
一 蚊 HKD1.00
兩 個 半 HKD2.50
兩 毫 半 HKD0.25
兩 個 兩 毫 半 HKD2.25

date built-in grammar

The date grammar accepts a date spoken in the format of Year - Month - Day.

The grammar also accepts the following common words, and returns specific values:

Caller says Value Returned
前 天 -2
昨 天 -1
今 天 0
明 天 1
後 天 2

Examples

Caller says MEANING key
前 天 -2
昨 天 -1
今 天 0
明 天 +1
後 天 +2
一 號 ??????01
十 二 月 四 號 星 期 三 ????1204
十 二 月 四 號 ????1204
四 號 ??????04
二 零 零 一 年 六 月 四 號 20010604

digits built-in grammar

The digits grammar recognizes a continuously spoken string of up to 20 digits (i.e., the caller is not required to pause after each digit).

Valid characters are the digits: 零一二三四五六七八九

Examples

Caller says MEANING key
0
1
零 一 二 三 四 五 六 七 八 九 0123456789

Here are examples of utterances that do not parse when spoken by callers:

Caller says Reason for not being recognized
Natural numbers are not recognized by this grammar
十 二 Natural numbers are not recognized by this grammar

number built-in grammar

The number grammar recognizes whole numeric numbers (the caller must not speak the individual digits).

Examples

Numbers from -99,999,999.99 to 99,999,999.99 are recognized, but by default the minallowed parameter is set to zero, which limits recognition to positive values.

Caller says MEANING key
十 二 12
二 十 一 21
二 十 二 22
三 十 30
一 百 零 一 101
四 百 二 十 420
三 千 零 二 3002
一 萬 兩 千 三 百 四 十 五 12345
一 百 二 十 三 123
負 四 -4
十 四 點 五 六 14.56

phone built-in grammar

Telephone numbers (landline and cellular). Optionally, the caller can speak an extension number of as many as 4 digits.

This is the phone number coverage list:

  • 3-digit (emergency): 112, 189, 990-999
  • 4-digit (directory assistance): 1083

Landline:

  • 8-digit numbers - starting with 2,3
  • 8-digit numbers - starting with 2,3 - plus 1-to-4-digit extension

Cellular:

  • 8-digit numbers - starting with 5,6,9

Pager numbers:

  • 8-digit numbers - starting with 7

Personal service numbers:

  • 8-digit numbers - starting with 8

Toll-free numbers:

  • 8-digit numbers - starting with 800

The variable SWI_vars.typesallowed can be used to switch on or off the following phone number groups:

Available tags:

  • landline - landline numbers (with optional extension)
  • cellular - cellular numbers
  • special - special numbers
  • pager - pager numbers
  • service - personal service numbers
  • tollfree - toll-free numbers

The following groups are active by default: landline+cellular
Sample settings to only allow one or some groups:

Allow cellular, landline and pager numbers:
phone.xml?SWI_vars.typesallowed=cellular+landline+pager

Examples

Caller says MEANING key
九 九 九 999
二 三 四 五 六 七 八 九 23456789
二 三 四 五 六 七 八 九 內 線 二 三 四 五 23456789x2345

Here are examples of utterances that do not parse when spoken by callers:

Caller says Reason for not being recognized
五 三 四 五 六 七 八 九 53456789 Telephone numbers do not begin with 5.
九 九 九 內 線 二 三 四 五 999x2345 The `999’ number is a special number; it is never followed by extra digits.

time built-in grammar

The time grammar recognizes a time of day.

Recognized phrases include:

Times spoken in… Example
12-hour format 五 點
24-hour format 二十 三點 十五 分

Callers can specify 5-minute increments with: 個 字

In addition, the grammar recognizes “qualified” times. For example

Qualifiers Description
之 前 Sets the QUALIFIER key to `before’.
大 約 Sets the QUALIFIER key to `approx’.
大 約 五 點 之 前 Not recognized; the grammar does not expect callers to speak qualifiers before and after the time.

Examples

For each entry, the values returned in the MEANING and QUALIFIER keys are shown. (Not shown are the values of the HOUR, MINUTE and AMPM keys.)

Caller says MEANING QUALIFIER
中 午 1200p exact
中 午 之 前 1200p before
八 點 半 0830? exact
夜 晚 七 點 一 個 字 0705p exact
零 晨 一 點 0100a exact
二 十 三 點 2300h exact
一 點 十 分 0110? exact
一 點 一 個 字 0105? exact
大 約 一 點 一 個 字 0105? approx
下 晝 一 點 一 個 字 0105p exact
上 晝 一 點 一 個 字 0105a exact
一 點 半 0130? exact
十 二 點 十 分 1210? exact
中 午 十 二 點 1200p exact

Vocabulary items and pronunciations

This chapter describes considerations for vocabularies and their pronunciations in Cantonese (cn-HK).

Cantonese pronunciations

This section provides detailed reference information to help create pronunciation dictionaries. It is intended for people who have sufficient knowledge of the Cantonese language as spoken in Hong Kong. It provides information about transcription and pronunciation.

The Cantonese phoneme system

There are six different types of Cantonese consonants:

  • Plosives
  • Fricatives
  • Affricates
  • Glides
  • Nasals
  • Liquids

Cantonese symbol set grouped by phoneme classes

Phoneme class SAMPA IPA Examples of use
Consonants Plosives b p
p /pu3n/
d t /do3w/
t /ti4n/
g k /gE1y/
k /ka1/
Fricatives f f
s z /su3G/
h h /ha6w/
Affricates q ts
j dz /ja3t/
kw kw /kw@3n/
gw gw /gwo2G/
Glides w w
y j /yi4G/
Nasals m m
n n /na5/
G ŋ /Ga5w/
Liquids l l
Vowels Single_vowels a
@ a /b@1n/
u ʊ - uː 籠 固 /lu4G/ /gu3/
i i - iː - ɪ 僥 思 匿 /hi1w//si1//ni1k/
o o - ɔ - u 勞 框 戊 /lo4w//ho1G//mo6w/
v /tv4n/
8 œ - œː 輪 嚐 /l84n//s84G/
E e - ɛ 幾 些 /gE2y//sE1/

Cantonese consonants

The Cantonese consonant system has:

  • Six plosives
  • Three fricatives
  • Four affricates
  • Three nasals
  • Two glides
  • One liquid

Plosives

There are three aspirated and three unaspirated plosives in Cantonese, which can be arranged in pairs as shown here:

Unaspirated Examples Aspirated Examples
b /ba1/ p
d /do3w/ t
g /gE1y/ k

/b/ /d/ and /g/ may be realized as voiced stops, but the distinction is really between aspirated / unaspirated, because the voicing is not systematic. Syllable-final plosives are usually unreleased.

Fricatives

There are three fricatives in Cantonese:

f /fu1/
s /su3G/
h /ha6w/

Affricates

In Cantonese there are two real affricates and two co-articulated consonants:

q /qo1/ kw /kw@3n/
j /ja3t/ gw /gwo2G/

/kw/ and /gw/ are not actually affricates but co-articulated consonants because the velar plosives /k/ or /g/ are uttered simultaneously with the glide /w/.

Nasals

There are three nasals in Cantonese:

m /mo4/
n /na5/
G /Ga5w/

/m/ and /G/ can also denote semivowels (as syllable nucleus) whereas /n/ always denotes the alveolar nasal.

Glides

There are two glides in Cantonese:

w /w@1n/
y /yi4G/

Liquids

There is one liquid in Cantonese:

l /li1/

Cantonese vowels

Monophthongs

There are eight vowels (monophthongs) in Cantonese.

a /ka1/
@ /b@1n/
u /lu4G/ /gu3/
i /ni1k/ /si1/
o /ho1G/
v /tv4n/
8 /l84n/ /s84G/
E /sE1/

Diphthongs

Diphthongs are formed from a sequence of a vowel (monophthong) and a glide.

Note: The jyutping diphthong eoi is transcribed by the phoneme /8/ + tone + /H/

/ja3y/ /sa3y/
/ja2w/ /ja6w/
/hi1w/
/mo6w/ /lo4w/ /jo6w/
/jo6y/
/gE2y/ /gE1y/ 鍛錘 /dv3nq84H/
/ji1w/
/s85y/
/bu1y/

The Cantonese tone system

Overview

Cantonese is a tone language, which means that a syllable carries different meanings depending on the tone with which it is pronounced. Hence, tone is obligatory to the construction of a syllable.

There are six tones in Cantonese and every syllable must be assigned one of these six tones, otherwise the transcription is invalid.

Note : Syllables that only contain a consonant (syllabic consonants) also must have a tone indicator (In the current language pack the following syllabic phonemes exist: /G4/, /G5/, /G6/ and /m4/)

The Cantonese tone system is summarized in the following table.

TONE DESCRIPTION EXAMPLE
1 Falling
2 High Rising
3 Mid Level
4 High Level
5 Low Level
6 Low Rising

Tone sandhi

One common tone phenomenon in Cantonese is tone sandhi, which is the change of tones when syllables are in sequence. That is, a syllable has one of the tones in isolation, and the same syllable may take on a different tone without any change in meaning when it is followed by another syllable.

There are no hard and fast rules on where and when tone sandhi should occur, and that there are also many exceptions to the rule. But we can safely conclude that it occurs mostly on the second character of two-character compound nouns, especially when this character is normally sounded with Tones 4 or 6 (the two tones with the lowest pitch).

Examples

甜 甜 | /ti4m//ti2m/
伶 伶 | /li4G//li2G/

Please note that tone sandhi does not occur in all two-character compound nouns with tone 4 or 6.

The Cantonese symbol set in alphabetical order

The following table shows the Cantonese symbol set (left column without tone markers) in alphabetical order:

SAMPA IPA Examples of use
@ a 嘔 賓
8 œ / œː 輪 嚐
a a:
b p
d t
E e - ɛ 幾 些
f f
g k
G ŋ
gw gw
h h
i i - i: - ɪ 僥 思 匿
j dz
k
kw kw
l l
m m
n n
o u - o - ɔ 戊 勞 框
p
q ts
s z
t
u ʊ - u: 籠 固
v y:
w w
y j

Automatic pronunciation module

The automatic pronunciation module is provided to pronounce words that are not in any dictionary.

The automatic pronunciation module supports a wide set of chinese characters:

  • 19,568 characters part of the Unihan database that have a value in the field Cantonese and part of the basic multilingual plane (BMP) subset.
  • 2636 characters part of the 2008 revision of the Hong Kong Supplementary Character Set and part of the basic multilingual plane (BMP) subset.

A complete list of supported characters can be provided upon request.

Below is the statement found in the Unihan database that acknowledges the contribution of the Linguistic Society of Hong Kong to the Cantonese field.
“The jyutping phrase box from the Linguistic Society of Hong Kong. The copyright of the Jyutping phrase box belongs to the Linguistic Society of Hong Kong.
We would like to thank the Jyutping Group of the Linguistic Society of Hong Kong for permission to use the electronic file in our research and/or product development.
Note that the inclusion of the phrase box in the Unihan database requires that any products developed using the Cantonese field needs to include this acknowledgment.”

The web address for the Unihan database is http://unicode.org/charts/unihan.html   .
The web address for the Linguistic Society of Hong Kong on the Jyutping Romanization scheme is http://www.lshk.org/cantonese.php   .
The web address for the Office of the Government Chief information Officer on the Hong Kong Supplementary Character Set is http://www.ogcio.gov.hk/ccli/eng/hkscs/introduction.html   .