Cantonese Hong Kong (cn-HK)
This documentation was updated on November 22, 2023.
Creating grammars
The following subsections describe key issues for working with grammar documents in the Cantonese language.
Character encoding
Nuance Recognizer has full internal Unicode support. Create your grammars using UTF-8. For example, your grammar header might be:
<?xml version=‘1.0’ encoding=‘UTF-8’?> <grammar xml:lang=“cn-HK” version=“1.0” root=“test”>
alphanum_lc built-in grammar
The alphanum_lc built-in grammar recognizes a connected string of up to 20 digits and lower case alphabetic characters. For example, this grammar could be used to recognize a product code or order number.
Valid characters are the English letters of the alphabet (a–z) so callers can speak English characters in addition to Cantonese numbers. The pronunciation of the letter z as the British-style “zed” is recognized, but the American-style “zii” is likely to be misrecognized as the letter c.
Valid digits are 0–9. Although specified as Arabic numbers, callers speak the Cantonese equivalents: 零 一 二 三 四 五 六 七 八 九.
Non-alphanumeric characters such as hyphens (-), dots (.), and underscores (_) are not recognized; if spoken they reduce recognition accuracy.
Return keys/values
MEANING | Contains a string of ISO-8859-1 digits and lowercase letters, with no embedded spaces. |
---|---|
SWI_literal | Contains the exact text that was recognized. |
Examples
In the following examples, note that the English letters of the alphabet are allowed. This is done to allow callers to speak English characters in addition to Cantonese.
Caller says | MEANING key |
---|---|
Spaces between digits indicate individually spoken numbers: 零 一 二 三 四 五 六 七 八 九 | 0123456789 |
a b c d e f g | abcdefg |
a b c 1 e 6 g | abc1e6g |
a 一 s 二 d 三 f 四 | a1s2d3f4 |
Here are examples of utterances that do not parse when spoken by callers:
Caller says | Reason for not being recognized |
---|---|
十二 | Natural numbers are not recognized with this grammar. Each digit must be spoken individually. |
alphanum built-in grammar
**NOTE: for backward-compatibility only. Otherwise, use alphanum_lc builtin!
This grammar has been replaced by the alphanum_lc grammar, but is still available and has been retained for backward-compatibility.
For new implementations, please use the alphanum_lc builtin grammar. **
The alphanum built-in grammar recognizes a connected string of up to 20 digits and upper and lower case alphabetic characters. For example, this grammar could be used to recognize a product code or order number.
Valid characters are the English letters of the alphabet (a–z) so callers can speak English characters in addition to Cantonese numbers. The pronunciation of the letter z as the British-style “zed” is recognized, but the American-style “zii” is likely to be misrecognized as the letter c.
Valid digits are 0–9. Although specified as Arabic numbers, callers speak the Cantonese equivalents: 零 一 二 三 四 五 六 七 八 九.
Non-alphanumeric characters such as hyphens (-), dots (.), and underscores (_) are not recognized; if spoken they reduce recognition accuracy.
Return keys/values
MEANING | Contains a string of ISO-8859-1 digits and lowercase letters, with no embedded spaces. |
---|---|
SWI_literal | Contains the exact text that was recognized. |
Examples
In the following examples, note that the English letters of the alphabet are allowed. This is done to allow callers to speak English characters in addition to Cantonese.
Caller says | MEANING key |
---|---|
Spaces between digits indicate individually spoken numbers: 零 一 二 三 四 五 六 七 八 九 | 0123456789 |
a b c d e f g | abcdefg |
a b c 1 e 6 g | abc1e6g |
a 一 s 二 d 三 f 四 | a1s2d3f4 |
Here are examples of utterances that do not parse when spoken by callers:
Caller says | Reason for not being recognized |
---|---|
十二 | Natural numbers are not recognized with this grammar. Each digit must be spoken individually. |
boolean built-in grammar
The boolean grammar collects an affirmative or negative response.
Properties
The y and n parameters let you associate any two touchtone buttons as synonyms for yes and no.
Parameter | Description |
---|---|
y | Desired DTMF digit to be equivalent to 岩 (default = 1) |
n | Desired DTMF digit to be equivalent to 錯 (default = 2) |
Examples
Caller says… | MEANING key |
---|---|
岩 | true |
錯 | false |
ccexpdate built-in grammar
The ccexpdate grammar understands the expiration date on a credit card. Expiration dates are usually a month and a year, and are often embossed on a credit card in the form “mm/yy.” The grammar recognizes variations on the date, for example, December 2005 (二 零 零 五 年 十 二 月) and oh four oh five ( 二 零 零 五 年 四 月).
Some credit cards are stamped with a day of the month as well as the month and year; the ccexpdate grammar recognizes these dates as well. However, the only day of the month it recognizes is the last day of a given month, for example, November 30th, 2005 ( 二 零 零 五 年 十 一 月 三 十 號). The grammar does not check for leap years: both February 28 and February 29 are recognized, regardless of the given year.
Return keys/values
Upon return, the MEANING key is assigned to the recognized date in YYYYMMDD format, where YYYY is the year, MM is the month, and DD is the day. For example, 20100331 refers to March 31, 2010. The value is the same regardless of whether the caller specified a day of the month or not; the day is always set to the last day of the month. For example, both “oh six three oh oh five” ( 二 零 零 五 年 六 月 三 十 號) and “oh six oh five” ( 二 零 零 五 年 六 月) return 20050630. Note that if the expiration month is February, MMDD is always 0228, regardless of what the caller said or whether or not the expiration year is a leap year.
citizenid built-in grammar
The citizenid grammar understands 8 or 9 character long Hong Kong citizen ID numbers:
- The 8 character ID has this pattern: LDDDDDDX
- The 9 character ID has this pattern: LLDDDDDDX
These are the parts contained in the ID number:
L - letter a-z
D - digits 0-9
X - check sum character 0-9 or a
A description of the check sum calculation process can be found in the header of the source grammar.
Example
Caller says | MEANING key |
---|---|
a b 一 二 三 四 五 六 九 | ab1234569 |
creditcard built-in grammar
The creditcard grammar understands a caller saying a credit card number, optionally preceding the number with the credit card name, or the words “account number” (账號) or “account” (账户). For example, a caller can say, “visa account number four seven six four…” ( 维萨 卡 账號 四 七 六 四), “mastercard five two seven eight…” ( 万事达 卡 五 二 七 八), or “three seven three five…” ( 三 七 三 五).
The following card types are allowed by default: Visa, Mastercard, JCB, American Express and DinersClub.
In order to allow other card types you have to add the default card tags “visa+mastercard+jcb+amex+dinersclu” plus your other selected card types to your grammar load line, joined by + signs:
e.g.
[credit card grammar]?SWI_vars.typesallowed=mastercard+visa+dinersclub+private+amex+discover+jcb+cup
Besides the card types set by default the following card types are implemented as well in the source grammar: Discover (tag: discover), China Unionpay Card (tag: cup)
currency built-in grammar
The currency grammar collects currency amounts using Hong Kong Dollars
(written as 蚊 , 元 , or 個 ) and its subunit 毫 (10 Hong Kong Cent).
Return keys/values
MEANING | If the speaker does not explicitly mention any unit name (main units 蚊 , 元 , 個 , or subunit 毫 ), then the utterance is interpreted as referring to a main unit amount, that is, " 五 " will be interpreted as 5 Hong Kong Dollar. |
---|---|
SWI_literal | contains the exact text that was recognized. |
Examples
Caller says | MEANING |
---|---|
五 蚊 | HKD5.00 |
五 元 | HKD5.00 |
五 蚊 零 五 | HKD5.05 |
五 蚊 兩 毫 半 | HKD5.25 |
五 蚊 兩 毫 五 | HKD5.25 |
六 十 二 萬 五 千 四 百 六 十 四 蚊 | HKD625464.00 |
四 十 一 萬 二 千 五 百 六 十 元 | HKD412560.00 |
四 十 一 萬 二 千 五 百 六 十 元 一 毫 | HKD412560.10 |
一 蚊 | HKD1.00 |
兩 個 半 | HKD2.50 |
兩 毫 半 | HKD0.25 |
兩 個 兩 毫 半 | HKD2.25 |
date built-in grammar
The date grammar accepts a date spoken in the format of Year - Month - Day.
The grammar also accepts the following common words, and returns specific values:
Caller says | Value Returned |
---|---|
前 天 | -2 |
昨 天 | -1 |
今 天 | 0 |
明 天 | 1 |
後 天 | 2 |
Examples
Caller says | MEANING key |
---|---|
前 天 | -2 |
昨 天 | -1 |
今 天 | 0 |
明 天 | +1 |
後 天 | +2 |
一 號 | ??????01 |
十 二 月 四 號 星 期 三 | ????1204 |
十 二 月 四 號 | ????1204 |
四 號 | ??????04 |
二 零 零 一 年 六 月 四 號 | 20010604 |
digits built-in grammar
The digits grammar recognizes a continuously spoken string of up to 20 digits (i.e., the caller is not required to pause after each digit).
Valid characters are the digits: 零一二三四五六七八九
Examples
Caller says | MEANING key |
---|---|
零 | 0 |
一 | 1 |
零 一 二 三 四 五 六 七 八 九 | 0123456789 |
Here are examples of utterances that do not parse when spoken by callers:
Caller says | Reason for not being recognized |
---|---|
十 | Natural numbers are not recognized by this grammar |
十 二 | Natural numbers are not recognized by this grammar |
number built-in grammar
The number grammar recognizes whole numeric numbers (the caller must not speak the individual digits).
Examples
Numbers from -99,999,999.99 to 99,999,999.99 are recognized, but by default the minallowed parameter is set to zero, which limits recognition to positive values.
Caller says | MEANING key |
---|---|
十 二 | 12 |
二 十 一 | 21 |
二 十 二 | 22 |
三 十 | 30 |
一 百 零 一 | 101 |
四 百 二 十 | 420 |
三 千 零 二 | 3002 |
一 萬 兩 千 三 百 四 十 五 | 12345 |
一 百 二 十 三 | 123 |
負 四 | -4 |
十 四 點 五 六 | 14.56 |
phone built-in grammar
Telephone numbers (landline and cellular). Optionally, the caller can speak an extension number of as many as 4 digits.
This is the phone number coverage list:
- 3-digit (emergency): 112, 189, 990-999
- 4-digit (directory assistance): 1083
Landline:
- 8-digit numbers - starting with 2,3
- 8-digit numbers - starting with 2,3 - plus 1-to-4-digit extension
Cellular:
- 8-digit numbers - starting with 5,6,9
Pager numbers:
- 8-digit numbers - starting with 7
Personal service numbers:
- 8-digit numbers - starting with 8
Toll-free numbers:
- 8-digit numbers - starting with 800
The variable SWI_vars.typesallowed can be used to switch on or off the following phone number groups:
Available tags:
- landline - landline numbers (with optional extension)
- cellular - cellular numbers
- special - special numbers
- pager - pager numbers
- service - personal service numbers
- tollfree - toll-free numbers
The following groups are active by default: landline+cellular
Sample settings to only allow one or some groups:
Allow cellular, landline and pager numbers:
phone.xml?SWI_vars.typesallowed=cellular+landline+pager
Examples
Caller says | MEANING key |
---|---|
九 九 九 | 999 |
二 三 四 五 六 七 八 九 | 23456789 |
二 三 四 五 六 七 八 九 內 線 二 三 四 五 | 23456789x2345 |
Here are examples of utterances that do not parse when spoken by callers:
Caller says | Reason for not being recognized |
---|---|
五 三 四 五 六 七 八 九 | 53456789 Telephone numbers do not begin with 5. |
九 九 九 內 線 二 三 四 五 | 999x2345 The `999’ number is a special number; it is never followed by extra digits. |
time built-in grammar
The time grammar recognizes a time of day.
Recognized phrases include:
Times spoken in… | Example |
---|---|
12-hour format | 五 點 |
24-hour format | 二十 三點 十五 分 |
Callers can specify 5-minute increments with: 個 字
In addition, the grammar recognizes “qualified” times. For example
Qualifiers | Description |
---|---|
之 前 | Sets the QUALIFIER key to `before’. |
大 約 | Sets the QUALIFIER key to `approx’. |
大 約 五 點 之 前 | Not recognized; the grammar does not expect callers to speak qualifiers before and after the time. |
Examples
For each entry, the values returned in the MEANING and QUALIFIER keys are shown. (Not shown are the values of the HOUR, MINUTE and AMPM keys.)
Caller says | MEANING | QUALIFIER |
---|---|---|
中 午 | 1200p | exact |
中 午 之 前 | 1200p | before |
八 點 半 | 0830? | exact |
夜 晚 七 點 一 個 字 | 0705p | exact |
零 晨 一 點 | 0100a | exact |
二 十 三 點 | 2300h | exact |
一 點 十 分 | 0110? | exact |
一 點 一 個 字 | 0105? | exact |
大 約 一 點 一 個 字 | 0105? | approx |
下 晝 一 點 一 個 字 | 0105p | exact |
上 晝 一 點 一 個 字 | 0105a | exact |
一 點 半 | 0130? | exact |
十 二 點 十 分 | 1210? | exact |
中 午 十 二 點 | 1200p | exact |
Vocabulary items and pronunciations
This chapter describes considerations for vocabularies and their pronunciations in Cantonese (cn-HK).
Cantonese pronunciations
This section provides detailed reference information to help create pronunciation dictionaries. It is intended for people who have sufficient knowledge of the Cantonese language as spoken in Hong Kong. It provides information about transcription and pronunciation.
The Cantonese phoneme system
There are six different types of Cantonese consonants:
- Plosives
- Fricatives
- Affricates
- Glides
- Nasals
- Liquids
Cantonese symbol set grouped by phoneme classes
Phoneme class | SAMPA | IPA | Examples of use |
---|---|---|---|
Consonants | Plosives | b | p |
p | pʰ | 判 | /pu3n/ |
d | t | 到 | /do3w/ |
t | tʰ | 填 | /ti4n/ |
g | k | 其 | /gE1y/ |
k | kʰ | 卡 | /ka1/ |
Fricatives | f | f | 呼 |
s | z | 送 | /su3G/ |
h | h | 效 | /ha6w/ |
Affricates | q | ts | 初 |
j | dz | 扎 | /ja3t/ |
kw | kw | 困 | /kw@3n/ |
gw | gw | 廣 | /gwo2G/ |
Glides | w | w | 熅 |
y | j | 延 | /yi4G/ |
Nasals | m | m | 磨 |
n | n | 哪 | /na5/ |
G | ŋ | 咬 | /Ga5w/ |
Liquids | l | l | 唎 |
Vowels | Single_vowels | a | aː |
@ | a | 賓 | /b@1n/ |
u | ʊ - uː | 籠 固 | /lu4G/ /gu3/ |
i | i - iː - ɪ | 僥 思 匿 | /hi1w//si1//ni1k/ |
o | o - ɔ - u | 勞 框 戊 | /lo4w//ho1G//mo6w/ |
v | yː | 團 | /tv4n/ |
8 | œ - œː | 輪 嚐 | /l84n//s84G/ |
E | e - ɛ | 幾 些 | /gE2y//sE1/ |
Cantonese consonants
The Cantonese consonant system has:
- Six plosives
- Three fricatives
- Four affricates
- Three nasals
- Two glides
- One liquid
Plosives
There are three aspirated and three unaspirated plosives in Cantonese, which can be arranged in pairs as shown here:
Unaspirated | Examples | Aspirated | Examples |
---|---|---|---|
b | 叭 | /ba1/ | p |
d | 到 | /do3w/ | t |
g | 其 | /gE1y/ | k |
/b/ /d/ and /g/ may be realized as voiced stops, but the distinction is really between aspirated / unaspirated, because the voicing is not systematic. Syllable-final plosives are usually unreleased.
Fricatives
There are three fricatives in Cantonese:
f | 呼 | /fu1/ |
---|---|---|
s | 送 | /su3G/ |
h | 效 | /ha6w/ |
Affricates
In Cantonese there are two real affricates and two co-articulated consonants:
q | 初 | /qo1/ | kw | 困 | /kw@3n/ |
---|---|---|---|---|---|
j | 扎 | /ja3t/ | gw | 廣 | /gwo2G/ |
/kw/ and /gw/ are not actually affricates but co-articulated consonants because the velar plosives /k/ or /g/ are uttered simultaneously with the glide /w/.
Nasals
There are three nasals in Cantonese:
m | 磨 | /mo4/ |
---|---|---|
n | 哪 | /na5/ |
G | 咬 | /Ga5w/ |
/m/ and /G/ can also denote semivowels (as syllable nucleus) whereas /n/ always denotes the alveolar nasal.
Glides
There are two glides in Cantonese:
w | 熅 | /w@1n/ |
---|---|---|
y | 延 | /yi4G/ |
Liquids
There is one liquid in Cantonese:
l | 唎 | /li1/ |
---|
Cantonese vowels
Monophthongs
There are eight vowels (monophthongs) in Cantonese.
a | 卡 | /ka1/ | ||
---|---|---|---|---|
@ | 賓 | /b@1n/ | ||
u | 籠 | /lu4G/ | 固 | /gu3/ |
i | 匿 | /ni1k/ | 思 | /si1/ |
o | 框 | /ho1G/ | ||
v | 團 | /tv4n/ | ||
8 | 輪 | /l84n/ | 嚐 | /s84G/ |
E | 些 | /sE1/ |
Diphthongs
Diphthongs are formed from a sequence of a vowel (monophthong) and a glide.
Note: The jyutping diphthong eoi is transcribed by the phoneme /8/ + tone + /H/
債 | /ja3y/ | 世 | /sa3y/ | ||
---|---|---|---|---|---|
爪 | /ja2w/ | 宙 | /ja6w/ | ||
僥 | /hi1w/ | ||||
戊 | /mo6w/ | 勞 | /lo4w/ | 做 | /jo6w/ |
在 | /jo6y/ | ||||
幾 | /gE2y/ | 肌 | /gE1y/ | 鍛錘 | /dv3nq84H/ |
招 | /ji1w/ | ||||
嶲 | /s85y/ | ||||
杯 | /bu1y/ |
The Cantonese tone system
Overview
Cantonese is a tone language, which means that a syllable carries different meanings depending on the tone with which it is pronounced. Hence, tone is obligatory to the construction of a syllable.
There are six tones in Cantonese and every syllable must be assigned one of these six tones, otherwise the transcription is invalid.
Note : Syllables that only contain a consonant (syllabic consonants) also must have a tone indicator (In the current language pack the following syllabic phonemes exist: /G4/, /G5/, /G6/ and /m4/)
The Cantonese tone system is summarized in the following table.
TONE | DESCRIPTION | EXAMPLE |
---|---|---|
1 | Falling | 依 |
2 | High Rising | 咦 |
3 | Mid Level | 意 |
4 | High Level | 怡 |
5 | Low Level | 洱 |
6 | Low Rising | 二 |
Tone sandhi
One common tone phenomenon in Cantonese is tone sandhi, which is the change of tones when syllables are in sequence. That is, a syllable has one of the tones in isolation, and the same syllable may take on a different tone without any change in meaning when it is followed by another syllable.
There are no hard and fast rules on where and when tone sandhi should occur, and that there are also many exceptions to the rule. But we can safely conclude that it occurs mostly on the second character of two-character compound nouns, especially when this character is normally sounded with Tones 4 or 6 (the two tones with the lowest pitch).
Examples
甜 甜 | /ti4m//ti2m/
伶 伶 | /li4G//li2G/
Please note that tone sandhi does not occur in all two-character compound nouns with tone 4 or 6.
The Cantonese symbol set in alphabetical order
The following table shows the Cantonese symbol set (left column without tone markers) in alphabetical order:
SAMPA | IPA | Examples of use |
---|---|---|
@ | a | 嘔 賓 |
8 | œ / œː | 輪 嚐 |
a | a: | 卡 |
b | p | 叭 |
d | t | 到 |
E | e - ɛ | 幾 些 |
f | f | 呼 |
g | k | 其 |
G | ŋ | 咬 |
gw | gw | 廣 |
h | h | 效 |
i | i - i: - ɪ | 僥 思 匿 |
j | dz | 扎 |
k | kʰ | 卡 |
kw | kw | 困 |
l | l | 唎 |
m | m | 磨 |
n | n | 哪 |
o | u - o - ɔ | 戊 勞 框 |
p | pʰ | 判 |
q | ts | 初 |
s | z | 送 |
t | tʰ | 填 |
u | ʊ - u: | 籠 固 |
v | y: | 團 |
w | w | 熅 |
y | j | 延 |
Automatic pronunciation module
The automatic pronunciation module is provided to pronounce words that are not in any dictionary.
The automatic pronunciation module supports a wide set of chinese characters:
- 19,568 characters part of the Unihan database that have a value in the field Cantonese and part of the basic multilingual plane (BMP) subset.
- 2636 characters part of the 2008 revision of the Hong Kong Supplementary Character Set and part of the basic multilingual plane (BMP) subset.
A complete list of supported characters can be provided upon request.
Below is the statement found in the Unihan database that acknowledges the contribution of the Linguistic Society of Hong Kong to the Cantonese field.
“The jyutping phrase box from the Linguistic Society of Hong Kong. The copyright of the Jyutping phrase box belongs to the Linguistic Society of Hong Kong.
We would like to thank the Jyutping Group of the Linguistic Society of Hong Kong for permission to use the electronic file in our research and/or product development.
Note that the inclusion of the phrase box in the Unihan database requires that any products developed using the Cantonese field needs to include this acknowledgment.”
The web address for the Unihan database is http://unicode.org/charts/unihan.html .
The web address for the Linguistic Society of Hong Kong on the Jyutping Romanization scheme is http://www.lshk.org/cantonese.php .
The web address for the Office of the Government Chief information Officer on the Hong Kong Supplementary Character Set is http://www.ogcio.gov.hk/ccli/eng/hkscs/introduction.html .
Feedback
Was this page helpful?
Glad to hear it! Please tell us how we can improve.
Sorry to hear that. Please tell us how we can improve.