Cantonese Hong Kong (cn-HK)

This documentation was updated on November 22, 2023.

Creating grammars

The following subsections describe key issues for working with grammar documents in the Cantonese language.

Character encoding

Nuance Recognizer has full internal Unicode support. Create your grammars using UTF-8. For example, your grammar header might be:

<?xml version=‘1.0’ encoding=‘UTF-8’?> <grammar xml:lang=“cn-HK” version=“1.0” root=“test”>

alphanum_lc built-in grammar

The alphanum_lc built-in grammar recognizes a connected string of up to 20 digits and lower case alphabetic characters. For example, this grammar could be used to recognize a product code or order number.

Valid characters are the English letters of the alphabet (a–z) so callers can speak English characters in addition to Cantonese numbers. The pronunciation of the letter z as the British-style “zed” is recognized, but the American-style “zii” is likely to be misrecognized as the letter c.

Valid digits are 0–9. Although specified as Arabic numbers, callers speak the Cantonese equivalents: 零一二三四五六七八九.

Non-alphanumeric characters such as hyphens (-), dots (.), and underscores (_) are not recognized; if spoken they reduce recognition accuracy.

Return keys/values

MEANING	Contains a string of ISO-8859-1 digits and lowercase letters, with no embedded spaces.
SWI_literal	Contains the exact text that was recognized.

Examples

In the following examples, note that the English letters of the alphabet are allowed. This is done to allow callers to speak English characters in addition to Cantonese.

Caller says	MEANING key
Spaces between digits indicate individually spoken numbers: 零一二三四五六七八九	0123456789
a b c d e f g	abcdefg
a b c 1 e 6 g	abc1e6g
a 一 s 二 d 三 f 四	a1s2d3f4

Here are examples of utterances that do not parse when spoken by callers:

Caller says	Reason for not being recognized
十二	Natural numbers are not recognized with this grammar. Each digit must be spoken individually.

alphanum built-in grammar

**NOTE: for backward-compatibility only. Otherwise, use alphanum_lc builtin!
This grammar has been replaced by the alphanum_lc grammar, but is still available and has been retained for backward-compatibility.
For new implementations, please use the alphanum_lc builtin grammar. **

The alphanum built-in grammar recognizes a connected string of up to 20 digits and upper and lower case alphabetic characters. For example, this grammar could be used to recognize a product code or order number.

Valid digits are 0–9. Although specified as Arabic numbers, callers speak the Cantonese equivalents: 零一二三四五六七八九.

Non-alphanumeric characters such as hyphens (-), dots (.), and underscores (_) are not recognized; if spoken they reduce recognition accuracy.

Return keys/values

MEANING	Contains a string of ISO-8859-1 digits and lowercase letters, with no embedded spaces.
SWI_literal	Contains the exact text that was recognized.

Examples

In the following examples, note that the English letters of the alphabet are allowed. This is done to allow callers to speak English characters in addition to Cantonese.

Caller says	MEANING key
Spaces between digits indicate individually spoken numbers: 零一二三四五六七八九	0123456789
a b c d e f g	abcdefg
a b c 1 e 6 g	abc1e6g
a 一 s 二 d 三 f 四	a1s2d3f4

Here are examples of utterances that do not parse when spoken by callers:

Caller says	Reason for not being recognized
十二	Natural numbers are not recognized with this grammar. Each digit must be spoken individually.

boolean built-in grammar

The boolean grammar collects an affirmative or negative response.

Properties

The y and n parameters let you associate any two touchtone buttons as synonyms for yes and no.

Parameter	Description
y	Desired DTMF digit to be equivalent to 岩 (default = 1)
n	Desired DTMF digit to be equivalent to 錯 (default = 2)

Examples

Caller says…	MEANING key
岩	true
錯	false

ccexpdate built-in grammar

The ccexpdate grammar understands the expiration date on a credit card. Expiration dates are usually a month and a year, and are often embossed on a credit card in the form “mm/yy.” The grammar recognizes variations on the date, for example, December 2005 (二零零五年十二月) and oh four oh five ( 二零零五年四月).

Some credit cards are stamped with a day of the month as well as the month and year; the ccexpdate grammar recognizes these dates as well. However, the only day of the month it recognizes is the last day of a given month, for example, November 30th, 2005 ( 二零零五年十一月三十號). The grammar does not check for leap years: both February 28 and February 29 are recognized, regardless of the given year.

Return keys/values

Upon return, the MEANING key is assigned to the recognized date in YYYYMMDD format, where YYYY is the year, MM is the month, and DD is the day. For example, 20100331 refers to March 31, 2010. The value is the same regardless of whether the caller specified a day of the month or not; the day is always set to the last day of the month. For example, both “oh six three oh oh five” ( 二零零五年六月三十號) and “oh six oh five” ( 二零零五年六月) return 20050630. Note that if the expiration month is February, MMDD is always 0228, regardless of what the caller said or whether or not the expiration year is a leap year.

citizenid built-in grammar

The citizenid grammar understands 8 or 9 character long Hong Kong citizen ID numbers:

The 8 character ID has this pattern: LDDDDDDX
The 9 character ID has this pattern: LLDDDDDDX

These are the parts contained in the ID number:

L - letter a-z
D - digits 0-9
X - check sum character 0-9 or a

A description of the check sum calculation process can be found in the header of the source grammar.

Example

Caller says	MEANING key
a b 一二三四五六九	ab1234569

creditcard built-in grammar

The creditcard grammar understands a caller saying a credit card number, optionally preceding the number with the credit card name, or the words “account number” (账號) or “account” (账户). For example, a caller can say, “visa account number four seven six four…” ( 维萨卡账號四七六四), “mastercard five two seven eight…” ( 万事达卡五二七八), or “three seven three five…” ( 三七三五).

The following card types are allowed by default: Visa, Mastercard, JCB, American Express and DinersClub.

In order to allow other card types you have to add the default card tags “visa+mastercard+jcb+amex+dinersclu” plus your other selected card types to your grammar load line, joined by + signs:
e.g.
[credit card grammar]?SWI_vars.typesallowed=mastercard+visa+dinersclub+private+amex+discover+jcb+cup

Besides the card types set by default the following card types are implemented as well in the source grammar: Discover (tag: discover), China Unionpay Card (tag: cup)

currency built-in grammar

The currency grammar collects currency amounts using Hong Kong Dollars
(written as 蚊 , 元 , or 個 ) and its subunit 毫 (10 Hong Kong Cent).

Return keys/values

MEANING	If the speaker does not explicitly mention any unit name (main units 蚊 , 元 , 個 , or subunit 毫 ), then the utterance is interpreted as referring to a main unit amount, that is, " 五 " will be interpreted as 5 Hong Kong Dollar.
SWI_literal	contains the exact text that was recognized.

Examples

Caller says	MEANING
五蚊	HKD5.00
五元	HKD5.00
五蚊零五	HKD5.05
五蚊兩毫半	HKD5.25
五蚊兩毫五	HKD5.25
六十二萬五千四百六十四蚊	HKD625464.00
四十一萬二千五百六十元	HKD412560.00
四十一萬二千五百六十元一毫	HKD412560.10
一蚊	HKD1.00
兩個半	HKD2.50
兩毫半	HKD0.25
兩個兩毫半	HKD2.25

date built-in grammar

The date grammar accepts a date spoken in the format of Year - Month - Day.

The grammar also accepts the following common words, and returns specific values:

Caller says	Value Returned
前天	-2
昨天	-1
今天	0
明天	1
後天	2

Examples

Caller says	MEANING key
前天	-2
昨天	-1
今天	0
明天	+1
後天	+2
一號	??????01
十二月四號星期三	????1204
十二月四號	????1204
四號	??????04
二零零一年六月四號	20010604

digits built-in grammar

The digits grammar recognizes a continuously spoken string of up to 20 digits (i.e., the caller is not required to pause after each digit).

Valid characters are the digits: 零一二三四五六七八九

Examples

Caller says	MEANING key
零	0
一	1
零一二三四五六七八九	0123456789

Here are examples of utterances that do not parse when spoken by callers:

Caller says	Reason for not being recognized
十	Natural numbers are not recognized by this grammar
十二	Natural numbers are not recognized by this grammar

number built-in grammar

The number grammar recognizes whole numeric numbers (the caller must not speak the individual digits).

Examples

Numbers from -99,999,999.99 to 99,999,999.99 are recognized, but by default the minallowed parameter is set to zero, which limits recognition to positive values.

Caller says	MEANING key
十二	12
二十一	21
二十二	22
三十	30
一百零一	101
四百二十	420
三千零二	3002
一萬兩千三百四十五	12345
一百二十三	123
負四	-4
十四點五六	14.56

phone built-in grammar

Telephone numbers (landline and cellular). Optionally, the caller can speak an extension number of as many as 4 digits.

This is the phone number coverage list:

3-digit (emergency): 112, 189, 990-999
4-digit (directory assistance): 1083

Landline:

8-digit numbers - starting with 2,3
8-digit numbers - starting with 2,3 - plus 1-to-4-digit extension

Cellular:

8-digit numbers - starting with 5,6,9

Pager numbers:

8-digit numbers - starting with 7

Personal service numbers:

8-digit numbers - starting with 8

Toll-free numbers:

8-digit numbers - starting with 800

The variable SWI_vars.typesallowed can be used to switch on or off the following phone number groups:

Available tags:

landline - landline numbers (with optional extension)
cellular - cellular numbers
special - special numbers
pager - pager numbers
service - personal service numbers
tollfree - toll-free numbers

The following groups are active by default: landline+cellular
Sample settings to only allow one or some groups:

Allow cellular, landline and pager numbers:
phone.xml?SWI_vars.typesallowed=cellular+landline+pager

Examples

Caller says	MEANING key
九九九	999
二三四五六七八九	23456789
二三四五六七八九內線二三四五	23456789x2345

Here are examples of utterances that do not parse when spoken by callers:

Caller says	Reason for not being recognized
五三四五六七八九	53456789 Telephone numbers do not begin with 5.
九九九內線二三四五	999x2345 The `999’ number is a special number; it is never followed by extra digits.

time built-in grammar

The time grammar recognizes a time of day.

Recognized phrases include:

Times spoken in…	Example
12-hour format	五點
24-hour format	二十三點十五分

Callers can specify 5-minute increments with: 個字

In addition, the grammar recognizes “qualified” times. For example

Qualifiers	Description
之前	Sets the QUALIFIER key to `before’.
大約	Sets the QUALIFIER key to `approx’.
大約五點之前	Not recognized; the grammar does not expect callers to speak qualifiers before and after the time.

Examples

For each entry, the values returned in the MEANING and QUALIFIER keys are shown. (Not shown are the values of the HOUR, MINUTE and AMPM keys.)

Caller says	MEANING	QUALIFIER
中午	1200p	exact
中午之前	1200p	before
八點半	0830?	exact
夜晚七點一個字	0705p	exact
零晨一點	0100a	exact
二十三點	2300h	exact
一點十分	0110?	exact
一點一個字	0105?	exact
大約一點一個字	0105?	approx
下晝一點一個字	0105p	exact
上晝一點一個字	0105a	exact
一點半	0130?	exact
十二點十分	1210?	exact
中午十二點	1200p	exact

Vocabulary items and pronunciations

This chapter describes considerations for vocabularies and their pronunciations in Cantonese (cn-HK).

Cantonese pronunciations

This section provides detailed reference information to help create pronunciation dictionaries. It is intended for people who have sufficient knowledge of the Cantonese language as spoken in Hong Kong. It provides information about transcription and pronunciation.

The Cantonese phoneme system

There are six different types of Cantonese consonants:

Plosives
Fricatives
Affricates
Glides
Nasals
Liquids

Cantonese symbol set grouped by phoneme classes

Phoneme class	SAMPA	IPA	Examples of use
Consonants	Plosives	b	p
p	pʰ	判	/pu3n/
d	t	到	/do3w/
t	tʰ	填	/ti4n/
g	k	其	/gE1y/
k	kʰ	卡	/ka1/
Fricatives	f	f	呼
s	z	送	/su3G/
h	h	效	/ha6w/
Affricates	q	ts	初
j	dz	扎	/ja3t/
kw	kw	困	/kw@3n/
gw	gw	廣	/gwo2G/
Glides	w	w	熅
y	j	延	/yi4G/
Nasals	m	m	磨
n	n	哪	/na5/
G	ŋ	咬	/Ga5w/
Liquids	l	l	唎
Vowels	Single_vowels	a	aː
@	a	賓	/b@1n/
u	ʊ - uː	籠固	/lu4G/ /gu3/
i	i - iː - ɪ	僥思匿	/hi1w//si1//ni1k/
o	o - ɔ - u	勞框戊	/lo4w//ho1G//mo6w/
v	yː	團	/tv4n/
8	œ - œː	輪嚐	/l84n//s84G/
E	e - ɛ	幾些	/gE2y//sE1/

Cantonese consonants

The Cantonese consonant system has:

Six plosives
Three fricatives
Four affricates
Three nasals
Two glides
One liquid

Plosives

There are three aspirated and three unaspirated plosives in Cantonese, which can be arranged in pairs as shown here:

Unaspirated	Examples	Aspirated	Examples
b	叭	/ba1/	p
d	到	/do3w/	t
g	其	/gE1y/	k

/b/ /d/ and /g/ may be realized as voiced stops, but the distinction is really between aspirated / unaspirated, because the voicing is not systematic. Syllable-final plosives are usually unreleased.

Fricatives

There are three fricatives in Cantonese:

f	呼	/fu1/
s	送	/su3G/
h	效	/ha6w/

Affricates

In Cantonese there are two real affricates and two co-articulated consonants:

q	初	/qo1/	kw	困	/kw@3n/
j	扎	/ja3t/	gw	廣	/gwo2G/

/kw/ and /gw/ are not actually affricates but co-articulated consonants because the velar plosives /k/ or /g/ are uttered simultaneously with the glide /w/.

Nasals

There are three nasals in Cantonese:

m	磨	/mo4/
n	哪	/na5/
G	咬	/Ga5w/

/m/ and /G/ can also denote semivowels (as syllable nucleus) whereas /n/ always denotes the alveolar nasal.

Glides

There are two glides in Cantonese:

w	熅	/w@1n/
y	延	/yi4G/

Liquids

There is one liquid in Cantonese:

l	唎	/li1/

Cantonese vowels

Monophthongs

There are eight vowels (monophthongs) in Cantonese.

a	卡	/ka1/
@	賓	/b@1n/
u	籠	/lu4G/	固	/gu3/
i	匿	/ni1k/	思	/si1/
o	框	/ho1G/
v	團	/tv4n/
8	輪	/l84n/	嚐	/s84G/
E	些	/sE1/

Diphthongs

Diphthongs are formed from a sequence of a vowel (monophthong) and a glide.

Note: The jyutping diphthong eoi is transcribed by the phoneme /8/ + tone + /H/

債	/ja3y/	世	/sa3y/
爪	/ja2w/	宙	/ja6w/
僥	/hi1w/
戊	/mo6w/	勞	/lo4w/	做	/jo6w/
在	/jo6y/
幾	/gE2y/	肌	/gE1y/	鍛錘	/dv3nq84H/
招	/ji1w/
嶲	/s85y/
杯	/bu1y/

The Cantonese tone system

Overview

Cantonese is a tone language, which means that a syllable carries different meanings depending on the tone with which it is pronounced. Hence, tone is obligatory to the construction of a syllable.

There are six tones in Cantonese and every syllable must be assigned one of these six tones, otherwise the transcription is invalid.

Note : Syllables that only contain a consonant (syllabic consonants) also must have a tone indicator (In the current language pack the following syllabic phonemes exist: /G4/, /G5/, /G6/ and /m4/)

The Cantonese tone system is summarized in the following table.

TONE	DESCRIPTION	EXAMPLE
1	Falling	依
2	High Rising	咦
3	Mid Level	意
4	High Level	怡
5	Low Level	洱
6	Low Rising	二

Tone sandhi

One common tone phenomenon in Cantonese is tone sandhi, which is the change of tones when syllables are in sequence. That is, a syllable has one of the tones in isolation, and the same syllable may take on a different tone without any change in meaning when it is followed by another syllable.

There are no hard and fast rules on where and when tone sandhi should occur, and that there are also many exceptions to the rule. But we can safely conclude that it occurs mostly on the second character of two-character compound nouns, especially when this character is normally sounded with Tones 4 or 6 (the two tones with the lowest pitch).

Examples

甜甜 | /ti4m//ti2m/
伶伶 | /li4G//li2G/

Please note that tone sandhi does not occur in all two-character compound nouns with tone 4 or 6.

The Cantonese symbol set in alphabetical order

The following table shows the Cantonese symbol set (left column without tone markers) in alphabetical order:

SAMPA	IPA	Examples of use
@	a	嘔賓
8	œ / œː	輪嚐
a	a:	卡
b	p	叭
d	t	到
E	e - ɛ	幾些
f	f	呼
g	k	其
G	ŋ	咬
gw	gw	廣
h	h	效
i	i - i: - ɪ	僥思匿
j	dz	扎
k	kʰ	卡
kw	kw	困
l	l	唎
m	m	磨
n	n	哪
o	u - o - ɔ	戊勞框
p	pʰ	判
q	ts	初
s	z	送
t	tʰ	填
u	ʊ - u:	籠固
v	y:	團
w	w	熅
y	j	延

Automatic pronunciation module

The automatic pronunciation module is provided to pronounce words that are not in any dictionary.

The automatic pronunciation module supports a wide set of chinese characters:

19,568 characters part of the Unihan database that have a value in the field Cantonese and part of the basic multilingual plane (BMP) subset.
2636 characters part of the 2008 revision of the Hong Kong Supplementary Character Set and part of the basic multilingual plane (BMP) subset.

A complete list of supported characters can be provided upon request.

Below is the statement found in the Unihan database that acknowledges the contribution of the Linguistic Society of Hong Kong to the Cantonese field.
“The jyutping phrase box from the Linguistic Society of Hong Kong. The copyright of the Jyutping phrase box belongs to the Linguistic Society of Hong Kong.
We would like to thank the Jyutping Group of the Linguistic Society of Hong Kong for permission to use the electronic file in our research and/or product development.
Note that the inclusion of the phrase box in the Unihan database requires that any products developed using the Cantonese field needs to include this acknowledgment.”

The web address for the Unihan database is http://unicode.org/charts/unihan.html .
The web address for the Linguistic Society of Hong Kong on the Jyutping Romanization scheme is http://www.lshk.org/cantonese.php .
The web address for the Office of the Government Chief information Officer on the Hong Kong Supplementary Character Set is http://www.ogcio.gov.hk/ccli/eng/hkscs/introduction.html .

Feedback

Was this page helpful?

Glad to hear it! Please tell us how we can improve.

Sorry to hear that. Please tell us how we can improve.