The Extraction and Fine-grained Classification of Written Cantonese Materials through Linguistic Feature Detection
Abstract: This paper presents a linguistically-informed, non-machine-learning tool for classifying Written Cantonese, Standard
Written Chinese, and the intermediate varieties used by Cantonese-speaking users from Hong Kong, which are
often grouped into a single “Traditional Chinese” label. Our approach addresses the lack of textual materials for
Cantonese NLP, a consequence of a lower sociolinguistic status of Written Cantonese and the interchangeable
use of these varieties by users without sufficient language labeling. The tool utilizes key lexical markers identified
from past linguistic research to determine whether a segment is Cantonese, Standard Written Chinese, mixed or
unmarked. The task is reduced into string operations to allow for a flexible and efficient extraction of high-quality
Cantonese data from large datasets mixed with Standard Written Chinese. This implementation ensures that the
tool can process large amounts of data at a low cost by bypassing model-inferencing, which is particularly significant
for marginalized languages. The tool also aims to provide a baseline measure for future classification systems, and
the approach may be applicable to other low-resource regional or diglossic languages.
Loading