PyCantonese: Cantonese Linguistics and NLP in Python

Published: 01 Jan 2022, Last Modified: 07 Feb 2025LREC 2022EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: This paper introduces PyCantonese, an open-source Python library for Cantonese linguistics and natural language processing. After the library design, implementation, corpus data format, and key datasets included are introduced, the paper provides an overview of the currently implemented functionality: stop words, handling Jyutping romanization, word segmentation, part-of-speech tagging, and parsing Cantonese text.
Loading