Nonparametric Bayesian Semi-supervised Word SegmentationDownload PDF

25 Oct 2022OpenReview Archive Direct UploadReaders: Everyone
Abstract: This paper presents a novel hybrid genera- tive/discriminative model of word segmenta- tion based on nonparametric Bayesian meth- ods. Unlike ordinary discriminative word seg- mentation which relies only on labeled data, our semi-supervised model also leverages a huge amounts of unlabeled text to automat- ically learn new “words”, and further con- strains them by using a labeled data to seg- ment non-standard texts such as those found in social networking services. Specifically, our hybrid model combines a discriminative classifier (CRF; Lafferty et al. (2001) and unsupervised word segmentation (NPYLM; Mochihashi et al. (2009)), with a transparent exchange of information between these two model structures within the semi- supervised framework (JESS-CM; Suzuki and Isozaki (2008)). We confirmed that it can appropriately segment non-standard texts like those in Twitter and Weibo and has nearly state-of-the-art accuracy on standard datasets in Japanese, Chinese, and Thai.
0 Replies

Loading