Semi-Supervised Chinese Word Segmentation Using Partial-Label Learning With Conditional Random Fields

Fan Yang, Paul Vozila

2014 (modified: 04 Sept 2019)EMNLP 2014Readers: Everyone

Abstract: There is rich knowledge encoded in online web data. For example, punctuation and entity tags in Wikipedia data define some word boundaries in a sentence. In this paper we adopt partial-label learning with conditional random fields to make use of this valuable knowledge for semi-supervised Chinese word segmentation. The basic idea of partial-label learning is to optimize a cost function that marginalizes the probability mass in the constrained space that encodes this knowledge. By integrating some domain adaptation techniques, such as EasyAdapt, our result reaches an F-measure of 95.98% on the CTB-6 corpus, a significant improvement from both the supervised baseline and a previous proposed approach, namely constrained decode.

0 Replies