Forgetting Word Segmentation in Chinese Text Classification with1-Regularized Logistic Regression

Qiang Fu, Xinyu Dai, Shujian Huang, Jiajun Chen

Published: 2013, Last Modified: 18 Apr 2024PAKDD (2) 2013EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: Word segmentation is commonly a preprocessing step for Chinese text representation in building a text classification system. We have found that Chinese text representation based on segmented words may lose some valuable features for classification, no matter the segmented results are correct or not. To preserve these features, we propose to use character-based N-gram to represent the Chinese text in a larger scale feature space. Considering the sparsity problem of the N-gram data, we suggest the L1-regularized logistic regression (L1-LR) model to classify Chinese text for better generalization and interpretation. The experimental results demonstrate our proposed method can get better performance than those state-of-the-art methods. Further qualitative analysis also shows that character-based N-gram representation with L1-LR is reasonable and effective for text classification.