Abstract: In this paper, an approach to Chinese sentence tokenization is proposed whereby word segmentation and text normalization could be conducted at the same time within the framework of Viterbi decoding. In the process, not only lexical words but also the new word classes could be identified. The approach demonstrated is very practical in sentence tokenization for n-gram statistical language modeling.
External IDs:dblp:conf/iscslp/0001LB98
Loading