Efficient Text Analysis with Pre-Trained Neural Network Models

Jia Cui, Heng Lu, Wenjie Wang, Shiyin Kang, Liqiang He, Guangzhi Li, Dong Yu

Published: 2022, Last Modified: 12 May 2023SLT 2022Readers: Everyone

Abstract: This paper investigates the application of pre-trained BERT model in three classic text analysis tasks: Chinese grapheme-to-phoneme(G2P), text normalization(TN) and sentence punctuation annotation. Even though the full-sized BERT has prominent modeling power, there are two challenges for it in real applications: the requirement for annotated training data and the considerable computational cost. In this paper, we propose BERT-based low-latency solutions. To collect sufficient training corpus for G2P, we transfer knowledge from existing rule-based system to BERT through a large amount of unlabeled corpus. The new model could convert all characters directly from raw texts with higher accuracy. We also propose a hybrid two-stage text normalization pipeline which reduces the sentence error rate by 25% compared to the rule-based system. We offer both supervised and weakly supervised versions and find that the latter has only 1% accuracy drop from the former.

0 Replies