Part-of-Speech and Confusion-Set Constrained Language Model for Vietnamese Spelling Correction Corpus Construction

Published: 01 Jan 2024, Last Modified: 17 May 2025NLPCC (4) 2024EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Supervised spelling error correction models have achieved outstanding performances on rich-source languages. However, these models are difficult to directly apply to Vietnamese spelling correction due to the corpus scarcity. To address this issue, we first construct a basic high-quality Vietnamese Spelling Correction (ViSC) corpus via automatic speech recognition (ASR) generation and human annotation. Then, we propose a part-of-speech and confusion-set double-constrained method to mimic the practical error distribution and use them as external knowledge to guide the large language models (LLMs) to construct diverse pseudo data. Finally, we exploit pseudo corpora to pre-train and ViSC corpus to fine-tune spelling error correction models. Experiments on the benchmark dataset show that our proposed corpus construction method consistently outperforms various baselines, leading to state-of-the-art results on all Vietnamese-specific pre-trained language model-enhanced spelling correction models. Detailed analysis demonstrates that part-of-speech and confusion-set are complementary and significant in controlling a stable and diverse corpus generation. In-depth comparison experiments reveal that the proper utilization of pseudo corpus is essential for improving Vietnamese spelling error correction. Besides, we release our codes and constructed corpus at https://github.com/DarkFanta3y/VSEC_corpus to facilitate future research.
Loading