Abstract: Pre-trained language model-based methods for Chinese Grammatical Error Correction (CGEC) are categorized into Seq2Seq and Seq2Edit types. However, both Seq2Seq and Seq2Edit models depend on high-quality training data significantly. Considering the strong generation and inference ability of large language models (LLMs), we propose a large language model-guided optimization training method to exploit LLMs to extract error knowledge to optimize the traditional CGEC model training process. On the one hand, we use error types and confusion sets as extra knowledge to guide LLMs to generate diverse pseudo data, thus extending the error distribution of our training data. On the other hand, LLMs are utilized to infer the predicted results from our CGEC models and obtain the re-training data, thus iteratively optimizing our pre-trained CGEC models. Experiments on two benchmark datasets show that our LLMs-guided optimization method with small-scale training data can achieve comparable results with baseline models with large-scale training data. Detailed comparison experiments demonstrate that both the early deviser pseudo data and the later re-training data are extremely useful for traditional CGEC model optimization training, and can benefit from each other. We will release our code and prompts at https://github.com/SakuraAcedia/llm-cgec-got to facilitate future work.
Loading