Abstract: In web search scenarios, erroneous queries frequently degrade user experience by leading to irrelevant results. This underscores the critical role of Chinese Spelling Check (CSC) systems in maintaining search quality. Conventional approaches typically employ domain-specific models trained on limited corpora. While effective for frequent errors, these models exhibit two key limitations: (1) poor generalization to rare entities in open-domain searches,and (2) inability to adapt to temporal entity variations due to static training paradigms. With the advent of Large Language Models(LLMs), a potential solution has been provided for these problems. However, LLMs have serious over-correction issues and struggle to handle long-tail entities. To tackle this, we present RACQC-a $\textbf{C}hinese$ $\textbf{Q}uery$ $\textbf{C}orrection$ $system$ $with$ $\textbf{R}etrieval-\textbf{A}ugmented$ $Generation$(RAG) and multi-task learning. Specifically, our approach (1) integrates dynamic knowledge retrieval through entity-centric RAG to handle rare entities and,(2) employs contrastive correction tasks to mitigate LLM over-correction tendencies. Furthermore, we propose MDCQC-a $\textbf{M}ulti-\textbf{D}omain$ $\textbf{C}hinese$ $\textbf{Q}uery$ $\textbf{C}orrection$ benchmark to test the model's entity correction capabilities. Extensive experiments on several datasets show that RACQC significantly outperforms existing baselines in CSC tasks.
Paper Type: Long
Research Area: NLP Applications
Research Area Keywords: GEC,fine-tuning,NLP datasets
Contribution Types: NLP engineering experiment, Data resources
Languages Studied: Chinese
Submission Number: 2938
Loading