RACQC: Advanced Retrieval-Augmented Generation for Chinese Query Correction

RACQC: Advanced Retrieval-Augmented Generation for Chinese Query Correction

ACL ARR 2025 February Submission2938 Authors

15 Feb 2025 (modified: 09 May 2025)ACL ARR 2025 February SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: In web search scenarios, erroneous queries frequently degrade user experience by leading to irrelevant results. This underscores the critical role of Chinese Spelling Check (CSC) systems in maintaining search quality. Conventional approaches typically employ domain-specific models trained on limited corpora. While effective for frequent errors, these models exhibit two key limitations: (1) poor generalization to rare entities in open-domain searches,and (2) inability to adapt to temporal entity variations due to static training paradigms. With the advent of Large Language Models(LLMs), a potential solution has been provided for these problems. However, LLMs have serious over-correction issues and struggle to handle long-tail entities. To tackle this, we present RACQC-a $\textbf{C}hinese$ $\textbf{Q}uery$ $\textbf{C}orrection$ $system$ $with$ $\textbf{R}etrieval-\textbf{A}ugmented$ $Generation$(RAG) and multi-task learning. Specifically, our approach (1) integrates dynamic knowledge retrieval through entity-centric RAG to handle rare entities and,(2) employs contrastive correction tasks to mitigate LLM over-correction tendencies. Furthermore, we propose MDCQC-a $\textbf{M}ulti-\textbf{D}omain$ $\textbf{C}hinese$ $\textbf{Q}uery$ $\textbf{C}orrection$ benchmark to test the model's entity correction capabilities. Extensive experiments on several datasets show that RACQC significantly outperforms existing baselines in CSC tasks.

Paper Type: Long

Research Area: NLP Applications

Research Area Keywords: GEC,fine-tuning,NLP datasets

Contribution Types: NLP engineering experiment, Data resources

Languages Studied: Chinese

Submission Number: 2938

Loading