Read the Docs Before Rewriting: Equip Rewriter with Domain Knowledge via Continual Pre-training

Read the Docs Before Rewriting: Equip Rewriter with Domain Knowledge via Continual Pre-training

ACL ARR 2025 February Submission7621 Authors

16 Feb 2025 (modified: 09 May 2025)ACL ARR 2025 February SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: A Retrieval-Augmented Generation (RAG)-based question-answering (QA) system enhances a large language model's knowledge by retrieving relevant documents based on user queries. Discrepancies between user queries and document phrasings often necessitate query rewriting. However, in specialized domains, the rewriter model may struggle due to limited domain-specific knowledge. To resolve this, we propose the R\&R (Read the doc before Rewriting) rewriter, which involves continual pre-training on professional documents, akin to how students prepare for open-book exams by reviewing textbooks. Additionally, it can be combined with supervised fine-tuning for improved results. Experiments on multiple datasets demonstrate that R\&R excels in professional QA, effectively bridging the query-document gap, while maintaining good performance in general scenarios, thus advancing the application of RAG-based QA systems in specialized fields.

Paper Type: Long

Research Area: Question Answering

Research Area Keywords: Query Rewriting, Retrieved Augmented Generation, Large Language Models, Question Answering

Languages Studied: English, Chinese

Submission Number: 7621

Loading