SelfAug: Mitigating Catastrophic Forgetting in Retrieval-Augmented Generation via Distribution Self-Alignment

ACL ARR 2025 May Submission975 Authors

16 May 2025 (modified: 03 Jul 2025)ACL ARR 2025 May SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Recent advancements in large language models (LLMs) have revolutionized natural language processing through their remarkable capabilities in understanding and executing diverse tasks. While supervised fine-tuning, particularly in Retrieval-Augmented Generation (RAG) scenarios, has proven effective for enhancing task-specific performance, it often leads to catastrophic forgetting, where models lose their previously acquired knowledge and general capabilities. Existing solutions either require access to general instruction data or face limitations in preserving the model's original distribution. To overcome these limitations, we propose SelfAug, a novel self-distribution alignment method. By aligning distributions through the logits of input sequences, SelfAug preserves the model’s semantic distribution, thereby simultaneously mitigating catastrophic forgetting and improving downstream task performance. Through extensive experiments, we show that SelfAug achieves a better balance between downstream task learning and the retention of general capabilities compared to existing methods. Our comprehensive empirical analysis reveals a direct correlation between distribution shifts and the severity of catastrophic forgetting in RAG scenarios, particularly highlighting how the absence of RAG capabilities in general instruction tuning leads to significant distribution shifts during fine-tuning.
Paper Type: Long
Research Area: Language Modeling
Research Area Keywords: fine-tuning, retrieval-augmented generation
Contribution Types: NLP engineering experiment
Languages Studied: English, Chinese
Submission Number: 975
Loading