Towards Efficient and No Forgetting Domain Continual Pretraining by Mitigating the Stability Gap

ICLR 2025 Conference Submission13381 Authors

28 Sept 2024 (modified: 13 Oct 2024)ICLR 2025 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Continual pretraining
Abstract: Adapting Large Language Models (LLMs) to specialized domains like medicine and law through domain continual pre-training has become the cutting-edge method. However, contrary to our expectations of immediate gains, we’ve uncovered a surprising phenomenon: a temporary performance drop at the start of the process, followed by a performance recovery phrase. This drop is not only unexpected but remarkably consistent across different model sizes and domains, such as medical and law. To gain a deeper understanding of this issue, we introduce the concept of stability gap—borrowed from visual models dealing with new class classifications—to explain this initial drop in LLM performance. Based on this concept, we hypothesize that the initial performance drop arises from instability in the model’s general abilities, which we further validated through our experiments. We further reveal that this initial instability is intricately tied to training settings that involve distribution shifts. To address this initial instability and enhance LLM performance within a fixed compute budget, we propose one training strategy that reduces the instability by increasing the epoch number, along with two data sampling strategies focused on data quality and corpus distribution. We conduct various experiments on Llama-family models to validate the effectiveness of our strategies in both medical and legal continual pre-training and instruction tuning. For example, our strategies improve the average medical task performance of the OpenLlama-3B model from 36.2\% to 40.7\% with only 40\% of the original training budget and enhance the average general task performance without causing forgetting. Furthermore, we apply our strategies to continually pre-train and instruction-tune the Llama-3-8B model. The resulting model, Llama-3-Physician, achieves the best medical performance among current open-source models and performs comparably to or even better than GPT-4 on several medical benchmarks.
Primary Area: foundation or frontier models, including LLMs
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.
Reciprocal Reviewing: I understand the reciprocal reviewing requirement as described on https://iclr.cc/Conferences/2025/CallForPapers. If none of the authors are registered as a reviewer, it may result in a desk rejection at the discretion of the program chairs. To request an exception, please complete this form at https://forms.gle/Huojr6VjkFxiQsUp6.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 13381
Loading