Investigating Continual Pretraining in Large Language Models: Insights and Implications

Investigating Continual Pretraining in Large Language Models: Insights and Implications

ACL ARR 2024 June Submission3071 Authors

15 Jun 2024 (modified: 02 Jul 2024)ACL ARR 2024 June SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: Continual learning (CL) in large language models (LLMs) is an evolving domain that focuses on developing strategies for efficient and sustainable training. Our primary emphasis is on \emph{continual domain-adaptive pretraining}, a process designed to equip LLMs with the ability to integrate new information from various domains while retaining previously learned knowledge and enhancing cross-domain knowledge transfer without relying on domain-specific identification. Unlike previous studies, which mostly concentrate on a limited selection of tasks or domains and primarily aim to address the issue of forgetting, our research evaluates the adaptability and capabilities of LLMs to changing data landscapes in practical scenarios. To this end, we introduce a new benchmark designed to measure the adaptability of LLMs to these evolving data environments, offering a comprehensive framework for evaluation. We examine the impact of model size on learning efficacy and forgetting, as well as how the progression and similarity of emerging domains affect the knowledge transfer within these models. Our findings uncover several key insights: (i) performance improves only if the adaptation corpora match the original pretraining scale, (ii) smaller models are particularly sensitive to continual pretraining, showing the most significant rates of both forgetting and learning, (iii) when the sequence of domains shows semantic similarity, continual pretraining enables LLMs to specialize better compared to stand-alone pretraining, and (iv) fine-tuning performance on standard benchmarks is indeed influenced by continual pretraining domains. We posit that our research marks a shift towards establishing a more realistic benchmark for investigating CL in LLMs.

Paper Type: Long

Research Area: Machine Learning for NLP

Research Area Keywords: transfer learning / domain adaptation, continual learning

Contribution Types: Model analysis & interpretability

Languages Studied: English

Submission Number: 3071

Loading