Abstract: Continual learning (CL) in large language models (LLMs) is an evolving domain that focuses on developing efficient and sustainable training strategies to adapt models to emerging knowledge and achieve robustness in dynamic environments. Our primary emphasis is on continual domain-adaptive pretraining, a process designed to equip LLMs with the ability to integrate new information from various domains while retaining previously learned knowledge. Since existing works concentrate mostly on continual fine-tuning for a limited selection of downstream tasks or training domains, we introduce a new benchmark designed to measure the adaptability of LLMs to changing pretraining data landscapes. We further examine the impact of model size on learning efficacy and forgetting, as well as how the progression and similarity of emerging domains affect the knowledge transfer within these models.
Our findings uncover several key insights: (i) continual pretraining consistently improves <1.5B models studied in this work and is also superior to domain adaptation, (ii) larger models always achieve better perplexity than smaller ones when continually pretrained on the same corpus, (iii) smaller models are particularly sensitive to continual pretraining, showing the most significant rates of both learning and forgetting, (iv) continual pretraining boosts downstream task performance of GPT-2 family, (v) continual pretraining enables LLMs to specialize better when the sequence of domains shows semantic similarity while randomizing training domains leads to better transfer and final performance otherwise. We posit that our research establishes a new benchmark for CL in LLMs, providing a more realistic evaluation of knowledge retention and transfer across diverse domains.
Submission Length: Regular submission (no more than 12 pages of main content)
Changes Since Last Submission: Added Appendix A3 to address:
- Clarify the procedure used for constructing similar-order training sequences (see "How did we order domains?" paragraph)
- Add a new experiment with an alternative similar-order variant (see "Additional similar order experiments" paragraph)
- Add a detailed table listing the domain sequences (see the remaining paragraphs)
Added Figure 11 to address:
- Introduce a new baseline experiment using fully mixed domain data (non-continual i.i.d. training)
Updated sections 2 and 3 by moving all the notation to a new "Metric expressed explicitly" paragraph at the end of section 3 and rephrasing certain passages to address:
- Clarify domain-adaptive pretraining (DAPT) baseline implementation details
- Revise and improve the explanation of evaluation metrics in Section 3
- Fix overloaded notation issues
- Add training details (more explanation on the implementation of the baselines)
Added a first paragraph to Section 4.6 to address:
- Provide a clearer justification and explanation for the rank-based analysis in Section 4.6
We updated the last paragraph of Section 4.6 to address:
- Highlight results of the rank-based knowledge accumulation analysis using GPT2-M
We added subsection A.4:
- Discuss learning rate choices in more detail
We realized that this would lead our submission to go beyond 12 pages. We will check with the action editor before making the following change:
- Move key figures (e.g., Figures 11, 12, and 14) from the appendix into the main paper
We unfortunately did not have resources to perform the following experiments:
- Consider adding additional domains via subsampling from domains >5GB
- Explore a parameter averaging experiment
Assigned Action Editor: ~Elahe_Arani1
Submission Number: 4174
Loading