Simple and Scalable Strategies to Continually Pre-train Large Language Models

Adam Ibrahim; Benjamin Thérien; Kshitij Gupta; Mats Leon Richter; Quentin Gregory Anthony; Eugene Belilovsky; Timothée Lesort; Irina Rish

Simple and Scalable Strategies to Continually Pre-train Large Language Models

Adam Ibrahim, Benjamin Thérien, Kshitij Gupta, Mats Leon Richter, Quentin Gregory Anthony, Eugene Belilovsky, Timothée Lesort, Irina Rish

Published: 08 Jul 2024, Last Modified: 17 Sept 2024Accepted by TMLREveryoneRevisionsBibTeXCC BY 4.0

Abstract: Large language models (LLMs) are routinely pre-trained on billions of tokens, only to start the process over again once new data becomes available. A much more efficient solution is to continually pre-train these models—saving significant compute compared to re-training. However, the distribution shift induced by new data typically results in degraded performance on previous data or poor adaptation to the new data. In this work, we show that a simple and scalable combination of learning rate (LR) re-warming, LR re-decaying, and replay of previous data is sufficient to match the performance of fully re-training from scratch on all available data, as measured by the final loss and the average score on several language model (LM) evaluation benchmarks. Specifically, we show this for a weak but realistic distribution shift between two commonly used LLM pre-training datasets (English$\rightarrow$English) and a stronger distribution shift (English$\rightarrow$German) at the $405$M parameter model scale with large dataset sizes (hundreds of billions of tokens). Selecting the weak but realistic shift for larger-scale experiments, we also find that our continual learning strategies match the re-training baseline for a 10B parameter LLM. Our results demonstrate that autoregressive transformer-based LLMs can be successfully updated via simple and scalable continual learning strategies, matching the re-training baseline using only a fraction of the compute. Finally, inspired by previous work, we propose alternatives to the cosine learning rate schedule that help circumvent forgetting induced by LR re-warming and that are not bound to a fixed token budget.

Submission Length: Long submission (more than 12 pages of main content)

Changes Since Last Submission: As a part of our camera ready submission, we have added a paragraph in section 2 about recent works that employ our techniques and we have elaborated our disucssion of MMLU performance in section 6.4.2 in light of these recent works.

Video: https://youtu.be/y4sUn3sYWFc

Code: https://github.com/EleutherAI/gpt-neox

Supplementary Material: pdf

Assigned Action Editor: ~Andrew_Kyle_Lampinen1

Submission Number: 2356

Loading