Continual Pre-Training of Large Language Models: How to re-warm your model?

Kshitij Gupta; Benjamin Thérien; Adam Ibrahim; Mats Leon Richter; Quentin Gregory Anthony; Eugene Belilovsky; Irina Rish; Timothée Lesort

Continual Pre-Training of Large Language Models: How to re-warm your model?

Kshitij Gupta, Benjamin Thérien, Adam Ibrahim, Mats Leon Richter, Quentin Gregory Anthony, Eugene Belilovsky, Irina Rish, Timothée Lesort

Published: 20 Jun 2023, Last Modified: 16 Jul 2023ES-FoMO 2023 PosterEveryoneRevisionsBibTeX

Keywords: Large Language Models, Pre-training, Continual Learning, Optimization, Transfer Learning, Continual Pre-training, Learning rate schedule

TL;DR: We investigate re-warming large language models for continual pre-training on large scale downstream datasets.

Abstract: Large language models (LLMs) are routinely pre-trained on billions of tokens, only to restart the process over again once new data becomes available. A much cheaper and more efficient solution would be to enable the continual pre-training of these models, i.e. updating pre-trained models with new data instead of re-training them from scratch. However, the distribution shift induced by novel data typically results in degraded performance on past data. Taking a step towards efficient continual pre-training, in this work, we examine the effect of different warm-up strategies. Our hypothesis is that the learning rate must be re-increased to improve compute efficiency when training on a new dataset. We study the warmup phase of models pre-trained on the Pile (upstream data, 300B tokens) as we continue to pre-train on SlimPajama (downstream data, 297B tokens), following a linear warmup and cosine decay schedule. We conduct all experiments on the Pythia 410M language model architecture and evaluate performance through validation perplexity. We experiment with different pre-training checkpoints, various maximum learning rates, and various warmup lengths within the first 50B tokens. Our results show that not warming up at all and keeping a constant learning rate gives the best performance for both downstream and upstream validation data.

Submission Number: 51

Loading