Abstract: Due to the continuous emergence of new data, version updates have become an indispensable requirement for Large Language Models (LLMs). The training paradigms for version updates of LLMs include pre-training from scratch (PTFS) and continual pre-training (CPT). Preliminary experiments demonstrate that PTFS exhibits better pre-training performance, while the training cost of CPT is lower. Moreover, their performance and training cost gaps gradually widen with the version updates processing. To investigate the underlying reasons for this phenomenon, we analyze the effect of learning rate adjustments during the two stages of CPT: preparing an initialization checkpoint and conducting pre-training based on this checkpoint. We find that a large learning rate in the first stage and a complete learning rate decaying process in the second stage are crucial for version updates of LLMs. Hence, we propose a learning rate path switching training paradigm. Our paradigm comprises one main path, where we pre-train a LLM with the maximal learning rate, and multiple branching paths, each of which corresponds to an update of the LLM with newly-added training data. Compared with PTFS, when training four versions of LLMs, our paradigm can reduce the total training cost to 58% while maintaining comparable pre-training performance. In addition, we also validate the generalization of our paradigm, further proving its practicability.
Paper Type: Long
Research Area: Special Theme (conference specific)
Research Area Keywords: efficient training
Contribution Types: Approaches low compute settings-efficiency
Languages Studied: English
Submission Number: 5145
Loading