Abstract: The rapid proliferation of large language models (LLMs), such as GPT-4 and Gemini, underscores the intense demand for resources during their training processes, posing significant challenges due to substantial computational and environmental costs. In this paper, we introduce a novel checkpoint merging strategy aimed at making efficient use of intermediate checkpoints during LLM pretraining. This method utilizes intermediate checkpoints with shared training trajectories, and is rooted in an extensive search space exploration for the best merging weight via Bayesian optimization. Through various experiments, we demonstrate that: (1) Our proposed methodology exhibits the capacity to augment pretraining, presenting an opportunity akin to obtaining substantial benefits at minimal cost; (2) Our proposed methodology, despite requiring a given held-out dataset, still demonstrates robust generalization capabilities across diverse domains, a pivotal aspect in pretraining.
Lay Summary: Training large language models (LLMs) like GPT-4 demands immense computational resources, contributing to high costs and environmental impact. A key challenge is making better use of intermediate checkpoints saved during this lengthy pretraining process, as current methods primarily focus on architectural improvements. Merging these checkpoints could be beneficial but is complex due to the intricate loss landscapes and the need to find optimal parameter combinations.
We introduce a novel strategy that merges intermediate LLM checkpoints by linearly combining their parameters. To overcome the complexity of finding the best combination, we leverage Bayesian optimization. This powerful technique efficiently searches for optimal merging weights, focusing on combining adjacent checkpoints in the training trajectory, guided by pilot experiments.
Our method acts as a "free lunch" for LLM pretraining, significantly boosting model performance with minimal additional computational cost. Importantly, the improved models demonstrate strong generalization capabilities across diverse tasks and languages, even to unseen data. This approach offers a practical way to maximize the value of LLM pretraining and reduce its resource footprint.
Primary Area: Deep Learning->Large Language Models
Keywords: Large Language Models (LLMs), Pretraining, Checkpoint Merging, Bayesian Optimization
Submission Number: 11299
Loading