Enhancing LLM Pretraining by Checkpoint Merging: An Almost Free Lunch Approach.

ACL ARR 2024 December Submission2025 Authors

16 Dec 2024 (modified: 05 Feb 2025)ACL ARR 2024 December SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Large language models (LLMs), such as GPT-4, LLaMA and Gemini, have achieved widespread success across a wide range of natural language processing (NLP) tasks. Pretraining is a foundational step in the LLM training process, where the model gains a general understanding of language by exposure to vast amounts of text data. However, pretraining LLM comes with high costs and significant impacts on energy consumption and the environment. To alleviate this issue, we propose a simple and almost free lunch approach, which involves merging the LLM's checkpoints that share training trajectories during the pretraining phase. Besides improving pretraining without increasing the compute budget, our method can relax the requirement of the label information in contrast to previous merging methods, which is achieved by leveraging generation quality as the indicator to determine the merging weight. Through various experiments, we demonstrate that the merged checkpoint can achieve superior performance across multiple datasets compared to the best-performing individual checkpoint and still exhibits higher generalization performance in the out-of-distribution setting.
Paper Type: Long
Research Area: Efficient/Low-Resource Methods for NLP
Research Area Keywords: Large Language Models (LLMs), Pretraining, Checkpoint Merging, Perplexity
Languages Studied: English
Submission Number: 2025
Loading