Keywords: pruning, sparsity, large language model, pretraining
TL;DR: We make dense scaling laws fit sparsely pretrained models.
Abstract: Parameter pruning has emerged as a promising technique to address the growing computational demand of large language models (LLMs). While many studies focus on post-training pruning of LLMs, sparse pre-training offers a compelling alternative: sparsifying during pre-training reduces both training and inference costs. In this work, we conduct the first comprehensive study on optimal sparse pre-training configurations for LLMs, exploring various pruning schedules across different sparsity levels and training duration. We evaluate 80 unique configurations and find that a pruning schedule starting at 25% of total training compute and ending at 75% achieves near-optimal final evaluation loss. Our findings provide valuable insights for efficient and effective sparse pre-training of LLMs. Furthermore, we propose a new scaling law that modifies the Chinchilla scaling law to use the average number of active parameters during training. We present both empirical and theoretical evidence that this modification accurately models evaluation loss for both sparsely and densely pre-trained LLMs, thus offering a unified scaling law for dense and sparse model training. Our insights suggest that, while sparse pre-training yields similar model loss as dense pre-training for the same compute budget, it offers a clear advantage: the final model is smaller, resulting in significant potential computational savings during inference.
Primary Area: foundation or frontier models, including LLMs
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.
Reciprocal Reviewing: I understand the reciprocal reviewing requirement as described on https://iclr.cc/Conferences/2025/CallForPapers. If none of the authors are registered as a reviewer, it may result in a desk rejection at the discretion of the program chairs. To request an exception, please complete this form at https://forms.gle/Huojr6VjkFxiQsUp6.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 9129
Loading