On the Efficiency of Structured Pruning in Small Language Model Pretraining

Yixiao Li; Xianzhi Du; AJAY KUMAR JAISWAL; Tao Lei; Tuo Zhao; Chong Wang; Jianyu Wang

On the Efficiency of Structured Pruning in Small Language Model Pretraining

Yixiao Li, Xianzhi Du, AJAY KUMAR JAISWAL, Tao Lei, Tuo Zhao, Chong Wang, Jianyu Wang

20 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0

Keywords: large language models, pretraining, pruning, efficiency

Abstract: Recent advancements in generative language models have intensified the need for efficient and deployable models within limited inference budgets, while companies possess enormous computational resources for training. This scenario opens a new regime and presents a fundamental question: given sufficient training resources but strict inference constraints, what is the most effective approach to obtain the best possible small generative language model? One solution is to utilize structured pruning to compress a large model to a small model. However, while structured pruning has shown promise compared to training a target-size model from scratch as shown in existing works, the overall efficiency becomes unclear when incorporating the cost of pretraining the large model that serves only as an intermediate step in our new scenario. In this paper, we first study the question of whether it is worth pretraining the large model even if it is never deployed. Our results show that once the pretraining cost of the large model is taken into account, existing pruning methods are less token-efficient than training the target-size model from scratch. Therefore, we further investigate how to improve the efficiency of the entire pipeline for producing small models. To this end, we propose an integrated enlarge-and-prune pipeline, which combines enlarged model training, pruning, and recovery under a single cosine annealing learning rate schedule. This approach is further complemented by an iterative structured pruning method for the gradual removal of parameters. We conduct comprehensive experiments on compressing 2.8B models to 1.3B with up to 2T tokens in pretraining. Our results demonstrate that the integrated approach not only provides insights into the token efficiency of structured pruning but also achieves superior performance of pruned models.

Primary Area: foundation or frontier models, including LLMs

Submission Number: 22967

Loading