['3c3', '< Abstract: LLMs are computationally expensive to pre-train due to their large scale. Model growth emerges as a promising approach by leveraging smaller models to accelerate the training of larger ones. However, the viability of these model growth methods in efficient LLM pre-training remains underexplored. This work identifies three critical Obstacles: (O1) lack of comprehensive evaluation, (O2) untested viability for scaling, and (O3) lack of empirical guidelines. To tackle O1, we summarize existing approaches into four atomic growth operators and systematically evaluate them in a standardized LLM pre-training setting. Our findings reveal that a depthwise stacking operator, called G stack , exhibits remarkable acceleration in training, leading to decreased loss and improved overall performance on eight standard NLP benchmarks compared to strong baselines. Motivated by these promising results, we conduct extensive experiments to delve deeper into G stack to address O2 and O3. For O2 (untested scalability), our study shows that G stack is scalable and consistently performs well, with experiments up to 7B LLMs after growth and pre-training LLMs with 750B tokens. For example, compared to a conventionally trained 7B model using 300B tokens, our G stack model converges to the same loss with 194B tokens, resulting in a 54.6% speedup. We further address O3 (lack of empirical guidelines) by formalizing guidelines to determine growth timing and growth factor for G stack , making it practical in general LLM pre-training. We also provide in-depth discussions and comprehensive ablation studies of G stack . Our code and pre-trained model are available at https://llm-stacking.github.io/.', '---', "> Abstract: The escalating computational demands for pre-training Large Language Models (LLMs) necessitate innovative efficiency solutions. Model growth, a technique that leverages smaller, pre-trained models to accelerate the training of larger ones, holds significant promise but remains largely unvalidated for efficient LLM pre-training. This work systematically addresses three critical obstacles hindering its adoption: (O1) the absence of a comprehensive evaluation framework, (O2) unverified scalability across diverse LLM sizes and data regimes, and (O3) the lack of practical empirical guidelines. To tackle (O1), we formalize existing approaches into four atomic growth operators and conduct a rigorous, standardized evaluation in an LLM pre-training setting. Our key finding is that a novel depthwise stacking operator, termed G stack , consistently delivers substantial training acceleration, leading to reduced loss and superior performance across eight standard NLP benchmarks. Addressing (O2), we demonstrate G stack 's robust scalability, successfully applying it to LLMs up to 7B parameters and pre-training with 750B tokens. Notably, a 7B G stack model achieves the same loss as a conventionally trained 7B model using 300B tokens, but with only 194B tokens, representing a remarkable 54.6% speedup. For (O3), we establish empirical guidelines for optimizing G stack 's growth timing and growth factor, making it readily applicable for practitioners. We also provide extensive discussions and ablation studies to deepen understanding. Our code and pre-trained models are publicly available at https://llm-stacking.github.io/.", '6,12c6,12', '< Emergent abilities of Large Language Models (LLMs) rely on scaling-up [1,2]. Empirical evidence from scaling laws [3][4][5] fuels the development of increasingly larger models, pushing the boundaries of LLMs capabilities. However, pre-training these gigantic models comes at a significant cost in terms of energy consumption and environmental impact [6] (e.g., pre-training Llama-3 [7] consumes a total of 7.7M GPU hours and generates 2290 tons of carbon dioxide equivalent of carbon emissions). The efficient pre-training of LLMs is thus crucial, both from a scientific and a societal perspective, to ensure the continual growth and adoption of AI [8,9].', '< One promising research direction to accelerate model training involves leveraging trained smaller (base) models to expedite the training of larger (target) models, a technique known as model growth.', "< Concretely, model growth studies how to leverage the trained smaller model's parameters Θ (s) to initialize the larger model's parameters Θ (l) . Current popular methods generally focus on expanding the parameters of the base model through techniques like splitting [10][11][12], copying [13,14], or matrix mapping [15]. There are also some approaches that initialize new parameters from scratch [16,12,17]. The primary objective is to accelerate the training of large models, and existing methods demonstrate promising speedup results on models such as BERT [11,14,18,15,12,13]. Despite such empirical evidence and its alignment with the goal of efficient LLM pre-training, model growth methods are not widely adopted in the context of LLM pre-training [7,19]. To our best knowledge, the only LLM that utilizes model growth for accelerating is FLM-101B [20], but it lacks a baseline LLM trained from scratch to compare. We observe three key Obstacles that hinder LLM pre-training from using existing model growth techniques, specifically:", '< • O1: Lack of comprehensive assessment. Some existing model growth methods report results on LLM pre-training, but either lack a baseline comparison [20] or are still in exploratory stages [15,13]. In contrast, most growth approaches are evaluated in encoder-based BERT models [14,11,18,12,13,16,17], which have different architecture and training configurations compared to prominent decoder-based LLMs such as Llama [21].', '< • O2: The untested scalability. This scalability has two aspects: the model size and the amount of pretraining data. Regarding the model size, the existing approaches are only evaluated on smaller-scale BERT models or in preliminary experiments with LLMs. It is unclear whether these growth methods will continue accelerating training when applied to large-scale LLMs with more extensive evaluation. As for the amount of pre-training data, there are debates [22] over whether certain efficient training strategies may initially converge faster but ultimately perform similarly or worse than vanilla training methods when given ample computational resources (i.e., more training data).', '< • O3: Lack of empirical guidelines. Scaling laws [3,4] give clear empirical guidelines on pre-training computational-optimized LLMs, greatly stimulating and advancing the field. Yet, there is a lack of empirical guidelines on growth techniques, discouraging LLM practitioners from adopting these approaches, especially considering the high costs of LLM pre-training.', '< These three obstacles are consequential in nature. Hence, in this work, we empirically revisit the concept of model growth as a solution to efficient LLM pre-training by tackling them one by one.   The training loss for two 7B LLMs, trained from scratch and with G ↑ direct (G stack ). At 300B tokens, G stack accelerates by 54.6% compared to scratch.', '---', '> The remarkable emergent abilities of Large Language Models (LLMs) are fundamentally linked to their scale [1,2]. This scaling trend, driven by empirical scaling laws [3,4,5], continuously pushes the boundaries of LLM capabilities. However, the pre-training of these increasingly gigantic models incurs a prohibitive cost in terms of computational resources, energy consumption, and environmental impact [6] (e.g., Llama-3 [7] pre-training consumed 7.7M GPU hours and generated 2290 tons of carbon dioxide equivalent emissions). Consequently, the development of efficient LLM pre-training methodologies is not merely a scientific pursuit but a societal imperative, crucial for the sustainable advancement and broader adoption of AI [8,9].', '> ', '> Model growth, a compelling research direction, aims to accelerate the training of larger (target) models by intelligently leveraging the knowledge encoded in smaller, already-trained (base) models. Specifically, model growth investigates how to effectively transfer and expand the parameters Θ (s) of a smaller model to initialize the parameters Θ (l) of a larger model. While various methods exist, including splitting [10,11,12], copying [13,14], matrix mapping [15], and initializing new parameters from scratch [16,12,17], their primary objective is consistent: to expedite large model training. Existing techniques have shown promising speedup results, particularly in encoder-based BERT models [11,14,18,15,12,13]. Yet, despite this empirical evidence and its direct relevance to efficient LLM pre-training, model growth methods are surprisingly not widely adopted in the current LLM landscape [7,19]. To date, FLM-101B [20] stands as a rare example of an LLM utilizing model growth, but its evaluation lacks a crucial baseline trained from scratch. We identify three key Obstacles that collectively impede the widespread integration of model growth techniques into LLM pre-training:', '> • O1: Limited Evaluation on LLMs. Existing model growth methods are predominantly evaluated on encoder-based BERT models [14,11,18,12,13,16,17], which differ significantly in architecture and training paradigms from modern decoder-based LLMs like Llama [21]. Evaluations on LLMs are often incomplete, lacking comprehensive baseline comparisons [20] or remaining in exploratory stages [15,13].', '> • O2: Unverified Scalability for LLMs. The scalability of model growth across both increasing model sizes and vast pre-training data volumes for LLMs remains largely untested. Current evaluations are restricted to smaller BERT models or preliminary LLM experiments, leaving it unclear whether these methods sustain acceleration for large-scale LLMs and extensive data regimes. This is critical given debates [22] on whether initial speedups translate to long-term performance gains with abundant computational resources.', '> • O3: Absence of Practical Guidelines. Unlike well-established scaling laws [3,4] that provide clear empirical guidance for LLM pre-training, there is a distinct lack of practical guidelines for implementing model growth techniques. This absence discourages LLM practitioners from adopting these methods, especially considering the substantial costs associated with LLM pre-training.', '> These critical obstacles collectively impede the broader adoption of model growth. This work systematically addresses these challenges, empirically revisiting model growth as a potent solution for efficient LLM pre-training.', '16c16,20', '< To summarize, our contributions are four-fold: 1) We first systematically investigate model growth techniques and identify four atomic model growth operators, establishing a better understanding of the field in Section 3.1. 2) We then design a standard LLM pre-training testbed and perform comprehensive evaluations on these operators, finding that a simple depthwise stacking G stack exhibits significant superiority in Section 3. 3) We further demonstrate the scalability of G stack with experiments on LLMs ranging from 410M to 7B parameters and up to 750B training tokens in Section 4. 1. 4) We also provide guidelines of equations on determining growth timing and growth factors for optimal use of G stack in Section 4.2.', '---', '> To summarize, our contributions are four-fold and significantly advance the understanding and application of model growth in LLM pre-training:', '> 1) We conduct the first systematic investigation into model growth techniques for LLMs, categorizing them into four atomic growth operators and providing a foundational understanding of the field (Section 3.1).', '> 2) We establish a standardized LLM pre-training testbed and perform comprehensive evaluations, revealing that a novel depthwise stacking operator, G stack , consistently outperforms all other methods and baselines (Section 3).', '> 3) We rigorously demonstrate the scalability of G stack through extensive experiments on LLMs ranging from 410M to 7B parameters and utilizing up to 750B training tokens (Section 4.1).', '> 4) We derive and formalize practical empirical guidelines, including equations, for optimally determining growth timing and growth factors when deploying G stack in real-world LLM pre-training scenarios (Section 4.2).', '748d751', '< ']
