Title: Stacking Your Transformers: A Closer Look at Model Growth for Efficient LLM Pre-Training

Abstract: LLMs are computationally expensive to pre-train due to their large scale. Model growth emerges as a promising approach by leveraging smaller models to accelerate the training of larger ones. However, the viability of these model growth methods in efficient LLM pre-training remains underexplored. This work identifies three critical Obstacles: (O1) lack of comprehensive evaluation, (O2) untested viability for scaling, and (O3) lack of empirical guidelines. To tackle O1, we summarize existing approaches into four atomic growth operators and systematically evaluate them in a standardized LLM pre-training setting. Our findings reveal that a depthwise stacking operator, called G stack , exhibits remarkable acceleration in training, leading to decreased loss and improved overall performance on eight standard NLP benchmarks compared to strong baselines. Motivated by these promising results, we conduct extensive experiments to delve deeper into G stack to address O2 and O3. For O2 (untested scalability), our study shows that G stack is scalable and consistently performs well, with experiments up to 7B LLMs after growth and pre-training LLMs with 750B tokens. For example, compared to a conventionally trained 7B model using 300B tokens, our G stack model converges to the same loss with 194B tokens, resulting in a 54.6% speedup. We further address O3 (lack of empirical guidelines) by formalizing guidelines to determine growth timing and growth factor for G stack , making it practical in general LLM pre-training. We also provide in-depth discussions and comprehensive ablation studies of G stack . Our code and pre-trained model are available at https://llm-stacking.github.io/.

Section: Introduction
Emergent abilities of Large Language Models (LLMs) rely on scaling-up [1,2]. Empirical evidence from scaling laws [3][4][5] fuels the development of increasingly larger models, pushing the boundaries of LLMs capabilities. However, pre-training these gigantic models comes at a significant cost in terms of energy consumption and environmental impact [6] (e.g., pre-training Llama-3 [7] consumes a total of 7.7M GPU hours and generates 2290 tons of carbon dioxide equivalent of carbon emissions). The efficient pre-training of LLMs is thus crucial, both from a scientific and a societal perspective, to ensure the continual growth and adoption of AI [8,9].
One promising research direction to accelerate model training involves leveraging trained smaller (base) models to expedite the training of larger (target) models, a technique known as model growth.
Concretely, model growth studies how to leverage the trained smaller model's parameters Θ (s) to initialize the larger model's parameters Θ (l) . Current popular methods generally focus on expanding the parameters of the base model through techniques like splitting [10][11][12], copying [13,14], or matrix mapping [15]. There are also some approaches that initialize new parameters from scratch [16,12,17]. The primary objective is to accelerate the training of large models, and existing methods demonstrate promising speedup results on models such as BERT [11,14,18,15,12,13]. Despite such empirical evidence and its alignment with the goal of efficient LLM pre-training, model growth methods are not widely adopted in the context of LLM pre-training [7,19]. To our best knowledge, the only LLM that utilizes model growth for accelerating is FLM-101B [20], but it lacks a baseline LLM trained from scratch to compare. We observe three key Obstacles that hinder LLM pre-training from using existing model growth techniques, specifically:
• O1: Lack of comprehensive assessment. Some existing model growth methods report results on LLM pre-training, but either lack a baseline comparison [20] or are still in exploratory stages [15,13]. In contrast, most growth approaches are evaluated in encoder-based BERT models [14,11,18,12,13,16,17], which have different architecture and training configurations compared to prominent decoder-based LLMs such as Llama [21].
• O2: The untested scalability. This scalability has two aspects: the model size and the amount of pretraining data. Regarding the model size, the existing approaches are only evaluated on smaller-scale BERT models or in preliminary experiments with LLMs. It is unclear whether these growth methods will continue accelerating training when applied to large-scale LLMs with more extensive evaluation. As for the amount of pre-training data, there are debates [22] over whether certain efficient training strategies may initially converge faster but ultimately perform similarly or worse than vanilla training methods when given ample computational resources (i.e., more training data).
• O3: Lack of empirical guidelines. Scaling laws [3,4] give clear empirical guidelines on pre-training computational-optimized LLMs, greatly stimulating and advancing the field. Yet, there is a lack of empirical guidelines on growth techniques, discouraging LLM practitioners from adopting these approaches, especially considering the high costs of LLM pre-training.
These three obstacles are consequential in nature. Hence, in this work, we empirically revisit the concept of model growth as a solution to efficient LLM pre-training by tackling them one by one.   The training loss for two 7B LLMs, trained from scratch and with G ↑ direct (G stack ). At 300B tokens, G stack accelerates by 54.6% compared to scratch.
To tackle O1, we systematically evaluate model growth techniques on practical LLM pre-training. We first categorize existing growth methods and summarize them into four atomic growth operators, each of which can grow along two directions: widthwise (intra-layer) and depthwise (layer-wise). We illustrate them in Figure 2. These operators serve as representative choices for evaluating the performance of model growth techniques. We use these operators to expand 400M base models to 1.1B Llama-like LLMs and continually pre-train them. Next, we evaluate these growth techniques on the training loss and eight standard NLP benchmarks from the Harness toolkit [23]. We found the direct operator that stacks depthwisely G stack consistently outperforms others across overall evaluation metrics, demonstrating its potential in accelerating LLM pre-training. This motivates us to investigate extensively by addressing O2 and O3 on G stack .
To address O2, we investigate the G stack operator's scalability to larger model sizes and to more training data. We conduct extensive experiments by scaling model size up to 7B parameters trained with 300B tokens, and pre-training a 410M model with over 750B training tokens. This is in contrast to the previous largest LLM pre-training experiment that uses model growth methods and has baselines for comparison, which is reported in Ligo [15], where a GPT2-1.5B model is trained for 15k steps (approximately 15B tokens). The results are encouraging, as we consistently observe significant improvements G stack offers in both scenarios. For example, we achieve a remarkable 54.6% speedup in pre-training for a 7B model with 300B tokens (Figure 1). Interestingly, the loss improvement in our 750B-token experiment aligns with a logarithmic function. We further extend this logarithmic curve and determine that the improvement continues to be substantial even for the LLM trained with over 8T tokens. Moreover, we summarize all our experiments by estimating the LLM scaling law for LLMs pre-trained with G stack . Given the same target loss value, our analysis reveals a significantly reduced computational cost compared to the common scaling law [4].
For O3, we explore the practical guidelines for using G stack in LLM pre-training. Given a computational budget, we determine the optimal strategy for two key factors of G stack , growth timing d and growth factor g. Growth timing d relates to the training tokens used for small models before growing, and growth factor g refers to the factor between the non-embedding parameter number of the large models and the small models. We formalize our findings into equations that offer concrete suggestions for utilizing G stack . We believe this work could significantly pique the interest and bolster confidence in future LLM pre-training with model growth techniques, both in academia and industry.
To summarize, our contributions are four-fold: 1) We first systematically investigate model growth techniques and identify four atomic model growth operators, establishing a better understanding of the field in Section 3.1. 2) We then design a standard LLM pre-training testbed and perform comprehensive evaluations on these operators, finding that a simple depthwise stacking G stack exhibits significant superiority in Section 3. 3) We further demonstrate the scalability of G stack with experiments on LLMs ranging from 410M to 7B parameters and up to 750B training tokens in Section 4. 1. 4) We also provide guidelines of equations on determining growth timing and growth factors for optimal use of G stack in Section 4.2.

Section: Related Work -Model Growth for Efficient Pre-training
The idea of growing neural networks dates back to the 1990s [24][25][26]. The pioneering work of Net2Net [10] marks a milestone, for the first attempt to study model growth in deep learning era. Net2Net expands width and depth while keeping original functions (namely function preserving) via randomly splitting old neurons and injecting new identity layers. The widthwise splitting method of Net2Net represents a series of works that aim to "expand" the existing neurons to the desired larger size. Bert2Bert [11] serves as a BERT-based extension of the widthwise Net2Net. StagedGrow [13] doubles the width by concatenating two identical layers and halves final loss to keep functionpreserving. Lemon [12] suggests integrating a parameter into the splitting of neurons in Bert2Bert, aiming to break weight symmetry. Depthwisely, StackedBert [14] simply stacks duplicated layers to form a deeper model. In contrast to the above direct copy/split approaches, LiGO [15]presents a learning-based method that initializes the larger model's parameters via learning a linear mapping from the smaller model's parameters.
Alongside the approaches that expand existing parameters, there are works that initialize new ones without relying on existing ones. For instance, MSG [17] proposes a multi-staged growing strategy that progressively expands transformer components, where the newly grown neurons are randomly initialized using a masking mechanism to ensure function preservation. Besides, some works have assigned specific values, like zero, to the newly initialized neurons to negate their influence [16,12].
All the above methods are primarily explored in BERT or earlier stages of LLM pre-training. On the other hand, our objective is to present the first systematic review of model growth techniques in the LLMs era. To our knowledge, FLM-101B [20] is the only existing LLM that uses the growth method [17] for accelerating billion-scale LLM pre-training. Nonetheless, this work lacks a baseline model trained from scratch, making it difficult to assess the effectiveness of the model growth technique. In contrast, we aim to provide a comprehensive study by establishing a standardized testbed to compare LLMs trained from scratch and with various growth methods in LLM pre-training.

Section: Systematically Assessing Model Growth for LLM Pre-Training
Existing model growth methods [14,11,18,15,12,13,16,17] are mainly evaluated on BERT [27], with limited focus on decoder-only large-scale language models such as Llama [21]. Moreover, these growth methods are often not comparable due to different training settings [14,11,17,12].
Even some growth LLMs experiments are evaluated, their results are often incomplete [20,15]. To overcome these limitations, we first summarize existing works [14,11,18,15,12,13,16,17] into four atomic growth operators to represent these growth techniques. Then we build a standardized LLMs training testbed to pre-train LLMs with four growth operators on depthwise and widthwise directions and evaluate the results with both training loss and eight evaluation metrics in Harness [23].

Section: Growing LLMs with Growth Operators
Recent years, researchers have focused on enhancing the efficiency of training large models by making use of smaller pre-existing models [10,11,14,18,15,12,13,16,17]. These state-of-the-art methods can be categorized into two distinct groups. The first group focuses on deriving new neurons from the existing ones [10,11,14,12,15], while the second group focuses on initializing new parameters separately [18,13,16,17]. Drawing from these two lines of research, we summarize four atomic growth operators. These operators include: (A) directly duplicating and stacking old layers in a depthwise manner or splitting neurons in the same layer widthwisely, denoted as G direct , (B) generating expanded parameters using a learnable mapping matrix to the existing parameters, denoted as G learn , (C) setting the new parameters to zero, denoted as G zero , and (D) randomly initializing the new parameters, denoted as G random . The illustration of four operators is shown in Figure 2. The G direct and G learn growth operators produce new neurons from the current ones, in contrast to G zero and G random which initialize new parameters independently. For the formal definitions of the operators and the differences to the existing growth methods in design, please refer to Appendix A. Complex growth methods, such as those involving auxiliary loss or exploring training dynamics like learning rates [28,29,16]  To make a fair comparison of the four growth operators for LLM pre-training, we define a standardized "one-hop" growth process that involves two training phases, small model training before growth and large model training after growth. We first train the small LLMs with d tokens before growing. Then, we use operator G to grow them to the target LLMs by a factor of g for non-embedding parameters and then continual pre-training the large LLMs for D tokens. Two key factors in the procedure are worth noting: the growth factor g and the data for base model training d, which can be interpreted as "growth timing". We further evaluate each growth operator by separately examining in depthwise (intra-layer) growth G ↑ and widthwise (layer-wise) growth G → . Concretely, we start with base models (400M LLMs) trained on d = 10B tokens, apply the four operators in both directions to scale them up to the target size of 1.1B (approximately a growth factor of g = 4), and then continue training for an additional D = 97.5B tokens. 4 Appendix B contains the LLM's architecture configuration and training details.

Section: Pre-Training 1.1B LLMs
We report results on training loss, eight standard Harness NLP benchmarks along with the average accuracy and the speedup ratio in Figure 3. Our key discovery reveals that depthwise growth G ↑ exhibits a significant acceleration over both widthwise growth G → and training models from scratch, while surprisingly, G → does not offer any notable advantages. Among the depthwise growth operators, G ↑ direct , G ↑ learn , and G ↑ zero , all outperform the baseline and G ↑ random . The underperformance of G ↑ random in our study may be attributed to its design for gradual "mini-step" growth [17], whereas our unified approach uses a single step.    [30], ARC-c [31], ARC-e [31], Logiqa [32], PIQA [33], Sciq [34], Winogrande [35] and Wikitext PPL [36] totaling eight standard NLP benchmarks. After 8 × 10 20 FLOPs of training, G ↑ direct demonstrates a significant speedup.

Section: Delving Deeper Into Depthwise Stacking (G stack )
The empirical evidence suggests that certain growth operators, most notably G ↑ direct , exhibit an impressive acceleration in LLM pre-training compared to the baseline approach of training models from scratch. We now turn our attention to a more in-depth examination of the G ↑ direct . For ease of reference, we will henceforth denote this depthwise stacking approach as operator G stack :
M = M • M • • • • • M g×M
, where M is a small base model trained with d tokens, M is the target model and g is the growth factor. This section addresses the two main challenges (O2 and O3) outlined in the introduction: 1) evaluating the performance of G stack in scaling scenarios, i.e. larger model sizes and more training tokens; and 2) determining the hyperparameters when using G stack , i.e., the growth timing d and growth factor g. FLOPs (1e+20) ). Then, we train the stacked models using over 300B tokens (D = 300B) for both sizes. Figures 4 and5 show the loss, and the NLP benchmarks average accuracy evaluated using the Harness evaluator for training 3B and 7B LLMs with 300B tokens, respectively. 5 The acceleration of G stack is consistent across two models and all evaluation metrics. For instance, considering the 3B model, Figure 4 demonstrates that G stack achieves a 54.5% speedup in pre-training, improvement of 2.1 in NLP benchmarks average accuracy compared to the baseline 3B model trained with 240B tokens.
When comparing the 1B, 3B, and 7B models, it is evident that the benefits of G stack are not reduced as the model size increases, implying that its acceleration effect can be leveraged even with larger models. Details of the evaluation results, including evaluation with instruction tuning, can be found in Appendix D. Appendix E compares our baselines with the open-source LLMs Pythia and tinyLlama.   6b. The fitting curve indicates G stack will continue to exhibit acceleration effects even after 8T tokens, which is over 1000 times longer than the recommended token number [4]. It is also notable that this loss improvement after 8T training is not trivial for LLM pre-training, as previous studies [39] suggest that even minor improvements in the later phase can have a relatively substantial impact on downstream performance.
From a LLM practitioner's perspective, this is also crucial considering "overtraining", which involves training LLMs with significantly larger amounts of data than recommended by scaling laws [3][4][5], a common practice that has become prevalent. A notable example is the training of LLama 3-8B with 15T tokens, which is nearly 100 times greater than the token count recommended by the chinchilla scaling laws [4]. Hence, this finding provides confidence in the consistent excellent acceleration of G stack throughout the entire practical LLM pre-training process.
0.1 1 10 100 1000 FLOPs (1e+20)  Estimating Scaling Laws. To further explore our findings, we graph our four models (410M, 1.1B, 3B, and 7B) on the same figure and attempt to uncover our "scaling law" using the G stack operator. Following [3,4], we define the scaling power law using the equation L C = aC b , where a and b are constants we need to fit, C represents the FLOPs, and L C denotes the model's final loss under this FLOP. We use the curve_filt function in SciPy [40] to fit both the scratch model and the G stack model and present the estimation scaling law in Figure 7. The figure shows that our G stack scaling law exhibits improved efficiency compared to the scaling law estimated from baseline LLMs, achieving the same target loss while requiring much less computational resources. However, in light of the significant computational resources devoted to other scaling law studies [3,4], we acknowledge that our G stack scaling law is an initial estimate subject to computation constraints, and a comprehensive study is left for future research.

Section: Determining Growth Timing and Growth Factor for Using G stack
We comprehensively validate the effectiveness of the G stack compared to training from scratch in Section 4.1. However, to incorporate G stack into a LLM's pre-training process, we need to determine two crucial hyperparameters: the growing time (d) and the growing factor (g). In our previous experiments, we rely on ad-hoc choices for these parameters, thereby lacking a systematic approach to determining them when use G stack . There exists research on investigating the growth timing [41], but the settings are quite different from the LLM pre-training. Therefore, this section offers a clear guide for practitioners looking to optimize using the G stack operator in LLM pre-training processes.
We begin by offering a formal definition. When given a computational budget C, established scaling power laws [3,4] exist to guide the non-embedding parameters N and the number of training tokens D to achieve the lowest model loss in the case of training from scratch. However, tuning hyperparameters becomes more complex when the fixed budget C is allocated to find the optimal model training strategy using the G stack operator, which involves two training phases. Consequently, the overall computational budget C can be expressed as the sum of the two components:
C = C1 + C2.
Here  Determining Growth Timing: d. We first explore the effect of growth timing, i.e. the training token d for the small model. Particularly, we apply the G stack operator to a series of small models trained with d = 0B, 1B, 5B, 10B, 20B, 50B tokens. Subsequently, we stack them to the target layers with growth factor g = 4 and train for a fixed set of computational FLOPs. We replicate the above experiments using three target model sizes N = 410M, 1.1B, 3B and plot each set of IsoFLOP points in Figure 8a, 8b and 8c. Surprisingly, even a small model trained with just 1B tokens exhibits a significant speedup compared to the directly stacking small random initialized models (represented as "0B"). While 0B's performance is similar to models trained from scratch, implying stacking itself does not serve as an effective initialization method. Furthermore, by applying smoothing techniques to model IsoFLOP curves as parabolas, we identify the optimized value of d that minimizes loss for each FLOP count, leading us to hypothesize the existence of a logarithmic equation involving N , C, and d:  Training Loss Determining Growth Factor: g. Another factor we determine is the growth factor g.
As models with 3B and 7B parameters have identical depths, we run experiments using two model sizes: 1.1B (24 layers) and 3B (32 layers). Specifically, we vary the stack factors to g = 2, 4, 8, 24 for the 1.1B model and g = 4, 8, 16, 32 for the 3B model while keeping the base models trained with d = 10B tokens. The smoothed IsoFLOP curves are plotted in Figure 10. Interestingly, even with a relatively shallow 2-layer base model and a growth factor of g = 16, we observe a remarkable improvement compared to the baseline 3B model (g = 1). However, when using a 1-layer base model, G stack underperforms compared to the baseline. Our curves indicate that the optimal growth factor g lies between 2 and 4.
However, unlike determining training token d, we cannot generate sufficient data to estimate the relationship between N , C, and g, due to computational constraints. Thus, this work suggests a constant growth factor of g = 4. We also include our preliminary estimated equation and contour figure for g in the Appendix F. All evaluation results of Section 4.2 are listed in Appendix G.

Section: Ablation and Discussion
To further give insights into adopting model growth techniques in LLM pre-training, we ablate variances for G stack and discuss function preserving in general model growth techniques.

Section: Ablation: How to Stack?
It is worth noting that G stack differs from the algorithm proposed in StackedBERT [14], which utilizes a gradually stacking strategy. Hence, we compare our "one-hop" G stack and their gradual stacking approach. Following the methodology introduced in StackBERT, we employ a two-step stack strategy. Given our target model size of 1.1B with 24 layers, we start with a 6-layer model. Subsequently, we train it on 10B tokens and double the model's depth through stacking, repeating this step twice (train-stack-train-stack) to achieve the desired scale. Our experiments demonstrate that G stack outperforms gradual stacking approaches on loss and downstream evaluations. For example, the evaluation results show that G stack achieves a 2.4 higher average accuracy and 0.6 better Wikitext PPL than gradual stacking when pre-training large models for 100B tokens. The results can be found in Appendix H.1. We further compare other stacking variations, such as stacking via interpolation and partial stacking of certain layers which are also adopted in LlamaPro [42] and Solar [43], and leave our detailed findings in the Appendix H.2 and H.3.

Section: Discussion: Why Does Function Preserving Fail?
Function preservation (FP) is a key concept that underlies most model growth approaches [10][11][12]17].
The idea is intuitive that a larger model should initialize parameters that can represent the same function as the ones in the smaller model, i.e. ∀x, f (x; Θ (s) ) = f (x; Θ (l) init ), where x is the input. We give a mathematical definition of FP in the Appendix I.1.
We find it intriguing that our G stack approach, which violates FP, emerges as the most effective in our study. To further investigate, we conduct a simple ablation study to break FP by introducing noise on the strict-FP operator G → direct . We initialize the new neurons by a weighted combination of two sets of parameters: those from G → direct and those from random initialization. The weighting factor is controlled by a noise ratio. Our findings are intriguing. After 40B tokens training, adding 20% noise outperforms original G → direct by 0.27 on the Wikitext PPL and 0.41 on the average accuracy score. We also add noise for G stack . When we add 20% noise, our LLM performs slightly better than the no-noise model. However, when the noise level exceeds 20%, the performance significantly deteriorates. These results indicate that function preservation may not be the sole determining factor for model growth. In other words, exploring ways to accelerate the training of larger models and strict preserving function during growth might represent two overlapping yet distinct research directions. The experimental details are provided in the Appendix I.2.

Section: Conclusion
This work empirically explores model growth approaches for efficient LLM pre-training. We address three key challenges of current model growth research for efficient LLM pre-training. We first comprehensively evaluate model growth techniques into four atomic operators and explore depthwise growth G stack beats all other methods and baselines in various evaluations. We next address concerns about the scalability of G stack by extending the model and training data scales. Furthermore, we systematically analyze the usage of the G stack operator, focusing on growth timing and growth factor. Based on this analysis, we formalize a set of guidelines for effectively utilizing the G stack operator. In addition, we provide in-depth discussions and comprehensive ablation studies of G stack , shedding light on the broader implications of our work.

Section: Limitations
While our work has demonstrated remarkable potential, four limitations deserve further attention. One limitation is the constraint of computation resources. For example, we only compare two sets of growth factor d configurations, which limits the capacity to derive a formula for determining the optimal growth factor d. Another limitation of our work is the focus on relatively simple operator choices, where we prioritize simplicity over exploring more sophisticated strategies. For instance, we do not extensively investigate the multi-step growth or dynamic modifications to the training process, such as adjusting the learning rate during continual pre-training. The third limitation involves the incomplete cosine learning rate schedule during training. This also arises from the resource-intensive nature of pre-training LLMs and the constraints on available computational resources. Therefore, we adopt a strategy where we initially set a large number of training tokens and then we pre-train LLMs until the training runs are interrupted by tasks with higher priority. Lastly, although this study's scope is an empirical exploration and the content is self-contained, there is a lack of theoretical insights into the success of G stack in LLM pre-training. 6 Nonetheless, we will release all LLM checkpoints to facilitate the community's investigation into the theoretical principles behind our observations.

Section: A Details of Growth Operators
A.1 Four Growth Operators A.1.1 Operator G direct : Direct Derivation of Grown Parameters From Old Parameters One intuitive strategy for expanding neural networks involves directly duplicating or splitting existing neurons. [14,11,12]. Unlike other growth operators, we distinguish between growth in terms of depth and width.
For width-wise expansion, the Net2Net technique and its transformer implementations [10,11] involve splitting old neurons into two or more parts, with each splitting step achieving a=b+c. Depending on the specific splitting mechanism, there are two variations: even splitting and uneven splitting. The latter is proposed to address symmetry issues that arise when neurons are evenly split. In this paper, we adopt the approach of uneven splitting.
In the context of depth-wise expansion, a common practice is to duplicate layers, often referred to as "stacking" [14]. Therefore, we use the term G stack to represent this operator. While this approach may appear to deviate from function preservation, it surprisingly yields a strong baseline.

Section: A.1.2 Operator G learn : Generation of New Parameters through Matrix Transformation
G learn is an operator that learns a matrix transformation function to map small models to a larger one [15]. This operator is applicable to both width and depth expansion. Considering the original model f with parameters θ, the target model F with parameters Θ, and G learn as the hypernetwork for meta-learning, the training corpus is denoted as D, and the language model loss is denoted as L. Then, we optimize the following objective:
arg min Glearn E x∼D L(x; F Θ ), where Θ = G learn (θ)(3)
A.1.3 Operator G zero : Setting New Parameters to 0
Setting new parameters to zero is often considered a simple method to achieve function preservation. However, optimizing networks with a significant number of zeros can present challenges. To tackle this issue, we adopt current practices that selectively zero out either the fan-in or fan-out parameters [13,16,12]. Specifically, for operator G zero , during width growing, we zero out only the set of fan-out parameters for new neurons and randomly initialize the remaining ones. In the case of depthwise expansion, we zero out the final output layer of the newly-duplicated transformer blocks' MultiHead Attention and MLP.

Section: A.1.4 Operator G random : Random Initialization of New Parameters
This group follows the common practice of randomly initializing new parameters. In earlier attempts, old neurons were frozen after the growth process [18,17]. However, to ensure function preservation, a recent study introduces a mask for new neurons after expansion [17]. This mask is gradually removed during ongoing training. We refer to this new approach as the growth operator G random .

Section: A.2 Difference of Our Operators and Base Methods
The operators G → direct shares a similar setting to Lemon with minor variances due to Llama achitectures. G learn is consistent with the methods LiGO, but with our own implementation. For G zero , our approach aligns with Lemon in terms of depth, but differs from stagedTraining in width. Unlike stagedTraining, we do not double the width and assign zeros to the off-diagonal entries. Instead, our approach is more flexible; by zeroing out the submatrix in the bottom-left corner, we can extend it to any dimension. Our G random does not exhibit the "multi-hop" growth like MSG, instead, it grows "one-hop" directly to the target size. Our implementation of G ↑ direct (G stack ) differs from the algorithm employed in stackedBert. In stackedBert, a gradual growing technique is utilized, whereas our operator follows a more direct approach.

Section: A.3 Details of G direct
Embedding Consider E ∈ R V ×d , and our goal is to expand it to E ′ ∈ R V ×D , G direct just copy some columns:
E ′ = G direct (E) (4) = ER (5) = E I I d I(6)
where R ∈ R d×D is used to copy the embedding matrix E.
Linear Consider W ∈ R dout×din , target parameter W ′ ∈ R Dout×Din , where d out ≤ D out , d in ≤ D in , G direct is defined as: W ′ = G direct (W ) (7) = LW R (8) = d out I I I W α β d in I(9)
where R ∈ R din×Din is used for expanding the fan-in and L ∈ R Dout×dout is used for expanding the fan-out. To satisfy function preserving, we ensure that α + β = I.
RMSNorm For RMSNorm, a similar approach is adopted, consider parameter
µ ∈ R d , expanded parameter µ ′ = √ d √ D [µ, µ 0,D-d ] ∈ R D : RM SN orm ′ (x ′ ) = x ′ 1 D D i=1 x ′ 2 i ⊙ µ ′ (10) = [ d i=1 x 2 i D i=1 x ′ 2 i × RM SN orm(x), ζ](11)
Therefore, using the G direct , it is not possible to achieve function preservation for RMSNorm Depth (G stack ) Consider a transformer with l layers represented as
F = f 0 • f 1 • • • • • f l .
Our objective is to expand it to L layers, where L is a multiple of l. We have various stacking forms for this purpose, such as (a) direct stacking:
F ′ = F • F • • • • • F . Algorithm 1 Operator G stack
Input: Base model M l k with l layers trained using dataset d k where k is iteration steps. Growth factor g. Output: Target Model M gl 0 with gl layers
M l 0 =M l k for t = 2 to g do ▷ Model Stacking M tl 0 = M (t-1)l 0 • M l k end A.4 Details of G zero
Embedding Consider an embedding matrix E ∈ R V ×d . The G zero operator expands it to E ′ ∈ R V ×D with O, where d ≤ D. Formally:
E ′ = [E, O](12)
Therefore, give a token x, the expanded embedding can be expressed as:
Embedding ′ (x) = 1 x E ′ = [Embedding(x), 0 D-d ](13)
Linear Consider parameter W ∈ R dout×din . G zero expand it to W ′ ∈ R Dout×Din , where d out ≤ D out and d in ≤ D in . Formally:
W ′ = W A O C(14)
where A, C are randomly initialized new parameters. Considering the input token x ∈ R din before expansion, and the input after expansion x ′ ∈ R Din :
x ′ = [x, 0 Din-din ](15)
Linear ′ (x ′ ) = x ′ W ′T (16) = [x, 0 Din-din ] W T O A T C T (17) = [xW T , 0 Dout-dout ] (18) = [Linear(x), 0 Dout-dout ](19)
RMSNorm Considering the parameter
µ ∈ R d , G zero expand it to µ ′ = [αµ, ξ] like G random in Appendix A.5, because the input must be x ′ = [x, 0 D-d ] ∈ R D .
Depth In depth, by retaining only the residual part and initializing the MHA and SwiGLU final linear projections to zero, the MHA and SwiGLU layers can achieve function preservation.

Section: A.5 Details of G random
Embedding Consider an embedding matrix
E ∈ R V ×d . The goal of G random is to expand it to E ′ ∈ R V ×D
, where d ≤ D. Formally:
E ′ = [E, E](20)
where E ∈ R V ×(D-d) represents randomly initialized new parameters. We use a mask c ∈ R D to mask out the randomly initialized parts:
c = [1 d , 0 D-d ] → [1 d , 1 D-d ](21)
Therefore, for a token x, the masked embedding can be expressed as:
Embedding ′ (x) = 1 x E ′ ⊙ c = [Embedding(x), 0 D-d ](22)
Linear Consider parameter W ∈ R dout×din . Our goal is to expand it to W ′ ∈ R Dout×Din , where d out ≤ D out and d in ≤ D in . Formally:
W ′ = W A B C(23)
where A, B, C are randomly initialized new parameters. Considering the input token x ∈ R din before expansion, and the input after expansion x ′ ∈ R Din :
x ′ = [x, 0 Din-din ](24)
x
′ W ′T = [x, 0 Din-din ] W T B T A T C T (25) = [xW T , xB T ](26)
To ensure that the expanded part of x ′ starts with zeros, we still utilize a mask:
c = [1 dout , 0 Dout-dout ] → [1 dout , 1 Dout-dout ](27)
Linear ′ (x ′ ) = x ′ W ′T ⊙ c = [Linear(x), 0 Dout-dout ](28)
RMSNorm Considering the parameter µ ∈ R d , our objective is to expand it to
µ ′ = [αµ, ξ] ∈ R D ,
where α is an undetermined coefficient and ξ is a randomly initialized new parameter. Let the input be x ′ = [x, 0 D-d ] ∈ R D , then we have:
D i=0 x ′2 = d i=0 x 2(29)
RM SN orm ′ (x ′ ) = x ′ 1 D D i=0 x ′ i 2 ⊙ µ ′ (30) = [x, 0 D-d ] 1 D d i=0 x i 2 ⊙ [αµ, ξ](31)
=   √ D √ d x 1 d d i=0 x i 2 ⊙ αµ, 0 D-d  (32)
By observing equation 32, we can conclude that, to achieve function preservation, α = √ d √ D . Finally, we can conclude:
RM SN orm ′ (x ′ ) = [RM SN orm(x), 0 D-d ](33)
Depth In depth, preserving only the residual part and masking the MHA and SwiGLU layers can achieve function preservation:
Y = X + M HA(RM SN orm(X)) ⊙ c (34) Y = X + SwiGLU (RM SN orm(X)) ⊙ c (35) c = 0 D → 1 D(36)

Section: A.6 Details of G learn
Using G learn for width expansion, for the embedding layer E ∈ R V ×d , the parameter B emb ∈ R D×d is defined as follows:
E ′ = EB T emb(37)
For Attention layer, where W Q , W K , W V , and W O ∈ R d×d , and RMSNorm µ 1 ∈ R d , the parameters B Q , B K , and B V ∈ R D×d , we have:
         W ′ Q = B Q W Q B T emb W ′ K = B K W K B T emb W ′ V = B V W V B T emb W ′ O = B emb W O B T V µ ′ 1 = B emb µ 1(38)
For MLP, where
W up , W gate ∈ R d mlp ×d , W down ∈ R d×d mlp , RMSNorm µ 2 ∈ R d , the parameter B mlp ∈ R D mlp ×d mlp , we have:        W ′ up = B mlp W up B T emb W ′ down = B emb W mlp B T mlp W ′ gate = B mlp W gate B T emb µ ′ 2 = B emb µ 2(39)
For the output head W head ∈ R V ×d , we have:
W ′ head = W head B emb(40)
Using G learn for depth expansion, consider a transformer model with L 1 layers, we use G learn to expand it to L 2 layers. For l ∈ {1, 2, • • • , L 2 }:
               W Q l ′ = L1 j=1 D Q l,j W Q j W K l ′ = L1 j=1 D K l,j W K j W V l ′ = L1 j=1 D V l,j W V j W O l ′ = L1 j=1 D O l,j W O j µ (ln1) l ′ = L1 j=1 D (ln1) l,j µ (ln1) j(41)
where D Q,K,V,O,ln1 ∈ R L2×L1 represents learnable parameters. These parameters are used to expand the MHA vertically in depth. Similarly, for SwiGLU, we also perform expansion using a similar method. Formally, this can be written as:
           W up l ′ = L1 j=1 D up l,j W up j W down l ′ = L1 j=1 D down l,j W down j W gate l ′ = L1 j=1 D gate l,j W gate j µ (ln2) l ′ = L1 j=1 D (ln2) l,j µ (ln2) j(42)
where D up,down,gate,ln2 ∈ R L2×L1 represents learnable parameters used for expanding SwiGLU in the depth.

Section: B LLMs Framework and Training Details
Embedding Consider a vocabulary size V and embedding size d. Then, the embedding matrix E ∈ R V ×d , and the one-hot vector for input tokens X is denoted as 1 X ∈ R T ×V , where T is the sequence length. Formally, it can be written as:
Embedding(X) = 1 X E(43)
for i, v ∈ [V ], where i ̸ = j, it is guaranteed that
E i ̸ = E j .
Multi-Head Attention Multi-Head Attention (MHA) consists of multiple attention heads, each of which computes its own self-attention. The results of these attention heads are then concatenated and projected to obtain the following output:
Q i , K i , V i = XW Q i , XW K i , XW V i H i = sof tmax( QiK T i √ d h )V i M HA(X) = Concat(H 1 , • • • , H n )W O (44) here, the input X ∈ R T ×d , parameters W Q i ∈ R d×d h , W K i ∈ R d×d h , W V i ∈ R d×d h , and W O ∈ R d×d , where n × d h = d.
Feed Forward Network The Feed Forward Network (FFN) consists of two linear layers and the activation function GeLU. Typically, the two linear layers first perform an up-projection to d F F N and then down-project back to the dimension d. Therefore, FFN is defined as:
F F N (X) = GeLU (XW up )W down (45
)
where the input X ∈ R T ×d , parameter
W up ∈ R d×d F F N and W down ∈ R d F F N ×d .
SwiGLU LLaMA replaces the original FFN in the Transformer Decoder with SwiGLU, resulting in improved performance. SwiGLU consists of three linear layers and the swiglu activation function.
It can be defined as:
SwiGLU (X) = (XW gate ⊙ swiglu(XW up ))W down (46
)
where ⊙ means the element-wise multiplication, the input
X ∈ R T ×d , parameter W up ∈ R d×d F F N , W gate ∈ R d×d F F N and W down ∈ R d F F N ×d .
RMSNorm Before MHA, FFN, or SwiGLU, there is a layer of RMSNorm to enhance the stability of the model. Compared to LayerNorm, RMSNorm is simpler in form. Formally, it can be written as:
RM SN orm(X) = X 1 d d i=1 X 2 i ⊙ µ(47)
where 
X ∈ R T ×d , parameter µ ∈ R d . B.
= L(M t-1 , d t ) M t ← A(M t-1 , loss) end M 0 = G(M k ) for t = 1 to K do ▷ Target Model Training loss = L(M t-1 , D t ) M t ← A(M t-1 , loss) end B.

Section: Details of Speedup Calculation
We calculate speedup sp between operator G and scratch model pre-training by:
sp = F LOP s scratch F LOP s G -1(48)
where F LOP s scratch and F LOP s G represent the FLOPs required by the scratch model and the G model, respectively, to achieve the same loss.

Section: B.3 Details of Training Settings
We use TinyLlama7 [45] as our pre-training codebase. We employ FSDP (Fully Sharded DataParallel) along with FlashAttention [46] 2.0, and other acceleration techniques. We use the open-source dataset Slimpajama-627B8 [47] for pre-training. The hyperparameters used for each model size are listed in Table 1. Our 7B model is trained over around 100B tokens per day on an NVIDIA Hopper cluster.          

Section: D.4 Instruction Tuning Results on 3B


Section: E Compare with Other Opensource LLMs
In Table 3, we compare the harness evaluation results after training the G stack model and the scratch model (Baseline) for 100B tokens with Pythia-1B [51] and TinyLlama-1.1B, which are trained on the same number of tokens. The comparative results indicate that our baseline performs normally, comparable to pythia-1B. Meanwhile, the G stack model significantly outperforms both the baseline and pythia-1B, demonstrating the acceleration effect of G stack on the pre-training process. 

Section: F Fitting Results for the Growth Factor g
Although due to computational resource limitations, we only explore predicting g given N and C on the 1.1B and 3B models, we still attempted to fit using equation:
log 10 (g) = a log 10 (N ) + b log 10 (C) + c(49)
In the equation 49, N represents the number of target parameters, g represents the growth factor. The fitting result is as follows: We also visualize the fitted curves in Figure 22, but the results were mediocre due to the lack of data.  50.

Section: F.1 Stacking Law Guidelines For Llama Families
We give an example of empirical usage of G stack by using the configurations of Llama2 and Llama3 families [21,7] to show the estimated optimal base model training tokens d and growth factor g in Table 4.  FLOPs (1e+20)   FLOPs (1e+20)  FLOPs (1e+20)   FLOPs (1e+20)   FLOPs (1e+20)  FLOPs (1e+20)    H.2 Ablation:
f 2 • f 1 • f 0 • f 2 • f 1 • f 0 or f 2 • f 2 • f 1 • f 1 • f 0 • f 0 (interpolation)
To investigate whether the connections between layers affect the performance of stacking, we conduct a comparison of two approaches for stacking small models into larger ones. We explore two approaches for stacking small models into larger ones. The first approach involves taking the entire small model as a unit and directly stacking it, which can retain the connections between most layers. The second approach involves replicating and interleaving each layer in the small model, which almost break the connections. To measure the degree of retention of inter-layer connections after stacking, we define the connection rate R c :
R c = Con r Con all(51)
where the Con r is number of retained connections, the Con all is number of all connections.
For example, if we had a small model with three layers, denoted as f 2 • f 1 • f 0 , and desired a model depth of 6, the first approach would result in
f 2 • f 1 • f 0 • f 2 • f 1 • f 0 , where its R c = 80%. The second approach would result in f 2 • f 2 • f 1 • f 1 • f 0 • f 0 , where its R c = 40%.
In our experiments, we stack a small model with 8 layers to a 24 layers target model. The growth timing d is 10B tokens and growing factor s is 3. The R c of G stack is 91.3% and the R c of G interpolate is 30.4%. We report the training loss and standard NLP benchmarks average accuracy in Figure 35.
At the beginning of training, interpolated stacking perform as well as stacking entire small model. However, as the training continues, the performance of interpolated stacking deteriorates.
Therefore, we can conclude that the higher the connection rate of stacking, the better the effect of stacking. In Appendix H.3, we continue to validate this conclusion.
0.0 2.5 5.0 7.5 10.0 12.5
FLOPs (1e+20)  We also report the details of evaluation results about 8 standard NLP benchmarks. FLOPs (1e+20)  

Section: H.3 Ablation: Partial Stacking
Partial stacking has been explored in LLMs like LlamaPro [42], Solar [43]. But their goal is to stack an off-the-shelf LLMs such as Llama2, while our aim is to accelerate LLM pre-training process.
To explore stacking which layers of the small model can achieve the best performance, we conduct experiments on partial stacking. In our experiments, we stack a small model with 6 layers ({L 1 , L   We report the training loss and standard NLP benchmarks average accuracy in Figure 37. By observing the loss curves in Figure 37a, we can find that the eight partial stacking methods are clearly divided into three groups based on their loss. The first group, {123456*4, 12-3456*5-56, 12-345*7-6, 123-456*7}, achieves the best performance. The second group consisting of {1234-56*10, 12-34*10-56, 1-234*7-56}, performs just so-so. The third group, {123*7-456}, performs poorly, even worse than the baseline.
In Table 5, we summarize the eight partial stacking and calculate the R c of each partial stacking methods based on Equation 51.
For partial stacking, we conclude that: all > middle ≈ back ≫ front. Meanwhile, when the stacked parts are the same, the larger the R c , the better the performance.   

Section: I.2 Breaking Function Preserving by Adding Noise
For the down projection in SwiGLU and the output projection in MultiHeadAttention, we apply noise:
W noise ← (1 -α)W + αϵ where ϵ ∼ N (0, 1 d × l 2 )(53)
For the Embedding Layer and other Linear Layers, we apply noise:
W noise ← (1 -α)W + αϵ where ϵ ∼ N (0, 2 5d )(54)
Adding Noise on G direct to Break FP 0 1 2 3
FLOPs (1e+20)    FLOPs (1e+20)  

Section: J Results on Samba
We utilize the codebase from Samba9 , which implements a hybrid State Space Model using the Slimpajama dataset for LM. In this experiment, we follow the guidelines outlined in the main paper to guide our stacking process. With a parameter size of 410M and training on 100B tokens, we set the growth timing to 8B and the growth factor to 3. We opted for 3 instead of 4 because Samba is an interleaving of Mamba and self-attention layers. Since the target model has 12 layers, we can only stack even layers, leading us to select a 4-layer base model (Mamba-SA-Mamba-SA).
Our experiments results on loss curves 43 and downstream tasks 7 indicate stacking also works beyond Transformer-based LLMs. Please note that in Table 7, we select stack with 47B rather than 50B to count the additional consumption required to train the base model on 8B tokens.    FLOPs (1e+20) Tokens (Billions) 

Section: L Societal Impacts
As a successful exploration for efficient LLM pre-training, our work has great potential to give positive societal impact towards sustainable AI. Nevertheless, as a common drawback for LLMs, there are also chances that our LLMs might be misused intentionally or uniintentionally.
NeurIPS Paper Checklist

Section: Claims
Question: Do the main claims made in the abstract and introduction accurately reflect the paper's contributions and scope?
Answer: [Yes] Justification: In the Abstract, we clearly elucidate our contributions, and at the end of Section 1 Introduction, we further detail our contributions and scope.
Guidelines:
• The answer NA means that the abstract and introduction do not include the claims made in the paper. • The abstract and/or introduction should clearly state the claims made, including the contributions made in the paper and important assumptions and limitations. A No or NA answer to this question will not be perceived well by the reviewers. • The claims made should match theoretical and experimental results, and reflect how much the results can be expected to generalize to other settings. • It is fine to include aspirational goals as motivation as long as it is clear that these goals are not attained by the paper.

Section: Limitations
Question: Does the paper discuss the limitations of the work performed by the authors?
Answer: [Yes]
Justification: In Section 7, we discuss the limitations of our work.
Guidelines:
• The answer NA means that the paper has no limitation while the answer No means that the paper has limitations, but those are not discussed in the paper. • The authors are encouraged to create a separate "Limitations" section in their paper.
• The paper should point out any strong assumptions and how robust the results are to violations of these assumptions (e.g., independence assumptions, noiseless settings, model well-specification, asymptotic approximations only holding locally). The authors should reflect on how these assumptions might be violated in practice and what the implications would be. • The authors should reflect on the scope of the claims made, e.g., if the approach was only tested on a few datasets or with a few runs. In general, empirical results often depend on implicit assumptions, which should be articulated. • The authors should reflect on the factors that influence the performance of the approach.
For example, a facial recognition algorithm may perform poorly when image resolution is low or images are taken in low lighting. Or a speech-to-text system might not be used reliably to provide closed captions for online lectures because it fails to handle technical jargon. • The authors should discuss the computational efficiency of the proposed algorithms and how they scale with dataset size. • If applicable, the authors should discuss possible limitations of their approach to address problems of privacy and fairness. • While the authors might fear that complete honesty about limitations might be used by reviewers as grounds for rejection, a worse outcome might be that reviewers discover limitations that aren't acknowledged in the paper. The authors should use their best judgment and recognize that individual actions in favor of transparency play an important role in developing norms that preserve the integrity of the community. Reviewers will be specifically instructed to not penalize honesty concerning limitations.

Section: Theory Assumptions and Proofs
Question: For each theoretical result, does the paper provide the full set of assumptions and a complete (and correct) proof?
• If the authors answer NA or No, they should explain why their work has no societal impact or why the paper does not address societal impact. • Examples of negative societal impacts include potential malicious or unintended uses (e.g., disinformation, generating fake profiles, surveillance), fairness considerations (e.g., deployment of technologies that could make decisions that unfairly impact specific groups), privacy considerations, and security considerations. • The conference expects that many papers will be foundational research and not tied to particular applications, let alone deployments. However, if there is a direct path to any negative applications, the authors should point it out. For example, it is legitimate to point out that an improvement in the quality of generative models could be used to generate deepfakes for disinformation. On the other hand, it is not needed to point out that a generic algorithm for optimizing neural networks could enable people to train models that generate Deepfakes faster. • The authors should consider possible harms that could arise when the technology is being used as intended and functioning correctly, harms that could arise when the technology is being used as intended but gives incorrect results, and harms following from (intentional or unintentional) misuse of the technology. • If there are negative societal impacts, the authors could also discuss possible mitigation strategies (e.g., gated release of models, providing defenses in addition to attacks, mechanisms for monitoring misuse, mechanisms to monitor how a system learns from feedback over time, improving the efficiency and accessibility of ML).

Section: Safeguards
Question: Does the paper describe safeguards that have been put in place for responsible release of data or models that have a high risk for misuse (e.g., pretrained language models, image generators, or scraped datasets)?
Answer: [No] Justification: Our study is an empirical exploration. The dataset we use is a open-source high-quality corpus, and the models we release are intended solely for further research and are not meant for direct industrial application.
Guidelines:
• The answer NA means that the paper poses no such risks.
• Released models that have a high risk for misuse or dual-use should be released with necessary safeguards to allow for controlled use of the model, for example by requiring that users adhere to usage guidelines or restrictions to access the model or implementing safety filters. • Datasets that have been scraped from the Internet could pose safety risks. The authors should describe how they avoided releasing unsafe images. • We recognize that providing effective safeguards is challenging, and many papers do not require this, but we encourage authors to take this into account and make a best faith effort.
12. Licenses for existing assets Question: Are the creators or original owners of assets (e.g., code, data, models), used in the paper, properly credited and are the license and terms of use explicitly mentioned and properly respected?
Answer: [Yes] Justification: Please refer to Appendix B.3.
Guidelines:
• The answer NA means that the paper does not use existing assets.
• The authors should cite the original paper that produced the code package or dataset.
• The authors should state which version of the asset is used and, if possible, include a URL. • The name of the license (e.g., CC-BY 4.0) should be included for each asset.
• Depending on the country in which research is conducted, IRB approval (or equivalent) may be required for any human subjects research. If you obtained IRB approval, you should clearly state this in the paper. • We recognize that the procedures for this may vary significantly between institutions and locations, and we expect authors to adhere to the NeurIPS Code of Ethics and the guidelines for their institution. • For initial submissions, do not include any information that would break anonymity (if applicable), such as the institution conducting the review.

Section: Acknowledgments
We thank all constructive comments from anonymous reviewers. Reynold Cheng and Wenyu Du were supported by the Hong Kong Jockey Club Charities Trust (Project 260920140), the University of Hong Kong (Project 109000579), the HKU Outstanding Research Student Supervisor Award 2022-23, and the HKU Faculty Exchange Award 2024 (Faculty of Engineering).

Section: 
FLOPs (1e+20) Answer: [NA] Justification: Our study is empirical exploration.
Guidelines:
• The answer NA means that the paper does not include theoretical results.
• All the theorems, formulas, and proofs in the paper should be numbered and crossreferenced. • All assumptions should be clearly stated or referenced in the statement of any theorems.
• The proofs can either appear in the main paper or the supplemental material, but if they appear in the supplemental material, the authors are encouraged to provide a short proof sketch to provide intuition. • Inversely, any informal proof provided in the core of the paper should be complemented by formal proofs provided in appendix or supplemental material. • Theorems and Lemmas that the proof relies upon should be properly referenced.

Section: Experimental Result Reproducibility
Question: Does the paper fully disclose all the information needed to reproduce the main experimental results of the paper to the extent that it affects the main claims and/or conclusions of the paper (regardless of whether the code and data are provided or not)?
Answer: [Yes] Justification: We report our detailed training settings in Appendix B.3.

Section: Guidelines:
• The answer NA means that the paper does not include experiments.
• If the paper includes experiments, a No answer to this question will not be perceived well by the reviewers: Making the paper reproducible is important, regardless of whether the code and data are provided or not. • If the contribution is a dataset and/or model, the authors should describe the steps taken to make their results reproducible or verifiable. • Depending on the contribution, reproducibility can be accomplished in various ways.
For example, if the contribution is a novel architecture, describing the architecture fully might suffice, or if the contribution is a specific model and empirical evaluation, it may be necessary to either make it possible for others to replicate the model with the same dataset, or provide access to the model. In general. releasing code and data is often one good way to accomplish this, but reproducibility can also be provided via detailed instructions for how to replicate the results, access to a hosted model (e.g., in the case of a large language model), releasing of a model checkpoint, or other means that are appropriate to the research performed. • While NeurIPS does not require releasing code, the conference does require all submissions to provide some reasonable avenue for reproducibility, which may depend on the nature of the contribution. , with an open-source dataset or instructions for how to construct the dataset). (d) We recognize that reproducibility may be tricky in some cases, in which case authors are welcome to describe the particular way they provide for reproducibility.
In the case of closed-source models, it may be that access to the model is limited in some way (e.g., to registered users), but it should be possible for other researchers to have some path to reproducing or verifying the results. • Providing as much information as possible in supplemental material (appended to the paper) is recommended, but including URLs to data and code is permitted.

Section: Open access to data and code


Section: Experimental Setting/Details
Question: Does the paper specify all the training and test details (e.g., data splits, hyperparameters, how they were chosen, type of optimizer, etc.) necessary to understand the results?
Answer: [Yes] Justification: We report the detailed settings in Appendix B.3
Guidelines:
• The answer NA means that the paper does not include experiments.
• The experimental setting should be presented in the core of the paper to a level of detail that is necessary to appreciate the results and make sense of them. • The full details can be provided either with the code, in appendix, or as supplemental material.

Section: Experiment Statistical Significance
Question: Does the paper report error bars suitably and correctly defined or other appropriate information about the statistical significance of the experiments?
Answer: [No] Justification: LLMs pre-training consumes a significant amount of computational resources, making it impractical to conduct multiple experiments to obtain error bars.
Guidelines:
• The answer NA means that the paper does not include experiments.
• The authors should answer "Yes" if the results are accompanied by error bars, confidence intervals, or statistical significance tests, at least for the experiments that support the main claims of the paper.
• The factors of variability that the error bars are capturing should be clearly stated (for example, train/test split, initialization, random drawing of some parameter, or overall run with given experimental conditions). • The method for calculating the error bars should be explained (closed form formula, call to a library function, bootstrap, etc.) • The assumptions made should be given (e.g., Normally distributed errors).
• It should be clear whether the error bar is the standard deviation or the standard error of the mean. Guidelines:
• The answer NA means that the paper does not include experiments.
• The paper should indicate the type of compute workers CPU or GPU, internal cluster, or cloud provider, including relevant memory and storage. • The paper should provide the amount of compute required for each of the individual experimental runs as well as estimate the total compute. • The paper should disclose whether the full research project required more compute than the experiments reported in the paper (e.g., preliminary or failed experiments that didn't make it into the paper).

Section: Code Of Ethics
Question: Does the research conducted in the paper conform, in every respect, with the NeurIPS Code of Ethics https://neurips.cc/public/EthicsGuidelines?

Section: Answer: [Yes]
Justification: We have read this code.
Guidelines:
• The answer NA means that the authors have not reviewed the NeurIPS Code of Ethics.
• If the authors answer No, they should explain the special circumstances that require a deviation from the Code of Ethics. • The authors should make sure to preserve anonymity (e.g., if there is a special consideration due to laws or regulations in their jurisdiction).

Section: Broader Impacts
Question: Does the paper discuss both potential positive societal impacts and negative societal impacts of the work performed?
Answer: [Yes] Justification: We have a section in the Appendix L to discuss societal impacts.
Guidelines:
• The answer NA means that there is no societal impact of the work performed.
• For scraped data from a particular source (e.g., website), the copyright and terms of service of that source should be provided. • If assets are released, the license, copyright information, and terms of use in the package should be provided. For popular datasets, paperswithcode.com/datasets has curated licenses for some datasets. Their licensing guide can help determine the license of a dataset. • For existing datasets that are re-packaged, both the original license and the license of the derived asset (if it has changed) should be provided. • If this information is not available online, the authors are encouraged to reach out to the asset's creators.

Section: New Assets
Question: Are new assets introduced in the paper well documented and is the documentation provided alongside the assets?
Answer: [Yes]
Justification: All codes and models are will be full released under the license of CC-BY 4.0.
Guidelines:
• The answer NA means that the paper does not release new assets.
• Researchers should communicate the details of the dataset/code/model as part of their submissions via structured templates. This includes details about training, license, limitations, etc. • The paper should discuss whether and how consent was obtained from people whose asset is used. • At submission time, remember to anonymize your assets (if applicable). You can either create an anonymized URL or include an anonymized zip file.

Section: Crowdsourcing and Research with Human Subjects
Question: For crowdsourcing experiments and research with human subjects, does the paper include the full text of instructions given to participants and screenshots, if applicable, as well as details about compensation (if any)?
Answer: [NA] Justification: This work does not involve crowdsourcing nor research with human subjects.
Guidelines:
• The answer NA means that the paper does not involve crowdsourcing nor research with human subjects. • Including this information in the supplemental material is fine, but if the main contribution of the paper involves human subjects, then as much detail as possible should be included in the main paper. • According to the NeurIPS Code of Ethics, workers involved in data collection, curation, or other labor should be paid at least the minimum wage in the country of the data collector.

Section: Institutional Review Board (IRB) Approvals or Equivalent for Research with Human Subjects
Question: Does the paper describe potential risks incurred by study participants, whether such risks were disclosed to the subjects, and whether Institutional Review Board (IRB) approvals (or an equivalent approval/review based on the requirements of your country or institution) were obtained?
Answer: [NA] Justification: This work does not involve crowdsourcing nor research with human subjects.
Guidelines:
• The answer NA means that the paper does not involve crowdsourcing nor research with human subjects.


References:
[b0] B Tom; Benjamin Brown; Nick Mann; Melanie Ryder; Jared Subbiah; Prafulla Kaplan; Arvind Dhariwal; Pranav Neelakantan; Girish Shyam; Amanda Sastry; Sandhini Askell; Ariel Agarwal; Gretchen Herbert-Voss; Tom Krueger; Rewon Henighan; Aditya Child; Daniel M Ramesh; Jeffrey Ziegler; Clemens Wu; Christopher Winter; Mark Hesse; Eric Chen; Mateusz Sigler; Scott Litwin; Benjamin Gray; Jack Chess; Christopher Clark; Sam Berner; Alec Mccandlish; Ilya Radford; Dario Sutskever;  Amodei (2020). Language models are few-shot learners. 
[b1] Jason Wei; Yi Tay; Rishi Bommasani; Colin Raffel; Barret Zoph; Sebastian Borgeaud; Dani Yogatama; Maarten Bosma; Denny Zhou; Donald Metzler; Ed H Chi; Tatsunori Hashimoto; Oriol Vinyals; Percy Liang; Jeff Dean; William Fedus (2022). Emergent abilities of large language models. 
[b2] Jared Kaplan; Sam Mccandlish; Tom Henighan; Tom B Brown; Benjamin Chess; Rewon Child; Scott Gray; Alec Radford; Jeffrey Wu; Dario Amodei (2020). Scaling laws for neural language models. 
[b3] Jordan Hoffmann; Sebastian Borgeaud; Arthur Mensch; Elena Buchatskaya; Trevor Cai; Eliza Rutherford; Diego De Las; Lisa Anne Casas; Johannes Hendricks; Aidan Welbl; Tom Clark; Eric Hennigan; Katie Noland; George Millican; Bogdan Van Den Driessche; Aurelia Damoc; Simon Guy; Karen Osindero; Erich Simonyan; Jack W Elsen; Oriol Rae; Laurent Vinyals;  Sifre (2022). Training compute-optimal large language models. 
[b4] Ibrahim Alabdulmohsin; Behnam Neyshabur; Xiaohua Zhai (2022). Revisiting neural scaling laws in language and vision. 
[b5] Mengwei Xu; Wangsong Yin; Dongqi Cai; Rongjie Yi; Daliang Xu; Qipeng Wang; Bingyang Wu; Yihao Zhao; Chen Yang; Shihe Wang; Qiyang Zhang; Zhenyan Lu; Li Zhang; Shangguang Wang; Yuanchun Li; Yunxin Liu; Xin Jin; Xuanzhe Liu (2024). A survey of resource-efficient llm and multimodal foundation models. 
[b6]  (2024). Llama 3 model card. 
[b7] Carole-Jean Wu; Ramya Raghavendra; Udit Gupta; Bilge Acun; Newsha Ardalani; Kiwan Maeng; Gloria Chang; Fiona Aga Behram; James Huang; Charles Bai; Michael Gschwind; Anurag Gupta; Myle Ott; Anastasia Melnikov; Salvatore Candido; David Brooks; Geeta Chauhan; Benjamin Lee; S Hsien-Hsin; Bugra Lee; Maximilian Akyildiz; Joe Balandat; Ravi Spisak; Mike Jain; Kim Rabbat;  Hazelwood (2022). Sustainable ai: Environmental implications, challenges and opportunities. 
[b8] Alex De; Vries  (2023). The growing energy footprint of artificial intelligence. Joule
[b9] Tianqi Chen; Ian Goodfellow; Jonathon Shlens (2015). Net2net: Accelerating learning via knowledge transfer. 
[b10] Cheng Chen; Yichun Yin; Lifeng Shang; Xin Jiang; Yujia Qin; Fengyu Wang; Zhi Wang; Xiao Chen; Zhiyuan Liu; Qun Liu (2021). bert2bert: Towards reusable pretrained language models. 
[b11] Yite Wang; Jiahao Su; Hanlin Lu; Cong Xie; Tianyi Liu; Jianbo Yuan; Haibin Lin; Ruoyu Sun; Hongxia Yang (2023). Lemon: Lossless model expansion. 
[b12] Sheng Shen; Pete Walsh; Kurt Keutzer; Jesse Dodge; Matthew Peters; Iz Beltagy (2022). Staged training for transformer language models. PMLR
[b13] Linyuan Gong; Di He; Zhuohan Li; Tao Qin; Liwei Wang; Tieyan Liu (2019). Efficient training of bert by progressively stacking. PMLR
[b14] Peihao Wang; Rameswar Panda; Lucas Torroba Hennigen; Philip Greengard; Leonid Karlinsky; Rogerio Feris; David Daniel Cox; Zhangyang Wang; Yoon Kim (2023). Learning to grow pretrained models for efficient transformer training. 
[b15] Utku Evci; Bart Van Merrienboer; Thomas Unterthiner; Max Vladymyrov; Fabian Pedregosa (2022). Gradmax: Growing neural networks using gradient information. 
[b16] Yiqun Yao; Zheng Zhang; Jing Li; Yequan Wang (2024). Masked structural growth for 2x faster language model pre-training. 
[b17] Cheng Yang; Shengnan Wang; Chao Yang; Yuechuan Li; Ru He; Jingqiao Zhang (2020). Progressively stacking 2.0: A multi-stage layerwise training method for bert training speedup. 
[b18] Albert Q Jiang; Alexandre Sablayrolles; Arthur Mensch; Chris Bamford; Devendra Singh Chaplot; Diego De Las Casas; Florian Bressand; Gianna Lengyel; Guillaume Lample; Lucile Saulnier; Renard Lélio; Marie-Anne Lavaud; Pierre Lachaux; Teven Stock; Thibaut Le Scao; Thomas Lavril; Timothée Wang; William El Lacroix;  Sayed (2023). . Mistral
[b19] Xiang Li; Yiqun Yao; Xin Jiang; Xuezhi Fang; Xuying Meng; Siqi Fan; Peng Han; Jing Li; Li Du; Zheng Bowen Qin; Aixin Zhang; Yequan Sun;  Wang (2023). Flm-101b: An open llm and how to train it with $100k budget. 
[b20] Hugo Touvron; Louis Martin; Kevin Stone; Peter Albert; Amjad Almahairi; Yasmine Babaei; Nikolay Bashlykov; Soumya Batra; Prajjwal Bhargava; Shruti Bhosale; Dan Bikel; Lukas Blecher; Cristian Canton Ferrer; Moya Chen; Guillem Cucurull; David Esiobu; Jude Fernandes; Jeremy Fu; Wenyin Fu; Brian Fuller; Cynthia Gao; Vedanuj Goswami; Naman Goyal; Anthony Hartshorn; Saghar Hosseini; Rui Hou; Hakan Inan; Marcin Kardas; Viktor Kerkez; Madian Khabsa; Isabel Kloumann; Artem Korenev; Punit Singh Koura; Marie-Anne Lachaux; Thibaut Lavril; Jenya Lee; Diana Liskovich; Yinghai Lu; Yuning Mao; Xavier Martinet; Todor Mihaylov; Pushkar Mishra; Igor Molybog; Yixin Nie; Andrew Poulton; Jeremy Reizenstein; Rashi Rungta; Kalyan Saladi; Alan Schelten; Ruan Silva; Eric Michael Smith; Ranjan Subramanian; Ellen Xiaoqing; Binh Tan; Ross Tang; Adina Taylor; Jian Williams; Puxin Xiang Kuan; Zheng Xu; Iliyan Yan; Yuchen Zarov; Angela Zhang; Melanie Fan; Sharan Kambadur; Aurelien Narang; Robert Rodriguez; Sergey Stojnic; Thomas Edunov;  Scialom (2023). Llama 2: Open foundation and fine-tuned chat models. 
[b21] Jean Kaddour; Oscar Key; Piotr Nawrot; Pasquale Minervini; Matt J Kusner (2023). No train no gain: Revisiting efficient training algorithms for transformer-based language models. 
[b22] Leo Gao; Jonathan Tow; Stella Baber Abbasi; Sid Biderman; Anthony Black; Charles Dipofi; Laurence Foster; Jeffrey Golding; Alain Hsu; Haonan Le Noac'h; Kyle Li; Niklas Mcdonell; Chris Muennighoff; Jason Ociepa; Laria Phang; Hailey Reynolds; Aviya Schoelkopf; Lintang Skowron; Eric Sutawika; Anish Tang; Ben Thite; Kevin Wang; Andy Wang;  Zou (2023). A framework for few-shot language model evaluation. 
[b23] Scott Fahlman; Christian Lebiere (1989). The cascade-correlation learning architecture. Morgan-Kaufmann
[b24] Scott E Fahlman (1990). The recurrent cascade-correlation architecture. 
[b25] Steven Gutstein; Olac Fuentes; Eric A Freudenthal (2007). Knowledge transfer in deep convolutional neural nets. 
[b26] Jacob Devlin; Ming-Wei Chang; Kenton Lee; Kristina Toutanova (2019). Bert: Pre-training of deep bidirectional transformers for language understanding. 
[b27] Lemeng Wu; Bo Liu; Peter Stone; Qiang Liu (2021). Firefly neural architecture descent: a general approach for growing neural networks. 
[b28] Xin Yuan; Pedro Savarese; Michael Maire (2023). Accelerated training via incrementally growing neural networks using variance transfer and learning rate adaptation. 
[b29] Denis Paperno; Germán Kruszewski; Angeliki Lazaridou; Ngoc Quan; Raffaella Pham; Sandro Bernardi; Marco Pezzelle; Gemma Baroni; Raquel Boleda;  Fernández (2016). The lambada dataset: Word prediction requiring a broad discourse context. 
[b30] Peter Clark; Isaac Cowhey; Oren Etzioni; Tushar Khot; Ashish Sabharwal; Carissa Schoenick; Oyvind Tafjord (2018). Think you have solved question answering? try arc, the ai2 reasoning challenge. 
[b31] Jian Liu; Leyang Cui; Hanmeng Liu; Dandan Huang; Yile Wang; Yue Zhang (2020). Logiqa: A challenge dataset for machine reading comprehension with logical reasoning. 
[b32] Yonatan Bisk; Rowan Zellers; Le Ronan; Jianfeng Bras; Yejin Gao;  Choi (2019). Piqa: Reasoning about physical commonsense in natural language. 
[b33] Johannes Welbl; Nelson F Liu; Matt Gardner (2017). Crowdsourcing multiple choice science questions. 
[b34] Keisuke Sakaguchi; Le Ronan; Chandra Bras; Yejin Bhagavatula;  Choi (2019). Winogrande: An adversarial winograd schema challenge at scale. 
[b35] Stephen Merity; Caiming Xiong; James Bradbury; Richard Socher (2016). Pointer sentinel mixture models. 
[b36] Kun Wayne Xin Zhao; Junyi Zhou; Tianyi Li; Xiaolei Tang; Yupeng Wang; Yingqian Hou; Beichen Min; Junjie Zhang; Zican Zhang; Yifan Dong; Chen Du; Yushuo Yang; Zhipeng Chen; Jinhao Chen; Ruiyang Jiang; Yifan Ren; Xinyu Li; Zikang Tang; Peiyu Liu; Jian-Yun Liu; Ji-Rong Nie;  Wen (2023). A survey of large language models. 
[b37] Ziheng Jiang; Haibin Lin; Yinmin Zhong; Qi Huang; Yangrui Chen; Zhi Zhang; Yanghua Peng; Xiang Li; Cong Xie; Shibiao Nong; Yulu Jia; Sun He; Hongmin Chen; Zhihao Bai; Qi Hou; Shipeng Yan; Ding Zhou; Yiyao Sheng; Zhuo Jiang; Haohan Xu; Haoran Wei; Zhang Zhang; Pengfei Nie; Leqi Zou; Sida Zhao; Liang Xiang; Zherui Liu; Zhe Li; Xiaoying Jia; Jianxi Ye; Xin Jin; Xin Liu (2024). Megascale: Scaling large language model training to more than 10,000 gpus. 
[b38] Zhengxiao Du; Aohan Zeng; Yuxiao Dong; Jie Tang (2024). Understanding emergent abilities of language models from the loss perspective. 
[b39] Pauli Virtanen; Ralf Gommers; Travis E Oliphant; Matt Haberland; Tyler Reddy; David Cournapeau; Evgeni Burovski; Pearu Peterson; Warren Weckesser; Jonathan Bright; J Stéfan; Matthew Van Der Walt; Joshua Brett; K Wilson; Nikolay Jarrod Millman;  Mayorov; R J Andrew; Eric Nelson; Robert Jones; Eric Kern; C J Larson; İlhan Carey; Yu Polat; Eric W Feng; Jake Moore; Denis Vanderplas; Josef Laxalde; Robert Perktold; Ian Cimrman; E A Henriksen; Charles R Quintero; Anne M Harris; Antônio H Archibald; Fabian Ribeiro; Paul Pedregosa; Aditya Van Mulbregt; Alessandro Pietro Vijaykumar; Alex Bardelli; Andreas Rothberg; Andreas Hilboll; Anthony Kloeckner; Antony Scopatz; Ariel Lee; C Nathan Rokem; Chad Woods; Charles Fulton; Christian Masson; Clark Häggström; David A Fitzgerald; David R Nicholson; Dmitrii V Hagen; Emanuele Pasechnik; Eric Olivetti; Eric Martin; Fabrice Wieser; Felix Silva; Florian Lenders; G Wilhelm; Gavin A Young; Gert-Ludwig Price; Gregory E Ingold; Gregory R Allen; Hervé Lee; Irvin Audren; Jörg P Probst; Jacob Dietrich; James T Silterra; Janko Webber; Joel Slavič; Johannes Nothman; Johannes Buchner; Johannes L Kulick; José Schönberger; Miranda Vinícius De; Joscha Cardoso; Joseph Reimer; Juan Harrington; Juan Luis Cano Rodríguez; Justin Nunez-Iglesias; Kevin Kuczynski; Martin Tritz; Matthew Thoma; Matthias Newville; Maximilian Kümmerer; Michael Bolingbroke; Mikhail Tartre; Nathaniel J Pak; Nikolai Smith; Nikolay Nowaczyk; Oleksandr Shebanov;  Pavlyk; A Per; Perry Brodtkorb; Robert T Lee; Roman Mcgibbon; Sam Feldbauer; Sam Lewis; Scott Tygier; Sebastiano Sievert; Stefan Vigna; Surhud Peterson; Tadeusz More; Takuya Pudlik; Thomas J Oshima; Thomas P Pingel; Thomas Robitaille;  Spura; R Thouis; Tim Jones; Tim Cera; Tiziano Leslie; Tom Zito; Utkarsh Krauss; Yaroslav O Upadhyay; Yoshiki Halchenko;  Vázquez-Baeza (2020-02). Scipy 1.0: fundamental algorithms for scientific computing in python. Nature Methods
[b40] Haihang Wu; Wei Wang; Tamasha Malepathirana; Damith Senanayake; Denny Oetomo; Saman Halgamuge (2024). When to grow? a fitting risk-aware policy for layer growing in deep neural networks. 
[b41] Chengyue Wu; Yukang Gan; Yixiao Ge; Zeyu Lu; Jiahao Wang; Ye Feng; Ping Luo; Ying Shan (2024). Llama pro: Progressive llama with block expansion. 
[b42] Dahyun Kim; Chanjun Park; Sanghoon Kim; Wonsung Lee; Wonho Song; Yunsu Kim; Hyeonwoo Kim; Yungi Kim; Hyeonju Lee; Jihoo Kim (2023). Solar 10.7 b: Scaling large language models with simple yet effective depth up-scaling. 
[b43] Nikunj Saunshi; Stefani Karp; Shankar Krishnan; Sobhan Miryoosefi; J Sashank; Sanjiv Reddi;  Kumar (2024). On the inductive bias of stacking towards improving reasoning. 
[b44] Peiyuan Zhang; Guangtao Zeng; Tianduo Wang; Wei Lu (2024). Tinyllama: An open-source small language model. 
[b45] Tri Dao; Daniel Y Fu; Stefano Ermon; Atri Rudra; Christopher Ré (2022). Flashattention: Fast and memory-efficient exact attention with io-awareness. 
[b46] Daria Soboleva; Faisal Al-Khateeb; Robert Myers; Jacob R Steeves; Joel Hestness; Nolan Dey (2023). SlimPajama: A 627B token cleaned and deduplicated version of RedPajama. 
[b47] Rowan Zellers; Ari Holtzman; Yonatan Bisk; Ali Farhadi; Yejin Choi (2019). Hellaswag: Can a machine really finish your sentence?. 
[b48] Stephanie Lin; Jacob Hilton; Owain Evans (2022). Truthfulqa: Measuring how models mimic human falsehoods. 
[b49] Dan Hendrycks; Collin Burns; Steven Basart; Andy Zou; Mantas Mazeika; Dawn Song; Jacob Steinhardt (2021). Measuring massive multitask language understanding. 
[b50] Stella Biderman; Hailey Schoelkopf; Quentin Gregory Anthony; Herbie Bradley; O' Kyle; Eric Brien; Mohammad Hallahan; Shivanshu Aflah Khan;  Purohit; Edward Usvsn Sai Prashanth;  Raff (2023). Pythia: A suite for analyzing large language models across training and scaling. PMLR
[b51] Leo Gao; Stella Biderman; Sid Black; Laurence Golding; Travis Hoppe; Charles Foster; Jason Phang; Horace He; Anish Thite; Noa Nabeshima (2020). The pile: An 800gb dataset of diverse text for language modeling. 
[b52] Dirk Groeneveld; Iz Beltagy; Pete Walsh; Akshita Bhagia; Rodney Kinney; Oyvind Tafjord; Ananya Harsh Jha; Hamish Ivison; Ian Magnusson; Yizhong Wang; Shane Arora; David Atkinson; Russell Authur; Khyathi Chandu; Arman Cohan; Jennifer Dumas; Yanai Elazar; Yuling Gu; Jack Hessel; Tushar Khot; William Merrill; Jacob Morrison; Niklas Muennighoff; Aakanksha Naik; Crystal Nam; Matthew E Peters; Valentina Pyatkin; Abhilasha Ravichander; Dustin Schwenk; Saurabh Shah; Will Smith; Nishant Subramani; Mitchell Wortsman; Pradeep Dasigi; Nathan Lambert; Kyle Richardson; Jesse Dodge; Kyle Lo; Luca Soldaini; Noah A Smith; Hannaneh Hajishirzi (2024). Olmo: Accelerating the science of language models. 
[b53] Zhengzhong Liu; Aurick Qiao; Willie Neiswanger; Hongyi Wang; Bowen Tan; Tianhua Tao; Junbo Li; Yuqi Wang; Suqi Sun; Omkar Pangarkar; Richard Fan; Yi Gu; Victor Miller; Yonghao Zhuang; Guowei He; Haonan Li; Fajri Koto; Liping Tang; Nikhil Ranjan; Zhiqiang Shen; Xuguang Ren; Roberto Iriondo; Cun Mu; Zhiting Hu; Mark Schulze; Preslav Nakov; Tim Baldwin; Eric P Xing (2023). Llm360: Towards fully transparent open-source llms. 
[b54] Luca Soldaini; Rodney Kinney; Akshita Bhagia; Dustin Schwenk; David Atkinson; Russell Authur; Ben Bogin; Khyathi Chandu; Jennifer Dumas; Yanai Elazar; Valentin Hofmann; Ananya Harsh Jha; Sachin Kumar; Li Lucy; Xinxi Lyu; Nathan Lambert; Ian Magnusson; Jacob Morrison; Niklas Muennighoff; Aakanksha Naik; Crystal Nam; Matthew E Peters; Abhilasha Ravichander; Kyle Richardson; Zejiang Shen; Emma Strubell; Nishant Subramani; Oyvind Tafjord; Pete Walsh; Luke Zettlemoyer; Noah A Smith; Hannaneh Hajishirzi; Iz Beltagy; Dirk Groeneveld; Jesse Dodge; Kyle Lo;  Dolma (2024). an Open Corpus of Three Trillion Tokens for Language Model Pretraining Research. 

Figures:
Figure fig_1: 1
Type: figure
Caption: Figure 1 :1Figure 1: The training loss for two 7B LLMs, trained from scratch and with G ↑ direct (G stack ). At 300B tokens, G stack accelerates by 54.6% compared to scratch.
Data: 

Figure fig_2: 2
Type: figure
Caption: Figure 2 :2Figure2: The simplified illustration of four growth operators G direct , G learn , G zero and G random , each of which can grow along widthwise (intra-layer) G → or depthwise (layer-wise) G ↑ . W n is the parameters before growth, while D n , R n and O are the growth parameters derived from the old, randomly initialized, and zero-initialized respectively. Except G direct , other three operators only illustrates the widthwise growth.
Data: 

Figure fig_3: 
Type: figure
Caption: Lambada
Data: 

Figure fig_4: 3
Type: figure
Caption: Figure 3 :3Figure3: We evaluate operators using training loss and Lambada[30], ARC-c[31], ARC-e[31], Logiqa[32], PIQA[33], Sciq[34], Winogrande[35] and Wikitext PPL[36] totaling eight standard NLP benchmarks. After 8 × 10 20 FLOPs of training, G ↑ direct demonstrates a significant speedup.
Data: 

Figure fig_7: 7
Type: figure
Caption: Figure 7 :7Figure 7: We plot scaling law lines based on 410M, 1.1B, 3B, 7B LLMs and make two predictions at the same losses of original computationaloptimized 13B and 70B LLMs.
Data: 

Figure fig_8: 
Type: figure
Caption: ,C1 and C2 represent the flops required to train the initial small models C1 = F LOP s(n, d), and the large model C2 = F LOP s(N, D) respectively, where n and d denote the parameters and training tokens of the small model, and N and D represent the parameters and training tokens of the large model. Since the large model is grown by a factor of g such that N = gn, we have C = C1 + C2 = F LOP s(g, N, d) + F LOP s(N, D) = F LOP s(g, N, d, D).
Data: 

Figure fig_9: 89
Type: figure
Caption: Figure 8 :Figure 9 :89Figure 8: In 410M, 1.1B, and 3B LLMs, we plot smoothed loss curves for different growth timing d given a set of FLOPs to form IsoFLOP figures. We find a clear valley in loss, indicating that for a given FLOP budget, there exists an optimal growth timing d for the G stack operation. So when given a budget C, our objective is to identify the optimized values D, N , d, g that minimize the loss L(D, N, d, g). However, simultaneously optimizing the above four variables can be computationally expensive. Therefore, instead of searching for global optimals, we separately determine two factors closely related to the G stack : the training tokens for the small model (growth timing) d and the growth factor g:
Data: 

Figure fig_10: 102
Type: figure
Caption: log 10 (( 2 )102d) = a log 10 (N ) + b log 10 (C) + c After fitting, we obtain a = 0.88, b = 163.27 and c = -5.74 and we plot the contour figure in Figure 9. It can be observed that our estimated curves align well with the actual optimal points.
Data: 

Figure fig_11: 10
Type: figure
Caption: Figure 10 :10Figure 10: In 1.1B, and 3B LLMs, we plot smoothed loss curves for different growth factor g given a set of FLOPs as IsoFLOP figures. The optimal g falls between 2 and 4.
Data: 

Figure fig_12: 11
Type: figure
Caption: Figure 11 :11Figure 11: Training Loss on Slimpajama.
Data: 

Figure fig_13: 12
Type: figure
Caption: Figure 12 :12Figure 12: Evaluation results on growth in depth from small model (10B) by four operators.
Data: 

Figure fig_14: 13
Type: figure
Caption: Figure 13 :13Figure 13: Evaluation results on growth in depth from small model (50B) by four operators.
Data: 

Figure fig_15: 14
Type: figure
Caption: Figure 14 :14Figure 14: Evaluation results on growth in width from small model (10B) by four operators.
Data: 

Figure fig_16: 15
Type: figure
Caption: Figure 15 :15Figure 15: Evaluation results on growth in width from small model (50B) by four operators.
Data: 

Figure fig_17: 1617
Type: figure
Caption: Figure 16 :DFigure 17 :1617Figure 16: Average accuracy of seven standard NLP benchmarks.
Data: 

Figure fig_18: 181920
Type: figure
Caption: Figure 18 :Figure 19 :Figure 20 :181920Figure 18: Evaluation results on scratch model and G stack model at 3B size.
Data: 

Figure fig_19: 21
Type: figure
Caption: Figure 21 :21Figure 21: Evaluation results on scratch model and G stack model at 410M size.
Data: 

Figure fig_20: 10
Type: figure
Caption: log 10 (10g) = 1.01 log 10 (N ) -29.88 log 10 (C) -7.36(50)
Data: 

Figure fig_21: 22
Type: figure
Caption: Figure 22 :22Figure 22: Visualization of the Equation 50.
Data: 

Figure fig_22: 24
Type: figure
Caption: Figure 24 :24Figure 24: Evaluation results on 410M.
Data: 

Figure fig_23: 25
Type: figure
Caption: Figure 25 :25Figure 25: Training loss and standard NLP benchmarks average accuracy of 1.1B.
Data: 

Figure fig_25: 26
Type: figure
Caption: Figure 26 :26Figure 26: Evaluation results on 1.1B.
Data: 

Figure fig_26: 27
Type: figure
Caption: Figure 27 :27Figure 27: Training loss and standard NLP benchmarks average accuracy of 3B.
Data: 

Figure fig_28: 29
Type: figure
Caption: Figure 29 :29Figure 29: Training loss and standard NLP benchmarks average accuracy of 1.1B.
Data: 

Figure fig_30: 30
Type: figure
Caption: Figure 30 :30Figure 30: Evaluation results on 1.1B.
Data: 

Figure fig_32: 33
Type: figure
Caption: Figure 33 :33Figure 33: Training loss and standard NLP benchmarks average accuracy of scratch, G stack and G gradual .
Data: 

Figure fig_33: 34
Type: figure
Caption: Figure 34 :34Figure 34: Evaluation results on scratch, G stack and gradual stacking in StackBert.
Data: 

Figure fig_34: 35
Type: figure
Caption: Figure 35 :35Figure 35: Training loss and standard NLP benchmarks average accuracy of scratch, G stack and interpolation.
Data: 

Figure fig_36: 36
Type: figure
Caption: Figure 36 :36Figure 36: Evaluation results on scratch, G stack and interpolation.
Data: 

Figure fig_37: 37
Type: figure
Caption: Figure 37 :37Figure 37: Training loss and standard NLP benchmarks average accuracy of scratch, G stack and other partial stacking.
Data: 

Figure fig_38: 38
Type: figure
Caption: Figure 38 :38Figure 38: Evaluation results on scratch, G stack and other partial stacking.
Data: 

Figure fig_40: 39
Type: figure
Caption: Figure 39 :39Figure 39: Training loss and standard NLP benchmarks average accuracy of scratch, G → direct and G → direct with 20% noise.
Data: 

Figure fig_42: 42
Type: figure
Caption: Figure 42 :42Figure 42: Evaluation results on scratch, G stack and G stack with 20% noise.
Data: 

Figure fig_44: 44
Type: figure
Caption: Figure 44 illustrates44Figure44illustrates the loss spikes that occur right after stacking.
Data: 

Figure fig_45: 44
Type: figure
Caption: Figure 44 :44Figure 44: Loss Spikes in G stack (Non-FP) and G ↑ random (FP)
Data: 

Figure tab_1: 
Type: table
Caption: Most notably, depthwise stacking G ↑ direct emerges as the clear winner among growth operators, surpassing its competitors in speedup, training loss and nearly every Harness evaluation metric. For example, compared to training models from scratch for 100B tokens, G ↑ direct achieves a significant efficiency gain, increasing training speed by 49.1%. The calculation of speedup please refer to Appendix B.2. The Appendix C presents more experiments on these operators, including their loss training and evaluation figures.
Data: |||||||||48.2048.6744.1448.3646.1644.6744.2445.6647.8729.1828.3228.4127.3828.5826.7027.6426.7027.2154.2551.7652.6951.1751.5549.7053.8250.3748.8628.8727.9525.9628.1127.3425.0326.1126.5725.9671.9871.8170.7871.1669.4769.7470.1369.9169.6481.181.977.780.081.476.079.579.576.856.0356.9853.3554.4554.2254.9352.9553.5154.5352.8052.4850.4351.5251.2549.5450.6350.3250.1216.7317.3517.8516.9318.0318.7618.2918.4417.982.1512.1612.2582.1562.2092.2492.2272.2332.20449.1%46.6%-25.7%48.6%-0.7%-17.9%-13.8%-15.4%0.0%

Figure tab_6: 
Type: table
Caption: 1 LLMs Training with Growth Operator Algorithm 2 LLMs Training with Growth Operator Input: Growth operator G, Loss function L, Iterative optimizer A. Dataset {d 1 , d 2 , • • • , d k } for base model. Dataset {D 1 , D 2 , • • • , D K } for target model. Output: Target Model M K Initial Phase: Initialize a base model M 0 from scratch. for t = 1 to k do ▷ Base Model Training loss
Data: 

Figure tab_7: 1
Type: table
Caption: Hyperparameters
Data: SizeContext Length Batch Size max-LR min-LR Warmup Steps LR Scheduler410M20482M tokens6e-46e-53000cosine1.1B20482M tokens3e-43e-53000cosine3B20482M tokens1.6e-41.6e-53000cosine7B20482M tokens1e-41e-53000cosineC Training Loss and Evaluation Results of Four Operators in both Depth andWidth growthWe have two small (base) models, one trained with token count d = 10B and another trained withtoken count d = 50B.

Figure tab_8: 2
Type: table
Caption: Evaluation Results after Instruction-Tuning (Higher better)
Data: Method Tokens Tuning lambada arc-c arc-e logiqapiqasciq winograndeavgscratch400B54.07 60.3528.84 55.35 31.48 56.126.88 27.0473.94 82.0 74.32 81.259.43 60.1454.36 55.8Gstack290B55.04 61.3432.34 58.08 34.98 59.9728.88 29.6573.88 79.6 75.14 80.161.8 60.2255.66 57.34

Figure tab_9: 3
Type: table
Caption: Compare with opensource LLMs on 1B
Data: Pythia-1BTinyLlama-1.1BG stack -1.1BBaseline-1.1BDatasetsPile-300B [52] Slimpajama-627B& Starcoder Slimpajama-627B Slimpajama-627BTokens100B103B100B100Blambada53.52-48.2047.87ARC-c25.5924.3229.1827.21ARC-e47.2644.9154.2548.86piqa69.3167.3071.9869.64logiqa29.49-28.8725.96sciq77.3-81.176.8winogrande51.2253.2856.0354.53Avg.50.53-52.8050.09

Figure tab_10: 4
Type: table
Caption: 
Data: : "Stacking Law" GuidelinesModelNDdgLlama3-8B8B 15T 6.58B 4Llama2-7B7B2T 11.11B 4Llama2-13B 13B 2T 15.84B 4Llama2-70B 70B 2T 42.48B 4G Training Loss and Evaluation Results of "growth timing" and "growthfactor"G.1 "Growth Timing" d

Figure tab_19: 
Type: table
Caption: 2 
Data: 

Figure tab_20: 
Type: table
Caption: , • • • , L 6 }) to a 24 layers target model. We set growth timing d = 10B tokens and growth factor g = 4. For simplicity, we use a format such as 1-234*7-56 to denote stacking 234 layers 7 times.
Data: Training Loss2.1 2.2 2.3 2.4 2.5 2.6 2.7 2.8 2.9 3.00 02 scratch Gstack(123456 * 4) FLOPs (1e+20) 4 6 2.15 2.20 2.25 70 Gstack(123 * 7 456) Gstack(1 234 * 7 56) Gstack(12 345 * 7 6) Gstack(12 34 * 10 56) 6 7 80 Gstack(1234 56 * 10) Gstack(12 3456 * 5 56) Gstack(123 456 * 7) 20 40 60 80 Tokens (Billions)8Average Accuracy44 46 48 50 522 scratch Gstack(123456 * 4) FLOPs (1e+20) 4 Gstack(123 * 7 456) Gstack(1 234 * 7 56) Gstack(12 345 * 7 6) Gstack(1234 56 * 10) 6 Gstack(12 34 * 10 56) Gstack(12 3456 * 5 56) Gstack(123 456 * 7) 20 40 60 80 Tokens (Billions)8(a) Training Loss(b) Average Accuracy

Figure tab_21: 5
Type: table
Caption: R c and stacked parts of each partial stacking method
Data: GroupMethodStacked partsR c123456*4all87.0%First12-3456*5-56 12-345*7-6middle-back middle-back78.3% 74.0%123-456*7back74.0%1234-56*10back60.7%Second12-34*10-56middle60.7%1-234*7-56front-middle74.0%Third123*7-456front74.0%Then, we report the evaluation results here.

Figure tab_22: 6
Type: table
Caption: Compare with opensource 7B LLMs on 130B tokens. Function preservation is a key concept that underlies diverse model growth approaches. It entails ensuring consistent output from a model, regardless of its expansion. Mathematically, let us define a function as F and a growth operator as G. The ultimate aim is to apply the operator G to the function
Data: Pythia-6.9BOLMo-7B [53] Amber-7B [54]G stack -7BDatasetsPile-300B [52]Dolma [55]AmberSlimpajama-627BTokens130B133B132B130BARC-c33.2828.5829.0135.24ARC-e59.8151.6055.0563.64boolq63.3955.0560.1866.45hellaswag60.0354.5261.2165.85lambada65.1149.9157.1357.93logiqa28.8828.4226.7326.88obqa37.2033.6037.4036.40piqa75.0374.4376.0176.82sciq82.774.482.085.9winogrande60.1453.7556.8362.75Avg.56.5650.4354.1657.79Wikitext13.334018.469015.620212.5635I Details of Function PreservingI.1 Function Preserving

Figure tab_25: 
Type: table
Caption: Figure 43: The training loss for two Samba LLMs, trained from scratch and with G stack . At loss=2.48, 2.45, 2.42, G stack accelerates by 61.7%, 61.5% and 58.2% compared to scratch.
Data: 4060801002.4 2.5 61.7% 61.5% 58.2%0.0%0.0%0.0%1.01.52.02.5

Figure tab_26: 7
Type: table
Caption: Evaluation Results on Samba LLMs
Data: Method Tokens lambada arc-c arc-e logiqapiqasciqavgscratch50B36.4125.34 43.7727.5067.36 70.00 45.06Gstack47B38.4426.19 44.9526.8867.95 72.80 46.20K Loss Spikes


Formulas:
Formula formula_0: M = M • M • • • • • M g×M

Formula formula_1: C = C1 + C2.

Formula formula_2: arg min Glearn E x∼D L(x; F Θ ), where Θ = G learn (θ)(3)

Formula formula_3: E ′ = G direct (E) (4) = ER (5) = E I I d I(6)

Formula formula_4: Linear Consider W ∈ R dout×din , target parameter W ′ ∈ R Dout×Din , where d out ≤ D out , d in ≤ D in , G direct is defined as: W ′ = G direct (W ) (7) = LW R (8) = d out I I I W α β d in I(9)

Formula formula_5: µ ∈ R d , expanded parameter µ ′ = √ d √ D [µ, µ 0,D-d ] ∈ R D : RM SN orm ′ (x ′ ) = x ′ 1 D D i=1 x ′ 2 i ⊙ µ ′ (10) = [ d i=1 x 2 i D i=1 x ′ 2 i × RM SN orm(x), ζ](11)

Formula formula_6: F = f 0 • f 1 • • • • • f l .

Formula formula_7: F ′ = F • F • • • • • F . Algorithm 1 Operator G stack

Formula formula_8: M l 0 =M l k for t = 2 to g do ▷ Model Stacking M tl 0 = M (t-1)l 0 • M l k end A.4 Details of G zero

Formula formula_9: E ′ = [E, O](12)

Formula formula_10: Embedding ′ (x) = 1 x E ′ = [Embedding(x), 0 D-d ](13)

Formula formula_11: W ′ = W A O C(14)

Formula formula_12: x ′ = [x, 0 Din-din ](15)

Formula formula_13: Linear ′ (x ′ ) = x ′ W ′T (16) = [x, 0 Din-din ] W T O A T C T (17) = [xW T , 0 Dout-dout ] (18) = [Linear(x), 0 Dout-dout ](19)

Formula formula_14: µ ∈ R d , G zero expand it to µ ′ = [αµ, ξ] like G random in Appendix A.5, because the input must be x ′ = [x, 0 D-d ] ∈ R D .

Formula formula_15: E ∈ R V ×d . The goal of G random is to expand it to E ′ ∈ R V ×D

Formula formula_16: E ′ = [E, E](20)

Formula formula_17: c = [1 d , 0 D-d ] → [1 d , 1 D-d ](21)

Formula formula_18: Embedding ′ (x) = 1 x E ′ ⊙ c = [Embedding(x), 0 D-d ](22)

Formula formula_19: W ′ = W A B C(23)

Formula formula_20: x ′ = [x, 0 Din-din ](24)

Formula formula_21: ′ W ′T = [x, 0 Din-din ] W T B T A T C T (25) = [xW T , xB T ](26)

Formula formula_22: c = [1 dout , 0 Dout-dout ] → [1 dout , 1 Dout-dout ](27)

Formula formula_23: Linear ′ (x ′ ) = x ′ W ′T ⊙ c = [Linear(x), 0 Dout-dout ](28)

Formula formula_24: µ ′ = [αµ, ξ] ∈ R D ,

Formula formula_25: D i=0 x ′2 = d i=0 x 2(29)

Formula formula_26: RM SN orm ′ (x ′ ) = x ′ 1 D D i=0 x ′ i 2 ⊙ µ ′ (30) = [x, 0 D-d ] 1 D d i=0 x i 2 ⊙ [αµ, ξ](31)

Formula formula_27: =   √ D √ d x 1 d d i=0 x i 2 ⊙ αµ, 0 D-d  (32)

Formula formula_28: RM SN orm ′ (x ′ ) = [RM SN orm(x), 0 D-d ](33)

Formula formula_29: Y = X + M HA(RM SN orm(X)) ⊙ c (34) Y = X + SwiGLU (RM SN orm(X)) ⊙ c (35) c = 0 D → 1 D(36)

Formula formula_30: E ′ = EB T emb(37)

Formula formula_31:          W ′ Q = B Q W Q B T emb W ′ K = B K W K B T emb W ′ V = B V W V B T emb W ′ O = B emb W O B T V µ ′ 1 = B emb µ 1(38)

Formula formula_32: W up , W gate ∈ R d mlp ×d , W down ∈ R d×d mlp , RMSNorm µ 2 ∈ R d , the parameter B mlp ∈ R D mlp ×d mlp , we have:        W ′ up = B mlp W up B T emb W ′ down = B emb W mlp B T mlp W ′ gate = B mlp W gate B T emb µ ′ 2 = B emb µ 2(39)

Formula formula_33: W ′ head = W head B emb(40)

Formula formula_34:                W Q l ′ = L1 j=1 D Q l,j W Q j W K l ′ = L1 j=1 D K l,j W K j W V l ′ = L1 j=1 D V l,j W V j W O l ′ = L1 j=1 D O l,j W O j µ (ln1) l ′ = L1 j=1 D (ln1) l,j µ (ln1) j(41)

Formula formula_35:            W up l ′ = L1 j=1 D up l,j W up j W down l ′ = L1 j=1 D down l,j W down j W gate l ′ = L1 j=1 D gate l,j W gate j µ (ln2) l ′ = L1 j=1 D (ln2) l,j µ (ln2) j(42)

Formula formula_36: Embedding(X) = 1 X E(43)

Formula formula_37: E i ̸ = E j .

Formula formula_38: Q i , K i , V i = XW Q i , XW K i , XW V i H i = sof tmax( QiK T i √ d h )V i M HA(X) = Concat(H 1 , • • • , H n )W O (44) here, the input X ∈ R T ×d , parameters W Q i ∈ R d×d h , W K i ∈ R d×d h , W V i ∈ R d×d h , and W O ∈ R d×d , where n × d h = d.

Formula formula_39: F F N (X) = GeLU (XW up )W down (45

Formula formula_40: )

Formula formula_41: W up ∈ R d×d F F N and W down ∈ R d F F N ×d .

Formula formula_42: SwiGLU (X) = (XW gate ⊙ swiglu(XW up ))W down (46

Formula formula_43: )

Formula formula_44: X ∈ R T ×d , parameter W up ∈ R d×d F F N , W gate ∈ R d×d F F N and W down ∈ R d F F N ×d .

Formula formula_45: RM SN orm(X) = X 1 d d i=1 X 2 i ⊙ µ(47)

Formula formula_46: X ∈ R T ×d , parameter µ ∈ R d . B.

Formula formula_47: = L(M t-1 , d t ) M t ← A(M t-1 , loss) end M 0 = G(M k ) for t = 1 to K do ▷ Target Model Training loss = L(M t-1 , D t ) M t ← A(M t-1 , loss) end B.

Formula formula_48: sp = F LOP s scratch F LOP s G -1(48)

Formula formula_49: log 10 (g) = a log 10 (N ) + b log 10 (C) + c(49)

Formula formula_50: f 2 • f 1 • f 0 • f 2 • f 1 • f 0 or f 2 • f 2 • f 1 • f 1 • f 0 • f 0 (interpolation)

Formula formula_51: R c = Con r Con all(51)

Formula formula_52: f 2 • f 1 • f 0 • f 2 • f 1 • f 0 , where its R c = 80%. The second approach would result in f 2 • f 2 • f 1 • f 1 • f 0 • f 0 , where its R c = 40%.

Formula formula_54: W noise ← (1 -α)W + αϵ where ϵ ∼ N (0, 1 d × l 2 )(53)

Formula formula_55: W noise ← (1 -α)W + αϵ where ϵ ∼ N (0, 2 5d )(54)

