Chain-of-Learngene: A Scalable Learngene-based Paradigm for Building and Initializing Variable-Sized Language Models

boyu shi; Yicheng Jiang; Chang Liu; Qiufeng Wang; Xu Yang; Xin Geng

Chain-of-Learngene: A Scalable Learngene-based Paradigm for Building and Initializing Variable-Sized Language Models

boyu shi, Yicheng Jiang, Chang Liu, Qiufeng Wang, Xu Yang, Xin Geng

16 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0

Keywords: Learngene, Large Language Model, Model Initialization

Abstract: Large language models (LLMs) show strong performance across a wide range of tasks, yet their deployment remains costly in resource-constrained environments. A common alternative is to pre-train small language models (SLMs) from scratch, but this approach demands substantial computation and often suffers from limited model capacity. Knowledge distillation (KD) improves SLMs' performance by transferring knowledge from LLMs, but generating SLMs of varying sizes typically requires repeated teacher (LLMs) inference, which remains computationally expensive. To address these challenges, we propose \textbf{Chain-of-Learngene (\textbf{CoL})}, a scalable framework for efficiently initializing multi-scale SLMs for diverse resource-constrained settings. \textbf{CoL} is inspired by the Learngene framework, which extracts expressive and tiny components (\textit{learngene}) from a pre-trained ancestor model (AnsNet) to initialize descendant models (DesNets) of different sizes. Building on this idea, \textbf{CoL} constructs a sparse sequence of intermediate models, forming a \textit{learngene chain}, through a few stepwise distillation steps from the AnsNet. Besides, a \textit{bridge distillation} mechanism is introduced to support AnsNets with different architectures or vocabularies. Finally, \textbf{CoL} initializes variable-sized SLMs via parameter interpolation between adjacent models in the chain, thereby eliminating duplicate access to the LLMs. Experiments show that \textbf{CoL} significantly improves efficiency, scalability, and downstream performance. For instance, a 138M DesNet initialized by \textbf{CoL} without any recovery pre-training outperforms scratch-trained models on a 10B-token corpus.

Supplementary Material: zip

Primary Area: other topics in machine learning (i.e., none of the above)

Submission Number: 6763

Loading