A Little Help Goes a Long Way: Efficient LLM Training by Leveraging Small LMs

ICLR 2026 Conference Submission21184 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Large language models, knowledge distillation, data selection, efficiency
TL;DR: We leverage small LMs as teachers during knowledge distillation to improve large LM pre-training on both quality and training efficiency and rigorously support our methods with novel statistical results.
Abstract: A primary challenge in developing large language models (LLMs) is their onerous pre-training cost. This paper explores a promising paradigm to improve LLM pre-training efficiency and quality by leveraging a small language model (SLM). In particular, this paradigm relies on an SLM to both (1) provide soft labels as additional supervision, and (2) select a small subset of valuable training examples. Put together, this enables an effective transfer of the SLM's predictive distribution to the LLM, while prioritizing specific regions of the training data distribution. Empirically, this leads to reduced LLM training time compared to standard training, while improving the overall quality. Theoretically, we develop a statistical framework to study the utility of SLMs in enabling efficient training of high-quality LLMs. Our framework characterizes how the SLM's seemingly low-quality supervision can enhance the training of a much more capable LLM. Furthermore, it also highlights the need for an adaptive utilization of such supervision, by striking a balance between the bias and variance introduced by the SLM-provided soft labels. We corroborate our theoretical framework by improving the pre-training of LLMs with 2.8B and 8.6B parameters by utilizing smaller LMs on the Pile dataset.
Primary Area: foundation or frontier models, including LLMs
Submission Number: 21184
Loading