Where to Begin: Efficient Pretraining via Sub-network Selection and Distillation

Arjun Krishnakumar; Rhea Sanjay Sukthanker; Hannan Javed Mahadik; Gabriela Kadlecová; Vladyslav Moroshan; Timur Carstensen; Frank Hutter; Aaron Klein

Where to Begin: Efficient Pretraining via Sub-network Selection and Distillation

Arjun Krishnakumar, Rhea Sanjay Sukthanker, Hannan Javed Mahadik, Gabriela Kadlecová, Vladyslav Moroshan, Timur Carstensen, Frank Hutter, Aaron Klein

15 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0

Keywords: Small Language Models, Efficient Pretraining, Distillation, Pruning, Sub-network extraction

TL;DR: We present a library for efficient pretraining of SLMs with sub-network extraction and distillation

Abstract: Small Language models (SLMs) offer an efficient and accessible alternative to Large Language Models (LLMs), delivering strong performance while using far fewer resources. We introduce a simple and effective framework for pretraining SLMs that brings together three complementary ideas. First, we identify structurally sparse **sub-network initializations** that consistently outperform randomly initialized models of similar size under the same compute budget. Second, we use **evolutionary search** to automatically discover high-quality sub-network initializations, providing better starting points for pretraining. Third, we apply **knowledge distillation** from larger teacher models to speed up training and improve generalization. Together, these components make SLM pretraining substantially more efficient: our best model, discovered using evolutionary search and initialized with LLM weights, matches the validation perplexity of a comparable Pythia SLM while requiring **5.16x** and **1.26x** fewer floating point operations for token budgets of 10B and 100B, respectively. We release all code publicly, offering a practical and reproducible path toward cost-efficient small language model development at scale.

Primary Area: infrastructure, software libraries, hardware, systems, etc.

Submission Number: 5592

Loading