Keywords: Small Language Models, Efficient Pretraining, Distillation, Pruning, Sub-network extraction
TL;DR: We present a library for efficient pretraining of SLMs with sub-network extraction and distillation
Abstract: Small Language models (SLMs) offer an efficient and accessible alternative to Large Language Models (LLMs), delivering strong performance while using far fewer resources. We introduce a simple and effective framework for pretraining SLMs that brings together three complementary ideas. First, we identify structurally sparse **sub-network initializations** that consistently outperform randomly initialized models of similar size under the same compute budget. Second, we use **evolutionary search** to automatically discover high-quality sub-network initializations, providing better starting points for pretraining. Third, we apply **knowledge distillation** from larger teacher models to speed up training and improve generalization. Together, these components make SLM pretraining substantially more efficient: our best model, discovered using evolutionary search and initialized with LLM weights, matches the validation perplexity of a comparable Pythia SLM while requiring **5.16x** and **1.26x** fewer floating point operations for token budgets of 10B and 100B, respectively. We release all code publicly, offering a practical and reproducible path toward cost-efficient small language model development at scale.
Primary Area: infrastructure, software libraries, hardware, systems, etc.
Submission Number: 5592
Loading