SmolLM2: When Smol Goes Big — Data-Centric Training of a Fully Open Small Language Model

Loubna Ben allal; Anton Lozhkov; Elie Bakouch; Gabriel Martin Blazquez; Guilherme Penedo; Lewis Tunstall; Andrés Marafioti; Agustín Piqueres Lajarín; Hynek Kydlíček; Vaibhav Srivastav; Joshua Lochner; Caleb Fahlgren; Xuan Son NGUYEN; Ben Burtenshaw; Clémentine Fourrier; Haojun Zhao; Hugo Larcher; Mathieu Morlon; Cyril Zakka; Colin Raffel; Leandro Von Werra; Thomas Wolf

SmolLM2: When Smol Goes Big — Data-Centric Training of a Fully Open Small Language Model

Published: 08 Jul 2025, Last Modified: 26 Aug 2025COLM 2025EveryoneRevisionsBibTeXCC BY 4.0

Keywords: small language models, dataset, pretraining

TL;DR: SmolLM2 is a fully open 1.7B parameter LM that achieves state-of-the-art performance through multi-stage training on diverse high-quality data and is released alongside new math, code and instruction tuning datasets.

Abstract: Large language models, while groundbreaking, are computationally expensive and difficult to deploy in resource-constrained settings. To address this challenge, small language models have emerged, but their performance critically depends on the quality and composition of the pretraining datasets—yet many recent models, such as Qwen2.5-1.5B and Llama3.2-1B, remain opaque about their training data, limiting reproducibility and scientific understanding. In this paper, we document and publicly release SmolLM2, a fully transparent state-of-the-art ``small'' (1.7 billion parameter) language model (LM), along with its training datasets and code. To attain strong performance, we overtrain SmolLM2 on 11 trillion tokens of data using a multi-stage training process that mixes web text with specialized math, code, and instruction-following data. We additionally curate and release new specialized datasets (FineMath, Stack-Edu, and SmolTalk) at stages where we found existing datasets to be problematically small or low-quality. To inform our design decisions, we perform both small-scale ablations and a manual refinement process that updates the dataset mixing rates at each stage based on the performance at the previous one. Ultimately, we demonstrate that SmolLM2 outperforms other recent small LMs including Qwen2.5-1.5B, Llama3.2-1B, and Falcon3-1.6B. By releasing our model, datasets, and code, we aim to facilitate future research on LM development as well as applications of small LMs.

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the COLM Code of Ethics on https://colmweb.org/CoE.html

Author Guide: I certify that this submission complies with the submission instructions as described on https://colmweb.org/AuthorGuide.html

Submission Number: 477

Loading