Distillation Scaling Laws

Dan Busbridge; Amitis Shidani; Floris Weers; Jason Ramapuram; Etai Littwin; Russell Webb

Distillation Scaling Laws

Dan Busbridge, Amitis Shidani, Floris Weers, Jason Ramapuram, Etai Littwin, Russell Webb

Published: 01 May 2025, Last Modified: 25 Jul 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY 4.0

TL;DR: We provide a distillation scaling law that estimates distilled model performance based on a compute budget and its allocation between the student and teacher.

Abstract: We propose a distillation scaling law that estimates distilled model performance based on a compute budget and its allocation between the student and teacher. Our findings mitigate the risks associated with large-scale distillation by enabling compute-optimal allocation for both the teacher and student to maximize student performance. We provide compute-optimal distillation recipes for two key scenarios: when a teacher already exists, and when a teacher needs training. In settings involving many students or an existing teacher, distillation outperforms supervised learning up to a compute level that scales predictably with student size. Conversely, if only one student is to be distilled and a teacher also requires training, supervised learning is generally preferable. Additionally, our large-scale study of distillation increases our understanding of the process and helps inform experimental design.

Lay Summary: (1) Training powerful machine learning models is very expensive. A technique called "distillation" can create smaller, more efficient "student" models by having them learn from more capable "teacher" models. But figuring out how to best spend limited computing power to train both and get a good student is a major challenge, making large projects risky. (2) We developed a predictive formula – a "distillation scaling law" – that estimates how well the student will perform based on the total computing budget and how it's split between training the teacher and the student. This allowed us to create practical "recipes" for the best way to allocate these resources, whether a teacher model already exists or also needs to be built. (3) Our research helps take the guesswork out of distillation, reducing costs and risks. These recipes guide users to get the best possible student for their budget, showing that distillation can outperform traditional methods in specific situations, especially when an expert "teacher" model is already available or many "students" need training. Ultimately, this work improves our understanding of how machine learning models can efficiently learn from each-other.

Primary Area: Deep Learning->Large Language Models

Keywords: scaling laws, distillation, pretraining, alms, large language models

Submission Number: 12010

Loading