DistiLLM-2: A Contrastive Approach Boosts the Distillation of LLMs

Published: 01 May 2025, Last Modified: 18 Jun 2025ICML 2025 oralEveryoneRevisionsBibTeXCC BY 4.0
TL;DR: DistiLLM-2 improves Large Language Model (LLM) distillation by leveraging a contrastive approach that increases the likelihood of teacher responses while decreasing that of student responses.
Abstract: Despite the success of distillation in large language models (LLMs), most prior work applies identical loss functions to both teacher- and student-generated data. These strategies overlook the synergy between loss formulations and data types, leading to a suboptimal performance boost in student models. To address this, we propose DistiLLM-2, a contrastive approach that simultaneously increases the likelihood of teacher responses and decreases that of student responses by harnessing this synergy. Our extensive experiments show that DistiLLM-2 not only builds high-performing student models across a wide range of tasks, including instruction-following and code generation, but also supports diverse applications, such as preference alignment and vision-language extensions. These findings highlight the potential of a contrastive approach to enhance the efficacy of LLM distillation by effectively aligning teacher and student models across varied data types.
Lay Summary: Large language models (LLMs) like ChatGPT are powerful but require a lot of computing power, making them hard to use in real-world applications. To solve this, researchers use a process called "distillation," which helps create smaller, faster models that try to mimic the abilities of larger ones. However, most previous distillation methods used the same training strategy for all types of model outputs, missing out on the chance to better tailor training for different situations. Our work introduces DistiLLM-2, a new approach that uses a "contrastive" training strategy. Instead of treating all model responses equally, DistiLLM-2 separately encourages the student model to learn from good answers provided by the teacher model, while also learning from its own mistakes. By using different training techniques for each type of answer, we help the smaller model learn more effectively. We tested DistiLLM-2 on a wide variety of tasks, including following instructions, solving math problems, and writing code. The results show that our method creates small language models that perform better than those trained with older techniques. Beyond language, our method also improves models that can understand images and text together, and helps make AI run faster and more efficiently. We believe this work will help make AI more accessible and practical for everyone.
Link To Code: https://github.com/jongwooko/distillm-2
Primary Area: Deep Learning->Large Language Models
Keywords: knowledge distillation, efficiency, contrastive approach
Submission Number: 5637
Loading