Keywords: Knowledge distillation, Gradient-based metrics, Teacher–student alignment, Gradient diversity, Unified metric
TL;DR: A simple gradient-based metric, GradCV, predicts student performance after distilling from a teacher at different generation temperatures.
Abstract: Knowledge distillation is a primary strategy to produce powerful small models, where a “student” learns to mimic the generations of powerful “teacher” models. It is of high practical value to understand what makes a teacher suitable for distillation, so that one can efficiently identify the teacher that leads to the best student from a possibly large set of candidates. In this work, we show that good teachers should both align with the students and provide diverse training signals. Combining both leads to a single metric, GradCV, that strongly correlates with the student’s post-distillation performance. We demonstrate the effectiveness of GradCV on GSM8k and MATH with LLaMA and OLMo student models.
Submission Number: 204
Loading