Keywords: Knowledge distillation, Directional coverage, Gradient variance, Cross Validation, Best Teacher prediction
TL;DR: GRACE is a gradient-based score that efficiently predicts the best teacher for knowledge distillation, without requiring teacher internals or test data
Abstract: Knowledge distillation is an efficient strategy to use data generated by large teacher language models to train smaller “capable” student models, but selecting the optimal teacher for a specific student-task combination requires expensive trial-and-error. We propose a lightweight score called GRACE to quantify how effective a teacher will be when post-training a student model to solve math problems. GRACE efficiently measures distributional properties of student gradients, and it can be computed without access to a verifier, teacher logits, teacher internals, or test data. From an information-theoretic perspective, GRACE measures leave-one-out stability in gradient-based algorithms, directly connecting it to the generalization performance of distilled student models. On GSM8K and MATH, GRACE correlates strongly (up to 86%) with the performance of the distilled Llama and OLMo students. In particular, training on GRACE-selected teacher provides at least a 6% improvement over naively using the best-performing teacher. We further demonstrate the utility of GRACE in providing guidance on crucial design choices in distillation, including (1) the best temperature to use when generating from the teacher, (2) the best teacher to use given a size constraint, and (3) the best teacher to use within a specific model family. Altogether, our findings demonstrate that GRACE can efficiently and effectively identify the most compatible teacher for a given student and provide fine-grained guidance on how to perform distillation.
Primary Area: foundation or frontier models, including LLMs
Submission Number: 20641
Loading