Generalization Analysis of Linear Knowledge Distillation

Published: 29 May 2026, Last Modified: 29 May 2026HiLD at ICML 2026 PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: knowledge distillation, generalization, Gaussian mixture, implicit bias
TL;DR: We theoretically study the generalization of a linear student distilled from various teachers.
Abstract: Knowledge distillation (KD), a framework in which a smaller student model is trained under the guidance of a stronger teacher, has become a popular technique for model compression. Despite its empirical success, the theoretical understanding of KD remains underexplored. In this work, we theoretically study the generalization behavior of linear knowledge distillation (LKD), a simplified setting in which the student is restricted to a linear model. We first characterize the implicit bias of gradient descent on separable training data when the student is trained with LKD. Building on the results, we derive a population zero-one risk bound for the distilled student under binary Gaussian mixture data. We quantify the provable generalization benefit of LKD distilled from various teachers compared to standard hard-label training.
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 118
Loading