Keywords: knowledge distillation, regularization, understanding, underfitting
TL;DR: On the lack of fidelity in knowledge distillation
Abstract: Knowledge distillation has been widely-used to improve the performance of a ``student'' network by hoping to mimic the soft probabilities of a ``teacher'' network. Yet, for self-distillation to work, the student {\em must} deviate from the teacher in some manner \citep{stanton21does}. We conduct a variety of experiments across image and language classification datasets to more precisely understand the nature of student-teacher deviations and how they relate to accuracy gains. Our first key empirical observation is that in a majority of our settings, the student underfits points that the teacher finds hard. Next, we find that student-teacher deviations during the \textit{initial} phase training are \textit{not} crucial to get the benefits of distillation --- simply switching to distillation in the middle of training can recover a significant fraction of distillation's accuracy gains.
We then provide two parallel theoretical perspectives of student-teacher deviations, one casting distillation as a regularizer in eigenspace, and another as a denoiser of gradients. In both these views, we argue how our empirically reported student-teacher deviations may emerge, and how they may relate to generalization. Importantly, our analysis bridges key gaps between existing theory and practice by focusing on gradient descent and avoiding label noise assumptions.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics
Submission Guidelines: Yes
Please Choose The Closest Area That Your Submission Falls Into: Deep Learning and representational learning
19 Replies
Loading