Abstract: Knowledge distillation is the procedure of transferring ``knowledge'' from a large model (the teacher) to a more compact one (the student), often being used in the context of model compression. When both models have the same architecture, this procedure is called self-distillation. Several works have anecdotally shown that a self-distilled student can outperform the teacher on held-out data. In this study, we conduct a comprehensive analysis of self-distillation with a focus on vision classification across various settings. First, we show that even with a highly accurate teacher, self-distillation allows a student to surpass the teacher in all cases. Secondly, we revisit published works on self-distillation and provide empirical experiments that suggest potential incompleteness. Third, we provide an alternative explanation for the dynamics of self-distillation through the lens of loss landscape geometry. We conduct extensive experiments to show that self-distillation leads to flatter minima, thereby resulting in better generalization. Finally, we study what properties can self-distillation transfer from teachers to students, beyond task accuracy. We show that a student can inherit natural robustness by leveraging the soft outputs of the teacher, while merely training on ground-truth labels will make the student less robust.
Submission Length: Regular submission (no more than 12 pages of main content)
Assigned Action Editor: ~Alexander_A_Alemi1
Submission Number: 956
Loading