Revisiting Self-Distillation

Revisiting Self-Distillation

TMLR Paper956 Authors

16 Mar 2023 (modified: 17 Sept 2024)Rejected by TMLREveryoneRevisionsBibTeXCC BY 4.0

Abstract: Knowledge distillation is the procedure of transferring ``knowledge'' from a large model (the teacher) to a more compact one (the student), often being used in the context of model compression. When both models have the same architecture, this procedure is called self-distillation. Several works have anecdotally shown that a self-distilled student can outperform the teacher on held-out data. In this study, we conduct a comprehensive analysis of self-distillation with a focus on vision classification across various settings. First, we show that even with a highly accurate teacher, self-distillation allows a student to surpass the teacher in all cases. Secondly, we revisit published works on self-distillation and provide empirical experiments that suggest potential incompleteness. Third, we provide an alternative explanation for the dynamics of self-distillation through the lens of loss landscape geometry. We conduct extensive experiments to show that self-distillation leads to flatter minima, thereby resulting in better generalization. Finally, we study what properties can self-distillation transfer from teachers to students, beyond task accuracy. We show that a student can inherit natural robustness by leveraging the soft outputs of the teacher, while merely training on ground-truth labels will make the student less robust.

Submission Length: Regular submission (no more than 12 pages of main content)

Assigned Action Editor: ~Alexander_A_Alemi1

Submission Number: 956

Loading