Insights into the mechanism behind reusing Teacher's classifier in Knowledge DistillationDownload PDF

01 Mar 2023 (modified: 01 Jun 2023)Submitted to Tiny Papers @ ICLR 2023Readers: Everyone
Keywords: knowledge distillation, reused classifier, alignment, image classification, teacher-student framework
TL;DR: Even in vanilla knowledge distillation, the student classifier aligns with the teacher classifier. This alignment decreases with increase in Temperature.
Abstract: Knowledge distillation (KD) has emerged as an effective approach to compress deep neural networks by transferring knowledge from a powerful yet cumbersome teacher model to a lightweight student model. Recent research has suggested that re-using the teacher's final layer (i.e., the classifier) can be a straightforward and effective method for knowledge distillation. The underlying mechanism for this method's success remains unclear. Our study aims to shed light on how the knowledge distillation loss affects the alignment between the weights of the student classifier and the teacher classifier. Specifically, we compare the $L^2$ norm of the difference between the weights of the student and the teacher classifier during the training process. Our experiments demonstrate that the knowledge distillation loss encourages alignment between the student and teacher classifiers, as indicated by a strong positive correlation ($>0.97$) between the $L^2$ norm and the loss during training. We also observe that as temperature increases, this alignment decreases and the $L^2$ norm behaves similar to normal (non-KD) training. Our analysis aims to provide to a better understanding of knowledge distillation provide a starting point for the development of new KD frameworks.
6 Replies

Loading