Primary Area: general machine learning (i.e., none of the above)
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Keywords: knowledge distillation, Calibration
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2024/AuthorGuide.
TL;DR: Calibration error serves as an effective criterion for selecting teachers in KD, and employing calibration methods can further enhance KD performance.
Abstract: Knowledge distillation (KD) is one of the successful deep learning compression methods for edge devices, transferring the knowledge from a large model, known as the *teacher*, to a smaller model, referred to as the *student*. KD has demonstrated remarkable performance since its first introduction. However, recent research in KD reveals that using a higher-performance teacher network does not guarantee better performance of the student network. This naturally leads to a question about the criterion for choosing an appropriate teacher. In this paper, we reveal that there is a strong correlation between the calibration error of the teacher and the accuracy of the student. Therefore, we claim that the calibration error of the teacher model can be a selection criterion for knowledge distillation. Furthermore, we demonstrate that the performance of KD can be improved by simply applying a temperature-based calibration method that reduces the teacher's calibration error. Our algorithm can be easily applied to other methods, and when applied on top of the current state-of-the-art (SOTA) model, it achieves a new SOTA performance.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors' identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 1117
Loading