Understanding Calibration Transfer in Knowledge Distillation

18 Sept 2023 (modified: 25 Mar 2024)ICLR 2024 Conference Withdrawn SubmissionEveryoneRevisionsBibTeX
Keywords: Knowledge distillation, Calibration, Trustworthy ML
TL;DR: We show, arguably for the first time, that only calibrated teachers potentially distill the best-calibrated students, and thus, a recipe for producing accurate and calibrated classifiers must also involve calibrating teacher classifiers first.
Abstract: Modern deep neural networks are often miscalibrated, leading to overconfident mistakes that erode their reliability, and limit their use in critical applications. The existing confidence calibration techniques range from train-time modification of loss functions to post-hoc smoothing of the classifier's predicted confidence vector. Despite the success of these approaches, it is relatively unclear if supervision from an already trained expert classifier can further enhance a given classifier's confidence calibration. Knowledge distillation (KD) has been shown to help classifiers achieve better accuracy. However, little to no attention has been paid to a systematic understanding if the calibration can also be transferred via KD. In this work, we provide new insights into how and when expert supervision can produce well-calibrated classifiers, by studying a special class of linear teacher and student classifiers. Specifically, we provide theoretical insights into the working mechanisms of KD and show that calibrated teachers can distill calibrated students. We further show that unlike traditional KD where a smaller capacity classifier learns reliably from a larger capacity expert, transfer of calibration can be induced from lower capacity teachers to larger capacity students (aka reverse-KD). Furthermore, our findings indicate that not all training regimes are equally suitable and that a teacher classifier trained using dynamic label smoothing leads to the better calibration of student classifiers via KD. Moreover, the proposed KD-based calibration leads to a state-of-the-art(SOTA) calibration framework surpassing all existing calibration techniques. Our claims are backed up by extensive experiments on standard computer vision classification tasks. On CIFAR100 using the WRN-40-1 feature extractor, we report an ECE of 0.98 compared to 7.61 and 2.1 by the current SOTA calibration techniques Adafocal (Ghosh, NeurIPS 2022) and CPC (Cheng and Vasconcelos, CVPR 2022) respectively, and 11.16 by the baseline NLL loss (lower ECE is better). The calibration improvement is achieved across various architectures. Using MobileNetv2 on CIFAR100 we report an ECE of 0.88/1.83/4.17/7.76 using Ours/Adafocal/CPC/\NLL.
Supplementary Material: zip
Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2024/AuthorGuide.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors' identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 1318
Loading