Abstract: Knowledge distillation is widely used as a means of improving the performance of a relatively simple “student” model using the predictions from a complex “teacher” model. Several works have shown that distillation significantly boosts the student’s overall performance; however, are these gains uniform across all data sub-groups? In this paper, we show that distillation can harm performance on certain subgroups, e.g., classes with few associated samples, compared to the vanilla student trained using the one-hot labels. We trace this behavior to errors made by the teacher distribution being transferred to and amplified by the student model. To mitigate this problem, we present techniques which soften the teacher influence for subgroups where it is less reliable. Experiments on several image classification benchmarks show that these modifications of distillation maintain boost in overall accuracy, while additionally ensuring improvement in subgroup performance.
15 Replies
Loading